Details

Interview Time:  
November 25, 2025 7:30 PM
Targeted Company:  
Targeted Level:  
Staff+

Record

Record Link:  
Record

Feedback

Strengths

  • You quickly surfaced the core functional requirements of a CI/CD platform and didn’t get stuck on details early on.

  • You proactively checked assumptions on scalability with the interviewer, which shows good communication and alignment habits.

  • You were able to think in terms of snapshots (hourly/daily incremental changes), which is important for build artifacts and state over time.

  • You listened well to hints and used them to refine your non-functional requirements.

  • You clearly explained how triggering works with webhooks, wiring the Git events to your system.

  • You expanded the workflow entity schema correctly, capturing the right metadata.

  • You introduced an event queue to absorb high-volume incoming requests, which is exactly the right direction for scalability.

  • You added a “running status” entity to track worker id and start time as the runtime view of a job run – a good modeling step.

  • You mentioned a scheduler component for job status management and lifecycle control.

  • For retries, you correctly noted that querying for retries in a low-latency environment is expensive, and that updating retry counts directly is a better strategy.

  • You proposed using a cache keyed by (timestamp + job_id) → status, which is a reasonable approach for quick status lookups.

Growth Areas / Suggestions

  • When talking about completion rate, anchor it in fault tolerance and reliability (e.g., % of jobs that succeed within SLO, including retries) instead of just raw counts.

  • Rather than “execution time” in the abstract, emphasize latency to trigger (from Git push → first job starts) as a concrete SLO.

  • Call out that long-running builds are usually not acceptable; explicitly mention timeouts and how your system terminates over-time jobs.

  • In your architecture, you can draw a dedicated Git repo metadata store and a Git fetcher service that reads from Git and populates configs, instead of reading configs directly from the main DB.

  • The biggest gap: you had the system execute jobs directly after fetching configs from the DB, instead of placing them into a separate job queue for controlled scheduling. With large scale, you must queue and schedule runs, not execute them synchronously.

  • You jumped into NFRs a bit too early, before fully clarifying the data model and status transitions of workflow & job entities.

  • For an async system, clearly separate:


    • Scheduler (decides what to run, when, with which constraints), and

    • Executor/Workers (actually run containers / steps).
      Also model two core entities:

    • workflow (overall pipeline / DAG definition)

    • job_run / step_run (individual executions with status).

  • Make sure the job structure captures dependencies as a DAG / map, plus decision logic (e.g., only run job B if A succeeds).

  • For timeout / termination, introduce components like:


    • Job Orchestrator (drives state machine of jobs: PENDING → RUNNING → SUCCESS/FAILED/TIMED_OUT)

    • Timeout Manager (periodically scans for over-deadline jobs and marks them as TIMED_OUT and stops workers).

  • For retries, the job id should remain stable. Retries should be tracked with:


    • A retry_count field, and

    • Possibly a backoff schedule (exponential backoff driven by the orchestrator).

  • The interviewer asked for a very concrete status flow: how retry count changes, how start_time is recorded per retry, and how status transitions happen. This is a good area to practice detailing your state machine + timeline.

  • For status updates to clients, instead of repeated long-polling:


    • Use a durable connection like Server-Sent Events (SSE) or WebSockets.

    • On the backend, publish job status updates via pub/sub, and have the API layer push them to clients who subscribed to those job ids.

Overall, you were able to follow the full conversation and arrive at a reasonable design with hints. The next step is to lean more proactively into orchestration, state machines, and queues without needing as many prompts.

Homework / Next Steps

  • Redraw your CI/CD design in Excalidraw and explicitly show:


    • Entities: workflow, job_run, step_run, artifact, log, running_status, etc.

    • How data & status flow: Git → webhook → scheduler → queue → workers → artifact store → status updates.

    • How dependencies and retries are modeled in the job DAG.

  • As a reference, review the ShowOffer CI/CD system design video or webpage and compare your diagram to the reference architecture. Look for:


    • Where your design is already aligned, and

    • Where you still need a clearer separation of “control plane vs data plane” and “scheduler vs executor.”