Details

Interview Time:

November 25, 2025 7:30 PM

Targeted Company:

Targeted Level:

Staff+

Record

Record Link:

Record

Feedback

Strengths

You quickly surfaced the core functional requirements of a CI/CD platform and didn’t get stuck on details early on.
You proactively checked assumptions on scalability with the interviewer, which shows good communication and alignment habits.
You were able to think in terms of snapshots (hourly/daily incremental changes), which is important for build artifacts and state over time.
You listened well to hints and used them to refine your non-functional requirements.
You clearly explained how triggering works with webhooks, wiring the Git events to your system.
You expanded the workflow entity schema correctly, capturing the right metadata.
You introduced an event queue to absorb high-volume incoming requests, which is exactly the right direction for scalability.
You added a “running status” entity to track worker id and start time as the runtime view of a job run – a good modeling step.
You mentioned a scheduler component for job status management and lifecycle control.
For retries, you correctly noted that querying for retries in a low-latency environment is expensive, and that updating retry counts directly is a better strategy.
You proposed using a cache keyed by (timestamp + job_id) → status, which is a reasonable approach for quick status lookups.

Growth Areas / Suggestions

When talking about completion rate, anchor it in fault tolerance and reliability (e.g., % of jobs that succeed within SLO, including retries) instead of just raw counts.
Rather than “execution time” in the abstract, emphasize latency to trigger (from Git push → first job starts) as a concrete SLO.
Call out that long-running builds are usually not acceptable; explicitly mention timeouts and how your system terminates over-time jobs.
In your architecture, you can draw a dedicated Git repo metadata store and a Git fetcher service that reads from Git and populates configs, instead of reading configs directly from the main DB.
The biggest gap: you had the system execute jobs directly after fetching configs from the DB, instead of placing them into a separate job queue for controlled scheduling. With large scale, you must queue and schedule runs, not execute them synchronously.
You jumped into NFRs a bit too early, before fully clarifying the data model and status transitions of workflow & job entities.
For an async system, clearly separate:
- Scheduler (decides what to run, when, with which constraints), and
- Executor/Workers (actually run containers / steps).
  Also model two core entities:
- workflow (overall pipeline / DAG definition)
- job_run / step_run (individual executions with status).
Make sure the job structure captures dependencies as a DAG / map, plus decision logic (e.g., only run job B if A succeeds).
For timeout / termination, introduce components like:
- Job Orchestrator (drives state machine of jobs: PENDING → RUNNING → SUCCESS/FAILED/TIMED_OUT)
- Timeout Manager (periodically scans for over-deadline jobs and marks them as TIMED_OUT and stops workers).
For retries, the job id should remain stable. Retries should be tracked with:
- A retry_count field, and
- Possibly a backoff schedule (exponential backoff driven by the orchestrator).
The interviewer asked for a very concrete status flow: how retry count changes, how start_time is recorded per retry, and how status transitions happen. This is a good area to practice detailing your state machine + timeline.
For status updates to clients, instead of repeated long-polling:
- Use a durable connection like Server-Sent Events (SSE) or WebSockets.
- On the backend, publish job status updates via pub/sub, and have the API layer push them to clients who subscribed to those job ids.

Overall, you were able to follow the full conversation and arrive at a reasonable design with hints. The next step is to lean more proactively into orchestration, state machines, and queues without needing as many prompts.

Homework / Next Steps

Redraw your CI/CD design in Excalidraw and explicitly show:
- Entities: workflow, job_run, step_run, artifact, log, running_status, etc.
- How data & status flow: Git → webhook → scheduler → queue → workers → artifact store → status updates.
- How dependencies and retries are modeled in the job DAG.
As a reference, review the ShowOffer CI/CD system design video or webpage and compare your diagram to the reference architecture. Look for:
- Where your design is already aligned, and
- Where you still need a clearer separation of “control plane vs data plane” and “scheduler vs executor.”

‍