Strengths
- You quickly surfaced the core functional requirements of a CI/CD platform and didn’t get stuck on details early on.
- You proactively checked assumptions on scalability with the interviewer, which shows good communication and alignment habits.
- You were able to think in terms of snapshots (hourly/daily incremental changes), which is important for build artifacts and state over time.
- You listened well to hints and used them to refine your non-functional requirements.
- You clearly explained how triggering works with webhooks, wiring the Git events to your system.
- You expanded the workflow entity schema correctly, capturing the right metadata.
- You introduced an event queue to absorb high-volume incoming requests, which is exactly the right direction for scalability.
- You added a “running status” entity to track worker id and start time as the runtime view of a job run – a good modeling step.
- You mentioned a scheduler component for job status management and lifecycle control.
- For retries, you correctly noted that querying for retries in a low-latency environment is expensive, and that updating retry counts directly is a better strategy.
- You proposed using a cache keyed by (timestamp + job_id) → status, which is a reasonable approach for quick status lookups.
Growth Areas / Suggestions
- When talking about completion rate, anchor it in fault tolerance and reliability (e.g., % of jobs that succeed within SLO, including retries) instead of just raw counts.
- Rather than “execution time” in the abstract, emphasize latency to trigger (from Git push → first job starts) as a concrete SLO.
- Call out that long-running builds are usually not acceptable; explicitly mention timeouts and how your system terminates over-time jobs.
- In your architecture, you can draw a dedicated Git repo metadata store and a Git fetcher service that reads from Git and populates configs, instead of reading configs directly from the main DB.
- The biggest gap: you had the system execute jobs directly after fetching configs from the DB, instead of placing them into a separate job queue for controlled scheduling. With large scale, you must queue and schedule runs, not execute them synchronously.
- You jumped into NFRs a bit too early, before fully clarifying the data model and status transitions of workflow & job entities.
- For an async system, clearly separate:
- Scheduler (decides what to run, when, with which constraints), and
- Executor/Workers (actually run containers / steps).
Also model two core entities:
- workflow (overall pipeline / DAG definition)
- job_run / step_run (individual executions with status).
- Make sure the job structure captures dependencies as a DAG / map, plus decision logic (e.g., only run job B if A succeeds).
- For timeout / termination, introduce components like:
- Job Orchestrator (drives state machine of jobs: PENDING → RUNNING → SUCCESS/FAILED/TIMED_OUT)
- Timeout Manager (periodically scans for over-deadline jobs and marks them as TIMED_OUT and stops workers).
- For retries, the job id should remain stable. Retries should be tracked with:
- A retry_count field, and
- Possibly a backoff schedule (exponential backoff driven by the orchestrator).
- The interviewer asked for a very concrete status flow: how retry count changes, how start_time is recorded per retry, and how status transitions happen. This is a good area to practice detailing your state machine + timeline.
- For status updates to clients, instead of repeated long-polling:
- Use a durable connection like Server-Sent Events (SSE) or WebSockets.
- On the backend, publish job status updates via pub/sub, and have the API layer push them to clients who subscribed to those job ids.
Overall, you were able to follow the full conversation and arrive at a reasonable design with hints. The next step is to lean more proactively into orchestration, state machines, and queues without needing as many prompts.
Homework / Next Steps
- Redraw your CI/CD design in Excalidraw and explicitly show:
- Entities: workflow, job_run, step_run, artifact, log, running_status, etc.
- How data & status flow: Git → webhook → scheduler → queue → workers → artifact store → status updates.
- How dependencies and retries are modeled in the job DAG.
- As a reference, review the ShowOffer CI/CD system design video or webpage and compare your diagram to the reference architecture. Look for:
- Where your design is already aligned, and
- Where you still need a clearer separation of “control plane vs data plane” and “scheduler vs executor.”