Github Action

Github Action

💡 System design interviews at companies like OpenAI and Meta

These interviews often go beyond textbook answers—they test how you reason under constraints, design for scale, and handle failure gracefully.

Here are just a few of the questions you might face in CI/CD design:

  • Concurrency & Queuing
    "When multiple developers push code at once, how would you orchestrate and prioritize builds to optimize resource usage and minimize delays?"
  • Multi-Tenant Architecture
    "If multiple teams or repositories share the same CI/CD platform, how would you isolate their pipelines, enforce fairness, and prevent noisy neighbor issues?"
  • Immutable Build Artifacts
    "Would you enforce artifact immutability? If so, how would you ensure reproducibility and manage cache invalidation?"
  • Rollback Safety
    "How would you design safe rollback mechanisms in a multi-service environment without risking cascading failures?"

We’ll walk through the core architectural principles, then go deeper into battle-tested strategies for resilient deployments, efficient pipelines, and developer productivity at scale.

Whether you're preparing for a high-stakes interview or architecting your own system, this guide is built to help you think like an experienced engineer.

CI/CD isn’t just another tech buzzword — it’s the backbone of modern software development that keeps your team shipping fast without breaking things (well, most of the time! 😅).

Let’s first check what CI/CD means:

CI (Continuous Integration)Your Code’s Quality Guardian

Think of CI as that super diligent colleague who checks everyone’s work before it gets merged. Every time someone pushes code:

  • Code gets automatically integrated into the main branch
  • Tests run immediately to catch bugs before they spread
  • Fast feedback loops mean developers know within minutes if they broke something

It’s like having a 24/7 code reviewer that never gets tired or grumpy!

CD (Continuous Delivery/Deployment)Your Release Automation Hero

Now this is where things get exciting:

  • Continuous Delivery keeps your code in a “ready-to-ship” state at all times. Think of it as you being dressed up and ready for the party, but you still decide when to actually go.
  • Continuous Deployment takes it a step further — every change that passes your tests automatically goes to production. It’s like having a fully automated assembly line that just keeps churning out releases.

Functional Requirements

  • FR1: Trigger workflows on each Git push using the repo-defined workflow file.
  • FR2: Run workflows in isolated sandboxes with scoped secrets.
  • FR3: Provide real-time visibility into execution (status, step progress, logs).

Non-function Requirements

💡 Non-function Metrics Assumptions

Let’s assume the system handles ~10 pushes per second on average, and each workflow takes about 15 minutes to run. That translates to roughly 10,000 concurrent workflows under normal load. During peak traffic — such as coordinated releases or mass test retries — the system may need to handle up to 10× the baseline, i.e., 100,000 concurrent workflows.

This framing helps us size the CI/CD platform not just for today’s team size but for realistic workload bursts, ensuring fast feedback, minimal queuing, and uninterrupted developer productivity.

  • NFR1: The system shall support concurrent execution of at least 100,000 jobs across the cluster.
  • NFR2: The system shall enforce per-job container isolation to limit blast radius and prevent cross-workflow interference.
  • NFR3: The system shall enforce per-organization/repo quotas to guarantee fairness and prevent noisy-neighbor effects.
  • NFR4: The system shall persist workflow state and artifacts across failures to enable safe recovery.
  • NFR5: The system shall process workflow triggers within 10 seconds of receiving a Git event during peak time.

High Level Design

💡 How to Deliver High-Level Design in Your Interviews?

Many candidates worry that if they don’t mention everything upfront, they’ll miss key evaluation points — especially since later stages of the interview often focus on deep dives. That concern is valid.

But rather than front loading everything (which often becomes noisy or ungrounded), it’s better to start from flow, show awareness of interesting decision points, and invite the interviewer to go deeper later.

Here’s a structured 5-step approach:

  1. Truncate Each Functional Requirement into Building Blocks
    Break each functional requirement into clear, sequential building blocks (e.g., “On Git push → fetch YAML → parse DAG → enqueue jobs”). These steps form the scaffold of your design.
  2. Apply Core Design Principles per Block
    At each block, ask questions like:
    • Should it be sync or async?
    • What are the durability guarantees?
    • Do we need deduplication, retries, or fallbacks?
    This aligns your architecture to real-world tradeoffs.
  3. Derive Entities from Flow, Not the Other Way Around
    Instead of guessing tables upfront, identify which objects need to persist across blocks. Then, define entities (e.g., Run, Job, Step) and expand key fields based on what’s required in the flow.
  4. Draw a Text or Component Diagram to Tie It Together
    Use a clean, labeled diagram to explain how services and data interact. Show ownership (e.g., “Runner writes logs to Trigger Service”) and lifecycle (e.g., “Run → Jobs → Steps → Logs”).
  5. Circle Back to Deep Dives as Needed
    It’s totally okay if you don’t go deep on every piece early on. Instead, set expectations:
    “Here we stream logs through Redis and flush to blob storage — I’ll highlight more around log fanout and buffering later if that’s of interest.”
    This shows you see the depth, but prioritize flow — a strong signal of system-level thinking.

This approach ensures your design is anchored, intentional, and scalable, without overloading the interviewer early or falling into checklist-style design.

FR1: Trigger workflows on each Git push using the repo-defined workflow file.

In a modern CI/CD system, every pipeline is triggered by a Git event — such as a push, pull request, or merge. To manage this reliably at scale, we introduce an Ingress Service as the entry point. The Ingress Service receives these Git events via provider webhooks (e.g., GitHub), fetches the workflow definition pinned to the commit SHA, validates it, constructs a DAG of jobs and dependencies, and persists a durable workflow run plan. This plan is designed to be recoverable, replayable, and observable, serving as the single source of truth for reliable downstream execution.

Core Responsibilities covered in this function requirements

  • Event ingestion & authentication: Securely receive and validate webhook events from Git providers.
  • Dedup: Git providers use at-least-once delivery, so events may be resent due to retries or network issues, and therefore the system needs dedup.
  • Workflow retrieval: Fetch the workflow definition pinned to the referenced commit SHA.
  • Validation: Ensure configuration correctness, syntax validity, and policy compliance.
  • Execution planning: Build a DAG of jobs and dependencies, then persist the run plan in a durable storage.

Key Design Principles

  • Asynchronous-first: Acknowledge webhooks quickly; defer heavy lifting off the hot path.
  • Exactly-once delivery: Deduplicate at ingress, preserve immutability, and ensure events are consumed reliably.
  • Durable state persistence: The system should persist workflow state and artifacts so that execution can survive crashes, support recovery, and provide auditability.

Entities

GitHub Event → WorkflowRun (one per trigger) → Tasks (one per task, DAG edges from prerequisites[]) → TaskAttempts (one per execution try).

# Ingest a Git push event, authenticate and deduplicate it, then trigger the creation of a workflow run plan. POST /push_event
  • Repository — A registered source code repository within the CI/CD system. It is the origin of workflow definitions and the source of truth for pipeline execution.
  • Workflow_Run — A versioned specification, tied to a specific commit, that defines the pipeline. It declares Task, their dependencies, and the steps each job must execute.
  • Task — A single node within the workflow’s DAG. Each task is executed in its own isolated environment (e.g., container, VM, or sandbox) to ensure reproducibility and fault isolation.
  • Task_Attempt — An individual execution of a task. Each attempt represents a concrete command or action run by the system, ordered and tracked for retries, logging, and observability.

Let’s walk through how this works step by step.

Step 1: Github Push Event

GitHub Actions is powered internally by GitHub’s event system, which is built on webhooks. When something happens (push, pull request, issue comment, release, etc.), GitHub fires an internal webhook-like event.

Step 2: Authenticate Github Action Event

The first responsibility is to authenticate the event via API Gateway, since processing an unauthenticated or replayed request could allow attackers to trigger arbitrary builds or replay past builds.

How are GitHub webhook events authenticated?

We expose a public endpoint (POST /webhooks/git) where the handler auto-detects the Git provider based on headers. Authentication varies by provider:

Provider Method
GitHub HMAC-SHA256 via X-Hub-Signature-256
GitLab Shared token via X-Gitlab-Token
Bitbucket JWT or shared HMAC (if configured)

All signatures are verified against the raw request body using constant-time comparison to mitigate timing attacks. We also support dual-secret rotation so keys can be rotated without downtime.

HMAC (Hash-based Message Authentication Code) is a cryptographic technique that uses a secret shared key and a hash function to verify both the integrity and authenticity of a message. The sender combines the message and secret key to create an HMAC tag, which is sent along with the message. The receiver uses the same secret key to recalculate the HMAC tag from the received message. If the calculated tag matches the transmitted tag, the message is confirmed as authentic and untampered.

Step 3: Deduplication

Git providers use at-least-once delivery, meaning the same event may be sent multiple times due to retries, delays, or transient network failures. Without protection, this can cause duplicate workflow runs, wasted compute, and inconsistent system state. It will be described more in DD 1.

Step 4: Fetch & Validate Workflow Definition

Once the event is authenticated and deduplicated, we fetch the workflow definition (e.g., .ci/workflows.yaml) pinned to the exact commit SHA referenced in the event.

💡 Fetch by commit SHA (not branch) to ensure builds use an immutable snapshot of the codebase — making them reproducible and safe from “moving target” changes.

Step 5: Parse and Validate the Workflow

After fetching, the workflow YAML is parsed into an internal representation of jobs and dependencies. But we don’t stop at syntax checks—validation is multi-layered to enforce correctness and policy:

  1. Schema validation: fields, formats, types.
  2. Policy linting: enforce org rules.
  3. Semantic validation: detect cycles in the prerequisites[], downstream[] graph from task, invalid job references, undeclared/reused artifact names.
💡 Stricter validation may block some edge cases, but it reduces runtime failures and improves security. Early failure (failed(config)) gives developers fast feedback and saves cluster resources.

Step 6: Construct DAG

Once validated, the workflow definition is transformed into a Directed Acyclic Graph (DAG) of tasks. The execution model is layered:

DAG vs. Step-by-Step Workflow

In a CI/CD system, execution can be modeled as either a DAG or a Step-by-Step flow:
  • DAG → Enables parallelism, fan-out/fan-in, matrix builds, and independent retries. Pipelines run faster and are more resilient, but require more complex scheduling and state management.
  • Step-by-Step → Each stage runs sequentially after the previous one. Simpler and easier to reason about, but less scalable—failures often force full pipeline re-runs and block parallelism.
Recommendation: Default to a DAG for scalability and selective retries, but collapse to Step-by-Step when the pipeline is strictly linear to minimize overhead.

This structure ensures durability (workflow state survives crashes), parallelism (jobs can be scheduled independently), and auditability (every run, job, and retry is traceable).

1. Workflow_run: Provides a reliable top-level object for retries, history, and observability — even if schedulers crash.

  • Captures the entire triggered workflow as a durable record.
  • Links together the GitHub event, commit SHA, workflow file, and run status.

2. Tasks: Breaking runs into tasks exposes parallelism, enforces dependencies, and isolates execution.

  • Each task in the YAML becomes a Task node in the DAG.
📘 What is a task in the CI/CD YAML config?

  • A task represents a logical unit of work within the workflow.
  • It specifies:
    • Execution environment (e.g., container image, VM, or runner)
    • Set of steps (commands or actions) to run
    • Resource requirements
    • Retry behavior
    • Artifacts it consumes or produces
  • Example: In a CI pipeline, jobs might include build, test, lint, or deploy.
  • Dependencies are modeled as edges; matrix strategies expand into multiple task variants.
  • Stores specs: image, command, resources, timeouts, artifact contracts, and retry limits.

3. Task_Attempts: Attempts make transitions idempotent and auditable — retries don’t corrupt prior state.

  • Every actual execution is a Task_Attempt row, recording worker ID, start/end times, exit codes, and logs.
  • Retries create new attempts instead of overwriting history.
💡 Durable Persistence

The DAG is stored in a durable store, so if the controller crashes, a reconciler can reload state and resume execution from where it left off.

While this adds minor latency and storage overhead, in CI/CD systems durability and correctness far outweigh raw speed, ensuring reliable and observable pipelines.
💡 Challenges

Event ingestion is often the first bottleneck and a frequent source of correctness bugs. Key edge cases include:

  • Absorb bursts of high-volume webhook traffic without dropping events
  • Deduplicate retries or replayed events from providers
  • Persist state before processing to guarantee recovery after crashes

These are the foundations for building reliable ingestion pipelines. Details on handling each appear in the Deep Dive section.

FR2: Schedule and execute workflow jobs in isolated container environments

Once the Ingress Service has parsed the workflow and persisted the task DAG, the system control passes to the Scheduler, which is responsible for driving execution across distributed, containerized workers. Here are the primary responsibilities of this schedule and execution part:

  • Enforce execution order based on the DAG
  • Efficiently allocate compute resources across jobs, this ensures multi-tenant fairness
  • Enforce tenant isolation and security
  • Build Artifacts Sharing among execution runners

Entities:

Artifact - File(build, logs etc) produced by a task or task_attempt .

Let’s go through how to achieve those design goals above:

Step 1: Enforce execution order based on the DAG

Scheduler has to scan all tasks repeatedly to figure out which ones can run.

  • Each scheduling cycle, it checks every task in the workflow run.
  • For each task, it looks at the prerequisites[] list and verifies if all prerequisites are complete.

This works for small DAGs, but at scale (thousands of jobs per workflow or thousands of workflows at once), it becomes expensive and inefficient — wasting CPU cycles re-checking blocked tasks. And it also makes fairness harder, since the scheduler has no quick way to prioritize just the runnable tasks.

📂 Ready Queue

A Ready Queue is a queue of tasks whose dependencies have all been satisfied.
  • When a task finishes successfully, the scheduler checks all tasks that depend on it.
  • If any dependent now has all prerequisites met, it is moved into the Ready Queue.
  • The scheduler only pulls tasks from this Ready Queue when assigning work to executors.

Benefits:
  1. Efficiency → Only eligible tasks are considered, avoiding repeated scans.
  2. Low latency → Tasks start immediately once dependencies complete.
  3. Fairness & Prioritization → The queue can be ordered by tenant, priority, or SLA.
  4. Scalability → Handles thousands of workflows without excessive compute overhead.

Step 2: Efficiently allocate compute resources across jobs

In a CI/CD system, every Git event can trigger dozens or even hundreds of jobs, all competing for shared compute. These jobs may demand very different resources — from lightweight lint checks to GPU-heavy integration tests. If we don’t manage allocation carefully, we risk 3 big issues:

  • a single team monopolizing the cluster
  • noisy neighbors degrading performance
  • critical jobs like hotfix deploys getting delayed behind less important work

🧠 What are the common resource allocation strategies?

Large-scale CI/CD platforms and compute clusters typically rely on different strategies to balance efficiency, fairness, and isolation. Below are three common models:

1️⃣ Static Allocation

Each team or organization is given a fixed quota of runners or nodes. It’s predictable and guarantees isolation since no one can exceed their share. However, it suffers from poor utilization—idle capacity remains unused while other queues wait. This leads to over-provisioning and wasted resources.

2️⃣ Dynamic Scheduling with Queues

Instead of fixed allocations, all resources are pooled, and the scheduler places jobs into queues with quotas and priorities applied. This allows bursts and ensures fairness. It’s more efficient and elastic but introduces complexity—requiring robust queue management, fairness policies, and starvation prevention.

3️⃣ Resource Isolation

Each job runs inside its own container or micro-VM with enforced CPU and memory quotas. This prevents noisy-neighbor effects and provides strong multi-tenant isolation. The trade-off is operational overhead—cold starts, CPU/memory fragmentation, and additional latency.

💡 In practice: Modern CI/CD platforms combine these models— using a dynamic scheduler for utilization and fairness, paired with container-level isolation for safety. This enables the system to absorb bursts, prevent tenant interference, and maximize cluster efficiency.

Step 3: Task_attemp Enqueue to Job Queue

When a job is ready to run, the scheduler creates a task attempt—a record representing one execution of that job—and places it into a distributed job queue. This queue acts as a buffer between the scheduling logic and the compute layer, ensuring that tasks are stored reliably, can be retried if needed, and are available for any available runner to pick up when capacity allows.

Step 4: Runners Pull from Queue

Runner agents, which are the workers that actually execute CI/CD tasks, continuously poll the job queue for available work. When a runner claims a task, it reserves the job, prepares an isolated execution environment (e.g., container or VM), pulls the required code and dependencies, and then runs the defined steps, reporting logs, status, and artifacts back to the CI/CD system.

Step 5: Isolated Execution

Each task runs in its own sandbox, typically materialized as a Pod (containers) in Kubernetes or an Micro Virtual Machines. At launch, the system injects the full execution context:

  • Image & command — the container image (digest-pinned for immutability) and the command to execute.
  • Secrets — short-lived credentials injected at runtime (via environment variables or mounted volumes), scoped only to that task.
  • Artifacts
    • Inputs: pulled from the artifact storage (e.g., build outputs, test datasets) before execution.
    • Outputs: uploaded back into the artifact storage under a versioned path, ensuring immutability and traceability for downstream tasks.
  • Isolation controls — namespaces, quotas, and network policies that sandbox each execution and prevent noisy-neighbor effects.

📦 Containers vs Micro-VMs

Containers are faster and more resource-efficient but offer weaker isolation. Micro-VMs provide stronger security boundaries at the cost of slower cold starts and higher memory overhead. We use containers by default and reserve micro-VMs for sensitive workflows—such as jobs that manage secrets or access cloud provider credentials.

Step 6: Build Output (Artifacts) Publishing

Build artifacts are the output files generated during the CI (Continuous Integration) phase. These are the deliverables of a build job that downstream jobs — including deployment — will rely on. Common examples include:

  • Compiled binaries (e.g., .jar, .exe, .wasm)
  • Docker images
  • Static web assets (e.g., .js, .css, .html)
  • Configuration files, schemas, or signed packages
  • Test reports, logs, or coverage summaries

In a modern CI/CD system, CI and CD often run on separate compute nodes or even different clusters. To decouple them while maintaining correctness, build artifacts act as a contract between these stages.

💡 Why Artifact Matters Here?

Artifacts are the glue between tasks. Without them:

  • Jobs would have to recompute or re-download inputs every time, causing waste.
  • Downstream jobs could see inconsistent outputs if artifacts were mutable.

By coupling task execution with the artifact:

  • Reproducibility → Every run uses the same versioned inputs.
  • Traceability → Outputs are tied back to workflow run and commit SHA.
  • Rollbacks → If a deployment fails, an older artifact can be redeployed instantly.

We will discuss this more in DD 7 later.

Down below, we will go through how CI/CD workers integrate with artifact storage.

CI Worker (Build Phase)

After building, the CI job uploads artifacts to a durable artifact store, such as:

  • Object storage (e.g., S3, GCS)
  • Container registry (e.g., ECR, GCR)

Artifacts are content-addressed (via SHA) or version-tagged for immutability. Metadata (e.g., commit SHA, run ID, checksum) is persisted to track lineage.

Artifact Metadata

Centralized Artifact Metadata ensures that artifacts:

  • Are durable and recoverable
  • Can be fetched independently of the original build node
  • Support retries, rollbacks, and auditing

CD Worker (Deploy Phase)

When deployment jobs are triggered, they fetch the relevant artifacts using a unique identifier (e.g., commit SHA or version tag). The CD pipeline does not re-build — it relies on the previously stored, validated artifacts. This allows safe, reproducible deployments, even if the original code branch has changed.

Although this post mainly covers workflow management and notification flow, it’s also helpful for interviews to understand how CI/CD workers function internally in case interviewers probe for deeper details.

🗂️ How runner works in CI Pipeline

Once the scheduler assigns a job, the runner takes over. The runner carries out the work in an isolated environment.

1. Pulling the Job Spec

The runner fetches the job specification from the scheduler or queue. This spec defines:

  • Container image or VM template that sets up the environment (toolchains, dependencies).
  • Command/entrypoint to execute (e.g., make build, pytest, npm run lint).
  • Resource profile (CPU, memory, disk, GPU) required for the job.
  • Optional timeouts, retries, and artifact expectations.

2. Executing Steps

Inside the isolated runtime (container, VM, or sandbox), the runner executes pipeline steps. Typical stages include:

  • Build: Compile code, resolve dependencies, package binaries.
  • Test: Run unit, integration, or system tests.
  • Lint/Analysis: Static analysis, code style checks, vulnerability scanning.

Isolation is enforced using cgroups/quotas so one job cannot interfere with others on the same node.

🗂️ How runner works in CD Pipeline

In Continuous Deployment, once a build and test stage passes, the pipeline automatically pushes changes to staging or production. The runner is the execution agent that performs deployment steps in an isolated, controlled environment.

1. Pulling the Deployment Spec

  • The runner retrieves the job spec from the scheduler.
  • This spec includes:
    • Environment details (staging, canary, production).
    • Deployment strategy (blue/green, rolling, canary).
    • Commands/scripts to execute (e.g., kubectl apply, helm upgrade, Terraform apply).
    • Resource requirements and credentials/secrets for accessing target systems.

2. Executing Deployment Steps

Inside its sandboxed environment (container, VM, or micro-VM), the runner carries out deployment steps such as:

  • Fetching build artifacts (e.g., Docker images, binaries) from the artifact storage.
  • Applying infrastructure changes (provisioning, database migrations).
  • Rolling out application updates to the cluster or servers.
  • Running health checks to confirm the service is up and stable.

FR3: Provide real-time visibility into execution (status, step progress, logs).

Once a CI/CD workflow is triggered, engineers expect a live view of what’s happening — not just after-the-fact results. This FR focuses on delivering real-time feedback loops that allow developers to observe, debug, and react as their workflow progresses. Once a workflow is triggered and the DAG is persisted, the CI/CD system doesn’t just execute jobs — it also needs to manage the full lifecycle of those executions: tracking what happened, supporting manual recovery, and enabling data to flow cleanly between jobs and environments.

Step 1. Streaming Logs

Each task (or task_attempts) produces logs that are streamed live to both the UI and persistent storage.

  • Executors stream stdout and stderr to a log collector , which:
    • Buffers logs locally (to survive network flaps)
    • Batches and ships logs to a central pipeline (e.g., Kafka → log store)
    • Tags logs with (workflow_run_id, job_name, step_name, attempt_id)

💡 Logs are often written to a scalable backend like:

Executors push logs and status into a message queue, from which separate consumers fan them out to object storage (for durable chunks), a search index (for query), and a real-time pub/sub channel (for live UI streaming).

  • Object Storage (e.g., S3 with gzip chunks) → bulk raw log data from job steps (stdout/stderr), plus archived artifacts (binaries, reports, coverage files).
  • Search Index (e.g., Elasticsearch) → structured log lines and metadata (workflow ID, job, step, timestamps, error strings) to support fast queries, filtering, and debugging.
  • Real-time Pub/Sub (e.g., Redis Pub/Sub for UI streaming) → live log frames and task status events (RUNNING, FAILED, SUCCEEDED, heartbeat) streamed instantly to the CI/CD dashboard.

The frontend uses SSE to tail logs live, falling back to static rendering post-execution.

Step 2. Realtime updates of Live Job & Step Status

At any point during execution, the system should expose:

  • Workflow-level state: pending, running, success, failed, canceled
  • Per-step/task status: granular view of each Task_Run in the DAG
  • Timing breakdowns: queued time, execution time, retry counts, exit codes

This is driven by a state machine behind the scenes:

  • Every job emits transitions (e.g., PENDING → RUNNING → SUCCEEDED)
  • These transitions are captured in a durable database (e.g., Task_Attempt table)
  • A status service subscribes to updates and pushes events to clients via SSE.

A real-time update system with a notification service based on pub/sub works by decoupling senders (publishers) from receivers (subscribers) so updates can flow quickly and scale easily:

  • Publishers: Whenever something changes — e.g., a build finishes, a new comment appears, or a job’s status updates — the producing service publishes an event message to a topic in the notification system.
  • Pub/Sub Broker: A message broker (e.g., Redis) takes these events and fan-outs them to all interested subscribers without the publisher needing to know who they are or how many exist.
  • Subscribers: Services or clients (such as web servers, mobile apps, or WebSocket gateways) subscribe to the topic. When a new event arrives, they immediately receive the message and can push a notification or update their UI.

This pattern enables low-latency, scalable real-time updates, since publishers don’t have to directly contact each client, and the notification service handles delivery, retries, and fan-out.

💡 Challenges in Enabling Live Observability

  • Low-latency delivery: Keeping log updates fast enough for real-time viewing while dealing with unpredictable network latency and high event throughput.
  • Backpressure and overload: Preventing slow or overloaded clients from overwhelming the system and causing delays or dropped updates for others.
  • Scalable fan-out: Supporting many simultaneous connections without degrading performance as the number of viewers grows.
  • Storage and retention costs: Managing the high volume of logs produced by builds while controlling long-term storage expenses.
  • Efficient searchability: Providing fast, reliable search and filtering across large log datasets without excessive indexing or query overhead.

Deep Dive

DD1: Reliable Ingestion of Push Events Under Burst Traffic

💡 What is the event ingestion risk?
Coupling API latency with build scheduling can cause dropped triggers during bursts, deploys, or crashes. Use immediate acknowledgment plus durable enqueue to absorb spikes and ensure safe retries.

CI/CD systems need to handle massive webhook volumes — especially during peak hours when thousands of developers push code, merge PRs, or trigger deployments. If ingestion isn’t designed carefully, the system risks:

  • Dropped events (due to restarts or network hiccups)
  • Duplicate processing (due to webhook retries)
  • Coupled latency between webhook and downstream scheduling

So let’s break this down across three core responsibilities:

  1. Immediate Acknowledgment, Deferred Processing
💡 The cardinal rule of webhook design is: Acknowledge fast, process later.

Webhook providers like GitHub, GitLab, or Bitbucket expect a 2xx response within a few seconds. If your server is slow or restarts mid-request, they’ll retry — sometimes multiple times — resulting in either duplicates or missed events.

To solve this:

  • As soon as a valid event is received, we acknowledge with 202 Accepted.
  • Then we enqueue the event into a durable buffer such as Kafka. This design ensures that event ingestion is fast and reliable — we can immediately acknowledge the webhook provider and absorb bursts of traffic without overloading downstream services.

By decoupling ingestion from processing, we allow workflow builders and schedulers to scale independently, apply retries safely, and recover gracefully from failures. The durable log also gives us strong operational benefits: replay for debugging or backfilling, ordered consumption per repository, and auditability of every event received. Together, these properties protect the system against spikes, backpressure, and downstream outages while preserving correctness and resilience.

  1. Deduplication for Effectively-Once Delivery

Even with fast responses, webhook providers retry on failures — which introduces at-least-once semantics at the network layer. Without proper handling, you could process the same event multiple times — triggering redundant builds, double deployments, or duplicated artifacts.

To handle this, we implement deduplication at the ingress layer.

How it works:

  • Each webhook includes a unique delivery ID in the header:
    • GitHub → X-GitHub-Delivery
    • GitLab → X-Gitlab-Event-UUID
    • Bitbucket → X-Request-UUID
  • We extract this ID and form a key:
  • dedup:{provider}:{delivery_id}
  • This key is checked and stored in Redis with a TTL (e.g., 7–14 days):
    • If it exists → duplicate → skip enqueue
    • If not → first seen → store and proceed

Downstream systems receive each webhook effectively once, despite retries. A global DB table with unique constraint is safer, but slower under burst load — Redis is the sweet spot for scale.

<aside>💡

What if provider doesn’t send a UUID?

When the provider doesn’t send a UUID, construct a natural key per event type + a payload digest fallback:

Natural keys (examples):

  • push: {repo_id}:{ref}:{after_sha}
  • pull_request: {repo_id}:{pr_number}:{action}:{head_sha}
  • issue_comment: {repo_id}:{issue_number}:{comment_id}:{action}
  • tag/release: {repo_id}:{event_type}:{release_id|tag_ref}:{action}
  • pipeline/workflow: {repo_id}:{run_id|workflow_id}:{action} (if present)</aside>
  1. Durable Event Logging & Replay

Even with deduplication and queuing, systems still fail — nodes crash, disks fill up, downstream services become unavailable.

To make ingestion resilient to unexpected restarts, we:

  • Log the raw webhook event to a durable store (e.g., blob storage, write-ahead log)
  • Add metadata: delivery ID, received timestamp, validation results
  • Optionally support manual or auto replay of missed events by re-publishing from logs

This gives us:

  • Auditability (What was received?)
  • Observability (How many retries? Which failed?)
  • Replayability (Can we reprocess a dropped event safely?)
📂 Smooth traffic bursts
CI/CD event flow is spiky — e.g., mass pushes during work hours or batch PR merges can overwhelm the scheduler. A durable log (like Kafka) absorbs these spikes so downstream systems can process at a steady pace.

Diagram Updated:

DD2: Horizontal Scalability of Scheduling and Execution

💡 What is the scalability risk?
As CI/CD usage grows across a company—especially in large organizations with hundreds of teams and thousands of daily workflows—your system must scale from tens to thousands of concurrent jobs. At this scale, naive designs break down.

To scale reliably, we adopt 3 core principles:

  1. Partition the Workload

Break the global job queue into shards, based on some natural partitioning key:

  • Repository
  • Organization/tenant
  • Workflow run ID hash

Each shard has its own queue, and schedulers can independently process different shards in parallel.

📦 Example:

Shard 0 → GitHub org: alpha
Shard 1 → GitHub org: beta
Shard 2 → GitHub org: gamma

This approach provides:

  • Scalability: no single bottleneck
  • Fairness: tenants don't starve each other
  • Isolation: failures in one shard don’t impact others
  1. Distributed Workers + Central Orchestrator

Each scheduler instance runs independently, pulling from its assigned shard. It:

  • Selects READY tasks
  • Allocates compute (CPU/GPU/zone-specific)
  • Pushes tasks to a worker agent or container runtime (e.g., K8s)

Workers report back via heartbeats:

  • Status updates: RUNNING, SUCCESS, FAILED
  • Logs, metrics, resource usage

This loop is stateless per worker — all orchestration state lives in a shared DB or distributed state store, which allows:

  • Failover if a worker dies
  • Rehydration after restarts
  • Safe reprocessing of incomplete tasks
  1. Idempotent State Transitions

In distributed systems, things fail and retry constantly:

  • Workers crash mid-run
  • Messages are duplicated or arrive out of order
  • Schedulers race to claim the same task

To avoid inconsistency, every state transition is idempotent.

For example:

  • Before running a task, a worker checks task.status == READY
  • On retry, the same transition (READY → RUNNING) is no-op if already complete
  • Use optimistic locking or versioning to ensure safe updates

This prevents:

  • Double execution
  • Lost updates
  • Race conditions across schedulers
  1. Elastic Scaling with Autoscaling + Hot Shard Protection

To scale under burst load:

  • Use autoscaling policies for workers and schedulers based on queue depth, job latency, or CPU utilization
  • Detect and isolate hot shards (e.g., one repo pushing constantly)
    • Apply per-tenant rate limits
    • Queue jobs separately by org or workflow type
    • Prioritize based on SLA, priority class, or fairness policy (e.g., weighted fair scheduling)
💡 Summary
Scaling CI/CD scheduling isn't about blindly adding more workers. It requires:
  • Sharded queues for isolation and throughput
  • Durable, centralized orchestration state
  • Idempotent transitions to handle retries, failures, and concurrency
  • Fairness policies and autoscaling to deal with bursty load

With this design, the system becomes elastic, reliable, and multi-tenant aware, scaling to thousands of concurrent workflows while preserving correctness and visibility.

Diagram Updated:

DD3: Durable Workflow Orchestration with Retries, Timeouts, and Cancellations

💡 What is needed for a durable system?

In a CI/CD system, workflows are rarely a single task. They’re composed of multiple interdependent jobs—builds, tests, deploys, smoke checks—organized as a directed acyclic graph (DAG). To ensure correctness and resiliency across these workflows, we need an orchestration layer that is:

  • Durable — survives crashes, reboots, and network partitions
  • Idempotent — reprocessing won’t cause side effects or duplicates
  • Reactive — handles retries, timeouts, and cancellations cleanly

Let’s walk through the key design decisions for a durable orchestrator.

Rely on Persisted State

A naive system might hold the workflow DAG in memory or rely on in-flight messages for orchestration logic. But this design is fragile:

  • If a scheduler crashes mid-run, all state is lost.
  • If a retry or timeout is triggered mid-transition, the job may be double-executed or skipped.

Solution: Treat workflow state as durable and queryable data.

This includes:

  • WorkflowRun: overall execution status and metadata
  • Task: DAG nodes with job definitions and dependencies
  • TaskAttempt: every execution attempt with exit code, timestamps, and retry info
  • ExecutionEvent (Logs): transitions (e.g., RUNNING, CANCELED) as immutable logs

This data can be stored in a transactional store (e.g., Postgres, Spanner) or a state machine backend like Temporal/Cadence.

Apply Retries with Backoff

Failures happen — flaky integration tests, network timeouts, container pull errors. A resilient orchestrator supports automatic retries, using:

  • Retry budget: max retry attempts (e.g., 5)
  • Backoff strategy: exponential backoff with jitter (e.g., 1 sec, 2 sec, 4 sec, 8 sec, 16 sec etc)
  • Retryable error classification: no waste of resource on non-retryable errors (e.g., network timeouts = retryable; config errors = not retryable)

Each retry attempt creates a new TaskAttempt, preserving a clean audit trail and avoiding mutation of past runs.

📘 Why retry policies must be persisted?

So reconcilers or workers can pick up where the previous one left off, without relying on memory or timers alone.

Enforce timeouts

Jobs may hang indefinitely — due to a misbehaving process, a deadlock, or data skew. The orchestrator must enforce:

  • Hard timeouts (e.g., 15 minutes per job)
  • Global workflow TTLs (e.g., fail all after 2 hours)
  • Graceful termination (SIGTERMWAITSIGKILL)

Timeouts must be reconciled from persisted state so enforcement can resume after restarts or leadership elections.

Key logic:

  • Persist start_time in the DB
  • Calculate deadline = start_time + timeout
  • Periodically check now > deadline and apply cancellation

Deal with Cancellations Gracefully

Users may want to cancel workflows due to stale branches, broken configs, or misfires. Cancellations can also cascade from failed parent jobs.

In our philosophy of design, each cancellation must:

  • Update the workflow and task status to CANCELED
  • Kill any running containers via the executor
  • Skip downstream tasks (mark as SKIPPED(canceled))
  • Emit cancellation events for logging and UI to users

Again, cancellations must be persisted — so retries don’t revive canceled jobs, and state is cleanly propagated through the DAG.

📘 Idempotent Transitions

All workflow transitions (e.g., PENDING → RUNNING → SUCCESS) must be idempotent, because:

  • Events may be replayed
  • Reconcilers may crash mid-transition
  • Messages may be delivered more than once

To support this:

  • Use versioned updates (e.g., optimistic locking or last_updated_at)
  • Store transitions in append-only logs (ExecutionEvent) for audit and debugging
  • Ensure executors handle duplicate “run” requests without side effects (e.g., checking if task already completed)

In essence, workflow orchestration is not just about job execution—it’s about managing correctness over time, despite failure. Durable, idempotent state is the backbone of that correctness.

Diagram Updated:

DD4: Fault Tolerance Across Jobs and Infrastructure Failures

💡 What is the problem?

In distributed CI/CD systems, failure is not an edge case — it’s the expected operating condition. Machines crash, containers get killed, disks fail, and sometimes entire regions go offline.

The goal of this deep dive is to show how to design fault tolerance across job execution, orchestration, and infrastructure layers — such that jobs remain resilient, recoverable, and invisible in failure to the end-user.

Before we discuss how to defend, let’s understand what can go wrong:

Failure Type Example
Node crash Worker node rebooted during job execution
Network partition Job running but can't reach orchestrator or vault
Container eviction K8s OOM killer or spot instance termination
Executor bug or panic Internal crash during log upload or status report
Full zone/region outage Cloud outage or control plane unavailability

Without protection, these failures can lead to:

  • Zombie jobs that run forever but are invisible
  • Double execution due to retries
  • Stuck workflows due to orphaned states

There are list of solutions down below:

  1. Durable Checkpointing of Job State

Each job should report durable status transitions:

  • PENDING → RUNNING → SUCCESS/FAILED/TIMED_OUT/CANCELED

These transitions are persisted in a central DB (e.g., Postgres, Spanner) and logged as immutable ExecutionEvents. No job status lives in memory alone.

This allows the orchestrator to recover current state after a crash — for example:

  • If a node dies mid-job, the orchestrator sees RUNNING without heartbeat
  • After timeout, the job is retried or failed cleanly

Even logs, artifacts, and metadata should be uploaded incrementally (e.g., logs every 5s) so progress isn’t lost on failure.

  1. Idempotent Execution & Transitions

Retries are inevitable — whether due to worker restarts or network retries.

To avoid duplication and corruption:

  • Each job run is tagged with a unique task_attempt_id
  • Executors check current state before transitioning
  • State updates use optimistic locking or version fields

This means:

  • If a task is marked SUCCESS, replaying a RUNNING event is ignored
  • If a task is retried, only the new attempt ID is allowed to write
  1. Retry on a Different Runner

If a worker crashes mid-execution:

  • The orchestrator detects loss of heartbeat (e.g., 30s TTL)
  • Marks current attempt as FAILED(worker_lost)
  • Enqueues a new attempt on a different worker
  • Downstream steps continue only after a clean success

This ensures no dependency is left incomplete — even if multiple failures occur.

4. Graceful Cancellation & Timeout Handling

When cancelling or timing out a job:

  • Send SIGTERM to the container
  • Wait for a grace period (e.g., 10s), then send SIGKILL
  • Mark the job as CANCELED or TIMED_OUT durably

Downstream jobs depending on this step are skipped with reason propagation, ensuring the DAG stays consistent.

Diagram Updated:

DD5: Real-Time Logs and Status Updates at Scale

💡 What is needed?

In modern CI/CD systems, developers expect real-time visibility — they want to tail logs as jobs execute, see step-by-step status updates, and debug failures on the fly. This is easy in small systems. But at scale — with thousands of jobs running concurrently — it becomes a distributed systems problem involving throughput, backpressure, fanout, and cost control.

A simple implementation might let workers stream logs directly to connected clients or store them immediately in a relational DB. This works fine at small scale, but breaks down quickly:

Problem Why it happens
Storage overload Log spam floods disk and IOPS (e.g., job loops printing 1000s of lines)
Client backpressure Browsers lag or crash if logs are streamed unthrottled
Poor fault tolerance If the log stream is lost mid-run, logs disappear permanently
No replay or historical search You can’t “go back” to review logs after completion

We need a more robust architecture: Pub/Sub + Object Storage Model. The modern approach separates log ingestion, log fanout, and log storage into three distinct layers:

Ingestion: Worker to Log Ingestor (Pub/Sub)

As the job runs, the worker:

  • Streams logs in small batches (e.g., every 2–5 seconds or every 8KB)
  • Publishes them into a log ingestion pipeline, such as: kafka

Each log line is tagged with:

  • workflow_run_id
  • task_id and attempt_id
  • timestamp, stream_type (stdout / stderr)

This stream is durable and replayable, so logs aren’t lost if the client disconnects or crashes.

💡 WebSocket or SSE

Default to SSE for CI/CD notifications, use WebSocket only when you truly need full duplex.

When to use SSE (recommended for most CI/CD UIs): Primarily server→client updates — build/deploy status, step progress, live log tail.

When to use WebSocket: You need client→server interactivity on the same stream — attaching an interactive shell, pausing/resuming steps, log search with server-side cursoring, or binary payloads (e.g., PTY streams).

Fanout: Live Log Tailing to Clients

Clients (CLI, web UI, API consumers) connect via:

  • WebSocket or SSE (Server-Sent Events)
💡 WebSocket or SSE

Default to SSE for CI/CD notifications, use WebSocket only when you truly need full duplex.

When to use SSE (recommended for most CI/CD UIs): Primarily server→client updates — build/deploy status, step progress, live log tail.

When to use WebSocket: You need client→server interactivity on the same stream — attaching an interactive shell, pausing/resuming steps, log search with server-side cursoring, or binary payloads (e.g., PTY streams).

  • Query APIs like:
  • GET /logs/{workflow_run_id}/{step}?tail=true

The UI subscribes to the log stream using Kafka consumer groups or a log gateway service that:

  • Buffers and debounces updates to avoid overloading the browser
  • Sends periodic heartbeat messages to indicate activity
  • Handles reconnect/resume using checkpoint offsets
💡 To prevent overloading clients:
  • Apply log rate limits per user/session
  • Collapse repeated lines or debounce updates
  • Offer filters (e.g., only stderr, only step X)

Storage: Persist to Object Storage for History

Once a job completes (or after time windows), logs are:

  • Aggregated into chunked files (e.g., gzip-compressed text)
  • Uploaded to object storage (e.g., S3, GCS)
  • Indexed with metadata for search and retrieval

Storage paths follow a deterministic format:

/logs/{workflow_run_id}/{step_name}/{attempt_id}/stdout.log.gz

This enables:

  • Long-term retention
  • Retrospective debugging and sharing
  • Compliance with audit/logging policies

Updated Diagram:

DD6: Secure Management of Secrets in a Multi-Step Workflow

In any CI/CD system, workflow jobs often need access to sensitive credentials — such as cloud provider API keys, Docker registry tokens, or deployment SSH keys. But with great flexibility comes great risk: improper handling of secrets can lead to credential leaks, unauthorized access, or insider abuse

There are two common injection methods: environment variables and volume mounts.

  1. Environment Variables

Secrets are pulled at task startup and injected into the container’s environment:

Advantages:

  • Simple to implement and use — most apps can consume secrets directly from env vars.
  • Doesn’t require changes to file paths or mounted volumes.
  • Compatible across all container runtimes.

Trade-offs:

  • Secrets appear in the process environment; if the app crashes or logs its env, they can leak.
  • Available to every process inside the container — no fine-grained scoping.
  • Can’t restrict read permissions or rotate easily within long-lived processes.

Best for: short-lived jobs, low- to medium-sensitivity secrets, language-agnostic workflows.

  1. Volume Mounts (e.g., tmpfs or projected volumes)

Secrets are written into ephemeral, in-memory volumes and mounted into the container file system at known paths (e.g., /secrets/aws/token).

Advantages:

  • Secrets can be scoped tightly with filesystem permissions (chmod 400, owned by the job user).
  • Avoids putting secrets in process environment.
  • Some systems (e.g., Kubernetes, Vault Agent) can auto-rotate secrets on disk.

Trade-offs:

  • Requires the app to read from a file path — might need config changes.
  • Slightly more complex setup and cleanup logic.
  • Mount points must be scrubbed after job completion.

Best for: high-sensitivity secrets, longer-lived jobs, or workflows with file-based integrations (e.g., service account JSONs).

💡 Additional Design Principles
  1. Store secrets in a centralized vault
    • Use systems like HashiCorp Vault, AWS Secrets Manager, or GCP Secret Manager
    • Secrets are versioned, access-controlled, and audited
    • Secret values are never hardcoded in config files or repos
  2. Inject secrets only at runtime
    • Secrets are fetched right before task execution
    • Injected into the container using:
      • Environment variables (e.g., AWS_ACCESS_KEY_ID)
      • In-memory volume mounts (tmpfs, projectedServiceAccount)
    • No secrets exist in the scheduler or orchestrator memory
  3. Scope secrets per job or step
    • Jobs only receive secrets they explicitly declare
    • You can’t accidentally inherit secrets from other steps
    • Example:
    jobs:
      deploy:
        secrets: [aws-prod-deploy-token]
  4. Revoke secrets after use
    • Secrets are cleaned from memory and temp volumes after container exit
    • Prefer short-lived session tokens (e.g., STS tokens, JWTs with TTL)
    • Optionally rotate or invalidate secrets post-deployment

DD7: Artifacts Management and Shared among jobs.

It also handles artifacts—the data that flows between jobs—by storing them in versioned, immutable object storage, ensuring that jobs remain deterministic and that deployments can be traced, verified, and even rolled back without rebuilds.

Once jobs start running, they may produce data needed by downstream jobs—that’s where artifacts come in.

Jobs declare:

  • artifacts.outputs: files or directories to upload after success.
  • artifacts.inputs: required assets from upstream jobs before execution.

We upload outputs to object storage (e.g., S3, GCS) using structured paths like:

/artifacts/{workflow_run_id}/{job_name}/{artifact_name}

Inputs are validated:

  • Existence: must be present before the task is marked ready.
  • Checksum: verify SHA256 hash to prevent corruption.
  • Access control: tasks can only access artifacts declared in config.

This design is rooted in three principles:

  • Artifacts are immutable. Once uploaded, they can’t be mutated—this guarantees reproducibility.
  • Deployments are traceable. We can trace any binary in production back to the Git SHA, the build task that produced it, and the test task that validated it.
  • Rollbacks are instant. We can redeploy a prior artifact version without triggering a rebuild, just by restoring from object storage.
💡 Additional Design Principles
  1. Store secrets in a centralized vault
    • Use systems like HashiCorp Vault, AWS Secrets Manager, or GCP Secret Manager
    • Secrets are versioned, access-controlled, and audited
    • Secret values are never hardcoded in config files or repos
  2. Inject secrets only at runtime
    • Secrets are fetched right before task execution
    • Injected into the container using:
      • Environment variables (e.g., AWS_ACCESS_KEY_ID)
      • In-memory volume mounts (tmpfs, projectedServiceAccount)
    • No secrets exist in the scheduler or orchestrator memory
  3. Scope secrets per job or step
    • Jobs only receive secrets they explicitly declare
    • You can’t accidentally inherit secrets from other steps
    • Example:
    jobs:
      deploy:
        secrets: [aws-prod-deploy-token]
  4. Revoke secrets after use
    • Secrets are cleaned from memory and temp volumes after container exit
    • Prefer short-lived session tokens (e.g., STS tokens, JWTs with TTL)
    • Optionally rotate or invalidate secrets post-deployment

DD8: Build Artifact and Optimization

There are several key design decisions, when we talk about artifact building

  • Accurate change detection
    • Track all build inputs: source files, transitive dependencies, compiler flags, toolchains, environment variables.
    • Only rebuild when a relevant input fingerprint changes.
  • Dependency correctness
    • Build order is derived from an explicit Directed Acyclic Graph (DAG) of targets.
    • A downstream target must never run before its prerequisites succeed.
    • Missing or misdeclared deps should be surfaced as errors, not silently ignored.
  • Reproducible builds
    • Same source + same environment ⇒ identical output bits.
    • No reliance on developer machine state, $PATH, or unpinned tools.
  • Work reuse / caching
    • Local cache to skip recompilation on repeat runs.
    • Remote cache to share build outputs across engineers and CI agents.
💡 Change Detection & Caching Details
  • Hashes: SHA256 of file contents, flags, compiler, platform.
  • Rebuild scope: Only nodes whose hashes changed + their downstream dependents.
  • Remote cache: Dedupe work across dev laptops & CI runners; must handle:
    • Key stability (hash of inputs must be deterministic)
    • Eviction policies for space
    • Security (signed artifacts / ACLs)

DD9: Globally Continuously Deployment

Globally Continuous Deployment is an approach where every code change, once built and tested, can be safely and automatically released to all regions and environments without manual gating. The system coordinates rollouts across data centers or clouds, applies staged releases (canary → progressive → full), monitors health signals in real time, and can roll back instantly if needed.

📄 Shipping large artifacts globally?

For globally continuous deployment, build immutable, content-addressable artifacts (e.g., digest-pinned images), replicate them to regional registries or edge caches, verify availability before rollout, then orchestrate a safe staged release (canary → progressive → full) across regions with real-time health checks and fast rollback.

Coach + Mock
Practice with a Senior+ engineer who just get an offer from your dream (FANNG) companies.
Schedule Now
Content: