min read

ChatGPT is an AI system that can understand text you type and generate human-like responses.

Chatgpt

Coach with Author

Book a 90-minute 1:1 coaching session with the author of this post and video — get tailored feedback, real-world insights, and system design strategy tips.

‍Let’s sharpen your skills and have some fun doing it!

Schedule

💡 What is ChatGPT? What is ChatGPT Playground?

ChatGPT is an AI system that can understand text you type and generate human-like responses. You give it a prompt — a question, a task, or a piece of text — and it produces a coherent continuation or answer. It’s powered by large language models trained by OpenAI, but for this system design, you can think of it simply as an API that takes text input and returns text output.

The Playground is a lightweight, interactive web tool that lets users experiment with ChatGPT.

It provides a simple interface where you can:

type any prompt,
adjust response settings (like creativity or output length),
and see ChatGPT’s response instantly.

It’s essentially a sandbox for testing prompts and model behavior, without needing to write code or build an application. For our system design, the Playground represents the minimal product that lets users play with ChatGPT through a clean UI backed by a simple backend service.

► Scope & Assumptions (What We’re Actually Designing)

We are designing the application backend for “The Playground” plus just enough frontend behavior to make the data flow clear.

In Scope

A backend service that:
- Accepts prompts + user-selected parameters from the UI.
- Validates and normalizes those parameters.
- Calls a pre-trained, externally hosted GPT API (think “given & reliable black box”).
- Streams or returns completions back to the frontend.
Persistence + APIs for:
- Saving presets (prompt + parameters).
- Listing/loading existing presets.
Basic observability on our side (logging, metrics, simple rate limiting) as part of the backend design.
A very light frontend assumption:
- Collect prompt + params
- Display generated tokens as a stream

Non-Goals / Out of Scope

These are explicitly outside the scope of this system design:

Model Training / Tuning
- No pretraining, fine-tuning, RLHF, or model-selection algorithms.
- We treat the GPT model as a pre-trained, already-deployed API managed by another team.
Core ML / Infra Architecture
- No design of GPU clusters, model sharding, serving infra, or vector stores.
- Latency/throughput considerations are only for our backend, not the internal workings of the model service.
Rich Frontend / UX Design
- No detailed React component tree, styling system, or complex client-side state management.
- We assume the frontend can display text and controls and can handle simple API responses.
Auth, Billing, and User Management
- We assume the caller is already authenticated (e.g., behind company SSO or API gateway).
- No payments, quota accounting, or subscription/billing flows.

► System Scale

Metric	Value	Explanation
Daily Active Users (DAU)	10 million	Core base of engaged, logged-in users generating prompts
Avg Prompts per User per Day	10	Typical user iterates 10 times with retries, tweaks, follow-ups
Daily Prompt Volume	100 million prompts/day	10M × 10
Avg Prompt QPS	1,000 QPS	100M prompts / 100,000 seconds
Peak Prompt QPS	10,000 QPS	10× burst factor to handle peak usage hours
Concurrent Generations	100,000	10K QPS × 10s per generation = 100K live responses
Tokens per Prompt (input + output)	1,000 tokens	300 in + 700 out, rounded down
Daily Token Throughput	100 billion tokens	1K tokens × 100M prompts
Preset Saves per Day	10 million	Assume 1 in 10 prompts leads to a save action
Concurrent Logged-In Users / Sessions	1 million	Editing, exploring, reading, not actively submitting
Preset DB Growth	10 GB/day	1KB per preset × 10M/day = 10GB
Model Calls per Second (with retries)	10–20K/sec	10K baseline + retries/tool calls = max 20K

Functional Requirements

FR1 – Submit Prompt & View Completion

Users should be able to submit a free-text prompt and view the model response typed out live.

Use case:

A marketing intern types: “Write a tagline for an ice cream shop” and clicks Submit.

The UI shows: “Taste the Joy of Summer at Our Creamery!” like a chat.

FR2 – Adjust Generation Parameters

Users can tweak generation settings like temperature, max tokens, etc., and observe how output changes.

Use case:

A copywriter reruns the same tagline prompt with temperature=0.2 and then 0.9, comparing the creative differences.

FR3 – Save Prompt as a Preset

Users can save a prompt and its parameters as a reusable named preset.

Use case:

After crafting a strong “Brand Tagline Generator,” the user taps Save, names it “Taglines – Playful,” and stores it.

FR4 – Search, Load & Run Existing Presets

Users can search for and run saved presets using new input text.

Use case:

The user types “summarize” into the preset search bar, selects “Summarize for a 2nd grader” from the filtered list, pastes in a paragraph, and clicks Submit to get a simplified version.

Non-Functional Requirements

NFR1 – Latency: Time-to-First-Token ≤ 300 ms, End-to-End Completion ≤ 2 sec (p95)

Why this matters:

In a system handling 10,000 prompt submissions per second, latency isn’t just about speed — it directly affects perceived quality and user engagement. A slow UI feels broken even when the system is technically functional. Instant token streaming gives users a sense of progress and keeps the feedback loop tight — especially critical when users are rapidly iterating prompts.

Deep dive challenge:

How do we prevent slow token starts or long tails from degrading UX?

NFR2 – Availability: 99.9% Success Rate Across Prompt Submission Flow

Why this matters:

At 100M prompts/day, a 0.1% failure rate still means 100,000 broken generations daily. Users expect reliability from AI tools, especially when using saved presets or collaborating across sessions. High availability ensures the Playground remains usable even when parts of the system — like the model — are temporarily degraded, without breaking prompt editing, UI rendering, or preset loading.

Deep dive challenge:

How do we prevent a partial outage from breaking all completions?

NFR3 – Rate Limiting: ≤ 60 Requests per User per Minute, ≤ 2 Concurrent Generations per User

Why this matters:

In a 10M DAU system, even 1% of users misbehaving (intentionally or not) could generate millions of excess requests per minute, potentially spiking model cost and degrading experience for others. Enforcing smart limits and cool-downs protects system stability while keeping honest users unaffected — especially during bursty activity or tab abuse.

Deep dive challenge:

How do we prevent spams from one user choking the system?

NFR4 – Scalability: Support 10k QPS and 100k Concurrent Streaming Sessions

Why this matters:

At scale, usage is never evenly distributed. Peaks, viral usage spikes, and time zone overlaps mean your system must elastically scale while still streaming responses in real time. Supporting 100K concurrent streams without running out of memory, dropping connections, or causing cold-start delays is critical to avoid backlogs and latency cliffs.

Deep dive challenge:

How do we prevent sudden traffic spikes from overwhelming stream infra?

High Level Design (Delivery Framework: API → Entity → Workflow → Diagram)

FR1 – Submit Prompt & View Completion

► 💡 Note on Assumptions & Model

This is a working solution (not perfect), focused on core functionality. Here are some assumptions reasonably made:

User is already authenticated (via SSO or token).
Model is a hosted GPT-3 variant (e.g., text-davinci-003) accessed via API, we use GPT‑3 in later doc for simplicity.
We do not persist prompt data — we process and stream responses in real time.

API Signature

Request

{
  "event": "generate",
  "request_id": "uuid-1234",  // Optional idempotency key
  "prompt": "Write a tagline for an ice cream shop",  // User's input prompt
  "temperature": 0.9,    // Controls randomness (0 = deterministic, 1 = creative)
  "max_tokens": 256,     // Limits response length
  "top_p": 1,      // Top-p sampling: limits to top cumulative probability mass
  "frequency_penalty": 0.5,    // Penalizes repeated phrases
  "model": "text-davinci-003", // GPT-3 variant
  "stream": true     // Enables real-time token streaming via WebSocket
}
    

Response

{
  "request_id": "uuid-1234",
  "completion": "Taste the Joy of Summer at Our Creamery!"
}
    

(Optional) Stop Request

{
  "event": "stop",
  "request_id": "uuid-1234"
}
    

Entity

PromptRequest: This is a short-lived object used to validate and relay a single generation request. This object exists in memory for the duration of the request. It is not saved to a database unless explicitly logged.

Field	Type	Description
`request_id`	UUID	Optional deduplication key
`user_id`	String	Who submitted the prompt
`prompt`	Text	Raw prompt text from user
`model`	String	Model name (e.g., GPT‑3)
`temperature`	Float	Creativity level
`max_tokens`	Integer	Response limit
`top_p`	Float	Sampling restriction
`frequency_penalty`	Float	Controls repetition
`timestamp`	ISOTime	When the request was made

Workflow: Prompt Submission via Streaming

User types a prompt and chooses parameters in the Playground UI. (Before that we assume WebSocket connection is established).
Frontend sends a WebSocket message representing the prompt submission, matching the POST /v1/completions schema. This schema aligns with our REST spec for composability, but is transmitted over WebSocket.
API Gateway receives the request:
- It does not authenticate (auth is assumed to be handled upstream, e.g., via login/session cookie or API token).
- It applies basic rate limits and usage quotas (we will discuss later).
Completions Service:
- Validates parameters (e.g., max tokens, numeric ranges)
- Checks request_id for idempotency
- Constructs a model-ready payload and sends it to GPT-3 (black box)
GPT-3 begins processing — it takes time to run inference but starts returning tokens one by one as they’re ready.
Tokens are streamed back to the Completions Service, which:
- Uses WebSocket to relay tokens to the user’s browser in real time
- Each token is immediately sent as an WebSocket event — the user sees the text “typing out”
Frontend renders the streamed tokens linearly until the model signals the end of generation.

💡 Why we use WebSocket? What are the other options?

WebSocket is the right fit for streaming GPT completions in the Playground because:

Bi-directional communication: WebSocket allows the client to cancel a generation midstream — critical for responsiveness when users hit "Stop" or edit their prompt.
Low latency, persistent connection: A single WebSocket stays open for real-time token delivery, reducing the overhead of repeated HTTP requests.
Backpressure handling: Enables the server to pause or slow the stream if the client can’t keep up — useful for throttling and flow control.
Typed UX compatibility: WebSocket easily supports “typing” behavior — streaming partial tokens while still letting the client send messages (e.g., cancel, edit, feedback) at any point.

Other Options	Why It Falls Short
SSE (Server-Sent Events)	One-way only (server → client). Can’t support “stop generation” or live client input midstream.
HTTP Polling / Long Polling	High latency, inefficient under load. Doesn’t scale for 100K+ concurrent sessions.
gRPC Streaming	Optimized for backend services, not browsers. Poor support in JS clients, adds transport overhead.

FR2 – Adjust Generation Parameters

API Signature

This is the same WebSocket message structure as FR1, but FR2 emphasizes how we validate and tune the model’s behavior using exposed parameters.

{
  "event": "generate",
  "request_id": "uuid-5678",
  "prompt": "Give me 3 startup ideas",
  "temperature": 0.3,           // Lower = more focused, deterministic
  "max_tokens": 150,            // Shorter completion
  "top_p": 0.8,                 // Narrower sampling pool
  "frequency_penalty": 0.7,     // Avoid repeated ideas
  "model": "text-davinci-003",  // GPT-3 variant
  "stream": true                // Stream tokens via WebSocket
}
    

Entity

GenerationParameters - This is a structured sub-object within the prompt request. It is not stored, but is validated and passed to GPT-3. This is Optional.

💡 Completion Parameters (Field Spec)

Field	Type	Purpose
`temperature`	Float	Controls randomness (0–1)
`max_tokens`	Int	Limits completion length
`top_p`	Float	Top probability sampling filter
`frequency_penalty`	Float	Penalizes repetition
`model`	String	Which GPT model to use
`stream`	Bool	Enables real-time delivery

Workflow: Parameterized Prompt Submission

User adjusts generation settings in the Playground UI via sliders, dropdowns, or advanced options.
Frontend sends a WebSocket message to backend, embedding the adjusted parameters along with the prompt.
API Gateway receives the message and applies user-level rate limits and quotas.
Completions Service:
- Validates parameter values (e.g. temperature ∈ [0,1], max_tokens ≤ 2048)
- Fills in sane defaults if optional parameters are missing
- (Optional) Logs the full GenerationParameters object for A/B testing or observability
- Assembles a model-ready payload and forwards to GPT-3 API
Model API uses these parameters to shape generation behavior.
Streaming back of tokens proceeds via WebSocket, same as FR1:
- Each token is sent incrementally
- Client can hit “Stop” at any time to cancel
- Optional: log or visualize how different settings affect outputs

Design Diagram

FR3 – Save Prompt as a Preset

API Signature

Request

POST /v1/presets
Content-Type: application/json

{
  "name": "Taglines – Playful",      // User-assigned display name
  "prompt": "Write a tagline for an ice cream shop",  // Prompt to save
  "temperature": 0.9,
  "max_tokens": 256,
  "top_p": 1,
  "frequency_penalty": 0.5,
  "model": "GPT-3"      // Model this preset is tuned for
}
    

Response

{
  "preset_id": "preset_9f3d7c",            // Returned ID for later reference
  "status": "saved"
}
    

Entity

Presets - user-owned, persistent records storing prompt + parameters. We only need a simple schema.

💡 Preset Schema

Field	Type	Description
`preset_id`	String	Unique ID (generated)
`user_id`	String	Owner of the preset
`name`	String	Display name
`prompt`	Text	Prompt text
`temperature`	Float	Generation setting
`max_tokens`	Integer	Generation setting
`top_p`	Float	Generation setting
`frequency_penalty`	Float	Generation setting
`model`	String	Which model to use (e.g. GPT-3)
`created_at`	Timestamp	When the preset was created

Note: Versioning, org-sharing, or metadata tagging can be added later, but not needed in this scoped flow.

Workflow: Save Preset Flow

User edits prompt and generation settings in the Playground UI and clicks Save, entering a preset name.
Frontend sends a POST /v1/presets request over the existing WebSocket connection.
API Gateway receives the request and:
- Verifies basic auth/session (upstream)
- Passes it to the Preset Service
Preset Service:
- Validates required fields (name, valid parameter ranges)
- Assigns a globally unique preset_id
- Persists the preset into the Preset DB
- (Optional: logs metric or audit entry)
Returns confirmation with preset_id to the frontend for future use.

💡 Note: Create-and-Confirm Pattern

This is a classic create-and-confirm pattern — unlike completions, no streaming is needed. All logic is synchronous.

💡 How does the frontend send a POST /v1/presets request over WebSocket?

Yes — WebSocket doesn’t natively support HTTP verbs like POST. But in this system, we emulate RESTful behavior by sending structured JSON messages over the WebSocket connection.

Here’s how it works in practice:

{
  "op": "POST",
  "path": "/v1/presets",
  "body": {
    "name": "Taglines — Playful",
    "prompt": "Write a tagline for an ice cream shop",
    "temperature": 0.9,
    "max_tokens": 256,
    "top_p": 1,
    "frequency_penalty": 0.5,
    "model": "GPT-3"
  }
}

The backend WebSocket handler interprets this as a logical POST /v1/presets call and routes it to the appropriate service (like PresetService), just as if it came from HTTP.

This lets us reuse REST-style endpoints while maintaining a single, persistent WebSocket connection for both:

Real-time completions (FR1, FR2), and
Non-streaming actions like saving or loading presets (FR3, FR4).

Design Diagram

FR4 – Search, Load & Run Existing Presets

API Signature

1. Search Presets

GET /v1/presets?query=summarize&page=1&page_size=10
→ Returns list of preset metadata matching search term
    

2. Load and run a specific preset

POST /v1/completions
Content-Type: application/json

{
  "preset_id": "preset_9f3d7c",              // Preset to use
  "input_context": "Paste article here..."   // User-provided context to apply
}
    

Note:

query is optional; empty returns all presets.
preset_id is used to hydrate the prompt + generation parameters.
input_context is user-provided runtime content.

Entity

presets - same as FR3

Workflow 1: Search

User searches for presets
- Types into a search bar (e.g. "summarize")
- UI sends GET /v1/presets?query=summarize via WebSocket
- API-GW forwards to Preset Service, which:
  - Filters presets by user_id + fuzzy name match
  - Returns paginated results

Workflow 2: Load → Run

User selects a preset and provides context
- UI sends POST /v1/completions via WebSocket with:
  - preset_id
  - input_context
Completions Service:
- Calls Preset Service to hydrate the preset:
  - Fetches prompt + generation parameters by preset_id
  - Verifies user ownership
- Merges prompt + input_context
- Constructs full payload and sends it to the model
Model API (e.g. GPT-3):
- Receives hydrated, parameterized prompt
- Generates completion tokens
Streaming Response:
- Tokens are streamed back incrementally via WebSocket
User sees output typed out live

Design Diagram

Summary – What’s Working

💡 What’s Working in the System Design

Area	What’s Working
Delivery Framework	All 4 FRs follow a clear four-part structure: API, Entity, Workflow, and Design Diagram — making the system easy to follow.
Component Reuse	Shared components like Completion Service, Preset Service, and Model API are cleanly reused across FR1–FR4 without unnecessary branching.
Streaming Transport	WebSocket is used consistently for all real-time interactions — enabling both incremental token delivery and user-side cancellation.
Auth and Routing	API Gateway cleanly handles routing, authentication, authorization, and rate limiting — acting as the single enforcement choke point.
Model Abstraction	The GPT-3 inference API is treated as a stateless black box — allowing us to focus on orchestration and delivery without modeling internals.
Preset UX Flow	FR3 and FR4 together enable users to save, search, and re-run prompts — supporting a smooth reuse loop and better iteration speed.

Limitations – What’s Not Fully Covered Yet

⚠️ NFR Risks and Gaps in the HLD (Click to Expand)

NFR	Open Question	What’s Missing in the HLD
Latency (p95 ≤ 500ms to first token)	How do we handle slow starts and model-side tail latency?	We haven’t modeled delays from tokenization, queueing, or cold starts on the Model API. No timeout, timeout retry, or first-token SLA guard.
Availability (≥99.9% uptime, graceful fallback)	What happens when model or preset services fail mid-flow?	No degraded mode is defined — we don’t show retry logic, circuit breakers, or fallbacks (e.g., show cached completions, default prompt, or error UI).
Rate Limiting & Abuse Control	How do we prevent spam, flooding, or misuse?	While API-GW handles rate limiting, we don’t yet define per-user quotas, burst detection, or abuse controls like CAPTCHA or behavioral scoring.
Scalability (10K QPS, 100K concurrent sessions)	Can our services handle concurrent WebSocket streams at scale?	We use WebSockets but haven’t shown how Completion Service handles pressure — no mention of stream buffer limits, backpressure signaling, or instance autoscaling.

The current high-level design successfully delivers all four core functional requirements with clear APIs, well‑scoped entities, step‑by‑step workflows, and consistent diagrams. Responsibilities are cleanly split between the Completion Service, Preset Service, and API-GW, and the model remains a simple black‑box inference API. All four FRs follow the same routing pattern through API-GW, and WebSocket is used consistently for real-time communication — enabling incremental token streaming, typed‑out UX, and mid‑generation cancellation. Overall, the HLD provides a solid, end‑to‑end foundation for how the Playground accepts prompts, tunes parameters, saves presets, searches them, and streams model output back to users.

However, several deeper system concerns are intentionally left unresolved at this stage. The design does not yet address long‑tail or cold‑start latency from the model API, nor does it specify how to recover from failures in either the model service or the preset database. Rate limiting is currently described only at a high level; we have not covered per-user quotas, abuse protection heuristics, or multi‑tab enforcement. Finally, the HLD does not yet describe how Completion Service and WebSocket infrastructure scale under 10K QPS and 100K+ concurrent streams, including buffer management, backpressure, or autoscaling behavior. These areas will be explored in the upcoming deep dives — one per NFR.

Deep Dives (What breaks the HLD → Options Discussions → Solution)

DD1 - Low Latency - Eg, How do we prevent slow token starts or long tails from degrading UX?

💡 Goal: Ensure fast first-token delivery (Click to Expand)

Goal: Ensure users see the first token typed out quickly (p95 ≤ 500ms), and generation continues smoothly without long pauses or jitter.

💡 What Breaks the Current High Level Design? (Click to expand)

Latency regressions in any of these stages can cause slow starts or long-tail delays:

Model Cold Starts
GPT models can be cold (e.g., spun down for cost-saving or scaling) — leading to 1–3s spin-up time before the first token is even generated.
Delays at Model API
High QPS or sudden load spikes can cause prompt requests to slow behind other jobs — delaying time-to-first-token even if our own services are healthy.
Streaming Flush & Jitter in WebSocket Path
Even after the model starts generating, if the Completions Service or WebSocket layer buffers tokens or batches sends too aggressively, tokens arrive in uneven bursts and the frontend sees choppy “typing” instead of smooth, real-time output.

Issue 1: Model Cold Start – How do we minimize or eliminate it?

💡 What Options Do We Have? (Click to expand)

Option 1: Let OpenAI Handle It (No Action)
If we treat the model API as a black box, we accept its internal behavior and cold starts. This is simplest, but leads to unpredictable first-token latency during idle periods or scale-ups.

Option 2: Ping Warmers / Scheduled Requests
We run periodic “warmup prompts” (e.g., no-op dummy prompts every 30s) to keep popular models hot. This avoids spin-up time but adds cost and assumes predictable traffic patterns.

Option 3: Use a Model Proxy Layer (Preferred)
We introduce a lightweight Model Proxy Service that tracks prompt volumes per model and proactively issues warmup calls during low-load windows. It also exposes health metrics (e.g., time-to-first-token trends) to help the Completion Service dynamically route to faster models.

Solution: Option 3 – Use a Model Proxy Layer

We add a Model Proxy Layer between the Completions Service and the GPT-3 API. Its job is to sit in the middle and make smarter, faster calls to the model.

1.Real-time latency monitoring

The proxy tracks how fast each model endpoint returns the first token and how often it fails. If one endpoint is slow or flaky, we simply stop sending traffic there.

2. Smarter warmups

Instead of pinging all the time, the proxy uses recent traffic to decide when to warm a model. It warms when things are about to go cold, and backs off when normal traffic is enough to keep things hot.

3. Better routing

For each request, the proxy chooses the “healthiest” model replica — not the random one. That means fewer cold workers, fewer queues, and faster first tokens.

4. Shield from upstream changes

If the model provider changes how they scale or warm instances, we adapt inside the proxy. The rest of our system still talks to one stable, predictable endpoint.

5. Future-proof hook

Over time, this proxy is where we can add retries, circuit breakers, per-user limits, or shadow traffic for new models — without touching the rest of the Playground.

Issue 2 – Delay in Our Own Completions Service

Even when the model responds quickly, users may still experience visible lag. This often stems from internal queuing delays before a request is even forwarded to the model. These delays typically happen during high load — when the Completions Service receives more prompt submissions than it can concurrently process — resulting in backlogs, slow start times, or even dropped requests if not handled gracefully.

💡 What Options Do We Have? (Click to expand)

Option 1 – Thread-per-Request Model
In this model, each prompt submission is handled by a dedicated thread or coroutine. While this is simple and works well under light load, it breaks down as concurrency grows. Thread exhaustion, high context switching overhead, and unpredictable latency spikes emerge once the number of active requests surpasses system limits. This option also lacks any form of controlled queuing or backpressure.

Option 2 – FIFO Queue + Worker Pool (Preferred)
Requests are added to an internal bounded FIFO queue, and a worker pool of fixed-size consumers pulls jobs off the queue and forwards them to the model API. This model brings order and predictability to the system: incoming requests are accepted if the queue has space, otherwise rejected or retried. Workers can be tuned based on CPU, memory, or expected throughput. Crucially, this allows for latency isolation and better tail behavior, since burst traffic is absorbed temporarily and smoothed out.

Option 3 – Adaptive Load Shedding (Reject Early)
Under this model, the service monitors real-time load (CPU, memory, queue depth) and starts rejecting requests early when under pressure. While this reduces overload risk, it also leads to unpredictable UX, especially for users who experience rejection despite fast model response time. It’s effective only when paired with robust retry logic or global load balancers — and even then, the user-perceived performance might be inconsistent.

Solution: Option 2 – FIFO Queue + Worker Pool

We choose Option 2: FIFO queue with a worker pool, as it gives us controlled admission, predictable latency behavior, and clean decoupling between request ingestion and model execution.

With this approach, the Completions Service introduces an internal request queue, bounded to a size that reflects system capacity (e.g. 5× the number of workers). Incoming prompt submissions are enqueued, and a pool of worker threads or coroutines asynchronously pulls from the queue and processes the request (i.e., validates, constructs payload, and forwards to the Model Proxy Layer). This design allows:

Backpressure: Instead of overwhelming the system, bursts are queued and processed at a steady rate.
Graceful degradation: If the queue fills, we can reject new requests cleanly or offload to another region.
Latency isolation: Spikes from one user or cohort don’t delay others, as each request flows through the same bounded pipeline.

Issue 3 – Streaming Flush & Jitter in the WebSocket PathEven with WebSockets in place, users can still feel the system is “laggy” if we buffer tokens too long before sending them, or if they arrive in uneven bursts. This shows up as choppy typing: nothing for a while, then a chunk of text, then another gap. These delays often come from how the Completions Service and WebSocket handler decide when to flush outgoing messages — especially under load, when we’re tempted to batch sends for efficiency.

💡 What Options Do We Have? (Click to expand)

Option 1 – Fixed-Interval Batching
In this model, the Completions Service buffers tokens and flushes them to the client every N milliseconds (for example, every 50–100ms), or once a certain number of tokens are accumulated. This reduces the number of WebSocket frames and can improve CPU/network efficiency. However, it also introduces artificial delay: a token that’s ready now might wait tens of milliseconds before being sent, and small batching bugs can turn into visible pauses in the UI.

Option 2 – Flush on Every Token (Preferred)
Here, every time the model emits a token, the Completions Service immediately sends it over the WebSocket to the browser. This maximizes responsiveness: the user sees characters appear almost as quickly as the model can generate them. The tradeoff is higher overhead — more frames, more syscalls, more frequent writes — but in exchange we get the smoothest “typing” effect and the clearest mapping between backend progress and frontend experience.

Option 3 – Hybrid Threshold-Based Flush
This approach starts from immediate flushing but introduces small safeguards: for example, flush every token unless we are already flushing very frequently, in which case we briefly buffer for a few milliseconds or a few tokens. The idea is to keep UX smooth while trimming worst-case overhead. The downside is extra complexity and more parameters to tune (time thresholds, token counts), and edge cases where behavior can become harder to reason about.

Solution: Option 2 – Flush on Every Token

We choose Option 2: flush each token as soon as it’s ready, because the Playground is fundamentally a latency-sensitive, UX-first product. The primary goal is to make the response feel alive and interactive, not to minimize the number of frames on the wire. With this strategy, the Completions Service’s WebSocket handler writes tokens to the socket immediately as they arrive from the Model Proxy Layer, and the browser renders them as they land. This keeps the perceived time-to-first-token extremely low and avoids jitter caused by batching. The components involved are the Completions Service WebSocket handler (which controls when to send frames), the API Gateway/WebSocket terminator (which must support low-latency forwarding), and the frontend renderer (which appends incoming tokens directly to the UI). Together, they ensure that as soon as the model speaks, the user sees it — with no extra buffering in between.

Workflow

1. User → API-GW over WebSocket

The browser already has a WebSocket open to the API Gateway. When the user hits Submit, the prompt + parameters are sent as a WS message. This avoids a fresh HTTP handshake and gives us a persistent, low-latency channel for both requests and streamed tokens.

2. API-GW → Completions Service (WS)‍

The API-GW does auth / basic rate limiting, then forwards the WS message to the WS Ingress Handler inside the Completions Service over another WebSocket hop.

3. WS Ingress Handler → Bounded FIFO Queue (Issue 2)The WS Ingress Handler does lightweight validation (shape, required fields, user id) and tries to push the request into the bounded FIFO queue.

If the queue has space, the request is enqueued immediately (fast ACK to user: “generation started”).
If the queue is full, we fail fast with a friendly “system busy, please retry” instead of silently adding huge latency.

4. Worker Pool pulls from Queue (Issue 2)A fixed-size Worker Pool continuously pulls requests from the queue. Each worker:

Hydrates any preset if needed (via preset service / DB path, unchanged).
Builds the final model payload.
Calls the Model Proxy instead of talking to GPT-3 directly.
This decouples “submit rate” from “how fast we can safely run inference” and smooths bursts without unbounded tail latency.

5. Worker → Model Proxy (Issue 1)The worker sends the request to the Model Proxy. The proxy:

Uses latency metrics to select the healthiest, already-warm model replica.
Triggers warmups & routing adjustments behind the scenes to avoid cold or overloaded instances.
As a result, time-to-first-token is minimized even when upstream infrastructure is scaling or fluctuating.

6. Model-API → Model Proxy → Worker (Issue 1)

GPT-3 (Model-API) starts generating tokens. These tokens stream back to the Model Proxy, which forwards them to the worker. Because the proxy has already avoided slow replicas, tokens arrive steadily with fewer cold-start spikes.

7. Worker → WS Streamer (flush every token, Issue 3)

As each token arrives from the proxy, the worker hands it to the WS streamer inside the Completions Service. The WS streamer immediately sends that token over the WebSocket connection — no batching, no fixed-interval buffering.

8. WS Streamer → API-GW → User (Issue 3)

Tokens travel back over the existing WS path through the API-GW to the browser. The browser appends them to the UI as they arrive, giving the user a smooth “typed out” experience with minimal jitter.

9. Presets Path (unchanged latency-wise)

If a preset is involved, the worker briefly calls the preset service → DB (Preset) to hydrate prompt + parameters before step 5. This path is unaffected by DD1 but still sits outside the hot token streaming loop.

Design Diagram

💡 Why Don’t We Use a Cache for Low Latency in DD1? (Click to Expand)

It’s tempting to say “just cache completions,” but for a Playground-style product the traffic pattern is extremely long-tail: almost every user prompt is unique, and even small edits (prompt wording, temperature, max_tokens, model) change the output. That means a cache keyed on (prompt + params + model) would have a very low hit rate, so we’d pay the complexity and memory cost of a cache without meaningfully improving p95 latency for most users.

More importantly, the dominant latency in this system comes from running the model (GPU inference) and how we move tokens over the wire, not from reading a small object from our own storage. DD1 focuses on levers that help every request: avoiding cold/slow replicas (Model Proxy), removing our own queue delay (FIFO + workers), and flushing tokens immediately over WebSocket. Those three give consistent latency wins across the board, while caching would only help a tiny subset of repeated, deterministic prompts.

That said, caching can still be useful later — for example, as a fallback when the model is degraded (DD2: Availability), or for heavily reused demo / template prompts, or for preset search metadata (DD4: Scalability). So we don’t ignore cache entirely; we just don’t treat it as a first-class low-latency tool for the general Playground completion path.

DD2 - High Availability - Eg, How do we prevent a partial outage from breaking all completions?

💡 Goal: Maintain Playground availability under partial failures (Click to expand)

Keep the Playground usable even when some components are unhealthy. A single model region, Completions pod, or Preset DB issue should not make every prompt fail. We aim for 99.9%+ success on prompt submissions with clear degraded behavior instead of silent breakage.

💡 What Breaks the Current High Level Design? (Click to expand)

The HLD assumes all dependencies are healthy. Partial failures can still bring down completions:

Model / Region Outages
If one upstream model endpoint or region is slow or down, all requests routed there may hang or fail, even if other regions are fine.
Preset Service / DB Failures
If Preset Service or its DB is degraded, any flow involving presets (search, load, run) can break completions, even though raw “prompt + params” usage could still work.
Completions Pod / WebSocket Path Failures
If a subset of Completions pods or WebSocket terminators are unhealthy, in-flight streams may drop and new generations may be routed to bad instances, creating the impression that “everything is down.”

Issue 1 – Model / Region Outage – How do we avoid a single bad upstream from breaking completions?

💡 What Breaks the Current High Level Design? (Click to expand)

The HLD assumes all dependencies are healthy. Partial failures can still bring down completions:

Model / Region Outages
If one upstream model endpoint or region is slow or down, all requests routed there may hang or fail, even if other regions are fine.
Preset Service / DB Failures
If Preset Service or its DB is degraded, any flow involving presets (search, load, run) can break completions, even though raw “prompt + params” usage could still work.
Completions Pod / WebSocket Path Failures
If a subset of Completions pods or WebSocket terminators are unhealthy, in-flight streams may drop and new generations may be routed to bad instances, creating the impression that “everything is down.”

Solution: Option 3 – Resilient Model Proxy with Circuit Breakers

We strengthen the Model Proxy Layer so it doesn’t just optimize latency; it also protects us from partial model outages.

1. Per-endpoint health tracking

The proxy maintains rolling metrics for each upstream target (e.g., model=GPT-3, region=us-east-1 vs us-west-2): error rate, timeouts, TTFT. When a target crosses a threshold (e.g., >5% errors or TTFT > X ms), it is marked degraded.

2. Circuit breakers and fail-fast

For degraded endpoints, the proxy “opens the circuit” and stops sending user traffic there. Instead of letting Completions requests hang on a bad region, we fail fast or reroute to a healthier region/model. This keeps broken endpoints from dragging down global success rate.

3. Multi-region / multi-model routing

When a circuit is open, the proxy routes requests to alternate targets: another region running the same model, or a backup model (e.g., smaller but more reliable). This gives us a graceful degradation path: slower or slightly different quality is better than total failure.

4. Simple behavior for Completions Service

From the Completions Service perspective, it still just calls the Model Proxy once. All the complex logic (health checks, circuit breakers, region failover) lives inside the proxy, so the rest of the Playground stays simple.

Issue 2 – Preset Service / DB Failures – How do we avoid presets breaking completions?

💡 What Options Do We Have? (Click to expand)

Option 1 – Hard Dependency on Preset Service
In this model, any run that references a preset_id must successfully call Preset Service and its DB. If Preset Service is slow or down, the entire completion request fails. This couples the main completion path tightly to Preset availability and makes a relatively “nice-to-have” feature capable of breaking core UX. Simple to build, but bad for high availability.

Option 2 – Soft-Dependency with Degraded Mode Only
Here we treat presets as optional: if Preset Service is down, preset_id hydration fails fast and we return an error like “Presets temporarily unavailable, please paste your prompt manually.” Core completions still work, but users lose presets entirely during outages. This protects availability but feels a bit passive and user-hostile when outages are longer.

Option 3 – Preset Resilience Layer: Cache + Degraded Writes (Preferred)
We add a Preset Resilience Layer around the Preset Service and DB. The idea is: reads should still work from cached or recent data, and writes should fail gracefully or queue instead of blocking completions. Concretely, this layer introduces:

A Preset Read Cache (e.g. Redis) for each user’s recent presets and popular shared presets.
A Background Sync Job that keeps the cache in sync from the DB under normal conditions.
A Write Queue or “Saved Locally” Mode for new presets when the DB is down, so users can keep working and sync later.

Solution: Option 3 – Preset Resilience Layer (Cache + Degraded Writes)

We choose Option 3 because it keeps presets from being a single point of failure, while still giving users a mostly working preset experience during partial outages.

1. Fast reads from Preset Read Cache

Under normal operation, when Completions or the frontend needs to search or hydrate presets, the Preset Service first checks a Preset Read Cache keyed by user_id and preset_id (plus optional org/shared scopes). Recent presets (e.g. last N per user) and frequently used presets are kept hot in this cache. If the DB is slow or temporarily unreachable, we can still serve stale-but-usable presets from cache instead of failing immediately.

2. Background sync from DB → Cache

A background sync job or change-stream subscriber keeps the cache updated from the Preset DB: new presets, updates, deletions. This is eventually consistent, which is fine for presets. Under normal conditions, cache hit rates are high for everyday usage (recent and favorite presets).

3. Degraded writes when DB is unhealthy

‍When the Preset DB is down or failing health checks, new POST /v1/presets requests don’t block completions. Instead, we:

Either enqueue these writes into a durable write queue to be applied later when the DB recovers, and mark them as “pending sync”.
Or store them only in the cache with a clear “unsynced” flag, telling users presets may not be permanent until the outage is over.

In both cases, users can still use those presets in the short term, even if they aren’t fully persisted yet.

4. UI and behavior during degraded mode‍

When the Preset Service detects DB issues, it reports a degraded state. The UI can:

Show a banner like “Preset changes may not be saved permanently. You can still use recent presets.”
Still allow selection of cached presets, and still allow completions to run using those values.

Core “prompt → completion” remains unaffected, and a large chunk of preset UX still works from cache.

5. Completions remain loosely coupled

Importantly, Completions never hard-depends on live DB access. It calls Preset Service; Preset Service serves from cache if DB is unhealthy. If even cache fails, we fall back to the simple degraded behavior from Option 2: tell the user presets aren’t available and let them paste prompt/params manually. This layering lets us be actively resilient in most outages and only fall back to passive degradation in worst-case scenarios.

Issue 3 – Completions / WebSocket Failures – How do we handle in-flight and new requests?

💡 What Options Do We Have? (Click to expand)

Option 1 – Let Streams Die Silently
If a Completions pod crashes or a WebSocket terminator becomes unhealthy, in-flight streams just drop. Users see a frozen or broken response, and new requests may continue to be routed to the bad instance until health checks catch up. This feels like “the whole Playground is broken.”

Option 2 – Stateless Completions with Fast Failover (Preferred)
We keep the Completions Service stateless per request (state lives in the model stream and WebSocket session), deploy multiple pods behind the API-GW, and use health checks to quickly remove bad instances from rotation. If a connection drops, the user gets a clear error and can resubmit the same request_id to a healthy instance.

Option 3 – Full Stream Resume Across Nodes
We implement complex logic to resume token streaming on another node after failure, using request_id and model offsets. This is hard to do correctly with live model inference and adds a lot of complexity for relatively rare events. Likely overkill for a Playground.

Solution: Option 2 – Stateless Completions + Fast Failover

We design Completions + WebSocket handling to fail fast and recover quickly instead of trying to magically resume streams.

1. Stateless Completions Service

Each prompt request is handled independently by one Completions worker. No shared in-memory state is required across requests. If a pod dies, only in-flight requests on that pod are affected.

2. Health-checked, multi-pod deployment

The API-GW routes WebSocket connections to multiple Completions pods. Liveness and readiness probes ensure unhealthy pods are removed from rotation quickly. New WebSocket connections are directed only to healthy pods.

3. Fail-fast behavior on connection loss

If a WebSocket connection drops mid-stream, the frontend shows a clear error (“Generation interrupted, please retry”) instead of hanging. Because the request is identified by request_id, the user can simply resubmit, and the new request goes to a healthy pod.

4. Protection against partial infra failure

Even if one Completions pod or WS terminator is misbehaving, it is isolated and removed from the pool. Other pods continue serving completions, so the outage is partial and contained, not system-wide.

Workflow – What High Availability Looks Like in Practice

1. User submits a prompt over WebSocket

The browser sends prompt + parameters (or a preset_id) over an existing WebSocket connection to the API Gateway. From the user’s point of view, it’s the same “hit Submit and see it type back” flow as in the HLD.

2. API-GW routes to a healthy Completions pod

The API-GW/WebSocket terminator uses liveness and readiness checks to route this WS session to a healthy Completions pod. Any pod that is crashing or failing probes is taken out of rotation, so new requests don’t land on unhealthy instances (Issue 3 solution: stateless + fast failover).

3. Completions enqueues and processes the request (same as DD1)

Inside the Completions Service, the WS ingress handler performs quick validation and enqueues the request into the bounded FIFO queue. A worker from the pool pulls it off the queue and starts processing, just like in DD1. If the queue is full, the service fails fast with a clear error instead of adding hidden tail latency or risking overload.

4. Preset hydration goes through the Preset Resilience LayerIf the request references a preset_id, the worker calls the Preset Service, which now fronts a Preset Read Cache and the Preset DB.

Under normal conditions, it reads from cache or DB and returns full prompt + parameters.
If the DB is slow or down, it serves stale-but-usable data from cache, so the completion can still run.
If both DB and cache are unavailable, it fails fast with a clear message so the user can paste prompt/params manually (Issue 2 solution: cache + degraded writes, soft dependency).

5. New preset saves use degraded writes when DB is unhealthy

When the user saves a new preset during a DB outage, the Preset Service writes to a durable write queue or “cache-only unsynced” store instead of blocking. The UI warns that changes may not be permanent, but the user can still immediately use that preset value for completions. The DB is updated later when it recovers.

6. Completions worker calls the Model Proxy, not the raw modelOnce prompt + params are ready, the worker calls the Model Proxy Layer. The proxy examines per-region metrics (error rate, timeouts, TTFT) and uses circuit breakers to avoid bad upstreams.

If one region is degraded, its circuit is “open” and traffic is not sent there.
Requests are routed to healthy regions or fallback models instead, so the user still gets a completion, possibly a bit slower but not a hard failure (Issue 1 solution: resilient proxy).

7. Tokens stream back over WebSocket from a healthy pipeline

As the chosen model begins generating tokens, they stream via Model Proxy → Completions worker → WS streamer → API-GW → browser. If a Completions pod or WS terminator crashes mid-stream, that one WebSocket connection drops; the frontend shows “Generation interrupted, please retry,” and a retry goes to a different healthy pod thanks to health checks and statelessness (Issue 3 solution).

Design Diagram

💡 Summary – Overall user experience during partial outages (Click to expand)

A broken model region is routed around by the Model Proxy.
A sick Preset DB is hidden behind cache + degraded writes, with clear but non-fatal UX.
A bad Completions pod is quickly removed from rotation, and only in-flight requests on that pod are affected.

Most users continue to submit prompts and see completions successfully, and even when pieces are unhealthy, the system fails gracefully instead of making the Playground feel completely down.

‍

DD3 - Rate Limiting - Eg, How do we prevent spams from one user choking the system?

💡 Goal: (Click to expand)

Prevent a single user, script, or IP from flooding the Playground or driving runaway model cost, while keeping normal users almost never rate-limited. Concretely we target something like ≤ 60 requests per user per minute and ≤ 2 concurrent generations per user, enforced across tabs and devices.

💡 What Breaks the Current High Level Design? (Click to expand)

Without explicit rate limiting, a few bad actors can hurt everyone:

Multi-tab / multi-device spam
A user (or bot) can open many tabs, each with its own WebSocket, and fire POST /v1/completions messages in parallel. Per-connection limits can’t see the whole picture, so one user can easily generate hundreds of requests per minute.
Queue flooding inside the Completions Service
Even with DD1’s bounded FIFO queue and worker pool, a single noisy user could enqueue a lot of requests and dominate the head of the queue. Other users then see long delays or “frozen” generations, even though the system is technically still up.

Issue 1 – Per-User Burst Control (QPS): How do we cap a single user across tabs/devices?

💡 What Options Do We Have? (Click to expand)

Option 1 – Client-side Throttling Only

We throttle in the UI (e.g., grey out Submit after N requests). This is easy but trivial to bypass: custom scripts, modified clients, or direct API calls ignore the UI. It provides zero protection for backend capacity or model cost.

Option 2 – Local In-Memory Limits per API-GW Node

Each API-GW instance keeps an in-memory counter for each user/IP and rejects when usage exceeds the limit. Better than nothing, but once we scale API-GW horizontally, one user hitting 3 gateways can get ~3× the intended allowance. We also can’t easily tweak limits globally because each node only has a partial view.

Option 3 – Centralized Distributed Rate Limiter (Preferred)

All API-GW nodes consult a shared Rate Limit Service backed by a fast store (e.g., Redis). That service maintains token/leaky buckets for (user_id, org_id, IP, route) and answers “allow or deny?” for each new completion request. Because all gateways talk to the same limiter, limits hold no matter how many nodes we run.

Solution: Option 3 – Centralized Distributed Rate Limiter

We add a Rate Limit Service as a small, dedicated component that API-GW calls before admitting a new completion.

1. Shared counters across gateways

When the user hits Submit, the WebSocket message that starts a new completion is treated as one logical request. API-GW calls the Rate Limit Service with (user_id, org_id, IP, "completions"). The limiter uses Redis to track sliding-window or token-bucket counters and decides if the user is still under the configured limit (for example, ≤ 60/min).

2. Consistent behavior across tabs and devices

Because the counters are keyed on user_id (and optionally IP/org_id) in a shared store, it doesn’t matter whether the requests come from three tabs, two devices, or a script. All of them consume the same bucket, so abuse is contained.

3. User-friendly errors, not silent failuresIf the limiter denies a request, API-GW immediately sends a structured WebSocket message like:The frontend can show a clear message and avoid spamming retries. Normal users rarely see this; heavy users get a clear explanation instead of mysterious hangs.

type: "error", code: "rate_limit_exceeded", retry_after: 30

4. Building blocks involved

API-GW: enforcement point; every new completion request checks the limiter.
Rate Limit Service: small stateless API implementing token buckets, backed by Redis.
Redis: stores per-user / per-IP counters.
Completions Service / Model Proxy: unchanged; they only see traffic that already passed the limiter.

Issue 2 – Concurrent Generations: How do we stop one user from hogging the worker pool?

Solution: Option 3 – Per-User Concurrency Tokens

We add lightweight concurrency tracking in the same Rate Limit Service, coordinated with Completions.

1. Admission check at API-GWWhen a “start completion” message arrives, API-GW makes two checks against the Rate Limit Service:If the user already has 2 active generations, the concurrency check fails and API-GW sends back an error like code: "concurrent_generations_limit_exceeded".

QPS bucket (Issue 1: “am I under my per-minute limit?”)
Concurrency bucket (“how many active generations do I already have?”)

If the user already has 2 active generations, the concurrency check fails and API-GW sends back an error like code: "concurrent_generations_limit_exceeded".

2. Increment on start, decrement on finish

‍Once a request passes both checks and is put onto the bounded FIFO queue, the Rate Limit Service increments the user’s “in-flight” counter. When the generation finishes or the user hits Stop, the Completions Service (or API-GW) sends a tiny “release” call to decrement the counter.

If the WebSocket drops unexpectedly, we rely on a short TTL on the concurrency entry (for example, 60–90 seconds) so stuck slots auto-clear.
Optionally, a watchdog in Completions can detect timeouts and explicitly release tokens.

3. Fair sharing of the worker pool

Because each user can hold at most 2 tokens, they can never fully monopolize the worker pool. The bounded FIFO queue still orders all jobs, but the per-user cap ensures bursts from one user are naturally limited and other users’ jobs continue to make progress.

4. Building blocks involved

API-GW: checks both QPS and concurrency counters before enqueuing.
Rate Limit Service + Redis: maintain “active generations per user” with TTL.
Completions Service: pushes accepted requests into the FIFO queue and signals completion to release concurrency slots.
Worker Pool: unchanged; it pulls jobs from the queue and runs them, now protected from single-user hogging.

Workflow – Prevents One User from Choking the System

1. User starts a generation over WebSocket

The browser sends a “start completion” message to the WS Gateway over an existing WebSocket connection.

2. Global per-user QPS check at the edge

‍Before forwarding the request, the WS Gateway calls the Rate Limit Service (Redis-backed) with (user_id, org_id, IP).

If the user exceeds their requests-per-minute budget, the gateway immediately responds with a clear rate_limit_exceeded error over WebSocket.
If under limit, the request continues.

3. Per-user concurrency check (active_generations)

‍The Rate Limit Service also tracks how many active generations a user has.

If the user is already at the max (for example, 2 concurrent generations), the gateway returns concurrent_generations_limit_exceeded.
Otherwise, it increments active_generations[user_id] and allows the request.

4. Accepted requests enter the bounded queue

‍For allowed requests, the WS Gateway forwards the message to a Completions pod.

The WS Ingress Handler enqueues it into the pod’s bounded FIFO queue.
If the queue is full, the pod responds “busy, please retry” and no more load is added.

5. Worker pool runs the job and streams tokens

A worker pulls the job from the queue, hydrates any preset, calls the Model Proxy, and streams tokens back over WebSocket (flush-every-token) to the user.

6. Concurrency token is released on finish/stopWhen the generation completes, errors out, or the user hits Stop, the Completions pod (or gateway) sends a small “release” call to the Rate Limit Service.

active_generations[user_id] is decremented.
The user can now start another completion without violating the concurrency cap.

Together, the QPS bucket and active_generations tokens make sure one user cannot flood the system, even with many tabs or devices, while normal users almost never hit a limit.

Design Diagram

DD4 - High Scalability - How do we prevent sudden traffic spikes from overwhelming stream infra?

💡 Goal (Click to expand)

Goal: Safely handle 10k+ prompt QPS and 100k+ concurrent WebSocket streams during spikes, without melting the API-GW, overfilling queues, or starving the model. The system should scale horizontally and shed load in a controlled, user-friendly way.

💡 What Breaks the Current High Level Design? (Click to expand)

Even with good latency and availability, sudden surges can still hurt us:

WebSocket Connection Fan-In at the Edge
A popular announcement or live event can cause tens of thousands of new WebSocket connections within seconds. If the API-GW / WS tier cannot scale quickly enough, we can run out of file descriptors, memory, or CPU, causing dropped connections for everyone.
Completions Service Hotspots and Saturation
If traffic isn’t evenly spread, a subset of Completions pods can get overloaded: their queues fill, workers saturate CPU, and latency spikes. Even with a bounded FIFO, we need a way to scale out before we hit the wall.
Preset Search & Listing at Scale
Preset search (FR4) starts as a simple LIKE %query% on a relational DB. At millions of presets, this becomes slow and expensive, competing with other DB workloads and turning “search bar” usage into a scalability bottleneck.

Issue 1 – WebSocket Connection Fan-In at the Edge

💡 What Options Do We Have? (Click to expand)

Option 1 – Single API-GW Cluster Doing Everything
We keep one generic API-GW cluster that terminates all WebSockets and serves all REST traffic. It can scale horizontally, but capacity planning is tricky: high WS connections compete with REST, and spikes can exhaust per-node connection limits. It also makes it harder to tune autoscaling signals because long-lived WS connections and short REST calls behave very differently.

Option 2 – Horizontally Scaled WS Tier (Preferred)
We treat the WebSocket termination layer as a dedicated tier (even if it’s still conceptually “API-GW” in diagrams): a pool of WS gateway pods behind a load balancer, responsible mainly for:

Terminating WS connections from browsers
Running auth + rate limiting
Forwarding messages to Completions pods

These pods are scaled horizontally based on number of active connections, messages per second, and CPU/memory. REST endpoints (e.g., for presets) can live on a separate HTTP tier and scale on different signals.

Option 3 – Push WS Termination Down Into Completions Pods
Each Completions pod holds client WebSockets directly. That reduces one hop but couples connection capacity with compute capacity: when you add pods for CPU, you also add WS capacity, but any skew (many idle connections vs few active requests) makes tuning hard. Pod restarts also drop lots of client connections at once.

Solution: Option 2 – Dedicated, Horizontally Scaled WebSocket Tier

We conceptually split the edge into:

A WebSocket Gateway tier (the API-GW WS side) that:
- Terminates WS connections
- Enforces per-user rate limits (from DD3)
- Forwards messages to Completions pods over internal WS/gRPC
A separate HTTP tier for REST-ish traffic (preset CRUD, health, etc.)

Autoscaling rules for the WS tier are tuned on:

Active connection count per pod
Messages per second
CPU / memory thresholds

When traffic spikes, new WS connections are spread over more gateway pods. Existing pods aren’t overloaded with connection bookkeeping, so they can keep forwarding messages to Completions with low overhead.

Issue 2 – Completions Service Hotspots & Saturation

💡 What Options Do We Have? (Click to expand)

Option 1 – Scale Only on CPU
We configure a Horizontal Pod Autoscaler (HPA) for Completions based only on CPU usage. This is easy to set up, but CPU often lags behind real pressure: queues might be filling while CPU is still moderate, and by the time CPU crosses the threshold, we are already dropping or timing out requests.

Option 2 – Scale on QPS Only
We track prompt QPS and increase pod count when QPS goes up. Better, but still incomplete: 1 pod might see much more traffic than another (hotspot), and QPS doesn’t capture how much work is still in flight (queue depth and active generations).

Option 3 – Multi-signal Autoscaling with Per-Pod Concurrency (Preferred)
We make Completions scaling driven by a mix of signals:

Average queue depth per pod (from the bounded FIFO)
Number of active generations per pod
CPU (and possibly memory) as a safety signal

We also enforce a per‑pod concurrency cap: when a pod’s queue length or active generations exceed a threshold, the WS gateway starts sending new requests to other pods, or returns “busy” if the whole pool is saturated.

Solution: Option 3 – Multi-signal Autoscaling + Concurrency Caps

In this design:

Each Completions pod exposes metrics:
- queue_length for its bounded FIFO
- active_generations (currently running model calls)
- CPU usage
A cluster autoscaler / HPA uses these metrics:
- If avg queue length per pod stays high for N seconds → scale out more pods
- If queue length is near 0 and CPU is low → scale in safely
The WS Gateway is aware of per-pod load:
- It routes new requests to pods with lower queue depth
- If all pods report “queue full”, it can return a fast “system busy, please retry” to the client instead of letting queues grow unbounded

Result: when a sudden spike hits, we don’t just overload a few unlucky pods. The system spreads load across the cluster, adds capacity proactively, and protects queues with hard limits.

Issue 3 – Preset Search at Scale (Elasticsearch / Search Index)

Preset search isn’t on the streaming hot path, but at large scale it can easily become a DB bottleneck under heavy use, especially when users lean on presets during a spike.

💡 What Options Do We Have? (Click to expand)

Option 1 – DB LIKE Queries Forever
We keep FR4 as implemented in HLD: SELECT ... WHERE user_id = ? AND name ILIKE %query%. This works at small scale, but at tens of millions of presets it becomes slow, index-unfriendly, and can lock or thrash the DB. It also competes with other DB workloads (preset writes, cache sync), hurting overall system health.

Option 2 – Heavy Index Tuning + Partial Text Search in DB
We add trigram / full-text indexes and limit searches to prefix or tokenized matches. This buys us time but keeps search and OLTP on the same DB, so both compete for I/O and cache. Scaling this usually means sharding the DB or adding replicas dedicated to search, which is still less flexible than a search engine.

Option 3 – Dedicated Search Index (Preferred)
We offload search to a search engine like ElasticSearch (or a managed equivalent):

Presets are indexed into a search cluster asynchronously
GET /v1/presets?query=... hits the search index, not the main DB
The index is optimized for text search, scoring, pagination, and filtering

Solution: Option 3 – Search Index for Presets (Elasticsearch)

We enhance the Preset Service by adding an Elasticsearch-based search tier on top of the existing Preset DB and cache.

Write path fan-out

When a preset is saved or updated:

It is first written to the Preset DB as the source of truth.
In parallel, we emit a small message (or consume a CDC stream) into a Search Indexer job.
The Search Indexer upserts a document in Elasticsearch, indexed by user_id, preset_id, name, tags, and any other filterable fields (e.g. created_at, is_favorite).

Read path via Elasticsearch

For GET /v1/presets?query=...:

The Preset Service issues a search query to Elasticsearch scoped by user_id and text query (full-text on name/tags).
Elasticsearch returns a ranked list of matching preset_ids plus light metadata (name, snippet).
If the client needs full preset details, the Preset Service then hydrates those IDs from the Preset Cache (Redis) or directly from the Preset DB.

Scalable search tier

The Elasticsearch cluster scales independently based on:

Query QPS (more search nodes behind a coordinator when search traffic grows)
Index size (sharding by user_id or time, adding shards/replicas as data grows)
CPU / memory on search nodes (tuned for inverted index and text scoring)

Workflow – Handles Traffic Spikes

1. WS connections spike at the edge

A surge of users open the Playground. Browsers establish WebSocket connections to the WS Gateway Pool, which auto-scales based on active connections and messages per second so no single gateway node melts down.

2. Per-user limits filter abuse early

When a user hits Submit, the WS gateway calls the Rate Limit Service (Redis-backed) to check per-user QPS and concurrent generations. Abusive users get fast, explicit errors; only allowed requests are forwarded to Completions.

3. Requests are steered to less-loaded pods

The WS gateway routes each allowed request to a Completions pod with low queue depth and healthy metrics, spreading load so no single pod becomes a hotspot.

4. Bounded queues protect pods from overload

Inside each Completions pod, the WS Ingress Handler pushes requests into a bounded FIFO queue. A worker pool pulls from the queue up to a safe concurrency limit; if the queue is full, the pod signals “busy” and the gateway returns a friendly “retry later” instead of silently stalling.

5. Autoscaling adds more Completions capacity

Cluster autoscaling uses pod metrics (queue length, active generations, CPU) to spin up more Completions pods during the spike and scale back down when traffic drops, keeping latency and error rates under control.

6. Model + presets scale independently

Workers call the Model Proxy (which already does multi-region routing from DD1/DD2) and, when needed, the Preset Service. Preset search uses Elasticsearch, so heavy preset queries don’t compete with core completions for DB capacity during spikes.

Design Diagram

Final Thoughts

Stepping back, this Playground design does what we set out to do: start from four small, concrete user flows and grow them into a production-shaped system. FR1–FR4 give us a clean backbone: a WebSocket-based completions path, parameter control, presets as a first-class reusable asset, and a search → load → run loop that matches how people actually use ChatGPT. The HLD keeps the components small and well-named — API-GW, Completions Service, Preset Service, Model Proxy, WS gateways — so in an interview you can “walk the graph” without getting lost.

The deep dives then layer in the hard stuff without blowing up the core design. DD1 focuses on end-to-end latency with three levers: a Model Proxy to dodge cold/slow replicas, a bounded FIFO + worker pool to tame our own queues, and a WS streamer that flushes every token for that “typing” feel. DD2 makes the same components resilient: circuit breakers and multi-region routing in the Model Proxy, cache + degraded writes for presets, and stateless, health-checked Completions pods so individual failures don’t look like global outages. DD3 fences off bad actors with Redis-backed, per-user limits and queue-level protection, while DD4 zooms out to system-wide scale — WebSocket gateway pool, global L7 routing, autoscaling Completions pods, and Elasticsearch to keep preset search fast as data grows.

If you had more time, obvious “v2” directions would be richer safety and moderation hooks, multi-tenant org features, and deeper observability (per-tenant SLOs, cost dashboards, replay tooling). But for interview purposes, this doc already tells a complete story: you can deliver the basic product, you know exactly where it breaks under real-world scale and abuse, and you have concrete, technically credible plans to fix those weaknesses without over-engineering from day one.

But in the end, thanks for reading our article. We have dedicated coaches on these system design interview questions, we are here to help uplevel your tech interview performances.

Good luck with the final design diagram:

‍

Coach + Mock

Practice with a Senior+ engineer who just get an offer from your dream (FANNG) companies.

Schedule Now

Content:

Chatgpt

Unlock Full System Design Access