Problem Understanding and Clarification
- What is a Webhook System and its application?
- A webhook is a mechanism that allows one application to notify another when an event occurs
- Essentially, it's a real-time notification system that pushes data to a designated endpoint when a specific event happens, instead of having to continuously check for updates.
- For example, when consumer makes a payment on Amazon to buy a product, a webhook is inter-connected between Payment Gateway in Amazon to External Payment Network (eg. Visa, Mastercard etc). It is designed to listen to new status change on the payment event from External Payment Network, such as “payment success” or “payment rejected”, to trigger transaction updates in Amazon database.
- Clarify the requirements, such as:
- Define the scope:
Clarifying those questions help clear out functional and non-functional requirements, and can also help shape your design or make better design. For example:
- if high throughput and high volume, with low-latency delivery, then it’s more reasonable to consider using stream instead of queue.
- if there is strong security requirement, then you need to be ready to talk about how to ensure data integrity (i.e: data is not being intercepted and modified during transit or in-rest).
Functional Requirements
- Client should be able register, update, delete a webhook with a callback URL
- Client should be able receive events from source application through Webhook
- System should support event filtering based on event types.
Non-Functional Requirements
- highly scalable to support up to 1B events per day (should calculate on peak QPS)
- highly available for receiving, processing and delivering events
- fault-tolerant (monitor and retry on failures)
- secure and prevent unauthorized access and data tampering
- highly durable and ensure no data loss
API Design
(1) Register a webhook
(2) Update / Delete a webhook
Event (Data) Flow
Before we get into high-level design, let's see the end-to-end event (data) flow within webhook. The primary function of a webhook system is to enable real-time communication by delivering events (i.e: payment updates) from a source application (e.g: Stripe) to a client's designated callback URL. It ensures reliable and at-least-once delivery for sensitive applications (such as: payment).
Here is a typical sequence of steps:
1. Event trigger in source application: When a specific action occurs in the source application (e.g: a payment is successfully processed in Stripe, triggering a payment_success
event), the application generates an event payload (JSON object containing event details).
2. Source application pushes the event to Webhook system: The source application sends the event to the webhook system via an HTTP POST request to an event ingestion endpoint.
3. Webhook system processes and delivers events: The webhook system processes the event, identifies matching subscriptions, and delivers to registered callback URLs with proper security headers and retry logic.
4. Client processes the new event: Upon receiving the webhook, the client validates the signature, processes the event, and returns an HTTP 200 status code to confirm successful receipt.
High-Level Design

The core components of our high-level design include:
Webhook Manager
serves as the central registry and configuration store for all webhook subscriptions, storing all webhook settings in database with Redis cache layer for low-latency lookups.
- exposes RESTful APIs for webhook CRUD operations and validates callback URLs through DNS resolution, HTTPS enforcement, and reachability tests.
- manages authentication and authorization for all webhook operations, generates signing secrets for HMAC verification, or store/process client provided public keys.
Event Watcher (Event Processing Service)
operates as a set of horizontally scalable micro-services that receive events from source applications and route them to appropriate webhooks.
- accepts events via HTTP POST and performs initial validation including schema validation and deduplication and transforms events into canonical format.
- queries matching webhook subscriptions from cache and database, applies event filtering based on trigger conditions and then publishes delivery events directly to delivery queues.
Message Queue or Stream
provides reliable, scalable event processing with multiple tiers based on priority and delivery requirements. The queue structure includes high priority queues for critical events such as payments and security alerts, standard queues for regular business events, and a dead letter queues (DLQ) for events that exhausted all retry attempts.
Delivery Worker
consumes events from queues and deliver events to callback URLs using HTTP POST requests.
- retrieves event data, generates HMAC signatures using signing secrets, and sends HTTP requests to callback URLs.
- updates delivery status and handles retries with rate limiting per webhook endpoint to respect client infrastructure limitations.
Monitoring & Alerting
- Monitor all internal components health (through heartbeat message or periodic ping/healthy-API) and collect real-time status (resource consumption, throughput, queue metrics).
- Alert on (1) worker failure and (2) failed events in DLQ;
Workflow 1: Register a new Webhook
The client sends a POST request to /api/v1/webhooks
with callback URL, event types, and configuration details:
- callback_url: The destination endpoint that will receive event notifications
- event_types: List of events to subscribe to (e.g., "payment.succeeded", "order.created")
- trigger_conditions: Optional filtering criteria to receive only specific instances of events
- retry_policy: Optional custom retry settings (defaults apply if not specified)
- (*) public_key used to encrypt sensitive payload data
The Webhook Manager validates the JWT token and verifies client permissions for webhook creation. The system performs initial validation including URL format verification and HTTPS requirement enforcement, DNS resolution of the callback domain, reachability testing with HTTP HEAD request, and event type validation against supported types.
After successful validation, the Webhook manage stores the complete webhook configuration into the database and essential lookup information is cached in Redis for performance. Finally, the Webhook Manager returns a response containing the webhook_id, signing_secret, and verification status to the client.
Workflow 2: Event Delivery
We will split into two steps: (1) Event Ingestion & Processing and (2) Delivery Execution.
(1) Event Ingestion & Processing
When a source application generates an event (such as a payment being processed or an order status changing), it sends to Event Watcher through HTTP request.
Upon receiving the new event, Event Watcher extracts the event's core attributes (event_id, client_id, event_type, timestamp, and payload) and validates event schema to ensure the event contains all required fields and the checks for duplication based on event_id and client_id. For validated events, the Event Watcher identifies all webhook subscriptions that should receive this event by matching the client_id, event_type, active status, and any filtering conditions. This subscription lookup uses optimized database indices (PK: webhook_id. GSI: client_id & event_type) and cache.
For each matching webhook event, Event Watcher generates HMAC signature. This signature enables clients to verify the payload integrity. After signing and encrypting the payload, Event Watcher will then place a message in the appropriate queue for each target webhook with reference to the event_id and webhook_id.
(2) Delivery Execution
The delivery process begins when Delivery Worker pulls message from queues. Delivery Worker retrieves the full event data and webhook configuration from DB, and then prepares an HTTP POST request to the webhook’s callback URL.
Delivery Worker sends the HTTP request with a pre-configrued connection timeout (say 10 seconds). On receiving an HTTP response, Delivery Worker determines successful delivery. If no response was received after timeout, it retries with exponential backoff. After exhausting all retries, Delivery Worker moves events to DLQ and update events delivery status to FAILED in DB.
Design Options
Dimension 1: Choose between Queue (AWS SQS) vs Stream (Kafka, AWS Kinesis Stream)
Option 1: Queue
- moderate throughput (~ few thousands of events per second)
- guarantee at-least-once and order (with FIFO queue)
- scales automatically (by increasing batch_size and shard_count)
- BUT not designed for continuous or real-time data streams
Option 2: Stream
- Support high throughout (up to millions of events per second)
- real-time, low-latency processing is a requirement
Recommendation: The choice depends on clarified functional and non-functional requirements. In general, queues are sufficient for most webhook systems, while Kafka and streams should be used for ultra-high throughputs. For our case with a requirement to support up to 1 billion events per day (approximately 11,500 events per second average), AWS SQS provides sufficient throughput while offering managed infrastructure, automatic scaling, and built-in retry mechanisms that align perfectly with webhook delivery requirements.
Dimension 2: With vs Without Database
Option 1: Queue-Only (No Persistent Storage)
This approach works when events are non-critical and clients can tolerate occasional missed delivery, when there is no need for auditing, replay or complex analytics, when event volume is relatively low to moderate such that SQS's 14-day retention is enough for retries, and when duplicates are not a concern and can be handled by clients.
Option 2: Database + Queue (Recommended)
This approach is necessary when guaranteed delivery is required, when there is need to pull history events for auditing, compliance, and debugging such as analyzing financial transactions from Stripe, and when the system needs to handle deduplication or additional processing or complex analytics.
Recommendation:
For database, there isn't a single "right" database, but the key is to show the thought process and reasoning behind the choice.For this design, we use Option 2 with a database. The main database use cases include storing structured events with fixed schemas, ensuring no data loss, handling write-heavy workloads with minimal update contention, and supporting high throughput for both reads and writes.
Both SQL databases like CockroachDB and PostgreSQL and NoSQL databases like Cassandra and DynamoDB are fine choices. Regardless of the choice, it's important to revisit how the selected database can satisfy the above use cases and requirements (i.e: PostgreSQL provides ACID guarantees for reliable event storage, supports JSONB for flexible event payloads, and offers table partitioning for handling large-scale data efficiently.)
Deep Dive
Security
1. Verify source application and prevent unauthorized event injection.
When source applications (like payment processors - Stripe, e-commerce platforms - Shopify) want to send events to our webhook system, they provide us with their API credentials (API Key + Secret) for authentication.
- (Source Application Onboarding) Source applications provide us with their API credentials during integration:
- (Event Ingestion Authentication) Source applications include their own credentials when sending events:
- (Verification Proces) Our Event Watcher validates:
API key
matches what the source application provided us during onboarding- Request signature matches expected HMAC-SHA256 using their provided
secret
- Timestamp is within 5-minute window (prevents replay attacks)
- Event type is allowed for this source application
- For additional security, we could maintain an IP-allowlisting for all source applications. So that events/requests from non-allowlisted IP will be discarded.
2. Support data encryption and ensure data integrity.
When delivering events to client endpoints, delivery worker signs every webhook delivery using the webhook-specific signing secret. Clients verify our signature using the signing secret we provided.
For sensitive data fields, delivery worker can encrypt the selective fields or entire payload using client-provided RSA public keys. Client can decrypt the payload on receiving the new events.
Fault Tolerant
1. Guaranteed Delivery
Guaranteed Delivery is usually implemented based on acknowledgement (”ACK”). Recall from “Design ChatApp” where the client app (installed on Mobile or desktop) and chat backend can adhere to a very structured and consistent acknowledgment message format, which then can be used to track last successful delivery. In the Webhook case, we can not really control all our clients to respond with a structured response or follow a acknowledgment protocol. But we can still handle this by interpreting the HTTP status code at least. For example,
- 200, we can assume success/acknowledged.
- 4xx, client error — don’t immediate retry. Mark events are failed and move to DLQ, and re-play failed events once client-side issue is resolved.
- 5xx and other unexpected responses, retry.
Most webhook systems (including Stipe) implement “best-efforts delivery” rather than true guaranteed delivery. What we could guarantee are:
- (At least once & Retry) events will be attempted multiple times;
- (Persistent Storage) events won’t be lost from webhook system;
- (Delivery Tracking) event delivery attempts will be persisted/tracked in DB and failed events will be moved to DLQ for further processing.
2. Ensure No Data Loss
- Database durability.
- PostgreSQL: primary + secondary (additional read-only replicas)
- NoSQL: DynamoDB (fully managed by AWS), CassandraDB (specify replication factor)
- Queue durability.
- Kafka (or AWS SQS) was built with data durability guarantee through data replication. In case of consumer failure (in this case, delivery workers), we can increase message retention to 14 days or event longer time period, providing enough time for delivery retries.
- Regular Backup/Snapshot.
- We could take additional DB-level backup at a regular/fixed frequency (or adaptive to event rates.) In case of data loss, we could restore from snapshots.
3. Monitor failure and handle retry
- using external tools (datadog, cloudwatch etc) or implementing health-check APIs.
- leverage SQS visibility timeout with exponential backoff.
High-scalability
1. Handle “hot client”
- Per-Client Queue Partitioning: Events are distributed across multiple SQS queues using client_id-based partitioning. This ensures that a hot client's high event volume doesn't block processing for other clients. Each partition can be scaled independently based on the volume of that specific client.
- Adaptive Rate Limiting: The system implements per-webhook rate limiting to protect client infrastructure:
- Default: 1K requests/second per webhook
- Configurable during webhook registration
- Adaptive scaling based on client response times and error rates
- Priority-based Delivery: Multiple queue tiers handle different client priorities:
- high-priority queue (time-sensitive events requiring immediate delivery)
- standard queue (standard events that can tolerant up to few seconds delay)
- Bulk Queue (latency is not a requirement. prefer reduced requests & minimal interruption/impact to clients.)
- Client-Specific Scaling: For clients receiving millions of events, the system can allocate more dedicated Delivery Worker instances (or a dedicated delivery worker pool) to ensure consistent performance.
2. Handle “hot source application”
- Horizontal Scaling: Event Watcher can automatically scale based on ingestion speed and/or throughputs.
- Source-Specific Rate Limiting: Aggressive rate limiting for sources showing unusual patterns:
- Baseline rate established from historical data
- Alert when source exceeds 3x baseline for 5 minutes
- Temporary throttling when source exceeds 10x baseline
- Database Write Optimization: For extreme loads, implement database optimizations:
- Batch inserts for multiple events
- Asynchronous replication for read replicas
- Table partitioning by date and client_id for improved write performance
- (Optional) Database Connection Pooling: Optimized connection pools handle write-heavy workloads:
Observability (Be able to track delivery)
We need to track the delivery status of every webhook event, including multiple retry attempts, response times, and error details. This tracking data must be accessible to clients through APIs so they can monitor their webhook delivery performance and troubleshoot issues.
The core design question is where to store this delivery tracking information.
1. Extend the existing events table
We could add delivery status columns directly to the events table: delivery_status, last_attempt_time, retry_count, last_error_message. This keeps all event-related data in one location and simplifies queries when we need both event content and delivery status.
2.Create a separate delivery tracking table
We could maintain delivery tracking in a dedicated table, with each record representing one delivery attempt to one webhook endpoint.
The issue with extending the events table is the one-to-many problem. A single event can be delivered to multiple webhooks (when multiple clients subscribe to the same event type), and each webhook delivery can have multiple retry attempts.
If we store delivery status in the events table, we cannot represent the reality that event evt_123 might be successfully delivered to webhook wh_456 but still retrying delivery to webhook wh_789. There's no single "delivery status" that accurately represents this mixed state across multiple webhooks.
Recommendation
We choose Option 2 because it properly models the one-to-many relationship between events and delivery attempts. Additionally, events and delivery tracking have fundamentally different access patterns: events are created once and rarely modified, while delivery status updates frequently as workers process delivery attempts. Separating these concerns allows us to optimize each table independently and avoid write contention on the events table.
Further Read: Monitor and Alert (for failed and delayed delivery)
When delivery attempts exhaust all retries, events move to a Dead Letter Queue (DLQ). Our monitoring focuses on the volume of failed events to detect both individual webhook issues and system-wide problems.
We can monitor and set different alerts based on the number of events in DLQ for a specific time window.
- If we’re using AWS SQS + AWS CloudWatch, we can create cloud watch alarms based on the volume of failed events in DLQ.
- if we’re using some other queue service/system, we can run a background monitor service which queries DLQ at a fixed frequency, counting events added in the last minute/hour and then group them by webhook_id. When thredshold is exceeded, the service can send alerts.