A webhook is a mechanism that allows one application to notify another when an event occurs

Webhook

Written by
Staff Eng at Meta
Last revisited
June 6, 2025

Problem Understanding and Clarification

  • What is a Webhook System and its application?
    • A webhook is a mechanism that allows one application to notify another when an event occurs
    • Essentially, it's a real-time notification system that pushes data to a designated endpoint when a specific event happens, instead of having to continuously check for updates.
    • For example, when consumer makes a payment on Amazon to buy a product, a webhook is inter-connected between Payment Gateway in Amazon to External Payment Network (eg. Visa, Mastercard etc). It is designed to listen to new status change on the payment event from External Payment Network, such as “payment success” or “payment rejected”, to trigger transaction updates in Amazon database.
  • Clarify the requirements, such as:
  • What types of events trigger webhooks? Common triggers include user actions (e.g. message sent, status update) and system events (e.g. job completion, threshold exceeded).
  • Expected volume of events: Depends on scale—some systems may emit millions to billions of events per day.
  • Delivery guarantees: Do you need at-least-once or exactly-once delivery semantics? This affects system design and reliability.
  • Event ordering guarantees: Do events need to be processed in order? If yes, this adds complexity and constraints on parallelism.
  • Latency requirements: Are events expected to be delivered in real time (e.g. sub-second)? Latency SLAs shape your transport and infrastructure decisions.
  • Security/authentication needs: Must the data be encrypted in transit? Should webhooks be signed or authenticated using headers or tokens?
  • Define the scope:
  • Is this a standalone webhook system or a part of a larger platform (i.e: within a source application such as payment system)?

Clarifying those questions help clear out functional and non-functional requirements, and can also help shape your design or make better design. For example:

  • if high throughput and high volume, with low-latency delivery, then it’s more reasonable to consider using stream instead of queue.
  • if there is strong security requirement, then you need to be ready to talk about how to ensure data integrity (i.e: data is not being intercepted and modified during transit or in-rest).

Functional Requirements

  • Client should be able register, update, delete a webhook with a callback URL
  • Client should be able receive events from source application through Webhook
  • System should support event filtering based on event types.

Using the previous payment system as an example:

  • “Client” can be treated as Amazon internal payment service.
  • “Source” can be treated as External Payment Network (e.g., Visa, Mastercard, etc).
  • “Call-back URL” signifies where the source application should send an event to when a particular event occurs. In this case, Amazon payment endpoint API.

Non-Functional Requirements

  • highly scalable to support up to 1B events per day (should calculate on peak QPS)
  • highly available for receiving, processing and delivering events
  • fault-tolerant (monitor and retry on failures)
  • secure and prevent unauthorized access and data tampering
  • highly durable and ensure no data loss

API Design

(1) Register a webhook

POST /api/v1/webhooks header: Authorization: Bearer {JWT_TOKEN} request_body: { "callback_url": "string (required, must be HTTPS)", "event_types": ["string"] (optional, e.g., ["payment_success", "order_updated"]), "trigger_conditions": { "filters": { "amount_gt": 1000, "currency": "USD" } }, "retry_config": { "max_retries": 5, "initial_delay_ms": 1000, "max_delay_ms": 30000, "backoff_multiplier": 2.0 }, "rate_limit": { "requests_per_minute": 100 } } response: 200 (OK), 400 (Bad Request), 401 (Unauthorized), 5xx (server error) { "webhook_id": "uuid", "status": "active|inactive", "signing_secret": "string (server-generated for HMAC)", "created_at": "", // timestamp "callback_url_verified": "boolean" }

(2) Update / Delete a webhook

PATCH (DELETE) /api/v1/webhooks/{webhook_id} header: Authorization: Bearer {JWT_TOKEN} request_body: { "callback_url": "string (optional)", "event_types": ["string"] (optional), "status": "active|inactive" (optional), "retry_config": {...} (optional) } response: 200 (OK), 400 (Bad Request), 404 (Not Found), 401 (Unauthorized) { "webhook_id": "uuid", "status": "active|inactive", "updated_at": "ISO8601 timestamp" }

Event (Data) Flow

Before we get into high-level design, let's see the end-to-end event (data) flow within webhook. The primary function of a webhook system is to enable real-time communication by delivering events (i.e: payment updates) from a source application (e.g: Stripe) to a client's designated callback URL. It ensures reliable and at-least-once delivery for sensitive applications (such as: payment).

You may not need to answer this part in your interview for the sake of time. However, we find many candidates are confused about the end-to-end process. So we provide this content to give you a clear understand between the data and API calls.

Here is a typical sequence of steps:

1. Event trigger in source application: When a specific action occurs in the source application (e.g: a payment is successfully processed in Stripe, triggering a payment_success event), the application generates an event payload (JSON object containing event details).

{ "eventId": "evt_123", "eventType": "payment_success", "data": { "paymentId": "pay_456", "amount": 1000, "currency": "USD" }, "timestamp": "2025-05-18T18:06:00Z", "source": "payment_service", "version": "1.0" }

2. Source application pushes the event to Webhook system: The source application sends the event to the webhook system via an HTTP POST request to an event ingestion endpoint.

3. Webhook system processes and delivers events: The webhook system processes the event, identifies matching subscriptions, and delivers to registered callback URLs with proper security headers and retry logic.

4. Client processes the new event: Upon receiving the webhook, the client validates the signature, processes the event, and returns an HTTP 200 status code to confirm successful receipt.

Where exactly is the webhook in this flow?

  • Webhook System: A standalone intermediary system hosted on its own infrastructure. It acts as a bridge between source applications and clients, typically implemented as a separate microservice.
  • Source Application Side: The source application is the originator of events (e.g., payment_successful) based on actions within its domain. It sends these events to the webhook system via HTTP POST requests to the ingestion API. It does not manage delivery.
  • Client Side: The client receives events and registers webhooks with callback URLs (like setting up a mailbox). It uses REST APIs to configure where and how to receive those events.

High-Level Design

The core components of our high-level design include:

Webhook Manager

serves as the central registry and configuration store for all webhook subscriptions, storing all webhook settings in database with Redis cache layer for low-latency lookups.

  • exposes RESTful APIs for webhook CRUD operations and validates callback URLs through DNS resolution, HTTPS enforcement, and reachability tests.
  • manages authentication and authorization for all webhook operations, generates signing secrets for HMAC verification, or store/process client provided public keys.

Ensuring data integrity (no tampering) is an interesting topic. Mentioning briefly events (or event payload more specifically) will be encrypted during transit, is enough for now. You can say “we will deep dive into encryption and security requirement later” (comparing different options: using public key encryption v.s HMAC).

Event Watcher (Event Processing Service)

operates as a set of horizontally scalable micro-services that receive events from source applications and route them to appropriate webhooks.

  • accepts events via HTTP POST and performs initial validation including schema validation and deduplication and transforms events into canonical format.
  • queries matching webhook subscriptions from cache and database, applies event filtering based on trigger conditions and then publishes delivery events directly to delivery queues.

Message Queue or Stream

provides reliable, scalable event processing with multiple tiers based on priority and delivery requirements. The queue structure includes high priority queues for critical events such as payments and security alerts, standard queues for regular business events, and a dead letter queues (DLQ) for events that exhausted all retry attempts.

Delivery Worker

consumes events from queues and deliver events to callback URLs using HTTP POST requests.

  • retrieves event data, generates HMAC signatures using signing secrets, and sends HTTP requests to callback URLs.
  • updates delivery status and handles retries with rate limiting per webhook endpoint to respect client infrastructure limitations.

Monitoring & Alerting

  • Monitor all internal components health (through heartbeat message or periodic ping/healthy-API) and collect real-time status (resource consumption, throughput, queue metrics).
  • Alert on (1) worker failure and (2) failed events in DLQ;

Workflow 1: Register a new Webhook

The client sends a POST request to /api/v1/webhooks with callback URL, event types, and configuration details:

  • callback_url: The destination endpoint that will receive event notifications
  • event_types: List of events to subscribe to (e.g., "payment.succeeded", "order.created")
  • trigger_conditions: Optional filtering criteria to receive only specific instances of events
  • retry_policy: Optional custom retry settings (defaults apply if not specified)
  • (*) public_key  used to encrypt sensitive payload data

The Webhook Manager validates the JWT token and verifies client permissions for webhook creation. The system performs initial validation including URL format verification and HTTPS requirement enforcement, DNS resolution of the callback domain, reachability testing with HTTP HEAD request, and event type validation against supported types.

After successful validation, the Webhook manage stores the complete webhook configuration into the database and essential lookup information is cached in Redis for performance. Finally, the Webhook Manager returns a response containing the webhook_id, signing_secret, and verification status to the client.

Workflow 2: Event Delivery

We will split into two steps: (1) Event Ingestion & Processing and (2) Delivery Execution.

(1) Event Ingestion & Processing

When a source application generates an event (such as a payment being processed or an order status changing), it sends to Event Watcher through HTTP request.

Upon receiving the new event, Event Watcher extracts the event's core attributes (event_id, client_id, event_type, timestamp, and payload) and validates event schema to ensure the event contains all required fields and the checks for duplication based on event_id and client_id. For validated events, the Event Watcher identifies all webhook subscriptions that should receive this event by matching the client_id, event_type, active status, and any filtering conditions. This subscription lookup uses optimized database indices (PK: webhook_id. GSI: client_id & event_type) and cache.

For each matching webhook event, Event Watcher generates HMAC signature. This signature enables clients to verify the payload integrity. After signing and encrypting the payload, Event Watcher will then place a message in the appropriate queue for each target webhook with reference to the event_id and webhook_id.

(2) Delivery Execution

The delivery process begins when Delivery Worker pulls message from queues. Delivery Worker retrieves the full event data and webhook configuration from DB, and then prepares an HTTP POST request to the webhook’s callback URL.

Delivery Worker sends the HTTP request with a pre-configrued connection timeout (say 10 seconds). On receiving an HTTP response, Delivery Worker determines successful delivery. If no response was received after timeout, it retries with exponential backoff. After exhausting all retries, Delivery Worker  moves events to DLQ and update events delivery status to FAILED in DB.

Design Options

Dimension 1: Choose between Queue (AWS SQS) vs Stream (Kafka, AWS Kinesis Stream)

Option 1: Queue

  • moderate throughput (~ few thousands of events per second)
  • guarantee at-least-once and order (with FIFO queue)
  • scales automatically (by increasing batch_size and shard_count)
  • BUT not designed for continuous or real-time data streams

Option 2: Stream

  • Support high throughout (up to millions of events per second)
  • real-time, low-latency processing is a requirement

Recommendation: The choice depends on clarified functional and non-functional requirements. In general, queues are sufficient for most webhook systems, while Kafka and streams should be used for ultra-high throughputs. For our case with a requirement to support up to 1 billion events per day (approximately 11,500 events per second average), AWS SQS provides sufficient throughput while offering managed infrastructure, automatic scaling, and built-in retry mechanisms that align perfectly with webhook delivery requirements.

Dimension 2: With vs Without Database

Option 1: Queue-Only (No Persistent Storage)

This approach works when events are non-critical and clients can tolerate occasional missed delivery, when there is no need for auditing, replay or complex analytics, when event volume is relatively low to moderate such that SQS's 14-day retention is enough for retries, and when duplicates are not a concern and can be handled by clients.

Option 2: Database + Queue (Recommended)

This approach is necessary when guaranteed delivery is required, when there is need to pull history events for auditing, compliance, and debugging such as analyzing financial transactions from Stripe, and when the system needs to handle deduplication or additional processing or complex analytics.

Recommendation:

For database, there isn't a single "right" database, but the key is to show the thought process and reasoning behind the choice.For this design, we use Option 2 with a database. The main database use cases include storing structured events with fixed schemas, ensuring no data loss, handling write-heavy workloads with minimal update contention, and supporting high throughput for both reads and writes.

Both SQL databases like CockroachDB and PostgreSQL and NoSQL databases like Cassandra and DynamoDB are fine choices. Regardless of the choice, it's important to revisit how the selected database can satisfy the above use cases and requirements (i.e: PostgreSQL provides ACID guarantees for reliable event storage, supports JSONB for flexible event payloads, and offers table partitioning for handling large-scale data efficiently.)

Deep Dive

Security

1. Verify source application and prevent unauthorized event injection.

When source applications (like payment processors - Stripe, e-commerce platforms - Shopify) want to send events to our webhook system, they provide us with their API credentials (API Key + Secret) for authentication.

  • (Source Application Onboarding) Source applications provide us with their API credentials during integration:
{ "source_name": "stripe_payment_processor", "api_key": "sk_live_stripe_provided_key", // Stripe provides this to us "webhook_secret": "whsec_stripe_provided_secret", // Stripe provides this to us "allowed_event_types": ["payment_intent.succeeded", "payment_intent.failed"]}
  • (Event Ingestion Authentication) Source applications include their own credentials when sending events:
POST /api/v1/events/ingestAuthorization: Bearer **sk_live_stripe_provided_key <---- API-key**X-Source-Signature: sha256=HMAC-SHA256(whsec_stripe_provided_secret, request_body)X-Source-Timestamp: 1705312200Content-Type: application/json{ "event_id": "evt_123", "event_type": "payment_intent.succeeded", "data": { ... }}‍


  • (Verification Proces) Our Event Watcher validates:
    • API key matches what the source application provided us during onboarding
    • Request signature matches expected HMAC-SHA256 using their provided secret
    • Timestamp is within 5-minute window (prevents replay attacks)
    • Event type is allowed for this source application
    • For additional security, we could maintain an IP-allowlisting for all source applications. So that events/requests from non-allowlisted IP will be discarded.

2. Support data encryption and ensure data integrity.

When delivering events to client endpoints, delivery worker signs every webhook delivery using the webhook-specific signing secret. Clients verify our signature using the signing secret we provided.

// Our Delivery Worker signs the payload const payload_to_sign = timestamp + "." + JSON.stringify(event_data); const signature = HMAC_SHA256(webhook.signing_secret, payload_to_sign);

For sensitive data fields, delivery worker can encrypt the selective fields or entire payload using client-provided RSA public keys. Client can decrypt the payload on receiving the new events.

Fault Tolerant

1. Guaranteed Delivery

Guaranteed Delivery is usually implemented based on acknowledgement (”ACK”). Recall from “Design ChatApp” where the client app (installed on Mobile or desktop) and chat backend can adhere to a very structured and consistent acknowledgment message format, which then can be used to track last successful delivery. In the Webhook case, we can not really control all our clients to respond with a structured response or follow a acknowledgment protocol. But we can still handle this by interpreting the HTTP status code at least. For example,

  • 200, we can assume success/acknowledged.
  • 4xx, client error — don’t immediate retry. Mark events are failed and move to DLQ, and re-play failed events once client-side issue is resolved.
  • 5xx and other unexpected responses, retry.

Most webhook systems (including Stipe) implement “best-efforts delivery” rather than true guaranteed delivery. What we could guarantee are:

  • (At least once & Retry) events will be attempted multiple times;
  • (Persistent Storage) events won’t be lost from webhook system;
  • (Delivery Tracking) event delivery attempts will be persisted/tracked in DB and failed events will be moved to DLQ for further processing.

2. Ensure No Data Loss

  • Database durability.
    • PostgreSQL: primary + secondary (additional read-only replicas)
    • NoSQL: DynamoDB (fully managed by AWS), CassandraDB (specify replication factor)
  • Queue durability.
    • Kafka (or AWS SQS) was built with data durability guarantee through data replication. In case of consumer failure (in this case, delivery workers), we can increase message retention to 14 days or event longer time period, providing enough time for delivery retries.
  • Regular Backup/Snapshot.
    • We could take additional DB-level backup at a regular/fixed frequency (or adaptive to event rates.) In case of data loss, we could restore from snapshots.

3. Monitor failure and handle retry

  • using external tools (datadog, cloudwatch etc) or implementing health-check APIs.
  • leverage SQS visibility timeout with exponential backoff.

High-scalability

1. Handle “hot client”

  • A "hot client" refers to a client that has registered webhooks receiving an unusually high volume of events, potentially overwhelming their infrastructure or creating
  • Per-Client Queue Partitioning: Events are distributed across multiple SQS queues using client_id-based partitioning. This ensures that a hot client's high event volume doesn't block processing for other clients. Each partition can be scaled independently based on the volume of that specific client.
  • Adaptive Rate Limiting: The system implements per-webhook rate limiting to protect client infrastructure:
    • Default: 1K requests/second per webhook
    • Configurable during webhook registration
    • Adaptive scaling based on client response times and error rates
    When a client starts responding slowly (>5 seconds) or returning errors (>5% failure rate), the delivery worker automatically reduces the delivery rate to prevent overwhelming the client.
  • Priority-based Delivery: Multiple queue tiers handle different client priorities:
    1. high-priority queue (time-sensitive events requiring immediate delivery)
    2. standard queue (standard events that can tolerant up to few seconds delay)
    3. Bulk Queue (latency is not a requirement. prefer reduced requests & minimal interruption/impact to clients.)
  • Client-Specific Scaling: For clients receiving millions of events, the system can allocate more dedicated Delivery Worker instances (or a dedicated delivery worker pool) to ensure consistent performance.

2. Handle “hot source application”

  • A "hot source application" refers to a source that suddenly generates an extremely high volume of events, potentially overwhelming the ingestion and processing pipeline.
  • Horizontal Scaling: Event Watcher can automatically scale based on ingestion speed and/or throughputs.
  • Source-Specific Rate Limiting: Aggressive rate limiting for sources showing unusual patterns:
    • Baseline rate established from historical data
    • Alert when source exceeds 3x baseline for 5 minutes
    • Temporary throttling when source exceeds 10x baseline
  • Database Write Optimization: For extreme loads, implement database optimizations:
    • Batch inserts for multiple events
    • Asynchronous replication for read replicas
    • Table partitioning by date and client_id for improved write performance
  • (Optional) Database Connection Pooling: Optimized connection pools handle write-heavy workloads:

Observability (Be able to track delivery)

We need to track the delivery status of every webhook event, including multiple retry attempts, response times, and error details. This tracking data must be accessible to clients through APIs so they can monitor their webhook delivery performance and troubleshoot issues.

The core design question is where to store this delivery tracking information.

1. Extend the existing events table

We could add delivery status columns directly to the events table: delivery_status, last_attempt_time, retry_count, last_error_message. This keeps all event-related data in one location and simplifies queries when we need both event content and delivery status.

2.Create a separate delivery tracking table

We could maintain delivery tracking in a dedicated table, with each record representing one delivery attempt to one webhook endpoint.

The issue with extending the events table is the one-to-many problem. A single event can be delivered to multiple webhooks (when multiple clients subscribe to the same event type), and each webhook delivery can have multiple retry attempts.

If we store delivery status in the events table, we cannot represent the reality that event evt_123 might be successfully delivered to webhook wh_456 but still retrying delivery to webhook wh_789. There's no single "delivery status" that accurately represents this mixed state across multiple webhooks.

Recommendation

We choose Option 2 because it properly models the one-to-many relationship between events and delivery attempts. Additionally, events and delivery tracking have fundamentally different access patterns: events are created once and rarely modified, while delivery status updates frequently as workers process delivery attempts. Separating these concerns allows us to optimize each table independently and avoid write contention on the events table.

Further Read: Monitor and Alert (for failed and delayed delivery)

When delivery attempts exhaust all retries, events move to a Dead Letter Queue (DLQ). Our monitoring focuses on the volume of failed events to detect both individual webhook issues and system-wide problems.

We can monitor and set different alerts based on the number of events in DLQ for a specific time window.

  • If we’re using AWS SQS + AWS CloudWatch, we can create cloud watch alarms based on the volume of failed events in DLQ.
  • if we’re using some other queue service/system, we can run a background monitor service which queries DLQ at a fixed frequency, counting events added in the last minute/hour and then group them by webhook_id. When thredshold is exceeded, the service can send alerts.
Coach + Mock
Practice with a Senior+ engineer who just get an offer from your dream (FANNG) companies.
Schedule Now
Content: