Webhooks at Scale Best Practices and Lessons Learned Stripe London / July 15th, London @hookdeck @leggetter phil@hookdeck.com
A presentation at Stripe London July 2025 in July 2025 in London, UK by Phil Leggetter
Webhooks at Scale Best Practices and Lessons Learned Stripe London / July 15th, London @hookdeck @leggetter phil@hookdeck.com
Phil Leggetter Head of DX s k o o h b we n o i l l i b 100 g n i t n u o c and SMS API CHARGES API WEBHOOK AWS S3
Webhooks are the gateway drug to event-driven architecture.
Why are webhooks hard? Event-driven paradigms No control over producer Producer inconsistencies Out-of-order High product satisfaction with 80+ NPS Duplicates (at-least-once) Low churn at less than 1% monthly Queuing, DLQs, Alerting churn Bursty traffic Security, authentication & verification Tight timeout limits Payload format Troubleshooting & retrieswith 135% High account expansion NDR Retries and guarantees Fragmented dev experience
Problem Out-of-order & at-least-once Providers don’t guarantee orde At-least-once delivery guarantee means you must design with idempotency in mind subscription.updated → Update database: subscription doesn’t exit? subscription.created → Create in database: unique constraint conflict invoice.paid → Send payment confirmation email invoice.paid → Send payment confirmation email again? subscription.cancelled → Update DB to set cancel date subscription.updated → Wait is the subscription cancelled or not?
“ “An operation is said to be idempotent if performing it multiple times produces the same result as performing it once”
Idempotency Solution 1 Fetch before processing How Best when For each webhook, retrieve the most recent data from the AP Syncing records stat Update database or take action based on latest data High API rate-limits (low risk of exceeding) Stripe new thin events Stripe thin event containing only reference IDs taken from Stripe doc
Idempotency Solution 1 Fetch before processing How Best when For each webhook, retrieve the most recent data from the AP Syncing records stat Update database or take action based on latest data High API rate-limits (low risk of exceeding) Stripe new thin events Stripe thin event containing only reference IDs taken from Stripe doc
Idempotency Solution 2 Upsert by date How Best when Create or update the record only when the date is more recen Take action conditional on the record data matching the upsert date Upsert must always be transactional! Example using Postgres High throughput / API rate limit constraint Need to know if data is the most recen Must be keeping track of record Storage needs to allows transactional upsert
Idempotency Solution 3 Tracking processing state How Keep track of each unique event ID processing statu For each webhook, check the processing statu Return an error (ie 409) if already processing Implementation example using NodeJS Best when Not storing or keeping track of record Need generic implementatio Depend on downstream dependencies like email provider or 3rd party APIs
Problem Bursty traffic & tight timeouts High sustained volume and sudden burst Flash sales, BFCM, batch processing, retry storms Response time Webhooks per second Timeout
Solution Decoupled, scalable ingestion Horizontally scalable, stateless ingestion laye Load balancing and autoscalin Use message queues or buffers at the edge
Problem Overwhelming downstream systems Traffic spikes put pressure on backend service Event handlers are resource-constraine Processing rates can’t always match ingestion rates, causing backpressure Solution Throughput control & backpressure management Audit, test, and document scalability constraint Control throughput via concurrency & number of consumer Set up alerts for backpressure (queue depth / max age Implement prioritization through multiple queues/consumers 1 Processing events…
Problem Failure recovery & data integrity App downtime or bugs = failed deliverie Events may be lost without retry handlin Repeated failures can block other events Solution Guaranteed processing or reconciliation Guaranteed processing Automated retry with backoff, vendor retry != guarante Route failed events to DLQs & setup alerts for DLQ Test & validate ACK/NACK logic! Reconciliation Fetch all data from time period (slow & looses incremental changes Vendors with Events API (like Stripe) make this easier
Problem Lack of observability Hard to trace webhook delivery across system Limited visibility into failure causes and retrie No central view of webhook status or histor Sparse alerting about failure and backpressure issues Prod / Metrics Solution Centralized event logs Per-event logging and process tracin Searchable event histor Retry / replay capabilitie Alerting and notifications on processing issues 24h Total events Error rate 1,200,452 8.63/min Event count Prod / Events 422 stripe → invoice 200 stripe → subscriptions 200 stripe → invoice Retry
What’s next? A new era for webhooks event delivery AWS EVENTBRIDGE HOOKDECK GCP PUB/SUB Thin events, native filtering, etc. Evolution in webhooks support. ie: Stripe thin event & Shopify native filter Event Destinations Alternative event delivery methods to deliver directly to your message bus Webhooks === 2007! Years without meaningful change has come to an end. Event Gateways Cloud infrastructure primitive to manage event interoperability
What’s next? Event Destinations lem imp S S O → eventdestinations.org PRODUCER-1 [ EVENT DESTINATIONS ] AWS EVENTBRIDGE PRODUCER-2 KAFKA GCP PUB/SUB APP LOGIC HOOKDECK PRODUCER-3 RABBITMQ + MORE… PRODUCER-4 on ati ent
What’s next? Event Gateways
Event Gateway Components Serverless Ingestion, Routing & Filtering
Event Gateway Components Message Queue & Throughput Control
Event Gateway Components Event Log & Retries
Webhooks at Scale: Key Takeaways Webhooks ≠ Simple Adopt battle-tested patterns They’re your first step into distributed, event-driven Idempotency, queue-first ingestion, and fetch- architecture. before-process are your friends. Scale brings failure, Modern problems need modern duplication, overload infra Plan for retries, back pressure, and observability Event Gateways and Event Destinations are from day one. emerging to simplify webhook delivery. The landscape is shifting Platforms like Stripe and Shopify are investing heavily. Your infra should evolve too. Phil Leggetter Head of DX
Phil Leggetter Head of DX Q&A @hookdeck @leggetter phil@hookdeck.com