Webhooks at Scale Best Practices and Lessons Learned

A presentation at Stripe London July 2025 in July 2025 in London, UK by Phil Leggetter

Slide 1

Slide 1

Webhooks at Scale Best Practices and Lessons Learned Stripe London / July 15th, London @hookdeck @leggetter phil@hookdeck.com

Slide 2

Slide 2

Phil Leggetter Head of DX s k o o h b we n o i l l i b 100 g n i t n u o c and SMS API CHARGES API WEBHOOK AWS S3

Slide 3

Slide 3

Webhooks are the gateway drug to event-driven architecture.

Slide 4

Slide 4

Why are webhooks hard? Event-driven paradigms No control over producer Producer inconsistencies Out-of-order High product satisfaction with 80+ NPS Duplicates (at-least-once) Low churn at less than 1% monthly Queuing, DLQs, Alerting churn Bursty traffic Security, authentication & verification Tight timeout limits Payload format Troubleshooting & retrieswith 135% High account expansion NDR Retries and guarantees Fragmented dev experience

Slide 5

Slide 5

Problem
 Out-of-order & at-least-once Providers don’t guarantee orde At-least-once delivery guarantee means you must design with idempotency in mind subscription.updated → Update database: subscription doesn’t exit? subscription.created → Create in database: unique constraint conflict invoice.paid → Send payment confirmation email invoice.paid → Send payment confirmation email again? subscription.cancelled → Update DB to set cancel date subscription.updated → Wait is the subscription cancelled or not?

Slide 6

Slide 6

“ “An operation is said to be idempotent if performing it multiple times produces the same result as performing it once”

Slide 7

Slide 7

Idempotency Solution 1
 Fetch before processing How Best when For each webhook, retrieve the most recent data from the AP Syncing records stat Update database or take action based on latest data High API rate-limits (low risk of exceeding) Stripe new thin events Stripe thin event containing only reference IDs taken from Stripe doc

Slide 8

Slide 8

Idempotency Solution 1
 Fetch before processing How Best when For each webhook, retrieve the most recent data from the AP Syncing records stat Update database or take action based on latest data High API rate-limits (low risk of exceeding) Stripe new thin events Stripe thin event containing only reference IDs taken from Stripe doc

Slide 9

Slide 9

Idempotency Solution 2
 Upsert by date How Best when Create or update the record only when the date is more recen Take action conditional on the record data matching the upsert date Upsert must always be transactional! Example using Postgres High throughput / API rate limit constraint Need to know if data is the most recen Must be keeping track of record Storage needs to allows transactional upsert

Slide 10

Slide 10

Idempotency Solution 3
 Tracking processing state How Keep track of each unique event ID processing statu For each webhook, check the processing statu Return an error (ie 409) if already processing Implementation example using NodeJS Best when Not storing or keeping track of record Need generic implementatio Depend on downstream dependencies like email provider or 3rd party APIs

Slide 11

Slide 11

Problem
 Bursty traffic & tight timeouts High sustained volume and sudden burst Flash sales, BFCM, batch processing, retry storms Response time Webhooks per second Timeout

Slide 12

Slide 12

Solution
 Decoupled, scalable ingestion Horizontally scalable, stateless ingestion laye Load balancing and autoscalin Use message queues or buffers at the edge

Slide 13

Slide 13

Problem
 Overwhelming downstream systems Traffic spikes put pressure on backend service Event handlers are resource-constraine Processing rates can’t always match ingestion rates, causing backpressure Solution
 Throughput control & backpressure management Audit, test, and document scalability constraint Control throughput via concurrency & number of consumer Set up alerts for backpressure (queue depth / max age Implement prioritization through multiple queues/consumers 1 Processing events…

Slide 14

Slide 14

Problem
 Failure recovery & data integrity App downtime or bugs = failed deliverie Events may be lost without retry handlin Repeated failures can block other events Solution
 Guaranteed processing or reconciliation Guaranteed processing Automated retry with backoff, vendor retry != guarante Route failed events to DLQs & setup alerts for DLQ Test & validate ACK/NACK logic! Reconciliation Fetch all data from time period (slow & looses incremental changes Vendors with Events API (like Stripe) make this easier

Slide 15

Slide 15

Problem
 Lack of observability Hard to trace webhook delivery across system Limited visibility into failure causes and retrie No central view of webhook status or histor Sparse alerting about failure and backpressure issues Prod / Metrics Solution
 Centralized event logs Per-event logging and process tracin Searchable event histor Retry / replay capabilitie Alerting and notifications on processing issues 24h Total events Error rate 1,200,452 8.63/min Event count Prod / Events 422 stripe → invoice 200 stripe → subscriptions 200 stripe → invoice Retry

Slide 16

Slide 16

What’s next?
 A new era for webhooks event delivery AWS EVENTBRIDGE HOOKDECK GCP PUB/SUB Thin events, native filtering, etc.
 Evolution in webhooks support. ie: Stripe thin event & Shopify native filter Event Destinations
 Alternative event delivery methods to deliver directly to your message bus Webhooks === 2007! Years without meaningful change has come to an end. Event Gateways
 Cloud infrastructure primitive to manage event interoperability

Slide 17

Slide 17

What’s next?
 Event Destinations lem imp S S O → eventdestinations.org PRODUCER-1 [ EVENT DESTINATIONS ] AWS EVENTBRIDGE PRODUCER-2 KAFKA GCP PUB/SUB APP LOGIC HOOKDECK PRODUCER-3 RABBITMQ + MORE… PRODUCER-4 on ati ent

Slide 18

Slide 18

What’s next?
 Event Gateways

Slide 19

Slide 19

Event Gateway Components
 Serverless Ingestion, Routing & Filtering

Slide 20

Slide 20

Event Gateway Components
 Message Queue & Throughput Control

Slide 21

Slide 21

Event Gateway Components
 Event Log & Retries

Slide 22

Slide 22

Webhooks at Scale: Key Takeaways Webhooks ≠ Simple Adopt battle-tested patterns They’re your first step into distributed, event-driven Idempotency, queue-first ingestion, and fetch- architecture. before-process are your friends. Scale brings failure, Modern problems need modern duplication, overload infra Plan for retries, back pressure, and observability Event Gateways and Event Destinations are from day one. emerging to simplify webhook delivery. The landscape is shifting Platforms like Stripe and Shopify are investing heavily. Your infra should evolve too. Phil Leggetter Head of DX

Slide 23

Slide 23

Phil Leggetter Head of DX Q&A @hookdeck @leggetter phil@hookdeck.com