Webhooks at Scale Best Practices and Lessons Learned

Webhooks at Scale Best Practices and Lessons Learned Shopify Partners North / Jan 27th, Leeds @hookdeck @leggetter phil@hookdeck.com

Phil Leggetter Head of DX s k o o h b e w n o i l l i b 100 g n i t n u o c and SMS API CHARGES API WEBHOOK AWS S3

Webhooks are the gateway drug to event-driven architecture.

Why are webhooks hard? Event-driven paradigms No control over producer Producer inconsistencies Out-of-order High product satisfaction with 80+ NPS Duplicates (at-least-once) Low churn at less than 1% monthly Queuing, churn DLQs, Alerting Bursty traffic Security, authentication & verification Tight timeout limits Payload format Troubleshooting & retrieswith 135% High account expansion NDR Retries and guarantees Fragmented dev experience

Problem
 Out-of-order & at-least-once Providers don’t guarantee order At-least-once delivery guarantee means you must design with idempotency in mind orders/update → Update database: subscription doesn’t exit? orders/create → Create in database: unique constraint conflict orders/paid → Send payment confirmation email orders/paid → Send payment confirmation email again? orders/delete → Update DB to set cancel date orders/update → Wait is the order deleted or not?

“ “An operation is said to be idempotent if performing it multiple times produces the same result as performing it once”

Idempotency Solution 1
 Fetch before processing How For each webhook, retrieve the most recent data from the API Update database or take action based on latest data Best when Syncing records state Low API read costs (can afford extra fetches) webhook as notification
 a.k.a “thin events” Inspired by Stripe “Thin events”

Idempotency Solution 2
 Upsert by updated_at How Best when Create or update the record only when the updated_at is more recent Take action conditional on the record data matching the upsert date Upsert must always be transactional! Example using Postgres High throughput / API rate limit constraints Need to know if data is the most recent Must be keeping track of records Storage must allow transactional upsert

Idempotency Solution 3
 Tracking processing state How Keep track of each unique event ID processing status For each webhook, check the processing status Return an error (ie 409) if already processing Implementation example using NodeJS Best when Not storing or keeping track of records Need generic implementation Depend on downstream dependencies like email provider or 3rd party APIs

Problem
 Bursty traffic & tight timeouts High sustained volume and sudden bursts Flash sales, BFCM, batch processing, retry storms Response time Webhooks per second Timeout

Solution
 Decoupled, scalable ingestion Horizontally scalable, stateless ingestion layer Load balancing and autoscaling Use message queues or buffers at the edge

Problem
 Overwhelming downstream systems Traffic spikes put pressure on backend services Event handlers are resource-constrained Processing rates can’t always match ingestion rates, causing backpressure Solution
 Throughput control & backpressure management Audit, test, and document scalability constraints Control throughput via concurrency & number of consumers Set up alerts for backpressure (queue depth / max age) Implement prioritization through multiple queues/consumers 1 Processing events…

Problem
 Failure recovery & data integrity App downtime or bugs = failed deliveries Events may be lost without retry handling Repeated failures can block other events Solution
 Guaranteed processing or reconciliation Guaranteed processing Automated retry with backoff, vendor retry != guarantee Route failed events to DLQs & setup alerts for DLQs Test & validate ACK/NACK logic! Reconciliation Fetch all data from time period (slow & looses incremental changes) Vendors with Events API (like Stripe) make this easier

Problem
 Lack of observability Hard to trace webhook delivery across systems Limited visibility into failure causes and retries No central view of webhook status or history Sparse alerting about failure and backpressure issues Solution
 Centralized event logs Per-event logging and process tracing Searchable event history Retry / replay capabilities Alerting and notifications on processing issues Prod / Metrics Total events 24h Error rate 1,200,452 8.63/min Prod / Events Event count 422 shopify → orders 200 shopify → products 200 shopify → orders Retry

What’s next?
 A new era for webhooks event delivery AWS EVENTBRIDGE HOOKDECK GCP PUB/SUB Thin events, native filtering, etc.
 Evolution in webhooks support. ie: Stripe thin event & Shopify native filter Event Destinations
 Alternative event delivery methods to deliver directly to your message bus Webhooks === 2007! Years without meaningful change has come to an end. Event Gateways
 Cloud infrastructure primitive to manage event interoperability

What’s next?
 Event Destinations p m OSS i → eventdestinations.org PRODUCER-1 ion t a t n leme [ EVENT DESTINATIONS ] AWS EVENTBRIDGE PRODUCER-2 PRODUCER-3 KAFKA GCP PUB/SUB HOOKDECK RABBITMQ + MORE… PRODUCER-4 APP LOGIC

What’s next?
 Event Gateways

Event Gateway Components
 Serverless Ingestion, Routing & Filtering

Event Gateway Components
 Serverless Ingestion, Routing & Filtering

Event Gateway Components
 Serverless Ingestion, Routing & Filtering

Event Gateway Components
 Serverless Ingestion, Routing & Filtering

Event Gateway Components
 Message Queue & Throughput Control Live Demo

Event Gateway Components
 Event Log & Retries Live Demo

Webhooks at Scale: Key Takeaways Webhooks ≠ Simple The landscape is shifting They’re your first step into distributed, event-driven Platforms like Shopify and Stripe are investing architecture. heavily. Your infra should evolve too. Scale brings failure, Modern problems need modern duplication, overload infra Plan for retries, back pressure, and observability Event Gateways and Event Destinations are from day one. emerging to simplify webhook delivery. Adopt battle-tested patterns Idempotency, queue-first ingestion, and fetchbefore-process are your friends. Phil Leggetter Head of DX

Phil Leggetter Head of DX Q&A @hookdeck @leggetter phil@hookdeck.com