Reliably Handling Webhooks at Scale: Best Practices and Lessons Learned

A presentation at Shopify Partners North Developers Meetup in January 2026 in Leeds, UK by Phil Leggetter

Slide 1

Slide 1

Webhooks at Scale Best Practices and Lessons Learned

Webhooks at Scale Best Practices and Lessons Learned Shopify Partners North / Jan 27th, Leeds @hookdeck @leggetter phil@hookdeck.com

Slide 2

Slide 2

Phil Leggetter Head of DX s k o o h b e w n o i l l i b 100 g n i t n u o c and SMS API CHARGES API WEBHOOK AWS S3

Slide 3

Slide 3

Webhooks are the gateway drug to event-driven architecture.

Slide 4

Slide 4

Why are webhooks hard? Event-driven paradigms No control over producer Producer inconsistencies Out-of-order High product satisfaction with 80+ NPS Duplicates (at-least-once) Low churn at less than 1% monthly Queuing, churn DLQs, Alerting Bursty traffic Security, authentication & verification Tight timeout limits Payload format Troubleshooting & retrieswith 135% High account expansion NDR Retries and guarantees Fragmented dev experience

Slide 5

Slide 5

Problem
 Out-of-order & at-least-once Providers don’t guarantee order At-least-once delivery guarantee means you must design with idempotency in mind orders/update → Update database: subscription doesn’t exit? orders/create → Create in database: unique constraint conflict orders/paid → Send payment confirmation email orders/paid → Send payment confirmation email again? orders/delete → Update DB to set cancel date orders/update → Wait is the order deleted or not?

Slide 6

Slide 6

“ “An operation is said to be idempotent if performing it multiple times produces the same result as performing it once”

Slide 7

Slide 7

Idempotency Solution 1
 Fetch before processing How For each webhook, retrieve the most recent data from the API Update database or take action based on latest data Best when Syncing records state Low API read costs (can afford extra fetches) webhook as notification
 a.k.a “thin events” Inspired by Stripe “Thin events”

Slide 8

Slide 8

Idempotency Solution 2
 Upsert by updated_at How Best when Create or update the record only when the updated_at is more recent Take action conditional on the record data matching the upsert date Upsert must always be transactional! Example using Postgres High throughput / API rate limit constraints Need to know if data is the most recent Must be keeping track of records Storage must allow transactional upsert

Slide 9

Slide 9

Idempotency Solution 3
 Tracking processing state How Keep track of each unique event ID processing status For each webhook, check the processing status Return an error (ie 409) if already processing Implementation example using NodeJS Best when Not storing or keeping track of records Need generic implementation Depend on downstream dependencies like email provider or 3rd party APIs

Slide 10

Slide 10

Problem
 Bursty traffic & tight timeouts High sustained volume and sudden bursts Flash sales, BFCM, batch processing, retry storms Response time Webhooks per second Timeout

Slide 11

Slide 11

Solution
 Decoupled, scalable ingestion Horizontally scalable, stateless ingestion layer Load balancing and autoscaling Use message queues or buffers at the edge

Slide 12

Slide 12

Problem
 Overwhelming downstream systems Traffic spikes put pressure on backend services Event handlers are resource-constrained Processing rates can’t always match ingestion rates, causing backpressure Solution
 Throughput control & backpressure management Audit, test, and document scalability constraints Control throughput via concurrency & number of consumers Set up alerts for backpressure (queue depth / max age) Implement prioritization through multiple queues/consumers 1 Processing events…

Slide 13

Slide 13

Problem
 Failure recovery & data integrity App downtime or bugs = failed deliveries Events may be lost without retry handling Repeated failures can block other events Solution
 Guaranteed processing or reconciliation Guaranteed processing Automated retry with backoff, vendor retry != guarantee Route failed events to DLQs & setup alerts for DLQs Test & validate ACK/NACK logic! Reconciliation Fetch all data from time period (slow & looses incremental changes) Vendors with Events API (like Stripe) make this easier

Slide 14

Slide 14

Problem
 Lack of observability Hard to trace webhook delivery across systems Limited visibility into failure causes and retries No central view of webhook status or history Sparse alerting about failure and backpressure issues Solution
 Centralized event logs Per-event logging and process tracing Searchable event history Retry / replay capabilities Alerting and notifications on processing issues Prod / Metrics Total events 24h Error rate 1,200,452 8.63/min Prod / Events Event count 422 shopify → orders 200 shopify → products 200 shopify → orders Retry

Slide 15

Slide 15

What’s next?
 A new era for webhooks event delivery AWS EVENTBRIDGE HOOKDECK GCP PUB/SUB Thin events, native filtering, etc.
 Evolution in webhooks support. ie: Stripe thin event & Shopify native filter Event Destinations
 Alternative event delivery methods to deliver directly to your message bus Webhooks === 2007! Years without meaningful change has come to an end. Event Gateways
 Cloud infrastructure primitive to manage event interoperability

Slide 16

Slide 16

What’s next?
 Event Destinations p m OSS i → eventdestinations.org PRODUCER-1 ion t a t n leme [ EVENT DESTINATIONS ] AWS EVENTBRIDGE PRODUCER-2 PRODUCER-3 KAFKA GCP PUB/SUB HOOKDECK RABBITMQ + MORE… PRODUCER-4 APP LOGIC

Slide 17

Slide 17

What’s next?
 Event Gateways

Slide 18

Slide 18

Event Gateway Components
 Serverless Ingestion, Routing & Filtering

Slide 19

Slide 19

Event Gateway Components
 Serverless Ingestion, Routing & Filtering

Slide 20

Slide 20

Event Gateway Components
 Serverless Ingestion, Routing & Filtering

Slide 21

Slide 21

Event Gateway Components
 Serverless Ingestion, Routing & Filtering

Slide 22

Slide 22

Event Gateway Components
 Message Queue & Throughput Control Live Demo

Slide 23

Slide 23

Event Gateway Components
 Event Log & Retries Live Demo

Slide 24

Slide 24

Webhooks at Scale: Key Takeaways Webhooks ≠ Simple The landscape is shifting They’re your first step into distributed, event-driven Platforms like Shopify and Stripe are investing architecture. heavily. Your infra should evolve too. Scale brings failure, Modern problems need modern duplication, overload infra Plan for retries, back pressure, and observability Event Gateways and Event Destinations are from day one. emerging to simplify webhook delivery. Adopt battle-tested patterns Idempotency, queue-first ingestion, and fetchbefore-process are your friends. Phil Leggetter Head of DX

Slide 25

Slide 25

Phil Leggetter Head of DX Q&A @hookdeck @leggetter phil@hookdeck.com