ELIXIR + CQRS ARCHITECTING FOR AVAILABILITY, OPERABILITY, AND MAINTAINABILITY AT PAGERDUTY Jon Grieman jon@pagerduty.com

PAGERDUTY PAGERDUTY IS A LEADING DIGITAL OPERATIONS PLATFORM - CONNECTING PEOPLE TO REAL TIME OPPORTUNITIES

WE PAGE PEOPLE

WHO ARE YOU? JON GRIEMAN ▸ 3.5 Years at PagerDuty ▸ First team to choose Elixir ▸ Elixir since 2016

TODAYS TALK ▸ The story of a service ▸ Design, Architecture and Evolution of one of our services ▸ Why Elixir was a good fit ▸ Operability and Maintainability

BACKGROUND

DESIGN ELEMENTS OF THE NEW SYSTEM Creation Querying

TECHNIQUES FOR GOING ASYNCHRONUS ▸ System Can No Longer Expect Immediate Existence ▸ Upstream ID Generation ▸ Progressive Degradation ▸ Monitoring

UPSTREAM ID GENERATION Timestamp Random Bits 2019-08-29 14:15:00 0x5d684054 0xfc3540a0 Composite Temporal ID 0x5d684054fc3540a0

THE BIRTHDAY PROBLEM N: Number of Random Values 1 -( N(N-1) 2 ) = P M-1 M M: Range of Possible Values Probability of Collision

COMPOSITE KEY ADVANTAGES 0x294c7b029 0x5d68405400fcd001 0x294c7b02a 0x5d6840543590ca95 0x294c7b02b 0x5d6841231c3540a0 0x294c7b02c 0x5d68412c83d843b2 0x294c7b02d 0x5d68412cf822b0ac 0x294c7b02e 0x5d6842c8de1c52a8 0x294c7b02f 0x5d6842c86c29299b

SEPARATION ISSUES Multi Dependency Single Dependency 1 1 1 2 1 2 1 2

MONITORING ASYNCHRONOUS QUEUE ▸ Monitor your queue ▸ Backups are visible as missing data ▸ Throughput capacity gates recovery rates ▸ Stay ahead, Find ways to get ahead

SEPARATION OF MODIFICATION AND READ ▸ Command Query Responsibility Segregation (or separation) Creation Querying

CQRS AT ITS HEART IS THE NOTION THAT YOU CAN USE A DIFFERENT MODEL TO UPDATE INFORMATION THAN THE MODEL YOU USE TO READ INFORMATION. Martin Fowler https://martinfowler.com/bliki/CQRS.html

CQRS + ELIXIR

BENEFITS OF SEPARATION : SCALING

BENEFITS OF SEPARATION : MONITORING ▸ Different Load Profiles ▸ Spot trends ▸ OS & Hardware level metrics

Container Orchestrator Cluster Creator Jobs Querier Job

Upstream Systems Kafka Creator Database Cluster Querier Client Systems

LASP A SUITE OF LIBRARIES AIMED AT PROVIDING A COMPREHENSIVE PROGRAMMING SYSTEM FOR PLANETARY SCALE ELIXIR AND ERLANG APPLICATIONS LASP Project Description

Upstream Systems Kafka Creator Database Cluster Querier Client Systems

Region 1 Region 2

Region 1 Region 2

Region 1 X X X Region 2 ???

Region 1 Region 2

Queue 1

  1. Write 0xF00 (success) Queue 2
  2. Write 0xF00 (failure)
  3. Record 0xF00 was a failure
  4. Was 0xF00 a success?
  5. 0xF00 failed halt processing

Region 1 Region 2

MATCHING PROGRESS TO THE LOG Backup

Region 1 Region 2

AN INCIDENT ▸ Bad Creation Request ▸ Mismatch with Kafka ▸ All data within Message Set had to be processed before any progress could be confirmed ▸ If failed for long enough, couldn’t recover without intervention

NON LINEARITIES PERFORMANCE

THERE’S ONLY ONE TEST ENVIRONMENT THAT MATTERS PRODUCTION

Region 1 Stack 1 Stack 2 Region 2

HOW IT WENT

Region 1 Stack 1 Stack 2 Region 2

SCHEDULED MAINTENANCE SCHEDULED MAINTENANCE IS FOR CARS, NOT FOR SAAS. Tim Armandpour SVP, Engineering, Pagerduty

Region 1 Stack 1 Stack 2 Region 2

GOING FURTHER : REPLACE THE DB ENGINE ENTIRELY

CHANGING THE ENGINES No one noticed No Downtime / Maintenance Windows No Incidents No Negative Customer Impact

ARCHITECTURAL ADVANTAGES ▸ Discussion of CQRS usually focus on architecture and code ▸ Benefits to Operability and Maintainability Under-appreciated

OTHER SYSTEMS ▸ Not all PagerDuty systems follow this pattern ▸ Use it where suitable ▸ Later uses in more complex situations ▸ Leveraging ETS ▸ Event Sourcing approaches ▸ Snapshotting of materialized views

THANKS! Q&A

WE ARE HIRING! ▸ Hiring Elixir Developers! ▸ Toronto, San Francisco, & Atlanta ▸ Would love to hear from you! jon@pagerduty.com jon on elixirforums

Images and Figures Birthday Paradox Graph: Wikipedia, Birthday Problem Server by Vectorstall from the Noun Project queue by Kirby Wu from the Noun Project database by Saifurrijal from the Noun Project log by Shastry from the Noun Project Lock by Creative Stall from the Noun Project Pagerduty Engineer Photo: PagerDuty Other Photos : Jon Grieman