Elixir + CQRS - Architecting for Availability, Operability, and Maintainability At PagerDuty

A presentation at ElixirConf2019 in August 2019 in Aurora, CO, USA by Jon Grieman

Slide 1

Slide 1

ELIXIR + CQRS ARCHITECTING FOR AVAILABILITY, OPERABILITY, AND MAINTAINABILITY AT PAGERDUTY Jon Grieman jon@pagerduty.com

Slide 2

Slide 2

PAGERDUTY PAGERDUTY IS A LEADING DIGITAL OPERATIONS PLATFORM - CONNECTING PEOPLE TO REAL TIME OPPORTUNITIES

Slide 3

Slide 3

WE PAGE PEOPLE

Slide 4

Slide 4

WHO ARE YOU? JON GRIEMAN ▸ 3.5 Years at PagerDuty ▸ First team to choose Elixir ▸ Elixir since 2016

Slide 5

Slide 5

TODAYS TALK ▸ The story of a service ▸ Design, Architecture and Evolution of one of our services ▸ Why Elixir was a good fit ▸ Operability and Maintainability

Slide 6

Slide 6

BACKGROUND

Slide 7

Slide 7

DESIGN ELEMENTS OF THE NEW SYSTEM Creation Querying

Slide 8

Slide 8

TECHNIQUES FOR GOING ASYNCHRONUS ▸ System Can No Longer Expect Immediate Existence ▸ Upstream ID Generation ▸ Progressive Degradation ▸ Monitoring

Slide 9

Slide 9

UPSTREAM ID GENERATION Timestamp Random Bits 2019-08-29 14:15:00 0x5d684054 0xfc3540a0 Composite Temporal ID 0x5d684054fc3540a0

Slide 10

Slide 10

THE BIRTHDAY PROBLEM N: Number of Random Values 1 -( N(N-1) 2 ) = P M-1 M M: Range of Possible Values Probability of Collision

Slide 11

Slide 11

COMPOSITE KEY ADVANTAGES 0x294c7b029 0x5d68405400fcd001 0x294c7b02a 0x5d6840543590ca95 0x294c7b02b 0x5d6841231c3540a0 0x294c7b02c 0x5d68412c83d843b2 0x294c7b02d 0x5d68412cf822b0ac 0x294c7b02e 0x5d6842c8de1c52a8 0x294c7b02f 0x5d6842c86c29299b

Slide 12

Slide 12

SEPARATION ISSUES Multi Dependency Single Dependency 1 1 1 2 1 2 1 2

Slide 13

Slide 13

MONITORING ASYNCHRONOUS QUEUE ▸ Monitor your queue ▸ Backups are visible as missing data ▸ Throughput capacity gates recovery rates ▸ Stay ahead, Find ways to get ahead

Slide 14

Slide 14

Slide 15

Slide 15

SEPARATION OF MODIFICATION AND READ ▸ Command Query Responsibility Segregation (or separation) Creation Querying

Slide 16

Slide 16

CQRS AT ITS HEART IS THE NOTION THAT YOU CAN USE A DIFFERENT MODEL TO UPDATE INFORMATION THAN THE MODEL YOU USE TO READ INFORMATION. Martin Fowler https://martinfowler.com/bliki/CQRS.html

Slide 17

Slide 17

CQRS + ELIXIR

Slide 18

Slide 18

BENEFITS OF SEPARATION : SCALING

Slide 19

Slide 19

BENEFITS OF SEPARATION : MONITORING ▸ Different Load Profiles ▸ Spot trends ▸ OS & Hardware level metrics

Slide 20

Slide 20

Container Orchestrator Cluster Creator Jobs Querier Job

Slide 21

Slide 21

Upstream Systems Kafka Creator Database Cluster Querier Client Systems

Slide 22

Slide 22

LASP A SUITE OF LIBRARIES AIMED AT PROVIDING A COMPREHENSIVE PROGRAMMING SYSTEM FOR PLANETARY SCALE ELIXIR AND ERLANG APPLICATIONS LASP Project Description

Slide 23

Slide 23

Upstream Systems Kafka Creator Database Cluster Querier Client Systems

Slide 24

Slide 24

Region 1 Region 2

Slide 25

Slide 25

Region 1 Region 2

Slide 26

Slide 26

Region 1 X X X Region 2 ???

Slide 27

Slide 27

Region 1 Region 2

Slide 28

Slide 28

Queue 1

  1. Write 0xF00 (success) Queue 2
  2. Write 0xF00 (failure)
  3. Record 0xF00 was a failure
  4. Was 0xF00 a success?
  5. 0xF00 failed halt processing

Slide 29

Slide 29

Region 1 Region 2

Slide 30

Slide 30

MATCHING PROGRESS TO THE LOG Backup

Slide 31

Slide 31

Region 1 Region 2

Slide 32

Slide 32

AN INCIDENT ▸ Bad Creation Request ▸ Mismatch with Kafka ▸ All data within Message Set had to be processed before any progress could be confirmed ▸ If failed for long enough, couldn’t recover without intervention

Slide 33

Slide 33

NON LINEARITIES PERFORMANCE

Slide 34

Slide 34

THERE’S ONLY ONE TEST ENVIRONMENT THAT MATTERS PRODUCTION

Slide 35

Slide 35

Region 1 Stack 1 Stack 2 Region 2

Slide 36

Slide 36

HOW IT WENT

Slide 37

Slide 37

Region 1 Stack 1 Stack 2 Region 2

Slide 38

Slide 38

SCHEDULED MAINTENANCE SCHEDULED MAINTENANCE IS FOR CARS, NOT FOR SAAS. Tim Armandpour SVP, Engineering, Pagerduty

Slide 39

Slide 39

Region 1 Stack 1 Stack 2 Region 2

Slide 40

Slide 40

GOING FURTHER : REPLACE THE DB ENGINE ENTIRELY

Slide 41

Slide 41

CHANGING THE ENGINES No one noticed No Downtime / Maintenance Windows No Incidents No Negative Customer Impact

Slide 42

Slide 42

ARCHITECTURAL ADVANTAGES ▸ Discussion of CQRS usually focus on architecture and code ▸ Benefits to Operability and Maintainability Under-appreciated

Slide 43

Slide 43

OTHER SYSTEMS ▸ Not all PagerDuty systems follow this pattern ▸ Use it where suitable ▸ Later uses in more complex situations ▸ Leveraging ETS ▸ Event Sourcing approaches ▸ Snapshotting of materialized views

Slide 44

Slide 44

THANKS! Q&A

Slide 45

Slide 45

WE ARE HIRING! ▸ Hiring Elixir Developers! ▸ Toronto, San Francisco, & Atlanta ▸ Would love to hear from you! jon@pagerduty.com jon on elixirforums

Slide 46

Slide 46

Images and Figures Birthday Paradox Graph: Wikipedia, Birthday Problem Server by Vectorstall from the Noun Project queue by Kirby Wu from the Noun Project database by Saifurrijal from the Noun Project log by Shastry from the Noun Project Lock by Creative Stall from the Noun Project Pagerduty Engineer Photo: PagerDuty Other Photos : Jon Grieman