ELIXIR + CQRS ARCHITECTING FOR AVAILABILITY, OPERABILITY, AND MAINTAINABILITY AT PAGERDUTY Jon Grieman jon@pagerduty.com
Slide 2
PAGERDUTY
PAGERDUTY IS A LEADING DIGITAL OPERATIONS PLATFORM - CONNECTING PEOPLE TO REAL TIME OPPORTUNITIES
Slide 3
WE PAGE PEOPLE
Slide 4
WHO ARE YOU?
JON GRIEMAN
▸ 3.5 Years at PagerDuty ▸ First team to choose Elixir ▸ Elixir since 2016
Slide 5
TODAYS TALK
▸ The story of a service ▸ Design, Architecture and Evolution of one of our services ▸ Why Elixir was a good fit ▸ Operability and Maintainability
Slide 6
BACKGROUND
Slide 7
DESIGN ELEMENTS OF THE NEW SYSTEM
Creation
Querying
Slide 8
TECHNIQUES FOR GOING ASYNCHRONUS
▸ System Can No Longer Expect Immediate Existence ▸ Upstream ID Generation ▸ Progressive Degradation ▸ Monitoring
Slide 9
UPSTREAM ID GENERATION Timestamp
Random Bits
2019-08-29 14:15:00
0x5d684054
0xfc3540a0
Composite Temporal ID
0x5d684054fc3540a0
Slide 10
THE BIRTHDAY PROBLEM
N: Number of Random Values
1 -(
N(N-1) 2
) = P
M-1 M
M: Range of Possible Values
Probability of Collision
SEPARATION ISSUES Multi Dependency
Single Dependency
1
1
1
2
1
2
1
2
Slide 13
MONITORING ASYNCHRONOUS QUEUE
▸ Monitor your queue ▸ Backups are visible as missing data ▸ Throughput capacity gates recovery rates ▸ Stay ahead, Find ways to get ahead
Slide 14
Slide 15
SEPARATION OF MODIFICATION AND READ
▸ Command Query Responsibility Segregation (or separation)
Creation
Querying
Slide 16
CQRS
AT ITS HEART IS THE NOTION THAT YOU CAN USE A DIFFERENT MODEL TO UPDATE INFORMATION THAN THE MODEL YOU USE TO READ INFORMATION. Martin Fowler https://martinfowler.com/bliki/CQRS.html
Slide 17
CQRS + ELIXIR
Slide 18
BENEFITS OF SEPARATION : SCALING
Slide 19
BENEFITS OF SEPARATION : MONITORING
▸ Different Load Profiles ▸ Spot trends ▸ OS & Hardware level metrics
Slide 20
Container Orchestrator Cluster
Creator Jobs
Querier Job
Slide 21
Upstream Systems
Kafka
Creator
Database Cluster
Querier
Client Systems
Slide 22
LASP
A SUITE OF LIBRARIES AIMED AT PROVIDING A COMPREHENSIVE PROGRAMMING SYSTEM FOR PLANETARY SCALE ELIXIR AND ERLANG APPLICATIONS LASP Project Description
Slide 23
Upstream Systems
Kafka
Creator
Database Cluster
Querier
Client Systems
Slide 24
Region 1
Region 2
Slide 25
Region 1
Region 2
Slide 26
Region 1
X X
X Region 2
???
Slide 27
Region 1
Region 2
Slide 28
Queue 1
Write 0xF00 (success) Queue 2
Write 0xF00 (failure)
Record 0xF00 was a failure
Was 0xF00 a success?
0xF00 failed halt processing
Slide 29
Region 1
Region 2
Slide 30
MATCHING PROGRESS TO THE LOG Backup
Slide 31
Region 1
Region 2
Slide 32
AN INCIDENT ▸ Bad Creation Request ▸ Mismatch with Kafka ▸ All data within Message Set had to be processed before any progress could be confirmed ▸ If failed for long enough, couldn’t recover without intervention
Slide 33
NON LINEARITIES PERFORMANCE
Slide 34
THERE’S ONLY ONE TEST ENVIRONMENT THAT MATTERS
PRODUCTION
Slide 35
Region 1
Stack 1
Stack 2
Region 2
Slide 36
HOW IT WENT
Slide 37
Region 1
Stack 1
Stack 2
Region 2
Slide 38
SCHEDULED MAINTENANCE
SCHEDULED MAINTENANCE IS FOR CARS, NOT FOR SAAS. Tim Armandpour SVP, Engineering, Pagerduty
Slide 39
Region 1
Stack 1
Stack 2
Region 2
Slide 40
GOING FURTHER : REPLACE THE DB ENGINE ENTIRELY
Slide 41
CHANGING THE ENGINES
No one noticed No Downtime / Maintenance Windows No Incidents No Negative Customer Impact
Slide 42
ARCHITECTURAL ADVANTAGES
▸ Discussion of CQRS usually focus on architecture and code ▸ Benefits to Operability and Maintainability Under-appreciated
Slide 43
OTHER SYSTEMS ▸ Not all PagerDuty systems follow this pattern ▸ Use it where suitable ▸ Later uses in more complex situations ▸ Leveraging ETS ▸ Event Sourcing approaches ▸ Snapshotting of materialized views
Slide 44
THANKS!
Q&A
Slide 45
WE ARE HIRING! ▸ Hiring Elixir Developers! ▸ Toronto, San Francisco, & Atlanta ▸ Would love to hear from you! jon@pagerduty.com jon on elixirforums
Slide 46
Images and Figures Birthday Paradox Graph: Wikipedia, Birthday Problem Server by Vectorstall from the Noun Project queue by Kirby Wu from the Noun Project database by Saifurrijal from the Noun Project log by Shastry from the Noun Project Lock by Creative Stall from the Noun Project Pagerduty Engineer Photo: PagerDuty Other Photos : Jon Grieman