Reliability is no Accident

A presentation at SLOconf Monthly Meetup in January 2022 in by Julie Gunderson

Slide 1

The Road To Resilience: Chaos Engineering, GameDays, & Disaster Recovery Julie Gunderson, Sr. Reliability Advocate @Gremlin @julie_gund

Slide 2

Reliability is no Accident @julie_gund

Slide 3

Slide 4

Slide 5

Slide 6

Julie Gunderson Sr. Reliability Advocate Julie@gremlin.com @julie_gund

Slide 7

Gene Kim Jez Humble Nicole Forsgren @julie_gund

Slide 8

Accelerate was published in 2018 by Nicole Forsgren, Jez Humble and Gene Kim. The evidence collected prior to the release of the book refuted the bimodal notion in tech that you have to choose between speed and stability. They discovered that speed actually depends on stability, so good tech practices give you both. @julie_gund

Slide 9

Today we will explore three key practices that improve tempo and stability @julie_gund

Slide 10

What is tempo and stability? Tempo is measured by deployment frequency and change lead time Stability is measured by mean time to recover (MTTR) and change failure rate @julie_gund

Slide 11

What is tempo and stability? Tempo Deployment Frequency The rate that software is deployed to production or an app store (e.g. within a range of multiple times a day to once a year) Tempo Change Lead Time The time it takes to go from a customer making a request to the request being satisfied Stability Mean Time To Recover (MTTR) The mean time it takes a company to recover from downtime of their software Stability Change Failure Rate The likelihood of defect changes (e.g. ⅕ ) @julie_gund

Slide 12

What are the three key practices? @julie_gund

Slide 13

Three Key Practices: 1. Chaos Engineering 2. GameDays 3. Disaster Recovery @julie_gund

Slide 14

“Build systems that are designed to be deployed easily, can detect and tolerate failures, and can have various components of the system updated independently.” - Accelerate @julie_gund

Slide 15

What is the best way to know if your system can detect and tolerate failure? @julie_gund

Slide 16

Chaos Engineering

Slide 17

Chaos Engineering (Let’s see a dependency demo!) @julie_gund

Slide 18

https://github.com/GoogleCloudPlatform/bank-of-anthos The architecture of our banking application

Slide 19

https://4503-f37e5de5-39bf-4406-acbc-9c7f2abb0d16.cs-us-east1-wzxb.cloudshell.dev/home Does blackholing a critical path service like the Balance Reader result in graceful degradation of the customer experience? @julie_gund

Slide 20

https://app.gremlin.com/attacks/new/kubernetes Does blackholing a critical path service like the Balance Reader result in graceful degradation of the customer experience? @julie_gund

Slide 21

The balance appears as $—This could make the user think they have no money in their account @julie_gund

Slide 22

The user is still able to make a deposit of $1000 while the Balance Reader service is in a blackhole. @julie_gund

Slide 23

The user is unable to send payments. They will see an error that the payment failed due to Balance Reader. @julie_gund

Slide 24

The user is unable to send payments. They will see an error that the payment failed due to Balance Reader. @julie_gund

Slide 25

Free demo environment to learn about black holes 1. Use this link to install with minikube on google cloud shell: https://ssh.cloud.google.com/cloudshell/editor?show=ide&cloudshell_git_ repo=https://github.com/GoogleCloudPlatform/bank-of-anthos&cloudshe ll_workspace=.&cloudshell_tutorial=extras/cloudshell/tutorial.md 2. Click minikube → start 3. In Cloud Shell terminal, run kubectl apply -f bank-of-anthos/extras/jwt/jwt-secret.yaml 4. Click <> Cloud Code → Run on Kubernetes, change to port 4503 5. To create black holes, create a namespace for gremlin and install gremlin as helm chart https://github.com/gremlin/helm @julie_gund

Slide 26

https://github.com/GoogleCloudPlatform/bank-of-anthos Now let’s see how a blackhole impacts our Transaction History service @julie_gund

Slide 27

https://4503-f37e5de5-39bf-4406-acbc-9c7f2abb0d16.cs-us-east1-wzxb.cloudshell.dev/home Does blackholing transaction history result in graceful degradation of the customer experience? @julie_gund

Slide 28

https://4503-f37e5de5-39bf-4406-acbc-9c7f2abb0d16.cs-us-east1-wzxb.cloudshell.dev/home We will get an error message “Error: Could Not Load Transactions” @julie_gund

Slide 29

kubectl scale deployment transactionhistory —replicas=2 What can we do to mitigate against a blackhole? Depending on the service, scaling replicas may work well @julie_gund

Slide 30

kubectl get pods Now we have 2 transaction history pods @julie_gund

Slide 31

https://app.gremlin.com/attacks/new/kubernetes Let’s send 50% of transaction history pods into a blackhole @julie_gund

Slide 32

https://github.com/GoogleCloudPlatform/bank-of-anthos Deployment Set: Transaction History replicas=2 Pod 1: Transaction History Pod 2: Transaction History There will be a very short outage and then the other pod will take over @julie_gund

Slide 33

https://4503-f37e5de5-39bf-4406-acbc-9c7f2abb0d16.cs-us-east1-wzxb.cloudshell.dev/home We are still able to see transaction history and no longer receive error messages. @julie_gund

Slide 34

GoogleCloudPlatform/bank-of-anthos @julie_gund

Slide 35

Key Practice #2: GameDays @julie_gund

Slide 36

How can GameDays Improve Tempo and Stability? Accelerate shares that Game Days are a great way to build relationships within an organization. This is a cultural must-do to become a high-performing organization. @julie_gund

Slide 37

What is an Example GameDay? Invite 4+ people to attend (2+ teams). Spike load (Gatling) and introduce failure (Gremlin). Minimum time required = 10min @julie_gund

Slide 38

What are More Example GameDays? ● Dependency Testing (Flex) ● Capacity Plan Testing ● Autoscaling Testing @julie_gund

Slide 39

How can Disaster Recovery Improve Tempo and Stability? “For DiRT-style events to be successful, an organization first needs to accept system and process failures as a means of learning… we design tests that require engineers from several groups who might not normally work together to interact with each other. That way, should a real large-scale disaster ever strike, these people will already have strong working relationships” - Kripa Krishnan, Director of Cloud Operations @ Google @julie_gund

Slide 40

What is an example DiRT? Invite 4+ people to attend. Failover 5 core services using the Gremlin blackhole. (safer than a shutdown and faster to recover) Minimum time required = 5 minutes @julie_gund

Slide 41

Can you automate this? Yes!!! @julie_gund

Slide 42

Are you ready to become a Gremlin-certified Chaos Engineering Practitioner? gremlin.com/certification

Slide 43

State of Chaos Engineering Report “Top-performing Chaos Engineering teams boast four nines of availability (52 min of downtime a year) with an MTTR of less than one hour.” - Kolton Andrus gremlin.com/state-of-chaos-engineering/2021 @julie_gund

Slide 44

Thank you! Thank you! @julie_gund