SRE: The Good, The Bad, and the Ouch

A presentation at PREVAIL in October 2021 in by Holly Cummins

Slide 1

Slide 1

SRE: the good, the bad, and the ouch PREVAIL 2021 - An IBM Academy of Technology Conference © 2021 IBM Corporation

Slide 2

Slide 2

Speaker Holly Cummins Innovation Leader, IBM SPEED IBM Garage alum @holly_cummins PREVAIL 2021 - An IBM Academy of Technology Conference © 2021 IBM Corporation 2

Slide 3

Slide 3

PREVAIL Technical Conference 2021 what is SRE? @holly_cummins

Slide 4

Slide 4

PREVAIL Technical Conference 2021 SRE what ops would be like if it was done by software engineers @holly_cummins

Slide 5

Slide 5

PREVAIL Technical Conference 2021 why SRE? @holly_cummins

Slide 6

Slide 6

PREVAIL Technical Conference 2021 reliability is very important @holly_cummins

Slide 7

Slide 7

PREVAIL Technical Conference 2021 old ops @holly_cummins

Slide 8

Slide 8

PREVAIL Technical Conference 2021 manual old ops @holly_cummins

Slide 9

Slide 9

PREVAIL Technical Conference 2021 manual repetitive old ops @holly_cummins

Slide 10

Slide 10

PREVAIL Technical Conference 2021 manual repetitive siloed old ops @holly_cummins

Slide 11

Slide 11

PREVAIL Technical Conference 2021 manual repetitive siloed not aligned to business goals old ops @holly_cummins

Slide 12

Slide 12

PREVAIL Technical Conference 2021 manual repetitive siloed not aligned to business goals old ops unable to handle complexity of cloud native @holly_cummins

Slide 13

Slide 13

PREVAIL Technical Conference 2021 eliminate repetitive tasks @holly_cummins

Slide 14

Slide 14

PREVAIL Technical Conference 2021 eliminate toil @holly_cummins

Slide 15

Slide 15

PREVAIL Technical Conference 2021 aligned incentives @holly_cummins

Slide 16

Slide 16

PREVAIL Technical Conference 2021 failure is a symptom, not a cause @holly_cummins

Slide 17

Slide 17

PREVAIL Technical Conference 2021 devops? @holly_cummins

Slide 18

Slide 18

PREVAIL Technical Conference 2021 SRE DevOps automate everything @holly_cummins

Slide 19

Slide 19

PREVAIL Technical Conference 2021 SRE DevOps holistic + collaborative @holly_cummins

Slide 20

Slide 20

PREVAIL Technical Conference 2021 what could possibly go wrong? @holly_cummins

Slide 21

Slide 21

true story the cunning rebrand “we’re SRE now” IBM Garage @holly_cummins

Slide 22

Slide 22

true story the cunning rebrand “we’re SRE now” IBM Garage @holly_cummins

Slide 23

Slide 23

PREVAIL Technical Conference 2021 SRE: what ops would be like if it was done by software engineers software engineer @holly_cummins

Slide 24

Slide 24

PREVAIL Technical Conference 2021 SRE: what ops would be like if it was done by software engineers software engineer @holly_cummins

Slide 25

Slide 25

PREVAIL Technical Conference 2021 SRE: what ops would be like if it was done by software engineers ops software engineer @holly_cummins

Slide 26

Slide 26

PREVAIL Technical Conference 2021 SRE: what ops would be like if it was done by ops? ops @holly_cummins

Slide 27

Slide 27

PREVAIL Technical Conference 2021 SRE: what ops would be like if it was done by ops? ops @holly_cummins

Slide 28

Slide 28

true story the cunning rebrand “we are just as good: we have scripts” IBM Garage @holly_cummins

Slide 29

Slide 29

PREVAIL Technical Conference 2021 what triggers the scripts? @holly_cummins

Slide 30

Slide 30

PREVAIL Technical Conference 2021 how much contact do SRE have with dev? SRE dev @holly_cummins

Slide 31

Slide 31

PREVAIL Technical Conference 2021 how much contact do SRE have with dev? SRE dev @holly_cummins

Slide 32

Slide 32

PREVAIL Technical Conference 2021 are there any SRE NFRs in the dev backlog? SRE dev @holly_cummins

Slide 33

Slide 33

true story the cunning rebrand “we do SRE … in silos” IBM Garage @holly_cummins

Slide 34

Slide 34

PREVAIL Technical Conference 2021 @holly_cummins

Slide 35

Slide 35

PREVAIL Technical Conference 2021 @holly_cummins

Slide 36

Slide 36

PREVAIL Technical Conference 2021 @holly_cummins

Slide 37

Slide 37

PREVAIL Technical Conference 2021 @holly_cummins

Slide 38

Slide 38

PREVAIL Technical Conference 2021 I am not designed for this. @holly_cummins

Slide 39

Slide 39

PREVAIL Technical Conference 2021 two war rooms @holly_cummins

Slide 40

Slide 40

PREVAIL Technical Conference 2021 team mainframe @holly_cummins team mobile

Slide 41

Slide 41

PREVAIL Technical Conference 2021 we’re responsible for stability of the front end we’re responsible for stability of the mainframe team mainframe @holly_cummins team mobile

Slide 42

Slide 42

PREVAIL Technical Conference 2021 we’re responsible for we’re stability offor the responsible mainframe … as stability of thelong as it’s used correctly mainframe we’re responsible for stability of the front end the ambassador team mainframe @holly_cummins team mobile

Slide 43

Slide 43

true story dots aren’t connected “we have a ticket per team, not per incident” IBM Garage @holly_cummins

Slide 44

Slide 44

“we want to do SRE but we don’t have enough permissions on our systems”

Slide 45

Slide 45

PREVAIL Technical Conference 2021 “the DBAs don’t trust us” @holly_cummins

Slide 46

Slide 46

PREVAIL Technical Conference 2021 “it takes us 15 minutes just to get permission to run a standard set of SQL diagnostic statements” @holly_cummins

Slide 47

Slide 47

PREVAIL Technical Conference 2021 “it takes us 15 minutes just to get permission to run a standard set of SQL diagnostic statements” @holly_cummins

Slide 48

Slide 48

PREVAIL Technical Conference 2021 silos cost @holly_cummins

Slide 49

Slide 49

true story the gap between intent and reality “we do post-mortems after every incident … maybe” IBM Garage @holly_cummins

Slide 50

Slide 50

PREVAIL Technical Conference 2021 @holly_cummins

Slide 51

Slide 51

PREVAIL Technical Conference 2021 measure the number of incidents @holly_cummins

Slide 52

Slide 52

PREVAIL Technical Conference 2021 measure the number of incidents measure the number of post-mortems @holly_cummins

Slide 53

Slide 53

PREVAIL Technical Conference 2021 measure the number of incidents measure the number of post-mortems see if they match @holly_cummins

Slide 54

Slide 54

PREVAIL Technical Conference 2021 @holly_cummins

Slide 55

Slide 55

PREVAIL Technical Conference 2021 advanced metrics: @holly_cummins

Slide 56

Slide 56

PREVAIL Technical Conference 2021 advanced metrics: how many people were in the post-mortem? @holly_cummins

Slide 57

Slide 57

PREVAIL Technical Conference 2021 advanced metrics: how many people were in the post-mortem? does it include more than the people directly involved? @holly_cummins

Slide 58

Slide 58

PREVAIL Technical Conference 2021 advanced metrics: how many people were in the post-mortem? does it include more than the people directly involved? did we invite more than our own team? @holly_cummins

Slide 59

Slide 59

true story “no one says anything in our blameless post-mortems” IBM Garage @holly_cummins

Slide 60

Slide 60

PREVAIL Technical Conference 2021 ‘blameless’ post-mortem @holly_cummins

Slide 61

Slide 61

PREVAIL Technical Conference 2021 if involvement in an incident is punished, people will avoid engaging with systems @holly_cummins

Slide 62

Slide 62

PREVAIL Technical Conference 2021 “great idea, go build that!” if ideas are punished with extra work, people will try not to have ideas @holly_cummins

Slide 63

Slide 63

true story the perverse incentive “we have success metrics” IBM Garage @holly_cummins

Slide 64

Slide 64

PREVAIL Technical Conference 2021 metrics are good @holly_cummins

Slide 65

Slide 65

PREVAIL Technical Conference 2021 SREs are data-driven @holly_cummins

Slide 66

Slide 66

PREVAIL Technical Conference 2021 but … @holly_cummins

Slide 67

Slide 67

PREVAIL Technical Conference 2021 as senior leaders, be careful what you incentivise @holly_cummins

Slide 68

Slide 68

PREVAIL Technical Conference 2021 be careful what behaviours you discourage @holly_cummins

Slide 69

Slide 69

true story the perverse incentive “we count how many incidents we have; if the number goes down, it means we are working better” IBM Garage @holly_cummins

Slide 70

Slide 70

PREVAIL Technical Conference 2021 outstanding quality! @holly_cummins

Slide 71

Slide 71

PREVAIL Technical Conference 2021 delivery excellence! @holly_cummins

Slide 72

Slide 72

PREVAIL Technical Conference 2021 fewer people working → fewer incidents @holly_cummins

Slide 73

Slide 73

PREVAIL Technical Conference 2021 new release → more incidents @holly_cummins

Slide 74

Slide 74

PREVAIL Technical Conference 2021 what should you measure? @holly_cummins

Slide 75

Slide 75

PREVAIL Technical Conference 2021 @holly_cummins

Slide 76

Slide 76

PREVAIL Technical Conference 2021 make work visible @holly_cummins

Slide 77

Slide 77

true story the email timesink “we never seem to complete the work we planned” IBM Garage @holly_cummins

Slide 78

Slide 78

PREVAIL Technical Conference 2021 1 sprint theory @holly_cummins

Slide 79

Slide 79

PREVAIL Technical Conference 2021 1 sprint theory 50% story points @holly_cummins

Slide 80

Slide 80

PREVAIL Technical Conference 2021 theory 1 sprint 50% unplanned work (tickets) 50% story points @holly_cummins

Slide 81

Slide 81

PREVAIL Technical Conference 2021 theory reality 1 sprint 50% unplanned work (tickets) 50% story points @holly_cummins

Slide 82

Slide 82

PREVAIL Technical Conference 2021 theory reality 1 sprint 50% unplanned work (tickets) 50% story points 10% story points @holly_cummins

Slide 83

Slide 83

PREVAIL Technical Conference 2021 theory reality 1 sprint 50% unplanned work (tickets) 50% tickets 50% story points 10% story points @holly_cummins

Slide 84

Slide 84

PREVAIL Technical Conference 2021 theory reality 50% unplanned work (tickets) 1 sprint 40% ?? 50% tickets 50% story points 10% story points @holly_cummins

Slide 85

Slide 85

PREVAIL Technical Conference 2021 theory reality 50% unplanned work (tickets) 1 sprint 40% ?? 50% tickets 50% story points 10% story points @holly_cummins

Slide 86

Slide 86

PREVAIL Technical Conference 2021 @holly_cummins

Slide 87

Slide 87

PREVAIL Technical Conference 2021 “can you just … “ @holly_cummins

Slide 88

Slide 88

PREVAIL Technical Conference 2021 “can you just … “ “how do I do this?” @holly_cummins

Slide 89

Slide 89

PREVAIL Technical Conference 2021 “can you just … “ “how do I do this?” “where is this documented?” @holly_cummins

Slide 90

Slide 90

PREVAIL Technical Conference 2021 “can you just … “ “how do I do this?” “where is this documented?” @holly_cummins

Slide 91

Slide 91

PREVAIL Technical Conference 2021 @holly_cummins

Slide 92

Slide 92

PREVAIL Technical Conference 2021 this wasn’t a team failure @holly_cummins

Slide 93

Slide 93

PREVAIL Technical Conference 2021 this wasn’t a team failure it was a data quality issue @holly_cummins

Slide 94

Slide 94

PREVAIL Technical Conference 2021 this wasn’t a team failure it was a data quality issue it was a process issue @holly_cummins

Slide 95

Slide 95

PREVAIL Technical Conference 2021 track work use the data to eliminate toil @holly_cummins

Slide 96

Slide 96

PREVAIL Technical Conference 2021 measure blockers @holly_cummins

Slide 97

Slide 97

PREVAIL Technical Conference 2021 mean time to failure? mean time to detect problems? @holly_cummins

Slide 98

Slide 98

PREVAIL Technical Conference 2021 what is failure in a complex system? if a system goes down but user experience is ne, does that count? fi @holly_cummins

Slide 99

Slide 99

PREVAIL Technical Conference 2021 measure “what have I learned” measure “have I made sure it won’t happen again” @holly_cummins

Slide 100

Slide 100

true client story value on the shelf “we can’t actually release this.” IBM Garage @holly_cummins

Slide 101

Slide 101

PREVAIL Technical Conference 2021 @holly_cummins

Slide 102

Slide 102

PREVAIL Technical Conference 2021 what’s stopping more frequent deploys? @holly_cummins

Slide 103

Slide 103

PREVAIL Technical Conference 2021 “it costs too much to release” @holly_cummins

Slide 104

Slide 104

PREVAIL Technical Conference 2021 “it costs too much to release” you can x that fi @holly_cummins

Slide 105

Slide 105

PREVAIL Technical Conference 2021 “we can’t ship until we have more con dence in the quality” fi @holly_cummins

Slide 106

Slide 106

PREVAIL Technical Conference 2021 “we can’t ship until we have more con dence in the quality” you can x that fi fi @holly_cummins

Slide 107

Slide 107

PREVAIL Technical Conference 2021 deferred wiring @holly_cummins

Slide 108

Slide 108

PREVAIL Technical Conference 2021 feature flags @holly_cummins

Slide 109

Slide 109

true client story the monolithic microservices “we can’t release this microservice… we deploy all our microservices at the same time… because otherwise nothing works.” IBM Garage @holly_cummins

Slide 110

Slide 110

PREVAIL Technical Conference 2021 let’s talk about microservices @holly_cummins

Slide 111

Slide 111

true client story the peril of microservices “every time we change code, something breaks” IBM Garage @holly_cummins

Slide 112

Slide 112

PREVAIL Technical Conference 2021 just because a system runs across 6 containers doesn’t mean it’s decoupled @holly_cummins

Slide 113

Slide 113

PREVAIL Technical Conference 2021 @holly_cummins

Slide 114

Slide 114

PREVAIL Technical Conference 2021 mars climate explorer @holly_cummins

Slide 115

Slide 115

Courtesy NASA/ JPL-Caltech #IBMGarage @holly_cummins

Slide 116

Slide 116

Slide 117

Slide 117

Slide 118

Slide 118

distributing did not help

Slide 119

Slide 119

metric units distributing did not help

Slide 120

Slide 120

metric units imperial units distributing did not help

Slide 121

Slide 121

PREVAIL Technical Conference 2021 testing @holly_cummins

Slide 122

Slide 122

PREVAIL Technical Conference 2021 Cluster + Ariane 5 $370 million loss https://en.wikipedia.org/wiki/Cluster_(spacecraft) @holly_cummins

Slide 123

Slide 123

PREVAIL Technical Conference 2021 @holly_cummins

Slide 124

Slide 124

PREVAIL Technical Conference 2021 they tested it … @holly_cummins

Slide 125

Slide 125

PREVAIL Technical Conference 2021 they tested it … but stubbed out one component. @holly_cummins

Slide 126

Slide 126

PREVAIL Technical Conference 2021 they tested it … but stubbed out one component. that component was the one that broke. @holly_cummins

Slide 127

Slide 127

PREVAIL Technical Conference 2021 the ariane failed in 36 seconds you can’t a/b test a $370 million rocket @holly_cummins

Slide 128

Slide 128

PREVAIL Technical Conference 2021 testing will always be incomplete aim for recoverability @holly_cummins

Slide 129

Slide 129

PREVAIL Technical Conference 2021 @holly_cummins

Slide 130

Slide 130

PREVAIL Technical Conference 2021 resilience @holly_cummins

Slide 131

Slide 131

PREVAIL Technical Conference 2021 resilience recoverability @holly_cummins

Slide 132

Slide 132

PREVAIL Technical Conference 2021 observability @holly_cummins

Slide 133

Slide 133

they often couldn’t see the explorer

Slide 134

Slide 134

feedback is good engineering

Slide 135

Slide 135

PREVAIL Technical Conference 2021 when SRE is right it is great @holly_cummins

Slide 136

Slide 136

bank

Slide 137

Slide 137

PREVAIL Technical Conference 2021 remember this bank? team mainframe @holly_cummins team mobile

Slide 138

Slide 138

PREVAIL Technical Conference 2021 remember this bank? we’re responsible for stability of the front end team mainframe @holly_cummins team mobile

Slide 139

Slide 139

PREVAIL Technical Conference 2021 remember this bank? we’re responsible for stability of the mainframe … as long as it’s used correctly we’re responsible for stability of the front end the ambassador team mainframe @holly_cummins team mobile

Slide 140

Slide 140

PREVAIL Technical Conference 2021 one team web front-end back-end another department … @holly_cummins

Slide 141

Slide 141

PREVAIL Technical Conference 2021 e on team mobile front-end web front-end back-end another department … @holly_cummins

Slide 142

Slide 142

PREVAIL Technical Conference 2021 e on team CI/CD pipelines canary deploys CI/CD pipelines big-bang deploys onto AIX one team, range of techniques @holly_cummins

Slide 143

Slide 143

PREVAIL Technical Conference 2021 by the way … @holly_cummins

Slide 144

Slide 144

PREVAIL Technical Conference 2021 big bang deploys @holly_cummins

Slide 145

Slide 145

PREVAIL Technical Conference 2021 50% failure rate big bang deploys @holly_cummins

Slide 146

Slide 146

PREVAIL Technical Conference 2021 50% failure rate big bang deploys canary deploys @holly_cummins

Slide 147

Slide 147

PREVAIL Technical Conference 2021 50% 10% failure rate failure rate big bang deploys canary deploys @holly_cummins

Slide 148

Slide 148

industrial

Slide 149

Slide 149

remember the suspicious DBAs? #IBMGarage @holly_cummins

Slide 150

Slide 150

PREVAIL Technical Conference 2021 two root problems: • automation • trust and transparency @holly_cummins

Slide 151

Slide 151

PREVAIL Technical Conference 2021 trigger automation via slack @holly_cummins

Slide 152

Slide 152

PREVAIL Technical Conference 2021 because it was transparent, DBAs were happy and automated more things @holly_cummins

Slide 153

Slide 153

PREVAIL Technical Conference 2021 what happens when things go wrong? @holly_cummins

Slide 154

Slide 154

PREVAIL Technical Conference 2021 @holly_cummins

Slide 155

Slide 155

PREVAIL Technical Conference 2021 leadership need to provide a safety net. @holly_cummins

Slide 156

Slide 156

PREVAIL Technical Conference 2021 celebrate success celebrate failure @holly_cummins

Slide 157

Slide 157

PREVAIL Technical Conference 2021 celebrate success celebrate learning @holly_cummins

Slide 158

Slide 158

PREVAIL Technical Conference 2021 transformation endurance @holly_cummins

Slide 159

Slide 159

PREVAIL Technical Conference 2021 remember the why @holly_cummins

Slide 160

Slide 160

PREVAIL Technical Conference 2021 better, safer, faster, happier @holly_cummins

Slide 161

Slide 161

PREVAIL 2021 - An IBM Academy of Technology Conference. The information in this presentation is representative of the presenter and their views and opinions are not necessarily those of IBM or of the IBM Academy of Technology. PREVAIL 2021 - An IBM Academy of Technology Conference © 2021 IBM Corporation