How Do You Infect Your Organization With Humane Ops?

A presentation at DevOpsDays Riga 2018 in September 2018 in Riga, Latvia by Matt Stratton

Slide 1

Slide 1

Slide 2

Slide 2

Slide 3

Slide 3

Slide 4

Slide 4

Slide 5

Slide 5

THE DATA

PagerDuty commissioned a study across over 10,000 companies over 100 different segments.

Slide 6

Slide 6

THE DATA

PagerDuty commissioned a study across over 10,000 companies over 100 different segments.

Slide 7

Slide 7

THE DATA

PagerDuty commissioned a study across over 10,000 companies over 100 different segments.

Slide 8

Slide 8

THE DATA

PagerDuty commissioned a study across over 10,000 companies over 100 different segments.

Slide 9

Slide 9

THE DATA

PagerDuty commissioned a study across over 10,000 companies over 100 different segments.

Slide 10

Slide 10

THE DATA

PagerDuty commissioned a study across over 10,000 companies over 100 different segments.

Slide 11

Slide 11

THE DATA

PagerDuty commissioned a study across over 10,000 companies over 100 different segments.

Slide 12

Slide 12

Slide 13

Slide 13

Slide 14

Slide 14

Slide 15

Slide 15

Slide 16

Slide 16

Slide 17

Slide 17

Remember, memes are another way of evolving across generations. This happens in the world of Snow Crash, but it can happen in your organization as well.

Slide 18

Slide 18

Slide 19

Slide 19

Slide 20

Slide 20

Slide 21

Slide 21

Slide 22

Slide 22

Slide 23

Slide 23

Slide 24

Slide 24

Slide 25

Slide 25

Slide 26

Slide 26

Slide 27

Slide 27

Slide 28

Slide 28

Slide 29

Slide 29

Slide 30

Slide 30

Slide 31

Slide 31

If we don’t treat every outage or alert as something to learn from or something to improve, we run the risk of the   Normalization of Deviance effect.

In this case, we start to accept alerts or degradations as acceptable. Our standards suffer. We let things slip through the cracks.

Slide 32

Slide 32

If we don’t treat every outage or alert as something to learn from or something to improve, we run the risk of the   Normalization of Deviance effect.

In this case, we start to accept alerts or degradations as acceptable. Our standards suffer. We let things slip through the cracks.

Slide 33

Slide 33

If we don’t treat every outage or alert as something to learn from or something to improve, we run the risk of the   Normalization of Deviance effect.

In this case, we start to accept alerts or degradations as acceptable. Our standards suffer. We let things slip through the cracks.

Slide 34

Slide 34

If we don’t treat every outage or alert as something to learn from or something to improve, we run the risk of the   Normalization of Deviance effect.

In this case, we start to accept alerts or degradations as acceptable. Our standards suffer. We let things slip through the cracks.

Slide 35

Slide 35

Slide 36

Slide 36

Andy Fleener, Platform Operations Manager, Sportsengine - “We review every alert from the last 24 hours/weekend every day. No broken windows.”

Slide 37

Slide 37

If we don’t treat every outage or alert as something to learn from or something to improve, we run the risk of the   Normalization of Deviance effect.

In this case, we start to accept alerts or degradations as acceptable. Our standards suffer. We let things slip through the cracks.

Slide 38

Slide 38

Slide 39

Slide 39

Slide 40

Slide 40

Slide 41

Slide 41

Let’s make sure that we are setting the proper expectations. We don’t want to just expect five 9’s of reliability because “well, five is better than four.” Why do you need five? Have you tied your metrics to a business outcome?

Likewise, your speed metrics shouldn’t be “faster than last month.” And beware of inaccurate extrapolation. You might have data suggesting that   if your page load time increases by a second, conversion drops by 50 percent. But that doesn’t mean that if you reduce load time by a second, conversion will increase by 50 percent. Correlation doesn’t always equal causation, and the same numbers don’t move the dials in both directions.

Slide 42

Slide 42

Slide 43

Slide 43

Slide 44

Slide 44

Slide 45

Slide 45

Slide 46

Slide 46

Don’t over-design systems. Resume-driven development is almost always a recipe for on-call disasters.

Slide 47

Slide 47

At the heart of every complex resilient system is the hubris that someone believed they could predict everything that could go wrong. Fate, and the internet, laughs

Slide 48

Slide 48

Slide 49

Slide 49

Slide 50

Slide 50

ask how the on call is feeling during stand ups. give them the opportunity to mention they might be burning out.

Slide 51

Slide 51

Slide 52

Slide 52

Slide 53

Slide 53

Slide 54

Slide 54

Slide 55

Slide 55

Slide 56

Slide 56

Slide 57

Slide 57

Slide 58

Slide 58

Slide 59

Slide 59

volunteer to help as an incident commander (what’s that? Maybe we should have them!) 


Slide 60

Slide 60

You want to get all the right people on the call as soon as you need to…but you also want to get them OFF of the call as soon as possible.

Slide 61

Slide 61

Slide 62

Slide 62

Slide 63

Slide 63

Slide 64

Slide 64

Slide 65

Slide 65

Slide 66

Slide 66

Slide 67

Slide 67

Slide 68

Slide 68

Slide 69

Slide 69

Slide 70

Slide 70

Slide 71

Slide 71

Slide 72

Slide 72

Even if it’s not on a card

Slide 73

Slide 73

Slide 74

Slide 74

Slide 75

Slide 75

Slide 76

Slide 76

These might seem obvious, but if they’re so obvious, I assume you’ve done them already?

Slide 77

Slide 77

Slide 78

Slide 78

Slide 79

Slide 79

Slide 80

Slide 80

Slide 81

Slide 81

Slide 82

Slide 82

Slide 83

Slide 83

Slide 84

Slide 84

Slide 85

Slide 85

Slide 86

Slide 86

Slide 87

Slide 87

Slide 88

Slide 88