Heresy and Evangelism: Schism in the Church of Data

A presentation at DevOpsDays Nashville 2019 in May 2019 in Nashville, TN, USA by Aaron Aldrich

Slide 1

Slide 1

Heresy & Evangelism Schism in the church of monitoring (@elastic) - Aaron Aldrich (@crayzeigh) 1

Slide 2

Slide 2

Hi! ! 4 Community @ 4 Reach me here: 4 aaron.aldrich@elastic.co 4 @CrayZeigh 4 Slides are here: 4 noti.st/crayzeigh 4 This picture is amazing, come @ me. 2

Slide 3

Slide 3

A word from our sponsor (@elastic) - Aaron Aldrich (@crayzeigh) 3

Slide 4

Slide 4

4 We make: 4 Elasticsearch 4 Logstash 4 Kibana 4 Beats 4 Elastic APM (open tracing, ooo) 4 We host: 4 Elastic Search Service 4 Site Search 4 App Search 4 You can run it all where ever 4 Open Source 4 We’re hiring (Fully Distributed, oooh, aaah) 4 Talk to me later (@elastic) - Aaron Aldrich (@crayzeigh) 4

Slide 5

Slide 5

Let’s find out where we’re at. (@elastic) - Aaron Aldrich (@crayzeigh) 5

Slide 6

Slide 6

How many of you deal with monitoring as a job function? (@elastic) - Aaron Aldrich (@crayzeigh) 6

Slide 7

Slide 7

How many of you touch monitoring in some way? (@elastic) - Aaron Aldrich (@crayzeigh) 7

Slide 8

Slide 8

Uptime Performance/Resource Utilization Response time? (@elastic) - Aaron Aldrich (@crayzeigh) 8

Slide 9

Slide 9

Why? (@elastic) - Aaron Aldrich (@crayzeigh) 9

Slide 10

Slide 10

Things Fall Apart * * something about a slouching beast (@elastic) - Aaron Aldrich (@crayzeigh) 10

Slide 11

Slide 11

Incidents Suck (@elastic) - Aaron Aldrich (@crayzeigh) 11

Slide 12

Slide 12

Locus of Control (@elastic) - Aaron Aldrich (@crayzeigh) 12

Slide 13

Slide 13

(@elastic) - Aaron Aldrich (@crayzeigh) 13

Slide 14

Slide 14

(@elastic) - Aaron Aldrich (@crayzeigh) 14

Slide 15

Slide 15

(@elastic) - Aaron Aldrich (@crayzeigh) 15

Slide 16

Slide 16

(@elastic) - Aaron Aldrich (@crayzeigh) 16

Slide 17

Slide 17

100% (@elastic) - Aaron Aldrich (@crayzeigh) 17

Slide 18

Slide 18

99.999% (@elastic) - Aaron Aldrich (@crayzeigh) 18

Slide 19

Slide 19

Just a minute! (@elastic) - Aaron Aldrich (@crayzeigh) 19

Slide 20

Slide 20

Eine Minute, bitte! ! Stolen Joke, if you know where it’s from we’re probably friends (@elastic) - Aaron Aldrich (@crayzeigh) 20

Slide 21

Slide 21

NINES don’t matter… (@elastic) - Aaron Aldrich (@crayzeigh) 21

Slide 22

Slide 22

(@elastic) - Aaron Aldrich (@crayzeigh) 22

Slide 23

Slide 23

NINES don’t matter when USERS aren’t HAPPY ~ Charity Majors (@mipsytipsy) (@elastic) - Aaron Aldrich (@crayzeigh) 23

Slide 24

Slide 24

She doesn’t care whether or not [the datacenter is literally on fire], just as long as the ship’s coming in. ! Cake - Italian Leather Sofa [Lightly Interpreted] (@elastic) - Aaron Aldrich (@crayzeigh) 24

Slide 25

Slide 25

How does your business make money? (@elastic) - Aaron Aldrich (@crayzeigh) 25

Slide 26

Slide 26

How do you help? (@elastic) - Aaron Aldrich (@crayzeigh) 26

Slide 27

Slide 27

DevOps is about delivering Value (@elastic) - Aaron Aldrich (@crayzeigh) 27

Slide 28

Slide 28

(@elastic) - Aaron Aldrich (@crayzeigh) 28

Slide 29

Slide 29

(@elastic) - Aaron Aldrich (@crayzeigh) 29

Slide 30

Slide 30

Observability (@elastic) - Aaron Aldrich (@crayzeigh) 30

Slide 31

Slide 31

Isn’t it just monitoring with better SEO? - You (@elastic) - Aaron Aldrich (@crayzeigh) 31

Slide 32

Slide 32

You’re not wrong… (@elastic) - Aaron Aldrich (@crayzeigh) 32

Slide 33

Slide 33

(@elastic) - Aaron Aldrich (@crayzeigh) 33

Slide 34

Slide 34

Traditional Architecture 4 Predictable 4 Obvious relationships 4 able to be easily modeled 4 System Health is an accurate predictor of user experience 4 Dashboards are useful and valuable (@elastic) - Aaron Aldrich (@crayzeigh) 34

Slide 35

Slide 35

(@elastic) - Aaron Aldrich (@crayzeigh) 35

Slide 36

Slide 36

Complex Systems 4 Always changing 4 Difficult or impossible to model 4 emergent behavior (unknown-unknowns) 4 non-linear relationships 4 feedback loops 4 can adapt and have memory 4 can be nested 4 System health and user experience are no longer directly related (@elastic) - Aaron Aldrich (@crayzeigh) 36

Slide 37

Slide 37

Root Cause is a myth (@elastic) - Aaron Aldrich (@crayzeigh) 37

Slide 38

Slide 38

(@elastic) - Aaron Aldrich (@crayzeigh) 38

Slide 39

Slide 39

One-in-a-million chances crop up nine times out of ten ~ Sir Terry Pratchett ! “Pterry” for short, which gives me joy (@elastic) - Aaron Aldrich (@crayzeigh) 39

Slide 40

Slide 40

SRE (@elastic) - Aaron Aldrich (@crayzeigh) 40

Slide 41

Slide 41

(@elastic) - Aaron Aldrich (@crayzeigh) SLI SLO SLA 41

Slide 42

Slide 42

Services not systems (@elastic) - Aaron Aldrich (@crayzeigh) 42

Slide 43

Slide 43

(@elastic) - Aaron Aldrich (@crayzeigh) 43

Slide 44

Slide 44

Site Reliability Engineering 4 (SLI) What is availability? 4 (SLO) How much do we actually need? 4 (SLA) What happens when we’re not meeting this target? (@elastic) - Aaron Aldrich (@crayzeigh) 44

Slide 45

Slide 45

Site Reliability Engineering 4 (SLI) What is availability? 4 (SLO) How much do we actually need? 4 (SLA) What happens when we’re not meeting this target? (@elastic) - Aaron Aldrich (@crayzeigh) 45

Slide 46

Slide 46

Service Level Indicators 4 Is it up? 4 200OK 4 latency 4 percentiles or medians for meaning (@elastic) - Aaron Aldrich (@crayzeigh) 46

Slide 47

Slide 47

Service Level Indicators 4 Is it up? ! 4 200OK 4 latency 4 percentiles or medians for meaning ! Never trust averages, they hide data (@elastic) - Aaron Aldrich (@crayzeigh) 47

Slide 48

Slide 48

Never trust averages, they hide data (@elastic) - Aaron Aldrich (@crayzeigh) 48

Slide 49

Slide 49

The 99th percentile latency of requests received in <300 ms and responded to with a 200 status (@elastic) - Aaron Aldrich (@crayzeigh) 49

Slide 50

Slide 50

Service Level Objectives How much availability do we need? (@elastic) - Aaron Aldrich (@crayzeigh) 50

Slide 51

Slide 51

99% (@elastic) - Aaron Aldrich (@crayzeigh) 51

Slide 52

Slide 52

99.9% (@elastic) - Aaron Aldrich (@crayzeigh) 52

Slide 53

Slide 53

99.99% (@elastic) - Aaron Aldrich (@crayzeigh) 53

Slide 54

Slide 54

99.999% (@elastic) - Aaron Aldrich (@crayzeigh) 54

Slide 55

Slide 55

Each 9 is exponentially more expensive to provide (@elastic) - Aaron Aldrich (@crayzeigh) 55

Slide 56

Slide 56

availability avg per year avg per day 99% 3.65 days 14.4 minutes 99.9% 8.76 hours 1.44 minutes 99.99% 52.56 minutes 8.64 seconds 99.999% 5.25 minutes 863 ms (@elastic) - Aaron Aldrich (@crayzeigh) 56

Slide 57

Slide 57

A good SLO barely keeps users happy (these should be driving your alerts) (@elastic) - Aaron Aldrich (@crayzeigh) 57

Slide 58

Slide 58

Error Budgets (@elastic) - Aaron Aldrich (@crayzeigh) 58

Slide 59

Slide 59

It’s GOOD to have errors (@elastic) - Aaron Aldrich (@crayzeigh) 59

Slide 60

Slide 60

(@elastic) - Aaron Aldrich (@crayzeigh) 60

Slide 61

Slide 61

Error Budgets Bring Balance to the Force (@elastic) - Aaron Aldrich (@crayzeigh) 61

Slide 62

Slide 62

SLAs = (@elastic) - Aaron Aldrich (@crayzeigh) 62

Slide 63

Slide 63

SLAs = (@elastic) - Aaron Aldrich (@crayzeigh) 63

Slide 64

Slide 64

What about the fire? (@elastic) - Aaron Aldrich (@crayzeigh) 64

Slide 65

Slide 65

(@elastic) - Aaron Aldrich (@crayzeigh) 65

Slide 66

Slide 66

(@elastic) - Aaron Aldrich (@crayzeigh) 66

Slide 67

Slide 67

(@elastic) - Aaron Aldrich (@crayzeigh) 67

Slide 68

Slide 68

(@elastic) - Aaron Aldrich (@crayzeigh) 68

Slide 69

Slide 69

(@elastic) - Aaron Aldrich (@crayzeigh) 69

Slide 70

Slide 70

Observability A system is observable when you can ask arbitrary questions about it and receive meaningful answers without having to resort to writing new code or command line tools. It lets you discover unknown-unknowns and debug in production. (@elastic) - Aaron Aldrich (@crayzeigh) 70

Slide 71

Slide 71

Three Pillars of Observability 4 Metrics 4 Logs 4 APM (@elastic) - Aaron Aldrich (@crayzeigh) 71

Slide 72

Slide 72

These aren’t pillars. (@elastic) - Aaron Aldrich (@crayzeigh) 72

Slide 73

Slide 73

(@elastic) - Aaron Aldrich (@crayzeigh) 73

Slide 74

Slide 74

Three Pillars of Carpentry? stahp. (@elastic) - Aaron Aldrich (@crayzeigh) 74

Slide 75

Slide 75

They’re tools, not pillars You need to know how to use them (@elastic) - Aaron Aldrich (@crayzeigh) 75

Slide 76

Slide 76

Metrics 4 Great, not on their own ! 4 largely contextless 4 need further notation to be valuable (tags) 4 Easy to store lots of them 4 collection can be a pain ! Check out Open Metrics! openmetrics.io (@elastic) - Aaron Aldrich (@crayzeigh) 76

Slide 77

Slide 77

High Cardinality Data 4 UUIDs 4 raw queries 4 comments 4 firstname, lastname 4 PID/PPID 4 app ID 4 device ID 4 build ID 4 IP:port 4 shopping cart ID 4 userid (@elastic) - Aaron Aldrich (@crayzeigh) 77

Slide 78

Slide 78

What’s better at carrying Cardinality? (@elastic) - Aaron Aldrich (@crayzeigh) 78

Slide 79

Slide 79

Events! (@elastic) - Aaron Aldrich (@crayzeigh) 79

Slide 80

Slide 80

(Logs) (@elastic) - Aaron Aldrich (@crayzeigh) 80

Slide 81

Slide 81

But please not these: 64.242.88.10 - - [07/Mar/2004:16:05:49 -0800] “GET /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1” 401 12846 64.242.88.10 - - [07/Mar/2004:16:06:51 -0800] “GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1” 200 4523 64.242.88.10 - - [07/Mar/2004:16:10:02 -0800] “GET /mailman/listinfo/hsdivision HTTP/1.1” 200 6291 64.242.88.10 - - [07/Mar/2004:16:11:58 -0800] “GET /twiki/bin/view/TWiki/WikiSyntax HTTP/1.1” 200 7352 64.242.88.10 - - [07/Mar/2004:16:20:55 -0800] “GET /twiki/bin/view/Main/DCCAndPostFix HTTP/1.1” 200 5253 64.242.88.10 - - [07/Mar/2004:16:23:12 -0800] “GET /twiki/bin/oops/TWiki/AppendixFileSystem?template=oopsmore¶m1=1.12¶m2=1.12 HTTP/1.1” 200 11382 64.242.88.10 - - [07/Mar/2004:16:24:16 -0800] “GET /twiki/bin/view/Main/PeterThoeny HTTP/1.1” 200 4924 64.242.88.10 - - [07/Mar/2004:16:29:16 -0800] “GET /twiki/bin/edit/Main/Header_checks?topicparent=Main.ConfigurationVariables HTTP/1.1” 401 12851 64.242.88.10 - - [07/Mar/2004:16:30:29 -0800] “GET /twiki/bin/attach/Main/OfficeLocations HTTP/1.1” 401 12851 64.242.88.10 - - [07/Mar/2004:16:31:48 -0800] “GET /twiki/bin/view/TWiki/WebTopicEditTemplate HTTP/1.1” 200 3732 64.242.88.10 - - [07/Mar/2004:16:32:50 -0800] “GET /twiki/bin/view/Main/WebChanges HTTP/1.1” 200 40520 64.242.88.10 - - [07/Mar/2004:16:33:53 -0800] “GET /twiki/bin/edit/Main/Smtpd_etrn_restrictions?topicparent=Main.ConfigurationVariables HTTP/1.1” 401 12851 64.242.88.10 - - [07/Mar/2004:16:35:19 -0800] “GET /mailman/listinfo/business HTTP/1.1” 200 6379 64.242.88.10 - - [07/Mar/2004:16:36:22 -0800] “GET /twiki/bin/rdiff/Main/WebIndex?rev1=1.2&rev2=1.1 HTTP/1.1” 200 46373 64.242.88.10 - - [07/Mar/2004:16:37:27 -0800] “GET /twiki/bin/view/TWiki/DontNotify HTTP/1.1” 200 4140 64.242.88.10 - - [07/Mar/2004:16:39:24 -0800] “GET /twiki/bin/view/Main/TokyoOffice HTTP/1.1” 200 3853 64.242.88.10 - - [07/Mar/2004:16:43:54 -0800] “GET /twiki/bin/view/Main/MikeMannix HTTP/1.1” 200 3686 64.242.88.10 - - [07/Mar/2004:16:45:56 -0800] “GET /twiki/bin/attach/Main/PostfixCommands HTTP/1.1” 401 12846 64.242.88.10 - - [07/Mar/2004:16:47:12 -0800] “GET /robots.txt HTTP/1.1” 200 68 64.242.88.10 - - [07/Mar/2004:16:47:46 -0800] “GET /twiki/bin/rdiff/Know/ReadmeFirst?rev1=1.5&rev2=1.4 HTTP/1.1” 200 5724 64.242.88.10 - - [07/Mar/2004:16:49:04 -0800] “GET /twiki/bin/view/Main/TWikiGroups?rev=1.2 HTTP/1.1” 200 5162 64.242.88.10 - - [07/Mar/2004:16:50:54 -0800] “GET /twiki/bin/rdiff/Main/ConfigurationVariables HTTP/1.1” 200 59679 64.242.88.10 - - [07/Mar/2004:16:52:35 -0800] “GET /twiki/bin/edit/Main/Flush_service_name?topicparent=Main.ConfigurationVariables HTTP/1.1” 401 12851 64.242.88.10 - - [07/Mar/2004:16:53:46 -0800] “GET /twiki/bin/rdiff/TWiki/TWikiRegistration HTTP/1.1” 200 34395 64.242.88.10 - - [07/Mar/2004:16:54:55 -0800] “GET /twiki/bin/rdiff/Main/NicholasLee HTTP/1.1” 200 7235 64.242.88.10 - - [07/Mar/2004:16:56:39 -0800] “GET /twiki/bin/view/Sandbox/WebHome?rev=1.6 HTTP/1.1” 200 8545 64.242.88.10 - - [07/Mar/2004:16:58:54 -0800] “GET /mailman/listinfo/administration HTTP/1.1” 200 6459 lordgun.org - - [07/Mar/2004:17:01:53 -0800] “GET /razor.html HTTP/1.1” 200 2869 64.242.88.10 - - [07/Mar/2004:17:09:01 -0800] “GET /twiki/bin/search/Main/SearchResult?scope=text®ex=on&search=Joris%20*Benschop[^A-Za-z] HTTP/1.1” 200 4284 (@elastic) - Aaron Aldrich (@crayzeigh) 81

Slide 82

Slide 82

Structured Data ! { “message”:”user_deleted”, “user”: { “id”:6, “email”:”crayzeigh@example.com”, “created_at”:”2015-12-11T04:31:46.828Z”, “updated_at”:”2015-12-11T04:32:18.340Z”, “name”:”crayzeigh”, “role”:”user”, “invitation_token”:null, “invitation_created_at”:null, “invitation_sent_at”:null, “invitation_accepted_at”:null, “invitation_limit”:null, “invited_by_id”:null, “invited_by_type”:null, “invitations_count”:0 }, “@timestamp”:”2015-12-11T13:35:50.070+00:00”, “@version”:”1”, “severity”:”INFO”, “host”:”app1-web1”, “type”:”apps” } ! from James Turnbull: https://www.kartar.net/2015/12/structured-logging/ (@elastic) - Aaron Aldrich (@crayzeigh) 82

Slide 83

Slide 83

Generate LOTS of events use sampling to store them (@elastic) - Aaron Aldrich (@crayzeigh) 83

Slide 84

Slide 84

OK let’s talk about APM (@elastic) - Aaron Aldrich (@crayzeigh) 84

Slide 85

Slide 85

Distributed Tracing ! Check out Open tracing fron CNCF: opentracing.io (@elastic) - Aaron Aldrich (@crayzeigh) 85

Slide 86

Slide 86

Instrumentation: SLIs are a good place to start (@elastic) - Aaron Aldrich (@crayzeigh) 86

Slide 87

Slide 87

Kill Staging: Test in Production (@elastic) - Aaron Aldrich (@crayzeigh) 87

Slide 88

Slide 88

(@elastic) - Aaron Aldrich (@crayzeigh) 88

Slide 89

Slide 89

This doesn’t eliminate QA or testing (please test before prod) (@elastic) - Aaron Aldrich (@crayzeigh) 89

Slide 90

Slide 90

Kill your staging environment 4 always out of sync 4 can’t replicate prod traffic anyway 4 definitely can’t replicate real users 4 replace with feature flags and canary deploys ! Launch Darkly talks about this a lot. You should listen to what they have to say. (@elastic) - Aaron Aldrich (@crayzeigh) 90

Slide 91

Slide 91

O11y ❤ ‘s QA Start leveraging a common toolset (@elastic) - Aaron Aldrich (@crayzeigh) 91

Slide 92

Slide 92

Every Dashboard sucks (@elastic) - Aaron Aldrich (@crayzeigh) 92

Slide 93

Slide 93

(@elastic) - Aaron Aldrich (@crayzeigh) 93

Slide 94

Slide 94

Not really, some dashboards are pretty good (@elastic) - Aaron Aldrich (@crayzeigh) 94

Slide 95

Slide 95

(@elastic) - Aaron Aldrich (@crayzeigh) 95

Slide 96

Slide 96

It’s about Storytelling know your audience (@elastic) - Aaron Aldrich (@crayzeigh) 96

Slide 97

Slide 97

Ops & Incident Response 4 Interactive 4 Iterative 4 Involve search bars (@elastic) - Aaron Aldrich (@crayzeigh) 97

Slide 98

Slide 98

Vendor Warning: Search & Common Data Schema (@elastic) - Aaron Aldrich (@crayzeigh) 98

Slide 99

Slide 99

Making O11y Evangelists (@elastic) - Aaron Aldrich (@crayzeigh) 99

Slide 100

Slide 100

Don’t just start making changes (@elastic) - Aaron Aldrich (@crayzeigh) 100

Slide 101

Slide 101

(@elastic) - Aaron Aldrich (@crayzeigh) 101

Slide 102

Slide 102

History is important (@elastic) - Aaron Aldrich (@crayzeigh) 102

Slide 103

Slide 103

Change conducted poorly breaks organizations (@elastic) - Aaron Aldrich (@crayzeigh) 103

Slide 104

Slide 104

top-down mandated change never works ☠ Did you know “defenestration” is the act of throwing someone out a window? (@elastic) - Aaron Aldrich (@crayzeigh) 104

Slide 105

Slide 105

Talk to other parts of the business to understand what stories they value (@elastic) - Aaron Aldrich (@crayzeigh) 105

Slide 106

Slide 106

LISTEN It’s all about context (@elastic) - Aaron Aldrich (@crayzeigh) 106

Slide 107

Slide 107

Start measuring business values (@elastic) - Aaron Aldrich (@crayzeigh) 107

Slide 108

Slide 108

Who else might care about dashboards? (@elastic) - Aaron Aldrich (@crayzeigh) 108

Slide 109

Slide 109

What data can we expose to the rest of the business? (@elastic) - Aaron Aldrich (@crayzeigh) 109

Slide 110

Slide 110

110

Slide 111

Slide 111

111

Slide 112

Slide 112

112

Slide 113

Slide 113

Dashboards help tell stories with context (@elastic) - Aaron Aldrich (@crayzeigh) 113

Slide 114

Slide 114

Share results Good and Bad (@elastic) - Aaron Aldrich (@crayzeigh) 114

Slide 115

Slide 115

Are your systems up? Are they responding acceptably? (@elastic) - Aaron Aldrich (@crayzeigh) 115

Slide 116

Slide 116

Who cares? (@elastic) - Aaron Aldrich (@crayzeigh) 116

Slide 117

Slide 117

(@elastic) - Aaron Aldrich (@crayzeigh) 117

Slide 118

Slide 118

Are your services delivering value? (@elastic) - Aaron Aldrich (@crayzeigh) 118