Production testing through monitoring

A presentation at DevOpsDays Rockies in April 2016 in Denver, CO, USA by Leon Fayer

Slide 1

Slide 1

Troubleshooting with monitoring Testing in production DevOps monitoring [something] testing [something] monitoring [something] in production Leon Fayer @papa_fire

Slide 2

Slide 2

WHO AM I? ๏ engineer for 20+ years ๏ professional cynic ๏ @ OmniTI ๏ build and operate big systems ๏ we are hiring! ๏ omniti.com/is/hiring THAT’S ME ❖ ❖ ❖ ❖ @papa_fire leon@omniti.com fayerplay.com slideshare.net/LeonFayer1

Slide 3

Slide 3

I HATE TESTING @papa_fire

Slide 4

Slide 4

testing is required @papa_fire

Slide 5

Slide 5

testing is not enough @papa_fire

Slide 6

Slide 6

unit testing > functional testing > resilience testing > performance testing >… @papa_fire

Slide 7

Slide 7

testing can give a false sense of security @papa_fire

Slide 8

Slide 8

testing is deterministic @papa_fire

Slide 9

Slide 9

data problem @papa_fire

Slide 10

Slide 10

quantity of data > frequency of data > quality of data @papa_fire

Slide 11

Slide 11

example Wolfe+585 @papa_fire

Slide 12

Slide 12

example Hubert Blaine Wolfeschlegelsteinhausenbergerdorffwelchevoralternwarengewissenhaftschaferswessenschafewarenwohlgepflegeundsorgfaltigkeitbe schutzenvorangreifendurchihrraubgierigfeindewelchevoralternzwolfhundert tausendjahresvorandieerscheinenvonderersteerdemenschderraumschiff genachtmittungsteinundsiebeniridiumelektrischmotorsgebrauchlichtalsseinur sprungvonkraftgestartseinlangefahrthinzwischensternartigraumaufdersuchen nachbarschaftdersternwelchegehabtbewohnbarplanetenkreisedrehensichundwo hinderneuerassevonverstandigmenschlichkeitkonntefortpflanzenundsicher freuenanlebenslanglichfreudeundruhemitnichteinfurchtvorangreifenvor andererintelligentgeschopfsvonhinzwischensternartigraum, Sr. @papa_fire

Slide 13

Slide 13

user problem @papa_fire

Slide 14

Slide 14

“ Users (n) - distributed fault injection test suite for production @papa_fire

Slide 15

Slide 15

example Corrupted Blood bug @papa_fire

Slide 16

Slide 16

example @papa_fire

Slide 17

Slide 17

other factors @papa_fire

Slide 18

Slide 18

lack of foresight (Y2K bug) > too many use-cases (female Tauren bug) > change to assumptions @papa_fire

Slide 19

Slide 19

testing is great for “known knowns” @papa_fire

Slide 20

Slide 20

testing is ok for “known unknowns” @papa_fire

Slide 21

Slide 21

testing is bad for “unknown unknowns” @papa_fire

Slide 22

Slide 22

enter monitoring @papa_fire

Slide 23

Slide 23

why monitor? @papa_fire

Slide 24

Slide 24

because testing isn’t enough @papa_fire

Slide 25

Slide 25

software is never perfect > systems are complex > external dependency worry > proactive is better than reactive >… @papa_fire

Slide 26

Slide 26

because things change @papa_fire

Slide 27

Slide 27

because things change in production @papa_fire

Slide 28

Slide 28

what to monitor? @papa_fire

Slide 29

Slide 29

“ in God we trust all others we monitor @papa_fire

Slide 30

Slide 30

systems > databases > applications > integration points > performance > user behavior >… @papa_fire

Slide 31

Slide 31

is it enough? @papa_fire

Slide 32

Slide 32

is it too much? @papa_fire

Slide 33

Slide 33

what is important? @papa_fire

Slide 34

Slide 34

what is important? (i.e. what to alert on) @papa_fire

Slide 35

Slide 35

example > servers up and running > HTTP checks return 200 > tweets are lost @papa_fire

Slide 36

Slide 36

s/system checks/unit tests/ @papa_fire

Slide 37

Slide 37

“ I don’t give a **** if the datacenter is on fire as long as I am still making money — CEO @papa_fire

Slide 38

Slide 38

we monitor because things change @papa_fire

Slide 39

Slide 39

changes effect business @papa_fire

Slide 40

Slide 40

top-down approach > understand business > define baseline > correlate data @papa_fire

Slide 41

Slide 41

example ๏ online marketing company ๏ major e-commerce component ๏ ~100 million users ๏ 1 billion emails/month ๏ 300,000 lines of code ๏5600+ metrics collected @papa_fire

Slide 42

Slide 42

it all starts with a call … @papa_fire

Slide 43

Slide 43

revenue @papa_fire

Slide 44

Slide 44

revenue + traffic @papa_fire

Slide 45

Slide 45

revenue + traffic + load time @papa_fire

Slide 46

Slide 46

revenue + traffic + load time + db @papa_fire

Slide 47

Slide 47

revenue + traffic + load time + db + email @papa_fire

Slide 48

Slide 48

what if … … email wasn’t monitored? @papa_fire

Slide 49

Slide 49

what if … … email wasn’t monitored? (it would be after this) @papa_fire

Slide 50

Slide 50

instrumentation is never done @papa_fire

Slide 51

Slide 51

example > same symptoms > higher decline rates > all metrics are within norm @papa_fire

Slide 52

Slide 52

example > same symptoms > higher decline rates > all metrics are within norm AmEx blocked @papa_fire

Slide 53

Slide 53

tl;dr @papa_fire

Slide 54

Slide 54

testing and monitoring not testing or monitoring @papa_fire

Slide 55

Slide 55

understand the business @papa_fire

Slide 56

Slide 56

continuous improvement @papa_fire

Slide 57

Slide 57

{also bad at conclusions} @papa_fire

Slide 58

Slide 58

THANK YOU questions? @papa_fire