Troubleshooting with monitoring Testing in production DevOps monitoring [something] testing [something] monitoring [something] in production Leon Fayer @papa_fire

WHO AM I? ๏ engineer for 20+ years ๏ professional cynic ๏ @ OmniTI ๏ build and operate big systems ๏ we are hiring! ๏ omniti.com/is/hiring THAT’S ME ❖ ❖ ❖ ❖ @papa_fire leon@omniti.com fayerplay.com slideshare.net/LeonFayer1

I HATE TESTING @papa_fire

testing is required @papa_fire

testing is not enough @papa_fire

unit testing > functional testing > resilience testing > performance testing >… @papa_fire

testing can give a false sense of security @papa_fire

testing is deterministic @papa_fire

data problem @papa_fire

quantity of data > frequency of data > quality of data @papa_fire

example Wolfe+585 @papa_fire

example Hubert Blaine Wolfeschlegelsteinhausenbergerdorffwelchevoralternwarengewissenhaftschaferswessenschafewarenwohlgepflegeundsorgfaltigkeitbe schutzenvorangreifendurchihrraubgierigfeindewelchevoralternzwolfhundert tausendjahresvorandieerscheinenvonderersteerdemenschderraumschiff genachtmittungsteinundsiebeniridiumelektrischmotorsgebrauchlichtalsseinur sprungvonkraftgestartseinlangefahrthinzwischensternartigraumaufdersuchen nachbarschaftdersternwelchegehabtbewohnbarplanetenkreisedrehensichundwo hinderneuerassevonverstandigmenschlichkeitkonntefortpflanzenundsicher freuenanlebenslanglichfreudeundruhemitnichteinfurchtvorangreifenvor andererintelligentgeschopfsvonhinzwischensternartigraum, Sr. @papa_fire

user problem @papa_fire

“ Users (n) - distributed fault injection test suite for production @papa_fire

example Corrupted Blood bug @papa_fire

example @papa_fire

other factors @papa_fire

lack of foresight (Y2K bug) > too many use-cases (female Tauren bug) > change to assumptions @papa_fire

testing is great for “known knowns” @papa_fire

testing is ok for “known unknowns” @papa_fire

testing is bad for “unknown unknowns” @papa_fire

enter monitoring @papa_fire

why monitor? @papa_fire

because testing isn’t enough @papa_fire

software is never perfect > systems are complex > external dependency worry > proactive is better than reactive >… @papa_fire

because things change @papa_fire

because things change in production @papa_fire

what to monitor? @papa_fire

“ in God we trust all others we monitor @papa_fire

systems > databases > applications > integration points > performance > user behavior >… @papa_fire

is it enough? @papa_fire

is it too much? @papa_fire

what is important? @papa_fire

what is important? (i.e. what to alert on) @papa_fire

example > servers up and running > HTTP checks return 200 > tweets are lost @papa_fire

s/system checks/unit tests/ @papa_fire

“ I don’t give a **** if the datacenter is on fire as long as I am still making money — CEO @papa_fire

we monitor because things change @papa_fire

changes effect business @papa_fire

top-down approach > understand business > define baseline > correlate data @papa_fire

example ๏ online marketing company ๏ major e-commerce component ๏ ~100 million users ๏ 1 billion emails/month ๏ 300,000 lines of code ๏5600+ metrics collected @papa_fire

it all starts with a call … @papa_fire

revenue @papa_fire

revenue + traffic @papa_fire

revenue + traffic + load time @papa_fire

revenue + traffic + load time + db @papa_fire

revenue + traffic + load time + db + email @papa_fire

what if … … email wasn’t monitored? @papa_fire

what if … … email wasn’t monitored? (it would be after this) @papa_fire

instrumentation is never done @papa_fire

example > same symptoms > higher decline rates > all metrics are within norm @papa_fire

example > same symptoms > higher decline rates > all metrics are within norm AmEx blocked @papa_fire

tl;dr @papa_fire

testing and monitoring not testing or monitoring @papa_fire

understand the business @papa_fire

continuous improvement @papa_fire

{also bad at conclusions} @papa_fire

THANK YOU questions? @papa_fire