Sensory Friendly Monitoring: Keeping the Noise Down (Ignite Format)

A presentation at DevOpsDays Charlote 2019 in February 2019 in Charlotte, NC, USA by Quintessence Anx

Slide 1

Slide 1

Sensory Friendly Monitoring Keeping the noise down QuintessenceAnx

Slide 2

Slide 2

When we try to know everything … @QuintessenceAnx // @logzio

Slide 3

Slide 3

Too much noise can … ● … bury important / high severity alerts in a sea of low priority notices ● … cause engineering teams to start muting alarms or whole notification sources ○ … which in turn means the people who need to be notified, aren’t. @QuintessenceAnx // @logzio

Slide 4

Slide 4

Turning the dial down too far… …yields mistrust in the quiet, we are hypervigilant So we need to find a happy medium. @QuintessenceAnx // @logzio

Slide 5

Slide 5

Consider: the cost of noise @QuintessenceAnx // @logzio

Slide 6

Slide 6

Your brain on alerts: osmotic retention @QuintessenceAnx // @logzio

Slide 7

Slide 7

Time Cost ~25 minutes (All alerts are fictional.) @QuintessenceAnx // @logzio

Slide 8

Slide 8

Quality cost @QuintessenceAnx // @logzio

Slide 9

Slide 9

Multitasking Cost @QuintessenceAnx // @logzio

Slide 10

Slide 10

How do we reduce the noise? @QuintessenceAnx // @logzio

Slide 11

Slide 11

Know the sources of noise @QuintessenceAnx // @logzio

Slide 12

Slide 12

Be aware of how you contribute to your noise @QuintessenceAnx // @logzio

Slide 13

Slide 13

Self induced noise Communicate and set boundaries for what works for you (All alerts are fictional.) @QuintessenceAnx // @logzio

Slide 14

Slide 14

External noise Categorize and create workflows around the noise @QuintessenceAnx // @logzio

Slide 15

Slide 15

Categorize and flow ● False negatives ● False positives ● Fragility ● Frequency (fix it) ● What needs to be known ● Who needs to know it ● How soon should they know ● How should they be notified @QuintessenceAnx // @logzio

Slide 16

Slide 16

Resiliency builds trust ● How reliable are your tools and services? ● How much notification duplication is needed? ● Do you have the ability to switch alert endpoints / mechanisms in the event of a service outage? ● Do you regularly evaluate the reliability of your services? @QuintessenceAnx // @logzio

Slide 17

Slide 17

Keep Alerts Relevant: Sprint Cleaning @QuintessenceAnx // @logzio

Slide 18

Slide 18

Presentation available on Github https://github.com/quintessence/presentations @QuintessenceAnx // @logzio

Slide 19

Slide 19

Additional Reading ● “The Cost of Interrupted Work: More Speed and Stress” — Gloria Mark, dept of Informatics @ UC Irvine https://www.ics.uci.edu/~gmark/chi08-mark.pdf ● “Are digital distractions harming labour productivity?” — The Economist https://www.economist.com/finance-and-economics/2017/12/07/are-digital-distractions-harming-labour-productivity ● “Brief Interruptions Spawn Errors” — Michigan State University https://msutoday.msu.edu/news/2013/brief-interruptions-spawn-errors/ ● “Tenets of SRE” — Stephen Thorne, Sr Google SRE https://medium.com/@jerub/tenets-of-sre-8af6238ae8a8 @QuintessenceAnx // @logzio

Slide 20

Slide 20

@QuintessenceAnx Developer Advocate