Best Practices for Monitoring: How to Combat Alert Fatigue

A presentation at How to Combat Alert Fatigue Webinar - Hosted by Logz.io in March 2019 in by Quintessence Anx

Slide 1

Slide 1

WEBINAR Best Practices for Monitoring: How to Combat Alert Fatigue

Slide 2

Slide 2

Hello! ● ● ● ● Developer Advocate for Logz.io a.k.a. “Recovering SRE” Worked in IT community for 10 yrs, including ~5 years in “Cloud Engineering” / Infrastructure … i.e. SREing before it was cool (j.k. it was always cool) Mentors underrepresented minorities in tech in the Buffalo NY / Niagara Falls region Quintessence Anx Developer Advocate quinn@logz.io @QuintessenceAnx

Slide 3

Slide 3

Agenda ● Establish healthy thought patterns regarding ○ Why to monitor ○ How to create a workflow around monitoring / alerts ○ How to setup and maintain boundaries for monitoring and its accompanying noise

Slide 4

Slide 4

When we try to know everything…

Slide 5

Slide 5

Too much noise can… ● ● ● …bury important / high severity alerts in a sea of low priority notices …causing engineering teams to start muting alarms or whole alarm sources …which in turn means the people who need to be notified, won’t be.

Slide 6

Slide 6

Turning the dial back too far, however…

Slide 7

Slide 7

Let’s find a happy medium All alerts are fictional.

Slide 8

Slide 8

What is the cost of noise?

Slide 9

Slide 9

Your brain on alerts

Slide 10

Slide 10

Time cost ~25 Minutes

Slide 11

Slide 11

Quality cost

Slide 12

Slide 12

Cost of multitasking

Slide 13

Slide 13

So how do we reduce the noise?

Slide 14

Slide 14

Be aware, not overwhelmed ● ● ● ● Determine the sources of noise Categorize the types of noise Channel the noise into a productive workflow Create a routine to clear the clutter

Slide 15

Slide 15

Sources of noise ● ● ● ● ● Logging / alert system Knowledge base Ticketing system Chat integrations Repetition ○ …and you

Slide 16

Slide 16

Wait, I need to be aware of myself? (Absolutely.) All alerts are fictional.

Slide 17

Slide 17

How often do you… ● …check your email? ● …check your social media? ● …check your text messages? ● …check your Apple / Google messages? ● … the list goes on.

Slide 18

Slide 18

Communication & Boundaries ● ● ● ● Plan for set times to focus on your work and mute non-critical alerts This includes messages from friends & family When setting boundaries make sure your friends, family, and co-workers know what you consider to be relevant emergencies Set reasonable expectations for yourself and others

Slide 19

Slide 19

What about external sources of noise? All alerts are fictional.

Slide 20

Slide 20

Categorizing your noise ● False positives ● False negatives ● Fragility ● Frequency (just fix it)

Slide 21

Slide 21

Save time by creating your noise flow ● What needs to be known ● Who needs to know it ● How soon should they know ● How should they be notified

Slide 22

Slide 22

Re-evaluate redundancy Know when to add a little complexity to stop a vacuum.

Slide 23

Slide 23

Resilient noise builds trust ● How reliable are your tools and services? ● How much notification duplication is needed? ● Do you have the ability to switch alert endpoints in the event of a service outage? ● Do you regularly evaluate the reliability of your services (external and internal)? All alerts are fictional.

Slide 24

Slide 24

Keep alerts relevant with “sprint cleaning” For every alert triggered, ask: ● Was the notification needed? ● How was the incident resolved? ● Can the solution be automated? ● Is the solution permanent? ● How urgently was a solution needed?

Slide 25

Slide 25

Summing it up

Slide 26

Slide 26

Next steps ● Logz.io blog - Building Monitors you can Trust: https://logz.io/blog/building-monitors/ ● TechBeacon - How to use monitoring for innovation and resilience, not firefighting: https://techbeacon.com/app-dev-testing/how-use -monitoring-innovation-resilience-not-firefighting Quintessence Anx Developer Advocate quinn@logz.io @QuintessenceAnx

Slide 27

Slide 27

Questions

Slide 28

Slide 28

Thanks!