WEBINAR Best Practices for Monitoring: How to Combat Alert Fatigue
Slide 2
Hello! ● ● ●
●
Developer Advocate for Logz.io a.k.a. “Recovering SRE” Worked in IT community for 10 yrs, including ~5 years in “Cloud Engineering” / Infrastructure … i.e. SREing before it was cool (j.k. it was always cool) Mentors underrepresented minorities in tech in the Buffalo NY / Niagara Falls region
Quintessence Anx Developer Advocate quinn@logz.io @QuintessenceAnx
Slide 3
Agenda ●
Establish healthy thought patterns regarding ○ Why to monitor ○ How to create a workflow around monitoring / alerts ○ How to setup and maintain boundaries for monitoring and its accompanying noise
Slide 4
When we try to know everything…
Slide 5
Too much noise can… ● ● ●
…bury important / high severity alerts in a sea of low priority notices …causing engineering teams to start muting alarms or whole alarm sources …which in turn means the people who need to be notified, won’t be.
Slide 6
Turning the dial back too far, however…
Slide 7
Let’s find a happy medium
All alerts are fictional.
Slide 8
What is the cost of noise?
Slide 9
Your brain on alerts
Slide 10
Time cost
~25 Minutes
Slide 11
Quality cost
Slide 12
Cost of multitasking
Slide 13
So how do we reduce the noise?
Slide 14
Be aware, not overwhelmed ● ● ● ●
Determine the sources of noise Categorize the types of noise Channel the noise into a productive workflow Create a routine to clear the clutter
Slide 15
Sources of noise ● ● ● ● ●
Logging / alert system Knowledge base Ticketing system Chat integrations Repetition ○ …and you
Slide 16
Wait, I need to be aware of myself? (Absolutely.)
All alerts are fictional.
Slide 17
How often do you… ●
…check your email?
●
…check your social media?
●
…check your text messages?
●
…check your Apple / Google messages?
●
… the list goes on.
Slide 18
Communication & Boundaries ● ● ●
●
Plan for set times to focus on your work and mute non-critical alerts This includes messages from friends & family When setting boundaries make sure your friends, family, and co-workers know what you consider to be relevant emergencies Set reasonable expectations for yourself and others
Slide 19
What about external sources of noise?
All alerts are fictional.
Slide 20
Categorizing your noise
●
False positives
●
False negatives
●
Fragility
●
Frequency (just fix it)
Slide 21
Save time by creating your noise flow ●
What needs to be known
●
Who needs to know it
●
How soon should they know
●
How should they be notified
Slide 22
Re-evaluate redundancy Know when to add a little complexity to stop a vacuum.
Slide 23
Resilient noise builds trust ●
How reliable are your tools and services?
●
How much notification duplication is needed?
●
Do you have the ability to switch alert endpoints in the event of a service outage?
●
Do you regularly evaluate the reliability of your services (external and internal)?
All alerts are fictional.
Slide 24
Keep alerts relevant with “sprint cleaning” For every alert triggered, ask: ●
Was the notification needed?
●
How was the incident resolved?
●
Can the solution be automated?
●
Is the solution permanent?
●
How urgently was a solution needed?
Slide 25
Summing it up
Slide 26
Next steps ●
Logz.io blog - Building Monitors you can Trust: https://logz.io/blog/building-monitors/
●
TechBeacon - How to use monitoring for innovation and resilience, not firefighting: https://techbeacon.com/app-dev-testing/how-use -monitoring-innovation-resilience-not-firefighting
Quintessence Anx Developer Advocate quinn@logz.io @QuintessenceAnx