A presentation at DevOpsDays New York City 2019 in in New York, NY, USA by Quintessence Anx
The ability to monitor infrastructure has been exploding with new tools on the market and new integrations, so the tools can speak to one another, leading to even more tools, and to a hypothetically very loud monitoring environment with various members of the engineering team finding themselves muting channels, individual alerts, or even alert sources so they can focus long enough to complete other tasks. There has to be a better way - a way to configure comprehensive alerts that send out notifications with the appropriate level of urgency to the appropriate persons at the appropriate time. And in fact there is: during this talk I’ll be walking through different alert patterns and discussing: what we need to know, who needs to know it, as well as how soon and how often do they need to know.
Here’s what was said about this presentation on social media.
#devopsdays #devopsdaysnyc @QuintessenceAnx: Sensory-Friendly Monitoring: Keeping the Noise Down
— Heidi @ home (@wiredferret) January 24, 2019
#devopsdays #devopsdaysnyc @QuintessenceAnx: Keep your alerts relevant. Have a regular cadence to check the alerts: did you need the notification? How was it resolved? Can we automate the resolution?
— Heidi @ home (@wiredferret) January 24, 2019
#devopsdays #devopsdaysnyc @QuintessenceAnx: As much as we need to turn down the noise, we also need to think about redundancy. We need to not failover our alerts to a single place without a secondary backup.
— Heidi @ home (@wiredferret) January 24, 2019
#devopsdays #devopsdaysnyc @QuintessenceAnx: What needs to be known, who needs to know it, how fast do they need to know, and how should they be notified?
— Heidi @ home (@wiredferret) January 24, 2019
#devopsdays #devopsdaysnyc @QuintessenceAnx: But what about external sources of noise? When you categorize them, there are some common types: false positives, false negatives, fragility, and frequency. They are not actionable or they are predictable.
— Heidi @ home (@wiredferret) January 24, 2019
#devopsdays #devopsdaysnyc @QuintessenceAnx: When you’re setting these DND boundaries, be clear about what does rise to the level of an alert-worthy emergency? What can be in the queue? Set reasonable expectations for yourself and others.
— Heidi @ home (@wiredferret) January 24, 2019
#devopsdays #devopsdaysnyc @QuintessenceAnx: How to manage - Plan for set times to focus on your work and mute non-critical alerts, including from your friends and family.
— Heidi @ home (@wiredferret) January 24, 2019
#devopsdays #devopsdaysnyc @QuintessenceAnx: How often do you get push notifications? How often do you check your email/slack/twitter/texts.....?
— Heidi @ home (@wiredferret) January 24, 2019
#devopsdays #devopsdaysnyc @QuintessenceAnx: Sources of noise may include: logging, alerts, knowledge bases, ticketing system, chat integrations, repetition, humans, and yourself.
— Heidi @ home (@wiredferret) January 24, 2019
#devopsdays #devopsdaysnyc @QuintessenceAnx: 1) Determine the sources of noise 2) Categorize the types of noise 3) Channel the noise into a productive workflom 4) Create a routine to clear the clutter.
— Heidi @ home (@wiredferret) January 24, 2019
#devopsdays #devopsdaysnyc @QuintessenceAnx: How do we reduce the noise? Be aware, not overwhelmed.
— Heidi @ home (@wiredferret) January 24, 2019
#devopsdays #devopsdaysnyc @QuintessenceAnx: What does it cost us to multitask? On a low-cognitive load task, the error rate doubles if you ask someone to have a side conversation.
— Heidi @ home (@wiredferret) January 24, 2019
#devopsdays #devopsdaysnyc @QuintessenceAnx: Interruptions also reduce quality. Even if you get the time back to reset from interruption, you may still not bring your quality back to the baseline of non-interruption.
— Heidi @ home (@wiredferret) January 24, 2019
#devopsdays #devopsdaysnyc @QuintessenceAnx: What is the cost of the noise? Some studies say it may cost as as much as 25 minutes to reset to a task or state.
— Heidi @ home (@wiredferret) January 24, 2019
#devopsdays #devopsdaysnyc @QuintessenceAnx: When you turn down the alerts, it makes people nervous. As a species, it’s not great for us to be alarmed by the absense of alarms.
— Heidi @ home (@wiredferret) January 24, 2019
#devopsdays #devopsdaysnyc @QuintessenceAnx: Trying to keep track of everything makes all your alerts into very loud white noise. And then we start ignoring them, because some of them aren’t relevant.
— Heidi @ home (@wiredferret) January 24, 2019
#devopsdays #devopsdaysnyc @QuintessenceAnx: I used to work at a startup, and I felt like I needed to know everything, beacuse I was the first and only infra engineer. I need to know what is happening anywhere at any time.
— Heidi @ home (@wiredferret) January 24, 2019
Hey it’s time for @QuintessenceAnx! pic.twitter.com/bHZ5pRv8uR
— DevopsDays NYC (@devopsdaysNYC) January 24, 2019
Heading to #DevOpsDaysNYC? Don't miss @QuintessenceAnx's talk at 4:40 to learn some great tips for reducing noise in monitoring. #DevOpsDays pic.twitter.com/JniSJoFWcT
— Logz.io (@logzio) January 24, 2019
Heading to #DevOpsDaysNYC? Don't miss QuintessenceAnx's talk at 4:40 to learn some great tips for reducing noise in monitoring. #DevOpsDays pic.twitter.com/ARpcntsDBu
— Bianca (@biancalewisSA) January 24, 2019
“it’s a little easier in astronomy, that supernova is going to be there tomorrow” - vs your operational metrics that just scrolled by while slack is down... - @QuintessenceAnx @logzio #DevOpsDaysNYC pic.twitter.com/Pr7wRLWUZ7
— Dr. Erik Riedel (@er1p) January 24, 2019
Thank you for a fabulous talk on how to reduce ops noise, @QuintessenceAnx. First-time speaker buddies from @devopsdaysChi reunited! https://t.co/5PoIoQV67n
— Lilia Gutnik (@superlilia) January 25, 2019
for those who want to read the papers, here are the refs - @QuintessenceAnx @logzio - keeping the noise down - #DevOpsDaysNYC pic.twitter.com/1vecFuyTR9
— Dr. Erik Riedel (@er1p) January 24, 2019
Thanks @QuintessenceAnx! This is really solid talk about creating quality, resilient alerting. There’s a whole other talk about how to evaluate alerts too. Awesome! #DevOpsDaysnyc
— aaron aldrich @ nowhere/everywhere (@crayzeigh) January 24, 2019
Next full-length talk from #devopsdaysNYC is @QuintessenceAnx about operator-friendly monitoring!
— Liz Fong-Jones (方禮真) (@lizthegrey) January 24, 2019
Genuinely loving this talk by @QuintessenceAnx on reducing noise and improving focus. pic.twitter.com/0PWdWmutnG
— Tierney Cyren (@bitandbang) January 24, 2019
As someone with ADHD, I love setting up noise because it helps me being more productive, but too much can 100% destroy my productivity. It’s extremely helpful to hear @QuintessenceAnx’s suggestions on how to reduce noise to *only* what I need.
— Tierney Cyren (@bitandbang) January 24, 2019
Kicking off #DevOpsDays in NYC at @Viacom! I love this event early in the year to remind me there’s great people trying to make tech better: @mattstratton @thelongshanx @tmclaughbos @wiredferret @jaydestro @kmugrage @lizthegrey @QuintessenceAnx @ohaiwalt more I’m sure I missed
— aaron aldrich @ nowhere/everywhere (@crayzeigh) January 24, 2019
I would ❤️ to watch one of your talks. Thx for the link.
— donavon {...♥️} (@donavon) January 25, 2019
Clean alerts periodically. Get used to deleting things that you don’t use. Make sure your checks are working and checking the right things. Periodic cleaning keeps things working well. @QuintessenceAnx #DevOpsDaysNYC
— aaron aldrich @ nowhere/everywhere (@crayzeigh) January 24, 2019
@QuintessenceAnx great presentation. Missed the links at the end. Can you tweet them out?
— Boris Berenberg (@imatincr) January 24, 2019
Speaking of being friendly to operators, and reducing stress [after a group stretch break, yay, thank you!]
— Liz Fong-Jones (方禮真) (@lizthegrey) January 24, 2019
We think we need to integrate with everything and listen to/watch everything at once. #devopsdaysNYC
Everything became very loud white noise. So now important things get lost in the noise and people start muting things. #devopsdaysNYC
— Liz Fong-Jones (方禮真) (@lizthegrey) January 24, 2019
People also in the opposite extreme can think that things are too quiet and get hypervigilant.
— Liz Fong-Jones (方禮真) (@lizthegrey) January 24, 2019
What can we do that's somewhere in the middle? #devopsdaysNYC
Interruptions and dumping state costs at least 30 minutes of your time per interruption. #devopsdaysNYC
— Liz Fong-Jones (方禮真) (@lizthegrey) January 24, 2019
It's not just time, it's quality as well. Even if you give someone their time back, it doesn't make up for the lost concentration. #devopsdaysNYC
— Liz Fong-Jones (方禮真) (@lizthegrey) January 24, 2019
Working longer hours and overtime won't save you. Neither will multitasking.
— Liz Fong-Jones (方禮真) (@lizthegrey) January 24, 2019
People who think they're superheroes and capable of anything wind up with massive error rates from side conversations. #devopsdaysNYC
You can dig out by categorizing the noise and channeling it into productive places and building proper workflow around it so it's not loud for literally everyone. #devopsdaysNYC
— Liz Fong-Jones (方禮真) (@lizthegrey) January 24, 2019
You can quickly discover that one event in a system can trigger an avalanche of alerts and notifications.
— Liz Fong-Jones (方禮真) (@lizthegrey) January 24, 2019
But people send email and tweets too! #devopsdaysNYC
Or what about those open floorplans and phones ringing? #devopsdaysNYC
— Liz Fong-Jones (方禮真) (@lizthegrey) January 24, 2019
Make sure you set appropriate boundaries and expectations about when you'll be focusing and how you can receive notifications. #devopsdaysNYC
— Liz Fong-Jones (方禮真) (@lizthegrey) January 24, 2019
What about external sources of noise? Figure out what's false positive/negative, and how correlated to actual impact it is. #devopsdaysNYC
— Liz Fong-Jones (方禮真) (@lizthegrey) January 24, 2019
"Make sure that your alerts only go to dedicated SREs..." [eep. maybe not? specialize who has flow, but don't always make the sre do it]. #devopsdaysnyc
— Liz Fong-Jones (方禮真) (@lizthegrey) January 24, 2019
Look at your dependencies. Slackops only works as long as slack is up. 3 hours is a long time, even if it only happens rarely. #devopsdaysnyc
— Liz Fong-Jones (方禮真) (@lizthegrey) January 24, 2019
Can you tell the difference between your service being down and it just having quiet activity? let's reduce our stress and focus on the labor of love that is our engineering jobs. #devopsdaysnyc
— Liz Fong-Jones (方禮真) (@lizthegrey) January 24, 2019
Make sure that you do spring cleaning of your alerts. was it needed? was the solution automated/made permanent? was it actually urgent? [ed: we have a name for this at Google: actionable, urgent, repeated, and user-impacting] [fin] #devopsdaysnyc
— Liz Fong-Jones (方禮真) (@lizthegrey) January 24, 2019
Excited for “Sensory Friendly Monitoring” from @QuintessenceAnx at #devopsdaysNYC
— aaron aldrich @ nowhere/everywhere (@crayzeigh) January 24, 2019
It’s easy to fall into the trap of wanting to know everything about everything. We generate too much noise and lose important data. When we dial it back it becomes suspicious, how do we find the medium? @QuintessenceAnx #devopsdaysNYC
— aaron aldrich @ nowhere/everywhere (@crayzeigh) January 24, 2019
Interruptions cost (avg) 25 minutes. You lose the state not just the time when interrupted. It takes time to get back into work. Interruptions add up quickly. @QuintessenceAnx #devopsdaysNYC
— aaron aldrich @ nowhere/everywhere (@crayzeigh) January 24, 2019
Further interruptions affect quality. Even when time was given back from an interruption, quality doesn’t necessarily improve. @QuintessenceAnx #devopsdaysNYC
— aaron aldrich @ nowhere/everywhere (@crayzeigh) January 24, 2019
So can we do multitasking to make up for it? No. This increases error rates. [Truth! Multitasking is a myth. More so for some than others.] @QuintessenceAnx #devopsdaysNYC
— aaron aldrich @ nowhere/everywhere (@crayzeigh) January 24, 2019
We want to be AWARE of the noise not OVERWHELMED by it.
— aaron aldrich @ nowhere/everywhere (@crayzeigh) January 24, 2019
Where does it come from?
Categorize the types.
Channel it into a productive workflow.
Create a routine to clear the clutter. @QuintessenceAnx #DevOpsDaysnyc
Some noise is obvious some is self generated, email? Texts? Twitter?
— aaron aldrich @ nowhere/everywhere (@crayzeigh) January 24, 2019
Set communication boundaries so people know when you might not be available for interruptions. Let people know what is considered an emergency. @QuintessenceAnx #DevOpsDaysNYC
Make sure noise is categorized.
— aaron aldrich @ nowhere/everywhere (@crayzeigh) January 24, 2019
False positives
False negatives
Fagility
Frequency @QuintessenceAnx #DevOpsDaysNYC
Make sure the noise is flowing properly, humans shouldn’t act as message carriers here. What needs to be known? Who needs to know? How soon do they need to know (immediately? During business or 24hr?) Don’t wake people up if we don’t have to. @QuintessenceAnx #DevOpsDaysNYC
— aaron aldrich @ nowhere/everywhere (@crayzeigh) January 24, 2019
Re-evaluate redundancies. It may add a little complexity but it can prevent a signal vacuum. [prepare for failure!] If slack goes down, what happens to notifications? @QuintessenceAnx #DevOpsDaysNYC
— aaron aldrich @ nowhere/everywhere (@crayzeigh) January 24, 2019
Resilient noise builds trust. If we’re used to noisy alerts or aren’t sure that our alternate notification methods work, how can we trust silence? Evaluate your message resiliency periodically. @QuintessenceAnx #DevOpsDaysNYC
— aaron aldrich @ nowhere/everywhere (@crayzeigh) January 24, 2019