A presentation at PREVAIL by Holly Cummins
SRE: the good, the bad, and the ouch PREVAIL 2021 - An IBM Academy of Technology Conference © 2021 IBM Corporation
Speaker Holly Cummins Innovation Leader, IBM SPEED IBM Garage alum @holly_cummins PREVAIL 2021 - An IBM Academy of Technology Conference © 2021 IBM Corporation 2
PREVAIL Technical Conference 2021 what is SRE? @holly_cummins
PREVAIL Technical Conference 2021 SRE what ops would be like if it was done by software engineers @holly_cummins
PREVAIL Technical Conference 2021 why SRE? @holly_cummins
PREVAIL Technical Conference 2021 reliability is very important @holly_cummins
PREVAIL Technical Conference 2021 old ops @holly_cummins
PREVAIL Technical Conference 2021 manual old ops @holly_cummins
PREVAIL Technical Conference 2021 manual repetitive old ops @holly_cummins
PREVAIL Technical Conference 2021 manual repetitive siloed old ops @holly_cummins
PREVAIL Technical Conference 2021 manual repetitive siloed not aligned to business goals old ops @holly_cummins
PREVAIL Technical Conference 2021 manual repetitive siloed not aligned to business goals old ops unable to handle complexity of cloud native @holly_cummins
PREVAIL Technical Conference 2021 eliminate repetitive tasks @holly_cummins
PREVAIL Technical Conference 2021 eliminate toil @holly_cummins
PREVAIL Technical Conference 2021 aligned incentives @holly_cummins
PREVAIL Technical Conference 2021 failure is a symptom, not a cause @holly_cummins
PREVAIL Technical Conference 2021 devops? @holly_cummins
PREVAIL Technical Conference 2021 SRE DevOps automate everything @holly_cummins
PREVAIL Technical Conference 2021 SRE DevOps holistic + collaborative @holly_cummins
PREVAIL Technical Conference 2021 what could possibly go wrong? @holly_cummins
true story the cunning rebrand “we’re SRE now” IBM Garage @holly_cummins
PREVAIL Technical Conference 2021 SRE: what ops would be like if it was done by software engineers software engineer @holly_cummins
PREVAIL Technical Conference 2021 SRE: what ops would be like if it was done by software engineers ops software engineer @holly_cummins
PREVAIL Technical Conference 2021 SRE: what ops would be like if it was done by ops? ops @holly_cummins
true story the cunning rebrand “we are just as good: we have scripts” IBM Garage @holly_cummins
PREVAIL Technical Conference 2021 what triggers the scripts? @holly_cummins
PREVAIL Technical Conference 2021 how much contact do SRE have with dev? SRE dev @holly_cummins
PREVAIL Technical Conference 2021 are there any SRE NFRs in the dev backlog? SRE dev @holly_cummins
true story the cunning rebrand “we do SRE … in silos” IBM Garage @holly_cummins
PREVAIL Technical Conference 2021 @holly_cummins
PREVAIL Technical Conference 2021 I am not designed for this. @holly_cummins
PREVAIL Technical Conference 2021 two war rooms @holly_cummins
PREVAIL Technical Conference 2021 team mainframe @holly_cummins team mobile
PREVAIL Technical Conference 2021 we’re responsible for stability of the front end we’re responsible for stability of the mainframe team mainframe @holly_cummins team mobile
PREVAIL Technical Conference 2021 we’re responsible for we’re stability offor the responsible mainframe … as stability of thelong as it’s used correctly mainframe we’re responsible for stability of the front end the ambassador team mainframe @holly_cummins team mobile
true story dots aren’t connected “we have a ticket per team, not per incident” IBM Garage @holly_cummins
“we want to do SRE but we don’t have enough permissions on our systems”
PREVAIL Technical Conference 2021 “the DBAs don’t trust us” @holly_cummins
PREVAIL Technical Conference 2021 “it takes us 15 minutes just to get permission to run a standard set of SQL diagnostic statements” @holly_cummins
PREVAIL Technical Conference 2021 silos cost @holly_cummins
true story the gap between intent and reality “we do post-mortems after every incident … maybe” IBM Garage @holly_cummins
PREVAIL Technical Conference 2021 measure the number of incidents @holly_cummins
PREVAIL Technical Conference 2021 measure the number of incidents measure the number of post-mortems @holly_cummins
PREVAIL Technical Conference 2021 measure the number of incidents measure the number of post-mortems see if they match @holly_cummins
PREVAIL Technical Conference 2021 advanced metrics: @holly_cummins
PREVAIL Technical Conference 2021 advanced metrics: how many people were in the post-mortem? @holly_cummins
PREVAIL Technical Conference 2021 advanced metrics: how many people were in the post-mortem? does it include more than the people directly involved? @holly_cummins
PREVAIL Technical Conference 2021 advanced metrics: how many people were in the post-mortem? does it include more than the people directly involved? did we invite more than our own team? @holly_cummins
true story “no one says anything in our blameless post-mortems” IBM Garage @holly_cummins
PREVAIL Technical Conference 2021 ‘blameless’ post-mortem @holly_cummins
PREVAIL Technical Conference 2021 if involvement in an incident is punished, people will avoid engaging with systems @holly_cummins
PREVAIL Technical Conference 2021 “great idea, go build that!” if ideas are punished with extra work, people will try not to have ideas @holly_cummins
true story the perverse incentive “we have success metrics” IBM Garage @holly_cummins
PREVAIL Technical Conference 2021 metrics are good @holly_cummins
PREVAIL Technical Conference 2021 SREs are data-driven @holly_cummins
PREVAIL Technical Conference 2021 but … @holly_cummins
PREVAIL Technical Conference 2021 as senior leaders, be careful what you incentivise @holly_cummins
PREVAIL Technical Conference 2021 be careful what behaviours you discourage @holly_cummins
true story the perverse incentive “we count how many incidents we have; if the number goes down, it means we are working better” IBM Garage @holly_cummins
PREVAIL Technical Conference 2021 outstanding quality! @holly_cummins
PREVAIL Technical Conference 2021 delivery excellence! @holly_cummins
PREVAIL Technical Conference 2021 fewer people working → fewer incidents @holly_cummins
PREVAIL Technical Conference 2021 new release → more incidents @holly_cummins
PREVAIL Technical Conference 2021 what should you measure? @holly_cummins
PREVAIL Technical Conference 2021 make work visible @holly_cummins
true story the email timesink “we never seem to complete the work we planned” IBM Garage @holly_cummins
PREVAIL Technical Conference 2021 1 sprint theory @holly_cummins
PREVAIL Technical Conference 2021 1 sprint theory 50% story points @holly_cummins
PREVAIL Technical Conference 2021 theory 1 sprint 50% unplanned work (tickets) 50% story points @holly_cummins
PREVAIL Technical Conference 2021 theory reality 1 sprint 50% unplanned work (tickets) 50% story points @holly_cummins
PREVAIL Technical Conference 2021 theory reality 1 sprint 50% unplanned work (tickets) 50% story points 10% story points @holly_cummins
PREVAIL Technical Conference 2021 theory reality 1 sprint 50% unplanned work (tickets) 50% tickets 50% story points 10% story points @holly_cummins
PREVAIL Technical Conference 2021 theory reality 50% unplanned work (tickets) 1 sprint 40% ?? 50% tickets 50% story points 10% story points @holly_cummins
PREVAIL Technical Conference 2021 “can you just … “ @holly_cummins
PREVAIL Technical Conference 2021 “can you just … “ “how do I do this?” @holly_cummins
PREVAIL Technical Conference 2021 “can you just … “ “how do I do this?” “where is this documented?” @holly_cummins
PREVAIL Technical Conference 2021 this wasn’t a team failure @holly_cummins
PREVAIL Technical Conference 2021 this wasn’t a team failure it was a data quality issue @holly_cummins
PREVAIL Technical Conference 2021 this wasn’t a team failure it was a data quality issue it was a process issue @holly_cummins
PREVAIL Technical Conference 2021 track work use the data to eliminate toil @holly_cummins
PREVAIL Technical Conference 2021 measure blockers @holly_cummins
PREVAIL Technical Conference 2021 mean time to failure? mean time to detect problems? @holly_cummins
PREVAIL Technical Conference 2021 what is failure in a complex system? if a system goes down but user experience is ne, does that count? fi @holly_cummins
PREVAIL Technical Conference 2021 measure “what have I learned” measure “have I made sure it won’t happen again” @holly_cummins
true client story value on the shelf “we can’t actually release this.” IBM Garage @holly_cummins
PREVAIL Technical Conference 2021 what’s stopping more frequent deploys? @holly_cummins
PREVAIL Technical Conference 2021 “it costs too much to release” @holly_cummins
PREVAIL Technical Conference 2021 “it costs too much to release” you can x that fi @holly_cummins
PREVAIL Technical Conference 2021 “we can’t ship until we have more con dence in the quality” fi @holly_cummins
PREVAIL Technical Conference 2021 “we can’t ship until we have more con dence in the quality” you can x that fi fi @holly_cummins
PREVAIL Technical Conference 2021 deferred wiring @holly_cummins
PREVAIL Technical Conference 2021 feature flags @holly_cummins
true client story the monolithic microservices “we can’t release this microservice… we deploy all our microservices at the same time… because otherwise nothing works.” IBM Garage @holly_cummins
PREVAIL Technical Conference 2021 let’s talk about microservices @holly_cummins
true client story the peril of microservices “every time we change code, something breaks” IBM Garage @holly_cummins
PREVAIL Technical Conference 2021 just because a system runs across 6 containers doesn’t mean it’s decoupled @holly_cummins
PREVAIL Technical Conference 2021 mars climate explorer @holly_cummins
Courtesy NASA/ JPL-Caltech #IBMGarage @holly_cummins
distributing did not help
metric units distributing did not help
metric units imperial units distributing did not help
PREVAIL Technical Conference 2021 testing @holly_cummins
PREVAIL Technical Conference 2021 Cluster + Ariane 5 $370 million loss https://en.wikipedia.org/wiki/Cluster_(spacecraft) @holly_cummins
PREVAIL Technical Conference 2021 they tested it … @holly_cummins
PREVAIL Technical Conference 2021 they tested it … but stubbed out one component. @holly_cummins
PREVAIL Technical Conference 2021 they tested it … but stubbed out one component. that component was the one that broke. @holly_cummins
PREVAIL Technical Conference 2021 the ariane failed in 36 seconds you can’t a/b test a $370 million rocket @holly_cummins
PREVAIL Technical Conference 2021 testing will always be incomplete aim for recoverability @holly_cummins
PREVAIL Technical Conference 2021 resilience @holly_cummins
PREVAIL Technical Conference 2021 resilience recoverability @holly_cummins
PREVAIL Technical Conference 2021 observability @holly_cummins
they often couldn’t see the explorer
feedback is good engineering
PREVAIL Technical Conference 2021 when SRE is right it is great @holly_cummins
bank
PREVAIL Technical Conference 2021 remember this bank? team mainframe @holly_cummins team mobile
PREVAIL Technical Conference 2021 remember this bank? we’re responsible for stability of the front end team mainframe @holly_cummins team mobile
PREVAIL Technical Conference 2021 remember this bank? we’re responsible for stability of the mainframe … as long as it’s used correctly we’re responsible for stability of the front end the ambassador team mainframe @holly_cummins team mobile
PREVAIL Technical Conference 2021 one team web front-end back-end another department … @holly_cummins
PREVAIL Technical Conference 2021 e on team mobile front-end web front-end back-end another department … @holly_cummins
PREVAIL Technical Conference 2021 e on team CI/CD pipelines canary deploys CI/CD pipelines big-bang deploys onto AIX one team, range of techniques @holly_cummins
PREVAIL Technical Conference 2021 by the way … @holly_cummins
PREVAIL Technical Conference 2021 big bang deploys @holly_cummins
PREVAIL Technical Conference 2021 50% failure rate big bang deploys @holly_cummins
PREVAIL Technical Conference 2021 50% failure rate big bang deploys canary deploys @holly_cummins
PREVAIL Technical Conference 2021 50% 10% failure rate failure rate big bang deploys canary deploys @holly_cummins
industrial
remember the suspicious DBAs? #IBMGarage @holly_cummins
PREVAIL Technical Conference 2021 two root problems: • automation • trust and transparency @holly_cummins
PREVAIL Technical Conference 2021 trigger automation via slack @holly_cummins
PREVAIL Technical Conference 2021 because it was transparent, DBAs were happy and automated more things @holly_cummins
PREVAIL Technical Conference 2021 what happens when things go wrong? @holly_cummins
PREVAIL Technical Conference 2021 leadership need to provide a safety net. @holly_cummins
PREVAIL Technical Conference 2021 celebrate success celebrate failure @holly_cummins
PREVAIL Technical Conference 2021 celebrate success celebrate learning @holly_cummins
PREVAIL Technical Conference 2021 transformation endurance @holly_cummins
PREVAIL Technical Conference 2021 remember the why @holly_cummins
PREVAIL Technical Conference 2021 better, safer, faster, happier @holly_cummins
PREVAIL 2021 - An IBM Academy of Technology Conference. The information in this presentation is representative of the presenter and their views and opinions are not necessarily those of IBM or of the IBM Academy of Technology. PREVAIL 2021 - An IBM Academy of Technology Conference © 2021 IBM Corporation
View SRE: The Good, The Bad, and the Ouch on Notist.
Dismiss
SRE sounds like a plan with no drawbacks … but making it work in practice can be trickier than the theory says. This talk shares stories of SRE wins and SRE accidents.