PREVAIL Technical Conference 2021
reliability is very important
@holly_cummins
Slide 7
PREVAIL Technical Conference 2021
old ops @holly_cummins
Slide 8
PREVAIL Technical Conference 2021
manual
old ops @holly_cummins
Slide 9
PREVAIL Technical Conference 2021
manual repetitive old ops @holly_cummins
Slide 10
PREVAIL Technical Conference 2021
manual repetitive siloed old ops @holly_cummins
Slide 11
PREVAIL Technical Conference 2021
manual repetitive siloed
not aligned to business goals
old ops @holly_cummins
Slide 12
PREVAIL Technical Conference 2021
manual repetitive siloed
not aligned to business goals
old ops
unable to handle complexity of cloud native @holly_cummins
PREVAIL Technical Conference 2021
what could possibly go wrong?
@holly_cummins
Slide 21
true story
the cunning rebrand
“we’re SRE now”
IBM Garage
@holly_cummins
Slide 22
true story
the cunning rebrand
“we’re SRE now”
IBM Garage
@holly_cummins
Slide 23
PREVAIL Technical Conference 2021
SRE: what ops would be like if it was done by software engineers software engineer
@holly_cummins
Slide 24
PREVAIL Technical Conference 2021
SRE: what ops would be like if it was done by software engineers software engineer
@holly_cummins
Slide 25
PREVAIL Technical Conference 2021
SRE: what ops would be like if it was done by software engineers ops
software engineer
@holly_cummins
Slide 26
PREVAIL Technical Conference 2021
SRE: what ops would be like if it was done by ops? ops
@holly_cummins
Slide 27
PREVAIL Technical Conference 2021
SRE: what ops would be like if it was done by ops? ops
@holly_cummins
Slide 28
true story
the cunning rebrand
“we are just as good: we have scripts”
IBM Garage
@holly_cummins
Slide 29
PREVAIL Technical Conference 2021
what triggers the scripts?
@holly_cummins
Slide 30
PREVAIL Technical Conference 2021
how much contact do SRE have with dev?
SRE
dev
@holly_cummins
Slide 31
PREVAIL Technical Conference 2021
how much contact do SRE have with dev?
SRE
dev
@holly_cummins
Slide 32
PREVAIL Technical Conference 2021
are there any SRE NFRs in the dev backlog?
SRE
dev
@holly_cummins
Slide 33
true story
the cunning rebrand
“we do SRE … in silos”
IBM Garage
@holly_cummins
Slide 34
PREVAIL Technical Conference 2021
@holly_cummins
Slide 35
PREVAIL Technical Conference 2021
@holly_cummins
Slide 36
PREVAIL Technical Conference 2021
@holly_cummins
Slide 37
PREVAIL Technical Conference 2021
@holly_cummins
Slide 38
PREVAIL Technical Conference 2021
I am not designed for this.
@holly_cummins
Slide 39
PREVAIL Technical Conference 2021
two war rooms @holly_cummins
Slide 40
PREVAIL Technical Conference 2021
team mainframe
@holly_cummins
team mobile
Slide 41
PREVAIL Technical Conference 2021
we’re responsible for stability of the front end
we’re responsible for stability of the mainframe
team mainframe
@holly_cummins
team mobile
Slide 42
PREVAIL Technical Conference 2021 we’re responsible for we’re stability offor the responsible mainframe … as stability of thelong as it’s used correctly mainframe
we’re responsible for stability of the front end
the ambassador
team mainframe
@holly_cummins
team mobile
Slide 43
true story
dots aren’t connected
“we have a ticket per team, not per incident”
IBM Garage
@holly_cummins
Slide 44
“we want to do SRE but we don’t have enough permissions on our systems”
Slide 45
PREVAIL Technical Conference 2021
“the DBAs don’t trust us”
@holly_cummins
Slide 46
PREVAIL Technical Conference 2021
“it takes us 15 minutes just to get permission to run a standard set of SQL diagnostic statements”
@holly_cummins
Slide 47
PREVAIL Technical Conference 2021
“it takes us 15 minutes just to get permission to run a standard set of SQL diagnostic statements”
@holly_cummins
PREVAIL Technical Conference 2021
advanced metrics: how many people were in the post-mortem?
@holly_cummins
Slide 57
PREVAIL Technical Conference 2021
advanced metrics: how many people were in the post-mortem? does it include more than the people directly involved?
@holly_cummins
Slide 58
PREVAIL Technical Conference 2021
advanced metrics: how many people were in the post-mortem? does it include more than the people directly involved?
did we invite more than our own team?
@holly_cummins
Slide 59
true story
“no one says anything in our blameless post-mortems”
IBM Garage
@holly_cummins
PREVAIL Technical Conference 2021
if involvement in an incident is punished, people will avoid engaging with systems
@holly_cummins
Slide 62
PREVAIL Technical Conference 2021
“great idea, go build that!” if ideas are punished with extra work, people will try not to have ideas @holly_cummins
Slide 63
true story
the perverse incentive
“we have success metrics”
IBM Garage
@holly_cummins
Slide 64
PREVAIL Technical Conference 2021
metrics are good
@holly_cummins
Slide 65
PREVAIL Technical Conference 2021
SREs are data-driven
@holly_cummins
Slide 66
PREVAIL Technical Conference 2021
but …
@holly_cummins
Slide 67
PREVAIL Technical Conference 2021
as senior leaders, be careful what you incentivise
@holly_cummins
Slide 68
PREVAIL Technical Conference 2021
be careful what behaviours you discourage
@holly_cummins
Slide 69
true story
the perverse incentive
“we count how many incidents we have; if the number goes down, it means we are working better”
IBM Garage
@holly_cummins
PREVAIL Technical Conference 2021
mean time to failure? mean time to detect problems?
@holly_cummins
Slide 98
PREVAIL Technical Conference 2021
what is failure in a complex system? if a system goes down but user experience is ne, does that count?
fi
@holly_cummins
Slide 99
PREVAIL Technical Conference 2021
measure “what have I learned” measure “have I made sure it won’t happen again”
@holly_cummins
Slide 100
true client story value on the shelf
“we can’t actually release this.”
IBM Garage
@holly_cummins
Slide 101
PREVAIL Technical Conference 2021
@holly_cummins
Slide 102
PREVAIL Technical Conference 2021
what’s stopping more frequent deploys?
@holly_cummins
Slide 103
PREVAIL Technical Conference 2021
“it costs too much to release”
@holly_cummins
Slide 104
PREVAIL Technical Conference 2021
“it costs too much to release” you can x that fi
@holly_cummins
Slide 105
PREVAIL Technical Conference 2021
“we can’t ship until we have more con dence in the quality”
fi
@holly_cummins
Slide 106
PREVAIL Technical Conference 2021
“we can’t ship until we have more con dence in the quality” you can x that
fi
fi
@holly_cummins
true client story the monolithic microservices
“we can’t release this microservice… we deploy all our microservices at the same time… because otherwise nothing works.” IBM Garage
@holly_cummins
Slide 110
PREVAIL Technical Conference 2021
let’s talk about microservices
@holly_cummins
Slide 111
true client story the peril of microservices
“every time we change code, something breaks”
IBM Garage
@holly_cummins
Slide 112
PREVAIL Technical Conference 2021
just because a system runs across 6 containers doesn’t mean it’s decoupled
@holly_cummins
Slide 113
PREVAIL Technical Conference 2021
@holly_cummins
Slide 114
PREVAIL Technical Conference 2021
mars climate explorer
@holly_cummins
PREVAIL Technical Conference 2021
when SRE is right it is great
@holly_cummins
Slide 136
bank
Slide 137
PREVAIL Technical Conference 2021
remember this bank?
team mainframe
@holly_cummins
team mobile
Slide 138
PREVAIL Technical Conference 2021
remember this bank? we’re responsible for stability of the front end
team mainframe
@holly_cummins
team mobile
Slide 139
PREVAIL Technical Conference 2021
remember this bank? we’re responsible for stability of the mainframe … as long as it’s used correctly
we’re responsible for stability of the front end
the ambassador
team mainframe
@holly_cummins
team mobile
Slide 140
PREVAIL Technical Conference 2021
one team
web front-end
back-end
another department … @holly_cummins
Slide 141
PREVAIL Technical Conference 2021 e on team
mobile front-end
web front-end
back-end
another department … @holly_cummins
Slide 142
PREVAIL Technical Conference 2021 e on team
CI/CD pipelines
canary deploys CI/CD pipelines
big-bang deploys onto AIX
one team, range of techniques @holly_cummins
Slide 143
PREVAIL Technical Conference 2021
by the way …
@holly_cummins
Slide 144
PREVAIL Technical Conference 2021
big bang deploys
@holly_cummins
Slide 145
PREVAIL Technical Conference 2021
50% failure rate
big bang deploys
@holly_cummins
Slide 146
PREVAIL Technical Conference 2021
50% failure rate
big bang deploys canary deploys
@holly_cummins
Slide 147
PREVAIL Technical Conference 2021
50%
10%
failure rate
failure rate
big bang deploys canary deploys
@holly_cummins
Slide 148
industrial
Slide 149
remember the suspicious DBAs?
#IBMGarage
@holly_cummins
Slide 150
PREVAIL Technical Conference 2021
two root problems: • automation • trust and transparency @holly_cummins
Slide 151
PREVAIL Technical Conference 2021
trigger automation via slack
@holly_cummins
Slide 152
PREVAIL Technical Conference 2021
because it was transparent, DBAs were happy and automated more things
@holly_cummins
Slide 153
PREVAIL Technical Conference 2021
what happens when things go wrong?
@holly_cummins
Slide 154
PREVAIL Technical Conference 2021
@holly_cummins
Slide 155
PREVAIL Technical Conference 2021
leadership need to provide a safety net.
@holly_cummins