A presentation at DevOpsDays London in September 2018 in London, UK by Euan Finlay
Don't Panic! How to Cope Now You're Responsible for Production Euan Finlay @efinlay24
https://theantimedia.com/bbc-accidentally-reports-the-death-of-queen-elizabeth-ii/
https://www.businessinsider.com/ft-publishes-incorrect-ecb-announcement-early-2015-12
https://twitter.com/Ludo_Dufour/status/672401158653218816
https://www.ft.com/content/c01f37ec-99c1-11e5-987b-d6cdef1b205c
/usr/bin/whoami @efinlay24
/usr/bin/whodoiworkfor No such file or directory. @efinlay24
https://www.ft.com
You've just been told you're on call. (and you're mildly terrified) @efinlay24
Obligatory audience interaction. @efinlay24
Everyone feels the same way at the start. (I still do today) @efinlay24
How do you get to the point where you're more comfortable? @efinlay24
A tenuous link to A Christmas Carol.
The Ghost of Incidents... > Future Present Past
The Ghost of Incidents Future
Handling incidents is the same as any other skill. @efinlay24
Get comfortable with your alerts.
Get comfortable with your alerts. (and bin the rubbish ones)
Have a plan for when things break. @efinlay24
Keep your documentation up to date. @efinlay24
Practice regularly. @efinlay24
The one where we decommissioned all our production servers
Break things, and see what happens. Did your systems do what you expected? @efinlay24
The Planned Datacenter Disconnect
We got complacent, and stopped running datacenter failure tests... @efinlay24
The Unplanned Datacenter Disconnect
Have a central place for reporting changes and problems. @efinlay24
The Unplanned Datacenter Disconnect (Part II)
We should have followed our own advice. @efinlay24
We're not perfect. (but we always try to improve) @efinlay24
The Ghosts of Incidents... Future > Present Past
The Ghost of Incidents Present
Calm down, take a deep breath: it's (probably) ok. @efinlay24
Don't dive straight in. Go back to first principles. @efinlay24
What's the actual impact? @efinlay24
"All incidents are equal, but some incidents are more equal than others." George Orwell, probably @efinlay24
What's already been tried? @efinlay24
Is there definitely a problem? @efinlay24
What's the minimum viable solution? @efinlay24
Get it running before you get it fixed.
Check the basics first. @efinlay24
Don't be afraid to call for help. @efinlay24
The One Where a Manager Falls Through the Ceiling
The One Where a Director Falls Through the Ceiling
Communication is key. Especially to your customers. @efinlay24
Put someone in charge. @efinlay24
Create a temporary incident channel. @efinlay24
If you think you're over-communicating, it's probably just the right amount. @efinlay24
Tired people don't think good. @efinlay24
The one where we had to serve traffic from staging
It wasn't great, but it wasn't the end of the world. @efinlay24
The Ghosts of Incidents... Future Present > Past
The Ghost of Incidents Past
Congratulations! You survived. It probably wasn't that bad, was it? @efinlay24
Run a post-mortem with everyone involved. @efinlay24
Incident reports are important. @efinlay24
Prioritise follow-up actions. @efinlay24
https://blog.travis-ci.com/2018-04-03-incident-post-mortem
Identify what can be done better next time.
The One with the Badly Named Database
Don't name your pre-production database: 'pprod' Seriously, who does that? @efinlay24
https://twitter.com/iamdevloper/status/1040171187601633280
Nearly the end. (don't clap yet) @efinlay24
Problems will always happen. (and that's ok) @efinlay24
The end. (please clap)
@efinlay24 euan.finlay@ft.com We're hiring! https://ft.com/dev/null https://aboutus.ft.com/en-gb/careers/ Image links: https://goo.gl/3DeojV