Don't Panic!

A presentation at Velocity in November 2018 in London, UK by Euan Finlay

Slide 1

Slide 1

Don't Panic! How to Cope Now You're Responsible for Production Euan Finlay @efinlay24 Hi! Thanks for the introduction. despite this being a technical conference the scariest production incident that I've been part of in my four years at the Financial Times wasn't actually caused by anything in our technology stack

Slide 2

Slide 2

back in 2015, the European Central Bank was preparing to make an important announcement, with everyone across the financial industry expecting interest rates to be cut when major events like this take place, it's standard practice across the media to prepare articles for the different outcomes this is so that we can be ready to publish, without needing to write an article from scratch

Slide 3

Slide 3

https://theantimedia.com/bbc-accidentally-reports-the-death-of-queen-elizabeth-ii/ sometimes mistakes happen, and the wrong information gets published for example, when the BBC incorrectly announced the Queen's death on Twitter in our case, we were updating a draft article about the ECB holding the interest rates steady - this was the opposite of what everyone expected to happen

Slide 4

Slide 4

https://www.businessinsider.com/ft-publishes-incorrect-ecb-announcement-early-2015-12 however, instead of updating the draft, the wrong article was accidentally published live, 10 minutes ahead of the announcement embargo

Slide 5

Slide 5

https://www.businessinsider.com/ft-publishes-incorrect-ecb-announcement-early-2015-12 to make things worse, our systems then automatically published the associated tweet

Slide 6

Slide 6

https://twitter.com/Ludo_Dufour/status/672401158653218816 due to the high level of trust that people place in our reporting, everyone who read the article thought that the announcement was surprising, but factually accurate the incorrect article caused the exchange rate to spike by 0.4% - which was a significant shift in the market this was a huge issue for the FT, with some serious implications for us we immediately removed the incorrect article, then published the real announcement, alongside a correction

Slide 7

Slide 7

https://www.ft.com/content/c01f37ec-99c1-11e5-987b-d6cdef1b205c after that, the exchange dropped back to normal levels while there wasn't any long-term damage to the markets, this had a serious impact on our brand and our reputation there were some extremely stressed people across the Editorial, Technology and Communications teams that morning it's quite scary to think that a mistake like this can have a serious real-world impact

Slide 8

Slide 8

however, the incident was handled well by the teams involved, and their quick actions went a long way towards mitigating damage to the business in response to this, we made a number of improvements to the technical side of publishing, as well as our processes we put measures in place to make it much harder to accidentally publish a draft article we also had problems with the false article being cached in various places after we'd taken it down these days, the legacy services that used to run our front-end website have been replaced, so this is less of a worry for us

Slide 9

Slide 9

/usr/bin/whoami @efinlay24 #VelocityConf I'm a Senior Integration Engineer at the Financial Times. I originally started off in desktop support, then spent some time as a Linux sysadmin Currently I lead a team at the FT, who support and maintain our backend Content APIs, working with k8s, Docker and Go microservices.

Slide 10

Slide 10

/usr/bin/whodoiworkfor No such file or directory. @efinlay24 #VelocityConf although we're most famous for the newspaper, we're primarily a digital content company last year was the first time revenue from our online content overtook the physical paper and our advertising

Slide 11

Slide 11

https://www.ft.com our content and our website are absolutely critical to our survival - we invest heavily in technology, and we've got many teams working on different areas of the business at the FT, we're big believers in DevOps practices and empowered engineering teams trusting them to make the best decisions around technology, architecture, and delivery as part of that, our teams fully own, run & support their services - from the very beginning to the end of the product lifecycle

Slide 12

Slide 12

We're hiring! https://ft.com/dev/null @efinlay24 #VelocityConf we currently have some open positions, so if you enjoy my talk, please take a look, and feel free to and come chat to me afterwards

Slide 13

Slide 13

You've just been told you're on call. And you're mildly terrified. @efinlay24 #VelocityConf so, maybe you're part of a team that's been running for a while: you're proud of the services that your team have built and deployed but it's still intimidating being told that you're now responsible for dealing with production issues Or maybe, you're a developer or engineer that's recently moved to a new company: you've had time to settle in and you're familiar with your services but now you're being asked to go on the on-call rota

Slide 14

Slide 14

Obligatory audience interaction. @efinlay24 #VelocityConf hands up if you've been on call hands up if you've never had to support production hands up if you're don't like putting up your hand in the middle of conference talks I remember what it felt like the first time I was called out: - terrifying - was asked to fix a service I knew nothing about - couldn't find the documentation - thought it was the worst thing in the world - thought about quitting and becoming a llama farmer

Slide 15

Slide 15

Everyone feels the same at the start. I still do today. @efinlay24 #VelocityConf It's something like imposter syndrome no matter how good you are, I suspect everyone feels something similar the first time they get called out or start handling production incidents even now, with experience - I still get a twinge of fear whenever my phone goes off in the middle of the night - what if it's something I can't fix? - what if I'm not good enough?

Slide 16

Slide 16

How do you get more comfortable? @efinlay24 #VelocityConf so how DO you get to the point where you're comfortable on call? the idea behind this talk, was thinking what tips and advice I could give so that you're not dreading that phone call or that Slack message of "everything is broken!" when writing this talk, I was told it helps to have a bit of a theme so I thought to myself, who else is quite grumpy (like lots of sysadmins) and who else gets woken up at 2am? the answer was… CLICK

Slide 17

Slide 17

A tenuous link to A Christmas Carol. Scrooge, from A Christmas Carol which actually leads nicely into a talk structure about dealing with production incidents because, much like A Christmas Carol, there are:

Slide 18

Slide 18

The Ghosts of Incidents... > Future preparations we can make right now, to help us the next time we get called out or have major problems

Slide 19

Slide 19

The Ghosts of Incidents... Future > Present actions we should take when something actually breaks

Slide 20

Slide 20

The Ghosts of Incidents... Future Present > Past And there are tasks we should do after an incident, to improve things for next time

Slide 21

Slide 21

The Ghost of Incidents Future so knowing that things will go wrong at some point what can we do to plan ahead?

Slide 22

Slide 22

Handling incidents is the same as any other skill. @efinlay24 #VelocityConf handling incidents IS the same as any other skill It can be learned, and taught, and practiced If the first time you try to do this is: without any training with no plan of action after a phone call at 2am, it's not going to go very well If you're familiar with dealing with your alerts, and what can go wrong - you'll be a lot more relaxed when you get called out

Slide 23

Slide 23

Get comfortable with your alerts. @efinlay24 #VelocityConf so get people on your teams to rotate through support regularly, in hours, when everyone else is available to help and support them It'll get everyone familiar with what can go wrong the alerts that go off the monitoring tools does everyone have access? you never want that moment of "hmm, I can't access the documentation" during an emergency...

Slide 24

Slide 24

Bin the alerts you don't care about. @efinlay24 #VelocityConf and on a similar theme, alert noise and overload is bad every alert should be actionable otherwise, you may lose real issues in the noise or inversely, you may be called out for no reason I work with Sarah Wells, who has done an excellent talk on alert overload, which I definitely recommend watching

Slide 25

Slide 25

Have a plan for when things break. @efinlay24 #VelocityConf Depending on where you work, or what you work on, having an incident response plan might look very different Maybe your company is large enough to have a first-line team that escalates problems to you that they're unable to fix themselves your response plan may be very formal and defined across the whole business Alternatively if you're in a small start-up, it might just be a handful of you, getting called out by PagerDuty or an automated equivalent In that case you may just need a quick verbal discussion with your team, or some guidelines jotted down in a Google document somewhere Either way - you don't want to be wondering what you're meant to be doing when alerts start going off

Slide 26

Slide 26

Keep your documentation up to date. @efinlay24 #VelocityConf and, keep your documentation up to date Nobody likes writing documentation, but you need to have information on what to do when services break what it does, where it lives, and how it works is a good start service panic guides or run books are important as well - these should contain solutions for common problems that might occur write them as though it's 2am and you've just been woken up you only want the essentials to get things fixed and up and running again it's awful trying to fix a system that has no documentation, when the only person who knows about it has left

Slide 27

Slide 27

Practice regularly. @efinlay24 #VelocityConf Once you have those guides Run through them on a regular basis to make sure they still work, and that people are familiar with them we did this in fairly spectacular fashion one day, when we performed an unscheduled test of our disaster recovery procedures we were provisioning a new production cluster using Ansible we ran our playbook, which should create us 5 new instances it turns out that If you're not careful with the way you write your playbooks, and say "give me 5 instances" Ansible ensures that you have 5 instances in total across all of your production clusters

Slide 28

Slide 28

The one where we decommissioned all our production servers the first we knew of this was when all of our alerts went off, and I hear a very quiet "oh no" from my friend sat next to me Was it a problem? absolutely we had an brief outage while we worked out what was going on

Slide 29

Slide 29

DRINK But we were back up and running fairly quickly - we failed over to our backup cluster and as we already had a guide on on how to spin everything back up from scratch it wasn't too long before we were back to normal we definitely wouldn't have wanted to find out that our DR guide didn't work at that point in time...

Slide 30

Slide 30

Break things, and see what happens. Did your systems do what you expected? @efinlay24 #VelocityConf As an extension of that - actively break things, and check that your services behave the way you expect! Do your recovery guides still work? Does your alerting and monitoring work correctly? Crystal Hirschorn talked about this in more depth in her keynote this morning Chaos monkey is probably the most well known example, but you can do this manually in a controlled way, too Run DR tests to make sure you iron out any bugs with your processes

Slide 31

Slide 31

The Planned Datacenter Disconnect We ran a planned DR test - disconnecting the network to one of our datacenters, to make sure we could recover if it happened in a real world situation Pulled the plug, and the Ops dashboard lit up with red lights - just as we expected we then moved to step 2 - failing over our systems to the healthy datacenter It was at that point we found that our failover service didn't work if one of datacenters was unreachable... We fixed that up fairly quickly :)

Slide 32

Slide 32

We got complacent, and stopped running datacenter failure tests... @efinlay24 #VelocityConf however, the last time that we ran one of those tests was about 2 years ago - maybe a bit more for context, we've been moving away from our physical DCs for the last few years, as part of a big cost-reduction push we've migrated a lot of our legacy services to the cloud, replaced them with externally hosted solutions, or decommissioned them altogether however, there's still a number of important services that run out of our 2 DCs you can probably guess what the next slide is going to be...

Slide 33

Slide 33

The Unplanned Datacenter Disconnect because recently, we had a live test of what exactly happens when one of our DCs drops off the network naturally, it happened on a weekend, in the middle of August, when lots of people were on holiday without access to their laptops or the internet as you can imagine, after two years since the last test, a lot of the the info we had was out of date and it turned out that there were a lot more important services depending on those datacenters than we expected

Slide 34

Slide 34

Have a central place for reporting changes and problems. @efinlay24 #VelocityConf one of the useful things that we have at the FT is a chat channel that anyone can join we use this channel to: communicate changes that are in progress, and to report potential issues or problems that are happening as you can imagine, this channel got VERY busy during the outage, with lots of reports and issues from across the business

Slide 35

Slide 35

failover of these legacy services isn't simple or automated, either - it's a fiddly manual process involving DNS updates, configuration changes, and service restarts since it'd been so long since we'd last done it - we had to hope that the people with knowledge were available, and that they could remember what the failover steps were we did get everything stabilised though, without any major customer impact later, we found out that the cause of the problems was another customer decommissioning of their network equipment in the process of doing that, they cut the fiber connection that provided internet access to all of our servers… there's no way we could have predicted that happening it took roughly 5 days for network connectivity to be fully restored, and for all of our services to be back to high availability and running out of both DCs

Slide 36

Slide 36

The Unplanned Datacenter Disconnect II: The Network Strikes Back which was really fortunate, because a day later, we had a core network switch fail in our OTHER datacenter so we had to repeat the same exercise, but in the other direction… we were definitely a bit more practiced the second time around

Slide 37

Slide 37

We should have followed our own advice. @efinlay24 #VelocityConf we should have followed our own advice if we'd practiced regularly and tested the failover process, we'd have been a lot quicker, and we'd have had a much better understanding of the impact and what needed to happen but - we still recovered without any major business impact, and we learnt a lot about the remaining legacy services that are important to us we also uncovered a whole set of issues that we didn't know about, such as: a HA service that wasn't able to connect to one DC at all one of our monitoring tools really doesn't like it when it can't reliably connect to it's monitoring target and that some of our legacy services run out of memory when all traffic is routed to a single location

Slide 38

Slide 38

We're not perfect. But we always try to improve. @efinlay24 #VelocityConf all of those issues are now fixed, or in the process of being fixed most importantly - we've learnt from our experience, and used it to improve our legacy failover documentation and processes

Slide 39

Slide 39

The Ghosts of Incidents... Future > Present Past those are some ideas that you can go away, think about, and get started on before something goes wrong

Slide 40

Slide 40

The Ghost of Incidents Present but something's happened - alerts have gone off, and you've been called or been asked to investigate what's the first steps that you should take?

Slide 41

Slide 41

Calm down, and take a deep breath: it's probably ok. @efinlay24 #VelocityConf take a deep breath. Dealing with incidents is stressful - but do what you can to remind yourself that it's not the end of the world. for most of us, if our website goes down, or a service fails, it's not completely catastrophic in the grand scheme of things Unless you work at a nuclear power plant, in which case probably be a bit more worried

Slide 42

Slide 42

Don't dive straight in. Go back to first principles. @efinlay24 #VelocityConf It's always tempting to immediately jump in and start trying to solve the problem Go back to basics first - treat it the same as anything else you do get as much information as possible before you start Generally speaking, no matter what the problem is, there's always a certain set of questions that I'll always ask myself, before digging into a problem further

Slide 43

Slide 43

What's the actual impact? @efinlay24 #VelocityConf what's the actual impact? For example - for my team at the FT, we're a content company - our most critical considerations are: can the journalists publish content? can customers access the website? A problem preventing the news from going out is a huge issue, and we'll immediately get multiple people investigating ASAP. However, if our Jenkins box alerts that it's running low on disk space over a weekend? I'm unlikely to care, and I'll fix it on Monday If overnight and it doesn't need to be fixed immediately, perhaps it's safer to wait until morning, when you have a clearer head and more eyes on the fix.

Slide 44

Slide 44

"All incidents are equal, but some incidents are more equal than others." George Orwell, probably. @efinlay24 #VelocityConf DRINK some things to consider: Is it affecting your customers? Is the issue blocking other teams right now? is there a brand impact - does it make your company look bad? for example, the the issue I talked about earlier, where we published the wrong article that was DEFINITELY something that needed to be handled immediately

Slide 45

Slide 45

What's already been tried? @efinlay24 #VelocityConf let's assume it's important and needs investigation what's already been tried? Maybe nothing, if you're the first responder. Maybe first line have already run through the obvious solutions, restarting, failing over etc Or maybe your teammates have tried some fixes Get as much information as possible - getting vague details can sometimes hide the actual problem "I've restarted it" << what's it? the service? their laptop? CLICK

Slide 46

Slide 46

have they restarted the whole internet? I hope not...

Slide 47

Slide 47

Is there definitely a problem? @efinlay24 #VelocityConf and, validate that the problem does exist there are times when, you'll have reports like "the website is slow" which could mean anything from "my home wifi router is broken" all the way to "there's been another denial of service attack on Dyn, and 90% of the internet has fallen over

Slide 48

Slide 48

or maybe the monitoring system is broken and has started spamming out alerts to everyone - that's happened to us before so, get as much information as you can it's worth spending a couple minutes just to validate that there is definitely a problem before you start jumping into trying to fix things let's assume there is indeed a problem

Slide 49

Slide 49

What's the minimum viable solution? @efinlay24 #VelocityConf what's the least amount of effort you can spend to bypass the problem and get back online depending on what your service is, this is often more important than fixing the root cause

Slide 50

Slide 50

Get it running before you get it fixed. @efinlay24 For example: Can we just fail over? Can we just roll back a release? Can we just restore a snapshot? #VelocityConf

Slide 51

Slide 51

Go back to basics. Don't forget to check the logs. @efinlay24 #VelocityConf but if you don't have a simple way to restore service, you'll need to investigate This will entirely depend on your system architecture, and your issue - but good starting points might be: check the logs check the disk, memory, cpu, network traffic have you checked the steps in your panic guide? has there been a new release / deployment? was there planned work around when the problems started

Slide 52

Slide 52

and are there other known issues or outages happening? for example, when the Dyn attack broke the internet back in 2016 or issues with AWS in the past, where S3 or EC2 have fallen over for a whole region

Slide 53

Slide 53

Let's assume that whatever's gone wrong isn't simple to solve… You've done your investigation You've tried the obvious solutions You're still stuck Everything is still on fire

Slide 54

Slide 54

Don't be afraid to call for help. @efinlay24 #VelocityConf That's ok! Call for backup, if you can it's often better to bring other people in, and get help quickly. You can't always fix everything on your own - and that's ok Do basic investigation first and and confirm it's not a simple problem But don't be afraid to get assistance if you need it. Nobody will think less of you.

Slide 55

Slide 55

The One Where a Manager Falls Through the Ceiling in my previous company, we had an aircon alert for office server room which went off at the weekend Tech Director got an alert from our monitoring system, and popped in to fix it - he lived near the office however, he couldn't remember the door code at this point, most people would have called us to get the new code instead, he decided to crawl through the false ceiling of the office to get into the server room in case you're not aware, office ceilings aren't designed to support the weight of a person

Slide 56

Slide 56

The One Where a Director Falls Through the Ceiling it gave way, he fell into the server room, which caused us even MORE problems than the aircon, which we then had to sort out on Monday… it turns out that wild servers are easily startled - they don't like having people fall on top of them unexpectedly and they definitely don't like breathing in many years of accumulated ceiling dust

Slide 57

Slide 57

(it didn't look like this) In his defence, his reasoning was that didn't want to disturb us on a weekend, which I can respect but it would have been MUCH better for everyone if he'd just called us up, asked for help, and we could have fixed the problem together without needing to replace the ceiling afterwards

Slide 58

Slide 58

Communication is key. Especially to your customers. @efinlay24 #VelocityConf communication is really important it's really irritating if you're trying to use a service or product it's not working and you've no idea if it's being investigated, or if people are even aware of the issue

Slide 59

Slide 59

even though everything might be on fire, you still need to communicate with your customers and the business this is quite difficult when you're the person trying to actually fix the problem, so...

Slide 60

Slide 60

Put someone in charge. @efinlay24 #VelocityConf make someone the incident manager they need to be in charge of handling communication It's extremely hard to multitask in normal work, let alone during a stressful production incident give one person the task of updating the business and your customers make sure they provide regular updates and they prevent interruptions from senior management to the people trying to fix things Beth Long and Elisa Binette did a far more detailed talk on incident command earlier today I thoroughly enjoyed it, and I recommend you check out the video

Slide 61

Slide 61

If you're like us and have alerts / notifications in your chat channels, then you've seen things like this before, alert spam makes the channel impossible to use for trying to discuss and solve the issue

Slide 62

Slide 62

Create a temporary incident channel. @efinlay24 #VelocityConf spin up a new channel or group, specifically for this incident this comes back to communication - if you have multiple people investigating / fixing a problem, you need somewhere they can coordinate. This is especially true if there are people from multiple teams, areas of the business, or even companies it helps a lot with having a incident timeline later on so that you can go back and see who was doing what, when.

Slide 63

Slide 63

If you've ever tried to fix something with multiple people, it often ends up like this Having a central place to talk helps with coordination Make sure people share what they're doing, changing, investigating... The last thing you want is: "oh, wait - you were in the middle of rolling back the database? but I've just changed the network settings" and now your database is corrupt

Slide 64

Slide 64

this is an example of one of our temporary incident channels as engineers trying to fix the problem, we are: - discussing the issue - sharing logs & graphs - announcing any changes, tests or fixes that we're running

Slide 65

Slide 65

If you think you're over-communicating, it's probably just the right amount. @efinlay24 #VelocityConf I've mentioned communication already, but it's so important Provide high-level updates every 1/2hr just to let people know you're still working on it nothing worse than someone saying "I'm looking into it", and then nothing for an hour and you wonder - is it fixed? are they still investigating? have they gone for lunch

Slide 66

Slide 66

Tired people don't think good. @efinlay24 #VelocityConf When you're tired and extremely stressed, you make mistakes Make sure people take breaks, especially if the problem is long-running! It's hard to make yourself take 15 mins to go get a coffee, go for a walk while things are still broken, because you feel obligated to stay until it's solved but if you don't you'll be less effective, miss obvious things, maybe even make the problem worse For long running incidents, this may even involve rotating in shifts - or in large companies, handing over to other teams

Slide 67

Slide 67

Our longest running incident that I've been part of in the Content team was when our EU production cluster started failing due to CPU load at around 5pm on a Thursday, just as everyone was about to leave for the day no increase in traffic, so we spent some time trying to identify the cause of the issue, with no success given that our US cluster is healthy, we fail all of our traffic to the US the US then starts failing as well… we continue to spend several hours: - trying to work out what's causing the problem - attempting to get into a state where we can serve any content at all - swearing quite a lot by this point, it's around 10pm, the entire team is completely exhausted, and we're struggling for ideas

Slide 68

Slide 68

The one where we had to serve traffic from staging eventually, it takes our Director of Engineering to suggest routing our traffic through our staging environment manually editing our configuration files to pull data from our our old legacy platform, which we were migrating away from fortunately we hadn't decommissioned it yet it was roughly around midnight before we managed to get to a point where we were serving stable traffic, in a very roundabout way

Slide 69

Slide 69

we all went home, then continued to investigate the next day eventually identified the root cause as an update to a database query making it extremely slow, which caused the overload on our databases, which eventually caused the cluster to collapse

Slide 70

Slide 70

It wasn't great, but it wasn't the end of the world. @efinlay24 #VelocityConf Fortunately, the ft.com team are rather good, and failed gracefully in this situation, even though our backend APIs were extremely unreliable there was zero end-user downtime, but we did serve stale content for several hours this would've been a problem in a breaking news situation, but fortunately it all worked out

Slide 71

Slide 71

The Ghosts of Incidents... Future Present > Past so there's some (hopefully useful) tips for how to deal with incidents in progress

Slide 72

Slide 72

The Ghost of Incidents Past what do you need to do once the dust has settled, and you're back online?

Slide 73

Slide 73

Congratulations! You survived. It probably wasn't that bad, was it? @efinlay24 #VelocityConf Take some time, for your own mental health. Incidents are stressful, and if you've been working all day yesterday and through the night, you need to take some time to recover

Slide 74

Slide 74

Run a learning review with everyone involved. @efinlay24 #VelocityConf Emma Button suggested using the term "learning review" instead of "post-mortem", which I quite like this is an opportunity to discuss the incident with everyone Especially if the incident had a serious impact, or had multiple people involved. The objective isn't to point fingers and assign blame but it's to discuss what worked, what didn't, and what can be improved for next time Do it soon after the incident - if you leave it too long, everyone will forget things, and move on to other work

Slide 75

Slide 75

Incident reports are important. @efinlay24 #VelocityConf again, I don't like writing documentation, but incident reports are extremely valuable This is where keeping a timeline comes in handy Depending on where you work, incident reports may be required and very formal You might need to make them public, if you have external customers Even if you don't, it's worth having them internally for your team, so that you have a record of previous incidents that you can refer back to later

Slide 76

Slide 76

DRINK there's nothing worse than running into a production problem, and your friend says: "oh yeah, it's exactly like that time when this happened last year!" and then NOBODY BEING ABLE TO REMEMBER HOW TO FIX IT so make sure you write up what happened and what the solution was

Slide 77

Slide 77

https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/ an excellent example of an incident report is from when GitLab had a fairly serious outage back at the start of 2017 to summarise really briefly, GitLab were investigating an load issue on their production database there were a number of unfortunate events that compounded the original problem and eventually lead to an engineer accidentally deleting production data to make things worse, they then found that their database backups had been failing silently for some time...

Slide 78

Slide 78

"Until a restore is attempted, a backup is both successful and unsuccessful." Definitely not Erwin Schrödinger. https://twitter.com/TessSchrodinger/status/534042916264873984 which meant it was much harder for them to recover... I've been in similar situations myself, but not at that scale - and I know I just wanted to crawl into a hole and hide forever I definitely recommend checking the full report out it's a really interesting - and really scary - read

Slide 79

Slide 79

https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/ what really impressed a lot of people though - myself included - was the way in which they handled the incident they released a very open, honest incident report to the public, with a detailed timeline of what happened, and why

Slide 80

Slide 80

https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/ they also made their disaster recovery process completely open to everyone, which is really unusual they had a public Google Doc which they used to keep track of their progress and they also livestreamed the recovery on YouTube I'm not suggesting that everyone should do this personally, I'm not convinced a livestream would have a good mental impact on the engineers trying to recover the problem but it did show that GitLab were committed to keeping their customers up to date with the status of the recovery

Slide 81

Slide 81

https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/ finally, they identified a number of improvements and fixes, to ensure a similar incident couldn't happen again again, these were shared publicly so that their customers could see their progress and status

Slide 82

Slide 82

Identify what can be improved for next time. @efinlay24 #VelocityConf this is most important part of the post-incident process it can encompass multiple things - and not all of them may be technical or code-related Nobody's response plans are perfect when they start out, and you should improve them as you go Maybe the call-out or escalation process didn't work very well, and needs to be improved. perhaps the documentation was incomplete and needs to be updated or some obvious bugs that need to be fixed

Slide 83

Slide 83

Prioritise follow-up actions. @efinlay24 #VelocityConf and once you've identified them, make sure they get done if you leave it too long, people will forget the details, or move on to other things

Slide 84

Slide 84

https://blog.github.com/2018-10-30-oct21-post-incident-analysis I'll briefly mention GitHub's incident report as well, from their outage last week - this was released a couple of days back again, it's well written and very interesting, so I recommend checking it out

Slide 85

Slide 85

The One with the Badly Named Database We had an outage at a previous company I used to work at where a business analyst ran some scripts against the production database, thinking they were connected to pre-production the production database name was prod - as you'd expect but the pre-production database CLICK

Slide 86

Slide 86

Please don't name your pre-production database: 'pprod' @efinlay24 was named pprod I didn't name it, but top of our list of actions: rename that database, don't give databases really similar names in future, and #VelocityConf

Slide 87

Slide 87

https://twitter.com/iamdevloper/status/1040171187601633280 restrict access to production so people can't accidentally connect

Slide 88

Slide 88

Nearly the end. Don't clap yet. @efinlay24 #VelocityConf That's pretty much everything I wanted to cover - I hope you've found it interesting and useful

Slide 89

Slide 89

Problems will always happen. And that's ok. @efinlay24 #VelocityConf To sum up - incidents and issues are a just another part of what we deal with in technology It's how we plan for them, respond to them, and then improve things afterwards that makes the difference. For those of you that are new to support - I hope I've not completely scared you off! hopefully you've got some ideas of things you can go away and do after this, to make your lives easier when something does eventually go wrong And if you need to go on call, you'll have some plans in place to cope - so it won't be quite as terrifying

Slide 90

Slide 90

The end. Please clap. @efinlay24 Please clap. clap clap clap #VelocityConf

Slide 91

Slide 91

@efinlay24 euan.finlay@ft.com We're hiring! https://ft.com/dev/null bye