Don’t Panic!

A presentation at DevOps and Dine in May 2019 in London, UK by Euan Finlay

Slide 1

Slide 1

Don’t Panic! How to Cope Now You’re Responsible for Production Euan Finlay @efinlay24 Hi! Thanks for the introduction. despite this being a technical talk the scariest production incident that I’ve been part of in my four years at the Financial Times wasn’t actually caused by anything in our technology stack

Slide 2

Slide 2

back in 2015, the European Central Bank was preparing to make an important announcement, with the financial industry expecting interest rates to be cut it’s standard practice in the media to prepare articles for major events like this so that we don’t need to write a story from scratch sometimes, mistakes happen, and the wrong information gets published for example CLICK

Slide 3

Slide 3

https://www.theguardian.com/uk-news/2015/jun/03/queens-health-bbc-tweet-global-news-alert/ when the BBC incorrectly announced the Queen’s death on Twitter in our case, we were updating a draft article about interest rates remaining steady which was the opposite of what everyone expected to happen

Slide 4

Slide 4

https://www.businessinsider.com/ft-publishes-incorrect-ecb-announcement-early-2015-12/ however, instead of updating the draft, we accidentally published the wrong story live 10 minutes ahead of the announcement embargo

Slide 5

Slide 5

https://www.businessinsider.com/ft-publishes-incorrect-ecb-announcement-early-2015-12/ to make things worse, our systems then automatically sent out the associated tweet because people place a lot of trust in our reporting everyone who read the article thought that the announcement was a surprise, but factually accurate

Slide 6

Slide 6

https://twitter.com/Ludo_Dufour/status/672401158653218816/ this caused the exchange rate to spike by 0.4% - which was actually a significant shift in the market we immediately removed the incorrect story, then published the real announcement… CLICK alongside a correction

Slide 7

Slide 7

https://www.ft.com/content/c01f37ec-99c1-11e5-987b-d6cdef1b205c/ while there wasn’t any long-term damage to the markets, this wasn’t good for our brand and reputation there were some very stressed people across the Editorial, Technology and Communications teams that morning it’s quite scary to think that a mistake like this can have a serious real-world impact

Slide 8

Slide 8

that said, the incident was handled really well by the teams involved and if it wasn’t for their quick actions, it could have been much worse in response to this, we made a number of improvements to our systems and processes which made it much harder to accidentally publish a draft article

Slide 9

Slide 9

/usr/bin/whoami @efinlay24 I’m a Senior Engineer at the Financial Times. Currently I work on the Operations & Reliability team at the FT, responsible for production systems and the website globally across the FT.

Slide 10

Slide 10

/usr/bin/whodoiworkfor No such file or directory. @efinlay24 although we’re most famous for the newspaper, we’re primarily a digital content company 2 years ago, revenue from our online subscriptions overtook the physical paper and advertising

Slide 11

Slide 11

https://www.ft.com/ this means that our content and website are absolutely critical to our business, and we invest heavily in technology we try to empower our engineering teams, and as part of that, they design, run & support their systems - from the beginning, to the very end of the product lifecycle

Slide 12

Slide 12

Your team is now on call. And you’re mildly terrified. @efinlay24 so, maybe you’ve been leading a team for a while: you’re proud of the services that you’ve built and deployed but it’s still intimidating being told you now need to deal with production issues Or perhaps, you’re in a leadership role and have recently moved to a new company: you’ve had time to settle in and get familiar with your teams, but now you’re encouraging them to be involved in production support and they’re worried about how they’ll cope when things go wrong

Slide 13

Slide 13

Obligatory audience interaction. @efinlay24 hands up if you’ve been on call before hands up if you’ve never had to support production services hands up if you don’t like putting up your hand in the middle of talks I still remember what it felt like the first time I was called out: - it was terrifying - I was asked to fix a service I knew nothing about - I couldn’t find the documentation - I thought about quitting technology entirely, and running away to become an llama farmer

Slide 14

Slide 14

Everyone feels the same when they start out. I still do today. @efinlay24 It’s something like imposter syndrome I suspect we all feel something similar the first time we start handling production incidents even now, with more experience - I still get a twinge of fear whenever my phone goes off in the middle of the night - what if it’s something I can’t fix? - what if I’m just not good enough? and if I feel like that, imagine being a junior engineer, with much less support experience

Slide 15

Slide 15

How do you get comfortable with supporting production? @efinlay24 so the idea behind this talk was how to enable our teams to become more comfortable so that they’re not dreading that phone call, or that message of “the website is down!” when writing this talk, I was told it helps to have a tenuous theme so I thought to myself, who else is quite grumpy (like lots of sysadmins) and who else gets woken up at 2am? the answer was… CLICK

Slide 16

Slide 16

A tenuous link to A Christmas Carol. Scrooge, from A Christmas Carol which actually leads nicely into a talk structure about dealing with production incidents because, much like A Christmas Carol, there are:

Slide 17

Slide 17

The Ghosts of Incidents… > Future things we can plan for now, to help our teams before a problem happens

Slide 18

Slide 18

The Ghosts of Incidents… Future > Present actions we need to take when something actually breaks

Slide 19

Slide 19

The Ghosts of Incidents… Future Present > Past And there are things we need to do after an incident, to prevent problems from reoccurring

Slide 20

Slide 20

The Ghost of Incidents Future so knowing that things will go wrong at some point how do we plan ahead?

Slide 21

Slide 21

Handling incidents is the same as any other skill. @efinlay24 handling incidents IS the same as any other skill It can be learned, and taught, and practiced If the first time people on your teams try to do this is: - without any training - with no plan of action - after a phone call at 2am, it’s not going to go well

Slide 22

Slide 22

Get comfortable with your alerts. @efinlay24 at the FT, we regularly rotate people through in-hours support, when the whole team is available to help and provide advice this keeps everyone familiar with - what can go wrong - the alerts that go off - the monitoring tools, and how to use them

Slide 23

Slide 23

Delete the alerts you don’t care about. @efinlay24 think carefully about what alerts you create, as alert noise and overload is bad ideally, every alert should be business critical and actionable otherwise, they’ll just get ignored, and real issues might be missed

Slide 24

Slide 24

Have a plan for when things break. @efinlay24 it’s really important to have a response plan ahead of time in large companies, this might be quite formal and well defined in a small startup though, you may just have some short guidelines so people know what’s expected of them Either way, make sure your teams aren’t wondering what they’re meant to do when alerts start going off

Slide 25

Slide 25

Keep the documentation up to date. @efinlay24 while I don’t think anyone especially likes writing documentation, it’s important to have information on what to do when services break create panic guides with common solutions for problems that might occur write them as though it’s 2am and you’ve just been woken up - only include the essentials needed to get things up and running again have separate disaster recovery documentation, containing more detail on how to recover from a major outage

Slide 26

Slide 26

Practice regularly. @efinlay24 Once you’ve created those guides, run through them regularly so that everyone is familiar with them a couple of years back, we performed an unscheduled test of our disaster recovery procedures we were creating a new production cluster, and ran our Ansible playbook to create us 5 new servers it turns out that if you’re not careful with the way you write your playbooks, and say “give me 5 servers” Ansible will ensure that you have 5 servers, in total, and delete all the others

Slide 27

Slide 27

The one where we decommissioned all our production servers the first we knew of this was when all of our alerts went off, and I hear a very quiet “oh no” from my friend sat next to me we had an brief outage while our automatic failover system switched across to our backup cluster DRINK

Slide 28

Slide 28

however, since we already had a plan for how to spin everything back up from scratch it wasn’t too long before we were back to normal operation we definitely wouldn’t have wanted to find out that our disaster recovery process didn’t work at that point in time…

Slide 29

Slide 29

Break things, and see what happens. Did your systems do what you expected? @efinlay24 As an extension of that - encourage teams to actively break things, and check that our services behave correctly chaos monkey is probably the most well known example, but you can do this manually in a more controlled way, too

Slide 30

Slide 30

The Planned Datacenter Disconnect We ran a planned DR test, disconnecting the network to one of our datacenters we pulled the plug, and the monitoring dashboard lit up with red lights - just as we expected we then tried to fail our systems over to the healthy datacenter It was at that point we found that a critical part of the failover system didn’t work, if one of the datacenters was offline… we were glad we found that before a real problem occured

Slide 31

Slide 31

We got complacent, and stopped running datacenter failure tests… @efinlay24 however, the last time that we ran one of those tests was a couple of years ago while we’ve migrated the majority of our systems to the cloud, there’s still a few important services that run out of our 2 DCs you can probably guess what the next slide is going to be…

Slide 32

Slide 32

The Unplanned Datacenter Disconnect because late last year, we had a live test of what exactly happens when one of our DCs drops off the network naturally, it happened on a weekend in the middle of August, when lots of people were on holiday without access to the internet worse, after two years since the last test, our processes and guides weren’t fully up to date

Slide 33

Slide 33

Have a central place for reporting changes and problems. @efinlay24 at the FT, we have a chat channel that anyone can join, which we use to: communicate changes that are in progress, and to report potential issues or problems that are happening as you can imagine, this channel got VERY busy during the outage, with lots of reports across the business

Slide 34

Slide 34

while the failover process took longer than we’d have liked, we got everything stabilised, without any major business impact later, we found that the problem had been caused by another customer in the datacenter accidentally cutting the fiber connection providing internet access to all of our servers

Slide 35

Slide 35

We’re not perfect. But we always try to improve. @efinlay24 we should have followed our own advice if we’d practiced regularly, we would have been much more prepared when the failure happened that said, we still kept everything running that week, without any major business impact and most importantly - we’ve learnt from our experience, and used it to improve our failover processes

Slide 36

Slide 36

The Ghosts of Incidents… Future > Present Past those are some ideas that you can think about and work on with your teams before something goes wrong

Slide 37

Slide 37

The Ghost of Incidents Present but something’s happened - alerts have gone off, and we’ve been called or been asked to investigate what’s the first steps that we should take?

Slide 38

Slide 38

Calm down, and take a deep breath. It’s probably ok. @efinlay24 encourage your teams to take a step back and assess the situation. Dealing with incidents is stressful - but it’s important to remind ourselves that it’s not the end of the world. for most of us, if our website goes down, or a service fails, it’s usually not catastrophic in the grand scheme of things

Slide 39

Slide 39

Don’t dive straight in. Go back to first principles. @efinlay24 It’s always tempting to immediately jump in and start trying to solve the problem but generally speaking, there’s always a set of questions that I’ll always ask myself, before digging into a issue further

Slide 40

Slide 40

What’s the actual impact? @efinlay24 what’s the actual impact? at the FT, our most critical considerations are: can the journalists publish content? can customers access the website? DRINK

Slide 41

Slide 41

“All incidents are equal, but some incidents are more equal than others.” George Orwell, probably. @efinlay24 A problem preventing the news from going out is a huge issue, and we’ll immediately get multiple people investigating ASAP. However, if a build server warns that it’s running low on disk space over a weekend? I’m unlikely to care, and I’ll fix it during office hours

Slide 42

Slide 42

What’s already been tried? @efinlay24 has anything already been done by other people? Get as much information as possible - vague details can sometimes hide the actual problem

Slide 43

Slide 43

“I’ve restarted it” << what’s it? the web service? the server itself? have they restarted the whole internet?

Slide 44

Slide 44

Is there definitely a problem? @efinlay24 confirm the problem definitely exists people can report things like “the website is slow” which can mean anything from “a badly written database query is causing timeouts” to “my laptop has decided to download updates, and I’m on a spotty wifi connection” or maybe CLICK

Slide 45

Slide 45

the monitoring system is broken and has started firing false alerts to everyone - that’s happened to us before let’s assume there is indeed a problem

Slide 46

Slide 46

What’s the minimum viable solution? @efinlay24 what’s the least amount of effort we can spend to get back online? depending on what our issue is, this is often higher priority than fixing the root cause

Slide 47

Slide 47

Get it running before you get it fixed. @efinlay24 For example: Can we just fail over? Can we just roll back a release? Can we just restore a snapshot?

Slide 48

Slide 48

Go back to basics. @efinlay24 but if there isn’t a simple way to restore service, we’ll need to investigate This will entirely depend on your system architecture, and the issue - but good starting points tend to be: checking the logs and monitoring have you run through the steps in the panic guide? was there a new release, or other planned work around the time the problems started?

Slide 49

Slide 49

are there other known issues or outages happening outside of our control? for example, when the Dyn denial of service attack broke the internet back in 2016 or issues with AWS in the past, where services have gone down for a whole region

Slide 50

Slide 50

Let’s assume that it’s not simple to solve… we’ve done our initial investigation we’ve tried our standard recovery solutions we’re still stuck

Slide 51

Slide 51

Don’t be afraid to call for help. @efinlay24 That’s ok! encourage people to call for backup, sooner rather than later it’s important to have a culture where teams feel comfortable asking for help we can’t always fix everything on our own - and that’s to be expected in a previous company, we had an alert went off on a Sunday afternoon, warning that the aircon in our office server room had stopped working

Slide 52

Slide 52

The One Where a Manager Falls Through the Ceiling our Tech Director saw this alert, and popped in to fix it - they lived near the office however, they didn’t know the new security code for the server room door at this point, most people would have called us to get the new code instead, because it was an emergency, they decided to crawl through the false ceiling of the office to get into the server room in case you’re not aware, most office ceilings aren’t designed to support the weight of a person

Slide 53

Slide 53

The One Where a Director Falls Through the Ceiling it gave way, and they fell into the server room, which ended up causing us a lot more problems than just a broken aircon unit it turns out that wild servers are easily startled - they don’t like having people fall on top of them unexpectedly and they definitely don’t like breathing in several years of accumulated ceiling dust DRINK

Slide 54

Slide 54

(it didn’t look like this) In their defence, our director’s reasoning was that they didn’t want to disturb us on a weekend because they could fix it themselves, which I can respect but it would have been MUCH better for everyone if they’d just asked for help, and we could have fixed the problem together without needing to replace the ceiling afterwards

Slide 55

Slide 55

Communication is key. Especially to our customers. @efinlay24 this leads into my next point, which is that communication is really important it’s a huge problem when there’s an ongoing incident, but nobody is sure what the status is

Slide 56

Slide 56

even though everything might be on fire, we still need to communicate with the business this is quite difficult when our team is focused on actually fixing the problem, so…

Slide 57

Slide 57

Put someone in charge. @efinlay24 someone needs to take the role of incident manager, which frees everyone else to dive deeper into the problem without multitasking they’re responsible for providing regular status updates and preventing interruptions to the people fixing the issue Beth Long and Elisa Binette did a really good talk at Velocity last year, about the incident command role at New Relic I recommend checking out the video - I’ll share the link on Twitter afterwards

Slide 58

Slide 58

having alerts & notifications in your chat channels can be useful but during an outage, it can make a channel impossible to use for discussion

Slide 59

Slide 59

and when you’re trying to to fix a problem with multiple people, it can often end up like this so we need somewhere to coordinate the investigation

Slide 60

Slide 60

Create a temporary incident channel. @efinlay24 having a single temporary space helps the incident manager to keep everyone on the same page and it’s valuable to use as a timeline later, to see what actions have been taken

Slide 61

Slide 61

Make sure that everybody shares information and reports what they’re doing you don’t want two people making conflicting changes, and potentially making the problem worse

Slide 62

Slide 62

If you think you’re over-communicating, it’s probably just the right amount. @efinlay24 I’ve mentioned communication already, but it’s so important Provide high-level updates on a regular basis just to let people know the problem is still being worked on

Slide 63

Slide 63

Tired people don’t think good. @efinlay24 when people are tired and extremely stressed, we all make mistakes Make sure everyone takes breaks, especially if the problem is long-running otherwise people will be less effective, or accidentally make things worse depending on the duration, this may even involve rotating in shifts, or handing over to other teams

Slide 64

Slide 64

the longest running incident in the Content team was when our EU cluster started failing due to CPU load, just as everyone was about to leave for the day we did some initial investigation, then switched all of our traffic to the healthy US cluster which then started failing as well… we spent the next five hours investigating, and attempting to get our clusters stable we were completely exhausted, and struggling for ideas

Slide 65

Slide 65

The one where we had to serve traffic from staging eventually, our director of engineering suggested routing traffic through our staging environment, and manually editing configuration files to pull data from our our old legacy platform by the time we managed to reliably serve traffic, it was around midnight

Slide 66

Slide 66

we continued the investigation again the next day and eventually identified the root cause as an update to a query which overloaded our databases this had a cascade effect on our other services, eventually causing the cluster to collapse

Slide 67

Slide 67

It wasn’t great, but it wasn’t the end of the world. @efinlay24 it wasn’t the best situation to be in, but the ft.com website team are great, and design their systems to fail gracefully in situations like this we served stale content during the outage, but there was zero downtime to our customers

Slide 68

Slide 68

The Ghosts of Incidents… Future Present > Past so there’s some tips for how to deal with incidents in progress

Slide 69

Slide 69

The Ghost of Incidents Past what do we need to do once the dust has settled, and we’re back online?

Slide 70

Slide 70

Congratulations! We survived. It probably wasn’t that bad, was it? @efinlay24 encourage everyone involved to take some time out, for their mental health Incidents are stressful, and if people have been working out of hours, they need to take some time to recover

Slide 71

Slide 71

Run a learning review with everyone involved. @efinlay24 At DevOpsDays in London last year, Emma Button suggested using the term “learning review” instead of “post-mortem”, which I quite like the objective isn’t to point fingers and assign blame but it’s an opportunity to discuss what worked, what didn’t, and what can be improved for next time Do it soon afterwards, otherwise people will forget the details, and move on to other work

Slide 72

Slide 72

Incident reports are important. @efinlay24 incident reports are valuable, and this is where keeping a timeline comes in handy it’s useful to log what happened and how we fixed previous problems, so we can refer back to them in the future DRINK CLICK XKCD

Slide 73

Slide 73

there’s nothing worse than having a production issue, and someone says: “oh, it’s exactly like when this happened last year!” but nobody can remember what was done to fix it make sure the solution gets written up

Slide 74

Slide 74

https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/ a great example is from when GitLab had a fairly serious outage back at the start of 2017 to briefly summarise, GitLab were investigating an load issue on their production database there were a number of unfortunate events that compounded the original problem and eventually lead to an engineer accidentally deleting production data

Slide 75

Slide 75

“Until a restore is attempted, a backup is both successful and unsuccessful.” Erwin Schrödinger? @efinlay24 to make things worse, they then found that their database backups had been failing silently for some time which meant they suffered permanent loss of customer data

Slide 76

Slide 76

https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/ what really impressed a lot of people though was how they handled the outage they released a very open, honest incident report to the public, with a detailed timeline of what happened, and why

Slide 77

Slide 77

https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/ they also made their recovery process public, and livestreamed it on YouTube I’m not suggesting that everyone should do this personally, I don’t think a livestream would have a good mental impact on the teams trying to fix the problem but it did show that GitLab were committed to keeping their customers up to date with the status of the recovery

Slide 78

Slide 78

Identify what can be improved for next time. @efinlay24 follow-up actions are the most important part of this process they can cover lots of things - and not all of them may be technical or code-related our response plans are never perfect when we start out, and they should be improved over time

Slide 79

Slide 79

https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/ Maybe the escalation process needs to be improved perhaps the documentation was out of date or there are bugs that need to be fixed again, GitLab shared these actions publicly so that their customers could see their progress and status I definitely recommend checking the full report out it’s a really interesting read

Slide 80

Slide 80

Nearly the end. Don’t clap yet. @efinlay24 so that’s pretty much everything I wanted to talk about today - I hope you’ve found it interesting and useful hopefully, you have some ideas of how to help your teams to cope the next time something breaks

Slide 81

Slide 81

Failure is inevitable. And that’s ok. @efinlay24 to sum up, problems and outages are a just another part of what we deal with in technology It’s how we help our teams plan for them, respond to them, and then improve things afterwards that makes the difference.

Slide 82

Slide 82

The end. “Please clap.” Jeb Bush, 2016 @efinlay24 clap clap clap #LeadDevMeetup

Slide 83

Slide 83

We’re hiring! https://ft.com/dev/null/ @efinlay24 euan.finlay@ft.com bye