Don’t Panic!

A presentation at DevReach in October 2019 in Sofia, Bulgaria by Euan Finlay

Slide 1

Slide 1

Don’t Panic! How to Cope Now You’re Responsible for Production Euan Finlay @efinlay24 | #DevReach2019 Hi! Thanks for coming despite this being a technical talk the scariest production incident that I’ve been part of in my five years at the Financial Times wasn’t actually caused by anything in our technology stack

Slide 2

Slide 2

back in 2015, the European Central Bank was preparing to make an important announcement, with the financial industry expecting interest rates to be cut it’s standard practice in the media to prepare articles for major events like this so that we don’t need to write a story from scratch sometimes, mistakes happen, and the wrong information gets published for example, CLICK [ when the BBC incorrectly announced the Queen’s death on Twitter ]

Slide 3

Slide 3

https://www.theguardian.com/uk-news/2015/jun/03/queens-health-bbc-tweet-global-news-alert/ when the BBC incorrectly announced the Queen’s death on Twitter in our case, we were updating a draft article about interest rates remaining steady which was the opposite of what everyone expected to happen

Slide 4

Slide 4

https://www.businessinsider.com/ft-publishes-incorrect-ecb-announcement-early-2015-12/ however, instead of updating the draft, we accidentally published the wrong story live 10 minutes ahead of the announcement embargo

Slide 5

Slide 5

https://www.businessinsider.com/ft-publishes-incorrect-ecb-announcement-early-2015-12/ to make things worse, our systems then automatically sent out the associated tweet because our readers place a lot of trust in our reporting everyone who read the article thought that the announcement was a surprise, but factually accurate

Slide 6

Slide 6

https://twitter.com/Ludo_Dufour/status/672401158653218816/ this caused the exchange rate to spike by 0.4% - which was actually a significant shift in the market we immediately removed the incorrect story, then published the real announcement… CLICK [ alongside a correction ]

Slide 7

Slide 7

https://www.ft.com/content/c01f37ec-99c1-11e5-987b-d6cdef1b205c/ while there wasn’t any long-term damage to the markets, this wasn’t good for our brand and reputation there were some very stressed people across the Editorial, Technology and Communications teams that morning it’s quite scary to think that a mistake like this can have a serious real-world impact

Slide 8

Slide 8

that said, the incident was handled really well by the teams involved and if it wasn’t for their quick actions, it could have been much worse in response to this, we made a number of improvements to our systems and processes which made it much harder to accidentally publish a draft article

Slide 9

Slide 9

/usr/bin/whoami @efinlay24 #DevReach2019 I’m a Senior Engineer at the Financial Times currently, I work on the Operations & Reliability team, responsible for the website and all of our production systems across the globe

Slide 10

Slide 10

/usr/bin/whodoiworkfor No such file or directory. @efinlay24 #DevReach2019 although the FT is most famous for our newspaper, we’re primarily a digital content company 3 years ago, revenue from our online subscriptions overtook the physical paper and advertising

Slide 11

Slide 11

https://www.ft.com/ this means that our content and website are absolutely critical to our business, and we invest heavily in technology we try to empower our cross-functional engineering teams and as part of that, they design, run & support their systems - from the beginning, to the very end of the product lifecycle to give you a sense of scale, [ we have a total of 1211 live production systems ]

Slide 12

Slide 12

1211 Production Systems we have a total of 1211 live production systems

Slide 13

Slide 13

245 Platinum Systems 245 of those are what we call “platinum”, and provide business-critical capabilities these are things such as our journalists being able to publish content, or customers being able to access the ft.com website

Slide 14

Slide 14

~150 Daily Releases we believe in devops and agile working practices, which means that we release to production roughly 150 times per day

Slide 15

Slide 15

~150 (including Fridays) …and yes, that includes fridays :)

Slide 16

Slide 16

60+ Third-Party Providers and there are over 60 third-party providers that the FT integrates with and depends upon - companies like Google, Vodafone and Verizon

Slide 17

Slide 17

our main technology hubs are currently London, Manila and Sofia (hi!)

Slide 18

Slide 18

Lots of pins but we have more bureaus and journalism offices across the world they’re not our main locations, but we still have people and services there that we need to support as you can imagine, that’s a huge amount of technology and infrastructure to keep track of and that’s where the Operations Support team come into play

Slide 19

Slide 19

we provide 24x7 first-line support for all our systems and products, so our team is split between London

Slide 20

Slide 20

and Manila for major issues, or anything we can’t fix, we escalate to the engineering teams who build and own the services that’s what the focus of my talk is today - handling major incidents, and helping development teams support their services in production

Slide 21

Slide 21

Your team is now on call. And you’re mildly terrified. @efinlay24 #DevReach2019 so, maybe you’re an engineer that’s been part of your team for a while: you’re proud of the services that you’ve built and deployed but it’s still intimidating being told you now need to deal with production issues Or perhaps you’re in a leadership role and have recently moved to a new company: you’ve had time to settle in and get familiar with your teams, but now you’re encouraging them to be involved in production support and they’re worried about how they’ll cope when things go wrong

Slide 22

Slide 22

Obligatory audience interaction. @efinlay24 #DevReach2019 hands up if you’ve been on call before hands up if you’ve never had to support production services hands up if you don’t like putting up your hand in the middle of talks I still remember what it felt like the first time I was called out: - it was terrifying - I was asked to fix a service I knew nothing about - I couldn’t find the documentation - it was genuinely an awful experience

Slide 23

Slide 23

Everyone feels the same when they start out. I still do today. @efinlay24 #DevReach2019 It’s something like imposter syndrome I suspect we all feel something similar the first time we start handling production incidents even now, with more experience - I still get a twinge of fear whenever my phone goes off in the middle of the night - what if it’s something I can’t fix? - what if I’m just not good enough? and if I feel like that, imagine being a junior engineer, with much less support experience

Slide 24

Slide 24

How do we get comfortable with supporting production? @efinlay24 #DevReach2019 so the idea behind this talk was how to enable our teams to become more comfortable so that they’re not dreading that phone call, or that message of “the website is down!” when writing this talk, I was told it helps to have a tenuous theme so I thought to myself, who else is quite grumpy (like lots of sysadmins) and who else gets woken up at 3am? the answer was: CLICK [ Scrooge, from A Christmas Carol ]

Slide 25

Slide 25

A tenuous link to A Christmas Carol. Scrooge, from A Christmas Carol which actually leads nicely into a talk structure about dealing with production incidents because, much like A Christmas Carol, there are: CLICK [ things we can plan for now, ]

Slide 26

Slide 26

The Ghosts of Incidents… > Future things we can plan for now, to help our teams before a problem happens

Slide 27

Slide 27

The Ghosts of Incidents… Future > Present actions we need to take when something actually breaks

Slide 28

Slide 28

The Ghosts of Incidents… Future Present > Past And there are things we need to do after an incident, to prevent problems from reoccurring

Slide 29

Slide 29

so knowing that things will go wrong at some point how do we plan ahead?

Slide 30

Slide 30

The Ghosts of Incidents… > Future Present Past let’s talk about some of the things we can do right now, while everything is stable and we have time available to plan

Slide 31

Slide 31

Handling incidents is the same as any other skill. @efinlay24 #DevReach2019 handling incidents IS the same as any other skill It can be learned, and taught, and practiced If the first time people on your teams try to do this is: - without any training - with no plan of action - after a phone call at 3am, it’s not going to go well

Slide 32

Slide 32

Get comfortable with your alerts. @efinlay24 #DevReach2019 at the FT, our engineering teams rotate people through in-hours support, when the whole team is available to help and provide advice we call this our OpsCop rota, and many companies will do something similar this keeps everyone familiar with - what can go wrong - the alerts that go off - the monitoring tools, and how to use them

Slide 33

Slide 33

Delete the alerts you don’t care about. @efinlay24 #DevReach2019 think carefully about what alerts you create, as alert noise and overload is bad ideally, every alert should be business critical and actionable otherwise, they’ll just get ignored, and real issues might be missed

Slide 34

Slide 34

Have a plan for when things break. @efinlay24 #DevReach2019 it’s really important to have a response plan ahead of time in large companies, this might be quite formal and well defined in a small startup though, you may just have some short guidelines so people know what’s expected of them Either way, make sure your teams aren’t wondering what they’re meant to do when alerts start going off

Slide 35

Slide 35

Keep your documentation up to date. @efinlay24 #DevReach2019 while I don’t think anyone especially likes writing documentation, it’s important to have information on what to do when services break create runbooks with common troubleshooting steps for problems that might occur write them as though it’s 3am and you’ve just been woken up - only include the essentials needed to get things up and running again have separate disaster recovery documentation, containing more detail on how to recover from a major outage

Slide 36

Slide 36

Biz Ops we use a home-built tool called BizOps for this

Slide 37

Slide 37

The central place to find info on all of the FT’s systems, products and teams. short for Business Operations, it’s where we store information about all of our systems, products and teams at the FT and powered by a neo4j graph database

Slide 38

Slide 38

this allows us to quickly find troubleshooting and support information in an emergency

Slide 39

Slide 39

and if we can’t fix it - who we can contact for further support, if needed

Slide 40

Slide 40

all of this data is automatically populated from a markdown file, which lives in each of our repositories and this ensures that it’s easy for teams to update their documentation and troubleshooting information, whenever they make changes to their code

Slide 41

Slide 41

Practice regularly. @efinlay24 #DevReach2019 Once you’ve created troubleshooting and disaster recovery guides, run through them regularly so that everyone is familiar with them a few years back, we performed an unscheduled test of our disaster recovery procedures we were creating a new production cluster, and ran our Ansible playbook to create us 5 new servers it turns out that if you’re not careful with the way you write your playbooks, and say “give me 5 servers” Ansible will ensure that you have 5 servers, in total, and delete all the others

Slide 42

Slide 42

The one where we decommissioned all our production servers the first we knew of this was when all of our alerts went off, and I hear a very quiet “oh no” from my friend sat next to me we had an brief outage, but our automatic failover system quickly kicked in, and switched across to our backup cluster DRINK

Slide 43

Slide 43

because we already had a plan for how to spin everything back up from scratch and had practiced it regularly it wasn’t too long before we were back to normal operation we definitely wouldn’t have wanted to find out that our disaster recovery process didn’t work at that point in time…

Slide 44

Slide 44

Break things, and see what happens. Did your systems do what you expected? @efinlay24 #DevReach2019 As an extension of that - encourage teams to actively break things, and check that our services behave correctly simian army is probably the most well known automated example, but you can start off by doing this manually in a more controlled way, too

Slide 45

Slide 45

The Planned Datacenter Disconnect We ran a planned disaster recovery test, disconnecting the network to one of our datacenters we pulled the plug, and the monitoring dashboard lit up with red lights - just as we expected we then tried to fail our systems over to the healthy datacenter It was at that point we found that a critical part of the failover system didn’t work, if one of the datacenters was offline… we were glad we found that before a real problem occured

Slide 46

Slide 46

We got complacent, and stopped running datacenter failure tests… @efinlay24 #DevReach2019 however, the last time that we ran one of those tests was a couple of years ago while we’ve migrated the majority of our systems to the cloud, there’s still a few important services that run out of our 2 datacenters you can probably guess what the next slide is going to be…

Slide 47

Slide 47

The Unplanned Datacenter Disconnect because late last year, we had a live test of what exactly happens when one of our DCs drops off the network naturally, it happened on a weekend in the middle of August, when lots of people were travelling or on holiday

Slide 48

Slide 48

our incident lead was coordinating everything from her phone, while in a car on the motorway (don’t worry, she wasn’t driving)

Slide 49

Slide 49

I myself was hiking in the Romanian hills I was lucky to find a cafe where I was able to sit down, get online, and help investigate the issue with a beer

Slide 50

Slide 50

Have a central place for reporting changes and problems. @efinlay24 #DevReach2019 at the FT, we have a chat channel that anyone can join, which we use to: communicate changes that are in progress, and to report potential issues or problems that are happening as you can imagine, this channel got VERY busy during the outage, with lots of reports across the business

Slide 51

Slide 51

while the failover process took longer than we’d have liked, we got everything stabilised, without any major business impact later, we found that the problem had been caused by another customer in the datacenter accidentally cutting the fiber connection providing internet access to all of our servers

Slide 52

Slide 52

We’re not perfect. But we always try to improve. @efinlay24 #DevReach2019 we should have followed our own advice if we’d practiced regularly, we would have been much more prepared when the failure happened that said, we still kept everything running that week, without any major business impact and most importantly - we learnt from our experience, and used it to improve

Slide 53

Slide 53

Response https://monzo.com/blog/2019/07/08/h ow-we-respond-to-incidents one new way that we’re improving our processes is with Response this is a incident management slackbot which was recently open-sourced by Monzo they’re a digital mobile-only bank based in the UK

Slide 54

Slide 54

An easy way to report technology problems that could affect the business. https://monzo.com/blog/2019/07/08/h ow-we-respond-to-incidents Response gives us a quick and simple way for anyone to report an incident, by typing /incident into any of our Slack channels

Slide 55

Slide 55

https://github.com/monzo/response/ the user fills out a short form asking for details about the problem, and then it notifies our incident channel so that we can get quickly get the right people involved we’ve found it extremely helpful, and it was fairly easy to get set up you can try it out yourself - it’s available on GitHub here

Slide 56

Slide 56

https://www.ft.com/tech-principles/ another recent change is to embed operability into our tech principles this is really helpful, as it clearly sets the expectations of what we consider important at the FT and what we expect from our engineering teams I really like this one, especially as someone who is often on the sharp end of production incidents

Slide 57

Slide 57

so, those are some ideas that you can think about and work on with your teams before something goes wrong

Slide 58

Slide 58

The Ghosts of Incidents… Future > Present Past but something’s happened - alerts have gone off, and we’ve been called or been asked to investigate what’s the first steps that we should take?

Slide 59

Slide 59

Calm down, and take a deep breath. It’s probably ok. @efinlay24 #DevReach2019 encourage your teams to take a step back and assess the situation. Dealing with incidents is stressful - but it’s important to remind ourselves that it’s not the end of the world. for most of us, if our website goes down, or a service fails, it’s usually not catastrophic in the grand scheme of things

Slide 60

Slide 60

Don’t dive straight in. Go back to first principles. @efinlay24 #DevReach2019 It’s always tempting to immediately jump in and start trying to solve the problem but generally speaking, there’s always a set of questions that I’ll always ask myself, before digging into a issue further

Slide 61

Slide 61

What’s the actual impact? @efinlay24 #DevReach2019 what’s the actual business impact? at the FT, our most critical considerations are: can the journalists publish content? can customers access the website? DRINK

Slide 62

Slide 62

“All incidents are equal, but some incidents are more equal than others.” George Orwell, probably. @efinlay24 #DevReach2019 A problem preventing the news from going out is a huge issue, and we’ll immediately get multiple people investigating ASAP. however, if the self service system that allows users to change their passwords breaks during the weekend? that’s not a big deal, and we’ll fix it during office hours

Slide 63

Slide 63

What’s already been tried? @efinlay24 #DevReach2019 has anything already been done by other people? Get as much information as possible - vague details can sometimes hide the actual problem

Slide 64

Slide 64

“I’ve restarted it” << what’s it? the web service? the server? or maybe their own laptop?

Slide 65

Slide 65

Is there definitely a problem? @efinlay24 #DevReach2019 confirm the problem definitely exists people can report things like “the website is slow” which can mean anything from “a badly written database query is causing timeouts” to “my laptop has decided to download updates, and I’m on a spotty wifi connection” or maybe CLICK [ the monitoring system is broken ]

Slide 66

Slide 66

the monitoring system is broken and has started firing false alerts to everyone - that’s happened to us before let’s assume there is indeed a problem

Slide 67

Slide 67

What’s the Minimum Viable Solution? @efinlay24 #DevReach2019 what’s the least amount of effort we can spend to get back online? depending on what our issue is, this is often higher priority than fixing the root cause

Slide 68

Slide 68

Get it running before you get it fixed. @efinlay24 #DevReach2019 For example: Can we just fail over? Can we just roll back a release? Can we just restore a snapshot?

Slide 69

Slide 69

Go back to basics. @efinlay24 #DevReach2019 if there isn’t a simple way to restore service, we’ll need to investigate This will entirely depend on your system architecture, and the issue - but good starting points tend to be: checking the logs and monitoring have you run through the steps in the runbook? was there a new release, or other planned work around the time the problems started? CLICK [ are there other known issues or outages happening outside of our control? ]

Slide 70

Slide 70

are there other known issues or outages happening outside of our control? for example, when the Dyn denial of service attack broke the internet back in 2016 or issues with AWS in the past, where services have gone down for a whole region

Slide 71

Slide 71

Let’s assume that it’s not simple to solve… we’ve done our initial investigation we’ve tried our standard recovery solutions we’re still stuck

Slide 72

Slide 72

Don’t be afraid to call for help. @efinlay24 #DevReach2019 That’s ok! encourage people to call for backup, sooner rather than later it’s important to have a culture of psychological safety, where teams feel comfortable asking for help we can’t always fix everything on our own - and that’s to be expected

Slide 73

Slide 73

Psychological safety is What makes an effective paramount. team at Google? https://rework.withgoogle.com/guides/understanding-team-effectiveness/steps/introduction/ Google and re:work recently released the results of Project Aristotle this was a research project conducted over several years, aiming to identify what makes a effective team at Google

Slide 74

Slide 74

https://rework.withgoogle.com/guides/understanding-team-effectiveness/steps/introduction/ the most important factor they found was that effective teams had a strong sense of psychological safety they were able to take risks, admit when they didn’t have answers, and felt comfortable asking for help I definitely recommend reading the full report - there is some some really useful information there along with tools that you can use to measure and improve team safety

Slide 75

Slide 75

https://www.youtube.com/watch?v=LhoLuui9gX8 Amy Edmondson presented a TED talk on this topic as well she outlines 3 steps that everyone can take to build psychological safety within a team:

Slide 76

Slide 76

  1. Opportunities, not failures. https://www.youtube.com/watch?v=LhoLuui9gX8 first, frame problems as an opportunity to learn, not as a failure on the individual’s part

Slide 77

Slide 77

  1. Opportunities, not failures. 2) Acknowledge our own fallibility. https://www.youtube.com/watch?v=LhoLuui9gX8 second, be open and honest with our teams when we make mistakes, or when we don’t know the answer

Slide 78

Slide 78

  1. Opportunities, not failures. 2) Acknowledge our own fallibility. 3) Model curiosity, ask questions. https://www.youtube.com/watch?v=LhoLuui9gX8 and finally, demonstrate and encourage curiosity - make sure you ask questions publicly it’s a short 10 minute talk, which I recommend checking out on the topic of calling for help: in a previous company that I used to work for, we had an alert went off on a Sunday afternoon, warning that the aircon in our office server room had stopped working

Slide 79

Slide 79

The One Where a Manager Falls Through the Ceiling our Tech Director saw this alert, and popped in to fix it - they lived near the office however, they didn’t know the new security code for the server room door at this point, most people would have called us to get the new code instead, because it was an emergency, they decided to crawl through the false ceiling of the office to get into the server room in case you’re not aware, most office ceilings aren’t designed to support the weight of a person

Slide 80

Slide 80

The One Where a Director Falls Through the Ceiling it gave way, and they fell into the server room, which ended up causing us a lot more problems than just a broken aircon unit it turns out that wild servers are easily startled - they don’t like having people fall on top of them unexpectedly and they definitely don’t like breathing in several years of accumulated ceiling dust DRINK

Slide 81

Slide 81

(it didn’t look like this) In their defence, our director’s reasoning was that they didn’t want to disturb us on a weekend because they could fix it themselves, which I can respect but it would have been MUCH better for everyone if they’d just asked for help, and we could have fixed the problem together without needing to replace the ceiling afterwards

Slide 82

Slide 82

Communication is key. Especially to our customers. @efinlay24 #DevReach2019 this leads into my next point, which is that communication is really important communication underpins almost everything that we do in technology and software development - it’s one of the most valuable skills any engineer can have

Slide 83

Slide 83

who here saw Jody Davis’ amazing keynote this morning? one thing she said really stood out to me: roughly 25% of the work we do is technical - the other 75% is teamwork, collaboration and communication so much of what we do relies on frequent, open and honest communication - whether that’s with our teammates, our stakeholders, or our users it’s a huge problem when there’s an ongoing incident, but nobody is sure what the status is

Slide 84

Slide 84

even though everything might be on fire, we still need to communicate with the business this is quite difficult when our team is focused on actually fixing the problem, so CLICK [ someone needs to take the role of incident lead ]

Slide 85

Slide 85

Designate an incident lead. @efinlay24 #DevReach2019 someone needs to take the role of incident lead, which frees everyone else to dive deeper into the problem without multitasking they’re responsible for providing regular status updates, coordinating the investigation and preventing interruptions to the people fixing the issue Beth Long and Elisa Binette did a really good talk at Velocity last year, about the incident command role at New Relic I recommend checking out the video - I’ll share the link on Twitter afterwards

Slide 86

Slide 86

https://blog.newrelic.com/engineering/on-call-and-incident-response-new-relic-best-practices/ there’s also a great blog post by Beth about how they do on call and their wider incident response processes I found it to be really useful, and you can use this as a step by step guide to improve things with your teams too

Slide 87

Slide 87

having alerts & notifications in your chat channels can be useful but during an outage, it can make a channel impossible to use for discussion

Slide 88

Slide 88

and when you’re trying to to fix a problem with multiple people, it can often end up like this so we need somewhere to coordinate the investigation

Slide 89

Slide 89

Create a temporary incident channel. @efinlay24 #DevReach2019 having a single temporary space helps the incident lead to keep everyone on the same page and it’s valuable to use as a timeline later, to see what actions have been taken

Slide 90

Slide 90

Make sure that everybody shares information and reports what they’re doing you don’t want two people making conflicting changes, and potentially making the problem worse Monzo’s Response slackbot makes it really easy for us to do this at the click of a button, we can spin up a new channel dedicated to this incident

Slide 91

Slide 91

it also provides a page for the incident, where we can pin important messages and information to the timeline this makes it easy for people joining mid-incident to quickly catch up, and get our current status

Slide 92

Slide 92

If you think you’re over-communicating, it’s probably just the right amount. @efinlay24 #DevReach2019 I’ve mentioned communication already, but it’s so important Provide high-level updates on a regular basis just to let people know the problem is still being worked on

Slide 93

Slide 93

Tired people don’t think good. @efinlay24 #DevReach2019 when people are tired and extremely stressed, we all make mistakes Make sure everyone takes breaks, especially if the problem is long-running otherwise people will be less effective, or accidentally make things worse depending on the duration, this may even involve rotating in shifts, or handing over to other teams

Slide 94

Slide 94

Sometimes we have to leave things broken. @efinlay24 #DevReach2019 but at some point we have to make a call about whether we still keep working on the problem if we can mitigate the business impact, it may be better for everyone to go home, rest and carry on the next day rather than work through the night

Slide 95

Slide 95

the longest running incident I’ve been part of was when our EU content cluster started failing due to CPU load, just as everyone was about to leave for the day we did some initial investigation, then switched all of our traffic to the healthy US cluster which then started failing as well… we spent the next five hours investigating, and attempting to get our clusters stable we were completely exhausted, and struggling for ideas

Slide 96

Slide 96

The one where we had to serve traffic from staging eventually, our director of engineering suggested routing traffic through our staging environment, and manually editing configuration files to pull data from our our old legacy platform by the time we managed to reliably serve traffic, it was around midnight

Slide 97

Slide 97

we continued the investigation again the next day and eventually identified the root cause as an update to a query which overloaded our databases this had a cascade effect on our other services, eventually causing the cluster to collapse

Slide 98

Slide 98

It wasn’t great, but it wasn’t the end of the world. @efinlay24 #DevReach2019 it wasn’t the best situation to be in, but the ft.com website team are great, and design their systems to fail gracefully in situations like this we served some stale content during the outage, but there was zero downtime to our customers

Slide 99

Slide 99

so there’s some tips for how to deal with incidents in progress

Slide 100

Slide 100

The Ghosts of Incidents… Future Present > Past what do we need to do once the dust has settled, and we’re back online?

Slide 101

Slide 101

Congratulations! You survived. It probably wasn’t that bad, was it? @efinlay24 #DevReach2019 encourage everyone involved to take some time out, for their mental health Incidents are stressful, and if people have been working out of hours, they need to take some time to recover

Slide 102

Slide 102

Run a incident review with everyone involved. Nobody died, so it’s not a post-mortem. @efinlay24 #DevReach2019 At DevOpsDays in London last year, Emma Button suggested using the term “incident review” instead of “post-mortem”, which I quite like the objective isn’t to point fingers and assign blame but it’s an opportunity to discuss what worked, what didn’t, and what can be improved for next time Do it soon afterwards, otherwise people will forget the details, and move on to other work

Slide 103

Slide 103

Incident reports are important. @efinlay24 #DevReach2019 incident reports are valuable, and this is where keeping a timeline comes in handy it’s useful to log what happened and how we fixed previous problems, so we can refer back to them in the future DRINK CLICK [ XKCD ]

Slide 104

Slide 104

there’s nothing worse than having a production issue, and someone says: “oh, it’s exactly like when this happened last year!” but nobody can remember what was done to fix it make sure the solution gets written up

Slide 105

Slide 105

we don’t make our reports public, but we do share them internally across our tech teams we use the issues page on a GitHub repository for this - it makes it easy for us to draft and collaborate on reports as well as filter and search on historical report data

Slide 106

Slide 106

https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/ a great incident report example is from when GitLab had a fairly serious outage back at the start of 2017 to briefly summarise, GitLab were investigating an load issue on their production database there were a number of unfortunate events that compounded the original problem and eventually lead to an engineer accidentally deleting production data

Slide 107

Slide 107

“Until a restore is attempted, a backup is both successful and unsuccessful.” Erwin Schrödinger? @efinlay24 #DevReach2019 to make things worse, they then found that their database backups had been failing silently for some time which meant they suffered permanent loss of customer data

Slide 108

Slide 108

https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/ what really impressed a lot of people though was how they handled the outage they released a very open, honest incident report to the public, with a detailed timeline of what happened, and why

Slide 109

Slide 109

https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/ they also made their recovery process public, and livestreamed it on YouTube I’m not suggesting that everyone should do this personally, I don’t think a livestream would have a good mental impact on the teams trying to fix the problem but it did show that GitLab were committed to keeping their customers up to date with the status of the recovery

Slide 110

Slide 110

Identify what can be improved for next time. @efinlay24 #DevReach2019 follow-up actions are the most important part of this process they can cover lots of things - and not all of them may be technical or code-related our response plans are never perfect when we start out, and they should be improved over time

Slide 111

Slide 111

https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/ Maybe the escalation process needs to be improved perhaps the documentation was out of date or there are bugs that need to be fixed again, GitLab shared these actions publicly so that their customers could see their progress and status I definitely recommend checking the full report out it’s a really interesting read

Slide 112

Slide 112

https://monzo.com/blog/2019/06/20/why-bank-transfers-failed-on-30th-may-2019 I’m going to briefly mention Monzo agail, who I talked about earlier in relation to the incident response slackbot they believe in transparency, and are great at releasing detailed and clear writeups of what happens behind the scenes when something goes wrong

Slide 113

Slide 113

https://monzo.com/blog/2019/09/08/why-monzo-wasnt-working-on-july-29th if you’re slightly strange like me and enjoy reading other people’s incident reports, I definitely recommend checking them out I find it’s a really interesting insight into how they run and operate their systems, especially as finance is an industry that I’ve not personally worked in before

Slide 114

Slide 114

Nearly the end. Don’t clap yet. @efinlay24 #DevReach2019 so that’s pretty much everything I wanted to talk about today - I hope you’ve found it interesting and useful hopefully, you have some ideas of how to help your teams to cope the next time something breaks

Slide 115

Slide 115

Feedback is welcome. https://www.telerik.com/devreach/day 2feedback it’s my first time speaking at DevReach, and I’d love your help to make this talk even better it would be great if you could submit feedback to the conference organisers by scanning this QR code or go to telerik.com / devreach / day2feedback plus I’m happy to take questions, feedback, or hear stories of your own after this talk I’ll be around for the rest of the conference :)

Slide 116

Slide 116

Failure is inevitable. And that’s ok. @efinlay24 #DevReach2019 to sum up, problems and outages can happen anywhere, at any time, and are just another part of what we deal with in technology we can’t prevent them happening entirely - but It’s how we help our teams plan for them, respond to them, and then improve things afterwards that makes the difference.

Slide 117

Slide 117

The end. “Please clap.” Jeb Bush, 2016 @efinlay24 #DevReach2019 thank you

Slide 118

Slide 118

We’re hiring in Sofia! https://ft.com/dev/null/ @efinlay24 euan.finlay@ft.com clap clap clap