Switching Horses Midstream: The challenge of migrating 150 microservices to Kubernetes

A presentation at iQuest 2.0 in October 2018 in Cluj-Napoca, Romania by Euan Finlay

Slide 1

Slide 1

Switching Horses Midstream The challenge of migrating 150 microservices to Kubernetes Euan Finlay @efinlay24 hi! thank you for the introduction back in 2015, the Content team at the Financial Times were having problems we had been struggling with the stability of our containerised platform our developers were constantly firefighting which was increasing their stress levels, and dropping morale across the team how did we get into that situation? and how did we manage to turn things around, with the help of iQuest?

Slide 2

Slide 2

@efinlay24 We often give technical talks as though our story was a journey. We set out to reach a goal - we faced some difficulties along the way, but we got to our destination, and everything worked out. But I don’t think that’s what it’s really like. I think it's because we’re looking back after the event. in reality, we don't necessarily know exactly where we're going we have a destination in mind, but it's more like we've heard of a great place to travel to, and it's somewhere in this direction

Slide 3

Slide 3

@efinlay24 in addition, we often don't start out from home it’s rare to be starting completely from scratch on a project and even when we're working on something new, we only get to have no baggage right at the very beginning… the previous decisions we’ve made have a big impact on what we can do - and they can make things tricky later on

Slide 4

Slide 4

/usr/bin/whoami @efinlay24 I'm a Senior Integration Engineer at the Financial Times. I originally started off in desktop support, did some time as a Linux sysadmin, and now work in the DevOps area Currently I lead a team at the FT, who help to support and maintain the backend Content APIs and services

Slide 5

Slide 5

/usr/bin/whodoiworkfor No such file or directory. @efinlay24 although we're most well-known as a newspaper, we're primarily a digital content company last year was the first time revenue from our digital subscriptions overtook both the physical paper and advertising

Slide 6

Slide 6

https://www.ft.com our content and our website are absolutely critical to our survival we invest heavily in technology, and we've got many teams working across different areas of the business at the FT, we're big believers in DevOps practices and empowered engineering teams, trusting them to make the best decisions around technology, architecture, and delivery as part of that, our teams fully own, run & support their services - from the very beginning to the end of the product lifecycle

Slide 7

Slide 7

https://www.ft.com https://www.iquestgroup.com iQuest have been a key part of our digital transformation journey - we've partnered closely with them over the last 10 years roughly half of the developers and engineers on the Content platform are from iQuest their development expertise and domain knowledge have been instrumental to many of the improvements and changes we've made along the way

Slide 8

Slide 8

@efinlay24 our content platform itself is made up of many microservices, written in Go and Java, running in Docker containers we're cloud-first where possible, and the majority of our infrastructure is hosted by Amazon Web Services

Slide 9

Slide 9

@efinlay24 a high level view looks like this. We take content from multiple systems, and transform it into a common format. We annotate that content via natural language processing. then load information about the millions of concepts that we use in that annotation process finally, we make it all available via APIs to our clients - both internal and external in practice, though, it looks something more like...

Slide 10

Slide 10

@efinlay24 this - which is only HALF of our service architecture diagram you won't be able to see any detail here, because it's really complex beacause we use a microservice architecutre, our system is made up of around 150+ small services, each with a single responsibility.

Slide 11

Slide 11

Running highly available services is complicated. @efinlay24 we need resilience and availability, which means we need multiple copies of the same service, running on different underlying hardware for us, that means around 600 containers, running on a much smaller number of Amazon instances. This is where cluster orchestrators come into play - it's impossible to manage this number of services manually we need automated deployment, scaling, and management of our containerized applications.

Slide 12

Slide 12

2015 > Docker in production. 2016 2017 2018 We started running Docker containers in production in mid 2015, which at that point was pretty cutting edge. at the time, there were several competing products, but it would have been challenging to put them all together into a working platform that fit our needs instead, we ended up building a cluster orchestrator ourselves

Slide 13

Slide 13

https://blog.gardeviance.org/2014/03/on-mapping-and-evolution-axis.html Simon Wardley is an IT strategist, who did an excellent talk this year at KubeCon He talks about how successful technology goes through a maturity curve. It starts off as something for experts -> becomes more common -> turns into a product you can buy -> and is eventually a commodity A good example from a long time ago is electricity - we wouldn't dream of building our own power stations now A more recent example is the evolution of the cloud. these days, computing is a commodity. back in 2015, Docker and containerisation was still far on the left hand side

Slide 14

Slide 14

Spend your innovation tokens wisely. http://mcfunley.com/choose-boring-technology There's a another great blog post from Dan Mckinley from Etsy about how to approach building technology solutions. he describes a limitation on the amount of new and innovative work you can do at any one time he calls this innovation tokens Imagine you have a limited supply of these tokens in your wallet, and you get to choose what you spend them on. Use a brand new database technology - you’re using an innovation token. Choose a new programming language? that's another token we definitely had to spend our tokens to build our own orchestration platform - so why did we do it?

Slide 15

Slide 15

For us, the benefits outweighed the risks. @efinlay24 Well, for us, containerisation offered some big advantages compared to our old systems. Before we introduced containers, we were running each service on its own instance, which is a really inefficient use of resources. individual microservices don’t use up much memory or CPU, and the costs quickly add up.

Slide 16

Slide 16

Our AWS costs dropped by around 40%. @efinlay24 After we introduced containers, we could easily run lots of services on each instance We ended up running our entire production platform across a handful of large instances per region, instead of several hundred small VMs This was about an 40% reduction in AWS costs

Slide 17

Slide 17

It was much easier to build and deploy new services. @efinlay24 before, we had to provision infrastructure for each service to give us staging environments and cross region resilience. on our containerised platform, deploying a new service to production didn’t involve any new infrastructure we only had to build a single service file - and deployment just needed a pull request in GitHub. It was much quicker and less error prone, and allowed us to experiment more with our services

Slide 18

Slide 18

However, supporting a home-built platform can be difficult. @efinlay24 those were great benefits, but the gains we got were offset by some other things. Because we built the platform ourselves, we had nowhere to go for help when something went wrong Documentation of the decisions that were made, and why, was very rarely a priority

Slide 19

Slide 19

It's even harder when you don't fully understand how it works. @efinlay24 In our case, most of the containerisation work was done by a small group of developers and engineers the key members of that group all moved on from the FT, over the course of a few months - which left us with a serious knowledge gap When that happened, we ended up supporting something we didn’t completely understand

Slide 20

Slide 20

Choose boring technology. http://mcfunley.com/choose-boring-technology The flip side of innovation tokens is that we should choose boring technology most of the time. Boring isn't a bad thing. Boring technologies are well understood, and easily supportable you want to innovate on the things that differentiate your company from your competitors. but the platforms on which you run those things? you want those to be boring

Slide 21

Slide 21

2015 2016 > Tools started maturing. 2017 2018 When we started using containers, we didn’t have a boring alternative to building our own platform. But at the end of 2016, there were people successfully running off-the-shelf cluster orchestrators in production.

Slide 22

Slide 22

https://blog.gardeviance.org/2014/03/on-mapping-and-evolution-axis.html Things were moving from custom built to product, and even towards commodity we had to take advantage of these moves. because once something is available as a product, we only want to be building that ourselves if it is absolutely core to our business.

Slide 23

Slide 23

We are not a cluster orchestration company. @efinlay24 The FT is not a cluster orchestration company. We’re a digital news organisation. And that’s where we have to focus our innovation, where possible.

Slide 24

Slide 24

In late 2016, we started to investigate alternatives. @efinlay24 We ran a workshop for a few days to evaluate our options

Slide 25

Slide 25

Metric for success #1: Reduce the amount of time spent keeping production healthy. @efinlay24 We agreed two metrics in the workshop the first one was to reduce the amount of time spent supporting the platform. When things went wrong in the internals of our stack, it was often hard to diagnose the problem and fix it worse, there was no-one else we could escalate the problem to. and, the second metric is amusing, but totally serious.

Slide 26

Slide 26

Metric for success #2: Reduce the number of sarcastic comments on Slack. @efinlay24 You remember I mentioned at the start of this talk that team morale had dropped very low? developers were venting their frustration, joking about how moving platform would fix all of our problems

Slide 27

Slide 27

@efinlay24 we were all really fed up with the number of production incidents worse, people in other areas of the business were starting to comment on the number of problems we were having as well.

Slide 28

Slide 28

We chose Kubernetes. @efinlay24 After a few days evaluating our options, we picked Kubernetes. We preferred it to the alternatives that we assessed, and it fit all our requirements. We also liked that multiple cloud providers were starting to support k8s, and we were hoping it would become an emerging standard.

Slide 29

Slide 29

https://twitter.com/lizrice/status/828872836777385984 Early in 2017 Liz Rice ran this poll - it looked like people were using k8s successfully, and that we’d be able to learn from their experiences. The later announcement of Elastic Kubernetes Service by Amazon was a nice confirmation that we were thinking on the right lines. Today at the end of 2018, all of the main cloud providers have a managed k8s solution, and the future of the ecosystem seems healthy

Slide 30

Slide 30

Using leading edge technologies requires you to be comfortable with change. @efinlay24 At the FT we often work at the cutting edge of technology. there are a lot of benefits to this - for example, our move to microservices and continuous deployment 5 years ago that meant that we leapt from 12 releases per year, to over 2000 But sometimes, being at the leading edge means you need to change, because you tried something and it didn't work out Or because now that area has matured, and you can buy a product to give you the same benefits. this will happen from time to time - and it's not necessarily a bad thing

Slide 31

Slide 31

Switching horses midstream image? even though we had good reasons to migrate, it was still a major challenge… we don’t really want to switch horses midstream. we've got enough other projects going on.

Slide 32

Slide 32

2015 2016 2017 > Kubernetes migration begins. 2018 in our case, we had lots of services in production and under active development. prior to the migration, we had around 150 microservices. There were five other teams delivering new functionality. and all of that work had complicated dependencies

Slide 33

Slide 33

Lots of other work going on at the same time @efinlay24 this is a map between the different work streams we had back in 2017 like before, I'm not expecting you to see the detail - and this is actually only a 1/3rd of the full diagram This was extremely complex. And our migration team had to make sure they didn't impact any of this.

Slide 34

Slide 34

Running in parallel complicates things further. @efinlay24 if we had been starting from scratch, we would have built the platform, then moved a services across a few at a time however, we had to know that k8s could support the complicated routing and failover logic of our stack that meant we had to migrate a large proportion of our services simultanously to gain confidence this took time, which meant we ran both our old and new platforms in parallel for quite a while

Slide 35

Slide 35

During our parallel run, there were over 2000 code releases. @efinlay24 We also had to be careful that for our developers could keep working normally. for example, if we had made a change that added just 10 minutes to each deployment over the course of 2000 releases that works out to 47 working days so what was actually involved in the migration process?

Slide 36

Slide 36

https://helm.sh/ the first step was to replace our old service descriptor files with Helm charts - a service package manager for k8s These files were pretty standard between our microservices, so this didn't take too long

Slide 37

Slide 37

Integrating the service into a templated jenkins pipeline the second step was to integrate the service into a jenkins pipeline for builds and deployment again, because this was heavily templated, only small changes were needed for each service so, great! neither of those two things were especially difficult

Slide 38

Slide 38

Each individual change wasn't huge, but 150 small changes add up quickly. @efinlay24 but while the changes to each service weren't that large, spending just 30 minutes on 150 microservices was equivalent to 10 working days

Slide 39

Slide 39

Unfortunately, we discovered a lot of broken things... @efinlay24 and in reality - it took a lot longer than 30 minutes to make those changes, deploy in parallel, and then test There were several reasons for this.

Slide 40

Slide 40

Some services hadn't been built for a very long time. @efinlay24 some of our microservices had been working happily unchanged, for years. That meant, when we DID need to change them, we were hit with unexpected problems. We found that some of our services were pulling in a couple of years of package updates or worse - some packages had changed entirely, which completely broke our builds we had to fix these problems up, and make sure they didn't happen again in the future

Slide 41

Slide 41

Nightly builds can help, even if you don’t deploy them. @efinlay24 other teams at the FT build all their services every night - which means you pick up the build problems as they arrive. This is worth doing, because you don’t want to have to resolve lots of build problems in order to get a critical bug or security fix out to production.

Slide 42

Slide 42

Not all of our service health endpoints worked correctly. @efinlay24 we also came across services where healthchecks or good-to-go endpoints hadn't been created correctly, or were missing entirely this hadn't caused us problems before but k8s relies on these endpoints to know when a service is ready to receive traffic, or when to step in if it's unhealthy we had to make sure these endpoints were working as intended, otherwise it would cause us real problems later on down the line

Slide 43

Slide 43

Some of our services didn't restart gracefully. @efinlay24 on a similar note, when older services started up, they expected their dependencies to be ready and waiting - which wasn't always the case this didn't matter when our containers weren't restarted very often but k8s moves and restarts services much more frequently, so we had to make sure that our services could cope

Slide 44

Slide 44

We always want to improve things. @efinlay24 and finally - when we started working on these services that hadn't been looked at for a while it was really tempting to start fixing all the OTHER little problems we came across, and adding small improvements that can really spiral out into a lot of extra time

Slide 45

Slide 45

We had to get everyone involved. @efinlay24 all of those things add up - so each service migration ended up taking far more than 30 minutes. We got every developer across all of our teams involved, and we spent a few days focusing on migrating services across. this was because we needed everyone on the team to understand the new platform and we didn’t want the people dedicated to the k8s migration to have to spend a month doing the same repetitive tasks over and over

Slide 46

Slide 46

Feedback from our teams was essential. @efinlay24 It was slow going, but it got us good feedback from everyone, and helped them to understand the new tools. this definitely had an impact on other work those teams should have been doing… but we were able to explain to our product owner why this was a necessity

Slide 47

Slide 47

We should have swarmed on the work for longer. @efinlay24 with hindsight, if we were to do things differently, the single biggest change would have been for everyone to swarm on the work from the beginning - not just a few days We paid a price for the size of the migration team - it was small, because we hadn’t agreed as much funding for this project as we would have liked.

Slide 48

Slide 48

Running in parallel increased our release overheads. @efinlay24 in addition, running in parallel for several months brought its own challenges The longer we ran, the longer we had to release code to two stacks, then test against two stacks.

Slide 49

Slide 49

...and also increased our AWS costs. @efinlay24 It increased our monthly AWS bills as well - which was fine for a short duration, but we had to make sure that we communicated why our costs increased, to the people who needed to know our costs would have been even higher if we hadn't turned off parts of our old platform in order to save money.

Slide 50

Slide 50

Not just AWS costs either @efinlay24 it wasn't just our runtime costs that were affected, though we were running load and soak tests against both our old and new platforms, which generated a lot of logs all of those logs got sent to Splunk, our log aggregator doubling the log quantities actually blew out our license a couple of times, which didn't make us very popular across the business

Slide 51

Slide 51

https://www.youtube.com/watch?v=sJx_emIiABk Alice Goldfuss did a great talk at Lead Dev London this year, called The Container Operators Manual it covers a lot of the things that we wish we'd known before starting our migration project and talks about the extra problems and considerations that people often won't realise if your company is considering running containers in production, I would say that it's a must-watch

Slide 52

Slide 52

Platform migration is a marathon, not a sprint. @efinlay24 because in total, the migration to k8s took us just over an entire year from speaking to other companies who have undergone similar projects, this is about normal however, our original estimates were far lower than this and we hadn't planned around a lot of the additional work that we uncovered as we went this >

Slide 53

Slide 53

Don't underestimate the time and resources required to migrate to Kubernetes. @efinlay24 is the one piece of advice I would give to anyone considering implementing k8s in production it will take time it will take investment and it requires a dedicated team

Slide 54

Slide 54

Our iQuest colleagues were essential to making this successful. Thank you, Tommy and Sorin. :) @efinlay24 I have a huge amount of respect for the two iQuest engineers who were fully committed to the k8s implementation and migration - it was really tough on them Without their dedication to the project, and a lot of hard work on their part, none of this would have been possible they built the infrastructure and the pipelines that were required and after that, they got stuck in and spent months migrating service after service It’s extremely hard to stay motivated in those circumstances. They did, and we're very grateful to them for that.

Slide 55

Slide 55

2015 2016 2017 2018 > Kubernetes go-live. Finally, though - our goal was in sight. We switched over just before a key part of our old stack became end of life. The last few months were intense, and swarming all of our teams onto the project had an impact on other work we were trying to do. But we made it.

Slide 56

Slide 56

Everything went smoothly. @efinlay24 The actual migration between the two stacks went very smoothly. We did the migration via DNS, switching requests across to our new clusters, one API endpoint at a time. we had a couple of small issues - we'd cleaned up some bugs in our services and it turned out a handful of our customers were relying on those bugs neither of them were major problems, and we got them fixed up fairly quickly

Slide 57

Slide 57

Was it worth it? @efinlay24 So did we get the results we were looking for? We think so.

Slide 58

Slide 58

We have a far more stable platform. @efinlay24 We have a considerably more stable platform. In the month following k8s go-live, we had 3 production incidents - vs 13 in the same time period in 2017. We’ve also had far fewer out of hours platform incidents and in some cases, k8s actually recovered before the we logged on to our laptops.

Slide 59

Slide 59

We have happier developers. @efinlay24 since we have less incidents, we no longer have sarcastic comments in Slack about the stability of our platform - so that was a success as well

Slide 60

Slide 60

We can learn from others. And we can share our knowledge. @efinlay24 We now use a technology that we can google about. We can watch talks, discuss our problems with other companies. Send people on training.

Slide 61

Slide 61

@efinlay24 and yes - we managed to reduce our costs. compared to our old containerised stack, we achieved a further 35% reduction in hosting and support costs And even with the cost of the migration, we predict we'll break even in 3 years compared to our old stack.

Slide 62

Slide 62

Nearly the end. (don't clap yet) @efinlay24 That's pretty much everything I wanted to cover today - I hope you've found it interesting and useful

Slide 63

Slide 63

@efinlay24 That was our journey - we didn’t start from the easiest place. And we didn’t necessarily end up exactly where we expected, either - but that's ok. we have to keep asking ourselves - is this a good place to stop, or do we need to go further? We learnt a lot along the way. and we now have a stable, cheaper platform our teams are much happier as well - they're able to spend their time on building new functionality, rather than supporting our platform.

Slide 64

Slide 64

Sarah Wells @sarahjwells https://www.youtube.com/watch?v=H06qrNmGqyE Before I finish, I'd also like to credit Sarah Wells, who's a Technical Director at the FT while she wasn't able to come along to this event today this talk has been heavily based on her opening keynote at KubeCon back in May - I definitely recommend checking it out

Slide 65

Slide 65

The end. (please clap) clap clap clap

Slide 66

Slide 66

@efinlay24 euan.finlay@ft.com bye