How Netlify migrated to multicloud while no one noticed

A presentation at Velocity Conf in June 2018 in by Ryan Neal

Slide 1

Slide 1

How Netlify Migrated to a Multicloud Architecture And no one noticed

Slide 2

Slide 2

ryan@ rybit

Who am I? @ry_boflavin

Slide 3

Slide 3

Who am I?

Dog Dad

Slide 4

Slide 4

Who am I?

Dog Dad

Engineer

Slide 5

Slide 5

Who am I?

Dog Dad

Engineer

Fire Spinner

Slide 6

Slide 6

Engineer of things

Tech Passions

Distributed Systems

Streaming Data System

Infrastructure Automation

System Design

Worked

Raytheon

Palantir Middle East

Ye l p

Netlify

Slide 7

Slide 7

full CI/CD

prerendering

content delivery

lambda deployment

routing layer

split testing

identity provider

dns provider

... What is Netlify? Netlify is the simplest way to build, deploy, and manage web projects on the JAMstack. We're changing the way the web is built by collapsing the modern front-end development process into a single, simplified workflow.

Slide 8

Slide 8

full CI/CD

prerendering

content delivery

lambda deployment

routing layer

split testing

identity provider

dns provider

... What is Netlify? Over

  • 5 million sites
  • 4,000 requests/sec
  • 1,200 deploys/hour

Slide 9

Slide 9

What is Netlify? Over

  • 5 million sites
  • 4,000 requests/sec
  • 1,200 deploys/hour

full CI/CD

prerendering

content delivery

lambda deployment

routing layer

split testing

identity provider

dns provider

...

Slide 10

Slide 10

What am I going to talk about? 1. Intro to the system 2. Why we did all this work 3. How we accomplished it 4. The actual migration 5. Next steps

Slide 11

Slide 11

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

Slide 12

Slide 12

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

Slide 13

Slide 13

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

Slide 14

Slide 14

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

Slide 15

Slide 15

Plan for failure

Redundancy is a priority

Everything is horizontally scalable

Everything runs in cluster

Health checking for everything

Slide 16

Slide 16

Getting Data into the system

Slide 17

Slide 17

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

Slide 18

Slide 18

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

Slide 19

Slide 19

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

Slide 20

Slide 20

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

Slide 21

Slide 21

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

Slide 22

Slide 22

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

Slide 23

Slide 23

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

Slide 24

Slide 24

Getting Data out of the system

Slide 25

Slide 25

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

Slide 26

Slide 26

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

Slide 27

Slide 27

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

Slide 28

Slide 28

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

Slide 29

Slide 29

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

Slide 30

Slide 30

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

Slide 31

Slide 31

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

Slide 32

Slide 32

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

Slide 33

Slide 33

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

Slide 34

Slide 34

DB Cluster Origin Cluster CDN Content Delivery origin origin origin origin origin origin cdn node cdn node cdn node cdn node cdn node cdn node Cloud
Files DNS

Slide 35

Slide 35

Cool, but where

are the actual servers?

Slide 36

Slide 36

DB Cluster Origin Cluster CDN But where does it live? origin origin origin origin origin origin

Slide 37

Slide 37

DB Cluster Origin Cluster CDN But where does it live? origin origin origin origin origin origin

Slide 38

Slide 38

DB Cluster Origin Cluster CDN But where does it live? origin origin origin origin origin origin

Slide 39

Slide 39

DB Cluster Origin Cluster CDN But where does it live? origin origin origin origin origin origin

Slide 40

Slide 40

DB Cluster Origin Cluster CDN But where does it live? origin origin origin origin origin origin

Slide 41

Slide 41

DB Cluster Origin Cluster CDN But where does it live? origin origin origin origin origin origin

Slide 42

Slide 42

DB Cluster Origin Cluster CDN But where does it live? origin origin origin origin origin origin

Slide 43

Slide 43

DB Cluster Origin Cluster CDN And when it fails? origin origin origin origin origin origin DNS

Slide 44

Slide 44

DB Cluster Origin Cluster CDN And when it fails? origin origin origin origin origin origin DNS

Slide 45

Slide 45

DB Cluster Origin Cluster CDN And when it fails? origin origin origin origin origin origin DNS

Slide 46

Slide 46

DB Cluster Origin Cluster CDN And when it fails? origin origin origin origin origin origin DNS

Slide 47

Slide 47

DB Cluster Origin Cluster CDN And when it fails? origin origin origin origin origin origin DNS

Slide 48

Slide 48

DB Cluster Origin Cluster CDN And when it fails? origin origin origin origin origin origin DNS

Slide 49

Slide 49

DB Cluster Origin Cluster CDN And when it fails? origin origin origin origin origin origin DNS

Slide 50

Slide 50

DB Cluster Origin Cluster CDN And when it fails? origin origin origin origin origin origin DNS

Slide 51

Slide 51

Slide 52

Slide 52

What happens when things go wrong?

Slide 53

Slide 53

CDN stays up

Keep serving cached content

Higher traffic sites are going to be happier What happens when things go wrong?

Slide 54

Slide 54

CDN stays up

Keep serving cached content

Higher traffic sites are going to be happier What happens when things go wrong?

Slide 55

Slide 55

CDN stays up

Keep serving cached content

Higher traffic sites are going to be happier What happens when things go wrong?

Slide 56

Slide 56

Providers fail with no notice

Degraded perf > outage

Same cloud is fastest Multicloud Setup

Slide 57

Slide 57

AWS

Elastic Compute Cloud

S3

GCP

Compute Engine

Cloud Storage

Providers fail with no notice

Degraded perf > outage

Same cloud is fastest Multicloud Setup

RAX

Cloud Servers

Cloud Files

Slide 58

Slide 58

AWS

Elastic Compute Cloud

S3

GCP

Compute Engine

Cloud Storage

Providers fail with no notice

Degraded perf > outage

Same cloud is fastest Multicloud Setup

RAX

Cloud Servers

Cloud Files

Slide 59

Slide 59

Slide 60

Slide 60

Why do all of this? Because clouds fail

Slide 61

Slide 61

But how do we build around that?

Slide 62

Slide 62

But how do we build around that?

Slide 63

Slide 63

Steps to Multicloud 1. Double check assumptions 2. Replicate all the objects 3. Prepare the database 4. Make the origin services cloud agnostic 5. Test everything 6. Do the actual cutover

Slide 64

Slide 64

Steps to Multicloud 1. Double check assumptions 2. Replicate all the objects 3. Prepare the database 4. Make the origin services cloud agnostic 5. Test everything 6. Do the actual cutover

Slide 65

Slide 65

Assumption Checking https://github.com/rybit/cloud-bench

Slide 66

Slide 66

Steps to Multicloud 1. Double check assumptions 2. Replicate all the objects 3. Prepare the database 4. Make the origin services cloud agnostic 5. Test everything 6. Do the actual cutover

Slide 67

Slide 67

Replicate it all origin origin origin origin primary CF 1 2 3 {
"_id" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7",
"size" : 9935,
"sha" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7", "created_at" : ISODate("2018-06-07T21:02:29.240Z"),

}

Slide 68

Slide 68

Replicate it all origin origin origin origin primary CF 1 2 3 {
"_id" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7",
"size" : 9935,
"sha" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7", "created_at" : ISODate("2018-06-07T21:02:29.240Z"),

}

Slide 69

Slide 69

Replicate it all origin origin origin origin primary CF 1 2 3 {
"_id" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7",
"size" : 9935,
"sha" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7", "created_at" : ISODate("2018-06-07T21:02:29.240Z"),

“m”: 1 } RAX = 1 AWS = 2 GCP = 4 Upload mask Example: m = 6

→ AWS & GCP

m = 3

→ AWS & RAX

m = 1

→ RAX only

Slide 70

Slide 70

BlobSync

Done out of band from the request cycle

Constantly queries for unreplicated blobs

Pulls object down, pushes to the other clouds

Records progress and errors BlobSync

Slide 71

Slide 71

BlobSync {
"_id" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7",
"size" : 9935,
"sha" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7", "created_at" : ISODate("2018-06-07T21:02:29.240Z"), “m”: 1 } BlobSync

Done out of band from the request cycle

Constantly queries for unreplicated blobs

Pulls object down, pushes to the other clouds

Records progress and errors

Slide 72

Slide 72

BlobSync {
"_id" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7",
"size" : 9935,
"sha" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7", "created_at" : ISODate("2018-06-07T21:02:29.240Z"), “m”: 1 } CF BlobSync

Done out of band from the request cycle

Constantly queries for unreplicated blobs

Pulls object down, pushes to the other clouds

Records progress and errors

Slide 73

Slide 73

BlobSync {
"_id" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7",
"size" : 9935,
"sha" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7", "created_at" : ISODate("2018-06-07T21:02:29.240Z"), “m”: 1 } CF S3 GCS BlobSync

Done out of band from the request cycle

Constantly queries for unreplicated blobs

Pulls object down, pushes to the other clouds

Records progress and errors

Slide 74

Slide 74

BlobSync {
"_id" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7",
"size" : 9935,
"sha" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7", "created_at" : ISODate("2018-06-07T21:02:29.240Z"), “m”: 1 } {
"_id" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7",
"size" : 9935,
"sha" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7", "created_at" : ISODate("2018-06-07T21:02:29.240Z"), “m”: 7 } CF S3 GCS BlobSync

Done out of band from the request cycle

Constantly queries for unreplicated blobs

Pulls object down, pushes to the other clouds

Records progress and errors

Slide 75

Slide 75

Replicate it all origin origin origin origin primary CF 1 2 3 {
"_id" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7",
"size" : 9935,
"sha" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7", "created_at" : ISODate("2018-06-07T21:02:29.240Z"),

“m”: 1 }

Slide 76

Slide 76

Replicate it all {
"_id" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7",
"size" : 9935,
"sha" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7", "created_at" : ISODate("2018-06-07T21:02:29.240Z"), “m”: 1, “r”: true } Replication Flag Spares index in mongo origin origin origin origin primary CF 1 2 3

Slide 77

Slide 77

State of the world origin origin origin origin primary CF CDN

Slide 78

Slide 78

State of the world origin origin origin origin primary CF GCS S3 BlobSync CDN

Slide 79

Slide 79

Steps to Multicloud 1. Double check assumptions 2. Replicate all the objects 3. Prepare the database 4. Make the origin services cloud agnostic 5. Test everything 6. Do the actual cutover

Slide 80

Slide 80

State of the world origin origin origin origin CF GCS S3 BlobSync CDN primary

Slide 81

Slide 81

Steps to Multicloud 1. Double check assumptions 2. Replicate all the objects 3. Prepare the database 4. Make the origin services cloud agnostic 5. Test everything 6. Do the actual cutover

Slide 82

Slide 82

Cloud Agnostic Origin Services

Generic cloud storage interface

Automatic failover

Prefer staying in cloud

Forceable overrides

Slide 83

Slide 83

Smart Resolution origin CDN primary {
"sha" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7", “m”: 7 }

Slide 84

Slide 84

Smart Resolution primary {
"sha" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7", “m”: 7 } origin CDN

Slide 85

Slide 85

Smart Resolution origin CDN primary {
"sha" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7", “m”: 7 }

Slide 86

Slide 86

Smart Resolution origin CDN primary {
"sha" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7", “m”: 7 }

Slide 87

Slide 87

Smart Resolution origin CDN primary {
"sha" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7", “m”: 7 }

Slide 88

Slide 88

Smart Resolution origin CDN primary {
"sha" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7", “m”: 7 }

Slide 89

Slide 89

Smart Resolution origin CDN primary {
"sha" : "9c74b7c31a3c04634ddf1d54d2339e0163dcf4a7", “m”: 7 }

Slide 90

Slide 90

Steps to Multicloud 1. Double check assumptions 2. Replicate all the objects 3. Prepare the database 4. Make the origin services cloud agnostic 5. Test everything 6. Do the actual cutover

Slide 91

Slide 91

State of the world origin origin origin origin CF origin origin origin origin GCS origin origin origin origin S3 BlobSync CDN primary

Slide 92

Slide 92

State of the world origin origin origin origin CF origin origin origin origin GCS origin origin origin origin S3 BlobSync CDN primary

Slide 93

Slide 93

Steps to Multicloud 1. Double check assumptions 2. Replicate all the objects 3. Prepare the database 4. Make the origin services cloud agnostic 5. Test everything 6. Do the actual cutover

Slide 94

Slide 94

Steps to Multicloud 1. Double check assumptions 2. Replicate all the objects 3. Prepare the database 4. Make the origin services cloud agnostic 5. Test everything 6. Do the actual cutover

Slide 95

Slide 95

Pulling the trigger 1. Spin up enough origin services 2. Fail over the DB
3. Update the consul entry 4. Aggressively stare at monitors

Slide 96

Slide 96

State of the world origin origin origin origin CF origin origin origin origin GCS origin origin origin origin S3 BlobSync CDN primary

Slide 97

Slide 97

primary State of the world origin origin origin origin CF origin origin origin origin GCS origin origin origin origin S3 BlobSync CDN

Slide 98

Slide 98

primary State of the world origin origin origin origin CF origin origin origin origin GCS origin origin origin origin S3 BlobSync CDN

Slide 99

Slide 99

Slide 100

Slide 100

Redundant everything

Cloud agnostic origin and CDN

Programmable infrastructure

Out of band replication

Smart routing

Automated failover Summary

Slide 101

Slide 101

So now what?

Setup trickle of traffic to live standby

Automate the traffic switch

Speedup network scale up

More monitoring

Slide 102

Slide 102

WE ARE HIRING

Slide 103

Slide 103

ryan@ rybit

Find me to talk! @ry_boflavin