Migrate 3 million websites without anybody noticing

A presentation at DevOpsCon Berlin in October 2020 in Berlin, Germany by Horacio Gonzalez

Slide 1

Slide 1

Migrate 3 million websites without anybody noticing Vincent Cassé & Horacio Gonzalez 2020-10-13

Slide 2

Slide 2

Who are we? Introducing ourselves and introducing OVH OVHcloud

Slide 3

Slide 3

Vincent Cassé @vcasse Host for millions of websites. Breakfast and HTTPS included. Engineering Manager at

Slide 4

Slide 4

Horacio Gonzalez @LostInBrittany Spaniard lost in Brittany, developer, dreamer and all-around geek Flutter

Slide 5

Slide 5

OVHcloud: A Global Leader 200k Private cloud VMs running 1 Dedicated IaaS Europe 30 Datacenters Own 20Tbps Network with 35 PoPs

1.3M Customers in 138 Countries Hosting capacity: 1.3M Physical Servers 360k Servers already deployed

Slide 6

Slide 6

OVHcloud: 4 Universes of Products WebCloud Domain / Email Domain names, DNS, SSL, Redirect Baremetal Cloud General Purpose VM SuperPlan Baremetal Game Collaborative Tools, NextCloud Virtualization T2 >20e T3 >80e Storage Mutu, CloudWeb Plesk, CPanel PaaS with Platform.sh Virtual servers VPS, Dedicated Server Compute Standalone, Cluster Email, Open-Xchange, Exchange PaaS for Web Database T4 >300e Bigdata T5 >600e HCI AI 12KVA /32KVA VDI Cloud Game VPS aaS CRM, Billing, Payment, Stats MarketPlace K8S, IA IaaS PaaS for DevOps Storage File, Block, Object, Archive Databases SQL, noSQL, Messaging, Dashboard Network IP FO, NAT, LB, VPN, Router, DNS, DHCP, TCP/SSL Offload Virtuozzo Cloud Security Wordpress, Magento, Prestashop Wholesales IAM, MFA, Encrypt, KMS IT Integrators, Cloud Storage, CDN, Database, ISV, WebHosting Support, Managed High Intensive CPU/GPU, Support Basic Encrypt Support thought Partners KMS, HSM Managed services Encrypt (SGX, Network, Storage) Hosted Private Cloud Hosted Private Cloud Network pCC DC SaaS Public Cloud IA, DL VMware SDDC, vSAN 1AZ / 2AZ vCD, Tanzu, Horizon, DBaaS, DRaaS Nutanix HCI 1AZ / 2AZ, Databases, DRaaS, VDI OpenStack IAM, Compute (VM, K8S) Stortage, Network, Databases Storage Ontap Select, Nutanix File OpenIO, MinIO, CEPH Zerto, Veeam, Atempo AI ElementAI, HuggingFace, Deepopmatic, Systran, EarthCube Bigdata / Analitics / ML Cloudera over S3, Dataiku, Saagie, Tableau, Hybrid Cloud Standard Tools for AI, AI Studio, vRack Connect, Edge-DC, Private DC IA IaaS, Hosting API AI Dell, HP, Cisco, OCP, MultiCloud Bigdata, ML, Analytics Datalake, ML, Dashboard Secured Cloud GOV, FinTech, Retail, HealtCare

Slide 7

Slide 7

“OVH: We Host You ” (On vous héberge)

Slide 8

Slide 8

Webhosting at OVHcloud Biggest webhoster in Europe 6 millions websites 60 Gb/s 6 billions HTTP requests (except CDN caches) ● 15 000 web servers ● ● ● ●

Slide 9

Slide 9

Webhosting at OVHcloud: small history ● Hosting in P19 (Paris) since 2003 ● Web have changed from 1999 ● New datacenter opening : Gravelines in 2016

Slide 10

Slide 10

What’s the hoster’s job?

Slide 11

Slide 11

apt-get install apache2 php7 mysql-server? ● Store data ● Run code source

Slide 12

Slide 12

Why did we want to leave Paris? ● Hardware end of life ● Too slow natural decreasing

Slide 13

Slide 13

Why it was difficult? Footer can be Da personalized as follow:

Slide 14

Slide 14

Risk management Probability by magnitude Da ● 0,1% for 1 website: 1 in 1000 chance ● 0,1% for 100 websites : 1 in 10 chance ● 0,1% for 3 millions : 3 000 times

Slide 15

Slide 15

Risk management Probability by magnitude ● 0,1% for 1 website: 1 in 1000 chance ● 0,1% for 100 websites : 1 in 10 chance ● 0,1% for 3 millions : 3 000 times Risk = Impact * Probability

Slide 16

Slide 16

Split brain definition Split-brain is a computer term, based on an analogy with the medical Split-brain syndrome. It indicates data or availability inconsistencies originating from the maintenance of two separate data sets with overlap in scope, either because of servers in a network design, or a failure condition based on servers not communicating and synchronizing their data to each other. https://en.wikipedia.org/wiki/Split-brain_(computing)

Slide 17

Slide 17

Hosting architecture. Vue for one website

Slide 18

Slide 18

Load balancing and fault tolerance Date

Slide 19

Slide 19

Fault domain

Slide 20

Slide 20

Difference between P19 & Gravelines

Slide 21

Slide 21

Files constraint ● Customer dependencies: source code / images / javascript… ● Rsync limitations ● Bloc copy implies to migrate all customer of a filerz

Slide 22

Slide 22

Clusters constraint ● High cost infras are shared by cluster (load balancer, IP…) ● DNS zone relies on customer configurations ● IP migration implies to migrate all cluster customers Da

Slide 23

Slide 23

Database constraint ● Database linked to one hosting account but… ● Exhaustive knowledge = comprehensive mastery of source code ● Break zero website implies to migrate in same time all websites at

Slide 24

Slide 24

Database constraint² ● Database naming use subdomain of mysql.db ● But “recent” feature (5 years) ● Old usages incompatible in Gravelines datacenter

Slide 25

Slide 25

Slide 26

Slide 26

So, how migrate? Fo ot er ca n be pe rs on ali ze d as Da foll 26

Slide 27

Slide 27

Be punk! Break the rules If we take all constraints: ● Either migrate the sites one by one knowing their website ● Either migrate all at the same time (TCP over Trucks ?)

Slide 28

Slide 28

Database naming

Slide 29

Slide 29

Database naming : ProxySQL

Slide 30

Slide 30

Database naming not know ● Network tunnel between the two datacenters ● Impact : + 10ms latency for each request ● « Best effort »

Slide 31

Slide 31

ProxySQL and latency

Slide 32

Slide 32

SQL proxy and latency +10 ms XXX db 2 …. db 2500 db 1 db 2 …. db 2500 db 1 db 2 …. db 2500 db 1 …. db 250 mysql55-XXX.plan-service mysql55-XXX.plan mysql55-XXX.plan dbXXX.plan P19 Gravelines

Slide 33

Slide 33

Shared IP constraint 127.0.0.1 ::1 Alr ate To gr mi P19 ea dy mi gra ted GRA

Slide 34

Slide 34

Shared IP constraint 127.0.0.1 ::1 To migrate Web Web Web Web Filerz P19 Web Web Web Web Web Filerz Filerz GRA Web Filerz

Slide 35

Slide 35

File constraints Are we able to migrate filerz customer all at the same time?

Slide 36

Slide 36

Party time! Let’s migrate!

Slide 37

Slide 37

IP switch ● Information system adaptation ● Load balancer patch ● Network tunnel ● Tools & monitoring

Slide 38

Slide 38

ProxySQL ● Configuration automatisation ● Risk management deployment : 1 / 10 / 100 / 1000 ● SQL proxying at scale: some surprises ○ MySQL and password storage format… ○ ARP Table ○ Old database management

Slide 39

Slide 39

Migration plan ● Migration filerz by filerz. ● Database related to hosting migrated, migrate at the same time ● 1 IP switch d’IP at a time. So 1 cluster at a time ● Cluster migration order by risk level. Less risky to more risky

Slide 40

Slide 40

Chronological timeline Hardware order D-40 : D-30 : Setup filerz Setup databases hosts D-10 : Cluster tests D-60 : Setup D-30 : D-15 : new cluster Setup filerz Communication D-30 : Communication D-7 : IP Switch D-1 : Night N : Accelerate filerz incremental copies Migration last filerz Decommissioning Night 1 : D+1: Migration X filerz Close P19 cluster infra

Slide 41

Slide 41

IP Switch (D-7) 1. 2. 3. 4. 5. Destination cluster a network tunnel tests Send communication to support and customers SSL jobs redirections Setup all SSL on destination load balancer For each IPv4 / v6 addresses! • Route IP to new load balancer • Tester websites at Paris and Gravelines 6. Route CDN to new infra

Slide 42

Slide 42

Filerz migration: during the night 1. Cluster websites tests 2. Cut monitoring of the cluster 3. Launch incremental 4. Close website (maintenance mode) 5. Wait PHP timeout 6. Close file access from the filerz 7. Launch last incremental 8. When data are in Gravelines: launch database migration 9. Update configurations of migrated hosting (IS, infrastructure…) 10. Reopen hostings accounts 11. Wait end of database migration 12. Test website again and check all is ok 13. Enable cluster monitoring 14. Prevent customer about the end of operations 15. Go to bed!

Slide 43

Slide 43

Migration: and databases? For each databases: 1. 2. 3. 4. 5. 6. Put database in read-only mode Dump database Import database on destination cluster and put the new in read-write mode Redirect DNS name to the new server Setup SQL proxy to new server Close old database in Paris

Slide 44

Slide 44

Migration: and databases? ● Distribution of operations on all servers ● Orchestration ○ storing information inside… a database !

Slide 45

Slide 45

Migration: and databases? ● Distribution of operations on all servers ● Orchestration ○ storing information inside… a database ! Record 13 502 databases migrated in 1 hour 13 minutes

Slide 46

Slide 46

Organisation

Slide 47

Slide 47

Challenges

  • Technical. But it was this presentation up to this slide - Infrastructure work splitted in specialized teams (database, web servers, storage servers, datacenters, server factory, support, load balancers, cdn, network…) - Legacy - Loooong migration

Slide 48

Slide 48

Continue improvement organisation

Build migration plan Implement and test the plan Migrate then improve migration after each week

Slide 49

Slide 49

Change management

Slide 50

Slide 50

—verbose? • Why we decided to migrate three million websites? https://www.ovh.com/blog/web-hosting-why-we-decided-to-migrate-three-million-websites/ • How to host 3 million websites? https://www.ovh.com/blog/web-hosting-how-to-host-3-million-websites/ • How to migrate 3 Million web sites? https://www.ovh.com/blog/web-hosting-how-to-migrate-3-million-web-sites/ • How do our databases work? https://www.ovh.com/blog/web-hosting-how-do-our-databases-work/ • How to win at the massive database migration game https://www.ovh.com/blog/how-to-win-at-the-massive-database-migration-game/ • migrate-datacentre –quiet: How do we seamlessly migrate a datacentre? https://www.ovh.com/blog/migrate-datacentre-quiet-how-do-we-seamlessly-migrate-a-datacentre/ • A day in the life of a ProxySQL at OVHcloud https://www.ovh.com/blog/a-day-in-the-life-of-a-proxysql-at-ovhcloud/ • Another day in ProxySQL life: sharing is caring https://www.ovh.com/blog/another-day-in-proxysql-life-sharing-is-caring/ More soon on https://ovh.com/blog