Migrate 3 million websites without anybody noticing Vincent Cassé & Horacio Gonzalez 2020-10-13
A presentation at DevOpsCon Berlin in October 2020 in Berlin, Germany by Horacio Gonzalez
Migrate 3 million websites without anybody noticing Vincent Cassé & Horacio Gonzalez 2020-10-13
Who are we? Introducing ourselves and introducing OVH OVHcloud
Vincent Cassé @vcasse Host for millions of websites. Breakfast and HTTPS included. Engineering Manager at
Horacio Gonzalez @LostInBrittany Spaniard lost in Brittany, developer, dreamer and all-around geek Flutter
OVHcloud: A Global Leader 200k Private cloud VMs running 1 Dedicated IaaS Europe 30 Datacenters Own 20Tbps Network with 35 PoPs
1.3M Customers in 138 Countries Hosting capacity: 1.3M Physical Servers 360k Servers already deployed
OVHcloud: 4 Universes of Products WebCloud Domain / Email Domain names, DNS, SSL, Redirect Baremetal Cloud General Purpose VM SuperPlan Baremetal Game Collaborative Tools, NextCloud Virtualization T2 >20e T3 >80e Storage Mutu, CloudWeb Plesk, CPanel PaaS with Platform.sh Virtual servers VPS, Dedicated Server Compute Standalone, Cluster Email, Open-Xchange, Exchange PaaS for Web Database T4 >300e Bigdata T5 >600e HCI AI 12KVA /32KVA VDI Cloud Game VPS aaS CRM, Billing, Payment, Stats MarketPlace K8S, IA IaaS PaaS for DevOps Storage File, Block, Object, Archive Databases SQL, noSQL, Messaging, Dashboard Network IP FO, NAT, LB, VPN, Router, DNS, DHCP, TCP/SSL Offload Virtuozzo Cloud Security Wordpress, Magento, Prestashop Wholesales IAM, MFA, Encrypt, KMS IT Integrators, Cloud Storage, CDN, Database, ISV, WebHosting Support, Managed High Intensive CPU/GPU, Support Basic Encrypt Support thought Partners KMS, HSM Managed services Encrypt (SGX, Network, Storage) Hosted Private Cloud Hosted Private Cloud Network pCC DC SaaS Public Cloud IA, DL VMware SDDC, vSAN 1AZ / 2AZ vCD, Tanzu, Horizon, DBaaS, DRaaS Nutanix HCI 1AZ / 2AZ, Databases, DRaaS, VDI OpenStack IAM, Compute (VM, K8S) Stortage, Network, Databases Storage Ontap Select, Nutanix File OpenIO, MinIO, CEPH Zerto, Veeam, Atempo AI ElementAI, HuggingFace, Deepopmatic, Systran, EarthCube Bigdata / Analitics / ML Cloudera over S3, Dataiku, Saagie, Tableau, Hybrid Cloud Standard Tools for AI, AI Studio, vRack Connect, Edge-DC, Private DC IA IaaS, Hosting API AI Dell, HP, Cisco, OCP, MultiCloud Bigdata, ML, Analytics Datalake, ML, Dashboard Secured Cloud GOV, FinTech, Retail, HealtCare
“OVH: We Host You ” (On vous héberge)
Webhosting at OVHcloud Biggest webhoster in Europe 6 millions websites 60 Gb/s 6 billions HTTP requests (except CDN caches) ● 15 000 web servers ● ● ● ●
Webhosting at OVHcloud: small history ● Hosting in P19 (Paris) since 2003 ● Web have changed from 1999 ● New datacenter opening : Gravelines in 2016
What’s the hoster’s job?
apt-get install apache2 php7 mysql-server? ● Store data ● Run code source
Why did we want to leave Paris? ● Hardware end of life ● Too slow natural decreasing
Why it was difficult? Footer can be Da personalized as follow:
Risk management Probability by magnitude Da ● 0,1% for 1 website: 1 in 1000 chance ● 0,1% for 100 websites : 1 in 10 chance ● 0,1% for 3 millions : 3 000 times
Risk management Probability by magnitude ● 0,1% for 1 website: 1 in 1000 chance ● 0,1% for 100 websites : 1 in 10 chance ● 0,1% for 3 millions : 3 000 times Risk = Impact * Probability
Split brain definition Split-brain is a computer term, based on an analogy with the medical Split-brain syndrome. It indicates data or availability inconsistencies originating from the maintenance of two separate data sets with overlap in scope, either because of servers in a network design, or a failure condition based on servers not communicating and synchronizing their data to each other. https://en.wikipedia.org/wiki/Split-brain_(computing)
Hosting architecture. Vue for one website
Load balancing and fault tolerance Date
Fault domain
Difference between P19 & Gravelines
Files constraint ● Customer dependencies: source code / images / javascript… ● Rsync limitations ● Bloc copy implies to migrate all customer of a filerz
Clusters constraint ● High cost infras are shared by cluster (load balancer, IP…) ● DNS zone relies on customer configurations ● IP migration implies to migrate all cluster customers Da
Database constraint ● Database linked to one hosting account but… ● Exhaustive knowledge = comprehensive mastery of source code ● Break zero website implies to migrate in same time all websites at
Database constraint² ● Database naming use subdomain of mysql.db ● But “recent” feature (5 years) ● Old usages incompatible in Gravelines datacenter
So, how migrate? Fo ot er ca n be pe rs on ali ze d as Da foll 26
Be punk! Break the rules If we take all constraints: ● Either migrate the sites one by one knowing their website ● Either migrate all at the same time (TCP over Trucks ?)
Database naming
Database naming : ProxySQL
Database naming not know ● Network tunnel between the two datacenters ● Impact : + 10ms latency for each request ● « Best effort »
ProxySQL and latency
SQL proxy and latency +10 ms XXX db 2 …. db 2500 db 1 db 2 …. db 2500 db 1 db 2 …. db 2500 db 1 …. db 250 mysql55-XXX.plan-service mysql55-XXX.plan mysql55-XXX.plan dbXXX.plan P19 Gravelines
Shared IP constraint 127.0.0.1 ::1 Alr ate To gr mi P19 ea dy mi gra ted GRA
Shared IP constraint 127.0.0.1 ::1 To migrate Web Web Web Web Filerz P19 Web Web Web Web Web Filerz Filerz GRA Web Filerz
File constraints Are we able to migrate filerz customer all at the same time?
Party time! Let’s migrate!
IP switch ● Information system adaptation ● Load balancer patch ● Network tunnel ● Tools & monitoring
ProxySQL ● Configuration automatisation ● Risk management deployment : 1 / 10 / 100 / 1000 ● SQL proxying at scale: some surprises ○ MySQL and password storage format… ○ ARP Table ○ Old database management
Migration plan ● Migration filerz by filerz. ● Database related to hosting migrated, migrate at the same time ● 1 IP switch d’IP at a time. So 1 cluster at a time ● Cluster migration order by risk level. Less risky to more risky
Chronological timeline Hardware order D-40 : D-30 : Setup filerz Setup databases hosts D-10 : Cluster tests D-60 : Setup D-30 : D-15 : new cluster Setup filerz Communication D-30 : Communication D-7 : IP Switch D-1 : Night N : Accelerate filerz incremental copies Migration last filerz Decommissioning Night 1 : D+1: Migration X filerz Close P19 cluster infra
IP Switch (D-7) 1. 2. 3. 4. 5. Destination cluster a network tunnel tests Send communication to support and customers SSL jobs redirections Setup all SSL on destination load balancer For each IPv4 / v6 addresses! • Route IP to new load balancer • Tester websites at Paris and Gravelines 6. Route CDN to new infra
Filerz migration: during the night 1. Cluster websites tests 2. Cut monitoring of the cluster 3. Launch incremental 4. Close website (maintenance mode) 5. Wait PHP timeout 6. Close file access from the filerz 7. Launch last incremental 8. When data are in Gravelines: launch database migration 9. Update configurations of migrated hosting (IS, infrastructure…) 10. Reopen hostings accounts 11. Wait end of database migration 12. Test website again and check all is ok 13. Enable cluster monitoring 14. Prevent customer about the end of operations 15. Go to bed!
Migration: and databases? For each databases: 1. 2. 3. 4. 5. 6. Put database in read-only mode Dump database Import database on destination cluster and put the new in read-write mode Redirect DNS name to the new server Setup SQL proxy to new server Close old database in Paris
Migration: and databases? ● Distribution of operations on all servers ● Orchestration ○ storing information inside… a database !
Migration: and databases? ● Distribution of operations on all servers ● Orchestration ○ storing information inside… a database ! Record 13 502 databases migrated in 1 hour 13 minutes
Organisation
Challenges
Continue improvement organisation
Build migration plan Implement and test the plan Migrate then improve migration after each week
Change management
—verbose? • Why we decided to migrate three million websites? https://www.ovh.com/blog/web-hosting-why-we-decided-to-migrate-three-million-websites/ • How to host 3 million websites? https://www.ovh.com/blog/web-hosting-how-to-host-3-million-websites/ • How to migrate 3 Million web sites? https://www.ovh.com/blog/web-hosting-how-to-migrate-3-million-web-sites/ • How do our databases work? https://www.ovh.com/blog/web-hosting-how-do-our-databases-work/ • How to win at the massive database migration game https://www.ovh.com/blog/how-to-win-at-the-massive-database-migration-game/ • migrate-datacentre –quiet: How do we seamlessly migrate a datacentre? https://www.ovh.com/blog/migrate-datacentre-quiet-how-do-we-seamlessly-migrate-a-datacentre/ • A day in the life of a ProxySQL at OVHcloud https://www.ovh.com/blog/a-day-in-the-life-of-a-proxysql-at-ovhcloud/ • Another day in ProxySQL life: sharing is caring https://www.ovh.com/blog/another-day-in-proxysql-life-sharing-is-caring/ More soon on https://ovh.com/blog