A presentation at Sunny Tech in in Montpellier, France by Horacio Gonzalez
Monitoring OVH: 300k servers, 28 DCs… and one Metrics platform Horacio Gonzalez @LostInBrittany #Sunnytech @LostInBrittany
Who are we? Introducing myself and introducing OVH #Sunnytech @LostInBrittany
Horacio Gonzalez @LostInBrittany Spaniard lost in Brittany, developer, dreamer and all-around geek Flutter #Sunnytech @LostInBrittany
OVH : Key Figures 1.3M Customers worldwide in 138 Countries 1.5 Billions euros investment over five years 28 Datacenters (growing) 350k Dedicated Servers 200k Private cloud VMs running 650k Public cloud Instances created in a month 20TB bandwidth capacity 35 Points of presence 4TB Anti DDoS capacity Hosting capacity : 1.3M Physical Servers #Sunnytech @LostInBrittany
OVH: Our solutions Cloud #Sunnytech Web Hosting Mobile Hosting Telecom VPS Containers ▪ Dedicated Server Domain names VoIP Public Cloud Compute ▪ Data Storage Email SMS/Fax Private Cloud ▪ Network and Database CDN Virtual desktop Serveur dédié Security Object Storage Web hosting Cloud HubiC Over theBox ▪ Licences Cloud Desktop Securities MS Office Hybrid Cloud Messaging MS solutions @LostInBrittany
Once upon a time… Because I love telling tales #Sunnytech @LostInBrittany
This talk is about a tale… A true one nevertheless #Sunnytech @LostInBrittany
And as in most tales It begins with a mission #Sunnytech @LostInBrittany
And a band of heroes Engulfed into the adventure #Sunnytech @LostInBrittany
They fight against mishaps And all kind of foes #Sunnytech @LostInBrittany
They build mighty fortresses Pushing the limits of possible #Sunnytech @LostInBrittany
And defend them day after day Against all odds #Sunnytech @LostInBrittany
But we don’t know yet the end Because this tale isn’t finished yet #Sunnytech @LostInBrittany
It begins with a mission Build a metrics platform for OVH #Sunnytech @LostInBrittany
A long time ago… #Sunnytech @LostInBrittany
A long time ago… Monitoring: Does the system works? #Sunnytech @LostInBrittany
Moving from monolith to μservices App #Sunnytech @LostInBrittany
Moving from monolith to μservices App App App #Sunnytech @LostInBrittany
Moving from monolith to μservices App App App App #Sunnytech @LostInBrittany
Moving from monolith to μservices App App App App #Sunnytech @LostInBrittany
Moving from monolith to μservices RPXY LB App App App App #Sunnytech @LostInBrittany Cache
What could go wrong? RPXY LB App App App App #Sunnytech @LostInBrittany Cache
Microservices are a distributed system GOTO 2017 • Debugging Under Fire: Keep your Head when Systems have Lost their Mind • Bryan Cantrill #Sunnytech @LostInBrittany
We need to have insights Observability : Understand how it works #Sunnytech @LostInBrittany
OVH decided go metrics-oriented #Sunnytech @LostInBrittany
A metrics platform for OVH For all OVH #Sunnytech @LostInBrittany
Building OVH Metrics One Platform to unify them all, One Platform to find them, One Platform to bring them all and in the Metrics monitor them #Sunnytech @LostInBrittany
What is OVH Metrics? Managed Cloud Platform for Time Series #Sunnytech @LostInBrittany
OVH monitoring story We had lots of partial solutions… #Sunnytech @LostInBrittany
OVH monitoring story One Platform to unify them all What should we build it on? #Sunnytech @LostInBrittany
OVH monitoring story Including a really big #Sunnytech @LostInBrittany
OpenTSDB drawbacks OpenTSDB RowKey Design ! #Sunnytech @LostInBrittany
OpenTSDB Rowkey design flaws ● .regex. => full table scans ● High cardinality issues (Query latencies) We needed something able to manage hundreds of millions time series #Sunnytech OpenTSBD didn’t scale for us @LostInBrittany
OpenTSDB other flaws ● ● ● ● ● Compaction (or append writes) /api/query : 1 endpoint per function? Asynchronous Unauthenticated … #Sunnytech @LostInBrittany
Scaling OpenTSDB #Sunnytech @LostInBrittany
Metrics needs First need: To be massively scalable #Sunnytech @LostInBrittany
Analytics is the key to success Fetching data is only the tip of the iceberg #Sunnytech @LostInBrittany
Analysing metrics data To be scalable, analysis must be done in the database, not in user’s computer #Sunnytech @LostInBrittany
Metrics needs Second need: To have rich query capabilities #Sunnytech @LostInBrittany
Enter Warp 10… Open-source Time series Database #Sunnytech @LostInBrittany
More than a Time Series DB Warp 10 is a software platform that ● Ingests and stores time series ● Manipulates and analyzes time series #Sunnytech @LostInBrittany
Manipulating Time Series with Warp 10 A true Time Series analysis toolbox ○ Hundreds of functions ○ Manipulation frameworks ○ Analysis workflow #Sunnytech @LostInBrittany
Manipulating Time Series with Warp 10 A Time Series manipulation language WarpScript #Sunnytech @LostInBrittany
Did you say scalability? From the smallest to the largest… #Sunnytech @LostInBrittany
More Warp 10 goodness ● Secured & multi tenant ● Synchronous (transactions) ● In memory Index ● Better Performance ● No cardinality issues ● Better Scalability ● Lockfree ingestion ● Versatile ● WarpScript Query Language (standalone, distributed) ● Support more data types #Sunnytech @LostInBrittany
OVH Observability Metrics Platform #Sunnytech @LostInBrittany
Building an ecosystem From Warp 10 to OVH Metrics #Sunnytech @LostInBrittany
What protocols should we support? Who must do the effort? #Sunnytech @LostInBrittany
Open source monitoring tools #Sunnytech @LostInBrittany
Open source monitoring tools #Sunnytech @LostInBrittany
Open source monitoring tools #Sunnytech @LostInBrittany
Open source monitoring tools #Sunnytech @LostInBrittany
Open source monitoring tools #Sunnytech @LostInBrittany
Open source monitoring tools #Sunnytech @LostInBrittany
Open source monitoring tools Why choose? Let’s support all of them! #Sunnytech @LostInBrittany
Metrics Platform #Sunnytech @LostInBrittany
Metrics Platform graphite influx https:// opentsdb .<region>.metrics.ovh.net prometheus Warp10 tsl … #Sunnytech @LostInBrittany
Metrics Platform graphite influx https:// opentsdb .<region>.metrics.ovh.net prometheus Warp10 tsl … #Sunnytech @LostInBrittany
TSL select(“cpu.usage_system”) .where(“cpu~cpu[0-7]*”) .last(12h) .sampleBy(5m,max) .groupBy(mean) .rate() github.com/ovh/tsl #Sunnytech @LostInBrittany
Metrics Live In-memory, high-performance Metrics instances #Sunnytech @LostInBrittany
In-memory: Metrics live millions of writes/s #Sunnytech @LostInBrittany
In-memory: Metrics live #Sunnytech @LostInBrittany
In-memory: Metrics live #Sunnytech @LostInBrittany
Monitoring is only the beginning OVH Metrics answer to many other use cases #Sunnytech @LostInBrittany
Graveline rack’s temperature #Sunnytech @LostInBrittany
Even medical research… Metrics’ Pattern Detection feature helped Gynaecology Research to prove patterns on perinatal mortality #Sunnytech @LostInBrittany
Use cases families • • • • Billing Monitoring IoT (e.g. bill on monthly max consumption) ……………………………………………..……. (APM, infrastructure,appliances,…) …..…………………………… (Manage devices, operator integration, …) …………………………………………….…………………. Geo Location #Sunnytech (Manage localized fleets) ……..………………… @LostInBrittany
Use cases • • • • • • DC Temperature/Elec/Cooling map Pay as you go billing (PCI/IPLB) GSCAN Monitoring ML Model scoring (Anti-Fraude) Pattern Detection for medical applications #Sunnytech @LostInBrittany
SREing Metrics With a great power comes a great responsibility #Sunnytech @LostInBrittany
Metrics’s metrics 432.000.000.000 datapoints / day #Sunnytech @LostInBrittany 70
Our stack overview More than 650 machines operated by 5 people >95% dedicated servers No Docker, only SystemD Running many Apache projects: ○ Hadoop ○ HBase ○ Zookeeper ○ Flink ● And Warp 10 ● ● ● ● #Sunnytech @LostInBrittany
Our biggest Hadoop cluster 200 datanodes 60k regions of 10Gb ~ 2.3 PB of capacity 8.5Gb/s of bandwidth #Sunnytech @LostInBrittany 1.5M of writes/s 3M of reads/s
Hadoop need a lot of #Sunnytech @LostInBrittany
Warp10: distributed overview #Sunnytech @LostInBrittany
Warp10: distributed overview #Sunnytech @LostInBrittany
Warp10: distributed overview #Sunnytech @LostInBrittany
Warp10: distributed overview #Sunnytech @LostInBrittany
Warp10: distributed overview #Sunnytech @LostInBrittany
Hadoop nodes Most of the nodes are the following: ● ● ● 16 to 32 cores 64 to 128 GB of RAM 12 to 16 TB #Sunnytech But, we also have some huge nodes: ● ● ● 2x 20 cores (xeon gold) 320 GB of RAM 12x 4TB of Disk @LostInBrittany
Warp10 nodes Ingress (cpu-bound): ● ● 32 cores 128 GB of RAM Egress (cpu-bound): ● ● 32 cores 128 GB of RAM #Sunnytech Directory (ram-bound): ● ● 48 cores 512 GB of RAM Store (cpu-bound): ● ● @LostInBrittany 32 cores 128 GB of RAM
Why you should care? #Sunnytech @LostInBrittany
Why you should care? (>30s) #Sunnytech @LostInBrittany
The only way to optimize: measure Logs #Sunnytech Metrics @LostInBrittany
Monitoring JVM with metrics #Sunnytech @LostInBrittany
Monitoring JVM with metrics #Sunnytech @LostInBrittany
Monitoring JVM with metrics #Sunnytech @LostInBrittany
Monitoring JVM with metrics #Sunnytech @LostInBrittany
Monitoring JVM with metrics #Sunnytech @LostInBrittany
Tuning G1 is hard #Sunnytech @LostInBrittany
Tuning G1 is hard #Sunnytech @LostInBrittany
Our programming stack ● We mostly use garbage collected languages as ○ Go ○ Java ○ JavaScript #Sunnytech @LostInBrittany
Our programming stack However, we are using non-garbage collected languages as Rust when needed #Sunnytech @LostInBrittany
Our friends for µservices #Sunnytech @LostInBrittany
We open-source Code contribution: ● ● ● ● ● ● https://github.com/ovh/beamium https://github.com/ovh/noderig https://github.com/ovh/tsl https://github.com/ovh/ovh-warp10-datasource https://github.com/ovh/ovh-tsl-datasource … #Sunnytech @LostInBrittany Involved in: ● ● ● ● Warp10 community Apache Hbase/Flink development Prometheus/InfluxData discussions TS Query Language Working group
Conclusion That’s all folks! #Sunnytech @LostInBrittany
What to do when you must monitor the whole infrastructure of the biggest European hosting and cloud provider? How to choose a tool when the most used ones fail to scale to your needs? How to build an Metrics platform to unify, conciliate and replace years of fragmented legacy partial solutions?
In this talk we will relate our experience building and maintaining OVH Metrics, the platform used to monitor all OVH infrastructure. We needed to go to places where most monitoring solutions hadn’t gone before, it needed to operate at the scale of the biggest European hosting and cloud providers: 27 data centers, more than 300k servers (bare metal!), and hundreds of products to fulfill our mission to host 1.3 million customers.
You will hear about time series, about open source solutions pushed to the limit, about HBase clusters operated at the extreme, and how about a small team leveraged the power of a handful of open source solution and lots of coding glue to build one of the most performant monitoring solutions ever.
Here’s what was said about this presentation on social media.
Et @LostInBrittany nous parle de monitoring chez @OVH à @SunnyTech_MTP #SeaTechAndSun pic.twitter.com/57KMHmR0jh
— David Pilato🇪🇺🇫🇷 (@dadoonet) June 28, 2019
Monitoring OVH: 300k serveurs, 27 DCs une plateforme de métriques avec @LostInBrittany #sunnyTech #SeaTechAndSun #Montpellier #devops pic.twitter.com/lLj2AjMWAU
— Sunny Tech (@SunnyTech_MTP) June 28, 2019
Amphi quasi comble (et se remplissant encore) pour la présentation de @LostInBrittany sur le monitoring d'@OVH !#SunnyTech #SeaTechAndSun pic.twitter.com/SBUnSFYCGg
— Aurélien Hebert (@AurrelH95) June 28, 2019