A presentation at RivieraDev in in Sophia Antipolis, France by Horacio Gonzalez
Monitoring OVH: 300k servers, 28 DCs… and one Metrics platform Horacio Gonzalez Pierre Zemb @LostInBrittany @PierreZ @LostInBrittany @PierreZ
Who are we? Introducing myself and introducing OVH @LostInBrittany @PierreZ
Horacio Gonzalez @LostInBrittany Spaniard lost in Brittany, developer, dreamer and all-around geek Flutter @LostInBrittany @PierreZ
Pierre Zemb @PierreZ Software Engineer working on distributed systems @LostInBrittany @PierreZ
OVH: A Global Leader on Cloud 200k Private cloud VMs running 1 Dedicated IaaS Europe 2018 27 Datacenters Own 15 Tbps Hosting capacity : 1.3M Physical Servers 360k Servers already deployed Netwok with 35 PoPs
1.3M Customers in 138 Countries @LostInBrittany @PierreZ 2020 50 Datacenters
OVH : Key Figures 1.3M Customers worldwide in 138 Countries 1.5 Billions euros investment over five years 28 Datacenters (growing) 350k Dedicated Servers 200k Private cloud VMs running 650k Public cloud Instances created in a month 20TB bandwidth capacity
Ranking & Recognition 1st European Cloud Provider* 1st Hosting provider in Europe 1st Provider Microsoft Exchange Certified vCloud Datacenter Certified Kubernetes platform (CNCF) Vmware Global Service Provider 2013-2016 Veeam Best Cloud Partner of the year (2018) @LostInBrittany @PierreZ
OVH: Our solutions Cloud Web Hosting Mobile Hosting Telecom VPS Containers ▪ Dedicated Server Domain names VoIP Public Cloud Compute ▪ Data Storage Email SMS/Fax Private Cloud ▪ Network and Database CDN Virtual desktop Serveur dédié Security Object Storage Web hosting Cloud HubiC Over theBox ▪ Licences Cloud Desktop Securities MS Office Hybrid Cloud Messaging MS solutions @LostInBrittany @PierreZ
Once upon a time… Because I love telling tales @LostInBrittany @PierreZ
This talk is about a tale… A true one nevertheless @LostInBrittany @PierreZ
And as in most tales It begins with a mission @LostInBrittany @PierreZ
And a band of heroes Engulfed into the adventure @LostInBrittany @PierreZ
They fight against mishaps And all kind of foes @LostInBrittany @PierreZ
They build mighty fortresses Pushing the limits of possible @LostInBrittany @PierreZ
And defend them day after day Against all odds @LostInBrittany @PierreZ
But we don’t know yet the end Because this tale isn’t finished yet @LostInBrittany @PierreZ
It begins with a mission Build a metrics platform for OVH @LostInBrittany @PierreZ
A long time ago… @LostInBrittany @PierreZ
A long time ago… Monitoring: Does the system works? @LostInBrittany @PierreZ
Moving from monolith to μservices App @LostInBrittany @PierreZ
Moving from monolith to μservices App App App @LostInBrittany @PierreZ
Moving from monolith to μservices App App App App @LostInBrittany @PierreZ
Moving from monolith to μservices App App App App @LostInBrittany @PierreZ
Moving from monolith to μservices RPXY LB App App App App @LostInBrittany @PierreZ Cache
What could go wrong? RPXY LB App App App App @LostInBrittany @PierreZ Cache
Microservices are a distributed system GOTO 2017 • Debugging Under Fire: Keep your Head when Systems have Lost their Mind • Bryan Cantrill @LostInBrittany @PierreZ
We need to have insights Observability : Understand how it works @LostInBrittany @PierreZ
OVH decided go metrics-oriented @LostInBrittany @PierreZ
A metrics platform for OVH For all OVH @LostInBrittany @PierreZ
Building OVH Metrics One Platform to unify them all, One Platform to find them, One Platform to bring them all and in the Metrics monitor them @LostInBrittany @PierreZ
What is OVH Metrics? Managed Cloud Platform for Time Series @LostInBrittany @PierreZ
OVH monitoring story We had lots of partial solutions… @LostInBrittany @PierreZ
OVH monitoring story One Platform to unify them all What should we build it on? @LostInBrittany @PierreZ
OVH monitoring story Including a really big @LostInBrittany @PierreZ
OpenTSDB drawbacks OpenTSDB RowKey Design ! @LostInBrittany @PierreZ
OpenTSDB Rowkey design flaws ● .regex. => full table scans ● High cardinality issues (Query latencies) We needed something able to manage hundreds of millions time series OpenTSBD didn’t scale for us @LostInBrittany @PierreZ
OpenTSDB other flaws ● ● ● ● ● Compaction (or append writes) /api/query : 1 endpoint per function? Asynchronous Unauthenticated … @LostInBrittany @PierreZ
Scaling OpenTSDB @LostInBrittany @PierreZ
Metrics needs First need: To be massively scalable @LostInBrittany @PierreZ
Analytics is the key to success Fetching data is only the tip of the iceberg @LostInBrittany @PierreZ
Analysing metrics data To be scalable, analysis must be done in the database, not in user’s computer @LostInBrittany @PierreZ
Metrics needs Second need: To have rich query capabilities @LostInBrittany @PierreZ
Enter Warp 10… Open-source Time series Database @LostInBrittany @PierreZ
More than a Time Series DB Warp 10 is a software platform that ● Ingests and stores time series ● Manipulates and analyzes time series @LostInBrittany @PierreZ
Manipulating Time Series with Warp 10 A true Time Series analysis toolbox ○ Hundreds of functions ○ Manipulation frameworks ○ Analysis workflow @LostInBrittany @PierreZ
Manipulating Time Series with Warp 10 A Time Series manipulation language WarpScript @LostInBrittany @PierreZ
Did you say scalability? From the smallest to the largest… @LostInBrittany @PierreZ
More Warp 10 goodness ● Secured & multi tenant ● Synchronous (transactions) ● In memory Index ● Better Performance ● No cardinality issues ● Better Scalability ● Lockfree ingestion ● Versatile ● WarpScript Query Language (standalone, distributed) ● Support more data types @LostInBrittany @PierreZ
OVH Observability Metrics Platform @LostInBrittany @PierreZ
Building an ecosystem From Warp 10 to OVH Metrics @LostInBrittany @PierreZ
What protocols should we support? Who must do the effort? @LostInBrittany @PierreZ
Open source monitoring tools @LostInBrittany @PierreZ
Open source monitoring tools @LostInBrittany @PierreZ
Open source monitoring tools @LostInBrittany @PierreZ
Open source monitoring tools @LostInBrittany @PierreZ
Open source monitoring tools @LostInBrittany @PierreZ
Open source monitoring tools @LostInBrittany @PierreZ
Open source monitoring tools Why choose? Let’s support all of them! @LostInBrittany @PierreZ
Metrics Platform @LostInBrittany @PierreZ
Metrics Platform graphite influx https:// opentsdb .<region>.metrics.ovh.net prometheus Warp10 tsl … @LostInBrittany @PierreZ
Metrics Platform graphite influx https:// opentsdb .<region>.metrics.ovh.net prometheus Warp10 tsl … @LostInBrittany @PierreZ
TSL select(“cpu.usage_system”) .where(“cpu~cpu[0-7]*”) .last(12h) .sampleBy(5m,max) .groupBy(mean) .rate() github.com/ovh/tsl @LostInBrittany @PierreZ
Metrics Live In-memory, high-performance Metrics instances @LostInBrittany @PierreZ
In-memory: Metrics live millions of writes/s @LostInBrittany @PierreZ
In-memory: Metrics live @LostInBrittany @PierreZ
In-memory: Metrics live @LostInBrittany @PierreZ
Monitoring is only the beginning OVH Metrics answer to many other use cases @LostInBrittany @PierreZ
Graveline rack’s temperature @LostInBrittany @PierreZ
Even medical research… Metrics’ Pattern Detection feature helped Gynaecology Research to prove patterns on perinatal mortality @LostInBrittany @PierreZ
Use cases families • • • • Billing Monitoring IoT (e.g. bill on monthly max consumption) ……………………………………………..……. (APM, infrastructure,appliances,…) …..…………………………… (Manage devices, operator integration, …) …………………………………………….…………………. Geo Location (Manage localized fleets) ……..………………… @LostInBrittany @PierreZ
Use cases • • • • • • DC Temperature/Elec/Cooling map Pay as you go billing (PCI/IPLB) GSCAN Monitoring ML Model scoring (Anti-Fraude) Pattern Detection for medical applications @LostInBrittany @PierreZ
SREing Metrics With a great power comes a great responsibility @LostInBrittany @PierreZ
Metrics’s metrics 432.000.000.000 datapoints / day @LostInBrittany @PierreZ 73
Our stack overview More than 666 machines operated by 5 people >95% dedicated servers No Docker, only SystemD Running many Apache projects: ○ Hadoop ○ HBase ○ Zookeeper ○ Flink ● And Warp 10 ● ● ● ● @LostInBrittany @PierreZ
Our biggest Hadoop cluster 200 datanodes ~60k regions of 10Gb 2.3 PB of capacity 8.5Gb/s of bandwidth 1.5M of writes/s 3M of reads/s @LostInBrittany @PierreZ
Hadoop need a lot of @LostInBrittany @PierreZ
Warp10: distributed overview @LostInBrittany @PierreZ
Warp10: distributed overview @LostInBrittany @PierreZ
Warp10: distributed overview @LostInBrittany @PierreZ
Warp10: distributed overview @LostInBrittany @PierreZ
Warp10: distributed overview @LostInBrittany @PierreZ
Hadoop nodes Most of the nodes are the following: ● ● ● 16 to 32 cores 64 to 128 GB of RAM 12 to 16 TB But, we also have some huge nodes: ● ● ● 2x 20 cores (xeon gold) 320 GB of RAM 12x 4TB of Disk @LostInBrittany @PierreZ
Warp10 nodes Ingress (cpu-bound): ● ● 32 cores 128 GB of RAM ● ● Egress (cpu-bound): ● ● 32 cores 128 GB of RAM Directory (ram-bound): 48 cores 512 GB of RAM Store (cpu-bound): ● ● 32 cores 128 GB of RAM @LostInBrittany @PierreZ
Why you should care? @LostInBrittany @PierreZ
Why you should care? (>30s) @LostInBrittany @PierreZ
The only way to optimize: measure Logs Metrics @LostInBrittany @PierreZ
Monitoring JVM with metrics @LostInBrittany @PierreZ
Monitoring JVM with metrics @LostInBrittany @PierreZ
Monitoring JVM with metrics @LostInBrittany @PierreZ
Monitoring JVM with metrics @LostInBrittany @PierreZ
Monitoring JVM with metrics @LostInBrittany @PierreZ
Tuning G1 is hard @LostInBrittany @PierreZ
Tuning G1 is hard @LostInBrittany @PierreZ
Our programming stack ● We mostly use garbage collected languages as ○ Go ○ Java ○ JavaScript @LostInBrittany @PierreZ
Our programming stack However, we are using non-garbage collected languages as Rust when needed @LostInBrittany @PierreZ
Our friends for µservices @LostInBrittany @PierreZ
We open-source Code contribution: ● ● ● ● ● ● https://github.com/ovh/beamium https://github.com/ovh/noderig https://github.com/ovh/tsl https://github.com/ovh/ovh-warp10-datasource https://github.com/ovh/ovh-tsl-datasource … Involved in: ● ● ● ● Warp10 community Apache Hbase/Flink development Prometheus/InfluxData discussions TS Query Language Working group @LostInBrittany @PierreZ
Conclusion That’s all folks! @LostInBrittany @PierreZ
What to do when you must monitor the whole infrastructure of the biggest European hosting and cloud provider? How to choose a tool when the most used ones fail to scale to your needs? How to build an Metrics platform to unify, conciliate and replace years of fragmented legacy partial solutions?
In this talk we will relate our experience building and maintaining OVH Metrics, the platform used to monitor all OVH infrastructure. We needed to go to places where most monitoring solutions hadn’t gone before, it needed to operate at the scale of the biggest European hosting and cloud providers: 27 data centers, more than 300k servers (bare metal!), and hundreds of products to fulfill our mission to host 1.3 million customers.
You will hear about time series, about open source solutions pushed to the limit, about HBase clusters operated at the extreme, and how about a small team leveraged the power of a handful of open source solution and lots of coding glue to build one of the most performant monitoring solutions ever.
Here’s what was said about this presentation on social media.
@LostInBrittany et @PierreZ nous raconte leur « histoire sans fin » pour créer la plateforme metrics 👏🏻 @RivieraDEV pic.twitter.com/iVwKjJ2BnR
— Cecile (@CecileHbh) May 16, 2019
@PierreZ @LostInBrittany speaking about Wap 10 @ @RivieraDEV. Awesome presentation guys! pic.twitter.com/q0bdJ0m2mw
— Nikita Rousseau (@nirousseau) May 16, 2019
Ceci n'est pas du placement de produit 😜@SenXHQ @warp10io à l'honneur 😎@PierreZ @LostInBrittany @SylvainLareyre Ok pour le câlin 😊 pic.twitter.com/M60jw7Wmyv
— JobOpportunIT (@JobOpportunIT_) May 16, 2019
Great talk @RivieraDEV from @LostInBrittany and @PierreZ on Monitoring @OVH #SenX pic.twitter.com/Voxwa4KyIf
— Tiffany Souterre (@TiffanySouterre) May 16, 2019
Il paraît que @LostInBrittany a envie d'un tweet. Bah voilà, checked. #RivieraDev pic.twitter.com/ABuIfACKoU
— Carine (@CarineReignault) May 16, 2019
On part sur Monitoring OVH: 300k serveurs, 27 DCs une plateforme de métriques avec @LostInBrittany et @PierreZ pic.twitter.com/GB5CsLoz6m
— RivieraDEV (@RivieraDEV) May 16, 2019