A presentation at Devops D-Day 2019 in in Marseille, France by Horacio Gonzalez
DEVOPS D-DAY #5 Monitoring OVH: 350k servers, 30 DCs… and one Metrics platform Horacio Gonzalez @LostInBrittany DEVOPS D-DAY #5 @LostInBrittany
Who are we? Introducing myself and introducing OVH OVHcloud DEVOPS D-DAY #5 @LostInBrittany
Horacio Gonzalez @LostInBrittany Spaniard lost in Brittany, developer, dreamer and all-around geek Flutter DEVOPS D-DAY #5 @LostInBrittany
OVH: A Global Leader on Cloud 250k Private cloud VMs running 1 Dedicated IaaS Europe 30 Datacenters Own 20Tbps Hosting capacity : 1.3M Physical Servers 360k Servers already deployed DEVOPS D-DAY #5 Netwok with 35 PoPs
1.3M Customers in 138 Countries @LostInBrittany
OVH: Our solutions Cloud Web Hosting Mobile Hosting Telecom VPS Containers ▪ Dedicated Server Domain names VoIP Public Cloud Compute ▪ Data Storage Email SMS/Fax Private Cloud ▪ Network and Database CDN Virtual desktop Serveur dédié Security Object Storage Web hosting Cloud HubiC Over theBox ▪ Licences Cloud Desktop Securities MS Office Hybrid Cloud Messaging MS solutions DEVOPS D-DAY #5 @LostInBrittany
And don’t forget, next week… OVHcloud Summit https://summit.ovhcloud.com/ DEVOPS D-DAY #5 @LostInBrittany
Once upon a time… Because I love telling tales DEVOPS D-DAY #5 @LostInBrittany
This talk is about a tale… A true one nevertheless DEVOPS D-DAY #5 @LostInBrittany
And as in most tales It begins with a mission DEVOPS D-DAY #5 @LostInBrittany
And a band of heroes Engulfed into the adventure DEVOPS D-DAY #5 @LostInBrittany
They fight against mishaps And all kind of foes DEVOPS D-DAY #5 @LostInBrittany
They build mighty fortresses Pushing the limits of possible DEVOPS D-DAY #5 @LostInBrittany
And defend them day after day Against all odds DEVOPS D-DAY #5 @LostInBrittany
But we don’t know yet the end Because this tale isn’t finished yet DEVOPS D-DAY #5 @LostInBrittany
It begins with a mission Build a metrics platform for OVH DEVOPS D-DAY #5 @LostInBrittany
A long time ago… DEVOPS D-DAY #5 @LostInBrittany
A long time ago… Monitoring: Does the system works? DEVOPS D-DAY #5 @LostInBrittany
Moving from monolith to μservices App DEVOPS D-DAY #5 @LostInBrittany
Moving from monolith to μservices App App App DEVOPS D-DAY #5 @LostInBrittany
Moving from monolith to μservices App App App DB App Slaves DEVOPS D-DAY #5 @LostInBrittany
Moving from monolith to μservices App App App Bus DB App Slaves DEVOPS D-DAY #5 @LostInBrittany
Moving from monolith to μservices RPXY LB Cache App App App Bus DB App Slaves DEVOPS D-DAY #5 @LostInBrittany
What could go wrong? RPXY LB Cache App App App Bus DB App Slaves DEVOPS D-DAY #5 @LostInBrittany
Microservices are a distributed system GOTO 2017 • Debugging Under Fire: Keep your Head when Systems have Lost their Mind • Bryan Cantrill DEVOPS D-DAY #5 @LostInBrittany
We need to have insights Observability: How the system works? DEVOPS D-DAY #5 @LostInBrittany
OVH decided go metrics-oriented DEVOPS D-DAY #5 @LostInBrittany
A metrics platform for OVH For all OVH DEVOPS D-DAY #5 @LostInBrittany
Building OVH Metrics One Platform to unify them all, One Platform to find them, One Platform to bring them all and in the Metrics monitor them DEVOPS D-DAY #5 @LostInBrittany
What is OVH Metrics? Managed Cloud Platform for Time Series DEVOPS D-DAY #5 @LostInBrittany
OVH monitoring story We had lots of partial solutions… DEVOPS D-DAY #5 @LostInBrittany
OVH monitoring story One Platform to unify them all What should we build it on? DEVOPS D-DAY #5 @LostInBrittany
OVH monitoring story Including a really big DEVOPS D-DAY #5 @LostInBrittany
OpenTSDB drawbacks OpenTSDB RowKey Design ! DEVOPS D-DAY #5 @LostInBrittany
OpenTSDB Rowkey design flaws ● .regex. => full table scans ● High cardinality issues (Query latencies) We needed something able to manage hundreds of millions time series OpenTSBD didn’t scale for us DEVOPS D-DAY #5 @LostInBrittany
OpenTSDB other flaws ● ● ● ● ● Compaction (or append writes) /api/query : 1 endpoint per function? Asynchronous Unauthenticated … DEVOPS D-DAY #5 @LostInBrittany
Scaling OpenTSDB DEVOPS D-DAY #5 @LostInBrittany
Metrics needs First need: To be massively scalable DEVOPS D-DAY #5 @LostInBrittany
Analytics is the key to success Fetching data is only the tip of the iceberg DEVOPS D-DAY #5 @LostInBrittany
Analysing metrics data To be scalable, analysis must be done in the database, not in user’s computer DEVOPS D-DAY #5 @LostInBrittany
Metrics needs Second need: To have rich query capabilities DEVOPS D-DAY #5 @LostInBrittany
Enter Warp 10… Open-source Time series Database DEVOPS D-DAY #5 @LostInBrittany
More than a Time Series DB Warp 10 is a software platform that ● Ingests and stores time series ● Manipulates and analyzes time series DEVOPS D-DAY #5 @LostInBrittany
Manipulating Time Series with Warp 10 A true Time Series analysis toolbox ○ Hundreds of functions ○ Manipulation frameworks ○ Analysis workflow DEVOPS D-DAY #5 @LostInBrittany
Manipulating Time Series with Warp 10 A Time Series manipulation language WarpScript DEVOPS D-DAY #5 @LostInBrittany
Did you say scalability? From the smallest to the largest… DEVOPS D-DAY #5 @LostInBrittany
More Warp 10 goodness ● Secured & multi tenant ● Synchronous (transactions) ● In memory Index ● Better Performance ● No cardinality issues ● Better Scalability ● Lockfree ingestion ● Versatile ● WarpScript Query Language (standalone, distributed) ● Support more data types DEVOPS D-DAY #5 @LostInBrittany
OVH Observability Metrics Platform DEVOPS D-DAY #5 @LostInBrittany
Building an ecosystem From Warp 10 to OVH Metrics DEVOPS D-DAY #5 @LostInBrittany
What protocols should we support? Who must do the effort? DEVOPS D-DAY #5 @LostInBrittany
Open source monitoring tools DEVOPS D-DAY #5 @LostInBrittany
Open source monitoring tools DEVOPS D-DAY #5 @LostInBrittany
Open source monitoring tools DEVOPS D-DAY #5 @LostInBrittany
Open source monitoring tools DEVOPS D-DAY #5 @LostInBrittany
Open source monitoring tools DEVOPS D-DAY #5 @LostInBrittany
Open source monitoring tools DEVOPS D-DAY #5 @LostInBrittany
Open source monitoring tools Why choose? Let’s support all of them! DEVOPS D-DAY #5 @LostInBrittany
Metrics Platform DEVOPS D-DAY #5 @LostInBrittany
Metrics Platform graphite influx https:// opentsdb .<region>.metrics.ovh.net prometheus Warp10 tsl … DEVOPS D-DAY #5 @LostInBrittany
Metrics Platform graphite influx https:// opentsdb .<region>.metrics.ovh.net prometheus Warp10 tsl … DEVOPS D-DAY #5 @LostInBrittany
TSL select(“cpu.usage_system”) .where(“cpu~cpu[0-7]*”) .last(12h) .sampleBy(5m,max) .groupBy(mean) .rate() github.com/ovh/tsl DEVOPS D-DAY #5 @LostInBrittany
Metrics Live In-memory, high-performance Metrics instances DEVOPS D-DAY #5 @LostInBrittany
In-memory: Metrics live millions of writes/s DEVOPS D-DAY #5 @LostInBrittany
In-memory: Metrics live DEVOPS D-DAY #5 @LostInBrittany
In-memory: Metrics live DEVOPS D-DAY #5 @LostInBrittany
Monitoring is only the beginning OVH Metrics answer to many other use cases DEVOPS D-DAY #5 @LostInBrittany
Graveline rack’s temperature DEVOPS D-DAY #5 @LostInBrittany
Even medical research… Metrics’ Pattern Detection feature helped Gynaecology Research to prove patterns on perinatal mortality DEVOPS D-DAY #5 @LostInBrittany
Use cases families • • • • Billing Monitoring IoT (e.g. bill on monthly max consumption) ……………………………………………..……. (APM, infrastructure,appliances,…) …..…………………………… (Manage devices, operator integration, …) …………………………………………….…………………. Geo Location DEVOPS D-DAY #5 (Manage localized fleets) ……..………………… @LostInBrittany
Use cases • • • • • • DC Temperature/Elec/Cooling map Pay as you go billing (PCI/IPLB) GSCAN Monitoring ML Model scoring (Anti-Fraude) Pattern Detection for medical applications DEVOPS D-DAY #5 @LostInBrittany
SREing Metrics With a great power comes a great responsibility DEVOPS D-DAY #5 @LostInBrittany
Metrics’s metrics 432.000.000.000 datapoints / day DEVOPS D-DAY #5 @LostInBrittany 71
Our stack overview More than 666 machines operated by 5 people >95% dedicated servers No Docker, only SystemD Running many Apache projects: ○ Hadoop ○ HBase ○ Zookeeper ○ Flink ● And Warp 10 ● ● ● ● DEVOPS D-DAY #5 @LostInBrittany
Our biggest Hadoop cluster 200 datanodes ~60k regions of 10Gb 2.3 PB of capacity 8.5Gb/s of bandwidth 1.5M of writes/s 3M of reads/s DEVOPS D-DAY #5 @LostInBrittany
Hadoop need a lot of DEVOPS D-DAY #5 @LostInBrittany
Warp10: distributed overview DEVOPS D-DAY #5 @LostInBrittany
Warp10: distributed overview DEVOPS D-DAY #5 @LostInBrittany
Warp10: distributed overview DEVOPS D-DAY #5 @LostInBrittany
Warp10: distributed overview DEVOPS D-DAY #5 @LostInBrittany
Warp10: distributed overview DEVOPS D-DAY #5 @LostInBrittany
Hadoop nodes Most of the nodes are the following: ● ● ● 16 to 32 cores 64 to 128 GB of RAM 12 to 16 TB DEVOPS D-DAY #5 But, we also have some huge nodes: ● ● ● 2x 20 cores (xeon gold) 320 GB of RAM 12x 4TB of Disk @LostInBrittany
Warp10 nodes Ingress (cpu-bound): ● ● 32 cores 128 GB of RAM Egress (cpu-bound): ● ● 32 cores 128 GB of RAM DEVOPS D-DAY #5 Directory (ram-bound): ● ● 48 cores 512 GB of RAM Store (cpu-bound): ● ● @LostInBrittany 32 cores 128 GB of RAM
Why you should care? DEVOPS D-DAY #5 @LostInBrittany
Why you should care? (>30s) DEVOPS D-DAY #5 @LostInBrittany
The only way to optimize: measure What is my application doing? App What is my runtime doing? How many GC triggered? Run tim Is there a hardware failure? Logs DEVOPS D-DAY #5 How many HTTP calls? e Hos t @LostInBrittany How many disk I have left? Metrics
Monitoring JVM with metrics DEVOPS D-DAY #5 @LostInBrittany
Monitoring JVM with metrics DEVOPS D-DAY #5 @LostInBrittany
Monitoring JVM with metrics DEVOPS D-DAY #5 @LostInBrittany
Monitoring JVM with metrics DEVOPS D-DAY #5 @LostInBrittany
Monitoring JVM with metrics DEVOPS D-DAY #5 @LostInBrittany
Tuning G1 is hard DEVOPS D-DAY #5 @LostInBrittany
Tuning G1 is hard DEVOPS D-DAY #5 @LostInBrittany
Our programming stack ● We mostly use garbage collected languages as ○ Go ○ Java ○ JavaScript DEVOPS D-DAY #5 @LostInBrittany
Our programming stack However, we are using non-garbage collected languages as Rust when needed DEVOPS D-DAY #5 @LostInBrittany
Our friends for µservices DEVOPS D-DAY #5 @LostInBrittany
We open-source Code contribution: ● ● ● ● ● ● https://github.com/ovh/beamium https://github.com/ovh/noderig https://github.com/ovh/tsl https://github.com/ovh/ovh-warp10-datasource https://github.com/ovh/ovh-tsl-datasource … DEVOPS D-DAY #5 @LostInBrittany Involved in: ● ● ● ● Warp10 community Apache Hbase/Flink development Prometheus/InfluxData discussions TS Query Language Working group
Conclusion That’s all folks! DEVOPS D-DAY #5 @LostInBrittany
Comment faire quand on doit faire le suivi de toute l’infrastructure du plus grand fournisseur de cloud Européen ? Comment choisir un outil quand les plus populaires ne tient pas la marée à cette échèle ? Comment construire une plateforme Metrics pour unifier, concilier et remplacer des années de legacy fragmenté et des solutions partielles ?
Dans ce talk nous racontons notre expérience sur la construction et la maintenance d’OVH Metrics, la plateforme utilisée pour monitorer toute l’infrastructure OVH. Nous avions besoin d’aller à des endroits ou la plupart de solutions de monitoring ne sont jamais allées, opérer à l’échelle du plus grand fournisseur Européen de cloud et hosting : 27 data centers, plus de 300k serveurs (physiques !) et des centaines de produits pour accomplir notre mission avec nos 1,3 millions de clients.
Venez pour entendre cette histoire de séries temporelles, de solutions open-sources poussées à l’extreme, de clusters HBase opérés en limite de capacité, et de comment une petite équipe s’est appuyé sur une poignée de solutions open-source et une bonne dose de code maison pour construire une des solutions de monitoring les plus performantes au monde.