A presentation at OVH Bordeaux Meetup in in Bordeaux, France by Horacio Gonzalez
Monitoring OVH: 300k servers, 28 DCs… and one Metrics platform Horacio Gonzalez @LostInBrittany Monitoring OVH @LostInBrittany
Who are we? Introducing myself and introducing OVH Monitoring OVH @LostInBrittany
Horacio Gonzalez @LostInBrittany Spaniard lost in Brittany, developer, dreamer and all-around geek Flutter Monitoring OVH @LostInBrittany
OVH: A Global Leader on Cloud 200k Private cloud VMs running 1 Dedicated IaaS Europe 2018 27 Datacenters Own 15 Tbps Hosting capacity : 1.3M Physical Servers 360k Servers already deployed 2020 50 Datacenters Netwok with 35 PoPs
1.3M Customers in 138 Countries Monitoring OVH @LostInBrittany
OVH : Key Figures 1.3M Customers worldwide in 138 Countries 1.5 Billions euros investment over five years 28 Datacenters (growing) 350k Dedicated Servers 200k Private cloud VMs running 650k Public cloud Instances created in a month 20TB bandwidth capacity
Ranking & Recognition 1st European Cloud Provider* 1st Hosting provider in Europe 1st Provider Microsoft Exchange Certified vCloud Datacenter Certified Kubernetes platform (CNCF) Vmware Global Service Provider 2013-2016 Veeam Best Cloud Partner of the year (2018) Monitoring OVH @LostInBrittany
OVH: Our solutions Cloud Web Hosting Mobile Hosting Telecom VPS Containers ▪ Dedicated Server Domain names VoIP Public Cloud Compute ▪ Data Storage Email SMS/Fax Private Cloud ▪ Network and Database CDN Virtual desktop Serveur dédié Security Object Storage Web hosting Cloud HubiC Over theBox ▪ Licences Cloud Desktop Securities MS Office Hybrid Cloud Messaging MS solutions Monitoring OVH @LostInBrittany
Once upon a time… Because I love telling tales Monitoring OVH @LostInBrittany
This talk is about a tale… A true one nevertheless Monitoring OVH @LostInBrittany
And as in most tales It begins with a mission Monitoring OVH @LostInBrittany
And a band of heroes Engulfed into the adventure Monitoring OVH @LostInBrittany
They fight against mishaps And all kind of foes Monitoring OVH @LostInBrittany
They build mighty fortresses Pushing the limits of possible Monitoring OVH @LostInBrittany
And defend them day after day Against all odds Monitoring OVH @LostInBrittany
But we don’t know yet the end Because this tale isn’t finished yet Monitoring OVH @LostInBrittany
It begins with a mission Build a metrics platform for OVH Monitoring OVH @LostInBrittany
Why do we need metrics? To make better decisions by using numbers Monitoring OVH @LostInBrittany
Why do we need metrics? We want our code to add value Monitoring OVH @LostInBrittany
Why do we need metrics? We need to make better decisions about our code Monitoring OVH @LostInBrittany
Why do we need metrics? Code adds value when it runs not when we write it Monitoring OVH @LostInBrittany
Why do we need metrics? We need to know what our code does when it runs Monitoring OVH @LostInBrittany
Why do we need metrics? We can’t do this unless we measure it Monitoring OVH @LostInBrittany
Why do we need metrics? We have a mental model of what our code does Monitoring OVH @LostInBrittany
Why do we need metrics? This representation can be wrong Monitoring OVH @LostInBrittany
Why do we need metrics? We can’t know until we measure it Monitoring OVH @LostInBrittany
Find the bottleneck ‘’ “The app is slow.” - User Monitoring OVH @LostInBrittany
Find the bottleneck ‘’ “The app is slow.” - User “The page takes 500ms!” - Ops Monitoring OVH @LostInBrittany
Find the bottleneck ? SQL Query? Template Rendering? Session Storage? Monitoring OVH @LostInBrittany
Find the bottleneck ? We don’t know Monitoring OVH @LostInBrittany
With observability: SQL Query………………………….53ms Template Rendering……….1ms Session Storage……………315ms Monitoring OVH @LostInBrittany
With observability: SQL Query………………………….53ms Template Rendering……….1ms Session Storage……………315ms Monitoring OVH @LostInBrittany
Why do we need metrics? We improve our mental model by measuring what our code does Monitoring OVH @LostInBrittany
Why do we need metrics? We use our mental model to decide what to do Monitoring OVH @LostInBrittany
Why do we need metrics? A better mental model makes us better at deciding what to do Monitoring OVH @LostInBrittany
Why do we need metrics? Better decisions makes us better at generating value Monitoring OVH @LostInBrittany
Why do we need metrics? Measuring make your App better Monitoring OVH @LostInBrittany
It began with a mission Build a metrics platform for OVH Monitoring OVH @LostInBrittany
A metrics platform for OVH For all OVH Monitoring OVH @LostInBrittany
Building OVH Metrics One Platform to unify them all, One Platform to find them, One Platform to bring them all and in the Metrics monitor them Monitoring OVH @LostInBrittany
What is OVH Metrics? Managed Cloud Platform for Time Series Monitoring OVH @LostInBrittany
OVH monitoring story We had lots of partial solutions… Monitoring OVH @LostInBrittany
OVH monitoring story One Platform to unify them all What should we build it on? Monitoring OVH @LostInBrittany
OVH monitoring story Including a really big Monitoring OVH @LostInBrittany
OpenTSDB drawbacks OpenTSDB RowKey Design ! Monitoring OVH @LostInBrittany
OpenTSDB Rowkey design flaws ● .regex. => full table scans ● High cardinality issues (Query latencies) We needed something able to manage hundreds of millions time series OpenTSBD didn’t scale for us Monitoring OVH @LostInBrittany
OpenTSDB other flaws ● ● ● ● ● Compaction (or append writes) /api/query : 1 endpoint per function? Asynchronous Unauthenticated … Monitoring OVH @LostInBrittany
Scaling OpenTSDB Monitoring OVH @LostInBrittany
Metrics needs First need: To be massively scalable Monitoring OVH @LostInBrittany
Analytics is the key to success Fetching data is only the tip of the iceberg Monitoring OVH @LostInBrittany
Analysing metrics data To be scalable, analysis must be done in the database, not in user’s computer Monitoring OVH @LostInBrittany
Metrics needs Second need: To have rich query capabilities Monitoring OVH @LostInBrittany
Enter Warp 10… Open-source Time series Database Monitoring OVH @LostInBrittany
More than a Time Series DB Warp 10 is a software platform that ● Ingests and stores time series ● Manipulates and analyzes time series Monitoring OVH @LostInBrittany
Manipulating Time Series with Warp 10 A true Time Series analysis toolbox ○ Hundreds of functions ○ Manipulation frameworks ○ Analysis workflow Monitoring OVH @LostInBrittany
Manipulating Time Series with Warp 10 A Time Series manipulation language WarpScript Monitoring OVH @LostInBrittany
Did you say scalability? From the smallest to the largest… Monitoring OVH @LostInBrittany
More Warp 10 goodness ● Secured & multi tenant ● Synchronous (transactions) ● In memory Index ● Better Performance ● No cardinality issues ● Better Scalability ● Lockfree ingestion ● Versatile ● WarpScript Query Language (standalone, distributed) ● Support more data types Monitoring OVH @LostInBrittany
OVH Observability Metrics Platform Monitoring OVH @LostInBrittany
Metrics Data Platform Monitoring OVH @LostInBrittany
Building an ecosystem From Warp 10 to OVH Metrics Monitoring OVH @LostInBrittany
Multi-protocol Why to choose? We need them all! Monitoring OVH @LostInBrittany
Open source monitoring tools Monitoring OVH @LostInBrittany
Open source monitoring tools Monitoring OVH @LostInBrittany
Open source monitoring tools Monitoring OVH @LostInBrittany
Open source monitoring tools Monitoring OVH @LostInBrittany
Open source monitoring tools Monitoring OVH @LostInBrittany
Open source monitoring tools Monitoring OVH @LostInBrittany
Open source monitoring tools Why choose? Let’s support all of them! Monitoring OVH @LostInBrittany
Metrics Platform Monitoring OVH @LostInBrittany
Metrics Platform graphite influx https:// opentsdb .<region>.metrics.ovh.net prometheus warp10 … Monitoring OVH @LostInBrittany
Metrics Live In-memory, high-performance Metrics instances Monitoring OVH @LostInBrittany
In-memory: Metrics live +120 million of writes/s Monitoring OVH @LostInBrittany
In-memory: Metrics live Monitoring OVH @LostInBrittany
In-memory: Metrics live Monitoring OVH @LostInBrittany
Monitoring is only the beginning OVH Metrics answer to many other use cases Monitoring OVH @LostInBrittany
Use cases families • • • • Billing Monitoring IoT (e.g. bill on monthly max consumption) ……………………………………………..……. (APM, infrastructure,appliances,…) …..…………………………… (Manage devices, operator integration, …) …………………………………………….…………………. Geo Location (Manage localized fleets) ……..………………… Monitoring OVH @LostInBrittany
Use cases • • • • • • DC Temperature/Elec/Cooling map Pay as you go billing (PCI/IPLB) GSCAN Monitoring ML Model scoring (Anti-Fraude) Pattern Detection for medical applications Monitoring OVH @LostInBrittany
SREing Metrics With a great power comes a great responsibility Monitoring OVH @LostInBrittany
Metrics’ own metrics 432 000 000 000 datapoints / day Monitoring OVH @LostInBrittany
Metrics’ own metrics 10 Tb / day Monitoring OVH @LostInBrittany
Metrics’ own metrics 5 000 000 dp/s Monitoring OVH @LostInBrittany
Metrics’ own metrics 500 000 000 series Monitoring OVH @LostInBrittany
Our clusters size GRA: BHS: ● 150 nodes ● 2 PB ● 1.1 Gbps ● 30 nodes ● 400 TB ● 120 Mbps Monitoring OVH @LostInBrittany
Our cluster architecture Monitoring OVH @LostInBrittany
Detecting errors Before it’s too late Monitoring OVH 85 @LostInBrittany
Extract errors from logs Monitoring OVH @LostInBrittany
Tailor Forward logs and extract metrics! Monitoring OVH @LostInBrittany
Monitoring the JVM Monitoring OVH @LostInBrittany
Documentation Monitoring OVH @LostInBrittany
JVM GC The good, the bad and the ugly Monitoring OVH @LostInBrittany
The good Monitoring OVH @LostInBrittany
The bad Monitoring OVH @LostInBrittany
… and the ugly #java #jdk11 #zgc Monitoring OVH @LostInBrittany
Monitoring HBase Monitoring OVH @LostInBrittany
Number of open regions Monitoring OVH @LostInBrittany
Queues length Monitoring OVH @LostInBrittany
Number of read and write requests Monitoring OVH @LostInBrittany
Preserve data locality Monitoring OVH @LostInBrittany
Host health Monitoring OVH @LostInBrittany
Pokédex Inventory all animals. Monitoring OVH @LostInBrittany
Merging all data sources Monitoring OVH @LostInBrittany
Global visualization Monitoring OVH @LostInBrittany
Correlate information Monitoring OVH @LostInBrittany
Sacha The best tamer Monitoring OVH @LostInBrittany
An awesome CLI Monitoring OVH @LostInBrittany
Retrieving bare informations Monitoring OVH @LostInBrittany
Create region map Monitoring OVH @LostInBrittany
Move region to another region server Monitoring OVH @LostInBrittany
Drain regions of the region server Monitoring OVH @LostInBrittany
Managing multiple hardware profiles Monitoring OVH @LostInBrittany
Balance the cluster Monitoring OVH @LostInBrittany
Conclusion That’s all folks! Monitoring OVH @LostInBrittany
What to do when you must monitor the whole infrastructure of the biggest European hosting and cloud provider? How to choose a tool when the most used ones fail to scale to your needs? How to build an Metrics platform to unify, conciliate and replace years of fragmented legacy partial solutions?
In this talk we will relate our experience building and maintaining OVH Metrics, the platform used to monitor all OVH infrastructure. We needed to go to places where most monitoring solutions hadn’t gone before, it needed to operate at the scale of the biggest European hosting and cloud providers: 27 data centers, more than 300k servers (bare metal!), and hundreds of products to fulfill our mission to host 1.3 million customers.
You will hear about time series, about open source solutions pushed to the limit, about HBase clusters operated at the extreme, and how about a small team leveraged the power of a handful of open source solution and lots of coding glue to build one of the most performant monitoring solutions ever.
The following resources were mentioned during the presentation or are useful additional information.