Monitoring OVH 300k servers, 27 DCs… and one Metrics platform Horacio Gonzalez @LostInBrittany Monitoring @LostInBrittany

Sommaire temporaire - Intro we and OVH (5 minutes) - Intro our talk (2 minutes) - Make Better Decisions By using Numbers (5 minutes) - Building OVH Metrics (10 minutes) - Conclusion (2 minutes) - Bye bye (1 minute) Monitoring @LostInBrittany

Who are we? Introducing myself and introducing OVH Monitoring @LostInBrittany

Horacio Gonzalez @LostInBrittany Spaniard lost in Brittany, developer, dreamer and all-around geek Monitoring @LostInBrittany

OVH : Key Figures 1.3M Customers worldwide in 138 Countries 1.5 Billions euros investment over five years 30 Datacenters (growing) 350k Dedicated Servers 200k Private cloud VMs running 650k Public cloud Instances created in a month 15TB bandwidth capacity

  • 2 500 Employees in 19 countries 18 Years of Innovation 35 Points of presence 4TB Anti DDoS capacity Hosting capacity : 1.3M Physical Servers Monitoring @LostInBrittany

OVH: A Global Leader on Cloud 200k Private cloud VMs running 1 Dedicated IaaS Europe 2018 27 Datacenters Own 15 Tbps Hosting capacity : 1.3M Physical Servers 360k Servers already deployed Netwok with 35 PoPs 2020 50 Datacenters

1.3M Customers in 138 Countries Monitoring @LostInBrittany

Ranking & Recognition 1st European Cloud Provider* 1st Hosting provider in Europe 1st Provider Microsoft Exchange Certified vCloud Datacenter Certified Kubernetes platform (CNCF) Vmware Global Service Provider 2013-2016 Veeam Best Cloud Partner of the year (2018) Monitoring @LostInBrittany

  • Netcraft 2017 -

OVH: Our solutions Cloud Web Hosting Mobile Hosting Telecom VPS Containers ▪ Dedicated Server Domain names VoIP Public Cloud Compute ▪ Data Storage Email SMS/Fax Private Cloud ▪ Network and Database CDN Virtual desktop Serveur dédié Security Object Storage Web hosting Cloud HubiC Over theBox ▪ Licences Cloud Desktop Securities MS Office Hybrid Cloud Messaging MS solutions Monitoring @LostInBrittany

Once upon a time… Because I love telling tales Monitoring @LostInBrittany

This talk is about a tale… A true one nevertheless Monitoring @LostInBrittany

And as in most tales It begins with a mission Monitoring @LostInBrittany

And a band of heroes Engulfed into the adventure Monitoring @LostInBrittany

They fight against mishaps And all kind of foes Monitoring @LostInBrittany

They build mighty fortresses Pushing the limits of possible Monitoring @LostInBrittany

And defend them day after day Against all odds Monitoring @LostInBrittany

But we don’t know yet the end Because this tale isn’t finished yet Monitoring @LostInBrittany

It begins with a mission Build a metrics platform for OVH Monitoring @LostInBrittany

Why do we need metrics? To make better decisions by using numbers Monitoring @LostInBrittany

Why do we need metrics? We want our code to add value Monitoring @LostInBrittany

Why do we need metrics? We need to make better decisions about our code Monitoring @LostInBrittany

Why do we need metrics? Code adds value when it runs not when we write it Monitoring @LostInBrittany

Why do we need metrics? We need to know what our code does when it runs Monitoring @LostInBrittany

Why do we need metrics? We can’t do this unless we measure it Monitoring @LostInBrittany

Why do we need metrics? We have a mental model of what our code does Monitoring @LostInBrittany

Why do we need metrics? This representation can be wrong Monitoring @LostInBrittany

Why do we need metrics? We can’t know until we measure it Monitoring @LostInBrittany

Find the bottleneck ‘’ “The app is slow.” - User Monitoring @LostInBrittany

Find the bottleneck ‘’ “The app is slow.” - User “The page takes 500ms!” - Ops Monitoring @LostInBrittany

Find the bottleneck ? SQL Query? Template Rendering? Session Storage? Monitoring @LostInBrittany

Find the bottleneck ? We don’t know Monitoring @LostInBrittany

Find the bottleneck

With observability: SQL Query………………………….53ms Template Rendering……….1ms Session Storage……………315ms Monitoring @LostInBrittany

Find the bottleneck

With observability: SQL Query………………………….53ms Template Rendering……….1ms Session Storage……………315ms Monitoring @LostInBrittany

Why do we need metrics? We improve our mental model by measuring what our code does Monitoring @LostInBrittany

Why do we need metrics? We use our mental model to decide what to do Monitoring @LostInBrittany

Why do we need metrics? A better mental model makes us better at deciding what to do Monitoring @LostInBrittany

Why do we need metrics? Better decisions makes us better at generating value Monitoring @LostInBrittany

Why do we need metrics? Measuring make your App better Monitoring @LostInBrittany

It began with a mission Build a metrics platform for OVH Monitoring @LostInBrittany

A metrics platform for OVH For all OVH Monitoring @LostInBrittany

Building OVH Metrics One Platform to unify them all, One Platform to find them, One Platform to bring them all and in the Metrics monitor them Monitoring @LostInBrittany

What is OVH Metrics? Managed Cloud Platform for Time Series Monitoring @LostInBrittany

OVH monitoring story We had lots of partial solutions… Monitoring @LostInBrittany

OVH monitoring story One Platform to unify them all What should we build it on? Monitoring @LostInBrittany

OVH monitoring story Including a really big Monitoring @LostInBrittany

OpenTSDB drawbacks OpenTSDB RowKey Design ! Monitoring @LostInBrittany

OpenTSDB Rowkey design flaws ● .regex. => full table scans ● High cardinality issues (Query latencies) We needed something able to manage hundreds of millions time series OpenTSBD didn’t scale for us Monitoring @LostInBrittany

OpenTSDB other flaws ● ● ● ● ● Compaction (or append writes) /api/query : 1 endpoint per function? Asynchronous Unauthenticated … Monitoring @LostInBrittany

Scaling OpenTSDB Monitoring @LostInBrittany

Metrics needs First need: To be massively scalable Monitoring @LostInBrittany

Analytics is the key to success Fetching data is only the tip of the iceberg Monitoring @LostInBrittany

Analysing metrics data To be scalable, analysis must be done in the database, not in user’s computer Monitoring @LostInBrittany

Metrics needs Second need: To have rich query capabilities Monitoring @LostInBrittany

Enter Warp 10… Open-source Time series Database Monitoring @LostInBrittany

More than a Time Series DB Warp 10 is a software platform that ● Ingests and stores time series ● Manipulates and analyzes time series Monitoring @LostInBrittany

Manipulating Time Series with Warp 10 A true Time Series analysis toolbox ○ Hundreds of functions ○ Manipulation frameworks ○ Analysis workflow Monitoring @LostInBrittany

Manipulating Time Series with Warp 10 A Time Series manipulation language WarpScript Monitoring @LostInBrittany

Did you say scalability? From the smallest to the largest… Monitoring @LostInBrittany

More Warp 10 goodness ● Secured & multi tenant ● Synchronous (transactions) ● In memory Index ● Better Performance ● No cardinality issues ● Better Scalability ● Lockfree ingestion ● Versatile ● WarpScript Query Language (standalone, distributed) ● Support more data types Monitoring @LostInBrittany

Metrics Data Platform + + Monitoring @LostInBrittany

Metrics Data Platform Monitoring @LostInBrittany

Building an ecosystem From Warp 10 to OVH Metrics Monitoring @LostInBrittany

Multi-protocol Why to choose? We need them all! Monitoring @LostInBrittany

Open source monitoring tools Monitoring @LostInBrittany

Open source monitoring tools Monitoring @LostInBrittany

Open source monitoring tools Monitoring @LostInBrittany

Open source monitoring tools Monitoring @LostInBrittany

Open source monitoring tools Monitoring @LostInBrittany

Open source monitoring tools Monitoring @LostInBrittany

Open source monitoring tools Why choose? Let’s support all of them! Monitoring @LostInBrittany

Metrics Platform Monitoring @LostInBrittany

Metrics Platform graphite influx https:// opentsdb .<region>.metrics.ovh.net prometheus warp10 … Monitoring @LostInBrittany

Metrics Live In-memory, high-performance Metrics instances Monitoring @LostInBrittany

In-memory: Metrics live +120 million of writes/s Monitoring @LostInBrittany

In-memory: Metrics live Monitoring @LostInBrittany

In-memory: Metrics live Monitoring @LostInBrittany

Monitoring is only the beginning OVH Metrics answer to many other use cases Monitoring @LostInBrittany

Use cases families • • • • Billing Monitoring IoT (e.g. bill on monthly max consumption) ……………………………………………..……. (APM, infrastructure,appliances,…) …..…………………………… (Manage devices, operator integration, …) …………………………………………….…………………. Geo Location (Manage localized fleets) ……..………………… Monitoring @LostInBrittany

Use cases • • • • • • DC Temperature/Elec/Cooling map Pay as you go billing (PCI/IPLB) GSCAN Monitoring ML Model scoring (Anti-Fraude) Pattern Detection for medical applications Monitoring @LostInBrittany

SREing Metrics With a great power comes a great responsibility Monitoring @LostInBrittany

Metrics’ own metrics 432 000 000 000 datapoints / day Monitoring @LostInBrittany

Metrics’ own metrics 10 Tb / day Monitoring @LostInBrittany

Metrics’ own metrics 5 000 000 dp/s Monitoring @LostInBrittany

Metrics’ own metrics 500 000 000 series Monitoring @LostInBrittany

Our clusters size GRA: BHS: ● 150 nodes ● 2 PB ● 1.1 Gbps ● 30 nodes ● 400 TB ● 120 Mbps Monitoring @LostInBrittany

Our cluster architecture Warp10 Ingress Warp10 Warp10 Directory Directory Kafka Warp10 Warp10 Egress Egress Warp10 Warp10 Store Store Region server + Datanode Region server + Datanode Region server + Datanode Monitoring Region server + Datanode @LostInBrittany

Detecting errors Before it’s too late Monitoring 86 @LostInBrittany

Extract errors from logs Monitoring @LostInBrittany

Tailor Forward logs and extract metrics! Monitoring @LostInBrittany

Monitoring the JVM Monitoring @LostInBrittany

Documentation Monitoring @LostInBrittany

JVM GC The good, the bad and the ugly Monitoring @LostInBrittany

The good Monitoring @LostInBrittany

The bad Monitoring @LostInBrittany

… and the ugly #java #jdk11 #zgc Monitoring @LostInBrittany

Monitoring HBase Monitoring @LostInBrittany

Number of open regions Monitoring @LostInBrittany

Queues length Monitoring @LostInBrittany

Number of read and write requests Monitoring @LostInBrittany

Preserve data locality Monitoring @LostInBrittany

Host health Monitoring @LostInBrittany

Pokédex Inventory all animals. Monitoring @LostInBrittany

Merging all data sources Monitoring @LostInBrittany

Global visualization Monitoring @LostInBrittany

Correlate information Monitoring @LostInBrittany

Sacha The best tamer Monitoring @LostInBrittany

An awesome CLI Monitoring @LostInBrittany

Retrieving bare informations Monitoring @LostInBrittany

Create region map Monitoring @LostInBrittany

Move region to another region server Monitoring @LostInBrittany

Drain regions of the region server Monitoring @LostInBrittany

Managing multiple hardware profiles Monitoring @LostInBrittany

Balance the cluster Monitoring @LostInBrittany

Conclusion That’s all folks! Monitoring @LostInBrittany