Monitoring OVH: 300k servers, 28 DCs… and one Metrics platform Horacio Gonzalez
Pierre Zemb
@LostInBrittany
@PierreZ
@LostInBrittany @PierreZ
Slide 2
Who are we? Introducing myself and introducing OVH
@LostInBrittany @PierreZ
Slide 3
Horacio Gonzalez @LostInBrittany Spaniard lost in Brittany, developer, dreamer and all-around geek
Flutter
@LostInBrittany @PierreZ
Slide 4
Pierre Zemb @PierreZ Software Engineer working on distributed systems
@LostInBrittany @PierreZ
Slide 5
OVH: A Global Leader on Cloud 200k Private cloud VMs running
1
Dedicated IaaS Europe
2018 27 Datacenters Own 15 Tbps
Hosting capacity : 1.3M Physical Servers 360k Servers already deployed
Netwok with 35 PoPs
1.3M Customers in 138 Countries
@LostInBrittany @PierreZ
2020 50 Datacenters
Slide 6
OVH : Key Figures 1.3M Customers worldwide in 138 Countries 1.5 Billions euros investment over five years 28 Datacenters (growing) 350k Dedicated Servers 200k Private cloud VMs running 650k Public cloud Instances created in a month 20TB bandwidth capacity
2 500 Employees in 19 countries
18 Years of Innovation
35 Points of presence 4TB Anti DDoS capacity Hosting capacity : 1.3M Physical Servers
@LostInBrittany @PierreZ
Slide 7
Ranking & Recognition
1st European Cloud Provider* 1st Hosting provider in Europe 1st Provider Microsoft Exchange Certified vCloud Datacenter Certified Kubernetes platform (CNCF) Vmware Global Service Provider 2013-2016 Veeam Best Cloud Partner of the year (2018) @LostInBrittany @PierreZ
Netcraft 2017 -
Slide 8
OVH: Our solutions Cloud
Web Hosting
Mobile Hosting
Telecom
VPS
Containers ▪ Dedicated Server
Domain names
VoIP
Public Cloud
Compute ▪ Data Storage
Email
SMS/Fax
Private Cloud
▪ Network and Database
CDN
Virtual desktop
Serveur dédié
Security Object Storage
Web hosting
Cloud HubiC Over theBox
▪ Licences
Cloud Desktop
Securities
MS Office
Hybrid Cloud
Messaging
MS solutions
@LostInBrittany @PierreZ
Slide 9
Once upon a time… Because I love telling tales
@LostInBrittany @PierreZ
Slide 10
This talk is about a tale…
A true one nevertheless @LostInBrittany @PierreZ
Slide 11
And as in most tales
It begins with a mission @LostInBrittany @PierreZ
Slide 12
And a band of heroes
Engulfed into the adventure @LostInBrittany @PierreZ
Slide 13
They fight against mishaps
And all kind of foes @LostInBrittany @PierreZ
Slide 14
They build mighty fortresses
Pushing the limits of possible @LostInBrittany @PierreZ
Slide 15
And defend them day after day
Against all odds @LostInBrittany @PierreZ
Slide 16
But we don’t know yet the end
Because this tale isn’t finished yet @LostInBrittany @PierreZ
Slide 17
It begins with a mission Build a metrics platform for OVH
@LostInBrittany @PierreZ
Slide 18
A long time ago…
@LostInBrittany @PierreZ
Slide 19
A long time ago…
Monitoring: Does the system works?
@LostInBrittany @PierreZ
Slide 20
Moving from monolith to μservices
App
@LostInBrittany @PierreZ
Slide 21
Moving from monolith to μservices
App App
App
@LostInBrittany @PierreZ
Slide 22
Moving from monolith to μservices
App App App
App
@LostInBrittany @PierreZ
Slide 23
Moving from monolith to μservices
App App App
App
@LostInBrittany @PierreZ
Slide 24
Moving from monolith to μservices RPXY
LB
App App App
App
@LostInBrittany @PierreZ
Cache
Slide 25
What could go wrong? RPXY
LB
App App App
App
@LostInBrittany @PierreZ
Cache
Slide 26
Microservices are a distributed system
GOTO 2017 • Debugging Under Fire: Keep your Head when Systems have Lost their Mind • Bryan Cantrill @LostInBrittany @PierreZ
Slide 27
We need to have insights
Observability : Understand how it works
@LostInBrittany @PierreZ
Slide 28
OVH decided go metrics-oriented
@LostInBrittany @PierreZ
Slide 29
A metrics platform for OVH
For all OVH @LostInBrittany @PierreZ
Slide 30
Building OVH Metrics One Platform to unify them all, One Platform to find them, One Platform to bring them all and in the Metrics monitor them
@LostInBrittany @PierreZ
Slide 31
What is OVH Metrics?
Managed Cloud Platform for Time Series
@LostInBrittany @PierreZ
Slide 32
OVH monitoring story We had lots of partial solutions…
@LostInBrittany @PierreZ
Slide 33
OVH monitoring story One Platform to unify them all What should we build it on?
@LostInBrittany @PierreZ
Slide 34
OVH monitoring story
Including a really big
@LostInBrittany @PierreZ
OpenTSDB Rowkey design flaws ● .regex. => full table scans ● High cardinality issues (Query latencies)
We needed something able to manage hundreds of millions time series OpenTSBD didn’t scale for us @LostInBrittany @PierreZ
Metrics needs
First need: To be massively scalable
@LostInBrittany @PierreZ
Slide 40
Analytics is the key to success
Fetching data is only the tip of the iceberg @LostInBrittany @PierreZ
Slide 41
Analysing metrics data
To be scalable, analysis must be done in the database, not in user’s computer @LostInBrittany @PierreZ
Slide 42
Metrics needs
Second need: To have rich query capabilities
@LostInBrittany @PierreZ
Slide 43
Enter Warp 10… Open-source Time series Database @LostInBrittany @PierreZ
Slide 44
More than a Time Series DB Warp 10 is a software platform that ● Ingests and stores time series ● Manipulates and analyzes time series
@LostInBrittany @PierreZ
Slide 45
Manipulating Time Series with Warp 10
A true Time Series analysis toolbox ○ Hundreds of functions ○ Manipulation frameworks ○ Analysis workflow
@LostInBrittany @PierreZ
Slide 46
Manipulating Time Series with Warp 10
A Time Series manipulation language
WarpScript @LostInBrittany @PierreZ
Slide 47
Did you say scalability?
From the smallest to the largest… @LostInBrittany @PierreZ
Slide 48
More Warp 10 goodness ● Secured & multi tenant
● Synchronous (transactions)
● In memory Index
● Better Performance
● No cardinality issues
● Better Scalability
● Lockfree ingestion
● Versatile
● WarpScript Query Language
(standalone, distributed)
● Support more data types
@LostInBrittany @PierreZ
Metrics Live In-memory, high-performance Metrics instances
@LostInBrittany @PierreZ
Slide 64
In-memory: Metrics live
millions of writes/s @LostInBrittany @PierreZ
Slide 65
In-memory: Metrics live
@LostInBrittany @PierreZ
Slide 66
In-memory: Metrics live
@LostInBrittany @PierreZ
Slide 67
Monitoring is only the beginning OVH Metrics answer to many other use cases
@LostInBrittany @PierreZ
Slide 68
Graveline rack’s temperature
@LostInBrittany @PierreZ
Slide 69
Even medical research…
Metrics’ Pattern Detection feature helped Gynaecology Research to prove patterns on perinatal mortality @LostInBrittany @PierreZ
Slide 70
Use cases families • • • •
Billing
Monitoring IoT
(e.g. bill on monthly max consumption)
……………………………………………..…….
(APM, infrastructure,appliances,…)
…..……………………………
(Manage devices, operator integration, …)
…………………………………………….………………….
Geo Location
(Manage localized fleets)
……..…………………
@LostInBrittany @PierreZ
Slide 71
Use cases • • • • • •
DC Temperature/Elec/Cooling map Pay as you go billing (PCI/IPLB) GSCAN Monitoring ML Model scoring (Anti-Fraude) Pattern Detection for medical applications
@LostInBrittany @PierreZ
Slide 72
SREing Metrics With a great power comes a great responsibility
@LostInBrittany @PierreZ
Slide 73
Metrics’s metrics
432.000.000.000 datapoints / day
@LostInBrittany @PierreZ
73
Slide 74
Our stack overview More than 666 machines operated by 5 people >95% dedicated servers No Docker, only SystemD Running many Apache projects: ○ Hadoop ○ HBase ○ Zookeeper ○ Flink ● And Warp 10 ● ● ● ●
@LostInBrittany @PierreZ
Slide 75
Our biggest Hadoop cluster
200 datanodes
~60k regions of 10Gb
2.3 PB of capacity 8.5Gb/s of bandwidth
1.5M of writes/s 3M of reads/s
@LostInBrittany @PierreZ
Hadoop nodes
Most of the nodes are the following: ● ● ●
16 to 32 cores 64 to 128 GB of RAM 12 to 16 TB
But, we also have some huge nodes: ● ● ●
2x 20 cores (xeon gold) 320 GB of RAM 12x 4TB of Disk
@LostInBrittany @PierreZ
Slide 83
Warp10 nodes Ingress (cpu-bound): ● ●
32 cores 128 GB of RAM
● ●
Egress (cpu-bound): ● ●
32 cores 128 GB of RAM
Directory (ram-bound): 48 cores 512 GB of RAM
Store (cpu-bound): ● ●
32 cores 128 GB of RAM
@LostInBrittany @PierreZ
Slide 84
Why you should care?
@LostInBrittany @PierreZ
Slide 85
Why you should care? (>30s)
@LostInBrittany @PierreZ
Slide 86
The only way to optimize: measure
Logs
Metrics
@LostInBrittany @PierreZ
Slide 87
Monitoring JVM with metrics
@LostInBrittany @PierreZ
Slide 88
Monitoring JVM with metrics
@LostInBrittany @PierreZ
Slide 89
Monitoring JVM with metrics
@LostInBrittany @PierreZ
Slide 90
Monitoring JVM with metrics
@LostInBrittany @PierreZ
Slide 91
Monitoring JVM with metrics
@LostInBrittany @PierreZ
Slide 92
Tuning G1 is hard
@LostInBrittany @PierreZ
Slide 93
Tuning G1 is hard
@LostInBrittany @PierreZ
Slide 94
Our programming stack ● We mostly use garbage collected languages as ○ Go ○ Java ○ JavaScript
@LostInBrittany @PierreZ
Slide 95
Our programming stack However, we are using non-garbage collected languages as Rust when needed
@LostInBrittany @PierreZ
Slide 96
Our friends for µservices
@LostInBrittany @PierreZ
Slide 97
We
open-source
Code contribution: ● ● ● ● ● ●
https://github.com/ovh/beamium https://github.com/ovh/noderig https://github.com/ovh/tsl https://github.com/ovh/ovh-warp10-datasource https://github.com/ovh/ovh-tsl-datasource …
Involved in: ● ● ● ●
Warp10 community Apache Hbase/Flink development Prometheus/InfluxData discussions TS Query Language Working group
@LostInBrittany @PierreZ
Slide 98
Conclusion That’s all folks!
@LostInBrittany @PierreZ