Monitoring OVH: 300k servers, 28 DCs… and one Metrics platform Horacio Gonzalez @LostInBrittany @LostInBrittany
Slide 2
Who are we? Introducing myself and introducing OVH
@LostInBrittany
Slide 3
Horacio Gonzalez @LostInBrittany Spaniard lost in Brittany, developer, dreamer and all-around geek
Flutter
@LostInBrittany
Slide 4
OVH : Key Figures 1.3M Customers worldwide in 138 Countries 1.5 Billions euros investment over five years 28 Datacenters (growing) 350k Dedicated Servers 200k Private cloud VMs running 650k Public cloud Instances created in a month 20TB bandwidth capacity
35 Points of presence 4TB Anti DDoS capacity Hosting capacity : 1.3M Physical Servers
@LostInBrittany
2 500 Employees in 19 countries
20 Years of Innovation
Slide 5
OVH: Our solutions Cloud
Web Hosting
Mobile Hosting
Telecom
VPS
Containers ▪ Dedicated Server
Domain names
VoIP
Public Cloud
Compute ▪ Data Storage
Email
SMS/Fax
Private Cloud
▪ Network and Database
CDN
Virtual desktop
Serveur dédié
Security Object Storage
Web hosting
Cloud HubiC Over theBox
▪ Licences
Cloud Desktop
Securities
MS Office
Hybrid Cloud
Messaging
MS solutions
@LostInBrittany
Slide 6
Once upon a time… Because I love telling tales
@LostInBrittany
Slide 7
This talk is about a tale…
A true one nevertheless @LostInBrittany
Slide 8
And as in most tales
It begins with a mission @LostInBrittany
Slide 9
And a band of heroes
Engulfed into the adventure @LostInBrittany
Slide 10
They fight against mishaps
And all kind of foes @LostInBrittany
Slide 11
They build mighty fortresses
Pushing the limits of possible @LostInBrittany
Slide 12
And defend them day after day
Against all odds @LostInBrittany
Slide 13
But we don’t know yet the end
Because this tale isn’t finished yet @LostInBrittany
Slide 14
It begins with a mission Build a metrics platform for OVH
@LostInBrittany
Slide 15
A long time ago…
@LostInBrittany
Slide 16
A long time ago…
Monitoring: Does the system works?
@LostInBrittany
Slide 17
Moving from monolith to μservices
App
@LostInBrittany
Slide 18
Moving from monolith to μservices App App
App
@LostInBrittany
Slide 19
Moving from monolith to μservices App App App DB App
Slaves
@LostInBrittany
Slide 20
Moving from monolith to μservices App App App
Bus
DB App
Slaves
@LostInBrittany
Slide 21
Moving from monolith to μservices RPXY
LB
Cache
App App App
Bus
DB App
Slaves
@LostInBrittany
Slide 22
What could go wrong? RPXY
LB
Cache
App App App
Bus
DB App
Slaves
@LostInBrittany
Slide 23
Microservices are a distributed system
GOTO 2017 • Debugging Under Fire: Keep your Head when Systems have Lost their Mind • Bryan Cantrill @LostInBrittany
Slide 24
We need to have insights
Observability : Understand how it works
@LostInBrittany
Slide 25
OVH decided go metrics-oriented
@LostInBrittany
Slide 26
A metrics platform for OVH
For all OVH @LostInBrittany
Slide 27
Building OVH Metrics One Platform to unify them all, One Platform to find them, One Platform to bring them all and in the Metrics monitor them
@LostInBrittany
Slide 28
What is OVH Metrics?
Managed Cloud Platform for Time Series
@LostInBrittany
Slide 29
OVH monitoring story We had lots of partial solutions…
@LostInBrittany
Slide 30
OVH monitoring story One Platform to unify them all What should we build it on?
@LostInBrittany
Slide 31
OVH monitoring story
Including a really big
@LostInBrittany
OpenTSDB Rowkey design flaws ● .regex. => full table scans ● High cardinality issues (Query latencies)
We needed something able to manage hundreds of millions time series OpenTSBD didn’t scale for us @LostInBrittany
Metrics needs
First need: To be massively scalable @LostInBrittany
Slide 37
Analytics is the key to success
Fetching data is only the tip of the iceberg @LostInBrittany
Slide 38
Analysing metrics data
To be scalable, analysis must be done in the database, not in user’s computer @LostInBrittany
Slide 39
Metrics needs
Second need: To have rich query capabilities @LostInBrittany
Slide 40
Enter Warp 10… Open-source Time series Database @LostInBrittany
Slide 41
More than a Time Series DB Warp 10 is a software platform that ● Ingests and stores time series ● Manipulates and analyzes time series
@LostInBrittany
Slide 42
Manipulating Time Series with Warp 10
A true Time Series analysis toolbox ○ Hundreds of functions ○ Manipulation frameworks ○ Analysis workflow
@LostInBrittany
Slide 43
Manipulating Time Series with Warp 10
A Time Series manipulation language
WarpScript @LostInBrittany
Slide 44
Did you say scalability?
From the smallest to the largest… @LostInBrittany
Slide 45
More Warp 10 goodness ● Secured & multi tenant
● Synchronous (transactions)
● In memory Index
● Better Performance
● No cardinality issues
● Better Scalability
● Lockfree ingestion
● Versatile
● WarpScript Query Language
(standalone, distributed)
● Support more data types
@LostInBrittany
Metrics Live In-memory, high-performance Metrics instances
@LostInBrittany
Slide 61
In-memory: Metrics live
millions of writes/s @LostInBrittany
Slide 62
In-memory: Metrics live
@LostInBrittany
Slide 63
In-memory: Metrics live
@LostInBrittany
Slide 64
Monitoring is only the beginning OVH Metrics answer to many other use cases
@LostInBrittany
Slide 65
Graveline rack’s temperature
@LostInBrittany
Slide 66
Even medical research…
Metrics’ Pattern Detection feature helped Gynaecology Research to prove patterns on perinatal mortality @LostInBrittany
Slide 67
Use cases families • • • •
Billing
Monitoring IoT
(e.g. bill on monthly max consumption)
……………………………………………..…….
(APM, infrastructure,appliances,…)
…..……………………………
(Manage devices, operator integration, …)
…………………………………………….………………….
Geo Location
(Manage localized fleets)
……..…………………
@LostInBrittany
Slide 68
Use cases • • • • • •
DC Temperature/Elec/Cooling map Pay as you go billing (PCI/IPLB) GSCAN Monitoring ML Model scoring (Anti-Fraude) Pattern Detection for medical applications
@LostInBrittany
Slide 69
SREing Metrics With a great power comes a great responsibility
@LostInBrittany
Slide 70
Metrics’s metrics
432.000.000.000 datapoints / day
@LostInBrittany
70
Slide 71
Our stack overview More than 650 machines operated by 5 people >95% dedicated servers No Docker, only SystemD Running many Apache projects: ○ Hadoop ○ HBase ○ Zookeeper ○ Flink ● And Warp 10 ● ● ● ●
@LostInBrittany
Slide 72
Our biggest Hadoop cluster
200 datanodes
60k regions of 10Gb
~
2.3 PB of capacity 8.5Gb/s of bandwidth
@LostInBrittany
1.5M of writes/s 3M of reads/s
Slide 73
Hadoop need a lot of
@LostInBrittany
Slide 74
Warp10: distributed overview
@LostInBrittany
Slide 75
Warp10: distributed overview
@LostInBrittany
Slide 76
Warp10: distributed overview
@LostInBrittany
Slide 77
Warp10: distributed overview
@LostInBrittany
Slide 78
Warp10: distributed overview
@LostInBrittany
Slide 79
Hadoop nodes Most of the nodes are the following: ● ● ●
16 to 32 cores 64 to 128 GB of RAM 12 to 16 TB
But, we also have some huge nodes: ● ● ●
2x 20 cores (xeon gold) 320 GB of RAM 12x 4TB of Disk
@LostInBrittany
Slide 80
Warp10 nodes Ingress (cpu-bound): ● ●
32 cores 128 GB of RAM
Egress (cpu-bound): ● ●
32 cores 128 GB of RAM
Directory (ram-bound): ● ●
48 cores 512 GB of RAM
Store (cpu-bound): ● ●
@LostInBrittany
32 cores 128 GB of RAM
Slide 81
Why you should care?
@LostInBrittany
Slide 82
Why you should care? (>30s)
@LostInBrittany
Slide 83
The only way to optimize: measure What is my application doing? What is my runtime doing?
App
Logs
How many GC triggered?
Run
time
Is there a hardware failure?
How many HTTP calls?
Hos
t
@LostInBrittany
How many disk I have left?
Metrics
Slide 84
Monitoring JVM with metrics
@LostInBrittany
Slide 85
Monitoring JVM with metrics
@LostInBrittany
Slide 86
Monitoring JVM with metrics
@LostInBrittany
Slide 87
Monitoring JVM with metrics
@LostInBrittany
Slide 88
Monitoring JVM with metrics
@LostInBrittany
Slide 89
Tuning G1 is hard
@LostInBrittany
Slide 90
Tuning G1 is hard
@LostInBrittany
Slide 91
Our programming stack ● We mostly use garbage collected languages as ○ Go ○ Java ○ JavaScript
@LostInBrittany
Slide 92
Our programming stack However, we are using non-garbage collected languages as Rust when needed
@LostInBrittany
Slide 93
Our friends for µservices
@LostInBrittany
Slide 94
We
open-source
Code contribution: ● ● ● ● ● ●
https://github.com/ovh/beamium https://github.com/ovh/noderig https://github.com/ovh/tsl https://github.com/ovh/ovh-warp10-datasource https://github.com/ovh/ovh-tsl-datasource …
@LostInBrittany
Involved in: ● ● ● ●
Warp10 community Apache Hbase/Flink development Prometheus/InfluxData discussions TS Query Language Working group