Monitoring OVH: 350k servers, 30 DCs… and one Metrics platform Horacio Gonzalez @LostInBrittany @LostInBrittany
Slide 2
Who are we? Introducing myself and introducing OVH OVHcloud
@LostInBrittany
Slide 3
Horacio Gonzalez @LostInBrittany Spaniard lost in Brittany, developer, dreamer and all-around geek
Flutter
@LostInBrittany
Slide 4
OVH: A Global Leader on Cloud 250k Private cloud VMs running
1
Dedicated IaaS Europe
30 Datacenters
Own 20Tbps
Hosting capacity : 1.3M Physical Servers 360k Servers already deployed
Netwok with 35 PoPs
1.3M Customers in 138 Countries
@LostInBrittany
Slide 5
OVH: Our solutions Cloud
Web Hosting
Mobile Hosting
Telecom
VPS
Containers ▪ Dedicated Server
Domain names
VoIP
Public Cloud
Compute ▪ Data Storage
Email
SMS/Fax
Private Cloud
▪ Network and Database
CDN
Virtual desktop
Serveur dédié
Security Object Storage
Web hosting
Cloud HubiC Over theBox
▪ Licences
Cloud Desktop
Securities
MS Office
Hybrid Cloud
Messaging
MS solutions
@LostInBrittany
Slide 6
And don’t forget, next week…
OVHcloud Summit https://summit.ovhcloud.com/ @LostInBrittany
Slide 7
Once upon a time… Because I love telling tales
@LostInBrittany
Slide 8
This talk is about a tale…
A true one nevertheless @LostInBrittany
Slide 9
And as in most tales
It begins with a mission @LostInBrittany
Slide 10
And a band of heroes
Engulfed into the adventure @LostInBrittany
Slide 11
They fight against mishaps
And all kind of foes @LostInBrittany
Slide 12
They build mighty fortresses
Pushing the limits of possible @LostInBrittany
Slide 13
And defend them day after day
Against all odds @LostInBrittany
Slide 14
But we don’t know yet the end
Because this tale isn’t finished yet @LostInBrittany
Slide 15
It begins with a mission Build a metrics platform for OVH
@LostInBrittany
Slide 16
A long time ago…
@LostInBrittany
Slide 17
A long time ago…
Monitoring: Does the system works?
@LostInBrittany
Slide 18
Moving from monolith to μservices
App
@LostInBrittany
Slide 19
Moving from monolith to μservices
App App
App
@LostInBrittany
Slide 20
Moving from monolith to μservices
App App App DB App
Slaves
@LostInBrittany
Slide 21
Moving from monolith to μservices
App App App
Bus
DB App
Slaves
@LostInBrittany
Slide 22
Moving from monolith to μservices RPXY
LB
Cache
App App App
Bus
DB App
Slaves
@LostInBrittany
Slide 23
What could go wrong? RPXY
LB
Cache
App App App
Bus
DB App
Slaves
@LostInBrittany
Slide 24
Microservices are a distributed system
GOTO 2017 • Debugging Under Fire: Keep your Head when Systems have Lost their Mind • Bryan Cantrill @LostInBrittany
Slide 25
We need to have insights
Observability: How the system works?
@LostInBrittany
Slide 26
OVH decided go metrics-oriented
@LostInBrittany
Slide 27
A metrics platform for OVH
For all OVH @LostInBrittany
Slide 28
Building OVH Metrics One Platform to unify them all, One Platform to find them, One Platform to bring them all and in the Metrics monitor them
@LostInBrittany
Slide 29
What is OVH Metrics?
Managed Cloud Platform for Time Series
@LostInBrittany
Slide 30
OVH monitoring story We had lots of partial solutions…
@LostInBrittany
Slide 31
OVH monitoring story One Platform to unify them all What should we build it on?
@LostInBrittany
Slide 32
OVH monitoring story
Including a really big
@LostInBrittany
OpenTSDB Rowkey design flaws ● .regex. => full table scans ● High cardinality issues (Query latencies)
We needed something able to manage hundreds of millions time series OpenTSBD didn’t scale for us @LostInBrittany
Metrics needs
First need: To be massively scalable
@LostInBrittany
Slide 38
Analytics is the key to success
Fetching data is only the tip of the iceberg @LostInBrittany
Slide 39
Analysing metrics data
To be scalable, analysis must be done in the database, not in user’s computer @LostInBrittany
Slide 40
Metrics needs
Second need: To have rich query capabilities
@LostInBrittany
Slide 41
Enter Warp 10… Open-source Time series Database @LostInBrittany
Slide 42
More than a Time Series DB Warp 10 is a software platform that ● Ingests and stores time series ● Manipulates and analyzes time series
@LostInBrittany
Slide 43
Manipulating Time Series with Warp 10
A true Time Series analysis toolbox ○ Hundreds of functions ○ Manipulation frameworks ○ Analysis workflow
@LostInBrittany
Slide 44
Manipulating Time Series with Warp 10
A Time Series manipulation language
WarpScript @LostInBrittany
Slide 45
Did you say scalability?
From the smallest to the largest…
@LostInBrittany
Slide 46
More Warp 10 goodness ● Secured & multi tenant
● Synchronous (transactions)
● In memory Index
● Better Performance
● No cardinality issues
● Better Scalability
● Lockfree ingestion
● Versatile
● WarpScript Query Language
(standalone, distributed)
● Support more data types
@LostInBrittany
Metrics Live In-memory, high-performance Metrics instances
@LostInBrittany
Slide 62
In-memory: Metrics live
millions of writes/s @LostInBrittany
Slide 63
In-memory: Metrics live
@LostInBrittany
Slide 64
In-memory: Metrics live
@LostInBrittany
Slide 65
Monitoring is only the beginning OVH Metrics answer to many other use cases
@LostInBrittany
Slide 66
Graveline rack’s temperature
@LostInBrittany
Slide 67
Even medical research…
Metrics’ Pattern Detection feature helped Gynaecology Research to prove patterns on perinatal mortality
@LostInBrittany
Slide 68
Use cases families • • • •
Billing
Monitoring IoT
(e.g. bill on monthly max consumption)
……………………………………………..…….
(APM, infrastructure,appliances,…)
…..……………………………
(Manage devices, operator integration, …)
…………………………………………….………………….
Geo Location
(Manage localized fleets)
……..…………………
@LostInBrittany
Slide 69
Use cases • • • • • •
DC Temperature/Elec/Cooling map Pay as you go billing (PCI/IPLB) GSCAN Monitoring ML Model scoring (Anti-Fraude) Pattern Detection for medical applications
@LostInBrittany
Slide 70
SREing Metrics With a great power comes a great responsibility
@LostInBrittany
Slide 71
Metrics’s metrics
432.000.000.000 datapoints / day
@LostInBrittany
71
Slide 72
Our stack overview More than 666 machines operated by 5 people >95% dedicated servers No Docker, only SystemD Running many Apache projects: ○ Hadoop ○ HBase ○ Zookeeper ○ Flink ● And Warp 10 ● ● ● ●
@LostInBrittany
Slide 73
Our biggest Hadoop cluster
200 datanodes
~60k regions of 10Gb
2.3 PB of capacity 8.5Gb/s of bandwidth
1.5M of writes/s 3M of reads/s
@LostInBrittany
Slide 74
Hadoop need a lot of
@LostInBrittany
Slide 75
Warp10: distributed overview
@LostInBrittany
Slide 76
Warp10: distributed overview
@LostInBrittany
Slide 77
Warp10: distributed overview
@LostInBrittany
Slide 78
Warp10: distributed overview
@LostInBrittany
Slide 79
Warp10: distributed overview
@LostInBrittany
Slide 80
Hadoop nodes
Most of the nodes are the following: ● ● ●
16 to 32 cores 64 to 128 GB of RAM 12 to 16 TB
But, we also have some huge nodes: ● ● ●
2x 20 cores (xeon gold) 320 GB of RAM 12x 4TB of Disk
@LostInBrittany
Slide 81
Warp10 nodes Ingress (cpu-bound): ● ●
32 cores 128 GB of RAM
Egress (cpu-bound): ● ●
32 cores 128 GB of RAM
Directory (ram-bound): ● ●
48 cores 512 GB of RAM
Store (cpu-bound): ● ●
@LostInBrittany
32 cores 128 GB of RAM
Slide 82
Why you should care?
@LostInBrittany
Slide 83
Why you should care? (>30s)
@LostInBrittany
Slide 84
The only way to optimize: measure What is my application doing?
App
What is my runtime doing?
How many GC triggered?
Run
tim
Is there a hardware failure?
Logs
How many HTTP calls?
e
Hos t
@LostInBrittany
How many disk I have left?
Metrics
Slide 85
Monitoring JVM with metrics
@LostInBrittany
Slide 86
Monitoring JVM with metrics
@LostInBrittany
Slide 87
Monitoring JVM with metrics
@LostInBrittany
Slide 88
Monitoring JVM with metrics
@LostInBrittany
Slide 89
Monitoring JVM with metrics
@LostInBrittany
Slide 90
Tuning G1 is hard
@LostInBrittany
Slide 91
Tuning G1 is hard
@LostInBrittany
Slide 92
Our programming stack ● We mostly use garbage collected languages as ○ Go ○ Java ○ JavaScript
@LostInBrittany
Slide 93
Our programming stack However, we are using non-garbage collected languages as Rust when needed
@LostInBrittany
Slide 94
Our friends for µservices
@LostInBrittany
Slide 95
We
open-source
Code contribution: ● ● ● ● ● ●
https://github.com/ovh/beamium https://github.com/ovh/noderig https://github.com/ovh/tsl https://github.com/ovh/ovh-warp10-datasource https://github.com/ovh/ovh-tsl-datasource …
@LostInBrittany
Involved in: ● ● ● ●
Warp10 community Apache Hbase/Flink development Prometheus/InfluxData discussions TS Query Language Working group