Monitoring OVH: 350k servers, 30 DCs… and one Metrics platform Horacio Gonzalez @LostInBrittany
Slide 2
Who are we? Introducing myself and introducing OVH OVHcloud
Slide 3
Horacio Gonzalez @LostInBrittany Spaniard lost in Brittany, developer, dreamer and all-around geek
Flutter
Slide 4
OVHcloud: A Global Leader 250k Private cloud VMs running
1
Dedicated IaaS Europe
30 Datacenters
Own 20Tbps
Hosting capacity : 1.3M Physical Servers 360k Servers already deployed
Netwok with 35 PoPs
1.3M Customers in 138 Countries
Slide 5
OVHcloud: Our solutions Cloud
Web Hosting
Mobile Hosting
Telecom
VPS
Containers ▪ Dedicated Server
Domain names
VoIP
Public Cloud
Compute ▪ Data Storage
Email
SMS/Fax
Private Cloud
▪ Network and Database
CDN
Virtual desktop
Serveur dédié
Security Object Storage
Web hosting
Cloud HubiC Over theBox
▪ Licences
Cloud Desktop
Securities
MS Office
Hybrid Cloud
Messaging
MS solutions
Slide 6
Once upon a time… Because I love telling tales
Slide 7
This talk is about a tale…
A true one nevertheless
Slide 8
And as in most tales
It begins with a mission
Slide 9
And a band of heroes
Engulfed into the adventure
Slide 10
They fight against mishaps
And all kind of foes
Slide 11
They build mighty fortresses
Pushing the limits of possible
Slide 12
And defend them day after day
Against all odds
Slide 13
But we don’t know yet the end
Because this tale isn’t finished yet
Slide 14
It begins with a mission Build a metrics platform for OVH
Slide 15
A long time ago…
Slide 16
A long time ago…
Monitoring: Does the system works?
Slide 17
Moving from monolith to μservices
App
Slide 18
Moving from monolith to μservices
App App
App
Slide 19
Moving from monolith to μservices
App App App DB App
Slaves
Slide 20
Moving from monolith to μservices
App App App
Bus
DB App
Slaves
Slide 21
Moving from monolith to μservices RPXY
LB
Cache
App App App
Bus
DB App
Slaves
Slide 22
What could go wrong? RPXY
LB
Cache
App App App
Bus
DB App
Slaves
Slide 23
Microservices are a distributed system
GOTO 2017 • Debugging Under Fire: Keep your Head when Systems have Lost their Mind • Bryan Cantrill
Slide 24
We need to have insights
Observability: How the system works?
Slide 25
OVH decided go metrics-oriented
Slide 26
A metrics platform for OVH
For all OVH
Slide 27
Building OVH Metrics One Platform to unify them all, One Platform to find them, One Platform to bring them all and in the Metrics monitor them
Slide 28
What is OVH Metrics?
Managed Cloud Platform for Time Series
Slide 29
OVH monitoring story We had lots of partial solutions…
Slide 30
OVH monitoring story One Platform to unify them all What should we build it on?
Slide 31
OVH monitoring story
Including a really big
Slide 32
OpenTSDB drawbacks
OpenTSDB RowKey Design
!
Slide 33
OpenTSDB Rowkey design flaws ● .regex. => full table scans ● High cardinality issues (Query latencies)
We needed something able to manage hundreds of millions time series OpenTSBD didn’t scale for us
Metrics needs
First need: To be massively scalable
Slide 37
Analytics is the key to success
Fetching data is only the tip of the iceberg
Slide 38
Analysing metrics data
To be scalable, analysis must be done in the database, not in user’s computer
Slide 39
Metrics needs
Second need: To have rich query capabilities
Slide 40
Enter Warp 10… Open-source Time series Database
Slide 41
More than a Time Series DB Warp 10 is a software platform that ● Ingests and stores time series ● Manipulates and analyzes time series
Slide 42
Manipulating Time Series with Warp 10
A true Time Series analysis toolbox ○ Hundreds of functions ○ Manipulation frameworks ○ Analysis workflow
Slide 43
Manipulating Time Series with Warp 10
A Time Series manipulation language
WarpScript
Slide 44
Did you say scalability?
From the smallest to the largest…
Slide 45
More Warp 10 goodness ● Secured & multi tenant
● Synchronous (transactions)
● In memory Index
● Better Performance
● No cardinality issues
● Better Scalability
● Lockfree ingestion
● Versatile
● WarpScript Query Language ● Support more data types
(standalone, distributed)
Slide 46
OVH Observability Metrics Platform
Slide 47
Building an ecosystem From Warp 10 to OVH Metrics
Slide 48
What protocols should we support? Who must do the effort?
Slide 49
Open source monitoring tools
Slide 50
Open source monitoring tools
Slide 51
Open source monitoring tools
Slide 52
Open source monitoring tools
Slide 53
Open source monitoring tools
Slide 54
Open source monitoring tools
Slide 55
Open source monitoring tools
Why choose? Let’s support all of them!
Metrics Live In-memory, high-performance Metrics instances
Slide 61
In-memory: Metrics live
millions of writes/s
Slide 62
In-memory: Metrics live
Slide 63
In-memory: Metrics live
Slide 64
Monitoring is only the beginning OVH Metrics answer to many other use cases
Slide 65
Graveline rack’s temperature
Slide 66
Even medical research…
Metrics’ Pattern Detection feature helped Gynaecology Research to prove patterns on perinatal mortality
Slide 67
Use cases families • • • •
Billing
Monitoring IoT
(e.g. bill on monthly max consumption)
……………………………………………..…….
(APM, infrastructure,appliances,…)
…..……………………………
(Manage devices, operator integration, …)
…………………………………………….………………….
Geo Location
(Manage localized fleets)
……..…………………
Slide 68
Use cases • • • • • •
DC Temperature/Elec/Cooling map Pay as you go billing (PCI/IPLB) GSCAN Monitoring ML Model scoring (Anti-Fraude) Pattern Detection for medical applications
Slide 69
SREing Metrics With a great power comes a great responsibility
Slide 70
Metrics’s metrics
70
Slide 71
Our stack overview More than 666 machines operated by 5 people >95% dedicated servers No Docker, only SystemD Running many Apache projects: ○ Hadoop ○ HBase ○ Zookeeper ○ Flink ● And Warp 10 ● ● ● ●
Slide 72
Our biggest Hadoop cluster
Slide 73
Hadoop need a lot of
Slide 74
Warp10: distributed overview
Slide 75
Warp10: distributed overview
Slide 76
Warp10: distributed overview
Slide 77
Warp10: distributed overview
Slide 78
Warp10: distributed overview
Slide 79
Hadoop nodes
● ● ●
● ● ●
Slide 80
Warp10 nodes ● ●
● ●
● ●
● ●
Slide 81
Why you should care?
Slide 82
Why you should care? (>30s)
Slide 83
The only way to optimize: measure What is my application doing?
App
What is my runtime doing?
How many GC triggered?
Run
tim
Is there a hardware failure?
Logs
How many HTTP calls?
Hos t
e
How many disk I have left?
Metrics
Slide 84
Monitoring JVM with metrics
Slide 85
Monitoring JVM with metrics
Slide 86
Monitoring JVM with metrics
Slide 87
Monitoring JVM with metrics
Slide 88
Monitoring JVM with metrics
Slide 89
Tuning G1 is hard
Slide 90
Tuning G1 is hard
Slide 91
Our programming stack ● ○ ○ ○
Slide 92
Our programming stack
Slide 93
Our friends for µservices
Slide 94
We
open-source
Code contribution: ● ● ● ● ● ●
https://github.com/ovh/beamium https://github.com/ovh/noderig https://github.com/ovh/tsl https://github.com/ovh/ovh-warp10-datasource https://github.com/ovh/ovh-tsl-datasource …
Involved in: ● ● ● ●
Warp10 community Apache Hbase/Flink development Prometheus/InfluxData discussions TS Query Language Working group