Monitoring OVH
300k servers, 27 DCs… and one Metrics platform Horacio Gonzalez @LostInBrittany Monitoring
@LostInBrittany
Slide 2
Sommaire temporaire - Intro we and OVH (5 minutes) - Intro our talk (2 minutes) - Make Better Decisions By using Numbers (5 minutes) - Building OVH Metrics (10 minutes) - Conclusion (2 minutes) - Bye bye (1 minute)
Monitoring
@LostInBrittany
Slide 3
Who are we? Introducing myself and introducing OVH
Monitoring
@LostInBrittany
Slide 4
Horacio Gonzalez @LostInBrittany Spaniard lost in Brittany, developer, dreamer and all-around geek
Monitoring
@LostInBrittany
Slide 5
OVH : Key Figures 1.3M Customers worldwide in 138 Countries 1.5 Billions euros investment over five years 30 Datacenters (growing) 350k Dedicated Servers 200k Private cloud VMs running 650k Public cloud Instances created in a month 15TB bandwidth capacity
2 500 Employees in 19 countries 18 Years of Innovation
35 Points of presence 4TB Anti DDoS capacity Hosting capacity : 1.3M Physical Servers
Monitoring
@LostInBrittany
Slide 6
OVH: A Global Leader on Cloud 200k Private cloud VMs running
1
Dedicated IaaS Europe
2018 27 Datacenters Own 15 Tbps
Hosting capacity : 1.3M Physical Servers 360k Servers already deployed
Netwok with 35 PoPs
2020 50 Datacenters
1.3M Customers in 138 Countries Monitoring
@LostInBrittany
Slide 7
Ranking & Recognition
1st European Cloud Provider* 1st Hosting provider in Europe 1st Provider Microsoft Exchange Certified vCloud Datacenter Certified Kubernetes platform (CNCF) Vmware Global Service Provider 2013-2016 Veeam Best Cloud Partner of the year (2018) Monitoring
@LostInBrittany
Netcraft 2017 -
Slide 8
OVH: Our solutions
Cloud
Web Hosting
Mobile Hosting
Telecom
VPS
Containers ▪ Dedicated Server
Domain names
VoIP
Public Cloud
Compute ▪ Data Storage
Email
SMS/Fax
Private Cloud
▪ Network and Database
CDN
Virtual desktop
Serveur dédié
Security Object Storage
Web hosting
Cloud HubiC Over theBox
▪ Licences
Cloud Desktop
Securities
MS Office
Hybrid Cloud
Messaging
MS solutions
Monitoring
@LostInBrittany
Slide 9
Once upon a time… Because I love telling tales
Monitoring
@LostInBrittany
Slide 10
This talk is about a tale…
A true one nevertheless Monitoring
@LostInBrittany
Slide 11
And as in most tales
It begins with a mission Monitoring
@LostInBrittany
Slide 12
And a band of heroes
Engulfed into the adventure Monitoring
@LostInBrittany
Slide 13
They fight against mishaps
And all kind of foes Monitoring
@LostInBrittany
Slide 14
They build mighty fortresses
Pushing the limits of possible Monitoring
@LostInBrittany
Slide 15
And defend them day after day
Against all odds Monitoring
@LostInBrittany
Slide 16
But we don’t know yet the end
Because this tale isn’t finished yet Monitoring
@LostInBrittany
Slide 17
It begins with a mission Build a metrics platform for OVH
Monitoring
@LostInBrittany
Slide 18
Why do we need metrics?
To make better decisions by using numbers
Monitoring
@LostInBrittany
Slide 19
Why do we need metrics?
We want our code to add value
Monitoring
@LostInBrittany
Slide 20
Why do we need metrics?
We need to make better decisions about our code
Monitoring
@LostInBrittany
Slide 21
Why do we need metrics?
Code adds value when it runs not when we write it
Monitoring
@LostInBrittany
Slide 22
Why do we need metrics?
We need to know what our code does when it runs
Monitoring
@LostInBrittany
Slide 23
Why do we need metrics?
We can’t do this unless we measure it
Monitoring
@LostInBrittany
Slide 24
Why do we need metrics?
We have a mental model of what our code does
Monitoring
@LostInBrittany
Slide 25
Why do we need metrics?
This representation can be wrong
Monitoring
@LostInBrittany
Slide 26
Why do we need metrics?
We can’t know until we measure it
Monitoring
@LostInBrittany
Slide 27
Find the bottleneck
‘’
“The app is slow.” - User
Monitoring
@LostInBrittany
Slide 28
Find the bottleneck
‘’
“The app is slow.” - User “The page takes 500ms!” - Ops
Monitoring
@LostInBrittany
Find the bottleneck
?
We don’t know
Monitoring
@LostInBrittany
Slide 31
Find the bottleneck
With observability: SQL Query………………………….53ms Template Rendering……….1ms Session Storage……………315ms
Monitoring
@LostInBrittany
Slide 32
Find the bottleneck
With observability: SQL Query………………………….53ms Template Rendering……….1ms Session Storage……………315ms
Monitoring
@LostInBrittany
Slide 33
Why do we need metrics?
We improve our mental model by measuring what our code does
Monitoring
@LostInBrittany
Slide 34
Why do we need metrics?
We use our mental model to decide what to do
Monitoring
@LostInBrittany
Slide 35
Why do we need metrics?
A better mental model makes us better at deciding what to do
Monitoring
@LostInBrittany
Slide 36
Why do we need metrics?
Better decisions makes us better at generating value
Monitoring
@LostInBrittany
Slide 37
Why do we need metrics?
Measuring make your App better
Monitoring
@LostInBrittany
Slide 38
It began with a mission
Build a metrics platform for OVH
Monitoring
@LostInBrittany
Slide 39
A metrics platform for OVH
For all OVH Monitoring
@LostInBrittany
Slide 40
Building OVH Metrics One Platform to unify them all, One Platform to find them, One Platform to bring them all and in the Metrics monitor them
Monitoring
@LostInBrittany
Slide 41
What is OVH Metrics?
Managed Cloud Platform for Time Series
Monitoring
@LostInBrittany
Slide 42
OVH monitoring story We had lots of partial solutions…
Monitoring
@LostInBrittany
Slide 43
OVH monitoring story One Platform to unify them all What should we build it on?
Monitoring
@LostInBrittany
Slide 44
OVH monitoring story
Including a really big
Monitoring
@LostInBrittany
OpenTSDB Rowkey design flaws ● .regex. => full table scans ● High cardinality issues (Query latencies)
We needed something able to manage hundreds of millions time series OpenTSBD didn’t scale for us Monitoring
@LostInBrittany
Metrics needs
First need: To be massively scalable
Monitoring
@LostInBrittany
Slide 50
Analytics is the key to success
Fetching data is only the tip of the iceberg Monitoring
@LostInBrittany
Slide 51
Analysing metrics data
To be scalable, analysis must be done in the database, not in user’s computer Monitoring
@LostInBrittany
Slide 52
Metrics needs
Second need: To have rich query capabilities
Monitoring
@LostInBrittany
Slide 53
Enter Warp 10… Open-source Time series Database Monitoring
@LostInBrittany
Slide 54
More than a Time Series DB Warp 10 is a software platform that ● Ingests and stores time series ● Manipulates and analyzes time series
Monitoring
@LostInBrittany
Slide 55
Manipulating Time Series with Warp 10
A true Time Series analysis toolbox ○ Hundreds of functions ○ Manipulation frameworks ○ Analysis workflow
Monitoring
@LostInBrittany
Slide 56
Manipulating Time Series with Warp 10
A Time Series manipulation language
WarpScript Monitoring
@LostInBrittany
Slide 57
Did you say scalability?
From the smallest to the largest… Monitoring
@LostInBrittany
Slide 58
More Warp 10 goodness ● Secured & multi tenant
● Synchronous (transactions)
● In memory Index
● Better Performance
● No cardinality issues
● Better Scalability
● Lockfree ingestion
● Versatile
● WarpScript Query Language
(standalone, distributed)
● Support more data types
Monitoring
@LostInBrittany
Slide 59
Metrics Data Platform
+
+
Monitoring
@LostInBrittany
Slide 60
Metrics Data Platform
Monitoring
@LostInBrittany
Slide 61
Building an ecosystem From Warp 10 to OVH Metrics
Monitoring
@LostInBrittany
Slide 62
Multi-protocol Why to choose? We need them all!
Monitoring
@LostInBrittany
Slide 63
Open source monitoring tools
Monitoring
@LostInBrittany
Slide 64
Open source monitoring tools
Monitoring
@LostInBrittany
Slide 65
Open source monitoring tools
Monitoring
@LostInBrittany
Slide 66
Open source monitoring tools
Monitoring
@LostInBrittany
Slide 67
Open source monitoring tools
Monitoring
@LostInBrittany
Slide 68
Open source monitoring tools
Monitoring
@LostInBrittany
Slide 69
Open source monitoring tools
Why choose? Let’s support all of them!
Monitoring
@LostInBrittany
Metrics Live In-memory, high-performance Metrics instances
Monitoring
@LostInBrittany
Slide 73
In-memory: Metrics live
+120 million of writes/s Monitoring
@LostInBrittany
Slide 74
In-memory: Metrics live
Monitoring
@LostInBrittany
Slide 75
In-memory: Metrics live
Monitoring
@LostInBrittany
Slide 76
Monitoring is only the beginning OVH Metrics answer to many other use cases
Monitoring
@LostInBrittany
Slide 77
Use cases families • • • •
Billing
Monitoring IoT
(e.g. bill on monthly max consumption)
……………………………………………..…….
(APM, infrastructure,appliances,…)
…..……………………………
(Manage devices, operator integration, …)
…………………………………………….………………….
Geo Location
(Manage localized fleets)
……..…………………
Monitoring
@LostInBrittany
Slide 78
Use cases • • • • • •
DC Temperature/Elec/Cooling map Pay as you go billing (PCI/IPLB) GSCAN Monitoring ML Model scoring (Anti-Fraude) Pattern Detection for medical applications
Monitoring
@LostInBrittany
Slide 79
SREing Metrics With a great power comes a great responsibility
Monitoring
@LostInBrittany
Slide 80
Metrics’ own metrics
432 000 000 000 datapoints / day Monitoring
@LostInBrittany
Slide 81
Metrics’ own metrics
10 Tb / day Monitoring
@LostInBrittany
Slide 82
Metrics’ own metrics
5 000 000 dp/s Monitoring
@LostInBrittany
Slide 83
Metrics’ own metrics
500 000 000 series Monitoring
@LostInBrittany
Our cluster architecture Warp10 Ingress
Warp10 Warp10 Directory Directory
Kafka
Warp10 Warp10 Egress Egress
Warp10 Warp10 Store Store
Region server + Datanode
Region server + Datanode
Region server + Datanode
Monitoring
Region server + Datanode
@LostInBrittany
Slide 86
Detecting errors Before it’s too late
Monitoring
86 @LostInBrittany
Slide 87
Extract errors from logs
Monitoring
@LostInBrittany
Slide 88
Tailor
Forward logs and extract metrics!
Monitoring
@LostInBrittany
Slide 89
Monitoring the JVM
Monitoring
@LostInBrittany
Slide 90
Documentation
Monitoring
@LostInBrittany
Slide 91
JVM GC The good, the bad and the ugly
Monitoring
@LostInBrittany
Slide 92
The good
Monitoring
@LostInBrittany
Slide 93
The bad
Monitoring
@LostInBrittany
Slide 94
… and the ugly
#java #jdk11 #zgc Monitoring
@LostInBrittany
Slide 95
Monitoring HBase
Monitoring
@LostInBrittany
Slide 96
Number of open regions
Monitoring
@LostInBrittany
Slide 97
Queues length
Monitoring
@LostInBrittany
Slide 98
Number of read and write requests
Monitoring
@LostInBrittany
Slide 99
Preserve data locality
Monitoring
@LostInBrittany
Slide 100
Host health
Monitoring
@LostInBrittany
Slide 101
Pokédex Inventory all animals.
Monitoring
@LostInBrittany
Slide 102
Merging all data sources
Monitoring
@LostInBrittany
Slide 103
Global visualization
Monitoring
@LostInBrittany
Slide 104
Correlate information
Monitoring
@LostInBrittany
Slide 105
Sacha The best tamer
Monitoring
@LostInBrittany
Slide 106
An awesome CLI
Monitoring
@LostInBrittany
Slide 107
Retrieving bare informations
Monitoring
@LostInBrittany
Slide 108
Create region map
Monitoring
@LostInBrittany
Slide 109
Move region to another region server
Monitoring
@LostInBrittany
Slide 110
Drain regions of the region server
Monitoring
@LostInBrittany