Monitoring OVH: 300k servers, 27 DCs and one Metrics platform

A presentation at Meilleur Dev de France 2018 in October 2018 in Paris, France by Horacio Gonzalez

Slide 1

Slide 1

Monitoring OVH 300k servers, 27 DCs... and one Metrics platform Horacio Gonzalez Kevin Georges Steven Le Roux @LostInBrittany @0xd33d33 @StevenLeRoux Monitoring @ovh

Slide 2

Slide 2

Who are we? Introducing ourselves and introducing OVH Monitoring @ovh

Slide 3

Slide 3

Horacio Gonzalez @LostInBrittany Spaniard lost in Brittany, developer, dreamer and all-around geek Monitoring @ovh

Slide 4

Slide 4

Kevin Georges @0xd33d33 Engineering Manager Working on Observability and Kubernetes Distributed system addict Warp10 / HBase / HDFS / Zookeeper / ETCD / Kubernetes Monitoring @ovh

Slide 5

Slide 5

Steven Le Roux @StevenLeRoux Principal Engineer From networking to Distributed Unconventional life rider Monitoring @ovh

Slide 6

Slide 6

OVH : Key Figures 1.3M Customers worldwide in 138 Countries 1.5 Billions euros investment over five years 30 Datacenters (growing) 350k Dedicated Servers 200k Private cloud VMs running 650k Public cloud Instances created in a month 15TB bandwidth capacity

  • 2 500 Employees in 19 countries 19 Years of Innovation 35 Points of presence 4TB Anti DDoS capacity Hosting capacity : 1.3M Physical Servers Monitoring @ovh

Slide 7

Slide 7

OVH: A Global Leader on Cloud 200k Private cloud VMs running 1 Dedicated IaaS Europe 2018 27 Datacenters Own 15 Tbps Hosting capacity : 1.3M Physical Servers 360k Servers already deployed Netwok with 35 PoPs 2020 50 Datacenters

1.3M Customers in 138 Countries Monitoring @ovh

Slide 8

Slide 8

Ranking & Recognition 1st European Cloud Provider* 1st Hosting provider in Europe 1st Provider Microsoft Exchange Certified vCloud Datacenter Certified Kubernetes platform (CNCF) Vmware Global Service Provider 2013-2016 Veeam Best Cloud Partner of the year (2018) Monitoring

  • Netcraft 2017 - @ovh

Slide 9

Slide 9

OVH: Our solutions Cloud Web Hosting Mobile Hosting Telecom VPS Containers ▪ Dedicated Server Domain names VoIP Public Cloud Compute ▪ Data Storage Email SMS/Fax Private Cloud ▪ Network and Database CDN Virtual desktop Serveur dédié Security Object Storage Web hosting Cloud HubiC Over theBox ▪ Licences Cloud Desktop Securities MS Office Hybrid Cloud Messaging MS solutions Monitoring @ovh

Slide 10

Slide 10

Once upon a time... Because we love telling tales Monitoring @ovh

Slide 11

Slide 11

This talk is about a tale... A true one nevertheless Monitoring @ovh

Slide 12

Slide 12

And as in most tales It begins with a mission Monitoring @ovh

Slide 13

Slide 13

And a band of heroes Engulfed into the adventure Monitoring @ovh

Slide 14

Slide 14

They fight against mishaps And all kind of foes Monitoring @ovh

Slide 15

Slide 15

They build a mighty citadel Pushing the limits of Physics Monitoring @ovh

Slide 16

Slide 16

And defend them day after day Against all odds Monitoring @ovh

Slide 17

Slide 17

But we don't know yet the end Because this tale isn't finished yet Monitoring @ovh

Slide 18

Slide 18

It begins with a mission Build a metrics platform for OVH Monitoring @ovh

Slide 19

Slide 19

It began with a mission Build a metrics platform for OVH Monitoring @ovh

Slide 20

Slide 20

Why do we need metrics? To make better decisions by using numbers Monitoring @ovh

Slide 21

Slide 21

Why do we need metrics? We need to make better decisions about our code Monitoring @ovh

Slide 22

Slide 22

Why do we need metrics? We want our code to add value Monitoring @ovh

Slide 23

Slide 23

Why do we need metrics? Code adds value when it runs not when we write it Monitoring @ovh

Slide 24

Slide 24

Why do we need metrics? We need to know what our code does when it runs Monitoring @ovh

Slide 25

Slide 25

Why do we need metrics? We can’t do this unless we measure it Monitoring @ovh

Slide 26

Slide 26

Why do we need metrics? We have a mental model of what our code does Monitoring @ovh

Slide 27

Slide 27

Why do we need metrics? This representation can be wrong Monitoring @ovh

Slide 28

Slide 28

Why do we need metrics? We can’t know until we measure it Monitoring @ovh

Slide 29

Slide 29

Find the bottleneck ‘’ “The app is slow.” - User Monitoring @ovh

Slide 30

Slide 30

Find the bottleneck ‘’ “The app is slow.” - User “The page takes 500ms!” - Ops Monitoring @ovh

Slide 31

Slide 31

Find the bottleneck ? SQL Query? Template Rendering? Session Storage? Monitoring @ovh

Slide 32

Slide 32

Find the bottleneck ? We don't know Monitoring @ovh

Slide 33

Slide 33

Find the bottleneck

With observability: SQL Query………………………....53ms Template Rendering……….1ms Session Storage………......315ms Monitoring @ovh

Slide 34

Slide 34

Find the bottleneck

With observability: SQL Query………………………....53ms Template Rendering……….1ms Session Storage………......315ms Monitoring @ovh

Slide 35

Slide 35

Why do we need metrics? We improve our mental model by measuring what our code does Monitoring @ovh

Slide 36

Slide 36

Why do we need metrics? We use our mental model to decide what to do Monitoring @ovh

Slide 37

Slide 37

Why do we need metrics? A better mental model makes us better at deciding what to do Monitoring @ovh

Slide 38

Slide 38

Why do we need metrics? Better decisions makes us better at generating value Monitoring @ovh

Slide 39

Slide 39

Why do we need metrics? Measuring make your App better Monitoring @ovh

Slide 40

Slide 40

It began with a mission Build a metrics platform for OVH Monitoring @ovh

Slide 41

Slide 41

A metrics platform for OVH Monitoring @ovh

Slide 42

Slide 42

Building OVH Metrics One Platform to unify them all, One Platform to find them, One Platform to bring them all and in the Metrics monitor them Monitoring @ovh

Slide 43

Slide 43

What is OVH Metrics? Managed Cloud Platform for Time Series Monitoring @ovh

Slide 44

Slide 44

OVH monitoring story We had lots of partial solutions... Monitoring @ovh

Slide 45

Slide 45

OVH monitoring story One Platform to unify them all What should we build it on? Monitoring @ovh

Slide 46

Slide 46

OVH monitoring story First try Monitoring @ovh

Slide 47

Slide 47

OpenTSDB drawbacks OpenTSDB RowKey Design ! Monitoring @ovh

Slide 48

Slide 48

OpenTSDB Rowkey design flaws ● .regex. => full table scans ● Cardinality issues (Query latencies) Monitoring @ovh

Slide 49

Slide 49

OpenTSDB other flaws ● ● ● ● ● Compactions (or append writes) /api/query : 1 endpoint per function? Asynchronous Unauthenticated ... Monitoring @ovh

Slide 50

Slide 50

Scaling OpenTSDB Monitoring @ovh

Slide 51

Slide 51

Metrics needs First need: To be massively scalable Monitoring @ovh

Slide 52

Slide 52

Analytics is the key to success Fetching data is only the tip of the iceberg Monitoring @ovh

Slide 53

Slide 53

Analysing metrics data To be scalable, analysis must be done in the database, not in user's computer Monitoring @ovh

Slide 54

Slide 54

Metrics needs Second need: To have rich query capabilities Monitoring @ovh

Slide 55

Slide 55

Enter Warp 10... Open-source Time series Database Monitoring @ovh

Slide 56

Slide 56

More than a Time Series DB Warp 10 is a software platform that ● Ingests and stores time series ● Manipulates and analyzes time series Monitoring @ovh

Slide 57

Slide 57

Manipulating Time Series with Warp 10 A true Time Series analysis toolbox ○ Hundreds of functions ○ Manipulation frameworks ○ Analysis workflow Monitoring @ovh

Slide 58

Slide 58

Manipulating Time Series with Warp 10 A Time Series manipulation language WarpScript Monitoring @ovh

Slide 59

Slide 59

Did you say scalability? From the smallest to the largest... Monitoring @ovh

Slide 60

Slide 60

Warp 10 goodness ● Secured & multi tenant ● Synchronous (transactions) ● In memory Index ● Better Performance ● No cardinality issues ● Better Scalability ● Lockfree ingestion ● Versatile ● WarpScript Query Language (standalone, distributed) ● Support more data types Monitoring @ovh

Slide 61

Slide 61

Metrics Data Platform + + Monitoring @ovh

Slide 62

Slide 62

Metrics Data Platform Monitoring @ovh

Slide 63

Slide 63

Leverage an ecosystem and choose the right one... Monitoring @ovh

Slide 64

Slide 64

Multi-protocol Why to choose? We need them all! Monitoring @ovh

Slide 65

Slide 65

Open source monitoring tools Monitoring @ovh

Slide 66

Slide 66

Open source monitoring tools Monitoring @ovh

Slide 67

Slide 67

Open source monitoring tools Monitoring @ovh

Slide 68

Slide 68

Open source monitoring tools Monitoring @ovh

Slide 69

Slide 69

Open source monitoring tools Monitoring @ovh

Slide 70

Slide 70

Open source monitoring tools Monitoring @ovh

Slide 71

Slide 71

Open source monitoring tools Why choose? Let’s support all of them! Monitoring @ovh

Slide 72

Slide 72

Metrics Platform Monitoring @ovh

Slide 73

Slide 73

Metrics Platform graphite influx https:// opentsdb .<region>.metrics.ovh.net prometheus warp10 ... Monitoring @ovh

Slide 74

Slide 74

Metrics Live In-memory, high-performance Metrics instances Monitoring @ovh

Slide 75

Slide 75

In-memory: Metrics live +120 million of writes/s Monitoring @ovh

Slide 76

Slide 76

In-memory: Metrics live Monitoring @ovh

Slide 77

Slide 77

In-memory: Metrics live Monitoring @ovh

Slide 78

Slide 78

Monitoring is only the beginning OVH Metrics answer to many other use cases Monitoring @ovh

Slide 79

Slide 79

Use cases families • • • • Billing Monitoring IoT (e.g. bill on monthly max consumption) ……………………………………………..….... (APM, infrastructure,appliances,...) …..…………………………... (Manage devices, operator integration, ...) …………………………………………….……………….... Geo Location (Manage localized fleets) ….....………………... Monitoring @ovh

Slide 80

Slide 80

Use cases • • • • • • DC Temperature/Elec/Cooling map Pay as you go billing (PCI/IPLB) GSCAN Monitoring ML Model scoring (Anti-Fraude) Pattern Detection for medical applications Monitoring @ovh

Slide 81

Slide 81

Conclusion That's all folks! Monitoring @ovh

Slide 82

Slide 82

SREing Metrics With a great power comes great responsibility Monitoring @ovh

Slide 83

Slide 83

Metrics' own metrics 432 000 000 000 datapoints / day Monitoring @ovh

Slide 84

Slide 84

Metrics' own metrics 10 Tb / day Monitoring @ovh

Slide 85

Slide 85

Metrics' own metrics 5 000 000 dp/s Monitoring @ovh

Slide 86

Slide 86

Metrics' own metrics 500 000 000 series Monitoring @ovh

Slide 87

Slide 87

Our clusters size GRA: BHS: ● 150 nodes ● 2 PB ● 1.1 Gbps ● 30 nodes ● 400 TB ● 120 Mbps Monitoring @ovh

Slide 88

Slide 88

Our cluster architecture Warp10 Ingress Warp10 Warp10 Directory Directory Kafka Warp10 Warp10 Egress Egress Warp10 Warp10 Store Store Region server + Datanode Region server + Datanode Region server + Datanode Monitoring Region server + Datanode @ovh

Slide 89

Slide 89

Detecting errors Before it's too late Monitoring 89 @ovh

Slide 90

Slide 90

HBASE is designed to fail It’s really good at it Monitoring 90 @ovh

Slide 91

Slide 91

HBASE fail in infinity ways NETWORK STORAGE COMPUTE Zookeeper timeout Slow disk Java GC Network latency Failed disk Region compaction Network bandwidth Corrupted block Delete handling Handlers exhaustion Monitoring @ovh

Slide 92

Slide 92

Extract errors from logs Monitoring @ovh

Slide 93

Slide 93

Tailor Filter logs Extract metrics Detect patterns Perform correlations Monitoring @ovh

Slide 94

Slide 94

Monitoring the JVM Monitoring @ovh

Slide 95

Slide 95

Documentation Monitoring @ovh

Slide 96

Slide 96

Documentation The option -XX:G1SummarizeRSetStatsPeriod in combination with gc+remset=trace level logging shows if this coarsening occurs. If so, then the X in the line Did <X> coarsenings in the Before GC Summary section shows a high value. The -XX:G1RSetRegionEntries option could be increased significantly to decrease the amount of these coarsenings. https://docs.oracle.com/javase/10/gctuning/garbage-first-garbage-collector-tuning.htm Monitoring @ovh

Slide 97

Slide 97

Let's observe what is happening Monitoring 97 @ovh

Slide 98

Slide 98

JVM GC The good, the bad and the ugly Monitoring @ovh

Slide 99

Slide 99

The good Monitoring @ovh

Slide 100

Slide 100

The bad Monitoring @ovh

Slide 101

Slide 101

… and the ugly #java #jdk11 #zgc Monitoring @ovh

Slide 102

Slide 102

Monitoring HBase Monitoring @ovh

Slide 103

Slide 103

Number of open regions Monitoring @ovh

Slide 104

Slide 104

Queues length Monitoring @ovh

Slide 105

Slide 105

Number of read and write requests Monitoring @ovh

Slide 106

Slide 106

Data locality Monitoring @ovh

Slide 107

Slide 107

Host health Monitoring @ovh

Slide 108

Slide 108

Pokédex & Pokeball Inventory all animals. Monitoring @ovh

Slide 109

Slide 109

Merging all data sources Monitoring @ovh

Slide 110

Slide 110

Global visualization Monitoring @ovh

Slide 111

Slide 111

Correlate information Monitoring @ovh

Slide 112

Slide 112

Sacha The best tamer Monitoring 112 @ovh

Slide 113

Slide 113

An awesome CLI Monitoring @ovh

Slide 114

Slide 114

Retrieving bare informations Monitoring @ovh

Slide 115

Slide 115

Create region map Monitoring @ovh

Slide 116

Slide 116

Move region to another region server Monitoring @ovh

Slide 117

Slide 117

Drain regions of the region server Monitoring @ovh

Slide 118

Slide 118

Managing multiple hardware profiles Monitoring @ovh

Slide 119

Slide 119

Balance the cluster Monitoring @ovh

Slide 120

Slide 120

Conclusion That's all folks! Monitoring @ovh