Monitoring OVH: 300k servers, 27 DCs and one Metrics platform

A presentation at Codemotion Amsterdam in April 2019 in Amsterdam, Netherlands by Horacio Gonzalez

Slide 1

Slide 1

Amsterdam | April 2 - 3, 2019 Monitoring OVH: 300k servers, 28 DCs… and one Metrics platform Horacio Gonzalez @LostInBrittany @LostInBrittany

Slide 2

Slide 2

Who are we? Introducing myself and introducing OVH @LostInBrittany

Slide 3

Slide 3

Horacio Gonzalez @LostInBrittany Spaniard lost in Brittany, developer, dreamer and all-around geek Flutter @LostInBrittany

Slide 4

Slide 4

OVH : Key Figures 1.3M Customers worldwide in 138 Countries 1.5 Billions euros investment over five years 28 Datacenters (growing) 350k Dedicated Servers 200k Private cloud VMs running 650k Public cloud Instances created in a month 20TB bandwidth capacity

  • 2 500 Employees in 19 countries 18 Years of Innovation 35 Points of presence 4TB Anti DDoS capacity Hosting capacity : 1.3M Physical Servers @LostInBrittany

Slide 5

Slide 5

OVH: A Global Leader on Cloud 200k Private cloud VMs running 1 Dedicated IaaS Europe 2018 27 Datacenters Own 15 Tbps Hosting capacity : 1.3M Physical Servers 360k Servers already deployed 2020 50 Datacenters Netwok with 35 PoPs

1.3M Customers in 138 Countries @LostInBrittany

Slide 6

Slide 6

Ranking & Recognition 1st European Cloud Provider* 1st Hosting provider in Europe 1st Provider Microsoft Exchange Certified vCloud Datacenter Certified Kubernetes platform (CNCF) Vmware Global Service Provider 2013-2016 Veeam Best Cloud Partner of the year (2018) @LostInBrittany

  • Netcraft 2017 -

Slide 7

Slide 7

OVH: Our solutions Cloud Web Hosting Mobile Hosting Telecom VPS Containers ▪ Dedicated Server Domain names VoIP Public Cloud Compute ▪ Data Storage Email SMS/Fax Private Cloud ▪ Network and Database CDN Virtual desktop Serveur dédié Security Object Storage Web hosting Cloud HubiC Over theBox ▪ Licences Cloud Desktop Securities MS Office Hybrid Cloud Messaging MS solutions @LostInBrittany

Slide 8

Slide 8

Once upon a time… Because I love telling tales @LostInBrittany

Slide 9

Slide 9

This talk is about a tale… A true one nevertheless @LostInBrittany

Slide 10

Slide 10

And as in most tales It begins with a mission @LostInBrittany

Slide 11

Slide 11

And a band of heroes Engulfed into the adventure @LostInBrittany

Slide 12

Slide 12

They fight against mishaps And all kind of foes @LostInBrittany

Slide 13

Slide 13

They build mighty fortresses Pushing the limits of possible @LostInBrittany

Slide 14

Slide 14

And defend them day after day Against all odds @LostInBrittany

Slide 15

Slide 15

But we don’t know yet the end Because this tale isn’t finished yet @LostInBrittany

Slide 16

Slide 16

It begins with a mission Build a metrics platform for OVH @LostInBrittany

Slide 17

Slide 17

Why do we need metrics? To make better decisions by using numbers @LostInBrittany

Slide 18

Slide 18

Why do we need metrics? We want our code to add value @LostInBrittany

Slide 19

Slide 19

Why do we need metrics? We need to make better decisions about our code @LostInBrittany

Slide 20

Slide 20

Why do we need metrics? Code adds value when it runs not when we write it @LostInBrittany

Slide 21

Slide 21

Why do we need metrics? We need to know what our code does when it runs @LostInBrittany

Slide 22

Slide 22

Why do we need metrics? We can’t do this unless we measure it @LostInBrittany

Slide 23

Slide 23

Why do we need metrics? We have a mental model of what our code does @LostInBrittany

Slide 24

Slide 24

Why do we need metrics? This representation can be wrong @LostInBrittany

Slide 25

Slide 25

Why do we need metrics? We can’t know until we measure it @LostInBrittany

Slide 26

Slide 26

Find the bottleneck ‘’ “The app is slow.” - User @LostInBrittany

Slide 27

Slide 27

Find the bottleneck ‘’ “The app is slow.” - User “The page takes 500ms!” - Ops @LostInBrittany

Slide 28

Slide 28

Find the bottleneck ? SQL Query? Template Rendering? Session Storage? @LostInBrittany

Slide 29

Slide 29

Find the bottleneck ? We don’t know @LostInBrittany

Slide 30

Slide 30

Find the bottleneck

With observability: SQL Query………………………….53ms Template Rendering……….1ms Session Storage……………315ms @LostInBrittany

Slide 31

Slide 31

Find the bottleneck

With observability: SQL Query………………………….53ms Template Rendering……….1ms Session Storage……………315ms @LostInBrittany

Slide 32

Slide 32

Why do we need metrics? We improve our mental model by measuring what our code does @LostInBrittany

Slide 33

Slide 33

Why do we need metrics? We use our mental model to decide what to do @LostInBrittany

Slide 34

Slide 34

Why do we need metrics? A better mental model makes us better at deciding what to do @LostInBrittany

Slide 35

Slide 35

Why do we need metrics? Better decisions makes us better at generating value @LostInBrittany

Slide 36

Slide 36

Why do we need metrics? Measuring make your App better @LostInBrittany

Slide 37

Slide 37

It began with a mission Build a metrics platform for OVH @LostInBrittany

Slide 38

Slide 38

A metrics platform for OVH For all OVH @LostInBrittany

Slide 39

Slide 39

Building OVH Metrics One Platform to unify them all, One Platform to find them, One Platform to bring them all and in the Metrics monitor them @LostInBrittany

Slide 40

Slide 40

What is OVH Metrics? Managed Cloud Platform for Time Series @LostInBrittany

Slide 41

Slide 41

OVH monitoring story We had lots of partial solutions… @LostInBrittany

Slide 42

Slide 42

OVH monitoring story One Platform to unify them all What should we build it on? @LostInBrittany

Slide 43

Slide 43

OVH monitoring story Including a really big @LostInBrittany

Slide 44

Slide 44

OpenTSDB drawbacks OpenTSDB RowKey Design ! @LostInBrittany

Slide 45

Slide 45

OpenTSDB Rowkey design flaws ● .regex. => full table scans ● High cardinality issues (Query latencies) We needed something able to manage hundreds of millions time series OpenTSBD didn’t scale for us @LostInBrittany

Slide 46

Slide 46

OpenTSDB other flaws ● ● ● ● ● Compaction (or append writes) /api/query : 1 endpoint per function? Asynchronous Unauthenticated … @LostInBrittany

Slide 47

Slide 47

Scaling OpenTSDB @LostInBrittany

Slide 48

Slide 48

Metrics needs First need: To be massively scalable @LostInBrittany

Slide 49

Slide 49

Analytics is the key to success Fetching data is only the tip of the iceberg @LostInBrittany

Slide 50

Slide 50

Analysing metrics data To be scalable, analysis must be done in the database, not in user’s computer @LostInBrittany

Slide 51

Slide 51

Metrics needs Second need: To have rich query capabilities @LostInBrittany

Slide 52

Slide 52

Enter Warp 10… Open-source Time series Database @LostInBrittany

Slide 53

Slide 53

More than a Time Series DB Warp 10 is a software platform that ● Ingests and stores time series ● Manipulates and analyzes time series @LostInBrittany

Slide 54

Slide 54

Manipulating Time Series with Warp 10 A true Time Series analysis toolbox ○ Hundreds of functions ○ Manipulation frameworks ○ Analysis workflow @LostInBrittany

Slide 55

Slide 55

Manipulating Time Series with Warp 10 A Time Series manipulation language WarpScript @LostInBrittany

Slide 56

Slide 56

Did you say scalability? From the smallest to the largest… @LostInBrittany

Slide 57

Slide 57

More Warp 10 goodness ● Secured & multi tenant ● Synchronous (transactions) ● In memory Index ● Better Performance ● No cardinality issues ● Better Scalability ● Lockfree ingestion ● Versatile ● WarpScript Query Language (standalone, distributed) ● Support more data types @LostInBrittany

Slide 58

Slide 58

OVH Observability Metrics Platform @LostInBrittany

Slide 59

Slide 59

Metrics Data Platform @LostInBrittany

Slide 60

Slide 60

Building an ecosystem From Warp 10 to OVH Metrics @LostInBrittany

Slide 61

Slide 61

Multi-protocol Why to choose? We need them all! @LostInBrittany

Slide 62

Slide 62

Open source monitoring tools @LostInBrittany

Slide 63

Slide 63

Open source monitoring tools @LostInBrittany

Slide 64

Slide 64

Open source monitoring tools @LostInBrittany

Slide 65

Slide 65

Open source monitoring tools @LostInBrittany

Slide 66

Slide 66

Open source monitoring tools @LostInBrittany

Slide 67

Slide 67

Open source monitoring tools @LostInBrittany

Slide 68

Slide 68

Open source monitoring tools Why choose? Let’s support all of them! @LostInBrittany

Slide 69

Slide 69

Metrics Platform @LostInBrittany

Slide 70

Slide 70

Metrics Platform graphite influx https:// opentsdb .<region>.metrics.ovh.net prometheus warp10 … @LostInBrittany

Slide 71

Slide 71

Metrics Live In-memory, high-performance Metrics instances @LostInBrittany

Slide 72

Slide 72

In-memory: Metrics live +120 million of writes/s @LostInBrittany

Slide 73

Slide 73

In-memory: Metrics live @LostInBrittany

Slide 74

Slide 74

In-memory: Metrics live @LostInBrittany

Slide 75

Slide 75

Monitoring is only the beginning OVH Metrics answer to many other use cases @LostInBrittany

Slide 76

Slide 76

Use cases families • • • • Billing Monitoring IoT (e.g. bill on monthly max consumption) ……………………………………………..……. (APM, infrastructure,appliances,…) …..…………………………… (Manage devices, operator integration, …) …………………………………………….…………………. Geo Location (Manage localized fleets) ……..………………… @LostInBrittany

Slide 77

Slide 77

Use cases • • • • • • DC Temperature/Elec/Cooling map Pay as you go billing (PCI/IPLB) GSCAN Monitoring ML Model scoring (Anti-Fraude) Pattern Detection for medical applications @LostInBrittany

Slide 78

Slide 78

SREing Metrics With a great power comes a great responsibility @LostInBrittany

Slide 79

Slide 79

Metrics’ own metrics 432 000 000 000 datapoints / day @LostInBrittany

Slide 80

Slide 80

Metrics’ own metrics 10 Tb / day @LostInBrittany

Slide 81

Slide 81

Metrics’ own metrics 5 000 000 dp/s @LostInBrittany

Slide 82

Slide 82

Metrics’ own metrics 500 000 000 series @LostInBrittany

Slide 83

Slide 83

Our clusters size GRA: BHS: ● 150 nodes ● 2 PB ● 1.1 Gbps ● 30 nodes ● 400 TB ● 120 Mbps @LostInBrittany

Slide 84

Slide 84

Our cluster architecture @LostInBrittany

Slide 85

Slide 85

Detecting errors Before it’s too late @LostInBrittany 85

Slide 86

Slide 86

Extract errors from logs @LostInBrittany

Slide 87

Slide 87

Tailor Forward logs and extract metrics! @LostInBrittany

Slide 88

Slide 88

Monitoring the JVM @LostInBrittany

Slide 89

Slide 89

Documentation @LostInBrittany

Slide 90

Slide 90

JVM GC The good, the bad and the ugly @LostInBrittany

Slide 91

Slide 91

The good @LostInBrittany

Slide 92

Slide 92

The bad @LostInBrittany

Slide 93

Slide 93

… and the ugly #java #jdk11 #zgc @LostInBrittany

Slide 94

Slide 94

Monitoring HBase @LostInBrittany

Slide 95

Slide 95

Number of open regions @LostInBrittany

Slide 96

Slide 96

Queues length @LostInBrittany

Slide 97

Slide 97

Number of read and write requests @LostInBrittany

Slide 98

Slide 98

Preserve data locality @LostInBrittany

Slide 99

Slide 99

Host health @LostInBrittany

Slide 100

Slide 100

Pokédex Inventory all animals. @LostInBrittany

Slide 101

Slide 101

Merging all data sources @LostInBrittany

Slide 102

Slide 102

Global visualization @LostInBrittany

Slide 103

Slide 103

Correlate information @LostInBrittany

Slide 104

Slide 104

Sacha The best tamer @LostInBrittany

Slide 105

Slide 105

An awesome CLI @LostInBrittany

Slide 106

Slide 106

Retrieving bare informations @LostInBrittany

Slide 107

Slide 107

Create region map @LostInBrittany

Slide 108

Slide 108

Move region to another region server @LostInBrittany

Slide 109

Slide 109

Drain regions of the region server @LostInBrittany

Slide 110

Slide 110

Managing multiple hardware profiles @LostInBrittany

Slide 111

Slide 111

Balance the cluster @LostInBrittany

Slide 112

Slide 112

Conclusion That’s all folks! @LostInBrittany