Monitoring OVH : 300K serveurs, 27 DCs une plateforme de métriques unique

A presentation at Devops D-Day 2019 in November 2019 in Marseille, France by Horacio Gonzalez

Slide 1

Slide 1

DEVOPS D-DAY #5 Monitoring OVH: 350k servers, 30 DCs… and one Metrics platform Horacio Gonzalez @LostInBrittany DEVOPS D-DAY #5 @LostInBrittany

Slide 2

Slide 2

Who are we? Introducing myself and introducing OVH OVHcloud DEVOPS D-DAY #5 @LostInBrittany

Slide 3

Slide 3

Horacio Gonzalez @LostInBrittany Spaniard lost in Brittany, developer, dreamer and all-around geek Flutter DEVOPS D-DAY #5 @LostInBrittany

Slide 4

Slide 4

OVH: A Global Leader on Cloud 250k Private cloud VMs running 1 Dedicated IaaS Europe 30 Datacenters Own 20Tbps Hosting capacity : 1.3M Physical Servers 360k Servers already deployed DEVOPS D-DAY #5 Netwok with 35 PoPs

1.3M Customers in 138 Countries @LostInBrittany

Slide 5

Slide 5

OVH: Our solutions Cloud Web Hosting Mobile Hosting Telecom VPS Containers ▪ Dedicated Server Domain names VoIP Public Cloud Compute ▪ Data Storage Email SMS/Fax Private Cloud ▪ Network and Database CDN Virtual desktop Serveur dédié Security Object Storage Web hosting Cloud HubiC Over theBox ▪ Licences Cloud Desktop Securities MS Office Hybrid Cloud Messaging MS solutions DEVOPS D-DAY #5 @LostInBrittany

Slide 6

Slide 6

And don’t forget, next week… OVHcloud Summit https://summit.ovhcloud.com/ DEVOPS D-DAY #5 @LostInBrittany

Slide 7

Slide 7

Once upon a time… Because I love telling tales DEVOPS D-DAY #5 @LostInBrittany

Slide 8

Slide 8

This talk is about a tale… A true one nevertheless DEVOPS D-DAY #5 @LostInBrittany

Slide 9

Slide 9

And as in most tales It begins with a mission DEVOPS D-DAY #5 @LostInBrittany

Slide 10

Slide 10

And a band of heroes Engulfed into the adventure DEVOPS D-DAY #5 @LostInBrittany

Slide 11

Slide 11

They fight against mishaps And all kind of foes DEVOPS D-DAY #5 @LostInBrittany

Slide 12

Slide 12

They build mighty fortresses Pushing the limits of possible DEVOPS D-DAY #5 @LostInBrittany

Slide 13

Slide 13

And defend them day after day Against all odds DEVOPS D-DAY #5 @LostInBrittany

Slide 14

Slide 14

But we don’t know yet the end Because this tale isn’t finished yet DEVOPS D-DAY #5 @LostInBrittany

Slide 15

Slide 15

It begins with a mission Build a metrics platform for OVH DEVOPS D-DAY #5 @LostInBrittany

Slide 16

Slide 16

A long time ago… DEVOPS D-DAY #5 @LostInBrittany

Slide 17

Slide 17

A long time ago… Monitoring: Does the system works? DEVOPS D-DAY #5 @LostInBrittany

Slide 18

Slide 18

Moving from monolith to μservices App DEVOPS D-DAY #5 @LostInBrittany

Slide 19

Slide 19

Moving from monolith to μservices App App App DEVOPS D-DAY #5 @LostInBrittany

Slide 20

Slide 20

Moving from monolith to μservices App App App DB App Slaves DEVOPS D-DAY #5 @LostInBrittany

Slide 21

Slide 21

Moving from monolith to μservices App App App Bus DB App Slaves DEVOPS D-DAY #5 @LostInBrittany

Slide 22

Slide 22

Moving from monolith to μservices RPXY LB Cache App App App Bus DB App Slaves DEVOPS D-DAY #5 @LostInBrittany

Slide 23

Slide 23

What could go wrong? RPXY LB Cache App App App Bus DB App Slaves DEVOPS D-DAY #5 @LostInBrittany

Slide 24

Slide 24

Microservices are a distributed system GOTO 2017 • Debugging Under Fire: Keep your Head when Systems have Lost their Mind • Bryan Cantrill DEVOPS D-DAY #5 @LostInBrittany

Slide 25

Slide 25

We need to have insights Observability: How the system works? DEVOPS D-DAY #5 @LostInBrittany

Slide 26

Slide 26

OVH decided go metrics-oriented DEVOPS D-DAY #5 @LostInBrittany

Slide 27

Slide 27

A metrics platform for OVH For all OVH DEVOPS D-DAY #5 @LostInBrittany

Slide 28

Slide 28

Building OVH Metrics One Platform to unify them all, One Platform to find them, One Platform to bring them all and in the Metrics monitor them DEVOPS D-DAY #5 @LostInBrittany

Slide 29

Slide 29

What is OVH Metrics? Managed Cloud Platform for Time Series DEVOPS D-DAY #5 @LostInBrittany

Slide 30

Slide 30

OVH monitoring story We had lots of partial solutions… DEVOPS D-DAY #5 @LostInBrittany

Slide 31

Slide 31

OVH monitoring story One Platform to unify them all What should we build it on? DEVOPS D-DAY #5 @LostInBrittany

Slide 32

Slide 32

OVH monitoring story Including a really big DEVOPS D-DAY #5 @LostInBrittany

Slide 33

Slide 33

OpenTSDB drawbacks OpenTSDB RowKey Design ! DEVOPS D-DAY #5 @LostInBrittany

Slide 34

Slide 34

OpenTSDB Rowkey design flaws ● .regex. => full table scans ● High cardinality issues (Query latencies) We needed something able to manage hundreds of millions time series OpenTSBD didn’t scale for us DEVOPS D-DAY #5 @LostInBrittany

Slide 35

Slide 35

OpenTSDB other flaws ● ● ● ● ● Compaction (or append writes) /api/query : 1 endpoint per function? Asynchronous Unauthenticated … DEVOPS D-DAY #5 @LostInBrittany

Slide 36

Slide 36

Scaling OpenTSDB DEVOPS D-DAY #5 @LostInBrittany

Slide 37

Slide 37

Metrics needs First need: To be massively scalable DEVOPS D-DAY #5 @LostInBrittany

Slide 38

Slide 38

Analytics is the key to success Fetching data is only the tip of the iceberg DEVOPS D-DAY #5 @LostInBrittany

Slide 39

Slide 39

Analysing metrics data To be scalable, analysis must be done in the database, not in user’s computer DEVOPS D-DAY #5 @LostInBrittany

Slide 40

Slide 40

Metrics needs Second need: To have rich query capabilities DEVOPS D-DAY #5 @LostInBrittany

Slide 41

Slide 41

Enter Warp 10… Open-source Time series Database DEVOPS D-DAY #5 @LostInBrittany

Slide 42

Slide 42

More than a Time Series DB Warp 10 is a software platform that ● Ingests and stores time series ● Manipulates and analyzes time series DEVOPS D-DAY #5 @LostInBrittany

Slide 43

Slide 43

Manipulating Time Series with Warp 10 A true Time Series analysis toolbox ○ Hundreds of functions ○ Manipulation frameworks ○ Analysis workflow DEVOPS D-DAY #5 @LostInBrittany

Slide 44

Slide 44

Manipulating Time Series with Warp 10 A Time Series manipulation language WarpScript DEVOPS D-DAY #5 @LostInBrittany

Slide 45

Slide 45

Did you say scalability? From the smallest to the largest… DEVOPS D-DAY #5 @LostInBrittany

Slide 46

Slide 46

More Warp 10 goodness ● Secured & multi tenant ● Synchronous (transactions) ● In memory Index ● Better Performance ● No cardinality issues ● Better Scalability ● Lockfree ingestion ● Versatile ● WarpScript Query Language (standalone, distributed) ● Support more data types DEVOPS D-DAY #5 @LostInBrittany

Slide 47

Slide 47

OVH Observability Metrics Platform DEVOPS D-DAY #5 @LostInBrittany

Slide 48

Slide 48

Building an ecosystem From Warp 10 to OVH Metrics DEVOPS D-DAY #5 @LostInBrittany

Slide 49

Slide 49

What protocols should we support? Who must do the effort? DEVOPS D-DAY #5 @LostInBrittany

Slide 50

Slide 50

Open source monitoring tools DEVOPS D-DAY #5 @LostInBrittany

Slide 51

Slide 51

Open source monitoring tools DEVOPS D-DAY #5 @LostInBrittany

Slide 52

Slide 52

Open source monitoring tools DEVOPS D-DAY #5 @LostInBrittany

Slide 53

Slide 53

Open source monitoring tools DEVOPS D-DAY #5 @LostInBrittany

Slide 54

Slide 54

Open source monitoring tools DEVOPS D-DAY #5 @LostInBrittany

Slide 55

Slide 55

Open source monitoring tools DEVOPS D-DAY #5 @LostInBrittany

Slide 56

Slide 56

Open source monitoring tools Why choose? Let’s support all of them! DEVOPS D-DAY #5 @LostInBrittany

Slide 57

Slide 57

Metrics Platform DEVOPS D-DAY #5 @LostInBrittany

Slide 58

Slide 58

Metrics Platform graphite influx https:// opentsdb .<region>.metrics.ovh.net prometheus Warp10 tsl … DEVOPS D-DAY #5 @LostInBrittany

Slide 59

Slide 59

Metrics Platform graphite influx https:// opentsdb .<region>.metrics.ovh.net prometheus Warp10 tsl … DEVOPS D-DAY #5 @LostInBrittany

Slide 60

Slide 60

TSL select(“cpu.usage_system”) .where(“cpu~cpu[0-7]*”) .last(12h) .sampleBy(5m,max) .groupBy(mean) .rate() github.com/ovh/tsl DEVOPS D-DAY #5 @LostInBrittany

Slide 61

Slide 61

Metrics Live In-memory, high-performance Metrics instances DEVOPS D-DAY #5 @LostInBrittany

Slide 62

Slide 62

In-memory: Metrics live millions of writes/s DEVOPS D-DAY #5 @LostInBrittany

Slide 63

Slide 63

In-memory: Metrics live DEVOPS D-DAY #5 @LostInBrittany

Slide 64

Slide 64

In-memory: Metrics live DEVOPS D-DAY #5 @LostInBrittany

Slide 65

Slide 65

Monitoring is only the beginning OVH Metrics answer to many other use cases DEVOPS D-DAY #5 @LostInBrittany

Slide 66

Slide 66

Graveline rack’s temperature DEVOPS D-DAY #5 @LostInBrittany

Slide 67

Slide 67

Even medical research… Metrics’ Pattern Detection feature helped Gynaecology Research to prove patterns on perinatal mortality DEVOPS D-DAY #5 @LostInBrittany

Slide 68

Slide 68

Use cases families • • • • Billing Monitoring IoT (e.g. bill on monthly max consumption) ……………………………………………..……. (APM, infrastructure,appliances,…) …..…………………………… (Manage devices, operator integration, …) …………………………………………….…………………. Geo Location DEVOPS D-DAY #5 (Manage localized fleets) ……..………………… @LostInBrittany

Slide 69

Slide 69

Use cases • • • • • • DC Temperature/Elec/Cooling map Pay as you go billing (PCI/IPLB) GSCAN Monitoring ML Model scoring (Anti-Fraude) Pattern Detection for medical applications DEVOPS D-DAY #5 @LostInBrittany

Slide 70

Slide 70

SREing Metrics With a great power comes a great responsibility DEVOPS D-DAY #5 @LostInBrittany

Slide 71

Slide 71

Metrics’s metrics 432.000.000.000 datapoints / day DEVOPS D-DAY #5 @LostInBrittany 71

Slide 72

Slide 72

Our stack overview More than 666 machines operated by 5 people >95% dedicated servers No Docker, only SystemD Running many Apache projects: ○ Hadoop ○ HBase ○ Zookeeper ○ Flink ● And Warp 10 ● ● ● ● DEVOPS D-DAY #5 @LostInBrittany

Slide 73

Slide 73

Our biggest Hadoop cluster 200 datanodes ~60k regions of 10Gb 2.3 PB of capacity 8.5Gb/s of bandwidth 1.5M of writes/s 3M of reads/s DEVOPS D-DAY #5 @LostInBrittany

Slide 74

Slide 74

Hadoop need a lot of DEVOPS D-DAY #5 @LostInBrittany

Slide 75

Slide 75

Warp10: distributed overview DEVOPS D-DAY #5 @LostInBrittany

Slide 76

Slide 76

Warp10: distributed overview DEVOPS D-DAY #5 @LostInBrittany

Slide 77

Slide 77

Warp10: distributed overview DEVOPS D-DAY #5 @LostInBrittany

Slide 78

Slide 78

Warp10: distributed overview DEVOPS D-DAY #5 @LostInBrittany

Slide 79

Slide 79

Warp10: distributed overview DEVOPS D-DAY #5 @LostInBrittany

Slide 80

Slide 80

Hadoop nodes Most of the nodes are the following: ● ● ● 16 to 32 cores 64 to 128 GB of RAM 12 to 16 TB DEVOPS D-DAY #5 But, we also have some huge nodes: ● ● ● 2x 20 cores (xeon gold) 320 GB of RAM 12x 4TB of Disk @LostInBrittany

Slide 81

Slide 81

Warp10 nodes Ingress (cpu-bound): ● ● 32 cores 128 GB of RAM Egress (cpu-bound): ● ● 32 cores 128 GB of RAM DEVOPS D-DAY #5 Directory (ram-bound): ● ● 48 cores 512 GB of RAM Store (cpu-bound): ● ● @LostInBrittany 32 cores 128 GB of RAM

Slide 82

Slide 82

Why you should care? DEVOPS D-DAY #5 @LostInBrittany

Slide 83

Slide 83

Why you should care? (>30s) DEVOPS D-DAY #5 @LostInBrittany

Slide 84

Slide 84

The only way to optimize: measure What is my application doing? App What is my runtime doing? How many GC triggered? Run tim Is there a hardware failure? Logs DEVOPS D-DAY #5 How many HTTP calls? e Hos t @LostInBrittany How many disk I have left? Metrics

Slide 85

Slide 85

Monitoring JVM with metrics DEVOPS D-DAY #5 @LostInBrittany

Slide 86

Slide 86

Monitoring JVM with metrics DEVOPS D-DAY #5 @LostInBrittany

Slide 87

Slide 87

Monitoring JVM with metrics DEVOPS D-DAY #5 @LostInBrittany

Slide 88

Slide 88

Monitoring JVM with metrics DEVOPS D-DAY #5 @LostInBrittany

Slide 89

Slide 89

Monitoring JVM with metrics DEVOPS D-DAY #5 @LostInBrittany

Slide 90

Slide 90

Tuning G1 is hard DEVOPS D-DAY #5 @LostInBrittany

Slide 91

Slide 91

Tuning G1 is hard DEVOPS D-DAY #5 @LostInBrittany

Slide 92

Slide 92

Our programming stack ● We mostly use garbage collected languages as ○ Go ○ Java ○ JavaScript DEVOPS D-DAY #5 @LostInBrittany

Slide 93

Slide 93

Our programming stack However, we are using non-garbage collected languages as Rust when needed DEVOPS D-DAY #5 @LostInBrittany

Slide 94

Slide 94

Our friends for µservices DEVOPS D-DAY #5 @LostInBrittany

Slide 95

Slide 95

We open-source Code contribution: ● ● ● ● ● ● https://github.com/ovh/beamium https://github.com/ovh/noderig https://github.com/ovh/tsl https://github.com/ovh/ovh-warp10-datasource https://github.com/ovh/ovh-tsl-datasource … DEVOPS D-DAY #5 @LostInBrittany Involved in: ● ● ● ● Warp10 community Apache Hbase/Flink development Prometheus/InfluxData discussions TS Query Language Working group

Slide 96

Slide 96

Conclusion That’s all folks! DEVOPS D-DAY #5 @LostInBrittany