Monitoring OVH: 300k servers, 27 DCs and one Metrics platform

A presentation at DevFest Toulouse in October 2019 in Toulouse, France by Horacio Gonzalez

Slide 1

Slide 1

Monitoring OVH: 350k servers, 30 DCs… and one Metrics platform Horacio Gonzalez @LostInBrittany @LostInBrittany

Slide 2

Slide 2

Who are we? Introducing myself and introducing OVH OVHcloud @LostInBrittany

Slide 3

Slide 3

Horacio Gonzalez @LostInBrittany Spaniard lost in Brittany, developer, dreamer and all-around geek Flutter @LostInBrittany

Slide 4

Slide 4

OVH: A Global Leader on Cloud 250k Private cloud VMs running 1 Dedicated IaaS Europe 30 Datacenters Own 20Tbps Hosting capacity : 1.3M Physical Servers 360k Servers already deployed Netwok with 35 PoPs

1.3M Customers in 138 Countries @LostInBrittany

Slide 5

Slide 5

OVH: Our solutions Cloud Web Hosting Mobile Hosting Telecom VPS Containers ▪ Dedicated Server Domain names VoIP Public Cloud Compute ▪ Data Storage Email SMS/Fax Private Cloud ▪ Network and Database CDN Virtual desktop Serveur dédié Security Object Storage Web hosting Cloud HubiC Over theBox ▪ Licences Cloud Desktop Securities MS Office Hybrid Cloud Messaging MS solutions @LostInBrittany

Slide 6

Slide 6

And don’t forget, next week… OVHcloud Summit https://summit.ovhcloud.com/ @LostInBrittany

Slide 7

Slide 7

Once upon a time… Because I love telling tales @LostInBrittany

Slide 8

Slide 8

This talk is about a tale… A true one nevertheless @LostInBrittany

Slide 9

Slide 9

And as in most tales It begins with a mission @LostInBrittany

Slide 10

Slide 10

And a band of heroes Engulfed into the adventure @LostInBrittany

Slide 11

Slide 11

They fight against mishaps And all kind of foes @LostInBrittany

Slide 12

Slide 12

They build mighty fortresses Pushing the limits of possible @LostInBrittany

Slide 13

Slide 13

And defend them day after day Against all odds @LostInBrittany

Slide 14

Slide 14

But we don’t know yet the end Because this tale isn’t finished yet @LostInBrittany

Slide 15

Slide 15

It begins with a mission Build a metrics platform for OVH @LostInBrittany

Slide 16

Slide 16

A long time ago… @LostInBrittany

Slide 17

Slide 17

A long time ago… Monitoring: Does the system works? @LostInBrittany

Slide 18

Slide 18

Moving from monolith to μservices App @LostInBrittany

Slide 19

Slide 19

Moving from monolith to μservices App App App @LostInBrittany

Slide 20

Slide 20

Moving from monolith to μservices App App App DB App Slaves @LostInBrittany

Slide 21

Slide 21

Moving from monolith to μservices App App App Bus DB App Slaves @LostInBrittany

Slide 22

Slide 22

Moving from monolith to μservices RPXY LB Cache App App App Bus DB App Slaves @LostInBrittany

Slide 23

Slide 23

What could go wrong? RPXY LB Cache App App App Bus DB App Slaves @LostInBrittany

Slide 24

Slide 24

Microservices are a distributed system GOTO 2017 • Debugging Under Fire: Keep your Head when Systems have Lost their Mind • Bryan Cantrill @LostInBrittany

Slide 25

Slide 25

We need to have insights Observability: How the system works? @LostInBrittany

Slide 26

Slide 26

OVH decided go metrics-oriented @LostInBrittany

Slide 27

Slide 27

A metrics platform for OVH For all OVH @LostInBrittany

Slide 28

Slide 28

Building OVH Metrics One Platform to unify them all, One Platform to find them, One Platform to bring them all and in the Metrics monitor them @LostInBrittany

Slide 29

Slide 29

What is OVH Metrics? Managed Cloud Platform for Time Series @LostInBrittany

Slide 30

Slide 30

OVH monitoring story We had lots of partial solutions… @LostInBrittany

Slide 31

Slide 31

OVH monitoring story One Platform to unify them all What should we build it on? @LostInBrittany

Slide 32

Slide 32

OVH monitoring story Including a really big @LostInBrittany

Slide 33

Slide 33

OpenTSDB drawbacks OpenTSDB RowKey Design ! @LostInBrittany

Slide 34

Slide 34

OpenTSDB Rowkey design flaws ● .regex. => full table scans ● High cardinality issues (Query latencies) We needed something able to manage hundreds of millions time series OpenTSBD didn’t scale for us @LostInBrittany

Slide 35

Slide 35

OpenTSDB other flaws ● ● ● ● ● Compaction (or append writes) /api/query : 1 endpoint per function? Asynchronous Unauthenticated … @LostInBrittany

Slide 36

Slide 36

Scaling OpenTSDB @LostInBrittany

Slide 37

Slide 37

Metrics needs First need: To be massively scalable @LostInBrittany

Slide 38

Slide 38

Analytics is the key to success Fetching data is only the tip of the iceberg @LostInBrittany

Slide 39

Slide 39

Analysing metrics data To be scalable, analysis must be done in the database, not in user’s computer @LostInBrittany

Slide 40

Slide 40

Metrics needs Second need: To have rich query capabilities @LostInBrittany

Slide 41

Slide 41

Enter Warp 10… Open-source Time series Database @LostInBrittany

Slide 42

Slide 42

More than a Time Series DB Warp 10 is a software platform that ● Ingests and stores time series ● Manipulates and analyzes time series @LostInBrittany

Slide 43

Slide 43

Manipulating Time Series with Warp 10 A true Time Series analysis toolbox ○ Hundreds of functions ○ Manipulation frameworks ○ Analysis workflow @LostInBrittany

Slide 44

Slide 44

Manipulating Time Series with Warp 10 A Time Series manipulation language WarpScript @LostInBrittany

Slide 45

Slide 45

Did you say scalability? From the smallest to the largest… @LostInBrittany

Slide 46

Slide 46

More Warp 10 goodness ● Secured & multi tenant ● Synchronous (transactions) ● In memory Index ● Better Performance ● No cardinality issues ● Better Scalability ● Lockfree ingestion ● Versatile ● WarpScript Query Language (standalone, distributed) ● Support more data types @LostInBrittany

Slide 47

Slide 47

OVH Observability Metrics Platform @LostInBrittany

Slide 48

Slide 48

Building an ecosystem From Warp 10 to OVH Metrics @LostInBrittany

Slide 49

Slide 49

What protocols should we support? Who must do the effort? @LostInBrittany

Slide 50

Slide 50

Open source monitoring tools @LostInBrittany

Slide 51

Slide 51

Open source monitoring tools @LostInBrittany

Slide 52

Slide 52

Open source monitoring tools @LostInBrittany

Slide 53

Slide 53

Open source monitoring tools @LostInBrittany

Slide 54

Slide 54

Open source monitoring tools @LostInBrittany

Slide 55

Slide 55

Open source monitoring tools @LostInBrittany

Slide 56

Slide 56

Open source monitoring tools Why choose? Let’s support all of them! @LostInBrittany

Slide 57

Slide 57

Metrics Platform @LostInBrittany

Slide 58

Slide 58

Metrics Platform graphite influx https:// opentsdb .<region>.metrics.ovh.net prometheus Warp10 tsl … @LostInBrittany

Slide 59

Slide 59

Metrics Platform graphite influx https:// opentsdb .<region>.metrics.ovh.net prometheus Warp10 tsl … @LostInBrittany

Slide 60

Slide 60

TSL select(“cpu.usage_system”) .where(“cpu~cpu[0-7]*”) .last(12h) .sampleBy(5m,max) .groupBy(mean) .rate() github.com/ovh/tsl @LostInBrittany

Slide 61

Slide 61

Metrics Live In-memory, high-performance Metrics instances @LostInBrittany

Slide 62

Slide 62

In-memory: Metrics live millions of writes/s @LostInBrittany

Slide 63

Slide 63

In-memory: Metrics live @LostInBrittany

Slide 64

Slide 64

In-memory: Metrics live @LostInBrittany

Slide 65

Slide 65

Monitoring is only the beginning OVH Metrics answer to many other use cases @LostInBrittany

Slide 66

Slide 66

Graveline rack’s temperature @LostInBrittany

Slide 67

Slide 67

Even medical research… Metrics’ Pattern Detection feature helped Gynaecology Research to prove patterns on perinatal mortality @LostInBrittany

Slide 68

Slide 68

Use cases families • • • • Billing Monitoring IoT (e.g. bill on monthly max consumption) ……………………………………………..……. (APM, infrastructure,appliances,…) …..…………………………… (Manage devices, operator integration, …) …………………………………………….…………………. Geo Location (Manage localized fleets) ……..………………… @LostInBrittany

Slide 69

Slide 69

Use cases • • • • • • DC Temperature/Elec/Cooling map Pay as you go billing (PCI/IPLB) GSCAN Monitoring ML Model scoring (Anti-Fraude) Pattern Detection for medical applications @LostInBrittany

Slide 70

Slide 70

SREing Metrics With a great power comes a great responsibility @LostInBrittany

Slide 71

Slide 71

Metrics’s metrics 432.000.000.000 datapoints / day @LostInBrittany 71

Slide 72

Slide 72

Our stack overview More than 666 machines operated by 5 people >95% dedicated servers No Docker, only SystemD Running many Apache projects: ○ Hadoop ○ HBase ○ Zookeeper ○ Flink ● And Warp 10 ● ● ● ● @LostInBrittany

Slide 73

Slide 73

Our biggest Hadoop cluster 200 datanodes ~60k regions of 10Gb 2.3 PB of capacity 8.5Gb/s of bandwidth 1.5M of writes/s 3M of reads/s @LostInBrittany

Slide 74

Slide 74

Hadoop need a lot of @LostInBrittany

Slide 75

Slide 75

Warp10: distributed overview @LostInBrittany

Slide 76

Slide 76

Warp10: distributed overview @LostInBrittany

Slide 77

Slide 77

Warp10: distributed overview @LostInBrittany

Slide 78

Slide 78

Warp10: distributed overview @LostInBrittany

Slide 79

Slide 79

Warp10: distributed overview @LostInBrittany

Slide 80

Slide 80

Hadoop nodes Most of the nodes are the following: ● ● ● 16 to 32 cores 64 to 128 GB of RAM 12 to 16 TB But, we also have some huge nodes: ● ● ● 2x 20 cores (xeon gold) 320 GB of RAM 12x 4TB of Disk @LostInBrittany

Slide 81

Slide 81

Warp10 nodes Ingress (cpu-bound): ● ● 32 cores 128 GB of RAM Egress (cpu-bound): ● ● 32 cores 128 GB of RAM Directory (ram-bound): ● ● 48 cores 512 GB of RAM Store (cpu-bound): ● ● @LostInBrittany 32 cores 128 GB of RAM

Slide 82

Slide 82

Why you should care? @LostInBrittany

Slide 83

Slide 83

Why you should care? (>30s) @LostInBrittany

Slide 84

Slide 84

The only way to optimize: measure What is my application doing? App What is my runtime doing? How many GC triggered? Run tim Is there a hardware failure? Logs How many HTTP calls? e Hos t @LostInBrittany How many disk I have left? Metrics

Slide 85

Slide 85

Monitoring JVM with metrics @LostInBrittany

Slide 86

Slide 86

Monitoring JVM with metrics @LostInBrittany

Slide 87

Slide 87

Monitoring JVM with metrics @LostInBrittany

Slide 88

Slide 88

Monitoring JVM with metrics @LostInBrittany

Slide 89

Slide 89

Monitoring JVM with metrics @LostInBrittany

Slide 90

Slide 90

Tuning G1 is hard @LostInBrittany

Slide 91

Slide 91

Tuning G1 is hard @LostInBrittany

Slide 92

Slide 92

Our programming stack ● We mostly use garbage collected languages as ○ Go ○ Java ○ JavaScript @LostInBrittany

Slide 93

Slide 93

Our programming stack However, we are using non-garbage collected languages as Rust when needed @LostInBrittany

Slide 94

Slide 94

Our friends for µservices @LostInBrittany

Slide 95

Slide 95

We open-source Code contribution: ● ● ● ● ● ● https://github.com/ovh/beamium https://github.com/ovh/noderig https://github.com/ovh/tsl https://github.com/ovh/ovh-warp10-datasource https://github.com/ovh/ovh-tsl-datasource … @LostInBrittany Involved in: ● ● ● ● Warp10 community Apache Hbase/Flink development Prometheus/InfluxData discussions TS Query Language Working group

Slide 96

Slide 96

Conclusion That’s all folks! @LostInBrittany