Monitoring OVH: 300k servers, 27 DCs and one Metrics platform

A presentation at RivieraDev in May 2019 in Sophia Antipolis, France by Horacio Gonzalez

Slide 1

Slide 1

Monitoring OVH: 300k servers, 28 DCs… and one Metrics platform Horacio Gonzalez Pierre Zemb @LostInBrittany @PierreZ @LostInBrittany @PierreZ

Slide 2

Slide 2

Who are we? Introducing myself and introducing OVH @LostInBrittany @PierreZ

Slide 3

Slide 3

Horacio Gonzalez @LostInBrittany Spaniard lost in Brittany, developer, dreamer and all-around geek Flutter @LostInBrittany @PierreZ

Slide 4

Slide 4

Pierre Zemb @PierreZ Software Engineer working on distributed systems @LostInBrittany @PierreZ

Slide 5

Slide 5

OVH: A Global Leader on Cloud 200k Private cloud VMs running 1 Dedicated IaaS Europe 2018 27 Datacenters Own 15 Tbps Hosting capacity : 1.3M Physical Servers 360k Servers already deployed Netwok with 35 PoPs

1.3M Customers in 138 Countries @LostInBrittany @PierreZ 2020 50 Datacenters

Slide 6

Slide 6

OVH : Key Figures 1.3M Customers worldwide in 138 Countries 1.5 Billions euros investment over five years 28 Datacenters (growing) 350k Dedicated Servers 200k Private cloud VMs running 650k Public cloud Instances created in a month 20TB bandwidth capacity

  • 2 500 Employees in 19 countries 18 Years of Innovation 35 Points of presence 4TB Anti DDoS capacity Hosting capacity : 1.3M Physical Servers @LostInBrittany @PierreZ

Slide 7

Slide 7

Ranking & Recognition 1st European Cloud Provider* 1st Hosting provider in Europe 1st Provider Microsoft Exchange Certified vCloud Datacenter Certified Kubernetes platform (CNCF) Vmware Global Service Provider 2013-2016 Veeam Best Cloud Partner of the year (2018) @LostInBrittany @PierreZ

  • Netcraft 2017 -

Slide 8

Slide 8

OVH: Our solutions Cloud Web Hosting Mobile Hosting Telecom VPS Containers ▪ Dedicated Server Domain names VoIP Public Cloud Compute ▪ Data Storage Email SMS/Fax Private Cloud ▪ Network and Database CDN Virtual desktop Serveur dédié Security Object Storage Web hosting Cloud HubiC Over theBox ▪ Licences Cloud Desktop Securities MS Office Hybrid Cloud Messaging MS solutions @LostInBrittany @PierreZ

Slide 9

Slide 9

Once upon a time… Because I love telling tales @LostInBrittany @PierreZ

Slide 10

Slide 10

This talk is about a tale… A true one nevertheless @LostInBrittany @PierreZ

Slide 11

Slide 11

And as in most tales It begins with a mission @LostInBrittany @PierreZ

Slide 12

Slide 12

And a band of heroes Engulfed into the adventure @LostInBrittany @PierreZ

Slide 13

Slide 13

They fight against mishaps And all kind of foes @LostInBrittany @PierreZ

Slide 14

Slide 14

They build mighty fortresses Pushing the limits of possible @LostInBrittany @PierreZ

Slide 15

Slide 15

And defend them day after day Against all odds @LostInBrittany @PierreZ

Slide 16

Slide 16

But we don’t know yet the end Because this tale isn’t finished yet @LostInBrittany @PierreZ

Slide 17

Slide 17

It begins with a mission Build a metrics platform for OVH @LostInBrittany @PierreZ

Slide 18

Slide 18

A long time ago… @LostInBrittany @PierreZ

Slide 19

Slide 19

A long time ago… Monitoring: Does the system works? @LostInBrittany @PierreZ

Slide 20

Slide 20

Moving from monolith to μservices App @LostInBrittany @PierreZ

Slide 21

Slide 21

Moving from monolith to μservices App App App @LostInBrittany @PierreZ

Slide 22

Slide 22

Moving from monolith to μservices App App App App @LostInBrittany @PierreZ

Slide 23

Slide 23

Moving from monolith to μservices App App App App @LostInBrittany @PierreZ

Slide 24

Slide 24

Moving from monolith to μservices RPXY LB App App App App @LostInBrittany @PierreZ Cache

Slide 25

Slide 25

What could go wrong? RPXY LB App App App App @LostInBrittany @PierreZ Cache

Slide 26

Slide 26

Microservices are a distributed system GOTO 2017 • Debugging Under Fire: Keep your Head when Systems have Lost their Mind • Bryan Cantrill @LostInBrittany @PierreZ

Slide 27

Slide 27

We need to have insights Observability : Understand how it works @LostInBrittany @PierreZ

Slide 28

Slide 28

OVH decided go metrics-oriented @LostInBrittany @PierreZ

Slide 29

Slide 29

A metrics platform for OVH For all OVH @LostInBrittany @PierreZ

Slide 30

Slide 30

Building OVH Metrics One Platform to unify them all, One Platform to find them, One Platform to bring them all and in the Metrics monitor them @LostInBrittany @PierreZ

Slide 31

Slide 31

What is OVH Metrics? Managed Cloud Platform for Time Series @LostInBrittany @PierreZ

Slide 32

Slide 32

OVH monitoring story We had lots of partial solutions… @LostInBrittany @PierreZ

Slide 33

Slide 33

OVH monitoring story One Platform to unify them all What should we build it on? @LostInBrittany @PierreZ

Slide 34

Slide 34

OVH monitoring story Including a really big @LostInBrittany @PierreZ

Slide 35

Slide 35

OpenTSDB drawbacks OpenTSDB RowKey Design ! @LostInBrittany @PierreZ

Slide 36

Slide 36

OpenTSDB Rowkey design flaws ● .regex. => full table scans ● High cardinality issues (Query latencies) We needed something able to manage hundreds of millions time series OpenTSBD didn’t scale for us @LostInBrittany @PierreZ

Slide 37

Slide 37

OpenTSDB other flaws ● ● ● ● ● Compaction (or append writes) /api/query : 1 endpoint per function? Asynchronous Unauthenticated … @LostInBrittany @PierreZ

Slide 38

Slide 38

Scaling OpenTSDB @LostInBrittany @PierreZ

Slide 39

Slide 39

Metrics needs First need: To be massively scalable @LostInBrittany @PierreZ

Slide 40

Slide 40

Analytics is the key to success Fetching data is only the tip of the iceberg @LostInBrittany @PierreZ

Slide 41

Slide 41

Analysing metrics data To be scalable, analysis must be done in the database, not in user’s computer @LostInBrittany @PierreZ

Slide 42

Slide 42

Metrics needs Second need: To have rich query capabilities @LostInBrittany @PierreZ

Slide 43

Slide 43

Enter Warp 10… Open-source Time series Database @LostInBrittany @PierreZ

Slide 44

Slide 44

More than a Time Series DB Warp 10 is a software platform that ● Ingests and stores time series ● Manipulates and analyzes time series @LostInBrittany @PierreZ

Slide 45

Slide 45

Manipulating Time Series with Warp 10 A true Time Series analysis toolbox ○ Hundreds of functions ○ Manipulation frameworks ○ Analysis workflow @LostInBrittany @PierreZ

Slide 46

Slide 46

Manipulating Time Series with Warp 10 A Time Series manipulation language WarpScript @LostInBrittany @PierreZ

Slide 47

Slide 47

Did you say scalability? From the smallest to the largest… @LostInBrittany @PierreZ

Slide 48

Slide 48

More Warp 10 goodness ● Secured & multi tenant ● Synchronous (transactions) ● In memory Index ● Better Performance ● No cardinality issues ● Better Scalability ● Lockfree ingestion ● Versatile ● WarpScript Query Language (standalone, distributed) ● Support more data types @LostInBrittany @PierreZ

Slide 49

Slide 49

OVH Observability Metrics Platform @LostInBrittany @PierreZ

Slide 50

Slide 50

Building an ecosystem From Warp 10 to OVH Metrics @LostInBrittany @PierreZ

Slide 51

Slide 51

What protocols should we support? Who must do the effort? @LostInBrittany @PierreZ

Slide 52

Slide 52

Open source monitoring tools @LostInBrittany @PierreZ

Slide 53

Slide 53

Open source monitoring tools @LostInBrittany @PierreZ

Slide 54

Slide 54

Open source monitoring tools @LostInBrittany @PierreZ

Slide 55

Slide 55

Open source monitoring tools @LostInBrittany @PierreZ

Slide 56

Slide 56

Open source monitoring tools @LostInBrittany @PierreZ

Slide 57

Slide 57

Open source monitoring tools @LostInBrittany @PierreZ

Slide 58

Slide 58

Open source monitoring tools Why choose? Let’s support all of them! @LostInBrittany @PierreZ

Slide 59

Slide 59

Metrics Platform @LostInBrittany @PierreZ

Slide 60

Slide 60

Metrics Platform graphite influx https:// opentsdb .<region>.metrics.ovh.net prometheus Warp10 tsl … @LostInBrittany @PierreZ

Slide 61

Slide 61

Metrics Platform graphite influx https:// opentsdb .<region>.metrics.ovh.net prometheus Warp10 tsl … @LostInBrittany @PierreZ

Slide 62

Slide 62

TSL select(“cpu.usage_system”) .where(“cpu~cpu[0-7]*”) .last(12h) .sampleBy(5m,max) .groupBy(mean) .rate() github.com/ovh/tsl @LostInBrittany @PierreZ

Slide 63

Slide 63

Metrics Live In-memory, high-performance Metrics instances @LostInBrittany @PierreZ

Slide 64

Slide 64

In-memory: Metrics live millions of writes/s @LostInBrittany @PierreZ

Slide 65

Slide 65

In-memory: Metrics live @LostInBrittany @PierreZ

Slide 66

Slide 66

In-memory: Metrics live @LostInBrittany @PierreZ

Slide 67

Slide 67

Monitoring is only the beginning OVH Metrics answer to many other use cases @LostInBrittany @PierreZ

Slide 68

Slide 68

Graveline rack’s temperature @LostInBrittany @PierreZ

Slide 69

Slide 69

Even medical research… Metrics’ Pattern Detection feature helped Gynaecology Research to prove patterns on perinatal mortality @LostInBrittany @PierreZ

Slide 70

Slide 70

Use cases families • • • • Billing Monitoring IoT (e.g. bill on monthly max consumption) ……………………………………………..……. (APM, infrastructure,appliances,…) …..…………………………… (Manage devices, operator integration, …) …………………………………………….…………………. Geo Location (Manage localized fleets) ……..………………… @LostInBrittany @PierreZ

Slide 71

Slide 71

Use cases • • • • • • DC Temperature/Elec/Cooling map Pay as you go billing (PCI/IPLB) GSCAN Monitoring ML Model scoring (Anti-Fraude) Pattern Detection for medical applications @LostInBrittany @PierreZ

Slide 72

Slide 72

SREing Metrics With a great power comes a great responsibility @LostInBrittany @PierreZ

Slide 73

Slide 73

Metrics’s metrics 432.000.000.000 datapoints / day @LostInBrittany @PierreZ 73

Slide 74

Slide 74

Our stack overview More than 666 machines operated by 5 people >95% dedicated servers No Docker, only SystemD Running many Apache projects: ○ Hadoop ○ HBase ○ Zookeeper ○ Flink ● And Warp 10 ● ● ● ● @LostInBrittany @PierreZ

Slide 75

Slide 75

Our biggest Hadoop cluster 200 datanodes ~60k regions of 10Gb 2.3 PB of capacity 8.5Gb/s of bandwidth 1.5M of writes/s 3M of reads/s @LostInBrittany @PierreZ

Slide 76

Slide 76

Hadoop need a lot of @LostInBrittany @PierreZ

Slide 77

Slide 77

Warp10: distributed overview @LostInBrittany @PierreZ

Slide 78

Slide 78

Warp10: distributed overview @LostInBrittany @PierreZ

Slide 79

Slide 79

Warp10: distributed overview @LostInBrittany @PierreZ

Slide 80

Slide 80

Warp10: distributed overview @LostInBrittany @PierreZ

Slide 81

Slide 81

Warp10: distributed overview @LostInBrittany @PierreZ

Slide 82

Slide 82

Hadoop nodes Most of the nodes are the following: ● ● ● 16 to 32 cores 64 to 128 GB of RAM 12 to 16 TB But, we also have some huge nodes: ● ● ● 2x 20 cores (xeon gold) 320 GB of RAM 12x 4TB of Disk @LostInBrittany @PierreZ

Slide 83

Slide 83

Warp10 nodes Ingress (cpu-bound): ● ● 32 cores 128 GB of RAM ● ● Egress (cpu-bound): ● ● 32 cores 128 GB of RAM Directory (ram-bound): 48 cores 512 GB of RAM Store (cpu-bound): ● ● 32 cores 128 GB of RAM @LostInBrittany @PierreZ

Slide 84

Slide 84

Why you should care? @LostInBrittany @PierreZ

Slide 85

Slide 85

Why you should care? (>30s) @LostInBrittany @PierreZ

Slide 86

Slide 86

The only way to optimize: measure Logs Metrics @LostInBrittany @PierreZ

Slide 87

Slide 87

Monitoring JVM with metrics @LostInBrittany @PierreZ

Slide 88

Slide 88

Monitoring JVM with metrics @LostInBrittany @PierreZ

Slide 89

Slide 89

Monitoring JVM with metrics @LostInBrittany @PierreZ

Slide 90

Slide 90

Monitoring JVM with metrics @LostInBrittany @PierreZ

Slide 91

Slide 91

Monitoring JVM with metrics @LostInBrittany @PierreZ

Slide 92

Slide 92

Tuning G1 is hard @LostInBrittany @PierreZ

Slide 93

Slide 93

Tuning G1 is hard @LostInBrittany @PierreZ

Slide 94

Slide 94

Our programming stack ● We mostly use garbage collected languages as ○ Go ○ Java ○ JavaScript @LostInBrittany @PierreZ

Slide 95

Slide 95

Our programming stack However, we are using non-garbage collected languages as Rust when needed @LostInBrittany @PierreZ

Slide 96

Slide 96

Our friends for µservices @LostInBrittany @PierreZ

Slide 97

Slide 97

We open-source Code contribution: ● ● ● ● ● ● https://github.com/ovh/beamium https://github.com/ovh/noderig https://github.com/ovh/tsl https://github.com/ovh/ovh-warp10-datasource https://github.com/ovh/ovh-tsl-datasource … Involved in: ● ● ● ● Warp10 community Apache Hbase/Flink development Prometheus/InfluxData discussions TS Query Language Working group @LostInBrittany @PierreZ

Slide 98

Slide 98

Conclusion That’s all folks! @LostInBrittany @PierreZ