Monitoring OVH: 300k servers, 27 DCs and one Metrics platform

A presentation at DEVOPS.BARCELONA in June 2019 in Barcelona, Spain by Horacio Gonzalez

Slide 1

Slide 1

Monitoring OVH: 300k servers, 28 DCs… and one Metrics platform Horacio Gonzalez @LostInBrittany @LostInBrittany

Slide 2

Slide 2

Who are we? Introducing myself and introducing OVH @LostInBrittany

Slide 3

Slide 3

Horacio Gonzalez @LostInBrittany Spaniard lost in Brittany, developer, dreamer and all-around geek Flutter @LostInBrittany

Slide 4

Slide 4

OVH : Key Figures 1.3M Customers worldwide in 138 Countries 1.5 Billions euros investment over five years 28 Datacenters (growing) 350k Dedicated Servers 200k Private cloud VMs running 650k Public cloud Instances created in a month 20TB bandwidth capacity 35 Points of presence 4TB Anti DDoS capacity Hosting capacity : 1.3M Physical Servers @LostInBrittany

  • 2 500 Employees in 19 countries 20 Years of Innovation

Slide 5

Slide 5

OVH: Our solutions Cloud Web Hosting Mobile Hosting Telecom VPS Containers ▪ Dedicated Server Domain names VoIP Public Cloud Compute ▪ Data Storage Email SMS/Fax Private Cloud ▪ Network and Database CDN Virtual desktop Serveur dédié Security Object Storage Web hosting Cloud HubiC Over theBox ▪ Licences Cloud Desktop Securities MS Office Hybrid Cloud Messaging MS solutions @LostInBrittany

Slide 6

Slide 6

Once upon a time… Because I love telling tales @LostInBrittany

Slide 7

Slide 7

This talk is about a tale… A true one nevertheless @LostInBrittany

Slide 8

Slide 8

And as in most tales It begins with a mission @LostInBrittany

Slide 9

Slide 9

And a band of heroes Engulfed into the adventure @LostInBrittany

Slide 10

Slide 10

They fight against mishaps And all kind of foes @LostInBrittany

Slide 11

Slide 11

They build mighty fortresses Pushing the limits of possible @LostInBrittany

Slide 12

Slide 12

And defend them day after day Against all odds @LostInBrittany

Slide 13

Slide 13

But we don’t know yet the end Because this tale isn’t finished yet @LostInBrittany

Slide 14

Slide 14

It begins with a mission Build a metrics platform for OVH @LostInBrittany

Slide 15

Slide 15

A long time ago… @LostInBrittany

Slide 16

Slide 16

A long time ago… Monitoring: Does the system works? @LostInBrittany

Slide 17

Slide 17

Moving from monolith to μservices App @LostInBrittany

Slide 18

Slide 18

Moving from monolith to μservices App App App @LostInBrittany

Slide 19

Slide 19

Moving from monolith to μservices App App App DB App Slaves @LostInBrittany

Slide 20

Slide 20

Moving from monolith to μservices App App App Bus DB App Slaves @LostInBrittany

Slide 21

Slide 21

Moving from monolith to μservices RPXY LB Cache App App App Bus DB App Slaves @LostInBrittany

Slide 22

Slide 22

What could go wrong? RPXY LB Cache App App App Bus DB App Slaves @LostInBrittany

Slide 23

Slide 23

Microservices are a distributed system GOTO 2017 • Debugging Under Fire: Keep your Head when Systems have Lost their Mind • Bryan Cantrill @LostInBrittany

Slide 24

Slide 24

We need to have insights Observability : Understand how it works @LostInBrittany

Slide 25

Slide 25

OVH decided go metrics-oriented @LostInBrittany

Slide 26

Slide 26

A metrics platform for OVH For all OVH @LostInBrittany

Slide 27

Slide 27

Building OVH Metrics One Platform to unify them all, One Platform to find them, One Platform to bring them all and in the Metrics monitor them @LostInBrittany

Slide 28

Slide 28

What is OVH Metrics? Managed Cloud Platform for Time Series @LostInBrittany

Slide 29

Slide 29

OVH monitoring story We had lots of partial solutions… @LostInBrittany

Slide 30

Slide 30

OVH monitoring story One Platform to unify them all What should we build it on? @LostInBrittany

Slide 31

Slide 31

OVH monitoring story Including a really big @LostInBrittany

Slide 32

Slide 32

OpenTSDB drawbacks OpenTSDB RowKey Design ! @LostInBrittany

Slide 33

Slide 33

OpenTSDB Rowkey design flaws ● .regex. => full table scans ● High cardinality issues (Query latencies) We needed something able to manage hundreds of millions time series OpenTSBD didn’t scale for us @LostInBrittany

Slide 34

Slide 34

OpenTSDB other flaws ● ● ● ● ● Compaction (or append writes) /api/query : 1 endpoint per function? Asynchronous Unauthenticated … @LostInBrittany

Slide 35

Slide 35

Scaling OpenTSDB @LostInBrittany

Slide 36

Slide 36

Metrics needs First need: To be massively scalable @LostInBrittany

Slide 37

Slide 37

Analytics is the key to success Fetching data is only the tip of the iceberg @LostInBrittany

Slide 38

Slide 38

Analysing metrics data To be scalable, analysis must be done in the database, not in user’s computer @LostInBrittany

Slide 39

Slide 39

Metrics needs Second need: To have rich query capabilities @LostInBrittany

Slide 40

Slide 40

Enter Warp 10… Open-source Time series Database @LostInBrittany

Slide 41

Slide 41

More than a Time Series DB Warp 10 is a software platform that ● Ingests and stores time series ● Manipulates and analyzes time series @LostInBrittany

Slide 42

Slide 42

Manipulating Time Series with Warp 10 A true Time Series analysis toolbox ○ Hundreds of functions ○ Manipulation frameworks ○ Analysis workflow @LostInBrittany

Slide 43

Slide 43

Manipulating Time Series with Warp 10 A Time Series manipulation language WarpScript @LostInBrittany

Slide 44

Slide 44

Did you say scalability? From the smallest to the largest… @LostInBrittany

Slide 45

Slide 45

More Warp 10 goodness ● Secured & multi tenant ● Synchronous (transactions) ● In memory Index ● Better Performance ● No cardinality issues ● Better Scalability ● Lockfree ingestion ● Versatile ● WarpScript Query Language (standalone, distributed) ● Support more data types @LostInBrittany

Slide 46

Slide 46

OVH Observability Metrics Platform @LostInBrittany

Slide 47

Slide 47

Building an ecosystem From Warp 10 to OVH Metrics @LostInBrittany

Slide 48

Slide 48

What protocols should we support? Who must do the effort? @LostInBrittany

Slide 49

Slide 49

Open source monitoring tools @LostInBrittany

Slide 50

Slide 50

Open source monitoring tools @LostInBrittany

Slide 51

Slide 51

Open source monitoring tools @LostInBrittany

Slide 52

Slide 52

Open source monitoring tools @LostInBrittany

Slide 53

Slide 53

Open source monitoring tools @LostInBrittany

Slide 54

Slide 54

Open source monitoring tools @LostInBrittany

Slide 55

Slide 55

Open source monitoring tools Why choose? Let’s support all of them! @LostInBrittany

Slide 56

Slide 56

Metrics Platform @LostInBrittany

Slide 57

Slide 57

Metrics Platform graphite influx https:// opentsdb .<region>.metrics.ovh.net prometheus Warp10 tsl … @LostInBrittany

Slide 58

Slide 58

Metrics Platform graphite influx https:// opentsdb .<region>.metrics.ovh.net prometheus Warp10 tsl … @LostInBrittany

Slide 59

Slide 59

TSL select(“cpu.usage_system”) .where(“cpu~cpu[0-7]*”) .last(12h) .sampleBy(5m,max) .groupBy(mean) .rate() github.com/ovh/tsl @LostInBrittany

Slide 60

Slide 60

Metrics Live In-memory, high-performance Metrics instances @LostInBrittany

Slide 61

Slide 61

In-memory: Metrics live millions of writes/s @LostInBrittany

Slide 62

Slide 62

In-memory: Metrics live @LostInBrittany

Slide 63

Slide 63

In-memory: Metrics live @LostInBrittany

Slide 64

Slide 64

Monitoring is only the beginning OVH Metrics answer to many other use cases @LostInBrittany

Slide 65

Slide 65

Graveline rack’s temperature @LostInBrittany

Slide 66

Slide 66

Even medical research… Metrics’ Pattern Detection feature helped Gynaecology Research to prove patterns on perinatal mortality @LostInBrittany

Slide 67

Slide 67

Use cases families • • • • Billing Monitoring IoT (e.g. bill on monthly max consumption) ……………………………………………..……. (APM, infrastructure,appliances,…) …..…………………………… (Manage devices, operator integration, …) …………………………………………….…………………. Geo Location (Manage localized fleets) ……..………………… @LostInBrittany

Slide 68

Slide 68

Use cases • • • • • • DC Temperature/Elec/Cooling map Pay as you go billing (PCI/IPLB) GSCAN Monitoring ML Model scoring (Anti-Fraude) Pattern Detection for medical applications @LostInBrittany

Slide 69

Slide 69

SREing Metrics With a great power comes a great responsibility @LostInBrittany

Slide 70

Slide 70

Metrics’s metrics 432.000.000.000 datapoints / day @LostInBrittany 70

Slide 71

Slide 71

Our stack overview More than 650 machines operated by 5 people >95% dedicated servers No Docker, only SystemD Running many Apache projects: ○ Hadoop ○ HBase ○ Zookeeper ○ Flink ● And Warp 10 ● ● ● ● @LostInBrittany

Slide 72

Slide 72

Our biggest Hadoop cluster 200 datanodes 60k regions of 10Gb ~ 2.3 PB of capacity 8.5Gb/s of bandwidth @LostInBrittany 1.5M of writes/s 3M of reads/s

Slide 73

Slide 73

Hadoop need a lot of @LostInBrittany

Slide 74

Slide 74

Warp10: distributed overview @LostInBrittany

Slide 75

Slide 75

Warp10: distributed overview @LostInBrittany

Slide 76

Slide 76

Warp10: distributed overview @LostInBrittany

Slide 77

Slide 77

Warp10: distributed overview @LostInBrittany

Slide 78

Slide 78

Warp10: distributed overview @LostInBrittany

Slide 79

Slide 79

Hadoop nodes Most of the nodes are the following: ● ● ● 16 to 32 cores 64 to 128 GB of RAM 12 to 16 TB But, we also have some huge nodes: ● ● ● 2x 20 cores (xeon gold) 320 GB of RAM 12x 4TB of Disk @LostInBrittany

Slide 80

Slide 80

Warp10 nodes Ingress (cpu-bound): ● ● 32 cores 128 GB of RAM Egress (cpu-bound): ● ● 32 cores 128 GB of RAM Directory (ram-bound): ● ● 48 cores 512 GB of RAM Store (cpu-bound): ● ● @LostInBrittany 32 cores 128 GB of RAM

Slide 81

Slide 81

Why you should care? @LostInBrittany

Slide 82

Slide 82

Why you should care? (>30s) @LostInBrittany

Slide 83

Slide 83

The only way to optimize: measure What is my application doing? What is my runtime doing? App Logs How many GC triggered? Run time Is there a hardware failure? How many HTTP calls? Hos t @LostInBrittany How many disk I have left? Metrics

Slide 84

Slide 84

Monitoring JVM with metrics @LostInBrittany

Slide 85

Slide 85

Monitoring JVM with metrics @LostInBrittany

Slide 86

Slide 86

Monitoring JVM with metrics @LostInBrittany

Slide 87

Slide 87

Monitoring JVM with metrics @LostInBrittany

Slide 88

Slide 88

Monitoring JVM with metrics @LostInBrittany

Slide 89

Slide 89

Tuning G1 is hard @LostInBrittany

Slide 90

Slide 90

Tuning G1 is hard @LostInBrittany

Slide 91

Slide 91

Our programming stack ● We mostly use garbage collected languages as ○ Go ○ Java ○ JavaScript @LostInBrittany

Slide 92

Slide 92

Our programming stack However, we are using non-garbage collected languages as Rust when needed @LostInBrittany

Slide 93

Slide 93

Our friends for µservices @LostInBrittany

Slide 94

Slide 94

We open-source Code contribution: ● ● ● ● ● ● https://github.com/ovh/beamium https://github.com/ovh/noderig https://github.com/ovh/tsl https://github.com/ovh/ovh-warp10-datasource https://github.com/ovh/ovh-tsl-datasource … @LostInBrittany Involved in: ● ● ● ● Warp10 community Apache Hbase/Flink development Prometheus/InfluxData discussions TS Query Language Working group

Slide 95

Slide 95

Conclusion That’s all folks! @LostInBrittany