Monitoring OVH: 300k servers, 27 DCs and one Metrics platform

A presentation at DEVOPS.BARCELONA in June 2019 in Barcelona, Spain by Horacio Gonzalez

Slide 1

Monitoring OVH: 300k servers, 28 DCs… and one Metrics platform Horacio Gonzalez @LostInBrittany @LostInBrittany

Slide 2

Who are we? Introducing myself and introducing OVH @LostInBrittany

Slide 3

Horacio Gonzalez @LostInBrittany Spaniard lost in Brittany, developer, dreamer and all-around geek Flutter @LostInBrittany

Slide 4

OVH : Key Figures 1.3M Customers worldwide in 138 Countries 1.5 Billions euros investment over five years 28 Datacenters (growing) 350k Dedicated Servers 200k Private cloud VMs running 650k Public cloud Instances created in a month 20TB bandwidth capacity 35 Points of presence 4TB Anti DDoS capacity Hosting capacity : 1.3M Physical Servers @LostInBrittany

2 500 Employees in 19 countries 20 Years of Innovation

Slide 5

OVH: Our solutions Cloud Web Hosting Mobile Hosting Telecom VPS Containers ▪ Dedicated Server Domain names VoIP Public Cloud Compute ▪ Data Storage Email SMS/Fax Private Cloud ▪ Network and Database CDN Virtual desktop Serveur dédié Security Object Storage Web hosting Cloud HubiC Over theBox ▪ Licences Cloud Desktop Securities MS Oﬃce Hybrid Cloud Messaging MS solutions @LostInBrittany

Slide 6

Once upon a time… Because I love telling tales @LostInBrittany

Slide 7

This talk is about a tale… A true one nevertheless @LostInBrittany

Slide 8

And as in most tales It begins with a mission @LostInBrittany

Slide 9

And a band of heroes Engulfed into the adventure @LostInBrittany

Slide 10

They fight against mishaps And all kind of foes @LostInBrittany

Slide 11

They build mighty fortresses Pushing the limits of possible @LostInBrittany

Slide 12

And defend them day after day Against all odds @LostInBrittany

Slide 13

But we don’t know yet the end Because this tale isn’t finished yet @LostInBrittany

Slide 14

It begins with a mission Build a metrics platform for OVH @LostInBrittany

Slide 15

A long time ago… @LostInBrittany

Slide 16

A long time ago… Monitoring: Does the system works? @LostInBrittany

Slide 17

Moving from monolith to μservices App @LostInBrittany

Slide 18

Moving from monolith to μservices App App App @LostInBrittany

Slide 19

Moving from monolith to μservices App App App DB App Slaves @LostInBrittany

Slide 20

Moving from monolith to μservices App App App Bus DB App Slaves @LostInBrittany

Slide 21

Moving from monolith to μservices RPXY LB Cache App App App Bus DB App Slaves @LostInBrittany

Slide 22

What could go wrong? RPXY LB Cache App App App Bus DB App Slaves @LostInBrittany

Slide 23

Microservices are a distributed system GOTO 2017 • Debugging Under Fire: Keep your Head when Systems have Lost their Mind • Bryan Cantrill @LostInBrittany

Slide 24

We need to have insights Observability : Understand how it works @LostInBrittany

Slide 25

OVH decided go metrics-oriented @LostInBrittany

Slide 26

A metrics platform for OVH For all OVH @LostInBrittany

Slide 27

Building OVH Metrics One Platform to unify them all, One Platform to find them, One Platform to bring them all and in the Metrics monitor them @LostInBrittany

Slide 28

What is OVH Metrics? Managed Cloud Platform for Time Series @LostInBrittany

Slide 29

OVH monitoring story We had lots of partial solutions… @LostInBrittany

Slide 30

OVH monitoring story One Platform to unify them all What should we build it on? @LostInBrittany

Slide 31

OVH monitoring story Including a really big @LostInBrittany

Slide 32

OpenTSDB drawbacks OpenTSDB RowKey Design ! @LostInBrittany

Slide 33

OpenTSDB Rowkey design flaws ● .regex. => full table scans ● High cardinality issues (Query latencies) We needed something able to manage hundreds of millions time series OpenTSBD didn’t scale for us @LostInBrittany

Slide 34

OpenTSDB other flaws ● ● ● ● ● Compaction (or append writes) /api/query : 1 endpoint per function? Asynchronous Unauthenticated … @LostInBrittany

Slide 35

Scaling OpenTSDB @LostInBrittany

Slide 36

Metrics needs First need: To be massively scalable @LostInBrittany

Slide 37

Analytics is the key to success Fetching data is only the tip of the iceberg @LostInBrittany

Slide 38

Analysing metrics data To be scalable, analysis must be done in the database, not in user’s computer @LostInBrittany

Slide 39

Metrics needs Second need: To have rich query capabilities @LostInBrittany

Slide 40

Enter Warp 10… Open-source Time series Database @LostInBrittany

Slide 41

More than a Time Series DB Warp 10 is a software platform that ● Ingests and stores time series ● Manipulates and analyzes time series @LostInBrittany

Slide 42

Manipulating Time Series with Warp 10 A true Time Series analysis toolbox ○ Hundreds of functions ○ Manipulation frameworks ○ Analysis workflow @LostInBrittany

Slide 43

Manipulating Time Series with Warp 10 A Time Series manipulation language WarpScript @LostInBrittany

Slide 44

Did you say scalability? From the smallest to the largest… @LostInBrittany

Slide 45

More Warp 10 goodness ● Secured & multi tenant ● Synchronous (transactions) ● In memory Index ● Better Performance ● No cardinality issues ● Better Scalability ● Lockfree ingestion ● Versatile ● WarpScript Query Language (standalone, distributed) ● Support more data types @LostInBrittany

Slide 46

OVH Observability Metrics Platform @LostInBrittany

Slide 47

Building an ecosystem From Warp 10 to OVH Metrics @LostInBrittany

Slide 48