Monitoring OVH: 300k servers, 27 DCs and one Metrics platform

A presentation at DevFest Toulouse in October 2019 in Toulouse, France by Horacio Gonzalez

Slide 1

Monitoring OVH: 350k servers, 30 DCs… and one Metrics platform Horacio Gonzalez @LostInBrittany @LostInBrittany

Slide 2

Who are we? Introducing myself and introducing OVH OVHcloud @LostInBrittany

Slide 3

Horacio Gonzalez @LostInBrittany Spaniard lost in Brittany, developer, dreamer and all-around geek Flutter @LostInBrittany

Slide 4

OVH: A Global Leader on Cloud 250k Private cloud VMs running 1 Dedicated IaaS Europe 30 Datacenters Own 20Tbps Hosting capacity : 1.3M Physical Servers 360k Servers already deployed Netwok with 35 PoPs

1.3M Customers in 138 Countries @LostInBrittany

Slide 5

OVH: Our solutions Cloud Web Hosting Mobile Hosting Telecom VPS Containers ▪ Dedicated Server Domain names VoIP Public Cloud Compute ▪ Data Storage Email SMS/Fax Private Cloud ▪ Network and Database CDN Virtual desktop Serveur dédié Security Object Storage Web hosting Cloud HubiC Over theBox ▪ Licences Cloud Desktop Securities MS Oﬃce Hybrid Cloud Messaging MS solutions @LostInBrittany

Slide 6

And don’t forget, next week… OVHcloud Summit https://summit.ovhcloud.com/ @LostInBrittany

Slide 7

Once upon a time… Because I love telling tales @LostInBrittany

Slide 8

This talk is about a tale… A true one nevertheless @LostInBrittany

Slide 9

And as in most tales It begins with a mission @LostInBrittany

Slide 10

And a band of heroes Engulfed into the adventure @LostInBrittany

Slide 11

They fight against mishaps And all kind of foes @LostInBrittany

Slide 12

They build mighty fortresses Pushing the limits of possible @LostInBrittany

Slide 13

And defend them day after day Against all odds @LostInBrittany

Slide 14

But we don’t know yet the end Because this tale isn’t finished yet @LostInBrittany

Slide 15

It begins with a mission Build a metrics platform for OVH @LostInBrittany

Slide 16

A long time ago… @LostInBrittany

Slide 17

A long time ago… Monitoring: Does the system works? @LostInBrittany

Slide 18

Moving from monolith to μservices App @LostInBrittany

Slide 19

Moving from monolith to μservices App App App @LostInBrittany

Slide 20

Moving from monolith to μservices App App App DB App Slaves @LostInBrittany

Slide 21

Moving from monolith to μservices App App App Bus DB App Slaves @LostInBrittany

Slide 22

Moving from monolith to μservices RPXY LB Cache App App App Bus DB App Slaves @LostInBrittany

Slide 23

What could go wrong? RPXY LB Cache App App App Bus DB App Slaves @LostInBrittany

Slide 24

Microservices are a distributed system GOTO 2017 • Debugging Under Fire: Keep your Head when Systems have Lost their Mind • Bryan Cantrill @LostInBrittany

Slide 25

We need to have insights Observability: How the system works? @LostInBrittany

Slide 26

OVH decided go metrics-oriented @LostInBrittany

Slide 27

A metrics platform for OVH For all OVH @LostInBrittany

Slide 28

Building OVH Metrics One Platform to unify them all, One Platform to find them, One Platform to bring them all and in the Metrics monitor them @LostInBrittany

Slide 29

What is OVH Metrics? Managed Cloud Platform for Time Series @LostInBrittany

Slide 30

OVH monitoring story We had lots of partial solutions… @LostInBrittany

Slide 31

OVH monitoring story One Platform to unify them all What should we build it on? @LostInBrittany

Slide 32

OVH monitoring story Including a really big @LostInBrittany

Slide 33

OpenTSDB drawbacks OpenTSDB RowKey Design ! @LostInBrittany

Slide 34

OpenTSDB Rowkey design flaws ● .regex. => full table scans ● High cardinality issues (Query latencies) We needed something able to manage hundreds of millions time series OpenTSBD didn’t scale for us @LostInBrittany

Slide 35

OpenTSDB other flaws ● ● ● ● ● Compaction (or append writes) /api/query : 1 endpoint per function? Asynchronous Unauthenticated … @LostInBrittany

Slide 36

Scaling OpenTSDB @LostInBrittany

Slide 37

Metrics needs First need: To be massively scalable @LostInBrittany

Slide 38

Analytics is the key to success Fetching data is only the tip of the iceberg @LostInBrittany

Slide 39

Analysing metrics data To be scalable, analysis must be done in the database, not in user’s computer @LostInBrittany

Slide 40

Metrics needs Second need: To have rich query capabilities @LostInBrittany

Slide 41

Enter Warp 10… Open-source Time series Database @LostInBrittany

Slide 42

More than a Time Series DB Warp 10 is a software platform that ● Ingests and stores time series ● Manipulates and analyzes time series @LostInBrittany

Slide 43

Manipulating Time Series with Warp 10 A true Time Series analysis toolbox ○ Hundreds of functions ○ Manipulation frameworks ○ Analysis workflow @LostInBrittany

Slide 44

Manipulating Time Series with Warp 10 A Time Series manipulation language WarpScript @LostInBrittany

Slide 45

Did you say scalability? From the smallest to the largest… @LostInBrittany

Slide 46

More Warp 10 goodness ● Secured & multi tenant ● Synchronous (transactions) ● In memory Index ● Better Performance ● No cardinality issues ● Better Scalability ● Lockfree ingestion ● Versatile ● WarpScript Query Language (standalone, distributed) ● Support more data types @LostInBrittany

Slide 47

OVH Observability Metrics Platform @LostInBrittany

Slide 48

Building an ecosystem From Warp 10 to OVH Metrics @LostInBrittany

Slide 49