Slide 1
Application Metrics
with Prometheus
Rafael Dohms
@rdohms
!
Slide 2
Slide 3
“The Prometheus
Scientist Method”
Slide 4
Slide 5
jobs.usabilla.com
Rafael Dohms
Staff Engineer
rdohms
doh.ms
!
!
Slide 6
jobs.usabilla.com
Rafael Dohms
Staff Engineer
rdohms
doh.ms
!
!
Feedback
Feedback
Slide 7
Kafka / DDD / Autonomous Microservices / Monitoring
Slide 8
Kafka / DDD / Autonomous Microservices / Monitoring
Slide 9
Kafka / DDD / Autonomous Microservices / Monitoring
Slide 10
Metrics are
insights
into
the
current state
of your
application.
Slide 11
Metrics tell you
if your
service is healthy.
Slide 12
Canary Deploys
Oksana Latysheva
Slide 13
Metrics tell you
what
is wrong.
Slide 14
Metrics tell you
what
is right.
Slide 15
Metrics tell you
what
will soon
be wrong.
Slide 16
Metrics tell you
where
to start looking.
Slide 17
Site Reliability Engineering
Slide 18
Slide 19
SLIs
"
Service Level Indicators
“A quantitative measure of some
aspect of your application”
The response time of a request was
150ms
Source: Site Reliability Engineering - O’Reilly
Slide 20
SLOs
◎
Service Level Objectives
“A target value or a range of values
for something measured by an SLI”
Request response times should be
below 200ms
Source: Site Reliability Engineering - O’Reilly
Slide 21
Help you
drive architectural
decisions
, like optimisation
SLOs
◎
Response time SLO:
150 ms
95th Percentile of Processing time (PHP time):
5ms
As a result we decided to invest more time in exploring the
problem domain and not optimising our stack.
Slide 22
SLAs
$
Service Level Agreements
“An explicit or implicit contract with
your customer, that includes
consequences of missing their SLOs”
The
99th percentile
of requests response times should meet our SLO, or
we
will refund users
Source: Site Reliability Engineering - O’Reilly
Slide 23
Slide 24
–Etsy Engineering
“If it moves, we track it.”
https://codeascraft.com/2011/02/15/measure-anything-measure-everything/
Slide 25
Metrics
Statistics
What is happening right
now?
How often does this
happen?
Telemetry
Slide 26
Telemetry
“the process of recording and transmitting the readings of an instrument”
Slide 27
Statistics / Analytics
“the practice of collecting and analysing numerical data in large quantities”
Slide 28
Statistics / Analytics
“the practice of collecting and analysing numerical data in large quantities”
Slide 29
I really miss Ayrton Senna
Statistics / Analytics
“the practice of collecting and analysing numerical data in large quantities”
Slide 30
Statistics
Incoming feedback items
with origin information
Telemetry
response time of public
endpoints
Slide 31
“If it moves, we track it.”
Slide 32
Request Latency
System Throughput
Error Rate
Availability
Resource Usage
“If it moves, we track it.”
Slide 33
Request Latency
System Throughput
Error Rate
Availability
Resource Usage
“If it moves, we track it.”
Incoming Data
Peak frequency
CPU
Memory
Disk Space
Bandwith
node
PHP
NginX
Database
Slide 34
Request Latency
System Throughput
Error Rate
Availability
Resource Usage
“If it moves, we track it.”
Incoming Data
Peak frequency
CPU
Memory
Disk Space
Bandwith
node
PHP
NginX
Database
Measure Monitoring
Measure measurements
Slide 35
Slide 36
%
%
%
%
%
%
%
&
&
&
%
%
&
Slide 37
SLIs
%
%
%
%
%
%
%
&
&
&
%
%
&
Slide 38
Slide 39
SLIs may change
according to
who
is
looking
at the data.
Slide 40
Understanding the
nature
of your system
Slide 41
User-Facing
serving system?
availability, throughput, latency
Slide 42
Storage System?
availability, durability, latency
Slide 43
Big Data Systems?
throughput, end-to-end latency
Slide 44
User-Facing
and
Big Data
Systems
Slide 45
๏
SLIs
Response time
in the “receive” endpoint
Turn around time
, from “receive” to “show”.
Individual
processing time per step
Data counting:
how many
, w h a t n a t u r e
User-Facing
and
Big Data
Systems
Slide 46
๏
SLIs
Response time
in the “receive” endpoint
Turn around time
, from “receive” to “show”.
Individual
processing time per step
Data counting:
how many
, w h a t n a t u r e
User-Facing
and
Big Data
Systems
More relevant to
development team
Slide 47
๏
SLIs
Response time
in the “receive” endpoint
Turn around time
, from “receive” to “show”.
Individual
processing time per step
Data counting:
how many
, w h a t n a t u r e
๏
Other Metrics
node, nginx, php-fpm, java metrics
server metrics: cpu, memor y, disk space
Size of cluster
Kafka health
User-Facing
and
Big Data
Systems
More relevant to
development team
Slide 48
๏
SLIs
Response time
in the “receive” endpoint
Turn around time
, from “receive” to “show”.
Individual
processing time per step
Data counting:
how many
, w h a t n a t u r e
๏
Other Metrics
node, nginx, php-fpm, java metrics
server metrics: cpu, memor y, disk space
Size of cluster
Kafka health
User-Facing
and
Big Data
Systems
More relevant to
development team
More relevant to
Infrastructure team
Slide 49
Slide 50
Target value
SLI value
=
target
Target Range
lower bound
<=
SLI value
<=
upper bound
Slide 51
Don’t
pick a target based
on current performance
What is the business need?
What are users trying to achieve?
How much impact does it have on the user experience?
Slide 52
How long can it take between
the
user
clicking submit and a confirmation
that our servers received the data?
Slide 53
How long can it take between
the
user
clicking submit and a confirmation
that our servers received the data?
'
'
'
'
“Immediate"
“We sell as
real time”
“500ms, too
much HTML“
“I don’t know”
Slide 54
How long can it take between
the
user
clicking submit and a confirmation
that our servers received the data?
'
'
'
'
“Immediate"
“We sell as
real time”
“500ms, too
much HTML“
“I don’t know”
What is human perception of
immediate? 100ms
Collection API should respond within
150ms
Slide 55
Some, but not too many.
can you settle an argument or priority based on it?
Slide 56
Don’t over achieve.
The Chubby example.
Slide 57
Adapt.
Evolve.
re-define SLO’s as your product evolves.
Slide 58
Slide 59
Attach consequences
to your
Objectives
.
Slide 60
The night is dark and
full of
loopholes
.
take a friend from legal with you.
Slide 61
Safety Margins.
like setting the alarm 5 minutes before the meeting.
Slide 62
Slide 63
Slide 64
(
)
(
(
Push Model
scale this!
Slide 65
(
(
(
)
)
)
Pull Model
scale this!
Slide 66
Prometheus
Telemetry
Statistics
Prometheus
StatsD, InfluxDB, etc…
+
Long Term Storage
Slide 67
Gauge
Histogram
Counter
Summary
Cumulative
metric the
represents a
single number
that only
increases
Samples and
count of
observations
over time
A counter, that
can go up or
down
Same as a
histogram but
with stream of
quantiles over a
sliding window.
*
*
*
*
+
+
Slide 68
jimdo/
prometheus_client_php
Slide 69
,
)
reads from /metrics
reads from local storage
writes to local storage
your code
/metrics
Slide 70
<?php
use
Prometheus\Counter
;
use
Prometheus\Histogram
;
use
Prometheus\Storage\APC
;
require_once
'vendor/autoload.php'
;
$adapter
=
new
APC()
;
$histogram
=
new
Histogram(
$adapter
,
'my_app'
,
'response_time_ms'
,
'This measures ....'
,
[
'status'
,
'url'
]
,
[
0
,
10
,
50
,
100
]
)
;
$histogram
->observe(
15
,
[
'200'
,
'/url'
])
;
$counter
=
new
Counter(
$adapter
,
'my_app'
,
'count_total'
,
'How many...'
,
[
'status'
,
'url'
])
;
$counter
->inc([
'200'
,
'/url'
])
;
$counter
->incBy(
5
,
[
'200'
,
'/url'
])
;
Slide 71
<?php use Prometheus\Counter ; use Prometheus\Histogram ; use Prometheus\Storage\APC ;
require_once 'vendor/autoload.php' ;
$adapter = new APC() ;
$histogram = new Histogram( $adapter , 'my_app' , 'response_time_ms' , 'This measures ....' , [ 'status' , 'url' ] , [ 0 , 10 , 50 , 100 ] ) ;
$histogram ->observe( 15 , [ '200' , '/url' ]) ;
$counter = new Counter( $adapter , 'my_app' , 'count_total' , 'How many...' , [ 'status' , 'url' ]) ;
$counter ->inc([ '200' , '/url' ]) ; $counter ->incBy( 5 , [ '200' , '/url' ]) ;
Slide 72
$adapter
new
APC()
;
APC / APCu
Redis
<?php use Prometheus\Counter ; use Prometheus\Histogram ; use Prometheus\Storage\APC ;
require_once 'vendor/autoload.php' ;
$histogram = new Histogram( $adapter , 'my_app' , 'response_time_ms' , 'This measures ....' , [ 'status' , 'url' ] , [ 0 , 10 , 50 , 100 ] ) ;
$histogram ->observe( 15 , [ '200' , '/url' ]) ;
$counter = new Counter( $adapter , 'my_app' , 'count_total' , 'How many...' , [ 'status' , 'url' ]) ;
$counter ->inc([ '200' , '/url' ]) ; $counter ->incBy( 5 , [ '200' , '/url' ]) ;
Slide 73
$histogram
new
Histogram(
$adapter
,
'my_app'
,
'response_time_ms'
,
'This measures ....'
,
[
'status'
,
'url'
]
,
[
0
,
10
,
50
,
100
]
)
;
namespace
metric name
help
label names
buckets
<?php use Prometheus\Counter ; use Prometheus\Histogram ; use Prometheus\Storage\APC ;
require_once 'vendor/autoload.php' ;
$adapter = new APC() ;
$histogram ->observe( 15 , [ '200' , '/url' ]) ;
$counter = new Counter( $adapter , 'my_app' , 'count_total' , 'How many...' , [ 'status' , 'url' ]) ;
$counter ->inc([ '200' , '/url' ]) ; $counter ->incBy( 5 , [ '200' , '/url' ]) ;
Slide 74
$histogram
->observe(
15
,
[
'200'
,
'/url'
])
;
measurement
label values
<?php use Prometheus\Counter ; use Prometheus\Histogram ; use Prometheus\Storage\APC ;
require_once 'vendor/autoload.php' ;
$adapter = new APC() ;
$histogram = new Histogram( $adapter , 'my_app' , 'response_time_ms' , 'This measures ....' , [ 'status' , 'url' ] , [ 0 , 10 , 50 , 100 ] ) ;
$counter = new Counter( $adapter , 'my_app' , 'count_total' , 'How many...' , [ 'status' , 'url' ]) ;
$counter ->inc([ '200' , '/url' ]) ; $counter ->incBy( 5 , [ '200' , '/url' ]) ;
Slide 75
$counter
new
Counter(
$adapter
,
'my_app'
,
'count_total'
,
'How many...'
,
[
'status'
,
'url'
])
;
namespace
metric name
help
labels
<?php use Prometheus\Counter ; use Prometheus\Histogram ; use Prometheus\Storage\APC ;
require_once 'vendor/autoload.php' ;
$adapter = new APC() ;
$histogram = new Histogram( $adapter , 'my_app' , 'response_time_ms' , 'This measures ....' , [ 'status' , 'url' ] , [ 0 , 10 , 50 , 100 ] ) ;
$histogram ->observe( 15 , [ '200' , '/url' ]) ;
$counter ->inc([ '200' , '/url' ]) ; $counter ->incBy( 5 , [ '200' , '/url' ]) ;
Slide 76
$counter
->inc([
'200'
,
'/url'
])
;
$counter
->incBy(
5
,
[
'200'
,
'/url'
])
;
<?php use Prometheus\Counter ; use Prometheus\Histogram ; use Prometheus\Storage\APC ;
require_once 'vendor/autoload.php' ;
$adapter = new APC() ;
$histogram = new Histogram( $adapter , 'my_app' , 'response_time_ms' , 'This measures ....' , [ 'status' , 'url' ] , [ 0 , 10 , 50 , 100 ] ) ;
$histogram ->observe( 15 , [ '200' , '/url' ]) ;
$counter = new Counter( $adapter , 'my_app' , 'count_total' , 'How many...' , [ 'status' , 'url' ]) ;
Slide 77
<?php
use
Prometheus\Counter
;
use
Prometheus\Histogram
;
use
Prometheus\Storage\APC
;
require_once
'vendor/autoload.php'
;
$adapter
=
new
APC()
;
$histogram
=
new
Histogram(
$adapter
,
'my_app'
,
'response_time_ms'
,
'This measures ....'
,
[
'status'
,
'url'
]
,
[
0
,
10
,
50
,
100
]
)
;
$histogram
->observe(
15
,
[
'200'
,
'/url'
])
;
$counter
=
new
Counter(
$adapter
,
'my_app'
,
'count_total'
,
'How many...'
,
[
'status'
,
'url'
])
;
$counter
->inc([
'200'
,
'/url'
])
;
$counter
->incBy(
5
,
[
'200'
,
'/url'
])
;
Slide 78
<?php
use
Prometheus\RenderTextFormat
;
use
Prometheus\Storage\APC
;
require_once
'vendor/autoload.php'
;
$adapter
=
new
APC
()
;
$renderer
=
new
RenderTextFormat
()
;
$result
=
$renderer
->
render
(
$adapter
->
collect
())
;
echo
$result
;
Slide 79
$renderer
new
RenderTextFormat
()
;
$result
$renderer
->
render
(
$adapter
->
collect
())
;
echo
$result
;
<?php use Prometheus\RenderTextFormat ; use Prometheus\Storage\APC ; require_once 'vendor/autoload.php' ; $adapter = new APC () ;
Slide 80
HELP my_app_count_total How many...
TYPE my_app_count_total counter
my_app_count_total{status="200",url="/url"} 6
HELP my_app_response_time_ms This measures ....
TYPE my_app_response_time_ms histogram
my_app_response_time_ms_bucket{status="200",url="/url",le="0"} 0
my_app_response_time_ms_bucket{status="200",url="/url",le="10"} 0
my_app_response_time_ms_bucket{status="200",url="/url",le="50"} 1
my_app_response_time_ms_bucket{status="200",url="/url",le="100"} 1
my_app_response_time_ms_bucket{status="200",url="/url",le="+Inf"} 1
my_app_response_time_ms_count{status="200",url="/url"} 1
my_app_response_time_ms_sum{status="200",url="/url"} 16
$renderer
new
RenderTextFormat
()
;
$result
$renderer
->
render
(
$adapter
->
collect
())
;
echo
$result
;
<?php use Prometheus\RenderTextFormat ; use Prometheus\Storage\APC ; require_once 'vendor/autoload.php' ; $adapter = new APC () ;
Slide 81
–Also Rafael (today)
“I’ll just try this live demo
again.”
http://localhost:9090/graph
http://localhost:8180/metrics
–Rafael (yesterday)
“Demos always fail.”
http://localhost:8180/index
https://github.com/rdohms/talk-app-metrics
"
Slide 82
You can’t act on what
you can’t see.
Slide 83
Slide 84
Slide 85
Metrics without
actionability
are just
numbers on a screen.
Slide 86
Act
as soon as an
SLO is threatened .
Slide 87
Thank you.
Drop me some
feedback at Usabilla
and make this talk
better.
@rdohms
http://slides.doh.ms