Alerting with Time Series Fabian Reinartz, CoreOS github.com/fabxc - - PowerPoint PPT Presentation

alerting with time series
SMART_READER_LITE
LIVE PREVIEW

Alerting with Time Series Fabian Reinartz, CoreOS github.com/fabxc - - PowerPoint PPT Presentation

Alerting with Time Series Fabian Reinartz, CoreOS github.com/fabxc @fabxc Time Series Stream of <timestamp, value> pairs associated with an identifier


slide-1
SLIDE 1

Alerting with Time Series

github.com/fabxc @fabxc Fabian Reinartz, CoreOS

slide-2
SLIDE 2

Stream of <timestamp, value> pairs associated with an identifier

http_requests_total{job="nginx",instance="1.2.3.4:80",path="/status",status="200"}

1348 @ 1480502384 1899 @ 1480502389 2023 @ 1480502394

http_requests_total{job="nginx",instance="1.2.3.1:80",path="/settings",status="201"} http_requests_total{job="nginx",instance="1.2.3.5:80",path="/",status="500"} ...

Time Series

slide-3
SLIDE 3

Stream of <timestamp, value> pairs associated with an identifier

sum by(path,status) (rate(http_requests_total{job="nginx"}[5m]))

{path="/status",status="200"} 32.13 @ 1480502384 {path="/status",status="500"} 19.133 @ 1480502394 {path="/profile",status="200"} 44.52 @ 1480502389

Time Series

slide-4
SLIDE 4

Prometheus

Targets Service Discovery

(Kubernetes, AWS, Consul, custom...)

Grafana HTTP API UI

slide-5
SLIDE 5

A lot of traffic to monitor

Monitoring traffic should not be proportional to user traffic

slide-6
SLIDE 6

A lot of targets to monitor

A single host can run hundreds of machines/procs/containers/...

slide-7
SLIDE 7

Targets constantly change

Deployments, scaling up, scaling down, rolling-updates

slide-8
SLIDE 8

Need a fleet-wide view

What’s my 99th percentile request latency across all frontends?

slide-9
SLIDE 9

Drill-down for investigation

Which pod/node/... has turned unhealthy? How and why?

slide-10
SLIDE 10

Monitor all levels, with the same system

Query and correlate metrics across the stack

slide-11
SLIDE 11

Translate that to

Meaningful Alerting

slide-12
SLIDE 12

Anomaly Detection

Automated Alert Correlation S e l f

  • H

e a l i n g Machine Learning

slide-13
SLIDE 13

Anomaly Detection

If you are actually monitoring at scale, something will always correlate. Huge efforts to eliminate huge number of false positives. Huge chance to introduce false negatives.

slide-14
SLIDE 14

Prometheus Alerts

!= =

current state desired state alerts

slide-15
SLIDE 15

Symptom-based pages

Urgent issues – Does it hurt your user?

system user dependency dependency dependency dependency

slide-16
SLIDE 16

Latency

Four Golden Signals

system user dependency dependency dependency dependency

slide-17
SLIDE 17

Traffic

Four Golden Signals

system user dependency dependency dependency dependency

slide-18
SLIDE 18

Errors

Four Golden Signals

system user dependency dependency dependency dependency

slide-19
SLIDE 19

Cause-based warnings

Helpful context, non-urgent problems

system user dependency dependency dependency dependency

slide-20
SLIDE 20

Saturation / Capacity

Four Golden Signals

system user dependency dependency dependency dependency

slide-21
SLIDE 21

Prometheus Alerts

ALERT <alert name> IF <PromQL vector expression> FOR <duration> LABELS { ... } ANNOTATIONS { ... } <elem1> <val1> <elem2> <val2> <elem3> <val3> ... Each result entry is one alert:

slide-22
SLIDE 22

etcd_has_leader{job="etcd", instance="A"} 0 etcd_has_leader{job="etcd", instance="B"} 0 etcd_has_leader{job="etcd", instance="C"} 1

slide-23
SLIDE 23

Prometheus Alerts

ALERT EtcdNoLeader IF etcd_has_leader == 0 FOR 1m LABELS { severity=”page” } {job=”etcd”,instance=”A”} 0.0 {job=”etcd”,instance=”B”} 0.0

{job=”etcd”,alertname=”EtcdNoLeader”,severity=”page”,instance=”A”} {job=”etcd”,alertname=”EtcdNoLeader”,severity=”page”,instance=”B”}

slide-24
SLIDE 24

requests_total{instance="web-1", path="/index", method="GET"} 8913435 requests_total{instance="web-1", path="/index", method="POST"} 34845 requests_total{instance="web-3", path="/api/profile", method="GET"} 654118 requests_total{instance="web-2", path="/api/profile", method="GET"} 774540 … request_errors_total{instance="web-1", path="/index", method="GET"} 84513 request_errors_total{instance="web-1", path="/index", method="POST"} 434 request_errors_total{instance="web-3", path="/api/profile", method="GET"} 6562 request_errors_total{instance="web-2", path="/api/profile", method="GET"} 3571 …

slide-25
SLIDE 25

ALERT HighErrorRate IF sum(rate(request_errors_total[5m])) > 500 {} 534

slide-26
SLIDE 26

ALERT HighErrorRate IF sum(rate(request_errors_total[5m])) > 500 {} 534

W R O N G

Absolute threshold alerting rule needs constant tuning as traffic changes

slide-27
SLIDE 27

ALERT HighErrorRate IF sum(rate(request_errors_total[5m])) > 500 {} 534 traffic changes over days

slide-28
SLIDE 28

ALERT HighErrorRate IF sum(rate(request_errors_total[5m])) > 500 {} 534 traffic changes over months

slide-29
SLIDE 29

ALERT HighErrorRate IF sum(rate(request_errors_total[5m])) > 500 {} 534 traffic when you release awesome feature X

slide-30
SLIDE 30

ALERT HighErrorRate IF sum(rate(request_errors_total[5m])) / sum(rate(requests_total[5m])) > 0.01 {} 1.8354

slide-31
SLIDE 31

ALERT HighErrorRate IF sum(rate(request_errors_total[5m])) / sum(rate(requests_total[5m])) > 0.01 {} 1.8354

slide-32
SLIDE 32

ALERT HighErrorRate IF sum(rate(request_errors_total[5m])) / sum(rate(requests_total[5m])) > 0.01 {} 1.8354

WRONG

No dimensionality in result loss of detail, signal cancelation

slide-33
SLIDE 33

ALERT HighErrorRate IF sum(rate(request_errors_total[5m])) / sum(rate(requests_total[5m])) > 0.01 {} 1.8354

high error / low traffic low error / high traffic total sum

slide-34
SLIDE 34

ALERT HighErrorRate IF sum by(instance, path) (rate(request_errors_total[5m])) / sum by(instance, path) (rate(requests_total[5m])) > 0.01 {instance=”web-2”, path=”/api/comments”} 0.02435 {instance=”web-1”, path=”/api/comments”} 0.01055 {instance=”web-2”, path=”/api/profile”} 0.34124

slide-35
SLIDE 35

ALERT HighErrorRate IF sum by(instance, path) (rate(request_errors_total[5m])) / sum by(instance, path) (rate(requests_total[5m])) > 0.01 {instance=”web-2”, path=”/api/v1/comments”} 0.022435 ...

WRONG

Wrong dimensions aggregates away dimensions of fault-tolerance

slide-36
SLIDE 36

ALERT HighErrorRate IF sum by(instance, path) (rate(request_errors_total[5m])) / sum by(instance, path) (rate(requests_total[5m])) > 0.01 {instance=”web-2”, path=”/api/v1/comments”} 0.02435 ...

instance 1 instance 2..1000

slide-37
SLIDE 37

ALERT HighErrorRate IF sum without(instance) (rate(request_errors_total[5m])) / sum without(instance) (rate(requests_total[5m])) > 0.01 {method=”GET”, path=”/api/v1/comments”} 0.02435 {method=”POST”, path=”/api/v1/comments”} 0.015 {method=”POST”, path=”/api/v1/profile”} 0.34124

slide-38
SLIDE 38

ALERT DiskWillFillIn4Hours IF predict_linear(node_filesystem_free{job='node'}[1h], 4*3600) < 0 FOR 5m ...

now

  • 1h

+4h

slide-39
SLIDE 39

ALERT DiskWillFillIn4Hours IF predict_linear(node_filesystem_free{job='node'}[1h], 4*3600) < 0 FOR 5m ANNOTATIONS { summary = “device filling up”, description = “{{$labels.device}} mounted on {{$labels.mountpoint}} on {{$labels.instance}} will fill up within 4 hours.” }

slide-40
SLIDE 40

Alertmanager

Aggregate, deduplicate, and route alerts

slide-41
SLIDE 41

Prometheus

Targets Service Discovery

(Kubernetes, AWS, Consul, custom...)

Alertmanager

Email, Slack, PagerDuty, OpsGenie, ...

slide-42
SLIDE 42

Alerting Rule Alerting Rule Alerting Rule Alerting Rule ... 04:11 hey, HighLatency, service=”X”, zone=”eu-west”, path=/user/profile, method=GET 04:11 hey, HighLatency, service=”X”, zone=”eu-west”, path=/user/settings, method=GET 04:11 hey, HighLatency, service=”X”, zone=”eu-west”, path=/user/settings, method=GET 04:11 hey, HighErrorRate, service=”X”, zone=”eu-west”, path=/user/settings, method=POST 04:12 hey, HighErrorRate, service=”X”, zone=”eu-west”, path=/user/profile, method=GET 04:13 hey, HighLatency, service=”X”, zone=”eu-west”, path=/index, method=POST 04:13 hey, CacheServerSlow, service=”X”, zone=”eu-west”, path=/user/profile, method=POST . . . 04:15 hey, HighErrorRate, service=”X”, zone=”eu-west”, path=/comments, method=GET 04:15 hey, HighErrorRate, service=”X”, zone=”eu-west”, path=/user/profile, method=POST

slide-43
SLIDE 43

Alerting Rule Alerting Rule Alerting Rule Alerting Rule ...

Alert Manager Chat JIRA PagerDuty ...

You have 15 alerts for Service X in zone eu-west 3x HighLatency 10x HighErrorRate 2x CacheServerSlow Individual alerts: ...

slide-44
SLIDE 44

Inhibition

{alertname=”LatencyHigh”, severity=”page”, ..., zone=”eu-west”} ... {alertname=”LatencyHigh”, severity=”page”, ..., zone=”eu-west”} {alertname=”ErrorsHigh”, severity=”page”, ..., zone=”eu-west”} ... {alertname=”ServiceDown”, severity=”page”, ..., zone=”eu-west”} {alertname=”DatacenterOnFire”, severity=”page”, zone=”eu-west”}

if active, mute everything else in same zone

slide-45
SLIDE 45

Anomaly Detection

slide-46
SLIDE 46

Practical Example 1

job:requests:rate5m = sum by(job) (rate(requests_total[5m])) job:requests:holt_winters_rate1h = holt_winters( job:requests:rate5m[1h], 0.6, 0.4 )

slide-47
SLIDE 47

Practical Example 1

ALERT AbnormalTraffic IF abs( job:requests:rate5m - job:requests:holt_winters_rate1h offset 7d ) > 0.2 * job:request_rate:holt_winters_rate1h offset 7d FOR 10m ...

slide-48
SLIDE 48

Practical Example 2

instance:latency_seconds:mean5m > on (job) group_left() ( avg by (job)(instance:latency_seconds:mean5m) + on (job) 2 * stddev by (job)(instance:latency_seconds:mean5m) )

slide-49
SLIDE 49

Practical Example 2

( instance:latency_seconds:mean5m

> on (job) group_left() ( avg by (job)(instance:latency_seconds:mean5m) + on (job) 2 * stddev by (job)(instance:latency_seconds:mean5m) )

) > on (job) group_left() 1.2 * avg by (job)(instance:latency_seconds:mean5m)

slide-50
SLIDE 50

Practical Example 2

( instance:latency_seconds:mean5m > on (job) group_left() ( avg by (job)(instance:latency_seconds:mean5m) + on (job) 2 * stddev by (job)(instance:latency_seconds:mean5m) ) ) > on (job) group_left() 1.2 * avg by (job)(instance:latency_seconds:mean5m)

and on (job) avg by (job)(instance:latency_seconds_count:rate5m) > 1

slide-51
SLIDE 51

S e l f H e a l i n g

slide-52
SLIDE 52

Prom Alert manager wh node scrape notify alert action

slide-53
SLIDE 53

Conclusion

  • Symptom-based pages + cause based warnings provide good coverage and insight

into service availability

  • Design alerts that are adaptive to change, preserve as many dimensions as

possible, aggregate away dimensions of fault tolerance

  • Use linear prediction for capacity planning and saturation detection
  • Advanced alerting expressions allow for well-scoped and practical anomaly detection
  • Raw alerts are not meant for human consumption
  • The Alertmanager aggregates, silences, and routes groups of alerts as meaningful

notifications

slide-54
SLIDE 54

coreos.com/fest @coreosfest

May 31 - June 1, 2017 San Francisco

slide-55
SLIDE 55

Join us!

careers: coreos.com/careers (now in Berlin!)