Where does CoreOS fit in? Automating Monitoring infrastructure - - PowerPoint PPT Presentation

where does coreos fit in
SMART_READER_LITE
LIVE PREVIEW

Where does CoreOS fit in? Automating Monitoring infrastructure - - PowerPoint PPT Presentation

Alertmanager and high availability Frederic Branczyk Software Engineer at CoreOS Prometheus/Alertmanager/Kubernetes @brancz Where does CoreOS fit in? Automating Monitoring infrastructure Prometheus + Kubernetes What will I be talking


slide-1
SLIDE 1

Alertmanager

and high availability

Frederic Branczyk

Software Engineer at CoreOS Prometheus/Alertmanager/Kubernetes @brancz

slide-2
SLIDE 2

Where does CoreOS fit in?

  • Automating Monitoring infrastructure
  • Prometheus + Kubernetes
slide-3
SLIDE 3

What will I be talking about?

  • From alert to notification
  • High availability contract
  • High availability implementation
  • Implications on operating HA Alertmanager
slide-4
SLIDE 4

Alertmanager Features

  • Receives and groups alerts
  • Deduplicates alerts
  • Sends notifications to providers

○ Pagerduty, email, Slack, etc.

  • Silencing
slide-5
SLIDE 5

Prometheus & Alertmanager

slide-6
SLIDE 6

Alerting Rule Alerting Rule Alerting Rule Alerting Rule ... 04:11 hey, HighLatency, service=”X”, zone=”eu-west”, path=/user/profile, method=GET 04:11 hey, HighLatency, service=”X”, zone=”eu-west”, path=/user/settings, method=GET 04:11 hey, HighLatency, service=”X”, zone=”eu-west”, path=/user/settings, method=GET 04:11 hey, HighErrorRate, service=”X”, zone=”eu-west”, path=/user/settings, method=POST 04:12 hey, HighErrorRate, service=”X”, zone=”eu-west”, path=/user/profile, method=GET 04:13 hey, HighLatency, service=”X”, zone=”eu-west”, path=/index, method=POST 04:13 hey, CacheServerSlow, service=”X”, zone=”eu-west”, path=/user/profile, method=POST . . . 04:15 hey, HighErrorRate, service=”X”, zone=”eu-west”, path=/comments, method=GET 04:15 hey, HighErrorRate, service=”X”, zone=”eu-west”, path=/user/profile, method=POST

slide-7
SLIDE 7

Grouped in one notification

  • 3 x HighLatency
  • 10 x HighErrorRate
  • 2 x CacheServerSlow
  • (+individual Alerts)
slide-8
SLIDE 8

Boiled down:

Alertmanager reliably sends notifications

slide-9
SLIDE 9

High Availability

slide-10
SLIDE 10

Infrastructure Scaling Story

Prometheus Prometheus Alertmanager Alertmanager Gossip Microservice 1 Microservice 2 Microservice 3 Microservice 1 Microservice 2 Microservice 3

...

slide-11
SLIDE 11

Why decoupled?

  • Keep Prometheus alerting simple
  • High availability of Prometheus
  • No state sharing between Prometheus
slide-12
SLIDE 12

Example Alerting Rule

ALERT NoLeader IF etcd_has_leader == 0 FOR 10m LABELS { severity = "warning" } ANNOTATIONS { summary = "etcd no leader", description = "etcd instance has no leader", }

slide-13
SLIDE 13

Alert Evaluation in Prometheus

Rule 1 Rule 2 Rule 3 ...

  • Evaluate Rule/Alert
  • Fire alert against Alertmanager

Repeat in *rule evaluation interval*

slide-14
SLIDE 14

Simple configuration

  • Resolve alerts in 5m
  • Group by job label
  • Group for 10 seconds
  • Send via webhook

receiver

global: resolve_timeout: 5m route: group_by: ['job'] group_wait: 10s group_interval: 10s repeat_interval: 1h receiver: 'webhook' receivers:

  • name: 'webhook'

webhook_configs:

  • url: 'http://127.0.0.1:5001/'
slide-15
SLIDE 15

Notification Pipeline

Silence Do not continue Wait Position in cluster multiplied by 5 seconds Dedup Has notification already been sent? Send Send notification via favorite provider Gossip Tell other peers notification has been sent

slide-16
SLIDE 16

What is gossiped?

  • Yes

○ Sent notifications ○ Silences

  • No

○ Received alerts

slide-17
SLIDE 17

How? CRDTs!

  • Conflict-free replicated data type
  • Associativity (a+(b+c)=(a+b)+c)
  • Commutativity (a+b=b+a)
  • Idempotence (a+a=a)
  • Well suited for AP systems
slide-18
SLIDE 18

Yes, but how? mesh by Weaveworks!

  • Eventually consistent
  • LWW-element-set
  • Mergeable log of records
  • Merges based on UID

○ On conflict latest timestamp wins

slide-19
SLIDE 19

Why not etcd?

  • Simple operation

○ Less moving pieces ○ Single binary

  • Want: AP not CP
slide-20
SLIDE 20

Silences

slide-21
SLIDE 21

Create Silences

Create Silence

Alertmanager 0 Silences Database

ID Values 1 Query, Start, End 2 Query, Start, End

Alertmanager 1 Silences Database

ID Values 1 Query, Start, End 2 Query, Start, End Gossip Delta ID: 2 ... Merge Gossip Data

slide-22
SLIDE 22

Update Silences

Update Silence UID: 1 Start: Start1

Alertmanager 0 Silences Database

ID Values 1 Query, Start, End 2 Query, Start, End

Alertmanager 1 Silences Database

ID Values 1 Query, Start, End 2 Query, Start, End Gossip Delta ID: 1 Start: Start1 Merge Gossip Data 1 Query, Start1, End 1 Query, Start1, End

slide-23
SLIDE 23

Notification Log

slide-24
SLIDE 24

Non silenced alert example

Alertmanager 1 Alertmanager 0 Prometheus

  • Wait 0s
  • Wait 5s
  • Dedup: Not sent→ Send
  • Gossip
  • Receive Gossip Data
  • Deduplicate → Do not send
slide-25
SLIDE 25

Gossip Partition

Alertmanager 1 Alertmanager 0 Prometheus

  • Wait 0s
  • Wait 5s
  • Dedup: Not sent→ Send
  • Gossip
  • Dedup: Not sent→ Send

Network Partition

slide-26
SLIDE 26

Notification Log

Alert Firing

Alertmanager 0 Notification Log

UID Values 1 Resolve,Notify,TS,... 2 Resolve,Notify,TS,...

Alertmanager 1 Notification Log

UID Values 1 Resolve,Notify,TS,... 2 Resolve,Notify,TS,... Gossip Delta UID: 2 ... Merge Gossip Data

slide-27
SLIDE 27

Group Key

  • Group at runtime

○ By Group By labels

  • XOR with Route
  • Concat with Receiver

global: resolve_timeout: 5m route: group_by: ['job'] group_wait: 10s group_interval: 10s repeat_interval: 1h receiver: 'webhook' receivers:

  • name: 'webhook'

webhook_configs:

  • url: 'http://127.0.0.1:5001/'
slide-28
SLIDE 28

DEMO!

slide-29
SLIDE 29

frederic.branczyk@coreos.com GitHub: @brancz Twitter: @fredbrancz

QUESTIONS?

Thanks! We’re hiring: coreos.com/careers

Let’s talk! #prometheus on Freenode More events: coreos.com/community

LONGER CHAT?

also in Berlin!