PromCon 2019 Fun and profit with Alertmanager Simon Pasquier - - PowerPoint PPT Presentation

promcon 2019 fun and profit with alertmanager
SMART_READER_LITE
LIVE PREVIEW

PromCon 2019 Fun and profit with Alertmanager Simon Pasquier - - PowerPoint PPT Presentation

PromCon 2019 Fun and profit with Alertmanager Simon Pasquier (@SimonHiker), November 7, 2019 Prometheus Who am I? Software engineer working at Red Hat Alertmanager & consul_exporter maintainer Prometheus Alerting craft


slide-1
SLIDE 1

Prometheus

PromCon 2019 Fun and profit with Alertmanager

Simon Pasquier (@SimonHiker), November 7, 2019

slide-2
SLIDE 2

Prometheus

Who am I?

  • Software engineer working at

Red Hat

  • Alertmanager & consul_exporter

maintainer

slide-3
SLIDE 3

Prometheus

Alerting craft

slide-4
SLIDE 4

Prometheus

slide-5
SLIDE 5

Prometheus

  • Think about which labels to propagate.
  • “Complex” alerts can be harmful.
  • Spend some time to learn the template language.

Guidelines

slide-6
SLIDE 6

Prometheus

When will I be notified that something’s broken?

slide-7
SLIDE 7

Prometheus

expr: foo > 0 for: 2m

time foo.set(1)

slide-8
SLIDE 8

Prometheus

expr: foo > 0 for: 2m

time scrape foo.set(1) Prometheus Alertmanager scrape interval

slide-9
SLIDE 9

Prometheus

expr: foo > 0 for: 2m

time scrape evaluation (pending) foo.set(1) Prometheus Alertmanager evaluation interval

slide-10
SLIDE 10

Prometheus

expr: foo > 0 for: 2m

time scrape evaluation (pending) foo.set(1) evaluation (pending) Prometheus Alertmanager evaluation interval

slide-11
SLIDE 11

Prometheus

expr: foo > 0 for: 2m

time scrape evaluation (pending) foo.set(1) evaluation (firing) evaluation (pending) Prometheus Alertmanager at least 2m

slide-12
SLIDE 12

Prometheus

expr: foo > 0 for: 2m

time scrape evaluation (pending) foo.set(1) notification evaluation (firing) evaluation (pending) Prometheus Alertmanager group_wait

slide-13
SLIDE 13

Prometheus

expr: foo > 0 for: 2m

time scrape evaluation (pending) foo.set(1) notification evaluation (firing) evaluation (pending) Prometheus Alertmanager scrape interval + evaluation interval + for + group_wait

slide-14
SLIDE 14

Prometheus

Things to know

  • Use “for” to avoid flapping alerts.
  • group_interval for subsequent updates (including

resolution).

  • repeat_interval for reminders.

○ `--data.retention` flag (#1806).

slide-15
SLIDE 15

Prometheus

Routing

slide-16
SLIDE 16

Prometheus

slide-17
SLIDE 17

Prometheus

  • Keep it simple.
  • First level routes to match services/teams.
  • Use amtool or routing tree editor to test/validate.

Guidelines

slide-18
SLIDE 18

Prometheus

Fictional scenario:

  • All notifications should go to Slack.
  • Alerts with job=app should email the app team.

○ severity=critical should page the app team too.

  • Alerts with severity=critical should page the ops

team.

To continue or not?

slide-19
SLIDE 19

Prometheus

Silences, inhibitions, oh my!

slide-20
SLIDE 20

Prometheus

Inhibition rule

slide-21
SLIDE 21

Prometheus

  • Pick the appropriate silence duration (#1639).
  • Corner cases with incident management systems.
  • Inhibiting alerts can’t inhibit themselves (#666).

Gotchas

slide-22
SLIDE 22

Prometheus

High availability

slide-23
SLIDE 23

Prometheus

  • Broadcast silences and notification logs.
  • Based on the hashicorp/memberlist library.
  • Requires a dedicated TCP/UDP port.

○ UDP for small messages (⩽ 700 bytes) ○ TCP otherwise

High availability

slide-24
SLIDE 24

Prometheus

  • -cluster.peer=””

alertmanager-0

slide-25
SLIDE 25

Prometheus

  • -cluster.peer=alertmanager-0:9094

alertmanager-1

slide-26
SLIDE 26

Prometheus

  • -cluster.peer=alertmanager-0:9094
  • -cluster.peer=alertmanager-0:909

4 alertmanager-2

slide-27
SLIDE 27

Prometheus Position: 0 Position: 2 Position: 1

slide-28
SLIDE 28

Prometheus

  • Server
  • -cluster.listen-address
  • -cluster.advertise-address
  • Peering
  • -cluster.peer
  • -cluster.peer-timeout (15s)
  • -cluster.settle-timeout (1m)

High availability flags

slide-29
SLIDE 29

Prometheus

  • Data exchange
  • -cluster.gossip-interval (250ms)
  • -cluster.pushpull-interval (1m)
  • -cluster.tcp-timeout (10s)
  • Probes
  • -cluster.probe-timeout (500ms)
  • -cluster.probe-interval (1s)
  • Reconnection
  • -cluster.reconnect-interval (10s)
  • -cluster.reconnect-timeout (6h)

High availability flags (continued)

slide-30
SLIDE 30

Prometheus

  • Peer names refreshed every 15 seconds.
  • Messages gossiped to half of the nodes (min. 3).
  • Gossip queue size of 4096 messages.
  • Settle phase stops after 3 “stable” iterations.

Hidden stuff

slide-31
SLIDE 31

Prometheus

  • Encryption & authentication using mTLS (#1819).
  • Better support for advertised address (#1909).

Future work

slide-32
SLIDE 32

Prometheus

Conclusion

  • Test all the things.
  • Keep it simple.
  • We ❤ contributions!
slide-33
SLIDE 33

Prometheus

Thanks!

Simon Pasquier pasquier.simon@gmail.com @SimonHiker

slide-34
SLIDE 34

Prometheus

Psst, we’re hiring!