PromCon 2019 Fun and profit with Alertmanager Simon Pasquier - - PowerPoint PPT Presentation

▶

Sep 12, 2023 233 likes •580 views

PromCon 2019 Fun and profit with Alertmanager Simon Pasquier (@SimonHiker), November 7, 2019 Prometheus Who am I? Software engineer working at Red Hat Alertmanager & consul_exporter maintainer Prometheus Alerting craft

SLIDE 1

Prometheus

PromCon 2019 Fun and profit with Alertmanager

Simon Pasquier (@SimonHiker), November 7, 2019

SLIDE 2

Prometheus

Who am I?

Software engineer working at

Red Hat

Alertmanager & consul_exporter

maintainer

SLIDE 3

Prometheus

Alerting craft

SLIDE 4

Prometheus

SLIDE 5

Prometheus

Think about which labels to propagate.
“Complex” alerts can be harmful.
Spend some time to learn the template language.

Guidelines

SLIDE 6

Prometheus

When will I be notified that something’s broken?

SLIDE 7

Prometheus

expr: foo > 0 for: 2m

time foo.set(1)

SLIDE 8

Prometheus

expr: foo > 0 for: 2m

time scrape foo.set(1) Prometheus Alertmanager scrape interval

SLIDE 9

Prometheus

expr: foo > 0 for: 2m

time scrape evaluation (pending) foo.set(1) Prometheus Alertmanager evaluation interval

SLIDE 10

Prometheus

expr: foo > 0 for: 2m

time scrape evaluation (pending) foo.set(1) evaluation (pending) Prometheus Alertmanager evaluation interval

SLIDE 11

Prometheus

expr: foo > 0 for: 2m

time scrape evaluation (pending) foo.set(1) evaluation (firing) evaluation (pending) Prometheus Alertmanager at least 2m

SLIDE 12

Prometheus

expr: foo > 0 for: 2m

time scrape evaluation (pending) foo.set(1) notification evaluation (firing) evaluation (pending) Prometheus Alertmanager group_wait

SLIDE 13

Prometheus

expr: foo > 0 for: 2m

time scrape evaluation (pending) foo.set(1) notification evaluation (firing) evaluation (pending) Prometheus Alertmanager scrape interval + evaluation interval + for + group_wait

SLIDE 14

Prometheus

Things to know

Use “for” to avoid flapping alerts.
group_interval for subsequent updates (including

resolution).

repeat_interval for reminders.

○ `--data.retention` flag (#1806).

SLIDE 15

Prometheus

Routing

SLIDE 16

Prometheus

SLIDE 17

Prometheus

Keep it simple.
First level routes to match services/teams.
Use amtool or routing tree editor to test/validate.

Guidelines

SLIDE 18

Prometheus

Fictional scenario:

All notifications should go to Slack.
Alerts with job=app should email the app team.

○ severity=critical should page the app team too.

Alerts with severity=critical should page the ops

team.

To continue or not?

SLIDE 19

Prometheus

Silences, inhibitions, oh my!

SLIDE 20

Prometheus

Inhibition rule

SLIDE 21

Prometheus

Pick the appropriate silence duration (#1639).
Corner cases with incident management systems.
Inhibiting alerts can’t inhibit themselves (#666).

Gotchas

SLIDE 22

Prometheus

High availability

SLIDE 23

Prometheus

Broadcast silences and notification logs.
Based on the hashicorp/memberlist library.
Requires a dedicated TCP/UDP port.

○ UDP for small messages (⩽ 700 bytes) ○ TCP otherwise

High availability

SLIDE 24

Prometheus

-cluster.peer=””

alertmanager-0

SLIDE 25

Prometheus

-cluster.peer=alertmanager-0:9094

alertmanager-1

SLIDE 26

Prometheus

-cluster.peer=alertmanager-0:9094
-cluster.peer=alertmanager-0:909

4 alertmanager-2

SLIDE 27

Prometheus Position: 0 Position: 2 Position: 1

SLIDE 28

Prometheus

Server
-cluster.listen-address
-cluster.advertise-address
Peering
-cluster.peer
-cluster.peer-timeout (15s)
-cluster.settle-timeout (1m)

High availability flags

SLIDE 29

Prometheus

Data exchange
-cluster.gossip-interval (250ms)
-cluster.pushpull-interval (1m)
-cluster.tcp-timeout (10s)
Probes
-cluster.probe-timeout (500ms)
-cluster.probe-interval (1s)
Reconnection
-cluster.reconnect-interval (10s)
-cluster.reconnect-timeout (6h)

High availability flags (continued)

SLIDE 30

Prometheus

Peer names refreshed every 15 seconds.
Messages gossiped to half of the nodes (min. 3).
Gossip queue size of 4096 messages.
Settle phase stops after 3 “stable” iterations.

Hidden stuff

SLIDE 31

Prometheus

Encryption & authentication using mTLS (#1819).
Better support for advertised address (#1909).

Future work

SLIDE 32

Prometheus

Conclusion

Test all the things.
Keep it simple.
We ❤ contributions!

SLIDE 33

Prometheus

Thanks!

Simon Pasquier pasquier.simon@gmail.com @SimonHiker

SLIDE 34

Prometheus

PromCon 2019 Fun and profit with Alertmanager

Simon Pasquier (@SimonHiker), November 7, 2019

Who am I?

Red Hat

maintainer

Alerting craft

Guidelines

When will I be notified that something’s broken?

expr: foo > 0 for: 2m

expr: foo > 0 for: 2m

expr: foo > 0 for: 2m

expr: foo > 0 for: 2m

expr: foo > 0 for: 2m

expr: foo > 0 for: 2m

expr: foo > 0 for: 2m

Things to know

resolution).

○ `--data.retention` flag (#1806).

Routing

Guidelines

Fictional scenario:

○ severity=critical should page the app team too.

team.

To continue or not?

Silences, inhibitions, oh my!

Inhibition rule

Gotchas

High availability

○ UDP for small messages (⩽ 700 bytes) ○ TCP otherwise

High availability

High availability flags

High availability flags (continued)

Hidden stuff

Future work

Conclusion

Thanks!

Simon Pasquier pasquier.simon@gmail.com @SimonHiker

Psst, we’re hiring!