promcon 2019 fun and profit with alertmanager
play

PromCon 2019 Fun and profit with Alertmanager Simon Pasquier - PowerPoint PPT Presentation

PromCon 2019 Fun and profit with Alertmanager Simon Pasquier (@SimonHiker), November 7, 2019 Prometheus Who am I? Software engineer working at Red Hat Alertmanager & consul_exporter maintainer Prometheus Alerting craft


  1. PromCon 2019 Fun and profit with Alertmanager Simon Pasquier (@SimonHiker), November 7, 2019 Prometheus

  2. Who am I? Software engineer working at ● Red Hat Alertmanager & consul_exporter ● maintainer Prometheus

  3. Alerting craft Prometheus

  4. Prometheus

  5. Guidelines ● Think about which labels to propagate. ● “Complex” alerts can be harmful. ● Spend some time to learn the template language. Prometheus

  6. When will I be notified that something’s broken? Prometheus

  7. expr: foo > 0 for: 2m time foo.set(1) Prometheus

  8. expr: foo > 0 for: 2m scrape interval time foo.set(1) scrape Prometheus Alertmanager Prometheus

  9. expr: foo > 0 for: 2m evaluation interval time foo.set(1) scrape evaluation (pending) Prometheus Alertmanager Prometheus

  10. expr: foo > 0 for: 2m evaluation interval time foo.set(1) scrape evaluation evaluation (pending) (pending) Prometheus Alertmanager Prometheus

  11. expr: foo > 0 for: 2m at least 2m time foo.set(1) scrape evaluation evaluation evaluation (pending) (pending) (firing) Prometheus Alertmanager Prometheus

  12. expr: foo > 0 for: 2m group_wait time foo.set(1) scrape evaluation evaluation evaluation notification (pending) (pending) (firing) Prometheus Alertmanager Prometheus

  13. expr: foo > 0 for: 2m scrape interval + evaluation interval + for + group_wait time foo.set(1) scrape evaluation evaluation evaluation notification (pending) (pending) (firing) Prometheus Alertmanager Prometheus

  14. Things to know ● Use “for” to avoid flapping alerts. ● group_interval for subsequent updates (including resolution). ● repeat_interval for reminders. ○ `--data.retention` flag (#1806). Prometheus

  15. Routing Prometheus

  16. Prometheus

  17. Guidelines ● Keep it simple. ● First level routes to match services/teams. ● Use amtool or routing tree editor to test/validate. Prometheus

  18. To continue or not? Fictional scenario: ● All notifications should go to Slack. ● Alerts with job=app should email the app team. ○ severity=critical should page the app team too. ● Alerts with severity=critical should page the ops team. Prometheus

  19. Silences, inhibitions, oh my! Prometheus

  20. Inhibition rule Prometheus

  21. Gotchas ● Pick the appropriate silence duration (#1639). ● Corner cases with incident management systems. ● Inhibiting alerts can’t inhibit themselves (#666). Prometheus

  22. High availability Prometheus

  23. High availability ● Broadcast silences and notification logs. ● Based on the hashicorp/memberlist library. ● Requires a dedicated TCP/UDP port. ○ UDP for small messages ( ⩽ 700 bytes) ○ TCP otherwise Prometheus

  24. --cluster.peer=”” alertmanager-0 Prometheus

  25. --cluster.peer=alertmanager-0:9094 alertmanager-1 Prometheus

  26. --cluster.peer=alertmanager-0:9094 --cluster.peer=alertmanager-0:909 4 alertmanager-2 Prometheus

  27. Position: 0 Position: 2 Position: 1 Prometheus

  28. High availability flags ● Server --cluster.listen-address --cluster.advertise-address ● Peering --cluster.peer --cluster.peer-timeout (15s) --cluster.settle-timeout (1m) Prometheus

  29. High availability flags (continued) ● Data exchange --cluster.gossip-interval (250ms) --cluster.pushpull-interval (1m) --cluster.tcp-timeout (10s) ● Probes --cluster.probe-timeout (500ms) --cluster.probe-interval (1s) ● Reconnection --cluster.reconnect-interval (10s) --cluster.reconnect-timeout (6h) Prometheus

  30. Hidden stuff ● Peer names refreshed every 15 seconds. ● Messages gossiped to half of the nodes (min. 3). ● Gossip queue size of 4096 messages. ● Settle phase stops after 3 “stable” iterations. Prometheus

  31. Future work ● Encryption & authentication using mTLS (#1819). ● Better support for advertised address (#1909). Prometheus

  32. Conclusion ● Test all the things. ● Keep it simple. ● We ❤ contributions! Prometheus

  33. Thanks! Simon Pasquier pasquier.simon@gmail.com @SimonHiker Prometheus

  34. Psst, we’re hiring! Prometheus

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend