where does coreos fit in
play

Where does CoreOS fit in? Automating Monitoring infrastructure - PowerPoint PPT Presentation

Alertmanager and high availability Frederic Branczyk Software Engineer at CoreOS Prometheus/Alertmanager/Kubernetes @brancz Where does CoreOS fit in? Automating Monitoring infrastructure Prometheus + Kubernetes What will I be talking


  1. Alertmanager and high availability Frederic Branczyk Software Engineer at CoreOS Prometheus/Alertmanager/Kubernetes @brancz

  2. Where does CoreOS fit in? ● Automating Monitoring infrastructure ● Prometheus + Kubernetes

  3. What will I be talking about? ● From alert to notification ● High availability contract ● High availability implementation ● Implications on operating HA Alertmanager

  4. Alertmanager Features ● Receives and groups alerts ● Deduplicates alerts ● Sends notifications to providers ○ Pagerduty, email, Slack, etc. ● Silencing

  5. Prometheus & Alertmanager

  6. Alerting Rule Alerting Rule ... Alerting Rule Alerting Rule 04:11 hey, HighLatency, service=”X”, zone=”eu-west”, path=/user/profile, method=GET 04:11 hey, HighLatency, service=”X”, zone=”eu-west”, path=/user/settings, method=GET 04:11 hey, HighLatency, service=”X”, zone=”eu-west”, path=/user/settings, method=GET 04:11 hey, HighErrorRate, service=”X”, zone=”eu-west”, path=/user/settings, method=POST 04:12 hey, HighErrorRate, service=”X”, zone=”eu-west”, path=/user/profile, method=GET 04:13 hey, HighLatency, service=”X”, zone=”eu-west”, path=/index, method=POST 04:13 hey, CacheServerSlow, service=”X”, zone=”eu-west”, path=/user/profile, method=POST . . . 04:15 hey, HighErrorRate, service=”X”, zone=”eu-west”, path=/comments, method=GET 04:15 hey, HighErrorRate, service=”X”, zone=”eu-west”, path=/user/profile, method=POST

  7. Grouped in one notification ● 3 x HighLatency ● 10 x HighErrorRate ● 2 x CacheServerSlow ● (+individual Alerts)

  8. Boiled down: Alertmanager reliably sends notifications

  9. High Availability

  10. Infrastructure Scaling Story Microservice 1 Microservice 2 Prometheus Alertmanager Microservice 3 Gossip Microservice 1 Microservice 2 Prometheus Alertmanager Microservice 3 ...

  11. Why decoupled? ● Keep Prometheus alerting simple ● High availability of Prometheus ● No state sharing between Prometheus

  12. Example Alerting Rule ALERT NoLeader IF etcd_has_leader == 0 FOR 10m LABELS { severity = "warning" } ANNOTATIONS { summary = "etcd no leader", description = "etcd instance has no leader", }

  13. Alert Evaluation in Prometheus Rule 1 ● Evaluate Rule/Alert Rule 2 ● Fire alert against Alertmanager Rule 3 ... Repeat in *rule evaluation interval*

  14. Simple configuration ● Resolve alerts in 5m global: resolve_timeout: 5m ● Group by job label route: group_by: ['job'] group_wait: 10s ● Group for 10 seconds group_interval: 10s repeat_interval: 1h receiver: 'webhook' ● Send via webhook receivers: - name: 'webhook' webhook_configs: receiver - url: 'http://127.0.0.1:5001/'

  15. Notification Pipeline Silence Wait Dedup Send Gossip Do not Position in Has Send Tell other continue cluster notification notification peers multiplied already via favorite notification by 5 been sent? provider has been seconds sent

  16. What is gossiped? ● Yes ○ Sent notifications ○ Silences ● No ○ Received alerts

  17. How? CRDTs! ● Conflict-free replicated data type ● Associativity (a+(b+c)=(a+b)+c) ● Commutativity (a+b=b+a) ● Idempotence (a+a=a) ● Well suited for AP systems

  18. Yes, but how? mesh by Weaveworks! ● Eventually consistent ● LWW-element-set ● Mergeable log of records ● Merges based on UID ○ On conflict latest timestamp wins

  19. Why not etcd? ● Simple operation ○ Less moving pieces ○ Single binary ● Want: AP not CP

  20. Silences

  21. Create Silences Create Silence Alertmanager 0 Alertmanager 1 Silences Silences Gossip Delta Database Database ID: 2 ... ID Values ID Values 1 Query, Start, End 1 Query, Start, End 2 Query, Start, End 2 Query, Start, End Merge Gossip Data

  22. Update Silences Alertmanager 0 Alertmanager 1 Update Silence UID: 1 Gossip Delta Silences Silences Start: Start1 ID: 1 Database Database Start: Start1 ID Values ID Values 1 1 Query, Start, End Query, Start1, End 1 1 Query, Start1, End Query, Start, End 2 Query, Start, End 2 Query, Start, End Merge Gossip Data

  23. Notification Log

  24. Non silenced alert example Alertmanager 0 ● Wait 0s Prometheus ● Dedup: Not sent→ Send ● Gossip Alertmanager 1 ● Wait 5s ● Receive Gossip Data ● Deduplicate → Do not send

  25. Gossip Partition Alertmanager 0 ● Wait 0s Network Prometheus ● Dedup: Not sent→ Send Partition ● Gossip Alertmanager 1 ● Wait 5s ● Dedup: Not sent→ Send

  26. Notification Log Alert Firing Alertmanager 0 Alertmanager 1 Notification Notification Gossip Delta Log Log UID: 2 ... UID Values UID Values 1 Resolve,Notify,TS,... 1 Resolve,Notify,TS,... 2 Resolve,Notify,TS,... 2 Resolve,Notify,TS,... Merge Gossip Data

  27. Group Key ● Group at runtime global: resolve_timeout: 5m ○ By Group By labels route: ● XOR with Route group_by: ['job'] group_wait: 10s group_interval: 10s repeat_interval: 1h ● Concat with Receiver receiver: 'webhook' receivers: - name: 'webhook' webhook_configs: - url: 'http://127.0.0.1:5001/'

  28. DEMO!

  29. Thanks! QUESTIONS? LONGER CHAT? frederic.branczyk@coreos.com Let’s talk! GitHub: @brancz #prometheus on Freenode Twitter: @fredbrancz More events: coreos.com/community We’re hiring: coreos.com/careers also in Berlin!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend