Monitoring Cloudflare's planet-scale edge network with Prometheus - - PowerPoint PPT Presentation

monitoring cloudflare s planet scale edge network with
SMART_READER_LITE
LIVE PREVIEW

Monitoring Cloudflare's planet-scale edge network with Prometheus - - PowerPoint PPT Presentation

Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring Alerting on critical production issues Incident response Post-mortem analysis Metrics, but


slide-1
SLIDE 1

Monitoring Cloudflare's planet-scale edge network with Prometheus

Matt Bostock

slide-2
SLIDE 2

@mattbostock Platform Operations

slide-3
SLIDE 3

Prometheus for monitoring

  • Alerting on critical production issues
  • Incident response
  • Post-mortem analysis
  • Metrics, but not long-term storage
slide-4
SLIDE 4

What does Cloudflare do?

CDN Moving content physically closer to visitors with

  • ur CDN.

Website Optimization Caching TLS 1.3 HTTP/2 Server push AMP Origin load-balancing Smart routing DNS Cloudflare is one of the fastest managed DNS providers in the world.

slide-5
SLIDE 5

115+

Data centers globally

1.2M

DNS requests/second

10%

Internet requests every day

5M

HTTP requests/second websites, apps & APIs in 150 countries

6M+

Cloudflare’s anycast edge network

slide-6
SLIDE 6

4.6M

Time-series max per server

4

Top-level Prometheus servers

185

Prometheus servers currently in Production

72k

Samples ingested per second max per server Max size of data on disk

250GB

Cloudflare’s Prometheus deployment

slide-7
SLIDE 7

Edge Points of Presence (PoPs)

  • Routing via anycast
  • Configured identically
  • Independent
slide-8
SLIDE 8

Services in each PoP

  • HTTP
  • DNS
  • Replicated key-value store
  • Attack mitigation
slide-9
SLIDE 9

Core data centers

  • Enterprise log share (HTTP access logs for Enterprise customers)
  • Customer analytics
  • Logging: auditd, HTTP errors, DNS errors, syslog
  • Application and operational metrics
  • Internal and customer-facing APIs
slide-10
SLIDE 10

Services in core data centers

  • PaaS: Marathon, Mesos, Chronos, Docker, Sentry
  • Object storage: Ceph
  • Data streams: Kafka, Flink, Spark
  • Analytics: ClickHouse (OLAP), CitusDB (shared PostgreSQL)
  • Hadoop: HDFS, HBase, OpenTSDB
  • Logging: Elasticsearch, Kibana
  • Config management: Salt
  • Misc: MySQL
slide-11
SLIDE 11

Prometheus queries

slide-12
SLIDE 12

node_md_disks_active / node_md_disks * 100

slide-13
SLIDE 13

count(count(node_uname_info) by (release))

slide-14
SLIDE 14

rate(node_disk_read_time_ms[2m]) / rate(node_disk_reads_completed[2m])

slide-15
SLIDE 15

Metrics for alerting

slide-16
SLIDE 16

sum(rate(http_requests_total{job="alertmanager", code=~"5.."}[2m])) / sum(rate(http_requests_total{job="alertmanager"}[2m])) * 100 > 0

slide-17
SLIDE 17

count( abs( (hbase_namenode_FSNamesystemState_CapacityUsed / hbase_namenode_FSNamesystemState_CapacityTotal)

  • ON() GROUP_RIGHT()

(hadoop_datanode_fs_DfsUsed / hadoop_datanode_fs_Capacity) ) * 100 > 10 )

slide-18
SLIDE 18

Prometheus architecture

slide-19
SLIDE 19

Before, we used Nagios

  • Tuned for high volume of checks
  • Hundreds of thousands of checks
  • One machine in one central location
  • Alerting backend for our custom metrics

pipeline

slide-20
SLIDE 20

Specification Comments

slide-21
SLIDE 21

Inside each PoP

Server Server Server Prometheus

slide-22
SLIDE 22

Inside each PoP

Server Server Server Prometheus

slide-23
SLIDE 23

Inside each PoP: High availability

Prometheus Server Server Server Prometheus

slide-24
SLIDE 24

Federation

San Jose Frankfurt Santiago Prometheus

CORE

slide-25
SLIDE 25

Federation configuration

  • job_name: 'federate'

scheme: https scrape_interval: 30s honor_labels: true metrics_path: '/federate' params: 'match[]': # Scrape target health

  • '{__name__="up"}'

# Colo-level aggregate metrics

  • '{__name__=~"colo(?:_.+)?:.+"}'
slide-26
SLIDE 26

Federation configuration

  • job_name: 'federate'

scheme: https scrape_interval: 30s honor_labels: true metrics_path: '/federate' params: 'match[]': # Scrape target health

  • '{__name__="up"}'

# Colo-level aggregate metrics

  • '{__name__=~"colo(?:_.+)?:.+"}'

colo:* colo_job:*

slide-27
SLIDE 27

Federation

San Jose Frankfurt Santiago Prometheus

CORE

slide-28
SLIDE 28

Federation: High availability

Prometheus Prometheus San Jose Frankfurt Santiago

CORE

slide-29
SLIDE 29

Federation: High availability

Prometheus Prometheus San Jose Frankfurt Santiago

CORE US CORE EU

slide-30
SLIDE 30

Retention and sample frequency

  • 15 days’ retention
  • Metrics scraped every 60 seconds

○ Federation: every 30 seconds

  • No downsampling
slide-31
SLIDE 31

Exporters we use

Purpose Name System (CPU, memory, TCP, RAID, etc) Node exporter Network probes (HTTP, TCP, ICMP ping) Blackbox exporter Log matches (hung tasks, controller errors) mtail

slide-32
SLIDE 32

Deploying exporters

  • One exporter per service instance
  • Separate concerns
  • Deploy in same failure domain
slide-33
SLIDE 33

Alerting

slide-34
SLIDE 34

Alerting

Alertmanager San Jose Frankfurt Santiago

CORE

slide-35
SLIDE 35

Alerting: High availability (soon)

Alertmanager Alertmanager San Jose Frankfurt Santiago

CORE US CORE EU

slide-36
SLIDE 36

Writing alerting rules

  • Test the query on past data
slide-37
SLIDE 37

Writing alerting rules

  • Test the query on past data
  • Descriptive name with adjective or adverb
slide-38
SLIDE 38

RAID_Array

slide-39
SLIDE 39

RAID_Health_Degraded

slide-40
SLIDE 40

Writing alerting rules

  • Test the query on past data
  • Descriptive name with adjective/adverb
  • Must have an alert reference
slide-41
SLIDE 41

Writing alerting rules

  • Test the query on past data
  • Descriptive name with adjective/adverb
  • Must have an alert reference
  • Must be actionable
slide-42
SLIDE 42

Writing alerting rules

  • Test the query on past data
  • Descriptive name with adjective/adverb
  • Must have an alert reference
  • Must be actionable
  • Keep it simple
slide-43
SLIDE 43

Example alerting rule

ALERT RAID_Health_Degraded IF node_md_disks - node_md_disks_active > 0 LABELS { notify="jira-sre" } ANNOTATIONS { summary = `{{ $value }} disks in {{ $labels.device }} on {{ $labels.instance }} are faulty`, Dashboard = `https://grafana.internal/disk-health?var-instance={{ $labels.instance }}`, link = "https://wiki.internal/ALERT+Raid+Health", }

slide-44
SLIDE 44

Monitoring your monitoring

slide-45
SLIDE 45

PagerDuty escalation drill

ALERT SRE_Escalation_Drill IF (hour() % 8 == 1 and minute() >= 35) or (hour() % 8 == 2 and minute() < 20) LABELS { notify="escalate-sre" } ANNOTATIONS { dashboard="https://cloudflare.pagerduty.com/", link="https://wiki.internal/display/OPS/ALERT+Escalation+Drill", summary="This is a drill to test that alerts are being correctly escalated. Please ack the PagerDuty notification." }

slide-46
SLIDE 46

Monitoring Prometheus

  • Mesh: each Prometheus monitors other

Prometheus servers in same datacenter

  • Top-down: top-level Prometheus servers

monitor datacenter-level Prometheus servers

slide-47
SLIDE 47

Monitoring Alertmanager

  • Use Grafana’s alerting mechanism to page
  • Alert if notifications sent is zero even though

notifications were received

slide-48
SLIDE 48

Monitoring Alertmanager

( sum(rate(alertmanager_alerts_received_total{job="alertmanager"}[5m])) without(status, instance) > 0 and sum(rate(alertmanager_notifications_total{job="alertmanager"}[5m])) without(integration, instance) == 0 )

  • r vector(0)
slide-49
SLIDE 49
slide-50
SLIDE 50

Alert routing

slide-51
SLIDE 51

Alert routing

notify=”hipchat-sre escalate-sre”

slide-52
SLIDE 52

Alert routing

  • match_re:

notify: (?:.*\s+)?hipchat-sre(?:\s+.*)? receiver: hipchat-sre continue: true

slide-53
SLIDE 53

Routing tree

slide-54
SLIDE 54
slide-55
SLIDE 55
slide-56
SLIDE 56
slide-57
SLIDE 57
slide-58
SLIDE 58
slide-59
SLIDE 59

amtool

matt➜~» go get -u github.com/prometheus/alertmanager/cmd/amtool matt➜~» amtool silence add \

  • -expire 4h \
  • -comment https://jira.internal/TICKET-1234 \

alertname=HDFS_Capacity_Almost_Exhausted

slide-60
SLIDE 60

Pain points

slide-61
SLIDE 61

Storage pressure

  • Use -storage.local.target-heap-size
  • Set -storage.local.series-file-shrink-ratio to 0.3 or

above

slide-62
SLIDE 62

Alertmanager races, deadlocks, timeouts,

  • h my
slide-63
SLIDE 63

Cardinality explosion

mbostock@host:~$ sudo cp /data/prometheus/data/heads.db ~ mbostock@host:~$ sudo chown mbostock: ~/heads.db mbostock@host:~$ storagetool dump-heads heads.db | awk '{ print $2 }' | sed 's/{.*//' | sed 's/METRIC=//' | sort | uniq -c | sort -n ...snip... 678869 eyom_eyomCPTOPON_numsub 678876 eyom_eyomCPTOPON_hhiinv 679193 eyom_eyomCPTOPON_hhi 2314366 eyom_eyomCPTOPON_rank 2314988 eyom_eyomCPTOPON_speed 2993974 eyom_eyomCPTOPON_share

slide-64
SLIDE 64

Standardise on metric labels early

  • Especially probes: source versus target
  • Identifying environments
  • Identifying clusters
  • Identifying deployments of same app in different

roles

slide-65
SLIDE 65

Next steps

slide-66
SLIDE 66

Prometheus 2.0

  • Lower disk I/O and memory requirements
  • Better handling of metrics churn
slide-67
SLIDE 67

Integration with long term storage

  • Ship metrics from Prometheus (remote write)
  • One query language: PromQL
slide-68
SLIDE 68

More improvements

  • Federate one set of metrics per datacenter
  • Highly-available Alertmanager
  • Visual similarity search
  • Alert menus; loading alerting rules dynamically
  • Priority-based alert routing
slide-69
SLIDE 69

More information

blog.cloudflare.com github.com/cloudflare Try Prometheus 2.0: prometheus.io/blog Questions? @mattbostock

slide-70
SLIDE 70

Thanks!

blog.cloudflare.com github.com/cloudflare Try Prometheus 2.0: prometheus.io/blog Questions? @mattbostock