Monitoring at Scale Migrating to Prometheus at Fastly PROMCON 2018 - - PowerPoint PPT Presentation

monitoring at scale
SMART_READER_LITE
LIVE PREVIEW

Monitoring at Scale Migrating to Prometheus at Fastly PROMCON 2018 - - PowerPoint PPT Presentation

Monitoring at Scale Migrating to Prometheus at Fastly PROMCON 2018 | Marcus Barczak @ickymettle How were we monitoring Fastly? + Growing pains with Ganglia Operational overhead. Limited graphing functions. No alerting support,


slide-1
SLIDE 1

Monitoring at Scale

Migrating to Prometheus at Fastly

PROMCON 2018 | Marcus Barczak

@ickymettle

slide-2
SLIDE 2
slide-3
SLIDE 3
slide-4
SLIDE 4
slide-5
SLIDE 5

How were we monitoring Fastly?

slide-6
SLIDE 6

+

slide-7
SLIDE 7

๏ Operational overhead. ๏ Limited graphing functions. ๏ No alerting support, ๏ No real API for consuming metric data.

Growing pains with Ganglia

slide-8
SLIDE 8

aaS

+ +

slide-9
SLIDE 9

๏ Now supporting two systems. ๏ Where do I put my metrics? ๏ Still writing external plugins and agents. ๏ Monitoring treated as a "post-release" phase.

Growing pains doubled

slide-10
SLIDE 10

Scaling our infrastructure horizontally Required scaling our monitoring vertically

slide-11
SLIDE 11

Third time lucky

slide-12
SLIDE 12

๏ Scale with our infrastructure growth, ๏ Be easy to deploy and operate. ๏ Engineer friendly instrumentation libraries. ๏ First class API support for data access. ๏ To reinvigorate our monitoring culture.


See: https:/ /peter.bourgon.org/observability-the-hard-parts/

Project goals

slide-13
SLIDE 13

?

slide-14
SLIDE 14
slide-15
SLIDE 15

๏ Build a proof of concept. ๏ Pair with pilot team to instrument their services. ๏ Iterate through the rest. ๏ Run both systems in parallel. ๏ Decommission SaaS system and Ganglia.

Getting started

slide-16
SLIDE 16

Infrastructure build

slide-17
SLIDE 17

prometheus A prometheus B

scrapes targets

SJC

scrapes targets

slide-18
SLIDE 18 prometheus A prometheus B scrapes targets

SJC

scrapes targets prometheus A prometheus B scrapes targets

JFK

scrapes targets prometheus A prometheus B scrapes targets

ATL

scrapes targets
slide-19
SLIDE 19 prometheus A prometheus B scrapes targets

SJC

scrapes targets prometheus A prometheus B scrapes targets

JFK

scrapes targets prometheus A prometheus B scrapes targets

ATL

scrapes targets

GCP

federator A federator B frontend stack
slide-20
SLIDE 20 prometheus A prometheus B scrapes targets

SJC

scrapes targets prometheus A prometheus B scrapes targets

JFK

scrapes targets prometheus A prometheus B scrapes targets

ATL

scrapes targets

GCP

federator A federator B frontend stack

Query Traffic (TLS)

slide-21
SLIDE 21

Prometheus Server Software Stack

Ghost Tunnel TLS termination and auth. Service Discovery Sidecar Target configuration Rules Loader Recording and Alert rules Prometheus

slide-22
SLIDE 22

Prometheus Server Software Stack

Ghost Tunnel TLS termination and auth. Service Discovery Sidecar Target configuration Rules Loader Recording and Alert rules Prometheus

Typical Server Software Stack

Service Discovery Proxy Service discovery and TLS exporter proxy Exporters Built into services or sidecar

slide-23
SLIDE 23

Build your own service discovery?

slide-24
SLIDE 24

Fastly's infrastructure is bare metal hardware no cloud conveniences

slide-25
SLIDE 25

๏ Automatic discovery of targets. ๏ Self-service registration of exporter endpoints, ๏ TLS encryption for all exporter traffic. ๏ Minimal exposure of exporter TCP ports.

Service discovery requirements

slide-26
SLIDE 26

Prometheus Server Software Stack

Ghost Tunnel TLS termination and auth. PromSD Sidecar Target configuration Prometheus

Typical Server Software Stack

PromSD Proxy Service discovery and TLS exporter proxy Exporters Built into services or sidecar

generates config for prometheus scrapes proxied targets over TLS queries for available targets

slide-27
SLIDE 27 promsd sidecar "exporter_hosts": [ "10.0.0.1", "10.0.0.2", "10.0.0.3", "10.0.0.4" ] configly fetch list of hosts in a datacenter 1 promsd proxy request /targets endpoint for each host to get list
  • f available scrape targets

3

2 3
  • utput all targets as a
file service discovery JSON file 4 Prometheus reads the file and scrapes the configured targets. { "targets": [ “10.0.0.1:9702”, “10.0.0.2:9702” ], "labels": { "__metrics_path__": “/node_exporter_9100/metrics", "job": “node_exporter” } }, { "targets": [ “10.0.0.1:9702”, “10.0.0.2:9702” ], "labels": { "__metrics_path__": "/varnishstat_exporter_19102/metrics", "job": "varnishstat_exporter" } }

PromSD sidecar

slide-28
SLIDE 28 promsd proxy fetch list of installed systemd services node_exporter process_exporter systemd "node_exporter": { "prometheus_properties": { "target": "127.0.0.1:9100" } }, … "varnishstat_exporter": { "prometheus_properties": { "target": "127.0.0.1:19102" } } for each corresponding systemd service fetch the local exporter target address varnishstat_exporter 1

3

2 3 configly exposes an API used by prometheus and promsd sidecar /node_exporter_9100/metrics /varnish_exporter_19102/metrics /targets sidecar

PromSD proxy

slide-29
SLIDE 29

๏ Really easy to leverage the file SD mechanism. ๏ New targets can be added with one line of config. ๏ TLS and authentication everywhere. ๏ Single exporter port open per host.

It worked!

slide-30
SLIDE 30

Prometheus Adoption

slide-31
SLIDE 31

Prometheus at Scale at Fastly

114 Prometheus servers globally

28.4M time series 2.2M million samples/second

slide-32
SLIDE 32

... a few hours later

slide-33
SLIDE 33

๏ Engineers love it. ๏ Dashboard and alert quality have increased. ๏ PromQL enables some deep insights. ๏ Scaling linearly with our infrastructure growth.

Prometheus wins

slide-34
SLIDE 34

๏ Metrics exploration without prior knowledge. ๏ Alertmanager's flexibility. ๏ Federation and global views. ๏ Long term storage still an open question.

Still some rough edges.

slide-35
SLIDE 35

😎

slide-36
SLIDE 36

Thanks!

@ickymettle fastly.com