M3 and Prometheus Monitoring at Planet Scale for Everyone Berlin, - - PowerPoint PPT Presentation

m3 and prometheus
SMART_READER_LITE
LIVE PREVIEW

M3 and Prometheus Monitoring at Planet Scale for Everyone Berlin, - - PowerPoint PPT Presentation

M3 and Prometheus Monitoring at Planet Scale for Everyone Berlin, 2019-11-06 Rob Skillington & ukasz Szczsny Who are we? Rob Skillington CTO at Chronosphere @robskillington M3DB Creator OpenMetrics Contributor ukasz Szczsny


slide-1
SLIDE 1

M3 and Prometheus

Monitoring at Planet Scale for Everyone

Berlin, 2019-11-06 Rob Skillington & Łukasz Szczęsny

slide-2
SLIDE 2

Who are we?

Rob Skillington

CTO at Chronosphere @robskillington M3DB Creator OpenMetrics Contributor

Łukasz Szczęsny

Snr SRE at Chronosphere @wybczu M3 Contributor

slide-3
SLIDE 3

Let’s talk

Monitoring an increasing number of things… Metrics being used as a platform more than ever... Operating in many regions or environments… M3 and Prometheus/Graphite…

slide-4
SLIDE 4

High dimensionality metrics?

4

slide-5
SLIDE 5

Example system being monitored

eu-west

frontend app

client v2.0 client v1.3

eu-north frontend app redis

client v2.0 client v1.3

mysql mysql redis

slide-6
SLIDE 6

mysql redis eu-north mysql frontend app redis

client v2.0 client v1.3

redis

Example system being monitored

client v2.0 client v1.3

frontend app

eu-west Which code path to debug? Need to detect failure and isolate to:

  • route = /api/search
  • region = eu-west
  • client-version = v2.0

redis

slide-7
SLIDE 7
  • Let’s debug this using HTTP status code delivered by frontends:

http_status_code

Let’s use high dimensionality metrics

Route (100?) Status Code (5?) Region (12?) Client App Version (40?) /api/search 2xx eu-east 1.3 /api/order 4xx eu-west 2.0 .... ...

slide-8
SLIDE 8

mysql redis eu-north mysql frontend app redis

client v2.0 client v1.3

redis

Revisiting this example...

client v2.0 client v1.3

frontend app

eu-west Failure is isolated to:

  • route = /api/search
  • region = eu-west
  • client-version = v2.0
slide-9
SLIDE 9

Ideally we would see...

slide-10
SLIDE 10

How many time series is that?

100 routes * 5 status codes * 12 regions * 40 client versions

= 240,000 unique time series

Route (100?) Status Code (5?) Region (12?) Client App Version (40?) /api/search 2xx eu-east 1.3 /api/order 4xx eu-west 2.0 .... ...

slide-11
SLIDE 11

You can roll up metrics to make viewing fast

status=2xx route=/api/search

region=eu-west client=v1.2 status=2xx ...

status=4xx route=/api/search status=5xx route=/api/search

region=eu-north client=v1.3 status=2xx ... region=us-west client=v2.0 status=2xx ... region=eu-west client=v1.1 status=5xx ... region=eu-north client=v1.4 status=5xx ... region=us-west client=v2.3 status=5xx ... region=eu-west client=v3.2 status=5xx ... region=eu-north client=v3.1 status=5xx ...

Partial-solution #1

slide-12
SLIDE 12

status=5xx route=/api/search ...

user_country=de client=v1.0 status=5xx ... user_country=us client=v1.3 status=5xx ... user_country=lt client=v2.0 status=5xx ...

For drill down and high granular alerting

240k time series, expensive but not too bad..? However add any other dimensions and it gets out of control (any multiplier on 240k explodes to millions quickly)

user_country=pl client=v2.0 status=5xx ...

...

e.g. Unique country code user = 249

slide-13
SLIDE 13

Built at Uber to scale monitoring horizontally and cost effective (began 2015, open source in 2018)

  • Distributed monitoring system

and time series database, compatible as remote storage for Prometheus. First built at SoundCloud (began 2012, open source in 2014 )

  • An open source monitoring

system and time series database.

  • All-in-one single node

monitoring solution using metrics.

What is Prometheus? What is M3?

slide-14
SLIDE 14

A single Prometheus instance can hold a reasonable amount of data (and you should always get started using Prometheus)

“This is fine..

I’m okay with the events that are unfolding currently”

Ok great, but what do I need?

slide-15
SLIDE 15

Ok great, but what do I need?

slide-16
SLIDE 16

Ok great, but what do I need?

Can I fit a service’s high cardinality metrics into an existing Prometheus instance? How do I scale up easily?

mysql my-cache my-api my-frontend

slide-17
SLIDE 17

Cloud Region #N

Horizontally scalable platform that supports multiple metric formats

Cloud Region #0

M3 Query

Aggregation

M3 Coordinator

Prometheus Graphite Grafana (PromQL, Graphite) Alerting Engines

M3DB M3DB

M3DB

So what is M3 and how does it help?

M3

slide-18
SLIDE 18

Why M3

  • 1. Suitable for many scenarios
  • 2. Scalable to billions of metrics
  • 3. Focus on simple operation
slide-19
SLIDE 19
  • 1. Suitable for many

scenarios

Cloud Native, Kubernetes

  • r On Prem,

Multi-Region, Prometheus and Graphite compatible

slide-20
SLIDE 20
  • 1. Suitable for many scenarios

M3 and Prometheus

  • Store metrics for weeks, months or years
  • Store metrics at different retention based on mapping

rules (e.g. app:nginx endpoints:/api*)

  • Scale up storage just by adding more nodes
slide-21
SLIDE 21

Prometheus

Prometheus

My App Grafana Alerting

slide-22
SLIDE 22

DEMO

slide-23
SLIDE 23

Prometheus remote read and write to M3DB

Prometheus

Prometheus

My App

Grafana Alerting

M3DB M3DB

M3DB

slide-24
SLIDE 24
  • 1. Suitable for many scenarios

M3 and Graphite

  • Ingest Carbon TCP protocol
  • Support for Graphite query API
slide-25
SLIDE 25

Graphite

My App Grafana Alerting

Carbon TCP line protocol ingestion Store Graphite and Prometheus metrics side-by-side

M3DB M3DB

M3DB

slide-26
SLIDE 26
  • 2. Scalable to

billions of metrics

slide-27
SLIDE 27
  • 2. Scalable to billions of metrics

M3 at scale

  • Collects metrics for 1000s of applications
  • No onboarding to monitoring or provisioning of

servers (just add storage nodes as required)

slide-28
SLIDE 28

Reverse index uses FST segments, like ElasticSearch with Apache

  • Lucene. It can regexp over billions of metric names and

dimensions, unlike other solutions out there.

  • 2. Scalable to billions of metrics

m3coordinator

M3 Query M3DB Node M3DB Node M3DB Node ...

Each storage node Find metrics matching query and return in parallel knowing exactly where to extract series data from local store.

M3DB Node

slide-29
SLIDE 29

Region 3 M3 Coordinator

M3DB M3DB M3DB

M3 Query M3 Coordinator Region 1

M3DB M3DB M3DB

Region 2

PromQL or Graphite query

(hit any region)

M3 Query M3 Coordinator

M3DB M3DB M3DB

M3 Query HTTP Load Balancer Grafana Alerting

Multi-Region

Global view with region-local storage

slide-30
SLIDE 30

Architected for Reliability and Scale

  • Global metrics collection and query
  • Low inter-region network bandwidth, data always

kept in region

  • Replication across Availability Zones within a

region as soon as metric collected

  • 2. Scalable to billions of metrics
slide-31
SLIDE 31
  • 3. Focus on

simple

  • peration
slide-32
SLIDE 32
  • M3 can be deployed on premise without any

dependencies - it’s easy to get started. ○ One binary and a YAML configuration file ○ Can be easily deployed using your favourite config management tool

  • Clustered version is open source

○ HA setup is pretty straightforward ○ Scaling a cluster used to require a lot of manual work

  • 3. Focus on simple operation
slide-33
SLIDE 33
  • M3 runs on Kubernetes and the M3DB k8s operator

can manage the cluster for you! See more at https://github.com/m3db/m3db-operator

  • 3. Focus on simple operation
slide-34
SLIDE 34

Why M3

  • 1. Suitable for many scenarios
  • 2. Scalable to billions of metrics
  • 3. Focus on simple operation
slide-35
SLIDE 35

Come say hi!

slide-36
SLIDE 36

M3 GitHub Monorepo (Apache 2 licensed): https://github.com/m3db/m3 M3 Slack: https://bit.ly/m3slack Chronosphere: https://chronosphere.io Twitter: https://twitter.com/chronosphereio

Thank you and Q&A

slide-37
SLIDE 37

License: Apache 2 Website: https://www.m3db.io Docs: https://docs.m3db.io Mailing list: https://groups.google.com/forum/#!forum/m3db

M3 Links and References

slide-38
SLIDE 38

Dedicated M3 Coordinator local to availability zone to coordinate replication

M3 and Prometheus with read/write isolation

Prometheus

My App

Single Region

M3 Coordinator

M3DB M3DB

M3DB

Grafana Alerting M3 Query

Dedicated M3 Query to isolate queries impacting writes

slide-39
SLIDE 39

Real time alerting of application metrics

What is Prometheus and M3 used for?

slide-40
SLIDE 40

Tracking business metrics (e.g., searches for “books” with category “biographies” in a region):

m.Tagged(Tags{region=“eu-west”,category=”books”,subcategory=”biographies”}).Counter(“searches”).Inc(1)

What is Prometheus and M3 used for?

slide-41
SLIDE 41

Infrastructure metrics such as network routing and datacenter health

What is Prometheus and M3 used for?