M3 and Prometheus Monitoring at Planet Scale for Everyone Berlin, - PowerPoint PPT Presentation

M3 and Prometheus Monitoring at Planet Scale for Everyone Berlin, 2019-11-06 Rob Skillington & Łukasz Szczęsny

Who are we? Rob Skillington CTO at Chronosphere @robskillington M3DB Creator OpenMetrics Contributor Łukasz Szczęsny Snr SRE at Chronosphere @wybczu M3 Contributor

Let’s talk Monitoring an increasing number of things… Metrics being used as a platform more than ever... Operating in many regions or environments… M3 and Prometheus/Graphite…

High dimensionality metrics? 4

Example system being monitored mysql client frontend app v1.3 redis eu-west mysql client v2.0 frontend app redis eu-north client v1.3 client v2.0

Example system being monitored Which code path to debug? Need to detect failure and isolate to: mysql route = /api/search ● client region = eu-west frontend app ● v1.3 client-version = v2.0 ● redis redis eu-west mysql client frontend app v2.0 redis redis eu-north client v1.3 client v2.0

Let’s use high dimensionality metrics Let’s debug this using HTTP status code delivered by frontends: ● http_status_code ○ Route Status Code Region Client App Version (100?) (5?) (12?) (40?) /api/search 2xx eu-east 1.3 /api/order 4xx eu-west 2.0 .... ...

Revisiting this example... Failure is isolated to: route = /api/search ● region = eu-west ● mysql client-version = v2.0 client ● frontend app v1.3 redis redis eu-west mysql client frontend app v2.0 redis eu-north client v1.3 client v2.0

Ideally we would see...

How many time series is that? Route Status Code Region Client App Version (100?) (5?) (12?) (40?) /api/search 2xx eu-east 1.3 /api/order 4xx eu-west 2.0 .... ... 100 routes * 5 status codes * 12 regions * 40 client versions = 240,000 unique time series

Partial-solution #1 You can roll up metrics to make viewing fast region=eu-west client=v1.2 status=2xx ... region=eu-north client=v1.3 status=2xx ... status=2xx route=/api/search region=us-west client=v2.0 status=2xx ... region=eu-west client=v3.2 status=5xx ... status=4xx route=/api/search region=eu-north client=v3.1 status=5xx ... status=5xx route=/api/search region=eu-west client=v1.1 status=5xx ... region=eu-north client=v1.4 status=5xx ... region=us-west client=v2.3 status=5xx ...

For drill down and high granular alerting 240k time series, expensive but not too bad..? However add any other dimensions and it gets out of control (any multiplier on 240k explodes to millions quickly) e.g. Unique country code user = 249 user_country=de client=v1.0 status=5xx ... status=5xx route=/api/search ... user_country=us client=v1.3 status=5xx ... user_country=lt client=v2.0 status=5xx ... user_country=pl client=v2.0 status=5xx ... ...

What is Prometheus? What is M3? First built at SoundCloud (began Built at Uber to scale monitoring 2012, open source in 2014 ) horizontally and cost effective (began An open source monitoring 2015, open source in 2018) ● system and time series Distributed monitoring system ● database. and time series database, All-in-one single node compatible as remote storage for ● monitoring solution using Prometheus. metrics.

Ok great, but what do I need? A single Prometheus instance can hold a reasonable amount of data (and you should always get started using Prometheus) “This is fine.. I’m okay with the events that are unfolding currently”

Ok great, but what do I need?

Ok great, but what do I need? Can I fit a service’s high cardinality metrics into an existing Prometheus instance? How do I scale up easily? mysql my-frontend my-api my-cache

So what is M3 and how does it help? Horizontally scalable platform that supports multiple metric formats Grafana Alerting Prometheus Graphite (PromQL, Graphite) Engines M3 M3 Aggregation M3DB M3 Query M3DB M3DB Coordinator Cloud Region #0 Cloud Region #N

Why M3 1. Suitable for many scenarios 2. Scalable to billions of metrics 3. Focus on simple operation

1. Suitable for many scenarios Cloud Native, Kubernetes or On Prem, Multi-Region, Prometheus and Graphite compatible

1. Suitable for many scenarios M3 and Prometheus ● Store metrics for weeks, months or years ● Store metrics at different retention based on mapping rules (e.g. app:nginx endpoints:/api*) ● Scale up storage just by adding more nodes

Prometheus My App Grafana Prometheus Alerting

Prometheus My App Grafana Prometheus Alerting Prometheus remote read and write to M3DB M3DB M3DB M3DB

1. Suitable for many scenarios M3 and Graphite ● Ingest Carbon TCP protocol ● Support for Graphite query API

Graphite My App Carbon TCP line protocol ingestion Grafana M3DB M3DB M3DB Store Graphite and Alerting Prometheus metrics side-by-side

2. Scalable to billions of metrics

2. Scalable to billions of metrics M3 at scale ● Collects metrics for 1000s of applications ● No onboarding to monitoring or provisioning of servers (just add storage nodes as required)

2. Scalable to billions of metrics Reverse index uses FST segments, like ElasticSearch with Apache Lucene. It can regexp over billions of metric names and dimensions, unlike other solutions out there. M3 Query m3coordinator Each storage node Find metrics matching query and return in parallel knowing Node ... exactly where to extract series M3DB M3DB M3DB M3DB data from local store. Node Node Node

Global view with region-local storage Grafana Alerting PromQL or HTTP Load Graphite query Balancer (hit any region) Region 1 Region 2 Region 3 M3 Query M3 Coordinator M3 Query M3 Query M3 Coordinator M3 Coordinator M3DB M3DB M3DB M3DB M3DB M3DB M3DB M3DB M3DB Multi-Region

2. Scalable to billions of metrics Architected for Reliability and Scale ● Global metrics collection and query ● Low inter-region network bandwidth, data always kept in region ● Replication across Availability Zones within a region as soon as metric collected

3. Focus on simple operation

3. Focus on simple operation M3 can be deployed on premise without any ● dependencies - it’s easy to get started. One binary and a YAML configuration file ○ Can be easily deployed using your favourite config ○ management tool Clustered version is open source ● HA setup is pretty straightforward ○ Scaling a cluster used to require a lot of manual work ○

3. Focus on simple operation M3 runs on Kubernetes and the M3DB k8s operator ● can manage the cluster for you! See more at https://github.com/m3db/m3db-operator

Why M3 1. Suitable for many scenarios 2. Scalable to billions of metrics 3. Focus on simple operation

Come say hi!

Thank you and Q&A M3 GitHub Monorepo (Apache 2 licensed): https://github.com/m3db/m3 M3 Slack: https://bit.ly/m3slack Chronosphere: https://chronosphere.io Twitter: https://twitter.com/chronosphereio

M3 Links and References License: Apache 2 Website: https://www.m3db.io Docs: https://docs.m3db.io Mailing list: https://groups.google.com/forum/#!forum/m3db

M3 and Prometheus with read/write isolation My App Prometheus Grafana Dedicated M3 M3 M3 Query Coordinator local Coordinator to availability zone Alerting to coordinate replication Dedicated M3 Query to isolate M3DB queries impacting M3DB M3DB Single Region writes

What is Prometheus and M3 used for? Real time alerting of application metrics

What is Prometheus and M3 used for? Tracking business metrics (e.g., searches for “books” with category “biographies” in a region): m.Tagged(Tags{region=“eu-west”,category=”books”,subcategory=”biographies”}).Counter(“searches”).Inc(1)

What is Prometheus and M3 used for? Infrastructure metrics such as network routing and datacenter health

M3 and Prometheus Monitoring at Planet Scale for Everyone Berlin, - PowerPoint PPT Presentation

M3 and Prometheus Monitoring at Planet Scale for Everyone Berlin, 2019-11-06 Rob Skillington & ukasz Szczsny Who are we? Rob Skillington CTO at Chronosphere @robskillington M3DB Creator OpenMetrics Contributor ukasz Szczsny

Prometheus Best Practices and Beastly Pitfalls Julius Volz, August 17, 2017 Prometheus

PromCon 2017 Welcome and Introduction Julius Volz, 17. August 2017 Prometheus Welcome and Thank

110 Rules for Prometheus Brian Brazil Founder Rule 110 110 Rules for Prometheus Brian Brazil

Practical monitoring with Prometheus and Grafana Jess Portnoy jess.portnoy@kaltura.com, Kaltura,

Prometheus Adam Goldsmith, Jack Gonsalves, Ben Gillette, and Luke Buquicchio Prometheus

Deploying Prometheus Filippo Giunchedi - Operations Engineer filippo@wikimedia.org Agenda

Rethinking monitoring with Prometheus Martn Ferrari Based on a previous talk prepared with

Cortex: Prometheus as a Service, One Year On Tom Wilkie, PromCon 2017 tom.wilkie@gmail.com

3. Agent-Oriented Methodologies Part 2: D) ems Design (MASD The PROMETHEUS The PROMETHEUS

Knowledge in Interviews Brian Brazil Founder Who am I? One of the developers of Prometheus

MongoDB Analysis with Prometheus and Grafana Akira Kurogane Percona Talk Overview The

Prometheus: Designing and Implementing a Modern Monitoring Solution in Go Bjrn Beorn

Staleness and Isolation in Prometheus 2.0 Brian Brazil Founder Who am I? One of the core

Prometheus Histograms Past, Present, and Future Bjrn Beorn Rabenstein PromCon EU,

Monitoring Networking Infrastructure with Prometheus ecosystem PromCon 2019 Artem Nedoshepa

Pushing Prometheus until it breaks. The bumpy road to a fully automated benchmarking. Krasi

Application of the Coulomb spheroidal basis for diatomic molecular calculations T . Kereselidze

Mac OS 10.15 Catalina Introduction: Catalina 10.15 is the latest Macintosh operating system

Barid Update June 2018 Thomas Hedberg Aviation Industry Analyst hedbergt@iata.org NewGen ISS

Toward production of clean energy commodities: uranium & battery metals www.u3o8corp.com

CO APCD Advisory Committee August 13, 2019 Agenda Opening Announcements Welcome CO

In Conclusion Vic Korsun, vickorsun@msn.com Tbilisi, Georgia April, 2014 1 What did I say

1 2 Marketing Purgex Purging Compounds in Portugal 3 Marketing Purgex Purging Compounds in

Workshop Lorenzo R. Cuesta Professional Registered Parliamentarian parliam@roberts-rules.com

M3 and Prometheus Monitoring at Planet Scale for Everyone Berlin, - PowerPoint PPT Presentation

M3 and Prometheus Monitoring at Planet Scale for Everyone Berlin, 2019-11-06 Rob Skillington & ukasz Szczsny Who are we? Rob Skillington CTO at Chronosphere @robskillington M3DB Creator OpenMetrics Contributor ukasz Szczsny

Prometheus Best Practices and Beastly Pitfalls Julius Volz, August 17, 2017 Prometheus

PromCon 2017 Welcome and Introduction Julius Volz, 17. August 2017 Prometheus Welcome and Thank

110 Rules for Prometheus Brian Brazil Founder Rule 110 110 Rules for Prometheus Brian Brazil

Practical monitoring with Prometheus and Grafana Jess Portnoy jess.portnoy@kaltura.com, Kaltura,

Prometheus Adam Goldsmith, Jack Gonsalves, Ben Gillette, and Luke Buquicchio Prometheus

Deploying Prometheus Filippo Giunchedi - Operations Engineer filippo@wikimedia.org Agenda

Rethinking monitoring with Prometheus Martn Ferrari Based on a previous talk prepared with

Cortex: Prometheus as a Service, One Year On Tom Wilkie, PromCon 2017 tom.wilkie@gmail.com

3. Agent-Oriented Methodologies Part 2: D) ems Design (MASD The PROMETHEUS The PROMETHEUS

Knowledge in Interviews Brian Brazil Founder Who am I? One of the developers of Prometheus

MongoDB Analysis with Prometheus and Grafana Akira Kurogane Percona Talk Overview The

Prometheus: Designing and Implementing a Modern Monitoring Solution in Go Bjrn Beorn

Staleness and Isolation in Prometheus 2.0 Brian Brazil Founder Who am I? One of the core

Prometheus Histograms Past, Present, and Future Bjrn Beorn Rabenstein PromCon EU,

Monitoring Networking Infrastructure with Prometheus ecosystem PromCon 2019 Artem Nedoshepa

Pushing Prometheus until it breaks. The bumpy road to a fully automated benchmarking. Krasi

Application of the Coulomb spheroidal basis for diatomic molecular calculations T . Kereselidze

Mac OS 10.15 Catalina Introduction: Catalina 10.15 is the latest Macintosh operating system

Barid Update June 2018 Thomas Hedberg Aviation Industry Analyst hedbergt@iata.org NewGen ISS

Toward production of clean energy commodities: uranium &amp; battery metals www.u3o8corp.com

CO APCD Advisory Committee August 13, 2019 Agenda Opening Announcements Welcome CO

In Conclusion Vic Korsun, vickorsun@msn.com Tbilisi, Georgia April, 2014 1 What did I say

1 2 Marketing Purgex Purging Compounds in Portugal 3 Marketing Purgex Purging Compounds in

Workshop Lorenzo R. Cuesta Professional Registered Parliamentarian parliam@roberts-rules.com

Toward production of clean energy commodities: uranium & battery metals www.u3o8corp.com