Scaling Automated Database Monitoring at Uber with M3 and - - PowerPoint PPT Presentation

scaling automated database monitoring at uber
SMART_READER_LITE
LIVE PREVIEW

Scaling Automated Database Monitoring at Uber with M3 and - - PowerPoint PPT Presentation

Scaling Automated Database Monitoring at Uber with M3 and Prometheus Richard Artoul Agenda 01 Automated database monitoring 02 Why scaling automated monitoring is hard 03 M3 architecture and why it scales so well 04 How you can use M3


slide-1
SLIDE 1

Scaling Automated Database Monitoring at Uber

… with M3 and Prometheus Richard Artoul

slide-2
SLIDE 2

01 Automated database monitoring 02 Why scaling automated monitoring is hard 03 M3 architecture and why it scales so well 04 How you can use M3

Agenda

slide-3
SLIDE 3

Uber’s “Architecture”

2015 2019

4000+ micr crose services ices - Most of which directly or indirectly depend on storage 22+ storage technologie ies s - Ranging from C* to MySQL 1000’s of dedicated servers running ing database ses s - Monitoring all of these is hard

slide-4
SLIDE 4

Monitoring Databases

Application Hardware Technology Application

slide-5
SLIDE 5

Hardware Level Metrics

slide-6
SLIDE 6

Technology Level Metrics

E E

slide-7
SLIDE 7

Application Level Metrics

  • Number of successes, errors, and latency broken down by

○ All queries against a given database ○ All queries issue by a specific service ○ A specific query ■ SELECT * FROM TABLE WHERE USER_ID = ?

slide-8
SLIDE 8
slide-9
SLIDE 9

What’s so hard about that?

slide-10
SLIDE 10

Monitoring Applications at Scale

1200

Microservices w/ dedicated storage

100

Instances per service

20

Instances per DB cluster

20

Queries per service

10

Metrics per query

480 million unique time series

X

100+ million dollars!

slide-11
SLIDE 11

Writes per second/s (post replication)

110M

Datapoints emitted pre- aggregation

800M

Unique Metric IDs

9B

Datapoints read per second

200B

Workload

slide-12
SLIDE 12

How do we do it?

slide-13
SLIDE 13

A Brief History of M3

  • 2014-2015: Graphite + WhisperDB

○ No replication, operations were ‘cumbersome’

  • 2015-2016: Cassandra

○ Solved operational issues ○ 16x YoY Growth ○ Expensive (> 1500 Cassandra Hosts) ○ Compactions => R.F=2

  • 2016-Today: M3DB
slide-14
SLIDE 14

M3DB Overview

slide-15
SLIDE 15

M3DB

An open source distributed time series database

  • Store arbitrary timestamp precision data points at any resolution for any

retention

  • Tag (key/value pair) based inverted index
  • Optimized file-system storage with minimal need for compactions of time

series data for real-time workloads

slide-16
SLIDE 16

High-Level Architecture

Like a Log Structured Merge Tree (LSM) Except a typical LSM has levelled or size based compaction M3DB has time window compaction

slide-17
SLIDE 17

Topology and Consistency

  • Strong consistent topology (using etcd)

○ No gossip ○ Replicated with zone/rack aware layout and configurable replication factor

  • Consistency managed via synchronous quorum writes and reads

○ Configurable consistency level for both read and write

slide-18
SLIDE 18

Cost Savings and Performance

  • Disk Usage in 2017

○ ~1.4PB for Cassandra at R.F=2 ○ ~200TB for M3DB at R.F=3

  • Much higher throughput per node with M3DB

○ Hundreds of thousands of writes/s on commodity hardware

slide-19
SLIDE 19

But what about the index?

slide-20
SLIDE 20

Centralized Elasticsearch Index

  • Actual time series data was stored in Cassandra and the M3DB
  • But indexing of data (for querying) was handled by Elasticsearch
  • Worked for us for a long time
  • … but scaling writes and reads required a lot of engineering
slide-21
SLIDE 21

Elasticsearch Index - Write Path

m3agg E.S Indexer Redis Cache “Don’t write cache”

Influx of new metrics:

  • 1. Large Service

Deployment

  • 2. Datacenter Failover

M3DB

slide-22
SLIDE 22

Elasticsearch Index - Read Path

Query E.S M3DB 1 2

slide-23
SLIDE 23

Elasticsearch Index - Read Path

Query E.S Redis Query Cache Need high T.T.L to prevent overwhelming E.S … but high T.T.L means long delay for new time series to become queryable

slide-24
SLIDE 24

Elasticsearch Index - Read Path

Query Redis Query Cache E.S Short E.S Long Redis Query Cache Short TTL Merge on read Long TTL

slide-25
SLIDE 25

Elasticsearch Index - Final Breakdown

  • Two elasticsearch clusters with separate configuration
  • Two query caches
  • Two don’t-write caches
  • A stateful indexing tier that requires consistent hashing, an in-memory

cache, and breaks everything if you deploy it too quickly

  • Another service just for automatically rotating the short-term elasticsearch

cluster indexes

  • A background process that’s always running and trying to delete stale

documents from the long term index

slide-26
SLIDE 26

M3 Inverted Index

  • Not nearly as expressive or feature-rich as Elasticsearch
  • … but like M3DB, it’s designed for high throughput
  • Temporal by nature (T.T.Ls are cheap)
  • Fewer moving parts
slide-27
SLIDE 27

M3 Inverted Index

  • Primary use-case, support queries in the form:

○ service = “foo” AND ○ endpoint = “bar” AND ○ client_version regexp matches “3.*”

service=”foo” endpoint=”bar” client_version=”3.*” AND AND client_version=”3.1” client_version=”3.2” OR OR

AND → Intersection OR → Union

slide-28
SLIDE 28

M3 Inverted Index - F.S.Ts

  • The inverted index is similar to Lucene in that it uses F.S.T segments to build an

efficient and compressed structure for fast regexp.

  • Each time series’ label has its own F.S.T that can be searched to find the offset to

unpack another data structure that contains the set of metrics IDs associated with a particular label value (postings list)

Encoded Relationships

are --> 4 ate --> 2 see --> 3 Compressed + fast regexp!

slide-29
SLIDE 29

M3 Inverted Index - Postings List

  • For every term combination in the form of service=”foo” need to store a set of metric

IDs (integers) that match, this is called a postings list.

  • Index is broken into blocks and segments, so need to be able to calculate

intersection (AND - across terms) and union (OR - within a term).

12P.M -> 2P.M Index Block service=”foo” endpoint=”bar” client_version=”3.*” INTERSECT

Intersect → AND Union → OR

INTERSECT client_version=”3.1” client_version=”3.2” Union Union

Primary data structure for the postings list in M3DB is the roaring bitmap

slide-30
SLIDE 30

M3 Inverted Index - File Structure

────────────────────────────── Time ────────────────────────────────────────▶ ┌──────────────────────────────────────┐ │/var/lib/m3db/data/namespace-a/shard-0│ └───────────────────┬───────────────┬──┴────────────┬───────────────┬───────────────┐ │Fileset File │Fileset File │Fileset File │Fileset File │ │Block │Block │Block │Block │ └───────────────┴───────────────┴───────────────┴───────────────┘ ┌──────────────────────────────────────┐ │/var/lib/m3db/index/namespace-a │ └───────────────────┬──────────────────┴────────────┬───────────────────────────────┐ │Index Fileset File │Index Fileset File │ │Block │Block │ └───────────────────────────────┴───────────────────────────────┘

slide-31
SLIDE 31

M3 Summary

  • Time series compression + temporal data and file structures
  • Distributed and horizontally scalable by default
  • Complexity pushed to the storage layer
slide-32
SLIDE 32

How you can use M3

slide-33
SLIDE 33

M3 and Prometheus

Scalable monitoring

slide-34
SLIDE 34

M3 and Prometheus

Prometheus

My App Grafana Alerting M3 Coordinator

M3DB M3DB M3DB

Directly query M3 using coordinator for single Grafana datasource

slide-35
SLIDE 35

Roadmap

slide-36
SLIDE 36

Roadmap

  • Bring the Kubernetes operator out of Alpha with further lifecycle management
  • Arbitrary out of order writes for writing data into the past and backfilling
  • Asynchronous cross region replication (for disaster recovery)
  • Evolving M3DB into a generic time series database (think event store)

○ Efficient compression of events in the form of Protobuf messages

slide-37
SLIDE 37

We’re Hiring!

  • Want to work on M3DB? We’re hiring!

○ Reach out to me at rartoul@uber.com