Monarch Googles planet-scale streaming monitoring infrastructure. - - PowerPoint PPT Presentation

monarch
SMART_READER_LITE
LIVE PREVIEW

Monarch Googles planet-scale streaming monitoring infrastructure. - - PowerPoint PPT Presentation

Monarch Googles planet-scale streaming monitoring infrastructure. Background Architecture and Data Model Queries Using Monarch Monarch Platform Lessons Learned re: Scaling Monitoring at Google Ref:


slide-1
SLIDE 1

Monarch

Google’s planet-scale streaming monitoring infrastructure.

slide-2
SLIDE 2

Background Architecture and Data Model Queries Using Monarch Monarch Platform Lessons Learned re: Scaling

slide-3
SLIDE 3

Monitoring at Google

Ref: https://www.google.com/about/datacenters/inside/locations/index.html

slide-4
SLIDE 4

Monitoring at Google

Ref: https://www.google.com/about/datacenters/inside/locations/index.html

Global Span Huge Volume Many Kinds

  • Hardware/networking
  • OS
  • Infrastructure services
  • Big, user-facing services
  • Smaller services

Constant change

slide-5
SLIDE 5

Essentials of Monarch Scaling

Maintain good hygiene Scale horizontally Reduce dimensions early

slide-6
SLIDE 6

Background Architecture and Data Model Queries Using Monarch Monarch Platform Lessons Learned re: Scaling

slide-7
SLIDE 7

Global Extent

Ref: https://www.google.com/about/datacenters/inside/locations/index.html

slide-8
SLIDE 8

Monarch Zone

Monitor Locally

Configuration Zone Mixer Leaf Leaf Evaluato r Leaf Leaf Leaf Leaf Ingest Router Leaf Leaf Assigner Target Streamz Library Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Recovery Logs Repository

slide-9
SLIDE 9

Configuration

Monarch Zone: Ingestion, Retention and Queries

Zone Mixer Leaf Leaf

Evaluator

Leaf Leaf Leaf Leaf

Ingest Router

Leaf Leaf

Assigner

Target

Streamz Library

Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf

Leaf

Recovery Logs Repository

Notification Query

slide-10
SLIDE 10

Configuration

Monarch Zone: Ingestion

Zone Mixer Leaf Leaf

Evaluator

Leaf Leaf Leaf Leaf

Ingest Router

Leaf Leaf

Assigner

Target

Streamz Library

Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf

Leaf

Recovery Logs Repository

Notification Query

slide-11
SLIDE 11

/statusz 200 /inspectz 200 /requestz 500

Metrics

/requestz 200

Path (string) Status_code_class (int64) /http/server/response_latencies (Distribution) (cumulative)

Description Values

... ...

slide-12
SLIDE 12

Target Schema

server ip 32

user (string) job (string) cell (string) task_num (int) BorgTask

jones Description Values

slide-13
SLIDE 13

Configuration

Monarch Zone: Ingestion

Zone Mixer Leaf Leaf

Evaluator

Leaf Leaf Leaf Leaf

Ingestion Router

Leaf Leaf

Assigner

Target

Streamz Library

Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf

Leaf

Recovery Logs Repository

Notification Query

slide-14
SLIDE 14

Configuration

Monarch Zone: Retention

Zone Mixer Leaf Leaf

Evaluator

Leaf Leaf Leaf Leaf

Ingestion Router

Leaf Leaf

Assigner

Target

Streamz Library

Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf

Leaf

Recovery Logs Repository

Notification Query

slide-15
SLIDE 15

Streams

timestamp value /inspectz 200 server ip 32 jones

/http/server/response_latencies BorgTask

1:20 1:19

...

... ... ... ...

stream-identifier history

1:21

slide-16
SLIDE 16

Confidential + Proprietary

The Data Model for Queries

jones . . . jones jones ... emons server . . . server server ... client

ip . . . ip ip ...

qr

. . . 876 877 ... 33 DB . . . DB DB ... Help Alloc . . . Query Undo ... Ask user status_code_class path job cell task_num

stream-id columns time series column

server_latencies

10:52-1:21 10:42-01:21 ... 10:52-1:21 10:42-01:21 ... 10:52-1:21 10:42-01:21 ... 07:33-4:49 07:38-4:49 ...

BorgTask :: /rpc/server/server_latencies

slide-17
SLIDE 17

Configuration

Monarch Zone: Retention

Zone Mixer Leaf Leaf

Evaluator

Leaf Leaf Leaf Leaf

Ingestion Router

Leaf Leaf

Assigner

Target

Streamz Library

Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf

Leaf

Recovery Logs Repository

Notification Query

slide-18
SLIDE 18

Configuration

Monarch Zone: Query

Zone Mixer Leaf Leaf

Evaluator

Leaf Leaf Leaf Leaf

Ingestion Router

Leaf Leaf

Assigner

Target

Streamz Library

Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf

Leaf

Recovery Logs Repository

Notification Query

slide-19
SLIDE 19

Configuration

Monarch Zone : Evaluation and Notification

Zone Mixer Leaf Leaf

Evaluator

Leaf Leaf Leaf Leaf

Sample Server

Leaf Leaf

Assigner

Target

Streamz Library

Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf

Leaf

Recovery Logs Repository

Notification Query Leaf Leaf Leaf Leaf

Ingestion Router

slide-20
SLIDE 20

Configuration

Monarch Zone

Zone Mixer Leaf Leaf

Evaluator

Leaf Leaf Leaf Leaf

Sample Server

Leaf Leaf

Assigner

Target

Streamz Library

Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf

Leaf

Recovery Logs Repository

Notification Query Leaf Leaf Leaf Leaf

Ingestion Router

slide-21
SLIDE 21

Ref: https://www.google.com/about/datacenters/inside/locations/index.html

slide-22
SLIDE 22

Local > Global View

Root Mixer Leaf Leaf Evaluator Leaf Leaf Config Server

slide-23
SLIDE 23

Global Monarch

Root Mixer Notification Query Leaf Leaf

Evaluator

Leaf Leaf

Config Server

Zone Mixers Zones Leaves (global zone) Configuration

slide-24
SLIDE 24

Global Monarch

Root Mixer Notification Query Leaf Leaf

Evaluator

Leaf Leaf

Config Server

Zone Mixers Zones Leaves (global zone) Configuration

slide-25
SLIDE 25

Global Monarch

Root Mixer Notification Query Leaf Leaf

Evaluator

Leaf Leaf

Config Server

Zone Mixers Zones Leaves (global zone) Configuration

slide-26
SLIDE 26

Monarch Zones

Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf

Global Monarch

Integrated Monarch

slide-27
SLIDE 27

Background Architecture and Data Model Queries Using Monarch Monarch Platform Lessons Learned re: Scaling

slide-28
SLIDE 28

Query

Query( Fetch(Raw('BorgTask', '/http/server/response_latency'), {'user': 'gmail', 'status_code_class': 200}) | Window(Delta('5m')) | GroupBy([job, cell], Sum()) | Point(Percentile(95)), '1h', '5m') Also: Join, PickTopStreams, MapStreamId, Union General expressions A large set of aggregation functions

slide-29
SLIDE 29

The Life of a Query

Query Root Mixer Zone Mixer Leaf Repo Fetch Window GroupBy Point Response

slide-30
SLIDE 30

The Life of a Query

Query Root Mixer Zone Mixer Leaf Repo Fetch Window GroupBy Point Response Fetch Window GroupBy Point

slide-31
SLIDE 31

The Life of a Query

Query Root Mixer Zone Mixer Leaf Repo Fetch Window GroupBy Point Response Fetch Window GroupBy Point Fetch Window GroupBy Fetch

slide-32
SLIDE 32

The Life of a Query

Query Root Mixer Zone Mixer Leaf Repo Response GroupBy Point Fetch Window GroupBy Fetch

slide-33
SLIDE 33

Background Architecture and Data Model Queries Using Monarch Monarch Platform Lessons Learned re: Scaling

slide-34
SLIDE 34

Panopticon

slide-35
SLIDE 35

Using Panopticon

Retention Policy

slide-36
SLIDE 36

Using Panopticon

Retention Policy Query

slide-37
SLIDE 37

Using Panopticon

Retention Policy Query Configure alert

slide-38
SLIDE 38

Using Panopticon

Retention Policy Query Configure alert Setup Consoles

slide-39
SLIDE 39

Background Architecture and Data Model Queries Using Monarch Monarch Platform Lessons Learned re: Scaling

slide-40
SLIDE 40

Monarch as Platform

A custom console service Python-based configuration libraries that encode best practices Really automatic monitoring Cross company monitoring SLA definition and alerting Automated monitoring of rollouts

. . .

slide-41
SLIDE 41

Google Stackdriver

Monarch is the backend for Google Stackdriver Monitors cloud customers and Google services used by those customers A good deal of important development to do this Encryption at rest Carefully controlled and audited access Different ways of naming things and data model

slide-42
SLIDE 42

Background Architecture and Data Model Queries Using Monarch Monarch Platform Lessons Learned re: Scaling

slide-43
SLIDE 43

Lessons Learned re: Scaling

Maintain Good Hygiene Scale horizontally -- only -- and it’s hard! Reduce dimensions early

slide-44
SLIDE 44

Lessons Learned - Good Hygiene

Concurrency: don’t make long tails longer. Periodically assess all components. Always be deprecating. Study outliers carefully!

slide-45
SLIDE 45

Lessons Learned - Scaling Horizontally

It’s hard, but it’s the only way. Increase the number of leaves and zones. Watch out for: Centralized services that become bottlenecks. Non-constant per-backend costs. Query fan-out.

slide-46
SLIDE 46

Lessons Learned - Reduce Dimensions Early

Aggregate data as it arrives. Configuration and data multiplexing are important. Users must be able to see “through” the aggregation.

slide-47
SLIDE 47

Lessons Learned - See through aggregation

slide-48
SLIDE 48

Lessons Learned - See through aggregation

slide-49
SLIDE 49

Lessons Learned - See through aggregation

slide-50
SLIDE 50

Lessons Learned re: Scaling

Maintain Good Hygiene Scale horizontally -- only -- and it’s hard! Reduce dimensions early This is a sampling of lessons we’ve learned--there are many more.

slide-51
SLIDE 51

Thank You