Monarch
Google’s planet-scale streaming monitoring infrastructure.
Monarch Googles planet-scale streaming monitoring infrastructure. - - PowerPoint PPT Presentation
Monarch Googles planet-scale streaming monitoring infrastructure. Background Architecture and Data Model Queries Using Monarch Monarch Platform Lessons Learned re: Scaling Monitoring at Google Ref:
Google’s planet-scale streaming monitoring infrastructure.
Monitoring at Google
Ref: https://www.google.com/about/datacenters/inside/locations/index.html
Monitoring at Google
Ref: https://www.google.com/about/datacenters/inside/locations/index.html
Global Span Huge Volume Many Kinds
Constant change
Essentials of Monarch Scaling
Maintain good hygiene Scale horizontally Reduce dimensions early
Global Extent
Ref: https://www.google.com/about/datacenters/inside/locations/index.html
Monarch Zone
Monitor Locally
Configuration Zone Mixer Leaf Leaf Evaluato r Leaf Leaf Leaf Leaf Ingest Router Leaf Leaf Assigner Target Streamz Library Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Recovery Logs Repository
Configuration
Monarch Zone: Ingestion, Retention and Queries
Zone Mixer Leaf Leaf
Evaluator
Leaf Leaf Leaf Leaf
Ingest Router
Leaf Leaf
Assigner
Target
Streamz Library
Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf
Leaf
Recovery Logs Repository
Notification Query
Configuration
Monarch Zone: Ingestion
Zone Mixer Leaf Leaf
Evaluator
Leaf Leaf Leaf Leaf
Ingest Router
Leaf Leaf
Assigner
Target
Streamz Library
Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf
Leaf
Recovery Logs Repository
Notification Query
/statusz 200 /inspectz 200 /requestz 500
Metrics
/requestz 200
Path (string) Status_code_class (int64) /http/server/response_latencies (Distribution) (cumulative)
Description Values
Target Schema
server ip 32
user (string) job (string) cell (string) task_num (int) BorgTask
jones Description Values
Configuration
Monarch Zone: Ingestion
Zone Mixer Leaf Leaf
Evaluator
Leaf Leaf Leaf Leaf
Ingestion Router
Leaf Leaf
Assigner
Target
Streamz Library
Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf
Leaf
Recovery Logs Repository
Notification Query
Configuration
Monarch Zone: Retention
Zone Mixer Leaf Leaf
Evaluator
Leaf Leaf Leaf Leaf
Ingestion Router
Leaf Leaf
Assigner
Target
Streamz Library
Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf
Leaf
Recovery Logs Repository
Notification Query
Streams
timestamp value /inspectz 200 server ip 32 jones
/http/server/response_latencies BorgTask
1:20 1:19
stream-identifier history
1:21
Confidential + Proprietary
The Data Model for Queries
jones . . . jones jones ... emons server . . . server server ... client
ip . . . ip ip ...
qr
. . . 876 877 ... 33 DB . . . DB DB ... Help Alloc . . . Query Undo ... Ask user status_code_class path job cell task_num
stream-id columns time series column
server_latencies
10:52-1:21 10:42-01:21 ... 10:52-1:21 10:42-01:21 ... 10:52-1:21 10:42-01:21 ... 07:33-4:49 07:38-4:49 ...
BorgTask :: /rpc/server/server_latencies
Configuration
Monarch Zone: Retention
Zone Mixer Leaf Leaf
Evaluator
Leaf Leaf Leaf Leaf
Ingestion Router
Leaf Leaf
Assigner
Target
Streamz Library
Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf
Leaf
Recovery Logs Repository
Notification Query
Configuration
Monarch Zone: Query
Zone Mixer Leaf Leaf
Evaluator
Leaf Leaf Leaf Leaf
Ingestion Router
Leaf Leaf
Assigner
Target
Streamz Library
Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf
Leaf
Recovery Logs Repository
Notification Query
Configuration
Monarch Zone : Evaluation and Notification
Zone Mixer Leaf Leaf
Evaluator
Leaf Leaf Leaf Leaf
Sample Server
Leaf Leaf
Assigner
Target
Streamz Library
Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf
Leaf
Recovery Logs Repository
Notification Query Leaf Leaf Leaf Leaf
Ingestion Router
Configuration
Monarch Zone
Zone Mixer Leaf Leaf
Evaluator
Leaf Leaf Leaf Leaf
Sample Server
Leaf Leaf
Assigner
Target
Streamz Library
Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf
Leaf
Recovery Logs Repository
Notification Query Leaf Leaf Leaf Leaf
Ingestion Router
Ref: https://www.google.com/about/datacenters/inside/locations/index.html
Local > Global View
Root Mixer Leaf Leaf Evaluator Leaf Leaf Config Server
Global Monarch
Root Mixer Notification Query Leaf Leaf
Evaluator
Leaf Leaf
Config Server
Zone Mixers Zones Leaves (global zone) Configuration
Global Monarch
Root Mixer Notification Query Leaf Leaf
Evaluator
Leaf Leaf
Config Server
Zone Mixers Zones Leaves (global zone) Configuration
Global Monarch
Root Mixer Notification Query Leaf Leaf
Evaluator
Leaf Leaf
Config Server
Zone Mixers Zones Leaves (global zone) Configuration
Monarch Zones
Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf
Global Monarch
Integrated Monarch
Query
Query( Fetch(Raw('BorgTask', '/http/server/response_latency'), {'user': 'gmail', 'status_code_class': 200}) | Window(Delta('5m')) | GroupBy([job, cell], Sum()) | Point(Percentile(95)), '1h', '5m') Also: Join, PickTopStreams, MapStreamId, Union General expressions A large set of aggregation functions
The Life of a Query
Query Root Mixer Zone Mixer Leaf Repo Fetch Window GroupBy Point Response
The Life of a Query
Query Root Mixer Zone Mixer Leaf Repo Fetch Window GroupBy Point Response Fetch Window GroupBy Point
The Life of a Query
Query Root Mixer Zone Mixer Leaf Repo Fetch Window GroupBy Point Response Fetch Window GroupBy Point Fetch Window GroupBy Fetch
The Life of a Query
Query Root Mixer Zone Mixer Leaf Repo Response GroupBy Point Fetch Window GroupBy Fetch
Panopticon
Using Panopticon
Retention Policy
Using Panopticon
Retention Policy Query
Using Panopticon
Retention Policy Query Configure alert
Using Panopticon
Retention Policy Query Configure alert Setup Consoles
Monarch as Platform
A custom console service Python-based configuration libraries that encode best practices Really automatic monitoring Cross company monitoring SLA definition and alerting Automated monitoring of rollouts
Google Stackdriver
Monarch is the backend for Google Stackdriver Monitors cloud customers and Google services used by those customers A good deal of important development to do this Encryption at rest Carefully controlled and audited access Different ways of naming things and data model
Lessons Learned re: Scaling
Maintain Good Hygiene Scale horizontally -- only -- and it’s hard! Reduce dimensions early
Lessons Learned - Good Hygiene
Concurrency: don’t make long tails longer. Periodically assess all components. Always be deprecating. Study outliers carefully!
Lessons Learned - Scaling Horizontally
It’s hard, but it’s the only way. Increase the number of leaves and zones. Watch out for: Centralized services that become bottlenecks. Non-constant per-backend costs. Query fan-out.
Lessons Learned - Reduce Dimensions Early
Aggregate data as it arrives. Configuration and data multiplexing are important. Users must be able to see “through” the aggregation.
Lessons Learned - See through aggregation
Lessons Learned - See through aggregation
Lessons Learned - See through aggregation
Lessons Learned re: Scaling
Maintain Good Hygiene Scale horizontally -- only -- and it’s hard! Reduce dimensions early This is a sampling of lessons we’ve learned--there are many more.