Atlas Scalable time-series management Brian Harrington December - PowerPoint PPT Presentation

Atlas Scalable time-series management Brian Harrington December 16th, 2014

About me • Brian • 4 years in May • Mostly focus on backend • Insight engineering • Enables and drives continuous improvement of real-time operational insight into our customer experience across operational environments.

Our role • Prevention Stability • Is my system working? • Test > Canary > Prod Rate of change • MTTD - mean time to detect • MTTR - mean time to resolution

Netflix likes monitoring • Hadoop, Hive, Spark, … • CloudWatch, Boundary, AppDynamics, Teradata, SumoLogic, … • JMX, SNMP, sar, … • Atlas, Chronos, Edda, Mantis, Turbine, Chukwa, …

What is Atlas? • Atlas is the system Netflix uses to manage dimensional time series data for near real-time operational insight. • Metric volume has doubled almost every quarter since I started. We have grown from 2M to 1.2B. Number of Metrics 1,200,000,000 900,000,000 Atlas proposed Atlas is primary 600,000,000 300,000,000 0 5/1/2011 1/1/2012 10/1/2012 2/1/2013 7/1/2013 12/1/2013 3/1/2014 6/1/2014

Insight Categories • Operational vs Business intelligence • Operations: What is happening now? • BI: What are the trends over time? • Time series vs Events • Do you need to query for a particular event? • Or just see a summary of events over time?

Where we started • Epic • Predecessor to Atlas • CGI script in front of RRDTool • MySQL for metadata and RRD files on disk • Data center • Falling over at around 2M metrics

Requirements • Don’t lose functionality • Retention: 2w + a few days • Scale • Query explicitly based on dimensions

Amount of time Time range for graph Time range for graph requests requests with shifts 3% 32% 41% 27% 97% <1w >1w Others Shift 1w Shift 2w

Any guesses?

Scale • Define scalable? • We can throw hardware at it • Write volume • Read volume

How much input data?

Graph 1: apiproxy • Number of time series matched: 206 • Number of blocks: 824 • Number of input data points: 37,080 • Number of output data points: 540 • Number of output lines: 3

Graph 2: nccp • Number of time series matched: 12M • Number of blocks: 48M • Number of input data points: 2.16B • Number of output data points: 540 • Number of output lines: 3

Why dimensions? • Example metric name • com.netflix.eds.nccp.successful.requests.uiversion.nccprt- authorization.devtypid-101.clver-PHL_0AB.uiver- UI_169_mid.geo-US • How do you query this?

Why dimensions? • Example metric name • com.netflix.eds.nccp.successful.requests.uiversion.nccprt- authorization.devtypid-101.clver-PHL_0AB.uiver- UI_169_mid.geo-US • How do you query this? Key Value name nccp.successful.requests nccprt authorization devtypid 101 clver PHL_0AB uiver UI_169_mid geo US

Why dimensions? • Example metric name • com.netflix.eds.nccp.successful.requests.uiversion.nccprt- authorization.devtypid-101.clver-PHL_0AB.uiver- UI_169_mid.geo-US • How do you query this? Key Value name nccp.successful.requests nccprt authorization x e devtypid 101 g e R clver PHL_0AB uiver UI_169_mid geo US

Perspective • Service owner • Library owner • UI team • CDN team managing caches in ISPs • Cross-functional • Performance and capacity team • Site reliability • Exploratory

Problem 1: parity • Normalization and consolidation • Flexible legends, scale independently of chart • Math, in particular handling of NaN values • Holt-Winters • Visualization options • Deep linking

General query layer Main CloudWatch Epic Custom …

General query layer us-east-1 eu-west-1 Main us-nflx-1 us-west-2 CloudWatch Epic Custom … … Island model: geographic regions should be isolated

General query layer Global us-east-1 eu-west-1 Main us-nflx-1 us-west-2 CloudWatch Epic Custom … … Island model: geographic regions should be isolated

Stack language • Embedding ¡and ¡linking ¡is ¡import ¡to ¡us ¡ • GET ¡request ¡ • URL ¡friendly ¡stack ¡language ¡ • Few ¡special ¡symbols ¡(comma, ¡colon, ¡parenthesis) ¡ • Easy ¡to ¡extend ¡ • Usability ¡ ¡ • Basic ¡operaEons ¡ • Query: ¡and, ¡or, ¡equal, ¡regex, ¡has ¡key, ¡not ¡ • AggregaEon: ¡sum, ¡count, ¡min, ¡max ¡ • ConsolidaEon: ¡aggregate ¡across ¡Eme ¡ • Math: ¡add, ¡subtract, ¡mulEply, ¡etc ¡ • Boolean: ¡and, ¡or, ¡lt, ¡gt, ¡etc ¡ • Graph ¡seKngs: ¡legends, ¡area, ¡transparency

Stack language summary • PunctuaEon: ¡comma, ¡colon, ¡and ¡parenthesis ¡ • OperaEons ¡start ¡with ¡colon ¡ • Comma ¡is ¡the ¡separator ¡ • Parenthesis ¡used ¡for ¡lists ¡ • Example: ¡ • nf.cluster,discovery,:eq,(,nf.zone,),:by ¡ • select ¡* ¡where ¡nf.cluster ¡== ¡“discovery” ¡group ¡by ¡nf.zone

Simple graph /api/v1/graph? ¡ ¡ ¡e=2012-‑01-‑01T00:00& ¡ ¡ ¡q=name,sps,:eq,nf.cluster,nccp-‑silverlight,:eq,:and,:sum ¡

More complex graph

More complex graph # ¡Query ¡for ¡input ¡line ¡ nf.cluster,aler]est,:eq, ¡ name,requestsPerSecond,:eq, ¡ :and,:sum, ¡

More complex graph # ¡Query ¡for ¡input ¡line ¡ nf.cluster,aler]est,:eq, ¡ name,requestsPerSecond,:eq, ¡ :and,:sum, ¡ # ¡Create ¡a ¡copy ¡on ¡the ¡stack ¡ :dup, ¡

More complex graph # ¡Query ¡for ¡input ¡line ¡ nf.cluster,aler]est,:eq, ¡ name,requestsPerSecond,:eq, ¡ :and,:sum, ¡ # ¡Create ¡a ¡copy ¡on ¡the ¡stack ¡ :dup, ¡ # ¡Create ¡a ¡DES ¡line ¡using ¡the ¡expr ¡ # ¡on ¡top ¡of ¡the ¡stack ¡ :des-‑simple, ¡

More complex graph # ¡Query ¡for ¡input ¡line ¡ nf.cluster,aler]est,:eq, ¡ name,requestsPerSecond,:eq, ¡ :and,:sum, ¡ # ¡Create ¡a ¡copy ¡on ¡the ¡stack ¡ :dup, ¡ # ¡Create ¡a ¡DES ¡line ¡using ¡the ¡expr ¡ # ¡on ¡top ¡of ¡the ¡stack ¡ :des-‑simple, ¡ # ¡Mutliply, ¡used ¡to ¡set ¡threshold ¡ 0.9,:mul, ¡

More complex graph # ¡Query ¡for ¡input ¡line ¡ nf.cluster,aler]est,:eq, ¡ name,requestsPerSecond,:eq, ¡ :and,:sum, ¡ # ¡Create ¡a ¡copy ¡on ¡the ¡stack ¡ :dup, ¡ # ¡Create ¡a ¡DES ¡line ¡using ¡the ¡expr ¡ # ¡on ¡top ¡of ¡the ¡stack ¡ :des-‑simple, ¡ # ¡Mutliply, ¡used ¡to ¡set ¡threshold ¡ 0.9,:mul, ¡ # ¡a ¡b ¡=> ¡a ¡b ¡abs(a ¡-‑ ¡b) ¡ :2over,:sub,:abs, ¡

More complex graph # ¡Query ¡for ¡input ¡line ¡ nf.cluster,aler]est,:eq, ¡ name,requestsPerSecond,:eq, ¡ :and,:sum, ¡ # ¡Create ¡a ¡copy ¡on ¡the ¡stack ¡ :dup, ¡ # ¡Create ¡a ¡DES ¡line ¡using ¡the ¡expr ¡ # ¡on ¡top ¡of ¡the ¡stack ¡ :des-‑simple, ¡ # ¡Mutliply, ¡used ¡to ¡set ¡threshold ¡ 0.9,:mul, ¡ # ¡a ¡b ¡=> ¡a ¡b ¡abs(a ¡-‑ ¡b) ¡ :2over,:sub,:abs, ¡ # ¡Take ¡line ¡on ¡top ¡of ¡stack ¡ # ¡and ¡set ¡it ¡to ¡area ¡with ¡transparency ¡ :area,40,:alpha, ¡

More complex graph # ¡Query ¡for ¡input ¡line ¡ nf.cluster,aler]est,:eq, ¡ name,requestsPerSecond,:eq, ¡ :and,:sum, ¡ # ¡Create ¡a ¡copy ¡on ¡the ¡stack ¡ :dup, ¡ # ¡Create ¡a ¡DES ¡line ¡using ¡the ¡expr ¡ # ¡on ¡top ¡of ¡the ¡stack ¡ :des-‑simple, ¡ # ¡Mutliply, ¡used ¡to ¡set ¡threshold ¡ 0.9,:mul, ¡ # ¡a ¡b ¡=> ¡a ¡b ¡abs(a ¡-‑ ¡b) ¡ :2over,:sub,:abs, ¡ # ¡Take ¡line ¡on ¡top ¡of ¡stack ¡ # ¡and ¡set ¡it ¡to ¡area ¡with ¡transparency ¡ :area,40,:alpha, ¡ # ¡Item ¡on ¡bo]om ¡of ¡stack ¡moved ¡to ¡ # ¡top, ¡set ¡legend ¡ :rot,$name,:legend, ¡ :rot,predicEon,:legend, ¡ :rot,delta,:legend ¡

More complex graph # ¡Query ¡for ¡input ¡line ¡ nf.cluster,aler]est,:eq, ¡ name,requestsPerSecond,:eq, ¡ :and,:sum, ¡ # ¡Create ¡a ¡copy ¡on ¡the ¡stack ¡ :dup, ¡ # ¡Create ¡a ¡DES ¡line ¡using ¡the ¡expr ¡ # ¡on ¡top ¡of ¡the ¡stack ¡ :des-‑simple, ¡ # ¡Mutliply, ¡used ¡to ¡set ¡threshold ¡ 0.9,:mul, ¡ # ¡a ¡b ¡=> ¡a ¡b ¡(a ¡< ¡b) ¡ :2over,:lt ¡ # ¡Take ¡line ¡on ¡top ¡of ¡stack ¡ # ¡and ¡set ¡it ¡to ¡area ¡with ¡transparency ¡ :area,40,:alpha, ¡ # ¡Item ¡on ¡bo]om ¡of ¡stack ¡moved ¡to ¡ # ¡top, ¡set ¡legend ¡ :rot,$name,:legend, ¡ :rot,predicEon,:legend, ¡ :rot,:vspan,40,:alpha ¡

Atlas Scalable time-series management Brian Harrington December - PowerPoint PPT Presentation

Atlas Scalable time-series management Brian Harrington December 16th, 2014 About me Brian 4 years in May Mostly focus on backend Insight engineering Enables and drives continuous improvement of real-time operational insight

Measuring DNSSEC using RIPE Atlas Kaveh Ranjbar RIPE NCC RIPE Atlas Coverage RIPE Atlas 2

ATLAS Searches for SUSY Chris Young, CERN ATLAS Group What have we not looked for? 1 / 37 ATLAS

ATLAS ROOT I/O pt 2 Atlas Hot Topics (with reference to CHEP presentations) Big data

ATLAS I/O Overview Peter van Gemmeren (ANL) gemmeren@anl.gov for many in ATLAS 8/23/2018 Peter

Top Properties from ATLAS Chris Young (CERN), on behalf of ATLAS 27th May 2020 1 / 19 Top

Atlas Summit 2016 C ALL FOR P RESENTA TION P ROPOSALS The Atlas Society is currently planning the

Atlas Arteria Investor Presentation July 2018 Important notice and disclaimer Disclaimer Atlas

ATLAS Shrugged ATLAS Shrugged Pat O Toole Toole Pat O (with apologies to Ayn Rand and

Macquarie Atlas Roads Limited Macquarie Atlas Roads International Limited 2016 Annual General

World Wide Computing and the ATLAS World Wide Computing and the ATLAS Experiment Experiment th

Highlights and Searches in ATLAS Dave Charlton University of Birmingham on behalf of the ATLAS

Data Management in ATLAS Angelos Molfetas on behalf of the ATLAS DQ2 team 1 ATLAS DDM

H result from ATLAS Lydia Brenner Introduction ATLAS I will try to compare some

Project ATLAS Michelle Warf NCDOT EAU Caitlyn Meyer ATLAS GIS Consultant February 25

Atlas Arteria 2018 Full Year Results Presentation 28 February 2019 Important notice and

Atlas Analysis Infrastructure in Atlas Analysis Infrastructure in Japan Japan Hiroshi Sakamoto

RIPE Atlas Tools for Operators and IXPs Michela Galante RIPE NCC 24 May 2017 | LACNIC 27 | Foz

FLUKA STUDIES OF DOSE RATES IN THE ATLAS STANDARD OPENING SCENARIO J. C. Armenteros, A. Cimmino,

S e a r c h e s f o r l o n g - l i v e d p a r t i c l e s w i t

Virtual slides in electronic publication: Experiences with the open access peer reviewed journal

SETTLEMENT AN ELECTRIFYING OPPORTUNITY FOR EV CHARGING OCTOBER 10, 2019 SPEAKERS Moderator:

Using Automatic HARDI Feature Selection, Registration, and Atlas Building to Characterize the

July Core Intervention Assignment (for the BSI QIA) #6 Catheter Reduction 1 6. Catheter

ATLAS ATLAS A Scalable Emulator for A Scalable Emulator for Transactional Parallel Systems