Chukwa: a scalable log collector Ari Rabkin and Randy Katz UC - PowerPoint PPT Presentation

UC Berkeley Chukwa: a scalable log collector Ari Rabkin and Randy Katz UC Berkeley USENIX LISA 2010 With thanks to… Eric Yang, Jerome Boulon, Bill Graham, Corbin Hoenes, and all the other Chukwa developers, contributors, and users

Why collect logs? • Many uses – Need logs to monitor/debug systems – Machine learning is getting increasingly good at detecting anomalies automatically. – Web log analysis is key to many businesses • Easier to process if centralized

Three Bets 1. MapReduce processing is necessary at scale. 2. Reliability matters for log collection 3. Should use Hadoop, not re-write storage and processing layers

Leveraging Hadoop • Really want to use HDFS for storage and MapReduce for processing. + Highly scalable, highly robust + Good integrity properties. • HDFS has quirks - Files should be big - No concurrent appends - Weak synchr onization semantics

The architecture Data HDFS App1 log Agent App2 log Collector Agent Archival Collector (seconds) ‏ Agent (seconds) ‏ (seconds) ‏ Metrics Data Sink (seconds) ‏ Storage (seconds) ‏ … (5 minutes) ‏ (Indefinitely) ‏ Per 100 nodes Map- One Per Node Reduce Jobs SQL DB (or HBase)

Design envelope Need more aggressive batching or fan-in control Need better FS! Number of Hosts Chukwa not needed – clients Don’t need Chukwa: should write use NFS instead direct to HDFS Data Rate per host (bytes/sec)

Respecting boundaries • Architecture captures the boundary between monitoring and production services – Important in practice! – Particularly nice in cloud context Control Protocol App1 log ds) ‏ Co App2 log Agent Structured s) ‏ Collector Data Sink Storage Metrics … System being monitored Monitoring system

Comparison Amazon CloudWatch Metrics Logs

Data sources • We optimize for the case of logs on disk – Supports legacy systems – Writes to local disk almost always succeed – Kept in memory in practice – fs caching • Can also handle other data sources – adaptors are pluggable – Support syslog, other UDP, JMS messages.

Reliability • Agents can crash • Record how much data from each source has been written successfully. • Resume at that point after crash • Fix duplicates in the storage layer Data Sent and committed not committed

Incorporating Asynchrony • What about collector Agent Collector HDFS crashes? Data • Want to tolerate Data asynchronous HDFS In Foo.done @ 3000 writes without blocking ls agent • Solution: async. acks Query Length of • Tell agent where data Foo.done = 3000 Foo.done@ will be written if write 3000 succeeds. …. • Uses single-writer aspect of HDFS Committed

Fast path Data HDFS App1 log Agent App2 log Collector Cleaned Agent (seconds) ‏ Collector Data Storage Agent (seconds) ‏ (seconds) ‏ Metrics (seconds) Data Sink (seconds) ‏ (Indefinitely) ‏ … (5 minutes) ‏ Per 100 nodes Map- One Per Node Reduce Jobs Fast-path clients (seconds) ‏

Two modes Robust delivery Prompt delivery • Data visible in minutes • Data visible in seconds • Collects everything • User-specified filter • Stores to HDFS • Written over a socket • Will resend after a crash • Delivered at most once • Facilitates MapReduce • Facilitates near-real-time monitoring • Used for bulk analysis • Used for real-time graphing

Overhead [with Cloudstone] 54 52 Ops per sec 50 48 46 Without Chukwa With Chukwa

Collection rates Collector write rate (MB/sec) 35 30 25 20 15 • Tested on EC2 10 • Able to write 30MB/ 5 sec/collector 0 10 20 30 40 50 60 70 80 90 • Note: data is about Agent send rate (MB/sec) 12 months old

Collection rates 220 Total Throughput (MB/sec) 200 180 160 • Scales linearly 140 • Able to saturate 120 underlying FS 100 80 60 40 4 6 8 10 12 14 16 18 20 Collectors

Experiences • Currently in use at: • UC Berkeley's RAD Lab, to monitor Cloud experiments • CBS Interactive, Selective Media, and Tynt for web log analysis – Dozens of machines – Gigabytes to Terabytes per day • Other sites too…we don't have a census

Related Work Handles Crash Metadata Interface Agent-side logs recovery? control Ganglia/ other NMS No No No UDP No Nagios/ Scribe Yes No No RPC Yes Flume Yes Yes Yes flexible No Chukwa Yes Yes Yes flexible Yes

Next steps • Tighten security, to make Chukwa suitable for world-facing deployments • Adjustable durability – Should be able to buffer arbitrary non-file data for reliability • HBase for near-real-time metrics display • Built-in indexing • Your idea here: Exploit open source!

Conclusions • Chukwa is a distributed log collection system that is • Practical – In use at several sites • Scalable – Builds on Hadoop for storage and processing • Reliable – Able to tolerate multiple concurrent failures without losing or mangling data • Open Source – Former Hadoop subproject, currently in Apache incubation, enroute to top level project.

Questions?

…vs Splunk • Significant overlap with Splunk. – Splunk uses syslog for transport. – Recently shifted towards MapReduce for evaluation. • Chukwa on its own doesn’t [yet] do indexing or analysis. • Chukwa helps extract data from systems – Reliably – Customizably

Assumptions about App • Processing should happen off-node. (Production hosts are sacrosanct) • Data should be available within minutes – Sub-minute delivery a non-goal. • Data rates between 1 and 100KB/sec/node – Architecture tuned for these cases, but Chukwa could be adapted to handle lower/higher rates. • No assumptions about data format • Administrator or app needs to tell Chukwa where logs live. – Support for directly streaming logs as well.

On the back end • Chukwa has a notion of parsed records , with complex schemas – Can put into structured storage – Display with HICC, a portal-style web interface.

Not storage, not processing • Chukwa is a collection system. – Not responsible for storage: • Use HDFS. • Our model is store-everything, prune late – Not responsible for processing • Use MapReduce, or custom layer on HDFS • Responsible for facilitating storage and processing • Framework for processing collected data • Includes Pig support

Goal: Low Footprint • Wanted minimal footprint on system and minimal changes to user workflow. – Application logging need not change. – Local logs stay put, Chukwa just copies them. – Can either specify filenames in static config, or else do some dynamic discovery. • Minimal human-produced metadata – We track what data source + host a chunk came from. Can store additional tags. – Chunks are numbered; can reconstruct order. – No schemas required to collect data

MapReduce and Hadoop • Major motivation for Chukwa was storing and analyzing Hadoop logs. – At Yahoo!, common to dynamically allocate hundreds of nodes for a particular task. – This can generate MBs of logs per second. – Log analysis becomes difficult

Why Ganglia doesn’t do this • Many systems for metrics collection – Ganglia particularly well-known. – Many similar systems, including network management systems like OpenView – Focus on collecting and aggregating metrics in scalable low-cost way • But logs aren’t metrics. Want to archive everything, not summarize aggressively. • Really want reliable delivery; missing key parts of logs might make rest useless

Clouds • Log processing needs to be scalable, since apps can get big quickly • This used to be a problem for the Microsofts and Googles of the world. Now it affects many more. • Can’t rely on local storage – Nodes are ephemeral – Need to move logs off-node • Can’t do analysis on single host – The data is too big

Questions about Goals • How many nodes? How much data? • What data sources and delivery semantics? • Processing expressiveness? • Storage?

Chukwa goals • How many nodes? How much data? – Scale to thousands of nodes. Hundreds of KB/ sec/node on average, bursts above that OK • What data sources and delivery semantics? – Console Logs and Metrics. Reliable delivery (as much as possible.) Minutes of delay are OK. • Processing expressiveness? – MapReduce • Storage? – Should be able to store data indefinitely. Support petabytes of stored data.

In contrast • Ganglia, Network Management systems, and Amazon’s CloudWatch are all metrics- oriented. – Goal is collecting and disseminating numerical metrics data in a scalable way. • Significantly different problem. – Metrics have well defined semantics – Can tolerate data loss – Easy to aggregate/compress for archiving – Often time-critical • Chukwa can serve these purposes, but isn’t optimized for it.

Real-time Chukwa • Chukwa was originally designed to support batch processing of logs – Minutes of latency OK. • But we can do [best effort] real-time “for free” – Watch data go past at the collector – Check chunks against a search pattern, forward matching ones to a listener via TCP. – Don’t need long-term storage or reliable delivery (do those via the regular data path) • Director uses this real-time path.

Chukwa: a scalable log collector Ari Rabkin and Randy Katz UC - PowerPoint PPT Presentation

UC Berkeley Chukwa: a scalable log collector Ari Rabkin and Randy Katz UC Berkeley USENIX LISA 2010 With thanks to Eric Yang, Jerome Boulon, Bill Graham, Corbin Hoenes, and all the other Chukwa developers, contributors, and users Why

(142733/102960-Log[4])+(614851/73920-2 Log[64]) h 2 +(2329/1680-Log[4]) h 4 -h 10 /20160

A Semi Preemptive Garbage A Semi Preemptive Garbage Collector for Solid State Collector

Bipolar Junction Transistors Emitter p n p Collector Emitter n p n Collector Base Base

Chandra data reduction The CDFs Giorgio, Margherita, Elisabeta, Eleonora, Lazarus, Enrica,

San Mateo County San Mateo County Treasurer- -Tax Collector Tax Collector Treasurer Lee

Syslog and Log Rotate Computer Center, CS, NCTU Log files Execution information of each

Distributed ephemeral log service Log entries are replicated,dispersed See Ivy,

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Section 3.7 Derivatives of logarithmic functions 1 Rules of exponentials and logarithms 1.

RDA Dissolution Update Oscar Valdez Auditor-Controller/Treasurer/Tax Collector September 13,

TOY FAIR 2010 CELEBRATION 2010 COLLECTOR EVENT COLLECTOR EVENT PRESENTATION PRESENTATION o

ESTABLISHING GOALS & PERFORMANCE MEASURES Jordan Kaufman, Kern County Treasurer-Tax Collector

CS 574: Randomized Algorithms Lecture 5. Coupon Collector Problems September 8, 2015 Lecture 5.

On the Biased Partial Word Collector Problem Philippe Duchon and Cyril Nicaud LIGM Universit e

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Section5.4 Properties of Logarithmic Functions PropertiesofLogarithms Formulas Basic

ESS00 ESS00 New w Em Employee ee ES ESS S Bas asics cs Welcome to Travis County. This

Getting Started with UNIX What is UNIX? Getting Started with UNIX Operating System

02 SSH CS 2043: Unix Tools and Scripting, Spring 2019 [1] Matthew Milano January 25, 2019

CAPER and the eCon Planning Suite March 30, 2016 Webinar Instructions Presenters Chris

Tmux and Screen Limitations of the terminal How do we multitask? Graphical

SSO and authentication off campus: SimpleSAMLphp and SAML2 Jacob-Steen Madsen (jsm@wayf.dk) The

Related Service Provision 101 Feb. 22 & March 9, 2018 OSSE Division of Data, Assessment, and

What is one of the most important concepts that we have covered? Enter your answer(s) in the chat

Chukwa: a scalable log collector Ari Rabkin and Randy Katz UC - PowerPoint PPT Presentation

UC Berkeley Chukwa: a scalable log collector Ari Rabkin and Randy Katz UC Berkeley USENIX LISA 2010 With thanks to Eric Yang, Jerome Boulon, Bill Graham, Corbin Hoenes, and all the other Chukwa developers, contributors, and users Why

(142733/102960-Log[4])+(614851/73920-2 Log[64]) h 2 +(2329/1680-Log[4]) h 4 -h 10 /20160

A Semi Preemptive Garbage A Semi Preemptive Garbage Collector for Solid State Collector

Bipolar Junction Transistors Emitter p n p Collector Emitter n p n Collector Base Base

Chandra data reduction The CDFs Giorgio, Margherita, Elisabeta, Eleonora, Lazarus, Enrica,

San Mateo County San Mateo County Treasurer- -Tax Collector Tax Collector Treasurer Lee

Syslog and Log Rotate Computer Center, CS, NCTU Log files Execution information of each

Distributed ephemeral log service Log entries are replicated,dispersed See Ivy,

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Section 3.7 Derivatives of logarithmic functions 1 Rules of exponentials and logarithms 1.

RDA Dissolution Update Oscar Valdez Auditor-Controller/Treasurer/Tax Collector September 13,

TOY FAIR 2010 CELEBRATION 2010 COLLECTOR EVENT COLLECTOR EVENT PRESENTATION PRESENTATION o

ESTABLISHING GOALS &amp; PERFORMANCE MEASURES Jordan Kaufman, Kern County Treasurer-Tax Collector

CS 574: Randomized Algorithms Lecture 5. Coupon Collector Problems September 8, 2015 Lecture 5.

On the Biased Partial Word Collector Problem Philippe Duchon and Cyril Nicaud LIGM Universit e

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Section5.4 Properties of Logarithmic Functions PropertiesofLogarithms Formulas Basic

ESS00 ESS00 New w Em Employee ee ES ESS S Bas asics cs Welcome to Travis County. This

Getting Started with UNIX What is UNIX? Getting Started with UNIX Operating System

02 SSH CS 2043: Unix Tools and Scripting, Spring 2019 [1] Matthew Milano January 25, 2019

CAPER and the eCon Planning Suite March 30, 2016 Webinar Instructions Presenters Chris

Tmux and Screen Limitations of the terminal How do we multitask? Graphical

SSO and authentication off campus: SimpleSAMLphp and SAML2 Jacob-Steen Madsen (jsm@wayf.dk) The

Related Service Provision 101 Feb. 22 &amp; March 9, 2018 OSSE Division of Data, Assessment, and

What is one of the most important concepts that we have covered? Enter your answer(s) in the chat

ESTABLISHING GOALS & PERFORMANCE MEASURES Jordan Kaufman, Kern County Treasurer-Tax Collector

Related Service Provision 101 Feb. 22 & March 9, 2018 OSSE Division of Data, Assessment, and