In-situ MapReduce for Log Processing Dionysios Logothe9s, - PowerPoint PPT Presentation

In-‑situ ¡MapReduce ¡for ¡Log ¡ Processing ¡ Dionysios ¡Logothe9s, ¡Kevin ¡Webb, ¡Kenneth ¡Yocum ¡ UC ¡San ¡Diego ¡ Chris ¡Trezzo ¡ Salesforce ¡Inc. ¡ USENIX ¡Annual ¡Technical ¡Conference ¡ June ¡2011 ¡

Log ¡analy9cs ¡ • Data ¡centers ¡with ¡1000s ¡of ¡servers ¡ • Genera9ng ¡logs ¡with ¡valuable ¡informa9on ¡ • Data-‑intensive ¡compu9ng: ¡Store ¡and ¡analyze ¡ TBs ¡of ¡logs ¡ Examples: ¡ • Click ¡logs: ¡ad-‑targe9ng, ¡personaliza9on ¡ • Social ¡media ¡feeds: ¡brand ¡monitoring ¡ • Purchase ¡logs: ¡fraud ¡detec9on ¡ • System ¡logs: ¡anomaly ¡detec9on, ¡debugging ¡ 2 ¡

Log ¡analy9cs ¡today ¡ Servers ¡ • “Store-‑first-‑query-‑later” ¡ – Migrate ¡logs ¡to ¡dedicated ¡clusters ¡ Problems: ¡ • Scale ¡ – e.g. ¡Facebook ¡collects ¡100TB ¡a ¡day! ¡ Store ¡first… ¡ – Data ¡migra9on ¡stresses ¡network ¡and ¡disks ¡ • Failures ¡ … ¡query ¡later ¡ – e.g. ¡server ¡is ¡unreachable ¡ – Delay ¡analysis ¡or ¡process ¡incomplete ¡data ¡ MapReduce ¡ • Timeliness ¡ – e.g. ¡ ¡long ¡data ¡migra9on ¡9mes ¡ Dedicated ¡cluster ¡ – Hinders ¡real-‑9me ¡apps: ¡ad-‑targe9ng, ¡fraud ¡detec9on ¡ 3 ¡

In-‑situ ¡MapReduce ¡(iMR) ¡ Servers ¡ Idea: ¡ • Move ¡analysis ¡to ¡the ¡servers ¡ MapReduce ¡ • MapReduce ¡for ¡con9nuous ¡data ¡ • Ability ¡to ¡trade ¡fidelity ¡for ¡latency ¡ Op9mized ¡for: ¡ • Highly ¡selec9ve ¡workloads ¡ – e.g. ¡up ¡to ¡80% ¡data ¡filtered ¡or ¡summarized! ¡ • Online ¡analy9cs ¡ – e.g. ¡ ¡Ad ¡re-‑targe9ng ¡based ¡on ¡most ¡recent ¡clicks ¡ Dedicated ¡cluster ¡ 4 ¡

An ¡iMR ¡query ¡ The ¡same: ¡ • MapReduce ¡API ¡ – map(r) ¡  ¡{k,v} ¡: ¡ ¡extract/filter ¡data ¡ – reduce( ¡{k, ¡v[]} ¡) ¡ ¡  ¡v’ ¡: ¡data ¡aggrega9on ¡ – combine( ¡{k, ¡v[]} ¡) ¡  ¡v’ ¡: ¡early, ¡par9al ¡aggrega9on ¡ The ¡new: ¡ • Provides ¡con9nuous ¡results ¡ • Because ¡logs ¡are ¡con9nuous ¡

Con9nuous ¡MapReduce ¡ Log ¡entries ¡ … ¡ iMR ¡input ¡is ¡an ¡infinite ¡stream ¡of ¡logs ¡ • Time ¡ 0’’ ¡ 30’’ ¡ 60’’ ¡ 90’’ ¡ Bound ¡input ¡with ¡ sliding ¡windows : ¡ • – Range ¡of ¡data ¡ ¡ – Update ¡frequency ¡ Map ¡ – e.g. ¡Process ¡user ¡clicks ¡over ¡the ¡last ¡60’’… ¡ Combine ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡… ¡and ¡update ¡analysis ¡every ¡15’’ ¡ Nodes ¡output ¡stream ¡of ¡results, ¡one ¡for ¡each ¡window ¡ • Analysis ¡con9nuously ¡updated ¡with ¡new ¡data ¡ • Reduce ¡ 6 ¡

Processing ¡windows ¡in-‑network ¡ Overlapping ¡data ¡ log ¡entries ¡ • Aggrega9on ¡trees ¡for ¡efficiency ¡ log ¡entries ¡ – Distribute ¡processing ¡load ¡ – Reduce ¡network ¡traffic ¡ Map ¡ Map ¡ Combine ¡ Combine ¡ Problem: ¡ Combine ¡ windows ¡ ¡ Map ¡ • Overlapping ¡data ¡ in-‑network ¡ Combine ¡ – Processed ¡mul9ple ¡9mes: ¡wastes ¡CPU ¡ – Sent ¡to ¡the ¡root ¡mul9ple ¡9mes: ¡wastes ¡network ¡ Query ¡root ¡ Reduce ¡ 7 ¡ 7 ¡

Efficient ¡processing ¡with ¡ panes ¡ P1 ¡ P2 ¡ P3 ¡ P4 ¡ … ¡ • Eliminate ¡redundant ¡work ¡ Time ¡ 0’’ ¡ 30’’ ¡ 60’’ ¡ 90’’ ¡ • Divide ¡window ¡into ¡ panes ¡(sub-‑windows) ¡ • Each ¡pane ¡is ¡processed ¡and ¡sent ¡only ¡once ¡ Map ¡ • Root ¡combines ¡panes ¡to ¡produce ¡window ¡ Combine ¡ P1 ¡ P2 ¡ • Saves ¡CPU ¡& ¡network ¡resources, ¡faster ¡ P3 ¡ analysis ¡ P4 ¡ Reduce ¡ 8 ¡

Impact ¡of ¡data ¡loss ¡on ¡analysis ¡ P1 ¡ P2 ¡ P3 ¡ P4 ¡ • Servers ¡may ¡get ¡overloaded ¡or ¡fail ¡ X • Apps ¡may ¡have ¡latency ¡requirements ¡ • Data ¡loss ¡is ¡unavoidable ¡to ¡ensure ¡9meliness ¡ Map ¡ Combine ¡ Challenges: ¡ • Characterize ¡incomplete ¡results ¡ • Allow ¡users ¡to ¡trade ¡fidelity ¡for ¡latency ¡ Reduce ¡ ? ¡ 9 ¡

Quan9fying ¡data ¡fidelity ¡ N1 ¡ Space ¡ • Data ¡are ¡naturally ¡distributed ¡across: ¡ N2 ¡ N3 ¡ – Space ¡(server ¡nodes) ¡ N4 ¡ P1 ¡ P2 ¡ P3 ¡ P4 ¡ – Time ¡(processing ¡window) ¡ Time ¡ • Panes ¡describe ¡temporal ¡and ¡spa9al ¡nature ¡of ¡data ¡ • C 2 ¡metric: ¡annotates ¡result ¡windows ¡with ¡a ¡“scoreboard” ¡ – Marks ¡successfully ¡received ¡panes ¡ 10 ¡ 10 ¡

Trading ¡fidelity ¡for ¡latency ¡ N1 ¡ • Use ¡C 2 ¡spec ¡to ¡trade ¡fidelity ¡for ¡latency ¡ Space ¡ N2 ¡ N3 ¡ N4 ¡ Users ¡may ¡specify: ¡ P1 ¡ P2 ¡ P3 ¡ P4 ¡ • Maximum ¡latency ¡requirement ¡ Time ¡ – e.g. ¡process ¡window ¡within ¡60sec ¡ N1 ¡ • Minimum ¡fidelity ¡ Space ¡ N2 ¡ – e.g. ¡at ¡least ¡50% ¡of ¡the ¡total ¡data ¡ N3 ¡ N4 ¡ P1 ¡ P2 ¡ P3 ¡ P4 ¡ • Different ¡ways ¡to ¡meet ¡minimum ¡fidelity ¡ Time ¡ – Impact ¡latency ¡and ¡accuracy ¡of ¡analysis ¡ • We ¡iden9fied ¡4 ¡useful ¡classes ¡of ¡C 2 ¡specifica9ons ¡ 11 ¡

Minimizing ¡result ¡latency ¡ N1 ¡ N2 ¡ N3 ¡ N4 ¡ P1 ¡ P2 ¡ P3 ¡ P4 ¡ • Minimum ¡fidelity ¡with ¡earlier ¡results ¡ – e.g. ¡50% ¡of ¡the ¡data ¡ • Gives ¡freedom ¡to ¡decrease ¡latency ¡ – Returns ¡the ¡earliest ¡data ¡available ¡ – e.g. ¡data ¡from ¡the ¡fastest ¡servers ¡ • Appropriate ¡for ¡uniformly ¡distributed ¡events ¡ – Accurately ¡summarizes ¡rela9ve ¡event ¡frequencies ¡ 12 ¡

Sampling ¡non-‑uniform ¡events ¡ N1 ¡ N2 ¡ N3 ¡ N4 ¡ P1 ¡ P2 ¡ P3 ¡ P4 ¡ • Minimum ¡fidelity ¡with ¡random ¡sampling ¡ – e.g. ¡random ¡50% ¡of ¡the ¡data ¡ • Less ¡freedom ¡to ¡decrease ¡latency ¡ – Included ¡data ¡may ¡not ¡be ¡the ¡first ¡available ¡ • Appropriate ¡even ¡for ¡non-‑uniform ¡data ¡ – Reproduces ¡rela9ve ¡occurrence ¡of ¡events ¡ 13 ¡

Correla9ng ¡events ¡across ¡9me ¡and ¡space ¡ Leverage ¡knowledge ¡about ¡data ¡distribu9on ¡ • Temporal ¡completeness: ¡ N1 ¡ Include ¡all ¡data ¡from ¡a ¡node ¡or ¡no ¡data ¡at ¡all ¡ • N2 ¡ N3 ¡ – e.g. ¡all ¡data ¡from ¡50% ¡of ¡the ¡nodes ¡ N4 ¡ Useful ¡when ¡events ¡are ¡local ¡to ¡a ¡node ¡ • P1 ¡ P2 ¡ P3 ¡ P4 ¡ – e.g. ¡coun9ng ¡events ¡on ¡a ¡per ¡node ¡basis ¡ Spa9al ¡completeness: ¡ Each ¡pane ¡contains ¡data ¡from ¡all ¡nodes ¡ • N1 ¡ N2 ¡ Useful ¡for ¡correla9ng ¡events ¡across ¡servers ¡ • N3 ¡ N4 ¡ – e.g ¡ ¡click ¡sessioniza9on ¡ P1 ¡ P2 ¡ P3 ¡ P4 ¡ 14 ¡

Prototype ¡ • Builds ¡upon ¡Mortar ¡distributed ¡stream ¡processor ¡ [Logothe9s ¡et ¡al., ¡USENIX’08] ¡ – Sliding ¡windows ¡ – In-‑network ¡aggrega9on ¡trees ¡ • Extended ¡to ¡support: ¡ – MapReduce ¡API ¡ – Paned-‑based ¡processing ¡ – Fault ¡tolerance ¡mechanisms: ¡operator ¡restart, ¡adap9ve ¡ data ¡rou9ng ¡ 15 ¡

Processing ¡data ¡in-‑situ ¡ • Analysis ¡co-‑located ¡with ¡client-‑facing ¡services ¡ • Limited ¡CPU ¡resources ¡for ¡log ¡analysis ¡ • Goal: ¡use ¡available ¡resources ¡intelligently ¡ • Load ¡shedding ¡mechanism ¡ – Nodes ¡monitor ¡local ¡processing ¡rate ¡ – Shed ¡panes ¡that ¡cannot ¡be ¡processed ¡on ¡9me ¡ • Increases ¡result ¡fidelity ¡under ¡9me ¡and ¡resource ¡ constraints ¡ 16 ¡

Evalua9on ¡ • System ¡scalability ¡ • Usefulness ¡of ¡C 2 ¡metric ¡ – Understanding ¡incomplete ¡results ¡ – Trading ¡fidelity ¡for ¡latency ¡ – Applica9ons: ¡ • Click-‑stream ¡sessioniza9on ¡ • HDFS ¡failure ¡detec9on ¡ • Processing ¡data ¡in-‑situ ¡ – Improving ¡fidelity ¡under ¡load ¡with ¡load ¡shedding ¡ – Minimize ¡impact ¡on ¡services ¡ 17 ¡

In-situ MapReduce for Log Processing Dionysios Logothe9s, - PowerPoint PPT Presentation

In-situ MapReduce for Log Processing Dionysios Logothe9s, Kevin Webb, Kenneth Yocum UC San Diego Chris Trezzo Salesforce Inc. USENIX Annual Technical

(142733/102960-Log[4])+(614851/73920-2 Log[64]) h 2 +(2329/1680-Log[4]) h 4 -h 10 /20160

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

RECENT USES OF IN SITU STABILIZATION, IN SITU CHEMICAL OXIDATION, AND IN SITU CHEMICAL

Chandra data reduction The CDFs Giorgio, Margherita, Elisabeta, Eleonora, Lazarus, Enrica,

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

In Situ I/O Processing: A Case for In Situ I/O Processing: A Case for Location Flexibility

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Spark RDD Operations Transformation and Actions 1 MapReduce Vs RDD Both MapReduce and RDD can

Syslog and Log Rotate Computer Center, CS, NCTU Log files Execution information of each

In Situ Visualization using VisIt Brad Whitlock Jean M. Favre Jeremy S. Meredith Lawrence

HUMAN-COMPUTER CO-CREATION Anna Kantosalo Matemaattis-luonnontieteellinen tiedekunta CC-2017

New and upcoming features in SpamAssassin v3 ApacheCon 2004 November 15, 2004 By: Theo Van

Organizing Harvested Knowledge Eduard Hovy USC/ISI (and

In Situ Measurements of Jet Energy Scale in ATLAS Doug Schouten, Andres Tanasiczjuk, and Mike

Data Management, In-Situ Workflows and Extreme Scales Manish Parashar, Ph.D . Director, Rutgers

Enabling Precision W and Z Physics at ILC with In-Situ Center-of-Mass Energy Measurements (plus

Discovery of Genomic Structural Variations with Next-Generation Sequencing Data Advanced Topics

In-situ MapReduce for Log Processing Dionysios Logothe9s, - PowerPoint PPT Presentation

In-situ MapReduce for Log Processing Dionysios Logothe9s, Kevin Webb, Kenneth Yocum UC San Diego Chris Trezzo Salesforce Inc. USENIX Annual Technical

(142733/102960-Log[4])+(614851/73920-2 Log[64]) h 2 +(2329/1680-Log[4]) h 4 -h 10 /20160

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

RECENT USES OF IN SITU STABILIZATION, IN SITU CHEMICAL OXIDATION, AND IN SITU CHEMICAL

Chandra data reduction The CDFs Giorgio, Margherita, Elisabeta, Eleonora, Lazarus, Enrica,

MapReduce 320302 Databases &amp; Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

In Situ I/O Processing: A Case for In Situ I/O Processing: A Case for Location Flexibility

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 340151 Big Data &amp; Cloud Services (P. Baumann) 1 Overview MapReduce : the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Spark RDD Operations Transformation and Actions 1 MapReduce Vs RDD Both MapReduce and RDD can

Syslog and Log Rotate Computer Center, CS, NCTU Log files Execution information of each

In Situ Visualization using VisIt Brad Whitlock Jean M. Favre Jeremy S. Meredith Lawrence

HUMAN-COMPUTER CO-CREATION Anna Kantosalo Matemaattis-luonnontieteellinen tiedekunta CC-2017

New and upcoming features in SpamAssassin v3 ApacheCon 2004 November 15, 2004 By: Theo Van

Organizing Harvested Knowledge Eduard Hovy USC/ISI (and

In Situ Measurements of Jet Energy Scale in ATLAS Doug Schouten, Andres Tanasiczjuk, and Mike

Data Management, In-Situ Workflows and Extreme Scales Manish Parashar, Ph.D . Director, Rutgers

Enabling Precision W and Z Physics at ILC with In-Situ Center-of-Mass Energy Measurements (plus

Discovery of Genomic Structural Variations with Next-Generation Sequencing Data Advanced Topics

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the