MapReduce Online
Tyson Condie UC Berkeley
Joint work with Neil Conway, Peter Alvaro, and Joseph M. Hellerstein (UC Berkeley) Khaled Elmeleegy and Russell Sears (Yahoo! Research)
MapReduce Online Tyson Condie UC Berkeley Joint work with Neil - - PowerPoint PPT Presentation
MapReduce Online Tyson Condie UC Berkeley Joint work with Neil Conway, Peter Alvaro, and Joseph M. Hellerstein (UC Berkeley) Khaled Elmeleegy and Russell Sears (Yahoo! Research) MapReduce Programming Model Think datacentric Apply a
Tyson Condie UC Berkeley
Joint work with Neil Conway, Peter Alvaro, and Joseph M. Hellerstein (UC Berkeley) Khaled Elmeleegy and Russell Sears (Yahoo! Research)
– Tuned for massive data parallelism – Many maps operate on porMons of the input – Many reduces, each assigned specific groups
– RunMmes range in minutes to hours – Execute on tens to thousands of machines – Failures common (fault tolerance crucial)
– Operators complete before producing any output – Atomic data exchange between operators
– Final answers only – No infinite streams
– BUT we must retain fault tolerance
Master Client Submit wordcount Workers
map reduce reduce
schedule
Block 1
HDFS
Cat Rabbit Cat Rabbit Dog Turtle
Workers
map reduce reduce
Cat, 1 Rabbit, 1
Cat, 1 Turtle, 1 Rabbit, 1 Dog, 1
Workers
map reduce reduce
Cat, 1 Cat, 1 Dog, 1 Rabbit, 1 Rabbit, 1 Turtle, 1
Workers
map reduce reduce
Cat, 2 Dog, 1 Rabbit, 2 Turtle, 1
Workers
map reduce reduce
Cat, 2 Dog, 1 Rabbit, 2 Turtle, 1
Local FS
Master Workers
reduce reduce
Local FS
Map finished Map output loca8on
Workers
reduce reduce
Local FS
HTTP get
Workers
reduce reduce
Cat, 5,1,3,4,… Dog, 1,4,2,5,… Rabbit, 2,5,1,7,… Turtle, 4,2,3,3,…
Map 1, Map 2, . . ., Map k Map 1, Map 2, . . ., Map k
Workers
reduce reduce HDFS
Cat, 25 Dog, 14 Rabbit, 23 Turtle, 16 Cat, 5,1,3,4,… Dog, 1,4,2,5,… Rabbit, 2,5,1,7,… Turtle, 4,2,3,3,…
Master Workers
map reduce reduce
Schedule Schedule + Map loca8on (ASAP)
pipeline request
Workers
map reduce reduce
Workers
map reduce reduce
Merge and combine Merge and combine
➔ Also done in blocking but more so when pipelining
– Reduce treats in‐progress map output as tenta/ve – If map dies then throw away its output – If map succeeds then accept its output
– Spill files have determinisMc boundaries and are assigned a sequence number – Correctness: Reduce tasks ensure spill files are idempotent – Op8miza8on: Map tasks avoid sending redundant spill files
HDFS Write Snapshot Answer HDFS Block 1 Block 2 Read Input File map map reduce reduce
– Wikipedia traffic staMsMcs (1TB) – Webpage clicks/hour – 5066 compressed files (each file = 1 hour click logs)
– group by language and hour – count clicks
– Final answer ≈ (intermediate click count * scale‐up factor)
and fracMon of hour (each file = 1 hour click logs)
– Taken less than 2 minutes into a ~2 hour job!
!"#$!!% &"#$!'% ("#$!'% )"#$!'% *"#$!'% +"#$!'% ,"#$!'%
./012% 3456782% 9:3412% 53:;<4% 70<67<4% =<><4383% >?6782% >?:0/5/383% :/887<4% 8><4782% @74<6%<48A3:% B<;>63%9:<1C?4% D?E%>:?5:388%
– Job progress assumes hours are uniformly sampled – Sample fracMon ≈ sample distribuMon of each hour
!" !#$" !#%" !#&" !#'" !#(" !#)" !#*" !#+" !" %'!" '+!" *%!" ,)!" $%!!" $''!" $)+!" $,%!" %$)!" %'!!" %)'!" %++!" &$%!" &&)!" &)!!" &+'!" '!+!" '&%!" '()!" (&'!" !"#$%#&%'(&&)&' *+,-'./-0/1'
567083"916:;.<"
– Increasing block size => fewer maps with longer runMmes
– 20 extra‐large EC2 nodes: 4 cores, 15GB RAM
– Two jobs: large vs. small block size
– Both jobs hard coded to use 60 reduce tasks
– Especially in blocking mode
– BUT incurs extra sorMng between shuffle and reduce steps
!"# $!"# %!"# &!"# '!"# (!!"# !# )# (!# ()# $!# $)# *!# *)# %!# %)# !"#$"%&&' ()*%'+*),-.%&/'
011'23'34#56),$'+78"$%'34#56&/'
+,-#-./0.122# 314561#-./0.122# !"# $!"# %!"# &!"# '!"# (!!"# !# )# (!# ()# $!# $)# *!# *)# %!# %)# !"#$"%&&' ()*%'+*),-.%&/'
01123'!)4%5),),$'+67"$%'35#89&/'
+,-#-./0.122# 314561#-./0.122#
4 minutes < 1 minute
Reduce step (75%‐100%) Shuffle step (0%‐75%)
– BUT idle periods sMll exist in blocking mode shuffle step – AND increases scheduler overhead (3120 maps) – AND increases HDFS (NameNode) memory pressure
– Based on runMme dynamics (reducer load, network capacity, etc.)
!"# $!"# %!"# &!"# '!"# (!!"# !# )# (!# ()# $!# $)# *!# *)# %!# !"#$"%&&' ()*%'+*),-.%&/'
01123'34#56),$'+7*844'34#56&/'
+,-#-./0.122# 314561#-./0.122# !"# $!"# %!"# &!"# '!"# (!!"# !# )# (!# ()# $!# $)# *!# *)# %!# !"#$"%&&' ()*%'+*),-.%&/'
01123'!)4%5),),$'+6*755'35#89&/'
+,-#-./0.122# 314561#-./0.122#
<< 1 minute < 1 minute
– So job contains 4 maps and (a hard‐coded) 2 reduces
2 maps sort and send output to reducer 3rd map finishes, sorts, and sends to reduce
– So job contains 4 maps and (a hard‐coded) 2 reduces
Reduce task 6.5 minute idle period Reduce task performing final merge‐sort Job compleMon when reduce finishes
Mapper CPU Reducer CPU
Amazon Cloudwatch
Reduce task 6.5 minute idle period
Map step more I/O bound
13 min. 7 min.
Amazon Cloudwatch
Steady network traffic
– An early view of the result from a running computaMon – InteracMve data analysis (you say when to stop)
– Tasks operate on infinite data streams – Real‐Mme data analysis
– Improve CPU and I/O overlap – Steady network traffic (fewer load spikes) – Improve cluster uMlizaMon (reducers do more work)
– Economy of Mechanism
– Implemented as a conMnuous map task – Record staMsMcs of interest (/proc, log files, etc.)
– Implemented as reduce tasks – Aggregate staMsMcs along machine, rack, datacenter – Reduce windows: 1, 5, and 15 second load averages
– Alert triggered aXer some threshold
– Faster than the (~5 second) TaskTracker reporMng interval ? Feedback loop to the JobTracker for bever scheduling
!" #!!!!" $!!!!" %!!!!" &!!!!" '!!!!" (!!!!" )!!!!" *!!!!" +!!!!" #!!!!!" !" '" #!" #'" $!" $'" %!" !"#$%&%'"(($)& *+,$&-%$./0)%1&
2345+$6&7$4$.8/0&
Workers
map reduce reduce
➔ Also done in blocking but more so when pipelining
– Single master node (JobTracker), many worker nodes (TaskTrackers) – Client submits a job to the JobTracker – JobTracker splits each job into tasks (map/reduce) – Assigns tasks to TaskTrackers on demand
– Single name node, many data nodes – Data is stored as fixed‐size (e.g., 64MB) blocks – HDFS typically holds map input and reduce output