Til
iled ed-Map apReduce educe
Optimizing Resource Usages of Data-parallel Applications on Multicore
Rong Chen Haibo Chen Binyu Zang
P a r a l le l P r o c e s s in g I n s t itute F u d a n U n i v e r s ity
Outline Out line 1. T il ed M ap ap R edu 1. iled educe ce . O - - PowerPoint PPT Presentation
T il ed- M ap ap R educe iled educe Optimizing Resource Usages of Data-parallel Applications on Multicore Rong Chen Haibo Chen Binyu Zang P a r a l le l P r o c e s s in g I n s t itute F u d a n U n i v e r s ity D ata ata- P arallel
Rong Chen Haibo Chen Binyu Zang
P a r a l le l P r o c e s s in g I n s t itute F u d a n U n i v e r s ity
Data-parallel applications emerge and rapidly increase in past 10 years
day in 2008
storage for 3D rendering *
* http://www.information-
management.com/newsletters/avatar_data_processing-10016774-1.html
for data-parallel applications from
programmer
Parallelism Functionality Data Distribution Fault Tolerance Load Balance
for data-parallel applications from
programmer
MapReduce Runtime Two Primitive:
Map (input) Reduce (key, values)
for data-parallel applications from
Functionality
programmer
Functionality
MapReduce Runtime Two Primitive:
Map (input)
for each word in input emit (word, 1)
Reduce (key, values)
int sum = 0; for each value in values sum += value; emit (word, sum)
Word Count
for data-parallel applications from
Multicore is commercially prevalent recently
appear in near feature
1X 1X 4X 8X 64X 64X
A MapReduce runtime for shared-memory > CMPs and SMPs > NUMA
Features > Parallelism: threads > Communication: shared address space A MapReduce runtime for shared-memory > CMPs and SMPs > NUMA
Heavily optimized runtime > Runtime algorithm e.g. locality-aware task distribution > Scalable data structure e.g. hash table > OS Interaction e.g. memory allocator, thread pool Features > Parallelism: threads > Communication: shared address space A MapReduce runtime for shared-memory > CMPs and SMPs > NUMA
Mul
Main Memory Processors Disk
Mul
......... ......... ......... ......... ......... ......... ......... ......... ......... .........
Worker Threads Start Main Memory Processors Disk Input Load Input Buffer
Mul
.. .. .. .. .. .. .. .. ......... ......... ......... ......... ......... ......... ......... ......... ......... .........
Worker Threads Start Main Memory Processors Disk Input Intermediate Buffer
Mul
.. .. .. .. .. .. .. .. ......... ...but... ......... .....boy. ......... ......... ..boy.... ......... ......... .........
Main Memory Worker Threads Processors Start M M M M
.. .. 1 1 1 1 1
Disk Input value array key array
Mul
.. .. .. .. .. .. .. .. ......... ...but... ......... .....boy. ......... ......... ..boy.... ......... ......... .........
Main Memory Worker Threads Processors Start M M M M R R R R
..
Disk Input Final Buffer
Mul
.. .. .. .. .. .. .. .. ......... ...but... ......... .....boy. ......... ......... ..boy.... ......... ......... .........
Main Memory Worker Threads Processors Start M M M M
..
R R R R
..
Disk Input
5
Mul
......... ...but... ......... .....boy. ......... ......... ..boy.... ......... ......... .........
Main Memory Worker Threads Processors Start M M M M R R R R
..
Result
Merge
Disk Input
.. .. .. .. .. .. .. ..
Output Buffer
Main Memory Processors Start M M M M R R R R
Merge
End Disk Input Free Output
Mul
Write File
.. .. .. .. .. .. .. .. ..
High memory usage
all the time
e.g. WordCount with 4GB input requires more than 4.3GB memory on Phoenix (93% used by input data)
High memory usage
all the time
e.g. WordCount with 4GB input requires more than 4.3GB memory on Phoenix (93% used by input data)
Poor data locality
e.g. WordCount with 4GB input has about 25% L2 cache miss rate
High memory usage
all the time
e.g. WordCount with 4GB input requires more than 4.3GB memory on Phoenix (93% used by input data)
Poor data locality
e.g. WordCount with 4GB input has about 25% L2 cache miss rate
Strict dependency barriers
High memory usage
all the time
Poor data locality
Strict dependency barriers
Tiled-MapReduce programming model
−Tiling strategy −Fault tolerance (in paper)
Three optimizations for Tiled-MapReduce runtime
−Input Data Buffer Reuse −NUCA/NUMA-aware Scheduler −Software Pipeline
“Tiling Strategy”
“Tiling Strategy”
Requirement
Associative
Hadoop meet the requirement
Extensions to MapReduce Model
Start Map Reduce
Merge
End
Extensions to MapReduce Model
loop of Map and Reduce phases
Start Map Reduce
Merge
End Reduce
Extensions to MapReduce Model
loop of Map and Reduce phases
iteration
Start Map Reduce
Merge
End Reduce
Extensions to MapReduce Model
loop of Map and Reduce phases
iteration
loop to the Combine phase
Start Map Reduce
Merge
End Combine
Extensions to MapReduce Model
loop of Map and Reduce phases
iteration
loop to the Combine phase
process the partial results of all iterations
Start Map Reduce
Merge
End Combine
programming model
programming model
.. .. .. .. .. .. .. ..
Worker Threads Start Main Memory Processors Disk Input
......... ......... ......... ......... ......... ......... ......... ......... ......... .........
Load Intermediate Buffer
.. .. .. .. .. .. .. .. ......... ......... ......... ......... ......... ......... ......... ......... ......... .........
Worker Threads Start Main Memory Processors M M M M Disk Input Iteration window
.. .. .. .. .. .. .. .. ......... ......... ......... ......... ......... ......... ......... ......... ......... .........
Worker Threads Start Main Memory Processors M M M M Disk Input C C C C Iteration Buffer
.. ... .. .. ..
.. .. .. .. .. .. .. .. ......... ......... ......... ......... ......... ......... ......... ......... ......... .........
Worker Threads Start Main Memory Processors M M M M Disk Input C C C C
.. .. ... .. ..
.. .. .. .. .. .. .. .. ......... ......... ......... ......... ......... ......... ......... ......... ......... .........
Worker Threads Start Main Memory Processors M M M M Disk Input C C C C
.. .. .. ..
......... ......... ......... ......... ......... ......... ......... ......... ......... .........
Worker Threads Start Main Memory Processors M M M M Disk Input C C C C R R R R
.. .. .. .. .. ...
Final Buffer
......... ......... ......... ......... ......... ......... ......... ......... ......... .........
Worker Threads Start Main Memory Processors M M M M Disk Input C C C C
Merge
R R R R
..
Result
Start Main Memory Processors M M M M Disk Input C C C C
Merge
End R R R R Output Free
High Memory Usage
the entire lifecycle
High Memory Usage
the entire lifecycle
Observation
e.g. WordCount: 1 copy for all duplicated words
High Memory Usage
the entire lifecycle
Observation
e.g. WordCount: 1 copy for all duplicated words
data locality
Input Data Memory Reuse
Combine phase
in memory
Extension of Interface
Acquire: load input data to memory Release: free input data from memory
Ostrich Google MapReduce Hadoop acquire reader constructor release writer close Runtime Interface
.. .. .. .. .. .. .. ..
......... .........
Worker Threads Start Main Memory Processors M M M M Disk Input
acquire
Load Input Buffer
.. .. .. .. .. .. .. ..
..Baby... .....But.
Worker Threads Start Main Memory Processors M M M M Disk Input
..
.. .. .. .. .. .. .. ..
..Baby... .....But.
Worker Threads Start Main Memory Processors M M M M Disk Input C C C C
.. .. ...
New Buffer
.. .. .. .. .. .. .. ..
..Baby... .....But.
Worker Threads Start Main Memory Processors M M M M Disk Input C C C C
.. Baby But ... .. ...
New Buffer
.. .. .. .. .. .. .. ..
Worker Threads Start Main Memory Processors M M M M Disk Input C C C C
.. Baby But ...
release
..
Free
Poor Data Locality of MapReduce runtime on Multicore
Poor Data Locality of MapReduce runtime on Multicore
Tiled-MapReduce improves data locality
the last level cache
(in OPT1)
Memory Hierarchy
in a non-uniform cache access (NUCA) way
e.g. Local/Remote L2 cache: 14/110 cycles*
* Intel 16-Core Machine with 4 Xeon 1.6GHz Quad-cores chips
Memory Hierarchy
in a non-uniform cache access (NUCA) way
e.g. Local/Remote L2 cache: 14/110 cycles*
NUCA/NUMA-aware scheduler
* Intel 16-Core Machine with 4 Xeon 1.6GHz Quad-cores chips
Sche
master/ worker core core core core core core core core
$ $ $ $ $ $ $ $ shared cache shared cache
worker worker worker worker worker worker worker main memory main memory main memory main memory
Sche
repeater/ worker worker worker worker repeater/ worker worker worker worker
group
master core core core core core core core core
$ $ $ $ $ $ $ $ shared cache shared cache
main memory main memory main memory main memory
Sche
repeater/ worker worker worker worker repeater/ worker worker worker worker job queue core core core core core core core core
$ $ $ $ $ $ $ $ shared cache shared cache
main memory main memory main memory main memory master
Sche
repeater/ worker worker worker worker repeater/ worker worker worker worker job queue core core core core core core core core
$ $ $ $ $ $ $ $ shared cache shared cache
main memory main memory main memory main memory master Intermediate Buffer Intermediate Buffer Iteration Buffer
Final Buffer
Iteration Buffer
Data Dependency
the slowest worker in each phase
Data Dependency
the slowest worker in each phase
Observation
Combine phase and its successor’s Map phase
Software Pipeline
sub-job and the Map phase of its successor
Pipel
core core core core
Time Map Combine Idle
Pipel
Barriers
core core core core
Time Map Combine Idle
Pipel
core core core core core core core core
Time Time Speedup Map Combine Idle Software Pipeline
Platform
Intel 16-Core machine (4 Quad-cores chips) 32GB Main Memory Debian Linux with kernel v2.6.24
Systems:
Phoenix-2 with streamflow * Ostrich with streamflow
* Scalable locality-conscious multithreaded memory allocation - ISMM’06
Applications
Inverted Index (II) WordCount (WC) Distributed Sort (DS) Log Statistics (LS)
many many few few many no many Applications Key Duplicate
Programm
Code Modification
Inverted Index (II) WordCount (WC) Distributed Sort (DS) Log Statistics (LS)
11 11 Default Default 3 3 Default Default
Applications Acquire Release
1 2 3 4 1 2 4 1 2 4 1 2 4 1 2 4
Phoenix Ostrich Speedup
WC DS LS II
3.3X 1.2X
1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 32 64 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16
Scalability
WC DS LS II
Scalability #cores #cores
Phoenix Ostrich
Other extension to MapReduce model
Other implementation of MapReduce runtime
and Metis [MIT-TR]
Environments differences between cluster and
multicore open new design spaces and
Tiled-MapReduce and the three optimizations Ostrich outperforms Phoenix by up to 3.3X
Parallel Processing Institute http://ppi.fudan.edu.cn
Ostrich
The top land speed and the largest of bird
1 2 3 4 5 PHO-1 OST-1 PHO-2 OST-2 PHO-4 OST-4 PHO-1 OST-1 PHO-2 OST-2 PHO-4 OST-4 PHO-1 OST-1 PHO-2 OST-2 PHO-4 OST-4 PHO-1 OST-1 PHO-2 OST-2 PHO-4 OST-4
Intermediate Input
Memory Consumption (GB)
WC DS LS II
0.8 0.9 1.0 1.1 1.2 1.3 1.4
4 8 12 16 4 8 12 16 4 8 12 16 4 8 12 16
Without NUCA/NUMA-Aware Scheduler With NUCA/NUMA-Aware Scheduler
Speedup
WC DS LS II
L2 Cache Miss Rate
0% 5% 10% 15% 20% 25% 30% 1 2 4 1 2 4 1 2 4 1 2 4
Phoenix Ostrich
WC DS LS II
1 2 3 4 5 6 7 8
WC WC/P DS DS/P LS LS/P II II/P
Merge Reduce Combine(Idle) Combine(Active) Map
Execution Time (Sec)