Outline Out line 1. T il ed M ap ap R edu 1. iled educe ce . O - - PowerPoint PPT Presentation

outline out line
SMART_READER_LITE
LIVE PREVIEW

Outline Out line 1. T il ed M ap ap R edu 1. iled educe ce . O - - PowerPoint PPT Presentation

T il ed- M ap ap R educe iled educe Optimizing Resource Usages of Data-parallel Applications on Multicore Rong Chen Haibo Chen Binyu Zang P a r a l le l P r o c e s s in g I n s t itute F u d a n U n i v e r s ity D ata ata- P arallel


slide-1
SLIDE 1

Til

iled ed-Map apReduce educe

Optimizing Resource Usages of Data-parallel Applications on Multicore

Rong Chen Haibo Chen Binyu Zang

P a r a l le l P r o c e s s in g I n s t itute F u d a n U n i v e r s ity

slide-2
SLIDE 2

Data-parallel applications emerge and rapidly increase in past 10 years

  • Google processes about 24 petabytes of data per

day in 2008

  • The movie tar” is takes over 1 petabyte of local

storage for 3D rendering *

Data

ata-Parallel arallel Applicati pplication

  • n

* http://www.information-

management.com/newsletters/avatar_data_processing-10016774-1.html

slide-3
SLIDE 3

Data

ata-parallel parallel Progr rogramming amming Model

  • del

MapReduce: a simple programming model

for data-parallel applications from

slide-4
SLIDE 4

Data

ata-parallel parallel Progr rogramming amming Model

  • del

programmer

Parallelism Functionality Data Distribution Fault Tolerance Load Balance

MapReduce: a simple programming model

for data-parallel applications from

slide-5
SLIDE 5

Data

ata-parallel parallel Progr rogramming amming Model

  • del

programmer

MapReduce Runtime Two Primitive:

Map (input) Reduce (key, values)

MapReduce: a simple programming model

for data-parallel applications from

Functionality

slide-6
SLIDE 6

programmer

Data

ata-parallel parallel Progr rogramming amming Model

  • del

Functionality

MapReduce Runtime Two Primitive:

Map (input)

for each word in input emit (word, 1)

Reduce (key, values)

int sum = 0; for each value in values sum += value; emit (word, sum)

Word Count

MapReduce: a simple programming model

for data-parallel applications from

slide-7
SLIDE 7

Multicore is commercially prevalent recently

  • Quad-cores and eight cores on a chip are common,
  • Tens and hundreds of cores on a single chip will

appear in near feature

Mul

ultic ticore

  • re

1X 1X 4X 8X 64X 64X

slide-8
SLIDE 8

Phoenix [HPCA’07 IISWC’09]

Map

apRedu educe ce on

  • n Mul

ultic ticore

  • re

A MapReduce runtime for shared-memory > CMPs and SMPs > NUMA

slide-9
SLIDE 9

Phoenix [HPCA’07 IISWC’09]

Map

apRedu educe ce on

  • n Mul

ultic ticore

  • re

Features > Parallelism: threads > Communication: shared address space A MapReduce runtime for shared-memory > CMPs and SMPs > NUMA

slide-10
SLIDE 10

Heavily optimized runtime > Runtime algorithm e.g. locality-aware task distribution > Scalable data structure e.g. hash table > OS Interaction e.g. memory allocator, thread pool Features > Parallelism: threads > Communication: shared address space A MapReduce runtime for shared-memory > CMPs and SMPs > NUMA

Phoenix [HPCA’07 IISWC’09]

Map

apRedu educe ce on

  • n Mul

ultic ticore

  • re
slide-11
SLIDE 11

Imple

mplementat mentation ion

  • n
  • n

Mul

ulticore ticore

Main Memory Processors Disk

slide-12
SLIDE 12

Imple

mplementat mentation ion

  • n
  • n

Mul

ulticore ticore

......... ......... ......... ......... ......... ......... ......... ......... ......... .........

Worker Threads Start Main Memory Processors Disk Input Load Input Buffer

slide-13
SLIDE 13

Imple

mplementat mentation ion

  • n
  • n

Mul

ulticore ticore

.. .. .. .. .. .. .. .. ......... ......... ......... ......... ......... ......... ......... ......... ......... .........

Worker Threads Start Main Memory Processors Disk Input Intermediate Buffer

slide-14
SLIDE 14

Imple

mplementat mentation ion

  • n
  • n

Mul

ulticore ticore

.. .. .. .. .. .. .. .. ......... ...but... ......... .....boy. ......... ......... ..boy.... ......... ......... .........

Main Memory Worker Threads Processors Start M M M M

.. .. 1 1 1 1 1

Disk Input value array key array

slide-15
SLIDE 15

Imple

mplementat mentation ion

  • n
  • n

Mul

ulticore ticore

.. .. .. .. .. .. .. .. ......... ...but... ......... .....boy. ......... ......... ..boy.... ......... ......... .........

Main Memory Worker Threads Processors Start M M M M R R R R

..

Disk Input Final Buffer

slide-16
SLIDE 16

Imple

mplementat mentation ion

  • n
  • n

Mul

ulticore ticore

.. .. .. .. .. .. .. .. ......... ...but... ......... .....boy. ......... ......... ..boy.... ......... ......... .........

Main Memory Worker Threads Processors Start M M M M

..

R R R R

..

Disk Input

5

slide-17
SLIDE 17

Imple

mplementat mentation ion

  • n
  • n

Mul

ulticore ticore

......... ...but... ......... .....boy. ......... ......... ..boy.... ......... ......... .........

Main Memory Worker Threads Processors Start M M M M R R R R

..

Result

Merge

Disk Input

.. .. .. .. .. .. .. ..

Output Buffer

slide-18
SLIDE 18

Main Memory Processors Start M M M M R R R R

Merge

End Disk Input Free Output

Imple

mplementat mentation ion

  • n
  • n

Mul

ulticore ticore

Write File

.. .. .. .. .. .. .. .. ..

slide-19
SLIDE 19

Defi

efici ciency of ency of Map apRedu educe on ce on Mul ulti tico core re

slide-20
SLIDE 20

High memory usage

  • Keep the whole input data in main memory

all the time

e.g. WordCount with 4GB input requires more than 4.3GB memory on Phoenix (93% used by input data)

Defi

efici ciency of ency of Map apRedu educe on ce on Mul ulti tico core re

slide-21
SLIDE 21

High memory usage

  • Keep the whole input data in main memory

all the time

e.g. WordCount with 4GB input requires more than 4.3GB memory on Phoenix (93% used by input data)

Poor data locality

  • Process all input data at one time

e.g. WordCount with 4GB input has about 25% L2 cache miss rate

Defi

efici ciency of ency of Map apRedu educe on ce on Mul ulti tico core re

slide-22
SLIDE 22

High memory usage

  • Keep the whole input data in main memory

all the time

e.g. WordCount with 4GB input requires more than 4.3GB memory on Phoenix (93% used by input data)

Poor data locality

  • Process all input data at one time

e.g. WordCount with 4GB input has about 25% L2 cache miss rate

Strict dependency barriers

  • CPU idle at the exchange of phases

Defi

efici ciency of ency of Map apRedu educe on ce on Mul ulti tico core re

slide-23
SLIDE 23

High memory usage

  • Keep the whole input data in main memory

all the time

Poor data locality

  • Process all input data at one time

Strict dependency barriers

  • CPU idle at the exchange of phases

Defi

efici ciency of ency of Map apRedu educe on ce on Mul ulti tico core re

Sol

  • lut

utio ion:

: Til

iled ed-Map apRedu educe ce

slide-24
SLIDE 24

Tiled-MapReduce programming model

−Tiling strategy −Fault tolerance (in paper)

Three optimizations for Tiled-MapReduce runtime

−Input Data Buffer Reuse −NUCA/NUMA-aware Scheduler −Software Pipeline

Cont

  • ntributi

ribution

  • n
slide-25
SLIDE 25

1.

  • 1. Til

iled ed Map apRedu educe ce 2. . Opt ptimi imizatio zation on n on TMR

TMR

3. . Eval valuatio uation 4. . Con

  • ncl

clusio usion

Outl

utline ine

slide-26
SLIDE 26

1.

  • 1. Til

iled ed Map apRedu educe ce 2. . Opt ptimi imizatio zation on n on TMR

TMR

3. . Eval valuatio uation 4. . Con

  • ncl

clusio usion

Outl

utline ine

slide-27
SLIDE 27

“Tiling Strategy”

  • Divide a large MapReduce job into a number
  • f independent small sub-jobs
  • Iteratively process one sub-job at a time

Tiled

iled-Map apRedu educe ce

slide-28
SLIDE 28

“Tiling Strategy”

  • Divide a large MapReduce job into a number
  • f independent small sub-jobs
  • Iteratively process one sub-job at a time

Requirement

  • Reduce function must be Commutative and

Associative

  • all 26 applications in the test suit of Phoenix and

Hadoop meet the requirement

Tiled

iled-Map apRedu educe ce

slide-29
SLIDE 29

Extensions to MapReduce Model

Tiled

iled-Map apRedu educe ce

Start Map Reduce

Merge

End

slide-30
SLIDE 30

Extensions to MapReduce Model

  • 1. Replace the Map phase with a

loop of Map and Reduce phases

Tiled

iled-Map apRedu educe ce

Start Map Reduce

Merge

End Reduce

slide-31
SLIDE 31

Extensions to MapReduce Model

  • 1. Replace the Map phase with a

loop of Map and Reduce phases

  • 2. Process one sub-job in each

iteration

Tiled

iled-Map apRedu educe ce

Start Map Reduce

Merge

End Reduce

slide-32
SLIDE 32

Extensions to MapReduce Model

  • 1. Replace the Map phase with a

loop of Map and Reduce phases

  • 2. Process one sub-job in each

iteration

  • 3. Rename the Reduce phase within

loop to the Combine phase

Tiled

iled-Map apRedu educe ce

Start Map Reduce

Merge

End Combine

slide-33
SLIDE 33

Extensions to MapReduce Model

  • 1. Replace the Map phase with a

loop of Map and Reduce phases

  • 2. Process one sub-job in each

iteration

  • 3. Rename the Reduce phase within

loop to the Combine phase

  • 4. Modify the Reduce phase to

process the partial results of all iterations

Tiled

iled-Map apRedu educe ce

Start Map Reduce

Merge

End Combine

slide-34
SLIDE 34

Prot

rototy

  • type

pe of

  • f Tiled

iled-Map apRedu educe ce

Ostrich: a prototype of Tiled-MapReduce

programming model

  • Demonstrate the effectiveness of TMR

programming model

  • Base on Phoenix runtime
  • Follow the data structure and algorithms
slide-35
SLIDE 35

Ostr

strich ich Implem mplementa entation tion

.. .. .. .. .. .. .. ..

Worker Threads Start Main Memory Processors Disk Input

......... ......... ......... ......... ......... ......... ......... ......... ......... .........

Load Intermediate Buffer

slide-36
SLIDE 36

Ostr

strich ich Implem mplementa entation tion

.. .. .. .. .. .. .. .. ......... ......... ......... ......... ......... ......... ......... ......... ......... .........

Worker Threads Start Main Memory Processors M M M M Disk Input Iteration window

slide-37
SLIDE 37

Ostr

strich ich Implem mplementa entation tion

.. .. .. .. .. .. .. .. ......... ......... ......... ......... ......... ......... ......... ......... ......... .........

Worker Threads Start Main Memory Processors M M M M Disk Input C C C C Iteration Buffer

.. ... .. .. ..

slide-38
SLIDE 38

Ostr

strich ich Implem mplementa entation tion

.. .. .. .. .. .. .. .. ......... ......... ......... ......... ......... ......... ......... ......... ......... .........

Worker Threads Start Main Memory Processors M M M M Disk Input C C C C

.. .. ... .. ..

slide-39
SLIDE 39

Ostr

strich ich Implem mplementa entation tion

.. .. .. .. .. .. .. .. ......... ......... ......... ......... ......... ......... ......... ......... ......... .........

Worker Threads Start Main Memory Processors M M M M Disk Input C C C C

.. .. .. ..

slide-40
SLIDE 40

Ostr

strich ich Implem mplementa entation tion

......... ......... ......... ......... ......... ......... ......... ......... ......... .........

Worker Threads Start Main Memory Processors M M M M Disk Input C C C C R R R R

.. .. .. .. .. ...

Final Buffer

slide-41
SLIDE 41

Ostr

strich ich Implem mplementa entation tion

......... ......... ......... ......... ......... ......... ......... ......... ......... .........

Worker Threads Start Main Memory Processors M M M M Disk Input C C C C

Merge

R R R R

..

Result

slide-42
SLIDE 42

Ostr

strich ich Implem mplementa entation tion

Start Main Memory Processors M M M M Disk Input C C C C

Merge

End R R R R Output Free

slide-43
SLIDE 43

1.

  • 1. Til

iled ed Map apRedu educe ce 2. . Opt ptimi imizatio zation on n on TMR

TMR

3. . Eval valuatio uation 4. . Con

  • ncl

clusio usion

Out Outline line

slide-44
SLIDE 44

OPT1

PT1:

: MEM EMORY ORY REUS EUSE

slide-45
SLIDE 45

High Memory Usage

  • Keep the whole input data in memory during

the entire lifecycle

OPT1

PT1:

: Memo emory ry Reus euse

slide-46
SLIDE 46

High Memory Usage

  • Keep the whole input data in memory during

the entire lifecycle

Observation

  • Only few data in input data is necessary

e.g. WordCount: 1 copy for all duplicated words

OPT1

PT1:

: Memo emory ry Reus euse

slide-47
SLIDE 47

High Memory Usage

  • Keep the whole input data in memory during

the entire lifecycle

Observation

  • Only few data in input data is necessary

e.g. WordCount: 1 copy for all duplicated words

  • The aggregation of these data improves

data locality

OPT1

PT1:

: Memo emory ry Reus euse

slide-48
SLIDE 48

Input Data Memory Reuse

  • Copy necessary data to a new buffer in each

Combine phase

  • Only hold the input data of current sub-job

in memory

  • Reuse the Input Buffer among sub-jobs

OPT1

PT1:

: Memo emory ry Reus euse

slide-49
SLIDE 49

Extension of Interface

  • Provide 2 optional interfaces

Acquire: load input data to memory Release: free input data from memory

  • The counterparts in other runtimes

OPT1

PT1:

: Memo emory ry Reus euse

Ostrich Google MapReduce Hadoop acquire reader constructor release writer close Runtime Interface

slide-50
SLIDE 50

.. .. .. .. .. .. .. ..

Inpu

nput t Data ata Memo emory ry Reus euse

......... .........

Worker Threads Start Main Memory Processors M M M M Disk Input

acquire

Load Input Buffer

slide-51
SLIDE 51

.. .. .. .. .. .. .. ..

Inpu

nput t Data ata Reus euse

..Baby... .....But.

Worker Threads Start Main Memory Processors M M M M Disk Input

..

slide-52
SLIDE 52

.. .. .. .. .. .. .. ..

Inpu

nput t Data ata Reus euse

..Baby... .....But.

Worker Threads Start Main Memory Processors M M M M Disk Input C C C C

.. .. ...

New Buffer

slide-53
SLIDE 53

.. .. .. .. .. .. .. ..

Inpu

nput t Data ata Reus euse

..Baby... .....But.

Worker Threads Start Main Memory Processors M M M M Disk Input C C C C

.. Baby But ... .. ...

New Buffer

slide-54
SLIDE 54

.. .. .. .. .. .. .. ..

Inpu

nput t Data ata Reus euse

Worker Threads Start Main Memory Processors M M M M Disk Input C C C C

.. Baby But ...

release

..

Free

slide-55
SLIDE 55

OPT2

PT2: : LOCAL

OCALITY ITY OPTI PTIMIZATION MIZATION

slide-56
SLIDE 56

Poor Data Locality of MapReduce runtime on Multicore

  • Process all input data in one time

OPT2

PT2:

: Local

  • cality

ity Opti ptimizat mization ion

slide-57
SLIDE 57

Poor Data Locality of MapReduce runtime on Multicore

  • Process all input data in one time

Tiled-MapReduce improves data locality

  • Make the working set of each sub-job fit into

the last level cache

  • Aggregate partial results in Combine phase

(in OPT1)

OPT2

PT2:

: Local

  • cality

ity Opti ptimizat mization ion

slide-58
SLIDE 58

Memory Hierarchy

  • Multicore hardware usually organizes caches

in a non-uniform cache access (NUCA) way

  • The cross-chip operations are expensive*

e.g. Local/Remote L2 cache: 14/110 cycles*

OPT2

PT2:

: Local

  • cality

ity Opti ptimizat mization ion

* Intel 16-Core Machine with 4 Xeon 1.6GHz Quad-cores chips

slide-59
SLIDE 59

Memory Hierarchy

  • Multicore hardware usually organizes caches

in a non-uniform cache access (NUCA) way

  • The cross-chip operations are expensive*

e.g. Local/Remote L2 cache: 14/110 cycles*

NUCA/NUMA-aware scheduler

  • Eliminate remote cache and memory access
  • Run each sub-job on a single chip

OPT2

PT2:

: Local

  • cality

ity Opti ptimizat mization ion

* Intel 16-Core Machine with 4 Xeon 1.6GHz Quad-cores chips

slide-60
SLIDE 60

NUC

UCA/ A/NUMA UMA-Aware ware

Sche

chedu duler ler

master/ worker core core core core core core core core

$ $ $ $ $ $ $ $ shared cache shared cache

worker worker worker worker worker worker worker main memory main memory main memory main memory

slide-61
SLIDE 61

NUC

UCA/ A/NUMA UMA-Aware ware

Sche

chedu duler ler

repeater/ worker worker worker worker repeater/ worker worker worker worker

group

master core core core core core core core core

$ $ $ $ $ $ $ $ shared cache shared cache

main memory main memory main memory main memory

slide-62
SLIDE 62

NUC

UCA/ A/NUMA UMA-Aware ware

Sche

chedu duler ler

repeater/ worker worker worker worker repeater/ worker worker worker worker job queue core core core core core core core core

$ $ $ $ $ $ $ $ shared cache shared cache

main memory main memory main memory main memory master

slide-63
SLIDE 63

NUC

UCA/ A/NUMA UMA-Aware ware

Sche

chedu duler ler

repeater/ worker worker worker worker repeater/ worker worker worker worker job queue core core core core core core core core

$ $ $ $ $ $ $ $ shared cache shared cache

main memory main memory main memory main memory master Intermediate Buffer Intermediate Buffer Iteration Buffer

Final Buffer

Iteration Buffer

slide-64
SLIDE 64

OPT3

PT3: : CPU

PU OPTI PTIMIZATION MIZATION

slide-65
SLIDE 65

Data Dependency

  • Strict barrier after map and reduce phase
  • The execution time of a job is determined by

the slowest worker in each phase

OPT3

PT3:

: CPU PU Opti ptimizat mization ion

slide-66
SLIDE 66

Data Dependency

  • Strict barrier after map and reduce phase
  • The execution time of a job is determined by

the slowest worker in each phase

Observation

  • No data dependency between one sub-job’s

Combine phase and its successor’s Map phase

OPT3

PT3:

: CPU PU Opti ptimizat mization ion

slide-67
SLIDE 67

Software Pipeline

  • Overlap the Combine phase of the current

sub-job and the Map phase of its successor

OPT3

PT3:

: CPU PU Opti ptimizat mization ion

slide-68
SLIDE 68

Soft

  • ftware

ware

Pipel

ipeline ine

core core core core

Time Map Combine Idle

slide-69
SLIDE 69

Soft

  • ftware

ware

Pipel

ipeline ine

Barriers

core core core core

Time Map Combine Idle

slide-70
SLIDE 70

Soft

  • ftware

ware

Pipel

ipeline ine

core core core core core core core core

Time Time Speedup Map Combine Idle Software Pipeline

slide-71
SLIDE 71
  • 1. Til

iled ed Map apRedu educe ce 2. . Opt ptimi imizatio zation on n on TMR

TMR

3. . Eval valuatio uation 4. . Con

  • ncl

clusio usion

Out Outline line

slide-72
SLIDE 72

Platform

Intel 16-Core machine (4 Quad-cores chips) 32GB Main Memory Debian Linux with kernel v2.6.24

Systems:

Phoenix-2 with streamflow * Ostrich with streamflow

Con

  • nfi

figuration guration

* Scalable locality-conscious multithreaded memory allocation - ISMM’06

slide-73
SLIDE 73

Applications

Con

  • nfi

figuration guration

Inverted Index (II) WordCount (WC) Distributed Sort (DS) Log Statistics (LS)

  • ne

many many few few many no many Applications Key Duplicate

slide-74
SLIDE 74

Burd

urden en

  • f
  • f

Programm

rogrammer er

Code Modification

  • Support input data memory reuse

Inverted Index (II) WordCount (WC) Distributed Sort (DS) Log Statistics (LS)

11 11 Default Default 3 3 Default Default

Applications Acquire Release

slide-75
SLIDE 75

Over

verall all Perf erform

  • rmance

ance

1 2 3 4 1 2 4 1 2 4 1 2 4 1 2 4

Phoenix Ostrich Speedup

WC DS LS II

3.3X 1.2X

slide-76
SLIDE 76

1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 32 64 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16

Scalabi

calability lity

Scalability

WC DS LS II

Scalability #cores #cores

Phoenix Ostrich

slide-77
SLIDE 77

Other extension to MapReduce model

  • Database: Map-reduce-merge [SIGMOD’07]
  • Online Aggregation: MapReduce Online [NSDI’10]

Other implementation of MapReduce runtime

  • Cluster: Hadoop [Apache, OSDI’08]
  • Shared Memory: Phoenix [HPCA’07, IISWC’09]

and Metis [MIT-TR]

  • GPGPU: Mars [PACT’07]
  • Heterogeneous: MapCG [PACT’10]

Rela

elated ted Work

  • rk
slide-78
SLIDE 78

 Environments differences between cluster and

multicore open new design spaces and

  • ptimization opportunities

 Tiled-MapReduce and the three optimizations  Ostrich outperforms Phoenix by up to 3.3X

Conc

  • nclusion

lusion

slide-79
SLIDE 79

Tha

hanks nks

Ques

uesti tion

  • ns

s ?

Parallel Processing Institute http://ppi.fudan.edu.cn

Ostrich

The top land speed and the largest of bird

slide-80
SLIDE 80
slide-81
SLIDE 81

1 2 3 4 5 PHO-1 OST-1 PHO-2 OST-2 PHO-4 OST-4 PHO-1 OST-1 PHO-2 OST-2 PHO-4 OST-4 PHO-1 OST-1 PHO-2 OST-2 PHO-4 OST-4 PHO-1 OST-1 PHO-2 OST-2 PHO-4 OST-4

Intermediate Input

Memo

emory ry Con

  • nsu

sumption mption

Memory Consumption (GB)

WC DS LS II

slide-82
SLIDE 82

0.8 0.9 1.0 1.1 1.2 1.3 1.4

4 8 12 16 4 8 12 16 4 8 12 16 4 8 12 16

Without NUCA/NUMA-Aware Scheduler With NUCA/NUMA-Aware Scheduler

NUC

UCA/ A/NUMA UMA-Aware ware Sche chedu duler ler

Speedup

WC DS LS II

slide-83
SLIDE 83

Expl

xploi

  • it

t Local

  • cality

ity

L2 Cache Miss Rate

0% 5% 10% 15% 20% 25% 30% 1 2 4 1 2 4 1 2 4 1 2 4

Phoenix Ostrich

WC DS LS II

slide-84
SLIDE 84

1 2 3 4 5 6 7 8

WC WC/P DS DS/P LS LS/P II II/P

Merge Reduce Combine(Idle) Combine(Active) Map

Soft

  • ftware

ware Pipe ipeline line

Execution Time (Sec)