Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in - - PowerPoint PPT Presentation

exploiting coarse grained task data and pipeline
SMART_READER_LITE
LIVE PREVIEW

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in - - PowerPoint PPT Presentation

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts Institute of Technology ASPLOS October 2006 San Jose, CA http://cag.csail.mit.edu/streamit


slide-1
SLIDE 1

1

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs

Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts Institute of Technology ASPLOS October 2006 San Jose, CA http://cag.csail.mit.edu/streamit

slide-2
SLIDE 2

2

Multicores Are Here!

1985 1990 1980 1970 1975 1995 2000

4004 8008 8086 8080 286 386 486 Pentium P2 P3 P4 Itanium Itanium 2

2005 20??

# of cores

1 2 4 8 16 32 64 128 256 512

Athlon Raw Power4 Opteron Power6 Niagara Yonah PExtreme Tanglewood Cell Intel Tflops Xbox360 Cavium Octeon Raza XLR PA-8800 Cisco CSR-1 Picochip PC102 Broadcom 1480 Opteron 4P Xeon MP Ambric AM2045

slide-3
SLIDE 3

3

Multicores Are Here!

For uniprocessors, C was:

  • Portable
  • High Performance
  • Composable
  • Malleable
  • Maintainable

Uniprocessors: C is the common machine language

1985 1990 1980 1970 1975 1995 2000

4004 8008 8086 8080 286 386 486 Pentium P2 P3 P4 Itanium Itanium 2

2005

Raw Power4 Opteron Power6 Niagara Yonah PExtreme Tanglewood Cell Intel Tflops Xbox360 Cavium Octeon Raza XLR PA-8800 Cisco CSR-1 Picochip PC102 Broadcom 1480

20??

# of cores

1 2 4 8 16 32 64 128 256 512

Opteron 4P Xeon MP Athlon Ambric AM2045

slide-4
SLIDE 4

4

Multicores Are Here!

What is the common machine language for multicores?

1985 1990 1980 1970 1975 1995 2000

4004 8008 8086 8080 286 386 486 Pentium P2 P3 P4 Itanium Itanium 2

2005

Raw Power4 Opteron Power6 Niagara Yonah PExtreme Tanglewood Cell Intel Tflops Xbox360 Cavium Octeon Raza XLR PA-8800 Cisco CSR-1 Picochip PC102 Broadcom 1480

20??

# of cores

1 2 4 8 16 32 64 128 256 512

Opteron 4P Xeon MP Athlon Ambric AM2045

slide-5
SLIDE 5

5

Common Machine Languages

Single memory image Single flow of control

Common Properties Uniprocessors:

ISA Functional Units Register File

Differences:

Register Allocation Instruction Selection Instruction Scheduling

Multiple local memories Multiple flows of control

Common Properties Multicores:

Communication Model Synchronization Model Number and capabilities of cores

Differences:

von-Neumann languages represent the common properties and abstract away the differences Need common machine language(s) for multicores

slide-6
SLIDE 6

6

Streaming as a Common Machine Language

  • Regular and repeating computation
  • Independent filters

with explicit communication

– Segregated address spaces and multiple program counters

  • Natural expression of Parallelism:

– Producer / Consumer dependencies – Enables powerful, whole-program transformations

Adder Speaker AtoD FMDemod LPF1 Scatter Gather LPF2 LPF3 HPF1 HPF2 HPF3

slide-7
SLIDE 7

7

Types of Parallelism

Task Parallelism – Parallelism explicit in algorithm – Between filters without producer/consumer relationship Data Parallelism – Peel iterations of filter, place within scatter/gather pair (fission) – parallelize filters with state Pipeline Parallelism – Between producers and consumers – Stateful filters can be parallelized Scatter Gather Task

slide-8
SLIDE 8

8

Types of Parallelism

Task Parallelism – Parallelism explicit in algorithm – Between filters without producer/consumer relationship Data Parallelism – Between iterations of a stateless filter – Place within scatter/gather pair (fission) – Can’t parallelize filters with state Pipeline Parallelism – Between producers and consumers – Stateful filters can be parallelized Scatter Gather

Scatter Gather

Task Pipeline Data

Data Parallel

slide-9
SLIDE 9

9

Types of Parallelism

Traditionally:

Task Parallelism – Thread (fork/join) parallelism Data Parallelism – Data parallel loop (forall) Pipeline Parallelism – Usually exploited in hardware Scatter Gather

Scatter Gather

Task Pipeline Data

slide-10
SLIDE 10

10

Problem Statement

Given: – Stream graph with compute and communication estimate for each filter – Computation and communication resources of the target machine Find: – Schedule of execution for the filters that best utilizes the available parallelism to fit the machine resources

slide-11
SLIDE 11

11

Our 3-Phase Solution

1. Coarsen: Fuse stateless sections of the graph 2. Data Parallelize: parallelize stateless filters 3. Software Pipeline: parallelize stateful filters Compile to a 16 core architecture

– 11.2x mean throughput speedup over single core Coarsen Granularity Data Parallelize Software Pipeline

slide-12
SLIDE 12

12

Outline

  • StreamIt language overview
  • Mapping to multicores

– Baseline techniques – Our 3-phase solution

slide-13
SLIDE 13

13

  • Applications

– DES and Serpent [PLDI 05] – MPEG-2 [IPDPS 06] – SAR, DSP benchmarks, JPEG, …

  • Programmability

– StreamIt Language (CC 02) – Teleport Messaging (PPOPP 05) – Programming Environment in Eclipse (P-PHEC 05)

  • Domain Specific Optimizations

– Linear Analysis and Optimization (PLDI 03) – Optimizations for bit streaming (PLDI 05) – Linear State Space Analysis (CASES 05)

  • Architecture Specific Optimizations

– Compiling for Communication-Exposed Architectures (ASPLOS 02) – Phased Scheduling (LCTES 03) – Cache Aware Optimization (LCTES 05) – Load-Balanced Rendering (Graphics Hardware 05) StreamIt Program Front-end Stream-Aware Optimizations

Uniprocessor backend Cluster backend Raw backend IBM X10 backend

C/C++

C per tile + msg code

Streaming X10 runtime

Annotated Java

MPI-like C/C++

Simulator (Java Library)

The StreamIt Project

slide-14
SLIDE 14

14

Model of Computation

  • Synchronous Dataflow [Lee ‘92]

– Graph of autonomous filters – Communicate via FIFO channels

  • Static I/O rates

– Compiler decides on an order

  • f execution (schedule)

– Static estimation of computation

A/D Duplicate LED Detect Band Pass LED Detect LED Detect LED Detect

slide-15
SLIDE 15

15

Example StreamIt Filter

1 2 3 4 5 6 7 8 9 10 11

input

  • utput

FIR

1 float→float filter FIR (int N,float[N] weights) { work push 1 pop 1 peek N { float result = 0; for (int i = 0; i < N; i++) { result += weights[i] ∗ peek(i); } pop(); push(result); } }

Stateless

slide-16
SLIDE 16

16

Example StreamIt Filter

float→float filter FIR (int N, ) { work push 1 pop 1 peek N { float result = 0; for (int i = 0; i < N; i++) { result += weights[i] ∗ peek(i); } pop(); push(result); } } 1 2 3 4 5 6 7 8 9 10 11

input

  • utput

FIR

1 weights = adaptChannel(weights); float[N] weights ; (int N) {

Stateful

slide-17
SLIDE 17

17

parallel computation

StreamIt Language Overview

  • StreamIt is a novel

language for streaming

– Exposes parallelism and communication – Architecture independent – Modular and composable

– Simple structures composed to creates complex graphs

– Malleable

– Change program behavior with small modifications

may be any StreamIt language construct

joiner splitter pipeline feedback loop joiner splitter splitjoin filter

slide-18
SLIDE 18

18

Outline

  • StreamIt language overview
  • Mapping to multicores

– Baseline techniques – Our 3-phase solution

slide-19
SLIDE 19

19

Baseline 1: Task Parallelism

Adder

Splitter Joiner

Compress BandPass

Expand

Process

BandStop Compress BandPass

Expand

Process

BandStop

  • Inherent task parallelism between

two processing pipelines

  • Task Parallel Model:

– Only parallelize explicit task parallelism – Fork/join parallelism

  • Execute this on a 2 core machine

~2x speedup over single core

  • What about 4, 16, 1024, … cores?
slide-20
SLIDE 20

20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

B i t

  • n

i c S

  • r

t C h a n n e l V

  • c
  • d

e r D C T D E S F F T F i l t e r b a n k F M R a d i

  • S

e r p e n t T D E M P E G 2 D e c

  • d

e r V

  • c
  • d

e r R a d a r G e

  • m

e t r i c M e a n

Throughput Normalized to Single Core StreamIt

Evaluation: Task Parallelism

Raw Microprocessor 16 inorder, single-issue cores with D$ and I$ 16 memory banks, each bank with DMA Cycle accurate simulator

Parallelism: Not matched to target! Synchronization: Not matched to target!

slide-21
SLIDE 21

21

Baseline 2: Fine-Grained Data Parallelism

Adder

Splitter Joiner

  • Each of the filters in the

example are stateless

  • Fine-grained Data Parallel

Model:

– Fiss each stateless filter N ways (N is number of cores) – Remove scatter/gather if possible

  • We can introduce data

parallelism

– Example: 4 cores

  • Each fission group occupies

entire machine

BandStop BandStop BandStop Adder

Splitter Joiner

Expand Expand Expand

Process Process Process

Joiner

BandPass BandPass BandPass Compress Compress Compress BandStop BandStop BandStop

Expand

BandStop

Splitter Joiner Splitter

Process

BandPass Compress

Splitter Joiner Splitter Joiner Splitter Joiner

Expand Expand Expand

Process Process Process

Joiner

BandPass BandPass BandPass Compress Compress Compress BandStop BandStop BandStop

Expand

BandStop

Splitter Joiner Splitter

Process

BandPass Compress

Splitter Joiner Splitter Joiner Splitter Joiner

slide-22
SLIDE 22

22

Evaluation: Fine-Grained Data Parallelism

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

B i t

  • n

i c S

  • r

t C h a n n e l V

  • c
  • d

e r D C T D E S F F T F i l t e r b a n k F M R a d i

  • S

e r p e n t T D E M P E G 2 D e c

  • d

e r V

  • c
  • d

e r R a d a r G e

  • m

e t r i c M e a n

Throughput Normalized to Single Core StreamIt

Task Fine-Grained Data

Good Parallelism! Too Much Synchronization!

slide-23
SLIDE 23

23

Outline

  • StreamIt language overview
  • Mapping to multicores

– Baseline techniques – Our 3-phase solution

slide-24
SLIDE 24

24

Phase 1: Coarsen the Stream Graph

Splitter Joiner

Expand

BandStop

Process

BandPass Compress

Expand

BandStop

Process

BandPass Compress

  • Before data-parallelism is

exploited

  • Fuse stateless pipelines as

much as possible without introducing state

– Don’t fuse stateless with stateful – Don’t fuse a peeking filter with anything upstream

Peek Peek Peek Peek

Adder

slide-25
SLIDE 25

25

Splitter Joiner

BandPass Compress Process Expand BandPass Compress Process Expand BandStop BandStop Adder

  • Before data-parallelism is

exploited

  • Fuse stateless pipelines as

much as possible without introducing state

– Don’t fuse stateless with stateful – Don’t fuse a peeking filter with anything upstream

  • Benefits:

– Reduces global communication and synchronization – Exposes inter-node

  • ptimization opportunities

Phase 1: Coarsen the Stream Graph

slide-26
SLIDE 26

26

Phase 2: Data Parallelize

Adder Adder

Adder

Splitter Joiner

Adder BandPass Compress Process Expand BandPass Compress Process Expand BandStop BandStop

Splitter Joiner

Fiss 4 ways, to occupy entire chip

Data Parallelize for 4 cores

slide-27
SLIDE 27

27

Phase 2: Data Parallelize

Adder Adder

Adder

Splitter Joiner

Adder BandPass Compress Process Expand

Splitter Joiner

BandPass Compress Process Expand BandPass Compress Process Expand

Splitter Joiner

BandPass Compress Process Expand BandStop BandStop

Splitter Joiner

Task parallelism! Each fused filter does equal work Fiss each filter 2 times to occupy entire chip

Data Parallelize for 4 cores

slide-28
SLIDE 28

28

BandStop BandStop

Phase 2: Data Parallelize

Adder Adder

Adder

Splitter Joiner

Adder BandPass Compress Process Expand

Splitter Joiner

BandPass Compress Process Expand BandPass Compress Process Expand

Splitter Joiner

BandPass Compress Process Expand

Splitter Joiner

BandStop

Splitter Joiner

BandStop

Splitter Joiner

Task parallelism, each filter does equal work Fiss each filter 2 times to occupy entire chip

  • Task-conscious data

parallelization

– Preserve task parallelism

  • Benefits:

– Reduces global communication and synchronization

Data Parallelize for 4 cores

slide-29
SLIDE 29

29

Evaluation: Coarse-Grained Data Parallelism

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 BitonicSort ChannelVocoder DCT DES FFT Filterbank FMRadio Serpent TDE MPEG2Decoder Vocoder Radar Geometric Mean

Throughput Normalized to Single Core StreamIt

Task Fine-Grained Data Coarse-Grained Task + Data

Good Parallelism! Low Synchronization!

slide-30
SLIDE 30

30

Simplified Vocoder

RectPolar

Splitter Joiner

AdaptDFT AdaptDFT

Splitter Splitter

Amplify Diff UnWrap

Accum

Amplify Diff Unwrap

Accum

Joiner Joiner

PolarRect

6 6 20 2 1 1 1 2 1 1 1 20 Data Parallel

Data Parallel Target a 4 core machine Data Parallel, but too little work!

slide-31
SLIDE 31

31

Data Parallelize

RectPolar RectPolar RectPolar

Splitter Joiner

AdaptDFT AdaptDFT

Splitter Splitter

Amplify Diff UnWrap

Accum

Amplify Diff Unwrap

Accum

Joiner

RectPolar

Splitter Joiner

RectPolar RectPolar RectPolar PolarRect

Splitter

Joiner

Joiner

6 6 20 2 1 1 1 2 1 1 1 20 5 5

Target a 4 core machine

slide-32
SLIDE 32

32

Data + Task Parallel Execution

Time Cores

21 Target 4 core machine

Splitter Joiner Splitter Splitter Joiner

Splitter Joiner

RectPolar

Splitter

Joiner

Joiner

6 6 2 1 1 1 2 1 1 1 5 5

slide-33
SLIDE 33

33

We Can Do Better!

Time Cores

Target 4 core machine

Splitter Joiner Splitter Splitter Joiner

Splitter Joiner

RectPolar

Splitter

Joiner

Joiner

6 6 2 1 1 1 2 1 1 1 5 5

16

slide-34
SLIDE 34

34

Phase 3: Coarse-Grained Software Pipelining

RectPolar RectPolar RectPolar RectPolar

Prologue New Steady State

  • New steady-state is free of

dependencies

  • Schedule new steady-state

using a greedy partitioning

slide-35
SLIDE 35

35

Greedy Partitioning

Target 4 core machine

Time

16

Cores

To Schedule:

slide-36
SLIDE 36

36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

B i t

  • n

i c S

  • r

t C h a n n e l V

  • c
  • d

e r D C T D E S F F T F i l t e r b a n k F M R a d i

  • S

e r p e n t T D E M P E G 2 D e c

  • d

e r V

  • c
  • d

e r R a d a r G e

  • m

e t r i c M e a n

Throughput Normalized to Single Core StreamIt

Task Fine-Grained Data Coarse-Grained Task + Data Coarse-Grained Task + Data + Software Pipeline

Evaluation: Coarse-Grained Task + Data + Software Pipelining

Best Parallelism! Lowest Synchronization!

slide-37
SLIDE 37

37

Generalizing to Other Multicores

  • Architectural requirements:

– Compiler controlled local memories with DMA – Efficient implementation of scatter/gather

  • To port to other architectures, consider:

– Local memory capacities – Communication to computation tradeoff

  • Did not use processor-to-processor

communication on Raw

slide-38
SLIDE 38

38

Related Work

  • Streaming languages:

– Brook [Buck et al. ’04] – StreamC/KernelC [Kapasi ’03, Das et al. ’06] – Cg [Mark et al. ‘03] – SPUR [Zhang et al. ‘05]

  • Streaming for Multicores:

– Brook [Liao et al., ’06]

  • Ptolemy [Lee ’95]
  • Explicit parallelism:

– OpenMP, MPI, & HPF

slide-39
SLIDE 39

39

Conclusions

  • Good speedups across varied benchmark suite
  • Algorithms should be applicable across multicores

Low Good Coarse-Grained Task + Data High Good Fine-Grained Data Lowest Not matched Synchronization Best Not matched Parallelism Coarse-Grained Task + Data + Software Pipeline Task

  • Streaming model naturally exposes task, data, and

pipeline parallelism

  • This parallelism must be exploited at the correct

granularity and combined correctly