exploiting coarse grained task data and pipeline
play

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in - PowerPoint PPT Presentation

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts Institute of Technology ASPLOS October 2006 San Jose, CA http://cag.csail.mit.edu/streamit


  1. Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts Institute of Technology ASPLOS October 2006 San Jose, CA http://cag.csail.mit.edu/streamit 1

  2. Multicores Are Here! 512 Picochip Ambric PC102 AM2045 256 Cisco CSR-1 128 Intel Tflops 64 32 # of Raza Cavium Raw XLR Octeon 16 cores Cell 8 Niagara Opteron 4P Broadcom 1480 4 Xeon MP Xbox360 PA-8800 Tanglewood Opteron 2 Power4 PExtreme Power6 Yonah 4004 8080 8086 286 386 486 Pentium P2 P3 Itanium 1 P4 8008 Athlon Itanium 2 1970 1975 1980 1985 1990 1995 2000 2005 20?? 2

  3. Multicores Are Here! For uniprocessors, 512 Picochip Ambric Uniprocessors: PC102 C was: AM2045 256 Cisco C is the common CSR-1 •Portable 128 Intel Tflops machine language •High Performance 64 32 •Composable # of Raza Cavium Raw XLR Octeon 16 cores •Malleable Cell 8 Niagara •Maintainable Opteron 4P Broadcom 1480 4 Xeon MP Xbox360 PA-8800 Tanglewood Opteron 2 Power4 PExtreme Power6 Yonah 4004 8080 8086 286 386 486 Pentium P2 P3 Itanium 1 P4 8008 Athlon Itanium 2 1970 1975 1980 1985 1990 1995 2000 2005 20?? 3

  4. Multicores Are Here! What is the common 512 Picochip Ambric PC102 machine language AM2045 256 Cisco CSR-1 for multicores? 128 Intel Tflops 64 32 # of Raza Cavium Raw XLR Octeon 16 cores Cell 8 Niagara Opteron 4P Broadcom 1480 4 Xeon MP Xbox360 PA-8800 Tanglewood Opteron 2 Power4 PExtreme Power6 Yonah 4004 8080 8086 286 386 486 Pentium P2 P3 Itanium 1 P4 8008 Athlon Itanium 2 1970 1975 1980 1985 1990 1995 2000 2005 20?? 4

  5. Common Machine Languages Uniprocessors: Multicores: Common Properties Common Properties Multiple flows of control Single flow of control Single memory image Multiple local memories Differences: Differences: Register Allocation Number and capabilities of cores Register File Instruction Selection ISA Communication Model Instruction Scheduling Synchronization Model Functional Units von-Neumann languages represent the Need common machine language(s) common properties and abstract away for multicores the differences 5

  6. Streaming as a Common Machine Language AtoD • Regular and repeating computation FMDemod • Independent filters Scatter with explicit communication – Segregated address spaces and LPF 1 LPF 2 LPF 3 multiple program counters HPF 1 HPF 2 HPF 3 • Natural expression of Parallelism: Gather – Producer / Consumer dependencies – Enables powerful, whole-program Adder transformations Speaker 6

  7. Types of Parallelism Task Parallelism – Parallelism explicit in algorithm – Between filters without producer/consumer relationship Scatter Data Parallelism – Peel iterations of filter, place within scatter/gather pair ( fission ) – parallelize filters with state Gather Pipeline Parallelism – Between producers and consumers – Stateful filters can be parallelized Task 7

  8. Types of Parallelism Task Parallelism Scatter – Parallelism explicit in algorithm Data Parallel – Between filters without producer/consumer relationship Gather Scatter Pipeline Data Parallelism – Between iterations of a stateless filter – Place within scatter/gather pair ( fission ) – Can’t parallelize filters with state Gather Pipeline Parallelism – Between producers and consumers Data – Stateful filters can be parallelized Task 8

  9. Types of Parallelism Traditionally: Scatter Task Parallelism Gather – Thread (fork/join) parallelism Scatter Pipeline Data Parallelism – Data parallel loop ( forall ) Pipeline Parallelism – Usually exploited in hardware Gather Data Task 9

  10. Problem Statement Given : – Stream graph with compute and communication estimate for each filter – Computation and communication resources of the target machine Find: – Schedule of execution for the filters that best utilizes the available parallelism to fit the machine resources 10

  11. Our 3-Phase Solution Coarsen Data Software Granularity Parallelize Pipeline 1. Coarsen: Fuse stateless sections of the graph 2. Data Parallelize: parallelize stateless filters 3. Software Pipeline: parallelize stateful filters Compile to a 16 core architecture – 11.2x mean throughput speedup over single core 11

  12. Outline • StreamIt language overview • Mapping to multicores – Baseline techniques – Our 3-phase solution 12

  13. The StreamIt Project • Applications StreamIt Program – DES and Serpent [PLDI 05] – MPEG-2 [IPDPS 06] – SAR, DSP benchmarks, JPEG, … Front-end • Programmability – StreamIt Language (CC 02) – Teleport Messaging (PPOPP 05) Annotated Java – Programming Environment in Eclipse (P-PHEC 05) • Domain Specific Optimizations – Linear Analysis and Optimization (PLDI 03) Simulator Stream-Aware – Optimizations for bit streaming (PLDI 05) (Java Library) Optimizations – Linear State Space Analysis (CASES 05) • Architecture Specific Optimizations – Compiling for Communication-Exposed Uniprocessor Cluster Raw IBM X10 Architectures (ASPLOS 02) backend backend backend backend – Phased Scheduling (LCTES 03) – Cache Aware Optimization (LCTES 05) – Load-Balanced Rendering C/C++ MPI-like C per tile + Streaming C/C++ msg code X10 runtime (Graphics Hardware 05) 13

  14. Model of Computation • Synchronous Dataflow [Lee ‘92] A/D – Graph of autonomous filters – Communicate via FIFO channels Band Pass • Static I/O rates Duplicate – Compiler decides on an order of execution (schedule) Detect Detect Detect Detect – Static estimation of computation LED LED LED LED 14

  15. Example StreamIt Filter input 0 1 2 3 4 5 6 7 8 9 10 11 FIR output 0 1 float → float filter FIR (int N,float[N] weights) { Stateless work push 1 pop 1 peek N { float result = 0; for (int i = 0; i < N; i++) { result += weights[i] ∗ peek (i); } pop (); push (result); } } 15

  16. Example StreamIt Filter input 0 1 2 3 4 5 6 7 8 9 10 11 FIR output 0 1 float → float filter FIR (int N, ) { (int N) { float[N] weights Stateful ; work push 1 pop 1 peek N { float result = 0; weights = adaptChannel(weights ); for (int i = 0; i < N; i++) { result += weights[i] ∗ peek (i); } pop (); push (result); } } 16

  17. StreamIt Language Overview filter • StreamIt is a novel language for streaming pipeline may be – Exposes parallelism and any StreamIt communication language construct – Architecture independent splitjoin parallel computation – Modular and composable – Simple structures composed to creates splitter joiner complex graphs – Malleable – Change program behavior feedback loop with small modifications splitter joiner 17

  18. Outline • StreamIt language overview • Mapping to multicores – Baseline techniques – Our 3-phase solution 18

  19. Baseline 1: Task Parallelism • Inherent task parallelism between Splitter two processing pipelines BandPass BandPass • Task Parallel Model: Compress Compress – Only parallelize explicit task parallelism Process Process – Fork/join parallelism Expand Expand • Execute this on a 2 core machine ~2x speedup over single core BandStop BandStop Joiner • What about 4, 16, 1024, … cores? Adder 19

  20. Evaluation: Task Parallelism Raw Microprocessor Parallelism: Not matched to target! 19 18 16 inorder, single-issue cores with D$ and I$ Synchronization: Not matched to target! 17 Throughput Normalized to Single Core StreamIt 16 memory banks, each bank with DMA 16 Cycle accurate simulator 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 r k n t T S o t E r r e T r r n e e n a a o d C F i D E d d e d d a e S o F a D D p T o o b a M c c R r c c r R o e i e e o n M c V S D o t V l i l i F r t e F 2 t i e B G n m n E a o P h e M C G 20

  21. Baseline 2: Fine-Grained Data Parallelism Splitter • Each of the filters in the Splitter Splitter example are stateless BandPass BandPass BandPass BandPass BandPass BandPass BandPass BandPass • Fine-grained Data Parallel Joiner Joiner Splitter Splitter Model: Compress Compress Compress Compress Compress Compress Compress Compress Joiner Joiner – Fiss each stateless filter N Splitter Splitter ways ( N is number of cores) Process Process Process Process Process Process Process Process Joiner Joiner – Remove scatter/gather if Splitter Splitter possible Expand Expand Expand Expand Expand Expand Expand Expand • We can introduce data Joiner Joiner Splitter Splitter parallelism BandStop BandStop BandStop BandStop BandStop BandStop BandStop BandStop Joiner Joiner – Example: 4 cores Joiner • Each fission group occupies entire machine Splitter BandStop BandStop BandStop Adder Adder Joiner 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend