Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in - PowerPoint PPT Presentation

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts Institute of Technology ASPLOS October 2006 San Jose, CA http://cag.csail.mit.edu/streamit 1

Multicores Are Here! 512 Picochip Ambric PC102 AM2045 256 Cisco CSR-1 128 Intel Tflops 64 32 # of Raza Cavium Raw XLR Octeon 16 cores Cell 8 Niagara Opteron 4P Broadcom 1480 4 Xeon MP Xbox360 PA-8800 Tanglewood Opteron 2 Power4 PExtreme Power6 Yonah 4004 8080 8086 286 386 486 Pentium P2 P3 Itanium 1 P4 8008 Athlon Itanium 2 1970 1975 1980 1985 1990 1995 2000 2005 20?? 2

Multicores Are Here! For uniprocessors, 512 Picochip Ambric Uniprocessors: PC102 C was: AM2045 256 Cisco C is the common CSR-1 •Portable 128 Intel Tflops machine language •High Performance 64 32 •Composable # of Raza Cavium Raw XLR Octeon 16 cores •Malleable Cell 8 Niagara •Maintainable Opteron 4P Broadcom 1480 4 Xeon MP Xbox360 PA-8800 Tanglewood Opteron 2 Power4 PExtreme Power6 Yonah 4004 8080 8086 286 386 486 Pentium P2 P3 Itanium 1 P4 8008 Athlon Itanium 2 1970 1975 1980 1985 1990 1995 2000 2005 20?? 3

Multicores Are Here! What is the common 512 Picochip Ambric PC102 machine language AM2045 256 Cisco CSR-1 for multicores? 128 Intel Tflops 64 32 # of Raza Cavium Raw XLR Octeon 16 cores Cell 8 Niagara Opteron 4P Broadcom 1480 4 Xeon MP Xbox360 PA-8800 Tanglewood Opteron 2 Power4 PExtreme Power6 Yonah 4004 8080 8086 286 386 486 Pentium P2 P3 Itanium 1 P4 8008 Athlon Itanium 2 1970 1975 1980 1985 1990 1995 2000 2005 20?? 4

Common Machine Languages Uniprocessors: Multicores: Common Properties Common Properties Multiple flows of control Single flow of control Single memory image Multiple local memories Differences: Differences: Register Allocation Number and capabilities of cores Register File Instruction Selection ISA Communication Model Instruction Scheduling Synchronization Model Functional Units von-Neumann languages represent the Need common machine language(s) common properties and abstract away for multicores the differences 5

Streaming as a Common Machine Language AtoD • Regular and repeating computation FMDemod • Independent filters Scatter with explicit communication – Segregated address spaces and LPF 1 LPF 2 LPF 3 multiple program counters HPF 1 HPF 2 HPF 3 • Natural expression of Parallelism: Gather – Producer / Consumer dependencies – Enables powerful, whole-program Adder transformations Speaker 6

Types of Parallelism Task Parallelism – Parallelism explicit in algorithm – Between filters without producer/consumer relationship Scatter Data Parallelism – Peel iterations of filter, place within scatter/gather pair ( fission ) – parallelize filters with state Gather Pipeline Parallelism – Between producers and consumers – Stateful filters can be parallelized Task 7

Types of Parallelism Task Parallelism Scatter – Parallelism explicit in algorithm Data Parallel – Between filters without producer/consumer relationship Gather Scatter Pipeline Data Parallelism – Between iterations of a stateless filter – Place within scatter/gather pair ( fission ) – Can’t parallelize filters with state Gather Pipeline Parallelism – Between producers and consumers Data – Stateful filters can be parallelized Task 8

Types of Parallelism Traditionally: Scatter Task Parallelism Gather – Thread (fork/join) parallelism Scatter Pipeline Data Parallelism – Data parallel loop ( forall ) Pipeline Parallelism – Usually exploited in hardware Gather Data Task 9

Problem Statement Given : – Stream graph with compute and communication estimate for each filter – Computation and communication resources of the target machine Find: – Schedule of execution for the filters that best utilizes the available parallelism to fit the machine resources 10

Our 3-Phase Solution Coarsen Data Software Granularity Parallelize Pipeline 1. Coarsen: Fuse stateless sections of the graph 2. Data Parallelize: parallelize stateless filters 3. Software Pipeline: parallelize stateful filters Compile to a 16 core architecture – 11.2x mean throughput speedup over single core 11

Outline • StreamIt language overview • Mapping to multicores – Baseline techniques – Our 3-phase solution 12

The StreamIt Project • Applications StreamIt Program – DES and Serpent [PLDI 05] – MPEG-2 [IPDPS 06] – SAR, DSP benchmarks, JPEG, … Front-end • Programmability – StreamIt Language (CC 02) – Teleport Messaging (PPOPP 05) Annotated Java – Programming Environment in Eclipse (P-PHEC 05) • Domain Specific Optimizations – Linear Analysis and Optimization (PLDI 03) Simulator Stream-Aware – Optimizations for bit streaming (PLDI 05) (Java Library) Optimizations – Linear State Space Analysis (CASES 05) • Architecture Specific Optimizations – Compiling for Communication-Exposed Uniprocessor Cluster Raw IBM X10 Architectures (ASPLOS 02) backend backend backend backend – Phased Scheduling (LCTES 03) – Cache Aware Optimization (LCTES 05) – Load-Balanced Rendering C/C++ MPI-like C per tile + Streaming C/C++ msg code X10 runtime (Graphics Hardware 05) 13

Model of Computation • Synchronous Dataflow [Lee ‘92] A/D – Graph of autonomous filters – Communicate via FIFO channels Band Pass • Static I/O rates Duplicate – Compiler decides on an order of execution (schedule) Detect Detect Detect Detect – Static estimation of computation LED LED LED LED 14

Example StreamIt Filter input 0 1 2 3 4 5 6 7 8 9 10 11 FIR output 0 1 float → float filter FIR (int N,float[N] weights) { Stateless work push 1 pop 1 peek N { float result = 0; for (int i = 0; i < N; i++) { result += weights[i] ∗ peek (i); } pop (); push (result); } } 15

Example StreamIt Filter input 0 1 2 3 4 5 6 7 8 9 10 11 FIR output 0 1 float → float filter FIR (int N, ) { (int N) { float[N] weights Stateful ; work push 1 pop 1 peek N { float result = 0; weights = adaptChannel(weights ); for (int i = 0; i < N; i++) { result += weights[i] ∗ peek (i); } pop (); push (result); } } 16

StreamIt Language Overview filter • StreamIt is a novel language for streaming pipeline may be – Exposes parallelism and any StreamIt communication language construct – Architecture independent splitjoin parallel computation – Modular and composable – Simple structures composed to creates splitter joiner complex graphs – Malleable – Change program behavior feedback loop with small modifications splitter joiner 17

Outline • StreamIt language overview • Mapping to multicores – Baseline techniques – Our 3-phase solution 18

Baseline 1: Task Parallelism • Inherent task parallelism between Splitter two processing pipelines BandPass BandPass • Task Parallel Model: Compress Compress – Only parallelize explicit task parallelism Process Process – Fork/join parallelism Expand Expand • Execute this on a 2 core machine ~2x speedup over single core BandStop BandStop Joiner • What about 4, 16, 1024, … cores? Adder 19

Evaluation: Task Parallelism Raw Microprocessor Parallelism: Not matched to target! 19 18 16 inorder, single-issue cores with D$ and I$ Synchronization: Not matched to target! 17 Throughput Normalized to Single Core StreamIt 16 memory banks, each bank with DMA 16 Cycle accurate simulator 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 r k n t T S o t E r r e T r r n e e n a a o d C F i D E d d e d d a e S o F a D D p T o o b a M c c R r c c r R o e i e e o n M c V S D o t V l i l i F r t e F 2 t i e B G n m n E a o P h e M C G 20

Baseline 2: Fine-Grained Data Parallelism Splitter • Each of the filters in the Splitter Splitter example are stateless BandPass BandPass BandPass BandPass BandPass BandPass BandPass BandPass • Fine-grained Data Parallel Joiner Joiner Splitter Splitter Model: Compress Compress Compress Compress Compress Compress Compress Compress Joiner Joiner – Fiss each stateless filter N Splitter Splitter ways ( N is number of cores) Process Process Process Process Process Process Process Process Joiner Joiner – Remove scatter/gather if Splitter Splitter possible Expand Expand Expand Expand Expand Expand Expand Expand • We can introduce data Joiner Joiner Splitter Splitter parallelism BandStop BandStop BandStop BandStop BandStop BandStop BandStop BandStop Joiner Joiner – Example: 4 cores Joiner • Each fission group occupies entire machine Splitter BandStop BandStop BandStop Adder Adder Joiner 21

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in - PowerPoint PPT Presentation

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts Institute of Technology ASPLOS October 2006 San Jose, CA http://cag.csail.mit.edu/streamit

A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs William

Fine Grained Access Control Fine-Grained Access Control Fine Grained Access Control

MOLECULAR DYNAMICS STUDY OF LIPOSOMES WITH A NEW COARSE-GRAINED MOLECULAR MODEL Wataru SHINODA

Application of the Lattice Boltzmann method with moving boundaries in a coarse-grained suspension

Junfeng Fan ESAT/COSIC ECC implementation methods Multi-core systems Coarse-Grained

Fine-Grained Access Control Fine Grained Access Control Fine-grained access control examples:

Coarse-graining Markov state models with PCCA Coarse-graining Markov state models

Robust Regression with Coarse Data Marco Cattaneo and Andrea Wiencierz Department of Statistics,

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

Mechanized Verification of Fine-grained Concurrent Programs Ilya Sergey Aleks Nanevski

Some categorical aspects of coarse spaces and balleans Nicol` o Zava joint work with Dikran

Coarse Woody Debris as Measurable Management Targets A.J. Kroll Weyerhaeuser COARSE WOODY

New design method for C30 recycled concr ete using mixed source concrete coarse agg regates

COARSE-TO-FINE, COST-SENSITIVE CLASSIFICATION OF E-MAIL Jay Pujara jay@cs.umd.edu Lise Getoor

Refinements in Data Manipulation Method for Coarse Grained Reconfigurable Architectures Takuya

Coarse-Grained Transactions Eric Koskinen University of Cambridge 20 January 2010 Joint work

MA/CSSE 473 Day 15 Return Exam Student questions Towers of Hanoi Subsets Ordered Permutations

Parsing Simone Campanoni simonec@eecs.northwestern.edu Outline Compiler structure

LiveCodeLab2.0anditslanguage* LiveCodeLang* ! ! Davide'Della'Casa,'Guy'John'

ATOLS A WSN-based, RSS-driven, Real-time Location Tracking System for Independent Living

Backtracking A short list of categories Algorithm types we will consider include: Simple

Exoplanet Reflections: the light from 51 Peg b J. H. C. Martins 1 , 2 , 3 , N. Santos 1 , 3 , P.

The Parsley Data Description Language Prashanth Mundkur 1 Linda Briesemeister 1 Natarajan Shankar 1

Finding a Stock Winner: Finding a Stock Winner: First Step Screening First Step Screening John

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in - PowerPoint PPT Presentation

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts Institute of Technology ASPLOS October 2006 San Jose, CA http://cag.csail.mit.edu/streamit

A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs William

Fine Grained Access Control Fine-Grained Access Control Fine Grained Access Control

MOLECULAR DYNAMICS STUDY OF LIPOSOMES WITH A NEW COARSE-GRAINED MOLECULAR MODEL Wataru SHINODA

Application of the Lattice Boltzmann method with moving boundaries in a coarse-grained suspension

Junfeng Fan ESAT/COSIC ECC implementation methods Multi-core systems Coarse-Grained

Fine-Grained Access Control Fine Grained Access Control Fine-grained access control examples:

Coarse-graining Markov state models with PCCA Coarse-graining Markov state models

Robust Regression with Coarse Data Marco Cattaneo and Andrea Wiencierz Department of Statistics,

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

Mechanized Verification of Fine-grained Concurrent Programs Ilya Sergey Aleks Nanevski

Some categorical aspects of coarse spaces and balleans Nicol` o Zava joint work with Dikran

Coarse Woody Debris as Measurable Management Targets A.J. Kroll Weyerhaeuser COARSE WOODY

New design method for C30 recycled concr ete using mixed source concrete coarse agg regates

COARSE-TO-FINE, COST-SENSITIVE CLASSIFICATION OF E-MAIL Jay Pujara jay@cs.umd.edu Lise Getoor

Refinements in Data Manipulation Method for Coarse Grained Reconfigurable Architectures Takuya

Coarse-Grained Transactions Eric Koskinen University of Cambridge 20 January 2010 Joint work

MA/CSSE 473 Day 15 Return Exam Student questions Towers of Hanoi Subsets Ordered Permutations

Parsing Simone Campanoni simonec@eecs.northwestern.edu Outline Compiler structure

LiveCodeLab*2.0*and*its*language* LiveCodeLang* ! ! Davide'Della'Casa,'Guy'John'

ATOLS A WSN-based, RSS-driven, Real-time Location Tracking System for Independent Living

Backtracking A short list of categories Algorithm types we will consider include: Simple

Exoplanet Reflections: the light from 51 Peg b J. H. C. Martins 1 , 2 , 3 , N. Santos 1 , 3 , P.

The Parsley Data Description Language Prashanth Mundkur 1 Linda Briesemeister 1 Natarajan Shankar 1

Finding a Stock Winner: Finding a Stock Winner: First Step Screening First Step Screening John

LiveCodeLab2.0anditslanguage* LiveCodeLang* ! ! Davide'Della'Casa,'Guy'John'