An Architectural Framework for Accelerating Dynamic Parallel - PowerPoint PPT Presentation

An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware Tao Chen, Shreesha Srinath Christopher Batten , G. Edward Suh Computer Systems Laboratory School of Electrical and Computer Engineering Cornell University 51st Int’l Symp. on Microarchitecture Fall 2018

• Motivation • Computation Model Accelerator Architecture Design Methodology Evaluation Accelerating Static Parallel Algorithms on Reconfigurable Hardware ◮ Emerging CPU+FPGA platforms for (int i=0; i<n; i++) c[i] = a[i] + b[i]; (Xilinx Zynq, Altera Cyclone SoC) High ◮ HLS maps parallelism statically to Level highly pipelined and parallel PEs Synthesis __kernel void vvadd( __global int* c, __global int* a, __global int* b, int n ) { int id = get_global_id(0); Reconfig General if ( id < n ) Hardware Purpose c[id] = a[id] + b[id]; (FPGA) CPU } Shared Mem Sys Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 2 / 18

• Motivation • Computation Model Accelerator Architecture Design Methodology Evaluation Programmers are increasingly moving from thread- to task-centric programming ◮ Task-parallel programming int fib( int n ) { frameworks enable creating if (n < 2) tasks dynamically as the return n; int x = spawn fib(n-1); program executes int y = fib(n-2); sync ; ⊲ Intel Cilk Plus, Intel C++ TBB, return x + y; Microsoft’s .NET TPL, Java’s } Fork/Join, OpenMP ◮ Benefits of this approach: Reconfig General ⊲ hierarchical data structures Hardware Purpose ⊲ divide-and-conquer algos (FPGA) CPU ⊲ adaptive algorithms ⊲ arbitrary nesting, composition ⊲ automatic load balancing Shared Mem Sys ⊲ efficient in theory and practice Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 3 / 18

Motivation Computation Model Accelerator Architecture Design Methodology Evaluation Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware int fib( int n ) { Motivation if (n < 2) return n; int x = spawn fib(n-1); Computation Model int y = fib(n-2); sync ; return x + y; Accelerator Architecture } Design Methodology Reconfig Reconfig General General Hardware Hardware Purpose Purpose Evaluation (FPGA) (FPGA) CPU CPU Shared Mem Sys Shared Mem Sys Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 4 / 18

Motivation • Computation Model • Accelerator Architecture Design Methodology Evaluation Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware int fib( int n ) { Motivation if (n < 2) return n; int x = spawn fib(n-1); Computation Model int y = fib(n-2); sync ; return x + y; Accelerator Architecture } Design Methodology Reconfig Reconfig General General Hardware Hardware Purpose Purpose Evaluation (FPGA) (FPGA) CPU CPU Shared Mem Sys Shared Mem Sys Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 5 / 18

Motivation • Computation Model • Accelerator Architecture Design Methodology Evaluation Explicit Continuation Passing parent task A spawn spawn 〈 D,1 〉〈 D,2 〉 cont = cont = child B C task spawn 〈 G,2 〉 cont = make E F successor Data-Flow Pattern make successor G 〈 D,2 〉 cont = arg1 arg2 D successor task Fork/Join Pattern Data-Parallel Pattern Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 6 / 18

Motivation • Computation Model • Accelerator Architecture Design Methodology Evaluation Example of Explicit Continuation Passing w/ Cilk int fib( int n ) task fib( cont int k, int n ) { { if (n < 2) if ( n < 2 ) return n; send_argument( k, n ); int x = spawn fib(n-1); else { int y = fib(n-2); cont int x, y; sync; spawn_next sum( k, ?x, ?y ); return x + y; spawn fib( x, n-1 ); } spawn fib( y, n-2 ); } } task sum( cont int k, int x, int y ) { send_argument( k, x+y ); } ◮ Cilk-1 used explicit continuation passing (JPDC’96) ◮ Cilk-5 used call/return semantics for parallelism (PLDI’98) ◮ Explicit continuation passing is an elegant match for hardware Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 7 / 18

Motivation Computation Model • Accelerator Architecture • Design Methodology Evaluation Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware int fib( int n ) { Motivation if (n < 2) return n; int x = spawn fib(n-1); Computation Model int y = fib(n-2); sync ; return x + y; Accelerator Architecture } Design Methodology Reconfig Reconfig General General Hardware Hardware Purpose Purpose Evaluation (FPGA) (FPGA) CPU CPU Shared Mem Sys Shared Mem Sys Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 8 / 18

Motivation Computation Model • Accelerator Architecture • Design Methodology Evaluation Scheduling Tasks with Work Stealing Task Queues Work in Progress PE 0 PE 1 PE 2 PE 3 ◮ Work stealing has good performance, space requirements, and communication overheads in both theory and practice Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 9 / 18

Motivation Computation Model • Accelerator Architecture • Design Methodology Evaluation Scheduling Tasks with Work Stealing Task Queues Work in Task A Progress PE 0 PE 1 PE 2 PE 3 ◮ Work stealing has good performance, space requirements, and communication overheads in both theory and practice Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 9 / 18

Motivation Computation Model • Accelerator Architecture • Design Methodology Evaluation Scheduling Tasks with Work Stealing Task Queues Task B Spawn Task B Work in Task A Progress PE 0 PE 1 PE 2 PE 3 ◮ Work stealing has good performance, space requirements, and communication overheads in both theory and practice Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 9 / 18

Motivation Computation Model • Accelerator Architecture • Design Methodology Evaluation Scheduling Tasks with Work Stealing Task Queues Dequeue Task B Work in Task B Progress PE 0 PE 1 PE 2 PE 3 ◮ Work stealing has good performance, space requirements, and communication overheads in both theory and practice Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 9 / 18

Motivation Computation Model • Accelerator Architecture • Design Methodology Evaluation Scheduling Tasks with Work Stealing Task Queues Task C Spawn Task C Work in Task B Progress PE 0 PE 1 PE 2 PE 3 ◮ Work stealing has good performance, space requirements, and communication overheads in both theory and practice Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 9 / 18

Motivation Computation Model • Accelerator Architecture • Design Methodology Evaluation Scheduling Tasks with Work Stealing Task Task C Queues Task D Spawn Task D Work in Task B Progress PE 0 PE 1 PE 2 PE 3 ◮ Work stealing has good performance, space requirements, and communication overheads in both theory and practice Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 9 / 18

Motivation Computation Model • Accelerator Architecture • Design Methodology Evaluation Scheduling Tasks with Work Stealing Task Queues Steal Task D Steal Task C Work in Task B Task D Task C Progress PE 0 PE 1 PE 2 PE 3 ◮ Work stealing has good performance, space requirements, and communication overheads in both theory and practice Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 9 / 18

Motivation Computation Model • Accelerator Architecture • Design Methodology Evaluation Scheduling Tasks with Work Stealing Task Queues Task E Task F Spawn Task E Spawn Task F Work in Task D Task C Progress PE 0 PE 1 PE 2 PE 3 ◮ Work stealing has good performance, space requirements, and communication overheads in both theory and practice Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 9 / 18

Motivation Computation Model • Accelerator Architecture • Design Methodology Evaluation Scheduling Tasks with Work Stealing Task Queues Steal Task E Steal Task F Work in Task E Task D Task C Task F Progress PE 0 PE 1 PE 2 PE 3 ◮ Work stealing has good performance, space requirements, and communication overheads in both theory and practice Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 9 / 18

Motivation Computation Model • Accelerator Architecture • Design Methodology Evaluation “Flexible” Architectural Template FPGA Networks Stealing Net IF Arg/Task Net IF Interface CPU Tile Tile Pending Arg & Task Task L1$ L1$ L1$ Store Router Cache Coherent Interconnect L2 Cache steal task succ Off-Chip DRAM TMU Processing task task Element in out Worker Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 10 / 18

An Architectural Framework for Accelerating Dynamic Parallel - PowerPoint PPT Presentation

An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware Tao Chen, Shreesha Srinath Christopher Batten , G. Edward Suh Computer Systems Laboratory School of Electrical and Computer Engineering Cornell

OBAMA PRESIDENTIAL CENTER INTRODUCTION 2 INTRODUCTION 3 ARCHITECTURAL DESIGN 4 ARCHITECTURAL

Religious Architectural Religious Architectural Religious Architectural Religious Architectural

Architectural Resources Cambridge Architectural Resources Cambridge Architectural Resources

NES Architectural Ltd http://www.nes-solutions.co.uk/architectural Who Are we? NES Architectural

Basics Architectural Presentation Basics Architectural Presentation Filesize: 6.51 MB Reviews

Idealised Fault Tolerant Idealised Fault Tolerant Architectural Element Architectural Element

Banking software architecture 2 Architectural Styles 1 WebLogic Network Gatekeeper's software

COMMUNICATING [with empathy] @ DY DYNAMIC JILL JILL @ DY DYNAMIC JILL TENSION IS INEVITABLE @

Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Minema Minema

Decommissioning: Winds of Change in Offshore Oil & Gas Accelerating NAMEPA & NOIA Winds

Sustainably Faster: Accelerating Sustainably Faster: Accelerating Innovation in Transportation

SSL Accelerating Test Bench SSL accelerating Test Method Stefan Deelen & Maurits van der

ACCELERATING YOUR VR APPLICATIONS WITH VRWORKS Cem Cebenoyan Edward Liu 1 ACCELERATING YOUR

CuZr-Mo bimetals for CLIC accelerating structures for CLIC accelerating structures Introduction

The Use of Prediction for The Use of Prediction for Accelerating Upgrade Misses in Accelerating

An Architectural Style Perspective on Dynamic Robotic Architectures John Georgas Institute for

Maximal Antichain Lattice Algorithms for Distributed Computations Vijay K. Garg Parallel and

Algorithms (2IL15) Lecture 13 Wrap-up lecture 1 TU/e Algorithms (2IL15) Lecture 13

Running time of algorithms How can we measure the running time of algorithms? Idea: Use a

Sublinear Algorithms for Graph Coloring Sanjeev Khanna University of Pennsylvania Joint work

A Near-Optimal Algorithm for Testing Isomorphism of Two Unknown Graphs Krzysztof Onak IBM T.J.

1 http://xkcd.com/1185/ CS 1331 (Georgia Tech) Algorithms 1 / 21 Introduction to

Simpsons 4-slot algorithm, proved in three slides Richard Bornat School of Computing,

A Net-Reduction based Clustering Preprocessing Algorithm Jianhua Li, Laleh Behjat University of

An Architectural Framework for Accelerating Dynamic Parallel - PowerPoint PPT Presentation

An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware Tao Chen, Shreesha Srinath Christopher Batten , G. Edward Suh Computer Systems Laboratory School of Electrical and Computer Engineering Cornell

OBAMA PRESIDENTIAL CENTER INTRODUCTION 2 INTRODUCTION 3 ARCHITECTURAL DESIGN 4 ARCHITECTURAL

Religious Architectural Religious Architectural Religious Architectural Religious Architectural

Architectural Resources Cambridge Architectural Resources Cambridge Architectural Resources

NES Architectural Ltd http://www.nes-solutions.co.uk/architectural Who Are we? NES Architectural

Basics Architectural Presentation Basics Architectural Presentation Filesize: 6.51 MB Reviews

Idealised Fault Tolerant Idealised Fault Tolerant Architectural Element Architectural Element

Banking software architecture 2 Architectural Styles 1 WebLogic Network Gatekeeper's software

COMMUNICATING [with empathy] @ DY DYNAMIC JILL JILL @ DY DYNAMIC JILL TENSION IS INEVITABLE @

Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Minema Minema

Decommissioning: Winds of Change in Offshore Oil &amp; Gas Accelerating NAMEPA &amp; NOIA Winds

Sustainably Faster: Accelerating Sustainably Faster: Accelerating Innovation in Transportation

SSL Accelerating Test Bench SSL accelerating Test Method Stefan Deelen &amp; Maurits van der

ACCELERATING YOUR VR APPLICATIONS WITH VRWORKS Cem Cebenoyan Edward Liu 1 ACCELERATING YOUR

CuZr-Mo bimetals for CLIC accelerating structures for CLIC accelerating structures Introduction

The Use of Prediction for The Use of Prediction for Accelerating Upgrade Misses in Accelerating

An Architectural Style Perspective on Dynamic Robotic Architectures John Georgas Institute for

Maximal Antichain Lattice Algorithms for Distributed Computations Vijay K. Garg Parallel and

Algorithms (2IL15) Lecture 13 Wrap-up lecture 1 TU/e Algorithms (2IL15) Lecture 13

Running time of algorithms How can we measure the running time of algorithms? Idea: Use a

Sublinear Algorithms for Graph Coloring Sanjeev Khanna University of Pennsylvania Joint work

A Near-Optimal Algorithm for Testing Isomorphism of Two Unknown Graphs Krzysztof Onak IBM T.J.

1 http://xkcd.com/1185/ CS 1331 (Georgia Tech) Algorithms 1 / 21 Introduction to

Simpsons 4-slot algorithm, proved in three slides Richard Bornat School of Computing,

A Net-Reduction based Clustering Preprocessing Algorithm Jianhua Li, Laleh Behjat University of

Decommissioning: Winds of Change in Offshore Oil & Gas Accelerating NAMEPA & NOIA Winds

SSL Accelerating Test Bench SSL accelerating Test Method Stefan Deelen & Maurits van der