an architectural framework for accelerating dynamic
play

An Architectural Framework for Accelerating Dynamic Parallel - PowerPoint PPT Presentation

An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware Tao Chen, Shreesha Srinath Christopher Batten , G. Edward Suh Computer Systems Laboratory School of Electrical and Computer Engineering Cornell


  1. An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware Tao Chen, Shreesha Srinath Christopher Batten , G. Edward Suh Computer Systems Laboratory School of Electrical and Computer Engineering Cornell University 51st Int’l Symp. on Microarchitecture Fall 2018

  2. • Motivation • Computation Model Accelerator Architecture Design Methodology Evaluation Accelerating Static Parallel Algorithms on Reconfigurable Hardware ◮ Emerging CPU+FPGA platforms for (int i=0; i<n; i++) c[i] = a[i] + b[i]; (Xilinx Zynq, Altera Cyclone SoC) High ◮ HLS maps parallelism statically to Level highly pipelined and parallel PEs Synthesis __kernel void vvadd( __global int* c, __global int* a, __global int* b, int n ) { int id = get_global_id(0); Reconfig General if ( id < n ) Hardware Purpose c[id] = a[id] + b[id]; (FPGA) CPU } Shared Mem Sys Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 2 / 18

  3. • Motivation • Computation Model Accelerator Architecture Design Methodology Evaluation Programmers are increasingly moving from thread- to task-centric programming ◮ Task-parallel programming int fib( int n ) { frameworks enable creating if (n < 2) tasks dynamically as the return n; int x = spawn fib(n-1); program executes int y = fib(n-2); sync ; ⊲ Intel Cilk Plus, Intel C++ TBB, return x + y; Microsoft’s .NET TPL, Java’s } Fork/Join, OpenMP ◮ Benefits of this approach: Reconfig General ⊲ hierarchical data structures Hardware Purpose ⊲ divide-and-conquer algos (FPGA) CPU ⊲ adaptive algorithms ⊲ arbitrary nesting, composition ⊲ automatic load balancing Shared Mem Sys ⊲ efficient in theory and practice Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 3 / 18

  4. Motivation Computation Model Accelerator Architecture Design Methodology Evaluation Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware int fib( int n ) { Motivation if (n < 2) return n; int x = spawn fib(n-1); Computation Model int y = fib(n-2); sync ; return x + y; Accelerator Architecture } Design Methodology Reconfig Reconfig General General Hardware Hardware Purpose Purpose Evaluation (FPGA) (FPGA) CPU CPU Shared Mem Sys Shared Mem Sys Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 4 / 18

  5. Motivation • Computation Model • Accelerator Architecture Design Methodology Evaluation Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware int fib( int n ) { Motivation if (n < 2) return n; int x = spawn fib(n-1); Computation Model int y = fib(n-2); sync ; return x + y; Accelerator Architecture } Design Methodology Reconfig Reconfig General General Hardware Hardware Purpose Purpose Evaluation (FPGA) (FPGA) CPU CPU Shared Mem Sys Shared Mem Sys Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 5 / 18

  6. Motivation • Computation Model • Accelerator Architecture Design Methodology Evaluation Explicit Continuation Passing parent task A spawn spawn 〈 D,1 〉 〈 D,2 〉 cont = cont = child B C task spawn 〈 G,2 〉 cont = make E F successor Data-Flow Pattern make successor G 〈 D,2 〉 cont = arg1 arg2 D successor task Fork/Join Pattern Data-Parallel Pattern Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 6 / 18

  7. Motivation • Computation Model • Accelerator Architecture Design Methodology Evaluation Example of Explicit Continuation Passing w/ Cilk int fib( int n ) task fib( cont int k, int n ) { { if (n < 2) if ( n < 2 ) return n; send_argument( k, n ); int x = spawn fib(n-1); else { int y = fib(n-2); cont int x, y; sync; spawn_next sum( k, ?x, ?y ); return x + y; spawn fib( x, n-1 ); } spawn fib( y, n-2 ); } } task sum( cont int k, int x, int y ) { send_argument( k, x+y ); } ◮ Cilk-1 used explicit continuation passing (JPDC’96) ◮ Cilk-5 used call/return semantics for parallelism (PLDI’98) ◮ Explicit continuation passing is an elegant match for hardware Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 7 / 18

  8. Motivation Computation Model • Accelerator Architecture • Design Methodology Evaluation Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware int fib( int n ) { Motivation if (n < 2) return n; int x = spawn fib(n-1); Computation Model int y = fib(n-2); sync ; return x + y; Accelerator Architecture } Design Methodology Reconfig Reconfig General General Hardware Hardware Purpose Purpose Evaluation (FPGA) (FPGA) CPU CPU Shared Mem Sys Shared Mem Sys Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 8 / 18

  9. Motivation Computation Model • Accelerator Architecture • Design Methodology Evaluation Scheduling Tasks with Work Stealing Task Queues Work in Progress PE 0 PE 1 PE 2 PE 3 ◮ Work stealing has good performance, space requirements, and communication overheads in both theory and practice Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 9 / 18

  10. Motivation Computation Model • Accelerator Architecture • Design Methodology Evaluation Scheduling Tasks with Work Stealing Task Queues Work in Task A Progress PE 0 PE 1 PE 2 PE 3 ◮ Work stealing has good performance, space requirements, and communication overheads in both theory and practice Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 9 / 18

  11. Motivation Computation Model • Accelerator Architecture • Design Methodology Evaluation Scheduling Tasks with Work Stealing Task Queues Task B Spawn Task B Work in Task A Progress PE 0 PE 1 PE 2 PE 3 ◮ Work stealing has good performance, space requirements, and communication overheads in both theory and practice Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 9 / 18

  12. Motivation Computation Model • Accelerator Architecture • Design Methodology Evaluation Scheduling Tasks with Work Stealing Task Queues Dequeue Task B Work in Task B Progress PE 0 PE 1 PE 2 PE 3 ◮ Work stealing has good performance, space requirements, and communication overheads in both theory and practice Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 9 / 18

  13. Motivation Computation Model • Accelerator Architecture • Design Methodology Evaluation Scheduling Tasks with Work Stealing Task Queues Task C Spawn Task C Work in Task B Progress PE 0 PE 1 PE 2 PE 3 ◮ Work stealing has good performance, space requirements, and communication overheads in both theory and practice Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 9 / 18

  14. Motivation Computation Model • Accelerator Architecture • Design Methodology Evaluation Scheduling Tasks with Work Stealing Task Task C Queues Task D Spawn Task D Work in Task B Progress PE 0 PE 1 PE 2 PE 3 ◮ Work stealing has good performance, space requirements, and communication overheads in both theory and practice Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 9 / 18

  15. Motivation Computation Model • Accelerator Architecture • Design Methodology Evaluation Scheduling Tasks with Work Stealing Task Queues Steal Task D Steal Task C Work in Task B Task D Task C Progress PE 0 PE 1 PE 2 PE 3 ◮ Work stealing has good performance, space requirements, and communication overheads in both theory and practice Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 9 / 18

  16. Motivation Computation Model • Accelerator Architecture • Design Methodology Evaluation Scheduling Tasks with Work Stealing Task Queues Task E Task F Spawn Task E Spawn Task F Work in Task D Task C Progress PE 0 PE 1 PE 2 PE 3 ◮ Work stealing has good performance, space requirements, and communication overheads in both theory and practice Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 9 / 18

  17. Motivation Computation Model • Accelerator Architecture • Design Methodology Evaluation Scheduling Tasks with Work Stealing Task Queues Steal Task E Steal Task F Work in Task E Task D Task C Task F Progress PE 0 PE 1 PE 2 PE 3 ◮ Work stealing has good performance, space requirements, and communication overheads in both theory and practice Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 9 / 18

  18. Motivation Computation Model • Accelerator Architecture • Design Methodology Evaluation “Flexible” Architectural Template FPGA Networks Stealing Net IF Arg/Task Net IF Interface CPU Tile Tile Pending Arg & Task Task L1$ L1$ L1$ Store Router Cache Coherent Interconnect L2 Cache steal task succ Off-Chip DRAM TMU Processing task task Element in out Worker Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 10 / 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend