An Architectural Framework for Accelerating Dynamic Parallel - - PowerPoint PPT Presentation

an architectural framework for accelerating dynamic
SMART_READER_LITE
LIVE PREVIEW

An Architectural Framework for Accelerating Dynamic Parallel - - PowerPoint PPT Presentation

An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware Tao Chen, Shreesha Srinath Christopher Batten , G. Edward Suh Computer Systems Laboratory School of Electrical and Computer Engineering Cornell


slide-1
SLIDE 1

An Architectural Framework for Accelerating Dynamic Parallel Algorithms

  • n Reconfigurable Hardware

Tao Chen, Shreesha Srinath Christopher Batten, G. Edward Suh

Computer Systems Laboratory School of Electrical and Computer Engineering Cornell University 51st Int’l Symp. on Microarchitecture Fall 2018

slide-2
SLIDE 2
  • Motivation •

Computation Model Accelerator Architecture Design Methodology Evaluation

Accelerating Static Parallel Algorithms on Reconfigurable Hardware

General Purpose CPU Reconfig Hardware (FPGA) Shared Mem Sys

for (int i=0; i<n; i++) c[i] = a[i] + b[i];

High Level Synthesis

__kernel void vvadd( __global int* c, __global int* a, __global int* b, int n ) { int id = get_global_id(0); if ( id < n ) c[id] = a[id] + b[id]; }

◮ Emerging CPU+FPGA platforms

(Xilinx Zynq, Altera Cyclone SoC)

◮ HLS maps parallelism statically to

highly pipelined and parallel PEs

  • C. Batten

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware 2 / 18

slide-3
SLIDE 3
  • Motivation •

Computation Model Accelerator Architecture Design Methodology Evaluation

Programmers are increasingly moving from thread- to task-centric programming

int fib( int n ) { if (n < 2) return n; int x = spawn fib(n-1); int y = fib(n-2); sync; return x + y; }

General Purpose CPU Reconfig Hardware (FPGA) Shared Mem Sys ◮ Task-parallel programming

frameworks enable creating tasks dynamically as the program executes

⊲ Intel Cilk Plus, Intel C++ TBB,

Microsoft’s .NET TPL, Java’s Fork/Join, OpenMP

◮ Benefits of this approach:

⊲ hierarchical data structures ⊲ divide-and-conquer algos ⊲ adaptive algorithms ⊲ arbitrary nesting, composition ⊲ automatic load balancing ⊲ efficient in theory and practice

  • C. Batten

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware 3 / 18

slide-4
SLIDE 4

Motivation Computation Model Accelerator Architecture Design Methodology Evaluation

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware

General Purpose CPU Reconfig Hardware (FPGA) Shared Mem Sys

int fib( int n ) { if (n < 2) return n; int x = spawn fib(n-1); int y = fib(n-2); sync; return x + y; }

General Purpose CPU Reconfig Hardware (FPGA) Shared Mem Sys

Motivation Computation Model Accelerator Architecture Design Methodology Evaluation

  • C. Batten

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware 4 / 18

slide-5
SLIDE 5

Motivation

  • Computation Model •

Accelerator Architecture Design Methodology Evaluation

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware

General Purpose CPU Reconfig Hardware (FPGA) Shared Mem Sys

int fib( int n ) { if (n < 2) return n; int x = spawn fib(n-1); int y = fib(n-2); sync; return x + y; }

General Purpose CPU Reconfig Hardware (FPGA) Shared Mem Sys

Motivation Computation Model Accelerator Architecture Design Methodology Evaluation

  • C. Batten

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware 5 / 18

slide-6
SLIDE 6

Motivation

  • Computation Model •

Accelerator Architecture Design Methodology Evaluation

Explicit Continuation Passing

A

spawn spawn

C B

child task parent task cont = 〈D,2〉 cont = 〈D,1〉 make successor

E F G D

spawn cont = 〈G,2〉 make successor cont = 〈D,2〉 arg1 arg2 successor task

Data-Parallel Pattern Data-Flow Pattern Fork/Join Pattern

  • C. Batten

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware 6 / 18

slide-7
SLIDE 7

Motivation

  • Computation Model •

Accelerator Architecture Design Methodology Evaluation

Example of Explicit Continuation Passing w/ Cilk

int fib( int n ) { if (n < 2) return n; int x = spawn fib(n-1); int y = fib(n-2); sync; return x + y; } task fib( cont int k, int n ) { if ( n < 2 ) send_argument( k, n ); else { cont int x, y; spawn_next sum( k, ?x, ?y ); spawn fib( x, n-1 ); spawn fib( y, n-2 ); } } task sum( cont int k, int x, int y ) { send_argument( k, x+y ); }

◮ Cilk-1 used explicit continuation passing (JPDC’96) ◮ Cilk-5 used call/return semantics for parallelism (PLDI’98) ◮ Explicit continuation passing is an elegant match for hardware

  • C. Batten

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware 7 / 18

slide-8
SLIDE 8

Motivation Computation Model

  • Accelerator Architecture •

Design Methodology Evaluation

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware

General Purpose CPU Reconfig Hardware (FPGA) Shared Mem Sys

int fib( int n ) { if (n < 2) return n; int x = spawn fib(n-1); int y = fib(n-2); sync; return x + y; }

General Purpose CPU Reconfig Hardware (FPGA) Shared Mem Sys

Motivation Computation Model Accelerator Architecture Design Methodology Evaluation

  • C. Batten

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware 8 / 18

slide-9
SLIDE 9

Motivation Computation Model

  • Accelerator Architecture •

Design Methodology Evaluation

Scheduling Tasks with Work Stealing

Work in Progress Task Queues PE 0 PE 1 PE 2 PE 3

◮ Work stealing has good performance, space requirements, and

communication overheads in both theory and practice

  • C. Batten

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware 9 / 18

slide-10
SLIDE 10

Motivation Computation Model

  • Accelerator Architecture •

Design Methodology Evaluation

Scheduling Tasks with Work Stealing

Work in Progress Task Queues PE 0 PE 1 PE 2 PE 3 Task A

◮ Work stealing has good performance, space requirements, and

communication overheads in both theory and practice

  • C. Batten

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware 9 / 18

slide-11
SLIDE 11

Motivation Computation Model

  • Accelerator Architecture •

Design Methodology Evaluation

Scheduling Tasks with Work Stealing

Work in Progress Task Queues PE 0 PE 1 PE 2 PE 3 Task B Task A Spawn Task B

◮ Work stealing has good performance, space requirements, and

communication overheads in both theory and practice

  • C. Batten

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware 9 / 18

slide-12
SLIDE 12

Motivation Computation Model

  • Accelerator Architecture •

Design Methodology Evaluation

Scheduling Tasks with Work Stealing

Work in Progress Task Queues PE 0 PE 1 PE 2 PE 3 Task B Dequeue Task B

◮ Work stealing has good performance, space requirements, and

communication overheads in both theory and practice

  • C. Batten

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware 9 / 18

slide-13
SLIDE 13

Motivation Computation Model

  • Accelerator Architecture •

Design Methodology Evaluation

Scheduling Tasks with Work Stealing

Work in Progress Task Queues PE 0 PE 1 PE 2 PE 3 Task C Task B Spawn Task C

◮ Work stealing has good performance, space requirements, and

communication overheads in both theory and practice

  • C. Batten

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware 9 / 18

slide-14
SLIDE 14

Motivation Computation Model

  • Accelerator Architecture •

Design Methodology Evaluation

Scheduling Tasks with Work Stealing

Work in Progress Task Queues PE 0 PE 1 PE 2 PE 3 Task B Spawn Task D Task D Task C

◮ Work stealing has good performance, space requirements, and

communication overheads in both theory and practice

  • C. Batten

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware 9 / 18

slide-15
SLIDE 15

Motivation Computation Model

  • Accelerator Architecture •

Design Methodology Evaluation

Scheduling Tasks with Work Stealing

Work in Progress Task Queues PE 0 PE 1 PE 2 PE 3 Task C Task D Steal Task D Steal Task C Task B

◮ Work stealing has good performance, space requirements, and

communication overheads in both theory and practice

  • C. Batten

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware 9 / 18

slide-16
SLIDE 16

Motivation Computation Model

  • Accelerator Architecture •

Design Methodology Evaluation

Scheduling Tasks with Work Stealing

Work in Progress Task Queues PE 0 PE 1 PE 2 PE 3 Task C Task D Task E Task F Spawn Task F Spawn Task E

◮ Work stealing has good performance, space requirements, and

communication overheads in both theory and practice

  • C. Batten

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware 9 / 18

slide-17
SLIDE 17

Motivation Computation Model

  • Accelerator Architecture •

Design Methodology Evaluation

Scheduling Tasks with Work Stealing

Work in Progress Task Queues PE 0 PE 1 PE 2 PE 3 Task C Task D Task E Task F Steal Task E Steal Task F

◮ Work stealing has good performance, space requirements, and

communication overheads in both theory and practice

  • C. Batten

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware 9 / 18

slide-18
SLIDE 18

Motivation Computation Model

  • Accelerator Architecture •

Design Methodology Evaluation

“Flexible” Architectural Template

Networks Interface Tile Tile L1$ L1$ CPU L1$ Cache Coherent Interconnect L2 Cache Off-Chip DRAM FPGA TMU task

  • ut

Worker task in Processing Element steal succ task Pending Task Store Arg & Task Router Stealing Net IF Arg/Task Net IF

  • C. Batten

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware 10 / 18

slide-19
SLIDE 19

Motivation Computation Model

  • Accelerator Architecture •

Design Methodology Evaluation

“Flexible” Architectural Template

Networks Interface Tile Tile L1$ L1$ CPU L1$ Cache Coherent Interconnect L2 Cache Off-Chip DRAM FPGA TMU task

  • ut

Worker task in Processing Element steal succ task Pending Task Store Arg & Task Router Stealing Net IF Arg/Task Net IF

TMU dequeues task and sends to worker

  • C. Batten

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware 10 / 18

slide-20
SLIDE 20

Motivation Computation Model

  • Accelerator Architecture •

Design Methodology Evaluation

“Flexible” Architectural Template

Networks Interface Tile Tile L1$ L1$ CPU L1$ Cache Coherent Interconnect L2 Cache Off-Chip DRAM FPGA TMU task

  • ut

Worker task in Processing Element steal succ task Pending Task Store Arg & Task Router Stealing Net IF Arg/Task Net IF

Worker sends request to pending task store to create a successor

  • C. Batten

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware 10 / 18

slide-21
SLIDE 21

Motivation Computation Model

  • Accelerator Architecture •

Design Methodology Evaluation

“Flexible” Architectural Template

Networks Interface Tile Tile L1$ L1$ CPU L1$ Cache Coherent Interconnect L2 Cache Off-Chip DRAM FPGA TMU task

  • ut

Worker task in Processing Element steal succ task Pending Task Store Arg & Task Router Stealing Net IF Arg/Task Net IF

Pending task store sends worker response with ID for creating continuations

  • C. Batten

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware 10 / 18

slide-22
SLIDE 22

Motivation Computation Model

  • Accelerator Architecture •

Design Methodology Evaluation

“Flexible” Architectural Template

Networks Interface Tile Tile L1$ L1$ CPU L1$ Cache Coherent Interconnect L2 Cache Off-Chip DRAM FPGA TMU task

  • ut

Worker task in Processing Element steal succ task Pending Task Store Arg & Task Router Stealing Net IF Arg/Task Net IF

Worker sends spawed tasks to TMU

  • C. Batten

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware 10 / 18

slide-23
SLIDE 23

Motivation Computation Model

  • Accelerator Architecture •

Design Methodology Evaluation

“Flexible” Architectural Template

Networks Interface Tile Tile L1$ L1$ CPU L1$ Cache Coherent Interconnect L2 Cache Off-Chip DRAM FPGA TMU task

  • ut

Worker task in Processing Element steal succ task Pending Task Store Arg & Task Router Stealing Net IF Arg/Task Net IF

Worker sends return value to arg/task router

  • C. Batten

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware 10 / 18

slide-24
SLIDE 24

Motivation Computation Model

  • Accelerator Architecture •

Design Methodology Evaluation

“Flexible” Architectural Template

Networks Interface Tile Tile L1$ L1$ CPU L1$ Cache Coherent Interconnect L2 Cache Off-Chip DRAM FPGA TMU task

  • ut

Worker task in Processing Element steal succ task Pending Task Store Arg & Task Router Stealing Net IF Arg/Task Net IF

Worker sends return value to arg/task router

  • C. Batten

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware 10 / 18

slide-25
SLIDE 25

Motivation Computation Model

  • Accelerator Architecture •

Design Methodology Evaluation

“Flexible” Architectural Template

Networks Interface Tile Tile L1$ L1$ CPU L1$ Cache Coherent Interconnect L2 Cache Off-Chip DRAM FPGA TMU task

  • ut

Worker task in Processing Element steal succ task Pending Task Store Arg & Task Router Stealing Net IF Arg/Task Net IF

If this is the final argument pending task store sends now ready task back to TMU

  • C. Batten

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware 10 / 18

slide-26
SLIDE 26

Motivation Computation Model

  • Accelerator Architecture •

Design Methodology Evaluation

“Flexible” Architectural Template

Networks Interface Tile Tile L1$ L1$ CPU L1$ Cache Coherent Interconnect L2 Cache Off-Chip DRAM FPGA TMU task

  • ut

Worker task in Processing Element steal succ task Pending Task Store Arg & Task Router Stealing Net IF Arg/Task Net IF

If task queue is empty, TMU randomly selects victim to steal from

  • C. Batten

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware 10 / 18

slide-27
SLIDE 27

Motivation Computation Model

  • Accelerator Architecture •

Design Methodology Evaluation

“Flexible” Architectural Template

Networks Interface Tile Tile L1$ L1$ CPU L1$ Cache Coherent Interconnect L2 Cache Off-Chip DRAM FPGA TMU task

  • ut

Worker task in Processing Element steal succ task Pending Task Store Arg & Task Router Stealing Net IF Arg/Task Net IF

If task queue is empty, TMU randomly selects victim to steal from

  • C. Batten

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware 10 / 18

slide-28
SLIDE 28

Motivation Computation Model

  • Accelerator Architecture •

Design Methodology Evaluation

“Lite” Architectural Template

Networks Interface Tile Tile L1$ L1$ CPU L1$ Cache Coherent Interconnect L2 Cache Off-Chip DRAM FPGA TMU task

  • ut

Worker task in Processing Element task Arg/Task Net IF

  • C. Batten

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware 11 / 18

slide-29
SLIDE 29

Motivation Computation Model Accelerator Architecture

  • Design Methodology •

Evaluation

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware

General Purpose CPU Reconfig Hardware (FPGA) Shared Mem Sys

int fib( int n ) { if (n < 2) return n; int x = spawn fib(n-1); int y = fib(n-2); sync; return x + y; }

General Purpose CPU Reconfig Hardware (FPGA) Shared Mem Sys

Motivation Computation Model Accelerator Architecture Design Methodology Evaluation

  • C. Batten

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware 12 / 18

slide-30
SLIDE 30

Motivation Computation Model Accelerator Architecture

  • Design Methodology •

Evaluation

Design Methodology

Worker Specification (C++) Architecture Template (PyMTL) Accelerator Generator (PyMTL) Accelerator RTL (Verilog) Vivado HLS Worker RTL (Verilog) Architecture Template (PyMTL)

Networks Interface Tile Tile L1$ L1$ TMU task

  • ut

Empty Worker task in Processing Element steal succ task Pending Task Store Arg & Task Router Stealing Net IF Arg/Task Net IF Empty Worker

  • C. Batten

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware 13 / 18

slide-31
SLIDE 31

Motivation Computation Model Accelerator Architecture

  • Design Methodology •

Evaluation

Design Methodology

Worker Specification (C++) Architecture Template (PyMTL) Accelerator Generator (PyMTL) Accelerator RTL (Verilog) Vivado HLS Worker RTL (Verilog)

void FibWorkerHLS( TaskInPort<FibTask> tin, TaskOutPort<FibTask> tout, SuccReqPort sreq, SuccRespPort sresp, ArgOutPort aout ) { FibTask task = task_in.read(); task_k_t k = task.k; if (task.type == FIB) { int n = task.x; if (n < 2) send_arg( Arg(k, n), aout ); else { k = make_succ(SUM,k,2,sreq,sresp); spawn(FibTask(FIB,k,1,n-2), tout); spawn(FibTask(FIB,k,0,n-1), tout); } } else if (task.type == SUM) { int sum = task.x + task.y; send_arg(Arg(k, sum), aout); } }

Worker Specification (C++) Wrapper Around TBB (C++) C++ Compiler Standard x86 or ARM Binary

  • C. Batten

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware 13 / 18

slide-32
SLIDE 32

Motivation Computation Model Accelerator Architecture Design Methodology

  • Evaluation •

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware

General Purpose CPU Reconfig Hardware (FPGA) Shared Mem Sys

int fib( int n ) { if (n < 2) return n; int x = spawn fib(n-1); int y = fib(n-2); sync; return x + y; }

General Purpose CPU Reconfig Hardware (FPGA) Shared Mem Sys

Motivation Computation Model Accelerator Architecture Design Methodology Evaluation

  • C. Batten

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware 14 / 18

slide-33
SLIDE 33

Motivation Computation Model Accelerator Architecture Design Methodology

  • Evaluation •

Applications

Name Suite Description Pattern nw in-house Needleman-Wunsch Algorithm data-flow quicksort in-house quicksort algorithm fork/join cilksort Cilk apps parallel merge sort algorithm fork/join queens Cilk apps N-queens problem fork/join knapsack Cilk apps 0-1 knapsack problem fork/join uts UTS unbalanced tree search fork/join bbgemm MachSuite blocked matrix multiplication data-parallel bfsqueue MachSuite breadth first search data-parallel spmvcrs MachSuite sparse matrix-vector mult data-parallel stencil2d MachSuite 3D stencil computation data-parallel

◮ Optimized software baseline implemented using Intel Cilk Plus with

ARM NEON auto-vectorization

◮ C++ application driver/worker implemented with design methodology

  • C. Batten

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware 15 / 18

slide-34
SLIDE 34

Motivation Computation Model Accelerator Architecture Design Methodology

  • Evaluation •

Current and Future CPU+FPGA Platforms

Networks Interface Tile 4PEs Tile 4PEs SBuf ARM A9 ARM A9 L1$ 32KB L1$ 32KB Cache Coherent Interconnect L2 Cache Off-Chip DRAM FPGA SBuf Arbiter Networks Interface Tile 4PEs Tile 4PEs L1$ 32KB L1$ 32KB 8x 8x ARM OOO ARM OOO 8x L1$ 32KB L1$ 32KB 8x Cache Coherent Interconnect L2 Cache Off-Chip DRAM FPGA

S e e P a p e r

◮ Current Zynq-7000 SoC Platform

⊲ Prototype using Zedboard ⊲ Two 667MHz ARM Cortex-A9 cores ⊲ Xilinx 7-series integrated FPGA fabric

(modest capacity, 142MHz)

⊲ Xcel uses stream buffers ⊲ Lower BW: FPGA ↔ coherent mem sys

◮ Future CPU+FPGA Platform

⊲ Simulation study using gem5 ⊲ Eight 1GHz ARM 4-way OOO cores ⊲ Xilinx 7-series integrated FPGA fabric

(larger capacity, 200MHz)

⊲ Xcel uses coherent 2x-pumped 32KB L1$ ⊲ Higher BW: FPGA ↔ coherent mem sys

  • C. Batten

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware 16 / 18

slide-35
SLIDE 35

Motivation Computation Model Accelerator Architecture Design Methodology

  • Evaluation •

Speedup on Future CPU+FPGA Platform

Speedup vs. SW on One OOO Core nw quicksort cilksort queens knapsack uts bbgemm bfsqueue spmvcrs stencil2d FlexArch Cilk on Eight OOO Cores LiteArch

◮ FlexArch is 4× faster than 8 cores, 24× faster than 1 core (geo mean) ◮ FlexArch is faster than LiteArch for more dynamic algorithms (load balancing) ◮ See paper for scalability, resource usage, energy efficiency, cache size study

  • C. Batten

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware 17 / 18

slide-36
SLIDE 36

Motivation Computation Model Accelerator Architecture Design Methodology Evaluation

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware

General Purpose CPU Reconfig Hardware (FPGA) Shared Mem Sys

int fib( int n ) { if (n < 2) return n; int x = spawn fib(n-1); int y = fib(n-2); sync; return x + y; }

General Purpose CPU Reconfig Hardware (FPGA) Shared Mem Sys ◮ Importance of exploring techniques for

accelerating more complex applications

  • n reconfigurable hardware

◮ We have described a promising

approach to accelerate dynamic parallel algorithms

⊲ computation model using explicit

continuation passing

⊲ accelerator architecture based on

work stealing

⊲ design methodology combining a

PyMTL-based architectural template with high-level synthesis

  • C. Batten

Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware 18 / 18