Automatic Generation of Efficient Accelerator Designs for - - PowerPoint PPT Presentation

automatic generation of efficient
SMART_READER_LITE
LIVE PREVIEW

Automatic Generation of Efficient Accelerator Designs for - - PowerPoint PPT Presentation

Automatic Generation of Efficient Accelerator Designs for Reconfigurable Hardware David Koeplinger Raghu Prabhakar Yaqi Zhang Christina Delimitrou Christos Kozyrakis Kunle Olukotun Stanford University ISCA 2016 FPGAs in Data Centers


slide-1
SLIDE 1

Automatic Generation of Efficient Accelerator Designs for Reconfigurable Hardware

Stanford University ISCA 2016

David Koeplinger

Raghu Prabhakar Yaqi Zhang Christina Delimitrou Christos Kozyrakis Kunle Olukotun

slide-2
SLIDE 2

FPGAs in Data Centers

 Increasing interest in use of FPGAs as application

accelerators in data centers

2

Key advantage: Performance/Watt

slide-3
SLIDE 3

Problem: Large Design Spaces

 Design spaces grow exponentially with the number of

parameters

 Even relatively small designs can have very large spaces  Parameters can change runtime by orders of magnitude  Parameters typically aren’t independent  Manual exploration is tedious, may result in suboptimal designs

3

slide-4
SLIDE 4

Key

DRAM

A B

Design Space Example: Dot Product

4

FPGA

+ ×

Tile B Tile A Algorithm: Dot Product of Vectors A and B Small and simple, but slow!

acc

Scratchpad

Reg op

slide-5
SLIDE 5

DRAM

A B

Important Parameters: Tile Sizes

 Increases length of DRAM accesses

Runtime

 Increases exploited spatial locality Runtime  Increases local memory sizes

Area

5

FPGA

+ ×

Tile B Tile A Algorithm: Dot Product of Vectors A and B

acc

Key Scratchpad

Reg op

slide-6
SLIDE 6

DRAM

A B

FPGA Stage 2 Stage 1

+ ×

Tile B Tile A

Important Parameters: Pipelining

6

Algorithm: Dot Product of Vectors A and B

 Overlaps memory and compute

Runtime

 Increases local memory sizes

Area

 Adds synchronization logic

Area

acc

Key Double

Reg op

Buffer

slide-7
SLIDE 7

DRAM

Important Parameters: Parallelization

7

FPGA

+ ×

Algorithm: Dot Product of Vectors A and B

× ×

Tile A Tile B

+ +

 Improves element throughput

Runtime

 Duplicates compute resources

Area A B

acc

Key Scratchpad

Reg op

slide-8
SLIDE 8

Language/Tool Requirements

8

VHDL Verilog LegUp Vivado HLS OpenCL SDK Aladdin DHDL Targets FPGAs Enables pipelining at arbitrary loop levels Exposes design parameters to the compiler Evaluates designs prior to synthesis Explores design space automatically Generates synthesizable code

slide-9
SLIDE 9

Delite Hardware Definition Language

 Includes a variety parameterized templates

 Parallel patterns with implicit parallelization factors  Pipeline constructs for pipelining at arbitrary levels  Explicit size parameters for loop step size and buffer sizes

 All parameters are exposed to compiler  Compiler includes latency and area models for quick

design evaluation

 Compiler automatically explores design space  Generates synthesizable MaxJ HGL after exploration

9

slide-10
SLIDE 10

DRAM

Dot Product DHDL Diagram

10

Tile B Tile A

× +

Inner Reduce Outer Reduce Parallelism factor #1 Pipelining toggle Tile Size (B) Parallelism factor #2 Parallelism factor #3

A B

  • ut
  • ut

+

slide-11
SLIDE 11

Dot Product in DHDL

11

val output = Reg[Float] val vectorA = OffChipMem[Float](N) val vectorB = OffChipMem[Float](N) Reduce(N by B)(output){ i => val tileA = Scratchpad[Float](B) val tileB = Scratchpad[Float](B) val acc = Reg[Float] tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B) Reduce(B by 1)(acc){ j => tileA(j) * tileB(j) }{a, b => a + b} }{a, b => a + b} Parallelism factor #1 Pipelining toggle Tile Size (B) Parallelism factor #2 Parallelism factor #3

1 2

slide-12
SLIDE 12

MaxCompiler + Altera Toolchain Design Space Exploration

DHDL to Hardware

12

Simple Analyses MaxJ HGL DHDL + Design Space DHDL Fixed DHDL Code Generation

slide-13
SLIDE 13

DHDL Enables Fast DSE

13

DHDL Program Simple Linear Models Concise IR Parameterized Templates Easily Derived Space Constraints Space Pruning Fast Design Space Exploration Fast Estimation

No Unrolling No Scheduling

Smaller Spaces

slide-14
SLIDE 14

Latency Modeling

 Analytical model

 Uses depth-first search to get critical path of

pipelines

 Accurate estimation requires data size annotations

 Main-memory model

 Mathematical model fit to observed runtimes  Parameterized by:

 Number of contending readers/writers  Number of commands issued in sequence  Command length

14

slide-15
SLIDE 15

Area Modeling

 Analytical model

 Simple summation of area of each template  Includes estimates for delay lines, banked memories

 Neural network models

 Models routing costs and memory duplication  Simple, 3 layer networks suffice here (we use 11-6-1)  Trained on about set of 200 characterization designs

 Total area = analytical area + neural net area

15

slide-16
SLIDE 16

Evaluation

16

 Accuracy:

How accurate are the models, compared to observations?

 Speed:

How fast are the predictions, compared to commercial tools?

 Space:

Do the design parameters help capture an interesting space?

 Performance:

How good is the best generated design?

slide-17
SLIDE 17

Model Synthesized

Results: Model Accuracy (Area)

17

Area models follow important trends and are accurate enough to drive automatic design space exploration

100% 60% 20%

ALMs BRAMs DSPs

Resource Usage (%)

dotproduct outerprod tpchq6 blackscholes gda kmeans gemm

slide-18
SLIDE 18

Results: Model Accuracy (Latency)

18

Latency models follow important trends and are accurate enough to drive automatic design space exploration

2.8% 1.3% 3.1% 3.4% 6.7% 7% 18.4%

0% 5% 10% 15% 20%

dotproduct outerprod tpchq6 blackscholes gda kmeans gemm

Average Error (%)

slide-19
SLIDE 19

Results: Prediction Speed

19

Benchmark Designs Search Time Dot Product 5,426 5.3 ms / design Outer Product 1,702 30 ms / design TPCHQ6 5,426 8.2 ms / design Blackscholes 572 27 ms / design Matrix Multiply 70,740 11 ms / design K-Means 75,200 20 ms / design GDA 42,800 17 ms / design Designs Search Time GDA 250 1.85 min / design

Vivado HLS:

6533x Speedup Over HLS!

DHDL:

slide-20
SLIDE 20

20% 60% 100% 20% 60% 100% 20% 60% 100% ALMs DSPs BRAMs Resource Usage (% of maximum)

Cycles (Log Scale)

Results: GDA Design Space

20

1010 109 108 107

Valid design point Pareto-optimal design Invalid design point Synthesized pareto design point

Performance limited by available BRAMs

Space for GDA spans four orders of magnitude

slide-21
SLIDE 21

Evaluation: Multi-Core Comparison

21

 FPGA

 Altera Stratix V (28 nm)  150 MHz clock  Peak main memory bandwidth of 37.5 GB/sec

 Multi-core CPU

 Intel Xeon E5-2630 (32nm)  2.3 GHz  Peak main memory bandwidth of 42.6 GB/sec  6 cores, 6 threads  Multi-threaded C++ code generated from Delite

 Execution time = FPGA execution time

 Does not include CPU   FPGA communication or configuration time

slide-22
SLIDE 22

Results: Comparison with Multi-Core

22

1.07 2.42 1.11 16.73 4.55 1.15 0.1 5 10 15 20

dotproduct outerprod tpchq6 blackscholes gda kmeans gemm

Speedup

Memory-bound Compute-bound

Gemm uses multi-threaded OpenBLAS on CPU

slide-23
SLIDE 23

Summary

 DHDL exposes large design spaces to the compiler  Parameterized templates enable fast, accurate estimators  Fast estimators enable rapid automated DSE  Up to 6533x faster estimation compared to Vivado HLS  Up to 16.7x speedup over 6-core CPU

23

slide-24
SLIDE 24

24

slide-25
SLIDE 25

20% 60% 100% 20% 60% 100% 20% 60% 100% ALMs DSPs BRAMs Resource Usage (% of maximum)

Cycles (Log Scale)

Results: TPCHQ6 Design Space

25

108 107 106

Valid design point Pareto-optimal design Invalid design point Synthesized pareto design point

slide-26
SLIDE 26

108 107 106 20% 60% 100% 20% 60% 100% 20% 60% 100% ALMs DSPs BRAMs Resource Usage (% of maximum)

Cycles (Log Scale)

Results: Blackscholes Design Space

26

Valid design point Pareto-optimal design Invalid design point Synthesized pareto design point