automatic generation of efficient
play

Automatic Generation of Efficient Accelerator Designs for - PowerPoint PPT Presentation

Automatic Generation of Efficient Accelerator Designs for Reconfigurable Hardware David Koeplinger Raghu Prabhakar Yaqi Zhang Christina Delimitrou Christos Kozyrakis Kunle Olukotun Stanford University ISCA 2016 FPGAs in Data Centers


  1. Automatic Generation of Efficient Accelerator Designs for Reconfigurable Hardware David Koeplinger Raghu Prabhakar Yaqi Zhang Christina Delimitrou Christos Kozyrakis Kunle Olukotun Stanford University ISCA 2016

  2. FPGAs in Data Centers  Increasing interest in use of FPGAs as application accelerators in data centers Key advantage: Performance/Watt 2

  3. Problem: Large Design Spaces  Design spaces grow exponentially with the number of parameters  Even relatively small designs can have very large spaces  Parameters can change runtime by orders of magnitude  Parameters typically aren’t independent  Manual exploration is tedious, may result in suboptimal designs 3

  4. Design Space Example: Dot Product Algorithm: Dot Product of Vectors A and B DRAM A Tile A acc + × Tile B B FPGA Key Small and simple, but slow! Scratchpad Reg op 4

  5. Important Parameters: Tile Sizes Algorithm: Dot Product of Vectors A and B DRAM A Tile A acc + × Tile B B FPGA Key  Increases length of DRAM accesses Runtime Scratchpad  Increases exploited spatial locality Runtime Reg op  Increases local memory sizes Area 5

  6. Important Parameters: Pipelining Algorithm: Dot Product of Vectors A and B DRAM Tile A A acc + × Tile B B Stage 2 Key Stage 1 FPGA Double  Overlaps memory and compute Runtime Buffer  Increases local memory sizes Area Reg op  Adds synchronization logic Area 6

  7. Important Parameters: Parallelization Algorithm: Dot Product of Vectors A and B DRAM × A Tile A + acc + × + Tile B B × FPGA Key  Improves element throughput Runtime Scratchpad  Duplicates compute resources Area Reg op 7

  8. Language/Tool Requirements VHDL LegUp Vivado HLS Aladdin DHDL Verilog OpenCL SDK Targets FPGAs Enables pipelining at arbitrary loop levels Exposes design parameters to the compiler Evaluates designs prior to synthesis Explores design space automatically Generates synthesizable code 8

  9. Delite Hardware Definition Language  Includes a variety parameterized templates  Parallel patterns with implicit parallelization factors  Pipeline constructs for pipelining at arbitrary levels  Explicit size parameters for loop step size and buffer sizes  All parameters are exposed to compiler  Compiler includes latency and area models for quick design evaluation  Compiler automatically explores design space  Generates synthesizable MaxJ HGL after exploration 9

  10. Dot Product DHDL Diagram Tile Size (B) DRAM Tile A out out A + + × B Tile B Inner Outer Reduce Reduce Parallelism factor #2 Parallelism factor #3 Parallelism factor #1 Pipelining toggle 10

  11. Dot Product in DHDL val output = Reg [ Float ] val vectorA = OffChipMem [ Float ](N) val vectorB = OffChipMem [ Float ](N) Parallelism factor #1 Pipelining toggle Reduce (N by B)(output){ i => val tileA = Scratchpad [ Float ](B) Tile Size (B) val tileB = Scratchpad [ Float ](B) val acc = Reg [ Float ] tileA load vectorA(i :: i+B) Parallelism factor #2 1 tileB load vectorB(i :: i+B) Reduce (B by 1)(acc){ j => Parallelism factor #3 tileA(j) * tileB(j) 2 }{a, b => a + b} }{a, b => a + b} 11

  12. DHDL to Hardware DHDL Simple Analyses DHDL + Design Space Design Space Exploration Fixed DHDL Code Generation MaxJ HGL MaxCompiler + Altera Toolchain 12

  13. DHDL Enables Fast DSE DHDL Program Parameterized Templates Concise IR Simple Linear Easily Derived Models Space Constraints Space Pruning Fast Estimation No Unrolling No Scheduling Smaller Spaces Fast Design Space Exploration 13

  14. Latency Modeling  Analytical model  Uses depth-first search to get critical path of pipelines  Accurate estimation requires data size annotations  Main-memory model  Mathematical model fit to observed runtimes  Parameterized by:  Number of contending readers/writers  Number of commands issued in sequence  Command length 14

  15. Area Modeling  Analytical model  Simple summation of area of each template  Includes estimates for delay lines, banked memories  Neural network models  Models routing costs and memory duplication  Simple, 3 layer networks suffice here (we use 11-6-1)  Trained on about set of 200 characterization designs  Total area = analytical area + neural net area 15

  16. Evaluation  Accuracy : How accurate are the models, compared to observations?  Speed : How fast are the predictions, compared to commercial tools?  Space : Do the design parameters help capture an interesting space?  Performance : How good is the best generated design? 16

  17. Results: Model Accuracy (Area) ALMs Model BRAMs Synthesized DSPs Resource Usage (%) 100% 60% 20% dotproduct outerprod tpchq6 blackscholes gda kmeans gemm Area models follow important trends and are accurate enough to drive automatic design space exploration 17

  18. Results: Model Accuracy (Latency) 20% 18.4% Average Error (%) 15% 10% 6.7% 7% 3.4% 5% 3.1% 2.8% 1.3% 0% dotproduct outerprod tpchq6 blackscholes gda kmeans gemm Latency models follow important trends and are accurate enough to drive automatic design space exploration 18

  19. Results: Prediction Speed DHDL: Benchmark Designs Search Time Dot Product 5,426 5.3 ms / design Outer Product 1,702 30 ms / design TPCHQ6 5,426 8.2 ms / design 6533x Blackscholes 572 27 ms / design Speedup Matrix Multiply 70,740 11 ms / design Over HLS! K-Means 75,200 20 ms / design GDA 42,800 17 ms / design Vivado HLS: Designs Search Time GDA 250 1.85 min / design 19

  20. Results: GDA Design Space Performance limited by available BRAMs Cycles (Log Scale) 10 10 10 9 10 8 10 7 20% 60% 100% 20% 60% 100% 20% 60% 100% ALMs DSPs BRAMs Resource Usage (% of maximum) Space for GDA spans four orders of magnitude Valid design point Pareto-optimal design Invalid design point Synthesized pareto design point 20

  21. Evaluation: Multi-Core Comparison  FPGA  Altera Stratix V (28 nm)  150 MHz clock  Peak main memory bandwidth of 37.5 GB/sec  Multi-core CPU  Intel Xeon E5-2630 (32nm)  2.3 GHz  Peak main memory bandwidth of 42.6 GB/sec  6 cores, 6 threads  Multi-threaded C++ code generated from Delite  Execution time = FPGA execution time  Does not include CPU   FPGA communication or configuration time 21

  22. Results: Comparison with Multi-Core 20 16.73 15 Gemm uses multi-threaded Speedup OpenBLAS on CPU 10 4.55 5 2.42 1.11 1.15 1.07 0.1 0 dotproduct outerprod tpchq6 blackscholes gda kmeans gemm Compute-bound Memory-bound 22

  23. Summary  DHDL exposes large design spaces to the compiler  Parameterized templates enable fast, accurate estimators  Fast estimators enable rapid automated DSE  Up to 6533x faster estimation compared to Vivado HLS  Up to 16.7x speedup over 6-core CPU 23

  24. 24

  25. Results: TPCHQ6 Design Space Cycles (Log Scale) 10 8 10 7 10 6 20% 60% 100% 20% 60% 100% 20% 60% 100% ALMs DSPs BRAMs Resource Usage (% of maximum) Valid design point Pareto-optimal design Invalid design point Synthesized pareto design point 25

  26. Results: Blackscholes Design Space Cycles (Log Scale) 10 8 10 7 10 6 20% 60% 100% 20% 60% 100% 20% 60% 100% ALMs DSPs BRAMs Resource Usage (% of maximum) Valid design point Pareto-optimal design Invalid design point Synthesized pareto design point 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend