towards scalable and efficient fpga stencil accelerators
play

Towards Scalable and Efficient FPGA Stencil Accelerators el Deest 1 - PowerPoint PPT Presentation

Towards Scalable and Efficient FPGA Stencil Accelerators el Deest 1 Nicolas Estibals 1 Tomofumi Yuki 2 Ga Steven Derrien 1 Sanjay Rajopadhye 3 1 IRISA / Universit 2 INRIA / LIP / ENS Lyon e de Rennes 1 / Cairn 3 Colorado State University


  1. Towards Scalable and Efficient FPGA Stencil Accelerators el Deest 1 Nicolas Estibals 1 Tomofumi Yuki 2 Ga¨ Steven Derrien 1 Sanjay Rajopadhye 3 1 IRISA / Universit´ 2 INRIA / LIP / ENS Lyon e de Rennes 1 / Cairn 3 Colorado State University January 19th, 2016 1 / 30

  2. Stencil Computations Important class of algorithms ◮ Iterative grid update. ◮ Uniform dependences. Examples: ◮ Solving partial differential equations ◮ Computer simulations (physics, seismology, etc.) ◮ (Realtime) image/video processing Strong need for efficient hardware implementations. 2 / 30

  3. Application Domains Two main application types with vastly � = goals: HPC Embedded Systems ◮ “Be as fast as possible” ◮ “Be fast enough” ◮ No realtime constraints ◮ Realtime constraints For now , we focus on FPGAs from the HPC perspective. 3 / 30

  4. FPGA As Stencil Accelerators ? CPU: ≈ 10 cores GPU: ≈ 100 cores FPGA: ≈ 1000 cores Control ALUs Cache ≈ 10 GB / s ≈ 100 GB / s ≈ 1 GB / s DDR GDDR DDR Features: Drawbacks: ◮ Large on-chip bandwidth ◮ Small off-chip bandwidth ◮ Fine-grained pipelining ◮ Difficult to program ◮ Customizable datapath / ◮ Lower clock frequencies arithmetic 4 / 30

  5. Design Challenges At least two problems: ◮ Increase throughput with parallelization. Examples: ◮ Multiple PEs. ◮ Pipelining. ◮ Decrease bandwidth occupation ◮ Use onchip memory to maximize reuse ◮ Choose memory mapping carefully to enable burst accesses 5 / 30

  6. Stencils “Done Right” for FPGAs Observation: ◮ Many different strategies exist: ◮ Multiple-level tiling ◮ Deep pipelining ◮ Time skewing ◮ . . . ◮ No papers put them all together. Key features: ◮ Target one large deeply pipelined PE... ◮ ...instead of many small PEs ◮ Manage throughput/bandwidth with two-level tiling 6 / 30

  7. Multiple-Level Tiling Composition of 2+ tiling transformations to account for: ◮ Memory hierarchies and locality ◮ Register, caches, RAM, disks, . . . ◮ Multiple level of parallelism ◮ Instruction-Level, Thread-Level, . . . In this work: 1. Inner tiling level: parallelism. 2. Outer tiling level: communication. 7 / 30

  8. Overview of Our Approach Core ideas: 1. Execute inner, Datapath-Level tiles on a single , pipelined “macro-operator”. ◮ Fire a new tile execution each cycle. ◮ Delegate operator pipelining to HLS. 2. Group DL-tiles into Communication-Level Tiles to decrease bandwidth requirements. ◮ Store intermediary results on chip. 8 / 30

  9. Outline Introduction Approach Evaluation Related Work and Comparison Future Work & Conclusion 9 / 30

  10. Running Example: Jacobi (3-point, 1D-data) Simplified code: f o r ( t =1; t < T; t++) f o r ( x=1; x < N − 1; x++) f [ t ] [ x ] = ( f [ t − 1][ x − 1] + f [ t − 1][ x ] + f [ t − 1][ x +1])/3; Dependence vectors: ( − 1 , − 1) , ( − 1 , 0) , ( − 1 , 1) 10 / 30

  11. Datapath-Level Tiling 11 / 30

  12. Datapath-Level Tiling t , x �→ t , x + t 11 / 30

  13. Datapath-Level Tiling t , x �→ t , x + t 11 / 30

  14. Datapath-Level Tile Operator ( t = . . . ) { f o r #pragma HLS PIPELINE I I =1 ( x = . . . ) { f o r #pragma HLS UNROLL f o r ( t t = . . . ) { #pragma HLS UNROLL f o r ( xx = . . . ) { i n t t = t+tt , x = x+xx − t ; f [ t ] [ x ] = ( f [ t − 1][ x − 1] + f [ t − 1][ x ] + f [ t − 1][ x +1])/3; } } }} Types of parallelism: ◮ Operation-Level parallelism (exposed by unrolling). ◮ Temporal parallelism (through pipelined tile executions). 12 / 30

  15. Pipelined Execution Pipelined execution requires inter-tile parallelism. Original dependences Tile-level dependences Gauss-Seidel dependences 13 / 30

  16. Wavefronts of Datapath-Level Tiles 14 / 30

  17. Wavefronts of Datapath-Level Tiles Skewing: t , x �→ t + x , x 14 / 30

  18. Wavefronts of Datapath-Level Tiles Wavefronts 14 / 30

  19. Managing Compute/IO Ratio Problem Suppose direct pipelining of 2 × 2 DL-tiles. At each clock cycle: ◮ A new tile enters the pipeline. ◮ Six 32-bit values are fetched from off-chip memory. At 100 MHz, bandwidth usage are 19.2 GBps ! Solution Use a second tiling level to decrease bandwidth requirements. 15 / 30

  20. Communication-Level Tiling WF1 WF2 Shape constraints: 3 2 Size constraints: 1 4 16 / 30

  21. Communication-Level Tiling Shape constraints: ◮ Constant-height wavefronts d 1 ◮ Enables use of simple FIFOs for intermediary results Size constraints: d 2 d 1 = d 2 16 / 30

  22. Communication-Level Tiling Shape constraints: ◮ Constant-height wavefronts ◮ Enables use of simple FIFOs for intermediary results ≥ d Size constraints: ◮ Tiles per WF ≥ pipeline depth 0 1 2 3 4 5 6 d = 4 16 / 30

  23. Communication-Level Tiling Shape constraints: ◮ Constant-height wavefronts ◮ Enables use of simple FIFOs for intermediary results Size constraints: ◮ Tiles per WF ≥ pipeline depth ◮ BW requirements ≤ chip limit ◮ Size of FIFOs ≤ chip limit 16 / 30

  24. Communication-Level Tile Shape Hyperparallelepipedic (rectangular) tiles satisfy all shape constraints. skew − 1 17 / 30

  25. Communication Two aspects: On-chip Communication Off-chip Communication ◮ Between DL-tiles ◮ Between CL-tiles ◮ Uses FIFOs ◮ Uses memory accesses 18 / 30

  26. On-Chip Communication We use Canonic Multi-Projections (Yuki and Rajopadhye, 2011). Main ideas: b uff x (out) ◮ Communicate along canonical b uff t (out) axes . b uff t (in) ◮ Project diagonal dependences on canonical directions. ◮ Some values are redundantly stored. b uff x (in) 19 / 30

  27. Off-Chip Communication Between CL-Tiles (assuming lexicographic ordering): ◮ Data can be reused along the innermost dimension. ◮ Data from/to other tiles must be fetched/stored off-chip . ◮ Complex shape ◮ Key for performance: use burst accesses ◮ Maximize contiguity with clever memory mapping 20 / 30

  28. Off-Chip Communication Between CL-Tiles (assuming lexicographic ordering): ◮ Data can be reused along the innermost dimension. ◮ Data from/to other tiles must be fetched/stored off-chip . ◮ Complex shape ◮ Key for performance: use burst accesses ◮ Maximize contiguity with clever memory mapping 20 / 30

  29. Outline Introduction Approach Evaluation Related Work and Comparison Future Work & Conclusion 21 / 30

  30. Metrics ◮ Hardware-related metrics ◮ Macro-operator pipeline depth ◮ Area (slices, BRAM & DSP) ◮ Performance-related metrics (at steady state) ◮ Throughput ◮ Required bandwidth 22 / 30

  31. Preliminary Results: Parallelism scalability 38 . 4 GFLop/s Computation resource usage 28 . 2 GFLop/s 44% Steady-State throughput 229 20 . 3 GFLop/s 34% Pipeline depth 196 11 . 5 GFLop/s 148 21% 7 . 2 GFLop/s 5 . 8 GFLop/s 5 . 8 GFLop/s 117 117 3 . 4 GFLop/s 100 13% 9% 61 61 8% 5% 2% 2 × 2 2 × 4 4 × 2 4 × 4 8 × 8 2 × 2 × 2 3 × 3 × 3 4 × 4 × 4 Datapath-level tile size Choose DL-tile to control: ◮ Computational throughput ◮ Computational resource usage ◮ Macro-operator latency and pipeline depth 23 / 30

  32. Preliminary Results: Bandwidth Usage Control 2 . 2 GB/s Steady-State Bandwidth 42% 1 . 4 GB/s 1 . 4 GB/s BRAM Usage 1 GB/s 1 GB/s 0 . 8 GB/s 0 . 7 GB/s 24% 0 . 5 GB/s 18% 18% 12% 12% 6% 6% n × 15 × 14 n × 22 × 22 n × 23 × 23 n × 31 × 32 n × 32 × 32 n × 38 × 39 n × 44 × 45 n × 59 × 59 Communication-level tile size for 4x4x4 DL-tile Enlarging CL-tiles : ◮ Does not change throughput ◮ Reduces bandwidth requirements ◮ Has a low impact on hardware resources 24 / 30

  33. Outline Introduction Approach Evaluation Related Work and Comparison Future Work & Conclusion 25 / 30

  34. Related Work ◮ Hardware implementations: ◮ Many ad-hoc / naive architectures ◮ Systolic architectures (LSGP) ◮ PolyOpt/HLS (Pouchet et al., 2013) ◮ Tiling to control compute/IO balance ◮ Alias et al., 2012 ◮ Single, pipelined operator ◮ Innermost loop body only ◮ Tiling method: ◮ “Jagged Tiling” (Shrestha et al., 2015) 26 / 30

  35. Outline Introduction Approach Evaluation Related Work and Comparison Future Work & Conclusion 27 / 30

  36. Future Work ◮ Finalize implementation ◮ Beyond Jacobi ◮ Exploring other number representations: ◮ Fixed-point ◮ Block floating-point ◮ Custom floating-point ◮ Hardware/software codesign ◮ . . . 28 / 30

  37. Conclusion ◮ Design template for FPGA stencil accelerators ◮ Two levels of control: ◮ Throughput ◮ Bandwidth requirements ◮ Maximize use of pipeline parallelism 29 / 30

  38. Thank You Questions ? 30 / 30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend