spatial a language and compiler for application
play

Spatial: A Language and Compiler for Application Accelerators Raghu - PowerPoint PPT Presentation

Spatial: A Language and Compiler for Application Accelerators Raghu Prabhakar Stanford University / SambaNova Systems TVM Conference Dec 13, 2018 The Future Is (Probably) Reconfigurable 10,000 Energy Efficiency (MOPS/mW) ASIC 1,000


  1. Dot Product in Spatial val output = ArgOut [Float] val vectorA = DRAM [Float](N) Tiled reduction (pipelined) val vectorB = DRAM [Float](N) DRAM Accel { Reduce (output)(N by B){ i => tileA (1) tileA (0) vectorA val tileA = SRAM [Float](B) acc + acc val tileB = SRAM [Float](B) × tileB (0) tileB (1) val acc = Reg [Float] acc acc Stage 2 Stage 1 vectorB tileA load vectorA(i :: i+B) tileB load vectorB(i :: output i+B) Outer Reduce FPGA Reduce (acc)(B by 1){ j => tileA(j) * tileB(j) }{a, b => a + b} 24

  2. Dot Product in Spatial val output = ArgOut [Float] Outer reduce function val vectorA = DRAM [Float](N) val vectorB = DRAM [Float](N) DRAM Accel { Reduce (output)(N by B){ i => tileA (1) tileA (0) vectorA val tileA = SRAM [Float](B) acc + acc val tileB = SRAM [Float](B) × tileB (0) tileB (1) val acc = Reg [Float] acc acc Stage 2 Stage 1 vectorB tileA load vectorA(i :: i+B) tileB load vectorB(i :: output + i+B) Stage 3 Outer Reduce FPGA Reduce (acc)(B by 1){ j => tileA(j) * tileB(j) }{a, b => a + b} 24 }{a, b => a + b}

  3. Dot Product in Spatial val output = ArgOut [Float] val vectorA = DRAM [Float](N) Banking strategy val vectorB = DRAM [Float](N) Tile Size (B) Parallelism factor #3 DRAM Accel { Reduce (output)(N by B){ i => tileA (1) tileA (0) vectorA val tileA = SRAM [Float](B) acc + acc val tileB = SRAM [Float](B) × tileB (0) tileB (1) val acc = Reg [Float] acc acc Stage 2 Stage 1 vectorB tileA load vectorA(i :: i+B) tileB load vectorB(i :: Parallelism output + i+B) Stage 3 Outer Reduce factor #2 FPGA Reduce (acc)(B by 1){ j => Parallelism factor #1 tileA(j) * tileB(j) Metapipelining toggle }{a, b => a + b} 24 }{a, b => a + b}

  4. Dot Product in Spatial val output = ArgOut [Float] val vectorA = DRAM [Float](N) val vectorB = DRAM [Float](N) Accel { Reduce (output)(N by B){ i => val tileA = SRAM [Float](B) val tileB = SRAM [Float](B) val acc = Reg [Float] tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B) Reduce (acc)(B by 1){ j => tileA(j) * tileB(j) }{a, b => a + b} 25 }{a, b => a + b}

  5. Dot Product in Spatial Spatial Program Design Parameters 25

  6. The Spatial Compiler Spatial Program 26

  7. The Spatial Compiler Spatial IR 26

  8. The Spatial Compiler Spatial IR Design Parameters Spatial IR Control Inference Legend Control Scheduling Intermediate Access Pattern Analysis Design Representatio Mem. Banking/ Parameters n Buffering Area/Runtime Analysis [ Optional ] Design IR Transformation Tuning IR Analysis Pipeline Unrolling Pipeline Retiming Code Generation Host Resource Allocation Control Signal Inference Chisel Code 26 Generation

  9. Control Scheduling Spatial IR Spatial IR Control Inference ■ Creates loop pipeline schedules Control Scheduling Detects data dependencies across loop intervals ■ Calculate initiation interval of pipelines Access Pattern Analysis ■ Mem. Banking/ Set maximum depth of buffers ■ Buffering ■ Supports arbitrarily nested pipelines Area/Runtime Analysis (Commercial HLS tools don’t support this) [ Optional ] Design Tuning Pipeline Unrolling Pipeline Retiming Host Resource Allocation Control Signal Inference Chisel Code 27 Generation

  10. Design Tuning Spatial IR Design Parameters Spatial IR Control Inference Control Scheduling Access Pattern Analysis Modified Mem. Banking/ Parameters Buffering Area/Runtime Analysis [ Optional ] Design Tuning Pipeline Unrolling Pipeline Retiming Host Resource Allocation Control Signal Inference Chisel Code 29 Generation

  11. Design Space Parameters Example vectorA ∙ vectorB DRAM FPGA vectorA tileA + acc × tileB vectorB ctr Small and simple, but slow! Legend Control Compute SRAM Regs � 22

  12. Important Parameters: Buffer Sizes vectorA ∙ vectorB DRAM FPGA vectorA tileA + acc × tileB vectorB ctr ■ Increases length of DRAM accesses Runtime ■ Increases exploited locality Legend Runtime Control Compute ■ Increases local memory sizes Area SRAM Regs � 23

  13. Important Parameters: Pipelining vectorA ∙ vectorB DRAM Metapipelining requires buffering FPGA vectorA tileA (1) tileA (0) + acc Tile × tileB (0) tileB (1) vectorB B Stage 2 Stage 1 ■ Overlaps memory and compute Runtime Legend ■ Increases local memory sizes Area Control Compute SRAM Regs ■ Adds synchronization logic Area Buffer Double � 24

  14. Important Parameters: Parallelization vectorA ∙ vectorB DRAM FPGA vectorA × tileA + + acc × tileB vectorB + × ctr ctr ctr ■ Improves element throughput Runtime Legend ■ Duplicates compute resources Area Control Compute SRAM Regs � 25

  15. Important Parameters: Memory Banking vectorA ∙ vectorB DRAM Parallelization requires banking vectorA × tileA + + acc × tileB vectorB + × ctr ctr ctr Legend ■ Improves memory bandwidth Runtime Control Compute SRAM Regs ■ May duplicate memory resources Area Banked SRAM � 26

  16. Design Tuning Spatial IR Design Parameters Spatial IR Control Inference Control Scheduling Access Pattern Analysis Modified Mem. Banking/ Parameters Buffering Area/Runtime Analysis [ Optional ] Design Tuning Pipeline Unrolling Pipeline Retiming Host Resource Allocation Control Signal Inference Chisel Code 29 Generation

  17. Design Tuning Spatial IR Design Parameters Spatial IR Control Inference Original tuning methods: Control Scheduling ■ Pre-prune space using simple Access Pattern Analysis heuristics Modified Mem. Banking/ Parameters Buffering ■ Randomly sample ~100,000 design Area/Runtime Analysis points [ Optional ] Design ■ Model area/runtime of each point Tuning Pipeline Unrolling Pipeline Retiming Host Resource Allocation Control Signal Inference Chisel Code 29 Generation

  18. Design Tuning Spatial IR Design Parameters Spatial IR Control Inference Original tuning methods: Control Scheduling ■ Pre-prune space using simple Access Pattern Analysis heuristics Modified Mem. Banking/ Parameters Buffering ■ Randomly sample ~100,000 design Area/Runtime Analysis points [ Optional ] Design ■ Model area/runtime of each point Tuning Pipeline Unrolling Proposed tuning method Pipeline Retiming Host Resource ■ Active learning: HyperMapper Allocation Control Signal (More details in paper) Inference Chisel Code 29 Generation

  19. Design Tuning Spatial IR Design Parameters Spatial IR Control Inference Original tuning methods: Control Scheduling ■ Pre-prune space using simple Access Pattern Analysis heuristics Modified Mem. Banking/ Parameters Buffering ■ Randomly sample ~100,000 design Area/Runtime Analysis points [ Optional ] Design ■ Model area/runtime of each point Tuning Pipeline Unrolling Proposed tuning method Pipeline Retiming Host Resource ■ Active learning: HyperMapper Allocation Control Signal (More details in paper) Inference Chisel Code 29 Generation Fast : No slow transformers in loop

  20. Design Tuning Spatial IR Design Parameters Spatial IR Control Inference Original tuning methods: Control Scheduling ■ Pre-prune space using simple Access Pattern Analysis heuristics Modified Mem. Banking/ Parameters Buffering ■ Randomly sample ~100,000 design Area/Runtime Analysis points [ Optional ] Design ■ Model area/runtime of each point Tuning Pipeline Unrolling Proposed tuning method Pipeline Retiming Host Resource ■ Active learning: HyperMapper Allocation Control Signal (More details in paper) Inference Chisel Code 29 Generation Fast : No slow transformers in loop

  21. Design Tuning Spatial IR Design Parameters Spatial IR Control Inference Original tuning methods: Control Scheduling ■ Pre-prune space using simple Access Pattern Analysis heuristics Modified Mem. Banking/ Parameters Buffering ■ Randomly sample ~100,000 design Area/Runtime Analysis points [ Optional ] Design ■ Model area/runtime of each point Tuning Pipeline Unrolling Proposed tuning method Pipeline Retiming Host Resource ■ Active learning: HyperMapper Allocation Control Signal (More details in paper) Inference Chisel Code 29 Generation Fast : No slow transformers in loop

  22. The Spatial Compiler: The Rest Spatial IR Spatial IR Code generation Control Inference ■ Synthesizable Chisel Control Scheduling ■ C++ code for host CPU Access Pattern Analysis Mem. Banking/ Buffering Area/Runtime Analysis [ Optional ] Design Tuning Pipeline Unrolling Pipeline Retiming Host Resource Allocation Control Signal Inference Chisel Code 30 Generation

  23. Evaluation: Performance ■ FPGA : ■ Amazon EC2 F1 Instance: Xilinx VU9P FPGA ■ Fixed clock rate of 150 MHz ■ Applications ■ SDAccel: Hand optimized, tuned implementations ■ Spatial: Hand written, automatically tuned implementations ■ Execution time = FPGA execution time 31

  24. Performance (Spatial vs. SDAccel) Average 2.9x faster hardware than SDAccel 8.5x 1.4x 3.5x 1.4x 1.6x 14.1x 1.3x Speedup over SDAccel 15 10 5 0 BlackScholes GEMM PageRank TPC-H Q6 32

  25. Productivity: Lines of Code Average 42% shorter programs versus SDAccel 47% 12% 60% 44% 31% 66% 35% 250 SDAccel Spatial 188 Lines 125 63 0 BlackScholes GEMM PageRank TPC-H Q6 33

  26. Evaluation: Portability ■ FPGA 1 ■ Amazon EC2 F1 Instance: Xilinx VU9P FPGA ■ 19.2 GB/s DRAM bandwidth (single channel) ■ FPGA 2 ■ Xilinx Zynq ZC706 ■ 4.3 GB/s ■ Applications ■ Spatial: Hand written, automatically tuned implementations ■ Fixed clock rate of 150 MHz 34

  27. Portability: VU9P vs. Zynq ZC706 Identical Spatial source, multiple targets 2.5x 2.5x 1.2x 1.3x 2.5x 4.6x 2.5x 30 Porting : Speedup (VU9P / Zynq) only from moving to larger FPGA 22.5 Speedup 15 DRAM Bandwidth: 7.5 4.5x LUTs (GP compute): 47.3x 0 DSPs (integer FMA): BlackScholes GEMM PageRank TPC-H Q6 VU9P / ZC706 7.6x * No URAM used on VU9P On-chip memory*: 4.0x 35

  28. Portability: VU9P vs. Zynq ZC706 Identical Spatial source, multiple targets 1.7x 1.0x 1.1x 2.7x 9.4x 2.6x 2.1x 30 Porting : Speedup (VU9P / Zynq) only from moving to larger FPGA 22.5 Tuning : Speedup only from tuning parameters for larger FPGA Speedup 15 DRAM Bandwidth: 7.5 4.5x LUTs (GP compute): 47.3x 0 DSPs (integer FMA): BlackScholes GEMM PageRank TPC-H Q6 VU9P / ZC706 7.6x * No URAM used on VU9P On-chip memory*: 4.0x 35

  29. Portability: VU9P vs. Zynq ZC706 Identical Spatial source, multiple targets 2.2x 2.5x 5.0x 6.8x 23.4x 6.5x 2.5x 30 Porting : Speedup (VU9P / Zynq) only from moving to larger FPGA 22.5 Tuning : Speedup only from tuning parameters for larger FPGA Speedup Product = Porting × Tuning 15 DRAM Bandwidth: 7.5 4.5x LUTs (GP compute): 47.3x 0 DSPs (integer FMA): BlackScholes GEMM PageRank TPC-H Q6 VU9P / ZC706 7.6x * No URAM used on VU9P On-chip memory*: 4.0x 35

  30. Portability: Plasticine CGRA Identical Spatial source, multiple targets Even reconfigurable hardware that isn’t an FPGA! DRAM Bandwidth (%) Resource Utilization (%) Speedup Benchmark Load Store PCU PMU AG vs. VU9P BlackScholes 77.4 12.9 73.4 10.9 20.6 1.6 GDA 24.0 0.2 95.3 73.4 38.2 9.8 GEMM 20.5 2.1 96.8 64.1 11.7 55.0 K-Means 8.0 0.4 89.1 57.8 17.6 6.3 TPC-H Q6 97.2 0.0 29.7 37.5 70.6 1.6 Prabhakar et al. Plasticine: A Reconfigurable Architecture For Parallel Patterns (ISCA ‘17) 36

  31. Halide to Spatial

  32. What is Halide? ● DSL for computational photography ● Separation between algorithm (what to compute) and schedule (how to compute) ● Straightforward to express and iterate over various schedules Algorithm Var x, y; 
 Func f; 
 f(x, y) = x + y; Implementations Schedule #1 f.tile(x,y,xi,yi,8,8); 


  33. What is Halide? ● DSL for computational photography ● Separation between algorithm (what to compute) and schedule (how to compute) ● Straightforward to express and iterate over various schedules Algorithm Var x, y; 
 Func f; 
 f(x, y) = x + y; Implementations Schedule #2 f.parallel(y); 
 f.vectorize(x, 8);

  34. Why use Halide as Front-End to Spatial? ● Separation of concerns ○ High-level transformations: Tiling, Vectorization etc can happen in Halide ○ Lift the hard work of transforming loop nests to Halide ○ Optimized code can be lowered into spatial 
 ● Loop-based IR ○ Easy mapping to Spatial front-end 


  35. 
 
 Halide IR produce f { 
 let t6 = (f.extent.0 + f.min.0) 
 let t7 = (f.min.1*f.stride.1) 
 // Algorithm 
 let t8 = max((f.extent.0/8), 0) 
 Var x, y; 
 let t3 = (t8 < ((f.extent.0 + 7)/8)) 
 let t2 = (0 - t7) 
 Func f; 
 let t5 = (((t6 - t7) - f.min.0) + -8) 
 f(x, y) = x + y; 
 let t4 = (t6 + -8) 
 parallel (f.s0.y, f.min.1, f.extent.1) { 
 // Schedule 
 let t10 = ((f.s0.y*f.stride.1) + t2) 
 f.parallel(y); 
 let t9 = (f.min.0 + f.s0.y) 
 f.vectorize(x, 8); 
 for (f.s0.x.x, 0, t8) { 
 f[ramp(((f.s0.x.x*8) + t10), 1, 8)] = ramp(((f.s0.x.x*8) + t9), 1, 8) 
 f.realize(32, 32); } 
 if (t3) { 
 f[ramp(((f.s0.y*f.stride.1) + t5), 1, 8)] = ramp((f.s0.y + t4), 1, 8) 
 } 
 } 
 }

  36. 
 
 
 Example: Halide to Spatial Halide Spatial // Algorithm 
 f(x, y) = x + y; 
 g(x, y) = (f(x, y) + f(x, y+1))/2; 
 // Schedule 
 g.in().spatial(); g.store_in(MemoryType::SRAM) .compute_at(g.in(), Var::outermost()); 
 g.tile(x, y, xo, yo, xi, yi, 4, 4); 
 f.compute_root(); 
 f.in() .copy_to_device() .store_in(MemoryType::SRAM) 
 .compute_at(g, xo); 
 g.in().copy_to_host(); wrapper.compile_to_spatial(...);

  37. 
 
 
 Example: Halide to Spatial Halide Spatial // Algorithm 
 val g_wrapper = DRAM[Int](16, 16); 
 f(x, y) = x + y; 
 Accel { 
 g(x, y) = (f(x, y) + f(x, y+1))/2; 
 val g = SRAM[Int](16, 16); 
 Foreach(0 until 4 by 1) {yo => 
 // Schedule 
 Foreach(0 until 4 by 1) {xo => 
 g.in().spatial(); val f_wrapper = SRAM[Int](4, 5); 
 g.store_in(MemoryType::SRAM) f_wrapper load f(xo*4::xo*4+4, yo*4::yo*4+5); 
 .compute_at(g.in(), Var::outermost()); 
 Foreach(0 until 4 by 1) {yi => 
 g.tile(x, y, xo, yo, xi, yi, 4, 4); 
 Foreach(0 until 4 by 1) {xi => 
 g(xo*4+xi, yo*4+yi) = f.compute_root(); 
 (f_wrapper(xi,yi)+f_wrapper(xi,yi+1))/2; 
 f.in() } 
 .copy_to_device() } 
 .store_in(MemoryType::SRAM) 
 } 
 .compute_at(g, xo); 
 } 
 g_wrapper store g; 
 g.in().copy_to_host(); } wrapper.compile_to_spatial(...);

  38. 
 
 
 Example: Halide to Spatial Halide Spatial // Algorithm 
 val g_wrapper = DRAM[Int](16, 16); 
 Compute at f(x, y) = x + y; 
 Accel { 
 g(x, y) = (f(x, y) + f(x, y+1))/2; 
 Accelerator val g = SRAM[Int](16, 16); 
 Foreach(0 until 4 by 1) {yo => 
 // Schedule 
 Foreach(0 until 4 by 1) {xo => 
 g.in().spatial(); val f_wrapper = SRAM[Int](4, 5); 
 g.store_in(MemoryType::SRAM) f_wrapper load f(xo*4::xo*4+4, yo*4::yo*4+5); 
 .compute_at(g.in(), Var::outermost()); 
 Foreach(0 until 4 by 1) {yi => 
 g.tile(x, y, xo, yo, xi, yi, 4, 4); 
 Foreach(0 until 4 by 1) {xi => 
 g(xo*4+xi, yo*4+yi) = f.compute_root(); 
 (f_wrapper(xi,yi)+f_wrapper(xi,yi+1))/2; 
 f.in() } 
 .copy_to_device() } 
 .store_in(MemoryType::SRAM) 
 } 
 .compute_at(g, xo); 
 } 
 g_wrapper store g; 
 g.in().copy_to_host(); } wrapper.compile_to_spatial(...);

  39. 
 
 
 Example: Halide to Spatial Halide Spatial Allocate SRAM // Algorithm 
 to store ‘g’ val g_wrapper = DRAM[Int](16, 16); 
 f(x, y) = x + y; 
 Accel { 
 g(x, y) = (f(x, y) + f(x, y+1))/2; 
 val g = SRAM[Int](16, 16); 
 Foreach(0 until 4 by 1) {yo => 
 // Schedule 
 Foreach(0 until 4 by 1) {xo => 
 g.in().spatial(); val f_wrapper = SRAM[Int](4, 5); 
 g.store_in(MemoryType::SRAM) f_wrapper load f(xo*4::xo*4+4, yo*4::yo*4+5); 
 .compute_at(g.in(), Var::outermost()); 
 Foreach(0 until 4 by 1) {yi => 
 g.tile(x, y, xo, yo, xi, yi, 4, 4); 
 Foreach(0 until 4 by 1) {xi => 
 g(xo*4+xi, yo*4+yi) = f.compute_root(); 
 (f_wrapper(xi,yi)+f_wrapper(xi,yi+1))/2; 
 f.in() } 
 .copy_to_device() } 
 .store_in(MemoryType::SRAM) 
 } 
 .compute_at(g, xo); 
 } 
 g_wrapper store g; 
 g.in().copy_to_host(); } wrapper.compile_to_spatial(...);

  40. 
 
 
 Example: Halide to Spatial Halide Spatial // Algorithm 
 val g_wrapper = DRAM[Int](16, 16); 
 f(x, y) = x + y; 
 Accel { 
 g(x, y) = (f(x, y) + f(x, y+1))/2; 
 val g = SRAM[Int](16, 16); 
 Foreach(0 until 4 by 1) {yo => 
 Tile g // Schedule 
 Foreach(0 until 4 by 1) {xo => 
 g.in().spatial(); val f_wrapper = SRAM[Int](4, 5); 
 g.store_in(MemoryType::SRAM) f_wrapper load f(xo*4::xo*4+4, yo*4::yo*4+5); 
 .compute_at(g.in(), Var::outermost()); 
 Foreach(0 until 4 by 1) {yi => 
 g.tile(x, y, xo, yo, xi, yi, 4, 4); 
 Foreach(0 until 4 by 1) {xi => 
 g(xo*4+xi, yo*4+yi) = f.compute_root(); 
 (f_wrapper(xi,yi)+f_wrapper(xi,yi+1))/2; 
 f.in() } 
 .copy_to_device() } 
 .store_in(MemoryType::SRAM) 
 } 
 .compute_at(g, xo); 
 } 
 g_wrapper store g; 
 g.in().copy_to_host(); } wrapper.compile_to_spatial(...);

  41. 
 
 
 Example: Halide to Spatial Halide Spatial // Algorithm 
 val g_wrapper = DRAM[Int](16, 16); 
 f(x, y) = x + y; 
 Accel { 
 g(x, y) = (f(x, y) + f(x, y+1))/2; 
 val g = SRAM[Int](16, 16); 
 Foreach(0 until 4 by 1) {yo => 
 // Schedule 
 Foreach(0 until 4 by 1) {xo => 
 g.in().spatial(); val f_wrapper = SRAM[Int](4, 5); 
 g.store_in(MemoryType::SRAM) f_wrapper load f(xo*4::xo*4+4, yo*4::yo*4+5); 
 .compute_at(g.in(), Var::outermost()); 
 Foreach(0 until 4 by 1) {yi => 
 g.tile(x, y, xo, yo, xi, yi, 4, 4); 
 Foreach(0 until 4 by 1) {xi => 
 g(xo*4+xi, yo*4+yi) = Load ‘f’ into f.compute_root(); 
 (f_wrapper(xi,yi)+f_wrapper(xi,yi+1))/2; 
 the f.in() } 
 accelerator’s .copy_to_device() } 
 .store_in(MemoryType::SRAM) 
 memory } 
 .compute_at(g, xo); 
 } 
 g_wrapper store g; 
 g.in().copy_to_host(); } wrapper.compile_to_spatial(...);

  42. 
 
 
 Example: Halide to Spatial Halide Spatial // Algorithm 
 val g_wrapper = DRAM[Int](16, 16); 
 f(x, y) = x + y; 
 Accel { 
 g(x, y) = (f(x, y) + f(x, y+1))/2; 
 val g = SRAM[Int](16, 16); 
 Foreach(0 until 4 by 1) {yo => 
 // Schedule 
 Foreach(0 until 4 by 1) {xo => 
 Do the load at loop g.in().spatial(); val f_wrapper = SRAM[Int](4, 5); 
 level ‘xo’ and store g.store_in(MemoryType::SRAM) f_wrapper load f(xo*4::xo*4+4, yo*4::yo*4+5); 
 in SRAM .compute_at(g.in(), Var::outermost()); 
 Foreach(0 until 4 by 1) {yi => 
 g.tile(x, y, xo, yo, xi, yi, 4, 4); 
 Foreach(0 until 4 by 1) {xi => 
 g(xo*4+xi, yo*4+yi) = f.compute_root(); 
 (f_wrapper(xi,yi)+f_wrapper(xi,yi+1))/2; 
 f.in() } 
 .copy_to_device() } 
 .store_in(MemoryType::SRAM) 
 } 
 .compute_at(g, xo); 
 } 
 g_wrapper store g; 
 g.in().copy_to_host(); } wrapper.compile_to_spatial(...);

  43. 
 
 
 Example: Halide to Spatial Halide Spatial // Algorithm 
 val g_wrapper = DRAM[Int](16, 16); 
 f(x, y) = x + y; 
 Accel { 
 g(x, y) = (f(x, y) + f(x, y+1))/2; 
 val g = SRAM[Int](16, 16); 
 Foreach(0 until 4 by 1) {yo => 
 // Schedule 
 Foreach(0 until 4 by 1) {xo => 
 g.in().spatial(); val f_wrapper = SRAM[Int](4, 5); 
 g.store_in(MemoryType::SRAM) f_wrapper load f(xo*4::xo*4+4, yo*4::yo*4+5); 
 .compute_at(g.in(), Var::outermost()); 
 Foreach(0 until 4 by 1) {yi => 
 g.tile(x, y, xo, yo, xi, yi, 4, 4); 
 Foreach(0 until 4 by 1) {xi => 
 g(xo*4+xi, yo*4+yi) = f.compute_root(); 
 (f_wrapper(xi,yi)+f_wrapper(xi,yi+1))/2; 
 f.in() } 
 Store ‘g’ .copy_to_device() } 
 back into .store_in(MemoryType::SRAM) 
 } 
 .compute_at(g, xo); 
 host’s DRAM } 
 g_wrapper store g; 
 g.in().copy_to_host(); } wrapper.compile_to_spatial(...);

  44. Conclusion 37

  45. Conclusion ■ Reconfigurable architectures are becoming key for performance / energy efficiency 37

  46. Conclusion ■ Reconfigurable architectures are becoming key for performance / energy efficiency ■ Current programming solutions for reconfigurables are still inadequate 37

  47. Conclusion ■ Reconfigurable architectures are becoming key for performance / energy efficiency ■ Current programming solutions for reconfigurables are still inadequate ■ Need to rethink outside of the C box for high level synthesis: Memory hierarchy for optimization ■ Design parameters for tuning ■ Arbitrarily nestable pipelines ■ 37

  48. Conclusion ■ Reconfigurable architectures are becoming key for performance / energy efficiency ■ Current programming solutions for reconfigurables are still inadequate ■ Need to rethink outside of the C box for high level synthesis: Memory hierarchy for optimization ■ Productivity Performance Design parameters for tuning ■ Arbitrarily nestable pipelines ■ ■ Spatial prototypes these language and compiler criteria: ■ Average speedup of 2.9x versus SDAccel on VU9P ■ Average 42% less code than SDAccel Portability ■ Achieves transparent portability through internal support for automated design tuning (HyperMapper) 37

  49. Conclusion ■ Reconfigurable architectures are becoming key for performance / energy efficiency ■ Current programming solutions for reconfigurables are still inadequate ■ Need to rethink outside of the C box for high level synthesis: Memory hierarchy for optimization ■ Productivity Performance Design parameters for tuning ■ Arbitrarily nestable pipelines ■ ■ Spatial prototypes these language and compiler criteria: ■ Average speedup of 2.9x versus SDAccel on VU9P ■ Average 42% less code than SDAccel Portability ■ Achieves transparent portability through internal support for automated design tuning (HyperMapper) Spatial is open source: https://spatial-lang.org/ 37

  50. Backup Slides

  51. The Team David Matt Raghu Yaqi Koeplinger Feldman Prabhakar Zhang Luigi Stefan Tian Ruben Ardavan Nardi Hadjis Zhao Fiszel Pedram Christos Kunle Kozyrakis Olukotun 38

  52. Custom ASICs

  53. Custom ASICs Good for widely used, fixed specifications (like compression)

  54. Custom ASICs Good for widely used, fixed specifications (like compression) Expensive with long design turnaround for developing fields like ML

  55. Custom ASICs Good for widely used, fixed specifications (like compression) Expensive with long design turnaround for developing fields like ML 20,000 20 15,000 15 Relative # of ML Arxiv Papers Papers / Year 10,000 10 Since 2009 5,000 5 0 0 2017 2009 2011 2013 2015 Time Jeff Dean, Scaled ML 2018 Kunle Olukotun, ISCA 2018

  56. C + Pragmas Example Add 512 integers originating from accelerator DRAM void sum( int * mem) { mem[512] = 0; for ( int i=0; i < 512; i++) { mem[512] += mem[i]; } } � 54

  57. C + Pragmas Example Add 512 integers originating from accelerator DRAM void sum( int * mem) { mem[512] = 0; Runtime: 27,236 clock Commercial for ( int i=0; i < 512; i++) { HLS Tool cycles mem[512] += mem[i]; (100x too long!) } } � 54

  58. C + Pragmas Example Add 512 integers originating from external DRAM #define CHUNKSIZE ( sizeof ( MPort )/ sizeof ( int )) #define LOOPCOUNT (512/CHUNKSIZE) 
 void sum( MPort * mem) { MPort buff[LOOPCOUNT]; 
 memcpy (buff, mem, LOOPCOUNT); 
 int sum = 0; 
 for ( int i=1; i<LOOPCOUNT; i++) { 
 #pragma PIPELINE for ( int j=0; j<CHUNKSIZE; j++) { #pragma UNROLL sum += ( int ) (buff[i]>>j* sizeof ( int )*8); } } mem[512] = sum; } Runtime: 302 clock cycles � 55

  59. C + Pragmas Example Add 512 integers originating from external DRAM #define CHUNKSIZE ( sizeof ( MPort )/ sizeof ( int )) Width of DRAM controller #define LOOPCOUNT (512/CHUNKSIZE) 
 interface void sum( MPort * mem) { MPort buff[LOOPCOUNT]; 
 Burst Access memcpy (buff, mem, LOOPCOUNT); 
 Use local variable int sum = 0; 
 Loop for ( int i=1; i<LOOPCOUNT; i++) { 
 Restructurin #pragma PIPELINE g Special for ( int j=0; j<CHUNKSIZE; j++) { Special #pragma UNROLL compiler compiler sum += ( int ) directives directives (buff[i]>>j* sizeof ( int )*8); Bit shifting to } extract individual } elements mem[512] = sum; } Runtime: 302 clock cycles � 55

  60. Hardware Design Considerations

  61. Hardware Design Considerations 1. Finite physical compute and memory resources

  62. Hardware Design Considerations 1. Finite physical compute and memory resources 2. Requires aggressive pipelining for performance ■ Maximize useful execution time of compute resources

  63. Hardware Design Considerations 1. Finite physical compute and memory resources 2. Requires aggressive pipelining for performance ■ Maximize useful execution time of compute resources 3. Disjoint memory space ■ No hardware managed memory hierarchy

  64. Hardware Design Considerations 1. Finite physical compute and memory resources 2. Requires aggressive pipelining for performance ■ Maximize useful execution time of compute resources 3. Disjoint memory space ■ No hardware managed memory hierarchy 4. Huge design parameter spaces ■ Parameters are interdependent, change runtime by orders of magnitude

  65. Hardware Design Considerations 1. Finite physical compute and memory resources 2. Requires aggressive pipelining for performance ■ Maximize useful execution time of compute resources 3. Disjoint memory space ■ No hardware managed memory hierarchy 4. Huge design parameter spaces ■ Parameters are interdependent, change runtime by orders of magnitude 5. Others… pipeline timing, clocking, etc.

  66. Local Memory Analysis Example Foreach (N by 1){ r => val a = SRAM [Float] (D) val b = SRAM [Float] (D) val c = SRAM [Float]( D) Foreach (D par 2){i => a(i) = … a } Reduce (sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach (D par 2){k => c(k) = a(k) * sum } }

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend