Spatial: A Language and Compiler for Application Accelerators Raghu - PowerPoint PPT Presentation

Dot Product in Spatial val output = ArgOut [Float] val vectorA = DRAM [Float](N) Tiled reduction (pipelined) val vectorB = DRAM [Float](N) DRAM Accel { Reduce (output)(N by B){ i => tileA (1) tileA (0) vectorA val tileA = SRAM [Float](B) acc + acc val tileB = SRAM [Float](B) × tileB (0) tileB (1) val acc = Reg [Float] acc acc Stage 2 Stage 1 vectorB tileA load vectorA(i :: i+B) tileB load vectorB(i :: output i+B) Outer Reduce FPGA Reduce (acc)(B by 1){ j => tileA(j) * tileB(j) }{a, b => a + b} 24

Dot Product in Spatial val output = ArgOut [Float] Outer reduce function val vectorA = DRAM [Float](N) val vectorB = DRAM [Float](N) DRAM Accel { Reduce (output)(N by B){ i => tileA (1) tileA (0) vectorA val tileA = SRAM [Float](B) acc + acc val tileB = SRAM [Float](B) × tileB (0) tileB (1) val acc = Reg [Float] acc acc Stage 2 Stage 1 vectorB tileA load vectorA(i :: i+B) tileB load vectorB(i :: output + i+B) Stage 3 Outer Reduce FPGA Reduce (acc)(B by 1){ j => tileA(j) * tileB(j) }{a, b => a + b} 24 }{a, b => a + b}

Dot Product in Spatial val output = ArgOut [Float] val vectorA = DRAM [Float](N) Banking strategy val vectorB = DRAM [Float](N) Tile Size (B) Parallelism factor #3 DRAM Accel { Reduce (output)(N by B){ i => tileA (1) tileA (0) vectorA val tileA = SRAM [Float](B) acc + acc val tileB = SRAM [Float](B) × tileB (0) tileB (1) val acc = Reg [Float] acc acc Stage 2 Stage 1 vectorB tileA load vectorA(i :: i+B) tileB load vectorB(i :: Parallelism output + i+B) Stage 3 Outer Reduce factor #2 FPGA Reduce (acc)(B by 1){ j => Parallelism factor #1 tileA(j) * tileB(j) Metapipelining toggle }{a, b => a + b} 24 }{a, b => a + b}

Dot Product in Spatial val output = ArgOut [Float] val vectorA = DRAM [Float](N) val vectorB = DRAM [Float](N) Accel { Reduce (output)(N by B){ i => val tileA = SRAM [Float](B) val tileB = SRAM [Float](B) val acc = Reg [Float] tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B) Reduce (acc)(B by 1){ j => tileA(j) * tileB(j) }{a, b => a + b} 25 }{a, b => a + b}

Dot Product in Spatial Spatial Program Design Parameters 25

The Spatial Compiler Spatial Program 26

The Spatial Compiler Spatial IR 26

The Spatial Compiler Spatial IR Design Parameters Spatial IR Control Inference Legend Control Scheduling Intermediate Access Pattern Analysis Design Representatio Mem. Banking/ Parameters n Buffering Area/Runtime Analysis [ Optional ] Design IR Transformation Tuning IR Analysis Pipeline Unrolling Pipeline Retiming Code Generation Host Resource Allocation Control Signal Inference Chisel Code 26 Generation

Control Scheduling Spatial IR Spatial IR Control Inference ■ Creates loop pipeline schedules Control Scheduling Detects data dependencies across loop intervals ■ Calculate initiation interval of pipelines Access Pattern Analysis ■ Mem. Banking/ Set maximum depth of buffers ■ Buffering ■ Supports arbitrarily nested pipelines Area/Runtime Analysis (Commercial HLS tools don’t support this) [ Optional ] Design Tuning Pipeline Unrolling Pipeline Retiming Host Resource Allocation Control Signal Inference Chisel Code 27 Generation

Design Tuning Spatial IR Design Parameters Spatial IR Control Inference Control Scheduling Access Pattern Analysis Modified Mem. Banking/ Parameters Buffering Area/Runtime Analysis [ Optional ] Design Tuning Pipeline Unrolling Pipeline Retiming Host Resource Allocation Control Signal Inference Chisel Code 29 Generation

Design Space Parameters Example vectorA ∙ vectorB DRAM FPGA vectorA tileA + acc × tileB vectorB ctr Small and simple, but slow! Legend Control Compute SRAM Regs � 22

Important Parameters: Buffer Sizes vectorA ∙ vectorB DRAM FPGA vectorA tileA + acc × tileB vectorB ctr ■ Increases length of DRAM accesses Runtime ■ Increases exploited locality Legend Runtime Control Compute ■ Increases local memory sizes Area SRAM Regs � 23

Important Parameters: Pipelining vectorA ∙ vectorB DRAM Metapipelining requires buffering FPGA vectorA tileA (1) tileA (0) + acc Tile × tileB (0) tileB (1) vectorB B Stage 2 Stage 1 ■ Overlaps memory and compute Runtime Legend ■ Increases local memory sizes Area Control Compute SRAM Regs ■ Adds synchronization logic Area Buffer Double � 24

Important Parameters: Parallelization vectorA ∙ vectorB DRAM FPGA vectorA × tileA + + acc × tileB vectorB + × ctr ctr ctr ■ Improves element throughput Runtime Legend ■ Duplicates compute resources Area Control Compute SRAM Regs � 25

Important Parameters: Memory Banking vectorA ∙ vectorB DRAM Parallelization requires banking vectorA × tileA + + acc × tileB vectorB + × ctr ctr ctr Legend ■ Improves memory bandwidth Runtime Control Compute SRAM Regs ■ May duplicate memory resources Area Banked SRAM � 26

Design Tuning Spatial IR Design Parameters Spatial IR Control Inference Control Scheduling Access Pattern Analysis Modified Mem. Banking/ Parameters Buffering Area/Runtime Analysis [ Optional ] Design Tuning Pipeline Unrolling Pipeline Retiming Host Resource Allocation Control Signal Inference Chisel Code 29 Generation

Design Tuning Spatial IR Design Parameters Spatial IR Control Inference Original tuning methods: Control Scheduling ■ Pre-prune space using simple Access Pattern Analysis heuristics Modified Mem. Banking/ Parameters Buffering ■ Randomly sample ~100,000 design Area/Runtime Analysis points [ Optional ] Design ■ Model area/runtime of each point Tuning Pipeline Unrolling Pipeline Retiming Host Resource Allocation Control Signal Inference Chisel Code 29 Generation

Design Tuning Spatial IR Design Parameters Spatial IR Control Inference Original tuning methods: Control Scheduling ■ Pre-prune space using simple Access Pattern Analysis heuristics Modified Mem. Banking/ Parameters Buffering ■ Randomly sample ~100,000 design Area/Runtime Analysis points [ Optional ] Design ■ Model area/runtime of each point Tuning Pipeline Unrolling Proposed tuning method Pipeline Retiming Host Resource ■ Active learning: HyperMapper Allocation Control Signal (More details in paper) Inference Chisel Code 29 Generation

Design Tuning Spatial IR Design Parameters Spatial IR Control Inference Original tuning methods: Control Scheduling ■ Pre-prune space using simple Access Pattern Analysis heuristics Modified Mem. Banking/ Parameters Buffering ■ Randomly sample ~100,000 design Area/Runtime Analysis points [ Optional ] Design ■ Model area/runtime of each point Tuning Pipeline Unrolling Proposed tuning method Pipeline Retiming Host Resource ■ Active learning: HyperMapper Allocation Control Signal (More details in paper) Inference Chisel Code 29 Generation Fast : No slow transformers in loop

The Spatial Compiler: The Rest Spatial IR Spatial IR Code generation Control Inference ■ Synthesizable Chisel Control Scheduling ■ C++ code for host CPU Access Pattern Analysis Mem. Banking/ Buffering Area/Runtime Analysis [ Optional ] Design Tuning Pipeline Unrolling Pipeline Retiming Host Resource Allocation Control Signal Inference Chisel Code 30 Generation

Evaluation: Performance ■ FPGA : ■ Amazon EC2 F1 Instance: Xilinx VU9P FPGA ■ Fixed clock rate of 150 MHz ■ Applications ■ SDAccel: Hand optimized, tuned implementations ■ Spatial: Hand written, automatically tuned implementations ■ Execution time = FPGA execution time 31

Performance (Spatial vs. SDAccel) Average 2.9x faster hardware than SDAccel 8.5x 1.4x 3.5x 1.4x 1.6x 14.1x 1.3x Speedup over SDAccel 15 10 5 0 BlackScholes GEMM PageRank TPC-H Q6 32

Productivity: Lines of Code Average 42% shorter programs versus SDAccel 47% 12% 60% 44% 31% 66% 35% 250 SDAccel Spatial 188 Lines 125 63 0 BlackScholes GEMM PageRank TPC-H Q6 33

Evaluation: Portability ■ FPGA 1 ■ Amazon EC2 F1 Instance: Xilinx VU9P FPGA ■ 19.2 GB/s DRAM bandwidth (single channel) ■ FPGA 2 ■ Xilinx Zynq ZC706 ■ 4.3 GB/s ■ Applications ■ Spatial: Hand written, automatically tuned implementations ■ Fixed clock rate of 150 MHz 34

Portability: VU9P vs. Zynq ZC706 Identical Spatial source, multiple targets 2.5x 2.5x 1.2x 1.3x 2.5x 4.6x 2.5x 30 Porting : Speedup (VU9P / Zynq) only from moving to larger FPGA 22.5 Speedup 15 DRAM Bandwidth: 7.5 4.5x LUTs (GP compute): 47.3x 0 DSPs (integer FMA): BlackScholes GEMM PageRank TPC-H Q6 VU9P / ZC706 7.6x * No URAM used on VU9P On-chip memory*: 4.0x 35

Portability: VU9P vs. Zynq ZC706 Identical Spatial source, multiple targets 1.7x 1.0x 1.1x 2.7x 9.4x 2.6x 2.1x 30 Porting : Speedup (VU9P / Zynq) only from moving to larger FPGA 22.5 Tuning : Speedup only from tuning parameters for larger FPGA Speedup 15 DRAM Bandwidth: 7.5 4.5x LUTs (GP compute): 47.3x 0 DSPs (integer FMA): BlackScholes GEMM PageRank TPC-H Q6 VU9P / ZC706 7.6x * No URAM used on VU9P On-chip memory*: 4.0x 35

Portability: VU9P vs. Zynq ZC706 Identical Spatial source, multiple targets 2.2x 2.5x 5.0x 6.8x 23.4x 6.5x 2.5x 30 Porting : Speedup (VU9P / Zynq) only from moving to larger FPGA 22.5 Tuning : Speedup only from tuning parameters for larger FPGA Speedup Product = Porting × Tuning 15 DRAM Bandwidth: 7.5 4.5x LUTs (GP compute): 47.3x 0 DSPs (integer FMA): BlackScholes GEMM PageRank TPC-H Q6 VU9P / ZC706 7.6x * No URAM used on VU9P On-chip memory*: 4.0x 35

Portability: Plasticine CGRA Identical Spatial source, multiple targets Even reconfigurable hardware that isn’t an FPGA! DRAM Bandwidth (%) Resource Utilization (%) Speedup Benchmark Load Store PCU PMU AG vs. VU9P BlackScholes 77.4 12.9 73.4 10.9 20.6 1.6 GDA 24.0 0.2 95.3 73.4 38.2 9.8 GEMM 20.5 2.1 96.8 64.1 11.7 55.0 K-Means 8.0 0.4 89.1 57.8 17.6 6.3 TPC-H Q6 97.2 0.0 29.7 37.5 70.6 1.6 Prabhakar et al. Plasticine: A Reconfigurable Architecture For Parallel Patterns (ISCA ‘17) 36

Halide to Spatial

What is Halide? ● DSL for computational photography ● Separation between algorithm (what to compute) and schedule (how to compute) ● Straightforward to express and iterate over various schedules Algorithm Var x, y;   Func f;   f(x, y) = x + y; Implementations Schedule #1 f.tile(x,y,xi,yi,8,8);  

What is Halide? ● DSL for computational photography ● Separation between algorithm (what to compute) and schedule (how to compute) ● Straightforward to express and iterate over various schedules Algorithm Var x, y;   Func f;   f(x, y) = x + y; Implementations Schedule #2 f.parallel(y);   f.vectorize(x, 8);

Why use Halide as Front-End to Spatial? ● Separation of concerns ○ High-level transformations: Tiling, Vectorization etc can happen in Halide ○ Lift the hard work of transforming loop nests to Halide ○ Optimized code can be lowered into spatial   ● Loop-based IR ○ Easy mapping to Spatial front-end  

    Halide IR produce f {   let t6 = (f.extent.0 + f.min.0)   let t7 = (f.min.1*f.stride.1)   // Algorithm   let t8 = max((f.extent.0/8), 0)   Var x, y;   let t3 = (t8 < ((f.extent.0 + 7)/8))   let t2 = (0 - t7)   Func f;   let t5 = (((t6 - t7) - f.min.0) + -8)   f(x, y) = x + y;   let t4 = (t6 + -8)   parallel (f.s0.y, f.min.1, f.extent.1) {   // Schedule   let t10 = ((f.s0.y*f.stride.1) + t2)   f.parallel(y);   let t9 = (f.min.0 + f.s0.y)   f.vectorize(x, 8);   for (f.s0.x.x, 0, t8) {   f[ramp(((f.s0.x.x*8) + t10), 1, 8)] = ramp(((f.s0.x.x*8) + t9), 1, 8)   f.realize(32, 32); }   if (t3) {   f[ramp(((f.s0.y*f.stride.1) + t5), 1, 8)] = ramp((f.s0.y + t4), 1, 8)   }   }   }

      Example: Halide to Spatial Halide Spatial // Algorithm   f(x, y) = x + y;   g(x, y) = (f(x, y) + f(x, y+1))/2;   // Schedule   g.in().spatial(); g.store_in(MemoryType::SRAM) .compute_at(g.in(), Var::outermost());   g.tile(x, y, xo, yo, xi, yi, 4, 4);   f.compute_root();   f.in() .copy_to_device() .store_in(MemoryType::SRAM)   .compute_at(g, xo);   g.in().copy_to_host(); wrapper.compile_to_spatial(...);

      Example: Halide to Spatial Halide Spatial // Algorithm   val g_wrapper = DRAM[Int](16, 16);   f(x, y) = x + y;   Accel {   g(x, y) = (f(x, y) + f(x, y+1))/2;   val g = SRAM[Int](16, 16);   Foreach(0 until 4 by 1) {yo =>   // Schedule   Foreach(0 until 4 by 1) {xo =>   g.in().spatial(); val f_wrapper = SRAM[Int](4, 5);   g.store_in(MemoryType::SRAM) f_wrapper load f(xo*4::xo*4+4, yo*4::yo*4+5);   .compute_at(g.in(), Var::outermost());   Foreach(0 until 4 by 1) {yi =>   g.tile(x, y, xo, yo, xi, yi, 4, 4);   Foreach(0 until 4 by 1) {xi =>   g(xo*4+xi, yo*4+yi) = f.compute_root();   (f_wrapper(xi,yi)+f_wrapper(xi,yi+1))/2;   f.in() }   .copy_to_device() }   .store_in(MemoryType::SRAM)   }   .compute_at(g, xo);   }   g_wrapper store g;   g.in().copy_to_host(); } wrapper.compile_to_spatial(...);

      Example: Halide to Spatial Halide Spatial // Algorithm   val g_wrapper = DRAM[Int](16, 16);   Compute at f(x, y) = x + y;   Accel {   g(x, y) = (f(x, y) + f(x, y+1))/2;   Accelerator val g = SRAM[Int](16, 16);   Foreach(0 until 4 by 1) {yo =>   // Schedule   Foreach(0 until 4 by 1) {xo =>   g.in().spatial(); val f_wrapper = SRAM[Int](4, 5);   g.store_in(MemoryType::SRAM) f_wrapper load f(xo*4::xo*4+4, yo*4::yo*4+5);   .compute_at(g.in(), Var::outermost());   Foreach(0 until 4 by 1) {yi =>   g.tile(x, y, xo, yo, xi, yi, 4, 4);   Foreach(0 until 4 by 1) {xi =>   g(xo*4+xi, yo*4+yi) = f.compute_root();   (f_wrapper(xi,yi)+f_wrapper(xi,yi+1))/2;   f.in() }   .copy_to_device() }   .store_in(MemoryType::SRAM)   }   .compute_at(g, xo);   }   g_wrapper store g;   g.in().copy_to_host(); } wrapper.compile_to_spatial(...);

      Example: Halide to Spatial Halide Spatial Allocate SRAM // Algorithm   to store ‘g’ val g_wrapper = DRAM[Int](16, 16);   f(x, y) = x + y;   Accel {   g(x, y) = (f(x, y) + f(x, y+1))/2;   val g = SRAM[Int](16, 16);   Foreach(0 until 4 by 1) {yo =>   // Schedule   Foreach(0 until 4 by 1) {xo =>   g.in().spatial(); val f_wrapper = SRAM[Int](4, 5);   g.store_in(MemoryType::SRAM) f_wrapper load f(xo*4::xo*4+4, yo*4::yo*4+5);   .compute_at(g.in(), Var::outermost());   Foreach(0 until 4 by 1) {yi =>   g.tile(x, y, xo, yo, xi, yi, 4, 4);   Foreach(0 until 4 by 1) {xi =>   g(xo*4+xi, yo*4+yi) = f.compute_root();   (f_wrapper(xi,yi)+f_wrapper(xi,yi+1))/2;   f.in() }   .copy_to_device() }   .store_in(MemoryType::SRAM)   }   .compute_at(g, xo);   }   g_wrapper store g;   g.in().copy_to_host(); } wrapper.compile_to_spatial(...);

      Example: Halide to Spatial Halide Spatial // Algorithm   val g_wrapper = DRAM[Int](16, 16);   f(x, y) = x + y;   Accel {   g(x, y) = (f(x, y) + f(x, y+1))/2;   val g = SRAM[Int](16, 16);   Foreach(0 until 4 by 1) {yo =>   Tile g // Schedule   Foreach(0 until 4 by 1) {xo =>   g.in().spatial(); val f_wrapper = SRAM[Int](4, 5);   g.store_in(MemoryType::SRAM) f_wrapper load f(xo*4::xo*4+4, yo*4::yo*4+5);   .compute_at(g.in(), Var::outermost());   Foreach(0 until 4 by 1) {yi =>   g.tile(x, y, xo, yo, xi, yi, 4, 4);   Foreach(0 until 4 by 1) {xi =>   g(xo*4+xi, yo*4+yi) = f.compute_root();   (f_wrapper(xi,yi)+f_wrapper(xi,yi+1))/2;   f.in() }   .copy_to_device() }   .store_in(MemoryType::SRAM)   }   .compute_at(g, xo);   }   g_wrapper store g;   g.in().copy_to_host(); } wrapper.compile_to_spatial(...);

      Example: Halide to Spatial Halide Spatial // Algorithm   val g_wrapper = DRAM[Int](16, 16);   f(x, y) = x + y;   Accel {   g(x, y) = (f(x, y) + f(x, y+1))/2;   val g = SRAM[Int](16, 16);   Foreach(0 until 4 by 1) {yo =>   // Schedule   Foreach(0 until 4 by 1) {xo =>   g.in().spatial(); val f_wrapper = SRAM[Int](4, 5);   g.store_in(MemoryType::SRAM) f_wrapper load f(xo*4::xo*4+4, yo*4::yo*4+5);   .compute_at(g.in(), Var::outermost());   Foreach(0 until 4 by 1) {yi =>   g.tile(x, y, xo, yo, xi, yi, 4, 4);   Foreach(0 until 4 by 1) {xi =>   g(xo*4+xi, yo*4+yi) = Load ‘f’ into f.compute_root();   (f_wrapper(xi,yi)+f_wrapper(xi,yi+1))/2;   the f.in() }   accelerator’s .copy_to_device() }   .store_in(MemoryType::SRAM)   memory }   .compute_at(g, xo);   }   g_wrapper store g;   g.in().copy_to_host(); } wrapper.compile_to_spatial(...);

      Example: Halide to Spatial Halide Spatial // Algorithm   val g_wrapper = DRAM[Int](16, 16);   f(x, y) = x + y;   Accel {   g(x, y) = (f(x, y) + f(x, y+1))/2;   val g = SRAM[Int](16, 16);   Foreach(0 until 4 by 1) {yo =>   // Schedule   Foreach(0 until 4 by 1) {xo =>   Do the load at loop g.in().spatial(); val f_wrapper = SRAM[Int](4, 5);   level ‘xo’ and store g.store_in(MemoryType::SRAM) f_wrapper load f(xo*4::xo*4+4, yo*4::yo*4+5);   in SRAM .compute_at(g.in(), Var::outermost());   Foreach(0 until 4 by 1) {yi =>   g.tile(x, y, xo, yo, xi, yi, 4, 4);   Foreach(0 until 4 by 1) {xi =>   g(xo*4+xi, yo*4+yi) = f.compute_root();   (f_wrapper(xi,yi)+f_wrapper(xi,yi+1))/2;   f.in() }   .copy_to_device() }   .store_in(MemoryType::SRAM)   }   .compute_at(g, xo);   }   g_wrapper store g;   g.in().copy_to_host(); } wrapper.compile_to_spatial(...);

      Example: Halide to Spatial Halide Spatial // Algorithm   val g_wrapper = DRAM[Int](16, 16);   f(x, y) = x + y;   Accel {   g(x, y) = (f(x, y) + f(x, y+1))/2;   val g = SRAM[Int](16, 16);   Foreach(0 until 4 by 1) {yo =>   // Schedule   Foreach(0 until 4 by 1) {xo =>   g.in().spatial(); val f_wrapper = SRAM[Int](4, 5);   g.store_in(MemoryType::SRAM) f_wrapper load f(xo*4::xo*4+4, yo*4::yo*4+5);   .compute_at(g.in(), Var::outermost());   Foreach(0 until 4 by 1) {yi =>   g.tile(x, y, xo, yo, xi, yi, 4, 4);   Foreach(0 until 4 by 1) {xi =>   g(xo*4+xi, yo*4+yi) = f.compute_root();   (f_wrapper(xi,yi)+f_wrapper(xi,yi+1))/2;   f.in() }   Store ‘g’ .copy_to_device() }   back into .store_in(MemoryType::SRAM)   }   .compute_at(g, xo);   host’s DRAM }   g_wrapper store g;   g.in().copy_to_host(); } wrapper.compile_to_spatial(...);

Conclusion 37

Conclusion ■ Reconfigurable architectures are becoming key for performance / energy efficiency 37

Conclusion ■ Reconfigurable architectures are becoming key for performance / energy efficiency ■ Current programming solutions for reconfigurables are still inadequate 37

Conclusion ■ Reconfigurable architectures are becoming key for performance / energy efficiency ■ Current programming solutions for reconfigurables are still inadequate ■ Need to rethink outside of the C box for high level synthesis: Memory hierarchy for optimization ■ Design parameters for tuning ■ Arbitrarily nestable pipelines ■ 37

Conclusion ■ Reconfigurable architectures are becoming key for performance / energy efficiency ■ Current programming solutions for reconfigurables are still inadequate ■ Need to rethink outside of the C box for high level synthesis: Memory hierarchy for optimization ■ Productivity Performance Design parameters for tuning ■ Arbitrarily nestable pipelines ■ ■ Spatial prototypes these language and compiler criteria: ■ Average speedup of 2.9x versus SDAccel on VU9P ■ Average 42% less code than SDAccel Portability ■ Achieves transparent portability through internal support for automated design tuning (HyperMapper) 37

Conclusion ■ Reconfigurable architectures are becoming key for performance / energy efficiency ■ Current programming solutions for reconfigurables are still inadequate ■ Need to rethink outside of the C box for high level synthesis: Memory hierarchy for optimization ■ Productivity Performance Design parameters for tuning ■ Arbitrarily nestable pipelines ■ ■ Spatial prototypes these language and compiler criteria: ■ Average speedup of 2.9x versus SDAccel on VU9P ■ Average 42% less code than SDAccel Portability ■ Achieves transparent portability through internal support for automated design tuning (HyperMapper) Spatial is open source: https://spatial-lang.org/ 37

Backup Slides

The Team David Matt Raghu Yaqi Koeplinger Feldman Prabhakar Zhang Luigi Stefan Tian Ruben Ardavan Nardi Hadjis Zhao Fiszel Pedram Christos Kunle Kozyrakis Olukotun 38

Custom ASICs

Custom ASICs Good for widely used, fixed specifications (like compression)

Custom ASICs Good for widely used, fixed specifications (like compression) Expensive with long design turnaround for developing fields like ML

Custom ASICs Good for widely used, fixed specifications (like compression) Expensive with long design turnaround for developing fields like ML 20,000 20 15,000 15 Relative # of ML Arxiv Papers Papers / Year 10,000 10 Since 2009 5,000 5 0 0 2017 2009 2011 2013 2015 Time Jeff Dean, Scaled ML 2018 Kunle Olukotun, ISCA 2018

C + Pragmas Example Add 512 integers originating from accelerator DRAM void sum( int * mem) { mem[512] = 0; for ( int i=0; i < 512; i++) { mem[512] += mem[i]; } } � 54

C + Pragmas Example Add 512 integers originating from accelerator DRAM void sum( int * mem) { mem[512] = 0; Runtime: 27,236 clock Commercial for ( int i=0; i < 512; i++) { HLS Tool cycles mem[512] += mem[i]; (100x too long!) } } � 54

C + Pragmas Example Add 512 integers originating from external DRAM #define CHUNKSIZE ( sizeof ( MPort )/ sizeof ( int )) #define LOOPCOUNT (512/CHUNKSIZE)   void sum( MPort * mem) { MPort buff[LOOPCOUNT];   memcpy (buff, mem, LOOPCOUNT);   int sum = 0;   for ( int i=1; i<LOOPCOUNT; i++) {   #pragma PIPELINE for ( int j=0; j<CHUNKSIZE; j++) { #pragma UNROLL sum += ( int ) (buff[i]>>j* sizeof ( int )*8); } } mem[512] = sum; } Runtime: 302 clock cycles � 55

C + Pragmas Example Add 512 integers originating from external DRAM #define CHUNKSIZE ( sizeof ( MPort )/ sizeof ( int )) Width of DRAM controller #define LOOPCOUNT (512/CHUNKSIZE)   interface void sum( MPort * mem) { MPort buff[LOOPCOUNT];   Burst Access memcpy (buff, mem, LOOPCOUNT);   Use local variable int sum = 0;   Loop for ( int i=1; i<LOOPCOUNT; i++) {   Restructurin #pragma PIPELINE g Special for ( int j=0; j<CHUNKSIZE; j++) { Special #pragma UNROLL compiler compiler sum += ( int ) directives directives (buff[i]>>j* sizeof ( int )*8); Bit shifting to } extract individual } elements mem[512] = sum; } Runtime: 302 clock cycles � 55

Hardware Design Considerations

Hardware Design Considerations 1. Finite physical compute and memory resources

Hardware Design Considerations 1. Finite physical compute and memory resources 2. Requires aggressive pipelining for performance ■ Maximize useful execution time of compute resources

Hardware Design Considerations 1. Finite physical compute and memory resources 2. Requires aggressive pipelining for performance ■ Maximize useful execution time of compute resources 3. Disjoint memory space ■ No hardware managed memory hierarchy

Hardware Design Considerations 1. Finite physical compute and memory resources 2. Requires aggressive pipelining for performance ■ Maximize useful execution time of compute resources 3. Disjoint memory space ■ No hardware managed memory hierarchy 4. Huge design parameter spaces ■ Parameters are interdependent, change runtime by orders of magnitude

Hardware Design Considerations 1. Finite physical compute and memory resources 2. Requires aggressive pipelining for performance ■ Maximize useful execution time of compute resources 3. Disjoint memory space ■ No hardware managed memory hierarchy 4. Huge design parameter spaces ■ Parameters are interdependent, change runtime by orders of magnitude 5. Others… pipeline timing, clocking, etc.

Local Memory Analysis Example Foreach (N by 1){ r => val a = SRAM [Float] (D) val b = SRAM [Float] (D) val c = SRAM [Float]( D) Foreach (D par 2){i => a(i) = … a } Reduce (sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach (D par 2){k => c(k) = a(k) * sum } }

Spatial: A Language and Compiler for Application Accelerators Raghu - PowerPoint PPT Presentation

Spatial: A Language and Compiler for Application Accelerators Raghu Prabhakar Stanford University / SambaNova Systems TVM Conference Dec 13, 2018 The Future Is (Probably) Reconfigurable 10,000 Energy Efficiency (MOPS/mW) ASIC 1,000

Compiler Construction Chapter 11 1 Compiler Construction Compiler Construction A New Compiler

Resource 1: What is spatial? presentation notes Section Section text Notes 1. Spatial

Broadening the Study of Spatial Intelligence Mary Hegarty University of California, Santa

A Spatial Cloaking Framework A Spatial Cloaking Framework A Spatial Cloaking Framework A Spatial

Creating a Science of Spatial Learning Nora S. Newcombe Temple University PI, Spatial

Spatial Digitech Keep it s im ple Make it spatial About US Spatial Digitech is a provider of

UCSB is Spatial ! http://www.spatial.ucsb.edu Specialist Meeting on Spatial Thinking across the

STAT 209 Spatial Data I April 30, 2018 Colin Reimer Dawson 1 / 26 Spatial Data Projections

11/8/2012 The Structure of a Compiler (2) The Structure of a Compiler (1) Any compiler must

Compiler Development (CMPSC 401) Janyl Jumadinova January 17, 2018 Janyl Jumadinova Compiler

Principles of Compiler Design - The Brainf*ck Compiler - Clifford Wolf - www.clifford.at

Spatial: A Language and Compiler for Application Accelerators David Koeplinger Matthew Feldman

Buffers and centroids Zev Ross President, ZevRoss Spatial Analysis DataCamp Spatial Analysis

Panel Regarding Marine Panel Regarding Marine Spatial Planning Spatial Planning A public process

Welcome! Zev Ross President, ZevRoss Spatial Analysis DataCamp Spatial Analysis with sf and

Basics of Geographic Analysis in R Spatial Autocorrelation and Spatial Weights Yuri M. Zhukov

Second Quarter 2018 Investor Call M. Terry Turner, President and CEO Harold R. Carpenter, EVP

The domain of the dollar: 8 questions Presentation to plenary panel of the 90th International

Minerve Meriseme, Thomas Gutierrez and Arianna Pommells T H E C O M M U N I T Y R E I N V E S T

Comparative Analysis of Islamic Banking Supervision and Regulation Development Muhamed Zulkhibri

AFCA Briefing Complaint trends AIST webinar 7 October 2020 Presenter: Heather Gray, Lead

A new Composite Securities Bill Current Securities Ordinance vintage 1974 Major changes in

Figure 1 Household satisfaction with retailer measures Gas Electricity 70 68 65 69 69

Q2 2007 FINANCIAL Investor Community Conference Call RESULTS KAREN MAIDMENT Chief Financial