Spatial: A Language and Compiler for Application Accelerators Raghu - - PowerPoint PPT Presentation

spatial a language and compiler for application
SMART_READER_LITE
LIVE PREVIEW

Spatial: A Language and Compiler for Application Accelerators Raghu - - PowerPoint PPT Presentation

Spatial: A Language and Compiler for Application Accelerators Raghu Prabhakar Stanford University / SambaNova Systems TVM Conference Dec 13, 2018 The Future Is (Probably) Reconfigurable 10,000 Energy Efficiency (MOPS/mW) ASIC 1,000


slide-1
SLIDE 1

Spatial: A Language and Compiler
 for Application Accelerators

Raghu Prabhakar

Stanford University / SambaNova Systems

TVM Conference Dec 13, 2018

slide-2
SLIDE 2

The Future Is (Probably) Reconfigurable

10,000 1,000 100 10 1 0.1 Energy Efficiency (MOPS/mW)

Not programmable Less programmable More programmable

Programmability

ASIC CPU GPU Reconfigurable Instruction-Based FPGA

2

CGRA Dedicated

slide-3
SLIDE 3

The Future Is (Probably) Reconfigurable

10,000 1,000 100 10 1 0.1 Energy Efficiency (MOPS/mW)

Not programmable Less programmable More programmable

Programmability

ASIC CPU GPU Reconfigurable Instruction-Based FPGA

2

CGRA Dedicated

25x perf/W vs. CPU XPU (HotChips ’17) 287 MOps/mW Brainwave (ISCA ’18)

slide-4
SLIDE 4

The Future Is (Probably) Reconfigurable

10,000 1,000 100 10 1 0.1 Energy Efficiency (MOPS/mW)

Not programmable Less programmable More programmable

Programmability

ASIC CPU GPU Reconfigurable Instruction-Based FPGA

2

CGRA Dedicated

25x perf/W vs. CPU XPU (HotChips ’17) 287 MOps/mW Brainwave (ISCA ’18) 77x perf/W vs. FPGA Plasticine (ISCA ’17)

slide-5
SLIDE 5

Key Question

How can we more productively target reconfigurable architectures like FPGAs?

3

slide-6
SLIDE 6

Key Question

Performance Productivity Portability

How can we more productively target reconfigurable architectures like FPGAs?

Fast and efficient designs Fast and efficient programmers Target-generic solutions

3

slide-7
SLIDE 7

HDLs

4

Hardware Description Languages (HDLs)

e.g. Verilog, VHDL, Chisel, Bluespec

slide-8
SLIDE 8

HDLs

Performance

4

Hardware Description Languages (HDLs)

e.g. Verilog, VHDL, Chisel, Bluespec

✓ Arbitrary RTL

slide-9
SLIDE 9

HDLs

Performance Portability

4

Hardware Description Languages (HDLs)

e.g. Verilog, VHDL, Chisel, Bluespec

✓ Arbitrary RTL ✘ Significant target-specific code

slide-10
SLIDE 10

HDLs

Performance Productivity Portability

4

Hardware Description Languages (HDLs)

e.g. Verilog, VHDL, Chisel, Bluespec

✓ Arbitrary RTL ✘ No high-level abstractions ✘ Significant target-specific code

slide-11
SLIDE 11

C + Pragmas

5

Existing High Level Synthesis (C + Pragmas)

e.g. Vivado HLS, SDAccel, Altera OpenCL

HDLs

slide-12
SLIDE 12

C + Pragmas

Performance

5

✘ No memory hierarchy ✘ No arbitrary pipelining

Existing High Level Synthesis (C + Pragmas)

e.g. Vivado HLS, SDAccel, Altera OpenCL

HDLs

slide-13
SLIDE 13

C + Pragmas

Performance Portability

5

✓ Portable for single vendor ✘ No memory hierarchy ✘ No arbitrary pipelining

Existing High Level Synthesis (C + Pragmas)

e.g. Vivado HLS, SDAccel, Altera OpenCL

HDLs

slide-14
SLIDE 14

C + Pragmas

Performance Productivity Portability

5

✓ Nested loops ✘ Difficult to optimize ✘ Ad-hoc mix of software/hardware ✓ Portable for single vendor ✘ No memory hierarchy ✘ No arbitrary pipelining

Existing High Level Synthesis (C + Pragmas)

e.g. Vivado HLS, SDAccel, Altera OpenCL

HDLs

slide-15
SLIDE 15

Rethinking HLS

6

HDLs C + Pragmas

Improved HLS

slide-16
SLIDE 16

Rethinking HLS

Performance

6

✓ Memory hierarchy ✓ Arbitrary pipelining

HDLs C + Pragmas

Improved HLS

slide-17
SLIDE 17

Rethinking HLS

Performance Portability

6

✓ Memory hierarchy ✓ Arbitrary pipelining ✓ Target-generic source across reconfigurable architectures

HDLs C + Pragmas

Improved HLS

slide-18
SLIDE 18

Rethinking HLS

Performance Productivity Portability

6

✓ Nested loops ✓ Automatic memory banking/buffering ✓ Implicit design parameters (unrolling, banking, etc.) ✓ Memory hierarchy ✓ Arbitrary pipelining ✓ Target-generic source across reconfigurable architectures ✓ Automated design tuning

HDLs C + Pragmas

Improved HLS

slide-19
SLIDE 19

Introducing Spatial

■ Programming language to simplify configurable accelerator

design

■ Constructs to express:

■ Hierarchical parallel and pipelined data paths ■ explicit memory hierarchies

■ Simple APIs to manage CPU Accelerator communication

■ Open source: https://spatial-lang.org/ ■ Allows programmers to focus on “interesting stuff”

■ Designed for performance oriented programmers ■ More intuitive than CUDA: dataflow instead of threads

David Koeplinger et al, “Spatial: A Language And Compiler For Application Accelerators”, PLDI 2018

slide-20
SLIDE 20

Spatial: Memory Hierarchy

8

DDR DRAM GB On-Chip SRAM MB Local Regs KB

slide-21
SLIDE 21

Spatial: Memory Hierarchy

8

DDR DRAM GB On-Chip SRAM MB Local Regs KB

val image = DRAM[UInt8] (H,W)

slide-22
SLIDE 22

Spatial: Memory Hierarchy

8

DDR DRAM GB On-Chip SRAM MB Local Regs KB

val image = DRAM[UInt8] (H,W) val buffer = SRAM[UInt8](C)

val fifo = FIFO[Float](D) val lbuf = LineBuffer[Int](R,C)

slide-23
SLIDE 23

Spatial: Memory Hierarchy

8

DDR DRAM GB On-Chip SRAM MB Local Regs KB

val image = DRAM[UInt8] (H,W) val buffer = SRAM[UInt8](C)

val fifo = FIFO[Float](D) val lbuf = LineBuffer[Int](R,C)

buffer load image(i, j::j+C) // dense buffer gather image(a) // sparse

slide-24
SLIDE 24

Spatial: Memory Hierarchy

8

DDR DRAM GB On-Chip SRAM MB Local Regs KB

val image = DRAM[UInt8] (H,W) val buffer = SRAM[UInt8](C)

val fifo = FIFO[Float](D) val lbuf = LineBuffer[Int](R,C)

val accum = Reg[Double] val pixels = RegFile[UInt8](R,C) buffer load image(i, j::j+C) // dense buffer gather image(a) // sparse

slide-25
SLIDE 25

Spatial: Control And Design Parameters

9

slide-26
SLIDE 26

val P = 16 (1 32) Reduce(0)(N by 1 par P){i => data(i) }{(a,b) => a + b}

Implicit/Explicit parallelization factors

(optional, but can be explicitly declared)

Spatial: Control And Design Parameters

9

slide-27
SLIDE 27

val P = 16 (1 32) Reduce(0)(N by 1 par P){i => data(i) }{(a,b) => a + b} Stream.Foreach(0 until N){i => … }

Implicit/Explicit parallelization factors

(optional, but can be explicitly declared)

Implicit/Explicit control schemes

(also optional, but can be used to override compiler)

Spatial: Control And Design Parameters

9

slide-28
SLIDE 28

val B = 64 (64 1024) val buffer = SRAM[Float](B) Foreach(N by B){i => … } val P = 16 (1 32) Reduce(0)(N by 1 par P){i => data(i) }{(a,b) => a + b} Stream.Foreach(0 until N){i => … }

Implicit/Explicit parallelization factors

(optional, but can be explicitly declared)

Explicit size parameters for loop step size and buffer sizes

(informs compiler it can tune this value)

Implicit/Explicit control schemes

(also optional, but can be used to override compiler)

Spatial: Control And Design Parameters

9

slide-29
SLIDE 29

val B = 64 (64 1024) val buffer = SRAM[Float](B) Foreach(N by B){i => … } val P = 16 (1 32) Reduce(0)(N by 1 par P){i => data(i) }{(a,b) => a + b} Stream.Foreach(0 until N){i => … }

Implicit/Explicit parallelization factors

(optional, but can be explicitly declared)

Explicit size parameters for loop step size and buffer sizes

(informs compiler it can tune this value)

Implicit/Explicit control schemes

(also optional, but can be used to override compiler) Foreach(64 par 16){i => buffer(i) // Parallel read }

Implicit memory banking and buffering schemes for parallelized access

Spatial: Control And Design Parameters

9

slide-30
SLIDE 30

Dot Product in Spatial

val output = ArgOut[Float] val vectorA = DRAM[Float](N) val vectorB = DRAM[Float](N) Accel { Reduce(output)(N by B){ i => val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B) Reduce(acc)(B by 1){ j => tileA(j) * tileB(j) }{a, b => a + b} }{a, b => a + b} DRAM

vectorA

Off-chip memory declarations

vectorB

FPGA

  • utput

10

slide-31
SLIDE 31

DRAM

Dot Product in Spatial

val output = ArgOut[Float] val vectorA = DRAM[Float](N) val vectorB = DRAM[Float](N) Accel { Reduce(output)(N by B){ i => val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B) Reduce(acc)(B by 1){ j => tileA(j) * tileB(j) }{a, b => a + b} }{a, b => a + b}

vectorA vectorB

Explicit work division in IR

FPGA

  • utput

24

slide-32
SLIDE 32

Dot Product in Spatial

val output = ArgOut[Float] val vectorA = DRAM[Float](N) val vectorB = DRAM[Float](N) Accel { Reduce(output)(N by B){ i => val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B) Reduce(acc)(B by 1){ j => tileA(j) * tileB(j) }{a, b => a + b} DRAM

vectorA vectorB

Tiled reduction (outer)

FPGA

Outer Reduce

  • utput

24

slide-33
SLIDE 33

DRAM

Dot Product in Spatial

val output = ArgOut[Float] val vectorA = DRAM[Float](N) val vectorB = DRAM[Float](N) Accel { Reduce(output)(N by B){ i => val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B) Reduce(acc)(B by 1){ j => tileA(j) * tileB(j) }{a, b => a + b} }

vectorA vectorB

FPGA

Outer Reduce

On-chip memory declarations

tileB (0) tileA (0) tileA (1) tileB (1) acc

  • utput

24

acc

slide-34
SLIDE 34

DRAM

Dot Product in Spatial

val output = ArgOut[Float] val vectorA = DRAM[Float](N) val vectorB = DRAM[Float](N) Accel { Reduce(output)(N by B){ i => val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B) Reduce(acc)(B by 1){ j => tileA(j) * tileB(j) }{a, b => a + b}

vectorA vectorB

FPGA

Outer Reduce

DRAM SRAM transfers (also have store, scatter, and gather)

Stage 1

tileB (0) tileA (0) tileA (1) tileB (1)

  • utput

24

acc acc

slide-35
SLIDE 35

DRAM

Dot Product in Spatial

val output = ArgOut[Float] val vectorA = DRAM[Float](N) val vectorB = DRAM[Float](N) Accel { Reduce(output)(N by B){ i => val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B) Reduce(acc)(B by 1){ j => tileA(j) * tileB(j) }{a, b => a + b}

vectorA vectorB

FPGA

Outer Reduce

acc

Stage 1

tileB (0) tileA (0) tileA (1) tileB (1)

Tiled reduction (pipelined)

Stage 2

+ ×

acc

  • utput

24

acc acc

slide-36
SLIDE 36

FPGA

Outer Reduce Stage 3

DRAM

Dot Product in Spatial

val output = ArgOut[Float] val vectorA = DRAM[Float](N) val vectorB = DRAM[Float](N) Accel { Reduce(output)(N by B){ i => val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B) Reduce(acc)(B by 1){ j => tileA(j) * tileB(j) }{a, b => a + b} }{a, b => a + b}

vectorA vectorB acc

Stage 1

tileB (0) tileA (0) tileA (1) tileB (1)

Stage 2

+ ×

acc

  • utput

Outer reduce function

+

24

acc acc

slide-37
SLIDE 37

Dot Product in Spatial

val output = ArgOut[Float] val vectorA = DRAM[Float](N) val vectorB = DRAM[Float](N) Accel { Reduce(output)(N by B){ i => val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B) Reduce(acc)(B by 1){ j => tileA(j) * tileB(j) }{a, b => a + b} }{a, b => a + b}

24

FPGA

Outer Reduce Stage 3

DRAM

vectorA vectorB acc

Stage 1

tileB (0) tileA (0) tileA (1) tileB (1)

Stage 2

+ ×

acc

  • utput

+

acc acc

Tile Size (B) Banking strategy Parallelism factor #1 Metapipelining toggle Parallelism factor #3 Parallelism factor #2

slide-38
SLIDE 38

Dot Product in Spatial

val output = ArgOut[Float] val vectorA = DRAM[Float](N) val vectorB = DRAM[Float](N) Accel { Reduce(output)(N by B){ i => val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B) Reduce(acc)(B by 1){ j => tileA(j) * tileB(j) }{a, b => a + b} }{a, b => a + b}

25

slide-39
SLIDE 39

Dot Product in Spatial

Spatial Program

Design Parameters

25

slide-40
SLIDE 40

Spatial Program

The Spatial Compiler

26

slide-41
SLIDE 41

Spatial IR

The Spatial Compiler

26

slide-42
SLIDE 42

Spatial IR Control Scheduling

  • Mem. Banking/

Buffering Access Pattern Analysis Control Inference Pipeline Unrolling Pipeline Retiming [Optional] Design Tuning Host Resource Allocation Control Signal Inference Chisel Code Generation Area/Runtime Analysis Spatial IR

Design Parameters

Intermediate Representatio n

Design Parameters

IR Transformation IR Analysis Code Generation

Legend

The Spatial Compiler

26

slide-43
SLIDE 43

Control Scheduling

Spatial IR Control Scheduling

  • Mem. Banking/

Buffering Access Pattern Analysis Control Inference Pipeline Unrolling Pipeline Retiming [Optional] Design Tuning Host Resource Allocation Control Signal Inference Chisel Code Generation Area/Runtime Analysis Spatial IR

■ Creates loop pipeline schedules

Detects data dependencies across loop intervals

Calculate initiation interval of pipelines

Set maximum depth of buffers

■ Supports arbitrarily nested pipelines

(Commercial HLS tools don’t support this)

27

slide-44
SLIDE 44

Spatial IR Control Scheduling

  • Mem. Banking/

Buffering Access Pattern Analysis Control Inference Pipeline Unrolling Pipeline Retiming [Optional] Design Tuning

Modified Parameters

Host Resource Allocation Control Signal Inference Chisel Code Generation Area/Runtime Analysis Spatial IR

Design Parameters

Design Tuning

29

slide-45
SLIDE 45

FPGA

+ ×

tileB tileA acc

DRAM vectorA vectorB

Design Space Parameters Example

vectorA ∙ vectorB

Legend

Control Compute

Regs SRAM

Small and simple, but slow!

ctr

22

slide-46
SLIDE 46

■ Increases length of DRAM accesses

Runtime

■ Increases exploited locality

Runtime

■ Increases local memory sizes Area

FPGA

+ ×

tileB tileA acc

DRAM vectorA vectorB

vectorA ∙ vectorB

Important Parameters: Buffer Sizes

Legend

Control Compute

Regs SRAM

ctr

23

slide-47
SLIDE 47

FPGA

Stage 2

Tile B

■ Overlaps memory and compute

Runtime

■ Increases local memory sizes

Area

■ Adds synchronization logic

Area

Important Parameters: Pipelining

Stage 1

+ ×

tileB (0) tileA (0) acc

DRAM vectorA vectorB

tileA (1) tileB (1)

vectorA ∙ vectorB

Legend

Control Compute

Regs SRAM Double Buffer

24

Metapipelining requires buffering

slide-48
SLIDE 48

■ Improves element throughput

Runtime

■ Duplicates compute resources Area

Important Parameters: Parallelization

FPGA

+ ×

acc

DRAM vectorA vectorB

ctr

vectorA ∙ vectorB

×

ctr ctr

× + +

Legend

Control Compute

Regs SRAM

tileB tileA

25

slide-49
SLIDE 49

■ Improves memory bandwidth

Runtime

■ May duplicate memory resources Area

Important Parameters: Memory Banking

+ ×

acc

DRAM vectorA vectorB

ctr

vectorA ∙ vectorB

×

ctr ctr

× + +

Legend

Control Compute

Regs SRAM

tileB tileA

Banked SRAM

26

Parallelization requires banking

slide-50
SLIDE 50

Spatial IR Control Scheduling

  • Mem. Banking/

Buffering Access Pattern Analysis Control Inference Pipeline Unrolling Pipeline Retiming [Optional] Design Tuning

Modified Parameters

Host Resource Allocation Control Signal Inference Chisel Code Generation Area/Runtime Analysis Spatial IR

Design Parameters

Design Tuning

29

slide-51
SLIDE 51

Spatial IR Control Scheduling

  • Mem. Banking/

Buffering Access Pattern Analysis Control Inference Pipeline Unrolling Pipeline Retiming [Optional] Design Tuning

Modified Parameters

Host Resource Allocation Control Signal Inference Chisel Code Generation Area/Runtime Analysis Spatial IR

Design Parameters

Original tuning methods:

■ Pre-prune space using simple

heuristics

■ Randomly sample ~100,000 design

points

■ Model area/runtime of each point

Design Tuning

29

slide-52
SLIDE 52

Spatial IR Control Scheduling

  • Mem. Banking/

Buffering Access Pattern Analysis Control Inference Pipeline Unrolling Pipeline Retiming [Optional] Design Tuning

Modified Parameters

Host Resource Allocation Control Signal Inference Chisel Code Generation Area/Runtime Analysis Spatial IR

Design Parameters

Original tuning methods:

■ Pre-prune space using simple

heuristics

■ Randomly sample ~100,000 design

points

■ Model area/runtime of each point

Proposed tuning method

■ Active learning: HyperMapper

(More details in paper)

Design Tuning

29

slide-53
SLIDE 53

Spatial IR Control Scheduling

  • Mem. Banking/

Buffering Access Pattern Analysis Control Inference Pipeline Unrolling Pipeline Retiming [Optional] Design Tuning

Modified Parameters

Host Resource Allocation Control Signal Inference Chisel Code Generation Area/Runtime Analysis Spatial IR

Design Parameters

Original tuning methods:

■ Pre-prune space using simple

heuristics

■ Randomly sample ~100,000 design

points

■ Model area/runtime of each point

Proposed tuning method

■ Active learning: HyperMapper

(More details in paper) Fast: No slow transformers in loop

Design Tuning

29

slide-54
SLIDE 54

Spatial IR Control Scheduling

  • Mem. Banking/

Buffering Access Pattern Analysis Control Inference Pipeline Unrolling Pipeline Retiming [Optional] Design Tuning

Modified Parameters

Host Resource Allocation Control Signal Inference Chisel Code Generation Area/Runtime Analysis Spatial IR

Design Parameters

Original tuning methods:

■ Pre-prune space using simple

heuristics

■ Randomly sample ~100,000 design

points

■ Model area/runtime of each point

Proposed tuning method

■ Active learning: HyperMapper

(More details in paper) Fast: No slow transformers in loop

Design Tuning

29

slide-55
SLIDE 55

Spatial IR Control Scheduling

  • Mem. Banking/

Buffering Access Pattern Analysis Control Inference Pipeline Unrolling Pipeline Retiming [Optional] Design Tuning

Modified Parameters

Host Resource Allocation Control Signal Inference Chisel Code Generation Area/Runtime Analysis Spatial IR

Design Parameters

Original tuning methods:

■ Pre-prune space using simple

heuristics

■ Randomly sample ~100,000 design

points

■ Model area/runtime of each point

Proposed tuning method

■ Active learning: HyperMapper

(More details in paper) Fast: No slow transformers in loop

Design Tuning

29

slide-56
SLIDE 56

Spatial IR Control Scheduling

  • Mem. Banking/

Buffering Access Pattern Analysis Control Inference Pipeline Unrolling Pipeline Retiming [Optional] Design Tuning Host Resource Allocation Control Signal Inference Chisel Code Generation Area/Runtime Analysis Spatial IR

The Spatial Compiler: The Rest

Code generation

■ Synthesizable Chisel ■ C++ code for host CPU

30

slide-57
SLIDE 57

■ FPGA:

■ Amazon EC2 F1 Instance: Xilinx VU9P FPGA ■ Fixed clock rate of 150 MHz

■ Applications

■ SDAccel: Hand optimized, tuned implementations ■ Spatial: Hand written, automatically tuned implementations

■ Execution time = FPGA execution time

Evaluation: Performance

31

slide-58
SLIDE 58

5 10 15

BlackScholes GEMM PageRank TPC-H Q6

Performance (Spatial vs. SDAccel)

Average 2.9x faster hardware than SDAccel

Speedup over SDAccel 8.5x 1.4x 1.6x 1.4x 3.5x 14.1x 1.3x

32

slide-59
SLIDE 59

Productivity: Lines of Code

63 125 188 250

BlackScholes GEMM PageRank TPC-H Q6

SDAccel Spatial

12%

Average 42% shorter programs versus SDAccel

60% 47% 44% 31% 66% 35% Lines

33

slide-60
SLIDE 60

■ FPGA 1

■ Amazon EC2 F1 Instance: Xilinx VU9P FPGA ■ 19.2 GB/s DRAM bandwidth (single channel)

■ FPGA 2

■ Xilinx Zynq ZC706 ■ 4.3 GB/s

■ Applications

■ Spatial: Hand written, automatically tuned implementations

■ Fixed clock rate of 150 MHz

Evaluation: Portability

34

slide-61
SLIDE 61

Portability: VU9P vs. Zynq ZC706

7.5 15 22.5 30

BlackScholes GEMM PageRank TPC-H Q6

2.5x 1.2x 2.5x 2.5x 1.3x 2.5x 4.6x

Identical Spatial source, multiple targets

Porting: Speedup (VU9P / Zynq) only from moving to larger FPGA

Speedup

DRAM Bandwidth: 4.5x LUTs (GP compute): 47.3x DSPs (integer FMA): 7.6x On-chip memory*: 4.0x VU9P / ZC706

* No URAM used on VU9P

35

slide-62
SLIDE 62

Portability: VU9P vs. Zynq ZC706

7.5 15 22.5 30

BlackScholes GEMM PageRank TPC-H Q6

2.6x 2.1x 9.4x 2.7x 1.7x 1.0x 1.1x

Identical Spatial source, multiple targets

Porting: Speedup (VU9P / Zynq) only from moving to larger FPGA Tuning: Speedup only from tuning parameters for larger FPGA

Speedup

DRAM Bandwidth: 4.5x LUTs (GP compute): 47.3x DSPs (integer FMA): 7.6x On-chip memory*: 4.0x VU9P / ZC706

* No URAM used on VU9P

35

slide-63
SLIDE 63

Portability: VU9P vs. Zynq ZC706

7.5 15 22.5 30

BlackScholes GEMM PageRank TPC-H Q6

6.5x 2.5x 23.4x 6.8x 2.2x 2.5x 5.0x

Identical Spatial source, multiple targets

Porting: Speedup (VU9P / Zynq) only from moving to larger FPGA Tuning: Speedup only from tuning parameters for larger FPGA Product = Porting × Tuning

Speedup

DRAM Bandwidth: 4.5x LUTs (GP compute): 47.3x DSPs (integer FMA): 7.6x On-chip memory*: 4.0x VU9P / ZC706

* No URAM used on VU9P

35

slide-64
SLIDE 64

Portability: Plasticine CGRA

Identical Spatial source, multiple targets Even reconfigurable hardware that isn’t an FPGA!

36

Benchmark DRAM Bandwidth (%) Load Store Resource Utilization (%) PCU PMU AG Speedup

  • vs. VU9P

BlackScholes 77.4 12.9 73.4 10.9 20.6 1.6 GDA 24.0 0.2 95.3 73.4 38.2 9.8 GEMM 20.5 2.1 96.8 64.1 11.7 55.0 K-Means 8.0 0.4 89.1 57.8 17.6 6.3 TPC-H Q6 97.2 0.0 29.7 37.5 70.6 1.6

Prabhakar et al. Plasticine: A Reconfigurable Architecture For Parallel Patterns (ISCA ‘17)

slide-65
SLIDE 65

Halide to Spatial

slide-66
SLIDE 66
  • DSL for computational photography
  • Separation between algorithm (what to compute) and schedule (how to compute)
  • Straightforward to express and iterate over various schedules

What is Halide?

Var x, y;
 Func f;
 f(x, y) = x + y;

Algorithm

f.tile(x,y,xi,yi,8,8);


Schedule #1 Implementations

slide-67
SLIDE 67
  • DSL for computational photography
  • Separation between algorithm (what to compute) and schedule (how to compute)
  • Straightforward to express and iterate over various schedules

What is Halide?

Var x, y;
 Func f;
 f(x, y) = x + y;

Algorithm

f.parallel(y);
 f.vectorize(x, 8);

Schedule #2 Implementations

slide-68
SLIDE 68

Why use Halide as Front-End to Spatial?

  • Separation of concerns

○ High-level transformations: Tiling, Vectorization etc can happen in Halide ○ Lift the hard work of transforming loop nests to Halide ○ Optimized code can be lowered into spatial


  • Loop-based IR

○ Easy mapping to Spatial front-end


slide-69
SLIDE 69

Halide IR

// Algorithm
 Var x, y;
 Func f;
 f(x, y) = x + y;
 
 // Schedule
 f.parallel(y);
 f.vectorize(x, 8);
 
 f.realize(32, 32);

produce f {
 let t6 = (f.extent.0 + f.min.0)
 let t7 = (f.min.1*f.stride.1)
 let t8 = max((f.extent.0/8), 0)
 let t3 = (t8 < ((f.extent.0 + 7)/8))
 let t2 = (0 - t7)
 let t5 = (((t6 - t7) - f.min.0) + -8)
 let t4 = (t6 + -8)
 parallel (f.s0.y, f.min.1, f.extent.1) {
 let t10 = ((f.s0.y*f.stride.1) + t2)
 let t9 = (f.min.0 + f.s0.y)
 for (f.s0.x.x, 0, t8) {
 f[ramp(((f.s0.x.x*8) + t10), 1, 8)] = ramp(((f.s0.x.x*8) + t9), 1, 8)
 }
 if (t3) {
 f[ramp(((f.s0.y*f.stride.1) + t5), 1, 8)] = ramp((f.s0.y + t4), 1, 8)
 }
 }
 }

slide-70
SLIDE 70

Example: Halide to Spatial

// Algorithm
 f(x, y) = x + y;
 g(x, y) = (f(x, y) + f(x, y+1))/2;
 
 // Schedule
 g.in().spatial(); g.store_in(MemoryType::SRAM) .compute_at(g.in(), Var::outermost());
 g.tile(x, y, xo, yo, xi, yi, 4, 4);
 
 f.compute_root();
 f.in() .copy_to_device() .store_in(MemoryType::SRAM)
 .compute_at(g, xo);
 g.in().copy_to_host(); 
 wrapper.compile_to_spatial(...);

Spatial Halide

slide-71
SLIDE 71

Example: Halide to Spatial

// Algorithm
 f(x, y) = x + y;
 g(x, y) = (f(x, y) + f(x, y+1))/2;
 
 // Schedule
 g.in().spatial(); g.store_in(MemoryType::SRAM) .compute_at(g.in(), Var::outermost());
 g.tile(x, y, xo, yo, xi, yi, 4, 4);
 
 f.compute_root();
 f.in() .copy_to_device() .store_in(MemoryType::SRAM)
 .compute_at(g, xo);
 g.in().copy_to_host(); 
 wrapper.compile_to_spatial(...); val g_wrapper = DRAM[Int](16, 16);
 Accel {
 val g = SRAM[Int](16, 16);
 Foreach(0 until 4 by 1) {yo =>
 Foreach(0 until 4 by 1) {xo =>
 val f_wrapper = SRAM[Int](4, 5);
 f_wrapper load f(xo*4::xo*4+4, yo*4::yo*4+5);
 Foreach(0 until 4 by 1) {yi =>
 Foreach(0 until 4 by 1) {xi =>
 g(xo*4+xi, yo*4+yi) = (f_wrapper(xi,yi)+f_wrapper(xi,yi+1))/2;
 }
 }
 }
 }
 g_wrapper store g;
 }

Spatial Halide

slide-72
SLIDE 72

Example: Halide to Spatial

// Algorithm
 f(x, y) = x + y;
 g(x, y) = (f(x, y) + f(x, y+1))/2;
 
 // Schedule
 g.in().spatial(); g.store_in(MemoryType::SRAM) .compute_at(g.in(), Var::outermost());
 g.tile(x, y, xo, yo, xi, yi, 4, 4);
 
 f.compute_root();
 f.in() .copy_to_device() .store_in(MemoryType::SRAM)
 .compute_at(g, xo);
 g.in().copy_to_host(); 
 wrapper.compile_to_spatial(...); val g_wrapper = DRAM[Int](16, 16);
 Accel {
 val g = SRAM[Int](16, 16);
 Foreach(0 until 4 by 1) {yo =>
 Foreach(0 until 4 by 1) {xo =>
 val f_wrapper = SRAM[Int](4, 5);
 f_wrapper load f(xo*4::xo*4+4, yo*4::yo*4+5);
 Foreach(0 until 4 by 1) {yi =>
 Foreach(0 until 4 by 1) {xi =>
 g(xo*4+xi, yo*4+yi) = (f_wrapper(xi,yi)+f_wrapper(xi,yi+1))/2;
 }
 }
 }
 }
 g_wrapper store g;
 }

Spatial Halide

Compute at Accelerator

slide-73
SLIDE 73

Example: Halide to Spatial

// Algorithm
 f(x, y) = x + y;
 g(x, y) = (f(x, y) + f(x, y+1))/2;
 
 // Schedule
 g.in().spatial(); g.store_in(MemoryType::SRAM) .compute_at(g.in(), Var::outermost());
 g.tile(x, y, xo, yo, xi, yi, 4, 4);
 
 f.compute_root();
 f.in() .copy_to_device() .store_in(MemoryType::SRAM)
 .compute_at(g, xo);
 g.in().copy_to_host(); 
 wrapper.compile_to_spatial(...); val g_wrapper = DRAM[Int](16, 16);
 Accel {
 val g = SRAM[Int](16, 16);
 Foreach(0 until 4 by 1) {yo =>
 Foreach(0 until 4 by 1) {xo =>
 val f_wrapper = SRAM[Int](4, 5);
 f_wrapper load f(xo*4::xo*4+4, yo*4::yo*4+5);
 Foreach(0 until 4 by 1) {yi =>
 Foreach(0 until 4 by 1) {xi =>
 g(xo*4+xi, yo*4+yi) = (f_wrapper(xi,yi)+f_wrapper(xi,yi+1))/2;
 }
 }
 }
 }
 g_wrapper store g;
 }

Spatial Halide

Allocate SRAM to store ‘g’

slide-74
SLIDE 74

Example: Halide to Spatial

// Algorithm
 f(x, y) = x + y;
 g(x, y) = (f(x, y) + f(x, y+1))/2;
 
 // Schedule
 g.in().spatial(); g.store_in(MemoryType::SRAM) .compute_at(g.in(), Var::outermost());
 g.tile(x, y, xo, yo, xi, yi, 4, 4);
 
 f.compute_root();
 f.in() .copy_to_device() .store_in(MemoryType::SRAM)
 .compute_at(g, xo);
 g.in().copy_to_host(); 
 wrapper.compile_to_spatial(...); val g_wrapper = DRAM[Int](16, 16);
 Accel {
 val g = SRAM[Int](16, 16);
 Foreach(0 until 4 by 1) {yo =>
 Foreach(0 until 4 by 1) {xo =>
 val f_wrapper = SRAM[Int](4, 5);
 f_wrapper load f(xo*4::xo*4+4, yo*4::yo*4+5);
 Foreach(0 until 4 by 1) {yi =>
 Foreach(0 until 4 by 1) {xi =>
 g(xo*4+xi, yo*4+yi) = (f_wrapper(xi,yi)+f_wrapper(xi,yi+1))/2;
 }
 }
 }
 }
 g_wrapper store g;
 }

Spatial Halide

Tile g

slide-75
SLIDE 75

Example: Halide to Spatial

// Algorithm
 f(x, y) = x + y;
 g(x, y) = (f(x, y) + f(x, y+1))/2;
 
 // Schedule
 g.in().spatial(); g.store_in(MemoryType::SRAM) .compute_at(g.in(), Var::outermost());
 g.tile(x, y, xo, yo, xi, yi, 4, 4);
 
 f.compute_root();
 f.in() .copy_to_device() .store_in(MemoryType::SRAM)
 .compute_at(g, xo);
 g.in().copy_to_host(); 
 wrapper.compile_to_spatial(...); val g_wrapper = DRAM[Int](16, 16);
 Accel {
 val g = SRAM[Int](16, 16);
 Foreach(0 until 4 by 1) {yo =>
 Foreach(0 until 4 by 1) {xo =>
 val f_wrapper = SRAM[Int](4, 5);
 f_wrapper load f(xo*4::xo*4+4, yo*4::yo*4+5);
 Foreach(0 until 4 by 1) {yi =>
 Foreach(0 until 4 by 1) {xi =>
 g(xo*4+xi, yo*4+yi) = (f_wrapper(xi,yi)+f_wrapper(xi,yi+1))/2;
 }
 }
 }
 }
 g_wrapper store g;
 }

Spatial Halide

Load ‘f’ into the accelerator’s memory

slide-76
SLIDE 76

Example: Halide to Spatial

// Algorithm
 f(x, y) = x + y;
 g(x, y) = (f(x, y) + f(x, y+1))/2;
 
 // Schedule
 g.in().spatial(); g.store_in(MemoryType::SRAM) .compute_at(g.in(), Var::outermost());
 g.tile(x, y, xo, yo, xi, yi, 4, 4);
 
 f.compute_root();
 f.in() .copy_to_device() .store_in(MemoryType::SRAM)
 .compute_at(g, xo);
 g.in().copy_to_host(); 
 wrapper.compile_to_spatial(...); val g_wrapper = DRAM[Int](16, 16);
 Accel {
 val g = SRAM[Int](16, 16);
 Foreach(0 until 4 by 1) {yo =>
 Foreach(0 until 4 by 1) {xo =>
 val f_wrapper = SRAM[Int](4, 5);
 f_wrapper load f(xo*4::xo*4+4, yo*4::yo*4+5);
 Foreach(0 until 4 by 1) {yi =>
 Foreach(0 until 4 by 1) {xi =>
 g(xo*4+xi, yo*4+yi) = (f_wrapper(xi,yi)+f_wrapper(xi,yi+1))/2;
 }
 }
 }
 }
 g_wrapper store g;
 }

Spatial Halide

Do the load at loop level ‘xo’ and store in SRAM

slide-77
SLIDE 77

Example: Halide to Spatial

// Algorithm
 f(x, y) = x + y;
 g(x, y) = (f(x, y) + f(x, y+1))/2;
 
 // Schedule
 g.in().spatial(); g.store_in(MemoryType::SRAM) .compute_at(g.in(), Var::outermost());
 g.tile(x, y, xo, yo, xi, yi, 4, 4);
 
 f.compute_root();
 f.in() .copy_to_device() .store_in(MemoryType::SRAM)
 .compute_at(g, xo);
 g.in().copy_to_host(); 
 wrapper.compile_to_spatial(...); val g_wrapper = DRAM[Int](16, 16);
 Accel {
 val g = SRAM[Int](16, 16);
 Foreach(0 until 4 by 1) {yo =>
 Foreach(0 until 4 by 1) {xo =>
 val f_wrapper = SRAM[Int](4, 5);
 f_wrapper load f(xo*4::xo*4+4, yo*4::yo*4+5);
 Foreach(0 until 4 by 1) {yi =>
 Foreach(0 until 4 by 1) {xi =>
 g(xo*4+xi, yo*4+yi) = (f_wrapper(xi,yi)+f_wrapper(xi,yi+1))/2;
 }
 }
 }
 }
 g_wrapper store g;
 }

Spatial Halide

Store ‘g’ back into host’s DRAM

slide-78
SLIDE 78

Conclusion

37

slide-79
SLIDE 79

Conclusion

■ Reconfigurable architectures are becoming key for performance / energy efficiency

37

slide-80
SLIDE 80

Conclusion

■ Reconfigurable architectures are becoming key for performance / energy efficiency ■ Current programming solutions for reconfigurables are still inadequate

37

slide-81
SLIDE 81

Conclusion

■ Reconfigurable architectures are becoming key for performance / energy efficiency ■ Current programming solutions for reconfigurables are still inadequate ■ Need to rethink outside of the C box for high level synthesis:

Memory hierarchy for optimization

Design parameters for tuning

Arbitrarily nestable pipelines

37

slide-82
SLIDE 82

Conclusion

■ Reconfigurable architectures are becoming key for performance / energy efficiency ■ Current programming solutions for reconfigurables are still inadequate ■ Need to rethink outside of the C box for high level synthesis:

Memory hierarchy for optimization

Design parameters for tuning

Arbitrarily nestable pipelines ■ Spatial prototypes these language and compiler criteria:

■ Average speedup of 2.9x versus SDAccel on VU9P ■ Average 42% less code than SDAccel ■ Achieves transparent portability through internal support for automated design tuning (HyperMapper)

37

Performance Productivity Portability

slide-83
SLIDE 83

Conclusion

■ Reconfigurable architectures are becoming key for performance / energy efficiency ■ Current programming solutions for reconfigurables are still inadequate ■ Need to rethink outside of the C box for high level synthesis:

Memory hierarchy for optimization

Design parameters for tuning

Arbitrarily nestable pipelines ■ Spatial prototypes these language and compiler criteria:

■ Average speedup of 2.9x versus SDAccel on VU9P ■ Average 42% less code than SDAccel ■ Achieves transparent portability through internal support for automated design tuning (HyperMapper)

37

Spatial is open source: https://spatial-lang.org/

Performance Productivity Portability

slide-84
SLIDE 84

Backup Slides

slide-85
SLIDE 85

The Team

Raghu Prabhakar Yaqi Zhang David Koeplinger Matt Feldman Tian Zhao Ardavan Pedram Christos Kozyrakis Kunle Olukotun Stefan Hadjis Ruben Fiszel Luigi Nardi

38

slide-86
SLIDE 86

Custom ASICs

slide-87
SLIDE 87

Custom ASICs

Good for widely used, fixed specifications (like compression)

slide-88
SLIDE 88

Custom ASICs

Good for widely used, fixed specifications (like compression) Expensive with long design turnaround for developing fields like ML

slide-89
SLIDE 89

Custom ASICs

Good for widely used, fixed specifications (like compression) Expensive with long design turnaround for developing fields like ML

Time

Jeff Dean, Scaled ML 2018 Kunle Olukotun, ISCA 2018

20,000 15,000 10,000 5,000 2009 2017 2011 2013 2015 20 15 10 5

Relative # of Papers / Year Since 2009 ML Arxiv Papers

slide-90
SLIDE 90

C + Pragmas Example

Add 512 integers originating from accelerator DRAM

void sum(int* mem) { mem[512] = 0; for(int i=0; i < 512; i++) { mem[512] += mem[i]; } }

54

slide-91
SLIDE 91

C + Pragmas Example

Add 512 integers originating from accelerator DRAM

void sum(int* mem) { mem[512] = 0; for(int i=0; i < 512; i++) { mem[512] += mem[i]; } }

54

Commercial HLS Tool

Runtime: 27,236 clock cycles (100x too long!)

slide-92
SLIDE 92

C + Pragmas Example

Add 512 integers originating from external DRAM

55

#define CHUNKSIZE (sizeof(MPort)/sizeof(int)) #define LOOPCOUNT (512/CHUNKSIZE)
 void sum(MPort* mem) { MPort buff[LOOPCOUNT];
 memcpy(buff, mem, LOOPCOUNT);
 int sum = 0;
 for(int i=1; i<LOOPCOUNT; i++) {
 #pragma PIPELINE for(int j=0; j<CHUNKSIZE; j++) { #pragma UNROLL sum += (int) (buff[i]>>j*sizeof(int)*8); } } mem[512] = sum; }

Runtime: 302 clock cycles

slide-93
SLIDE 93

C + Pragmas Example

Add 512 integers originating from external DRAM

55

#define CHUNKSIZE (sizeof(MPort)/sizeof(int)) #define LOOPCOUNT (512/CHUNKSIZE)
 void sum(MPort* mem) { MPort buff[LOOPCOUNT];
 memcpy(buff, mem, LOOPCOUNT);
 int sum = 0;
 for(int i=1; i<LOOPCOUNT; i++) {
 #pragma PIPELINE for(int j=0; j<CHUNKSIZE; j++) { #pragma UNROLL sum += (int) (buff[i]>>j*sizeof(int)*8); } } mem[512] = sum; }

Width of DRAM controller interface Burst Access Use local variable Special compiler directives Loop Restructurin g Bit shifting to extract individual elements Special compiler directives

Runtime: 302 clock cycles

slide-94
SLIDE 94

Hardware Design Considerations

slide-95
SLIDE 95

Hardware Design Considerations

  • 1. Finite physical compute and memory resources
slide-96
SLIDE 96

Hardware Design Considerations

  • 1. Finite physical compute and memory resources
  • 2. Requires aggressive pipelining for performance

■ Maximize useful execution time of compute resources

slide-97
SLIDE 97

Hardware Design Considerations

  • 1. Finite physical compute and memory resources
  • 2. Requires aggressive pipelining for performance

■ Maximize useful execution time of compute resources

  • 3. Disjoint memory space

■ No hardware managed memory hierarchy

slide-98
SLIDE 98

Hardware Design Considerations

  • 1. Finite physical compute and memory resources
  • 2. Requires aggressive pipelining for performance

■ Maximize useful execution time of compute resources

  • 3. Disjoint memory space

■ No hardware managed memory hierarchy

  • 4. Huge design parameter spaces

■ Parameters are interdependent, change runtime by orders of

magnitude

slide-99
SLIDE 99

Hardware Design Considerations

  • 1. Finite physical compute and memory resources
  • 2. Requires aggressive pipelining for performance

■ Maximize useful execution time of compute resources

  • 3. Disjoint memory space

■ No hardware managed memory hierarchy

  • 4. Huge design parameter spaces

■ Parameters are interdependent, change runtime by orders of

magnitude

  • 5. Others… pipeline timing, clocking, etc.
slide-100
SLIDE 100

Local Memory Analysis Example

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

a

slide-101
SLIDE 101

Local Memory Analysis Example

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

a

slide-102
SLIDE 102

Local Memory Analysis Example

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

a

2i 2i+1 Foreach{i =>

slide-103
SLIDE 103

Local Memory Analysis Example

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

a

2i 2i+1 Foreach{i => b(2j) b(2j+1) Reduce{j =>

slide-104
SLIDE 104

Local Memory Analysis Example

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

a

2i 2i+1 Foreach{i => b(2j) b(2j+1) Reduce{j => 2k 2k+1 Foreach{k =>

slide-105
SLIDE 105

Local Memory Analysis Example

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

a

2i 2i+1 Foreach{i => b(2j) b(2j+1) Reduce{j => 2k 2k+1 Foreach{k => Write port Read port 1 “instance” of a

slide-106
SLIDE 106

Local Memory Analysis Example

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

Step 1: For each read: Find the banking and buffering for that read and all writes that may be visible to that read a

2i 2i+1 Foreach{i => b(2j) b(2j+1) Reduce{j => 2k 2k+1 Foreach{k => Write port Read port 1 “instance” of a

slide-107
SLIDE 107

Local Memory Analysis Example (Cont.)

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

2i 2i+1 Foreach{i => b(2j) b(2j+1) Reduce{j =>

Step 1: For each read: Find the banking and buffering for that read and all writes that may be visible to that read

slide-108
SLIDE 108

Local Memory Analysis Example (Cont.)

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

2i 2i+1 Foreach{i => b(2j) b(2j+1) Reduce{j =>

a Step 1: For each read: Find the banking and buffering for that read and all writes that may be visible to that read

slide-109
SLIDE 109

Local Memory Analysis Example (Cont.)

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

2i 2i+1 Foreach{i => b(2j) b(2j+1) Reduce{j =>

a a a Step 1: For each read: Find the banking and buffering for that read and all writes that may be visible to that read

slide-110
SLIDE 110

Local Memory Analysis Example (Cont.)

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

2i 2i+1 Foreach{i => b(2j) b(2j+1) Reduce{j =>

a a Step 1: For each read: Find the banking and buffering for that read and all writes that may be visible to that read

slide-111
SLIDE 111

Local Memory Analysis Example (Cont.)

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

2i 2i+1 Foreach{i => b(2j) b(2j+1) Reduce{j =>

Metapipeline Distance = 1 a a a a

slide-112
SLIDE 112

Local Memory Analysis Example (Cont.)

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

2i 2i+1 Foreach{i => b(2j) b(2j+1) Reduce{j =>

Metapipeline Distance = 1 a a a a

(~4-8x memory)

slide-113
SLIDE 113

Local Memory Analysis Example (Cont.)

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

2i 2i+1 Foreach{i => 2k 2k+1 Foreach{k =>

slide-114
SLIDE 114

a

Local Memory Analysis Example (Cont.)

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

2i 2i+1 Foreach{i =>

Metapipeline Distance = 2

2k 2k+1 Foreach{k =>

a a

slide-115
SLIDE 115

a

Local Memory Analysis Example (Cont.)

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

2i 2i+1 Foreach{i =>

Metapipeline Distance = 2

2k 2k+1 Foreach{k =>

a a

(~3-6x memory)

slide-116
SLIDE 116

a

Local Memory Analysis Example (Cont.)

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

2i 2i+1 Foreach{i => 2k 2k+1 Foreach{k =>

a a

b(2j) b(2j+1) Reduce{j =>

a a a a

slide-117
SLIDE 117

a

Local Memory Analysis Example (Cont.)

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

2i 2i+1 Foreach{i => 2k 2k+1 Foreach{k =>

a a

b(2j) b(2j+1) Reduce{j =>

a a a a

(~7-14x memory)

slide-118
SLIDE 118

a

Local Memory Analysis Example (Cont.)

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

2i 2i+1 Foreach{i => 2k 2k+1 Foreach{k =>

a a

b(2j) b(2j+1) Reduce{j =>

a a a a Step 2: Greedily combine (merge) instances

  • Don’t combine if there are port

conflicts

  • Don’t combine if the cost of merging is

greater than sum of unmerged **Recompute banking for merged instances!

(~7-14x memory)

slide-119
SLIDE 119

Local Memory Analysis

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

2i 2i+1 Foreach{i => 2k 2k+1 Foreach{k => b(2j) b(2j+1) Reduce{j =>

a a Step 2: Greedily combine (merge) instances

  • Don’t combine if there are bank

conflicts

  • Don’t combine if the cost of merging is

greater than sum of unmerged **Recompute banking for merged instances! a a a

slide-120
SLIDE 120

Local Memory Analysis

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

2i 2i+1 Foreach{i => 2k 2k+1 Foreach{k => b(2j) b(2j+1) Reduce{j =>

a a Step 2: Greedily combine (merge) instances

  • Don’t combine if there are bank

conflicts

  • Don’t combine if the cost of merging is

greater than sum of unmerged **Recompute banking for merged instances! a a a

(~5-10x memory) (40% less)

slide-121
SLIDE 121

Kernel-Based Approach

Manually implement each DSL operation; use a simple compiler to stitch them together

slide-122
SLIDE 122

Kernel-Based Approach

Performance

Manually implement each DSL operation; use a simple compiler to stitch them together

Misses cross-kernel optimizations Excessive memory transfers Excessive buffering

slide-123
SLIDE 123

Kernel-Based Approach

Performance Productivity

High level specification no hardware design knowledge required

Manually implement each DSL operation; use a simple compiler to stitch them together

Misses cross-kernel optimizations Excessive memory transfers Excessive buffering

slide-124
SLIDE 124

Kernel-Based Approach

Performance Productivity Portability

High level specification no hardware design knowledge required Reasonably target-generic if done right

Manually implement each DSL operation; use a simple compiler to stitch them together

Misses cross-kernel optimizations Excessive memory transfers Excessive buffering

slide-125
SLIDE 125

type TM = FixPt[TRUE,_9,_23] type TX = FixPt[TRUE,_9,_7] val data = DRAM[TX](N, D) val y = DRAM[TM](N) val weights = DRAM[TM](D) Accel { val yAddr = Reg[Int](-1) val yCache = SRAM[TM](CSIZE) val wK = SRAM[TM](D) wK load weights(0::D) Sequential.Foreach(E by 1){e => epoch(random[Int](N), …) breakpoint() } weights(0 :: D) store wK } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Stochastic Gradient Descent in Spatial

slide-126
SLIDE 126

type TM = FixPt[TRUE,_9,_23] type TX = FixPt[TRUE,_9,_7] val data = DRAM[TX](N, D) val y = DRAM[TM](N) val weights = DRAM[TM](D) Accel { val yAddr = Reg[Int](-1) val yCache = SRAM[TM](CSIZE) val wK = SRAM[TM](D) wK load weights(0::D) Sequential.Foreach(E by 1){e => epoch(random[Int](N), …) breakpoint() } weights(0 :: D) store wK } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Arbitrary precision custom types

Stochastic Gradient Descent in Spatial

slide-127
SLIDE 127

type TM = FixPt[TRUE,_9,_23] type TX = FixPt[TRUE,_9,_7] val data = DRAM[TX](N, D) val y = DRAM[TM](N) val weights = DRAM[TM](D) Accel { val yAddr = Reg[Int](-1) val yCache = SRAM[TM](CSIZE) val wK = SRAM[TM](D) wK load weights(0::D) Sequential.Foreach(E by 1){e => epoch(random[Int](N), …) breakpoint() } weights(0 :: D) store wK } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Arbitrary precision custom types Off-chip memory allocations

Stochastic Gradient Descent in Spatial

slide-128
SLIDE 128

type TM = FixPt[TRUE,_9,_23] type TX = FixPt[TRUE,_9,_7] val data = DRAM[TX](N, D) val y = DRAM[TM](N) val weights = DRAM[TM](D) Accel { val yAddr = Reg[Int](-1) val yCache = SRAM[TM](CSIZE) val wK = SRAM[TM](D) wK load weights(0::D) Sequential.Foreach(E by 1){e => epoch(random[Int](N), …) breakpoint() } weights(0 :: D) store wK } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Arbitrary precision custom types Off-chip memory allocations Accelerator scope

Stochastic Gradient Descent in Spatial

slide-129
SLIDE 129

type TM = FixPt[TRUE,_9,_23] type TX = FixPt[TRUE,_9,_7] val data = DRAM[TX](N, D) val y = DRAM[TM](N) val weights = DRAM[TM](D) Accel { val yAddr = Reg[Int](-1) val yCache = SRAM[TM](CSIZE) val wK = SRAM[TM](D) wK load weights(0::D) Sequential.Foreach(E by 1){e => epoch(random[Int](N), …) breakpoint() } weights(0 :: D) store wK } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Arbitrary precision custom types Off-chip memory allocations Accelerator scope On-chip memory allocations

Stochastic Gradient Descent in Spatial

slide-130
SLIDE 130

type TM = FixPt[TRUE,_9,_23] type TX = FixPt[TRUE,_9,_7] val data = DRAM[TX](N, D) val y = DRAM[TM](N) val weights = DRAM[TM](D) Accel { val yAddr = Reg[Int](-1) val yCache = SRAM[TM](CSIZE) val wK = SRAM[TM](D) wK load weights(0::D) Sequential.Foreach(E by 1){e => epoch(random[Int](N), …) breakpoint() } weights(0 :: D) store wK } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Arbitrary precision custom types Off-chip memory allocations Accelerator scope On-chip memory allocations Explicit memory transfer

Stochastic Gradient Descent in Spatial

slide-131
SLIDE 131

type TM = FixPt[TRUE,_9,_23] type TX = FixPt[TRUE,_9,_7] val data = DRAM[TX](N, D) val y = DRAM[TM](N) val weights = DRAM[TM](D) Accel { val yAddr = Reg[Int](-1) val yCache = SRAM[TM](CSIZE) val wK = SRAM[TM](D) wK load weights(0::D) Sequential.Foreach(E by 1){e => epoch(random[Int](N), …) breakpoint() } weights(0 :: D) store wK } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Arbitrary precision custom types Off-chip memory allocations Accelerator scope On-chip memory allocations Explicit memory transfer Declaration of a sequential loop

Stochastic Gradient Descent in Spatial

slide-132
SLIDE 132

type TM = FixPt[TRUE,_9,_23] type TX = FixPt[TRUE,_9,_7] val data = DRAM[TX](N, D) val y = DRAM[TM](N) val weights = DRAM[TM](D) Accel { val yAddr = Reg[Int](-1) val yCache = SRAM[TM](CSIZE) val wK = SRAM[TM](D) wK load weights(0::D) Sequential.Foreach(E by 1){e => epoch(random[Int](N), …) breakpoint() } weights(0 :: D) store wK } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Arbitrary precision custom types Off-chip memory allocations Accelerator scope On-chip memory allocations Explicit memory transfer Declaration of a sequential loop Explicit memory transfer

Stochastic Gradient Descent in Spatial

slide-133
SLIDE 133

type TM = FixPt[TRUE,_9,_23] type TX = FixPt[TRUE,_9,_7] val data = DRAM[TX](N, D) val y = DRAM[TM](N) val weights = DRAM[TM](D) Accel { val yAddr = Reg[Int](-1) val yCache = SRAM[TM](CSIZE) val wK = SRAM[TM](D) wK load weights(0::D) Sequential.Foreach(E by 1){e => epoch(random[Int](N), …) breakpoint() } weights(0 :: D) store wK } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Arbitrary precision custom types Off-chip memory allocations Accelerator scope On-chip memory allocations Explicit memory transfer Declaration of a sequential loop Explicit memory transfer Debugging breakpoint

Stochastic Gradient Descent in Spatial

slide-134
SLIDE 134

SGD in Spatial

def epoch(i: Int, ...): Unit = { val yPt = Reg[TM] if (i >= yAddr & i < yAddr+CSIZE & yAddr != -1) { yPt := yCache(i - yAddr) } else { yAddr := i - (i % CSIZE) yCache load y(yAddr::yAddr + CSIZE) yPt := yCache(i % CSIZE) } val x = SRAM[TX](D) x load data(i, 0::D) // Compute gradient against wK_t val yHat = Reg[TM] Reduce(yHat)(D by 1){j => wK(j) * x(j).to[TM] } {_+_} val yErr = yHat - yPt // Update wK_t with reduced variance update Foreach(D by 1){i => wK(i) = wK(i) – (A.to[TM] * yErr * x(i).to[TM]) } } 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

slide-135
SLIDE 135

SGD in Spatial

def epoch(i: Int, ...): Unit = { val yPt = Reg[TM] if (i >= yAddr & i < yAddr+CSIZE & yAddr != -1) { yPt := yCache(i - yAddr) } else { yAddr := i - (i % CSIZE) yCache load y(yAddr::yAddr + CSIZE) yPt := yCache(i % CSIZE) } val x = SRAM[TX](D) x load data(i, 0::D) // Compute gradient against wK_t val yHat = Reg[TM] Reduce(yHat)(D by 1){j => wK(j) * x(j).to[TM] } {_+_} val yErr = yHat - yPt // Update wK_t with reduced variance update Foreach(D by 1){i => wK(i) = wK(i) – (A.to[TM] * yErr * x(i).to[TM]) } } 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

Custom caching for random access on y

slide-136
SLIDE 136

SGD in Spatial

def epoch(i: Int, ...): Unit = { val yPt = Reg[TM] if (i >= yAddr & i < yAddr+CSIZE & yAddr != -1) { yPt := yCache(i - yAddr) } else { yAddr := i - (i % CSIZE) yCache load y(yAddr::yAddr + CSIZE) yPt := yCache(i % CSIZE) } val x = SRAM[TX](D) x load data(i, 0::D) // Compute gradient against wK_t val yHat = Reg[TM] Reduce(yHat)(D by 1){j => wK(j) * x(j).to[TM] } {_+_} val yErr = yHat - yPt // Update wK_t with reduced variance update Foreach(D by 1){i => wK(i) = wK(i) – (A.to[TM] * yErr * x(i).to[TM]) } } 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

Custom caching for random access on y Explicit memory transfer

slide-137
SLIDE 137

SGD in Spatial

def epoch(i: Int, ...): Unit = { val yPt = Reg[TM] if (i >= yAddr & i < yAddr+CSIZE & yAddr != -1) { yPt := yCache(i - yAddr) } else { yAddr := i - (i % CSIZE) yCache load y(yAddr::yAddr + CSIZE) yPt := yCache(i % CSIZE) } val x = SRAM[TX](D) x load data(i, 0::D) // Compute gradient against wK_t val yHat = Reg[TM] Reduce(yHat)(D by 1){j => wK(j) * x(j).to[TM] } {_+_} val yErr = yHat - yPt // Update wK_t with reduced variance update Foreach(D by 1){i => wK(i) = wK(i) – (A.to[TM] * yErr * x(i).to[TM]) } } 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

Custom caching for random access on y Explicit memory transfer Gradient computation

slide-138
SLIDE 138

SGD in Spatial

def epoch(i: Int, ...): Unit = { val yPt = Reg[TM] if (i >= yAddr & i < yAddr+CSIZE & yAddr != -1) { yPt := yCache(i - yAddr) } else { yAddr := i - (i % CSIZE) yCache load y(yAddr::yAddr + CSIZE) yPt := yCache(i % CSIZE) } val x = SRAM[TX](D) x load data(i, 0::D) // Compute gradient against wK_t val yHat = Reg[TM] Reduce(yHat)(D by 1){j => wK(j) * x(j).to[TM] } {_+_} val yErr = yHat - yPt // Update wK_t with reduced variance update Foreach(D by 1){i => wK(i) = wK(i) – (A.to[TM] * yErr * x(i).to[TM]) } } 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

Custom caching for random access on y Explicit memory transfer Gradient computation Weight update

slide-139
SLIDE 139

FPGA

15.Sequential.Foreach

  • 41. Reduce

DRAM 13.load

×

wK

weights

22.store 37.load yHat

+

  • 45. Foreach
  • ×

y

  • 27. if … else

yPt

yAddr yCache

yErr

x

data

SGD in Spatial: Hardware