Spatial: A Language and Compiler for Application Accelerators David - PowerPoint PPT Presentation

Spatial: A Language and Compiler for Application Accelerators David Koeplinger Matthew Feldman Raghu Prabhakar Yaqi Zhang Stefan Hadjis Ruben Fiszel Tian Zhao Luigi Nardi Ardavan Pedram Christos Kozyrakis Kunle Olukotun PLDI June 21, 2018

Instructions Add Overheads Instruction-Based vectorA · vectorB CPU mov r8 , rcx add r8 , 8 Floating Point mov r9 , rdx Instruction: 70 pJ Control add r9 , 8 mov rcx , rax Arithmetic/Logic mov rax , 0 25pJ 38pJ 6pJ .calc: Inst. Queue Register File mov rbx , [r9] Control Overheads I-cache Register imul rbx , [r8] 32-bit ADD: L1 Cache L1 Cache add rax , rbx File access add r8 , 8 ~0.5 pJ (Instructions) (Data) Access add r9 , 8 loop .calc L2 Cache DRAM Legend Control Compute SRAM Regs Mark Horowitz, Computing’s Energy Problem (and what we can do about it) ISSCC 2014 2

A Dark Tale: The CPU Power Wall 3

A More Efficient Way Configuration-Based Instruction-Based vectorA · vectorB CPU* mov r8 , rcx Custom Hardware* Floating Point ctrl add r8 , 8 mov r9 , rdx Control add r8 , 8 Arithmetic/Logic ctr vectorA mov rcx , rax mov rax , 0 + acc × Inst. Queue Register File vectorB .calc: + mov rbx , [r9] imul rbx , [r8] L1 Cache L1 Cache add rax , rbx (Instructions) (Data) add r8 , 8 add r9 , 8 DRAM loop .calc L2 Cache *Also not to scale DRAM Legend *Not to scale Control Compute SRAM Regs 4

The Future Is (Probably) Reconfigurable 10,000 Energy Efficiency (MOPS/mW) ASIC Brainwave (ISCA ’18) Dedicated 1,000 CGRA Reconfigurable 287 MOps/mW 100 Instruction-Based FPGA GPU 10 XPU (HotChips ’17) CPU 1 25x perf/W vs. CPU 0.1 Not programmable Less programmable More programmable Programmability 5

Key Question How can we more productively target reconfigurable architectures like FPGAs? Productivity Performance Fast and efficient programmers Fast and efficient designs Portability Target-generic solutions 6

Language Taxonomy Domain Specificity Domain-Specific Halide Multi-Domain MyHDL Verilog x86 Netlist General Purpose VHDL Abstraction Abstraction Higher Level Lower Level Lower Level “What?” “How?” “How?” Reconfigurable Architectures (FPGAs) Instruction-Based Architectures (CPUs) 7

Abstracting Hardware Design Domain Specificity +hardware pragmas HDLs Netlist Abstraction Abstraction Higher Level Lower Level Lower Level “What?” “How?” “How?” Reconfigurable Architectures (FPGAs) Instruction-Based Architectures (CPUs) 8

HDLs Hardware Description Languages (HDLs) e.g. Verilog, VHDL, Chisel, Bluespec Productivity Performance ✓ Arbitrary RTL ✘ No high-level abstractions Portability ✘ Significant target-specific code 9

C + Pragmas Existing High Level Synthesis (C + Pragmas) HDLs e.g. Vivado HLS, SDAccel, Altera OpenCL Productivity Performance ✓ Nested loops ✘ No memory ✘ Ad-hoc mix of hierarchy ✘ No arbitrary software/hardware pipelining ✘ Difficult to optimize Portability ✓ Portable for single vendor 10

Criteria for Improved HLS Requirement C+Pragmas Represent memory hierarchy explicitly Aids on-chip memory optimization, specialization Express control as nested loops Enables analysis of access patterns Support arbitrarily nested pipelining Exploits nested parallelism Specialize memory transfers Enables customized memory controllers based on access patterns Capture design parameters Enables automatic design tuning in compiler 11

Design Space Parameters Example vectorA · vectorB DRAM FPGA vectorA tileA + acc × tileB vectorB ctr Small and simple, but slow! Legend Control Compute SRAM Regs 12

Important Parameters: Buffer Sizes vectorA · vectorB DRAM FPGA vectorA tileA + acc × tileB vectorB ctr n Increases length of DRAM accesses Runtime n Increases exploited locality Runtime Legend Control Compute n Increases local memory sizes Area SRAM Regs 13

Important Parameters: Pipelining vectorA · vectorB DRAM Metapipelining requires buffering FPGA vectorA tileA (1) tileA (0) + acc Tile × tileB (0) tileB (1) vectorB B Stage 2 Stage 1 n Overlaps memory and compute Runtime Legend n Increases local memory sizes Area Control Compute SRAM Regs n Adds synchronization logic Area Double Buffer 14

Important Parameters: Parallelization vectorA · vectorB DRAM FPGA vectorA × tileA + + acc × tileB vectorB + × ctr ctr ctr n Improves element throughput Runtime Legend n Duplicates compute resources Area Control Compute SRAM Regs 15

Important Parameters: Memory Banking vectorA · vectorB DRAM Parallelization requires banking vectorA × tileA + + acc × tileB vectorB + × ctr ctr ctr Legend n Improves memory bandwidth Runtime Control Compute SRAM Regs n May duplicate memory resources Area Banked SRAM 16

Criteria for Improved HLS Requirement C+Pragmas Represent memory hierarchy explicitly Aids on-chip memory optimization, specialization Express control as nested loops Enables analysis of access patterns Support arbitrarily nested pipelining Exploits nested parallelism Specialize memory transfers Enables customized memory controllers based on access patterns Capture design parameters Enables automatic design tuning in compiler 17

Rethinking HLS Improved HLS HDLs C + Pragmas Productivity Performance ✓ Memory ✓ Nested loops hierarchy ✓ Automatic memory ✓ Arbitrary pipelining banking/buffering ✓ Implicit design parameters (unrolling, banking, etc.) ✓ Automated design tuning Portability ✓ Target-generic source across reconfigurable architectures 18

Abstracting Hardware Design Domain Specificity Spatial HDLs Netlist +pragmas Abstraction Abstraction Higher Level Lower Level Lower Level “What?” “How?” “How?” Reconfigurable Architectures (FPGAs) Instruction-Based Architectures (CPUs) 19

Spatial: Memory Hierarchy DDR DRAM val image = DRAM [ UInt8 ](H,W) GB buffer load image(i, j::j+C) // dense buffer gather image(a) // sparse val buffer = SRAM [ UInt8 ](C) On-Chip SRAM val fifo = FIFO [ Float ](D) MB val lbuf = LineBuffer [ Int ](R,C) Local Regs val accum = Reg [ Double ] KB val pixels = RegFile [ UInt8 ](R,C) 20

Spatial: Control And Design Parameters val P = 16 (1 → 32) Implicit/Explicit parallelization factors Reduce(0)(N by 1 par P ){i => (optional, but can be explicitly declared) data(i) }{(a,b) => a + b} Implicit/Explicit control schemes Stream . Foreach(0 until N){i => … (also optional, but can be used to override compiler) } val B = 64 (64 → 1024) Explicit size parameters for loop step val buffer = SRAM [ Float ]( B ) size and buffer sizes Foreach (N by B ){i => … (informs compiler it can tune this value) } Foreach (64 par 16){i => Implicit memory banking and buffering buffer(i) // Parallel read schemes for parallelized access } 21

Dot Product in Spatial val output = ArgOut [Float] Off-chip memory declarations val vectorA = DRAM [Float](N) val vectorB = DRAM [Float](N) DRAM Accel { Reduce (output)(N by B){ i => vectorA val tileA = SRAM [Float](B) val tileB = SRAM [Float](B) val acc = Reg [Float] vectorB tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B) output Reduce (acc)(B by 1){ j => tileA(j) * tileB(j) FPGA }{a, b => a + b} }{a, b => a + b} } 22

Dot Product in Spatial val output = ArgOut [Float] Explicit work division in IR val vectorA = DRAM [Float](N) val vectorB = DRAM [Float](N) DRAM Accel { Reduce (output)(N by B){ i => vectorA val tileA = SRAM [Float](B) val tileB = SRAM [Float](B) val acc = Reg [Float] vectorB tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B) output Reduce (acc)(B by 1){ j => tileA(j) * tileB(j) FPGA }{a, b => a + b} }{a, b => a + b} } 24

Dot Product in Spatial val output = ArgOut [Float] val vectorA = DRAM [Float](N) Tiled reduction (outer) val vectorB = DRAM [Float](N) DRAM Accel { Reduce (output)(N by B){ i => vectorA val tileA = SRAM [Float](B) val tileB = SRAM [Float](B) val acc = Reg [Float] vectorB tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B) output Reduce (acc)(B by 1){ j => Outer Reduce tileA(j) * tileB(j) FPGA }{a, b => a + b} } 24

Dot Product in Spatial val output = ArgOut [Float] On-chip memory declarations val vectorA = DRAM [Float](N) val vectorB = DRAM [Float](N) DRAM Accel { Reduce (output)(N by B){ i => tileA (0) tileA (1) vectorA val tileA = SRAM [Float](B) acc acc val tileB = SRAM [Float](B) tileB (0) tileB (1) val acc = Reg [Float] vectorB tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B) output Reduce (acc)(B by 1){ j => Outer Reduce tileA(j) * tileB(j) FPGA }{a, b => a + b} } 24

Dot Product in Spatial val output = ArgOut [Float] DRAM → SRAM transfers val vectorA = DRAM [Float](N) (also have store, scatter , and gather) val vectorB = DRAM [Float](N) DRAM Accel { Reduce (output)(N by B){ i => tileA (0) tileA (1) vectorA val tileA = SRAM [Float](B) acc acc val tileB = SRAM [Float](B) tileB (0) tileB (1) val acc = Reg [Float] Stage 1 vectorB tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B) output Reduce (acc)(B by 1){ j => Outer Reduce tileA(j) * tileB(j) FPGA }{a, b => a + b} } 24

Spatial: A Language and Compiler for Application Accelerators David - PowerPoint PPT Presentation

Spatial: A Language and Compiler for Application Accelerators David Koeplinger Matthew Feldman Raghu Prabhakar Yaqi Zhang Stefan Hadjis Ruben Fiszel Tian Zhao Luigi Nardi Ardavan Pedram Christos Kozyrakis Kunle Olukotun PLDI June 21,

Compiler Construction Chapter 11 1 Compiler Construction Compiler Construction A New Compiler

Resource 1: What is spatial? presentation notes Section Section text Notes 1. Spatial

Broadening the Study of Spatial Intelligence Mary Hegarty University of California, Santa

A Spatial Cloaking Framework A Spatial Cloaking Framework A Spatial Cloaking Framework A Spatial

Creating a Science of Spatial Learning Nora S. Newcombe Temple University PI, Spatial

Spatial Digitech Keep it s im ple Make it spatial About US Spatial Digitech is a provider of

UCSB is Spatial ! http://www.spatial.ucsb.edu Specialist Meeting on Spatial Thinking across the

STAT 209 Spatial Data I April 30, 2018 Colin Reimer Dawson 1 / 26 Spatial Data Projections

11/8/2012 The Structure of a Compiler (2) The Structure of a Compiler (1) Any compiler must

Compiler Development (CMPSC 401) Janyl Jumadinova January 17, 2018 Janyl Jumadinova Compiler

Principles of Compiler Design - The Brainf*ck Compiler - Clifford Wolf - www.clifford.at

Spatial: A Language and Compiler for Application Accelerators Raghu Prabhakar Stanford

Buffers and centroids Zev Ross President, ZevRoss Spatial Analysis DataCamp Spatial Analysis

Panel Regarding Marine Panel Regarding Marine Spatial Planning Spatial Planning A public process

Welcome! Zev Ross President, ZevRoss Spatial Analysis DataCamp Spatial Analysis with sf and

Basics of Geographic Analysis in R Spatial Autocorrelation and Spatial Weights Yuri M. Zhukov

Lecture 23 Spatio-temporal Models Colin Rundel 04/17/2017 1 Spatial Models with AR time

Applying system dynamics to health and social care commissioning in the UK Professor Eric

Netflix Performance Meetup Global Client Performance Fast Metrics 3G in Kazakhstan Making the

Rethinking how capital programmes are delivered 3 rd October 2018 Sponsors this line this line

Evaluating the Robustness of Natural Language Reward Shaping Models to Spatial Relations Antony

2-Dimensional Smooths and Spatial Data Noam Ross Senior Research Scientist, EcoHealth Alliance

Sequential Point Process Model and Bayesian Inference for Spatial Point Patterns with Linear

Spatial Mixing and the Connective Constant: optimal bounds Yitong Yin Nanjing University