Building FPGA-Targeted Accelerators with HeteroCL Zhiru Zhang - PowerPoint PPT Presentation

Building FPGA-Targeted Accelerators with HeteroCL Zhiru Zhang School of ECE, Cornell University csl.cornell.edu/~zhiruz In collaboration with Cornell : Yi-Hsiang Lai, Shaojie Xiang, Yuwei Hu UCLA : Yuze Chi, Jie Wang, Cody Yu, Jason Cong TVM Workshop @ UW, 12/5/2019

HeteroCL Overview ▸ A programming framework built with TVM for productive hardware specialization – Flexible : Mixed declarative & imperative programming – Efficient : Mapping to high-performance spatial architecture templates – Portable : Clean decoupling of algorithm & hardware customizations HeteroCL High-level DSLs Processors + Accelerators Algorithm Spec. (declarative + imperative) Compute Customization Data Type Customization Custom Xcels CPUs … Memory Customization (e.g. FPGAs) github.com/cornell-zhang/heterocl Y.-H. Lai, et al., HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing , FPGA’2019 ( Best Paper Award ) 1

Essential Techniques for Hardware Specialization PE PE PE Compute customization • Parallelization, Pipelining … … … × × PE PE PE … 32 32 … … … + … PE PE PE 2

Essential Techniques for Hardware Specialization PE PE PE Compute customization • Parallelization, Pipelining … × × × × … … 16 Data type customization PE PE PE • Low-bitwidth integer, + + Fixed point ... … … … … … + PE PE PE 3

Essential Techniques for Hardware Specialization PE PE PE Loader Compute customization • Parallelization, Pipelining … Memory / Storage … … … … Data type customization PE PE PE • Low-bitwidth integer, Accelerator Fixed point ... … … … … Memory customization • Banking, Data reuse, PE PE PE Unloader Streaming ... FIFO Scratchpad 4

FPGA as a Programmable Accelerator ▸ Massive amount of fine-grained parallelism – Highly parallel / deeply pipelined architecture – Distributed data/control dispatch ▸ Silicon configurable to fit the application Block RAM Block RAM – Compute at desired numerical accuracy – Customized memory hierarchy ▸ High performance/watt – Low clock speed – Pre-fabricated architecture blocks ~2 Million ~5000 ~300Mb Logic Blocks DSP Blocks Block RAM But FPGAs are really hard to PROGRAM AWS F1 FPGA instance: Xilinx UltraScale+ VU9P [Figure source: David Pellerin, AWS] 5

Increasing Use of High-Level Synthesis (HLS) module dut(rst, clk, q); uint8 dut() { input rst; static uint8 c; vs. input clk; c+=1; output q; } reg [7:0] c; HLS C always @ ( posedge clk) begin if (rst == 1b’1) begin c <= 8'b00000000; end 3000+ papers since 2012 else begin 800 c <= c + 1; Number of Publications end 400 assign q = c; endmodule 0 RTL Verilog 2012 2013 2014 2015 2016 2017 2018 Year 6

FPGA Programming with HLS #pragma HLS array_partition variable=filter dim=0 ▸ Example: convolution hls::LineBuffer<3, N, ap_fixed<8,4> > buf; hls::Window<3, 3, ap_fixed<8,4> > window; for (int y = 0; y < N; y++) for(int y = 0; y < N; y++) { for (int x = 0; x < N; x++) for(int xo = 0; xo < N/M; xo++) { Custom compute for (int r = 0; r < 3; r++) #pragma HLS pipeline II=1 (Loop tiling) for(int xi = 0; xi < M; xi++) { for (int c = 0; c < 3; c++) int x = xo*M + xi; out[x, y] += image[x+r, y+c] * kernel[r, c] Custom data type ap_fixed<8,4> acc = 0; ap_fixed<8,4> in = image[y][x]; (Quantization) buf.shift_up(x); Custom memory buf.insert_top(in, x); window.shift_left(); (Reuse buffers) for(int r = 0; r < 2; r++) Algorithm#1 window.insert(buf.getval(r,x), i, 2); Compute Customization Entangled hardware window.insert(in, 2, 2); customization and if (y >= 2 && x >= 2) { Algorithm#2 for(int r = 0; r < 3; r++) { algorithm Data Type Customization for(int c = 0; c < 3; c++) { • Less portable acc += window.getval(r,c) * kernel[r][c]; • Less maintainable Memory Customization }} • Less productive out[y-2][x-2] = acc; Algorithm#3 }}}} 7

Decoupling Algorithm from Hardware Customizations HLS C HeteroCL Algorithm#1 Algorithm#1-3 Compute Customization Algorithm#2 Data Type Customization Compute Customization Memory Customization Data Type Customization Algorithm#3 Memory Customization Entangled algorithm specification Fully decoupled customization and customization schemes [1,2,3] schemes + Clean abstraction capturing the interdependence 8

Decoupled Compute Customization HeteroCL code HLS code r = hcl.reduce_axis(0, 3) Declarative for (int y = 0; y < N; y++) Algorithm c = hcl.reduce_axis(0, 3) programming for (int x = 0; x < N; x++) out = hcl.compute(N, N), ( TVM based ) for (int r = 0; r < 3; r++) lambda y, x: for (int c = 0; c < 3; c++) hcl.sum(image[x+r, y+c]*kernel[r, c], out[x, y] += image[x+r, y+c] * kernel[r, c] axis=[r, c])) Tile loop for (int xi = 0; xi < M; xi++) for (int xo = 0; xo < N/M; xo++) customization Decoupled s = hcl.create_schedule() for (int y = 0; y < N; y++) Reorder loops xo, xi = s[out].split(out.x, factor=M) for (int r = 0; r < 3; r++) s[out].reorder(xi, xo, out.y) for (int c = 0; c < 3; c++) out[xi+xo*M, y] += Customization primitives image[xi+xo*M+r, y+c] * kernel[r, c] • Portable, less error-prone 9

Decoupled Data Type Customization ▸ Bit-accurate data type support (e.g., Int(15), Fixed(7,4) ) – W.I.P.: custom floating-point types (e.g., bfloat16) ▸ Decoupled customization primitives: downsize & quantize 32-bit Floating-point r = hcl.reduce_axis(0, 3) Sign Exponent Mantissa c = hcl.reduce_axis(0, 3) 1b 23b 8b out = hcl.compute(N, N), 16-bit Brain Floating-point (bfloat) lambda y, x: Sign Exponent Mantissa hcl.sum(image[x+r, y+c]*kernel[r, c], 1b 8b 7b axis=[r, c])) 8-bit Fixed-point Fixed(8, 6) Quantize/downsize Int Fraction for i in range (2, 8): 2b 6b s = hcl.create_scheme() 2-bit Integer Int(2) s. quantize ([out], Fixed(i, i-2)) Int 2b 10

Decoupled Memory Customization ▸ Inferring custom on-chip storage with the reuse_at () primitive for (int y = 0; y < N; y++) r = hcl.reduce_axis(0, 3) for (int x = 0; x < N; x++) c = hcl.reduce_axis(0, 3) for (int r = 0; r < 3; r++) out = hcl.compute(N, N), for (int c = 0; c < 3; c++) lambda y, x: out[x, y] += image[x+r, y+c] * kernel[r, c] hcl.sum(image[x+r, y+c]*kernel[r, c], axis=[r, c])) linebuffer s = hcl.create_schedule() x linebuf = s[image]. reuse_at (out, out.y ) y image out 11

Decoupled Memory Customization ▸ Inferring custom on-chip storage with the reuse_at () primitive for (int y = 0; y < N; y++) r = hcl.reduce_axis(0, 3) for (int x = 0; x < N; x++) c = hcl.reduce_axis(0, 3) for (int r = 0; r < 3; r++) out = hcl.compute(N, N), for (int c = 0; c < 3; c++) lambda y, x: out[x, y] += image[x+r, y+c] * kernel[r, c] hcl.sum(image[x+r, y+c]*kernel[r, c], axis=[r, c])) linebuffer s = hcl.create_schedule() x linebuf = s[image]. reuse_at (out, out.y ) y winbuf = s[linebuf]. reuse_at (out, out.x ) image out window buffer 12

Decoupled Memory Customization ▸ Inferring custom on-chip storage with the reuse_at () primitive for (int y = 0; y < N; y++) r = hcl.reduce_axis(0, 3) for (int x = 0; x < N; x++) c = hcl.reduce_axis(0, 3) for (int r = 0; r < 3; r++) out = hcl.compute(N, N), for (int c = 0; c < 3; c++) lambda y, x: out[x, y] += image[x+r, y+c] * kernel[r, c] hcl.sum(image[x+r, y+c]*kernel[r, c], axis=[r, c])) linebuffer s = hcl.create_schedule() x linebuf = s[image]. reuse_at (out, out.y ) y winbuf = s[linebuf]. reuse_at (out, out.x ) = ⨂ image out window buffer kernel 13

Decoupled Data Placement ▸ A unified interface for specifying data placement & movement data Compute unit from heterocl import platform @ hcl.def_ () def conv (input, kernel): r = hcl.reduce_axis(0, 3) Host c = hcl.reduce_axis(0, 3) Kernel1 Kernel2 return hcl.compute(N, N), lambda y, x: hcl.sum(input[x+r, y+c]*kernel[r, c], axis=[r, c])) Out2 Image Out1 conv2 conv1 out1 = conv(image, kernel1, "conv1") out2 = conv(out1, kernel2, "conv2") s = hcl.create_schedule() p = platform.fpga_soc Xcel f = hcl.build(p) 14

Decoupled Data Placement ▸ A unified interface for specifying data placement & movement between – Host and accelerator data Stream Compute unit from heterocl import platform @ hcl.def_ () def conv (input, kernel): r = hcl.reduce_axis(0, 3) Host c = hcl.reduce_axis(0, 3) Kernel1 Kernel2 return hcl.compute(N, N), lambda y, x: hcl.sum(input[x+r, y+c]*kernel[r, c], axis=[r, c])) Out2 Image out1 = conv(image, kernel1, "conv1") out2 = conv(out1, kernel2, "conv2") Out1 s = hcl.create_schedule() Compute p = platform.fpga_soc placement is Xcel conv1 conv2 s. to ([image, kernel1, kernel2], p.xcel) inferred s. to (out2, p.host) automatically 15

Building FPGA-Targeted Accelerators with HeteroCL Zhiru Zhang - PowerPoint PPT Presentation

Building FPGA-Targeted Accelerators with HeteroCL Zhiru Zhang School of ECE, Cornell University csl.cornell.edu/~zhiruz In collaboration with Cornell : Yi-Hsiang Lai, Shaojie Xiang, Yuwei Hu UCLA : Yuze Chi, Jie Wang, Cody Yu, Jason Cong TVM

Application Accelerators: Application Accelerators: Application Accelerators: Application

Contents Introduction Pipelined FPGA DNN accelerators Roof-line Model and optimizing

Confidential Accelerators Stavros Volos Microsoft Research Accelerators Play Pivotal Role in

GRVI Phalanx Update: A Massively Parallel RISC-V FPGA Accelerator Framework Jan Gray |

Open Source FPGA Toolchain FPGA LSE Summer Week 2015 iCE40 Flow Conclusion Vincent Gatine

Tips about an FPGA 02/09/2018 J.C. special topic FPGA ( field-programmable gate array ) FPGA :

FPGA What is a FPGA? How FPGAs work How do they work? Manufacturers

WWW.FPGA What is an FPGA? Field Programmable Gate Array Introduction to FPGA designs

HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable

DETECTORS AND ACCELERATORS DETECTORS AND ACCELERATORS APPLIED TO MEDICINE Jos Bernabu Jos

Accelerators for Americas Future ACCELERATORS - MODERN SHIPS OF DISCOVERY October 26, 2009

R265: Advanced Topics in Computer Architecture Seminar 7: HW accelerators and accelerators for

Activities on accelerators in Spain Francis Perez ALBA Accelerators Head on behalf of

and Datapaths Using LLVM to Generate FPGA Accelerators Alan Baker Altera Corporation FPGAs

OpenMP Device Offloading to FPGA Accelerators Lukas Sommer, Jens Korinth, Andreas Koch

Adaptive FPGA-based Database Accelerators Achievements, Possibilities, and Challenges Daniel

Building a mag moment signal model for LZ Winnie Wang, Scott Hertel University of

Thermal energy supply and storage in ehub and NEST 6 th SCCER HaE storage Symposium October 25 th ,

2

Word Embedding Praveen Krishnan CVIT, IIIT Hyderabad June 22, 2017 1 Outline Introduction

Building an Intuitive Admin UX for the Forgotten End-User Stephen Lucero A true developer at

1 8 Steps to Grow Your Coaching Practice with Ease 1) Receive Your Niche 2) Create Your

1 QFHS is a non-profit organisation whose mission is to create and disseminate cultural and

Introduction to Module 6 Finding and Matching Funding Sources For Your Transactions In an Ideal