building fpga targeted accelerators with heterocl
play

Building FPGA-Targeted Accelerators with HeteroCL Zhiru Zhang - PowerPoint PPT Presentation

Building FPGA-Targeted Accelerators with HeteroCL Zhiru Zhang School of ECE, Cornell University csl.cornell.edu/~zhiruz In collaboration with Cornell : Yi-Hsiang Lai, Shaojie Xiang, Yuwei Hu UCLA : Yuze Chi, Jie Wang, Cody Yu, Jason Cong TVM


  1. Building FPGA-Targeted Accelerators with HeteroCL Zhiru Zhang School of ECE, Cornell University csl.cornell.edu/~zhiruz In collaboration with Cornell : Yi-Hsiang Lai, Shaojie Xiang, Yuwei Hu UCLA : Yuze Chi, Jie Wang, Cody Yu, Jason Cong TVM Workshop @ UW, 12/5/2019

  2. HeteroCL Overview ▸ A programming framework built with TVM for productive hardware specialization – Flexible : Mixed declarative & imperative programming – Efficient : Mapping to high-performance spatial architecture templates – Portable : Clean decoupling of algorithm & hardware customizations HeteroCL High-level DSLs Processors + Accelerators Algorithm Spec. (declarative + imperative) Compute Customization Data Type Customization Custom Xcels CPUs … Memory Customization (e.g. FPGAs) github.com/cornell-zhang/heterocl Y.-H. Lai, et al., HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing , FPGA’2019 ( Best Paper Award ) 1

  3. Essential Techniques for Hardware Specialization PE PE PE Compute customization • Parallelization, Pipelining … … … × × PE PE PE … 32 32 … … … + … PE PE PE 2

  4. Essential Techniques for Hardware Specialization PE PE PE Compute customization • Parallelization, Pipelining … × × × × … … 16 Data type customization PE PE PE • Low-bitwidth integer, + + Fixed point ... … … … … … + PE PE PE 3

  5. Essential Techniques for Hardware Specialization PE PE PE Loader Compute customization • Parallelization, Pipelining … Memory / Storage … … … … Data type customization PE PE PE • Low-bitwidth integer, Accelerator Fixed point ... … … … … Memory customization • Banking, Data reuse, PE PE PE Unloader Streaming ... FIFO Scratchpad 4

  6. FPGA as a Programmable Accelerator ▸ Massive amount of fine-grained parallelism – Highly parallel / deeply pipelined architecture – Distributed data/control dispatch ▸ Silicon configurable to fit the application Block RAM Block RAM – Compute at desired numerical accuracy – Customized memory hierarchy ▸ High performance/watt – Low clock speed – Pre-fabricated architecture blocks ~2 Million ~5000 ~300Mb Logic Blocks DSP Blocks Block RAM But FPGAs are really hard to PROGRAM AWS F1 FPGA instance: Xilinx UltraScale+ VU9P [Figure source: David Pellerin, AWS] 5

  7. Increasing Use of High-Level Synthesis (HLS) module dut(rst, clk, q); uint8 dut() { input rst; static uint8 c; vs. input clk; c+=1; output q; } reg [7:0] c; HLS C always @ ( posedge clk) begin if (rst == 1b’1) begin c <= 8'b00000000; end 3000+ papers since 2012 else begin 800 c <= c + 1; Number of Publications end 400 assign q = c; endmodule 0 RTL Verilog 2012 2013 2014 2015 2016 2017 2018 Year 6

  8. FPGA Programming with HLS #pragma HLS array_partition variable=filter dim=0 ▸ Example: convolution hls::LineBuffer<3, N, ap_fixed<8,4> > buf; hls::Window<3, 3, ap_fixed<8,4> > window; for (int y = 0; y < N; y++) for(int y = 0; y < N; y++) { for (int x = 0; x < N; x++) for(int xo = 0; xo < N/M; xo++) { Custom compute for (int r = 0; r < 3; r++) #pragma HLS pipeline II=1 (Loop tiling) for(int xi = 0; xi < M; xi++) { for (int c = 0; c < 3; c++) int x = xo*M + xi; out[x, y] += image[x+r, y+c] * kernel[r, c] Custom data type ap_fixed<8,4> acc = 0; ap_fixed<8,4> in = image[y][x]; (Quantization) buf.shift_up(x); Custom memory buf.insert_top(in, x); window.shift_left(); (Reuse buffers) for(int r = 0; r < 2; r++) Algorithm#1 window.insert(buf.getval(r,x), i, 2); Compute Customization Entangled hardware window.insert(in, 2, 2); customization and if (y >= 2 && x >= 2) { Algorithm#2 for(int r = 0; r < 3; r++) { algorithm Data Type Customization for(int c = 0; c < 3; c++) { • Less portable acc += window.getval(r,c) * kernel[r][c]; • Less maintainable Memory Customization }} • Less productive out[y-2][x-2] = acc; Algorithm#3 }}}} 7

  9. Decoupling Algorithm from Hardware Customizations HLS C HeteroCL Algorithm#1 Algorithm#1-3 Compute Customization Algorithm#2 Data Type Customization Compute Customization Memory Customization Data Type Customization Algorithm#3 Memory Customization Entangled algorithm specification Fully decoupled customization and customization schemes [1,2,3] schemes + Clean abstraction capturing the interdependence 8

  10. Decoupled Compute Customization HeteroCL code HLS code r = hcl.reduce_axis(0, 3) Declarative for (int y = 0; y < N; y++) Algorithm c = hcl.reduce_axis(0, 3) programming for (int x = 0; x < N; x++) out = hcl.compute(N, N), ( TVM based ) for (int r = 0; r < 3; r++) lambda y, x: for (int c = 0; c < 3; c++) hcl.sum(image[x+r, y+c]*kernel[r, c], out[x, y] += image[x+r, y+c] * kernel[r, c] axis=[r, c])) Tile loop for (int xi = 0; xi < M; xi++) for (int xo = 0; xo < N/M; xo++) customization Decoupled s = hcl.create_schedule() for (int y = 0; y < N; y++) Reorder loops xo, xi = s[out].split(out.x, factor=M) for (int r = 0; r < 3; r++) s[out].reorder(xi, xo, out.y) for (int c = 0; c < 3; c++) out[xi+xo*M, y] += Customization primitives image[xi+xo*M+r, y+c] * kernel[r, c] • Portable, less error-prone 9

  11. Decoupled Data Type Customization ▸ Bit-accurate data type support (e.g., Int(15), Fixed(7,4) ) – W.I.P.: custom floating-point types (e.g., bfloat16) ▸ Decoupled customization primitives: downsize & quantize 32-bit Floating-point r = hcl.reduce_axis(0, 3) Sign Exponent Mantissa c = hcl.reduce_axis(0, 3) 1b 23b 8b out = hcl.compute(N, N), 16-bit Brain Floating-point (bfloat) lambda y, x: Sign Exponent Mantissa hcl.sum(image[x+r, y+c]*kernel[r, c], 1b 8b 7b axis=[r, c])) 8-bit Fixed-point Fixed(8, 6) Quantize/downsize Int Fraction for i in range (2, 8): 2b 6b s = hcl.create_scheme() 2-bit Integer Int(2) s. quantize ([out], Fixed(i, i-2)) Int 2b 10

  12. Decoupled Memory Customization ▸ Inferring custom on-chip storage with the reuse_at () primitive for (int y = 0; y < N; y++) r = hcl.reduce_axis(0, 3) for (int x = 0; x < N; x++) c = hcl.reduce_axis(0, 3) for (int r = 0; r < 3; r++) out = hcl.compute(N, N), for (int c = 0; c < 3; c++) lambda y, x: out[x, y] += image[x+r, y+c] * kernel[r, c] hcl.sum(image[x+r, y+c]*kernel[r, c], axis=[r, c])) linebuffer s = hcl.create_schedule() x linebuf = s[image]. reuse_at (out, out.y ) y image out 11

  13. Decoupled Memory Customization ▸ Inferring custom on-chip storage with the reuse_at () primitive for (int y = 0; y < N; y++) r = hcl.reduce_axis(0, 3) for (int x = 0; x < N; x++) c = hcl.reduce_axis(0, 3) for (int r = 0; r < 3; r++) out = hcl.compute(N, N), for (int c = 0; c < 3; c++) lambda y, x: out[x, y] += image[x+r, y+c] * kernel[r, c] hcl.sum(image[x+r, y+c]*kernel[r, c], axis=[r, c])) linebuffer s = hcl.create_schedule() x linebuf = s[image]. reuse_at (out, out.y ) y winbuf = s[linebuf]. reuse_at (out, out.x ) image out window buffer 12

  14. Decoupled Memory Customization ▸ Inferring custom on-chip storage with the reuse_at () primitive for (int y = 0; y < N; y++) r = hcl.reduce_axis(0, 3) for (int x = 0; x < N; x++) c = hcl.reduce_axis(0, 3) for (int r = 0; r < 3; r++) out = hcl.compute(N, N), for (int c = 0; c < 3; c++) lambda y, x: out[x, y] += image[x+r, y+c] * kernel[r, c] hcl.sum(image[x+r, y+c]*kernel[r, c], axis=[r, c])) linebuffer s = hcl.create_schedule() x linebuf = s[image]. reuse_at (out, out.y ) y winbuf = s[linebuf]. reuse_at (out, out.x ) = ⨂ image out window buffer kernel 13

  15. Decoupled Data Placement ▸ A unified interface for specifying data placement & movement data Compute unit from heterocl import platform @ hcl.def_ () def conv (input, kernel): r = hcl.reduce_axis(0, 3) Host c = hcl.reduce_axis(0, 3) Kernel1 Kernel2 return hcl.compute(N, N), lambda y, x: hcl.sum(input[x+r, y+c]*kernel[r, c], axis=[r, c])) Out2 Image Out1 conv2 conv1 out1 = conv(image, kernel1, "conv1") out2 = conv(out1, kernel2, "conv2") s = hcl.create_schedule() p = platform.fpga_soc Xcel f = hcl.build(p) 14

  16. Decoupled Data Placement ▸ A unified interface for specifying data placement & movement between – Host and accelerator data Stream Compute unit from heterocl import platform @ hcl.def_ () def conv (input, kernel): r = hcl.reduce_axis(0, 3) Host c = hcl.reduce_axis(0, 3) Kernel1 Kernel2 return hcl.compute(N, N), lambda y, x: hcl.sum(input[x+r, y+c]*kernel[r, c], axis=[r, c])) Out2 Image out1 = conv(image, kernel1, "conv1") out2 = conv(out1, kernel2, "conv2") Out1 s = hcl.create_schedule() Compute p = platform.fpga_soc placement is Xcel conv1 conv2 s. to ([image, kernel1, kernel2], p.xcel) inferred s. to (out2, p.host) automatically 15

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend