heterocl a multi paradigm programming infrastructure for
play

HeteroCL: A Multi-Paradigm Programming Infrastructure for - PowerPoint PPT Presentation

HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing Yi-Hsiang Lai 1 , Yuze Chi 2 , Yuwei Hu 1 , Jie Wang 2 , Cody Hao Yu 2,3 , Yuan Zhou 1 , Jason Cong 2 , Zhiru Zhang 1 1 Cornell University 2


  1. HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing Yi-Hsiang Lai 1 , Yuze Chi 2 , Yuwei Hu 1 , Jie Wang 2 , Cody Hao Yu 2,3 , Yuan Zhou 1 , Jason Cong 2 , Zhiru Zhang 1 1 Cornell University 2 University of California, Los Angeles 3 Falcon Computing Solutions, Inc.

  2. Essential Techniques for Hardware Acceleration 32 32 32 Compute customization ... × + × × + + • Parallelization 32 32 ... • Pipelining, etc. × + + × + 16 8 8 4 Data type customization ... × × × × + + + + • Low-bitwidth integer 16 • Fixed point, etc. ... × × + + + Memory customization mem mem mem • Banking ... ... FIFO FIFO • Data reuse, etc There exists interdependence among different customizations 1

  3. Hardware Customization in High-Level Synthesis #pragma HLS array_partition variable=filter dim=0 ▸ Driving example: convolutional kernel hls::LineBuffer<3, N, ap_fixed<8,4> > buf; hls::Window<3, 3, ap_fixed<8,4> > window; for(int y = 0; y < N; y++) { for (int y = 0; y < N; y++) for(int xo = 0; xo < N/M; xo++) { Custom compute for (int x = 0; x < N; x++) #pragma HLS pipeline II=1 (Loop tiling) for(int xi = 0; xi < M; xi++) { for (int r = 0; r < 3; r++) int x = xo*M + xi; Custom data type for (int c = 0; c < 3; c++) ap_fixed<8,4> acc = 0; ap_fixed<8,4> in = image[y][x]; (Quantization) out[x, y] += image[x+r, y+c] * kernel[r, c] buf.shift_up(x); Custom memory buf.insert_top(in, x); window.shift_left(); (Reuse buffers) for(int r = 0; r < 2; r++) Algorithm#1 window.insert(buf.getval(r,x), i, 2); Entangled hardware Compute Customization window.insert(in, 2, 2); customization and if (y >= 2 && x >= 2) { Algorithm#2 algorithm for(int r = 0; r < 3; r++) { • Data Type Customization for(int c = 0; c < 3; c++) { Less portable acc += window.getval(r,c) * kernel[r][c]; • Less maintainable Memory Customization }} • Less productive out[y-2][x-2] = acc; Algorithm#3 }}}} 2

  4. Decoupling Algorithm from Hardware Customization HLS C Halide, TVM, etc. HeteroCL Algorithm#1,2 Algorithm#1 Algorithm#1,2,3 Compute Customization Data Type Customization Algorithm#2 Memory Customization Data Type Customization Algorithm#3 Compute Customization Memory Customization Data Type Customization Compute Customization Algorithm#3 Memory Customization Memory Customization Entangled algorithm specification Decoupled temporal Fully decoupled customization schedules [4,5,6,7,8] and customization schemes [1,2,3] schemes + Clean abstraction capturing the [4] Ragan- Kelly, et al. SIGPLAN’13 [1] Intel HLS interdependence [2] Xilinx Vivado HLS [5] Baghdadi, et al. arXiv’18 [3] Canis , et al. FPGA’11 [6] Rong , et al. arXiv’17 [7] Pu, et al. TACO’17 [8] Chen, et al. arXiv’18 3

  5. Decoupled Compute Customization HeteroCL code HLS code r = hcl.reduce_axis(0, 3) Declarative for (int y = 0; y < N; y++) Algorithm c = hcl.reduce_axis(0, 3) programming for (int x = 0; x < N; x++) out = hcl.compute(N, N), for (int r = 0; r < 3; r++) lambda y, x: for (int c = 0; c < 3; c++) hcl.sum(image[x+r, y+c]*kernel[r, c], out[x, y] += image[x+r, y+c] * kernel[r, c] axis=[r, c])) Tile loop for (int xi = 0; xi < M; xi++) for (int xo = 0; xo < N/M; xo++) customization Decoupled s = hcl.create_schedule() for (int y = 0; y < N; y++) Reorder loops xo, xi = s[out].split(out.x, factor=M) for (int r = 0; r < 3; r++) s[out].reorder(xi, xo, out.y) for (int c = 0; c < 3; c++) out[xi+xo*M, y] += Customization primitives image[xi+xo*M+r, y+c] * kernel[r, c] • More productive / less labor-intensive 4

  6. Decoupled Memory Customization ▸ Primitives can be applied with a user-defined sequence for (int y = 0; y < N; y++) r = hcl.reduce_axis(0, 3) for (int x = 0; x < N; x++) c = hcl.reduce_axis(0, 3) for (int r = 0; r < 3; r++) out = hcl.compute(N, N), for (int c = 0; c < 3; c++) lambda y, x: out[x, y] += image[x+r, y+c] * kernel[r, c] hcl.sum(image[x+r, y+c]*kernel[r, c], axis=[r, c])) 5

  7. Decoupled Memory Customization ▸ Primitives can be applied with a user-defined sequence for (int y = 0; y < N; y++) r = hcl.reduce_axis(0, 3) for (int x = 0; x < N; x++) c = hcl.reduce_axis(0, 3) for (int r = 0; r < 3; r++) out = hcl.compute(N, N), for (int c = 0; c < 3; c++) lambda y, x: out[x, y] += image[x+r, y+c] * kernel[r, c] hcl.sum(image[x+r, y+c]*kernel[r, c], axis=[r, c])) linebuffer s = hcl.create_schedule() x linebuf = s[image]. reuse_at (out, out.y ) y image out 6

  8. Decoupled Memory Customization ▸ Primitives can be applied with a user-defined sequence for (int y = 0; y < N; y++) r = hcl.reduce_axis(0, 3) for (int x = 0; x < N; x++) c = hcl.reduce_axis(0, 3) for (int r = 0; r < 3; r++) out = hcl.compute(N, N), for (int c = 0; c < 3; c++) lambda y, x: out[x, y] += image[x+r, y+c] * kernel[r, c] hcl.sum(image[x+r, y+c]*kernel[r, c], axis=[r, c])) linebuffer s = hcl.create_schedule() x linebuf = s[image]. reuse_at (out, out.y ) y winbuf = s[linebuf]. reuse_at (out, out.x ) image out window buffer 7

  9. Decoupled Memory Customization ▸ Primitives can be applied with a user-defined sequence for (int y = 0; y < N; y++) r = hcl.reduce_axis(0, 3) for (int x = 0; x < N; x++) c = hcl.reduce_axis(0, 3) for (int r = 0; r < 3; r++) out = hcl.compute(N, N), for (int c = 0; c < 3; c++) lambda y, x: out[x, y] += image[x+r, y+c] * kernel[r, c] hcl.sum(image[x+r, y+c]*kernel[r, c], axis=[r, c])) linebuffer s = hcl.create_schedule() x linebuf = s[image].reuse_at(out, out.y) y winbuf = s[linebuf].reuse_at(out, out.x) ⨂ image out window buffer kernel 8

  10. Decoupled Data Type Customization ▸ Bit-accurate data type support (e.g., Int(15) , Fixed(7,4) ) ▸ Decoupled customization primitives: downsize & quantize r = hcl.reduce_axis(0, 3) c = hcl.reduce_axis(0, 3) out = hcl.compute(N, N), lambda y, x: hcl.sum(image[x+r, y+c]*kernel[r, c], axis=[r, c])) s = hcl.create_scheme() s.quantize([out], Fixed(6, 4)) 9

  11. Decoupled Data Type Customization ▸ Bit-accurate data type support (e.g., Int(15) , Fixed(7,4) ) ▸ Decoupled customization primitives: downsize & quantize r = hcl.reduce_axis(0, 3) c = hcl.reduce_axis(0, 3) 100 out = hcl.compute(N, N), 80 Accuracy (%) lambda y, x: 60 hcl.sum(image[x+r, y+c]*kernel[r, c], 40 axis=[r, c])) 20 0 for i in range (2, 8): 2 4 6 8 s = hcl.create_scheme() Total bitwidth s.quantize([out], Fixed(i, i-2)) Trade-off between accuracy and resource for a neural network 10

  12. Currently Supported Customization Primitives Compute customization Memory customization ⨁ Macros for spatial architecture templates Data type customization 11

  13. Macro for Stencil with Dataflow Architecture ▸ A sliding window applied on a tensor ▸ For applications where data elements are updated with some fixed, local patterns ▸ Incorporate with SODA [Y. Chi, et al. ICCAD’18] – Scalable reuse buffers with minimum buffer size that achieve highest throughput Pipelined reuse buffers (RB) r = hcl.reduce_axis(0, 3) input FW FW FW FW RB RB RB RB PE: compute module, c = hcl.reduce_axis(0, 3) implements the input FW FW FW RB RB RB kernel function out = hcl.compute(N, N), lambda y, x: input FW FW FW FW RB RB RB RB hcl.sum(image[x+r, y+c]*kernel[r, c], PE output axis=[r, c])) PE output FW: forwarding module, s = hcl.create_schedule() implements FIFO and PE output s[out].stencil() distributes data 12

  14. Macro for Systolic Array ▸ A group of PEs locally connected to each other ▸ For applications having perfectly nested loops with uniform dependency ▸ Incorporate with PolySA [J. Cong, et al. ICCAD’18] – Systematic and efficient design space exploration => Comparable performance to manual designs within hours Off-chip DDRs On-chip BRAMs r = hcl.reduce_axis(0, 3) c = hcl.reduce_axis(0, 3) Loader Feeder Feeder Feeder Feeder out = hcl.compute(N, N), Feeder PE PE PE PE lambda y, x: hcl.sum(image[x+r, y+c]*kernel[r, c], Feeder PE PE PE PE axis=[r, c])) Feeder PE PE PE PE s = hcl.create_schedule() s[out].systolic() Feeder PE PE PE PE [X. Wei, et al. DAC’17] 13

  15. Imperative Programming in HeteroCL ▸ HeteroCL further provides an embedded imperative DSL – Not all algorithms can be described using declarative code ▸ Unified interface for applying hardware customization to both imperative and declarative codes with hcl.for_(0, N) as y: We need DSL with hcl.for_(0, N) as x: because normal with hcl.for_(0, 3) as r: Python is too flexible (i.e., not all semantics with hcl.for_(0, 3) as c: are synthesizable) out[x, y] += image[x+r, y+c] * kernel[r, c] s = hcl.create_schedule() s[out]. split (out.x, M) linebuf = s[image]. reuse_at (out, out.y) s. quantize ([out], Fixed(6, 4)) # … 14

  16. Explore the Interdependence: Dot Product i = hcl.reduce_axis(0, N) W return hcl.compute((1,), dot_product lambda x: hcl.sum(local_A[i] * local_B[i], DMA Off-chip memory axis=i)) PE + for W in [4, 8, 16, 32]: output + NUM_PE = BANDWIDTH / W PE local_A xo, xi = s[psum]. split (x, NUM_PE) Compute s[psum]. unroll (xi) NUM_PE … Data type s. quantize (local_A, hcl.Fixed(W)) + Memory s[local_A]. partition (NUM_PE) PE local_B Compute throughput #Elem / IO access Performance W=8 W=8 W=16 W=16 ) W=32 W=32 = min( , NUM_PE NUM_PE NUM_PE 15

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend