hardware design and analysis of efficient loop coarsening
play

Hardware Design and Analysis of Efficient Loop Coarsening and Border - PowerPoint PPT Presentation

Hardware Design and Analysis of Efficient Loop Coarsening and Border Handling for Image Processing M. Akif zkan, Oliver Reiche, Frank Hannig, and Jrgen Teich Hardware/Software Co-Design, Friedrich-Alexander University Erlangen-Nrnberg


  1. Hardware Design and Analysis of Efficient Loop Coarsening and Border Handling for Image Processing M. Akif Özkan, Oliver Reiche, Frank Hannig, and Jürgen Teich Hardware/Software Co-Design, Friedrich-Alexander University Erlangen-Nürnberg ASAP , July 11, 2017, Seattle

  2. Motivation: Coarse-grained parallelism on FPGA Memory bandwidth limits can be reached by processing multiple pixels per cycle: • A memory bandwidth of 12 GBytes/s for each DDR3 channel is feasible on a “modern” FPGA, which leads to 512 bit wide interfaces for around 200 MHz logic frequency (Zynq zc706) • High-speed serial transceiver technology on FPGAs enables the communica- tion interfaces to operate at high data rates … … … …

  3. Motivation: Coarse-grained parallelism on FPGA Memory bandwidth limits can be reached by processing multiple pixels per cycle: • A memory bandwidth of 12 GBytes/s for each DDR3 channel is feasible on a “modern” FPGA, which leads to 512 bit wide interfaces for around 200 MHz logic frequency (Zynq zc706) • High-speed serial transceiver technology on FPGAs enables the communica- tion interfaces to operate at high data rates … … … … How to provide efficient coarse-grained parallelism for image processing applications?

  4. Motivation: Image Processing Applications We can define three characteristic data operations in image processing applications: input image output image Point Operators: Output data is determined by single input data input image output image Local Operators: Output data is determined by a local region of the in- put data (stencil pattern-based calculations) input image output image Global Operators: Output data is determined by all of the input data

  5. Motivation: Image Processing Applications A great portion of image processing applications can be described as task graphs of point, local, and global operators: dx sx gx input output gxy sxy hc gy dy sy An example task graph for Harris Corner Detection (square: local operator, circle: point operator)

  6. Motivation: Parallelization of Image Processing Applications A naive way would be replicating the accelerator hardware: dx sx gx dx sx gx dx sx gx dx sx gx hc input output hc gxy sxy gxy sxy gxy sxy gxy sxy hc gy dy sy gy dy sy dy gy hc sy gy dy sy

  7. Motivation: Parallelization of Image Processing Applications Is there a more resource-efficient approach? {sx, sx, {gx, gx, gx, gx} sx, sx} output input {dx, dx, dx, dx} {hc, {sxy, hc, sxy, {gxy, gxy, gxy, gxy} hc, sxy, hc} sxy} {dy, dy, dy, dy} {sy, sy, {gy, gy, gy, gy} sy, sy}

  8. Motivation: Parallelization of point operators Coarse-grained parallelization of point operators is rather straightforward: input input f {f, f, f, f} output output The throughput is linear with the resource usage (when further data-path optimizations are ignored).

  9. Motivation: Parallelization of point operators Coarse-grained parallelization of point operators is rather straightforward: input input f {f, f, f, f} output output The throughput is linear with the resource usage (when further data-path optimizations are ignored). What are efficient parallelization methods for local operators?

  10. Motivation: Image border handling • a fundamental image processing issue for local operators • mostly overlooked by the digital hardware design community • should be considered together with coarse-grained parallelization 0 0 0 1 2 3 3 3 5 4 4 5 6 7 7 6 10 9 8 9 10 11 10 9 c c c c c c c c 0 0 0 3 3 3 0 0 3 3 6 5 5 6 7 6 5 c c c c c c c c 1 2 1 1 2 2 4 c c c c 0 0 0 1 2 3 3 3 1 0 0 1 2 3 3 2 2 1 0 1 2 3 2 1 0 1 2 3 4 4 4 5 6 7 7 7 5 4 4 5 6 7 7 6 6 5 4 5 6 7 6 5 c c 4 5 6 7 c c 8 8 8 9 10 11 11 11 9 8 8 9 10 11 11 10 10 9 8 9 10 11 10 9 c c 8 9 10 11 c c 12 12 12 13 14 15 15 15 13 12 12 13 14 15 15 14 14 13 12 13 14 15 14 13 c c 12 13 14 15 c c c c c c c c c c 12 12 12 13 14 15 15 15 13 12 12 13 14 15 15 14 10 9 8 9 10 11 10 9 12 12 12 13 14 15 15 15 9 8 8 9 10 11 11 10 6 5 4 5 6 7 6 5 c c c c c c c c (a) clamp (b) mirror (c) mirror-101 (d) constant Common border handling modes.

  11. Outline Loop Coarsening Border Handling Best Architecture Selection Evaluation and Results

  12. Loop Coarsening

  13. Loop Coarsening: Schmid’s 1 Approach Coarsening the outer horizontal loop of a 2D input by a factor of v : for(int y = 0; y < IMAGE_HEIGHT; y++){ for(int x = 0; x < IMAGE_WIDTH; x + v){ (DataBeatType*)(out[y][x]) = local_op(stencil_p1(y, x), ..); } } … … … … Raster order processing facilitates burst mode read, thus highest external memory bandwidth! 1 M. Schmid, O. Reiche, F . Hannig, and J. Teich, “Loop coarsening in C-based high-level synthesis”, ASAP15. ASAP’17 2 M. Akif Özkan | Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing

  14. Loop Coarsening: Schmid’s Approach The line buffer and sliding window are modified to store so-called data beats . Sliding Window Line Bu ff er … … … … f f f f … … The throughput is sub-linear with the resource usage . ASAP’17 3 M. Akif Özkan | Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing

  15. Loop Coarsening: Schmid’s Approach The line buffer and sliding window are modified to store so-called data beats . Sliding Window Line Bu ff er … … … … f f f f … … The throughput is sub-linear with the resource usage . ASAP’17 3 M. Akif Özkan | Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing

  16. Loop Coarsening: Schmid’s Sliding Window current timestamp: (coarsening) initial latency - 1 FETCH: 0 1 2 3 CALC: OUT: Line shift shift input Bu ff er … … f f f f … … (kernel width) w = 3, (coarsening factor) v = 4 ASAP’17 4 M. Akif Özkan | Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing

  17. Loop Coarsening: Schmid’s Sliding Window current timestamp: (coarsening) initial latency FETCH: 0 1 2 3 4 5 6 7 CALC: 0 1 2 3 OUT: 0 1 2 3 Line shift shift input Bu ff er … … f f f f … … (kernel width) w = 3, (coarsening factor) v = 4 ASAP’17 4 M. Akif Özkan | Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing

  18. Loop Coarsening: Schmid’s Sliding Window current timestamp: (coarsening) initial latency +1 FETCH: 0 1 2 3 4 5 6 7 8 9 10 11 CALC: 0 1 2 3 4 5 6 7 OUT: 0 1 2 3 4 5 6 7 Line shift shift input Bu ff er … … f f f f … … (kernel width) w = 3, (coarsening factor) v = 4 Deploys additional registers when r w mod v � = 0 ASAP’17 4 M. Akif Özkan | Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing

  19. Loop Coarsening: Fetch and Calc (F&C) Redundant registers in Schmid’s architecture are eliminated Schmid’s shift shift input Line Bu ff er Fetch And Calc shift shift input … … … … f f f f (kernel width) w = 3, (coarsening factor) v = 4 ASAP’17 5 M. Akif Özkan | Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing

  20. Loop Coarsening: Calc and Pack (C&P) current timestamp: (coarsening) initial latency - 1 FETCH: 0 1 2 3 CALC: x 0 1 2 OUT: shift input Line Bu ff er … … f f f f 0 1 2 … … (kernel width) w = 3, (coarsening factor) v = 4 ASAP’17 6 M. Akif Özkan | Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing

  21. Loop Coarsening: Calc and Pack (C&P) current timestamp: (coarsening) initial latency FETCH: 0 1 2 3 4 5 6 7 CALC: x 0 1 2 3 4 5 6 OUT: 0 1 2 3 shift input Line Bu ff er … … f f f f 4 5 6 … … 0 1 2 3 (kernel width) w = 3, (coarsening factor) v = 4 output ([ x , x + v ] , t ) = pack { out ([ 0 , v − r w − 1 ] , t − 1 ) , out ([ v − r w , v − 1 ] , t ) } ASAP’17 6 M. Akif Özkan | Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing

  22. Loop Coarsening: Calc and Pack (C&P) current timestamp: (coarsening) initial latency + 1 FETCH: 0 1 2 3 4 5 6 7 8 9 10 11 CALC: x 0 1 2 3 4 5 6 7 8 9 10 OUT: 0 1 2 3 4 5 6 7 shift input Line Bu ff er … … f f f f 8 9 10 … … 4 5 6 7 (kernel width) w = 3, (coarsening factor) v = 4 output ([ x , x + v ] , t ) = pack { out ([ 0 , v − r w − 1 ] , t − 1 ) , out ([ v − r w , v − 1 ] , t ) } ASAP’17 6 M. Akif Özkan | Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing

  23. Loop Coarsening Overview Resource usage is the # of registers when border handling is ignored: shift shift input Schmid’s: k in · h · ( v + 2 · ( v ·⌈ r w / v ⌉ )) f f f f shift shift input Fetch And Calc: C F&C reg = k in · h · ( r w + v · ( ⌈ r w / v ⌉ + 1 )) f f f f shift input Calc and Pack: C C&P reg = k in · h · ( 2 · r w + v )+ k out · ( v − ( r w mod v )) f f f f ASAP’17 7 M. Akif Özkan | Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing

  24. Border Handling

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend