Hardware Design and Analysis of Efficient Loop Coarsening and Border - - PowerPoint PPT Presentation

hardware design and analysis of efficient loop coarsening
SMART_READER_LITE
LIVE PREVIEW

Hardware Design and Analysis of Efficient Loop Coarsening and Border - - PowerPoint PPT Presentation

Hardware Design and Analysis of Efficient Loop Coarsening and Border Handling for Image Processing M. Akif zkan, Oliver Reiche, Frank Hannig, and Jrgen Teich Hardware/Software Co-Design, Friedrich-Alexander University Erlangen-Nrnberg


slide-1
SLIDE 1

Hardware Design and Analysis of Efficient Loop Coarsening and Border Handling for Image Processing

  • M. Akif Özkan, Oliver Reiche, Frank Hannig, and Jürgen Teich

Hardware/Software Co-Design, Friedrich-Alexander University Erlangen-Nürnberg ASAP , July 11, 2017, Seattle

slide-2
SLIDE 2

Motivation: Coarse-grained parallelism on FPGA

Memory bandwidth limits can be reached by processing multiple pixels per cycle:

  • A memory bandwidth of 12 GBytes/s for each DDR3 channel is feasible on a

“modern” FPGA, which leads to 512 bit wide interfaces for around 200 MHz logic frequency (Zynq zc706)

  • High-speed serial transceiver technology on FPGAs enables the communica-

tion interfaces to operate at high data rates

… … … …

slide-3
SLIDE 3

Motivation: Coarse-grained parallelism on FPGA

Memory bandwidth limits can be reached by processing multiple pixels per cycle:

  • A memory bandwidth of 12 GBytes/s for each DDR3 channel is feasible on a

“modern” FPGA, which leads to 512 bit wide interfaces for around 200 MHz logic frequency (Zynq zc706)

  • High-speed serial transceiver technology on FPGAs enables the communica-

tion interfaces to operate at high data rates

… … … …

How to provide efficient coarse-grained parallelism for image processing applications?

slide-4
SLIDE 4

Motivation: Image Processing Applications

We can define three characteristic data operations in image processing applications:

Point Operators: Output data is determined by single input data Local Operators: Output data is determined by a local region of the in- put data (stencil pattern-based calculations) Global Operators: Output data is determined by all of the input data

input image

  • utput image

input image

  • utput image

input image

  • utput image
slide-5
SLIDE 5

Motivation: Image Processing Applications

A great portion of image processing applications can be described as task graphs

  • f point, local, and global operators:

dx dy sxy sy gxy gy hc input

  • utput

sx gx

An example task graph for Harris Corner Detection (square: local operator, circle: point operator)

slide-6
SLIDE 6

Motivation: Parallelization of Image Processing Applications

A naive way would be replicating the accelerator hardware: dx dy sxy sy gxy gy hc input sx gx dx dy sxy sy gxy gy hc sx gx dx dy sxy sy gxy gy hc sx gx dx dy sxy sy gxy gy hc sx gx

  • utput
slide-7
SLIDE 7

Motivation: Parallelization of Image Processing Applications

Is there a more resource-efficient approach?

input

  • utput

{dy, dy, dy, dy} {dx, dx, dx, dx} {sy, sy, sy, sy} {sx, sx, sx, sx} {gxy, gxy, gxy, gxy} {gx, gx, gx, gx} {gy, gy, gy, gy} {sxy, sxy, sxy, sxy} {hc, hc, hc, hc}

slide-8
SLIDE 8

Motivation: Parallelization of point operators

Coarse-grained parallelization of point operators is rather straightforward:

{f, f, f, f} f

input

  • utput

input

  • utput

The throughput is linear with the resource usage (when further data-path optimizations are ignored).

slide-9
SLIDE 9

Motivation: Parallelization of point operators

Coarse-grained parallelization of point operators is rather straightforward:

{f, f, f, f} f

input

  • utput

input

  • utput

The throughput is linear with the resource usage (when further data-path optimizations are ignored). What are efficient parallelization methods for local operators?

slide-10
SLIDE 10

Motivation: Image border handling

  • a fundamental image processing issue for local operators
  • mostly overlooked by the digital hardware design community
  • should be considered together with coarse-grained parallelization

1 2 3 3 3 1 2 3 3 3 1 2 3 3 3 4 4 4 5 6 7 7 7 8 8 8 9 10 11 11 11 12 12 12 13 14 15 15 15 12 12 12 13 14 15 15 15 12 12 12 13 14 15 15 15

(a) clamp

5 4 4 5 6 7 7 6 1 1 2 3 3 2 1 1 2 3 3 2 5 4 4 5 6 7 7 6 9 8 8 9 10 11 11 10 13 12 12 13 14 15 15 14 13 12 12 13 14 15 15 14 9 8 8 9 10 11 11 10

(b) mirror

10 9 8 9 10 11 10 9 6 5 4 5 6 7 6 5 2 1 1 2 3 2 1 6 5 4 5 6 7 6 5 10 9 8 9 10 11 10 9 14 13 12 13 14 15 14 13 10 9 8 9 10 11 10 9 6 5 4 5 6 7 6 5

(c) mirror-101

c c c c c c c c c c c c c c c c c c 1 2 3 c c c c 4 5 6 7 c c c c 8 9 10 11 c c c c 12 13 14 15 c c c c c c c c c c c c c c c c c c

(d) constant Common border handling modes.

slide-11
SLIDE 11

Outline

Loop Coarsening Border Handling Best Architecture Selection Evaluation and Results

slide-12
SLIDE 12

Loop Coarsening

slide-13
SLIDE 13

Loop Coarsening: Schmid’s 1 Approach

Coarsening the outer horizontal loop of a 2D input by a factor of v :

for(int y = 0; y < IMAGE_HEIGHT; y++){ for(int x = 0; x < IMAGE_WIDTH; x + v){ (DataBeatType*)(out[y][x]) = local_op(stencil_p1(y, x), ..); } }

… … … …

Raster order processing facilitates burst mode read, thus highest external memory bandwidth!

  • 1M. Schmid, O. Reiche, F

. Hannig, and J. Teich, “Loop coarsening in C-based high-level synthesis”, ASAP15.

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 2

slide-14
SLIDE 14

Loop Coarsening: Schmid’s Approach

The line buffer and sliding window are modified to store so-called data beats.

Sliding Window

f f f f

… …

… … … …

Line Buffer

The throughput is sub-linear with the resource usage .

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 3

slide-15
SLIDE 15

Loop Coarsening: Schmid’s Approach

The line buffer and sliding window are modified to store so-called data beats.

Sliding Window

f f f f

… …

… … … …

Line Buffer

The throughput is sub-linear with the resource usage .

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 3

slide-16
SLIDE 16

Loop Coarsening: Schmid’s Sliding Window

current timestamp: (coarsening) initial latency - 1 FETCH: 0 1 2 3 CALC: OUT:

… …

Line Buffer

input

f f f f

shift shift

… …

(kernel width) w = 3, (coarsening factor) v = 4

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 4

slide-17
SLIDE 17

Loop Coarsening: Schmid’s Sliding Window

current timestamp: (coarsening) initial latency FETCH: 0 1 2 3 4 5 6 7 CALC: 0 1 2 3 OUT: 0 1 2 3

… …

Line Buffer

input

f f f f

shift shift

… …

(kernel width) w = 3, (coarsening factor) v = 4

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 4

slide-18
SLIDE 18

Loop Coarsening: Schmid’s Sliding Window

current timestamp: (coarsening) initial latency +1 FETCH: 0 1 2 3 4 5 6 7 8 9 10 11 CALC: 0 1 2 3 4 5 6 7 OUT: 0 1 2 3 4 5 6 7

… …

Line Buffer

input

f f f f

shift shift

… …

(kernel width) w = 3, (coarsening factor) v = 4

Deploys additional registers when rw mod v = 0

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 4

slide-19
SLIDE 19

Loop Coarsening: Fetch and Calc (F&C)

Redundant registers in Schmid’s architecture are eliminated

… …

Line Buffer

input shift shift shift input

f f f f

shift Schmid’s Fetch And Calc

… …

(kernel width) w = 3, (coarsening factor) v = 4

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 5

slide-20
SLIDE 20

Loop Coarsening: Calc and Pack (C&P)

current timestamp: (coarsening) initial latency - 1 FETCH: 0 1 2 3 CALC: x 0 1 2 OUT:

… …

Line Buffer

… …

shift input

f f f f

1 2

(kernel width) w = 3, (coarsening factor) v = 4

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 6

slide-21
SLIDE 21

Loop Coarsening: Calc and Pack (C&P)

current timestamp: (coarsening) initial latency FETCH: 0 1 2 3 4 5 6 7 CALC: x 0 1 2 3 4 5 6 OUT: 0 1 2 3

… …

Line Buffer

… …

shift input

f f f f

1 2 3 4 5 6

(kernel width) w = 3, (coarsening factor) v = 4

  • utput([x,x + v],t) = pack{out([0,v − rw − 1],t − 1), out([v − rw,v − 1],t)}
  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 6

slide-22
SLIDE 22

Loop Coarsening: Calc and Pack (C&P)

current timestamp: (coarsening) initial latency + 1 FETCH: 0 1 2 3 4 5 6 7 8 9 10 11 CALC: x 0 1 2 3 4 5 6 7 8 9 10 OUT: 0 1 2 3 4 5 6 7

… …

Line Buffer

… …

4

shift input

f f f f

5 6 7 8 9 10

(kernel width) w = 3, (coarsening factor) v = 4

  • utput([x,x + v],t) = pack{out([0,v − rw − 1],t − 1), out([v − rw,v − 1],t)}
  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 6

slide-23
SLIDE 23

Loop Coarsening Overview

Resource usage is the # of registers when border handling is ignored:

Schmid’s: kin · h ·(v + 2·(v ·⌈rw/v⌉)) Fetch And Calc: CF&C

reg = kin · h ·(rw + v ·(⌈rw/v⌉+ 1))

Calc and Pack: CC&P

reg = kin · h ·(2· rw + v)+ kout ·(v −(rw mod v))

shift input

f f f f

shift shift input

f f f f

shift shift input

f f f f

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 7

slide-24
SLIDE 24

Border Handling

slide-25
SLIDE 25

Image Border Handling

Conditional Selection appropriate input is selected before the calculation Padding input is enlarged with the appropriate border pixels

  • utput is smaller than the input: (W + ⌊w/2⌋, H + ⌊h/2⌋) -> (W, H)

Modification of the calculation calculation depends on the image coordinates

10 9 8 9 10 11 12 13 14 15 14 13 6 5 4 5 6 7 8 9 10 11 10 9 2 1 1 2 3 4 5 6 7 6 5 6 5 4 5 6 7 8 9 10 11 10 9 10 9 8 9 10 11 12 13 14 15 14 13 14 13 12 13 14 15 16 17 18 19 18 17 18 17 16 17 18 19 20 21 22 23 22 21 22 21 20 21 22 23 24 25 26 27 26 25 26 25 24 25 26 27 28 29 30 31 30 29 30 29 28 29 30 31 32 33 34 35 34 33 26 25 24 25 26 27 28 29 30 31 30 29 22 21 20 21 22 23 24 25 26 27 26 25

mirror-101

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 8

slide-26
SLIDE 26

Image Border Handling

Conditional Selection appropriate input is selected before the calculation Focus of this work! Padding input is enlarged with the appropriate border pixels

  • utput is smaller than the input: (W + ⌊w/2⌋, H + ⌊h/2⌋) -> (W, H)

Stalls the stream before (or during) the processing of a local operator! Modification of the calculation calculation depends on the image coordinates Specific to the target algorithm!

10 9 8 9 10 11 12 13 14 15 14 13 6 5 4 5 6 7 8 9 10 11 10 9 2 1 1 2 3 4 5 6 7 6 5 6 5 4 5 6 7 8 9 10 11 10 9 10 9 8 9 10 11 12 13 14 15 14 13 14 13 12 13 14 15 16 17 18 19 18 17 18 17 16 17 18 19 20 21 22 23 22 21 22 21 20 21 22 23 24 25 26 27 26 25 26 25 24 25 26 27 28 29 30 31 30 29 30 29 28 29 30 31 32 33 34 35 34 33 26 25 24 25 26 27 28 29 30 31 30 29 22 21 20 21 22 23 24 25 26 27 26 25

mirror-101

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 8

slide-27
SLIDE 27

Analysis of Border Handling: Selections

Is there an optimal implementation in terms of area resources, and speed? How many selections does the optimal implementation require?

Sliding Window

f

10 9 8 9 10 11 12 13 14 15 14 13 6 5 4 5 6 7 8 9 10 11 10 9 2 1 1 2 3 4 5 6 7 6 5 6 5 4 5 6 7 8 9 10 11 10 9 10 9 8 9 10 11 12 13 14 15 14 13 14 13 12 13 14 15 16 17 18 19 18 17 18 17 16 17 18 19 20 21 22 23 22 21 22 21 20 21 22 23 24 25 26 27 26 25 26 25 24 25 26 27 28 29 30 31 30 29 30 29 28 29 30 31 32 33 34 35 34 33 26 25 24 25 26 27 28 29 30 31 30 29 22 21 20 21 22 23 24 25 26 27 26 25

00

Selections

10 20 30 40 01 11 21 31 41 02 12 22 32 42 03 13 23 33 43 04 14 24 34 44

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 9

slide-28
SLIDE 28

Analysis of Border Handling: Selections

Let Pi,j(x,y) be a mapping to the (i,j)th output from the sliding window ex: P0,2(1,0): sw(2,2) -> out(0,2),

P1,2(1,0): sw(1,2) -> out(1,2), P2,2(1,0): sw(2,2) -> out(2,2), P3,2(1,0): sw(3,2) -> out(3,2), P4,2(1,0): sw(4,2) -> out(4,2)

Sliding Window

f

10 9 8 9 10 11 12 13 14 15 14 13 6 5 4 5 6 7 8 9 10 11 10 9 2 1 1 2 3 4 5 6 7 6 5 6 5 4 5 6 7 8 9 10 11 10 9 10 9 8 9 10 11 12 13 14 15 14 13 14 13 12 13 14 15 16 17 18 19 18 17 18 17 16 17 18 19 20 21 22 23 22 21 22 21 20 21 22 23 24 25 26 27 26 25 26 25 24 25 26 27 28 29 30 31 30 29 30 29 28 29 30 31 32 33 34 35 34 33 26 25 24 25 26 27 28 29 30 31 30 29 22 21 20 21 22 23 24 25 26 27 26 25

00

Selections

10 20 30 40 01 11 21 31 41 02 12 22 32 42 03 13 23 33 43 04 14 24 34 44

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 9

slide-29
SLIDE 29

Analysis of Border Handling: Selections

Let Pi,j(x,y) be a mapping to the (i,j)th output from the sliding window Is be the union of all input combinations; then the conditional selections in x and y are orthogonal to each other: |Is| = |Iw × Ih| = |Iw|·|Ih| = (1+ 2·

⌈w/2⌉

i=2

i) ×(1+ 2· ⌈h/2⌉

i=2

i) = 121 (for 5-by-5)

i,j 1 2 3 4 x,y else 1 else all else W − 1 else W − 2 W − 1 else (0,0) (2,0) (4,0) (1,0) (3,0) (2,0) (3,0) (1,0) (4,0) (2,0) (0,0) 1 (0,2) (2,2) (4,2) (1,2) (3,2) (2,2) (3,2) (1,2) (4,2) (2,2) (0,2) (0,4) (2,4) (4,4) (1,4) (3,4) (2,4) (3,4) (1,4) (4,4) (2,4) (0,4) 1 else (0,1) (2,1) (4,1) (1,1) (3,1) (2,1) (3,1) (1,1) (4,1) (2,1) (0,1) (0,3) (2,3) (4,3) (1,3) (3,3) (2,3) (3,3) (1,3) (4,3) (2,3) (0,3) 2 all (0,2) (2,2) (4,2) (1,2) (3,2) (2,2) (3,2) (1,2) (4,2) (2,2) (0,2) 3 else (0,3) (2,3) (4,3) (1,3) (3,3) (2,3) (3,3) (1,3) (4,3) (2,3) (0,3) H − 1 (0,1) (2,1) (4,1) (1,1) (3,1) (2,1) (3,1) (1,1) (4,1) (2,1) (0,1) 4 else (0,4) (2,4) (4,4) (1,4) (3,4) (2,4) (3,4) (1,4) (4,4) (2,4) (0,4) H − 2 (0,2) (2,2) (4,2) (1,2) (3,2) (2,2) (3,2) (1,2) (4,2) (2,2) (0,2) H − 1 (0,0) (2,0) (4,0) (1,0) (3,0) (2,0) (3,0) (1,0) (4,0) (2,0) (0,0)

Is for a 5× 5 local op. with mirror-101 border mode. Coordinates else and all cover redundant indices.

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 9

slide-30
SLIDE 30

Analysis of Border Handling: Selections

Let Pi,j(x,y) be a mapping to the (i,j)th output from the sliding window, and Is be the union of all input combinations; then the conditional selections in x and y are orthogonal to each other: |Is| = |Iw × Ih| = |Iw|·|Ih| = (1+ 2·

⌈w/2⌉

i=2

i) ×(1+ 2· ⌈h/2⌉

i=2

i) = 121 (for 5-by-5)

i,j 1 2 3 4 x,y else 1 else all else W − 1 else W − 2 W − 1 else (0,0) (2,0) (4,0) (1,0) (3,0) (2,0) (3,0) (1,0) (4,0) (2,0) (0,0) 1 (0,2) (2,2) (4,2) (1,2) (3,2) (2,2) (3,2) (1,2) (4,2) (2,2) (0,2) (0,4) (2,4) (4,4) (1,4) (3,4) (2,4) (3,4) (1,4) (4,4) (2,4) (0,4) 1 else (0,1) (2,1) (4,1) (1,1) (3,1) (2,1) (3,1) (1,1) (4,1) (2,1) (0,1) (0,3) (2,3) (4,3) (1,3) (3,3) (2,3) (3,3) (1,3) (4,3) (2,3) (0,3) 2 all (0,2) (2,2) (4,2) (1,2) (3,2) (2,2) (3,2) (1,2) (4,2) (2,2) (0,2) 3 else (0,3) (2,3) (4,3) (1,3) (3,3) (2,3) (3,3) (1,3) (4,3) (2,3) (0,3) H − 1 (0,1) (2,1) (4,1) (1,1) (3,1) (2,1) (3,1) (1,1) (4,1) (2,1) (0,1) 4 else (0,4) (2,4) (4,4) (1,4) (3,4) (2,4) (3,4) (1,4) (4,4) (2,4) (0,4) H − 2 (0,2) (2,2) (4,2) (1,2) (3,2) (2,2) (3,2) (1,2) (4,2) (2,2) (0,2) H − 1 (0,0) (2,0) (4,0) (1,0) (3,0) (2,0) (3,0) (1,0) (4,0) (2,0) (0,0)

Is for a 5× 5 local op. with mirror-101 border mode. Coordinates else and all cover redundant indices.

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 9

slide-31
SLIDE 31

Analysis of Border Handling: Selections

Let Pi,j(x,y) be a mapping to the (i,j)th output from the sliding window, and Is be the union of all input combinations; then the conditional selections in x and y are orthogonal to each other: |Is| = |Iw × Ih| = |Iw|·|Ih| = (1+ 2·

⌈w/2⌉

i=2

i) ×(1+ 2· ⌈h/2⌉

i=2

i) = 121 (for 5-by-5)

This implies that |Ih|+ h ·|Iw| = 66 (for 5-by-5) number of mappings is sufficient when x and y selections are separated.

Column Selection Line Buffer Line Buffer Line Buffer Line Buffer Column Selection Column Selection Column Selection Column Selection R

  • w

S e l e c t i

  • n

input pixel

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 9

slide-32
SLIDE 32

Hardware Architecture: Type-0

  • Separated border handling architecture
  • Uses the row selection for the column selection

Column Selection Line Buffer Line Buffer Line Buffer Line Buffer Column Selection Column Selection Column Selection Column Selection Row Selection input pixel

Separated border handling architecture

in0 in1 in4 in3 in2

  • ut0
  • ut1
  • ut3
  • ut4
  • ut2

Row selection (mirror-101)

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 10

slide-33
SLIDE 33

Hardware Architecture: Type-0

  • Separated border handling architecture
  • Uses the row selection for the column selection

input

Column selection (mirror-101)

in0 in1 in4 in3 in2

  • ut0
  • ut1
  • ut3
  • ut4
  • ut2

Row selection (mirror-101)

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 10

slide-34
SLIDE 34

Hardware Architecture: Type-0

  • Separated border handling architecture
  • Uses the row selection for the column selection

input

Column selection (mirror-101)

in0 in1 in4 in3 in2

  • ut0
  • ut1
  • ut3
  • ut4
  • ut2

Row selection (mirror-101) CType-0

reg

(b) = h · kin ·(2· rw)

T Type-0

CriticalPath =

  • T(MUX[rw + 1]),

mirror-101, mirror, clamp T(MUX[2]), clamp2, constant

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 10

slide-35
SLIDE 35

Hardware Architecture: Type-0

  • Separated border handling architecture
  • Uses the row selection for the column selection

input

Column selection (mirror-101)

in0 in1 in4 in3 in2

  • ut0
  • ut1
  • ut3
  • ut4
  • ut2

Row selection (mirror-101) CType-0

reg

(b) = h · kin ·(2· rw)

T Type-0

CriticalPath =

  • T(MUX[rw + 1]),

mirror-101, mirror, clamp T(MUX[2]), clamp2, constant

Exploit the temporal locality in raster-order processing, improve the column selection!

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 10

slide-36
SLIDE 36

Analysis of Border Handling: Temporal Locality

Rfetch shift shift input shift assign Rleft Rright R0

right

Rmid x = W − 4: 17 18 19 20 21 22 23 21 22 23 x = W − 3: 18 19 20 21 22 23 r0 22 23 x = W − 2: 19 20 21 22 23 r0 r1 23 1 x = W − 1: 20 21 22 23 r0 r1 r2 1 2 x = 0: l0 l1 l2 1 2 3 1 2 3 x = 1: l1 l2 1 2 3 4 2 3 4

  • Temporal data flow for column data selection with a local operator of size w = 7.

Blue background denotes valid image regions, lX and rX represent the appropriate pixel values for the corresponding border handling mode.

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 11

slide-37
SLIDE 37

Analysis of Border Handling: Temporal Locality

Rfetch shift shift input shift assign Rleft Rright R0

right

Rmid x = W − 4: 17 18 19 20 21 22 23 21 22 23 x = W − 3: 18 19 20 21 22 23 r0 22 23 x = W − 2: 19 20 21 22 23 r0 r1 23 1 x = W − 1: 20 21 22 23 r0 r1 r2 1 2 x = 0: l0 l1 l2 1 2 3 1 2 3 x = 1: l1 l2 1 2 3 4 2 3 4

  • Rleft

Rright’ Rmid Rfetch Rright

Assuming that the streaming is not stalled and one pixel is read in each cycle, at least rw pixels per row must be fetched at x = 0 in order be able to initialize all column pixels. Therefore, the minimum number of registers in a row is

Cmin

reg = h · kin ·(w + rw)

(1)

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 11

slide-38
SLIDE 38

Analysis of Border Handling: Temporal Locality

Rfetch shift shift input shift assign Rleft Rright R0

right

Rmid x = W − 4: 17 18 19 20 21 22 23 21 22 23 x = W − 3: 18 19 20 21 22 23 r0 22 23 x = W − 2: 19 20 21 22 23 r0 r1 23 1 x = W − 1: 20 21 22 23 r0 r1 r2 1 2 x = 0: l0 l1 l2 1 2 3 1 2 3 x = 1: l1 l2 1 2 3 4 2 3 4

  • Rleft

Rright’ Rmid Rfetch Rright

2 2 2 2 2

Only the data selection before Rfetch and blue portion of R′

right can be optimized,

when minimum number of registers is used in raster order processing.

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 11

slide-39
SLIDE 39

Column Selection Architectures: Type-1

Rfetch shift shift input shift assign Rleft Rright R0

right

Rmid x = W − 4: 17 18 19 20 21 22 23 21 22 23 x = W − 3: 18 19 20 21 22 23 r0 22 23 x = W − 2: 19 20 21 22 23 r0 r1 23 1 x = W − 1: 20 21 22 23 r0 r1 r2 1 2 x = 0: l0 l1 l2 1 2 3 1 2 3 x = 1: l1 l2 1 2 3 4 2 3 4

Rleft Rright’ Rmid Rfetch Rright

2 2 2 2 2

Lets minimize the selection at the Rright’ according to the analysis:

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 12

slide-40
SLIDE 40

Column Selection Architectures: Type-2

Rleft Rright’ Rmid Rfetch Rright

2 2 2 2 2

Rnew shift Rmid Rright R0

right

x = W − 4: 20 21 22 23 r0 r1 r2 x = W − 3: 21 22 23 r0 r1 r2 x = W − 2: 22 23 r0 r1 r2 1 x = W − 1: 23 r0 r1 r2 1 2 x = 0: 1 2 3

2 2 2 2 2

Lets minimize the selection before the Rfetch according to the analysis:

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 12

slide-41
SLIDE 41

Column Selection Architectures: Mirror border mode

Type-0:

  • not resource efficient

+ full flexibility for all the border modes

Type-1:

+ resource efficient for a great portion

  • f design space, w, v, border

mode.

Type-2:

+ fastest architecture + pareto-optimal depending on w, v, and technology mapping

input input

6 4

input

5

5 6 4

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 13

slide-42
SLIDE 42

Best Architecture Selection

slide-43
SLIDE 43

Considered Architectures

Architectures Loop Coarsening Architectures Border Handling Types Border Handling Modes Schmid’s Fetch and Calc (F&C) Calc and Pack (C&P) Mirror Mirror-101 Clamp Constant Naive Separated Type-0 Type-1 Type-2

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 14

slide-44
SLIDE 44

Best Architecture Selection

input : w, h, borderMode, v, kout, kin, designGoal

  • utput: BorderHandlingPattern, CoarseningArch

1

func selectParetoOptimal(BorderHandlingPattern, CoarseningArch,

2

w, h, borderMode, v, kout, kin, designGoal)

3

rw = ⌊w/2⌋

4

if borderMode = UNDEFINED then

5

if kout < kin · h then

6

CoarseningArch ← Calc and Pack

7

else

8

CoarseningArch ← Fetch and Calc

9

end

10

BorderHandlingPattern ← none

11

else

12

if rw ·(kin · h − kout + 1) < v ·(kin · h − kout) then

13

CoarseningArch ← Calc and Pack

14

else

15

CoarseningArch ← Fetch and Calc

16

end

17

if borderMode = (CLAMP ∨ CONSTANT) then

18

BorderHandlingPattern ← Type-1

19

else // borderMode = (MIRROR ∨ MIRROR-101)

20

if (designGoal = speed) ∨ ((rw + 1)MUX[2]− MUX[rvw + 1]− MUX[2] < 0) then

21

BorderHandlingPattern ← Type-2

22

else

23

BorderHandlingPattern ← Type-1

24

end

25

end

26

end

27

end

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 15

slide-45
SLIDE 45

Evaluation and Results

slide-46
SLIDE 46

Comparison of Loop Coarsening Architectures

70 75 80 85 90 95 5,000 10,000 w = 11 w = 11 w = 3 w = 3 1 2 4 8 16 32 64 1 2 4 8 16 32 64 LUT FF Calc and Pack (C&P) Fetch and Calc (F&C)

HLS estimation results of the proposed coarsening architectures (target clock frequency is 200 MHz, and no border handling is applied)

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 16

slide-47
SLIDE 47

Comparison of Border Handling Architectures

2,000 3,000 4,000 5,000 6,000 2,000 4,000 6,000 8,000 w = 11 w = 11 w = 11 1 2 4 8 1 2 4 8 1 2 4 8 LUT FF Naïve Type-0 Type-1

HLS estimation results of the proposed border handling architectures (target clock frequency is 200 MHz, and border mode is mirror)

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 17

slide-48
SLIDE 48

Type-1 vs Type-2

Parameters Implementation (CPtar = 3.1 ns) v w/h BH Patt. SLICE BRAM FF LUT CPpsyn CPimp 1 7 Naïve 482 6 1976 1174 2.55 2.74 1 7 Type-0 415 6 1843 878 2.56 2.85 1 7 Type-1 381 6 1620 741 2.56 2.83 1 7 Type-2 316 6 1409 851 2.52 2.83

Table: HLS implementation results for a Mean Filter.

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 18

slide-49
SLIDE 49

Discussion

for(int y = 0; y < IMAGE_HEIGHT; y++){ for(int x = 0; x < IMAGE_WIDTH; x + v){ (DataBeatType*)(out[y][x]) = local_op(stencil_p1(y, x), ..); } } Architectures Loop Coarsening Architectures Border Handling Types Border Handling Modes Schmid’s Fetch and Calc (F&C) Calc and Pack (C&P) Mirror Mirror-101 Clamp Constant Naive Separated Type-0 Type-1 Type-2

Thanks for listening. Any questions?

Title Hardware Design and Analysis of Efficient Loop Coarsening and Border Handling for Image Processing Speaker M. Akif Özkan, akif.oezkan@fau.de

slide-50
SLIDE 50

Backup Slides

slide-51
SLIDE 51

A Deeper Look into Implementation Results

slide-52
SLIDE 52

Coarsening Architectures

Parameters Estimation (CPtar = 20 ns) Estimation (CPtar = 3.1 ns) Implementation (CPtar = 3.1 ns) v w/h Coars. BRAM FF LUT CPes BRAM FF LUT CPes SLICE BRAM FF LUT DSP CPpsyn CPimp 1 5 C&P 4 304 93 13.50 4 378 93 3.10 152 4 600 270 29 2.48 2.55 1 5 F&C 4 304 93 13.50 4 378 93 3.10 139 4 600 269 29 2.48 2.61 2 5 C&P 4 339 86 14.43 4 446 88 3.09 215 4 873 448 14 2.52 2.63 2 5 F&C 4 339 86 14.43 4 446 88 3.09 222 4 873 449 14 2.52 2.46 8 5 C&P 8 663 82 14.43 8 954 84 3.06 675 8 2589 1565 6 2.39 2.56 8 5 F&C 8 855 82 14.43 8 1146 84 3.06 603 8 2781 1566 6 2.39 2.74 32 5 C&P 32 1995 75 14.43 32 3045 77 3.03 1951 32 9256 5367 6 2.38 2.83 32 5 F&C 32 2955 75 14.43 32 4005 77 3.03 2023 32 10216 5367 6 2.38 3.09

Table: HLS estimation results for a local operator and Implementation results for a Mean Filter.

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 19

slide-53
SLIDE 53

Border Handling Architectures

Parameters Estimation (CPtar = 20 ns) Estimation (CPtar = 3.1 ns) Implementation (CPtar = 3.1 ns) v w/h BH Patt. BRAM FF LUT CPes BRAM FF LUT CPes SLICE BRAM FF LUT CPpsyn CPimp 1 5 Naïve 4 471 575 13.5 4 643 576 3.10 213 4 879 533 2.63 2.67 1 5 Type-0 4 438 381 13.5 4 522 385 3.10 192 4 772 422 2.53 2.65 1 5 Type-1 4 398 341 13.5 4 482 345 3.10 176 4 732 419 2.52 2.58 1 5 Type-2 4 403 459 13.5 4 487 468 3.10 179 4 742 475 2.52 2.53 2 5 Naïve 4 495 538 14.4 4 664 542 3.09 288 4 1182 741 2.46 2.65 2 5 Type-0 4 432 344 14.4 4 565 349 3.09 268 4 1053 625 2.52 2.65 2 5 Type-1 4 432 344 14.4 4 565 349 3.09 244 4 1053 626 2.52 2.60 2 5 Type-2 4 435 501 14.4 4 569 511 3.09 262 4 1061 700 2.53 2.80 1 7 Naïve 6 855 1443 15.0 6 1434 1446 3.10 482 6 1976 1174 2.55 2.74 1 7 Type-0 6 806 867 15.0 6 1332 871 3.10 415 6 1843 878 2.56 2.85 1 7 Type-1 6 693 642 15.0 6 1107 646 3.10 381 6 1620 741 2.56 2.83 1 7 Type-2 6 698 808 15.0 6 885 817 3.10 316 6 1409 851 2.52 2.83 2 7 Naïve 6 957 1308 15.9 6 1589 1339 3.09 653 6 2715 1595 2.53 2.70 2 7 Type-0 6 862 730 15.9 6 1496 735 3.09 595 6 2570 1219 2.52 2.95 2 7 Type-1 6 806 674 15.9 6 1384 679 3.09 554 6 2459 1189 2.52 2.76 2 7 Type-2 6 923 1287 17.2 6 1107 1298 16.1 514 6 2194 1405 2.53 2.94 1 11 Naïve 10 2033 5390 16.6 10 4324 5393 3.10 1352 10 5601 3852 2.55 2.87 1 11 Type-0 10 1952 2997 16.6 10 3549 3018 3.10 1028 10 4781 2627 2.55 2.95 1 11 Type-1 10 1597 1583 16.6 10 1892 1594 3.45 711 10 3104 1883 2.55 2.67 1 11 Type-2 10 1601 1845 16.6 10 1891 1862 3.10 685 10 3114 2062 2.52 2.90 2 11 Naïve 10 2184 4562 16.7 10 4597 4566 3.09 1607 10 6839 4366 2.56 2.91 2 11 Type-0 10 2025 2160 16.7 10 3372 2228 3.09 1286 10 5663 3183 2.53 2.82 2 11 Type-1 10 1760 1719 16.7 10 2843 1787 3.09 1204 10 5136 2956 2.56 2.95 2 11 Type-2 10 1989 3564 25.3 10 2230 3639 25.3 1221 10 4537 3724 2.53 2.85

Table: HLS estimation results for a local operator and Implementation results for a Mean Filter.

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 20

slide-54
SLIDE 54

Border Handling in case of Loop Coarsening (F&C)

slide-55
SLIDE 55

Loop Coarsening: Architecture Selection

5 4 3 2 9 8 7 6 13 12 11 10 17 16 15 14 1

11’ 10’

input

(a) F&C Type-1, which basically is min(rw,v) = 2 parallel Type-1 column selection for w = 3

input

(b) F&C Type-1 mirror border handling for w = 9 and v = 2, which basically is min(rw,v) = 2 parallel Type-1 column selection for w = 5.

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 21

slide-56
SLIDE 56

Loop Coarsening Overview

slide-57
SLIDE 57

Loop Coarsening: Architecture Selection

Resource usage is the # of registers when border handling is ignored:

Schmid’s: kin · h ·(v + 2·(v ·⌈rw/v⌉)) Fetch And Calc: CF&C

reg = kin · h ·(rw + v ·(⌈rw/v⌉+ 1))

Calc and Pack: CC&P

reg = kin · h ·(2· rw + v)+ kout ·(v −(rw mod v))

shift input

f f f f

shift shift input

f f f f

shift shift input

f f f f

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 22

slide-58
SLIDE 58

Loop Coarsening: Architecture Selection

The region that C&P is better when border handling is ignored: CF&C- C&P

reg

> 0

CF&C- C&P

reg

=

kin · h ·(rw + v ·(⌈rw/v⌉+ 1))− kin · h ·(2· rw + v)+ kout ·(v −(rw mod v)) CF&C- C&P

reg

= (kin · h − kout)·((rw mod v)− v)

How significant the improvement is? Coarsening of an algorithm v = 1024/32 = 32 w = h = 5 CSchmid’s

reg

= 32· 5·(32+ 2·(32·⌈5/32⌉)) = 15360bits

CF&C

reg = 32· 5·(2+ 32·(⌈5/32⌉+ 1)) = 11040bits

CC&P

reg = 32· 5·(2· 2+ 32)+ 32·(32− 2) = 6720bits

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 23

slide-59
SLIDE 59

Analyses of Border Handling

slide-60
SLIDE 60

Analysis of Border Handling: Temporal Locality

Rfetch shift shift input shift assign Rleft Rright R0

right

Rmid x = W − 4: 17 18 19 20 21 22 23 21 22 23 x = W − 3: 18 19 20 21 22 23 r0 22 23 x = W − 2: 19 20 21 22 23 r0 r1 23 1 x = W − 1: 20 21 22 23 r0 r1 r2 1 2 x = 0: l0 l1 l2 1 2 3 1 2 3 x = 1: l1 l2 1 2 3 4 2 3 4

  • Temporal data flow for column data selection with a local operator of size w = 7.

Blue background denotes valid image regions, lX and rX represent the appropriate pixel values for the corresponding border handling mode.

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 24

slide-61
SLIDE 61

Analysis of Border Handling: Temporal Locality

Rfetch shift shift input shift assign Rleft Rright R0

right

Rmid x = W − 4: 17 18 19 20 21 22 23 21 22 23 x = W − 3: 18 19 20 21 22 23 r0 22 23 x = W − 2: 19 20 21 22 23 r0 r1 23 1 x = W − 1: 20 21 22 23 r0 r1 r2 1 2 x = 0: l0 l1 l2 1 2 3 1 2 3 x = 1: l1 l2 1 2 3 4 2 3 4

  • Rleft

Rright’ Rmid Rfetch Rright

Assuming that the streaming is not stalled and one pixel is read in each cycle, at least rw pixels per row must be fetched at x = 0 in order be able to initialize all column pixels. Therefore, the minimum number of registers in a row is

Cmin

reg = h · kin ·(w + rw)

(1)

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 24

slide-62
SLIDE 62

Analysis of Border Handling: Temporal Locality

Rfetch shift shift input shift assign Rleft Rright R0

right

Rmid x = W 4: 17 18 19 20 21 22 23 21 22 23 x = W 3: 18 19 20 21 22 23 r0 22 23 x = W 2: 19 20 21 22 23 r0 r1 23 1 x = W 1: 20 21 22 23 r0 r1 r2 1 2 x = 0: l0 l1 l2 1 2 3 1 2 3 x = 1: l1 l2 1 2 3 4 2 3 4

  • Rleft

Rright’ Rmid Rfetch Rright

Except at x = 0, border handling can be achieved only through data selection that appropriately feeds Rfetch and shifts the content stored in Rright, Rmid and Rleft.

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 24

slide-63
SLIDE 63

Analysis of Border Handling: Temporal Locality

Rfetch shift shift input shift assign Rleft Rright R0

right

Rmid x = W − 4: 17 18 19 20 21 22 23 21 22 23 x = W − 3: 18 19 20 21 22 23 r0 22 23 x = W − 2: 19 20 21 22 23 r0 r1 23 1 x = W − 1: 20 21 22 23 r0 r1 r2 1 2 x = 0: l0 l1 l2 1 2 3 1 2 3 x = 1: l1 l2 1 2 3 4 2 3 4

  • Rleft

Rright’ Rmid Rfetch Rright

All registers, except Rfetch, should be able to read from R′

right in order to initialize all

column pixels at x = 0.

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 24

slide-64
SLIDE 64

Analysis of Border Handling: Temporal Locality

Rfetch shift shift input shift assign Rleft Rright R0

right

Rmid x = W − 4: 17 18 19 20 21 22 23 21 22 23 x = W − 3: 18 19 20 21 22 23 r0 22 23 x = W − 2: 19 20 21 22 23 r0 r1 23 1 x = W − 1: 20 21 22 23 r0 r1 r2 1 2 x = 0: l0 l1 l2 1 2 3 1 2 3 x = 1: l1 l2 1 2 3 4 2 3 4

  • Rleft

Rright’ Rmid Rfetch Rright

2 2 2 2 2

There must be at least one MUX[2] before any register in Rright and Rleft.

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 24

slide-65
SLIDE 65

Analysis of Border Handling: Temporal Locality

Rfetch shift shift input shift assign Rleft Rright R0

right

Rmid x = W − 4: 17 18 19 20 21 22 23 21 22 23 x = W − 3: 18 19 20 21 22 23 r0 22 23 x = W − 2: 19 20 21 22 23 r0 r1 23 1 x = W − 1: 20 21 22 23 r0 r1 r2 1 2 x = 0: l0 l1 l2 1 2 3 1 2 3 x = 1: l1 l2 1 2 3 4 2 3 4

  • Rleft

Rright’ Rmid Rfetch Rright

2 2 2 2 2

Only the data selection before Rfetch and blue portion of R′

right can be optimized,

when minimum number of registers is used in raster order processing.

  • M. Akif Özkan

| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 24