Hardware Design and Analysis of Efficient Loop Coarsening and Border Handling for Image Processing
- M. Akif Özkan, Oliver Reiche, Frank Hannig, and Jürgen Teich
Hardware Design and Analysis of Efficient Loop Coarsening and Border - - PowerPoint PPT Presentation
Hardware Design and Analysis of Efficient Loop Coarsening and Border Handling for Image Processing M. Akif zkan, Oliver Reiche, Frank Hannig, and Jrgen Teich Hardware/Software Co-Design, Friedrich-Alexander University Erlangen-Nrnberg
… … … …
… … … …
input image
input image
input image
An example task graph for Harris Corner Detection (square: local operator, circle: point operator)
input
{dy, dy, dy, dy} {dx, dx, dx, dx} {sy, sy, sy, sy} {sx, sx, sx, sx} {gxy, gxy, gxy, gxy} {gx, gx, gx, gx} {gy, gy, gy, gy} {sxy, sxy, sxy, sxy} {hc, hc, hc, hc}
input
input
input
input
1 2 3 3 3 1 2 3 3 3 1 2 3 3 3 4 4 4 5 6 7 7 7 8 8 8 9 10 11 11 11 12 12 12 13 14 15 15 15 12 12 12 13 14 15 15 15 12 12 12 13 14 15 15 15
(a) clamp
5 4 4 5 6 7 7 6 1 1 2 3 3 2 1 1 2 3 3 2 5 4 4 5 6 7 7 6 9 8 8 9 10 11 11 10 13 12 12 13 14 15 15 14 13 12 12 13 14 15 15 14 9 8 8 9 10 11 11 10
(b) mirror
10 9 8 9 10 11 10 9 6 5 4 5 6 7 6 5 2 1 1 2 3 2 1 6 5 4 5 6 7 6 5 10 9 8 9 10 11 10 9 14 13 12 13 14 15 14 13 10 9 8 9 10 11 10 9 6 5 4 5 6 7 6 5
(c) mirror-101
c c c c c c c c c c c c c c c c c c 1 2 3 c c c c 4 5 6 7 c c c c 8 9 10 11 c c c c 12 13 14 15 c c c c c c c c c c c c c c c c c c
(d) constant Common border handling modes.
… … … …
. Hannig, and J. Teich, “Loop coarsening in C-based high-level synthesis”, ASAP15.
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 2
Sliding Window
f f f f
… …
… … … …
Line Buffer
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 3
Sliding Window
f f f f
… …
… … … …
Line Buffer
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 3
Line Buffer
input
f f f f
shift shift
(kernel width) w = 3, (coarsening factor) v = 4
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 4
Line Buffer
input
f f f f
shift shift
(kernel width) w = 3, (coarsening factor) v = 4
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 4
Line Buffer
input
f f f f
shift shift
(kernel width) w = 3, (coarsening factor) v = 4
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 4
Line Buffer
input shift shift shift input
f f f f
shift Schmid’s Fetch And Calc
(kernel width) w = 3, (coarsening factor) v = 4
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 5
… …
Line Buffer
… …
shift input
f f f f
1 2
(kernel width) w = 3, (coarsening factor) v = 4
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 6
… …
Line Buffer
… …
shift input
f f f f
1 2 3 4 5 6
(kernel width) w = 3, (coarsening factor) v = 4
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 6
… …
Line Buffer
… …
4
shift input
f f f f
5 6 7 8 9 10
(kernel width) w = 3, (coarsening factor) v = 4
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 6
reg = kin · h ·(rw + v ·(⌈rw/v⌉+ 1))
reg = kin · h ·(2· rw + v)+ kout ·(v −(rw mod v))
shift input
f f f f
shift shift input
f f f f
shift shift input
f f f f
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 7
10 9 8 9 10 11 12 13 14 15 14 13 6 5 4 5 6 7 8 9 10 11 10 9 2 1 1 2 3 4 5 6 7 6 5 6 5 4 5 6 7 8 9 10 11 10 9 10 9 8 9 10 11 12 13 14 15 14 13 14 13 12 13 14 15 16 17 18 19 18 17 18 17 16 17 18 19 20 21 22 23 22 21 22 21 20 21 22 23 24 25 26 27 26 25 26 25 24 25 26 27 28 29 30 31 30 29 30 29 28 29 30 31 32 33 34 35 34 33 26 25 24 25 26 27 28 29 30 31 30 29 22 21 20 21 22 23 24 25 26 27 26 25
mirror-101
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 8
10 9 8 9 10 11 12 13 14 15 14 13 6 5 4 5 6 7 8 9 10 11 10 9 2 1 1 2 3 4 5 6 7 6 5 6 5 4 5 6 7 8 9 10 11 10 9 10 9 8 9 10 11 12 13 14 15 14 13 14 13 12 13 14 15 16 17 18 19 18 17 18 17 16 17 18 19 20 21 22 23 22 21 22 21 20 21 22 23 24 25 26 27 26 25 26 25 24 25 26 27 28 29 30 31 30 29 30 29 28 29 30 31 32 33 34 35 34 33 26 25 24 25 26 27 28 29 30 31 30 29 22 21 20 21 22 23 24 25 26 27 26 25
mirror-101
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 8
Sliding Window
f
10 9 8 9 10 11 12 13 14 15 14 13 6 5 4 5 6 7 8 9 10 11 10 9 2 1 1 2 3 4 5 6 7 6 5 6 5 4 5 6 7 8 9 10 11 10 9 10 9 8 9 10 11 12 13 14 15 14 13 14 13 12 13 14 15 16 17 18 19 18 17 18 17 16 17 18 19 20 21 22 23 22 21 22 21 20 21 22 23 24 25 26 27 26 25 26 25 24 25 26 27 28 29 30 31 30 29 30 29 28 29 30 31 32 33 34 35 34 33 26 25 24 25 26 27 28 29 30 31 30 29 22 21 20 21 22 23 24 25 26 27 26 25
00
Selections
10 20 30 40 01 11 21 31 41 02 12 22 32 42 03 13 23 33 43 04 14 24 34 44
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 9
Sliding Window
f
10 9 8 9 10 11 12 13 14 15 14 13 6 5 4 5 6 7 8 9 10 11 10 9 2 1 1 2 3 4 5 6 7 6 5 6 5 4 5 6 7 8 9 10 11 10 9 10 9 8 9 10 11 12 13 14 15 14 13 14 13 12 13 14 15 16 17 18 19 18 17 18 17 16 17 18 19 20 21 22 23 22 21 22 21 20 21 22 23 24 25 26 27 26 25 26 25 24 25 26 27 28 29 30 31 30 29 30 29 28 29 30 31 32 33 34 35 34 33 26 25 24 25 26 27 28 29 30 31 30 29 22 21 20 21 22 23 24 25 26 27 26 25
00
Selections
10 20 30 40 01 11 21 31 41 02 12 22 32 42 03 13 23 33 43 04 14 24 34 44
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 9
⌈w/2⌉
i=2
i=2
i,j 1 2 3 4 x,y else 1 else all else W − 1 else W − 2 W − 1 else (0,0) (2,0) (4,0) (1,0) (3,0) (2,0) (3,0) (1,0) (4,0) (2,0) (0,0) 1 (0,2) (2,2) (4,2) (1,2) (3,2) (2,2) (3,2) (1,2) (4,2) (2,2) (0,2) (0,4) (2,4) (4,4) (1,4) (3,4) (2,4) (3,4) (1,4) (4,4) (2,4) (0,4) 1 else (0,1) (2,1) (4,1) (1,1) (3,1) (2,1) (3,1) (1,1) (4,1) (2,1) (0,1) (0,3) (2,3) (4,3) (1,3) (3,3) (2,3) (3,3) (1,3) (4,3) (2,3) (0,3) 2 all (0,2) (2,2) (4,2) (1,2) (3,2) (2,2) (3,2) (1,2) (4,2) (2,2) (0,2) 3 else (0,3) (2,3) (4,3) (1,3) (3,3) (2,3) (3,3) (1,3) (4,3) (2,3) (0,3) H − 1 (0,1) (2,1) (4,1) (1,1) (3,1) (2,1) (3,1) (1,1) (4,1) (2,1) (0,1) 4 else (0,4) (2,4) (4,4) (1,4) (3,4) (2,4) (3,4) (1,4) (4,4) (2,4) (0,4) H − 2 (0,2) (2,2) (4,2) (1,2) (3,2) (2,2) (3,2) (1,2) (4,2) (2,2) (0,2) H − 1 (0,0) (2,0) (4,0) (1,0) (3,0) (2,0) (3,0) (1,0) (4,0) (2,0) (0,0)
Is for a 5× 5 local op. with mirror-101 border mode. Coordinates else and all cover redundant indices.
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 9
⌈w/2⌉
i=2
i=2
i,j 1 2 3 4 x,y else 1 else all else W − 1 else W − 2 W − 1 else (0,0) (2,0) (4,0) (1,0) (3,0) (2,0) (3,0) (1,0) (4,0) (2,0) (0,0) 1 (0,2) (2,2) (4,2) (1,2) (3,2) (2,2) (3,2) (1,2) (4,2) (2,2) (0,2) (0,4) (2,4) (4,4) (1,4) (3,4) (2,4) (3,4) (1,4) (4,4) (2,4) (0,4) 1 else (0,1) (2,1) (4,1) (1,1) (3,1) (2,1) (3,1) (1,1) (4,1) (2,1) (0,1) (0,3) (2,3) (4,3) (1,3) (3,3) (2,3) (3,3) (1,3) (4,3) (2,3) (0,3) 2 all (0,2) (2,2) (4,2) (1,2) (3,2) (2,2) (3,2) (1,2) (4,2) (2,2) (0,2) 3 else (0,3) (2,3) (4,3) (1,3) (3,3) (2,3) (3,3) (1,3) (4,3) (2,3) (0,3) H − 1 (0,1) (2,1) (4,1) (1,1) (3,1) (2,1) (3,1) (1,1) (4,1) (2,1) (0,1) 4 else (0,4) (2,4) (4,4) (1,4) (3,4) (2,4) (3,4) (1,4) (4,4) (2,4) (0,4) H − 2 (0,2) (2,2) (4,2) (1,2) (3,2) (2,2) (3,2) (1,2) (4,2) (2,2) (0,2) H − 1 (0,0) (2,0) (4,0) (1,0) (3,0) (2,0) (3,0) (1,0) (4,0) (2,0) (0,0)
Is for a 5× 5 local op. with mirror-101 border mode. Coordinates else and all cover redundant indices.
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 9
⌈w/2⌉
i=2
i=2
Column Selection Line Buffer Line Buffer Line Buffer Line Buffer Column Selection Column Selection Column Selection Column Selection R
S e l e c t i
input pixel
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 9
Column Selection Line Buffer Line Buffer Line Buffer Line Buffer Column Selection Column Selection Column Selection Column Selection Row Selection input pixel
Separated border handling architecture
in0 in1 in4 in3 in2
Row selection (mirror-101)
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 10
input
Column selection (mirror-101)
in0 in1 in4 in3 in2
Row selection (mirror-101)
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 10
input
Column selection (mirror-101)
in0 in1 in4 in3 in2
Row selection (mirror-101) CType-0
reg
T Type-0
CriticalPath =
mirror-101, mirror, clamp T(MUX[2]), clamp2, constant
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 10
input
Column selection (mirror-101)
in0 in1 in4 in3 in2
Row selection (mirror-101) CType-0
reg
T Type-0
CriticalPath =
mirror-101, mirror, clamp T(MUX[2]), clamp2, constant
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 10
Rfetch shift shift input shift assign Rleft Rright R0
right
Rmid x = W − 4: 17 18 19 20 21 22 23 21 22 23 x = W − 3: 18 19 20 21 22 23 r0 22 23 x = W − 2: 19 20 21 22 23 r0 r1 23 1 x = W − 1: 20 21 22 23 r0 r1 r2 1 2 x = 0: l0 l1 l2 1 2 3 1 2 3 x = 1: l1 l2 1 2 3 4 2 3 4
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 11
Rfetch shift shift input shift assign Rleft Rright R0
right
Rmid x = W − 4: 17 18 19 20 21 22 23 21 22 23 x = W − 3: 18 19 20 21 22 23 r0 22 23 x = W − 2: 19 20 21 22 23 r0 r1 23 1 x = W − 1: 20 21 22 23 r0 r1 r2 1 2 x = 0: l0 l1 l2 1 2 3 1 2 3 x = 1: l1 l2 1 2 3 4 2 3 4
reg = h · kin ·(w + rw)
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 11
Rfetch shift shift input shift assign Rleft Rright R0
right
Rmid x = W − 4: 17 18 19 20 21 22 23 21 22 23 x = W − 3: 18 19 20 21 22 23 r0 22 23 x = W − 2: 19 20 21 22 23 r0 r1 23 1 x = W − 1: 20 21 22 23 r0 r1 r2 1 2 x = 0: l0 l1 l2 1 2 3 1 2 3 x = 1: l1 l2 1 2 3 4 2 3 4
2 2 2 2 2
right can be optimized,
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 11
Rfetch shift shift input shift assign Rleft Rright R0
right
Rmid x = W − 4: 17 18 19 20 21 22 23 21 22 23 x = W − 3: 18 19 20 21 22 23 r0 22 23 x = W − 2: 19 20 21 22 23 r0 r1 23 1 x = W − 1: 20 21 22 23 r0 r1 r2 1 2 x = 0: l0 l1 l2 1 2 3 1 2 3 x = 1: l1 l2 1 2 3 4 2 3 4
2 2 2 2 2
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 12
2 2 2 2 2
Rnew shift Rmid Rright R0
right
x = W − 4: 20 21 22 23 r0 r1 r2 x = W − 3: 21 22 23 r0 r1 r2 x = W − 2: 22 23 r0 r1 r2 1 x = W − 1: 23 r0 r1 r2 1 2 x = 0: 1 2 3
2 2 2 2 2
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 12
input input
6 4
input
5
5 6 4
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 13
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 14
input : w, h, borderMode, v, kout, kin, designGoal
1
func selectParetoOptimal(BorderHandlingPattern, CoarseningArch,
2
w, h, borderMode, v, kout, kin, designGoal)
3
rw = ⌊w/2⌋
4
if borderMode = UNDEFINED then
5
if kout < kin · h then
6
CoarseningArch ← Calc and Pack
7
else
8
CoarseningArch ← Fetch and Calc
9
end
10
BorderHandlingPattern ← none
11
else
12
if rw ·(kin · h − kout + 1) < v ·(kin · h − kout) then
13
CoarseningArch ← Calc and Pack
14
else
15
CoarseningArch ← Fetch and Calc
16
end
17
if borderMode = (CLAMP ∨ CONSTANT) then
18
BorderHandlingPattern ← Type-1
19
else // borderMode = (MIRROR ∨ MIRROR-101)
20
if (designGoal = speed) ∨ ((rw + 1)MUX[2]− MUX[rvw + 1]− MUX[2] < 0) then
21
BorderHandlingPattern ← Type-2
22
else
23
BorderHandlingPattern ← Type-1
24
end
25
end
26
end
27
end
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 15
70 75 80 85 90 95 5,000 10,000 w = 11 w = 11 w = 3 w = 3 1 2 4 8 16 32 64 1 2 4 8 16 32 64 LUT FF Calc and Pack (C&P) Fetch and Calc (F&C)
HLS estimation results of the proposed coarsening architectures (target clock frequency is 200 MHz, and no border handling is applied)
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 16
2,000 3,000 4,000 5,000 6,000 2,000 4,000 6,000 8,000 w = 11 w = 11 w = 11 1 2 4 8 1 2 4 8 1 2 4 8 LUT FF Naïve Type-0 Type-1
HLS estimation results of the proposed border handling architectures (target clock frequency is 200 MHz, and border mode is mirror)
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 17
Table: HLS implementation results for a Mean Filter.
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 18
for(int y = 0; y < IMAGE_HEIGHT; y++){ for(int x = 0; x < IMAGE_WIDTH; x + v){ (DataBeatType*)(out[y][x]) = local_op(stencil_p1(y, x), ..); } } Architectures Loop Coarsening Architectures Border Handling Types Border Handling Modes Schmid’s Fetch and Calc (F&C) Calc and Pack (C&P) Mirror Mirror-101 Clamp Constant Naive Separated Type-0 Type-1 Type-2
Parameters Estimation (CPtar = 20 ns) Estimation (CPtar = 3.1 ns) Implementation (CPtar = 3.1 ns) v w/h Coars. BRAM FF LUT CPes BRAM FF LUT CPes SLICE BRAM FF LUT DSP CPpsyn CPimp 1 5 C&P 4 304 93 13.50 4 378 93 3.10 152 4 600 270 29 2.48 2.55 1 5 F&C 4 304 93 13.50 4 378 93 3.10 139 4 600 269 29 2.48 2.61 2 5 C&P 4 339 86 14.43 4 446 88 3.09 215 4 873 448 14 2.52 2.63 2 5 F&C 4 339 86 14.43 4 446 88 3.09 222 4 873 449 14 2.52 2.46 8 5 C&P 8 663 82 14.43 8 954 84 3.06 675 8 2589 1565 6 2.39 2.56 8 5 F&C 8 855 82 14.43 8 1146 84 3.06 603 8 2781 1566 6 2.39 2.74 32 5 C&P 32 1995 75 14.43 32 3045 77 3.03 1951 32 9256 5367 6 2.38 2.83 32 5 F&C 32 2955 75 14.43 32 4005 77 3.03 2023 32 10216 5367 6 2.38 3.09
Table: HLS estimation results for a local operator and Implementation results for a Mean Filter.
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 19
Parameters Estimation (CPtar = 20 ns) Estimation (CPtar = 3.1 ns) Implementation (CPtar = 3.1 ns) v w/h BH Patt. BRAM FF LUT CPes BRAM FF LUT CPes SLICE BRAM FF LUT CPpsyn CPimp 1 5 Naïve 4 471 575 13.5 4 643 576 3.10 213 4 879 533 2.63 2.67 1 5 Type-0 4 438 381 13.5 4 522 385 3.10 192 4 772 422 2.53 2.65 1 5 Type-1 4 398 341 13.5 4 482 345 3.10 176 4 732 419 2.52 2.58 1 5 Type-2 4 403 459 13.5 4 487 468 3.10 179 4 742 475 2.52 2.53 2 5 Naïve 4 495 538 14.4 4 664 542 3.09 288 4 1182 741 2.46 2.65 2 5 Type-0 4 432 344 14.4 4 565 349 3.09 268 4 1053 625 2.52 2.65 2 5 Type-1 4 432 344 14.4 4 565 349 3.09 244 4 1053 626 2.52 2.60 2 5 Type-2 4 435 501 14.4 4 569 511 3.09 262 4 1061 700 2.53 2.80 1 7 Naïve 6 855 1443 15.0 6 1434 1446 3.10 482 6 1976 1174 2.55 2.74 1 7 Type-0 6 806 867 15.0 6 1332 871 3.10 415 6 1843 878 2.56 2.85 1 7 Type-1 6 693 642 15.0 6 1107 646 3.10 381 6 1620 741 2.56 2.83 1 7 Type-2 6 698 808 15.0 6 885 817 3.10 316 6 1409 851 2.52 2.83 2 7 Naïve 6 957 1308 15.9 6 1589 1339 3.09 653 6 2715 1595 2.53 2.70 2 7 Type-0 6 862 730 15.9 6 1496 735 3.09 595 6 2570 1219 2.52 2.95 2 7 Type-1 6 806 674 15.9 6 1384 679 3.09 554 6 2459 1189 2.52 2.76 2 7 Type-2 6 923 1287 17.2 6 1107 1298 16.1 514 6 2194 1405 2.53 2.94 1 11 Naïve 10 2033 5390 16.6 10 4324 5393 3.10 1352 10 5601 3852 2.55 2.87 1 11 Type-0 10 1952 2997 16.6 10 3549 3018 3.10 1028 10 4781 2627 2.55 2.95 1 11 Type-1 10 1597 1583 16.6 10 1892 1594 3.45 711 10 3104 1883 2.55 2.67 1 11 Type-2 10 1601 1845 16.6 10 1891 1862 3.10 685 10 3114 2062 2.52 2.90 2 11 Naïve 10 2184 4562 16.7 10 4597 4566 3.09 1607 10 6839 4366 2.56 2.91 2 11 Type-0 10 2025 2160 16.7 10 3372 2228 3.09 1286 10 5663 3183 2.53 2.82 2 11 Type-1 10 1760 1719 16.7 10 2843 1787 3.09 1204 10 5136 2956 2.56 2.95 2 11 Type-2 10 1989 3564 25.3 10 2230 3639 25.3 1221 10 4537 3724 2.53 2.85
Table: HLS estimation results for a local operator and Implementation results for a Mean Filter.
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 20
5 4 3 2 9 8 7 6 13 12 11 10 17 16 15 14 1
11’ 10’
input
(a) F&C Type-1, which basically is min(rw,v) = 2 parallel Type-1 column selection for w = 3
input
(b) F&C Type-1 mirror border handling for w = 9 and v = 2, which basically is min(rw,v) = 2 parallel Type-1 column selection for w = 5.
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 21
reg = kin · h ·(rw + v ·(⌈rw/v⌉+ 1))
reg = kin · h ·(2· rw + v)+ kout ·(v −(rw mod v))
shift input
f f f f
shift shift input
f f f f
shift shift input
f f f f
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 22
reg
reg
reg
reg
reg = 32· 5·(2+ 32·(⌈5/32⌉+ 1)) = 11040bits
reg = 32· 5·(2· 2+ 32)+ 32·(32− 2) = 6720bits
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 23
Rfetch shift shift input shift assign Rleft Rright R0
right
Rmid x = W − 4: 17 18 19 20 21 22 23 21 22 23 x = W − 3: 18 19 20 21 22 23 r0 22 23 x = W − 2: 19 20 21 22 23 r0 r1 23 1 x = W − 1: 20 21 22 23 r0 r1 r2 1 2 x = 0: l0 l1 l2 1 2 3 1 2 3 x = 1: l1 l2 1 2 3 4 2 3 4
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 24
Rfetch shift shift input shift assign Rleft Rright R0
right
Rmid x = W − 4: 17 18 19 20 21 22 23 21 22 23 x = W − 3: 18 19 20 21 22 23 r0 22 23 x = W − 2: 19 20 21 22 23 r0 r1 23 1 x = W − 1: 20 21 22 23 r0 r1 r2 1 2 x = 0: l0 l1 l2 1 2 3 1 2 3 x = 1: l1 l2 1 2 3 4 2 3 4
reg = h · kin ·(w + rw)
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 24
Rfetch shift shift input shift assign Rleft Rright R0
right
Rmid x = W 4: 17 18 19 20 21 22 23 21 22 23 x = W 3: 18 19 20 21 22 23 r0 22 23 x = W 2: 19 20 21 22 23 r0 r1 23 1 x = W 1: 20 21 22 23 r0 r1 r2 1 2 x = 0: l0 l1 l2 1 2 3 1 2 3 x = 1: l1 l2 1 2 3 4 2 3 4
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 24
Rfetch shift shift input shift assign Rleft Rright R0
right
Rmid x = W − 4: 17 18 19 20 21 22 23 21 22 23 x = W − 3: 18 19 20 21 22 23 r0 22 23 x = W − 2: 19 20 21 22 23 r0 r1 23 1 x = W − 1: 20 21 22 23 r0 r1 r2 1 2 x = 0: l0 l1 l2 1 2 3 1 2 3 x = 1: l1 l2 1 2 3 4 2 3 4
right in order to initialize all
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 24
Rfetch shift shift input shift assign Rleft Rright R0
right
Rmid x = W − 4: 17 18 19 20 21 22 23 21 22 23 x = W − 3: 18 19 20 21 22 23 r0 22 23 x = W − 2: 19 20 21 22 23 r0 r1 23 1 x = W − 1: 20 21 22 23 r0 r1 r2 1 2 x = 0: l0 l1 l2 1 2 3 1 2 3 x = 1: l1 l2 1 2 3 4 2 3 4
2 2 2 2 2
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 24
Rfetch shift shift input shift assign Rleft Rright R0
right
Rmid x = W − 4: 17 18 19 20 21 22 23 21 22 23 x = W − 3: 18 19 20 21 22 23 r0 22 23 x = W − 2: 19 20 21 22 23 r0 r1 23 1 x = W − 1: 20 21 22 23 r0 r1 r2 1 2 x = 0: l0 l1 l2 1 2 3 1 2 3 x = 1: l1 l2 1 2 3 4 2 3 4
2 2 2 2 2
right can be optimized,
| Hardware/Software Co-Design | Efficient Loop Coarsening and Border Handling for Image Processing ASAP’17 24