FPGAs for Image Processing
A DSL and program transformations Rob Stewart Greg Michaelson Idress Ibrahim Deepayan Bhowmik Andy Wallace Paulo Garcia
Heriot-Watt University
FPGAs for Image Processing A DSL and program transformations Rob - - PowerPoint PPT Presentation
FPGAs for Image Processing A DSL and program transformations Rob Stewart Greg Michaelson Idress Ibrahim Deepayan Bhowmik Andy Wallace Paulo Garcia Heriot-Watt University 10 May 2016 What I will say 1. EPSRC Rathlin project interested in
Heriot-Watt University
"A Comparison of CPUs, GPUs, FPGAs, and Massively Parallel Processor Arrays for Random Number Generation". D Thomas et a. Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays, 2009.
program = image1 = imread 512 512; image2 = imap image1 (λ[.]
image3 = imap image2 (λ[.]
image4 = map image3 (λ[x] -> [min 255 (x + 50) ]);
image4;
index 1 2 s
1 1
'
[.] [.+1] [.-1]
midpoint
λ[.] ([.-1] + [.] + [.+1]) / 3 RIPL: State transitions: σ
s
1 2
Ø
σ
1
σ
s
1 1
1
σ
1
s
1 1
' s
1 1
init: stream:
map : I(M,N) → ([P]A → [P]A) → I(M,N) imap : I(M,N) → (Pi → P) → I(M,N) scaleRow : I(M,N) → ([P]A → [P]B) → I(M∗(B/A),N) scaleCol : I(M,N) → ([P]A → [P]B) → I(M,N∗(B/A)) filter2D : I(M,N) → (x, y) : (Int, Int) → [K](x∗y) → I(M,N) zipWith : I(M,N) → I(M,N) → ([P]A → [P]A → [P]A) → I(M,N) unzip : I(M,N) → (Pi → P) → (Pi → P) → (I(M,N), I(M,N)) foldScalar : I(M,N) → Int → (P → Int → Int) → Int foldVector : I(M,N) → Int → a : Int → (P → [Int]a → [Int]a) → [Int]a transpose : I(M,N) → I(N,M)
runtime scheduling expressiveness
memory bound +
[a,b]→[c,d]
σy, S′
image2 = imap image1 (λ[.]
index 1 2 s
1 1
'
[.] [.+1] [.-1]
midpoint
λ[.] ([.-1] + [.] + [.+1]) / 3 RIPL: State transitions: σ
s
1 2
Ø
σ
1
σ
s
1 1
1
σ
1
s
1 1
' s
1 1
init: stream:
σ0, [0, 0, 0]
[23,27]→∅
− − − − − − → σ1, [27, 23, 0] σ1, [27, 23, 0]
[28]→[27]
− − − − − → σ1, [27, 23, 28] σ1, [23, 27, 28]
[34]→[28]
− − − − − → σ1, [34, 23, 28] σ1, [34, 23, 28]
[92]→[51]
− − − − − → σ1, [34, 92, 28]
Slice LUT Slice registers Block RAM DSP48E FMax /FIFO (MHz) Naive 3664 8777 88 49 55.41 Final_XY 76 80 721.48 Centre_XY 182 199 530.81 Stream_to_YUV 90 287 24 420.07 update_model 1042 2399 30 148.74 YUV2RGB 300 957 7 126.71 displacement 545 1326 2 9 73.40 update_weight 556 1544 14 4 66.46 kArray_derv 437 1074 1 18 55.44 kArray_evaluation 460 1148 1 18 55.41
Profile Guided Dataflow Transformation for FPGAs & CPUs.
Dataflow, in The Journal of Signal Processing Systems, Springer, 2015.
Functionality Transformation Registers Slice LUTs BRAM DSP Clock (MHz) None 90 287 24 420.0 Stream to YUV Loop elimination 27 85 386.7 YUV to RGB None 300 957 7 126.7 Actor fusion 99 353 182.8 None 545 1326 2 9 73.4 Displacement Task parallelism 791 1210 7 9 110.0 Update weight None 556 1544 14 4 66.5 Fission 12352 19878 55 128 72.5 Just square root (none) 346 548 4 72.5 Square root Lookup 139 227 32 368.2 Combined 7907 38544 1028 225.9 None 437 1074 1 18 55.4 k-array derive Loop promotion 4447 12484 5 144 52.7