FPGAs for Image Processing A DSL and program transformations Rob - - PowerPoint PPT Presentation

fpgas for image processing
SMART_READER_LITE
LIVE PREVIEW

FPGAs for Image Processing A DSL and program transformations Rob - - PowerPoint PPT Presentation

FPGAs for Image Processing A DSL and program transformations Rob Stewart Greg Michaelson Idress Ibrahim Deepayan Bhowmik Andy Wallace Paulo Garcia Heriot-Watt University 10 May 2016 What I will say 1. EPSRC Rathlin project interested in


slide-1
SLIDE 1

FPGAs for Image Processing

A DSL and program transformations Rob Stewart Greg Michaelson Idress Ibrahim Deepayan Bhowmik Andy Wallace Paulo Garcia

Heriot-Watt University

10 May 2016

slide-2
SLIDE 2

What I will say

  • 1. EPSRC Rathlin project interested in remote image processing.
  • 2. We’ve developed a DSL for FPGAs called RIPL.
  • 3. Dataflow IR transformation between RIPL and FPGA help.

Low powered accelerated remote image processing.

slide-3
SLIDE 3

FPGAs vs GPUs

FPGAs ✦ energy efficient ✦ sometimes faster ✪ hard to program ✪ hard to optimise GPUs ✦ fast floating point ✦ fast SIMD parallelism ✪ uses lots of energy ✪ poor performance with irregular memory access

slide-4
SLIDE 4

FPGAs vs CPUs

"A Comparison of CPUs, GPUs, FPGAs, and Massively Parallel Processor Arrays for Random Number Generation". D Thomas et a. Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays, 2009.

slide-5
SLIDE 5

Block RAM on an FPGA

slide-6
SLIDE 6

DSPs on an FPGA

slide-7
SLIDE 7

RIPL in an FPGA

slide-8
SLIDE 8

RIPL in an FPGA

slide-9
SLIDE 9

Part 1 of 4: RIPL skeletons.

slide-10
SLIDE 10

A RIPL program

program = image1 = imread 512 512; image2 = imap image1 (λ[.]

  • > ([. -1] + [.] + [.+1]) / 3);

image3 = imap image2 (λ[.]

  • > ([. -1] + [.] + [.+1]) / 3);

image4 = map image3 (λ[x] -> [min 255 (x + 50) ]);

  • ut

image4;

slide-11
SLIDE 11

Memory efficient skeletons

index 1 2 s

1 1

'

[.] [.+1] [.-1]

midpoint

λ[.] ([.-1] + [.] + [.+1]) / 3 RIPL: State transitions: σ

s

1 2

Ø

σ

1

σ

s

1 1

1

σ

1

s

1 1

' s

1 1

init: stream:

Images are just streams of pixels.

slide-12
SLIDE 12

RIPL skeletons

map : I(M,N) → ([P]A → [P]A) → I(M,N) imap : I(M,N) → (Pi → P) → I(M,N) scaleRow : I(M,N) → ([P]A → [P]B) → I(M∗(B/A),N) scaleCol : I(M,N) → ([P]A → [P]B) → I(M,N∗(B/A)) filter2D : I(M,N) → (x, y) : (Int, Int) → [K](x∗y) → I(M,N) zipWith : I(M,N) → I(M,N) → ([P]A → [P]A → [P]A) → I(M,N) unzip : I(M,N) → (Pi → P) → (Pi → P) → (I(M,N), I(M,N)) foldScalar : I(M,N) → Int → (P → Int → Int) → Int foldVector : I(M,N) → Int → a : Int → (P → [Int]a → [Int]a) → [Int]a transpose : I(M,N) → I(N,M)

slide-13
SLIDE 13

RIPL to FPGAs

  • 1. Use algorithmic skeletons.
  • 2. Compile RIPL → pipelined parallel dataflow graphs.
  • 3. Optimise apply dataflow transformations.
  • 4. Compile dataflow graph → hardware description with Verilog.
  • 5. Synthesise Verilog for an FPGA.
  • 6. Send bitstream to the FPGA.
slide-14
SLIDE 14

Part 2 of 4: RIPL to dataflow.

slide-15
SLIDE 15

RIPL to dataflow

slide-16
SLIDE 16

RIPLs dataflow constraints

SDF CSDF DPN

runtime scheduling expressiveness

  • +

+

memory bound +

slide-17
SLIDE 17

RIPLs small step dataflow semantics

Skeleton implementation is set of transition rule. σx, S

[a,b]→[c,d]

− − − − − − − →

σy, S′

  • Transition from σx to σy
  • Start with internal state S, end with S′
  • Consumes [a, b] pixels, generates [c, d] pixels
  • "What" is computed defined by RIPL programmer
slide-18
SLIDE 18

RIPLs small step dataflow semantics

image2 = imap image1 (λ[.]

  • > ([. -1] + [.] + [.+1]) / 3);

index 1 2 s

1 1

'

[.] [.+1] [.-1]

midpoint

λ[.] ([.-1] + [.] + [.+1]) / 3 RIPL: State transitions: σ

s

1 2

Ø

σ

1

σ

s

1 1

1

σ

1

s

1 1

' s

1 1

init: stream:

σ0, [0, 0, 0]

[23,27]→∅

− − − − − − → σ1, [27, 23, 0] σ1, [27, 23, 0]

[28]→[27]

− − − − − → σ1, [27, 23, 28] σ1, [23, 27, 28]

[34]→[28]

− − − − − → σ1, [34, 23, 28] σ1, [34, 23, 28]

[92]→[51]

− − − − − → σ1, [34, 92, 28]

slide-19
SLIDE 19

Part 3 of 4: optimising dataflow.

slide-20
SLIDE 20

Dataflow profiling

Find bottlenecks using open source TURNUS tool

  • critical dataflow path
  • actors with high computational latency
  • low clock frequency
slide-21
SLIDE 21

Slice LUT Slice registers Block RAM DSP48E FMax /FIFO (MHz) Naive 3664 8777 88 49 55.41 Final_XY 76 80 721.48 Centre_XY 182 199 530.81 Stream_to_YUV 90 287 24 420.07 update_model 1042 2399 30 148.74 YUV2RGB 300 957 7 126.71 displacement 545 1326 2 9 73.40 update_weight 556 1544 14 4 66.46 kArray_derv 437 1074 1 18 55.44 kArray_evaluation 460 1148 1 18 55.41

slide-22
SLIDE 22

Manual dataflow transformation

Profile Guided Dataflow Transformation for FPGAs & CPUs.

  • R. Stewart, D. Bhowmik, G. Michaelson, A. Wallace. Special Issue on

Dataflow, in The Journal of Signal Processing Systems, Springer, 2015.

Functionality Transformation Registers Slice LUTs BRAM DSP Clock (MHz) None 90 287 24 420.0 Stream to YUV Loop elimination 27 85 386.7 YUV to RGB None 300 957 7 126.7 Actor fusion 99 353 182.8 None 545 1326 2 9 73.4 Displacement Task parallelism 791 1210 7 9 110.0 Update weight None 556 1544 14 4 66.5 Fission 12352 19878 55 128 72.5 Just square root (none) 346 548 4 72.5 Square root Lookup 139 227 32 368.2 Combined 7907 38544 1028 225.9 None 437 1074 1 18 55.4 k-array derive Loop promotion 4447 12484 5 144 52.7

slide-23
SLIDE 23

Interactive dataflow transformation

Task parallel decomposition video http://goo.gl/awBWg4 Data parallel fan out/fan in video http://goo.gl/0iwVCM

slide-24
SLIDE 24

Part 4 of 4: evaluation.

slide-25
SLIDE 25
slide-26
SLIDE 26

Power performance

Sub-module Power (W) Camera 24MHz 0.043330 Camera 100MHz 0.106660 Visual Saliency 50MHz 0.045940 Visual Saliency 85MHz 0.078098 Visual Saliency 100MHz 0.091310

slide-27
SLIDE 27

Space performance

Resource Usage Occupation DSP48E1s 3 1% FIFO36E1s 2 1% External IOB33s 80 40% RAMB18E1s 135 48% RAMB36E1s 26 18% Slices 2812 21% Slice Registers 4989 4% Slice LUTS 7357 13% Slice LUT-Flip Flop pairs 8457 15%

slide-28
SLIDE 28

Throughput performance

Processing time (ms) Frame rate FPGA 19.0623 52 CPU 189 5 Current experiments show RIPL performance 50-160 FPS.

slide-29
SLIDE 29

Our contribution

  • A new image processing DSL for FPGAs.
  • Small step operational dataflow semantics for skeletons.
  • Identified profiling metrics that matter for FPGAs.
  • A graphical dataflow transformations framework.
  • FPGA-based image processing system architecture.
slide-30
SLIDE 30

Future work

  • Evaluate RIPLs expressivity for real world computer vision.
  • Many dataflow implementations for each skeleton.
  • Machine learning to construct & prune search space of all

possible dataflow representations of a single RIPL program.

  • Integrate transformations with dataflow profiling tool.
  • Automated compiler based transformation.

Thanks. R.Stewart@hw.ac.uk