FPGAs for Image Processing A DSL and program transformations Rob - PowerPoint PPT Presentation

FPGAs for Image Processing A DSL and program transformations Rob Stewart Greg Michaelson Idress Ibrahim Deepayan Bhowmik Andy Wallace Paulo Garcia Heriot-Watt University 10 May 2016

What I will say 1. EPSRC Rathlin project interested in remote image processing. 2. We’ve developed a DSL for FPGAs called RIPL. 3. Dataflow IR transformation between RIPL and FPGA help. Low powered accelerated remote image processing.

FPGAs vs GPUs FPGAs GPUs ✦ energy efficient ✦ fast floating point ✦ sometimes faster ✦ fast SIMD parallelism ✪ hard to program ✪ uses lots of energy ✪ hard to optimise ✪ poor performance with irregular memory access

FPGAs vs CPUs " A Comparison of CPUs, GPUs, FPGAs, and Massively Parallel Processor Arrays for Random Number Generation ". D Thomas et a. Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays, 2009.

Block RAM on an FPGA

DSPs on an FPGA

RIPL in an FPGA

Part 1 of 4: RIPL skeletons.

A RIPL program program = image1 = imread 512 512; image2 = imap image1 ( λ [.] -> ([. -1] + [.] + [.+1]) / 3); image3 = imap image2 ( λ [.] -> ([. -1] + [.] + [.+1]) / 3); image4 = map image3 ( λ [x] -> [min 255 (x + 50) ]); image4; out

Memory efficient skeletons RIPL: s 1 1 λ [.] ([.-1] + [.] + [.+1]) / 3 [.+1] [.-1] 2 0 index State transitions: s 1 Ø 1 σ σ init: 2 0 1 [.] s s ' 1 1 σ σ stream: 1 1 1 1 midpoint s ' 1 1 Images are just streams of pixels.

RIPL skeletons map : I ( M , N ) → ([ P ] A → [ P ] A ) → I ( M , N ) imap : I ( M , N ) → ( P i → P ) → I ( M , N ) scaleRow : I ( M , N ) → ([ P ] A → [ P ] B ) → I ( M ∗ ( B / A ) , N ) scaleCol : I ( M , N ) → ([ P ] A → [ P ] B ) → I ( M , N ∗ ( B / A )) filter 2 D : I ( M , N ) → ( x , y ) : ( Int , Int ) → [ K ] ( x ∗ y ) → I ( M , N ) zipWith : I ( M , N ) → I ( M , N ) → ([ P ] A → [ P ] A → [ P ] A ) → I ( M , N ) unzip : I ( M , N ) → ( P i → P ) → ( P i → P ) → ( I ( M , N ) , I ( M , N ) ) foldScalar : I ( M , N ) → Int → ( P → Int → Int ) → Int foldVector : I ( M , N ) → Int → a : Int → ( P → [ Int ] a → [ Int ] a ) → [ Int ] a transpose : I ( M , N ) → I ( N , M )

RIPL to FPGAs 1. Use algorithmic skeletons. 2. Compile RIPL → pipelined parallel dataflow graphs. 3. Optimise apply dataflow transformations. 4. Compile dataflow graph → hardware description with Verilog. 5. Synthesise Verilog for an FPGA. 6. Send bitstream to the FPGA.

Part 2 of 4: RIPL to dataflow.

RIPL to dataflow

RIPLs dataflow constraints SDF + - - - memory bound CSDF + + runtime expressiveness DPN scheduling

RIPLs small step dataflow semantics Skeleton implementation is set of transition rule. [ a , b ] �→ [ c , d ] � σ y , S ′ � � σ x , S� − − − − − − − → • Transition from σ x to σ y • Start with internal state S , end with S ′ • Consumes [ a , b ] pixels, generates [ c , d ] pixels • " What " is computed defined by RIPL programmer

RIPLs small step dataflow semantics image2 = imap image1 ( λ [.] -> ([. -1] + [.] + [.+1]) / 3); RIPL: s 1 1 λ [.] ([.-1] + [.] + [.+1]) / 3 [.+1] [.-1] 2 0 index State transitions: s 1 Ø 1 σ σ init: 2 0 1 [.] s s ' 1 1 stream: σ σ 1 1 1 1 midpoint s ' 1 1 [23 , 27] �→∅ � σ 0 , [0 , 0 , 0] � − − − − − − → � σ 1 , [27 , 23 , 0] � [28] �→ [27] � σ 1 , [27 , 23 , 0] � − − − − − → � σ 1 , [27 , 23 , 28] � [34] �→ [28] � σ 1 , [23 , 27 , 28] � − − − − − → � σ 1 , [34 , 23 , 28] � [92] �→ [51] � σ 1 , [34 , 23 , 28] � − − − − − → � σ 1 , [34 , 92 , 28] �

Part 3 of 4: optimising dataflow.

Dataflow profiling Find bottlenecks using open source TURNUS tool • critical dataflow path • actors with high computational latency • low clock frequency

Slice LUT Slice registers Block RAM DSP48E FMax /FIFO (MHz) Naive 3664 8777 88 49 55.41 Final_XY 76 80 0 0 721.48 Centre_XY 182 199 0 0 530.81 Stream_to_YUV 90 287 24 0 420.07 update_model 1042 2399 30 0 148.74 YUV2RGB 300 957 7 0 126.71 displacement 545 1326 2 9 73.40 update_weight 556 1544 14 4 66.46 kArray_derv 437 1074 1 18 55.44 kArray_evaluation 460 1148 1 18 55.41

Manual dataflow transformation Profile Guided Dataflow Transformation for FPGAs & CPUs . R. Stewart, D. Bhowmik, G. Michaelson, A. Wallace. Special Issue on Dataflow, in The Journal of Signal Processing Systems, Springer, 2015. Functionality Transformation Registers Slice LUTs BRAM DSP Clock (MHz) None 90 287 24 0 420.0 Stream to YUV Loop elimination 27 85 0 0 386.7 None 300 957 7 0 126.7 YUV to RGB Actor fusion 99 353 0 0 182.8 None 545 1326 2 9 73.4 Displacement Task parallelism 791 1210 7 9 110.0 None 556 1544 14 4 66.5 Fission 12352 19878 55 128 72.5 Update weight Just square root (none) 346 548 0 4 72.5 Square root Lookup 139 227 32 0 368.2 Combined 7907 38544 1028 0 225.9 None 437 1074 1 18 55.4 k-array derive Loop promotion 4447 12484 5 144 52.7

Interactive dataflow transformation Task parallel decomposition video http://goo.gl/awBWg4 Data parallel fan out/fan in video http://goo.gl/0iwVCM

Part 4 of 4: evaluation.

Power performance Sub-module Power (W) Camera 24MHz 0.043330 Camera 100MHz 0.106660 Visual Saliency 50MHz 0.045940 Visual Saliency 85MHz 0.078098 Visual Saliency 100MHz 0.091310

Space performance Resource Usage Occupation DSP48E1s 3 1% FIFO36E1s 2 1% External IOB33s 80 40% RAMB18E1s 135 48% RAMB36E1s 26 18% Slices 2812 21% Slice Registers 4989 4% Slice LUTS 7357 13% Slice LUT-Flip Flop pairs 8457 15%

Throughput performance Processing time (ms) Frame rate FPGA 19.0623 52 CPU 189 5 Current experiments show RIPL performance 50-160 FPS.

Our contribution • A new image processing DSL for FPGAs. • Small step operational dataflow semantics for skeletons. • Identified profiling metrics that matter for FPGAs. • A graphical dataflow transformations framework. • FPGA-based image processing system architecture.

Future work • Evaluate RIPLs expressivity for real world computer vision. • Many dataflow implementations for each skeleton. • Machine learning to construct & prune search space of all possible dataflow representations of a single RIPL program. • Integrate transformations with dataflow profiling tool. • Automated compiler based transformation. Thanks. R.Stewart@hw.ac.uk

FPGAs for Image Processing A DSL and program transformations Rob - PowerPoint PPT Presentation

FPGAs for Image Processing A DSL and program transformations Rob Stewart Greg Michaelson Idress Ibrahim Deepayan Bhowmik Andy Wallace Paulo Garcia Heriot-Watt University 10 May 2016 What I will say 1. EPSRC Rathlin project interested in

The BIST History of FPGAs FPGAs The BIST History of The BISTory BISTory of of FPGAs FPGAs

FPGAs 1 CMPE691/491: Advanced FPGA Design FPGAs Large array of configurable logic blocks

Physical Design For FPGAs Rajeev Jayaraman Physical Implementation Tools Xilinx Inc. ISPD-2001

Introduction: What is Image Processing? CS 4640: Image Processing Basics January 10, 2012 What

Efficient Multi-Ported Memories for FPGAs Eric LaForest Greg Steffan University of Toronto

High-Speed Computing & Co-Processing with FPGAs FPGAs (Field Programmable Gate Arrays) are

Image Restoration Image Enhancement and Image Restoration both deal with improving images. Image

Image Processing Todays Class Image Representations: Matrices Image Representations: RGB,

Virtex-7 FPGAs Target Software Virtex-7 FPGAs Target Software Defined Radio Applications Defined

Image Processing Tricks in Image Processing Tricks in OpenGL OpenGL Simon Green Simon Green

Image Processing CS 110 Why Image Processing? Medical Images

Signal Processing using FPGAs Carl Leuschen Instructor: Zoom, by appointment Office Hours:

Linux and FPGAs Chad D. Kersey chad@cdkersey.com cdkersey@gatech.edu Linux and FPGAs - p. 1/9

Physical optimization for Physical optimization for FPGAs using post- FPGAs using post-

Hybrid Dot-Product Design for FP-Enabled FPGAs Bogdan Pasca Intel ARITH 2019, June 10-12, 2019

A Network of Time Division Multiplexing for FPGAs Rosemary Francis Motivation FPGAs are

Towards a theory of Undo Aaron Brown UC Berkeley June 2002 ROC Retreat Outline Recap of

Analysis of wide area user mobility patterns Kevin Simler*, Steven E. Czerwinski , Anthony

Graphical Models Graphical Models Relationship between the directed & undirected models

PixelVault:+Using+GPUs+for+Securing+ Cryptographic+Opera;ons+ ! Giorgos+Vasiliadis + +

Marginal Inference in MRFs using Frank-Wolfe David Belanger, Daniel Sheldon, Andrew McCallum

Quantifying the Performance Impacts of Using Local Memory for Many-Core Processors Jianbin Fang 1

Automated Key Management for End-To-End Encrypted Email Communication Intermediate talk for the

Network Administration HW4 Checkpoints tzute Computer Center, CS, NCTU Overview (1/3) A. Check

FPGAs for Image Processing A DSL and program transformations Rob - PowerPoint PPT Presentation

FPGAs for Image Processing A DSL and program transformations Rob Stewart Greg Michaelson Idress Ibrahim Deepayan Bhowmik Andy Wallace Paulo Garcia Heriot-Watt University 10 May 2016 What I will say 1. EPSRC Rathlin project interested in

The BIST History of FPGAs FPGAs The BIST History of The BISTory BISTory of of FPGAs FPGAs

FPGAs 1 CMPE691/491: Advanced FPGA Design FPGAs Large array of configurable logic blocks

Physical Design For FPGAs Rajeev Jayaraman Physical Implementation Tools Xilinx Inc. ISPD-2001

Introduction: What is Image Processing? CS 4640: Image Processing Basics January 10, 2012 What

Efficient Multi-Ported Memories for FPGAs Eric LaForest Greg Steffan University of Toronto

High-Speed Computing &amp; Co-Processing with FPGAs FPGAs (Field Programmable Gate Arrays) are

Image Restoration Image Enhancement and Image Restoration both deal with improving images. Image

Image Processing Todays Class Image Representations: Matrices Image Representations: RGB,

Virtex-7 FPGAs Target Software Virtex-7 FPGAs Target Software Defined Radio Applications Defined

Image Processing Tricks in Image Processing Tricks in OpenGL OpenGL Simon Green Simon Green

Image Processing CS 110 Why Image Processing? Medical Images

Signal Processing using FPGAs Carl Leuschen Instructor: Zoom, by appointment Office Hours:

Linux and FPGAs Chad D. Kersey chad@cdkersey.com cdkersey@gatech.edu Linux and FPGAs - p. 1/9

Physical optimization for Physical optimization for FPGAs using post- FPGAs using post-

Hybrid Dot-Product Design for FP-Enabled FPGAs Bogdan Pasca Intel ARITH 2019, June 10-12, 2019

A Network of Time Division Multiplexing for FPGAs Rosemary Francis Motivation FPGAs are

Towards a theory of Undo Aaron Brown UC Berkeley June 2002 ROC Retreat Outline Recap of

Analysis of wide area user mobility patterns Kevin Simler*, Steven E. Czerwinski , Anthony

Graphical Models Graphical Models Relationship between the directed &amp; undirected models

PixelVault:+Using+GPUs+for+Securing+ Cryptographic+Opera;ons+ ! Giorgos+Vasiliadis + +

Marginal Inference in MRFs using Frank-Wolfe David Belanger, Daniel Sheldon, Andrew McCallum

Quantifying the Performance Impacts of Using Local Memory for Many-Core Processors Jianbin Fang 1

Automated Key Management for End-To-End Encrypted Email Communication Intermediate talk for the

Network Administration HW4 Checkpoints tzute Computer Center, CS, NCTU Overview (1/3) A. Check

High-Speed Computing & Co-Processing with FPGAs FPGAs (Field Programmable Gate Arrays) are

Graphical Models Graphical Models Relationship between the directed & undirected models