Automatic Streamization of Image Processing Applications LCPC 2014 - PowerPoint PPT Presentation

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Automatic Streamization of Image Processing Applications LCPC 2014 Pierre Guillou Fabien Coelho François Irigoin MINES ParisTech, PSL Research University Hillsboro, OR, September 15, 2014 1 / 24

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Context Image processing applications Computing systems CPUs (multi/many cores) Accelerators (GPUs, FPGAs. . . ) 2 / 24

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results DSL − → Streaming Language − → Manycore Accelerator Domain Specific Languages: High-level Easy-to-use Hardware agnostic C Embedded language: FREIA Streaming languages: Target easily multi/many cores architectures Image processing applications Verbose Examples: StreamIt, Sigma-C 3 / 24

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Manycore Processor PCI-Express DDR interface I/O cluster Host I/O cluster I/O cluster CPU Compute Attached DDR3 clusters Host RAM Kalray MPPA-256: 256 VLIW cores I/O cluster 2 MB/cluster 10 W 4 / 24

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Outline DSL & Streaming Language 1 Compilation and Execution Model 2 Optimizations 3 Experimental Results 4 5 / 24

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Image Processing DSL: FREIA FRamework for Embedded Image Applications: Sequential Embedded C code High-level image processing operators Example: freia_aipo_erode_8c (im1 , im0 , kernel ); // morphological freia_aipo_dilate_8c (im2 , im1 , kernel ); // morphological freia_aipo_and (im3 , im2 , im0); // arithmetic im1 ero dil im2 im0 im3 and 6 / 24

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Image Operators Arithmetic operators unary binary + − × / min max = & | ∼ Morphological operators selection + min/max/avg Reduction operators min/max/sum 7 / 24

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Sigma-C Agents input 0 = ⇒ output input 1 Agent foo agent foo() { interface { // define I/O channels in <int > in0 , in1; // 2 input integer channels out <int > out0; // 1 output integer channel spec{in0[2],in1 , // define flow scheduling out0 [3]}; } void start () exchange // DO SOMETHING! (in0 i0[2], in1 i1 , out0 o[3]) { o[0] = i0[0], o[1] = i1 , o[2] = i0 [1]; } } 8 / 24

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results From Agents to Subgraphs Agent 2 Agent 4 Subgraph bar Agent 1 Agent 5 Subgraph 3 subgraph bar() { interface { // define I/O channels in <int > in0 [2]; out <int > out0 , out1; spec{ { in0 [][3]; out0 }; { out1 [2] } }; } map { agent a1 = new Agent1 (); // instantiate agents agent a3 = new Subgraph3 (); ... connect (in0 [0], a1.input0 ); // I/O connections ... connect (a5.output , out1 ); connect (a1.output0 , a2.input ); // internal connections ... connect (a3.output , a5.input1 ); } } 9 / 24

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Input & Output From FREIA sequential C code: freia_aipo_erode_8c (im1 , im0 , kernel ); // morphological freia_aipo_dilate_8c (im2 , im1 , kernel ); // morphological freia_aipo_and (im3 , im2 , im0); // arithmetic To Sigma-C subgraph: subgraph foo() { int16_t kernel [9] = {0,1,0, 0,1,0, 0,1,0}; ... agent ero = new img_erode(kernel ); im1 agent dil = new img_dilate(kernel ); ero dil im2 agent and = new img_and_img (); im0 im3 ... and connect(ero.output , dil.input ); connect(dil.output , and.input ); ... } 10 / 24

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results From DSL Code to Streaming Code Build sequences of basic image operations 1 composed operator inlining partial evaluation loop unrolling Extract and optimize image expressions − → DAG 2 common subexpression elimination unused image computations removal copy propagation Generate target code 3 1 DAG � 1 subgraph 1 vertex � 1 agent Subgraph activation Use image operator library 4 11 / 24

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Execution Scheme Compute cores Control code Host run-time Accelerator run-time stream images load from HD agent 1 a launch a transfer store result agent n a launch b transfer stream images agent 1 b write on HD store result 12 / 24

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Mapping Sigma-C Graphs Graph throughput constraints: Slowest node in critical path = ⇒ split slow nodes, merge fast nodes Agent constraints: � agents ≤ 256 1 agent / compute core 2 MB for 16 cores mem(1 agent) ≤ 128 kB Fixed iteration overhead pack pixels Mapping constraints: NoC comms between clusters use few clusters Constant activation time use few large graphs 13 / 24

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Agent Granularity 1.4 Normalized execution times 128 256 512 640 per pixel on MPPA-256 1.2 1 0.8 0.6 0.4 0.2 0 anr999 deblocking licensePlate retina toggle GMEAN Fixed iteration overhead − → pack pixels Small memory − → avoid large structures Stencil ops − → manage overlap = ⇒ operate on image rows 14 / 24

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Optimization of Morphological Agents Morphological agents are the bottlenecks: 3 × 3 boolean matrix mask for selecting neighbors min, max or avg on selected neighbors Often combined in deep pipelines Some optimizations have been implemented: Agent buffer of 3 rows fed in a round-robin manner Innermost loop written in VLIW assembly code 15 / 24

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Bottleneck Reduction: Graph Transformation Data Parallelization of Morphological Agents morpho morpho 1 row 1 row 1 row 1 row 1 row 1 row morpho split join split morpho join morpho morpho (a) one row (b) two half-rows (c) three thirds of a row 1.6 Normalized execution time case (a) case (b) case (c) 1.4 1.2 1 0.8 0.6 0.4 0.2 0 anr999 antibio deblockinglicensePlate oop retina toggle GMEAN 16 / 24

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Reduce Number of Used Cores: Graph Transformation Aggregation of Arithmetic Agents � agents ≤ 256 Fast agents can be aggregated to use fewer cores Arithmetic operators are fast: good candidates for aggregation 1.4 Normalized execution time no compound agent 4 operators/compound agent 2 operators/compound agent 1.2 1 0.8 0.6 0.4 0.2 0 antibio burner licensePlate oop retina toggle GMEAN = ⇒ fewer cores used/same execution time 17 / 24

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Reduce Control Overhead: Enlarge Graphs While Unrolling for Convergent Transformations do { p = c; // p and c depend on the processed image ... // a converging operation freia_aipo_global_vol (img , &c); } while(c != p); 1.4 Normalized execution time without unrolling u.f. 4 u.f. 16 1.2 unrolling factor 2 u.f. 8 1 0.8 0.6 0.4 0.2 0 antibio burner retina GMEAN #control overhead ց #agents ր #speculative execution ր = ⇒ tradeoff: unroll by 8 18 / 24

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Benchmark Suite #operators Apps. LoC #subg #clust image size arith morpho red Total anr999 87 1 20 2 23 1 2 224 × 288 antibio 200 8 41 25 74 8 6 256 × 256 burner 510 18 410 3 431 3 16 256 × 256 deblocking 161 23 9 2 34 2 10 512 × 512 licensePlate 203 4 65 0 69 1 5 640 × 383 oop 442 7 10 0 17 1 2 350 × 288 retina 469 15 38 3 56 3 4 256 × 256 toggle 143 8 6 1 15 1 1 512 × 512 19 / 24

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Target Systems Targets hardware kind backend max W SPoC FPGA FPGA 26 Terapix FPGA FPGA 26 Intel dual-core 2c CPU OpenCL 65 AMD quad-core 4c CPU OpenCL 60 NV Geforce GTX 8800 GPU OpenCL 120 NV Quadro 600 GPU OpenCL 40 NV Tesla 2050C GPU OpenCL 240 Kalray MPPA-256 Manycore Sigma-C 10 20 / 24

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Relative Execution Times Kalray MPPA-256 Intel dual-core NVIDIA Quadro 600 SPoC AMD quad-core NVIDIA Tesla C 2050 Terapix NVIDIA GeForce 8800 GTX 10 1 0.1 0.01 anr999 antibio burner deblocking licensePlate oop retina toggle GMEAN Reference: MPPA = 1.0 21 / 24

Automatic Streamization of Image Processing Applications LCPC 2014 - PowerPoint PPT Presentation

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Automatic Streamization of Image Processing Applications LCPC 2014 Pierre Guillou Fabien Coelho Franois Irigoin MINES ParisTech, PSL Research

Introduction: What is Image Processing? CS 4640: Image Processing Basics January 10, 2012 What

Image Restoration Image Enhancement and Image Restoration both deal with improving images. Image

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

Image Processing Todays Class Image Representations: Matrices Image Representations: RGB,

Image Processing Tricks in Image Processing Tricks in OpenGL OpenGL Simon Green Simon Green

Image Processing CS 110 Why Image Processing? Medical Images

Color image processing The use of color in image processing is primarily motivated by two Image

Image restoration IMAGE P ROCES S IN G IN P YTH ON Rebeca Gonzalez Data Engineer Restore an

Image Transforma1ons image filtering : change range of image Image Processing : g(x) =

David Tschumperl Image Team, GREYC / CNRS (UMR 6072) IPOL Workshop on Image Processing

CCD Image Processing: CCD Image Processing: [ ] [ ] r x y , d x y , Raw File [ ]

CCD Image Processing: CCD Image Processing: Issues & Solutions Issues & Solutions 1

Lecture 1 Introduction Objectives Digital image processing, Why? Scope of digital image

PLT Project SIP(Simplified Image Processing) A Language for image processing Why SIP ??

Introduction to Digital Image Processing Asim Banerjee IEEE Workshop on Image Processing. 1 st

BBM 413 Today Fundamentals of What is image processing? Image Processing What does it

Progress towards nucleon-nucleon interactions with stochastic LapH [Estabrooks, Martin 1975]

Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - Winter 2009 Trace Caches

Top quak mass measurement using m T2 at CDF (dilepton channel) Hyunsu Lee The University of

Processing Expectation Maximization Mixture Models Bhiksha Raj Class 10. 3 Oct 2013 3 Oct

mesino oscillations AKSHAY GHALSASI, DAVE MCKEEN, ANN NELSON arxiv:1508.05392 The one minute

fl 3b) .rt ^l (J \J 5 \9 q F o I ''I u- fi I f" R I '-f ,'r l+ l 3 fl

First results from a microwave cavity axion search at 24 eV Ben Brubaker Yale University

The J-PARC accelerator complex for rare muon and kaon decays particle physics experiments and

Automatic Streamization of Image Processing Applications LCPC 2014 - PowerPoint PPT Presentation

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Automatic Streamization of Image Processing Applications LCPC 2014 Pierre Guillou Fabien Coelho Franois Irigoin MINES ParisTech, PSL Research

Introduction: What is Image Processing? CS 4640: Image Processing Basics January 10, 2012 What

Image Restoration Image Enhancement and Image Restoration both deal with improving images. Image

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

Image Processing Todays Class Image Representations: Matrices Image Representations: RGB,

Image Processing Tricks in Image Processing Tricks in OpenGL OpenGL Simon Green Simon Green

Image Processing CS 110 Why Image Processing? Medical Images

Color image processing The use of color in image processing is primarily motivated by two Image

Image restoration IMAGE P ROCES S IN G IN P YTH ON Rebeca Gonzalez Data Engineer Restore an

Image Transforma1ons image filtering : change range of image Image Processing : g(x) =

David Tschumperl Image Team, GREYC / CNRS (UMR 6072) IPOL Workshop on Image Processing

CCD Image Processing: CCD Image Processing: [ ] [ ] r x y , d x y , Raw File [ ]

CCD Image Processing: CCD Image Processing: Issues &amp; Solutions Issues &amp; Solutions 1

Lecture 1 Introduction Objectives Digital image processing, Why? Scope of digital image

PLT Project SIP(Simplified Image Processing) A Language for image processing Why SIP ??

Introduction to Digital Image Processing Asim Banerjee IEEE Workshop on Image Processing. 1 st

BBM 413 Today Fundamentals of What is image processing? Image Processing What does it

Progress towards nucleon-nucleon interactions with stochastic LapH [Estabrooks, Martin 1975]

Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - Winter 2009 Trace Caches

Top quak mass measurement using m T2 at CDF (dilepton channel) Hyunsu Lee The University of

Processing Expectation Maximization Mixture Models Bhiksha Raj Class 10. 3 Oct 2013 3 Oct

mesino oscillations AKSHAY GHALSASI, DAVE MCKEEN, ANN NELSON arxiv:1508.05392 The one minute

fl 3b) .rt ^l (J \J 5 \9 q F o I ''I u- fi I f&quot; R I '-f ,'r l+ l 3 fl

First results from a microwave cavity axion search at 24 eV Ben Brubaker Yale University

The J-PARC accelerator complex for rare muon and kaon decays particle physics experiments and

CCD Image Processing: CCD Image Processing: Issues & Solutions Issues & Solutions 1

fl 3b) .rt ^l (J \J 5 \9 q F o I ''I u- fi I f" R I '-f ,'r l+ l 3 fl