Automatic Streamization of Image Processing Applications LCPC 2014 - - PowerPoint PPT Presentation

automatic streamization of image processing applications
SMART_READER_LITE
LIVE PREVIEW

Automatic Streamization of Image Processing Applications LCPC 2014 - - PowerPoint PPT Presentation

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Automatic Streamization of Image Processing Applications LCPC 2014 Pierre Guillou Fabien Coelho Franois Irigoin MINES ParisTech, PSL Research


slide-1
SLIDE 1

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results

Automatic Streamization of Image Processing Applications

LCPC 2014 Pierre Guillou Fabien Coelho François Irigoin

MINES ParisTech, PSL Research University

Hillsboro, OR, September 15, 2014

1 / 24

slide-2
SLIDE 2

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results

Context

Image processing applications Computing systems

CPUs (multi/many cores) Accelerators (GPUs, FPGAs. . . )

2 / 24

slide-3
SLIDE 3

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results

DSL − → Streaming Language − → Manycore Accelerator

Domain Specific Languages: High-level Easy-to-use Hardware agnostic C Embedded language: FREIA Streaming languages: Target easily multi/many cores architectures Image processing applications Verbose Examples: StreamIt, Sigma-C

3 / 24

slide-4
SLIDE 4

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results

Manycore Processor

I/O cluster I/O cluster I/O cluster I/O cluster

Compute clusters

Host RAM Host CPU PCI-Express Attached DDR3 DDR interface

Kalray MPPA-256: 256 VLIW cores 2 MB/cluster 10 W

4 / 24

slide-5
SLIDE 5

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results

Outline

1

DSL & Streaming Language

2

Compilation and Execution Model

3

Optimizations

4

Experimental Results

5 / 24

slide-6
SLIDE 6

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results

Image Processing DSL: FREIA

FRamework for Embedded Image Applications: Sequential Embedded C code High-level image processing operators Example:

freia_aipo_erode_8c (im1 , im0 , kernel ); // morphological freia_aipo_dilate_8c (im2 , im1 , kernel ); // morphological freia_aipo_and (im3 , im2 , im0); // arithmetic im0 im1 im2 im3 ero dil and

6 / 24

slide-7
SLIDE 7

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results

Image Operators

Arithmetic operators

unary binary + − × / min max = & | ∼

Morphological operators

selection + min/max/avg

Reduction operators

min/max/sum

7 / 24

slide-8
SLIDE 8

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results

Sigma-C Agents

Agent foo

input 1 input 0

  • utput

= ⇒

agent foo() { interface { // define I/O channels in <int > in0 , in1; // 2 input integer channels

  • ut <int > out0;

// 1 output integer channel spec{in0[2],in1 , // define flow scheduling

  • ut0 [3]};

} void start () exchange // DO SOMETHING! (in0 i0[2], in1 i1 , out0 o[3]) {

  • [0] = i0[0], o[1] = i1 , o[2] = i0 [1];

} }

8 / 24

slide-9
SLIDE 9

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results

From Agents to Subgraphs

Subgraph bar

Agent 1 Agent 2 Subgraph 3 Agent 4 Agent 5

subgraph bar() { interface { // define I/O channels in <int > in0 [2];

  • ut <int > out0 , out1;

spec{ { in0 [][3];

  • ut0 }; { out1 [2] } };

} map { agent a1 = new Agent1 (); // instantiate agents agent a3 = new Subgraph3 (); ... connect (in0 [0], a1.input0 ); // I/O connections ... connect (a5.output , out1 ); connect (a1.output0 , a2.input ); // internal connections ... connect (a3.output , a5.input1 ); } }

9 / 24

slide-10
SLIDE 10

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results

Input & Output

From FREIA sequential C code:

freia_aipo_erode_8c (im1 , im0 , kernel ); // morphological freia_aipo_dilate_8c (im2 , im1 , kernel ); // morphological freia_aipo_and (im3 , im2 , im0); // arithmetic

To Sigma-C subgraph:

subgraph foo() { int16_t kernel [9] = {0,1,0, 0,1,0, 0,1,0}; ... agent ero = new img_erode(kernel ); agent dil = new img_dilate(kernel ); agent and = new img_and_img (); ... connect(ero.output , dil.input ); connect(dil.output , and.input ); ... } im0 im1 im2 im3 ero dil and

10 / 24

slide-11
SLIDE 11

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results

From DSL Code to Streaming Code

1

Build sequences of basic image operations

composed operator inlining partial evaluation loop unrolling

2

Extract and optimize image expressions − → DAG

common subexpression elimination unused image computations removal copy propagation

3

Generate target code

1 DAG 1 subgraph 1 vertex 1 agent Subgraph activation

4

Use image operator library

11 / 24

slide-12
SLIDE 12

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results

Execution Scheme

Control code Host run-time Accelerator run-time Compute cores

load from HD launch a launch b write on HD transfer transfer stream images store result stream images store result agent 1b agent na agent 1a

12 / 24

slide-13
SLIDE 13

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results

Mapping Sigma-C Graphs

Graph throughput constraints: Slowest node in critical path = ⇒ split slow nodes, merge fast nodes Agent constraints: 1 agent / compute core agents ≤ 256 2 MB for 16 cores mem(1 agent) ≤ 128 kB Fixed iteration overhead pack pixels Mapping constraints: NoC comms between clusters use few clusters Constant activation time use few large graphs

13 / 24

slide-14
SLIDE 14

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results

Agent Granularity

0.2 0.4 0.6 0.8 1 1.2 1.4

anr999 deblocking licensePlate retina toggle GMEAN

Normalized execution times per pixel on MPPA-256 128 256 512 640

Fixed iteration overhead − → pack pixels Small memory − → avoid large structures Stencil ops − → manage overlap = ⇒ operate on image rows

14 / 24

slide-15
SLIDE 15

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results

Optimization of Morphological Agents

Morphological agents are the bottlenecks: 3 × 3 boolean matrix mask for selecting neighbors min, max or avg on selected neighbors Often combined in deep pipelines Some optimizations have been implemented: Agent buffer of 3 rows fed in a round-robin manner Innermost loop written in VLIW assembly code

15 / 24

slide-16
SLIDE 16

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results

Bottleneck Reduction: Graph Transformation

Data Parallelization of Morphological Agents

1 row split morpho morpho join 1 row

(b) two half-rows

1 row morpho 1 row

(a) one row

1 row split morpho morpho morpho join 1 row

(c) three thirds of a row

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

anr999 antibio deblockinglicensePlate

  • op

retina toggle GMEAN

Normalized execution time case (a) case (b) case (c)

16 / 24

slide-17
SLIDE 17

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results

Reduce Number of Used Cores: Graph Transformation

Aggregation of Arithmetic Agents

Fast agents can be aggregated to use fewer cores agents ≤ 256 Arithmetic operators are fast: good candidates for aggregation

0.2 0.4 0.6 0.8 1 1.2 1.4

antibio burner licensePlate

  • op

retina toggle GMEAN

Normalized execution time no compound agent 2 operators/compound agent 4 operators/compound agent

= ⇒ fewer cores used/same execution time

17 / 24

slide-18
SLIDE 18

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results

Reduce Control Overhead: Enlarge Graphs

While Unrolling for Convergent Transformations

do { p = c; // p and c depend

  • n the

processed image ... // a converging

  • peration

freia_aipo_global_vol (img , &c); } while(c != p);

0.2 0.4 0.6 0.8 1 1.2 1.4

antibio burner retina GMEAN

Normalized execution time without unrolling unrolling factor 2 u.f. 4 u.f. 8 u.f. 16

#control overhead ց #agents ր #speculative execution ր = ⇒ tradeoff: unroll by 8

18 / 24

slide-19
SLIDE 19

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results

Benchmark Suite

Apps. LoC #operators #subg #clust image size arith morpho red Total anr999 87 1 20 2 23 1 2 224 × 288 antibio 200 8 41 25 74 8 6 256 × 256 burner 510 18 410 3 431 3 16 256 × 256 deblocking 161 23 9 2 34 2 10 512 × 512 licensePlate 203 4 65 69 1 5 640 × 383

  • op

442 7 10 17 1 2 350 × 288 retina 469 15 38 3 56 3 4 256 × 256 toggle 143 8 6 1 15 1 1 512 × 512

19 / 24

slide-20
SLIDE 20

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results

Target Systems

Targets hardware kind backend max W SPoC FPGA FPGA 26 Terapix FPGA FPGA 26 Intel dual-core 2c CPU OpenCL 65 AMD quad-core 4c CPU OpenCL 60 NV Geforce GTX 8800 GPU OpenCL 120 NV Quadro 600 GPU OpenCL 40 NV Tesla 2050C GPU OpenCL 240 Kalray MPPA-256 Manycore Sigma-C 10

20 / 24

slide-21
SLIDE 21

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results

Relative Execution Times

0.01 0.1 1 10 anr999 antibio burner deblocking licensePlate

  • op

retina toggle GMEAN Kalray MPPA-256 SPoC Terapix Intel dual-core AMD quad-core NVIDIA GeForce 8800 GTX NVIDIA Quadro 600 NVIDIA Tesla C 2050

Reference: MPPA = 1.0

21 / 24

slide-22
SLIDE 22

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results

Relative Energy Consumption

0.1 1 10 100 anr999 antibio burner deblocking licensePlate

  • op

retina toggle GMEAN Kalray MPPA-256 SPoC Terapix Intel dual-core AMD quad-core NVIDIA GeForce 8800 GTX NVIDIA Quadro 600 NVIDIA Tesla C 2050 MPPA ideal

Reference: MPPA = 1.0

22 / 24

slide-23
SLIDE 23

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results

Conclusion

Summary: Image processing DSL − → streaming language Using a source-to-source compiler Targetting manycore processors Contributions: Generation of Sigma-C subgraphs from FREIA applications Optimizations for running onto the Kalray MPPA-256 Energy results: MPPA can compete with dedicated accelerators Future Work: Better use of the MPPA compute power

Map non-concurrent subgraphs on the same cores Power off unused clusters

Automatic generation of specific convolutions with partial evaluation Exploit data parallelization when profitable

23 / 24

slide-24
SLIDE 24

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results

Automatic Streamization of Image Processing Applications

LCPC 2014 Pierre Guillou Fabien Coelho François Irigoin

MINES ParisTech, PSL Research University

Hillsboro, OR, September 15, 2014

24 / 24

slide-25
SLIDE 25

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results

Compilation Chain

  • riginal

application image processing DSL: FREIA control code streaming kernels

  • perator library

source-to-source compiler PIPS GCC target-specific compiler control binary compute binaries

24 / 24

slide-26
SLIDE 26

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results

Applicability

Other manycore targets: Intel Xeon Phi

∼ 60 cores on an interconnect ring no clusters, no shared memory 512 kB L2 cache/core

Tilera TILE-Gx

up to 72 cores with L1 and L2 cache no clusters, no shared memory 2d NoC

Other streaming languages: StreamIt

agents filters subgraphs pipelines/splitjoins/feedback loops

24 / 24