API-Compilation for Image Hardware Accelerators Fabien Coelho & - - PowerPoint PPT Presentation

api compilation for image hardware accelerators
SMART_READER_LITE
LIVE PREVIEW

API-Compilation for Image Hardware Accelerators Fabien Coelho & - - PowerPoint PPT Presentation

Coelho & Irigoin MINES ParisTech API-Compilation for Image Hardware Accelerators Fabien Coelho & Franc ois Irigoin ANR project: FREIA software environment for image application development on modern architectures API-Compilation for


slide-1
SLIDE 1

Coelho & Irigoin MINES ParisTech

API-Compilation for Image Hardware Accelerators

Fabien Coelho & Franc ¸ois Irigoin ANR project: FREIA software environment for image application development on modern architectures

API-Compilation for Image Hardware Accelerators 1

slide-2
SLIDE 2

Coelho & Irigoin MINES ParisTech

Terapix Hardware Accelerator

  • ! "

! "

  • #

$$$

  • µP + 128 SIMD PE array, 1024 pixels per PE, neighbor coms
  • computation // communication (in or out)

double buffer

  • issues: small memory implies tiles, 5.3 pixels/cycle bandwidth with DDR

API-Compilation for Image Hardware Accelerators 2

slide-3
SLIDE 3

Coelho & Irigoin MINES ParisTech

SPoC Hardware Accelerator Vector Unit

2 paths, 5 image ops + reductions, 4 pixels/cycle bandwidth

pixels 16−bit pixels 16−bit pixels 16−bit pixels 16−bit

ALU

MX MX MX MX

MORPH MORPH

THR THR

MES MES

Pipeline of 8 units

pixels 16−bit pixels 16−bit pixels 16−bit pixels 16−bit ALU MX MX MX MX MORPH MORPH THR THR MES MES pixels 16−bit pixels 16−bit pixels 16−bit pixels 16−bit ALU MX MX MX MX MORPH MORPH THR THR MES MES pixels 16−bit pixels 16−bit pixels 16−bit pixels 16−bit ALU MX MX MX MX MORPH MORPH THR THR MES MES pixels 16−bit pixels 16−bit pixels 16−bit pixels 16−bit ALU MX MX MX MX MORPH MORPH THR THR MES MES pixels 16−bit pixels 16−bit pixels 16−bit pixels 16−bit ALU MX MX MX MX MORPH MORPH THR THR MES MES pixels 16−bit pixels 16−bit pixels 16−bit pixels 16−bit ALU MX MX MX MX MORPH MORPH THR THR MES MES pixels 16−bit pixels 16−bit pixels 16−bit pixels 16−bit ALU MX MX MX MX MORPH MORPH THR THR MES MES pixels 16−bit pixels 16−bit pixels 16−bit pixels 16−bit ALU MX MX MX MX MORPH MORPH THR THR MES MES

API-Compilation for Image Hardware Accelerators 3

slide-4
SLIDE 4

Coelho & Irigoin MINES ParisTech

Portability vs Performance?

Portability write one generic code Performance re-write code for every accelerator

API-Compilation for Image Hardware Accelerators 4

slide-5
SLIDE 5

Coelho & Irigoin MINES ParisTech

(Pure) Library Approach?

  • domain-specific API, optimized (by hand)
  • small library: not enough operator aggregation, missed opportunities
  • large library: cost? portability?

VSIPL 1000s functions

(Pure) Compiler Approach?

  • start from source, inline functions, loop fusion. . .
  • issues: complexity, impact of stencils, conditions for borders. . .

API-Compilation for Image Hardware Accelerators 5

slide-6
SLIDE 6

Coelho & Irigoin MINES ParisTech

Mixed Library/Compiler Approach

Input small domain-specific image-level API in plain C basic/composed operators relevant to application developers library implemented (optimized?) by hand – quickly available Locality hardware and runtime handle loop fusion details! SPoC: delay lines with cyclic buffers Terapix: overlapping tiling induces redundant computations, µ-code Compilation get ops, merge ops, schedule, allocate

API-Compilation for Image Hardware Accelerators 6

slide-7
SLIDE 7

Coelho & Irigoin MINES ParisTech

ANR999: running example excerpt

// SKIPPED declarations and inits

freia common rx image(in, &fin);

// INPUT

freia global min(in, &min);

// COMPUTE

freia global vol(in, &vol); freia dilate(od, in, 8, 10); freia gradient(og, in, 8, 10); printf("min=%d, vol=%d\n", min, vol); // OUTPUT freia common tx image(od, &fout); freia common tx image(og, &fout);

API-Compilation for Image Hardware Accelerators 7

slide-8
SLIDE 8

Coelho & Irigoin MINES ParisTech

Compilation Strategy

Standard techniques for low-cost implementation

  • 1. Build large basic blocks of elementary operations:

generic inlining, scalar const. prop., loop unroll., dead-code elimination

  • 2. Build and optimize DAGs of image operations:

generic constant propagation, CSE, SDC, copy propagation

  • 3. Generate code for target:

specific SPoC: DAG splitting and scheduling, compaction, cutting Terapix: DAG splitting, scheduling, memory allocation OpenCL: DAG splitting, simple operation aggregation

API-Compilation for Image Hardware Accelerators 8

slide-9
SLIDE 9

Coelho & Irigoin MINES ParisTech

2.1 Build Image Expression DAG

b *.

  • |

i = s * b m &.

  • .
  • E8

D8 D8 E8 thr ? min max /. = + *.

from Video Survey

  • expression DAG of simple image operations

morpho, ALU, threshold, measure, copies, scalar ops

  • arrows: image and scalar dependencies

API-Compilation for Image Hardware Accelerators 9

slide-10
SLIDE 10

Coelho & Irigoin MINES ParisTech

2.2 Optimize DAG

freia_gradient connexity=8 depth=10 freia_erode connexity=8 depth=10 freia_dilate connexity=8 depth=10 freia_dilate connexity=8 depth=10 i E8 D8 D8 vol min d g

  • E8

E8 E8 E8 E8 E8 E8 E8 E8 D8 D8 D8 D8 D8 D8 D8 D8 D8 D8 D8 D8 D8 D8 D8 D8 D8 D8 i E8 D8 vol min d g

  • E8

E8 E8 E8 E8 E8 E8 E8 E8 D8 D8 D8 D8 D8 D8 D8 D8 D8

Anr999 API-Compilation for Image Hardware Accelerators 10

slide-11
SLIDE 11

Coelho & Irigoin MINES ParisTech

  • 3. Target-dependent code generator

mostly NP-Complete, greedy heuristics to split DAG and schedule ops

SPoC

spoc helper 1 spoc helper 0 i D8 vol min E8 d g d D8 g E8 D8

  • E8

D8 E8 D8 E8 D8 E8 D8 E8 D8 E8 D8 E8 D8 E8

Terapix

terapix helper 3 terapix helper 1 terapix helper 2 terapix helper 0 i E8 D8 vol min d g d D8 g E8 d D8 g E8

  • D8

D8 E8 E8 E8 E8 E8 D8 D8 D8 E8 E8 D8 D8

OpenCL

OpenCL helper 0 OpenCL helper 1 i min vol D8 E8 d g E8 E8

  • E8

E8 E8 E8 E8 E8 E8 D8 D8 D8 D8 D8 D8 D8 D8 D8

API-Compilation for Image Hardware Accelerators 11

slide-12
SLIDE 12

Coelho & Irigoin MINES ParisTech

Performance aggregated speedups for 9 applications

Hardware Target H/L L/C H/C FPGA SPoC 14.2 6.5 91.5 Accelerators Terapix 20.5 2.3 47.6 Multi-cores Intel dual-core 0.9 2.0 1.9 OpenCL AMD quad-core 1.3 2.7 3.5 GPGPU GeForce 8800 GTX – 7.8 – NVIDIA Quadro 600 – 22.1 – OpenCL Tesla C 2050 – 10.2 – H one thread on host, L library version, C compiled version

API-Compilation for Image Hardware Accelerators 12

slide-13
SLIDE 13

Coelho & Irigoin MINES ParisTech

Implementation in PIPS: add 5% to code base

  • source-to-source, easier to debug output
  • phase 1 – reuse (more or less) standard phases: 155000 LOCs
  • phase 2 – DAG building, optimization, utils: 4000 LOCs
  • phase 3 – code generation for three targets: 4400 LOCs

SPoC 1900 LOCs Terapix 1400 LOCs OpenCL 1100 LOCs

http://pips4u.org/

API-Compilation for Image Hardware Accelerators 13

slide-14
SLIDE 14

Coelho & Irigoin MINES ParisTech

Benefits: Cost effective reusable applications!

Portability through small common API Performance through high-level coarse-grain low-cost compilation

Key success factors

Co-design API / compiler / runtime / hardware

  • overlapping tiling moved from compiler to runtime
  • double buffers moved from runtime to compiler
  • borders management moved to runtime and hardware

Source-to-source ease development and testing Functional simulators help testing

API-Compilation for Image Hardware Accelerators 14

slide-15
SLIDE 15

Coelho & Irigoin MINES ParisTech

Applicability

Apps quite static (but not only!) structure and behavior API one data type, few dozen ops, a lot of parallelism Hardware well suited, hides loop fusion. . .

Future Work

  • Kalray MPPA data-flow model target?
  • new applications? new transformations?
  • consider other application domains?

API-Compilation for Image Hardware Accelerators 15

slide-16
SLIDE 16

Coelho & Irigoin MINES ParisTech

Questions?

API-Compilation for Image Hardware Accelerators 16

slide-17
SLIDE 17

Coelho & Irigoin MINES ParisTech

Hardware Accelerators

  • more or less domain specific
  • ASIC, FPGA, GPGPU, multi-cores. . .
  • embedded? real-time? systems

Motivation?

  • better execution time
  • lower energy footprint
  • (hide) intellectual property
  • product life time:

up to 30 years Two accelerators: Terapix (128 PE SIMD) and SPoC (chained vector)

API-Compilation for Image Hardware Accelerators 17

slide-18
SLIDE 18

Coelho & Irigoin MINES ParisTech

2.2 Optimize DAG (1)

in = E8 D8

  • ut

| | | | | | | = : : : : : : : : conv conv conv conv conv conv conv

  • l2

/_ *_

  • _

min max

  • cst

^ +_ in : conv E8 D8

  • ut

| | | | | | | : : : : : : : conv conv conv conv conv conv _- l2 /_ *_ +_ min max

  • from Deblocking

API-Compilation for Image Hardware Accelerators 18

slide-19
SLIDE 19

Coelho & Irigoin MINES ParisTech

Application Domain: image processing

algebra on images: one data type, basic (hw) and composed ops OOP (22 ops) Retina (106 ops) Antibio (49 ops) Burner (422 ops)

API-Compilation for Image Hardware Accelerators 19