API-Compilation for Image Hardware Accelerators Fabien Coelho & - - PowerPoint PPT Presentation
API-Compilation for Image Hardware Accelerators Fabien Coelho & - - PowerPoint PPT Presentation
Coelho & Irigoin MINES ParisTech API-Compilation for Image Hardware Accelerators Fabien Coelho & Franc ois Irigoin ANR project: FREIA software environment for image application development on modern architectures API-Compilation for
Coelho & Irigoin MINES ParisTech
Terapix Hardware Accelerator
- ! "
! "
- #
$$$
- µP + 128 SIMD PE array, 1024 pixels per PE, neighbor coms
- computation // communication (in or out)
double buffer
- issues: small memory implies tiles, 5.3 pixels/cycle bandwidth with DDR
API-Compilation for Image Hardware Accelerators 2
Coelho & Irigoin MINES ParisTech
SPoC Hardware Accelerator Vector Unit
2 paths, 5 image ops + reductions, 4 pixels/cycle bandwidth
pixels 16−bit pixels 16−bit pixels 16−bit pixels 16−bit
ALU
MX MX MX MX
MORPH MORPH
THR THR
MES MES
Pipeline of 8 units
pixels 16−bit pixels 16−bit pixels 16−bit pixels 16−bit ALU MX MX MX MX MORPH MORPH THR THR MES MES pixels 16−bit pixels 16−bit pixels 16−bit pixels 16−bit ALU MX MX MX MX MORPH MORPH THR THR MES MES pixels 16−bit pixels 16−bit pixels 16−bit pixels 16−bit ALU MX MX MX MX MORPH MORPH THR THR MES MES pixels 16−bit pixels 16−bit pixels 16−bit pixels 16−bit ALU MX MX MX MX MORPH MORPH THR THR MES MES pixels 16−bit pixels 16−bit pixels 16−bit pixels 16−bit ALU MX MX MX MX MORPH MORPH THR THR MES MES pixels 16−bit pixels 16−bit pixels 16−bit pixels 16−bit ALU MX MX MX MX MORPH MORPH THR THR MES MES pixels 16−bit pixels 16−bit pixels 16−bit pixels 16−bit ALU MX MX MX MX MORPH MORPH THR THR MES MES pixels 16−bit pixels 16−bit pixels 16−bit pixels 16−bit ALU MX MX MX MX MORPH MORPH THR THR MES MESAPI-Compilation for Image Hardware Accelerators 3
Coelho & Irigoin MINES ParisTech
Portability vs Performance?
Portability write one generic code Performance re-write code for every accelerator
API-Compilation for Image Hardware Accelerators 4
Coelho & Irigoin MINES ParisTech
(Pure) Library Approach?
- domain-specific API, optimized (by hand)
- small library: not enough operator aggregation, missed opportunities
- large library: cost? portability?
VSIPL 1000s functions
(Pure) Compiler Approach?
- start from source, inline functions, loop fusion. . .
- issues: complexity, impact of stencils, conditions for borders. . .
API-Compilation for Image Hardware Accelerators 5
Coelho & Irigoin MINES ParisTech
Mixed Library/Compiler Approach
Input small domain-specific image-level API in plain C basic/composed operators relevant to application developers library implemented (optimized?) by hand – quickly available Locality hardware and runtime handle loop fusion details! SPoC: delay lines with cyclic buffers Terapix: overlapping tiling induces redundant computations, µ-code Compilation get ops, merge ops, schedule, allocate
API-Compilation for Image Hardware Accelerators 6
Coelho & Irigoin MINES ParisTech
ANR999: running example excerpt
// SKIPPED declarations and inits
freia common rx image(in, &fin);
// INPUT
freia global min(in, &min);
// COMPUTE
freia global vol(in, &vol); freia dilate(od, in, 8, 10); freia gradient(og, in, 8, 10); printf("min=%d, vol=%d\n", min, vol); // OUTPUT freia common tx image(od, &fout); freia common tx image(og, &fout);
API-Compilation for Image Hardware Accelerators 7
Coelho & Irigoin MINES ParisTech
Compilation Strategy
Standard techniques for low-cost implementation
- 1. Build large basic blocks of elementary operations:
generic inlining, scalar const. prop., loop unroll., dead-code elimination
- 2. Build and optimize DAGs of image operations:
generic constant propagation, CSE, SDC, copy propagation
- 3. Generate code for target:
specific SPoC: DAG splitting and scheduling, compaction, cutting Terapix: DAG splitting, scheduling, memory allocation OpenCL: DAG splitting, simple operation aggregation
API-Compilation for Image Hardware Accelerators 8
Coelho & Irigoin MINES ParisTech
2.1 Build Image Expression DAG
b *.
- |
i = s * b m &.
- .
- E8
D8 D8 E8 thr ? min max /. = + *.
from Video Survey
- expression DAG of simple image operations
morpho, ALU, threshold, measure, copies, scalar ops
- arrows: image and scalar dependencies
API-Compilation for Image Hardware Accelerators 9
Coelho & Irigoin MINES ParisTech
2.2 Optimize DAG
freia_gradient connexity=8 depth=10 freia_erode connexity=8 depth=10 freia_dilate connexity=8 depth=10 freia_dilate connexity=8 depth=10 i E8 D8 D8 vol min d g
- E8
E8 E8 E8 E8 E8 E8 E8 E8 D8 D8 D8 D8 D8 D8 D8 D8 D8 D8 D8 D8 D8 D8 D8 D8 D8 D8 i E8 D8 vol min d g
- E8
E8 E8 E8 E8 E8 E8 E8 E8 D8 D8 D8 D8 D8 D8 D8 D8 D8
Anr999 API-Compilation for Image Hardware Accelerators 10
Coelho & Irigoin MINES ParisTech
- 3. Target-dependent code generator
mostly NP-Complete, greedy heuristics to split DAG and schedule ops
SPoC
spoc helper 1 spoc helper 0 i D8 vol min E8 d g d D8 g E8 D8
- E8
D8 E8 D8 E8 D8 E8 D8 E8 D8 E8 D8 E8 D8 E8
Terapix
terapix helper 3 terapix helper 1 terapix helper 2 terapix helper 0 i E8 D8 vol min d g d D8 g E8 d D8 g E8
- D8
D8 E8 E8 E8 E8 E8 D8 D8 D8 E8 E8 D8 D8
OpenCL
OpenCL helper 0 OpenCL helper 1 i min vol D8 E8 d g E8 E8
- E8
E8 E8 E8 E8 E8 E8 D8 D8 D8 D8 D8 D8 D8 D8 D8
API-Compilation for Image Hardware Accelerators 11
Coelho & Irigoin MINES ParisTech
Performance aggregated speedups for 9 applications
Hardware Target H/L L/C H/C FPGA SPoC 14.2 6.5 91.5 Accelerators Terapix 20.5 2.3 47.6 Multi-cores Intel dual-core 0.9 2.0 1.9 OpenCL AMD quad-core 1.3 2.7 3.5 GPGPU GeForce 8800 GTX – 7.8 – NVIDIA Quadro 600 – 22.1 – OpenCL Tesla C 2050 – 10.2 – H one thread on host, L library version, C compiled version
API-Compilation for Image Hardware Accelerators 12
Coelho & Irigoin MINES ParisTech
Implementation in PIPS: add 5% to code base
- source-to-source, easier to debug output
- phase 1 – reuse (more or less) standard phases: 155000 LOCs
- phase 2 – DAG building, optimization, utils: 4000 LOCs
- phase 3 – code generation for three targets: 4400 LOCs
SPoC 1900 LOCs Terapix 1400 LOCs OpenCL 1100 LOCs
http://pips4u.org/
API-Compilation for Image Hardware Accelerators 13
Coelho & Irigoin MINES ParisTech
Benefits: Cost effective reusable applications!
Portability through small common API Performance through high-level coarse-grain low-cost compilation
Key success factors
Co-design API / compiler / runtime / hardware
- overlapping tiling moved from compiler to runtime
- double buffers moved from runtime to compiler
- borders management moved to runtime and hardware
Source-to-source ease development and testing Functional simulators help testing
API-Compilation for Image Hardware Accelerators 14
Coelho & Irigoin MINES ParisTech
Applicability
Apps quite static (but not only!) structure and behavior API one data type, few dozen ops, a lot of parallelism Hardware well suited, hides loop fusion. . .
Future Work
- Kalray MPPA data-flow model target?
- new applications? new transformations?
- consider other application domains?
API-Compilation for Image Hardware Accelerators 15
Coelho & Irigoin MINES ParisTech
Questions?
API-Compilation for Image Hardware Accelerators 16
Coelho & Irigoin MINES ParisTech
Hardware Accelerators
- more or less domain specific
- ASIC, FPGA, GPGPU, multi-cores. . .
- embedded? real-time? systems
Motivation?
- better execution time
- lower energy footprint
- (hide) intellectual property
- product life time:
up to 30 years Two accelerators: Terapix (128 PE SIMD) and SPoC (chained vector)
API-Compilation for Image Hardware Accelerators 17
Coelho & Irigoin MINES ParisTech
2.2 Optimize DAG (1)
in = E8 D8
- ut
| | | | | | | = : : : : : : : : conv conv conv conv conv conv conv
- l2
/_ *_
- _
min max
- cst
^ +_ in : conv E8 D8
- ut
| | | | | | | : : : : : : : conv conv conv conv conv conv _- l2 /_ *_ +_ min max
- from Deblocking