API-Compilation for Image Hardware Accelerators Fabien Coelho & - PowerPoint PPT Presentation

Coelho & Irigoin MINES ParisTech API-Compilation for Image Hardware Accelerators Fabien Coelho & Franc ¸ois Irigoin ANR project: FREIA software environment for image application development on modern architectures API-Compilation for Image Hardware Accelerators 1

Coelho & Irigoin MINES ParisTech Terapix Hardware Accelerator �� #�� !�� "�� !�� "�� $$$ �� • µ P + 128 SIMD PE array, 1024 pixels per PE, neighbor coms • computation // communication (in or out) double buffer • issues: small memory implies tiles, 5.3 pixels/cycle bandwidth with DDR API-Compilation for Image Hardware Accelerators 2

Coelho & Irigoin MINES ParisTech SPoC Hardware Accelerator Vector Unit 2 paths, 5 image ops + reductions, 4 pixels/cycle bandwidth 16−bit 16−bit MORPH THR MX pixels pixels MES MX ALU MX MES 16−bit MX 16−bit MORPH THR pixels pixels Pipeline of 8 units 16−bit 16−bit 16−bit 16−bit 16−bit 16−bit 16−bit 16−bit 16−bit 16−bit 16−bit 16−bit 16−bit 16−bit 16−bit 16−bit MORPH THR MORPH THR MORPH THR MORPH THR MORPH THR MORPH THR MORPH THR MORPH THR MX MX MX MX MX MX MX MX pixels pixels pixels pixels pixels pixels pixels pixels pixels pixels pixels pixels pixels pixels pixels pixels MES MES MES MES MES MES MES MES MX MX MX MX MX MX MX MX ALU ALU ALU ALU ALU ALU ALU ALU MX MX MX MX MX MX MX MX MES MES MES MES MES MES MES MES 16−bit MX 16−bit 16−bit MX 16−bit 16−bit MX 16−bit 16−bit MX 16−bit 16−bit MX 16−bit 16−bit MX 16−bit 16−bit MX 16−bit 16−bit MX 16−bit MORPH THR MORPH THR MORPH THR MORPH THR MORPH THR MORPH THR MORPH THR MORPH THR pixels pixels pixels pixels pixels pixels pixels pixels pixels pixels pixels pixels pixels pixels pixels pixels API-Compilation for Image Hardware Accelerators 3

Coelho & Irigoin MINES ParisTech Portability vs Performance? Portability write one generic code Performance re-write code for every accelerator API-Compilation for Image Hardware Accelerators 4

Coelho & Irigoin MINES ParisTech (Pure) Library Approach? • domain-specific API, optimized (by hand) • small library: not enough operator aggregation, missed opportunities • large library: cost? portability? VSIPL 1000s functions (Pure) Compiler Approach? • start from source, inline functions, loop fusion. . . • issues: complexity, impact of stencils, conditions for borders. . . API-Compilation for Image Hardware Accelerators 5

Coelho & Irigoin MINES ParisTech Mixed Library/Compiler Approach Input small domain-specific image-level API in plain C basic/composed operators relevant to application developers library implemented (optimized?) by hand – quickly available Locality hardware and runtime handle loop fusion details! SPoC: delay lines with cyclic buffers Terapix: overlapping tiling induces redundant computations, µ -code Compilation get ops, merge ops, schedule, allocate API-Compilation for Image Hardware Accelerators 6

Coelho & Irigoin MINES ParisTech ANR999: running example excerpt // SKIPPED declarations and inits freia common rx image(in, &fin); // INPUT freia global min(in, &min); // COMPUTE freia global vol(in, &vol); freia dilate(od, in, 8, 10); freia gradient(og, in, 8, 10); printf("min=%d, vol=%d \ n", min, vol); // OUTPUT freia common tx image(od, &fout); freia common tx image(og, &fout); API-Compilation for Image Hardware Accelerators 7

Coelho & Irigoin MINES ParisTech Compilation Strategy Standard techniques for low-cost implementation 1. Build large basic blocks of elementary operations: generic inlining, scalar const. prop., loop unroll., dead-code elimination 2. Build and optimize DAGs of image operations: generic constant propagation, CSE, SDC, copy propagation 3. Generate code for target: specific SPoC : DAG splitting and scheduling, compaction, cutting Terapix : DAG splitting, scheduling, memory allocation OpenCL : DAG splitting, simple operation aggregation API-Compilation for Image Hardware Accelerators 8

Coelho & Irigoin MINES ParisTech 2.1 Build Image Expression DAG *. = i + /. = b *. s b -| * m E8 thr E8 D8 - -. &. min ? D8 max from Video Survey • expression DAG of simple image operations morpho, ALU, threshold, measure, copies, scalar ops • arrows: image and scalar dependencies API-Compilation for Image Hardware Accelerators 9

Coelho & Irigoin MINES ParisTech 2.2 Optimize DAG freia_gradient connexity=8 depth=10 freia_erode connexity=8 depth=10 E8 E8 E8 E8 E8 E8 E8 E8 E8 E8 freia_dilate connexity=8 depth=10 - g D8 D8 D8 D8 D8 D8 D8 D8 D8 D8 freia_dilate connexity=8 depth=10 i D8 D8 D8 D8 D8 D8 D8 D8 D8 D8 d vol d D8 D8 D8 D8 D8 D8 D8 D8 D8 D8 min E8 E8 E8 E8 E8 E8 E8 E8 E8 E8 - g i vol min Anr999 API-Compilation for Image Hardware Accelerators 10

Coelho & Irigoin MINES ParisTech 3. Target-dependent code generator mostly NP-Complete, greedy heuristics to split DAG and schedule ops spoc helper 0 spoc helper 1 d d vol D8 D8 D8 D8 D8 D8 D8 D8 D8 D8 i - g E8 E8 E8 E8 E8 E8 E8 E8 E8 g E8 min SPoC terapix helper 1 terapix helper 0 terapix helper 3 E8 E8 E8 E8 g g E8 E8 E8 E8 E8 E8 - g terapix helper 2 D8 D8 D8 D8 D8 D8 d d D8 D8 D8 D8 i d vol min Terapix OpenCL helper 0 min vol OpenCL helper 1 i E8 E8 E8 E8 E8 E8 E8 E8 E8 E8 - g D8 D8 D8 D8 D8 D8 D8 D8 D8 D8 OpenCL d API-Compilation for Image Hardware Accelerators 11

Coelho & Irigoin MINES ParisTech Performance aggregated speedups for 9 applications Hardware Target H/L L/C H/C SPoC 6.5 FPGA 14.2 91.5 Accelerators Terapix 20.5 2.3 47.6 Multi-cores Intel dual-core 0.9 2.0 1.9 OpenCL AMD quad-core 1.3 2.7 3.5 GPGPU GeForce 8800 GTX – 7.8 – NVIDIA Quadro 600 – 22.1 – OpenCL Tesla C 2050 – 10.2 – H one thread on host, L library version, C compiled version API-Compilation for Image Hardware Accelerators 12

Coelho & Irigoin MINES ParisTech Implementation in PIPS: add 5% to code base • source-to-source, easier to debug output • phase 1 – reuse (more or less) standard phases: 155000 LOCs • phase 2 – DAG building, optimization, utils: 4000 LOCs • phase 3 – code generation for three targets: 4400 LOCs SPoC 1900 LOCs Terapix 1400 LOCs OpenCL 1100 LOCs http://pips4u.org/ API-Compilation for Image Hardware Accelerators 13

Coelho & Irigoin MINES ParisTech Benefits: Cost effective reusable applications! Portability through small common API Performance through high-level coarse-grain low-cost compilation Key success factors Co-design API / compiler / runtime / hardware • overlapping tiling moved from compiler to runtime • double buffers moved from runtime to compiler • borders management moved to runtime and hardware Source-to-source ease development and testing Functional simulators help testing API-Compilation for Image Hardware Accelerators 14

Coelho & Irigoin MINES ParisTech Applicability Apps quite static (but not only!) structure and behavior API one data type, few dozen ops, a lot of parallelism Hardware well suited, hides loop fusion. . . Future Work • Kalray MPPA data-flow model target? • new applications? new transformations? • consider other application domains? API-Compilation for Image Hardware Accelerators 15

Coelho & Irigoin MINES ParisTech Questions? API-Compilation for Image Hardware Accelerators 16

Coelho & Irigoin MINES ParisTech Hardware Accelerators • more or less domain specific • ASIC, FPGA, GPGPU, multi-cores. . . • embedded? real-time? systems Motivation? • better execution time • lower energy footprint • (hide) intellectual property • product life time: up to 30 years Two accelerators: Terapix (128 PE SIMD) and SPoC (chained vector) API-Compilation for Image Hardware Accelerators 17

API-Compilation for Image Hardware Accelerators Fabien Coelho & - PowerPoint PPT Presentation

Coelho & Irigoin MINES ParisTech API-Compilation for Image Hardware Accelerators Fabien Coelho & Franc ois Irigoin ANR project: FREIA software environment for image application development on modern architectures API-Compilation for

RESTFUL API BEST PRACTICES By Malwina Nowakowska STX NEXT talented developers | flexible teams

Application Accelerators: Application Accelerators: Application Accelerators: Application

API Ruby on Rails UI ES API Hedtek Wijiti API API Elasticsearch Depositing user Build

JIT Compilation Module Overview JIT Compilation Native vs. Managed Compilation Managed

R265: Advanced Topics in Computer Architecture Seminar 7: HW accelerators and accelerators for

Hardware Observability Framework Hardware Observability Framework Hardware Observability

API Connect Arnauld Desprets - arnauld_desprets@fr.ibm.com Technical Sale 0 Agenda 1. API

Spock Data driven testing RESTful API What is a RESTful API ? A RESTful API is an application

Introduction to the SAGA API Outline SAGA Standardization API Structure and Scope (C++)

Study of an API Migration for two XML APIs Thiago Bartholomei Krzysztof Czarnecki Ralf Lmmel

Image Restoration Image Enhancement and Image Restoration both deal with improving images. Image

API Gateway API Gateway Gateway ESB At present tooling for API

DETECTORS AND ACCELERATORS DETECTORS AND ACCELERATORS APPLIED TO MEDICINE Jos Bernabu Jos

Accelerators for Americas Future ACCELERATORS - MODERN SHIPS OF DISCOVERY October 26, 2009

Confidential Accelerators Stavros Volos Microsoft Research Accelerators Play Pivotal Role in

Activities on accelerators in Spain Francis Perez ALBA Accelerators Head on behalf of

SUSY morph studies of inclusive spectra 04.09.2009 Max Baak and Stefan Gadatsch Test of morph-

Ex Exploring loring Heterogene terogeneity ty wi with thin in a a Core re for Imp

Week 5 - Friday What did we talk about last time? Euler angles Quaternions Started

RBF Morph Training Agenda Session #1 (May 24, 2:00 PM India Time, Duration - 60mins) General

Access Paths Renata Borovica-Gajic Stratos Idreos Anastasia Ailamaki Marcin Zukowski Campbell

Image Based Rendering Hua Zhong 2004/11 Render from images Image Morphing (has nothing to do

iOS Animation with Swift Part 9: Shape and Mask Animations CAShapeLayer Animatable Properties

Synthesizing Normalized Faces from Facial Identity Features Forrester Cole, David Belanger,