FPGA-based Acceleration: we need source to source compilers! Joo - PowerPoint PPT Presentation

FPGA-based Acceleration: we need source to source compilers! João M.P. Cardoso João Bispo, Pedro Pinto, Luís Reis, Tiago Carvalho, Ricardo Nobre, and Nuno Paulino University of Porto, FEUP/INESC-TEC, Porto, Portugal Email: jmpc@acm.org

An Exciting Reconfigurable Computing Era! “The FPGAs are 40 times faster than Widely spreading… a CPU at processing Bing’s custom algorithms, Burger says.” 2

Compiling to hardware: Timeline ... 80’s 90’s 00’s 10’s 20’ 3

Compilation to FPGAs (hardware) • From software to hardware • Generating hardware specific to the input software • Achieving performance benefits (acceleration), energy savings,… • Of paramount importance to the mainstream adoption of FPGAs • Efficient compilation will improve designer productivity and will make the use of FPGA technology viable for software programmers • The Challenge: • Added complexity of the extensive set of execution models supported by FPGAs makes efficient compilation (and programming) very hard • We have not yet solved the parallel programming problem, sort of … • High-Level Synthesis (hardware generation from C) has become a real solution! 4

Outline • Intro • Why source to source compilers? • Simple code restructuring example • Our source to source compilation approaches • Our source to source compilers • Ongoing work • Some challenges • Conclusion 5

Why source to source compilers? 6

Code Restructuring: 3D Path Planner • Target: ML507 Xilinx Virtex-5 board, PowerPC@400 MHz, CCUs@100 MHz Systems. FCCM 2012 Compiler Strategies for FPGA-based See: J. M. P. Cardoso,et al., Specifying Strategy Optimization 1 2 3 4 5 6 7 8        Loop fission and move     Replicate array 3×         Map gridit to HW core       Pointer-based accesses and strength Strategy 8: 6.8  faster than Strategy 8: 6.8  faster than reduction pure software solution pure software solution         Unroll 2× 8 8 6.80 6.80         Eliminating array accesses 7 7 6.72 6.72  Move data access 6 6 Specialization  3 HW cores 6.68 6.68   5 5 6.08 6.08 Transfer pot data according to gridit call     4 4 5.94 5.94       Transfer obstacles data according to gridit 3 3 5.61 5.61 call 2 2 5.01 5.01       On-demand obstacles data transfer Implementation FPGA resources 1 1 1.94 1.94 1 2,3,4 5,6 7,8 # Slice Registers as FF 901 939 956 2,470 1.8 1.8 2.3 2.3 2.8 2.8 3.3 3.3 3.8 3.8 4.3 4.3 4.8 4.8 5.3 5.3 5.8 5.8 6.3 6.3 6.8 6.8 7.3 7.3 # Slice LUTs 1,182 1,284 1,308 2,148 Source: EU-Funded FP7 REFLECT project Source: EU-Funded FP7 REFLECT project # occupied Slices 531 663 642 1,004 7 # BlockRAM/# DSP48Es 34/6 34/6 98/6 98/12

AutoTuning and Adaptivity appRoach for Energy efficient ANTAREX: eXascale HPC systems, FET-HPC, H2020 Project Example of a strategy from the ANTAREX project: • Create multiple versions of function “A” • Insert calls to timers for measuring the execution time of the function • Substitute the call to the original function with the possibility to execute one of the versions based on a parameter • Instantiate an autotuner and insert calls to the autotuner and communication of execution time • Use the parameter output by the autotuner to select between the versions of the function at runtime • Apply to each version a different optimization strategy All these steps are performed at the source code level! All these steps can be specified as (LARA) recipes automatically Silvano et al., ACM CF’2016 Silvano et al., ACM CF’2016 applied to source code! http://www.antarex-project.eu 8

Experiments make evident the importance of source to source transformations 9 Target: 2 × Intel Xeon CPU E5-2630 v3 @ 2.40GHz (8-core CPUs)

Why source to source compilers? • Translate from one programming language to Lang. A Lang. B another programming language • Take advantage of mature tool flows Lang. A Lang. A • backend, target-aware, compilers, synthesis tools • Apply target-aware and/or tool flow-aware code transformations 10

Source to source compilation • Code optimizations (loop unrolling, loop tiling, etc.) • Task-level parallelism and pipelining • Generation of multiple code versions (multiversioning) • Specialization/customization according to data • Memoization • Hardware/software partitioning (including insertion of synchronization and communication primitives) • Instrumentation • … 11

Source to source compilation • Target code is legible (good for debugging)! • Not tied to a specific target compiler (tool flow) or target Architecture! • Not all optimizations can be done at source code level! • Some code transformations are too specific and without enough application potential to justify inclusion in a compiler (unless the application is too important and must be continuously reshaped) 12

Code restructuring • Manual • Programmers need to know the impact of code styles and structures on the generated architecture (similar to the HDL developers, although in a different level) • Fully automatic with a source to source compiler (refactoring tool) • Need to devise the code transformations to apply and their ordering • Need source to source compilers integrating a vast portfolio of code transformations! • Semi-automatic with a source to source compiler (refactoring tool) • Code transformations automatically applied but guided by users • Users can define their own code transformations! 13

Simple code restructuring example 14

Code Restructuring: FIR Example // x is an input array // y is an output array x_0=x[0]; #define c0 2, c1 4, c2 4, c3 2 x_1=x[1]; #define M 256 // no. of samples x_2=x[2]; II=1 II=2 // Loop 1 // Loop 1 #define N 4 // no. of coeff. for (int j=3; j<M; j++) { int c[N] = {c0, c1, c2, c3}; for (int j=3; j<M; j++) { x_3=x[j]; ... x_3=x[j]; x_2=x[j-1]; // Loop 1: output=c0*x_3; x_1=x[j-2]; for (int j=N-1; j<M; j++) { output+=c1*x_2; x_0=x[j-3]; output+=c2*x_1; output=0; output=c0*x_3; // Loop 2: output+=c3*x_0; output+=c1*x_2; for (int i=0; i<N; i++) { x_0=x_1; output+=c2*x_1; output+=c[i]*x[j-i]; x_1=x_2; output+=c3*x_0; } x_2=x_3; y[j] = output; y[j] = output; y[j] = output; } } } 1 sample per 2 clock cycles 1 sample per clock cycle 15

II=1 // Loop 1 Code Restructuring: for (int j=3; j<M; j+=2) { Synthesis. FPGAs for Software Programmers 2016. See: João M. P. Cardoso, Markus Weinhardt, High-Level x_3=x[j]; FIR Example output=c0*x_3; output+=c1*x_2; output+=c2*x_1; x_0=x[0]; output+=c3*x_0; x_1=x[1]; x_0=x_1; II=2 x_2=x[2]; x_1=x_2; II=1 // Loop 1 // Loop 1 x_2=x_3; for (int j=3; j<M; j++) { for (int j=3; j<M; j++) { y[j] = output; x_3=x[j]; x_3=x[j]; x_3=x[j+1]; x_2=x[j-1]; output=c0*x_3; output=c0*x_3; x_1=x[j-2]; output+=c1*x_2; output+=c1*x_2; x_0=x[j-3]; output+=c2*x_1; output+=c2*x_1; output=c0*x_3; output+=c3*x_0; output+=c3*x_0; output+=c1*x_2; x_0=x_1; x_0=x_1; output+=c2*x_1; x_1=x_2; x_1=x_2; output+=c3*x_0; x_2=x_3; x_2=x_3; y[j] = output; y[j] = output; y[j+1] = output; } } } 2 samples per clock cycle 16 1 sample per 2 clock cycles 1 sample per clock cycle

Our source to source compilation approaches 17

Assumptions considering HLS from C • It is possible to generate efficient hardware accelerators from “massaged” C code (+ directives) • Directives will aid compilers with the information they cannot automatically extract/expose • Directives will instruct compilers to apply what they cannot easily devise • HLS will be extended to deal with directive driven programming models (e.g., OpenMP) + concurrency 18

Focus • C/OpenCL (+ directives + directive driven programming models) as our intermediate representation (IR) • Compiler generates target-specific code in this IR • Then, HLS and backend FPGA tools are used • This IR still misses other ways to express coarse-grained concurrency (e.g., communicating sequential processes, OpenMP directives) 19

LARA-based tool flow Application (C, Aspects / Strategies (LARA) MATLAB) Compiler Toolset Library of Aspects / Strategies Code Output Analysis Output 20

Aspects and Strategies Application (LARA) (C, MATLAB) Design-Space LARA Front- Exploration End (DSE) Source to Source hardware/software partitioning Aspect-IR code insertion Kernels for Hw/Sw iented Best Practices Components (C) + Compiler Toolset Annotations Compiler Toolset high-level optimizations: C Front-End Aspect-Orie -function inlining Hardware/Software Cores -loop unrolling CDFG-IR -loop tilling Optimizer Hardware/Softwar (Software/Hardware) e Templates middle-level optimizations: - word length analysis CDFG-IR Back-End (code weaving generators) low-level optimizations retargetability Assembly VHDL-RTL Source: EU-Funded FP7 REFLECT project Source: EU-Funded FP7 REFLECT project 21

FPGA-based Acceleration: we need source to source compilers! Joo - PowerPoint PPT Presentation

FPGA-based Acceleration: we need source to source compilers! Joo M.P. Cardoso Joo Bispo, Pedro Pinto, Lus Reis, Tiago Carvalho, Ricardo Nobre, and Nuno Paulino University of Porto, FEUP/INESC-TEC, Porto, Portugal Email: jmpc@acm.org An

An introduction to FPGA-based acceleration of neural networks Marco Pagani 1 What is an FPGA?

Open Source FPGA Toolchain FPGA LSE Summer Week 2015 iCE40 Flow Conclusion Vincent Gatine

Tips about an FPGA 02/09/2018 J.C. special topic FPGA ( field-programmable gate array ) FPGA :

FPGA What is a FPGA? How FPGAs work How do they work? Manufacturers

WWW.FPGA What is an FPGA? Field Programmable Gate Array Introduction to FPGA designs

Public FPGA based DM Public FPGA based DMA Atta A Attacking king UlfFrisk Agenda Background

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Current Trends in Hybrid FPGA/CPU Devices Hybrid FPGA/CPU Devices Xilinx Zynq Series Real

FPGA-CAPELLA: A REAL TIME AUDIO FX UNIT COSMA KUFA AND JUSTIN XIAO WHAT IS FPGA-CAPELLA?

GRVI Phalanx Update: A Massively Parallel RISC-V FPGA Accelerator Framework Jan Gray |

RTLinux in an FPGA Alejandro Lucero alucero@os3sl.com www.os3sl.com RTLinux in a FPGA 1.

Fast FPGA prototyping with Software Development Kit for FPGA (SDK4FPGA) Andrea Suardi

acceleration Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada NSS acceleration

Memories Introduction Why do we need memory in an FPGA Device? Topics Types of FPGA

Acceleration at North Allegheny Mathematics Acceleration (Elementary) Students may qualify for

Particle Driven Acceleration Experiments Edda Gschwendtner CAS, Plasma Wake Acceleration 2014 2

www.rust-lang.org An Overview of the Rust Programming Language Jim Royer CIS 352 April 29,

Compiler Construction October 20, 2018 Compiler Construction October 20, 2018 1 / 115 Mayer

Back to the model Jason Perry and Chung-chieh Shan Rutgers University July 10, 2011 1/24 Text

Programming translate our algorithm into set of instructions machine can execute Programming

Compiler Development (CMPSC 401) Semantic Analysis Janyl Jumadinova March 12, 2019 Janyl

Compiler Construction Hanspeter Mssenbck University of Linz http://ssw.jku.at/Misc/CC/ Text

The T clQuadcode Status report on T cl type Compiler analysis and code generation Donal

Compiler CS 449 Executables and gcc Object Linking Preprocessed C source files source