[PPT] - for Coarse-Grained FPGA Overlays Abhishek Kumar Jain, Douglas L. PowerPoint Presentation

SLIDE 1

Resource-Aware Just-in-Time OpenCL Compiler for Coarse-Grained FPGA Overlays

1

Abhishek Kumar Jain, Douglas L. Maskell School of Computer Science and Engineering Nanyang Technological University (NTU), Singapore 3rd International Workshop on Overlay Architectures for FPGAs (OLAF) 22nd Feb 2017, Monterey, CA, USA Suhaib A. Fahmy School of Engineering University of Warwick, UK

SLIDE 2

Hardware Accelerators

Many different platforms for hardware acceleration of compute intensive

applications

These include GPUs, FPGAs, etc.
GPUs are more widely used due to
The ease of use and better design productivity
Better support for the OpenCL programming model
Faster design cycles

2

SLIDE 3

Hardware Accelerators

3

1. Low level of programming abstraction (RTL) – HLS tools, Eg: Vivado-HLS 2. Runtime management (interfaces and drivers) – Newer tools, Eg: SDSoC, SDAccel, AOCL 3. Long compile times and slow switching between application kernels – Overlays 4. A lack of application portability and performance scalability – Overlays + OpenCL

Many different platforms for hardware acceleration of compute intensive

applications

These include GPUs, FPGAs, etc.
GPUs are more widely used due to
The ease of use and better design productivity
Better support for the OpenCL programming model
Faster design cycles
Issues with FPGA based accelerators (past and current)

1. Low level of programming abstraction (RTL) 2. Runtime management (interfaces and drivers) 3. Long compile times and slow switching between application kernels 4. A lack of application portability and performance scalability

SLIDE 4

So what is an Overlay?

A coarse-grained circuit abstraction which sits on top of the FPGA fabric
Many similarities to CGRAs
Because it is coarse-grained, it provides easier application mapping,

faster compilation and faster application kernel configuration

4 Images from University of Toronto [3]

SLIDE 5

So what is an Overlay?

6

Intermediate Fabric (University of Florida) [4] Mesh-of-FU Overlay (University of Toronto) [3] DySER (UW Madison) [5] MXP Overlay (VectorBlox) [1] SCGRA Overlay (HKUST, Hong Kong) [2]

1. Severance, Aaron, and Guy GF Lemieux. "Embedded supercomputing in FPGAs with the VectorBlox MXP matrix processor." CODES+ ISSS, 2013.
2. Liu, Cheng, Ho-Cheung Ng, and Hayden Kwok-Hay So. "QuickDough: a rapid fpga loop accelerator design framework using soft CGRA overlay." FPT 2015.
3. Capalija, Davor, and Tarek S. Abdelrahman. "A high-performance overlay architecture for pipelined execution of data flow graphs." FPL 2013.
4. G. Stitt and J. Coole, “Intermediate fabrics: Virtual architectures for near-instant FPGA compilation,” IEEE ESL, vol. 3(3), 2011.
5. J. Benson et al., "Design, integration and implementation of the DySER hardware accelerator into OpenSPARC," HPCA 2012.

SLIDE 6

Classification based on Architecture[6]

6

FU-0 FU-1 FU-2 FU-3 Overlay (Time-multiplexed) Similar to conventional CGRAs Overlay (Spatially-configured)

Time multiplexed
Spatially configured
Packet switched, and
Circuit switched
6. Kapre, Nachiket, et al. "Packet switched vs. time multiplexed FPGA overlay networks." FCCM 2006.

Images from: University of Toronto [3]

SLIDE 7

Started looking at SC overlays in 2013

7 175 50 100 150 200 250 300 350 400 450 Fmax

Fmax

6.3 20 40 60 80 100 120 140 GOPS

GOPS

7620 1000 2000 3000 4000 5000 6000 7000 8000 9000 LUTs/GOPS

LUTs/GOPS

FU area

SLIDE 8

Started looking at SC overlays in 2013

8 175 300 50 100 150 200 250 300 350 400 450 Fmax

Fmax

6.3 115 20 40 60 80 100 120 140 GOPS

GOPS

7620 320 1000 2000 3000 4000 5000 6000 7000 8000 9000 LUTs/GOPS

LUTs/GOPS

FU area

Zynq: 128 DSP blocks at 300 MHz V7: 800 DSP blocks at 380 MHz

Jain, Fahmy, Maskell

SLIDE 9

Started looking at SC overlays in 2013

9 175 300 395 50 100 150 200 250 300 350 400 450 Fmax

Fmax

6.3 115 23 20 40 60 80 100 120 140 GOPS

GOPS

7620 320 58 1000 2000 3000 4000 5000 6000 7000 8000 9000 LUTs/GOPS

LUTs/GOPS

FU area

Zynq: 128 DSP blocks at 300 MHz V7: 800 DSP blocks at 380 MHz

Jain, Fahmy, Maskell

SLIDE 10

Coarse-grained Overlays

10

1000x faster place and route Place and route within a second

n embedded ARM in Zynq

FU area

Zynq: 128 DSP blocks at 300 MHz V7: 800 DSP blocks at 380 MHz

Jain, Fahmy, Maskell

Get a very fast configuration

time. (μs rather than ms)

SLIDE 11

Coarse-grained Overlays

So can we use these overlays to provide a More GPU like experience?

OpenCL on GPU allows:
Fast compilation and configuration
Application portability across other accelerators (LLVM IR as abstraction layer)
Performance scaling by exploiting just-in-time (JIT) compilation

11

SLIDE 12

Coarse-grained Overlays + OpenCL

12

Intermediate Fabric Overlay (University of Florida) [7] TILT Overlay (University of Toronto) [8]

Recent work focused on exposing overlay as an OpenCL device
Provides a more GPU like experience by exploiting fast compilation and configuration
Does not exploit kernel replication feature of OpenCL like the one used by GPUs
1. Coole, James, and Greg Stitt. "Fast, flexible high-level synthesis from OpenCL using reconfiguration contexts." IEEE Micro, 2014
2. Rashid, Rafat, J. Gregory Steffan, and Vaughn Betz. "Comparing performance, productivity and scalability of the TILT overlay processor to OpenCL HLS." FPT 2014.

SLIDE 13

Performance Scaling on GPUs

13

Can automatically (at runtime) scale performance

if additional hardware resource is avaliable

Idea:
Compiling application kernel at runtime from kernel

source (JIT)

unroll/replicate the kernel based on the availability of

hardware resources

Direct runtime compilation for FPGAs is infeasible
Overlays can allow runtime compilation and

performance scaling!

time time

1. S. Gao and J. Chritz. "Characterization of OpenCL on a Scalable FPGA Architecture" ReConFig 2014

SLIDE 14

Coarse-grained Overlays + OpenCL

14

Dual-DSP FU aware DFG transformation Resource-aware Replication 8 Kernel Instances DISO Architecture Dual-DISO Architecture Dual-DISO Architecture

SLIDE 15

15

Resource-aware Performance Scaling

Proposed approach can help in performance

scaling on providing more hardware resources

Instead of mapping single copy of kernel on 8x8

array, compiler can replicate the kernel 16 times

SLIDE 16

Coarse-grained Overlays + OpenCL

16 (POCL)

SLIDE 17

17 1 10 100 1000 10000 100000 1000000 10000000 Overlay FPGA ARMv7 Cortex-A9 CPU NVIDIA Quadro 2000 Intel Xeon CPU E5-1650 Kernel compile time Write Execute Read

OpenCL Kernel Execution Profile

Application: Apply negate kernel (total inversion, 255 – pixel value) on 400x225

grayscale image

X-axis shows time in milliseconds

SLIDE 18

18

Summary

Proposed a resource-aware Just-in-Time OpenCL compiler for FPGA overlays
Embedded ARM processor on Zynq device can compile kernels within a second
Future Work:
Integration within POCL framework on Zynq-v2
Execution of OpenCL benchmarks
Comparison with GPU
Continue the DSP-based overlay research
Integration of multiple DeCO in the accelerator framework
Efficient time multiplexed overlays
Compilation of TensorFlow Graphs onto Overlays

Device dependent. ie., LLVM backend Device dependent. ie., device drivers

✓

SLIDE 19

Thank you

19

SLIDE 20

Results

20

SLIDE 21

21

Summary

Similar work: Supporting ρ-VEX vector processor as an OpenCL device using POCL

SLIDE 22

22

Summary

Similar work: Supporting ρ-VEX vector processor as an OpenCL device using POCL
Bad performance for the evaluated benchmark
Edge-detection algorithm applied on 640x480 image using 3x3 mask size

SLIDE 23

Additional Slides for tool-flow

23

SLIDE 24

Additional Slides for tool-flow

24

SLIDE 25

Compilation

25

SLIDE 26

Compilation

26

SLIDE 27

Compilation

27

SLIDE 28

Compilation

28

SLIDE 29

Resource-aware Performance Scaling

29

SLIDE 30

Kernels

30

SLIDE 31

Runtime Management of Accelerators using OpenCL

31

configure Data Write Trigger for processing Data Read

SLIDE 32

OpenCL as a programming model

32

SLIDE 33

OpenCL as a programming model

33

SLIDE 34

OpenCL as a programming model

34

SLIDE 35

Overlays

35

Time- multiplexed

Nearest-neighbor style – SCGRA, CARBON Customized – TILT, Remorph

Spatially- configured

Nearest-neighbor style – QUKU, FPCA, Mesh-of-FU Island-style – Intermediate Fabrics, Reconfiguration contexts, DySER

SLIDE 36

Concept of Time-multiplexing

36

SLIDE 37

Time-multiplexed Overlays: VectorBlox MXP

37

SLIDE 38

Time-multiplexed Overlays: SCGRA

38

SLIDE 39

Spatially-Configured Overlays

39

SLIDE 40

Success Story

40

SLIDE 41

FPGAs in Cloud

41

1/3rd of the cloud service provider nodes to use FPGAs by 2020
Microsoft: Doubling the throughput of Bing search engine using FPGAs
Microsoft Azure Cloud services and Amazon EC2 (FPGA-backed F1 instances)
Nicole Hemsoth, " The FPGA Accelerated Cloud Push Just Got Stronger," 30 November 2016, THENEXTPLATFORM

SLIDE 42

Main issue?

Long compilation times (specifically place and route times)

42

OpenCL Application RTL code generation from OpenCL kernel RTL code Verification RTL code generation for interfaces Final RTL code (Verilog/VHDL)

Final RTL code (Verilog/VHDL) Synthesis Technology Mapping Place Route Configuration Bitstream

Need for not only software-like abstractions

but also fast development cycles

Similar to application development for GPUs
Resource aware Just-in-Time compilation can

enable performance scaling

SLIDE 43

OpenCL compiler for Coarse-grained Overlays

43

FFT

*

+/-

* * * *

+/- +/- +/-

FFT IFFT

*

Intermediate Fabric (IF) “Context” FFT

* *

FFT

IFFT OpenCL HLS Intermediate Fabric Place & Route __kernel void kernelA(int *data) { … } Synthesized Netlist FPGA

Intermediate Fabric Overlay (University of Florida) TILT Overlay (University of Toronto)

SLIDE 44

Dual-DISO Functional Unit

44 SRLSEL 24 MUXSEL 8

SRLs SRLs SRLs SRLs

MUL B Register Pre-Adder C M INMODE OPMODE B A D C 1 ALUMODE P 16 16 16 16 5 7 4 1 4 16 DSP48E1 X Y Z MUXSEL 8 MUL B Register Pre-Adder C M INMODE OPMODE B A D C 1 ALUMODE P 16 16 16 16 5 7 4 1 4 16 DSP48E1 X Y Z

MUXSEL 4 32 CONST

SLIDE 45

Summary

45 131 148 175 338 300 50 100 150 200 250 300 350 400 Fmax

Fmax

IF IF(opt) DSP-DySER DISO Dual-DISO 25.6 29 6.3 65 115 20 40 60 80 100 120 140 GOPS

GOPS

IF IF(opt) DSP-DySER DISO Dual-DISO 3550 1725 7620 430 320 1000 2000 3000 4000 5000 6000 7000 8000 9000 LUTs/GOPS

LUTs/GOPS

IF IF(opt) DSP-DySER DISO Dual-DISO

Landy, Aaron, and Greg Stitt. "A low-overhead interconnect architecture for virtual reconfigurable fabrics." ACM CASES 2012.