for Coarse-Grained FPGA Overlays Abhishek Kumar Jain, Douglas L. - - PowerPoint PPT Presentation

for coarse grained fpga overlays
SMART_READER_LITE
LIVE PREVIEW

for Coarse-Grained FPGA Overlays Abhishek Kumar Jain, Douglas L. - - PowerPoint PPT Presentation

Resource-Aware Just-in-Time OpenCL Compiler for Coarse-Grained FPGA Overlays Abhishek Kumar Jain, Douglas L. Maskell Suhaib A. Fahmy School of Computer Science and Engineering School of Engineering Nanyang Technological University (NTU),


slide-1
SLIDE 1

Resource-Aware Just-in-Time OpenCL Compiler for Coarse-Grained FPGA Overlays

1

Abhishek Kumar Jain, Douglas L. Maskell School of Computer Science and Engineering Nanyang Technological University (NTU), Singapore 3rd International Workshop on Overlay Architectures for FPGAs (OLAF) 22nd Feb 2017, Monterey, CA, USA Suhaib A. Fahmy School of Engineering University of Warwick, UK

slide-2
SLIDE 2

Hardware Accelerators

  • Many different platforms for hardware acceleration of compute intensive

applications

  • These include GPUs, FPGAs, etc.
  • GPUs are more widely used due to
  • The ease of use and better design productivity
  • Better support for the OpenCL programming model
  • Faster design cycles

2

slide-3
SLIDE 3

Hardware Accelerators

3

1. Low level of programming abstraction (RTL) – HLS tools, Eg: Vivado-HLS 2. Runtime management (interfaces and drivers) – Newer tools, Eg: SDSoC, SDAccel, AOCL 3. Long compile times and slow switching between application kernels – Overlays 4. A lack of application portability and performance scalability – Overlays + OpenCL

  • Many different platforms for hardware acceleration of compute intensive

applications

  • These include GPUs, FPGAs, etc.
  • GPUs are more widely used due to
  • The ease of use and better design productivity
  • Better support for the OpenCL programming model
  • Faster design cycles
  • Issues with FPGA based accelerators (past and current)

1. Low level of programming abstraction (RTL) 2. Runtime management (interfaces and drivers) 3. Long compile times and slow switching between application kernels 4. A lack of application portability and performance scalability

slide-4
SLIDE 4

So what is an Overlay?

  • A coarse-grained circuit abstraction which sits on top of the FPGA fabric
  • Many similarities to CGRAs
  • Because it is coarse-grained, it provides easier application mapping,

faster compilation and faster application kernel configuration

4 Images from University of Toronto [3]

slide-5
SLIDE 5

So what is an Overlay?

6

Intermediate Fabric (University of Florida) [4] Mesh-of-FU Overlay (University of Toronto) [3] DySER (UW Madison) [5] MXP Overlay (VectorBlox) [1] SCGRA Overlay (HKUST, Hong Kong) [2]

  • 1. Severance, Aaron, and Guy GF Lemieux. "Embedded supercomputing in FPGAs with the VectorBlox MXP matrix processor." CODES+ ISSS, 2013.
  • 2. Liu, Cheng, Ho-Cheung Ng, and Hayden Kwok-Hay So. "QuickDough: a rapid fpga loop accelerator design framework using soft CGRA overlay." FPT 2015.
  • 3. Capalija, Davor, and Tarek S. Abdelrahman. "A high-performance overlay architecture for pipelined execution of data flow graphs." FPL 2013.
  • 4. G. Stitt and J. Coole, “Intermediate fabrics: Virtual architectures for near-instant FPGA compilation,” IEEE ESL, vol. 3(3), 2011.
  • 5. J. Benson et al., "Design, integration and implementation of the DySER hardware accelerator into OpenSPARC," HPCA 2012.
slide-6
SLIDE 6

Classification based on Architecture[6]

6

FU-0 FU-1 FU-2 FU-3 Overlay (Time-multiplexed) Similar to conventional CGRAs Overlay (Spatially-configured)

  • Time multiplexed
  • Spatially configured
  • Packet switched, and
  • Circuit switched
  • 6. Kapre, Nachiket, et al. "Packet switched vs. time multiplexed FPGA overlay networks." FCCM 2006.

Images from: University of Toronto [3]

slide-7
SLIDE 7

Started looking at SC overlays in 2013

7 175 50 100 150 200 250 300 350 400 450 Fmax

Fmax

6.3 20 40 60 80 100 120 140 GOPS

GOPS

7620 1000 2000 3000 4000 5000 6000 7000 8000 9000 LUTs/GOPS

LUTs/GOPS

FU area

slide-8
SLIDE 8

Started looking at SC overlays in 2013

8 175 300 50 100 150 200 250 300 350 400 450 Fmax

Fmax

6.3 115 20 40 60 80 100 120 140 GOPS

GOPS

7620 320 1000 2000 3000 4000 5000 6000 7000 8000 9000 LUTs/GOPS

LUTs/GOPS

FU area

Zynq: 128 DSP blocks at 300 MHz V7: 800 DSP blocks at 380 MHz

Jain, Fahmy, Maskell

slide-9
SLIDE 9

Started looking at SC overlays in 2013

9 175 300 395 50 100 150 200 250 300 350 400 450 Fmax

Fmax

6.3 115 23 20 40 60 80 100 120 140 GOPS

GOPS

7620 320 58 1000 2000 3000 4000 5000 6000 7000 8000 9000 LUTs/GOPS

LUTs/GOPS

FU area

Zynq: 128 DSP blocks at 300 MHz V7: 800 DSP blocks at 380 MHz

Jain, Fahmy, Maskell

slide-10
SLIDE 10

Coarse-grained Overlays

10

1000x faster place and route Place and route within a second

  • n embedded ARM in Zynq

FU area

Zynq: 128 DSP blocks at 300 MHz V7: 800 DSP blocks at 380 MHz

Jain, Fahmy, Maskell

Get a very fast configuration

  • time. (μs rather than ms)
slide-11
SLIDE 11

Coarse-grained Overlays

So can we use these overlays to provide a More GPU like experience?

  • OpenCL on GPU allows:
  • Fast compilation and configuration
  • Application portability across other accelerators (LLVM IR as abstraction layer)
  • Performance scaling by exploiting just-in-time (JIT) compilation

11

slide-12
SLIDE 12

Coarse-grained Overlays + OpenCL

12

Intermediate Fabric Overlay (University of Florida) [7] TILT Overlay (University of Toronto) [8]

  • Recent work focused on exposing overlay as an OpenCL device
  • Provides a more GPU like experience by exploiting fast compilation and configuration
  • Does not exploit kernel replication feature of OpenCL like the one used by GPUs
  • 1. Coole, James, and Greg Stitt. "Fast, flexible high-level synthesis from OpenCL using reconfiguration contexts." IEEE Micro, 2014
  • 2. Rashid, Rafat, J. Gregory Steffan, and Vaughn Betz. "Comparing performance, productivity and scalability of the TILT overlay processor to OpenCL HLS." FPT 2014.
slide-13
SLIDE 13

Performance Scaling on GPUs

13

  • Can automatically (at runtime) scale performance

if additional hardware resource is avaliable

  • Idea:
  • Compiling application kernel at runtime from kernel

source (JIT)

  • unroll/replicate the kernel based on the availability of

hardware resources

  • Direct runtime compilation for FPGAs is infeasible
  • Overlays can allow runtime compilation and

performance scaling!

time time

  • 1. S. Gao and J. Chritz. "Characterization of OpenCL on a Scalable FPGA Architecture" ReConFig 2014
slide-14
SLIDE 14

Coarse-grained Overlays + OpenCL

14

Dual-DSP FU aware DFG transformation Resource-aware Replication 8 Kernel Instances DISO Architecture Dual-DISO Architecture Dual-DISO Architecture

slide-15
SLIDE 15

15

Resource-aware Performance Scaling

  • Proposed approach can help in performance

scaling on providing more hardware resources

  • Instead of mapping single copy of kernel on 8x8

array, compiler can replicate the kernel 16 times

slide-16
SLIDE 16

Coarse-grained Overlays + OpenCL

16 (POCL)

slide-17
SLIDE 17

17 1 10 100 1000 10000 100000 1000000 10000000 Overlay FPGA ARMv7 Cortex-A9 CPU NVIDIA Quadro 2000 Intel Xeon CPU E5-1650 Kernel compile time Write Execute Read

OpenCL Kernel Execution Profile

  • Application: Apply negate kernel (total inversion, 255 – pixel value) on 400x225

grayscale image

  • X-axis shows time in milliseconds
slide-18
SLIDE 18

18

Summary

  • Proposed a resource-aware Just-in-Time OpenCL compiler for FPGA overlays
  • Embedded ARM processor on Zynq device can compile kernels within a second
  • Future Work:
  • Integration within POCL framework on Zynq-v2
  • Execution of OpenCL benchmarks
  • Comparison with GPU
  • Continue the DSP-based overlay research
  • Integration of multiple DeCO in the accelerator framework
  • Efficient time multiplexed overlays
  • Compilation of TensorFlow Graphs onto Overlays

Device dependent. ie., LLVM backend Device dependent. ie., device drivers

slide-19
SLIDE 19

Thank you

19

slide-20
SLIDE 20

Results

20

slide-21
SLIDE 21

21

Summary

  • Similar work: Supporting ρ-VEX vector processor as an OpenCL device using POCL
slide-22
SLIDE 22

22

Summary

  • Similar work: Supporting ρ-VEX vector processor as an OpenCL device using POCL
  • Bad performance for the evaluated benchmark
  • Edge-detection algorithm applied on 640x480 image using 3x3 mask size
slide-23
SLIDE 23

Additional Slides for tool-flow

23

slide-24
SLIDE 24

Additional Slides for tool-flow

24

slide-25
SLIDE 25

Compilation

25

slide-26
SLIDE 26

Compilation

26

slide-27
SLIDE 27

Compilation

27

slide-28
SLIDE 28

Compilation

28

slide-29
SLIDE 29

Resource-aware Performance Scaling

29

slide-30
SLIDE 30

Kernels

30

slide-31
SLIDE 31

Runtime Management of Accelerators using OpenCL

31

configure Data Write Trigger for processing Data Read

slide-32
SLIDE 32

OpenCL as a programming model

32

slide-33
SLIDE 33

OpenCL as a programming model

33

slide-34
SLIDE 34

OpenCL as a programming model

34

slide-35
SLIDE 35

Overlays

35

Time- multiplexed

Nearest-neighbor style – SCGRA, CARBON Customized – TILT, Remorph

Spatially- configured

Nearest-neighbor style – QUKU, FPCA, Mesh-of-FU Island-style – Intermediate Fabrics, Reconfiguration contexts, DySER

slide-36
SLIDE 36

Concept of Time-multiplexing

36

slide-37
SLIDE 37

Time-multiplexed Overlays: VectorBlox MXP

37

slide-38
SLIDE 38

Time-multiplexed Overlays: SCGRA

38

slide-39
SLIDE 39

Spatially-Configured Overlays

39

slide-40
SLIDE 40

Success Story

40

slide-41
SLIDE 41

FPGAs in Cloud

41

  • 1/3rd of the cloud service provider nodes to use FPGAs by 2020
  • Microsoft: Doubling the throughput of Bing search engine using FPGAs
  • Microsoft Azure Cloud services and Amazon EC2 (FPGA-backed F1 instances)
  • Nicole Hemsoth, " The FPGA Accelerated Cloud Push Just Got Stronger," 30 November 2016, THENEXTPLATFORM
slide-42
SLIDE 42

Main issue?

  • Long compilation times (specifically place and route times)

42

OpenCL Application RTL code generation from OpenCL kernel RTL code Verification RTL code generation for interfaces Final RTL code (Verilog/VHDL)

Final RTL code (Verilog/VHDL) Synthesis Technology Mapping Place Route Configuration Bitstream

  • Need for not only software-like abstractions

but also fast development cycles

  • Similar to application development for GPUs
  • Resource aware Just-in-Time compilation can

enable performance scaling

slide-43
SLIDE 43

OpenCL compiler for Coarse-grained Overlays

43

FFT

*

+/-

* * * *

+/- +/- +/-

FFT IFFT

*

Intermediate Fabric (IF) “Context” FFT

* *

  • FFT

IFFT OpenCL HLS Intermediate Fabric Place & Route __kernel void kernelA(int *data) { … } Synthesized Netlist FPGA

Intermediate Fabric Overlay (University of Florida) TILT Overlay (University of Toronto)

slide-44
SLIDE 44

Dual-DISO Functional Unit

44 SRLSEL 24 MUXSEL 8

SRLs SRLs SRLs SRLs

MUL B Register Pre-Adder C M INMODE OPMODE B A D C 1 ALUMODE P 16 16 16 16 5 7 4 1 4 16 DSP48E1 X Y Z MUXSEL 8 MUL B Register Pre-Adder C M INMODE OPMODE B A D C 1 ALUMODE P 16 16 16 16 5 7 4 1 4 16 DSP48E1 X Y Z

MUXSEL 4 32 CONST

slide-45
SLIDE 45

Summary

45 131 148 175 338 300 50 100 150 200 250 300 350 400 Fmax

Fmax

IF IF(opt) DSP-DySER DISO Dual-DISO 25.6 29 6.3 65 115 20 40 60 80 100 120 140 GOPS

GOPS

IF IF(opt) DSP-DySER DISO Dual-DISO 3550 1725 7620 430 320 1000 2000 3000 4000 5000 6000 7000 8000 9000 LUTs/GOPS

LUTs/GOPS

IF IF(opt) DSP-DySER DISO Dual-DISO

  • Landy, Aaron, and Greg Stitt. "A low-overhead interconnect architecture for virtual reconfigurable fabrics." ACM CASES 2012.