fpgaConvNet: A Framework for Mapping Convolutional Neural Networks - - PowerPoint PPT Presentation

fpgaconvnet a framework for mapping convolutional neural
SMART_READER_LITE
LIVE PREVIEW

fpgaConvNet: A Framework for Mapping Convolutional Neural Networks - - PowerPoint PPT Presentation

fpgaConvNet: A Framework for Mapping Convolutional Neural Networks on FPGAs Stylianos I. Venieris , Christos-Savvas Bouganis stylianos.venieris10@imperial.ac.uk FCCM 2016, Washington DC 2 May 2016 Deep Learning and AI 2 Deep Learning Success


slide-1
SLIDE 1

fpgaConvNet: A Framework for Mapping Convolutional Neural Networks on FPGAs

Stylianos I. Venieris, Christos-Savvas Bouganis stylianos.venieris10@imperial.ac.uk FCCM 2016, Washington DC 2 May 2016

slide-2
SLIDE 2

Deep Learning and AI

2

slide-3
SLIDE 3

Deep Learning Success Stories - ConvNets

3 Image Recognition (Microsoft, 2015) Image Captioning (Microsoft, 2015) “Deep Face” (Facebook, 2014)

slide-4
SLIDE 4

Deep Learning on FPGAs

4

  • Hand-tuned implementations
  • Memory I/O Optimisation
  • Design Space Exploration
slide-5
SLIDE 5

What is missing?

5

  • Caffe
  • TensorFlo
  • Theano
  • Torch
  • FPGA

– ConvNet Functionality – Optimised for High Performance

Deep Learning Developers

fpgaConvNet

slide-6
SLIDE 6

Our approach - fpgaConvNet

6

Automated Design Space Exploration ConvNet Hardware Mapping

fpgaConvNet

FPGA Target Platform Specifications

Supplied by Deep Learning Expert

ConvNet Description Bitstream

slide-7
SLIDE 7

Convolutional Neural Networks (ConvNets)

7 convolutional + nonlinearity pooling convolutional + nonlinearity pooling

slide-8
SLIDE 8
  • Synchronous Data Flow

– ConvNetas a data-driven graph – Represented as a matrix – Each layer mapped to a tunable set of hardware building blocks

fpgaConvNet– ConvNet Modelling Framework

8

Streaming Analytical power

slide-9
SLIDE 9

fpgaConvNet– Modelling ConvNetswith SDF

9 ConvNet Hardware SDF Graph

Convolutional Layer with 4 filters Nonlin Layer Pooling Layer Convo

Memory Interface Input Data

slide-10
SLIDE 10

fpgaConvNet– Design Space Perspective

10 ConvNet Hardware SDF Graph

Bottlenecks:

– Limited compute resources – Limited off-chip memory bandwidth – Limited on-chip memory for model parameters

2 4 6 5 10

Throughput Area

Design Space

Current Design Point

FPGA 2 FPGA 1

Define a set of actions to move around the design space

slide-11
SLIDE 11

Action 1: Coarse-grained Folding

1) Exceeding the available compute resources 4 Convolutions / cycle

11

2) Not enough off-chip memory bandwidth

slide-12
SLIDE 12

Action 1: Coarse-grained Folding

2 Convolutions / cycle

12

Action 2

Fine-grained Folding

Compute Resources Required Bandwidth

slide-13
SLIDE 13

Input Data Conv Layer 1 Nonlin Layer 1 Pool Layer 1 Conv Layer 2 Nonlin Layer 2 Pool Layer 2

Subgraph 1 Subgraph 2 Subgraph 3 Bitstream 1 Bitstream 2 Bitstream 3

13

Bitstream

Exceeding the available compute resources

Action 3: Partitioning through Reconfiguration

Not enough on-chip memory FPGA Reconfiguration

Input Data Intermediate Results Intermediate Results Final Results

Off-chip Memory

slide-14
SLIDE 14

fpgaConvNet– SDF Analytical Power

14

Window Size = K Pool Size = P Design 1 Design 2

  • Synchronous Data Flow

– Actions as algebraic operations – Any local action propagates through the network – Static scheduling – Analytical Performance Model – Cast DSE to formal resource-constrained optimisation

Hardware Stages Interconnections

slide-15
SLIDE 15

Evaluation - Experimental Setup

15

  • fpgaConvNet

– Xilinx Zynq-7000 XC7Z020 SoC with 220 DSPs at 100 MHz – Q8.8 fixed-point precision to match existing work (also supports floating-point) – Current toolflowsupports the VivadoHLS toolchain

slide-16
SLIDE 16

Performance Model Accuracy

16

2 4 6 8 10 12 14

CFF LeNet-5 MPCNN CNP Sign Recognition ConvNet Scene Labelling ConvNet

Performance Model Accuracy

Measured Performance (GOps/s) Predicted Performance (GOps/s)

Error between 1.73% and 11.76%

slide-17
SLIDE 17

fpgaConvNet vs. Existing FPGA Work

17

0.00005 0.0001 0.00015 0.0002 0.00025 0.0003 0.00035 0.0004 0.00045 0.0005

Hand-tuned [1] Memory-optimised [2]

Performance Density Comparison (GOps/s/Slice)

Existing Work (GOps/s/Slice) fpgaConvNet (GOps/s/Slice)

1.62×

[1] C. Farabet et al., “CNP: An FPGA-Based Processor for Convolutional Networks”, in FPL, IEEE, 2009. [2] M. Peemen et al., “Memory-centric accelerator design for Convolutional Neural Networks”, in ICCD, IEEE, 2013.

slide-18
SLIDE 18

18

1 2 3 4 5 6 7 8

Hand-tuned Embedded GPU [3]

Performance Efficiency Comparison (GOps/s/Watt)

Existing Work (GOps/s/Watt) fpgaConvNet (GOps/s/Watt)

[3] L. Cavigelli et al., “Accelerating real-time embedded scene labeling with convolutional networks”, in DAC, ACM/EDAC/IEEE, 2015.

Hand-tuned Embedded GPU

  • Tegra K1 at 800 MHz
  • Memory Bandwidth: 12 GB/s

fpgaConvNet

  • Zynq-7000 XC7Z020 at 100 MHz
  • Memory Bandwidth: 4.26 GB/s

fpgaConvNet vs. Existing Embedded GPU Work

slide-19
SLIDE 19

Conclusions

19

  • Caffe
  • TensorFlo
  • Theano
  • Torch

Deep Learning Developers

fpgaConvNet