fpgaconvnet a framework for mapping convolutional neural
play

fpgaConvNet: A Framework for Mapping Convolutional Neural Networks - PowerPoint PPT Presentation

fpgaConvNet: A Framework for Mapping Convolutional Neural Networks on FPGAs Stylianos I. Venieris , Christos-Savvas Bouganis stylianos.venieris10@imperial.ac.uk FCCM 2016, Washington DC 2 May 2016 Deep Learning and AI 2 Deep Learning Success


  1. fpgaConvNet: A Framework for Mapping Convolutional Neural Networks on FPGAs Stylianos I. Venieris , Christos-Savvas Bouganis stylianos.venieris10@imperial.ac.uk FCCM 2016, Washington DC 2 May 2016

  2. Deep Learning and AI 2

  3. Deep Learning Success Stories - ConvNets Image Recognition (Microsoft, 2015) “Deep Face” (Facebook, 2014) Image Captioning (Microsoft, 2015) 3

  4. Deep Learning on FPGAs • Memory I/O Optimisation • Hand-tuned implementations • Design Space Exploration 4

  5. What is missing? fpgaConvNet Deep Learning Developers • Caffe • TensorFlo • Theano • Torch • FPGA – ConvNet Functionality – Optimised for High Performance 5

  6. Our approach - fpgaConvNet FPGA Target Platform ConvNet Description Specifications Automated Design Space Exploration fpgaConvNet ConvNet Hardware Mapping Supplied by Deep Bitstream Learning Expert 6

  7. Convolutional Neural Networks (ConvNets) convolutional pooling convolutional pooling + nonlinearity + nonlinearity 7

  8. fpgaConvNet– ConvNet Modelling Framework • Synchronous Data Flow Streaming – ConvNetas a data-driven graph – Represented as a matrix Analytical power – Each layer mapped to a tunable set of hardware building blocks 8

  9. fpgaConvNet– Modelling ConvNetswith SDF ConvNet Hardware SDF Graph Nonlin Memory Input Pooling Layer Convo Convolutional Layer with 4 filters Layer Interface Data 9

  10. fpgaConvNet– Design Space Perspective Design Space ConvNet Hardware SDF Graph 6 FPGA 1 Throughput 4 Current Design Point FPGA 2 2 0 0 5 10 Area Bottlenecks: Define a set of actions – Limited compute resources to move around the – Limited off-chip memory bandwidth design space – Limited on-chip memory for model parameters 10

  11. Action 1: Coarse-grained Folding 4 Convolutions / cycle 2) Not enough off-chip 1) Exceeding the available memory bandwidth compute resources 11

  12. Action 1: Coarse-grained Folding 2 Convolutions / cycle Fine-grained Folding Compute Resources Required Bandwidth Action 2 12

  13. Action 3: Partitioning through Reconfiguration Input Intermediate Intermediate Final Off-chip Memory Data Results Results Results Bitstream Bitstream 1 Bitstream 2 Bitstream 3 Conv Nonlin Pool Conv Nonlin Pool Input Data Layer 1 Layer 1 Layer 1 Layer 2 Layer 2 Layer 2 Subgraph 1 Subgraph 2 Subgraph 3 Exceeding the available Not enough on-chip FPGA Reconfiguration compute resources memory 13

  14. fpgaConvNet– SDF Analytical Power Window Size = K Hardware Stages Pool Size = P Interconnections Design 1 Design 2 Synchronous Data Flow • Actions as algebraic operations – Any local action propagates through the network – Static scheduling – Analytical Performance Model – Cast DSE to formal resource-constrained optimisation – 14

  15. Evaluation - Experimental Setup • fpgaConvNet – Xilinx Zynq-7000 XC7Z020 SoC with 220 DSPs at 100 MHz – Q8.8 fixed-point precision to match existing work (also supports floating-point) – Current toolflowsupports the VivadoHLS toolchain 15

  16. Performance Model Accuracy Performance Model Accuracy Scene Labelling ConvNet Sign Recognition ConvNet CNP Error between 1.73% and 11.76% MPCNN LeNet-5 CFF 0 2 4 6 8 10 12 14 Measured Performance (GOps/s) Predicted Performance (GOps/s) 16

  17. fpgaConvNet vs. Existing FPGA Work Performance Density Comparison (GOps/s/Slice) 0.0005 0.00045 0.0004 0.00035 0.0003 0.00025 1.62× 0.0002 0.00015 0.0001 0.00005 0 Hand-tuned [1] Memory-optimised [2] Existing Work (GOps/s/Slice) fpgaConvNet (GOps/s/Slice) [1] C. Farabet et al., “CNP: An FPGA-Based Processor for Convolutional Networks”, in FPL, IEEE, 2009. [2] M. Peemen et al., “Memory-centric accelerator design for Convolutional Neural Networks”, in ICCD, 17 IEEE, 2013.

  18. fpgaConvNet vs. Existing Embedded GPU Work Performance Efficiency Comparison (GOps/s/Watt) 8 Hand-tuned Embedded GPU 7 Tegra K1 at 800 MHz • 6 5 Memory Bandwidth: 12 GB/s • 4 fpgaConvNet 3 2 Zynq-7000 XC7Z020 at 100 MHz • 1 Memory Bandwidth: 4.26 GB/s 0 • Hand-tuned Embedded GPU [3] Existing Work (GOps/s/Watt) fpgaConvNet (GOps/s/Watt) [3] L. Cavigelli et al., “Accelerating real-time embedded scene labeling with convolutional networks”, in DAC, ACM/EDAC/IEEE, 2015. 18

  19. Conclusions fpgaConvNet Deep Learning Developers • Caffe • TensorFlo • Theano • Torch 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend