Accelerating Convolutional Neural networks on FPGA SoC Francesco - - PDF document

accelerating convolutional neural networks on fpga soc
SMART_READER_LITE
LIVE PREVIEW

Accelerating Convolutional Neural networks on FPGA SoC Francesco - - PDF document

Accelerating Convolutional Neural networks on FPGA SoC Francesco Restuccia, Ph.D. Fellow 04/12/2020 1 Overview Background Xilinx CHaiDNN Framework Overview Working flow Hardware architecture Xilinx DNNDK framework


slide-1
SLIDE 1

04/12/2020

Accelerating Convolutional Neural networks on FPGA SoC

Francesco Restuccia, Ph.D. Fellow

1

Overview

  • Background
  • Xilinx CHaiDNN Framework
  • Overview
  • Working flow
  • Hardware architecture
  • Xilinx DNNDK framework
  • Overview
  • Working flow
  • The Deep Processing Unit (DPU)
  • Xilinx Vitis AI framework
  • Overview
  • Vitis AI for edge applications
2
slide-2
SLIDE 2

Background

3

FPGA SoC Architecture

FPGA Programmable Logic

4
slide-3
SLIDE 3

FPGA SoC Architecture - 2

ARM-based Processing System

5

FPGA SoC Architecture - 3

Hardware accelerators Interconnect (AXI based) DRAM controller ARM processors FPGA-PS interfaces

6
slide-4
SLIDE 4

What’s a hardware accelerator?

  • Piece of hardware developed to compute a specific

functionality

  • Placed in the Programmable Logic (FPGA)
  • Master Slave Interface for interconnection (AXI standard)
  • Can be described in different ways
7

What’s a hardware accelerator? - 2

  • Hardware accelerators (HWA) development method:
  • Hardware description languages (VHDL, Verilog,

SystemVerilog, …)

  • Hardware construction language (Chisel)
  • High-level description (HLS)
  • Buy one as Intellectual Property block (IP block)
8
slide-5
SLIDE 5

FPGA SoC are used to run inference!

  • FPGA SoC and FPGA MPSoC are mainly used for embedded applications
  • FPGA MPSoC and SoC have good performance for inference
  • In some case, comparable with GPU SoC (fps)
  • Better power efficiency
  • The training of neural networks is generally made on big GPUs
  • We will assume in the following slides to have a pre-trained neural network as starting point
  • The network model (.prototxt Caffe)
  • The trained weight (.model Caffe)
9

Considered platforms

Xilinx Zynq Ultrascale+ MPSoC

  • Processing System
  • 4 General Purpose Cortex A53
  • 2 Cortex R-5 (real-time)
  • Programmable Logic (huge)
  • Zynq Ultrascale+ logic
  • Xilinx ZYNQ Z-7000 (PYNQ Z-7020)
  • Processing System
  • 2 General Purpose Cortex A9
  • Programmable Logic
  • Artix 7-Series FPGA
Credits: www.xilinx.com Credits: www.digilent.com 10
slide-6
SLIDE 6

Xilinx CHaiDNN

11

ChaiDNN: Overview

  • Open source deep Neural Network library for the acceleration of deep neural networks
  • Designed for Xilinx UltraScale+ MPSoC
  • Designed to accelerate convolutional Neural Networks on images
  • High-level synthesis based accelerators
  • Compatible with multiple popular neural networks for classification (AlexNet, GoogleNet, ResNet, etc.)
credits: xilinx.com 12
slide-7
SLIDE 7

ChaiDNN: Install

Official Github https://github.com/Xilinx/CHaiDNN Download the network model Prepare the SD card image

Flash the source into the SD card

Build the software Build the hardware accelerator Use the prebuilt-system Build the system

13

How does CHaiDNN works?

  • Given a Caffe model and weights files (.prototxt

and .model)

  • Model is parsed runtime by the CHaiDNN application
  • Initializes the hardware accelerators and data

structure

  • Schedules the operation between processors and

the hardware accelerators

  • Issues the call for acceleration using the CHaiDNN

API

Caffe Model Parsed by the ChaiDNN application Scheduler Execute on processors Hardware accelerators Call to the CHaiDNN API

14
slide-8
SLIDE 8

Some important considerations

  • FPGA SoC has limited on-chip memory resources
  • Neural Network models in general use 16 or 32-bit floats -> too expensive for FPGA SoCs!
  • Moving data from/to off-chip memory (DRAM) is very expensive
  • Increase the power consumption
  • Performance drop
  • The gain in execution (acceleration) would be lost in moving the data (weights) from/to the off-chip

memory

15

Quantization process

  • To achieve good performance, the neural network model must be quantized to run on FPGA SoC!
  • CHaiDNN provides a tool for quantization, XportDNN
  • Better performance, at the cost of accuracy loss (“minimal” according to CHaiDNN developers)
  • CHaiDNN supports 6-bit and 8-bit quantized data integer
  • CHaiDNN provides some quantized model for popular networks (ModelZoo)
16
slide-9
SLIDE 9

Playing with the modelZoo

  • Ready-to-run popular neural network models (quantized)

Alexnet GoogleNet VGG

credits: wikimedia.org credits: medium.com credits: researchgate.com

ResNet50

17

FPGA SoC VS GPU SoC

credits: https://github.com/Xilinx/CHaiDNN 18
slide-10
SLIDE 10

Why FPGA SoC for NN acceleration?

  • HWA Development is not trivial
  • Execution on FPGA is highly

predictable

  • Lower power consumption
  • Highly suitable for critical embedded

applications!

19
  • Easy to program (CUDA)
  • Time predictability is poor
  • Lower power efficiency
Xilinx ZYNQ Ultrascale+ FPGA SoC credits: xilinx.com

VS

GPU SoC

Accelerate your own network

  • Just for convolutional neural network operating on

images

  • Quantize the neural network (XportDNN)
  • Build the CHaiDNN code
  • Cross-compile the code
  • Model and executable on the SD
  • The hardware accelerator does not change with

the Network!

Caffe Model (float) Quantization (XportDNN) Build the CHaiDNN code Compile (cross- compile) the code Execute the network

20
slide-11
SLIDE 11

The CHaiDNN hardware architecture

CHaiDNN block design from Xilinx Vivado 2018.2

21

The CHaiDNN hardware architecture

ZYNQ UltraScale+ PS

22
slide-12
SLIDE 12

ZYNQ Processing System

  • Runs the CHaiDNN application
  • Schedules the operations
  • Issues the request for hardware acceleration
  • The Processing System executes
  • L2 Normalization
  • Permute
  • Inner Product
  • SoftMax
  • Fully connected layers
23

Hardware Accelerators

Hardware accelerators

24
slide-13
SLIDE 13

PoolTop accelerator

  • High-Level Synthesis synthesized block
  • Custom adapter for AXI Interconnect
  • Operation:
  • Pooling (max average)
25

Convolution accelerator

  • High-Level Synthesis synthesized block
  • Custom adapter for AXI Interconnect
  • Operation:
  • Convolution
  • Normalization
  • Scale and bias
  • Element-wise addition
  • ReLu
26
slide-14
SLIDE 14

DeConvolution accelerator

  • High-Level Synthesis synthesized block
  • Custom adapter for AXI Interconnect
  • Operation:
  • Deconvolution
27

AXI Interconnects

Control AXI Interconnect

28
slide-15
SLIDE 15

AXI Interconnects

Data AXI Interconnect

29

Final remarks on CHaiDNN

  • The Processing System and the hardware accelerators cooperate to execute the network
  • The hardware accelerators are not “custom-made" for the network
  • The Processing System schedules the operation
  • The Processing System issues the requests for hardware acceleration
  • Hardware accelerators are autonomous in reading/writing data to/from the memory
  • In some cases, the performances (inference) are comparable with GPU SoC
30
slide-16
SLIDE 16

Xilinx DNNDK

31

DNNDK: Overview

  • Deep Neural Network Development Kit (DNNDK) is a

full-stack deep learning SDK for the Deep learning Processor Unit (DPU)

  • Designed for Xilinx UltraScale+ MPSoC and

ZYNQ-7000 platforms

  • Designed to accelerate convolutional Neural Networks
  • n images
  • Compatible with multiple popular neural networks for

classification (AlexNet, GoogleNet, ResNet, etc.)

DNNDK User Guide, Xilinx, UG1327 (v1.5) 32
slide-17
SLIDE 17

The Deep Learning Processing Unit

  • The Deep Processing Unit (DPU) is the hardware

accelerator for DNNDK

  • Placed in the Programmable Logic
  • Custom size, to fit Xilinx UltraScale+ MPSoC and

ZYNQ-7000 platforms

  • Designed to accelerate popular convolutional neural

network (VGG, GoogleNet, ResNet, YOLO, etc.)

DNNDK User Guide, Xilinx, UG1327 (v1.5) 33

DPU hardware architecture

  • Computing array of Hybrid Processing Elements
  • Local On-chip support memory
  • Instruction fetch unit to fetch the instruction from

memory

  • On board scheduler
  • Autonomous High-speed data access
DNNDK User Guide, Xilinx, UG1327 (v1.5) 34
slide-18
SLIDE 18

What’s the difference with CHaiDNN?

DNNDK

  • The Deep Learning Processing Unit defines

its own instruction set

  • Instructions are generated by the DNNDK

tools and fetched from the DRAM memory

  • The scheduler for the hardware operations is

internal

  • Custom size, to fit both Xilinx UltraScale+

MPSoC and ZYNQ-7000 platforms CHaiDNN

  • Hardware accelerator is a collection of AXI MM

devices (no instruction set)

  • The hardware accelerators is managed directly

by the PS

  • The scheduler for the hardware operation runs in

the Processing System

  • Fixed-size, not configurable.
35

Customize the DPU

  • Change the number of DPU core on the board
  • Parallelism of the convolutional unit
  • Total amount of on-chip memory
  • ReLu type
  • Hardware softmax implementation
  • … changing these features has an impact on

performance and resource consumption!

36
slide-19
SLIDE 19

DNNDK workflow

  • Compress the neural network model (DECENT)
  • Compile the neural network model (DNNC)
  • Build the DPU executable
  • Build the software application
  • DNNDK API
  • Compile and link the hybrid DPU application
  • Run the hybrid DPU executable

Caffe Model (float) Quantization & pruning (DECENT) Build the DPU executable (DNNC) Build the software app Cross-compile & linking

37

DNNDK workflow

  • Compress the neural network model (DECENT)

Caffe Model (float) Quantization & pruning (DECENT) Build the DPU executable (DNNC) Build the software app Cross-compile & linking

38
slide-20
SLIDE 20

DECENT: network quantization

  • Deep compression tool for pruning and quantization, run on the Host Linux PC
  • Input: Caffe model (float)
  • Pruning of the network (removes useless nodes)
  • Convert the float input & weights into 8-bit integer
  • Output: Quantized 8-bit integer network (input and weights)
DNNDK User Guide, Xilinx, UG1327 (v1.5) 39

DNNDK workflow

  • Compress the neural network model (DECENT)
  • Compile the neural network model (DNNC)
  • Build the DPU executable

Caffe Model (float) Quantization & pruning (DECENT) Build the DPU executable (DNNC) Build the software app Cross-compile & linking

40
slide-21
SLIDE 21

DNNC tool

  • Tool for network compilation for the DNNDK framework. Runs on the host Linux PC
  • Input:
  • Quantized 8-bit neural network (from DECENT)
  • Output:
  • DPU kernel, the executable which will run on the DPU
  • The list of not supported layers (that must be deployed in CPU)
DNNDK User Guide, Xilinx, UG1327 (v1.5) 41
  • Compress the neural network model (DECENT)
  • Compile the neural network model (DNNC)
  • Build the DPU executable
  • Build the software application
  • DNNDK API
  • Compile and link the hybrid DPU application
  • Run the hybrid DPU executable

DNNDK workflow

Caffe Model (float) Quantization & pruning (DECENT) Build the DPU executable (DNNC) Build the software app Cross-compile & linking

42
slide-22
SLIDE 22

Building the DNNDK software

  • Executable for the DPU is ready (DNNC)
  • Build the executable for the CPU
  • Initialize the data structure
  • Initialize the DPU
  • Implement the layer not supported by the DPU
  • Add the pre-processing and post-processing

(if needed)

  • Linking the hybrid executable
  • Run the network on the FPGA SoC platform!
DNNDK User Guide, Xilinx, UG1327 (v1.5) 43

Final remarks on DNNDK

  • DNNDK is a full-stack framework to accelerate Neural Network on FPGA SoCs
  • DNNDK differs from CHaiDNN in the execution paradigm
  • DNNDK framework has been incorporated in the new Xilinx VITIS AI framework
  • DPU is the state-of-the-art hardware accelerator for NN in FPGA SoC by Xilinx
  • DPU is used also in VITIS AI framework
44
slide-23
SLIDE 23

Xilinx Vitis AI

45

VITIS AI: Overview

  • State-of-the-art framework for deep neural

network acceleration on Xilinx SoC and cards

  • Two different flow:
  • Cloud flow
  • Edge flow
  • Derives directly from DNNDK
  • Currently, designed to accelerate convolutional

neural networks operating on images

  • Compatible with popular neural networks
Vitis AI User Guide, Xilinx, UG1414(v1.0) 46
slide-24
SLIDE 24

VITIS is not VITIS AI!

  • Vitis is a unified software platform for the development of

Xilinx FPGA SoC, MPSoC, and cards

  • Edge application (embedded systems)
  • Cloud applications
  • Includes HLS, frameworks, RTL based accelerators,

etc.

  • Vitis AI is the Xilinx’s development platform for AI

inference on Xilinx platforms,

  • Edge applications (SoC and MPSoC)
  • Cloud applications (Alveo cards)
credits: xilinx.com credits: xilinx.com 47

DPU hardware architecture for cloud

  • Previously known as Xilinx xDNN
  • Different structure compared with DPU for

Edge

  • Designed for Xilinx Alveo FPGA Cards
Vitis AI User Guide, Xilinx, UG1414(v1.0) 48
slide-25
SLIDE 25
  • The same architecture as DNNDK
  • Vitis AI is back-compatible with DNNDK

applications

DPU hardware architecture for Edge

Vitis AI User Guide, Xilinx, UG1414(v1.0) 49
  • Quantize the neural network model
  • vai_q_caffe and vai_q_tensorflow
  • output: quantized and pruned network
  • Compile the neural network model
  • vai_c_caffe and vai_c_tensorflow
  • output: executable for the DPU
  • Create the Vitis AI application
  • Unified API with DNNDK
  • Implement layers not supported by the DPU
  • Cross-compile & link the hybrid application
  • Run on the FPGA SoC or MPSoC

Vitis AI working flow (Edge)

Network model (float) Quantization (vai_q_caffe) Compile for DPU (vai_c_caffe) Create the software app Cross-compile & linking

50
slide-26
SLIDE 26
  • Quantize the neural network model
  • vai_q_caffe and vai_q_tensorflow
  • output: quantized and pruned network

Vitis AI working flow (Edge)

Network model (float) Quantization (vai_q_caffe) Compile for DPU (vai_c_caffe) Create the software app Cross-compile & linking

51
  • Same approach of DNNDK
  • vai_q_caffe and vai_q_tensorflow tools for pruning and quantization
  • From floating-point neural network model to quantised 8-bit model

AI Optimizer and AI Quantizer

Vitis AI User Guide, Xilinx, UG1414(v1.0) 52
slide-27
SLIDE 27
  • Quantize the neural network model
  • vai_q_caffe and vai_q_tensorflow
  • output: quantized and pruned network
  • Compile the neural network model
  • vai_c_caffe and vai_c_tensorflow
  • output: executable for the DPU

Vitis AI working flow (Edge)

Network model (float) Quantization (vai_q_caffe) Compile for DPU (vai_c_caffe) Create the software app Cross-compile & linking

53
  • Again, same approach seen in DNNDK
  • vai_c_caffe and vai_c_tensorflow tools
  • From quantized and pruned network to DPU executable (DPU instructions)

AI Compiler

Vitis AI User Guide, Xilinx, UG1414(v1.0) 54
slide-28
SLIDE 28
  • Quantize the neural network model
  • vai_q_caffe and vai_q_tensorflow
  • output: quantized and pruned network
  • Compile the neural network model
  • vai_c_caffe and vai_c_tensorflow
  • output: executable for the DPU
  • Create the Vitis AI application
  • Unified API with DNNDK
  • Implement layers not supported by the DPU
  • Cross-complile & link the hybrid application
  • Run on the FPGA SoC or MPSoC

Vitis AI working flow (Edge)

Network model (float) Quantization (vai_q_caffe) Compile for DPU (vai_c_caffe) Create the software app Cross-compile & linking

55

Final remarks

  • Three different frameworks for hardware acceleration of Neural Network on FPGA SoC
  • ChaiDNN is the only open source framework
  • However, it’s not updated by Xilinx anymore (last commit 2 years ago)
  • Currently, DNNDK and Vitis AI are the two most supported frameworks by Xilinx
  • Acceleration is based on Deep Learning Processing Unit (DPU)
  • Makes hardware acceleration easier for the developer
  • However, the general-purpose accelerator (DPU) is not customized for the network
  • Tradeoff between performance and time for development
56
slide-29
SLIDE 29

References

  • The CHaiDNN official github, Xilinx

https://github.com/Xilinx/CHaiDNN

  • DNNDK User Guide, Xilinx, UG1327(v1.5)

https://www.xilinx.com/support/documentation/user_guides/ug1327-dnndk-user-guide.pdf

  • Vitis AI User Guide, Xilinx UG1414(v1.0)

https://www.xilinx.com/support/documentation/sw_manuals/vitis_ai/1_0/ug1414-vitis-ai.pdf

57

Thanks for the attention

francesco.restuccia@santannapisa.it

58