[PDF] - Accelerating Convolutional Neural networks on FPGA SoC Francesco PDF Document

SLIDE 1

04/12/2020

Accelerating Convolutional Neural networks on FPGA SoC

Francesco Restuccia, Ph.D. Fellow

1

Overview

Background
Xilinx CHaiDNN Framework
Overview
Working flow
Hardware architecture
Xilinx DNNDK framework
Overview
Working flow
The Deep Processing Unit (DPU)
Xilinx Vitis AI framework
Overview
Vitis AI for edge applications

2

SLIDE 2

Background

3

FPGA SoC Architecture

FPGA Programmable Logic

4

SLIDE 3

FPGA SoC Architecture - 2

ARM-based Processing System

5

FPGA SoC Architecture - 3

Hardware accelerators Interconnect (AXI based) DRAM controller ARM processors FPGA-PS interfaces

6

SLIDE 4

What’s a hardware accelerator?

Piece of hardware developed to compute a specific

functionality

Placed in the Programmable Logic (FPGA)
Master Slave Interface for interconnection (AXI standard)
Can be described in different ways

7

What’s a hardware accelerator? - 2

Hardware accelerators (HWA) development method:
Hardware description languages (VHDL, Verilog,

SystemVerilog, …)

Hardware construction language (Chisel)
High-level description (HLS)
Buy one as Intellectual Property block (IP block)

8

SLIDE 5

FPGA SoC are used to run inference!

FPGA SoC and FPGA MPSoC are mainly used for embedded applications
FPGA MPSoC and SoC have good performance for inference
In some case, comparable with GPU SoC (fps)
Better power efficiency
The training of neural networks is generally made on big GPUs
We will assume in the following slides to have a pre-trained neural network as starting point
The network model (.prototxt Caffe)
The trained weight (.model Caffe)

9

Considered platforms

Xilinx Zynq Ultrascale+ MPSoC

Processing System
4 General Purpose Cortex A53
2 Cortex R-5 (real-time)
Programmable Logic (huge)
Zynq Ultrascale+ logic
Xilinx ZYNQ Z-7000 (PYNQ Z-7020)
Processing System
2 General Purpose Cortex A9
Programmable Logic
Artix 7-Series FPGA

Credits: www.xilinx.com Credits: www.digilent.com 10

SLIDE 6

Xilinx CHaiDNN

11

ChaiDNN: Overview

Open source deep Neural Network library for the acceleration of deep neural networks
Designed for Xilinx UltraScale+ MPSoC
Designed to accelerate convolutional Neural Networks on images
High-level synthesis based accelerators
Compatible with multiple popular neural networks for classification (AlexNet, GoogleNet, ResNet, etc.)

credits: xilinx.com 12

SLIDE 7

ChaiDNN: Install

Official Github https://github.com/Xilinx/CHaiDNN Download the network model Prepare the SD card image

Flash the source into the SD card

Build the software Build the hardware accelerator Use the prebuilt-system Build the system

13

How does CHaiDNN works?

Given a Caffe model and weights files (.prototxt

and .model)

Model is parsed runtime by the CHaiDNN application
Initializes the hardware accelerators and data

structure

Schedules the operation between processors and

the hardware accelerators

Issues the call for acceleration using the CHaiDNN

API

Caffe Model Parsed by the ChaiDNN application Scheduler Execute on processors Hardware accelerators Call to the CHaiDNN API

14

SLIDE 8

Some important considerations

FPGA SoC has limited on-chip memory resources
Neural Network models in general use 16 or 32-bit floats -> too expensive for FPGA SoCs!
Moving data from/to off-chip memory (DRAM) is very expensive
Increase the power consumption
Performance drop
The gain in execution (acceleration) would be lost in moving the data (weights) from/to the off-chip

memory

15

Quantization process

To achieve good performance, the neural network model must be quantized to run on FPGA SoC!
CHaiDNN provides a tool for quantization, XportDNN
Better performance, at the cost of accuracy loss (“minimal” according to CHaiDNN developers)
CHaiDNN supports 6-bit and 8-bit quantized data integer
CHaiDNN provides some quantized model for popular networks (ModelZoo)

16

SLIDE 9

Playing with the modelZoo

Ready-to-run popular neural network models (quantized)

Alexnet GoogleNet VGG

credits: wikimedia.org credits: medium.com credits: researchgate.com

ResNet50

17

FPGA SoC VS GPU SoC

credits: https://github.com/Xilinx/CHaiDNN 18

SLIDE 10

Why FPGA SoC for NN acceleration?

HWA Development is not trivial
Execution on FPGA is highly

predictable

Lower power consumption
Highly suitable for critical embedded

applications!

19

Easy to program (CUDA)
Time predictability is poor
Lower power efficiency

Xilinx ZYNQ Ultrascale+ FPGA SoC credits: xilinx.com

VS

GPU SoC

Accelerate your own network

Just for convolutional neural network operating on

images

Quantize the neural network (XportDNN)
Build the CHaiDNN code
Cross-compile the code
Model and executable on the SD
The hardware accelerator does not change with

the Network!

Caffe Model (float) Quantization (XportDNN) Build the CHaiDNN code Compile (cross- compile) the code Execute the network

20

SLIDE 11

The CHaiDNN hardware architecture

CHaiDNN block design from Xilinx Vivado 2018.2

21

The CHaiDNN hardware architecture

ZYNQ UltraScale+ PS

22

SLIDE 12

ZYNQ Processing System

Runs the CHaiDNN application
Schedules the operations
Issues the request for hardware acceleration
The Processing System executes
L2 Normalization
Permute
Inner Product
SoftMax
Fully connected layers

23

Hardware Accelerators

Hardware accelerators

24

SLIDE 13

PoolTop accelerator

High-Level Synthesis synthesized block
Custom adapter for AXI Interconnect
Operation:
Pooling (max average)

25

Convolution accelerator

High-Level Synthesis synthesized block
Custom adapter for AXI Interconnect
Operation:
Convolution
Normalization
Scale and bias
Element-wise addition
ReLu

26

SLIDE 14

DeConvolution accelerator

High-Level Synthesis synthesized block
Custom adapter for AXI Interconnect
Operation:
Deconvolution

27

AXI Interconnects

Control AXI Interconnect

28

SLIDE 15

AXI Interconnects

Data AXI Interconnect

29

Final remarks on CHaiDNN

The Processing System and the hardware accelerators cooperate to execute the network
The hardware accelerators are not “custom-made" for the network
The Processing System schedules the operation
The Processing System issues the requests for hardware acceleration
Hardware accelerators are autonomous in reading/writing data to/from the memory
In some cases, the performances (inference) are comparable with GPU SoC

30

SLIDE 16

Xilinx DNNDK

31

DNNDK: Overview

Deep Neural Network Development Kit (DNNDK) is a

full-stack deep learning SDK for the Deep learning Processor Unit (DPU)

Designed for Xilinx UltraScale+ MPSoC and

ZYNQ-7000 platforms

Designed to accelerate convolutional Neural Networks
n images
Compatible with multiple popular neural networks for

classification (AlexNet, GoogleNet, ResNet, etc.)

DNNDK User Guide, Xilinx, UG1327 (v1.5) 32

SLIDE 17

The Deep Learning Processing Unit

The Deep Processing Unit (DPU) is the hardware

accelerator for DNNDK

Placed in the Programmable Logic
Custom size, to fit Xilinx UltraScale+ MPSoC and

ZYNQ-7000 platforms

Designed to accelerate popular convolutional neural

network (VGG, GoogleNet, ResNet, YOLO, etc.)

DNNDK User Guide, Xilinx, UG1327 (v1.5) 33

DPU hardware architecture

Computing array of Hybrid Processing Elements
Local On-chip support memory
Instruction fetch unit to fetch the instruction from

memory

On board scheduler
Autonomous High-speed data access

DNNDK User Guide, Xilinx, UG1327 (v1.5) 34

SLIDE 18

What’s the difference with CHaiDNN?

DNNDK

The Deep Learning Processing Unit defines

its own instruction set

Instructions are generated by the DNNDK

tools and fetched from the DRAM memory

The scheduler for the hardware operations is

internal

Custom size, to fit both Xilinx UltraScale+

MPSoC and ZYNQ-7000 platforms CHaiDNN

Hardware accelerator is a collection of AXI MM

devices (no instruction set)

The hardware accelerators is managed directly

by the PS

The scheduler for the hardware operation runs in

the Processing System

Fixed-size, not configurable.

35

Customize the DPU

Change the number of DPU core on the board
Parallelism of the convolutional unit
Total amount of on-chip memory
ReLu type
Hardware softmax implementation
… changing these features has an impact on

performance and resource consumption!

36

SLIDE 19

DNNDK workflow

Compress the neural network model (DECENT)
Compile the neural network model (DNNC)
Build the DPU executable
Build the software application
DNNDK API
Compile and link the hybrid DPU application
Run the hybrid DPU executable

Caffe Model (float) Quantization & pruning (DECENT) Build the DPU executable (DNNC) Build the software app Cross-compile & linking

37

DNNDK workflow

Compress the neural network model (DECENT)

Caffe Model (float) Quantization & pruning (DECENT) Build the DPU executable (DNNC) Build the software app Cross-compile & linking

38

SLIDE 20

DECENT: network quantization

Deep compression tool for pruning and quantization, run on the Host Linux PC
Input: Caffe model (float)
Pruning of the network (removes useless nodes)
Convert the float input & weights into 8-bit integer
Output: Quantized 8-bit integer network (input and weights)

DNNDK User Guide, Xilinx, UG1327 (v1.5) 39

DNNDK workflow

Compress the neural network model (DECENT)
Compile the neural network model (DNNC)
Build the DPU executable

Caffe Model (float) Quantization & pruning (DECENT) Build the DPU executable (DNNC) Build the software app Cross-compile & linking

40

SLIDE 21

DNNC tool

Tool for network compilation for the DNNDK framework. Runs on the host Linux PC
Input:
Quantized 8-bit neural network (from DECENT)
Output:
DPU kernel, the executable which will run on the DPU
The list of not supported layers (that must be deployed in CPU)

DNNDK User Guide, Xilinx, UG1327 (v1.5) 41

Compress the neural network model (DECENT)
Compile the neural network model (DNNC)
Build the DPU executable
Build the software application
DNNDK API
Compile and link the hybrid DPU application
Run the hybrid DPU executable

DNNDK workflow

Caffe Model (float) Quantization & pruning (DECENT) Build the DPU executable (DNNC) Build the software app Cross-compile & linking

42

SLIDE 22

Building the DNNDK software

Executable for the DPU is ready (DNNC)
Build the executable for the CPU
Initialize the data structure
Initialize the DPU
Implement the layer not supported by the DPU
Add the pre-processing and post-processing

(if needed)

Linking the hybrid executable
Run the network on the FPGA SoC platform!

DNNDK User Guide, Xilinx, UG1327 (v1.5) 43

Final remarks on DNNDK

DNNDK is a full-stack framework to accelerate Neural Network on FPGA SoCs
DNNDK differs from CHaiDNN in the execution paradigm
DNNDK framework has been incorporated in the new Xilinx VITIS AI framework
DPU is the state-of-the-art hardware accelerator for NN in FPGA SoC by Xilinx
DPU is used also in VITIS AI framework

44

SLIDE 23

Xilinx Vitis AI

45

VITIS AI: Overview

State-of-the-art framework for deep neural

network acceleration on Xilinx SoC and cards

Two different flow:
Cloud flow
Edge flow
Derives directly from DNNDK
Currently, designed to accelerate convolutional

neural networks operating on images

Compatible with popular neural networks

Vitis AI User Guide, Xilinx, UG1414(v1.0) 46

SLIDE 24

VITIS is not VITIS AI!

Vitis is a unified software platform for the development of

Xilinx FPGA SoC, MPSoC, and cards

Edge application (embedded systems)
Cloud applications
Includes HLS, frameworks, RTL based accelerators,

etc.

Vitis AI is the Xilinx’s development platform for AI

inference on Xilinx platforms,

Edge applications (SoC and MPSoC)
Cloud applications (Alveo cards)

credits: xilinx.com credits: xilinx.com 47

DPU hardware architecture for cloud

Previously known as Xilinx xDNN
Different structure compared with DPU for

Edge

Designed for Xilinx Alveo FPGA Cards

Vitis AI User Guide, Xilinx, UG1414(v1.0) 48

SLIDE 25

The same architecture as DNNDK
Vitis AI is back-compatible with DNNDK

applications

DPU hardware architecture for Edge

Vitis AI User Guide, Xilinx, UG1414(v1.0) 49

Quantize the neural network model
vai_q_caffe and vai_q_tensorflow
output: quantized and pruned network
Compile the neural network model
vai_c_caffe and vai_c_tensorflow
output: executable for the DPU
Create the Vitis AI application
Unified API with DNNDK
Implement layers not supported by the DPU
Cross-compile & link the hybrid application
Run on the FPGA SoC or MPSoC

Vitis AI working flow (Edge)

Network model (float) Quantization (vai_q_caffe) Compile for DPU (vai_c_caffe) Create the software app Cross-compile & linking

50

SLIDE 26

Quantize the neural network model
vai_q_caffe and vai_q_tensorflow
output: quantized and pruned network

Vitis AI working flow (Edge)

Network model (float) Quantization (vai_q_caffe) Compile for DPU (vai_c_caffe) Create the software app Cross-compile & linking

51

Same approach of DNNDK
vai_q_caffe and vai_q_tensorflow tools for pruning and quantization
From floating-point neural network model to quantised 8-bit model

AI Optimizer and AI Quantizer

Vitis AI User Guide, Xilinx, UG1414(v1.0) 52

SLIDE 27

Quantize the neural network model
vai_q_caffe and vai_q_tensorflow
output: quantized and pruned network
Compile the neural network model
vai_c_caffe and vai_c_tensorflow
output: executable for the DPU

Vitis AI working flow (Edge)

Network model (float) Quantization (vai_q_caffe) Compile for DPU (vai_c_caffe) Create the software app Cross-compile & linking

53

Again, same approach seen in DNNDK
vai_c_caffe and vai_c_tensorflow tools
From quantized and pruned network to DPU executable (DPU instructions)

AI Compiler

Vitis AI User Guide, Xilinx, UG1414(v1.0) 54

SLIDE 28

Quantize the neural network model
vai_q_caffe and vai_q_tensorflow
output: quantized and pruned network
Compile the neural network model
vai_c_caffe and vai_c_tensorflow
output: executable for the DPU
Create the Vitis AI application
Unified API with DNNDK
Implement layers not supported by the DPU
Cross-complile & link the hybrid application
Run on the FPGA SoC or MPSoC

Vitis AI working flow (Edge)

Network model (float) Quantization (vai_q_caffe) Compile for DPU (vai_c_caffe) Create the software app Cross-compile & linking

55

Final remarks

Three different frameworks for hardware acceleration of Neural Network on FPGA SoC
ChaiDNN is the only open source framework
However, it’s not updated by Xilinx anymore (last commit 2 years ago)
Currently, DNNDK and Vitis AI are the two most supported frameworks by Xilinx
Acceleration is based on Deep Learning Processing Unit (DPU)
Makes hardware acceleration easier for the developer
However, the general-purpose accelerator (DPU) is not customized for the network
Tradeoff between performance and time for development

56

SLIDE 29

References

The CHaiDNN official github, Xilinx

https://github.com/Xilinx/CHaiDNN

DNNDK User Guide, Xilinx, UG1327(v1.5)

https://www.xilinx.com/support/documentation/user_guides/ug1327-dnndk-user-guide.pdf

Vitis AI User Guide, Xilinx UG1414(v1.0)

https://www.xilinx.com/support/documentation/sw_manuals/vitis_ai/1_0/ug1414-vitis-ai.pdf

57

Thanks for the attention

francesco.restuccia@santannapisa.it

58