Contents Introduction Pipelined FPGA DNN accelerators Roof-line - - PDF document

contents
SMART_READER_LITE
LIVE PREVIEW

Contents Introduction Pipelined FPGA DNN accelerators Roof-line - - PDF document

FPGA implementation of Quantized (Deep) Neural Network (QNN) Accelerators Biruk Seyoum Phd Student, Scoula Superiore SantAnna, Pisa Contents Introduction Pipelined FPGA DNN accelerators Roof-line Model and optimizing FPGA DNN


slide-1
SLIDE 1

FPGA implementation of Quantized (Deep) Neural Network (QNN) Accelerators

Biruk Seyoum Phd Student, Scoula Superiore Sant’Anna, Pisa

Contents

 Introduction  Pipelined FPGA DNN accelerators  Roof-line Model and optimizing FPGA

DNN accelerators

 Quantized Neural Networks (QNNs)  Introduction to FINN  FINN demo

slide-2
SLIDE 2

Some DNN Applications

 computer vision  speech recognition  Bi-informatics

. . . .

Some DNN properties

 Have an enormous storage and

compute requirements

 VGG-16 (528 MB, 19.6 GFLOPs)  ResNet152(232 MB, 11.3 GFLOPs)  Googlenet (51 MB, 1.5 GFLOPs)

 Computation dominated by fmoating

point computations

slide-3
SLIDE 3

FPGA DNN acceleration Deployment

 Cloud

 Host larger DNNs  Low inference latency

 Suitable for user Interaction latency

insensitive applications (machine translation,

NN business applications, forecasting etc...)

 Energy and

transmission latency cost of offmoading data to the cloud

FPGA DNN accelerator Deployment

 Embedded Platforms

 Suitable for Real-time safety critical

applications (autonomous driving, speech recognition

etc...)

 Ideal for smaller DNNs

slide-4
SLIDE 4

The ZYNQ7010 SoC FPGA DNN acceleration Architecture

 Single accelerator architecture

 Single compute engine for all layers  Usually the most compute intensive layer ofg-

loaded to FPGA engine

slide-5
SLIDE 5

FPGA DNN acceleration Architecture

 Pipelined architecture

 Each DNN layer maps to a HW implementation

  • n the FPGA

 Adjacent layers are connected using bufgers

FPGA DNN acceleration Architecture

 Pipelined architecture

 Each DNN layer maps to a HW implementation

  • n the FPGA

 Adjacent layers are connected using bufgers

Focus of this presentation

slide-6
SLIDE 6

Pipelined DNN FPGA accelerators

 DDR RD/RW  input images  Parameters  Outputs  MAC  Compare  div etc...

Pipelined DNN FPGA accelerators

For N number of DNN layers and a batch size of B

slide-7
SLIDE 7

Pipelined DNN FPGA accelerators

For N number of layers and inference time per layer per frame

Pipelined DNN FPGA accelerators performance

Bottleneck is peak FPGA compute capacity

slide-8
SLIDE 8

Pipelined DNN FPGA accelerators performance

Bottleneck is off-chip Memory bandwidth

Pipelined DNN FPGA accelerators performance

Performance of accelerator depends both on

 Peak FPGA compute capacity  Off-chip memory (DRAM) bandwidth

slide-9
SLIDE 9

Pipelined DNN FPGA accelerators performance

Performance of accelerator depends both on

 Peak FPGA compute capacity  Off-chip memory (DRAM) bandwidth

The peak FPGA compute capacity is obtained from hardware specification or benchmarks (eg. FLOPs/sec)

Pipelined DNN FPGA accelerators performance

Performance of accelerator depends both on

 Peak FPGA compute capacity  Off-chip memory (DRAM) bandwidth

The peak FPGA compute capacity is obtained from hardware specification or benchmarks (eg. FLOPs/sec)

How to relate the accelerator performance w.r.t. FPGA compute capacity and off-chip DRAM traffic ?

slide-10
SLIDE 10

Roof-line model

 A state of the art performance model for

Multicore architectures

 Can easily be adapted also for FPGA

accelerators whose inputs are stored on DRAM

 Correlates compute performance, memory

performance and operational intensity in a 2D graph

Roof-line model

 Operational intensity(Arithmetic intensity)

is the number of operation on FPGA per byte of DRAM traffjc

 Measured in Ops/byte (eg. FLOPs/byte)

slide-11
SLIDE 11

Roof-line model

 Operational intensity(Arithmetic intensity)

is the number of operation on FPGA per byte of DRAM traffjc

 Measured in Ops/byte (eg. FLOPs/byte)

High Operational intensity Lower Operational intensity

Roof-line model

Attainable GFLOPS/S = min {Peak compute capacity,

Mem_BW * Operational Intensity

}

slide-12
SLIDE 12

Roof-line model Roof-line model

slide-13
SLIDE 13

How to increase performance

  • f FPGA accelerator

Increasing memory bandwidth (memory bounded)

How to increase performance

  • f FPGA accelerator

Migrating to powerful FPGA (compute bounded)

slide-14
SLIDE 14

How to increase performance

  • f FPGA accelerator

Increasing the Operational Intensity

Increasing Operational Intensity

 Operational intensity can be increased

by

 Maximizing the number of operations

performed on data fetched from DRAM before write back

 Eg. Implementing the DNN layers in a pipeline

 Reducing the precision of computation to

bring more data simultaneously

 Using lower precision computation in general  Eg. 16 bit fmoating point, 8 bit integer etc...

slide-15
SLIDE 15

Quantized Neural Networks (QNNs)

 Deep neural networks are typically over-

parametrized

 Remedies to overcome this problem include

 Prunning  Weight sharing  Weight quantization (our topic)

 Weight quantization involves

 Representing weights and parameters in low

precision integers eg. 8 bit, 4bit, and at the extreme case in binary

Benefjts of weight quantization

 Reduced weight memory footprint

 Reduced DRAM memory footprint of weights

− Eg. AlexNet from 64 MB to ~ 2 MB with 1 bit weights

 Can even fjt inside the on-chip memory of the

FPGA

Weights stored inside the accelerator

slide-16
SLIDE 16

Benefjts of weight quantization

 Faster computation

 Reduced precision integer arithmetic is much

faster than fmoating computation (number of computation per clock cycle will be higher)

 Computation (eg. MAC) on reduced precision

integer is more FPGA friendly

− Floating point computation DSPs (scarce) − Reduced precision computation LUTs

(abundant)

 Compute engines on the FPGA also consume

less resource

The FINN framework

slide-17
SLIDE 17

Acceleration flow in FINN

Mapping fmow in FINN

slide-18
SLIDE 18

Quantization aware training

Brevitas: A PyTorch library for quantization-aware training

ONNX representation

 FINN uses an ONNX-based

intermediate representation

 Brevitas provides FINN-

ONNX export

 Quantization information is

exported as annotations

https://github.com/Xilinx/brevitas

slide-19
SLIDE 19

The FINN compiler

 Generates the Viviado

HLS based mapping of BNN layers to FPGAs

 Streaming architecture  Each layer is mapped to a

dedicated compute engine

 Compute engines are

pipelined and they communicate via on-chip data streams

The FINN flow

slide-20
SLIDE 20

The FINN compute engine

 Each layer has a single

compute engine

 The number of

Processing Elements (PE) and the Single Instruction Multiple Data (SIMD) lanes of each PE determine the throughput of the layer

 PE and SIMD also

determine the resource consumption of the engine

A FINN compute engine

Features of BNN computation

slide-21
SLIDE 21

DEMO

FINN Demo

slide-22
SLIDE 22

In this demo you will see

 The fjnn codebase  Accelerator IP generation using

Vivado_hls

 A quick summary of a Vivado partial

reconfjguration fmow

 Application of FINN to classify the

CIFAR-10 dataset on the Pynq-Z1 board

Vivado DPR fmow

slide-23
SLIDE 23

Architecture of the BNN

Thank you Questions ?