Scalable and Modularized RTL Compilation of Convolutional Neural - - PowerPoint PPT Presentation

▶

scalable and modularized rtl compilation of convolutional

Scalable and Modularized RTL Compilation of Convolutional Neural - - PowerPoint PPT Presentation

Aug 29, 2023 628 likes •1.01k views

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School of Computing, Informatics,

slide-1

SLIDE 1

Scalable and Modularized RTL Compilation of Convolutional Neural Networks

nto FPGA

Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula†

School of Electrical, Computer and Energy Engineering

†School of Computing, Informatics, Decision Systems Engineering

Arizona State University, Tempe, USA

slide-2

SLIDE 2

Outline

Overview of CNN Algorithms
Current CNN Accelerators & Motivation
Proposed Modular CNN RTL Compiler
Experimental Results
Conclusion
2 -

slide-3

SLIDE 3

Convolutional Neural Networks (CNN)

3 -
Dominant approach for recognition and detection tasks
Highly iterative with a few computing primitives
Composed of multiple types of layers
Evolving rapidly with more layers to achieve higher accuracy

Pooling (Subsampling) Convolution +Activation Fully-connected (Inner Product) Convolution +Activation

Input Image Feature Maps

From a few to >100 layers

slide-4

SLIDE 4

CNN Layers and Structure

Convolution (conv or cccp)

– 3D MAC operations – Constitute >90% of the total operations

Pooling (pool)

– Keep the maximum or average value of pixels

LRN (norm)

– Local response normalization : non-linear

Fully-connected (fc)

– Matrix-vector multiplication – Require large volume of weights

CNN Structure for image classification

– AlexNet [A. Krizhevsky, NIPS2012] – NIN [M. Lin, ICLR2014]

4 -

slide-5

SLIDE 5

Outline

Overview of CNN Algorithms
Current CNN Accelerators & Motivation
Proposed Modular CNN RTL Compiler
Experimental Results
Conclusion
5 -

slide-6

SLIDE 6

Comparison of CNN Accelerators

6 -

Throughput Resource Utilization Energy Efficiency Reconfigurability Design Speed

Flexible deep learning framework with modularity
Accelerated on GPU with thousands of parallel cores
High power consumption (>100W)

Software, GPU [Y. Jia, Caffe; M. Abadi, TensorFlow]

slide-7

SLIDE 7

Comparison of CNN Accelerators

7 -

Throughput Resource Utilization Energy Efficiency Reconfigurability Design Speed

High-level synthesis (e.g. OpenCL) based FPGA accelerator
Short turnaround time and fast design optimization
Cannot exploit low-level hardware structures

HLS, FPGA [C. Zhang, FPGA2015; N. Suda, FPGA2016]

slide-8

SLIDE 8

Comparison of CNN Accelerators

8 -

Throughput Resource Utilization Energy Efficiency Reconfigurability Design Speed

Agnostic to the CNN model configuration
Inefficient hardware resource usage

RTL, generic CNN accelerator [C. Farabet, CVPR2011]

slide-9

SLIDE 9

Comparison of CNN Accelerators

9 -

Throughput Resource Utilization Energy Efficiency Reconfigurability Design Speed

High efficiency with greater acceleration
Poor flexibility, long turnaround time
Require in-depth understanding of FPGA/ASIC

RTL, optimized for a specific CNN [J. Qiu, FPGA2016]

slide-10

SLIDE 10

Comparison of CNN Accelerators

10 -

Throughput Resource Utilization Energy Efficiency Reconfigurability Design Speed

Modular and scalable hardware design framework
Integrate the flexibility of HLS and the finer level optimization of RTL

Proposed RTL compiler

slide-11

SLIDE 11

Comparison of CNN Accelerators

11 -

Throughput Resource Utilization Energy Efficiency Reconfigurability Design Speed

Software, GPU HLS, FPGA RTL, generic CNN accelerator RTL, optimized for a specific CNN Proposed RTL compiler

slide-12

SLIDE 12

Outline

Overview of CNN Algorithms
Current CNN Accelerators & Motivation
Proposed Modular CNN RTL Compiler
Experimental Results
Conclusion
12 -

slide-13

SLIDE 13

Proposed CNN RTL Compiler

13 -
Modular and scalable hardware design framework
Compile end-to-end CNNs into efficient RTL codes for FPGA/ASIC

Parameterized RTL scripts (Verilog) FPGA design tools e.g. Quartus FPGA programming file

RTL compiler (Python)

CNN models

Connection of layers
Type of layers
Number and size of

kernel/feature maps Computing resources

Number of multipliers
Top-level system
Conv/Pool/Norm/FC

modules

RTL DMA controller
On-chip buffers
Data router

slide-14

SLIDE 14

Convolution Parameters and Loops

14 -

Loop-4 Across the output feature maps of Nof Loop-3 Across the input feature maps of Nif Loop-2 Scan within one input feature map along X×Y Loop-1 MAC within a kernel window of K×K

… …

⊗

Nif

K K

Xi Yi Nif K K K K Nif Nif Nof Xo Yo K K Input feature maps Kernel (filter) maps Output feature maps

=

slide-15

SLIDE 15

Strategy to Accelerate Convolution

15 -

… …

⊗

K K

Xi Yi K Xo Yo

=

Unroll Loop-3 Unroll Loop-4 Unroll Loop-3

If Nm>Nif : fully unroll Loop-3 and further unroll Loop-4

– Nm/Nif output feature maps with shared features

If Nm<Nif : partially unroll Loop-3

– Repeat kernel window sliding by Nif/Nm times

Serially compute Loop-1 before Loop-2 : reduce # of partial sums

K Nif Nif Nof

(Nm = # of multipliers)

slide-16

SLIDE 16

CONV Module and Components

16 -
Control logic

– Control the sliding of four loops by counters – Counters are parameterized to K, X, Y, Nif and Nof of each layer – Generate buffer addresses

slide-17

SLIDE 17

CONV Module and Components

17 -
Adder Trees

– # of fan-in = Nif, # of adders = Nm/Nif – Sum results from Nif parallel multipliers – Accumulate within one kernel window (K×K) – Shared by convolution layers with identical Nif.

ReLU = max(pixel, 0)

– Check the sign bit

slide-18

SLIDE 18

POOL (MAX or AVE) Module
NORM Module
FC Module

–

Perform matrix-vector multiplication (special form of convolution)

–

Share multipliers with CONV

–

Adders are shared across all FC layers

POOL, NORM, and FC Modules

18 -

slide-19

SLIDE 19

Integration of Modules

19 -
Overall CNN Accelerator

slide-20

SLIDE 20

Integration of Modules (Controller)

20 -
Controller

– Direct the layer by layer serial computation of modules

slide-21

SLIDE 21

Integration of Modules (Data Router)

21 -
Feature Data Router

– Select write and read data of two adjacent modules – Assign buffer outputs to POOL or shared multipliers

slide-22

SLIDE 22

Integration of Modules (Memory)

22 -
Feature Buffers

– Feature maps are stored in separate on-chip RAMs

slide-23

SLIDE 23

Integration of Modules (Memory)

23 -
Weight Buffers

– FC weights transfer is overlapped with its computation – CONV weights transfer is before its computation

slide-24

SLIDE 24

Outline

Overview of CNN Algorithms
Current CNN Accelerators & Motivation
Proposed Modular CNN RTL Compiler
Experimental Results
Conclusion
24 -

slide-25

SLIDE 25

Experimental Setup & FPGA System

AlexNet and NIN CNN models
Stand-alone DE5-Net board with Altera Stratix-V GXA7 FPGA chip

–

622K logic elements, 256 DSP blocks, 2560 M20K RAMs.

Synthesized by Altera Quartus tool.
25 -

Control the transfer of data from flash memory to SDRAM, and then start the CNN acceleration process. Transfer data from SDRAM to on-chip RAMs.

Standard Altera IP

start

slide-26

SLIDE 26

Experimental Results

26 -
J. Qiu

FPGA2016

C. Zhang

FPGA2015

N. Suda

FPGA2016 This work This work FPGA Zynq XC7Z045 Virtex-7 VX485T Stratix-V GXA7 Stratix-V GXA7 Stratix-V GXA7 Design Entry RTL C-language OpenCL RTL Compiler RTL Compiler CNN Model VGG - 16 AlexNet AlexNet AlexNet NIN # of op. per image 30.76 GOP 1.33 GOP 1.46 GOP 1.46 GOP 2.2 GOP DSP Utilization 780 (89%) 2,240 (80%) 256 (100%) 256 (100%) 256 (100%) Logic Utilizationa 183K (84%) 186K (61%) 114K (49%) 121K (52%) 112K (48%) On-chip RAMb 486 (87%) 1,024 (50%) 1,893 (74%) 1,552 (61%) 2,330 (91%) Convolution throughput 187.80 GOPS 61.6 GFOPS 67.5 GOPS 134.1 GOPS 117.3 GOPS Overall throughput 136.97 GOPS N/A 60.2 GOPS 114.5 GOPS 117.3 GOPS

a. Xilinx FPGAs in LUTs and Altera FPGAs in ALMs
b. Xilinx FPGAs in BRAMs (36 Kb) and Altera FPGAs in M20K RAMs (20 Kb)
Compared to OpenCL design, 1.9X overall throughput improvement

– On the same FPGA board – Using similar hardware resources

Compared to HLS design, 2X convolution throughput improvement

slide-27

SLIDE 27

Experimental Results

27 -
J. Qiu

FPGA2016

C. Zhang

FPGA2015

N. Suda

FPGA2016 This work This work FPGA Zynq XC7Z045 Virtex-7 VX485T Stratix-V GXA7 Stratix-V GXA7 Stratix-V GXA7 Design Entry RTL C-language OpenCL RTL Compiler RTL Compiler CNN Model VGG - 16 AlexNet AlexNet AlexNet NIN # of op. per image 30.76 GOP 1.33 GOP 1.46 GOP 1.46 GOP 2.2 GOP DSP Utilization 780 (89%) 2,240 (80%) 256 (100%) 256 (100%) 256 (100%) Logic Utilizationa 183K (84%) 186K (61%) 114K (49%) 121K (52%) 112K (48%) On-chip RAMb 486 (87%) 1,024 (50%) 1,893 (74%) 1,552 (61%) 2,330 (91%) Convolution throughput 187.80 GOPS 61.6 GFOPS 67.5 GOPS 134.1 GOPS 117.3 GOPS Overall throughput 136.97 GOPS N/A 60.2 GOPS 114.5 GOPS 117.3 GOPS

a. Xilinx FPGAs in LUTs and Altera FPGAs in ALMs
b. Xilinx FPGAs in BRAMs (36 Kb) and Altera FPGAs in M20K RAMs (20 Kb)
Model customized RTL and more DSPs improve throughput
More regular structure of VGG benefits the performance

– Uniform kernel map size, Nif in power of two, no norm

slide-28

SLIDE 28

Experimental Results

28 -
J. Qiu

FPGA2016

C. Zhang

FPGA2015

N. Suda

FPGA2016 This work This work FPGA Zynq XC7Z045 Virtex-7 VX485T Stratix-V GXA7 Stratix-V GXA7 Stratix-V GXA7 Design Entry RTL C-language OpenCL RTL Compiler RTL Compiler CNN Model VGG - 16 AlexNet AlexNet AlexNet NIN # of op. per image 30.76 GOP 1.33 GOP 1.46 GOP 1.46 GOP 2.2 GOP DSP Utilization 780 (89%) 2,240 (80%) 256 (100%) 256 (100%) 256 (100%) Logic Utilizationa 183K (84%) 186K (61%) 114K (49%) 121K (52%) 112K (48%) On-chip RAMb 486 (87%) 1,024 (50%) 1,893 (74%) 1,552 (61%) 2,330 (91%) Convolution throughput 187.80 GOPS 61.6 GFOPS 67.5 GOPS 134.1 GOPS 117.3 GOPS Overall throughput 136.97 GOPS N/A 60.2 GOPS 114.5 GOPS 117.3 GOPS

a. Xilinx FPGAs in LUTs and Altera FPGAs in ALMs
b. Xilinx FPGAs in BRAMs (36 Kb) and Altera FPGAs in M20K RAMs (20 Kb)
NIN has more convolution layers and more operations
Similar throughput can be achieved for both models

slide-29

SLIDE 29

0.61 0.85 0.56 0.64

2.83 0.00 2.00 4.00 6.00 8.00 10.00 12.00 14.00 16.00 18.00

AlexNet NIN

Execution Time (ms)

CONV1 CONV2 CONV3 CONV4 CONV5 CCCPs DMA_CONV POOLs FC6&7&8

Timing Breakdown of Layers

29 -
FC latency is determined by the DMA transfer delay that

covers the computation latency.

DMA transfer latency of CONV weights is NOT hidden.

Convolution layers = 17.04 ms (90.9%) Convolution = 8.74 ms (68.5%)

3.66 2.00 1.46 0.97 0.65 3.53 5.83 1.95 1.67 4.06

AlexNet NIN

slide-30

SLIDE 30

Logic and DSP Block Utilization

30 -
Stratix-V GXA7 has only 256 DSP blocks.
Multipliers are implemented by both logic elements and DSP blocks.
Layers with same Nif are combined to be one module

with shared adder tree.

Logic Utilization in ALMs

# of DSP Blocks

AlexNet NIN AlexNet NIN

slide-31

SLIDE 31

On-chip RAM Breakdown

31 -
RAMs are stacked for modules with shallow word depths
RAMs are shared by non-consecutive modules
Weight buffers to receive weights from external memory

On-chip RAMs Utilization in M20K blocks

AlexNet NIN

slide-32

SLIDE 32

Power Measurement and Breakdown

Measured power of DE5-Net

– running nothing is 16.5W – running AlexNet is 19.5W – running NIN is 19.1W

32 -

0% 20% 40% 60% 80% 100%

Simulation based power breakdown of AlexNet accelerator Multipliers CONVs FC NORMs POOLs RAM Routing

0% 20% 40% 60% 80% 100%

Simulation based power breakdown of FPGA chip AlexNet DDR3 Controller mSGDMA I/O Others 36.8%

Computing Modules RAM Routing AlexNet Accelerator

Simulated power of Stratix V

– running AlexNet is 12.8W

36.4% 21.2% 42.1% 14.5% 3.1% 43.3%

slide-33

SLIDE 33

ImageNet Accuracy

Data width: Features = 10-bit, Weights = 8-bit
The portion of integer and fractional bits are adjusted

according to the range of values for different layers.

33 -

Model accuracy comparison Software implementation (Caffe tool, 32-bit) FPGA implementation (Our work) CNN model Top-1 Top-5 Top-1 Top-5 AlexNet 56.78 % 79.72 % 55.64 % 79.32 % NIN 56.14 % 79.32 % 55.74 % 78.96 %

Tested on 5K images from ImageNet 2012 validation database.

Reduce data width requirement while retaining the same accuracy level

slide-34

SLIDE 34

Outline

Overview of CNN Algorithms
Current CNN Accelerators & Motivation
Proposed Modular CNN RTL Compiler
Experimental Results
Conclusion
34 -

slide-35

SLIDE 35

Conclusion

Modularized and scalable RTL design for CNN
Demonstrated on Altera Stratix-V GXA7 FPGA
End-to-end implementation of deep CNNs
114.5 GOPS for AlexNet and 117.3 GOPS for NIN
1.9X performance improvement compared to

OpenCL design on the same FPGA

Future work : increase generality and efficiency

for larger state-of-the-art CNNs.

35 -

slide-36

SLIDE 36

Thanks! Questions?

36 -