Scalable and Modularized RTL Compilation of Convolutional Neural - - PowerPoint PPT Presentation

scalable and modularized rtl compilation of convolutional
SMART_READER_LITE
LIVE PREVIEW

Scalable and Modularized RTL Compilation of Convolutional Neural - - PowerPoint PPT Presentation

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School of Computing, Informatics,


slide-1
SLIDE 1

Scalable and Modularized RTL Compilation of Convolutional Neural Networks

  • nto FPGA

Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula†

School of Electrical, Computer and Energy Engineering

†School of Computing, Informatics, Decision Systems Engineering

Arizona State University, Tempe, USA

slide-2
SLIDE 2

Outline

  • Overview of CNN Algorithms
  • Current CNN Accelerators & Motivation
  • Proposed Modular CNN RTL Compiler
  • Experimental Results
  • Conclusion
  • 2 -
slide-3
SLIDE 3

Convolutional Neural Networks (CNN)

  • 3 -
  • Dominant approach for recognition and detection tasks
  • Highly iterative with a few computing primitives
  • Composed of multiple types of layers
  • Evolving rapidly with more layers to achieve higher accuracy

Pooling (Subsampling) Convolution +Activation Fully-connected (Inner Product) Convolution +Activation

Input Image Feature Maps

From a few to >100 layers

slide-4
SLIDE 4

CNN Layers and Structure

  • Convolution (conv or cccp)

– 3D MAC operations – Constitute >90% of the total operations

  • Pooling (pool)

– Keep the maximum or average value of pixels

  • LRN (norm)

– Local response normalization : non-linear

  • Fully-connected (fc)

– Matrix-vector multiplication – Require large volume of weights

  • CNN Structure for image classification

– AlexNet [A. Krizhevsky, NIPS2012] – NIN [M. Lin, ICLR2014]

  • 4 -
slide-5
SLIDE 5

Outline

  • Overview of CNN Algorithms
  • Current CNN Accelerators & Motivation
  • Proposed Modular CNN RTL Compiler
  • Experimental Results
  • Conclusion
  • 5 -
slide-6
SLIDE 6

Comparison of CNN Accelerators

  • 6 -

Throughput Resource Utilization Energy Efficiency Reconfigurability Design Speed

  • Flexible deep learning framework with modularity
  • Accelerated on GPU with thousands of parallel cores
  • High power consumption (>100W)

Software, GPU [Y. Jia, Caffe; M. Abadi, TensorFlow]

slide-7
SLIDE 7

Comparison of CNN Accelerators

  • 7 -

Throughput Resource Utilization Energy Efficiency Reconfigurability Design Speed

  • High-level synthesis (e.g. OpenCL) based FPGA accelerator
  • Short turnaround time and fast design optimization
  • Cannot exploit low-level hardware structures

HLS, FPGA [C. Zhang, FPGA2015; N. Suda, FPGA2016]

slide-8
SLIDE 8

Comparison of CNN Accelerators

  • 8 -

Throughput Resource Utilization Energy Efficiency Reconfigurability Design Speed

  • Agnostic to the CNN model configuration
  • Inefficient hardware resource usage

RTL, generic CNN accelerator [C. Farabet, CVPR2011]

slide-9
SLIDE 9

Comparison of CNN Accelerators

  • 9 -

Throughput Resource Utilization Energy Efficiency Reconfigurability Design Speed

  • High efficiency with greater acceleration
  • Poor flexibility, long turnaround time
  • Require in-depth understanding of FPGA/ASIC

RTL, optimized for a specific CNN [J. Qiu, FPGA2016]

slide-10
SLIDE 10

Comparison of CNN Accelerators

  • 10 -

Throughput Resource Utilization Energy Efficiency Reconfigurability Design Speed

  • Modular and scalable hardware design framework
  • Integrate the flexibility of HLS and the finer level optimization of RTL

Proposed RTL compiler

slide-11
SLIDE 11

Comparison of CNN Accelerators

  • 11 -

Throughput Resource Utilization Energy Efficiency Reconfigurability Design Speed

Software, GPU HLS, FPGA RTL, generic CNN accelerator RTL, optimized for a specific CNN Proposed RTL compiler

slide-12
SLIDE 12

Outline

  • Overview of CNN Algorithms
  • Current CNN Accelerators & Motivation
  • Proposed Modular CNN RTL Compiler
  • Experimental Results
  • Conclusion
  • 12 -
slide-13
SLIDE 13

Proposed CNN RTL Compiler

  • 13 -
  • Modular and scalable hardware design framework
  • Compile end-to-end CNNs into efficient RTL codes for FPGA/ASIC

Parameterized RTL scripts (Verilog) FPGA design tools e.g. Quartus FPGA programming file

RTL compiler (Python)

CNN models

  • Connection of layers
  • Type of layers
  • Number and size of

kernel/feature maps Computing resources

  • Number of multipliers
  • Top-level system
  • Conv/Pool/Norm/FC

modules

  • RTL DMA controller
  • On-chip buffers
  • Data router
slide-14
SLIDE 14

Convolution Parameters and Loops

  • 14 -

Loop-4 Across the output feature maps of Nof Loop-3 Across the input feature maps of Nif Loop-2 Scan within one input feature map along X×Y Loop-1 MAC within a kernel window of K×K

… …

Nif

K K

Xi Yi Nif K K K K Nif Nif Nof Xo Yo K K Input feature maps Kernel (filter) maps Output feature maps

=

slide-15
SLIDE 15

Strategy to Accelerate Convolution

  • 15 -

… …

K K

Xi Yi K Xo Yo

=

Unroll Loop-3 Unroll Loop-4 Unroll Loop-3

  • If Nm>Nif : fully unroll Loop-3 and further unroll Loop-4

– Nm/Nif output feature maps with shared features

  • If Nm<Nif : partially unroll Loop-3

– Repeat kernel window sliding by Nif/Nm times

  • Serially compute Loop-1 before Loop-2 : reduce # of partial sums

K Nif Nif Nof

(Nm = # of multipliers)

slide-16
SLIDE 16

CONV Module and Components

  • 16 -
  • Control logic

– Control the sliding of four loops by counters – Counters are parameterized to K, X, Y, Nif and Nof of each layer – Generate buffer addresses

slide-17
SLIDE 17

CONV Module and Components

  • 17 -
  • Adder Trees

– # of fan-in = Nif, # of adders = Nm/Nif – Sum results from Nif parallel multipliers – Accumulate within one kernel window (K×K) – Shared by convolution layers with identical Nif.

  • ReLU = max(pixel, 0)

– Check the sign bit

slide-18
SLIDE 18
  • POOL (MAX or AVE) Module
  • NORM Module
  • FC Module

Perform matrix-vector multiplication (special form of convolution)

Share multipliers with CONV

Adders are shared across all FC layers

POOL, NORM, and FC Modules

  • 18 -
slide-19
SLIDE 19

Integration of Modules

  • 19 -
  • Overall CNN Accelerator
slide-20
SLIDE 20

Integration of Modules (Controller)

  • 20 -
  • Controller

– Direct the layer by layer serial computation of modules

slide-21
SLIDE 21

Integration of Modules (Data Router)

  • 21 -
  • Feature Data Router

– Select write and read data of two adjacent modules – Assign buffer outputs to POOL or shared multipliers

slide-22
SLIDE 22

Integration of Modules (Memory)

  • 22 -
  • Feature Buffers

– Feature maps are stored in separate on-chip RAMs

slide-23
SLIDE 23

Integration of Modules (Memory)

  • 23 -
  • Weight Buffers

– FC weights transfer is overlapped with its computation – CONV weights transfer is before its computation

slide-24
SLIDE 24

Outline

  • Overview of CNN Algorithms
  • Current CNN Accelerators & Motivation
  • Proposed Modular CNN RTL Compiler
  • Experimental Results
  • Conclusion
  • 24 -
slide-25
SLIDE 25

Experimental Setup & FPGA System

  • AlexNet and NIN CNN models
  • Stand-alone DE5-Net board with Altera Stratix-V GXA7 FPGA chip

622K logic elements, 256 DSP blocks, 2560 M20K RAMs.

  • Synthesized by Altera Quartus tool.
  • 25 -

Control the transfer of data from flash memory to SDRAM, and then start the CNN acceleration process. Transfer data from SDRAM to on-chip RAMs.

Standard Altera IP

start

slide-26
SLIDE 26

Experimental Results

  • 26 -
  • J. Qiu

FPGA2016

  • C. Zhang

FPGA2015

  • N. Suda

FPGA2016 This work This work FPGA Zynq XC7Z045 Virtex-7 VX485T Stratix-V GXA7 Stratix-V GXA7 Stratix-V GXA7 Design Entry RTL C-language OpenCL RTL Compiler RTL Compiler CNN Model VGG - 16 AlexNet AlexNet AlexNet NIN # of op. per image 30.76 GOP 1.33 GOP 1.46 GOP 1.46 GOP 2.2 GOP DSP Utilization 780 (89%) 2,240 (80%) 256 (100%) 256 (100%) 256 (100%) Logic Utilizationa 183K (84%) 186K (61%) 114K (49%) 121K (52%) 112K (48%) On-chip RAMb 486 (87%) 1,024 (50%) 1,893 (74%) 1,552 (61%) 2,330 (91%) Convolution throughput 187.80 GOPS 61.6 GFOPS 67.5 GOPS 134.1 GOPS 117.3 GOPS Overall throughput 136.97 GOPS N/A 60.2 GOPS 114.5 GOPS 117.3 GOPS

  • a. Xilinx FPGAs in LUTs and Altera FPGAs in ALMs
  • b. Xilinx FPGAs in BRAMs (36 Kb) and Altera FPGAs in M20K RAMs (20 Kb)
  • Compared to OpenCL design, 1.9X overall throughput improvement

– On the same FPGA board – Using similar hardware resources

  • Compared to HLS design, 2X convolution throughput improvement
slide-27
SLIDE 27

Experimental Results

  • 27 -
  • J. Qiu

FPGA2016

  • C. Zhang

FPGA2015

  • N. Suda

FPGA2016 This work This work FPGA Zynq XC7Z045 Virtex-7 VX485T Stratix-V GXA7 Stratix-V GXA7 Stratix-V GXA7 Design Entry RTL C-language OpenCL RTL Compiler RTL Compiler CNN Model VGG - 16 AlexNet AlexNet AlexNet NIN # of op. per image 30.76 GOP 1.33 GOP 1.46 GOP 1.46 GOP 2.2 GOP DSP Utilization 780 (89%) 2,240 (80%) 256 (100%) 256 (100%) 256 (100%) Logic Utilizationa 183K (84%) 186K (61%) 114K (49%) 121K (52%) 112K (48%) On-chip RAMb 486 (87%) 1,024 (50%) 1,893 (74%) 1,552 (61%) 2,330 (91%) Convolution throughput 187.80 GOPS 61.6 GFOPS 67.5 GOPS 134.1 GOPS 117.3 GOPS Overall throughput 136.97 GOPS N/A 60.2 GOPS 114.5 GOPS 117.3 GOPS

  • a. Xilinx FPGAs in LUTs and Altera FPGAs in ALMs
  • b. Xilinx FPGAs in BRAMs (36 Kb) and Altera FPGAs in M20K RAMs (20 Kb)
  • Model customized RTL and more DSPs improve throughput
  • More regular structure of VGG benefits the performance

– Uniform kernel map size, Nif in power of two, no norm

slide-28
SLIDE 28

Experimental Results

  • 28 -
  • J. Qiu

FPGA2016

  • C. Zhang

FPGA2015

  • N. Suda

FPGA2016 This work This work FPGA Zynq XC7Z045 Virtex-7 VX485T Stratix-V GXA7 Stratix-V GXA7 Stratix-V GXA7 Design Entry RTL C-language OpenCL RTL Compiler RTL Compiler CNN Model VGG - 16 AlexNet AlexNet AlexNet NIN # of op. per image 30.76 GOP 1.33 GOP 1.46 GOP 1.46 GOP 2.2 GOP DSP Utilization 780 (89%) 2,240 (80%) 256 (100%) 256 (100%) 256 (100%) Logic Utilizationa 183K (84%) 186K (61%) 114K (49%) 121K (52%) 112K (48%) On-chip RAMb 486 (87%) 1,024 (50%) 1,893 (74%) 1,552 (61%) 2,330 (91%) Convolution throughput 187.80 GOPS 61.6 GFOPS 67.5 GOPS 134.1 GOPS 117.3 GOPS Overall throughput 136.97 GOPS N/A 60.2 GOPS 114.5 GOPS 117.3 GOPS

  • a. Xilinx FPGAs in LUTs and Altera FPGAs in ALMs
  • b. Xilinx FPGAs in BRAMs (36 Kb) and Altera FPGAs in M20K RAMs (20 Kb)
  • NIN has more convolution layers and more operations
  • Similar throughput can be achieved for both models
slide-29
SLIDE 29

0.61 0.85 0.56 0.64

2.83 0.00 2.00 4.00 6.00 8.00 10.00 12.00 14.00 16.00 18.00

AlexNet NIN

Execution Time (ms)

CONV1 CONV2 CONV3 CONV4 CONV5 CCCPs DMA_CONV POOLs FC6&7&8

Timing Breakdown of Layers

  • 29 -
  • FC latency is determined by the DMA transfer delay that

covers the computation latency.

  • DMA transfer latency of CONV weights is NOT hidden.

Convolution layers = 17.04 ms (90.9%) Convolution = 8.74 ms (68.5%)

3.66 2.00 1.46 0.97 0.65 3.53 5.83 1.95 1.67 4.06

AlexNet NIN

slide-30
SLIDE 30

Logic and DSP Block Utilization

  • 30 -
  • Stratix-V GXA7 has only 256 DSP blocks.
  • Multipliers are implemented by both logic elements and DSP blocks.
  • Layers with same Nif are combined to be one module

with shared adder tree.

Logic Utilization in ALMs

# of DSP Blocks

AlexNet NIN AlexNet NIN

slide-31
SLIDE 31

On-chip RAM Breakdown

  • 31 -
  • RAMs are stacked for modules with shallow word depths
  • RAMs are shared by non-consecutive modules
  • Weight buffers to receive weights from external memory

On-chip RAMs Utilization in M20K blocks

AlexNet NIN

slide-32
SLIDE 32

Power Measurement and Breakdown

  • Measured power of DE5-Net

– running nothing is 16.5W – running AlexNet is 19.5W – running NIN is 19.1W

  • 32 -

0% 20% 40% 60% 80% 100%

Simulation based power breakdown of AlexNet accelerator Multipliers CONVs FC NORMs POOLs RAM Routing

0% 20% 40% 60% 80% 100%

Simulation based power breakdown of FPGA chip AlexNet DDR3 Controller mSGDMA I/O Others 36.8%

Computing Modules RAM Routing AlexNet Accelerator

  • Simulated power of Stratix V

– running AlexNet is 12.8W

36.4% 21.2% 42.1% 14.5% 3.1% 43.3%

slide-33
SLIDE 33

ImageNet Accuracy

  • Data width: Features = 10-bit, Weights = 8-bit
  • The portion of integer and fractional bits are adjusted

according to the range of values for different layers.

  • 33 -

Model accuracy comparison Software implementation (Caffe tool, 32-bit) FPGA implementation (Our work) CNN model Top-1 Top-5 Top-1 Top-5 AlexNet 56.78 % 79.72 % 55.64 % 79.32 % NIN 56.14 % 79.32 % 55.74 % 78.96 %

Tested on 5K images from ImageNet 2012 validation database.

Reduce data width requirement while retaining the same accuracy level

slide-34
SLIDE 34

Outline

  • Overview of CNN Algorithms
  • Current CNN Accelerators & Motivation
  • Proposed Modular CNN RTL Compiler
  • Experimental Results
  • Conclusion
  • 34 -
slide-35
SLIDE 35

Conclusion

  • Modularized and scalable RTL design for CNN
  • Demonstrated on Altera Stratix-V GXA7 FPGA
  • End-to-end implementation of deep CNNs
  • 114.5 GOPS for AlexNet and 117.3 GOPS for NIN
  • 1.9X performance improvement compared to

OpenCL design on the same FPGA

  • Future work : increase generality and efficiency

for larger state-of-the-art CNNs.

  • 35 -
slide-36
SLIDE 36

Thanks! Questions?

  • 36 -