Automatic Compiler Based FPGA Accelerator for CNN Training Shreyas - - PowerPoint PPT Presentation

automatic compiler based fpga accelerator for cnn training
SMART_READER_LITE
LIVE PREVIEW

Automatic Compiler Based FPGA Accelerator for CNN Training Shreyas - - PowerPoint PPT Presentation

Automatic Compiler Based FPGA Accelerator for CNN Training Shreyas Venkataramanaiah 1 , Yufei Ma 1 , Shihui Yin 1 , Eriko Nurvithadhi 2 , Aravind Dasu 3 , Yu Cao 1 , Jae-sun Seo 1 1 School of ECEE, Arizona State University, Tempe, AZ, USA 2 Intel


slide-1
SLIDE 1

Automatic Compiler Based FPGA Accelerator for CNN Training

Shreyas Venkataramanaiah1, Yufei Ma1 , Shihui Yin1, Eriko Nurvithadhi2, Aravind Dasu3, Yu Cao1, Jae-sun Seo1

1 School of ECEE, Arizona State University, Tempe, AZ, USA 2 Intel Labs, Intel Corporation, OR, USA 3 Programmable Solutions Group, Intel Corporation, CA, USA

slide-2
SLIDE 2

Outline

▪ Introduction ▪ CNN training algorithm ▪ RTL compiler ▪ CNN training accelerator ▪ Results ▪ Conclusion

2

slide-3
SLIDE 3

Introduction

▪ Challenges in training of neural networks

‒ Large storage, memory bandwidth, energy consumption ‒ New DNN structures rapidly evolving and developed for diverse applications

▪ GPU’s are power hungry ▪ ASIC not good for programmability, cannot predict future DNNs ▪ FPGA’s are flexible

‒ Reconfigurable, scalable training hardware ‒ Can support low-precision or sparse matrix computations

3

slide-4
SLIDE 4

Outline

▪ Introduction ▪ CNN training algorithm ▪ RTL compiler ▪ CNN training accelerator ▪ Results ▪ Conclusion

4

slide-5
SLIDE 5

CNN Training Algorithm

Forward Pass

▪ Each training image is associated with a label ▪ Loss function estimates the network performance and provides error value ▪ ReLU: Store the activation gradients ▪ Maxpool: Store the selected pixel position

1x10 Input image local gradients weight gradients Loss weight update Conv Pool Conv Pool FC 𝒙𝟏 𝒙𝟐 𝒙𝟑 Δ𝑥0 Δ𝑥2 Δ𝑥1 Conv Conv Vector mult 𝑥0, 𝛽, 𝛾 𝑥1, 𝛽, 𝛾 𝑥2, 𝛽, 𝛾 𝑥0 new new new 𝑥1 𝑥2 Error 1x10 Conv Upsamp FC Upsamp 𝑥𝑔𝑚𝑗𝑞

1

𝑥𝑈

2

slide-6
SLIDE 6

CNN Training Algorithm

Backward Pass

▪ Error values are propagated back in the network ▪ Flipped kernels are used in convolutions ▪ ReLU : gradients are scaled by activation gradients ▪ Maxpool: Upsample the image using pooling indices

1x10 Input image local gradients weight gradients Loss weight update Conv Pool Conv Pool FC 𝑥0 𝑥1 𝑥2 Δ𝑥0 Δ𝑥2 Δ𝑥1 Conv Conv Vector mult 𝑥0, 𝛽, 𝛾 𝑥1, 𝛽, 𝛾 𝑥2, 𝛽, 𝛾 𝑥0 new new new 𝑥1 𝑥2 Error 1x10 Conv Upsamp FC Upsamp 𝒙𝒈𝒎𝒋𝒒

𝟐

𝒙𝑼

𝟑

slide-7
SLIDE 7

CNN Training Algorithm

Weight Update

▪ Weight gradients are computed and accumulated ▪ Convolutions involve intra tile accumulations ▪ New weights are computed at the end of batch ▪ Learning rate (𝛽) and momentum (𝛾) parameters are used

1x10 Input image local gradients weight gradients Loss weight update Conv Pool Conv Pool FC 𝑥0 𝑥1 𝑥2 𝚬𝒙𝟏 𝚬𝒙𝟑 𝚬𝒙𝟐 Conv Conv Vector mult 𝒙𝟏, 𝜷, 𝜸 𝒙𝟐, 𝜷, 𝜸 𝒙𝟑, 𝜷, 𝜸 𝒙𝟏 new new new 𝒙𝟐 𝒙𝟑 Error 1x10 Conv Upsamp FC Upsamp 𝑥𝑔𝑚𝑗𝑞

1

𝑥𝑈

2

slide-8
SLIDE 8

Outline

▪ Introduction ▪ CNN training algorithm ▪ RTL compiler ▪ CNN training accelerator ▪ Results ▪ Conclusion

8

slide-9
SLIDE 9

Proposed RTL Compiler

Loop unrolling and tiling factors

CNN architecture

  • Layer details – conv, pool,

upsamp, scaling, weight update, flatten, loss

  • Fixed point precision of

each layer parameters

  • Layer scheduling

Initialize memory

  • Initial weight and bias
  • Training data, labels
  • Base addresses for

gradients, activations & weights

RTL model library

  • Highly parameterized

flexible RTL files supporting CNN training operations

Configure hardware

  • Generate parameters

based on CNN Top level RTL integrated with training H/W modules DRAM init files

RTL Compiler for CNN Training

FPGA synthesis and mapping

9

RTL compiler generates the training accelerator using high level CNN description

slide-10
SLIDE 10

Outline

▪ Introduction ▪ CNN training algorithm ▪ RTL compiler ▪ CNN training accelerator ▪ Results ▪ Conclusion

10

slide-11
SLIDE 11

Overall Architecture

ReLU, scale, loss Output buffer PE Array Conv/FC control Global Control logic Data Gather Data scatter Pooling Unit UPSA (Demux/mult) Weight buffer Input buffer Data router AG buffer Weight gradient buffers/accumulator Old weight buffer Transposable weight buffer IDX buffer DMA DMA Manager WU Pixel data bus Weight data bus Index/AG bus Control Computing modules On-chip buffers

AG – Activation gradients, IDX – Maxpool Indices, UPSA – Upsampling

slide-12
SLIDE 12

Overall Architecture

ReLU, scale, loss Output buffer PE Array Conv/FC control Global Control logic Data Gather Data scatter Pooling Unit UPSA (Demux/mult) Weight buffer Input buffer Data router AG buffer Weight gradient buffers/accumulator Old weight buffer Transposable weight buffer IDX buffer DMA DMA Manager WU Pixel data bus Weight data bus Index/AG bus Control Computing modules On-chip buffers

AG – Activation gradients, IDX – Maxpool Indices, UPSA – Upsampling

slide-13
SLIDE 13

Overall Architecture

ReLU, scale, loss Output buffer PE Array Conv/FC control Global Control logic Data Gather Data scatter Pooling Unit UPSA (Demux/mult) Weight buffer Input buffer Data router AG buffer Weight gradient buffers/accumulator Old weight buffer Transposable weight buffer IDX buffer DMA DMA Manager WU Pixel data bus Weight data bus Index/AG bus Control Computing modules On-chip buffers

AG – Activation gradients, IDX – Maxpool Indices, UPSA – Upsampling

slide-14
SLIDE 14

MAC Array

Training Phase Input px buffer Weight buffer Output buffer FP Activations Normal kernels Activations BP Local gradients Flipped kernels Local gradients WU Activations Local gradients Kernel gradients

MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC

Data router

Weight router

Input pixel buffer

Pox Pof Pad, stride kernel size Inpx data From DRAM Training phase Local grad buffer Transposable weight buffer

▪ Output stationary dataflow ▪ Data/weight re-use to minimize partial sum movement ▪ Reconfigurable MAC array to support all phases of training ▪ MAC array size is user determined – loop unroll factors (𝑄𝑝𝑔, 𝑄

𝑝𝑦, 𝑄 𝑝𝑧)

slide-15
SLIDE 15

Transposable Weight Buffers

15

101 102 103 104 201 202 203 204 301 302 303 304 401 402 403 404 Inp Feat. Maps (L) Out Feat. Maps (L+1) 101 102 103 104 201 202 203 204 301 302 303 304 401 402 403 404

Training stage Read address C0 C1 C2 C3 FP BP 1 2 3

C0 C1 C2 C3

FP weight access pattern

101 201 301 401 102 202 302 402 103 203 303 403 104 204 304 404 Inp Feat. Maps (L) Out Feat. Maps (L+1)

BP weight access pattern

Read controls to transposable buffer during FP, BP, WU Transpose Block circulant matrix Independent column buffers

Transposable weight storage

slide-16
SLIDE 16

Outline

▪ Introduction ▪ CNN training algorithm ▪ RTL compiler ▪ CNN training accelerator ▪ Results ▪ Conclusion

16

slide-17
SLIDE 17

Results

17

CNN Resource Latency per epoch (s) DSP ALM BRAM BS-10 BS-20 BS-40 CIFAR-10 1X 30% 19% 4.4% 18.2 18 18.01 CIFAR-10 2X 58% 44% 9.5% 41.7 41.3 41 CIFAR-10 4X 100% 76.2% 22.4% 98.2 96.8 96.18 Throughput (GOPs) Efficiency (GOPs/W) Device Titan XP FPGA Titan XP FPGA Batch size 1 40 (1/40) 1 40 (1/40) CIFAR-10 1X 45.6 551.8 163 0.5 3.7 7.9 CIFAR-10 2X 128.8 1337.9 282 1.3 8.3 8.59 CIFAR-10 4X 331.4 2353.7 479 2.9 13.5 9.49

▪ Peak throughput

  • f 479 GOPs

▪ Better energy efficiency than GPU’s for smaller batch sizes ▪ Limited by DRAM B/W ▪ Images in a batch are processed sequentially

CIFAR-10 1X: 2(16C3)-MP-2(32C3)-MP-2(64C3)-MP-FC

slide-18
SLIDE 18

18

Latency Breakdown

▪ Latency of CIFAR-10 4X CNN for one iteration of a batch ▪ Overall ~20% logic latency and ~80% due to DRAM access ▪ Weight update phase is memory intense

‒ Contributes for ~51% of the overall latency

89% 12% 88% 11% 29% 71% Logic latency DRAM latency

slide-19
SLIDE 19

Outline

▪ Introduction ▪ CNN training algorithm ▪ RTL compiler ▪ CNN training accelerator ▪ Results ▪ Conclusion

19

slide-20
SLIDE 20

Conclusion

▪ Automatic RTL compiler-based CNN training accelerator ▪ Implemented parameterized RTL library to support CNN training operations ▪ Evaluated performance on Intel Stratix-10 GX FPGA for three CNNs for CIFAR-10 dataset ▪ Achieved 479 GOPs throughput

20

slide-21
SLIDE 21

Acknowledgements

C-BRIC (Center for BRain-Inspired Computing)

We thank Intel Corporation for supporting and funding this research work. This work was also partially supported by NSF grant 1652866 and C-BRIC, one of six centers in JUMP, an SRC program sponsored by DARPA