Automatic Compiler Based FPGA Accelerator for CNN Training Shreyas - PowerPoint PPT Presentation

Automatic Compiler Based FPGA Accelerator for CNN Training Shreyas Venkataramanaiah 1 , Yufei Ma 1 , Shihui Yin 1 , Eriko Nurvithadhi 2 , Aravind Dasu 3 , Yu Cao 1 , Jae-sun Seo 1 1 School of ECEE, Arizona State University, Tempe, AZ, USA 2 Intel Labs, Intel Corporation, OR, USA 3 Programmable Solutions Group, Intel Corporation, CA, USA

Outline ▪ Introduction ▪ CNN training algorithm ▪ RTL compiler ▪ CNN training accelerator ▪ Results ▪ Conclusion 2

Introduction ▪ Challenges in training of neural networks ‒ Large storage, memory bandwidth, energy consumption ‒ New DNN structures rapidly evolving and developed for diverse applications ▪ GPU’s are power hungry ▪ ASIC not good for programmability, cannot predict future DNNs ▪ FPGA’s are flexible ‒ Reconfigurable, scalable training hardware ‒ Can support low-precision or sparse matrix computations 3

CNN Training Algorithm Forward Pass weight weight local Input gradients update gradients image ▪ Each training image is 𝑥 0 , 𝛽, 𝛾 Conv associated with a label 𝒙 𝟏 𝑥 0 new Conv ▪ Loss function estimates Δ𝑥 0 the network performance Upsamp Pool and provides error value 𝑥 1 , 𝛽, 𝛾 𝑥 1 new ▪ Conv ReLU: Store the activation Conv 1 𝑥 𝑔𝑚𝑗𝑞 𝒙 𝟐 Conv gradients Δ𝑥 1 ▪ Maxpool: Store the Pool Upsamp 𝑥 2 , 𝛽, 𝛾 𝑥 2 selected pixel position new Vector mult 𝒙 𝟑 2 FC 𝑥 𝑈 FC Δ𝑥 2 1x10 1x10 Error Loss

CNN Training Algorithm weight weight local Backward Pass Input gradients update gradients image 𝑥 0 , 𝛽, 𝛾 ▪ Error values are Conv 𝑥 0 𝑥 0 new propagated back in Conv the network Δ𝑥 0 ▪ Upsamp Pool Flipped kernels are used in convolutions 𝑥 1 , 𝛽, 𝛾 𝑥 1 new Conv ▪ ReLU : gradients are Conv 𝟐 𝒙 𝒈𝒎𝒋𝒒 𝑥 1 scaled by activation Conv Δ𝑥 1 gradients Pool ▪ Upsamp Maxpool: Upsample 𝑥 2 , 𝛽, 𝛾 𝑥 2 new the image using Vector pooling indices mult 𝑥 2 𝟑 FC 𝒙 𝑼 FC Δ𝑥 2 1x10 1x10 Error Loss

CNN Training Algorithm Weight Update weight weight local Input gradients update gradients image ▪ Weight gradients are 𝒙 𝟏 , 𝜷, 𝜸 Conv 𝑥 0 𝒙 𝟏 computed and new Conv accumulated 𝚬𝒙 𝟏 Upsamp ▪ Pool Convolutions involve 𝒙 𝟐 , 𝜷, 𝜸 intra tile accumulations 𝒙 𝟐 new Conv ▪ Conv 1 New weights are 𝑥 𝑔𝑚𝑗𝑞 𝑥 1 Conv computed at the end of 𝚬𝒙 𝟐 batch 𝒙 𝟑 , 𝜷, 𝜸 Pool Upsamp 𝒙 𝟑 new ▪ Learning rate (𝛽) and Vector momentum (𝛾) mult 𝑥 2 2 FC 𝑥 𝑈 FC 𝚬𝒙 𝟑 1x10 1x10 parameters are used Error Loss

Proposed RTL Compiler CNN architecture Top level • Layer details – conv, pool, RTL RTL Compiler for upsamp, scaling, weight integrated CNN Training update, flatten, loss with training • Fixed point precision of H/W each layer parameters modules Configure hardware • Layer scheduling • Generate parameters based on CNN Initialize memory DRAM init • Initial weight and bias files • Training data, labels RTL model library • Base addresses for • Highly parameterized gradients, activations & FPGA flexible RTL files weights synthesis supporting CNN and training operations Loop unrolling and tiling mapping factors RTL compiler generates the training accelerator using high level CNN description 9

Overall Architecture Output buffer Pixel data bus Weight gradient Transposable buffers/accumulator WU weight buffer Weight data bus Old weight buffer Global Control logic ReLU, AG IDX scale, loss buffer buffer Index/AG bus Pooling PE Array UPSA Control Unit Conv/FC (Demux/mult) control Data router Weight Computing Data buffer modules Input buffer Gather Data scatter On-chip DMA DMA Manager buffers AG – Activation gradients, IDX – Maxpool Indices, UPSA – Upsampling

Overall Architecture Output buffer Pixel data bus Weight gradient Transposable buffers/accumulator WU weight buffer Weight data bus Old weight Global Control logic ReLU, AG IDX buffer scale, loss buffer Index/AG bus buffer Pooling PE Array UPSA Control Unit Conv/FC (Demux/mult) control Data router Weight Computing Data buffer modules Input buffer Gather Data scatter On-chip DMA DMA Manager buffers AG – Activation gradients, IDX – Maxpool Indices, UPSA – Upsampling

Overall Architecture Output buffer Pixel data bus Weight gradient Transposable buffers/accumulator WU weight buffer Weight data bus Global Control logic Old weight buffer ReLU, AG IDX scale, loss buffer buffer Index/AG bus Pooling PE Array UPSA Control Unit Conv/FC (Demux/mult) control Data router Weight Computing Data buffer modules Input buffer Gather Data scatter On-chip DMA DMA Manager buffers AG – Activation gradients, IDX – Maxpool Indices, UPSA – Upsampling

MAC Array ▪ Output stationary Pox dataflow MAC MAC MAC MAC Transposable ▪ Data/weight re-use weight buffer to minimize partial MAC MAC MAC MAC Weight router sum movement Pof MAC MAC MAC MAC ▪ Reconfigurable MAC Local grad array to support all buffer MAC MAC MAC MAC phases of training ▪ MAC array size is Pad, stride Data router user determined – kernel size Training loop unroll factors Inpx data phase Input pixel buffer From DRAM (𝑄 𝑝𝑔 , 𝑄 𝑝𝑦 , 𝑄 𝑝𝑧 ) Training Phase Input px buffer Weight buffer Output buffer FP Activations Normal kernels Activations BP Local gradients Flipped kernels Local gradients WU Activations Local gradients Kernel gradients

Transposable Weight Buffers FP weight access pattern BP weight access pattern Out Feat. Maps (L+1) Inp Feat. Maps (L) Out Feat. Maps (L+1) Inp Feat. Maps (L) 101 102 103 104 101 201 301 401 Transpose 201 202 203 204 102 202 302 402 301 302 303 304 103 203 303 403 401 402 403 404 104 204 304 404 Transposable weight storage C0 C1 C3 C2 Training Read address 101 102 103 104 stage C0 C1 C2 C3 Independent 204 201 202 203 column FP 0 0 0 0 buffers 303 304 301 302 BP 0 1 2 3 402 403 404 401 Read controls to transposable Block circulant matrix buffer during FP, BP, WU 15

Results CIFAR-10 1X: 2(16C3)-MP-2(32C3)-MP-2(64C3)-MP-FC ▪ Peak throughput CNN Resource Latency per epoch (s) of 479 GOPs DSP ALM BRAM BS-10 BS-20 BS-40 ▪ Better energy efficiency than CIFAR-10 1X 30% 19% 4.4% 18.2 18 18.01 GPU’s for CIFAR-10 2X 58% 44% 9.5% 41.7 41.3 41 smaller batch sizes CIFAR-10 4X 100% 76.2% 22.4% 98.2 96.8 96.18 ▪ Limited by Throughput (GOPs) Efficiency (GOPs/W) DRAM B/W Device Titan XP FPGA Titan XP FPGA ▪ Images in a Batch size 1 40 (1/40) 1 40 (1/40) batch are processed CIFAR-10 1X 45.6 551.8 163 0.5 3.7 7.9 sequentially CIFAR-10 2X 128.8 1337.9 282 1.3 8.3 8.59 CIFAR-10 4X 331.4 2353.7 479 2.9 13.5 9.49 17

Latency Breakdown 11% 89% Logic latency 12% DRAM latency 88% 29% 71% ▪ Latency of CIFAR-10 4X CNN for one iteration of a batch ▪ Overall ~20% logic latency and ~80% due to DRAM access ▪ Weight update phase is memory intense ‒ Contributes for ~51% of the overall latency 18

Conclusion ▪ Automatic RTL compiler-based CNN training accelerator ▪ Implemented parameterized RTL library to support CNN training operations ▪ Evaluated performance on Intel Stratix-10 GX FPGA for three CNNs for CIFAR-10 dataset ▪ Achieved 479 GOPs throughput 20

Acknowledgements C-BRIC ( C enter for BR ain- I nspired C omputing) We thank Intel Corporation for supporting and funding this research work. This work was also partially supported by NSF grant 1652866 and C-BRIC , one of six centers in JUMP, an SRC program sponsored by DARPA

Automatic Compiler Based FPGA Accelerator for CNN Training Shreyas - PowerPoint PPT Presentation

Automatic Compiler Based FPGA Accelerator for CNN Training Shreyas Venkataramanaiah 1 , Yufei Ma 1 , Shihui Yin 1 , Eriko Nurvithadhi 2 , Aravind Dasu 3 , Yu Cao 1 , Jae-sun Seo 1 1 School of ECEE, Arizona State University, Tempe, AZ, USA 2 Intel

Compiler Construction Chapter 11 1 Compiler Construction Compiler Construction A New Compiler

CS7015 (Deep Learning) : Lecture 12 Object Detection: R-CNN, Fast R-CNN, Faster R-CNN, You Only

Object Detection using R-CNN Experiments CS381V: Visual Recognition, Spring 2016 William Xie

GRVI Phalanx Update: A Massively Parallel RISC-V FPGA Accelerator Framework Jan Gray |

Open Source FPGA Toolchain FPGA LSE Summer Week 2015 iCE40 Flow Conclusion Vincent Gatine

Tips about an FPGA 02/09/2018 J.C. special topic FPGA ( field-programmable gate array ) FPGA :

FPGA What is a FPGA? How FPGAs work How do they work? Manufacturers

WWW.FPGA What is an FPGA? Field Programmable Gate Array Introduction to FPGA designs

Public FPGA based DM Public FPGA based DMA Atta A Attacking king UlfFrisk Agenda Background

1 3 5 CONVENTIONAL DC MODEL Accelerator Output Accelerator Opening FB-CA SERIES Accelerator

CNN Ba CNN Based ed Pi Pipeline peline for or Op Optical ical Fl Flow ow Tal Schuster,

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

An introduction to FPGA-based acceleration of neural networks Marco Pagani 1 What is an FPGA?

Moving CNN Accelerator Computations Closer to Data Sumanth Gudaparthi Surya Narayanan Rajeev

Dependency Dependency- -Based Automatic Evaluation Based Automatic Evaluation Dependency

FPGA-based Training Accelerator Utilizing Sparseness of Convolutional Neural Network Hiroki

Mining Pools Prof. Tom Austin San Jos State University Review: Bitcoin mining Miners

A Preference-Based Bandit Framework for Personalized Recommendation Maryam Tavakol and Ulf

Teaching a first course in Human-Robot Interaction Carlotta A. Berry, Ph.D. Rose-Hulman

MARKET SIZE Governments will SPEND billions in smart cities A trillion dollar market in

Legal posi6vism and collec6ve acceptance Social Ontology Neuchatel 2020 Michael Schmitz

The logic of formulas Andre Kornell UC Davis BLAST August 10, 2018 Andre Kornell (UC Davis)

Quality Circles (QC) ''Small Groups of Health Care Professionals who meet at regular

The Pinch How the baby boomers took their childrens future - and why they should give it back