for Convolutional Neural Networks Lorem ipsum dolor sit amet, - PowerPoint PPT Presentation

A Data-Center FPGA Acceleration Platform 空白演示 for Convolutional Neural Networks Lorem ipsum dolor sit amet, consectetur adipisicing elit. Xiaoyu Yu 1 , Yuwei Wang 1 , Jie Miao 1 , Ephrem Wu 2 , Heng Zhang 1 , Yu Meng 1 , Bo Zhang 1 , Biao Min 1 , Dewei Chen 1 , Jianlin Gao 1 1 Tencent Shenzhen, China 2 Xilinx, Inc., San Jose, CA 95124, USA

About Tencent Monthly active users reach Users in over One of the Tencent is founded in 1/0.8 Billion 200 Top 5 1998 for WeChat/QQ countries Internet Companies by Market Value Photos from Profile Videos from Images in Live Video WeChat Photos WeChat Group Chat Streaming Moments Moments

Background  CNN Models are widely used in Tencent  Billions of operations per inference task × Billions of task each day  Models are still in fast evolution.  A reconfigurable accelerator is desirable  Three key objectives: • Support different CNN models, easy to try • Achieve higher performance to lower TCO • Low latency

Framework for General Purpose CNN Acceleration  More and more CNN models  Operator Classification • Convolution : 19% total types, 95%+ computation cost • Non-convolution : 81% total types, 5 %- computation cost  Different design strategies • Convolution : Performance improvement • Non-convolution : Support mass types of operators

Unified Computing Engine for Convolution—Supertile  Performance = Freq. * Dsp_num * Ops_per_dsp  The supertile method runs the DSP at twice the clock rate of the surrounding logic[1].  Enhanced Processing Element (EPE) EPE Weight Buf A Buf B Update Weight Cache MUX MUX DSP Buf C Buf D × + Activation Weight Cache Input EPE Weight Buf A Buf B Update Weight Cache MUX MUX DSP Buf C Buf D × + Activation Weight Cache a cfm Input [1] E. Wu, X. Zhang, D. Berman, and I. Cho. “A high-throughput reconfigurable processing array for neural networks,” In Field Programmable Logic and Applications (FPL), 2017 27th International Conference on (pp. 1–4). IEEE.

Unified Computing Engine for Convolution -- Supertile Unit (SU) Input feature map tile (IFT) saving Temporary results of output feature map tile (OFT) at the input buffer with C in channels with 2n channels, and saving into the output buffer C in m IF m H EPE m1 EPE m2 EPE m3 EPE m4 EPE mn Supertile unit with W m × n EPE array IF 3 IF 2 EPE 31 EPE 32 EPE 33 EPE 34 EPE 3n IF 1 EPE 21 EPE 22 EPE 23 EPE 24 EPE 2n EPE 11 EPE 12 EPE 13 EPE 14 EPE 1n m m 2n input kernel groups, each with m channels Convolution with an input feature map tile (IFT) and 2n kernel groups on one SU.

Unified Computing Engine for Convolution -- Scaled-up SU  Two Challenges:  Solutions: • task partition. • Interleaved task dispatching • data bandwidth would be multiplied • Dispatching-assembling buffering model • Broadcast cache (BC) Kernel size 3*3 Stride=1 Input buffer W0, W4, … SU 0 BC set 0 BC set 1 BC set 2 BC set 3 Broadcast cache W1, W5, … SU 1 Supertile units SU 0 SU 1 SU 2 SU 3 W2, W6, … SU 2 W0 W2 Output buffer OB set 0 OB set 1 OB set 2 OB set 3 W1 W3 Input SU 3 W3, W7, … feature map Assemble reader (b) (a)

Unified Computing Engine for Convolution – Broadcast Cache  a circular buffer  BC-window-stride = 4 × Convolution - Stride  Increase bandwidth from �� bit/s into �� bit/s Row ID KernelSize = 3*3, Stride = 1 1 Input buffer 2 Data buffering in 3 broadcast cache BC set 0 BC set 1 BC set 2 BC set 3 Broadcast cache Sliding 4 window 5 Supertile units SU 0 SU 1 SU 2 SU 3 6 WindowStride = Stride * 4 Output buffer 7 OB set 0 OB set 1 OB set 2 OB set 3 8 Assemble reader (b) 9 10

Non-convolution Ops in inference  Challenges: • Mass types of non-convolution ops • Resource limitation  Solutions: • Perform different design strategy for different class  Filter Processing Unit (FPU) MaxPool/AvgPool/DepthwiseConv/BN/Relu/Relu6  Customization: operations across channels LRN  Operator Fusion. ElementAdd/Relu/DynamicQuantization • Functional-logic-sharing

Postprocessing – FPU  Common Styles : • Two level of data access style Output buf (OB) • no operations exist across 2n channels ALU Kernel load ctl Ucmd decode ALU Filter processing unit Slice loop ctl Address gen ALU Ucmd fetch • parameters similarities × • Pointwise operations as special > + cases × Kernel buf Ucmd buf Worker part Function sharing part

Postprocessing – FPU  Reconfigurable ALU reg A LU cm p_f unc_set M UX t hr eshol d > pr em ul _por t _a m i d_cm p M UX × pr em ul _por t _b × pr e_m ul + m i d_share_por t _b back_m ul m i d_adder M UX reg backm ul _por t _b add_f unc_set

Postprocessing – Operator Fusion  Avoid extra memory access  Four operations fuse with Convolution [2] J. Qiu, et al. “Going deeper with embedded FPGA platform for convolutional neural network,“ Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2016

System Overview

System Overview  Frequency 500 MHz is applied for the EPEs. 250 MHz is applied for the others.  DSP Each SU : 512 DSPs Each CNN Engine : 4 Sus 2048 DSPs Whole Chip : Two CNN Engines 4096 DSPs provide 4.2 TOP/s @int16 for Conv  Memory IB : 4.2Mbit * 2 Bandwidth : 16 GB/s (R) -> 64 GB/s with BC OB : 4.2Mbit * 2 Bandwidth : 64 GB/s *2 (R and W)

Experimental Results: Performance in three models  Alexnet AlexNet GoogLeNet HCNet 16-bit 16-bit 16-bit Data precision  GoogLeNet 250/500 250/500 225/450 Clock (MHz)  HCNet (high-concurrency 4 2 4 Batch size network ) 1331.6/1448.8 3081.0/3083.1 444 CNN size (MOPs) 1753.8 527.7 1465.1 Throughput (FPS) 2335.4 1625.9 650.5 Performance (GOP/s) 2.3 3.8 2.7 Latency (ms) 62.6 56.6 57.6 Power (watts) 1.4 3.9 3.4 Speedup VS P4 (7 ms) Energy efficiency 37.3 28.7 11.3 (GOP/s/W)

Experimental Results: Comparison with FPGA-Based Accelerators [3] [4] [5] Ours FPGA chip Arria10-1150 Virtex7-690t KU115 KU115 KU115 Network VGG AlexNet VGG GoogLeNet AlexNet CNN size (GOPs) 30.8 1.4 30.8 3.1 1.3 Freq (MHz) 385 150 235 250/500 250/500 Precision Fix16 Fix16 Fix16 Fix16 Fix16 DSPs (used/total) 2756/3036 2833/3600 4318/5520 4214/5520 4214/5520 Peak performance 2.1 0.8 2.1 4.2 4.2 (TOP/s) Real performance 1.79 0.6 2 1.63 2.3 (TOP/s) [3] J. Zhang, and J. Li. “Improving the performance of opencl-based fpga accelerator for convolutional neural network,” Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2017. [4] C. Zhang C, et al. “Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks,” Proceedings of the 35th International Conference on Computer-Aided Design. ACM, 2016, p. 12. [5] X. Zhang, et al. “DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs,” Proceedings of the International Conference on Computer-Aided Design. ACM, 2018.

Experimental Results: Comparison with CPUs and GPU in datacenter TOP/s Off-chip Processo On-chip Power memory Processor r per nm MHz memory Release (Watts) 16bit FP32 server (MB) BW (GB/s) Intel E5- 2 - - 14 2400 35 × 2 76.8 × 2 120 2016 Q1 2680V4 NVIDIA P4 1 - 5.5 16 1000 10 [38] 192 50-75 2016 Q3 Xilinx KU115 1 4.2 - 20 250/500 11.8 38.4 50-66 2014 Q4

Comparison with CPU and GPU TOP/s Off-chip Processo On-chip Power memory Processor r per nm MHz memory Release (Watts) 16bit FP32 server (MB) BW (GB/s) Intel E5- 2 - - 14 2400 35 × 2 76.8 × 2 120 2016 Q1 2680V4 NVIDIA P4 1 - 5.5 16 1000 10 [38] 192 50-75 2016 Q3 Xilinx KU115 1 4.2 - 20 250/500 11.8 38.4 50-66 2014 Q4

Comparison with CPU and GPU  Limitations:  Achievements: • simpler fabrication process • Superior performance in latency-sensitive test • 20% memory bandwidth • 89% throughput with 1/57 latency in throughput-sensitive test. • 1/4 frequency of P4 • Performance can be improved (UltraScale+ VU9P 16 nm)

Conclusion  A unified framework facing different CNN models and easy to try.  Supertile EPEs are scaled up and shaped as multiple SUs with interleaved-task-dispatching method to break computation bound  Overcome the bandwidth limitation with dispatching-assembling buffering model and BC  A configurable FPU is proposed to support different types of non-convolution operators Performance Latency TCO Application 50 × lower 4.2Top/s 149% vs 32% 1 billion in fix16 than GPU than CPU People everyday

for Convolutional Neural Networks Lorem ipsum dolor sit amet, - PowerPoint PPT Presentation

A Data-Center FPGA Acceleration Platform for Convolutional Neural Networks Lorem ipsum dolor sit amet, consectetur adipisicing elit. Xiaoyu Yu 1 , Yuwei Wang 1 , Jie Miao 1 , Ephrem Wu 2 , Heng Zhang 1 , Yu Meng 1 , Bo Zhang 1 , Biao

HELLO TH THERE ipsum dolor sit amet lorem ipsum lorem ipsum dolor sit amet lorem ipsum dolor sit

WEB EXCELO OUR AGENDA 1.LOREM 2.LOREM 3.LOREM Lorem Ipsum is simply dummy text of the Lorem

How to improve your Oddl 1 Lorem Ipsum Lorem ipsum dolor sit amet, consectetur adipiscing

Lorem ipsum dolor sit amet consectetur adipiscing elit Suspendisse sed Abigail Alsop, Frank

North Adams Renewable Energy Action Plan Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Lorem ipsum dolor sit amet, consectetur Financial Capability among Young Adults adipiscing elit,

File I/O With Python File Opening file_reading_variable = open(filename.txt, mode )

PRESENTATION 19 JUN 19 Sports Presentation 20/11 lorem ipsum , quia dolor sit, amet,

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum is simply

Ipsum INTRO Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem

Overall Telecomm Project Safety Report October 2018 Lorem ipsum dolo sit amet. Bullet point

LOGOTYPE Lorem Ipsum is simply dummy text of the printing and typesetting industry. ABOUT US Lorem

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Logo Here Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum

LOGO HERE Lorem Ipsum is simply dummy text of the printing and typesetting industry. ABOUT US

LOGO HERE Lorem Ipsum is simply dummy text of the printing and typesetting industry. INTRODUCTION

Size Matters How Loading is Losing You Players W3C Workshop on Web Games - 27-28 June 2019;

SolutionChat: Real-time Moderator Support for Chat-based Structured Discussion Sung-Chul Lee

The Geroch group in Einstein spaces Marios Petropoulos CPHT Ecole Polytechnique CNRS

Open Addressing Algorithms CSE 373 19 SP - KASEY CHAMPION 1 Administrivia Exercise 2 due

Scikit-learn some perspectives Lundi 17 septembre 2018 Lancement de linitjatjve scikit-learn

Exports, Capabilities and Clusters John Page The Brookings Institution and UNU-WIDER LSE, London

Sharing a BLISful State Maggie Myers Devangi Parikh Robert van de Geijn Field Van Zee Our

4 Design Principles by Kent Beck Damien Cassou, Stphane Ducasse and Luc Fabresse WXSYY