for Convolutional Neural Networks Lorem ipsum dolor sit amet, - - PowerPoint PPT Presentation

for convolutional neural networks lorem ipsum dolor sit
SMART_READER_LITE
LIVE PREVIEW

for Convolutional Neural Networks Lorem ipsum dolor sit amet, - - PowerPoint PPT Presentation

A Data-Center FPGA Acceleration Platform for Convolutional Neural Networks Lorem ipsum dolor sit amet, consectetur adipisicing elit. Xiaoyu Yu 1 , Yuwei Wang 1 , Jie Miao 1 , Ephrem Wu 2 , Heng Zhang 1 , Yu Meng 1 , Bo Zhang 1 , Biao


slide-1
SLIDE 1

空白演示

Lorem ipsum dolor sit amet, consectetur adipisicing elit.

A Data-Center FPGA Acceleration Platform for Convolutional Neural Networks

Xiaoyu Yu1, Yuwei Wang1, Jie Miao1, Ephrem Wu2, Heng Zhang1, Yu Meng1, Bo Zhang1, Biao Min1, Dewei Chen1, Jianlin Gao1 1 Tencent Shenzhen, China 2 Xilinx, Inc., San Jose, CA 95124, USA

slide-2
SLIDE 2

About Tencent

One of the

Top 5

Internet Companies by Market Value Monthly active users reach

1/0.8 Billion

for WeChat/QQ Tencent is founded in

1998

Photos from WeChat Moments

Users in over

200

countries

Live Video Streaming Profile Photos Videos from WeChat Moments Images in Group Chat

slide-3
SLIDE 3

Background  CNN Models are widely used in Tencent  Billions of operations per inference task × Billions of task each day  Models are still in fast evolution.  A reconfigurable accelerator is desirable  Three key objectives:

  • Support different CNN models, easy to try
  • Achieve higher performance to lower TCO
  • Low latency
slide-4
SLIDE 4

Framework for General Purpose CNN Acceleration  More and more CNN models  Operator Classification

  • Convolution : 19% total types, 95%+ computation cost
  • Non-convolution : 81% total types, 5 %- computation cost

 Different design strategies

  • Convolution : Performance improvement
  • Non-convolution : Support mass types of operators
slide-5
SLIDE 5

Unified Computing Engine for Convolution—Supertile

a

cfm

EPE

DSP

× + Activation Input

MUX MUX

Weight Update

DSP

× + Activation Input Weight Update

EPE

Buf A Buf B Buf C Buf D Buf A Buf B Buf C Buf D

MUX MUX

Weight Cache Weight Cache Weight Cache Weight Cache

[1] E. Wu, X. Zhang, D. Berman, and I. Cho. “A high-throughput reconfigurable processing array for neural networks,” In Field Programmable Logic and Applications (FPL), 2017 27th International Conference on (pp. 1–4). IEEE.

 Performance = Freq. * Dsp_num * Ops_per_dsp  The supertile method runs the DSP at twice the clock rate of the surrounding logic[1].  Enhanced Processing Element (EPE)

slide-6
SLIDE 6

Unified Computing Engine for Convolution -- Supertile Unit (SU)

EPE31 EPE21 EPE11 EPEm1 EPE32 EPE22 EPE12 EPEm2 EPE33 EPE23 EPE13 EPEm3 EPE34 EPE24 EPE14 EPEm4 EPE3n EPE2n EPE1n EPEmn

Input feature map tile (IFT) saving at the input buffer with Cin channels

2n input kernel groups, each with m channels

Temporary results of output feature map tile (OFT) with 2n channels, and saving into the output buffer

Supertile unit with m×n EPE array m

IF 1 IF 2 IF 3 IF m

H W

m m

Cin Convolution with an input feature map tile (IFT) and 2n kernel groups on one SU.

slide-7
SLIDE 7

Unified Computing Engine for Convolution -- Scaled-up SU

Input buffer

SU 0

Broadcast cache

OB set 0

Output buffer

BC set 0

Assemble reader Supertile units (b)

SU 1 OB set 1 BC set 1 SU 2 OB set 2 BC set 2 SU 3 OB set 3 BC set 3

SU 0 SU 1 SU 2 SU 3 W0 W2 W1 W3 W0, W4, … Kernel size 3*3 Stride=1 W1, W5, … W2, W6, … W3, W7, … Input feature map (a)

 Two Challenges:

  • task partition.
  • data bandwidth would be multiplied

 Solutions:

  • Interleaved task dispatching
  • Dispatching-assembling buffering model
  • Broadcast cache (BC)
slide-8
SLIDE 8

Unified Computing Engine for Convolution – Broadcast Cache

KernelSize = 3*3, Stride = 1 Data buffering in broadcast cache

Sliding window WindowStride = Stride * 4

Row ID 1 2 3 4 5 6 7 8 9 10

 a circular buffer  BC-window-stride = 4 × Convolution-Stride  Increase bandwidth from

bit/s into bit/s

Input buffer

SU 0

Broadcast cache

OB set 0

Output buffer

BC set 0

Assemble reader Supertile units (b)

SU 1 OB set 1 BC set 1 SU 2 OB set 2 BC set 2 SU 3 OB set 3 BC set 3

slide-9
SLIDE 9

Non-convolution Ops in inference

 Challenges:

  • Mass types of non-convolution ops
  • Resource limitation

 Solutions:

  • Perform different design strategy for different class
  • Filter Processing Unit (FPU)

MaxPool/AvgPool/DepthwiseConv/BN/Relu/Relu6

  • Customization: operations across channels

LRN

  • Operator Fusion.

ElementAdd/Relu/DynamicQuantization

  • Functional-logic-sharing
slide-10
SLIDE 10

Postprocessing – FPU

Ucmd fetch Ucmd decode Slice loop ctl Address gen ALU ALU ALU

+ × > ×

Kernel load ctl Ucmd buf Kernel buf

Output buf (OB)

Filter processing unit

2n

Function sharing part Worker part  Common Styles :

  • Two level of data access style
  • no operations exist across

channels

  • parameters similarities
  • Pointwise operations as special

cases

slide-11
SLIDE 11

Postprocessing – FPU

 Reconfigurable ALU

× + >

M UX

×

M UX M UX

A LU

t hr eshol d backm ul _por t _b reg reg pr e_m ul m i d_adder m i d_cm p back_m ul add_f unc_set cm p_f unc_set pr em ul _por t _a pr em ul _por t _b m i d_share_por t _b

slide-12
SLIDE 12

Postprocessing – FPU

 Reconfigurable ALU

× + >

M UX

×

M UX M UX

A LU

t hr eshol d backm ul _por t _b reg reg pr e_m ul m i d_adder m i d_cm p back_m ul add_f unc_set cm p_f unc_set pr em ul _por t _a pr em ul _por t _b m i d_share_por t _b

slide-13
SLIDE 13

Postprocessing – FPU

 Reconfigurable ALU

× + >

M UX

×

M UX M UX

A LU

t hr eshol d backm ul _por t _b reg reg pr e_m ul m i d_adder m i d_cm p back_m ul add_f unc_set cm p_f unc_set pr em ul _por t _a pr em ul _por t _b m i d_share_por t _b

slide-14
SLIDE 14

Postprocessing – Operator Fusion  Avoid extra memory access  Four operations fuse with Convolution

[2] J. Qiu, et al. “Going deeper with embedded FPGA platform for convolutional neural network,“ Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2016

slide-15
SLIDE 15

System Overview

slide-16
SLIDE 16

System Overview

slide-17
SLIDE 17

System Overview

slide-18
SLIDE 18

System Overview

 Frequency

500 MHz is applied for the EPEs. 250 MHz is applied for the others.

 DSP

Each SU : 512 DSPs Each CNN Engine : 4 Sus 2048 DSPs Whole Chip : Two CNN Engines 4096 DSPs provide 4.2 TOP/s @int16 for Conv

 Memory

IB : 4.2Mbit * 2 Bandwidth : 16 GB/s (R) -> 64 GB/s with BC OB : 4.2Mbit * 2 Bandwidth : 64 GB/s *2 (R and W)

slide-19
SLIDE 19

Experimental Results: Performance in three models

AlexNet GoogLeNet HCNet Data precision 16-bit 16-bit 16-bit Clock (MHz) 250/500 250/500 225/450 Batch size 4 2 4 CNN size (MOPs) 1331.6/1448.8 3081.0/3083.1 444 Throughput (FPS) 1753.8 527.7 1465.1 Performance (GOP/s) 2335.4 1625.9 650.5 Latency (ms) 2.3 3.8 2.7 Power (watts) 62.6 56.6 57.6 Speedup VS P4 (7 ms) 1.4 3.9 3.4 Energy efficiency (GOP/s/W) 37.3 28.7 11.3

 Alexnet  GoogLeNet  HCNet (high-concurrency network )

slide-20
SLIDE 20

Experimental Results: Comparison with FPGA-Based Accelerators

[3] [4] [5] Ours FPGA chip Arria10-1150 Virtex7-690t KU115 KU115 KU115 Network VGG AlexNet VGG GoogLeNet AlexNet CNN size (GOPs) 30.8 1.4 30.8 3.1 1.3 Freq (MHz) 385 150 235 250/500 250/500 Precision Fix16 Fix16 Fix16 Fix16 Fix16 DSPs (used/total) 2756/3036 2833/3600 4318/5520 4214/5520 4214/5520 Peak performance (TOP/s) 2.1 0.8 2.1 4.2 4.2 Real performance (TOP/s) 1.79 0.6 2 1.63 2.3

[3] J. Zhang, and J. Li. “Improving the performance of opencl-based fpga accelerator for convolutional neural network,” Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2017. [4] C. Zhang C, et al. “Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks,” Proceedings of the 35th International Conference on Computer-Aided Design. ACM, 2016, p. 12. [5] X. Zhang, et al. “DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs,” Proceedings of the International Conference on Computer-Aided Design. ACM, 2018.

slide-21
SLIDE 21

Experimental Results: Comparison with CPUs and GPU in datacenter

Processor Processo r per server TOP/s nm MHz On-chip memory (MB) Off-chip memory BW (GB/s) Power (Watts) Release

16bit FP32

Intel E5- 2680V4 2

  • 14

2400 35 × 2 76.8 × 2 120 2016 Q1 NVIDIA P4 1

  • 5.5

16 1000 10 [38] 192 50-75 2016 Q3 Xilinx KU115 1 4.2

  • 20

250/500 11.8 38.4 50-66 2014 Q4

slide-22
SLIDE 22

Comparison with CPU and GPU

Processor Processo r per server TOP/s nm MHz On-chip memory (MB) Off-chip memory BW (GB/s) Power (Watts) Release

16bit FP32

Intel E5- 2680V4 2

  • 14

2400 35 × 2 76.8 × 2 120 2016 Q1 NVIDIA P4 1

  • 5.5

16 1000 10 [38] 192 50-75 2016 Q3 Xilinx KU115 1 4.2

  • 20

250/500 11.8 38.4 50-66 2014 Q4

slide-23
SLIDE 23

Comparison with CPU and GPU

 Limitations:

  • simpler fabrication process
  • 20% memory bandwidth
  • 1/4 frequency of P4

 Achievements:

  • Superior performance in latency-sensitive test
  • 89% throughput with 1/57 latency in throughput-sensitive test.
  • Performance can be improved (UltraScale+ VU9P 16 nm)
slide-24
SLIDE 24

Conclusion

 A unified framework facing different CNN models and easy to try.  Supertile EPEs are scaled up and shaped as multiple SUs with interleaved-task-dispatching method to break computation bound  Overcome the bandwidth limitation with dispatching-assembling buffering model and BC  A configurable FPU is proposed to support different types of non-convolution operators

Performance

4.2Top/s

in fix16 Latency

50× lower

than GPU TCO

149% vs 32%

than CPU Application

1 billion

People everyday

slide-25
SLIDE 25

空白演示

Lorem ipsum dolor sit amet, consectetur adipisicing elit.

Thank you!

Contact: kevinxiaoyu@tencent.com