Improving Efficiency in Neural Network Accelerator using Operands - - PowerPoint PPT Presentation

improving efficiency in neural network accelerator using
SMART_READER_LITE
LIVE PREVIEW

Improving Efficiency in Neural Network Accelerator using Operands - - PowerPoint PPT Presentation

Improving Efficiency in Neural Network Accelerator using Operands Hamming Distance Optimization Meng Li*, YiLei Li*, Pi Pierce ce Chuang , Liangzhen Lai, and Vikas Chandra EMC2 Workshop @ NeurIPS 2019 Facebook Silicon AI Research Motivation


slide-1
SLIDE 1

Facebook Silicon AI Research

Meng Li*, YiLei Li*, Pi Pierce ce Chuang, Liangzhen Lai, and Vikas Chandra EMC2 Workshop @ NeurIPS 2019

Improving Efficiency in Neural Network Accelerator using Operands Hamming Distance Optimization

slide-2
SLIDE 2

Motivation

2

Dataflow processing is widely exploited to amortize memory access energy Datapath energy becomes important for dataflow accelerators

  • Consist of compute energy in process elements (PEs) and data propagation energy among PEs

… …

H x W C PE Array K Psum K weight weight weight weight Psum Psum Psum

… …

H x W K PE Array C Act C weight weight weight weight Act Act Act PE Array Buffer Misc 57.7%

Thinker [Yin+, JSSC’18] Output Stationary Input Stationary

Datapath Buffer Misc 87.3%

ShiDianNao [Du+, ISCA’15]

slide-3
SLIDE 3

Motivation

3

In dataflow processing, operands are streamed into the compute array Datapath energy is determined by the total bit flips induced by operand streaming Ta Targe get: propose post-training and training-aware techniques to reduce bit flips of weight streaming

… …

H x W C PE Array K Psum K weight weight weight weight Psum Psum Psum

W[3, 0] W[2, 0] W[1, 0] W[0, 0]

K C

x

A[3, 0] A[2, 0] A[1, 0] A[0, 0] C H x W H x W C K

W[3, 1] W[2, 1] W[1, 1] W[0, 1] W[3, 2] W[2, 2] W[1, 2] W[0, 2] W[3, 3] W[2, 3] W[1, 3] W[0, 3]

A[3, 1] A[2, 1] A[1, 1] A[0, 1] A[3, 2] A[2, 2] A[1, 2] A[0, 2] A[3, 3] A[2, 3] A[1, 3] A[0, 3] A[3, 0] A[2, 0] A[1, 0] A[0, 0] A[3, 1] A[2, 1] A[1, 1] A[0, 1] A[3, 2] A[2, 2] A[1, 2] A[0, 2] A[3, 3] A[2, 3] A[1, 3] A[0, 3]

W[3, 0] W[2, 0] W[1, 0] W[0, 0] W[3, 1] W[2, 1] W[1, 1] W[0, 1] W[3, 2] W[2, 2] W[1, 2] W[0, 2] W[3, 3] W[2, 3] W[1, 3] W[0, 3]

K, C, H, W denotes output channel, input channel, output height, and output width, respectively

100 200 300 400 500 600 0.E+00 1.E+05 2.E+05 3.E+05 4.E+05

Normalized Energy Total Bit Flips

slide-4
SLIDE 4

Post-Training Optimization: Output Channel Reordering

4

To reduce bit flips, the most straight-forward technique is output channel reordering

  • Output channel reordering can be mapped to a traveling salesman problem, which can be

approximately solved with efficient greedy algorithms

H x W C K A[3, 0] A[2, 0] A[1, 0] A[0, 0] A[3, 1] A[2, 1] A[1, 1] A[0, 1] A[3, 2] A[2, 2] A[1, 2] A[0, 2] A[3, 3] A[2, 3] A[1, 3] A[0, 3]

W[3, 0] W[2, 0] W[1, 0] W[0, 0] W[3, 1] W[2, 1] W[1, 1] W[0, 1] W[3, 2] W[2, 2] W[1, 2] W[0, 2] W[3, 3] W[2, 3] W[1, 3] W[0, 3]

00 10 01 00 01 10 01 10 11 00 10 10 11 01 11 01 K C C 00 10 01 00 01 10 01 10 11 00 10 10 11 01 11 01 K Reorder

slide-5
SLIDE 5

Post-Training Optimization: Input Channel Clustering

5

For most networks, the channel dimension can be larger than the compute array size Weight matrices need to be segmented first and then fed into compute array

  • Each weight sub-matrix can use different output channel orders
  • Before segmenting the weight matrix, different input channels can be clustered first

Propose an iterative assignment and update approach for input clustering

00 11 11 11 11 11 00 11 00 00 00 11 11 00 11 11 01 10 10 10 10 10 01 10 01 01 10 10 10 01 10 10 K C 00 11 11 11 00 00 00 11 01 10 10 10 01 01 10 10 00 00 00 11 11 00 11 11 10 10 01 10 10 10 01 10 C cluster 1 C cluster 2 K 00 11 11 11 00 00 00 11 01 10 10 10 01 01 10 10 00 00 00 11 11 00 11 11 10 10 01 10 10 10 01 10 K C cluster 1 C cluster 2 C Clustering K Reordering K Reordering

slide-6
SLIDE 6

Experimental Results

6

Post-training optimization technique comparison

  • Use 1x1 Conv in MobileNetV2 and 3x3 Conv in ResNet26 for evaluation

Combine post-training and training-aware optimization

  • Incorporate bit flip loss into the loss function
  • Use MobileNetV2 trained on Cifar100 for evaluation

8 16 32 64 8 16 32 64 0.8 1 1.2 1.4 1.6 1.8 2 Channels/Cluster Average HD Reduction Baseline Direct Reorder Cluster-then-Reorder

MobileNetV2 ResNet26

0.5 1 1.5 2 2.5 3 3.5 4 HD Reduction Energy Reduction

Reduction

Baseline Post-Training Training-Aware Combine