Ultra-low-bit Neural Network Quantization Peisong Wang Institute of - - PowerPoint PPT Presentation

ultra low bit neural network quantization
SMART_READER_LITE
LIVE PREVIEW

Ultra-low-bit Neural Network Quantization Peisong Wang Institute of - - PowerPoint PPT Presentation

Ultra-low-bit Neural Network Quantization Peisong Wang Institute of Automation, Chinese Academy of Sciences 2020.06.03 Collaborators: Weixiang Xu Tianli Zhao Fanrong Li Xiangyu He Gang Li Jian Cheng Cong Leng 6/5/20 1


slide-1
SLIDE 1

Ultra-low-bit Neural Network Quantization

Peisong Wang

Institute of Automation, Chinese Academy of Sciences 2020.06.03

peisong.wang@nlpr.ia.ac.cn 1

Weixiang Xu Tianli Zhao Fanrong Li Xiangyu He Gang Li Jian Cheng

Collaborators:

Cong Leng

6/5/20

slide-2
SLIDE 2

From: Russ Salakhutdinov

2

Background: Deep Learning

peisong.wang@nlpr.ia.ac.cn 6/5/20

slide-3
SLIDE 3
  • 01

Classification Detection Segmentation Convolutional Neural Networks

3

Background: Application of CNNs

peisong.wang@nlpr.ia.ac.cn 6/5/20

slide-4
SLIDE 4

4

Train ResNet50 from several days to:

  • Facebook: 1 hour
  • Fast.ai: 18 min
  • Tencent: 6.6 min
  • Sony: 3.7 min
  • Google: 2.2 min
  • SenseTime: 1.5 min

Background: Training

peisong.wang@nlpr.ia.ac.cn 6/5/20

slide-5
SLIDE 5

Face Unlock Intelligent Robot Intelligent Surveillance Self-Driving Car AR/VR

5

Background: Real World Applications

Ø Low inference speed Ø Large memory/storage Ø High power consumption

peisong.wang@nlpr.ia.ac.cn 6/5/20

slide-6
SLIDE 6
  • Low-rank Decomposition
  • Sparse/Pruning
  • Quantization
  • Knowledge Distillation
  • ……

6

Network Acceleration and Compression

peisong.wang@nlpr.ia.ac.cn 6/5/20

slide-7
SLIDE 7

7

𝑻 : sign M : Mantissa 𝑭 : Exponent

S E M 1 8 23 FP32

−𝟐 𝑻×1.M×𝟑𝑭

S M 1 7 Int8 S M 1 3 Int4

−𝟐 𝑻×M

Fixed-point representation

peisong.wang@nlpr.ia.ac.cn 6/5/20

slide-8
SLIDE 8
  • Saving memory
  • Saving energy
  • Saving time
  • Saving area

Mark Horowitz , Computing’s Energy Problem. ISSCC 2014.

peisong.wang@nlpr.ia.ac.cn 8

Why Fixed-point quantization?

6/5/20

slide-9
SLIDE 9

2# 𝑤𝑏𝑚𝑣𝑓𝑡: 000…000 ~ 111…111

𝑂 − bit

Non-uniform Uniform Logarithmic 0…000 𝐷₋0 0…001 𝐷₋1 1 1 0…010 𝐷₋2 2 2 0…011 𝐷₋3 3 4 0…100 𝐷₋4 4 8 0…101 𝐷₋5 5 16 0…110 𝐷₋6 6 32 . . . . . . . . . . . . 1…111 𝐷₋(2! − 1) 2! − 1 2"!#"

Non-uniform Quantization Uniform Quantization Logarithmic Quantization

Scalar quantization with/without constrains

9

Type of quantization

peisong.wang@nlpr.ia.ac.cn 6/5/20

slide-10
SLIDE 10
  • Sparsity-inducing Binarized Neural Networks. AAAI, 2020.
  • Soft Threshold Ternary Networks. IJCAI, 2020.
  • Hardware Acceleration of CNN with One-Hot Quantization of Weights and Activations.

DATE 2020.

  • Towards Accurate Post-training Network Quantization via Bit-Split and Stitching. ICML,

2020

peisong.wang@nlpr.ia.ac.cn 10

Contents

6/5/20

slide-11
SLIDE 11
  • Sparsity-inducing Binarized Neural Networks. AAAI, 2020.
  • Soft Threshold Ternary Networks. IJCAI, 2020.
  • Hardware Acceleration of CNN with One-Hot Quantization of Weights and Activations.

DATE 2020.

  • Towards Accurate Post-training Network Quantization via Bit-Split and Stitching. ICML,

2020

peisong.wang@nlpr.ia.ac.cn 11

Contents

6/5/20

slide-12
SLIDE 12

Binary: Sparsity-inducing BNN

peisong.wang@nlpr.ia.ac.cn 12

Binary == -1/+1

  • 1

1

  • 1

1

  • 1

1

  • 1

1

  • 1

1 1

  • 1

1

Binary = Two States (𝒃𝟏 𝒃𝟐) Which two states to use ?

Peisong Wang, Xiangyu He, Gang Li, Tianli Zhao and Jian Cheng, “Sparsity-inducing Binarized Neural Networks”, AAAI, 2020.

Previous binary approach:

6/5/20

slide-13
SLIDE 13

Sparsity-inducing BNN

peisong.wang@nlpr.ia.ac.cn 13

How to accelerate BNN with 0/1 activations ?

(-1, +1)

(𝒃𝟏 𝒃𝟐)

Reparameterization with affine transformation

6/5/20

slide-14
SLIDE 14

Sparsity-inducing BNN

peisong.wang@nlpr.ia.ac.cn 14

How to determine the threshold of 0/1 binarization?

v Binarization at zero-point

v Normal distribution v Large quantization error

He Z , Fan D . Simultaneously Optimizing Weight and Quantizer of Ternary Neural Network using Truncated Gaussian Approximation. CVPR 2019.

v Binarization at 𝜄

v The mutual information 𝐽(𝑦; C 𝑧) of two discrete random variables x and C 𝑧 can be defined as

6/5/20

slide-15
SLIDE 15

Sparsity-inducing BNN

peisong.wang@nlpr.ia.ac.cn 15

Mutual Information can be formulated as the function of 𝑞 𝑦 = 0 = 𝑞 Ablation study on the selection of threshold on AlexNet

How to determine the threshold of 0/1 binarization?

6/5/20

slide-16
SLIDE 16

peisong.wang@nlpr.ia.ac.cn 16

Sparsity-inducing BNN

AlexNet ResNet-18 Comparison with 2-bit method

v Extend our methods to other network structures v Without bells and whistles

Experiments:

6/5/20

slide-17
SLIDE 17

Sparsity-inducing BNN

peisong.wang@nlpr.ia.ac.cn 17

Run-time speedup:

Tianli Zhao, Xiangyu He, Jian Cheng. BitStream: Efficient Computing Architecture for Real-Time Low-Power Inference of Binary Neural Networks on CPUs. ACM MM 2018.

6/5/20

slide-18
SLIDE 18
  • Sparsity-inducing Binarized Neural Networks. AAAI, 2020.
  • Soft Threshold Ternary Networks. IJCAI, 2020.
  • Hardware Acceleration of CNN with One-Hot Quantization of Weights and Activations.

DATE 2020.

  • Towards Accurate Post-training Network Quantization via Bit-Split and Stitching. ICML,

2020

peisong.wang@nlpr.ia.ac.cn 18

Contents

6/5/20

slide-19
SLIDE 19

peisong.wang@nlpr.ia.ac.cn 19

Soft Threshold Ternary Networks

Previous ternary problem:

Δ −Δ −1 +1

From hard to soft threshold:

Weixiang Xu, Xiangyu He, Tianli Zhao, Qinghao Hu, Peisong Wang and Jian Cheng. “Soft Threshold Ternary Networks”, IJCAI, 2020.

Binary + Binary = Ternary

6/5/20

slide-20
SLIDE 20

peisong.wang@nlpr.ia.ac.cn 20

Soft Threshold Ternary Networks

—Ternarize both weights and activations —Without constraint of ∆ —Soft threshold

6/5/20

slide-21
SLIDE 21

peisong.wang@nlpr.ia.ac.cn 21

Soft Threshold Ternary Networks

ImageNet Results:

6/5/20

slide-22
SLIDE 22
  • Sparsity-inducing Binarized Neural Networks. AAAI, 2020.
  • Soft Threshold Ternary Networks. IJCAI, 2020.
  • Hardware Acceleration of CNN with One-Hot Quantization of Weights and Activations.

DATE 2020.

  • Towards Accurate Post-training Network Quantization via Bit-Split and Stitching. ICML,

2020

peisong.wang@nlpr.ia.ac.cn 22

Contents

6/5/20

slide-23
SLIDE 23

peisong.wang@nlpr.ia.ac.cn 23

One-hot Networks

To obtain more efficient quantizer:

S INT-8 S INT-7 S INT-6 S INT-4 S INT-3 S INT-5 Bit-width

  • 128~127
  • 64~63
  • 32~31
  • 16~15
  • 8~7
  • 4~3

Gang Li, Peisong Wang, Zejian Liu, Cong Leng, Jian Cheng. Hardware Acceleration of CNN with One-Hot Quantization of Weights and Activations. DATE 2020

8-bit, both activations and weights

S 7-hot Non-zeros

  • 128~127

S 6-hot S One-hot S Two-hot

  • 127~126
  • 96~96
  • 64~64

6/5/20

slide-24
SLIDE 24

peisong.wang@nlpr.ia.ac.cn 24

One-hot Networks

One-hot weight (logarithmic)

§

  • nly one non-zero bit in weights

§ multiplication -> bit shift of activation

[1] H. Tann, S. Hashemi, R. I. Bahar, S. Reda, “HardwareSoftware Codesign of Highly Accurate, Multiplier-free Deep Neural Networks”, DAC'17 [2] S. Sharify et al., “Laconic Deep Learning Inference Acceleration”, ISCA'19

One-hot weight + One-hot Activation

§

  • nly one non-zero bit in weights/activations

§ multiplication -> addition + encoding [2] § Effectual bits: exponent bits + sign bit, § 8bit -> 3+1bit

6/5/20

slide-25
SLIDE 25

peisong.wang@nlpr.ia.ac.cn 25

One-hot Networks

Baseline: 16/16 DaDianNao [1], 8/8 Laconic [2] Xilinx ZC706 Dev Board Vivado HLS 2018.2

[1] Y. Chen et al., “DaDianNao: A Machine-Learning Supercomputer,” MICRO'14 [2] S. Sharify et al., “Laconic Deep Learning Inference Acceleration,” ISCA'19

6/5/20

slide-26
SLIDE 26
  • Sparsity-inducing Binarized Neural Networks. AAAI, 2020.
  • Soft Threshold Ternary Networks. IJCAI, 2020.
  • Hardware Acceleration of CNN with One-Hot Quantization of Weights and Activations.

DATE 2020.

  • Towards Accurate Post-training Network Quantization via Bit-Split and Stitching. ICML,

2020

peisong.wang@nlpr.ia.ac.cn 26

Contents

6/5/20

slide-27
SLIDE 27

peisong.wang@nlpr.ia.ac.cn 27

Bit-Split for Post-training Network Quantization

Training-aware quantization

Peisong Wang, Qiang Chen, Xiangyu He, Jian Cheng. Towards Accurate Post-training Network Quantization via Bit-Split and

  • Stitching. ICML2020

Pre-trained Model Network Quantization Finetune using data/labels

Post-training Quantization

Pre-trained Model Network Quantization

Data-free BP-free Hyper-parameter free Easy to use

6/5/20

slide-28
SLIDE 28

peisong.wang@nlpr.ia.ac.cn 28

Bit-Split for Post-training Network Quantization

Post-training quantization

Szymon Migacz. 8-bit Inference with TensorRT. GTC 2017

Minimize the Di Distance Min-Max Min-Max with clip Problem:

6/5/20

slide-29
SLIDE 29

peisong.wang@nlpr.ia.ac.cn 29

Bit-Split for Post-training Network Quantization

Problem: Optimization:

6/5/20

slide-30
SLIDE 30

peisong.wang@nlpr.ia.ac.cn 30

Bit-Split for Post-training Network Quantization

Weight Quantization: Weight and Activation Quantization:

6/5/20

slide-31
SLIDE 31

[1] Peisong Wang, Xiangyu He, Gang Li, Tianli Zhao and Jian Cheng, “Sparsity-inducing Binarized Neural Networks”, AAAI, 2020. [2] Weixiang Xu, Xiangyu He, Tianli Zhao, Qinghao Hu, Peisong Wang and Jian Cheng. “Soft Threshold Ternary Networks”, IJCAI, 2020. [3] Gang Li, Peisong Wang, Zejian Liu, Cong Leng, Jian Cheng. “Hardware Acceleration of CNN with One- Hot Quantization of Weights and Activations”, DATE, 2020. [4] Peisong Wang, Qiang Chen, Xiangyu He, Jian Cheng, “Towards Accurate Post-training Network Quantization via Bit-Split and Stitching”, ICML, 2020 [5] Fanrong Li, Zitao Mo, Peisong Wang, Zejian Liu, Jiayun Zhang, Gang Li, Qinghao Hu, Xiangyu He, Cong Leng, Yang Zhang and Jian Cheng. A System-Level Solution for Low-Power Object Detection. (ICCV workshop), 2019. [6] Tianli Zhao, Xiangyu He, Jian Cheng. BitStream: Efficient Computing Architecture for Real-Time Low- Power Inference of Binary Neural Networks on CPUs. ACM MM 2018.

peisong.wang@nlpr.ia.ac.cn 31

References

6/5/20

slide-32
SLIDE 32

Thanks for your attention.

peisong.wang@nlpr.ia.ac.cn 32 6/5/20