[PPT] - Ultra-low-bit Neural Network Quantization Peisong Wang Institute of PowerPoint Presentation

SLIDE 1

Ultra-low-bit Neural Network Quantization

Peisong Wang

Institute of Automation, Chinese Academy of Sciences 2020.06.03

peisong.wang@nlpr.ia.ac.cn 1

Weixiang Xu Tianli Zhao Fanrong Li Xiangyu He Gang Li Jian Cheng

Collaborators:

Cong Leng

6/5/20

SLIDE 2

From: Russ Salakhutdinov

2

Background: Deep Learning

peisong.wang@nlpr.ia.ac.cn 6/5/20

SLIDE 3

01

Classification Detection Segmentation Convolutional Neural Networks

3

Background: Application of CNNs

peisong.wang@nlpr.ia.ac.cn 6/5/20

SLIDE 4

4

Train ResNet50 from several days to:

Facebook: 1 hour
Fast.ai: 18 min
Tencent: 6.6 min
Sony: 3.7 min
Google: 2.2 min
SenseTime: 1.5 min

Background: Training

peisong.wang@nlpr.ia.ac.cn 6/5/20

SLIDE 5

Face Unlock Intelligent Robot Intelligent Surveillance Self-Driving Car AR/VR

5

Background: Real World Applications

Ø Low inference speed Ø Large memory/storage Ø High power consumption

peisong.wang@nlpr.ia.ac.cn 6/5/20

SLIDE 6

Low-rank Decomposition
Sparse/Pruning
Quantization
Knowledge Distillation
……

6

Network Acceleration and Compression

peisong.wang@nlpr.ia.ac.cn 6/5/20

SLIDE 7

7

𝑻 : sign M : Mantissa 𝑭 : Exponent

S E M 1 8 23 FP32

−𝟐 𝑻×1.M×𝟑𝑭

S M 1 7 Int8 S M 1 3 Int4

−𝟐 𝑻×M

Fixed-point representation

peisong.wang@nlpr.ia.ac.cn 6/5/20

SLIDE 8

Saving memory
Saving energy
Saving time
Saving area

Mark Horowitz , Computing’s Energy Problem. ISSCC 2014.

peisong.wang@nlpr.ia.ac.cn 8

Why Fixed-point quantization?

6/5/20

SLIDE 9

2# 𝑤𝑏𝑚𝑣𝑓𝑡: 000…000 ~ 111…111

𝑂 − bit

Non-uniform Uniform Logarithmic 0…000 𝐷₋0 0…001 𝐷₋1 1 1 0…010 𝐷₋2 2 2 0…011 𝐷₋3 3 4 0…100 𝐷₋4 4 8 0…101 𝐷₋5 5 16 0…110 𝐷₋6 6 32 . . . . . . . . . . . . 1…111 𝐷₋(2! − 1) 2! − 1 2"!#"

Non-uniform Quantization Uniform Quantization Logarithmic Quantization

Scalar quantization with/without constrains

9

Type of quantization

peisong.wang@nlpr.ia.ac.cn 6/5/20

SLIDE 10

Sparsity-inducing Binarized Neural Networks. AAAI, 2020.
Soft Threshold Ternary Networks. IJCAI, 2020.
Hardware Acceleration of CNN with One-Hot Quantization of Weights and Activations.

DATE 2020.

Towards Accurate Post-training Network Quantization via Bit-Split and Stitching. ICML,

2020

peisong.wang@nlpr.ia.ac.cn 10

Binary: Sparsity-inducing BNN

peisong.wang@nlpr.ia.ac.cn 12

Binary == -1/+1

1

1

1

1

1

1

1

1

1

1 1

1

1 Binary = Two States (𝒃𝟏 𝒃𝟐) Which two states to use ?

Peisong Wang, Xiangyu He, Gang Li, Tianli Zhao and Jian Cheng, “Sparsity-inducing Binarized Neural Networks”, AAAI, 2020.

Previous binary approach:

6/5/20

SLIDE 13

Sparsity-inducing BNN

peisong.wang@nlpr.ia.ac.cn 13

How to accelerate BNN with 0/1 activations ?

(-1, +1)

(𝒃𝟏 𝒃𝟐)

Reparameterization with affine transformation

6/5/20

SLIDE 14

Sparsity-inducing BNN

peisong.wang@nlpr.ia.ac.cn 14

How to determine the threshold of 0/1 binarization?

v Binarization at zero-point

v Normal distribution v Large quantization error

He Z , Fan D . Simultaneously Optimizing Weight and Quantizer of Ternary Neural Network using Truncated Gaussian Approximation. CVPR 2019.

v Binarization at 𝜄

v The mutual information 𝐽(𝑦; C 𝑧) of two discrete random variables x and C 𝑧 can be defined as

6/5/20

SLIDE 15

Sparsity-inducing BNN

peisong.wang@nlpr.ia.ac.cn 15

Mutual Information can be formulated as the function of 𝑞 𝑦 = 0 = 𝑞 Ablation study on the selection of threshold on AlexNet

How to determine the threshold of 0/1 binarization?

6/5/20

SLIDE 16

peisong.wang@nlpr.ia.ac.cn 16

Sparsity-inducing BNN

AlexNet ResNet-18 Comparison with 2-bit method

v Extend our methods to other network structures v Without bells and whistles

Experiments:

6/5/20

SLIDE 17

Sparsity-inducing BNN

peisong.wang@nlpr.ia.ac.cn 17

Run-time speedup:

Tianli Zhao, Xiangyu He, Jian Cheng. BitStream: Efficient Computing Architecture for Real-Time Low-Power Inference of Binary Neural Networks on CPUs. ACM MM 2018.

6/5/20

SLIDE 18

Sparsity-inducing Binarized Neural Networks. AAAI, 2020.
Soft Threshold Ternary Networks. IJCAI, 2020.
Hardware Acceleration of CNN with One-Hot Quantization of Weights and Activations.

DATE 2020.

Towards Accurate Post-training Network Quantization via Bit-Split and Stitching. ICML,

2020

peisong.wang@nlpr.ia.ac.cn 18

Soft Threshold Ternary Networks

Previous ternary problem:

Δ −Δ −1 +1

From hard to soft threshold:

Weixiang Xu, Xiangyu He, Tianli Zhao, Qinghao Hu, Peisong Wang and Jian Cheng. “Soft Threshold Ternary Networks”, IJCAI, 2020.

Binary + Binary = Ternary

6/5/20

SLIDE 20

peisong.wang@nlpr.ia.ac.cn 20

Soft Threshold Ternary Networks

Ternarize both weights and activations Without constraint of ∆ Soft threshold

6/5/20

SLIDE 21

peisong.wang@nlpr.ia.ac.cn 21

Soft Threshold Ternary Networks

ImageNet Results:

6/5/20

SLIDE 22

Sparsity-inducing Binarized Neural Networks. AAAI, 2020.
Soft Threshold Ternary Networks. IJCAI, 2020.
Hardware Acceleration of CNN with One-Hot Quantization of Weights and Activations.

DATE 2020.

Towards Accurate Post-training Network Quantization via Bit-Split and Stitching. ICML,

2020

peisong.wang@nlpr.ia.ac.cn 22

One-hot Networks

To obtain more efficient quantizer:

S INT-8 S INT-7 S INT-6 S INT-4 S INT-3 S INT-5 Bit-width

128~127
64~63
32~31
16~15
8~7
4~3

Gang Li, Peisong Wang, Zejian Liu, Cong Leng, Jian Cheng. Hardware Acceleration of CNN with One-Hot Quantization of Weights and Activations. DATE 2020

8-bit, both activations and weights

S 7-hot Non-zeros

128~127

S 6-hot S One-hot S Two-hot

127~126
96~96
64~64

…

6/5/20

SLIDE 24

peisong.wang@nlpr.ia.ac.cn 24

One-hot Networks

One-hot weight (logarithmic)

§

nly one non-zero bit in weights

§ multiplication -> bit shift of activation

[1] H. Tann, S. Hashemi, R. I. Bahar, S. Reda, “HardwareSoftware Codesign of Highly Accurate, Multiplier-free Deep Neural Networks”, DAC'17 [2] S. Sharify et al., “Laconic Deep Learning Inference Acceleration”, ISCA'19

One-hot weight + One-hot Activation

§

nly one non-zero bit in weights/activations

§ multiplication -> addition + encoding [2] § Effectual bits: exponent bits + sign bit, § 8bit -> 3+1bit

6/5/20

SLIDE 25

peisong.wang@nlpr.ia.ac.cn 25

One-hot Networks

Baseline: 16/16 DaDianNao [1], 8/8 Laconic [2] Xilinx ZC706 Dev Board Vivado HLS 2018.2

[1] Y. Chen et al., “DaDianNao: A Machine-Learning Supercomputer,” MICRO'14 [2] S. Sharify et al., “Laconic Deep Learning Inference Acceleration,” ISCA'19

6/5/20

SLIDE 26

Sparsity-inducing Binarized Neural Networks. AAAI, 2020.
Soft Threshold Ternary Networks. IJCAI, 2020.
Hardware Acceleration of CNN with One-Hot Quantization of Weights and Activations.

DATE 2020.

Towards Accurate Post-training Network Quantization via Bit-Split and Stitching. ICML,

2020

peisong.wang@nlpr.ia.ac.cn 26

Bit-Split for Post-training Network Quantization

Training-aware quantization

Peisong Wang, Qiang Chen, Xiangyu He, Jian Cheng. Towards Accurate Post-training Network Quantization via Bit-Split and

Stitching. ICML2020

Pre-trained Model Network Quantization Finetune using data/labels

Post-training Quantization

Pre-trained Model Network Quantization

Data-free BP-free Hyper-parameter free Easy to use

6/5/20

SLIDE 28

peisong.wang@nlpr.ia.ac.cn 28

Bit-Split for Post-training Network Quantization

Post-training quantization

Szymon Migacz. 8-bit Inference with TensorRT. GTC 2017

Minimize the Di Distance Min-Max Min-Max with clip Problem:

6/5/20

SLIDE 29

peisong.wang@nlpr.ia.ac.cn 29

Bit-Split for Post-training Network Quantization

Problem: Optimization:

6/5/20

SLIDE 30

peisong.wang@nlpr.ia.ac.cn 30

Bit-Split for Post-training Network Quantization

Weight Quantization: Weight and Activation Quantization:

6/5/20

SLIDE 31

[1] Peisong Wang, Xiangyu He, Gang Li, Tianli Zhao and Jian Cheng, “Sparsity-inducing Binarized Neural Networks”, AAAI, 2020. [2] Weixiang Xu, Xiangyu He, Tianli Zhao, Qinghao Hu, Peisong Wang and Jian Cheng. “Soft Threshold Ternary Networks”, IJCAI, 2020. [3] Gang Li, Peisong Wang, Zejian Liu, Cong Leng, Jian Cheng. “Hardware Acceleration of CNN with One- Hot Quantization of Weights and Activations”, DATE, 2020. [4] Peisong Wang, Qiang Chen, Xiangyu He, Jian Cheng, “Towards Accurate Post-training Network Quantization via Bit-Split and Stitching”, ICML, 2020 [5] Fanrong Li, Zitao Mo, Peisong Wang, Zejian Liu, Jiayun Zhang, Gang Li, Qinghao Hu, Xiangyu He, Cong Leng, Yang Zhang and Jian Cheng. A System-Level Solution for Low-Power Object Detection. (ICCV workshop), 2019. [6] Tianli Zhao, Xiangyu He, Jian Cheng. BitStream: Efficient Computing Architecture for Real-Time Low- Power Inference of Binary Neural Networks on CPUs. ACM MM 2018.

peisong.wang@nlpr.ia.ac.cn 31

References

6/5/20

SLIDE 32

Thanks for your attention.

peisong.wang@nlpr.ia.ac.cn 32 6/5/20

Ultra-low-bit Neural Network Quantization

Peisong Wang

Background: Deep Learning

Classification Detection Segmentation Convolutional Neural Networks

Background: Application of CNNs

Train ResNet50 from several days to:

Background: Training

Background: Real World Applications

Network Acceleration and Compression

𝑻 : sign M : Mantissa 𝑭 : Exponent

−𝟐 𝑻×1.M×𝟑𝑭

−𝟐 𝑻×M

Fixed-point representation

Why Fixed-point quantization?

2# 𝑤𝑏𝑚𝑣𝑓𝑡: 000…000 ~ 111…111

Non-uniform Quantization Uniform Quantization Logarithmic Quantization

Type of quantization

Contents

Contents

Binary: Sparsity-inducing BNN

Binary == -1/+1

1

1

1

1

1 1

1

Binary = Two States (𝒃𝟏 𝒃𝟐) Which two states to use ?

Previous binary approach:

Sparsity-inducing BNN

How to accelerate BNN with 0/1 activations ?

Reparameterization with affine transformation

Sparsity-inducing BNN

How to determine the threshold of 0/1 binarization?

Sparsity-inducing BNN

How to determine the threshold of 0/1 binarization?

Sparsity-inducing BNN

Experiments:

Sparsity-inducing BNN

Run-time speedup:

Contents

Soft Threshold Ternary Networks

Previous ternary problem:

Δ −Δ −1 +1

From hard to soft threshold:

Soft Threshold Ternary Networks

Ternarize both weights and activations Without constraint of ∆ Soft threshold

Soft Threshold Ternary Networks

ImageNet Results:

Contents

One-hot Networks

To obtain more efficient quantizer:

8-bit, both activations and weights

…

One-hot Networks

One-hot weight (logarithmic)

§

§ multiplication -> bit shift of activation

One-hot weight + One-hot Activation

§

§ multiplication -> addition + encoding [2] § Effectual bits: exponent bits + sign bit, § 8bit -> 3+1bit

One-hot Networks

Baseline: 16/16 DaDianNao [1], 8/8 Laconic [2] Xilinx ZC706 Dev Board Vivado HLS 2018.2

Contents

Bit-Split for Post-training Network Quantization

Training-aware quantization

Post-training Quantization

Bit-Split for Post-training Network Quantization

Post-training quantization

Bit-Split for Post-training Network Quantization

Bit-Split for Post-training Network Quantization

References

Thanks for your attention.

Ternarize both weights and activations Without constraint of ∆ Soft threshold