Ultra-low-bit Neural Network Quantization
Peisong Wang
Institute of Automation, Chinese Academy of Sciences 2020.06.03
peisong.wang@nlpr.ia.ac.cn 1
Weixiang Xu Tianli Zhao Fanrong Li Xiangyu He Gang Li Jian Cheng
Collaborators:
Cong Leng
6/5/20
Ultra-low-bit Neural Network Quantization Peisong Wang Institute of - - PowerPoint PPT Presentation
Ultra-low-bit Neural Network Quantization Peisong Wang Institute of Automation, Chinese Academy of Sciences 2020.06.03 Collaborators: Weixiang Xu Tianli Zhao Fanrong Li Xiangyu He Gang Li Jian Cheng Cong Leng 6/5/20 1
Institute of Automation, Chinese Academy of Sciences 2020.06.03
peisong.wang@nlpr.ia.ac.cn 1
Weixiang Xu Tianli Zhao Fanrong Li Xiangyu He Gang Li Jian Cheng
Collaborators:
Cong Leng
6/5/20
From: Russ Salakhutdinov
2
peisong.wang@nlpr.ia.ac.cn 6/5/20
3
peisong.wang@nlpr.ia.ac.cn 6/5/20
4
peisong.wang@nlpr.ia.ac.cn 6/5/20
Face Unlock Intelligent Robot Intelligent Surveillance Self-Driving Car AR/VR
5
Ø Low inference speed Ø Large memory/storage Ø High power consumption
peisong.wang@nlpr.ia.ac.cn 6/5/20
6
peisong.wang@nlpr.ia.ac.cn 6/5/20
7
S E M 1 8 23 FP32
S M 1 7 Int8 S M 1 3 Int4
peisong.wang@nlpr.ia.ac.cn 6/5/20
Mark Horowitz , Computing’s Energy Problem. ISSCC 2014.
peisong.wang@nlpr.ia.ac.cn 8
6/5/20
𝑂 − bit
Non-uniform Uniform Logarithmic 0…000 𝐷₋0 0…001 𝐷₋1 1 1 0…010 𝐷₋2 2 2 0…011 𝐷₋3 3 4 0…100 𝐷₋4 4 8 0…101 𝐷₋5 5 16 0…110 𝐷₋6 6 32 . . . . . . . . . . . . 1…111 𝐷₋(2! − 1) 2! − 1 2"!#"
Scalar quantization with/without constrains
9
peisong.wang@nlpr.ia.ac.cn 6/5/20
DATE 2020.
2020
peisong.wang@nlpr.ia.ac.cn 10
6/5/20
DATE 2020.
2020
peisong.wang@nlpr.ia.ac.cn 11
6/5/20
peisong.wang@nlpr.ia.ac.cn 12
Peisong Wang, Xiangyu He, Gang Li, Tianli Zhao and Jian Cheng, “Sparsity-inducing Binarized Neural Networks”, AAAI, 2020.
6/5/20
peisong.wang@nlpr.ia.ac.cn 13
(-1, +1)
(𝒃𝟏 𝒃𝟐)
6/5/20
peisong.wang@nlpr.ia.ac.cn 14
v Binarization at zero-point
v Normal distribution v Large quantization error
He Z , Fan D . Simultaneously Optimizing Weight and Quantizer of Ternary Neural Network using Truncated Gaussian Approximation. CVPR 2019.
v Binarization at 𝜄
v The mutual information 𝐽(𝑦; C 𝑧) of two discrete random variables x and C 𝑧 can be defined as
6/5/20
peisong.wang@nlpr.ia.ac.cn 15
Mutual Information can be formulated as the function of 𝑞 𝑦 = 0 = 𝑞 Ablation study on the selection of threshold on AlexNet
6/5/20
peisong.wang@nlpr.ia.ac.cn 16
AlexNet ResNet-18 Comparison with 2-bit method
v Extend our methods to other network structures v Without bells and whistles
6/5/20
peisong.wang@nlpr.ia.ac.cn 17
Tianli Zhao, Xiangyu He, Jian Cheng. BitStream: Efficient Computing Architecture for Real-Time Low-Power Inference of Binary Neural Networks on CPUs. ACM MM 2018.
6/5/20
DATE 2020.
2020
peisong.wang@nlpr.ia.ac.cn 18
6/5/20
peisong.wang@nlpr.ia.ac.cn 19
Weixiang Xu, Xiangyu He, Tianli Zhao, Qinghao Hu, Peisong Wang and Jian Cheng. “Soft Threshold Ternary Networks”, IJCAI, 2020.
Binary + Binary = Ternary
6/5/20
peisong.wang@nlpr.ia.ac.cn 20
6/5/20
peisong.wang@nlpr.ia.ac.cn 21
6/5/20
DATE 2020.
2020
peisong.wang@nlpr.ia.ac.cn 22
6/5/20
peisong.wang@nlpr.ia.ac.cn 23
S INT-8 S INT-7 S INT-6 S INT-4 S INT-3 S INT-5 Bit-width
Gang Li, Peisong Wang, Zejian Liu, Cong Leng, Jian Cheng. Hardware Acceleration of CNN with One-Hot Quantization of Weights and Activations. DATE 2020
S 7-hot Non-zeros
S 6-hot S One-hot S Two-hot
6/5/20
peisong.wang@nlpr.ia.ac.cn 24
[1] H. Tann, S. Hashemi, R. I. Bahar, S. Reda, “HardwareSoftware Codesign of Highly Accurate, Multiplier-free Deep Neural Networks”, DAC'17 [2] S. Sharify et al., “Laconic Deep Learning Inference Acceleration”, ISCA'19
6/5/20
peisong.wang@nlpr.ia.ac.cn 25
[1] Y. Chen et al., “DaDianNao: A Machine-Learning Supercomputer,” MICRO'14 [2] S. Sharify et al., “Laconic Deep Learning Inference Acceleration,” ISCA'19
6/5/20
DATE 2020.
2020
peisong.wang@nlpr.ia.ac.cn 26
6/5/20
peisong.wang@nlpr.ia.ac.cn 27
Peisong Wang, Qiang Chen, Xiangyu He, Jian Cheng. Towards Accurate Post-training Network Quantization via Bit-Split and
Pre-trained Model Network Quantization Finetune using data/labels
Pre-trained Model Network Quantization
Data-free BP-free Hyper-parameter free Easy to use
6/5/20
peisong.wang@nlpr.ia.ac.cn 28
Szymon Migacz. 8-bit Inference with TensorRT. GTC 2017
Minimize the Di Distance Min-Max Min-Max with clip Problem:
6/5/20
peisong.wang@nlpr.ia.ac.cn 29
Problem: Optimization:
6/5/20
peisong.wang@nlpr.ia.ac.cn 30
Weight Quantization: Weight and Activation Quantization:
6/5/20
[1] Peisong Wang, Xiangyu He, Gang Li, Tianli Zhao and Jian Cheng, “Sparsity-inducing Binarized Neural Networks”, AAAI, 2020. [2] Weixiang Xu, Xiangyu He, Tianli Zhao, Qinghao Hu, Peisong Wang and Jian Cheng. “Soft Threshold Ternary Networks”, IJCAI, 2020. [3] Gang Li, Peisong Wang, Zejian Liu, Cong Leng, Jian Cheng. “Hardware Acceleration of CNN with One- Hot Quantization of Weights and Activations”, DATE, 2020. [4] Peisong Wang, Qiang Chen, Xiangyu He, Jian Cheng, “Towards Accurate Post-training Network Quantization via Bit-Split and Stitching”, ICML, 2020 [5] Fanrong Li, Zitao Mo, Peisong Wang, Zejian Liu, Jiayun Zhang, Gang Li, Qinghao Hu, Xiangyu He, Cong Leng, Yang Zhang and Jian Cheng. A System-Level Solution for Low-Power Object Detection. (ICCV workshop), 2019. [6] Tianli Zhao, Xiangyu He, Jian Cheng. BitStream: Efficient Computing Architecture for Real-Time Low- Power Inference of Binary Neural Networks on CPUs. ACM MM 2018.
peisong.wang@nlpr.ia.ac.cn 31
6/5/20
peisong.wang@nlpr.ia.ac.cn 32 6/5/20