Extremely Low-bit Convolution Optimization for Quantized Neural - PowerPoint PPT Presentation

Extremely Low-bit Convolution Optimization for Quantized Neural Network on Modern Computer Architectures Qingchang Han 1,2 , Yongmin Hu 1 , Fengwei Yu 2 , Hailong Yang 1 , Bing Liu 2 , Peng Hu 1,2 , Ruihao Gong 1,2 , Yanfei Wang 2 , Rui Wang 1 , Zhongzhi Luan 1 , Depei Qian 1 School of Computer Science and Engineering Beihang University 1 , Beijing, China SenseTime Research 2

Outline ◼ Background & Motivation ◼ CNN & Quantized Neural Network ◼ Low-bit Computation on Modern Computer Architectures ◼ Optimization Methods ◼ Low-bit Convolution on ARM CPU ◼ Low-bit Convolution on NVIDIA GPU ◼ Evaluation ◼ Experiment Setup ◼ Performance Analysis ◼ Conclusion

Convolutional Neural Network Speech Recognition Computer Vision Automatic Driving Recommendation System Convolutional Neural Network Output Input Flatten Convolution Pooling Convolution Pooling FC The computation complexity and memory footprint of CNNs need to be optimized ◼ Convolution layers take 90% - 99% of computation and runtime [Chen et al., ISSCC’16] ◼

Model Compression Sign(1-bit) Exponent(8-bits) Mantissa(23-bits) ◼ Model compression reduces computation complexity with acceptable accuracy FP32 ◼ Network Pruning Quantize Dequantize ◼ Model Quantization 𝑦 𝑗𝑜𝑢 = 𝑠𝑝𝑣𝑜𝑒(𝑦 𝑔 /𝑡𝑑𝑏𝑚𝑓) ◼ Model Quantization 𝑦 𝑔 = 𝑡𝑑𝑏𝑚𝑓 × 𝑦 𝑟 ◼ Mapping data to a smaller set of numerical representation 𝑦 𝑟 = 𝑑𝑚𝑗𝑞(127, −128, 𝑦 𝑗𝑜𝑢 ) INT8 ◼ Improve the performance and reduce memory footprint while preserving accuracy Sign(1-bit) Mantissa(7-bits) ◼ Example: int8 Conv2d quantization

Accuracy of Quantized Neural Network Accuracy Comparison of Low-bit QNNs on ImageNet [Esser et al., ICLR’20] ◼ Recent works have proved the accuracy of quantized neural network ◼ 8-bit quantized model can almost reach the same accuracy as the full-precision one ◼ Lower-bit quantized models (e.g., 2 ∼ 4-bit) only loss the accuracy slightly compared to the full-precision ones ◼ However, achieving the optimal performance of QNNs across different computer architectures is challenging and less studied in literatures

The Target Architectures for Optimization ◼ Most widely used architectures for CNN inference ◼ Edge devices – ARM CPU ◼ Cloud accelerators – NVIDIA GPU The shipments of ARM-based chips to date ◼ Provide architecture support for low-bit arithmetic instructions ◼ ARM CPU: MLA / SMLAL ◼ NVIDIA GPU: dp4a / mma (Tensor Core) The share of types with Cloud Accelerators

Low-bit Computation Support on ARM CPU ◼ Low-bit arithmetic instruction Multiply-Accumulate 16x8bit ( SMLAL ) … 8x16bit … 16x8bit … Multiply-Accumulate 16x8bit ( MLA ) … 16x8bit … 16x8bit … Add Wide 8x16bit 16x8bit ARMv8.1 architecture ( SADDW ) … … 8x16bit 4x32bit …

Low-bit Computation Support on NVIDIA GPU ◼ Tensor Core Warp Scheduler Warp Scheduler ◼ Natively support mixed-precision GEMM Register File Register File ◼ INT8/INT4/INT1 for Turing Tensor Cores ◼ Powerful inference performance CUDA Tensor CUDA Tensor ◼ RTX 2080 Ti delivers up to 215.2 TOPS of INT8 Cores Cores Cores Cores inference performance Warp Scheduler Warp Scheduler Register File Register File INT32 INT8/INT4 INT8/INT4 INT32 CUDA Tensor CUDA Tensor ◼ Use of Tensor Core Cores Cores Cores Cores ◼ WMMA API ◼ PTX mma instructions(e.g. mma.m8n8k16) ◼ Vendor libraries: cuBLAS/cuDNN (only fp16 now) L1 Data Cache / Shared Memory

Existing Framework/Library Supporting Low-bit Conv2d ARM CPU NVIDIA GPU ◼ ncnn: 8-bit Conv2d(GEMM-based & Winograd) ◼ cuDNN: 8-bit Conv2d( dp4a )/16-bit Conv2d( Tensor Core ) ◼ QNNPACK: 8-bit Conv2d(indirect convolution) ◼ TensorRT: 8-bit Conv2d( Tensor Core ) ◼ TFLite: 8-bit Conv2d ◼ CUTLASS: 1/4/8-bit GEMM( Tensor Core ) ◼ TVM: 1/2-bit Conv2d( popcount )/8-bit Conv2d(spatial pack) ◼ There is no public work that can support extremely low-bit convolution covering a wide range of bit width on ARM CPU (2 ∼ 8-bit) and NVIDIA GPU (4-bit/8-bit) ◼ The missing support for extremely low-bit convolution motivates us to provide efficient implementations on ARM CPU and NVIDIA GPU

Outline ◼ Background & Motivation ◼ CNN & Quantized Neural Network ◼ Low-bit Computation on Modern Computer Architectures ◼ Optimization Methods ◼ Low-bit Convolution on ARM CPU ◼ Low-bit Convolution on NVIDIA GPU ◼ Evaluation ◼ Experiment Setup ◼ Performance Analysis ◼ Conclusion

Re-designing GEMM Computation on ARM CPU ◼ Re-design GEMM micro-kernel 1. Load one column of Matrix A into Buffer A 2. Load one row of Matrix B info Buffer B, and replicate it into each row of Buffer B 3. Perform element-wise multiplication between Buffer A and each column-vector of Buffer B, and store the results to Buffer C 4. After all the calculations are done, copy the data of Buffer C into Matrix C Memory × = 4 Matrix A Matrix B Matrix C 1 2 Registers 3 × Element-wise Multiplication Buffer A Buffer B Buffer C

Re-designing GEMM Computation on ARM CPU ◼ Data padding and packing optimization ◼ Perform zero-padding when the dimension of data is not a multiple of the required dimension ◼ Perform data packing to enable continuous data access A11 A12 A13 A14 B11 B12 B13 0 Zero-padding A21 A22 A23 A24 B21 B22 B23 0 Matrix A Matrix B A31 A32 A33 A34 B31 B32 B33 0 Zero-padding 0 0 0 0 B41 B42 B43 0 Padding and Packing Padding and Packing A11 A21 A31 0 A12 A22 A32 0 A13 A23 A33 0 A14 A24 A34 0 Packed Matrix A Packed Matrix B B11 B12 B13 0 B21 B22 B23 0 B31 B32 B33 0 B41 B42 B43 0

Instruction and Register Allocation Optimization on ARM CPU ◼ Optimized instruction schemes for GEMM ◼ For 4 to 8-bit GEMM, we choose SMLAL and SADDW instructions until overflow until overflow 1 2 SADDW 16x8bit SMLAL 8x16bit 4~8-bit 4x32bit … … 4~8-bit 16x8bit … ◼ For 2 to 3-bit GEMM, we choose MLA and SADDW instructions 4x32bit until overflow 3 16x8bit 2~3-bit until overflow until overflow 1 2 SADDW … MLA SADDW 8x16bit 2~3-bit 16x8bit 16x8bit … … …

Instruction and Register Allocation Optimization on ARM CPU ◼ Register allocation optimization ◼ For 4~8-bit input data Double Buffer A 𝑤 0 Buffer 16-bit 𝑤 2 ~𝑤 5 Buffer B SMLAL SADDW 𝑤 18 ~𝑤 31 𝑤 10 ~𝑤 17 𝑦 0 ~𝑦 3 𝑤 1 Buffer A Temporary Results Buffer C (16-bit) (32-bit) 𝑤 6 ~𝑤 9 Buffer B ◼ For 2~3-bit input data 8-bit 16-bit 𝑤 0 ~𝑤 3 Buffer A MLA SADDW SADDW 𝑤 20 ~𝑤 31 𝑤 8 ~𝑤 11 𝑤 12 ~𝑤 19 𝑦 0 ~𝑦 7 𝑤 4 ~𝑤 7 Buffer B Temporary Results Temporary Results Buffer C (8-bit) (16-bit) (32-bit)

Winograd Optimization on ARM CPU ◼ Winograd method ◼ Achieve acceleration by reducing the number of multiplications ◼ Converts convolution computation to the following form: : ◼ Apply F(2x2, 3x3) to 4~6-bit convolution ◼ Ensure the transformed data in the range of 8-bit precision For more details, please refer to our paper. ◼ F(2x2, 3x3): No more than 6 bits ◼ F(4x4, 3x3): Unacceptable increment of numerical range ◼ 2 to 3-bit convolution? ◼ The maximum theoretical speedup of F(2x2, 3x3) is 2.25 × , however MLA instruction is 2 × faster than SMLAL instruction ◼ Offset the performance advantage of Winograd method

Implicit-precomp GEMM Method on GPU ◼ Implicit GEMM ◼ Avoid global matrix transformation and reducing memory footprint ◼ Precomputed Buffer ◼ Store the offsets of elements in precomputed buffer Offsets in INPUT K M IM2COL M Precomputed Buffer INPUT: N*IH*IW*IC Matrix A(Implicit) M = (N*OH*OW) K = (KH*KW*IC) N = OC

Data Partition along with Thread Hierarchy on GPU (a) Grid-Level ◼ Divide the matrix A, B and C into tiles by MTile , NTile , KTile NFrag N NTile KStep K KTile B_Tile Matrix B B_Fragment (SMEM) (GMEM) (Register) K N NFrag KStep KTile NTile M M MFrag MFrag MTile MTile C_Fragment Matrix A Matrix C A_Tile C_Tile A_Fragment (Register) (GMEM) (GMEM) (SMEM) (Register) (Register) (a) Grid-Level (b) Block-Level (c) Warp-Level M = (N*OH*OW) K = (KH*KW*IC) N = OC

Data Partition along Thread Hierarchy on GPU (b) Block-Level ◼ Divide C_Tile, A_Tile, B_Tile into fragments by blockRowWarpNum , blockColWarpNum ◼ Split the KTile loop by KStep NFrag N NTile KStep K KTile B_Tile Matrix B B_Fragment (SMEM) (GMEM) (Register) K N NFrag KStep KTile NTile M M MFrag MFrag MTile MTile C_Fragment Matrix A Matrix C A_Tile C_Tile A_Fragment (Register) (GMEM) (GMEM) (SMEM) (Register) (Register) (a) Grid-Level (b) Block-Level (c) Warp-Level M = (N*OH*OW) K = (KH*KW*IC) N = OC

Extremely Low-bit Convolution Optimization for Quantized Neural - PowerPoint PPT Presentation

Extremely Low-bit Convolution Optimization for Quantized Neural Network on Modern Computer Architectures Qingchang Han 1,2 , Yongmin Hu 1 , Fengwei Yu 2 , Hailong Yang 1 , Bing Liu 2 , Peng Hu 1,2 , Ruihao Gong 1,2 , Yanfei Wang 2 , Rui Wang 1 ,

1 Convolution Convolution is an important operation in signal and image processing. Convolution

Vision and Sound Computer Vision Fall 2018 Columbia University Single-modality video

Listing Bit Strings List all bit strings of length 3. Listing Bit Strings List all bit strings

Correlation, Convolution, Filtering COMPSCI 527 Computer Vision COMPSCI 527 Computer

Improving PixelCNN Vertical stack oblem with this m of masked convolution. Blind spot

E he i m COMPSCI 527 Computer Vision Correlation, Convolution, Filtering 14 / 26 Image

Chapter 8: Fast Convolution Keshab K. Parhi Chapter 8 Fast Convolution Introduction

Lecture 13 : Lecture 13 : Special Bit Instructions Todays Goals L Learn bit-set and

Bit Basics Eric McCreath Bit Basics A bit (Binary digIT) is single unit of binary storage. A bit

https://bit.ly/3pptcRS 3 4 https://bit.ly/2UiBgWq Vase Face Face https://bit.ly/3luge2Q

WEBee Reverse Convolution Coding Reverse Convolution Coding Convolutional encoding uses a

Convolution Sum Overview Review of time invariance Review of sampling property

Convolution Layers Convolution Layers In [1]: from mxnet import autograd, nd from mxnet.gluon

Lecture 2: Convolution Mark Hasegawa-Johnson ECE 401: Signal and Image Analysis, Fall 2020

Chapter 3 Chapter 3 Convolution Representation Convolution Representation CT Unit-Impulse

Overview of Convolution Integral Topics Impulse response defined Several derivations of the

Weak dependence of mixed moving average fields and ap- plications Based on joint work with Imma

Registrant List Page 1 of 5 UCSF OCME Name City, State Adkins Samantha R Md Portland, OR 1

Probabilistic Deduplication, Data Linkage and Geocoding Peter Christen Data Mining Group,

A light at the end of the tunnel Petr Cintula 1 and Carles Noguera 2 1 Institute of Computer

Efficient implementation of polynomial arithmetic in a multiple-level programming environment Xin

FAST CORRIDORS INTANGIBLE INFRASTRUCTURE INCREASING TRANSPORT COMPETITIVENESS 1 AGENDA 1. WHAT

Mark Frerker MLB: $7.7 billion revenue in 2012 Statistics driven Moneyball

Agenda 1) Recap of Process- Where are We? 2) Research with other MLPs- Results 3) Review of