CENG5030 Part 2-4: CNN Inaccurate Speedup-2 - Quantization Bei Yu - PowerPoint PPT Presentation

CENG5030 Part 2-4: CNN Inaccurate Speedup-2 —- Quantization Bei Yu (Latest update: March 25, 2019) Spring 2019 1 / 9

These slides contain/adapt materials developed by ◮ Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746 ◮ Ritchie Zhao et al. (2017). “Accelerating binarized convolutional neural networks with software-programmable FPGAs”. In: Proc. FPGA , pp. 15–24 ◮ Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In: Proc. ECCV , pp. 525–542 2 / 9

What'should'I'learn' to'do'well'in' computer'vision' I'want'to'research' research?' on'a'topic'with'DEAP' LEARNING'in'it?' 3 / 9

DEEP'LEARNING' 3 / 9

GPU$ Server$ 3 / 9

Ohhh'No!!!' 3 / 9

State of the art recognition methods • 'Very'Expensive'' • Memory' • ComputaIon' • Power' 3 / 9

Overview Fixed-Point Representation Binary/Ternary Network Reading List 4 / 9

Fixed-Point v.s. Floating-Point 5 / 9

Fixed-Point Arithmetic Number representation Granularity 7 1 1 Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746. 6 / 9

Fixed-Point Arithmetic Number representation M ultiply-and- ACC umulate WL-bit multiplier Granularity (48-bits) 8 1 1 Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746. 6 / 9

Fixed-Point Arithmetic: Rounding Modes Round-to-nearest 9 1 1 Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746. 6 / 9

Fixed-Point Arithmetic: Rounding Modes Round-to-nearest Stochastic rounding Non-zero probability of rounding to � either or Unbiased rounding scheme: � expected rounding error is zero 10 1 1 Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746. 6 / 9

MNIST: Fully-connected DNNs Lower precision Lower precision FL 8 FL 10 FL 14 Float 11 1 1 Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746. 6 / 9

MNIST: Fully-connected DNNs Lower precision Lower precision FL 8 FL 10 FL 14 Float For small fractional lengths (FL < 12), a large majority of weight updates are � rounded to zero when using the round-to-nearest scheme. Convergence slows down � � For FL < 12, there is a noticeable degradation in the classification accuracy 12 1 1 Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746. 6 / 9

MNIST: Fully-connected DNNs Stochastic rounding preserves gradient information (statistically) � No degradation in convergence properties � Test error nearly equal to that obtained using 32-bit floats � 13 1 1 Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746. 6 / 9

FPGA prototyping: GEMM with stochastic rounding 8GB DDR3 SO-DIMM Input FIFOs: Matrix B FIFO FIFO FIFO n x n AXI Interface to Top Systolic DDR3 Controller Array n DSP DSP DSP FIFO MACC MACC MACC READ WRITE Systolic Array Input FIFOs: Matrix A (SA) of Computation DSP DSP DSP FIFO M ultiple-and- MACC MACC MACC ACC umulate L2 (MACC) units Cache L2-to-SA MACC units n (BRAM) (n x n array) Xilinx Kintex K325T FPGA DSP DSP DSP FIFO MACC MACC MACC Communication Top-level controller and memory hierarchy Wavefront systolic array for computing designed to maximize data reuse matrix product AB. Arrows indicate dataflow 21 1 1 Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746. 6 / 9

Maximizing data reuse Matrix A Matrix B [ N x K ] [ K x M ] Inner Loop: p.n rows Cycle through columns of Matrix B ( M/n iterations) n columns Outer Loop: Cycle through rows of Matrix A ( K/p.n iterations) Re-use factor for Matrix A: M times MUX MUX Re-use factor for Matrix B: p.n times L2- n : dimension of the systolic array A-cache B-cache p : parameter chosen based on available ( p.n rows) Cache ( n cols) BRAM resources 22 1 1 Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746. 6 / 9

Stochastic rounding Output C FIFOs FIFO FIFO FIFO FIFO DSP DSP ROUND ROUND DSP DSP FIFO MACC MACC DSP DSP FIFO MACC MACC Local registers Output path 23 1 1 Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746. 6 / 9

Stochastic rounding DSP ROUND Output C FIFOs FIFO FIFO Accumulated result FIFO FIFO • DSP DSP LSBs to be rounded-off ROUND ROUND DSP DSP FIFO Pseudo-random number MACC MACC + generated using LFSR DSP DSP FIFO MACC MACC Truncate LSBs, and saturate to limits if result exceeds range These operations can be implemented Local registers Output path efficiently using a single DSP unit 24 1 1 Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746. 6 / 9

� Binarized Neural Networks (BNN) CNN Key Differences 1. Inputs are binarized ( − 1 or +1) 2.4 6.2 … 5.0 9.1 … ∗ 0.8 0.1 3.3 1.8 4.3 7.8 = 2. Weights are binarized ( − 1 or +1) 0.3 0.8 … … 3. Results are binarized after Weights batch normalization Input Map Output Map BNN Batch Normalization 4 23 = 1 23 − 5 : + < 1 −1 … 1 −3 … 1 −1 … 6 7 − 8 ∗ 1 −1 1 1 3 −7 1 −1 = → 1 −1 … … … = 23 = >+1 if 4 23 ≥ 0 Weights 1 23 Input Map −1 otherwise Output Map (Binary) (Binary) (Binary) (Integer) Binarization 6 7 / 9

BNN CIFAR-10 Architecture [2] Feature map 32x32 dimensions 16x16 8x8 4x4 10 512 256 512 128 256 3 128 Number of feature maps 1024 1024 � 6 conv layers, 3 dense layers, 3 max pooling layers � All conv filters are 3x3 � First conv layer takes in floating-point input � 13.4 Mbits total model size (after hardware optimizations) [2] M. Courbariaux et al. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 . arXiv:1602.02830 , Feb 2016. 7 7 / 9

Advantages of BNN 1. Floating point ops replaced with binary logic ops b 1 b 2 b 1 1 ⨯ ⨯ b 2 b 1 b 2 b 1 1 XO XOR b 2 +1 +1 +1 0 0 0 +1 −1 −1 0 1 1 −1 +1 −1 1 0 1 −1 −1 +1 1 1 0 – Encode {+1, − 1} as {0,1} à multiplies become XORs – Conv/dense layers do dot products à XOR and popcount – Operations can map to LUT fabric as opposed to DSPs 2. Binarized weights may reduce total model size – Fewer bits per weight may be offset by having more weights 8 7 / 9

BNN vs CNN Parameter Efficiency Architecture Depth Param Bits Param Bits Error Rate (Float) (Fixed-Point) (%) ResNet [3] 164 51.9M 13.0M* 11.26 (CIFAR-10) BNN [2] 9 - 13.4M 11.40 * Assuming each float param can be quantized to 8-bit fixed-point � Comparison: – Conservative assumption: ResNet can use 8-bit weights – BNN is based on VGG (less advanced architecture) – BNN seems to hold promise! [2] M. Courbariaux et al. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 . arXiv:1602.02830 , Feb 2016. [3] K. He, X. Zhang, S. Ren, and J. Sun. Identity Mappings in Deep Residual Networks. ECCV 2016. 9 7 / 9

I ∗ OperaIons' Memory' ComputaIon' I ∗ +''−''×' 1x' 1x' R R I ∗ +''−''' ~32x' ~2x' R R B Binary'Weight'Networks' XNOR' I ∗ ~32x' ~58x' R B R B BitXcount' XNORXNetworks' 2 2 Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In: Proc. ECCV , pp. 525–542. 8 / 9

I ∗ OperaIons' Memory' ComputaIon' I ∗ +''−''×' 1x' 1x' R R I ∗ +''−''' ~32x' ~2x' R R B XNOR' I ∗ ~32x' ~58x' R B R B BitXcount' 2 2 Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In: Proc. ECCV , pp. 525–542. 8 / 9

I ∗ ≈ )) � )) � B R R R gn( X T W B gn( X T ∗ W ≈ R B W B ∗ W W B = sign(W) 2 2 Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In: Proc. ECCV , pp. 525–542. 8 / 9

Quantization Error W B = sign(W) W B ∗ W _' ≈ 0.75 R B 2 2 Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In: Proc. ECCV , pp. 525–542. 8 / 9

Optimal Scaling Factor ≈ computing α R B W B ∗ W α ∗ , W B ∗ = arg min W B ,α {|| W − α W B || 2 } W B ∗ = sign( W ) α ∗ | = 1 n k W k ` 1 2 2 Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In: Proc. ECCV , pp. 525–542. 8 / 9

CENG5030 Part 2-4: CNN Inaccurate Speedup-2 - Quantization Bei Yu - PowerPoint PPT Presentation

CENG5030 Part 2-4: CNN Inaccurate Speedup-2 - Quantization Bei Yu (Latest update: March 25, 2019) Spring 2019 1 / 9 These slides contain/adapt materials developed by Suyog Gupta et al. (2015). Deep learning with limited numerical

CENG5030 Part 2-1: Introduction to Convolutional Nueral Network Bei Yu (Latest update: March 4,

Quantization, after Souriau Prequantization Quantization? Group algebra Classical Franois

CS7015 (Deep Learning) : Lecture 12 Object Detection: R-CNN, Fast R-CNN, Faster R-CNN, You Only

Object Detection using R-CNN Experiments CS381V: Visual Recognition, Spring 2016 William Xie

GaudiMP GaudiMP performance performance- and and KSM KSM- measurements measurements

LOW PRECISION INFERENCE ON GPU Hao Wu, NVIDIA OUTLINE Performance motivation for quantization

Same, Same But Different Recovering Neural Network Quantization Error Through Weight

Quantization for TVM Ziheng Jiang TVM Conference, Dec 12th 2018 Quantization for TVM What is

CMSC5743 L03: CNN Accurate Speedup II Bei Yu (Latest update: September 29, 2020) Fall 2020 1 /

CMSC5743 L02: CNN Accurate Speedup I Bei Yu (Latest update: September 28, 2020) Fall 2020 1 /

CENG5030 Part 2-6: Network Architecture Search Bei Yu (Latest update: April 9, 2019) Spring

CENG5030 Part 1-2: Voltage Scaling - A Dynamic Programming Approach Bei Yu (Latest update:

CENG5030 Caffe Tutorial Part I: Caffe Hands-on Installation Easy customization with

CENG5030 Part 1-4: Switching Activity Bei Yu (Latest update: March 25, 2019) Spring 2019 1 /

CENG5030 Part 1-1: Introduction Bei Yu (Latest update: January 7, 2019) Spring 2019 1 / 19

Quantization of Poisson-Lie Hamiltonian systems Chiara Esposito Julius Maximilian University of

Stability and Scalability in Global Rou0ng S. K. Han 1 ,

1 Last class: Process Creation Today: Process Management 2 Process Description 3

Functions and procedures Rules of Processing Problem statement (short form) ;; Data Definition

ReverCSP: Time-travelling in CSP computations Carlos Galindo 1 Naoki Nishida 2 Josep Silva 1

Partitioning System Software for Hardware Enclaves Chia-Che Tsai Texas A&M University /

Towards Scalable Application Checkpointing with Parallel File System Delegation Dulcardo Arteaga

GHUMVEE: Efficient, Effective and Flexible Replication Stijn Volckaert Computer Systems Lab

CS 140: Models of parallel programming: Distributed memory and MPI Technology Trends:

CENG5030 Part 2-4: CNN Inaccurate Speedup-2 - Quantization Bei Yu - PowerPoint PPT Presentation

CENG5030 Part 2-4: CNN Inaccurate Speedup-2 - Quantization Bei Yu (Latest update: March 25, 2019) Spring 2019 1 / 9 These slides contain/adapt materials developed by Suyog Gupta et al. (2015). Deep learning with limited numerical

CENG5030 Part 2-1: Introduction to Convolutional Nueral Network Bei Yu (Latest update: March 4,

Quantization, after Souriau Prequantization Quantization? Group algebra Classical Franois

CS7015 (Deep Learning) : Lecture 12 Object Detection: R-CNN, Fast R-CNN, Faster R-CNN, You Only

Object Detection using R-CNN Experiments CS381V: Visual Recognition, Spring 2016 William Xie

GaudiMP GaudiMP performance performance- and and KSM KSM- measurements measurements

LOW PRECISION INFERENCE ON GPU Hao Wu, NVIDIA OUTLINE Performance motivation for quantization

Same, Same But Different Recovering Neural Network Quantization Error Through Weight

Quantization for TVM Ziheng Jiang TVM Conference, Dec 12th 2018 Quantization for TVM What is

CMSC5743 L03: CNN Accurate Speedup II Bei Yu (Latest update: September 29, 2020) Fall 2020 1 /

CMSC5743 L02: CNN Accurate Speedup I Bei Yu (Latest update: September 28, 2020) Fall 2020 1 /

CENG5030 Part 2-6: Network Architecture Search Bei Yu (Latest update: April 9, 2019) Spring

CENG5030 Part 1-2: Voltage Scaling - A Dynamic Programming Approach Bei Yu (Latest update:

CENG5030 Caffe Tutorial Part I: Caffe Hands-on Installation Easy customization with

CENG5030 Part 1-4: Switching Activity Bei Yu (Latest update: March 25, 2019) Spring 2019 1 /

CENG5030 Part 1-1: Introduction Bei Yu (Latest update: January 7, 2019) Spring 2019 1 / 19

Quantization of Poisson-Lie Hamiltonian systems Chiara Esposito Julius Maximilian University of

Stability and Scalability in Global Rou0ng S. K. Han 1 ,

1 Last class: Process Creation Today: Process Management 2 Process Description 3

Functions and procedures Rules of Processing Problem statement (short form) ;; Data Definition

ReverCSP: Time-travelling in CSP computations Carlos Galindo 1 Naoki Nishida 2 Josep Silva 1

Partitioning System Software for Hardware Enclaves Chia-Che Tsai Texas A&amp;M University /

Towards Scalable Application Checkpointing with Parallel File System Delegation Dulcardo Arteaga

GHUMVEE: Efficient, Effective and Flexible Replication Stijn Volckaert Computer Systems Lab

CS 140: Models of parallel programming: Distributed memory and MPI Technology Trends:

Partitioning System Software for Hardware Enclaves Chia-Che Tsai Texas A&M University /