ceng5030 part 2 4 cnn inaccurate speedup 2 quantization
play

CENG5030 Part 2-4: CNN Inaccurate Speedup-2 - Quantization Bei Yu - PowerPoint PPT Presentation

CENG5030 Part 2-4: CNN Inaccurate Speedup-2 - Quantization Bei Yu (Latest update: March 25, 2019) Spring 2019 1 / 9 These slides contain/adapt materials developed by Suyog Gupta et al. (2015). Deep learning with limited numerical


  1. CENG5030 Part 2-4: CNN Inaccurate Speedup-2 —- Quantization Bei Yu (Latest update: March 25, 2019) Spring 2019 1 / 9

  2. These slides contain/adapt materials developed by ◮ Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746 ◮ Ritchie Zhao et al. (2017). “Accelerating binarized convolutional neural networks with software-programmable FPGAs”. In: Proc. FPGA , pp. 15–24 ◮ Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In: Proc. ECCV , pp. 525–542 2 / 9

  3. 3 / 9

  4. What'should'I'learn' to'do'well'in' computer'vision' I'want'to'research' research?' on'a'topic'with'DEAP' LEARNING'in'it?' 3 / 9

  5. DEEP'LEARNING' 3 / 9

  6. GPU$ Server$ 3 / 9

  7. Ohhh'No!!!' 3 / 9

  8. State of the art recognition methods • 'Very'Expensive'' • Memory' • ComputaIon' • Power' 3 / 9

  9. Overview Fixed-Point Representation Binary/Ternary Network Reading List 4 / 9

  10. Overview Fixed-Point Representation Binary/Ternary Network Reading List 5 / 9

  11. Fixed-Point v.s. Floating-Point 5 / 9

  12. Fixed-Point v.s. Floating-Point 5 / 9

  13. Fixed-Point v.s. Floating-Point 5 / 9

  14. Fixed-Point Arithmetic Number representation Granularity 7 1 1 Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746. 6 / 9

  15. Fixed-Point Arithmetic Number representation M ultiply-and- ACC umulate WL-bit multiplier Granularity (48-bits) 8 1 1 Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746. 6 / 9

  16. Fixed-Point Arithmetic: Rounding Modes Round-to-nearest 9 1 1 Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746. 6 / 9

  17. Fixed-Point Arithmetic: Rounding Modes Round-to-nearest Stochastic rounding Non-zero probability of rounding to � either or Unbiased rounding scheme: � expected rounding error is zero 10 1 1 Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746. 6 / 9

  18. MNIST: Fully-connected DNNs Lower precision Lower precision FL 8 FL 10 FL 14 Float 11 1 1 Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746. 6 / 9

  19. MNIST: Fully-connected DNNs Lower precision Lower precision FL 8 FL 10 FL 14 Float For small fractional lengths (FL < 12), a large majority of weight updates are � rounded to zero when using the round-to-nearest scheme. Convergence slows down � � For FL < 12, there is a noticeable degradation in the classification accuracy 12 1 1 Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746. 6 / 9

  20. MNIST: Fully-connected DNNs Stochastic rounding preserves gradient information (statistically) � No degradation in convergence properties � Test error nearly equal to that obtained using 32-bit floats � 13 1 1 Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746. 6 / 9

  21. FPGA prototyping: GEMM with stochastic rounding 8GB DDR3 SO-DIMM Input FIFOs: Matrix B FIFO FIFO FIFO n x n AXI Interface to Top Systolic DDR3 Controller Array n DSP DSP DSP FIFO MACC MACC MACC READ WRITE Systolic Array Input FIFOs: Matrix A (SA) of Computation DSP DSP DSP FIFO M ultiple-and- MACC MACC MACC ACC umulate L2 (MACC) units Cache L2-to-SA MACC units n (BRAM) (n x n array) Xilinx Kintex K325T FPGA DSP DSP DSP FIFO MACC MACC MACC Communication Top-level controller and memory hierarchy Wavefront systolic array for computing designed to maximize data reuse matrix product AB. Arrows indicate dataflow 21 1 1 Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746. 6 / 9

  22. Maximizing data reuse Matrix A Matrix B [ N x K ] [ K x M ] Inner Loop: p.n rows Cycle through columns of Matrix B ( M/n iterations) n columns Outer Loop: Cycle through rows of Matrix A ( K/p.n iterations) Re-use factor for Matrix A: M times MUX MUX Re-use factor for Matrix B: p.n times L2- n : dimension of the systolic array A-cache B-cache p : parameter chosen based on available ( p.n rows) Cache ( n cols) BRAM resources 22 1 1 Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746. 6 / 9

  23. Stochastic rounding Output C FIFOs FIFO FIFO FIFO FIFO DSP DSP ROUND ROUND DSP DSP FIFO MACC MACC DSP DSP FIFO MACC MACC Local registers Output path 23 1 1 Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746. 6 / 9

  24. Stochastic rounding DSP ROUND Output C FIFOs FIFO FIFO Accumulated result FIFO FIFO • DSP DSP LSBs to be rounded-off ROUND ROUND DSP DSP FIFO Pseudo-random number MACC MACC + generated using LFSR DSP DSP FIFO MACC MACC Truncate LSBs, and saturate to limits if result exceeds range These operations can be implemented Local registers Output path efficiently using a single DSP unit 24 1 1 Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746. 6 / 9

  25. Overview Fixed-Point Representation Binary/Ternary Network Reading List 7 / 9

  26. � Binarized Neural Networks (BNN) CNN Key Differences 1. Inputs are binarized ( − 1 or +1) 2.4 6.2 … 5.0 9.1 … ∗ 0.8 0.1 3.3 1.8 4.3 7.8 = 2. Weights are binarized ( − 1 or +1) 0.3 0.8 … … 3. Results are binarized after Weights batch normalization Input Map Output Map BNN Batch Normalization 4 23 = 1 23 − 5 : + < 1 −1 … 1 −3 … 1 −1 … 6 7 − 8 ∗ 1 −1 1 1 3 −7 1 −1 = → 1 −1 … … … = 23 = >+1 if 4 23 ≥ 0 Weights 1 23 Input Map −1 otherwise Output Map (Binary) (Binary) (Binary) (Integer) Binarization 6 7 / 9

  27. BNN CIFAR-10 Architecture [2] Feature map 32x32 dimensions 16x16 8x8 4x4 10 512 256 512 128 256 3 128 Number of feature maps 1024 1024 � 6 conv layers, 3 dense layers, 3 max pooling layers � All conv filters are 3x3 � First conv layer takes in floating-point input � 13.4 Mbits total model size (after hardware optimizations) [2] M. Courbariaux et al. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 . arXiv:1602.02830 , Feb 2016. 7 7 / 9

  28. Advantages of BNN 1. Floating point ops replaced with binary logic ops b 1 b 2 b 1 1 ⨯ ⨯ b 2 b 1 b 2 b 1 1 XO XOR b 2 +1 +1 +1 0 0 0 +1 −1 −1 0 1 1 −1 +1 −1 1 0 1 −1 −1 +1 1 1 0 – Encode {+1, − 1} as {0,1} à multiplies become XORs – Conv/dense layers do dot products à XOR and popcount – Operations can map to LUT fabric as opposed to DSPs 2. Binarized weights may reduce total model size – Fewer bits per weight may be offset by having more weights 8 7 / 9

  29. BNN vs CNN Parameter Efficiency Architecture Depth Param Bits Param Bits Error Rate (Float) (Fixed-Point) (%) ResNet [3] 164 51.9M 13.0M* 11.26 (CIFAR-10) BNN [2] 9 - 13.4M 11.40 * Assuming each float param can be quantized to 8-bit fixed-point � Comparison: – Conservative assumption: ResNet can use 8-bit weights – BNN is based on VGG (less advanced architecture) – BNN seems to hold promise! [2] M. Courbariaux et al. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 . arXiv:1602.02830 , Feb 2016. [3] K. He, X. Zhang, S. Ren, and J. Sun. Identity Mappings in Deep Residual Networks. ECCV 2016. 9 7 / 9

  30. I ∗ OperaIons' Memory' ComputaIon' I ∗ +''−''×' 1x' 1x' R R I ∗ +''−''' ~32x' ~2x' R R B Binary'Weight'Networks' XNOR' I ∗ ~32x' ~58x' R B R B BitXcount' XNORXNetworks' 2 2 Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In: Proc. ECCV , pp. 525–542. 8 / 9

  31. I ∗ OperaIons' Memory' ComputaIon' I ∗ +''−''×' 1x' 1x' R R I ∗ +''−''' ~32x' ~2x' R R B XNOR' I ∗ ~32x' ~58x' R B R B BitXcount' 2 2 Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In: Proc. ECCV , pp. 525–542. 8 / 9

  32. I ∗ ≈ )) � )) � B R R R gn( X T W B gn( X T ∗ W ≈ R B W B ∗ W W B = sign(W) 2 2 Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In: Proc. ECCV , pp. 525–542. 8 / 9

  33. Quantization Error W B = sign(W) W B ∗ W _' ≈ 0.75 R B 2 2 Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In: Proc. ECCV , pp. 525–542. 8 / 9

  34. Optimal Scaling Factor ≈ computing α R B W B ∗ W α ∗ , W B ∗ = arg min W B ,α {|| W − α W B || 2 } W B ∗ = sign( W ) α ∗ | = 1 n k W k ` 1 2 2 Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In: Proc. ECCV , pp. 525–542. 8 / 9

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend