deep compression and eie
play

Deep Compression and EIE: Deep Neural Network Model Compression and - PowerPoint PPT Presentation

Deep Compression and EIE: Deep Neural Network Model Compression and Efficient Inference Engine Song Han CVA group, Stanford University Apr 7, 2016 1 Intro about me and my advisor Fourth year PhD with Prof. Bill Dally at Stanford.


  1. Deep Compression and EIE: ——Deep Neural Network Model Compression 
 and Efficient Inference Engine Song Han CVA group, Stanford University Apr 7, 2016 1

  2. Intro about me and my advisor • Fourth year PhD with Prof. Bill Dally at Stanford. • Research interest: deep learning model compression and hardware acceleration , to make inference more efficient for deployment. • Recent work on “Deep Compression” and “EIE: Efficient Inference Engine” covered by TheNextPlatform & O’Reilly Song Han & TechEmergence & HackerNews • Professor at Stanford University and former chairman of CS department, leads the Concurrent VLSI Architecture Group. • Chief Scientist of NVIDIA. • Member of the National Academy of Engineering, Fellow of the American Academy of Arts & Sciences, Fellow of the IEEE, Fellow of the ACM and numerous other rewards… Bill Dally 2

  3. Thanks to my collaborators • NVIDIA: Jeff Pool, John Tran, Bill Dally • Stanford: Xingyu Liu, Jing Pu, Ardavan Pedram, Mark Horowitz, Bill Dally • Tsinghua: Huizi Mao, Song Yao, Yu Wang • Berkeley: Forrest Iandola, Matthew Moskewicz, Khalid Ashraf, Kurt Keutzer You’ll be interested in his GTC talk: S6417 - FireCaffe Bill Dally 3

  4. This Talk: • Deep Compression [1,2] : A Deep Neural Network Model Compression Pipeline. • EIE Accelerator [3] : Efficient Inference Engine that Accelerates the Compressed Deep Neural Network Model. • SqueezeNet++ [4,5] : ConvNet Architecture Design Space Exploration [1]. Han et al. “Learning both Weights and Connections for Efficient Neural Networks”, NIPS 2015 [2]. Han et al. “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding”, ICLR 2016 [3]. Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016 [4]. Iandola, Han,et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size” arXiv 2016 [5]. Yao, Han, et.al, “Hardware-friendly convolutional neural network with even-number filter size” ICLR’16 workshop 4

  5. Deep Learning � Next Wave of AI Image Speech Natural Language Recognition Recognition Processing 5

  6. Applications 6

  7. The Problem: If Running DNN on Mobile … App developers suffers from the model size “At Baidu, our #1 motivation for compressing networks is to bring down the size of the binary file . As a mobile-first company, we frequently update various apps via different app stores. We've very sensitive to the size of our binary files , and a feature that increases the binary size by 100MB will receive much more scrutiny than one that increases it by 10MB.” —Andrew Ng 7

  8. The Problem: If Running DNN on Mobile … Hardware engineer suffers from the model size 
 (embedded system, limited resource) 8

  9. The Problem: If Running DNN on the Cloud … Network Power User Delay Budget Privacy Intelligent but Inefficient 9

  10. Deep Compression Problem 1: Model Size Solution 1: Deep Compression Smaller Size Accuracy Speedup Compress Mobile App 
 no loss of accuracy make inference faster Size by 35x-50x improved accuracy 
 10

  11. EIE Accelerator Problem 2: Latency, Power, Energy Solution 2: ASIC accelerator Offline Real Time Low Power No dependency on 
 No network delay High energy efficiency 
 network connection high frame rate that preserves battery 11

  12. Part1: Deep Compression • AlexNet: 35 × , 240MB => 6.9MB => 0.47MB (510x) • VGG16: 49 × , 552MB => 11.3MB • With no loss of accuracy on ImageNet12 • Weights fits on-chip SRAM, taking 120x less energy than DRAM 1. Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015 2. Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016 3. Iandola, Han, et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size” ECCV submission Deep Compression SqueezeNet++ EIE 12

  13. 1. Pruning Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015 Deep Compression SqueezeNet++ EIE 13

  14. Pruning: Motivation • Trillion of synapses are generated in the human brain during the first few months of birth. • 1 year old , peaked at 1000 trillion • Pruning begins to occur. • 10 years old , a child has nearly 500 trillion synapses • This ’pruning’ mechanism removes redundant connections in the brain. [1] Christopher A Walsh. Peter huttenlocher (1931-2013). Nature , 502(7470):172–172, 2013. 
 Deep Compression SqueezeNet++ EIE 14

  15. Retrain to Recover Accuracy L2 regularization w/o retrain L1 regularization w/o retrain L1 regularization w/ retrain L2 regularization w/ retrain L2 regularization w/ iterative prune and retrain 0.5% 0.0% -0.5% -1.0% Accuracy Loss -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parametes Pruned Away Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015 Deep Compression SqueezeNet++ EIE 15

  16. Pruning: Result on 4 Covnets Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015 Deep Compression SqueezeNet++ EIE 16

  17. AlexNet & VGGNet Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015 Deep Compression SqueezeNet++ EIE 17

  18. Mask Visualization Visualization of the first FC layer’s sparsity pattern of Lenet-300-100. It has a banded structure repeated 28 times, which correspond to the un-pruned parameters in the center of the images, since the digits are written in the center. Deep Compression SqueezeNet++ EIE 18

  19. Pruning NeuralTalk and LSTM Image Captioning Karpathy, Feifei, et al, "Deep Visual-Semantic Alignments for Generating Image Descriptions" Explain Images with Multimodal Recurrent Neural Networks, Mao et al. • Pruning away 90% parameters in NeuralTalk doesn’t hurt BLUE score with proper retraining Deep Visual-Semantic Alignments for Generating Image Descriptions, Karpathy and Fei-Fei Show and Tell: A Neural Image Caption Generator, Vinyals et al. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al. Learning a Recurrent Visual Representation for Image Caption Generation, Chen and Zitnick Lecture 10 - Lecture 10 - 8 Feb 2016 8 Feb 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson 51 Deep Compression SqueezeNet++ EIE 19

  20. Pruning NeuralTalk and LSTM • Original : a basketball player in a white uniform is playing with a ball • Pruned 90% : a basketball player in a white uniform is playing with a basketball • Original : a brown dog is running through a grassy field • Pruned 90% : a brown dog is running through a grassy area Original : a soccer player in red is running in the field • Pruned 95% : a man in a red shirt and black and white black • shirt is running through a field Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015 Deep Compression SqueezeNet++ EIE 20

  21. Pruning Neural Machine Translation Abi See, “CS224N Final Project: Exploiting the Redundancy in Neural Machine Translation” Deep Compression SqueezeNet++ EIE 21

  22. Pruning Neural Machine Translation Word Embedding: Dark means zero and redundant, White means non-zero and useful LSTM: Abi See, “CS224N Final Project: Exploiting the Redundancy in Neural Machine Translation” Deep Compression SqueezeNet++ EIE 22

  23. Speedup (FC layer) Intel Core i7 5930K: MKL CBLAS GEMV, MKL SPBLAS CSRMV • NVIDIA GeForce GTX Titan X: cuBLAS GEMV, cuSPARSE CSRMV • NVIDIA Tegra K1: cuBLAS GEMV, cuSPARSE CSRMV • Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016 Deep Compression SqueezeNet++ EIE 23

  24. Energy Efficiency (FC layer) • Intel Core i7 5930K: CPU socket and DRAM power are reported by pcm-power utility • NVIDIA GeForce GTX Titan X: reported by nvidia-smi utility • NVIDIA Tegra K1: measured the total power consumption with a power-meter, 15% AC to DC conversion loss, 85% regulator efficiency and 15% power consumed by peripheral components => 60% AP+DRAM power Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016 Deep Compression SqueezeNet++ EIE 24

  25. 2. Weight Sharing (Trained Quantization) Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016 Deep Compression SqueezeNet++ EIE 25

  26. Weight Sharing: Overview Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016 Deep Compression SqueezeNet++ EIE 26

  27. Finetune Centroids Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016 Deep Compression SqueezeNet++ EIE 27

  28. Accuracy ~ #Bits on 5 Conv Layers + 3 FC Layers Deep Compression SqueezeNet++ EIE 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend