Deep Compression and EIE:
——Deep Neural Network Model Compression and Efficient Inference Engine
Song Han CVA group, Stanford University Apr 7, 2016
1
Deep Compression and EIE: Deep Neural Network Model Compression and - - PowerPoint PPT Presentation
Deep Compression and EIE: Deep Neural Network Model Compression and Efficient Inference Engine Song Han CVA group, Stanford University Apr 7, 2016 1 Intro about me and my advisor Fourth year PhD with Prof. Bill Dally at Stanford.
Song Han CVA group, Stanford University Apr 7, 2016
1
and hardware acceleration, to make inference more efficient for deployment.
Inference Engine” covered by TheNextPlatform & O’Reilly & TechEmergence & HackerNews
Song Han Bill Dally
department, leads the Concurrent VLSI Architecture Group.
the American Academy of Arts & Sciences, Fellow of the IEEE, Fellow of the ACM and numerous other rewards…
2
3
Bill Dally
You’ll be interested in his GTC talk: S6417 - FireCaffe
[1]. Han et al. “Learning both Weights and Connections for Efficient Neural Networks”, NIPS 2015 [2]. Han et al. “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding”, ICLR 2016 [3]. Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016 [4]. Iandola, Han,et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size” arXiv 2016 [5]. Yao, Han, et.al, “Hardware-friendly convolutional neural network with even-number filter size” ICLR’16 workshop
4
Image Recognition Speech Recognition Natural Language Processing
5
6
“At Baidu, our #1 motivation for compressing networks is to bring down the size of the binary file. As a mobile-first company, we frequently update various apps via different app stores. We've very sensitive to the size of our binary files, and a feature that increases the binary size by 100MB will receive much more scrutiny than one that increases it by 10MB.” —Andrew Ng
If Running DNN on Mobile…
7
If Running DNN on Mobile…
8
Intelligent but Inefficient
Network Delay Power Budget User Privacy
9
If Running DNN on the Cloud…
Compress Mobile App Size by 35x-50x
no loss of accuracy improved accuracy
make inference faster
10
No dependency on network connection
No network delay high frame rate
High energy efficiency that preserves battery
11
Huffman Coding, ICLR 2016
ECCV submission
12
Deep Compression SqueezeNet++ EIE
Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015
13
Deep Compression SqueezeNet++ EIE
[1] Christopher A Walsh. Peter huttenlocher (1931-2013). Nature, 502(7470):172–172, 2013.
14
Deep Compression SqueezeNet++ EIE
0.0% 0.5% 40% 50% 60% 70% 80% 90% 100% Accuracy Loss Parametes Pruned Away
L2 regularization w/o retrain L1 regularization w/o retrain L1 regularization w/ retrain L2 regularization w/ retrain L2 regularization w/ iterative prune and retrain
Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015
15
Deep Compression SqueezeNet++ EIE
Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015
16
Deep Compression SqueezeNet++ EIE
Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015
17
Deep Compression SqueezeNet++ EIE
Visualization of the first FC layer’s sparsity pattern of Lenet-300-100. It has a banded structure repeated 28 times, which correspond to the un-pruned parameters in the center of the images, since the digits are written in the center.
18
Deep Compression SqueezeNet++ EIE
19
Lecture 10 - 8 Feb 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 10 - 8 Feb 2016 51
Explain Images with Multimodal Recurrent Neural Networks, Mao et al. Deep Visual-Semantic Alignments for Generating Image Descriptions, Karpathy and Fei-Fei Show and Tell: A Neural Image Caption Generator, Vinyals et al. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al. Learning a Recurrent Visual Representation for Image Caption Generation, Chen and Zitnick
Karpathy, Feifei, et al, "Deep Visual-Semantic Alignments for Generating Image Descriptions"
Deep Compression SqueezeNet++ EIE
with a ball
playing with a basketball
shirt is running through a field
20
Deep Compression SqueezeNet++ EIE
Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015
Abi See, “CS224N Final Project: Exploiting the Redundancy in Neural Machine Translation”
21
Deep Compression SqueezeNet++ EIE
Dark means zero and redundant, White means non-zero and useful Word Embedding: LSTM:
Abi See, “CS224N Final Project: Exploiting the Redundancy in Neural Machine Translation”
22
Deep Compression SqueezeNet++ EIE
23
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
Deep Compression SqueezeNet++ EIE
to DC conversion loss, 85% regulator efficiency and 15% power consumed by peripheral components => 60% AP+DRAM power
24
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
Deep Compression SqueezeNet++ EIE
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
25
Deep Compression SqueezeNet++ EIE
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
26
Deep Compression SqueezeNet++ EIE
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
27
Deep Compression SqueezeNet++ EIE
28
Deep Compression SqueezeNet++ EIE
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
29
Deep Compression SqueezeNet++ EIE
Figure 7: Pruning doesn’t hurt quantization. Dashed: quantization on unpruned network. Solid: quantization on pruned network; Accuracy begins to drop at the same number of quantization bits whether or not the network has been pruned. Although pruning made the number of parameters less, quantization still works well, or even better(3 bits case on the left figure) as in the unpruned network.
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
30
Deep Compression SqueezeNet++ EIE
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
31
Deep Compression SqueezeNet++ EIE
Huffman code is a type of optimal prefix code that is commonly used for loss-less data compression. It produces a variable-length code table for encoding source symbol. The table is derived from the occurrence probability for each symbol. As in other entropy encoding methods, more common symbols are represented with fewer bits than less common symbols, thus save the total space.
32
Deep Compression SqueezeNet++ EIE
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
33
Deep Compression SqueezeNet++ EIE
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
34
Deep Compression SqueezeNet++ EIE
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
35
Deep Compression SqueezeNet++ EIE
Iandola, Han,et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size
36
Deep Compression SqueezeNet++ EIE
Iandola, Han,et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size” arXiv 2016
0E+00 1E+05 3E+05 4E+05 5E+05
conv1 fire2/conv1x1_1 fire2/conv1x1_2 fire2/conv3x3_2 fire3/conv1x1_1 fire3/conv1x1_2 fire3/conv3x3_2 fire4/conv1x1_1 fire4/conv1x1_2 fire4/conv3x3_2 fire5/conv1x1_1 fire5/conv1x1_2 fire5/conv3x3_2 fire6/conv1x1_1 fire6/conv1x1_2 fire6/conv3x3_2 fire7/conv1x1_1 fire7/conv1x1_2 fire7/conv3x3_2 fire8/conv1x1_1 fire8/conv1x1_2 fire8/conv3x3_2 fire9/conv1x1_1 fire9/conv1x1_2 fire9/conv3x3_2 conv_final
Remaining parameters Parameters pruned away
Fig 2. Deep compression is compatible with even extreme efficient network architecture such as SqueezeNet: It can be pruned 3x, quantized to 6bit w/o loss of accuracy. Fig 1: SqueezeNet architecture
37
Input 1x1 Conv Squeeze 1x1 Conv Expand 3x3 Conv Expand Output Concat/Eltwise
64 16 64 64 128
Deep Compression SqueezeNet++ EIE
Efficient Model Pruning Weight Sharing
38
Input 1x1 Conv Squeeze 1x1 Conv Expand 3x3 Conv Expand Output Concat/Eltwise
64 16 64 64 128
CNN architecture Compression Approach Data Type Original → Compressed Model Size Reduction in Model Size vs. AlexNet Top-1 ImageNet Accuracy Top-5 ImageNet Accuracy AlexNet None (baseline) 32 bit 240MB 1x 57.2% 80.3% AlexNet SVD [3] 32 bit 240MB → 48MB 5x 56.0% 79.4% AlexNet Network Pruning [4] 32 bit 240MB → 27MB 9x 57.2% 80.3% AlexNet Deep Com- pression [5] 5-8 bit 240MB → 6.9MB 35x 57.2% 80.3% SqueezeNet (ours) None 32 bit 4.8MB 50x 57.5% 80.3% SqueezeNet (ours) Deep Compression 8 bit 4.8MB → 0.66MB 363x 57.5% 80.3% SqueezeNet (ours) Deep Compression 6 bit 4.8MB → 0.47MB 510x 57.5% 80.3%
Iandola, Han,et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size” arXiv 2016
Deep Compression SqueezeNet++ EIE
39
Iandola, Han,et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size” arXiv 2016
Deep Compression SqueezeNet++ EIE
40
Deep Compression SqueezeNet++ EIE
✓ No training needed ✓ Fast (3 minutes) x 5x - 10x compression rate x 1% loss of accuracy
✓ 35x - 50x compression rate ✓ no loss of accuracy x Training is needed x Slow
41
Deep Compression SqueezeNet++ EIE
42
Deep Compression SqueezeNet++ EIE
43
Deep Compression SqueezeNet++ EIE
affecting accuracy by finding the right connections and quantizing the weights.
enforce weight sharing => apply Huffman encoding.
storage by 35×, VGG16 by 49×, without loss of accuracy.
44
Deep Compression SqueezeNet++ EIE
45
Iandola, Han,et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size”, submitted to ECCV
Song Han CVA group, Stanford University Apr 7, 2016
Deep Compression SqueezeNet++ EIE
46
layer name/type
filter size / stride (if not a fire layer) depth
s1x1
(#1x1 squeeze)
e1x1
(#1x1 expand)
e3x3
(#3x3 expand)
s1x1
sparsity
e1x1
sparsity
e3x3
sparsity # bits #parameter before pruning #parameter after pruning input image 224x224x3
111x111x96 7x7/2 (x96) 1 6bit 14,208 14,208 maxpool1 55x55x96 3x3/2 fire2 55x55x128 2 16 64 64 100% 100% 33% 6bit 11,920 5,746 fire3 55x55x128 2 16 64 64 100% 100% 33% 6bit 12,432 6,258 fire4 55x55x256 2 32 128 128 100% 100% 33% 6bit 45,344 20,646 maxpool4 27x27x256 3x3/2 fire5 27x27x256 2 32 128 128 100% 100% 33% 6bit 49,440 24,742 fire6 27x27x384 2 48 192 192 100% 50% 33% 6bit 104,880 44,700 fire7 27x27x384 2 48 192 192 50% 100% 33% 6bit 111,024 46,236 fire8 27x27x512 2 64 256 256 100% 50% 33% 6bit 188,992 77,581 maxpool8 13x12x512 3x3/2 fire9 13x13x512 2 64 256 256 50% 100% 30% 6bit 197,184 77,581 conv10 13x13x1000 1x1/1 (x1000) 1 6bit 513,000 103,400 avgpool10 1x1x1000 13x13/1 1,248,424 (total) 421,098 (total) 20% (3x3) 100% (7x7)
Table 1. SqueezeNet architectural dimensions.
Deep Compression SqueezeNet++ EIE
47
Architecture Top-1 Accuracy Top-5 Accuracy Model Size SqueezeNet 57.5% 80.3% 4.8MB SqueezeNet++ 59.5% 81.5% 7.1MB
Iandola, Han,et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size”, submitted to ECCV
Deep Compression SqueezeNet++ EIE
48
"labrador retriever dog"
conv1
96
fire2
128
fire3
128
fire4
256
fire5
256
fire6
384
fire7
384
fire8
512
fire9
512
conv10
1000
softmax maxpool/2 maxpool/2 maxpool/2 global avgpool conv1
96
fire2
128
fire3
128
fire4
256
fire5
256
fire6
384
fire7
384
fire8
512
fire9
512
conv10
1000
softmax maxpool/2 maxpool/2 maxpool/2 global avgpool conv1
96
fire2
128
fire3
128
fire4
256
fire5
256
fire6
384
fire7
384
fire8
512
fire9
512
conv10
1000
softmax maxpool/2 maxpool/2 maxpool/2 global avgpool conv1x1 conv1x1 conv1x1 conv1x1
96
Iandola, Han,et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size”, submitted to ECCV
Deep Compression SqueezeNet++ EIE
49
Iandola, Han,et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size”, submitted to ECCV
Input 1x1 Conv Squeeze 1x1 Conv Expand 3x3 Conv Expand Output Concat/Eltwise
64 16 64 64 128
Vanilla Fire module
Input 1x1 Conv Squeeze 1x1 Conv Expand 3x3 Conv Expand Output Concat/Eltwise
128 16 64 64 128 128 128
Fire module with Simple Bypass
Input 1x1 Conv Squeeze 1x1 Conv Expand 3x3 Conv Expand Output Concat/Eltwise 1x1 Conv Bypass
64 16 64 64 128 64 128
Fire module with Complex Bypass
Table 3. SqueezeNet accuracy and model size using different macroarchitecture Architecture Top-1 Accuracy Top-5 Accuracy Model Size Vanilla SqueezeNet 57.5% 80.3% 4.8MB SqueezeNet + Simple Bypass 60.4% 82.5% 4.8MB SqueezeNet + Complex Bypass 58.8% 82.0% 7.7MB
Deep Compression SqueezeNet++ EIE
50
Table 5. Improving accuracy with dense→sparse→dense (DSD) training. Architecture Top-1 Accuracy Top-5 Accuracy Model Size SqueezeNet 57.5% 80.3% 4.8MB SqueezeNet (DSD) 61.8% 83.5% 4.8MB
the sparsity constraint, relaxing the constraint gives the network more freedom to escape the saddle point and arrive at a higher-accuracy local minimum.
an interesting area of future work.
Sparse Dense Dense
Constrain Relax
Iandola, Han,et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size”, submitted to ECCV
Deep Compression SqueezeNet++ EIE
51
Deep Compression SqueezeNet++ EIE
Song Han CVA group, Stanford University Apr 7, 2016
52
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016
Deep Compression SqueezeNet++ EIE
No dependency on network connection
No network delay high frame rate
High energy efficiency that preserves battery
53
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016
Deep Compression SqueezeNet++ EIE
PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE Central Control
54
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016
Deep Compression SqueezeNet++ EIE
55
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016
Deep Compression SqueezeNet++ EIE
56
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016
Deep Compression SqueezeNet++ EIE
57
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016
Deep Compression SqueezeNet++ EIE
58
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016
Deep Compression SqueezeNet++ EIE
59
60
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016
Deep Compression SqueezeNet++ EIE
61
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016
Deep Compression SqueezeNet++ EIE
62
Deep Compression SqueezeNet++ EIE
http://www.nextplatform.com/2015/12/08/emergent-chip-vastly-accelerates-deep-neural-networks/
63
Deep Compression SqueezeNet++ EIE
https://www.oreilly.com/ideas/compressed-representations-in-the-age-of-big-data
64
Deep Compression SqueezeNet++ EIE
http://techemergence.com/a-limitless-pill-for-deep-neural-networks/ 65
Deep Compression SqueezeNet++ EIE
https://news.ycombinator.com/item?id=10881683
66
Deep Compression SqueezeNet++ EIE
67
Deep Compression SqueezeNet++ EIE
Computation Mobile Computation Intelligent Mobile Computation
68
69
Model Compression
[1]. Han et al. “Learning both Weights and Connections for Efficient Neural Networks”, NIPS 2015 [2]. Han et al. “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding”, ICLR 2016, Deep Learning Symposium, NIPS 2015
Hardware Acceleration
[3]. Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016
CNN Architecture Design Space Exploration
[4]. Iandola, Han,et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size” arXiv’16 [5]. Yao, Han, et.al, “Hardware-friendly convolutional neural network with even- number filter size” ICLR 2016 workshop
Deep Compression SqueezeNet++ EIE
songhan@stanford.edu
70
[1]. Han et al. “Learning both Weights and Connections for Efficient Neural Networks”, NIPS 2015 [2]. Han et al. “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding”, ICLR 2016 [3]. Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016 [4]. Iandola, Han,et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size” ECCV’16 submission [5]. Yao, Han, et.al, “Hardware-friendly convolutional neural network with even-number filter size” ICLR’16 workshop