Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz
PRUNING CONVOLUTIONAL NEURAL NETWORKS
2017
NETWORKS Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan - - PowerPoint PPT Presentation
PRUNING CONVOLUTIONAL NEURAL NETWORKS Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz 2017 WHY WE CAN PRUNE CNNS? 2 WHY WE CAN PRUNE CNNS? Optimization failures: Some neurons are "dead": little activation
Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz
2017
2
3 3
4 4
Caltech-UCSD Birds (200 classes, <6000 images)
Small Dataset
5 5
Small Dataset
Oriole Goldfinch
Accuracy Size/Speed Small Network Training
6 6
Small Dataset Fine-tuning Large Pretrained Network
Oriole Goldfinch
Accuracy Size/Speed
AlexNet - VGG16 - ResNet
7 7
Small Dataset Large Pretrained Network
AlexNet - VGG16 - ResNet
Fine-tuning
Oriole Goldfinch
Accuracy Size/Speed Pruning Smaller Network
8 8
Convolutional layers Fully connected layers VGG16 99% 1% Alexnet 89% 11% R3DCNN 90% 10% To reduce computation, we focus pruning on convolutional units. Ratio of floating point operations Our focus
9 9
between neurons/feature maps
SW/HW for full speed-up
neurons/feature maps
Our focus
10
11 11
π·: training cost function π : training data π: network weights
π : pruned network weights
Training:
π π· π, π
12 12
π·: training cost function π : training data π: network weights
π : pruned network weights
π
Training: Pruning:
π π· π, π
13 13
π·: training cost function π : training data π: network weights
π : pruned network weights
π
Training: Pruning:
π π· π, π
0β€ πΆ
β 0 β β0 norm, number of non zero elements
14 14
Exact solution: combinatorial optimization problem β too computationally expensive
2W=3553871205531788502027616705177895234326962283811349000683834453551638494934980826570988629674816508671333937997942971545498563185784012615902725922028388957693142 1862796735241131771064707150729404513525374011172364491439311003809147986212244125583682040173009664289254204672705377527023751838969121362871174353608981432683121364 5491611587700632287226757360106388212811709391049243449409694131581866174894684285426551148222434459277138467708468356441728767115601429026774386653664558802884798090 6965876098883394994207765939795994221495102245529321358133169053471175098438846379813927963588224649996889912395677448659534869881828474761387469462375439163452354234 5894518795402778976197641675203085270364961383790287738178866981707575145292010325953635643917893687322226855341345293028465563634475713300900704784609781200491091266 5177085470491781920811732083028359068442910422663939383012657211605418802586239081536469961410441163264284259407567601349688157128480106842375724875121706906188815680 8417681026874596048633568575893047553712713299830093139608694750348505494684606129671946123873358658490052333372765817334544824122023280282312402650277313912908677267 41995809784279019489403498646468630714031376402488628074647455635839933307882358008948992762943104694366519689215
15 15
Exact solution: combinatorial optimization problem β too computationally expensive
Greedy pruning:
(same assumption for back propagation)
2W=3553871205531788502027616705177895234326962283811349000683834453551638494934980826570988629674816508671333937997942971545498563185784012615902725922028388957693142 1862796735241131771064707150729404513525374011172364491439311003809147986212244125583682040173009664289254204672705377527023751838969121362871174353608981432683121364 5491611587700632287226757360106388212811709391049243449409694131581866174894684285426551148222434459277138467708468356441728767115601429026774386653664558802884798090 6965876098883394994207765939795994221495102245529321358133169053471175098438846379813927963588224649996889912395677448659534869881828474761387469462375439163452354234 5894518795402778976197641675203085270364961383790287738178866981707575145292010325953635643917893687322226855341345293028465563634475713300900704784609781200491091266 5177085470491781920811732083028359068442910422663939383012657211605418802586239081536469961410441163264284259407567601349688157128480106842375724875121706906188815680 8417681026874596048633568575893047553712713299830093139608694750348505494684606129671946123873358658490052333372765817334544824122023280282312402650277313912908677267 41995809784279019489403498646468630714031376402488628074647455635839933307882358008948992762943104694366519689215
16 16
Algorithm: 1) Estimate importance of neurons (units) 2) Rank units 3) Remove the least important unit 4) Fine tune network for K iterations 5) Go back to step1)
17
18 18
Method Test accuracy
*SIFT+SVM 19% From scratch CNN 25%
*OverFeat+SVM 62% Our baseline VGG16 finetuned 72.2%
R-CNN 74%
*Pose-CNN 76%
*R-CNN+ 82% *require additional attributes
19 19
First layer Last layer
20 20
*only convolutional layers
Layer # Rank, lower better
21
22 22
ignore Absolute difference in cost by removing a neuron:
Gradient of the cost wrt. activation βπ Unitβs output Both computed during standard backprop.
23 23
ignore =0
Needs extra comp of second order derivative
24 24
OBD: second-order expansion:
=0
we propose: abs of first-order expansion: Assuming π§ =
ππ· πβπ βπ
For perfectly trained model: if y is Gaussian
25 25
=0
οΌ No extra computations οΌ We look at absolute difference β Canβt predict exact change in loss Assuming π§ =
ππ· πβπ βπ
For perfectly trained model: if y is Gaussian OBD: second-order expansion: we propose: abs of first-order expansion:
26 26
Mean rank correlation
(across layers)
0.27 0.56 0.59 0.73
0.2 0.4 0.6 0.8 1 Min weight Activation OBD Taylor expansion 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 9 10 11 12 13
Correlation with oracle Layer #
Min weight Activation OBD Taylor Expansion
27
VGG16
10 20 30 40 50 60 70 1 2 3 4 5 6 7 8 9 10 11 12 13 14
FLOPs per unit Layer #
28
29 29
30 30
GFLOPs #convolutional kernels
32 32
102 classes ~2k training images ~6k testing images
10 up 30 up 60 up 1000 up
Changing number of updates between pruning iterations
33 33
102 classes ~2k training images ~6k testing images
10 up 30 up 60 up 1000 up
Changing number of updates between pruning iterations
3.8x FLOPS reduction 2.4x actual speed up
34 34
Top-5 validation set
35 35
GFLOPs FLOPS reduction Actual speed up Top-5 31 1x 89.5% 12 2.6x 2.5x
8 3.9x 3.3x
Top-5 validation set
36 36
3D-CNN with recurrent layers fine-tuned for 25 dynamic gestures
P . Molchanov, Gesture recognition with 3D CNNs, GTC 2016
37 37
3D-CNN with recurrent layers fine-tuned for 25 dynamic gestures
P . Molchanov, Gesture recognition with 3D CNNs, GTC 2016
Reduction in FLOPs
Drop in accuracy
Speed-up
38
39 39
@kaggle Dogs vs. Cats classification Marco Lugoβs solution, 3rd place : 25,000 images
40 40
Full network
Pruned network
41 41
10000 20000 30000 40000 50000 60000 500 1000
Convolutional units Pruning iteration
52 672 units 3472 units
Full network
Pruned network
Compression
42 42