networks
play

NETWORKS Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan - PowerPoint PPT Presentation

PRUNING CONVOLUTIONAL NEURAL NETWORKS Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz 2017 WHY WE CAN PRUNE CNNS? 2 WHY WE CAN PRUNE CNNS? Optimization failures: Some neurons are "dead": little activation


  1. PRUNING CONVOLUTIONAL NEURAL NETWORKS Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz 2017

  2. WHY WE CAN PRUNE CNNS? 2

  3. WHY WE CAN PRUNE CNNS? Optimization “failures”: Some neurons are "dead": little activation • • Some neurons are uncorrelated with output Modern CNNs are overparameterized: VGG16 has 138M parameters • Alexnet has 61M parameters • ImageNet has 1.2M images • 3 3

  4. PRUNING FOR TRANSFER LEARNING Small Dataset Caltech-UCSD Birds (200 classes, <6000 images) 4 4

  5. PRUNING FOR TRANSFER LEARNING Small Network Accuracy Oriole Training Goldfinch Size/Speed Small Dataset 5 5

  6. PRUNING FOR TRANSFER LEARNING AlexNet - VGG16 - ResNet Large Pretrained Network Accuracy Oriole Fine-tuning Goldfinch Size/Speed Small Dataset 6 6

  7. PRUNING FOR TRANSFER LEARNING AlexNet - VGG16 - ResNet Large Pretrained Network Fine-tuning Small Dataset Accuracy Oriole Goldfinch Size/Speed Pruning Smaller Network 7 7

  8. TYPES OF UNITS • Convolutional units Our focus • Heavy on computation Small on storage • Fully connected (dense) units • Fast on computations • Heavy on storage • Ratio of floating point operations Convolutional layers Fully connected layers To reduce computation, VGG16 99% 1% we focus pruning on convolutional units. Alexnet 89% 11% R3DCNN 90% 10% 8 8

  9. TYPES OF PRUNING Fine pruning Coarse pruning No pruning • Remove connections Remove entire • between neurons/feature neurons/feature maps maps • Instant speed-up May require special • • No change to HW/SW SW/HW for full speed-up Our focus 9 9

  10. NETWORK PRUNING 10

  11. NETWORK PRUNING Training: min 𝑋 𝐷 𝑋, 𝒠 𝐷 : training cost function 𝒠 : training data 𝑋 : network weights : pruned network weights 𝑋 11 11

  12. NETWORK PRUNING Training: Pruning: , 𝒠 − 𝐷 𝑋, 𝒠 min 𝐷 𝑋 min 𝑋 𝐷 𝑋, 𝒠 𝑋 ⊂ 𝑋, 𝑋 < 𝐶 𝑡. 𝑢. 𝑋 𝐷 : training cost function 𝒠 : training data 𝑋 : network weights : pruned network weights 𝑋 12 12

  13. NETWORK PRUNING Training: Pruning: , 𝒠 − 𝐷 𝑋, 𝒠 min 𝐷 𝑋 min 𝑋 𝐷 𝑋, 𝒠 𝑋 ⊂ 𝑋, 𝑋 < 𝐶 𝑡. 𝑢. 𝑋 𝐷 : training cost function 𝒠 : training data 𝑡. 𝑢. 𝑋 0 ≤ 𝐶 𝑋 : network weights : pruned network weights 𝑋 ∙ 0 − ℓ 0 norm, number of non zero elements 13 13

  14. NETWORK PRUNING Exact solution: combinatorial optimization problem – too computationally expensive • VGG-16 has 𝑋 = 4224 convolutional units 2 W =3553871205531788502027616705177895234326962283811349000683834453551638494934980826570988629674816508671333937997942971545498563185784012615902725922028388957693142 1862796735241131771064707150729404513525374011172364491439311003809147986212244125583682040173009664289254204672705377527023751838969121362871174353608981432683121364 5491611587700632287226757360106388212811709391049243449409694131581866174894684285426551148222434459277138467708468356441728767115601429026774386653664558802884798090 6965876098883394994207765939795994221495102245529321358133169053471175098438846379813927963588224649996889912395677448659534869881828474761387469462375439163452354234 5894518795402778976197641675203085270364961383790287738178866981707575145292010325953635643917893687322226855341345293028465563634475713300900704784609781200491091266 5177085470491781920811732083028359068442910422663939383012657211605418802586239081536469961410441163264284259407567601349688157128480106842375724875121706906188815680 8417681026874596048633568575893047553712713299830093139608694750348505494684606129671946123873358658490052333372765817334544824122023280282312402650277313912908677267 41995809784279019489403498646468630714031376402488628074647455635839933307882358008948992762943104694366519689215 14 14

  15. NETWORK PRUNING Exact solution: combinatorial optimization problem – too computationally expensive • VGG-16 has 𝑋 = 4224 convolutional units 2 W =3553871205531788502027616705177895234326962283811349000683834453551638494934980826570988629674816508671333937997942971545498563185784012615902725922028388957693142 1862796735241131771064707150729404513525374011172364491439311003809147986212244125583682040173009664289254204672705377527023751838969121362871174353608981432683121364 5491611587700632287226757360106388212811709391049243449409694131581866174894684285426551148222434459277138467708468356441728767115601429026774386653664558802884798090 6965876098883394994207765939795994221495102245529321358133169053471175098438846379813927963588224649996889912395677448659534869881828474761387469462375439163452354234 5894518795402778976197641675203085270364961383790287738178866981707575145292010325953635643917893687322226855341345293028465563634475713300900704784609781200491091266 5177085470491781920811732083028359068442910422663939383012657211605418802586239081536469961410441163264284259407567601349688157128480106842375724875121706906188815680 8417681026874596048633568575893047553712713299830093139608694750348505494684606129671946123873358658490052333372765817334544824122023280282312402650277313912908677267 41995809784279019489403498646468630714031376402488628074647455635839933307882358008948992762943104694366519689215 Greedy pruning: • Assumes all neurons are independent (same assumption for back propagation) • Iteratively, remove neuron with the smallest contribution 15 15

  16. GREEDY NETWORK PRUNING Iterative pruning Algorithm: 1) Estimate importance of neurons (units) 2) Rank units 3) Remove the least important unit 4) Fine tune network for K iterations 5) Go back to step1) 16 16

  17. ORACLE 17

  18. ORACLE Caltech-UCSD Birds-200-2011 Dataset • 200 classes <6000 training images • Method Test accuracy S. Belongie et al *SIFT+SVM 19% From scratch CNN 25% S. Razavian et al *OverFeat+SVM 62% Our baseline VGG16 finetuned 72.2% N. Zhang et al R-CNN 74% S. Branson et al *Pose-CNN 76% J. Krause et al *R-CNN+ 82% *require additional attributes 18 18

  19. ORACLE VGG16 on Birds-200 dataset Exhaustively computed change in loss by removing one unit • First layer Last layer 19 19

  20. ORACLE VGG-16 on Birds-200 • On average first layers are more important Rank, lower better Every layer has very important units • • Every layer has non important units Layers with pooling are more important • Layer # *only convolutional layers 20 20

  21. APPROXIMATING THE ORACLE 21

  22. APPROXIMATING THE ORACLE Candidate criteria Average activation (discard lower activations) • Minimum weight (discard lower l 2 of weight) • With first-order Taylor expansion (TE): • Gradient of the cost wrt. activation ℎ 𝑗 ignore Unit’s output Absolute difference in cost by removing a neuron: Both computed during standard backprop. 22 22

  23. APPROXIMATING THE ORACLE Candidate criteria Alternative: Optimal Brain Damage (OBD) by Y. LeCun et al., 1990 • • Use second order derivatives to estimate importance of neurons: Needs extra comp of second order derivative =0 ignore 23 23

  24. APPROXIMATING THE ORACLE Comparison to OBD OBD: second-order expansion: we propose: abs of first-order expansion: =0 𝜀𝐷 Assuming 𝑧 = 𝜀ℎ 𝑗 ℎ 𝑗 For perfectly trained model: if y is Gaussian 24 24

  25. APPROXIMATING THE ORACLE Comparison to OBD OBD: second-order expansion: we propose: abs of first-order expansion: =0 𝜀𝐷 Assuming 𝑧 = 𝜀ℎ 𝑗 ℎ 𝑗  No extra computations For perfectly trained model: We look at absolute difference  — Can’t predict exact change in loss if y is Gaussian 25 25

  26. EVALUATING PRUNING CRITERIA Spearman’s rank correlation with oracle: VGG16 on Birds-200 Min weight Activation Mean rank correlation OBD Taylor Expansion (across layers) 1 1 0.9 0.8 Correlation with oracle 0.8 0.73 0.7 0.59 0.56 0.6 0.6 0.5 0.4 0.4 0.27 0.3 0.2 0.2 0 0.1 Min weight Activation OBD Taylor 0 expansion 1 2 3 4 5 6 7 8 9 10 11 12 13 Layer # 26 26

  27. EVALUATING PRUNING CRITERIA Pruning with objective Regularize criteria with objective: • VGG16 70 60 • Regularizer can be: FLOPs per unit 50 • FLOPs 40 • Memory 30 • Bandwidth 20 Target device • 10 Exact inference time • 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Layer # 27

  28. RESULTS 28

  29. RESULTS VGG16 on Birds 200 dataset Remove 1 conv unit every 30 updates • 29 29

  30. RESULTS VGG16 on Birds 200 dataset #convolutional kernels GFLOPs Training from scratch doesn’t work • • Taylor shows the best result vs any other metric for pruning 30 30

  31. RESULTS AlexNet on Oxford Flowers102 102 classes Changing number of updates between pruning iterations ~2k training images ~6k testing images 1000 up 60 up 30 up 10 up 32 32

  32. RESULTS AlexNet on Oxford Flowers102 102 classes Changing number of updates between pruning iterations ~2k training images ~6k testing images 1000 up 60 up 3.8x FLOPS reduction 2.4x actual speed up 30 up 10 up 33 33

  33. RESULTS VGG16 on ImageNet Top-5 validation set Pruned over 7 epochs • 34 34

  34. RESULTS VGG16 on ImageNet Top-5 validation set Pruned over 7 epochs • Fine-tuning 7 epochs • GFLOPs FLOPS Actual Top-5 reduction speed up 31 1x 89.5% 12 2.6x 2.5x -2.5% 8 3.9x 3.3x -5.0% 35 35

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend