NETWORKS Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan - - PowerPoint PPT Presentation

β–Ά
networks
SMART_READER_LITE
LIVE PREVIEW

NETWORKS Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan - - PowerPoint PPT Presentation

PRUNING CONVOLUTIONAL NEURAL NETWORKS Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz 2017 WHY WE CAN PRUNE CNNS? 2 WHY WE CAN PRUNE CNNS? Optimization failures: Some neurons are "dead": little activation


slide-1
SLIDE 1

Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz

PRUNING CONVOLUTIONAL NEURAL NETWORKS

2017

slide-2
SLIDE 2

2

WHY WE CAN PRUNE CNNS?

slide-3
SLIDE 3

3 3

WHY WE CAN PRUNE CNNS?

Optimization β€œfailures”:

  • Some neurons are "dead": little activation
  • Some neurons are uncorrelated with output

Modern CNNs are overparameterized:

  • VGG16 has 138M parameters
  • Alexnet has 61M parameters
  • ImageNet has 1.2M images
slide-4
SLIDE 4

4 4

PRUNING FOR TRANSFER LEARNING

Caltech-UCSD Birds (200 classes, <6000 images)

Small Dataset

slide-5
SLIDE 5

5 5

PRUNING FOR TRANSFER LEARNING

Small Dataset

Oriole Goldfinch

Accuracy Size/Speed Small Network Training

slide-6
SLIDE 6

6 6

PRUNING FOR TRANSFER LEARNING

Small Dataset Fine-tuning Large Pretrained Network

Oriole Goldfinch

Accuracy Size/Speed

AlexNet - VGG16 - ResNet

slide-7
SLIDE 7

7 7

PRUNING FOR TRANSFER LEARNING

Small Dataset Large Pretrained Network

AlexNet - VGG16 - ResNet

Fine-tuning

Oriole Goldfinch

Accuracy Size/Speed Pruning Smaller Network

slide-8
SLIDE 8

8 8

TYPES OF UNITS

  • Convolutional units
  • Heavy on computation
  • Small on storage
  • Fully connected (dense) units
  • Fast on computations
  • Heavy on storage

Convolutional layers Fully connected layers VGG16 99% 1% Alexnet 89% 11% R3DCNN 90% 10% To reduce computation, we focus pruning on convolutional units. Ratio of floating point operations Our focus

slide-9
SLIDE 9

9 9

TYPES OF PRUNING

No pruning Fine pruning

  • Remove connections

between neurons/feature maps

  • May require special

SW/HW for full speed-up

Coarse pruning

  • Remove entire

neurons/feature maps

  • Instant speed-up
  • No change to HW/SW

Our focus

slide-10
SLIDE 10

10

NETWORK PRUNING

slide-11
SLIDE 11

11 11

NETWORK PRUNING

𝐷: training cost function 𝒠: training data 𝑋: network weights

𝑋 : pruned network weights

Training:

min

𝑋 𝐷 𝑋, 𝒠

slide-12
SLIDE 12

12 12

NETWORK PRUNING

𝐷: training cost function 𝒠: training data 𝑋: network weights

𝑋 : pruned network weights

min

𝑋

𝐷 𝑋 , 𝒠 βˆ’ 𝐷 𝑋, 𝒠

Training: Pruning:

min

𝑋 𝐷 𝑋, 𝒠

𝑑. 𝑒. 𝑋 βŠ‚ 𝑋, 𝑋 < 𝐢

slide-13
SLIDE 13

13 13

NETWORK PRUNING

𝐷: training cost function 𝒠: training data 𝑋: network weights

𝑋 : pruned network weights

min

𝑋

𝐷 𝑋 , 𝒠 βˆ’ 𝐷 𝑋, 𝒠

Training: Pruning:

min

𝑋 𝐷 𝑋, 𝒠

𝑑. 𝑒. 𝑋 βŠ‚ 𝑋, 𝑋 < 𝐢 𝑑. 𝑒. 𝑋

0≀ 𝐢

βˆ™ 0 βˆ’ β„“0 norm, number of non zero elements

slide-14
SLIDE 14

14 14

NETWORK PRUNING

Exact solution: combinatorial optimization problem – too computationally expensive

  • VGG-16 has 𝑋 = 4224 convolutional units

2W=3553871205531788502027616705177895234326962283811349000683834453551638494934980826570988629674816508671333937997942971545498563185784012615902725922028388957693142 1862796735241131771064707150729404513525374011172364491439311003809147986212244125583682040173009664289254204672705377527023751838969121362871174353608981432683121364 5491611587700632287226757360106388212811709391049243449409694131581866174894684285426551148222434459277138467708468356441728767115601429026774386653664558802884798090 6965876098883394994207765939795994221495102245529321358133169053471175098438846379813927963588224649996889912395677448659534869881828474761387469462375439163452354234 5894518795402778976197641675203085270364961383790287738178866981707575145292010325953635643917893687322226855341345293028465563634475713300900704784609781200491091266 5177085470491781920811732083028359068442910422663939383012657211605418802586239081536469961410441163264284259407567601349688157128480106842375724875121706906188815680 8417681026874596048633568575893047553712713299830093139608694750348505494684606129671946123873358658490052333372765817334544824122023280282312402650277313912908677267 41995809784279019489403498646468630714031376402488628074647455635839933307882358008948992762943104694366519689215

slide-15
SLIDE 15

15 15

NETWORK PRUNING

Exact solution: combinatorial optimization problem – too computationally expensive

  • VGG-16 has 𝑋 = 4224 convolutional units

Greedy pruning:

  • Assumes all neurons are independent

(same assumption for back propagation)

  • Iteratively, remove neuron with the smallest contribution

2W=3553871205531788502027616705177895234326962283811349000683834453551638494934980826570988629674816508671333937997942971545498563185784012615902725922028388957693142 1862796735241131771064707150729404513525374011172364491439311003809147986212244125583682040173009664289254204672705377527023751838969121362871174353608981432683121364 5491611587700632287226757360106388212811709391049243449409694131581866174894684285426551148222434459277138467708468356441728767115601429026774386653664558802884798090 6965876098883394994207765939795994221495102245529321358133169053471175098438846379813927963588224649996889912395677448659534869881828474761387469462375439163452354234 5894518795402778976197641675203085270364961383790287738178866981707575145292010325953635643917893687322226855341345293028465563634475713300900704784609781200491091266 5177085470491781920811732083028359068442910422663939383012657211605418802586239081536469961410441163264284259407567601349688157128480106842375724875121706906188815680 8417681026874596048633568575893047553712713299830093139608694750348505494684606129671946123873358658490052333372765817334544824122023280282312402650277313912908677267 41995809784279019489403498646468630714031376402488628074647455635839933307882358008948992762943104694366519689215

slide-16
SLIDE 16

16 16

GREEDY NETWORK PRUNING

Iterative pruning

Algorithm: 1) Estimate importance of neurons (units) 2) Rank units 3) Remove the least important unit 4) Fine tune network for K iterations 5) Go back to step1)

slide-17
SLIDE 17

17

ORACLE

slide-18
SLIDE 18

18 18

ORACLE

Caltech-UCSD Birds-200-2011 Dataset

  • 200 classes
  • <6000 training images

Method Test accuracy

  • S. Belongie et al

*SIFT+SVM 19% From scratch CNN 25%

  • S. Razavian et al

*OverFeat+SVM 62% Our baseline VGG16 finetuned 72.2%

  • N. Zhang et al

R-CNN 74%

  • S. Branson et al

*Pose-CNN 76%

  • J. Krause et al

*R-CNN+ 82% *require additional attributes

slide-19
SLIDE 19

19 19

ORACLE

  • Exhaustively computed change in loss by removing one unit

VGG16 on Birds-200 dataset

First layer Last layer

slide-20
SLIDE 20

20 20

ORACLE

VGG-16 on Birds-200

  • On average first layers are more important
  • Every layer has very important units
  • Every layer has non important units
  • Layers with pooling are more important

*only convolutional layers

Layer # Rank, lower better

slide-21
SLIDE 21

21

APPROXIMATING THE ORACLE

slide-22
SLIDE 22

22 22

APPROXIMATING THE ORACLE

  • Average activation (discard lower activations)
  • Minimum weight (discard lower l2 of weight)
  • With first-order Taylor expansion (TE):

Candidate criteria

ignore Absolute difference in cost by removing a neuron:

Gradient of the cost wrt. activation β„Žπ‘— Unit’s output Both computed during standard backprop.

slide-23
SLIDE 23

23 23

APPROXIMATING THE ORACLE

  • Alternative: Optimal Brain Damage (OBD) by Y. LeCun et al., 1990
  • Use second order derivatives to estimate importance of neurons:

Candidate criteria

ignore =0

Needs extra comp of second order derivative

slide-24
SLIDE 24

24 24

APPROXIMATING THE ORACLE

Comparison to OBD

OBD: second-order expansion:

=0

we propose: abs of first-order expansion: Assuming 𝑧 =

πœ€π· πœ€β„Žπ‘— β„Žπ‘—

For perfectly trained model: if y is Gaussian

slide-25
SLIDE 25

25 25

APPROXIMATING THE ORACLE

Comparison to OBD

=0

οƒΌ No extra computations οƒΌ We look at absolute difference β€” Can’t predict exact change in loss Assuming 𝑧 =

πœ€π· πœ€β„Žπ‘— β„Žπ‘—

For perfectly trained model: if y is Gaussian OBD: second-order expansion: we propose: abs of first-order expansion:

slide-26
SLIDE 26

26 26

EVALUATING PRUNING CRITERIA

Spearman’s rank correlation with oracle: VGG16 on Birds-200

Mean rank correlation

(across layers)

0.27 0.56 0.59 0.73

0.2 0.4 0.6 0.8 1 Min weight Activation OBD Taylor expansion 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 9 10 11 12 13

Correlation with oracle Layer #

Min weight Activation OBD Taylor Expansion

slide-27
SLIDE 27

27

EVALUATING PRUNING CRITERIA

Pruning with objective

VGG16

  • Regularize criteria with objective:
  • Regularizer can be:
  • FLOPs
  • Memory
  • Bandwidth
  • Target device
  • Exact inference time

10 20 30 40 50 60 70 1 2 3 4 5 6 7 8 9 10 11 12 13 14

FLOPs per unit Layer #

slide-28
SLIDE 28

28

RESULTS

slide-29
SLIDE 29

29 29

RESULTS

VGG16 on Birds 200 dataset

  • Remove 1 conv unit every 30 updates
slide-30
SLIDE 30

30 30

RESULTS

VGG16 on Birds 200 dataset

  • Training from scratch doesn’t work
  • Taylor shows the best result vs any other metric for pruning

GFLOPs #convolutional kernels

slide-31
SLIDE 31

32 32

RESULTS

AlexNet on Oxford Flowers102

102 classes ~2k training images ~6k testing images

10 up 30 up 60 up 1000 up

Changing number of updates between pruning iterations

slide-32
SLIDE 32

33 33

RESULTS

AlexNet on Oxford Flowers102

102 classes ~2k training images ~6k testing images

10 up 30 up 60 up 1000 up

Changing number of updates between pruning iterations

3.8x FLOPS reduction 2.4x actual speed up

slide-33
SLIDE 33

34 34

RESULTS

VGG16 on ImageNet

  • Pruned over 7 epochs

Top-5 validation set

slide-34
SLIDE 34

35 35

RESULTS

VGG16 on ImageNet

  • Pruned over 7 epochs
  • Fine-tuning 7 epochs

GFLOPs FLOPS reduction Actual speed up Top-5 31 1x 89.5% 12 2.6x 2.5x

  • 2.5%

8 3.9x 3.3x

  • 5.0%

Top-5 validation set

slide-35
SLIDE 35

36 36

RESULTS

R3DCNN for gesture recognition

3D-CNN with recurrent layers fine-tuned for 25 dynamic gestures

P . Molchanov, Gesture recognition with 3D CNNs, GTC 2016

slide-36
SLIDE 36

37 37

RESULTS

R3DCNN for gesture recognition

3D-CNN with recurrent layers fine-tuned for 25 dynamic gestures

P . Molchanov, Gesture recognition with 3D CNNs, GTC 2016

12.6x

Reduction in FLOPs

2.5%

Drop in accuracy

5.2x

Speed-up

slide-37
SLIDE 37

38

How many neurons we need to classify a cat?

slide-38
SLIDE 38

39 39

DOGS VS. CATS

@kaggle Dogs vs. Cats classification Marco Lugo’s solution, 3rd place : 25,000 images

slide-39
SLIDE 39

40 40

DOGS VS. CATS

Fine-tuned ResNet-101

99.2%

Full network

99.0 %

Pruned network

slide-40
SLIDE 40

41 41

DOGS VS. CATS

Fine-tuned ResNet-101

10000 20000 30000 40000 50000 60000 500 1000

Convolutional units Pruning iteration

52 672 units 3472 units

99.2%

Full network

99.0 %

Pruned network

15x

Compression

slide-41
SLIDE 41

42 42

CONCLUSIONS

  • Pruning as greedy feature selection
  • New criteria based on Taylor expansion
  • Pruning is especially effective (and necessary!) for transfer learning
  • Pruning can incorporate desired objectives (such as FLOPs)
  • Read more in our ICLR2017 paper: https://openreview.net/pdf?id=SJGCiw5gl
slide-42
SLIDE 42

THANK YOU!