NETWORKS Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan - PowerPoint PPT Presentation

PRUNING CONVOLUTIONAL NEURAL NETWORKS Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz 2017

WHY WE CAN PRUNE CNNS? 2

WHY WE CAN PRUNE CNNS? Optimization “failures”: Some neurons are "dead": little activation • • Some neurons are uncorrelated with output Modern CNNs are overparameterized: VGG16 has 138M parameters • Alexnet has 61M parameters • ImageNet has 1.2M images • 3 3

PRUNING FOR TRANSFER LEARNING Small Dataset Caltech-UCSD Birds (200 classes, <6000 images) 4 4

PRUNING FOR TRANSFER LEARNING Small Network Accuracy Oriole Training Goldfinch Size/Speed Small Dataset 5 5

PRUNING FOR TRANSFER LEARNING AlexNet - VGG16 - ResNet Large Pretrained Network Accuracy Oriole Fine-tuning Goldfinch Size/Speed Small Dataset 6 6

PRUNING FOR TRANSFER LEARNING AlexNet - VGG16 - ResNet Large Pretrained Network Fine-tuning Small Dataset Accuracy Oriole Goldfinch Size/Speed Pruning Smaller Network 7 7

TYPES OF UNITS • Convolutional units Our focus • Heavy on computation Small on storage • Fully connected (dense) units • Fast on computations • Heavy on storage • Ratio of floating point operations Convolutional layers Fully connected layers To reduce computation, VGG16 99% 1% we focus pruning on convolutional units. Alexnet 89% 11% R3DCNN 90% 10% 8 8

TYPES OF PRUNING Fine pruning Coarse pruning No pruning • Remove connections Remove entire • between neurons/feature neurons/feature maps maps • Instant speed-up May require special • • No change to HW/SW SW/HW for full speed-up Our focus 9 9

NETWORK PRUNING 10

NETWORK PRUNING Training: min 𝑋 𝐷 𝑋, 𝒠 𝐷 : training cost function 𝒠 : training data 𝑋 : network weights : pruned network weights 𝑋 11 11

NETWORK PRUNING Training: Pruning: , 𝒠 − 𝐷 𝑋, 𝒠 min 𝐷 𝑋 min 𝑋 𝐷 𝑋, 𝒠 𝑋 ⊂ 𝑋, 𝑋 < 𝐶 𝑡. 𝑢. 𝑋 𝐷 : training cost function 𝒠 : training data 𝑋 : network weights : pruned network weights 𝑋 12 12

NETWORK PRUNING Training: Pruning: , 𝒠 − 𝐷 𝑋, 𝒠 min 𝐷 𝑋 min 𝑋 𝐷 𝑋, 𝒠 𝑋 ⊂ 𝑋, 𝑋 < 𝐶 𝑡. 𝑢. 𝑋 𝐷 : training cost function 𝒠 : training data 𝑡. 𝑢. 𝑋 0 ≤ 𝐶 𝑋 : network weights : pruned network weights 𝑋 ∙ 0 − ℓ 0 norm, number of non zero elements 13 13

NETWORK PRUNING Exact solution: combinatorial optimization problem – too computationally expensive • VGG-16 has 𝑋 = 4224 convolutional units 2 W =3553871205531788502027616705177895234326962283811349000683834453551638494934980826570988629674816508671333937997942971545498563185784012615902725922028388957693142 1862796735241131771064707150729404513525374011172364491439311003809147986212244125583682040173009664289254204672705377527023751838969121362871174353608981432683121364 5491611587700632287226757360106388212811709391049243449409694131581866174894684285426551148222434459277138467708468356441728767115601429026774386653664558802884798090 6965876098883394994207765939795994221495102245529321358133169053471175098438846379813927963588224649996889912395677448659534869881828474761387469462375439163452354234 5894518795402778976197641675203085270364961383790287738178866981707575145292010325953635643917893687322226855341345293028465563634475713300900704784609781200491091266 5177085470491781920811732083028359068442910422663939383012657211605418802586239081536469961410441163264284259407567601349688157128480106842375724875121706906188815680 8417681026874596048633568575893047553712713299830093139608694750348505494684606129671946123873358658490052333372765817334544824122023280282312402650277313912908677267 41995809784279019489403498646468630714031376402488628074647455635839933307882358008948992762943104694366519689215 14 14

NETWORK PRUNING Exact solution: combinatorial optimization problem – too computationally expensive • VGG-16 has 𝑋 = 4224 convolutional units 2 W =3553871205531788502027616705177895234326962283811349000683834453551638494934980826570988629674816508671333937997942971545498563185784012615902725922028388957693142 1862796735241131771064707150729404513525374011172364491439311003809147986212244125583682040173009664289254204672705377527023751838969121362871174353608981432683121364 5491611587700632287226757360106388212811709391049243449409694131581866174894684285426551148222434459277138467708468356441728767115601429026774386653664558802884798090 6965876098883394994207765939795994221495102245529321358133169053471175098438846379813927963588224649996889912395677448659534869881828474761387469462375439163452354234 5894518795402778976197641675203085270364961383790287738178866981707575145292010325953635643917893687322226855341345293028465563634475713300900704784609781200491091266 5177085470491781920811732083028359068442910422663939383012657211605418802586239081536469961410441163264284259407567601349688157128480106842375724875121706906188815680 8417681026874596048633568575893047553712713299830093139608694750348505494684606129671946123873358658490052333372765817334544824122023280282312402650277313912908677267 41995809784279019489403498646468630714031376402488628074647455635839933307882358008948992762943104694366519689215 Greedy pruning: • Assumes all neurons are independent (same assumption for back propagation) • Iteratively, remove neuron with the smallest contribution 15 15

GREEDY NETWORK PRUNING Iterative pruning Algorithm: 1) Estimate importance of neurons (units) 2) Rank units 3) Remove the least important unit 4) Fine tune network for K iterations 5) Go back to step1) 16 16

ORACLE 17

ORACLE Caltech-UCSD Birds-200-2011 Dataset • 200 classes <6000 training images • Method Test accuracy S. Belongie et al *SIFT+SVM 19% From scratch CNN 25% S. Razavian et al *OverFeat+SVM 62% Our baseline VGG16 finetuned 72.2% N. Zhang et al R-CNN 74% S. Branson et al *Pose-CNN 76% J. Krause et al *R-CNN+ 82% *require additional attributes 18 18

ORACLE VGG16 on Birds-200 dataset Exhaustively computed change in loss by removing one unit • First layer Last layer 19 19

ORACLE VGG-16 on Birds-200 • On average first layers are more important Rank, lower better Every layer has very important units • • Every layer has non important units Layers with pooling are more important • Layer # *only convolutional layers 20 20

APPROXIMATING THE ORACLE 21

APPROXIMATING THE ORACLE Candidate criteria Average activation (discard lower activations) • Minimum weight (discard lower l 2 of weight) • With first-order Taylor expansion (TE): • Gradient of the cost wrt. activation ℎ 𝑗 ignore Unit’s output Absolute difference in cost by removing a neuron: Both computed during standard backprop. 22 22

APPROXIMATING THE ORACLE Candidate criteria Alternative: Optimal Brain Damage (OBD) by Y. LeCun et al., 1990 • • Use second order derivatives to estimate importance of neurons: Needs extra comp of second order derivative =0 ignore 23 23

APPROXIMATING THE ORACLE Comparison to OBD OBD: second-order expansion: we propose: abs of first-order expansion: =0 𝜀𝐷 Assuming 𝑧 = 𝜀ℎ 𝑗 ℎ 𝑗 For perfectly trained model: if y is Gaussian 24 24

APPROXIMATING THE ORACLE Comparison to OBD OBD: second-order expansion: we propose: abs of first-order expansion: =0 𝜀𝐷 Assuming 𝑧 = 𝜀ℎ 𝑗 ℎ 𝑗  No extra computations For perfectly trained model: We look at absolute difference  — Can’t predict exact change in loss if y is Gaussian 25 25

EVALUATING PRUNING CRITERIA Spearman’s rank correlation with oracle: VGG16 on Birds-200 Min weight Activation Mean rank correlation OBD Taylor Expansion (across layers) 1 1 0.9 0.8 Correlation with oracle 0.8 0.73 0.7 0.59 0.56 0.6 0.6 0.5 0.4 0.4 0.27 0.3 0.2 0.2 0 0.1 Min weight Activation OBD Taylor 0 expansion 1 2 3 4 5 6 7 8 9 10 11 12 13 Layer # 26 26

EVALUATING PRUNING CRITERIA Pruning with objective Regularize criteria with objective: • VGG16 70 60 • Regularizer can be: FLOPs per unit 50 • FLOPs 40 • Memory 30 • Bandwidth 20 Target device • 10 Exact inference time • 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Layer # 27

RESULTS 28

RESULTS VGG16 on Birds 200 dataset Remove 1 conv unit every 30 updates • 29 29

RESULTS VGG16 on Birds 200 dataset #convolutional kernels GFLOPs Training from scratch doesn’t work • • Taylor shows the best result vs any other metric for pruning 30 30

RESULTS AlexNet on Oxford Flowers102 102 classes Changing number of updates between pruning iterations ~2k training images ~6k testing images 1000 up 60 up 30 up 10 up 32 32

RESULTS AlexNet on Oxford Flowers102 102 classes Changing number of updates between pruning iterations ~2k training images ~6k testing images 1000 up 60 up 3.8x FLOPS reduction 2.4x actual speed up 30 up 10 up 33 33

RESULTS VGG16 on ImageNet Top-5 validation set Pruned over 7 epochs • 34 34

RESULTS VGG16 on ImageNet Top-5 validation set Pruned over 7 epochs • Fine-tuning 7 epochs • GFLOPs FLOPS Actual Top-5 reduction speed up 31 1x 89.5% 12 2.6x 2.5x -2.5% 8 3.9x 3.3x -5.0% 35 35

NETWORKS Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan - PowerPoint PPT Presentation

PRUNING CONVOLUTIONAL NEURAL NETWORKS Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz 2017 WHY WE CAN PRUNE CNNS? 2 WHY WE CAN PRUNE CNNS? Optimization failures: Some neurons are "dead": little activation

P2P Networks as Content P2P Networks as Content Delivery Networks Delivery Networks FINAL

Current Network Structure for Pediatrics Hospital Networks Country, state, regional, Academic

Mobile Communications Ad-Hoc Networks & Wireless Sensor Networks Ad-hoc networks

Outline Applications of Random Networks Random Networks Applications of Random Networks

Types of networks (social networks, computer networks, entity- relationship networks, )

Computer Networks I Computer Networks I Networks A networks connection structure is known as

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Mobility and cellular networks Mobility and cellular networks Cellular radio and PCS networks

Overview Multi-layer networks: Cognitive Modeling limits of single layer networks; Lecture

Chapter 1 Communication Networks and Services Networks and Services Network Architecture and

Regional Networks Regional Networks Rural Creative Placemaking Summit Regional

Core Models of Complex Networks Principles of Complex Systems Generalized random networks

ECEN 5032 Data Networks Wireless Networks Peter Mathys mathys@colorado.edu University of

Relaxation and Hopfield Networks Neural Networks Neural Networks - Hopfield 1 Bibliography

VACCINE NETWORKS VACCINE NETWORKS EXAMINING ACUTE AND PERPETUAL NETWORKS AND EXAMINING ACUTE AND

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

1 Brainchip OCTOBER 2017 | Agenda Neuromorphic computing background Akida Neuromorphic

- Evolution and Mirror Neurons. An Introduction to the nature of Self-Consciousness - (1/7)

Fast classification using sparsely active spiking networks Hesham Mostafa Institute of neural

Neuro-Inspired Processor Design for On-Chip Learning and Classification with CMOS and Resistive

A Temporal Coding Hardware Implementation A Temporal Coding Hardware Implementation for Spiking

Higher-Order Correlations in Large Neuronal Populations Stefan Rotter Computational Neuroscience

Engineering Microglia for Neural Cell Therapy LETHBRIDGE BRAINIACS: Dennis Bettenson Rhys

and Challenges DAVIDE BACCIU DIPARTIMENTO DI INFORMATICA UNIVERSIT DI PISA Trends BioMedical

NETWORKS Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan - PowerPoint PPT Presentation

PRUNING CONVOLUTIONAL NEURAL NETWORKS Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz 2017 WHY WE CAN PRUNE CNNS? 2 WHY WE CAN PRUNE CNNS? Optimization failures: Some neurons are "dead": little activation

P2P Networks as Content P2P Networks as Content Delivery Networks Delivery Networks FINAL

Current Network Structure for Pediatrics Hospital Networks Country, state, regional, Academic

Mobile Communications Ad-Hoc Networks &amp; Wireless Sensor Networks Ad-hoc networks

Outline Applications of Random Networks Random Networks Applications of Random Networks

Types of networks (social networks, computer networks, entity- relationship networks, )

Computer Networks I Computer Networks I Networks A networks connection structure is known as

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Mobility and cellular networks Mobility and cellular networks Cellular radio and PCS networks

Overview Multi-layer networks: Cognitive Modeling limits of single layer networks; Lecture

Chapter 1 Communication Networks and Services Networks and Services Network Architecture and

Regional Networks Regional Networks Rural Creative Placemaking Summit Regional

Core Models of Complex Networks Principles of Complex Systems Generalized random networks

ECEN 5032 Data Networks Wireless Networks Peter Mathys mathys@colorado.edu University of

Relaxation and Hopfield Networks Neural Networks Neural Networks - Hopfield 1 Bibliography

VACCINE NETWORKS VACCINE NETWORKS EXAMINING ACUTE AND PERPETUAL NETWORKS AND EXAMINING ACUTE AND

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

1 Brainchip OCTOBER 2017 | Agenda Neuromorphic computing background Akida Neuromorphic

- Evolution and Mirror Neurons. An Introduction to the nature of Self-Consciousness - (1/7)

Fast classification using sparsely active spiking networks Hesham Mostafa Institute of neural

Neuro-Inspired Processor Design for On-Chip Learning and Classification with CMOS and Resistive

A Temporal Coding Hardware Implementation A Temporal Coding Hardware Implementation for Spiking

Higher-Order Correlations in Large Neuronal Populations Stefan Rotter Computational Neuroscience

Engineering Microglia for Neural Cell Therapy LETHBRIDGE BRAINIACS: Dennis Bettenson Rhys

and Challenges DAVIDE BACCIU DIPARTIMENTO DI INFORMATICA UNIVERSIT DI PISA Trends BioMedical

Mobile Communications Ad-Hoc Networks & Wireless Sensor Networks Ad-hoc networks