What is the State of Neural Network Pruning? Davis Blalock* Jose - - PowerPoint PPT Presentation

what is the state of neural network pruning
SMART_READER_LITE
LIVE PREVIEW

What is the State of Neural Network Pruning? Davis Blalock* Jose - - PowerPoint PPT Presentation

What is the State of Neural Network Pruning? Davis Blalock* Jose Javier Gonzalez* Jonathan Frankle John V. Guttag *equal contribution Overview Meta-analysis of neural network pruning We aggregated results across 81 pruning papers and pruned


slide-1
SLIDE 1

What is the State of Neural Network Pruning?

Davis Blalock* Jose Javier Gonzalez* Jonathan Frankle John V. Guttag

*equal contribution

slide-2
SLIDE 2

Blalock & Gonzalez 2

Overview

Meta-analysis of neural network pruning We aggregated results across 81 pruning papers and pruned hundreds of networks in controlled conditions

  • Some surprising findings…

ShrinkBench Open source library to facilitate development and standardized evaluation of neural network pruning methods

slide-3
SLIDE 3

Blalock & Gonzalez

Part 0: Background

3

slide-4
SLIDE 4

Blalock & Gonzalez

  • Neural networks are often accurate but large
  • Pruning: Systematically removing parameters from a network

4

Neural Network Pruning

slide-5
SLIDE 5

Blalock & Gonzalez

Typical Pruning Pipeline

  • Scoring importance of parameters
  • Schedule of pruning, training /

finetuning

  • Structure of induced sparsity
  • Finetuning details — optimizer,

duration, hyperparameters

5

Data Model

Pruning Algorithm

Finetuning Evaluation

Many design choices:

slide-6
SLIDE 6

Blalock & Gonzalez

  • Goal: Increase efficiency of

network as much as possible with minimal drop in quality

  • Metrics
  • Quality = Accuracy
  • Efficiency = FLOPs,

compression, latency…

  • Must use comparable tradeoffs

6

Evaluating Neural Network Pruning

Accuracy of Pruned Network

6

slide-7
SLIDE 7

Blalock & Gonzalez

Part 1: Meta-Analysis

7

slide-8
SLIDE 8

Blalock & Gonzalez

  • We aggregated results across

81 pruning papers

  • Mostly published in top venues
  • Corpus closed under

experimental comparison

8

Overview of Meta-Analysis

Venue # of Papers arXiv only 22 NeurIPS 16 ICLR 11 CVPR 9 ICML 4 ECCV 4 BMVC 3 IEEE Access 2 Other 10

slide-9
SLIDE 9

Blalock & Gonzalez

  • Pruning works
  • Almost any heuristic improves efficiency with

little performance drop

  • Many methods better than random pruning
  • Don’t prune all layers uniformly
  • Sparse models better for fixed # of parameters

9

Robust Findings

slide-10
SLIDE 10

Blalock & Gonzalez 10

Better Pruning vs Better Architecture

slide-11
SLIDE 11

Blalock & Gonzalez

Ideal Results Over Time

2015 2016 2017 2018 2019

(Dataset, Architecture, X metric, Y metric, Hyperparameters) → Curve

11

Compression Ratio

slide-12
SLIDE 12

Blalock & Gonzalez

Ideal Results Over Time

2015 2016 2017 2018 2019

VGG-16 on ImageNet AlexNet on ImageNet ResNet-50 on ImageNet

12

Compression Ratio Compression Ratio Compression Ratio Theoretical Speedup Theoretical Speedup Theoretical Speedup

slide-13
SLIDE 13

Blalock & Gonzalez

Actual Results Over Time

2015 2016 2017 2018 2019

VGG-16 on ImageNet AlexNet on ImageNet ResNet-50 on ImageNet

13

Compression Ratio Compression Ratio Compression Ratio Theoretical Speedup Theoretical Speedup Theoretical Speedup

slide-14
SLIDE 14

Blalock & Gonzalez

  • Among 81 papers:
  • 49 datasets
  • 132 architectures
  • 195 (dataset, architecture) pairs

14

Quantifying the Problem

Dataset Architecture # of Papers Using Pair ImageNet VGG-16 22 CIFAR-10 ResNet-56 14 ImageNet ResNet-50 14 ImageNet CaffeNet 11 ImageNet AlexNet 9 CIFAR-10 CIFAR-VGG 8 ImageNet ResNet-34 6 ImageNet ResNet-18 6 CIFAR-10 ResNet-110 5 CIFAR-10 PreResNet-164 4 CIFAR-10 ResNet-32 4

All (dataset, architecture) pairs used in at least 4 papers

  • Vicious cycle: extreme burden to

compare to existing methods

slide-15
SLIDE 15

Blalock & Gonzalez

  • Presence of comparisons:
  • Most papers compare to at most 1 other method
  • 40% papers have never been compared to
  • Pre-2010s methods almost completely ignored
  • Reinventing the wheel:
  • Magnitude-based pruning: Janowsky (1989)
  • Gradient times magnitude: Mozer & Smolensky (1989)
  • “Reviving” pruned weights: Tresp et al. (1997)

15

Dearth of Reported Comparisons

slide-16
SLIDE 16

Blalock & Gonzalez

  • Alice’s network has 10 million parameters. She

prunes 8 million of them. What compression ratio might she report in her paper?

  • A. 80%
  • B. 20%
  • C. 5x
  • D. No reported compression ratio

16

Pop quiz!

slide-17
SLIDE 17

Blalock & Gonzalez

  • Alice’s network has 10 million parameters. She

prunes 8 million of them. What compression ratio might she report in her paper?

  • A. 80%
  • B. 20%
  • C. 5x
  • D. No reported compression ratio

17

Pop quiz!

slide-18
SLIDE 18

Blalock & Gonzalez

  • According to the literature, how many FLOPs

does it take to run inference using AlexNet on ImageNet?

  • A. 371 million
  • B. 500 million
  • C. 724 million
  • D. 1.5 billion

18

Pop quiz!

slide-19
SLIDE 19

Blalock & Gonzalez

  • According to the literature, how many FLOPs

does it take to run inference using AlexNet on ImageNet?

  • A. 371 million
  • B. 500 million
  • C. 724 million
  • D. 1.5 billion

19

Pop quiz!

slide-20
SLIDE 20

Blalock & Gonzalez

Part 2: ShrinkBench

20

slide-21
SLIDE 21

Blalock & Gonzalez

Why ShrinkBench?

  • Want to hold everything but pruning algorithm constant
  • Improved rigor, development time

21

Data Model

Pruning Algorithm

Finetuning Evaluation

Potential confounding factors

slide-22
SLIDE 22

Blalock & Gonzalez

Masking API

Model (+ Data) Pruning Masks

  • 2.1

4.6 0.8

  • 0.1

0.2 1.5

  • 4.9

2.3

  • 2.5

2.7 4.2

  • 1.1
  • 0.3

5.0 3.1 4.7 1 1 1 1 1 1 1

  • 2.1

4.6 0.8

  • 0.1

0.2 1.5

  • 4.9

2.3

  • 2.5

2.7 4.2

  • 1.1
  • 0.3

5.0 3.1 4.7

  • 2.1

4.6 0.8

  • 0.1

0.2 1.5

  • 4.9

2.3

  • 2.5

2.7 4.2

  • 1.1
  • 0.3

5.0 3.1 4.7 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Accuracy Curve

  • Lets algorithm return arbitrary masks for weight tensors
  • Standardizes all other aspects of training and evaluation

22

slide-23
SLIDE 23

Blalock & Gonzalez

Crucial to Vary Amount of Pruning & Architecture

23

CIFAR-VGG ResNet-56

slide-24
SLIDE 24

Blalock & Gonzalez

Compression and Speedup are not Interchangeable

24

ResNet-18 on ImageNet

slide-25
SLIDE 25

Blalock & Gonzalez

Using Identical Initial Weights is essential

25

ResNet-56 on CIFAR-10

slide-26
SLIDE 26

Blalock & Gonzalez

  • Pruning works
  • But not as well as improving architecture
  • But we have no idea what methods work the best
  • Field suffers from extreme fragmentation in experimental setups
  • We introduce a library/benchmark to address this
  • Faster progress in the future, interesting findings already

26

Conclusion

https://github.com/jjgo/shrinkbench

slide-27
SLIDE 27

Blalock & Gonzalez

Questions?

27