[PPT] - Loss Valleys and Generalization in Deep Learning Andrew Gordon PowerPoint Presentation

SLIDE 1

Loss Valleys and Generalization in Deep Learning

Andrew Gordon Wilson

Assistant Professor https://people.orie.cornell.edu/andrew Cornell University The Robotic Vision Probabilistic Object Detection Challenge CVPR Long Beach, CA June 17, 2019

1 / 41

SLIDE 2

Model Selection

1949 1951 1953 1955 1957 1959 1961 100 200 300 400 500 600 700

Airline Passengers (Thousands) Year

Which model should we choose? (1): f1(x) = a0 + a1x (2): f2(x) =

3

j=0

ajxj (3): f3(x) =

104

j=0

ajxj

2 / 41

SLIDE 3

Bayesian or Frequentist?

3 / 41

SLIDE 4

How do we learn?

◮ The ability for a system to learn is determined by its support (which solutions

are a priori possible) and inductive biases (which solutions are a priori likely).

◮ An influx of new massive datasets provide great opportunities to automatically

learn rich statistical structure, leading to new scientific discoveries.

All Possible Datasets p(data|model)

Flexible Simple Medium

4 / 41

SLIDE 5

Bayesian Deep Learning

Why?

◮ A powerful framework for model construction and understanding generalization ◮ Uncertainty representation and calibration (crucial for decision making) ◮ Better point estimates ◮ Interpretably incorporate prior knowledge and domain expertise ◮ It was the most successful approach at the end of the second wave of neural

networks (Neal, 1998).

◮ Neural nets are much less mysterious when viewed through the lens of

probability theory.

Why not?

◮ Can be computationally intractable (but doesn’t have to be). ◮ Can involve a lot of moving parts (but doesn’t have to).

There has been exciting progress in the last two years addressing these limitations as part of an extremely fruitful research direction.

5 / 41

SLIDE 6

Wide Optima Generalize Better

Keskar et. al (2017)

◮ Bayesian integration will give very different predictions in deep learning

especially!

6 / 41

SLIDE 7

Mode Connectivity

−20 20 40 60 80 100 −20 20 40 60 80 0.065 0.11 0.17 0.28 0.54 1.1 2.3 5 > 5 −20 20 40 60 80 100 −20 20 40 60 80 100 0.065 0.11 0.17 0.28 0.54 1.1 2.3 5 > 5 −20 20 40 60 80 100 −20 20 40 60 0.065 0.11 0.17 0.28 0.54 1.1 2.3 5 > 5

Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs Advances in Neural Information Processing Systems (NeurIPS), 2018

T. Garipov, P. Izmailov, D. Podoprikhin, D. Vetrov, A.G. Wilson

7 / 41

SLIDE 8

Cyclical Learning Rate Schedule

8 / 41

SLIDE 9

Trajectory of SGD

−10 10 20 30 40 50 −10 10 20 30

W1 W2 W3 WSWA

Test error (%)

19.95 20.64 21.24 22.38 24.5 28.49 35.97 50 > 50 9 / 41

SLIDE 10

Trajectory of SGD

−10 10 20 30 40 50 −10 10 20 30

W1 W2 W3 WSWA

Test error (%)

19.95 20.64 21.24 22.38 24.5 28.49 35.97 50 > 50 10 / 41

SLIDE 11

Trajectory of SGD

−10 10 20 30 40 50 −10 10 20 30

W1 W2 W3 WSWA

Test error (%)

19.95 20.64 21.24 22.38 24.5 28.49 35.97 50 > 50 11 / 41

SLIDE 12

SWA Algorithm

◮ Use learning rate that doesn’t decay to zero (cyclical or constant) ◮ Average weights

◮ Cyclical LR: at the end of each cycle ◮ Constant LR: at the end of each epoch

◮ Recompute batch normalization statistics at the end of training; in practice, do

ne additional forward pass on the training data.

12 / 41

SLIDE 13

Trajectory of SGD

−10 10 20 30 40 50 −10 10 20 30

W1 W2 W3 WSWA

Test error (%)

19.95 20.64 21.24 22.38 24.5 28.49 35.97 50 > 50 −5 5 10 15 20 25 5 10

epoch 125 WSGD WSWA

Test error (%)

19.62 20.15 20.67 21.67 23.65 27.52 35.11 50 > 50 −5 5 10 15 20 25 5 10

epoch 125 WSGD WSWA

Train loss

0.00903 0.02142 0.03422 0.06024 0.1131 0.2206 0.4391 0.8832 > 0.8832

13 / 41

SLIDE 14

Following Random Paths

5 1 1 5 2 2 2 2 2 4 2 6 2 8 3

14 / 41

SLIDE 15

Path from wSWA to wSGD

−80 −60 −40 −20 20 40

Distance

17.5 20.0 22.5 25.0 27.5 30.0

Test error (%) Test error SWA SGD

0.0 0.5 1.0 1.5 2.0 2.5

Train loss Train loss SWA SGD

15 / 41

SLIDE 16

Approximating an FGE Ensemble Because the points sampled from an FGE ensemble take small steps in weight space by design, we can do a linearization analysis to show that f(wSWA) ≈ 1 n

f(wi)

16 / 41

SLIDE 17

SWA Results, CIFAR

17 / 41

SLIDE 18

SWA Results, ImageNet (Top-1 Error Rate)

18 / 41

SLIDE 19

Sampling from a High Dimensional Gaussian

SGD (with constant LR) proposals are on the surface of a hypersphere. Averaging lets us go inside the sphere to a point of higher density.

19 / 41

SLIDE 20

High Constant LR

50 100 150 200 250 300

Epochs

15 20 25 30 35 40 45 50

Test error (%) SGD Const LR SGD Const LR SWA

Side observation: Averaging bad models does not give good solutions. Averaging bad weights can give great solutions.

20 / 41

SLIDE 21

Stochastic Weight Averaging

◮ Simple drop-in replacement for SGD or other optimizers ◮ Works by finding flat regions of the loss surface ◮ No runtime overhead, but often significant improvements in generalization for

many tasks

◮ Available in PyTorch contrib (call optim.swa) ◮ https://people.orie.cornell.edu/andrew/code

Averaging Weights Leads to Wider Optima and Better Generalization, UAI 2018

P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, A.G. Wilson.

21 / 41

SLIDE 22

Uncertainty Representation with SWAG

1. Leverage theory that shows SGD with a

constant learning rate is approximately sampling from a Gaussian distribution.

2. Compute first two moments of SGD

trajectory (SWA computes just the first).

3. Use these moments to construct a Gaussian

approximation in weight space.

4. Sample from this Gaussian distribution, pass

samples through predictive distribution, and form a Bayesian model average. A Simple Baseline for Bayesian Uncertainty in Deep Learning

W. Maddox, P. Izmailov, T. Garipov, D. Vetrov, A.G. Wilson

22 / 41

SLIDE 23

Uncertainty Calibration

0.200 0.759 0.927 0.978 0.993 0.998

Confidence (max prob)

0.10
0.05

0.00 0.05 0.10 0.15 0.20

Confidence - Accuracy WideResNet28x10 CIFAR-100

0.200 0.759 0.927 0.978 0.993 0.998

Confidence (max prob)

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

Confidence - Accuracy WideResNet28x10 CIFAR-10 → STL-10

0.200 0.759 0.927 0.978 0.993 0.998

Confidence (max prob)

0.05
0.03

0.00 0.02 0.05 0.08 0.10

Confidence - Accuracy DenseNet-161 ImageNet

0.200 0.759 0.927 0.978 0.993 0.998

Confidence (max prob)

0.08
0.05
0.02

0.00 0.02 0.05 0.08 0.10 0.12

Confidence - Accuracy ResNet-152 ImageNet

23 / 41

SLIDE 24

Uncertainty Likelihood

24 / 41

SLIDE 25

Subspace Inference for Bayesian Deep Learning

A modular approach:

◮ Construct a subspace of a network with a high dimensional parameter space ◮ Perform inference directly in the subspace ◮ Sample from approximate posterior for Bayesian model averaging

We can approximate the posterior of a WideResNet with 36 million parameters in a 5D subspace and achieve state-of-the-art results!

25 / 41

SLIDE 26

Subspace Construction

◮ Choose shift ˆ

w and basis vectors {d1, . . . , dk}.

◮ Define subspace S = {w|w = ˆ

w + t1d1 + tkdk}.

◮ Likelihood p(D|t) = pM(D|w = ˆ

w + Pt).

26 / 41

SLIDE 27

Inference

◮ Approximate inference over parameters t

◮ MCMC, Variational Inference, Normalizing Flows, . . .

◮ Bayesian model averaging at test time:

p(D∗|D) = 1 J

J

j=1

pM(D∗|˜ w = ˆ w + P˜ ti) , ˜ ti ∼ q(t|D) (1)

27 / 41

SLIDE 28

Subspace Choice

We want a subspace that

◮ Contains diverse models which give rise to different predictions ◮ Cheap to construct

28 / 41

SLIDE 29

Random Subspace

◮ Directions d1, . . . , dk ∼ N(0, Ip) ◮ Use pre-trained solution as shift ˆ

w

◮ Subspace S = {w|w = ˆ

w + Pt}

29 / 41

SLIDE 30

PCA of SGD Trajectory

◮ Run SGD with a high constant learning rate from a pre-trained solution ◮ Collect snapshots of weights wi ◮ Use SWA solution as shift ˆ

w =

1 M

i wi

◮ {d1, . . . , dk} are the first k PCA components of vectors ˆ

w − wi.

30 / 41

SLIDE 31

Curve Subspace

31 / 41

SLIDE 32

Subspace Comparison (Regression)

32 / 41

SLIDE 33

Subspace Comparison (Classification)

Subspace Inference for Bayesian Deep Learning

P. Izmailov, W. Maddox, P. Kirichenko, T. Garipov, D. Vetrov, A.G. Wilson

33 / 41

SLIDE 34

Semi-Supervised Learning

◮ Make label predictions using structure

from both unlabelled and labelled training data.

◮ Can quantify recent advances in

unsupervised learning.

◮ Crucial for reducing the dependency of

deep learning on large labelled datasets.

34 / 41

SLIDE 35

Semi-Supervised Learning

There Are Many Consistent Explanations of Unlabeled Data: Why You Should Average

B. Athiwaratkun, M. Finzi, P. Izmailov, A.G. Wilson

ICLR 2019 L(wf ) =

(x,y)∈DL

ℓCE(wf , x, y)

LCE

+λ

x∈DL∪DU

ℓcons(wf , x)

Lcons

35 / 41

SLIDE 36

Semi-Supervised Learning

World record results on semi-supervised vision benchmarks

36 / 41

SLIDE 37

SWALP: Stochastic Weight Averaging in Low Precision Training

◮ End-to-end training entirely in low precision. ◮ Can outperform full-precision SGD even with all numbers quantized down to 8

bits, including gradient accumulators.

◮ Averaging combines weights that have been rounded up with those that have

been rounded down.

◮ Quantizing in a flat region does not hurt loss. ◮ SWALP converges arbitrarily close to the optimal solution. ◮ Special relevance to new GPU architectures.

Low-precision SGD Compute Weight Average Representable Points in Low Precision SGD-LP Trajectory SWALP Solution

37 / 41

SLIDE 38

Conclusions

By considering the geometry of the loss surfaces, we can:

◮ Develop optimization procedures which provide better generalization, and good

performance for low precision training.

◮ Develop scalable approaches to Bayesian deep learning, which both provide

better point predictions, as well as uncertainty representation and calibration. Code is available: https://people.orie.cornell.edu/andrew

38 / 41

SLIDE 39

Scalable Gaussian Processes

◮ Run GPs on millions of points in seconds, vs. thousands of points in hours. ◮ Outperforms stand-alone deep neural networks by learning deep kernels. ◮ Approach accelerated by kernel approximations which admit fast matrix vector

multiplies (Wilson and Nickisch, 2015).

◮ Harmonizes with GPU acceleration. ◮ O(n) training and O(1) testing (instead of O(n3) training and O(n2) testing). ◮ Implemented in our new library GPyTorch: gpytorch.ai

39 / 41

SLIDE 40

LSTM Kernels

◮ We derive kernels which have recurrent LSTM inductive biases, and apply to

autonomous vehicles, where predictive uncertainty is critical.

0.0 0.2 0.4 0.6 0.8 1.0

East, mi

0.0 0.2 0.4 0.6 0.8 1.0

North, mi

5 5 10 15 20 30 20 10 5 5 10 15 20 30 20 10 4 8 12 16 20 24 28

Speed, mi/s

10 20 30 40 50 10 20 30 40 50

Learning Scalable Deep Kernels with Recurrent Structure

M. Al-Shedivat, A. G. Wilson, Y. Saatchi, Z. Hu, E. P. Xing

Journal of Machine Learning Research (JMLR), 2017

40 / 41

SLIDE 41

GP-LSTM Predictive Distributions

−5 5 10 20 30 40 50 Front distance, m −5 5 −5 5 Side distance, m −5 5 −5 5 −5 5 10 20 30 40 50 Front distance, m −5 5 −5 5 Side distance, m −5 5 −5 5 41 / 41