Using Loss Surface Geometry for Practical Bayesian Deep Learning - - PowerPoint PPT Presentation

using loss surface geometry for practical bayesian deep
SMART_READER_LITE
LIVE PREVIEW

Using Loss Surface Geometry for Practical Bayesian Deep Learning - - PowerPoint PPT Presentation

Using Loss Surface Geometry for Practical Bayesian Deep Learning Andrew Gordon Wilson https://cims.nyu.edu/~andrewgw New York University Bayesian Deep Learning Workshop Advances in Neural Information Processing Systems December 13, 2019


slide-1
SLIDE 1

Using Loss Surface Geometry for Practical Bayesian Deep Learning

Andrew Gordon Wilson

https://cims.nyu.edu/~andrewgw New York University Bayesian Deep Learning Workshop Advances in Neural Information Processing Systems December 13, 2019 Collaborators: Pavel Izmailov, Wesley Maddox, Polina Kirichenko, Timur Garipov, Dmitry Vetrov

1 / 43

slide-2
SLIDE 2

Model Selection

1949 1951 1953 1955 1957 1959 1961 100 200 300 400 500 600 700

Airline Passengers (Thousands) Year

Which model should we choose? (1): f1(x) = a0 + a1x (2): f2(x) =

3

  • j=0

ajxj (3): f3(x) =

104

  • j=0

ajxj

2 / 43

slide-3
SLIDE 3

How do we learn?

◮ The ability for a system to learn is determined by its support (which solutions

are a priori possible) and inductive biases (which solutions are a priori likely).

◮ An influx of new massive datasets provide great opportunities to automatically

learn rich statistical structure, leading to new scientific discoveries.

All Possible Datasets p(data|model)

Flexible Simple Medium

3 / 43

slide-4
SLIDE 4

Bayesian Deep Learning

Why?

◮ A powerful framework for model construction and understanding generalization ◮ Uncertainty representation and calibration (crucial for decision making) ◮ Better point estimates ◮ Interpretably incorporate prior knowledge and domain expertise ◮ It was the most successful approach at the end of the second wave of neural

networks (Neal, 1998).

◮ Neural nets are much less mysterious when viewed through the lens of

probability theory.

Why not?

◮ Can be computationally intractable (but doesn’t have to be). ◮ Can involve a lot of moving parts (but doesn’t have to).

There has been exciting progress in the last year addressing these limitations.

4 / 43

slide-5
SLIDE 5

Wide Optima Generalize Better

Keskar et. al (2017)

◮ Bayesian integration will give very different predictions in deep learning

especially!

5 / 43

slide-6
SLIDE 6

Bayesian Deep Learning

Sum rule: p(x) =

x p(x, y). Product rule: p(x, y) = p(x|y)p(y) = p(y|x)p(x).

p(y|x∗, y, X) =

  • p(y|x∗, w)p(w|y, X)dw .

(1)

◮ Think of each setting of w as a different model. Eq. (1) is a Bayesian model

average, an average of infinitely many models weighted by their posterior probabilities.

◮ Automatically calibrated complexity even with highly flexible models. ◮ Can view classical training as using an approximate posterior

q(w|y, X) = δ(w = wMAP).

◮ Typically more interested in the induced distribution over functions than in

parameters w. Can be hard to have intuitions for priors on p(w).

6 / 43

slide-7
SLIDE 7

Mode Connectivity

Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs

  • T. Garipov, P. Izmailov, D. Podoprikhin, D. Vetrov, A.G. Wilson

NeurIPS 2018

7 / 43

slide-8
SLIDE 8

Mode Connectivity

8 / 43

slide-9
SLIDE 9

Mode Connectivity

9 / 43

slide-10
SLIDE 10

Mode Connectivity

10 / 43

slide-11
SLIDE 11

Mode Connectivity

11 / 43

slide-12
SLIDE 12

Uncertainty Representation with SWAG

  • 1. Leverage theory that shows SGD with a

constant learning rate is approximately sampling from a Gaussian distribution.

  • 2. Compute first two moments of SGD

trajectory (SWA computes just the first).

  • 3. Use these moments to construct a Gaussian

approximation in weight space.

  • 4. Sample from this Gaussian distribution, pass

samples through predictive distribution, and form a Bayesian model average. p(y∗|D) ≈ 1 J

J

  • j=1

p(y∗|wj) , wj ∼ q(w|D) , q(w|D) = N(¯ w, K) ¯ w = 1 T

  • t

wt , K = 1 2

  • 1

T − 1

  • t

(wt − ¯ w)(wt − ¯ w)T + 1 T − 1

  • t

diag(wi − ¯ w)2

  • A Simple Baseline for Bayesian Uncertainty in Deep Learning
  • W. Maddox, P. Izmailov, T. Garipov, D. Vetrov, A.G. Wilson

NeurIPS 2019

12 / 43

slide-13
SLIDE 13

Trajectory in PCA Subspace

13 / 43

slide-14
SLIDE 14

Uncertainty Calibration

0.200 0.759 0.927 0.978 0.993 0.998

Confidence (max prob)

  • 0.10
  • 0.05

0.00 0.05 0.10 0.15 0.20

Confidence - Accuracy WideResNet28x10 CIFAR-100

0.200 0.759 0.927 0.978 0.993 0.998

Confidence (max prob)

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

Confidence - Accuracy WideResNet28x10 CIFAR-10 → STL-10

0.200 0.759 0.927 0.978 0.993 0.998

Confidence (max prob)

  • 0.05
  • 0.03

0.00 0.02 0.05 0.08 0.10

Confidence - Accuracy DenseNet-161 ImageNet

0.200 0.759 0.927 0.978 0.993 0.998

Confidence (max prob)

  • 0.08
  • 0.05
  • 0.02

0.00 0.02 0.05 0.08 0.10 0.12

Confidence - Accuracy ResNet-152 ImageNet

14 / 43

slide-15
SLIDE 15

SWAG Regression Uncertainty

15 / 43

slide-16
SLIDE 16

SWAG Visualization

16 / 43

slide-17
SLIDE 17

Subspace Inference for Bayesian Deep Learning

A modular approach:

◮ Construct a subspace of a network with a high dimensional parameter space ◮ Perform inference directly in the subspace ◮ Sample from approximate posterior for Bayesian model averaging

We can approximate the posterior of a WideResNet with 36 million parameters in a 5D subspace and achieve state-of-the-art results!

17 / 43

slide-18
SLIDE 18

Subspace Construction

◮ Choose shift ˆ

w and basis vectors {d1, . . . , dk}.

◮ Define subspace S = {w|w = ˆ

w + z1d1 + zkdk}.

◮ Likelihood p(D|z) = pM(D|w = ˆ

w + Pz).

◮ Posterior inference p(z|D) ∝ p(D|z)p(z).

18 / 43

slide-19
SLIDE 19

Curve Subspace Traversal

19 / 43

slide-20
SLIDE 20

Curve Subspace Traversal

20 / 43

slide-21
SLIDE 21

Curve Subspace Traversal

21 / 43

slide-22
SLIDE 22

Curve Subspace Traversal

22 / 43

slide-23
SLIDE 23

Curve Subspace Traversal

23 / 43

slide-24
SLIDE 24

Curve Subspace Traversal

24 / 43

slide-25
SLIDE 25

Curve Subspace Traversal

25 / 43

slide-26
SLIDE 26

Curve Subspace Traversal

26 / 43

slide-27
SLIDE 27

Curve Subspace Traversal

27 / 43

slide-28
SLIDE 28

Curve Subspace Traversal

28 / 43

slide-29
SLIDE 29

Curve Subspace Traversal

29 / 43

slide-30
SLIDE 30

Curve Subspace Traversal

30 / 43

slide-31
SLIDE 31

Curve Subspace Traversal

31 / 43

slide-32
SLIDE 32

Curve Subspace Traversal

32 / 43

slide-33
SLIDE 33

Curve Subspace Traversal

33 / 43

slide-34
SLIDE 34

Subspace Comparison (Regression)

34 / 43

slide-35
SLIDE 35

Subspace Comparison (Classification)

Accuracy and NLL on CIFAR-100 Bayesian methods also lead to better point predictions in deep learning! Subspace Inference for Bayesian Deep Learning

  • P. Izmailov, W. Maddox, P. Kirichenko, T. Garipov, D. Vetrov, A.G. Wilson

UAI 2019

35 / 43

slide-36
SLIDE 36

Conclusions

◮ Neural networks represent many compelling solutions to a given problem, and a

very underspecified by the available data. This is the perfect situation for Bayesian marginalization.

◮ Even if we cannot perfectly express our priors, or perform full Bayesian

inference, we can try our best and get much better point predictions as well as improved calibration. We can view standard training as an impoverished Bayesian approximation.

◮ By exploiting information about the loss geometry in training, we can scale

Bayesian neural networks to ImageNet with improvements in accuracy and calibration, and essentially no runtime overhead.

36 / 43

slide-37
SLIDE 37

Join Us! There is a postdoc opening in my group! Join an energetic and ambitious team of scientists in New York City, looking to address big open questions in core machine learning.

37 / 43

slide-38
SLIDE 38

Scalable Gaussian Processes

◮ Run exact GPs on millions of points in minutes. ◮ Outperforms stand-alone deep neural networks by learning deep kernels. ◮ Implemented in our new library GPyTorch: gpytorch.ai

38 / 43

slide-39
SLIDE 39

Gaussian processes: a function space view

Gaussian processes provide an intuitive function space perspective on learning and generalization.

GP posterior

  • p(f(x)|D) ∝

Likelihood

  • p(D|f(x))

GP prior

p(f(x))

−10 −5 5 10 −3 −2 −1 1 2 3 −10 −5 5 10 −3 −2 −1 1 2 3

Outputs, f(x) Sample Prior Functions Sample Posterior Functions Inputs, x Inputs, x Outputs, f(x)

39 / 43

slide-40
SLIDE 40

BoTorch: Bayesian Optimization in PyTorch

◮ Probabilistic active learning ◮ Black box objectives, hyperparameter tuning, A/B testing, global optimization.

40 / 43

slide-41
SLIDE 41

Probabilistic Reinforcement Learning

Robust, sample efficient online decision making under uncertainty.

41 / 43

slide-42
SLIDE 42

References

Stochastic Weight Averaging in PyTorch: https://pytorch.org/blog/stochastic-weight-averaging-in-pytorch/. Semi-supervised Learning with Normalizing Flows. To appear.

  • W. Maddox, P. Izmailov, T. Garipov, D. Vetrov, A.G. Wilson. A Simple Baseline for Bayesian Uncertainty in Deep Learning. Advances in

Neural Information Processing Systems (NeurIPS), 2019.

  • K. A. Wang, G. Pleiss, J. Gardner, S. Tyree, K. Weinberger, A.G. Wilson. Exact Gaussian Processes on a Million Data Points. Advances in

Neural Information Processing Systems (NeurIPS), 2019.

  • P. Izmailov, W. Maddox, P. Kirichenko, T. Garipov, D. Vetrov, A.G. Wilson. Subspace Inference for Bayesian Deep Learning. Uncertainty In

Artificial Intelligence (UAI), 2019.

  • G. Yang, T. Chen, P. Kirichenko, J. Bai, A.G. Wilson, C. de Sa. SWALP: Stochastic Weight Averaging in Low Precision Training. International

Conference on Machine Learning (ICML), 2019.

  • W. Herlands, D.B. Neill, H. Nickisch, A.G. Wilson. Change Surfaces for Expressive Multidimensional Changepoints and Counterfactual
  • Prediction. Journal of Machine Learning Research (JMLR), 2019.
  • B. Athiwaratkun, M. Finzi, P. Izmailov, A.G. Wilson. There are Many Consistent Explanations of Unlabeled Data: Why You Should Average.

International Conference on Learning Representations (ICLR), 2019.

  • T. Garipov, P. Izmailov, D. Podoprikhin, D. Vetrov, A.G. Wilson. Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs. Advances

in Neural Information Processing Systems (NeurIPS), 2018.

  • J. Gardner, G. Pleiss, D. Bindel, K. Weinberger, A.G. Wilson. GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU
  • Acceleration. Advances in Neural Information Processing Systems (NeurIPS), 2018.

C.E. Rasmussen and Z. Ghahramani. Occam’s razor. Advances in Neural Information Processing Systems (NeurIPS), 2001.

  • D. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003.
  • C. Bishop. Pattern Recognition and Machine Learning. Cambridge University Press, 2006.
  • P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, A.G. Wilson. Averaging Weights Leads to Wider Optima and Better Generalization,

Uncertainty in Artificial Intelligence (UAI), 2018. 42 / 43

slide-43
SLIDE 43

References

  • G. Pleiss, J. Gardner, K.Q. Weinberger, and A.G. Wilson. Constant time predictive distributions for Gaussian processes. International

Conference on Machine Learning (ICML), 2018.

  • B. Athiwaratkun, A.G. Wilson, and A. Anandkumar. Probabilistic FastText. Association for Computational Linguistics (ACL), 2018.
  • B. Athiwaratkun and A.G. Wilson. Hierarchical Density Order Embeddings. International Conference on Learning Representations (ICLR),

2018.

  • Y. Saatchi and A.G. Wilson. Bayesian GAN. Neural Information Processing Systems (NeurIPS), 2017.
  • B. Athiwaratkun and A.G. Wilson. Multimodal Word Distributions. Association for Computational Linguistics (ACL), 2017.
  • M. Al-Shedivat, A.G. Wilson, Y. Saatchi, Z. Hu, and E.P. Xing. Learning Scalable Deep Kernels with Recurrent Structure. Journal of Machine

Learning Research (JMLR), 2017. A.G. Wilson, Z. Hu, R. Salakhutdinov, and E.P. Xing. Stochastic Variational Deep Kernel Learning. Neural Information Processing Systems (NeurIPS), 2016. A.G. Wilson, Z. Hu, R. Salakhutdinov, and E.P. Xing. Deep kernel learning. Artificial Intelligence and Statistics (AISTATS), 2016. A.G. Wilson, C. Dann, C.G. Lucas, and E.P. Xing. The human kernel. In Neural Information Processing Systems (NeurIPS), 2015. A.G. Wilson and H. Nickisch. Kernel interpolation for scalable structured Gaussian processes (KISS-GP). International Conference on Machine Learning (ICML), 2015. International Conference on Machine Learning (ICML), 2015. A.G. Wilson. Covariance kernels for fast automatic pattern discovery and extrapolation with Gaussian processes. PhD Thesis, University of

  • Cambridge. October 2014.

43 / 43