Using Loss Surface Geometry for Practical Bayesian Deep Learning - PowerPoint PPT Presentation

Using Loss Surface Geometry for Practical Bayesian Deep Learning Andrew Gordon Wilson https://cims.nyu.edu/~andrewgw New York University Bayesian Deep Learning Workshop Advances in Neural Information Processing Systems December 13, 2019 Collaborators: Pavel Izmailov, Wesley Maddox, Polina Kirichenko, Timur Garipov, Dmitry Vetrov 1 / 43

Model Selection 700 Airline Passengers (Thousands) 600 500 400 300 200 100 1949 1951 1953 1955 1957 1959 1961 Year Which model should we choose? 10 4 3 � a j x j � a j x j (1): f 1 ( x ) = a 0 + a 1 x (2): f 2 ( x ) = (3): f 3 ( x ) = j = 0 j = 0 2 / 43

How do we learn? ◮ The ability for a system to learn is determined by its support (which solutions are a priori possible) and inductive biases (which solutions are a priori likely). ◮ An influx of new massive datasets provide great opportunities to automatically learn rich statistical structure, leading to new scientific discoveries. Flexible Simple p(data|model) Medium All Possible Datasets 3 / 43

Bayesian Deep Learning Why? ◮ A powerful framework for model construction and understanding generalization ◮ Uncertainty representation and calibration (crucial for decision making) ◮ Better point estimates ◮ Interpretably incorporate prior knowledge and domain expertise ◮ It was the most successful approach at the end of the second wave of neural networks (Neal, 1998). ◮ Neural nets are much less mysterious when viewed through the lens of probability theory. Why not? ◮ Can be computationally intractable (but doesn’t have to be). ◮ Can involve a lot of moving parts (but doesn’t have to). There has been exciting progress in the last year addressing these limitations. 4 / 43

Wide Optima Generalize Better Keskar et. al (2017) ◮ Bayesian integration will give very different predictions in deep learning especially ! 5 / 43

Bayesian Deep Learning Sum rule: p ( x ) = � x p ( x , y ) . Product rule: p ( x , y ) = p ( x | y ) p ( y ) = p ( y | x ) p ( x ) . � p ( y | x ∗ , y , X ) = p ( y | x ∗ , w ) p ( w | y , X ) d w . (1) ◮ Think of each setting of w as a different model. Eq. (1) is a Bayesian model average , an average of infinitely many models weighted by their posterior probabilities. ◮ Automatically calibrated complexity even with highly flexible models. ◮ Can view classical training as using an approximate posterior q ( w | y , X ) = δ ( w = w MAP ) . ◮ Typically more interested in the induced distribution over functions than in parameters w . Can be hard to have intuitions for priors on p ( w ) . 6 / 43

Mode Connectivity Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs T. Garipov, P. Izmailov, D. Podoprikhin, D. Vetrov, A.G. Wilson NeurIPS 2018 7 / 43

Mode Connectivity 8 / 43

Uncertainty Representation with SWAG 1. Leverage theory that shows SGD with a constant learning rate is approximately sampling from a Gaussian distribution. 2. Compute first two moments of SGD trajectory (SWA computes just the first). 3. Use these moments to construct a Gaussian approximation in weight space. 4. Sample from this Gaussian distribution, pass samples through predictive distribution, and form a Bayesian model average. J p ( y ∗ |D ) ≈ 1 � p ( y ∗ | w j ) , w j ∼ q ( w |D ) , q ( w |D ) = N (¯ w , K ) J j = 1 � � w = 1 K = 1 1 1 w ) T + � � � w ) 2 ¯ w t , ( w t − ¯ w )( w t − ¯ diag ( w i − ¯ T − 1 T − 1 T 2 t t t A Simple Baseline for Bayesian Uncertainty in Deep Learning W. Maddox, P. Izmailov, T. Garipov, D. Vetrov, A.G. Wilson NeurIPS 2019 12 / 43

Trajectory in PCA Subspace 13 / 43

Uncertainty Calibration WideResNet28x10 CIFAR-100 WideResNet28x10 CIFAR-10 → STL-10 0.20 0.40 0.35 0.15 Confidence - Accuracy Confidence - Accuracy 0.30 0.10 0.25 0.05 0.20 0.00 0.15 0.10 -0.05 0.05 -0.10 0.00 0.200 0.759 0.927 0.978 0.993 0.998 0.200 0.759 0.927 0.978 0.993 0.998 Confidence (max prob) Confidence (max prob) DenseNet-161 ImageNet ResNet-152 ImageNet 0.12 0.10 0.10 Confidence - Accuracy Confidence - Accuracy 0.08 0.08 0.05 0.05 0.02 0.02 0.00 0.00 -0.02 -0.03 -0.05 -0.05 -0.08 0.200 0.759 0.927 0.978 0.993 0.998 0.200 0.759 0.927 0.978 0.993 0.998 Confidence (max prob) Confidence (max prob) 14 / 43

SWAG Regression Uncertainty 15 / 43

SWAG Visualization 16 / 43

Subspace Inference for Bayesian Deep Learning A modular approach: ◮ Construct a subspace of a network with a high dimensional parameter space ◮ Perform inference directly in the subspace ◮ Sample from approximate posterior for Bayesian model averaging We can approximate the posterior of a WideResNet with 36 million parameters in a 5D subspace and achieve state-of-the-art results! 17 / 43

Subspace Construction ◮ Choose shift ˆ w and basis vectors { d 1 , . . . , d k } . ◮ Define subspace S = { w | w = ˆ w + z 1 d 1 + z k d k } . ◮ Likelihood p ( D| z ) = p M ( D| w = ˆ w + Pz ) . ◮ Posterior inference p ( z |D ) ∝ p ( D| z ) p ( z ) . 18 / 43

Curve Subspace Traversal 19 / 43

Subspace Comparison (Regression) 34 / 43

Subspace Comparison (Classification) Accuracy and NLL on CIFAR-100 Bayesian methods also lead to better point predictions in deep learning! Subspace Inference for Bayesian Deep Learning P. Izmailov, W. Maddox, P. Kirichenko, T. Garipov, D. Vetrov, A.G. Wilson UAI 2019 35 / 43

Conclusions ◮ Neural networks represent many compelling solutions to a given problem, and a very underspecified by the available data. This is the perfect situation for Bayesian marginalization . ◮ Even if we cannot perfectly express our priors, or perform full Bayesian inference, we can try our best and get much better point predictions as well as improved calibration. We can view standard training as an impoverished Bayesian approximation. ◮ By exploiting information about the loss geometry in training, we can scale Bayesian neural networks to ImageNet with improvements in accuracy and calibration, and essentially no runtime overhead. 36 / 43

Join Us! There is a postdoc opening in my group! Join an energetic and ambitious team of scientists in New York City, looking to address big open questions in core machine learning. 37 / 43

Scalable Gaussian Processes ◮ Run exact GPs on millions of points in minutes. ◮ Outperforms stand-alone deep neural networks by learning deep kernels . ◮ Implemented in our new library GPyTorch : gpytorch.ai 38 / 43

Gaussian processes: a function space view Gaussian processes provide an intuitive function space perspective on learning and generalization . GP posterior Likelihood GP prior � �� p ( f ( x ) |D ) ∝ p ( D| f ( x )) p ( f ( x )) Sample Prior Functions Sample Posterior Functions 3 3 2 2 Outputs, f(x) Outputs, f(x) 1 1 0 0 − 1 − 1 − 2 − 2 − 3 − 3 − 10 − 5 0 5 10 − 10 − 5 0 5 10 Inputs, x Inputs, x 39 / 43

BoTorch: Bayesian Optimization in PyTorch ◮ Probabilistic active learning ◮ Black box objectives, hyperparameter tuning, A/B testing, global optimization. 40 / 43

Probabilistic Reinforcement Learning Robust, sample efficient online decision making under uncertainty. 41 / 43

Using Loss Surface Geometry for Practical Bayesian Deep Learning - PowerPoint PPT Presentation

Using Loss Surface Geometry for Practical Bayesian Deep Learning Andrew Gordon Wilson https://cims.nyu.edu/~andrewgw New York University Bayesian Deep Learning Workshop Advances in Neural Information Processing Systems December 13, 2019

Stochastic geometry and random generation 1 Stochastic geometry and random generation

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Di Distri rict ct Surf Surface ace Wa Water Us Use and and Deep Deep We Well Producti oduction

Function Space Priors in Bayesian Deep Learning Roger Grosse Motivation Today Bayesian deep

CHRONIC CHRONIC VISUAL LOSS VISUAL LOSS Wasu Supakornthanasarn, MD. Visual loss Sensory

Early Hearing Early Hearing Early Hearing loss D Early Hearing-loss D loss D loss D

48-175 Descriptive Geometry Basic Concepts of Descriptive Geometry Descriptive geometry is

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

3d Geometry for Computer Graphics Lesson 1: Basics & PCA 3d geometry 3d geometry 3d

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Practical Experience with Practical Experience with Practical Experience with Practical

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Hyperbolic Geometry Victor Gonzalez Mentor: Ryan Kirk May 4, 2016 Hyperbolic Geometry We are

Generative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at

From Worst-Case to Realistic-Case Analysis for Large Scale Machine Learning Algorithms

Computers and Intractability A Guide to the Theory of NP-Completeness The Bible of

Sum-Product Networks for Probabilistic Semantic Maps Kaiyu Zheng , Andrzej Pronobis, Rajesh Rao

Introduction to Computer Science CSCI 109 An al thm (pronounced AL-go-rith- algori rithm

Simple Problems. . . Example a 0 a 1 a 2 b 0 b 1 b 2 Question What is some preferred extension?

Op#miza#on Challenges for Deep Learning Yoshua Bengio U.

Halt return result; } true 5 6 1 The Halting Problem Undecidability Alan Turing, 1936

Using Loss Surface Geometry for Practical Bayesian Deep Learning - PowerPoint PPT Presentation

Using Loss Surface Geometry for Practical Bayesian Deep Learning Andrew Gordon Wilson https://cims.nyu.edu/~andrewgw New York University Bayesian Deep Learning Workshop Advances in Neural Information Processing Systems December 13, 2019

Stochastic geometry and random generation 1 Stochastic geometry and random generation

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Di Distri rict ct Surf Surface ace Wa Water Us Use and and Deep Deep We Well Producti oduction

Function Space Priors in Bayesian Deep Learning Roger Grosse Motivation Today Bayesian deep

CHRONIC CHRONIC VISUAL LOSS VISUAL LOSS Wasu Supakornthanasarn, MD. Visual loss Sensory

Early Hearing Early Hearing Early Hearing loss D Early Hearing-loss D loss D loss D

48-175 Descriptive Geometry Basic Concepts of Descriptive Geometry Descriptive geometry is

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

3d Geometry for Computer Graphics Lesson 1: Basics &amp; PCA 3d geometry 3d geometry 3d

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Practical Experience with Practical Experience with Practical Experience with Practical

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Hyperbolic Geometry Victor Gonzalez Mentor: Ryan Kirk May 4, 2016 Hyperbolic Geometry We are

Generative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at

From Worst-Case to Realistic-Case Analysis for Large Scale Machine Learning Algorithms

Computers and Intractability A Guide to the Theory of NP-Completeness The Bible of

Sum-Product Networks for Probabilistic Semantic Maps Kaiyu Zheng , Andrzej Pronobis, Rajesh Rao

Introduction to Computer Science CSCI 109 An al thm (pronounced AL-go-rith- algori rithm

Simple Problems. . . Example a 0 a 1 a 2 b 0 b 1 b 2 Question What is some preferred extension?

Op#miza#on Challenges for Deep Learning Yoshua Bengio U.

Halt return result; } true 5 6 1 The Halting Problem Undecidability Alan Turing, 1936

3d Geometry for Computer Graphics Lesson 1: Basics & PCA 3d geometry 3d geometry 3d