CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse - PowerPoint PPT Presentation

CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient 1 / 12

Motivation Two classes of optimization procedures used throughout ML (stochastic) gradient descent, with momentum, and maybe coordinate-wise rescaling (e.g. Adam) Can take many iterations to converge, especially if the problem is ill-conditioned coordinate descent (e.g. EM) Requires full-batch updates, which are expensive for large datasets Natural gradient is an elegant solution to both problems. How it fits in with this course: This lecture: it’s an elegant and efficient way of doing variational inference Later: using probabilistic modeling to make natural gradient practical for neural nets Bonus groundbreaking result: natural gradient can be interpreted as variational inference! Roger Grosse CSC2541 Lecture 5 Natural Gradient 2 / 12

Motivation SGD bounces around in high curvature directions and makes slow progress in low curvature directions. (Note: this cartoon understates the problem by orders of magnitude!) This happens because when we train a neural net (or some other ML model), we are optimizing over a complicated manifold of functions. Mapping a manifold to a flat coordinate system distorts distances. Natural gradient: compute the gradient on the globe, not on the map. Roger Grosse CSC2541 Lecture 5 Natural Gradient 3 / 12

Motivation: Invariances Suppose we have the following dataset for linear regression. x 1 x 2 t 114.8 0.00323 5.1 338.1 0.00183 3.2 98.8 0.00279 4.1 . . . . . . . . . This can happen since the inputs have arbitrary units. Which weight, w 1 or w 2 , will receive a larger gradient descent update? Which one do you want to receive a larger update? Note: the figure vastly understates the narrowness of the ravine! Roger Grosse CSC2541 Lecture 5 Natural Gradient 4 / 12

Motivation: Invariances Or maybe x 1 and x 2 correspond to years: x 1 x 2 t 2003 2005 3.3 2001 2008 4.8 1998 2003 2.9 . . . . . . . . . Roger Grosse CSC2541 Lecture 5 Natural Gradient 5 / 12

Motivation: Invariances Consider minimizing a function h ( x ), where x is measured in feet. Gradient descent update: x ← x − α d h d x But d h / d x has units 1/feet. So we’re adding feet and 1/feet, which is nonsense. This is why gradient descent has problems with badly scaled data. Natural gradient is a dimensionally correct optimization algorithm. In fact, the updates are equivalent (to first order) in any coordinate system! Roger Grosse CSC2541 Lecture 5 Natural Gradient 6 / 12

Steepest Descent (Rosenbrock example) Gradient defines a linear approximation to a function: h ( x + ∆ x ) ≈ h ( x ) + ∇ h ( x ) ⊤ ∆ x We don’t trust this approximation globally. Steepest descent tries to prevent the update from moving too far, in terms of some dissimilarity measure D : x k +1 ← arg min � ∇ h ( x k ) ⊤ ( x − x k ) + λ D ( x , x k ) � x Gradient descent can be seen as steepest descent with D ( x , x ′ ) = 1 2 � x − x ′ � 2 . Not a very interesting D , since it depends on the coordinate system. Roger Grosse CSC2541 Lecture 5 Natural Gradient 7 / 12

Steepest Descent A more interesting class of dissimilarity measures is Mahalanobis metrics: D ( x , x ′ ) = ( x − x ′ ) ⊤ A ( x − x ′ ) Steepest descent update: x ← x − λ − 1 A − 1 ∇ h ( x ) Roger Grosse CSC2541 Lecture 5 Natural Gradient 8 / 12

Steepest Descent It’s hard to compute the steepest descent update for an arbitrary D . But we can approximate it with a Mahalanobis metric by taking the second-order Taylor approximation. 2( x − x ′ ) ∂ 2 D D ( x , x ′ ) ≈ 1 ∂ x 2 ( x − x ′ ) One interesting example: simulating gradient descent on a different space. (Rosenbrock example) Later in this course, we’ll use this insight to train neural nets much faster. Roger Grosse CSC2541 Lecture 5 Natural Gradient 9 / 12

Fisher Metric If we’re fitting a probabilistic model, the optimization variables parameterize a probability distribution. The obvious dissimilarity measure is KL divergence: D ( θ , θ ′ ) = D KL ( p θ � p θ ′ ) The second-order Taylor approximation to KL divergence is the Fisher information matrix: ∂ 2 D KL = F = Cov x ∼ p θ ( ∇ θ log p θ ( x )) ∂ θ 2 Roger Grosse CSC2541 Lecture 5 Natural Gradient 10 / 12

Fisher Metric Fisher metric for two different parameterizations of a Gaussian: Roger Grosse CSC2541 Lecture 5 Natural Gradient 11 / 12

Fisher Metric KL divergence is an intrinsic dissimilarity measure on distributions: it doesn’t care how the distributions are parameterized. Therefore, steepest descent in the Fisher metric (which approximates KL divergence) is invariant to parameterization, to the first order. This is why it’s called natural gradient. Update rule: θ ← θ − α F − 1 ∇ θ h This can converge much faster than ordinary gradient descent. (example) Hoffman et al. found that if you’re doing variational inference on conjugate exponential families, the variational inference updates are surprisingly elegant — even nicer than ordinary gradient descent! Roger Grosse CSC2541 Lecture 5 Natural Gradient 12 / 12

CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse - PowerPoint PPT Presentation

CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient 1 / 12 Motivation Two classes of optimization procedures used throughout ML (stochastic) gradient descent, with momentum, and maybe

CSC2541 Lecture 1 Introduction Roger Grosse Roger Grosse CSC2541 Lecture 1 Introduction 1 / 36

CSC2541 Lecture 2 Bayesian Occams Razor and Gaussian Processes Roger Grosse Roger Grosse

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

CSC2541: Deep Reinforcement Learning Lecture 1: Introduction Slides borrowed from David Silver

CSC2541: Differentiable Inference and Generative Models Lecture 2: Variational autoencoders

CSC2541: Deep Reinforcement Learning Lecture 3: Monte-Carlo and Temporal Difference Slides

CSC321 Lecture 21: Policy Gradient Roger Grosse Roger Grosse CSC321 Lecture 21: Policy Gradient

How to use Gradient and Multi-Texture 1. Many situations, we need use the gradient texture for our

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Scalable natural gradient using probabilistic models of backprop Roger Grosse Overview

Bayesian Optimization CSC2541 - Topics in Machine Learning Scalable and Flexible Models of

Autoregressive and Invertible Models CSC2541 Fall 2016 Haider Al-Lawati Christopher Meaney

GAN Foundations CSC2541 Michael Chiu - chiu@cs.toronto.edu Jonathan Lorraine -

CSC2541: Differentiable Inference and Generative Models Density estimation using Real NVP. Ding

Humanoid Robots 1: Introduction why humanoids practical reasons: in many cases humanoids are

www.inter -fab.com X-stream Slide Assembly and Installation Instructions Assembly/Installation

Occupational Therapy and ADHD Occupational Therapy Integrated Team for Children with

Physical Activity for Early Years Start Early, Stay Active for Life What is Physical Activity?

Friction, Circular Motion, and More Applications of Newtons Laws Friction Uniform

Commercial Dog Breeders Part 8: Housing (Part 2) Introduction Housing Part 1 Housing Part 2

* EEP updates Web page items S afety, S izing and S iting MLS S Connecticut

Q4 and Full Year 2018 Earnings Slides February 5, 2019 FORWARD-LOOKING STATEMENTS & NON-GAAP

CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse - PowerPoint PPT Presentation

CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient 1 / 12 Motivation Two classes of optimization procedures used throughout ML (stochastic) gradient descent, with momentum, and maybe

CSC2541 Lecture 1 Introduction Roger Grosse Roger Grosse CSC2541 Lecture 1 Introduction 1 / 36

CSC2541 Lecture 2 Bayesian Occams Razor and Gaussian Processes Roger Grosse Roger Grosse

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

CSC2541: Deep Reinforcement Learning Lecture 1: Introduction Slides borrowed from David Silver

CSC2541: Differentiable Inference and Generative Models Lecture 2: Variational autoencoders

CSC2541: Deep Reinforcement Learning Lecture 3: Monte-Carlo and Temporal Difference Slides

CSC321 Lecture 21: Policy Gradient Roger Grosse Roger Grosse CSC321 Lecture 21: Policy Gradient

How to use Gradient and Multi-Texture 1. Many situations, we need use the gradient texture for our

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Scalable natural gradient using probabilistic models of backprop Roger Grosse Overview

Bayesian Optimization CSC2541 - Topics in Machine Learning Scalable and Flexible Models of

Autoregressive and Invertible Models CSC2541 Fall 2016 Haider Al-Lawati Christopher Meaney

GAN Foundations CSC2541 Michael Chiu - chiu@cs.toronto.edu Jonathan Lorraine -

CSC2541: Differentiable Inference and Generative Models Density estimation using Real NVP. Ding

Humanoid Robots 1: Introduction why humanoids practical reasons: in many cases humanoids are

www.inter -fab.com X-stream Slide Assembly and Installation Instructions Assembly/Installation

Occupational Therapy and ADHD Occupational Therapy Integrated Team for Children with

Physical Activity for Early Years Start Early, Stay Active for Life What is Physical Activity?

Friction, Circular Motion, and More Applications of Newtons Laws Friction Uniform

Commercial Dog Breeders Part 8: Housing (Part 2) Introduction Housing Part 1 Housing Part 2

* EEP updates Web page items S afety, S izing and S iting MLS S Connecticut

Q4 and Full Year 2018 Earnings Slides February 5, 2019 FORWARD-LOOKING STATEMENTS &amp; NON-GAAP

Q4 and Full Year 2018 Earnings Slides February 5, 2019 FORWARD-LOOKING STATEMENTS & NON-GAAP