Lecture 4: Optimization Justin Johnson Lecture 4 - 1 September - - PowerPoint PPT Presentation

lecture 4 optimization
SMART_READER_LITE
LIVE PREVIEW

Lecture 4: Optimization Justin Johnson Lecture 4 - 1 September - - PowerPoint PPT Presentation

Lecture 4: Optimization Justin Johnson Lecture 4 - 1 September 16, 2019 Waitlist Update We will open the course for enrollment later today / tomorrow Justin Johnson Lecture 4 - 2 September 16, 2019 Reminder: Assignment 1 Was due


slide-1
SLIDE 1

Justin Johnson September 16, 2019

Lecture 4: Optimization

Lecture 4 - 1

slide-2
SLIDE 2

Justin Johnson September 16, 2019

Waitlist Update

We will open the course for enrollment later today / tomorrow

Lecture 4 - 2

slide-3
SLIDE 3

Justin Johnson September 16, 2019

Reminder: Assignment 1

Was due yesterday! (But you do have late days…)

Lecture 4 - 3

slide-4
SLIDE 4

Justin Johnson September 16, 2019

Assignment 2

Lecture 4 - 4

  • Will be released today
  • Use SGD to train linear classifiers and fully-connected networks
  • After today, can do linear classifiers section
  • After Wednesday, can do fully-connected networks
  • If you have a hard time computing derivatives, wait for next Monday’s

lecture on backprop

  • Due Monday September 30, 11:59pm (two weeks from today)
slide-5
SLIDE 5

Justin Johnson September 16, 2019

Course Update

Lecture 4 - 5

  • A1: 10%
  • A2: 10%
  • A3: 10%
  • A4: 10%
  • A5: 10%
  • A6: 10%
  • Midterm: 20%
  • Final: 20%
slide-6
SLIDE 6

Justin Johnson September 16, 2019

Course Update: No Final Exam

  • A1: 10%
  • A2: 10%
  • A3: 10%
  • A4: 10%
  • A5: 10%
  • A6: 10%
  • Midterm: 20%
  • Final: 20%

Lecture 4 - 6

  • A1: 10%
  • A2: 13%
  • A3: 13%
  • A4: 13%
  • A5: 13%
  • A6: 13%
  • Midterm: 25%
  • Final

Expect A5 and A6 to be longer than

  • ther homework
slide-7
SLIDE 7

Justin Johnson September 16, 2019 Lecture 4 - 7

Last Time: Linear Classifiers

f(x,W) = Wx Algebraic Viewpoint Visual Viewpoint Geometric Viewpoint One template per class Hyperplanes cutting up space

slide-8
SLIDE 8

Justin Johnson September 16, 2019 Lecture 4 - 8

Last Time: Loss Functions quantify preferences

  • We have some dataset of (x, y)
  • We have a score function:
  • We have a loss function:

Softmax SVM Full loss

Linear classifier

slide-9
SLIDE 9

Justin Johnson September 16, 2019 Lecture 4 - 9

Last Time: Loss Functions quantify preferences

  • We have some dataset of (x, y)
  • We have a score function:
  • We have a loss function:

Softmax SVM Full loss

Linear classifier Q: How do we find the best W?

slide-10
SLIDE 10

Justin Johnson September 16, 2019

Optimization

Lecture 4 - 10

slide-11
SLIDE 11

Justin Johnson September 16, 2019 Lecture 4 - 11

Walking man image is CC0 1.0 public domain This image is CC0 1.0 public domain

slide-12
SLIDE 12

Justin Johnson September 16, 2019 Lecture 4 - 12

Walking man image is CC0 1.0 public domain This image is CC0 1.0 public domain

slide-13
SLIDE 13

Justin Johnson September 16, 2019

Idea #1: Random Search

ch (bad idea!)

Lecture 4 - 13

slide-14
SLIDE 14

Justin Johnson September 16, 2019

Idea #1: Random Search

ch (bad idea!)

Lecture 4 - 14

15.5% accuracy! not bad!

slide-15
SLIDE 15

Justin Johnson September 16, 2019

Idea #1: Random Search

ch (bad idea!)

Lecture 4 - 15

15.5% accuracy! not bad! (SOTA is ~95%)

slide-16
SLIDE 16

Justin Johnson September 16, 2019

Idea #2: Fo

Follow the slope

Lecture 4 - 16

slide-17
SLIDE 17

Justin Johnson September 16, 2019

Idea #2:

: Follow th the e slope

Lecture 4 - 17

In 1-dimension, the derivative of a function gives the slope:

slide-18
SLIDE 18

Justin Johnson September 16, 2019

Idea #2:

: Follow th the e slope

Lecture 4 - 18

In 1-dimension, the derivative of a function gives the slope:

In multiple dimensions, the gradient is the vector of (partial derivatives) along each dimension The slope in any direction is the dot product of the direction with the gradient The direction of steepest descent is the negative gradient

slide-19
SLIDE 19

Justin Johnson September 16, 2019 Lecture 4 - 19

current W: [0.34,

  • 1.11,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25347 gradient dL/dW: [?, ?, ?, ?, ?, ?, ?, ?, ?,…]

slide-20
SLIDE 20

Justin Johnson September 16, 2019 Lecture 4 - 20

current W: [0.34,

  • 1.11,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25347 gradient dL/dW: [?, ?, ?, ?, ?, ?, ?, ?, ?,…] W + h (first dim): [0.34 + 0.0001,

  • 1.11,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25322

slide-21
SLIDE 21

Justin Johnson September 16, 2019 Lecture 4 - 21

current W: [0.34,

  • 1.11,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25347 gradient dL/dW: [-2.5, ?, ?, ?, ?, ?, ?, ?, ?,…]

(1.25322 - 1.25347)/0.0001 = -2.5

W + h (first dim): [0.34 + 0.0001,

  • 1.11,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25322

slide-22
SLIDE 22

Justin Johnson September 16, 2019 Lecture 4 - 22

gradient dL/dW: [-2.5, ?, ?, ?, ?, ?, ?, ?, ?,…] current W: [0.34,

  • 1.11,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25347 W + h (second dim): [0.34,

  • 1.11 + 0.0001,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25353

slide-23
SLIDE 23

Justin Johnson September 16, 2019

gradient dL/dW: [-2.5, 0.6, ?, ?, ?, ?, ?, ?, ?,…]

Lecture 4 - 23

current W: [0.34,

  • 1.11,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25347 W + h (second dim): [0.34,

  • 1.11 + 0.0001,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25353

(1.25353 - 1.25347)/0.0001 = 0.6

slide-24
SLIDE 24

Justin Johnson September 16, 2019 Lecture 4 - 24

gradient dL/dW: [-2.5, 0.6, ?, ?, ?, ?, ?, ?, ?,…] current W: [0.34,

  • 1.11,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25347 W + h (third dim): [0.34,

  • 1.11,

0.78 + 0.0001, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25347

slide-25
SLIDE 25

Justin Johnson September 16, 2019 Lecture 4 - 25

gradient dL/dW: [-2.5, 0.6, 0.0, ?, ?, ?, ?, ?, ?,…] current W: [0.34,

  • 1.11,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25347 W + h (third dim): [0.34,

  • 1.11,

0.78 + 0.0001, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25347

(1.25347 - 1.25347)/0.0001 = 0.0

slide-26
SLIDE 26

Justin Johnson September 16, 2019 Lecture 4 - 26

gradient dL/dW: [-2.5, 0.6, 0.0, ?, ?, ?, ?, ?, ?,…] current W: [0.34,

  • 1.11,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25347 W + h (third dim): [0.34,

  • 1.11,

0.78 + 0.0001, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25347

Numeric Gradient:

  • Slow: O(#dimensions)
  • Approximate
slide-27
SLIDE 27

Justin Johnson September 16, 2019 Lecture 4 - 27

Loss is a function of W

want

slide-28
SLIDE 28

Justin Johnson September 16, 2019 Lecture 4 - 28

Loss is a function of W: Analytic Gradient

This image is in the public domain This image is in the public domain

Use calculus to compute an analytic gradient want

slide-29
SLIDE 29

Justin Johnson September 16, 2019 Lecture 4 - 29

gradient dL/dW: [-2.5, 0.6, 0, 0.2, 0.7,

  • 0.5,

1.1, 1.3,

  • 2.1,…]

current W: [0.34,

  • 1.11,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25347 dL/dW = ... (some function data and W)

slide-30
SLIDE 30

Justin Johnson September 16, 2019 Lecture 4 - 30

gradient dL/dW: [-2.5, 0.6, 0, 0.2, 0.7,

  • 0.5,

1.1, 1.3,

  • 2.1,…]

current W: [0.34,

  • 1.11,

0.78, 0.12, 0.55, 2.81,

  • 3.1,
  • 1.5,

0.33,…] loss 1.25347 dL/dW = ... (some function data and W)

(In practice we will compute dL/dW using backpropagation; see Lecture 6)

slide-31
SLIDE 31

Justin Johnson September 16, 2019

Computing Gradients

Lecture 4 - 31

  • Numeric gradient: approximate, slow, easy to write
  • Analytic gradient: exact, fast, error-prone

In practice: Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check.

slide-32
SLIDE 32

Justin Johnson September 16, 2019

Computing Gradients

Lecture 4 - 32

  • Numeric gradient: approximate, slow, easy to write
  • Analytic gradient: exact, fast, error-prone

In practice: Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check.

slide-33
SLIDE 33

Justin Johnson September 16, 2019

Computing Gradients

Lecture 4 - 33

  • Numeric gradient: approximate, slow, easy to write
  • Analytic gradient: exact, fast, error-prone

In practice: Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check.

slide-34
SLIDE 34

Justin Johnson September 16, 2019

  • Numeric gradient: approximate, slow, easy to write
  • Analytic gradient: exact, fast, error-prone

In practice: Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check.

Computing Gradients

Lecture 4 - 34

slide-35
SLIDE 35

Justin Johnson September 16, 2019 Lecture 4 - 35

Gradient Descent

Iteratively step in the direction of the negative gradient (direction of local steepest descent)

Hyperparameters:

  • Weight initialization method
  • Number of steps
  • Learning rate
slide-36
SLIDE 36

Justin Johnson September 16, 2019 Lecture 4 - 36

negative gradient direction

W_1 W_2

  • riginal W

Gradient Descent

Iteratively step in the direction of the negative gradient (direction of local steepest descent)

Hyperparameters:

  • Weight initialization method
  • Number of steps
  • Learning rate
slide-37
SLIDE 37

Justin Johnson September 16, 2019 Lecture 4 - 37

Gradient Descent

Iteratively step in the direction of the negative gradient (direction of local steepest descent)

Hyperparameters:

  • Weight initialization method
  • Number of steps
  • Learning rate
slide-38
SLIDE 38

Justin Johnson September 16, 2019

Batch ch Gradient Desce cent

Lecture 4 - 38

Full sum expensive when N is large!

slide-39
SLIDE 39

Justin Johnson September 16, 2019

Stoch chastic c Gradient Desce cent (SGD)

Lecture 4 - 39

Full sum expensive when N is large! Approximate sum using a minibatch of examples 32 / 64 / 128 common

Hyperparameters:

  • Weight initialization
  • Number of steps
  • Learning rate
  • Batch size
  • Data sampling
slide-40
SLIDE 40

Justin Johnson September 16, 2019

Stoch chastic c Gradient Desce cent (SGD)

Lecture 4 - 40

Think of loss as an expectation over the full data distribution pdata Approximate expectation via sampling

slide-41
SLIDE 41

Justin Johnson September 16, 2019

Stoch chastic c Gradient Desce cent (SGD)

Lecture 4 - 41

Think of loss as an expectation over the full data distribution pdata Approximate expectation via sampling

slide-42
SLIDE 42

Justin Johnson September 16, 2019

Interactive Web Demo

Lecture 4 - 42

http://vision.stanford.edu/teaching/cs231n-demos/linear-classify/

slide-43
SLIDE 43

Justin Johnson September 16, 2019

Problems with SGD

Lecture 4 - 43

What if loss changes quickly in one direction and slowly in another? What does gradient descent do? Loss function has high condition number: ratio of largest to smallest singular value

  • f the Hessian matrix is large
slide-44
SLIDE 44

Justin Johnson September 16, 2019

Problems with SGD

Lecture 4 - 44

What if loss changes quickly in one direction and slowly in another? What does gradient descent do? Very slow progress along shallow dimension, jitter along steep direction Loss function has high condition number: ratio of largest to smallest singular value

  • f the Hessian matrix is large
slide-45
SLIDE 45

Justin Johnson September 16, 2019

Problems with SGD

Lecture 4 - 45

What if the loss function has a local minimum or saddle point?

Local Minimum Saddle point

slide-46
SLIDE 46

Justin Johnson September 16, 2019

Problems with SGD

Lecture 4 - 46

What if the loss function has a local minimum or saddle point? Zero gradient, gradient descent gets stuck

Local Minimum Saddle point

slide-47
SLIDE 47

Justin Johnson September 16, 2019

Problems with SGD

Lecture 4 - 47

Our gradients come from minibatches so they can be noisy!

slide-48
SLIDE 48

Justin Johnson September 16, 2019

SG SGD

Lecture 4 - 48

SGD

slide-49
SLIDE 49

Justin Johnson September 16, 2019

SG SGD + Momen entum

Lecture 4 - 49

SGD

  • Build up “velocity” as a running mean of gradients
  • Rho gives “friction”; typically rho=0.9 or 0.99

Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013

SGD+Momentum

slide-50
SLIDE 50

Justin Johnson September 16, 2019

SG SGD + Momen entum

Lecture 4 - 50

Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013

SGD+Momentum

You may see SGD+Momentum formulated different ways, but they are equivalent - give same sequence of x

SGD+Momentum

slide-51
SLIDE 51

Justin Johnson September 16, 2019

SG SGD + Momen entum

Lecture 4 - 51

Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013

Local Minima Saddle points Gradient Noise

SGD+Momentum SGD

Poor Conditioning

slide-52
SLIDE 52

Justin Johnson September 16, 2019

SG SGD + Momen entum

Lecture 4 - 52

Gradient Velocity actual step

Momentum update:

Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k^2)”, 1983 Nesterov, “Introductory lectures on convex optimization: a basic course”, 2004 Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013

Combine gradient at current point with velocity to get step used to update weights

slide-53
SLIDE 53

Justin Johnson September 16, 2019

Nes Nester erov Mo Momentum tum

Lecture 4 - 53

Gradient Velocity actual step

Momentum update:

Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k^2)”, 1983 Nesterov, “Introductory lectures on convex optimization: a basic course”, 2004 Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013

Combine gradient at current point with velocity to get step used to update weights Gradient Velocity actual step

Nesterov Momentum

“Look ahead” to the point where updating using velocity would take us; compute gradient there and mix it with velocity to get actual update direction

slide-54
SLIDE 54

Justin Johnson September 16, 2019

Nes Nester erov Mo Momentum tum

Lecture 4 - 54

Gradient Velocity actual step “Look ahead” to the point where updating using velocity would take us; compute gradient there and mix it with velocity to get actual update direction

slide-55
SLIDE 55

Justin Johnson September 16, 2019

Nes Nester erov Mo Momentum tum

Lecture 4 - 55

Gradient Velocity actual step “Look ahead” to the point where updating using velocity would take us; compute gradient there and mix it with velocity to get actual update direction

Annoying, usually we want update in terms of

slide-56
SLIDE 56

Justin Johnson September 16, 2019

Nes Nester erov Mo Momentum tum

Lecture 4 - 56

Change of variables and rearrange: Annoying, usually we want update in terms of

slide-57
SLIDE 57

Justin Johnson September 16, 2019 Lecture 4 - 57

Ne Nesterov Momentum

SGD SGD+Momentum Nesterov

slide-58
SLIDE 58

Justin Johnson September 16, 2019 Lecture 4 - 58

Ad AdaGr Grad

Added element-wise scaling of the gradient based

  • n the historical sum of squares in each dimension

“Per-parameter learning rates”

  • r “adaptive learning rates”

Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011

slide-59
SLIDE 59

Justin Johnson September 16, 2019 Lecture 4 - 59

Ad AdaGr Grad

Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011

slide-60
SLIDE 60

Justin Johnson September 16, 2019 Lecture 4 - 60

Ad AdaGr Grad

Q: What happens with AdaGrad?

slide-61
SLIDE 61

Justin Johnson September 16, 2019 Lecture 4 - 61

Ad AdaGr Grad

Q: What happens with AdaGrad?

Progress along “steep” directions is damped; progress along “flat” directions is accelerated

slide-62
SLIDE 62

Justin Johnson September 16, 2019 Lecture 4 - 62

RM RMSProp: : “Leak Ad Adagrad”

AdaGrad

Tieleman and Hinton, 2012

RMSProp

slide-63
SLIDE 63

Justin Johnson September 16, 2019

RM RMSProp

Lecture 4 - 63

SGD SGD+Momentum RMSProp

slide-64
SLIDE 64

Justin Johnson September 16, 2019

Ad Adam m (almost): RMSProp + Momentum

Lecture 4 - 64

Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

slide-65
SLIDE 65

Justin Johnson September 16, 2019

Ad Adam m (almost): RMSProp + Momentum

Lecture 4 - 65

Momentum

Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

Adam SGD+Momentum

slide-66
SLIDE 66

Justin Johnson September 16, 2019

Ad Adam m (almost): RMSProp + Momentum

Lecture 4 - 66

Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

RMSProp Adam

Momentum AdaGrad / RMSProp

slide-67
SLIDE 67

Justin Johnson September 16, 2019

Ad Adam m (almost): RMSProp + Momentum

Lecture 4 - 67

Q: What happens at t=0? (Assume beta2 = 0.999)

Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

Adam

Bias correction Momentum AdaGrad / RMSProp

slide-68
SLIDE 68

Justin Johnson September 16, 2019

Ad Adam m (almost): RMSProp + Momentum

Lecture 4 - 68

Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

Bias correction

Bias correction for the fact that first and second moment estimates start at zero

Momentum AdaGrad / RMSProp

slide-69
SLIDE 69

Justin Johnson September 16, 2019

Ad Adam m (almost): RMSProp + Momentum

Lecture 4 - 69

Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

Bias correction for the fact that first and second moment estimates start at zero Adam with beta1 = 0.9, beta2 = 0.999, and learning_rate = 1e-3, 5e-4, 1e-4 is a great starting point for many models!

slide-70
SLIDE 70

Justin Johnson September 16, 2019

Ad Adam: Very Common in Practice!

Lecture 4 - 70

Adam with beta1 = 0.9, beta2 = 0.999, and learning_rate = 1e-3, 5e-4, 1e-4 is a great starting point for many models!

Gkioxari, Malik, and Johnson, ICCV 2019 Zhu, Kaplan, Johnson, and Fei-Fei, ECCV 2018 Johnson, Gupta, and Fei-Fei, CVPR 2018 Gupta, Johnson, et al, CVPR 2018 Bakhtin, van der Maaten, Johnson, Gustafson, and Girshick, NeurIPS 2019

slide-71
SLIDE 71

Justin Johnson September 16, 2019

Adam

Lecture 4 - 71

SGD SGD+Momentum RMSProp Adam

slide-72
SLIDE 72

Justin Johnson September 16, 2019

Optimization Algorithm Comparison

Lecture 4 - 72

Algorithm Tracks first moments (Momentum) Tracks second moments (Adaptive learning rates) Leaky second moments Bias correction for moment estimates SGD 𝙮 𝙮 𝙮 𝙮 SGD+Momentum ✓ 𝙮 𝙮 𝙮 Nesterov ✓ 𝙮 𝙮 𝙮 AdaGrad 𝙮 ✓ 𝙮 𝙮 RMSProp 𝙮 ✓ ✓ 𝙮 Adam ✓ ✓ ✓ ✓

slide-73
SLIDE 73

Justin Johnson September 16, 2019

So far: First-Order Optimization

Lecture 4 - 73

Loss w1

slide-74
SLIDE 74

Justin Johnson September 16, 2019

So far: Fi

First-Or Orde der r Optimization

Lecture 4 - 74

Loss w1

  • 1. Use gradient to make linear approximation
  • 2. Step to minimize the approximation
slide-75
SLIDE 75

Justin Johnson September 16, 2019

Sec Second-Or Orde der r Optimization

Lecture 4 - 75

Loss w1

  • 1. Use gradient and Hessian to make quadratic approximation
  • 2. Step to minimize the approximation
slide-76
SLIDE 76

Justin Johnson September 16, 2019

Sec Second-Or Orde der r Optimization

Lecture 4 - 76

Loss w1

  • 1. Use gradient and Hessian to make quadratic approximation
  • 2. Step to minimize the approximation

Take bigger steps in areas

  • f low curvature
slide-77
SLIDE 77

Justin Johnson September 16, 2019

Sec Second-Or Orde der r Optimization

Lecture 4 - 77

Second-Order Taylor Expansion: Solving for the critical point we obtain the Newton parameter update:

slide-78
SLIDE 78

Justin Johnson September 16, 2019

Sec Second-Or Orde der r Optimization

Lecture 4 - 78

Second-Order Taylor Expansion: Solving for the critical point we obtain the Newton parameter update:

Q: Why is this impractical?

slide-79
SLIDE 79

Justin Johnson September 16, 2019

Sec Second-Or Orde der r Optimization

Lecture 4 - 79

Second-Order Taylor Expansion: Solving for the critical point we obtain the Newton parameter update:

Q: Why is this impractical?

Hessian has O(N^2) elements Inverting takes O(N^3) N = (Tens or Hundreds of) Millions

slide-80
SLIDE 80

Justin Johnson September 16, 2019 Lecture 4 - 80

Sec Second nd-Or Order der Optimization

  • Quasi-Newton methods (BGFS most popular):

instead of inverting the Hessian (O(n^3)), approximate inverse Hessian with rank 1 updates over time (O(n^2) each).

  • L-BFGS (Limited memory BFGS):

Does not form/store the full inverse Hessian.

slide-81
SLIDE 81

Justin Johnson September 16, 2019 Lecture 4 - 81

Second-Order Optimization: L-BF

BFGS

  • Usually works very well in full batch, deterministic mode

i.e. if you have a single, deterministic f(x) then L-BFGS will probably work very nicely

  • Does not transfer very well to mini-batch setting. Gives bad
  • results. Adapting second-order methods to large-scale,

stochastic setting is an active area of research.

Le et al, “On optimization methods for deep learning, ICML 2011” Ba et al, “Distributed second-order optimization using Kronecker-factored approximations”, ICLR 2017

slide-82
SLIDE 82

Justin Johnson September 16, 2019

In practice:

Lecture 4 - 82

  • Adam is a good default choice in many cases

SGD+Momentum can outperform Adam but may require more tuning

  • If you can afford to do full batch updates then try out

L-BFGS (and don’t forget to disable all sources of noise)

slide-83
SLIDE 83

Justin Johnson September 16, 2019

Summary

Lecture 4 - 83

  • 1. Use Linear Models for image

classification problems

  • 2. Use Loss Functions to express

preferences over different choices of weights

  • 3. Use Stochastic Gradient

Descent to minimize our loss functions and train the model

Softmax SVM

slide-84
SLIDE 84

Justin Johnson September 16, 2019

Next time: Neural Networks

Lecture 4 - 84