Advanced Machine Learning Gradient Descent for Non-Convex Functions - - PowerPoint PPT Presentation

advanced machine learning gradient descent for non convex
SMART_READER_LITE
LIVE PREVIEW

Advanced Machine Learning Gradient Descent for Non-Convex Functions - - PowerPoint PPT Presentation

Advanced Machine Learning Gradient Descent for Non-Convex Functions Amit Sethi Electrical Engineering, IIT Bombay Learning outcomes for the lecture Characterize non-convex loss surfaces with Hessian List issues with non-convex surfaces


slide-1
SLIDE 1

Advanced Machine Learning Gradient Descent for Non-Convex Functions

Amit Sethi Electrical Engineering, IIT Bombay

slide-2
SLIDE 2

Learning outcomes for the lecture

  • Characterize non-convex loss surfaces with

Hessian

  • List issues with non-convex surfaces
  • Explain how certain optimization techniques

help solve these issues

slide-3
SLIDE 3

Contents

  • Characterizing a non-convex loss surfaces
  • Issues with gradient decent
  • Issues with Newton’s method
  • Stochastic gradient descent to the rescue
  • Momentum and its variants
  • Saddle-free Newton
slide-4
SLIDE 4

Why do we not get stuck in bad local minima?

  • Local minima are close to global minima in terms of

errors

  • Saddle points are much more likely at higher

portions of the error surface (in high-dimensional weight space)

  • SGD (and other techniques) allow you to escape the

saddle points

slide-5
SLIDE 5

Error surfaces and saddle points

http://math.etsu.edu/multicalc/prealpha/Chap2/Chap2-8/10-6-53.gif http://pundit.pratt.duke.edu/piki/images/thumb/0/0a/SurfExp04.png/400px-SurfExp04.png

slide-6
SLIDE 6

Eigenvalues of Hessian at critical points

http://i.stack.imgur.com/NsI2J.png

Long furrow Local minima Saddle point Plateau

slide-7
SLIDE 7

A realistic picture

Image source: https://www.cs.umd.edu/~tomg/projects/landscapes/

Global minima Local minima Saddle point Local maxima

slide-8
SLIDE 8

Is achieving global minima important?

  • Global minima for the training data may not

be the global minima for the validation or test data

  • Local minimas are often good enough

“The Loss Surfaces of Multilayer Networks” Choromanska et al. JMLR’15

slide-9
SLIDE 9

Under certain assumptions, theoretically also they are of high quality

“The Loss Surfaces of Multilayer Networks” Choromanska et al. JMLR’15

  • Results:

– Lowest critical values of the random loss form a band – Probability of minima outside that band diminishes exponentially with the size of the network – Empirical verification

  • Assumptions:

– Fully-connected feed-forward neural network – Variable independence – Redundancy in network parametrization – Uniformity

slide-10
SLIDE 10

Empirically, most minima are of high quality

“Identifying and attacking the saddle pointproblem in high-dimensional non-convex optimization” Dauphin et al., NIPS’15

slide-11
SLIDE 11

GD vs. Newton’s method

  • Gradient descent is based on first-order

approx.

  • Newton’s method is based on second order

𝑔 𝜄∗ + ∆𝜄 = 𝑔 𝜄∗ + 𝛼𝑔𝑈 ∆𝜄 Δ𝜄 = −𝜃 𝛼𝑔 𝑔 𝜄∗ + ∆𝜄 = 𝑔 𝜄∗ + 𝛼𝑔𝑈 ∆𝜄 + 1 2 ∆𝜄𝑈 𝐼 ∆𝜄 Δ𝜄 = −𝐼−1 𝛼𝑔

“Identifying and attacking the saddle pointproblem in high-dimensional non-convex optimization” Dauphin et al., NIPS’15

𝑔 𝜄∗ + ∆𝜄 = 𝑔 𝜄∗ + 1 2 𝜇𝑗∆𝒘𝑗2

𝑜𝜄 𝑗=1

slide-12
SLIDE 12

Disadvantages of 2nd order methods

  • Updates require O(d3) or at least O(d2)
  • May not work well for non-convex surfaces
  • Get attracted to saddle points (how?)
  • Not very good for batch-updates
slide-13
SLIDE 13

GD vs. SGD

  • GD:
  • wt+1 = wt − η gfor all samples(wt)
  • SGD momentum:
  • wt+1 = wt − η gfor a random subset(wt)

wt − η gfor a random subset(wt) wt − η gfor all samples (wt)

slide-14
SLIDE 14

Compare GD with SGD

  • GD requires more computations per update
  • SGD is more noisy
slide-15
SLIDE 15

SGD helps by changing the loss surface

  • Different mini-batches (or samples) have their own loss surfaces
  • The loss surface of the entire training sample (dotted) may be

different

  • Local minima of one loss surface may not be local minima of

another one

  • This helps us escape local minima using stochastic or batch

gradient descent

  • Mini-batch size depends on computational resource utilization
slide-16
SLIDE 16

Noise can be added in other ways to escape saddle points

  • Random mini-batches (SGD)
  • Add noise to the gradient or the update
  • Add noise to the input
slide-17
SLIDE 17

Learning rate scheduling

  • High learning rates explore faster earlier

– But, they can lead to divergence or high final loss

  • Low learning rates fine-tune better later

– But, they can be very slow to converge

  • LR scheduling combines advantages of both

– Lots of schedules possible: linear, exponential, square-root, step-wise, cosine

Training iterations Training loss

slide-18
SLIDE 18

Classical and Nesterov Momentum

  • GD:
  • wt+1 = wt − η g(wt)
  • Classical momentum:
  • vt+1 = α vt − η g(wt);
  • wt+1 = wt + vt+1
  • Nesterov momentum
  • vt+1 = α vt − η g(wt+αvt);
  • wt+1 = wt + vt+1
  • Better course-correction for bad velocity

− η g(wt) − η g(wt) − η g(wt) wt wt wt wt+1 wt+1 α vt α vt − η g(wt+αvt) vt+1 vt+1 vt+1

Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k^2)”, 1983 Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013

slide-19
SLIDE 19

AdaGrad, RMSProp, AdaDelta

  • Scales the gradient by a running norm of all

the previous gradients

  • Per dimension:

𝑥𝑢+1 = 𝑥𝑢 − 𝜃 𝑕(𝑥𝑢) 𝑕(𝑥𝑢)2

𝑢 𝑗=1

+ 𝜁

  • Automatically reduces learning rate with t
  • Parameters with small gradients speed up
  • RMSProp and AdaDelta use a forgetting factor

in grad squared so that the updates do not become too small

slide-20
SLIDE 20

Adam optimizer combines AdaGrad and momentum

  • Initialize
  • 𝑛0 = 0
  • 𝑤0 = 0
  • Loop over t
  • 𝑕𝑢 = 𝛼𝑥𝑔𝑢 𝑥𝑢−1
  • 𝑛𝑢 = 𝛾1 𝑛𝑢−1 + 1 − 𝛾1 𝑕𝑢
  • 𝑤𝑢 = 𝛾2 𝑤𝑢−1 + 1 − 𝛾2 𝑕𝑢

2

  • 𝑛

𝑢 = 𝑛𝑢/(1 − 𝛾1

𝑢)

  • 𝑤

𝑢 = 𝑤𝑢/(1 − 𝛾2

𝑢)

  • 𝑥𝑢 = 𝑥𝑢−1 − 𝛽 𝑛

𝑢/ 𝑤 𝑢 + 𝜁

“ADAM: A method for stochastic optimization” Kingma and Ba, ICLR’15

Get gradient Update first moment (biased) Update second moment (biased) Correct bias in first moment Correct bias in second moment Update parameters

slide-21
SLIDE 21

Visualizing optimizers

Source: http://ruder.io/optimizing-gradient-descent/index.html