Optimization for Machine Learning Tom Schaul schaul@cims.nyu.edu - - PowerPoint PPT Presentation

optimization for
SMART_READER_LITE
LIVE PREVIEW

Optimization for Machine Learning Tom Schaul schaul@cims.nyu.edu - - PowerPoint PPT Presentation

Optimization for Machine Learning Tom Schaul schaul@cims.nyu.edu Recap: Learning Machines Learning machines (Neural Networks, etc.) Forward passes produce a function of input Trainable parameters (aka weights, biases, etc.)


slide-1
SLIDE 1

Optimization for Machine Learning

Tom Schaul schaul@cims.nyu.edu

slide-2
SLIDE 2

Tom Schaul – /49

Recap: Learning Machines

  • Learning machines (Neural Networks, etc.)
  • Forward passes produce a function of input
  • Trainable parameters (aka weights, biases, etc.)
  • Backward passes compute gradients w.r.t. target
  • Modular structure  chain rule (aka Backprop)
  • Loss function (aka energy, cost, error)
  • an expectation over samples from dataset
  • Today: algorithms for minimizing the loss

10/8/2011 Optimization for ML 2

slide-3
SLIDE 3

Tom Schaul – /49

Flattening Parameters

  • Parameter space
  • Gradient from backprop
  • Element-wise correspondence

10/8/2011 Optimization for ML 3

slide-4
SLIDE 4

Tom Schaul – /49

Energy Surfaces

  • We can visualize the loss as a function of

parameters

  • Properties:
  • Local optima
  • Saddle points
  • Steep cliffs
  • Narrow, bent valleys
  • Flat areas
  • Only convex in the simplest cases
  • Convex optimization tools are of limited use

10/8/2011 Optimization for ML 4

slide-5
SLIDE 5

Tom Schaul – /49

Sample Variance

  • Every sample has a contribution to the loss
  • Sample distributions are complex
  • Sample gradients can have high variance

10/8/2011 Optimization for ML 5

slide-6
SLIDE 6

Tom Schaul – /49

Optimization Types

  • First-order methods, aka gradient descent
  • use gradients
  • incremental steps downhill on surface
  • Second-order methods
  • use second derivatives (curvature)
  • attempt large jumps (into the bottom of the valley)
  • Zeroth-order methods, aka black-box
  • use on values of loss function exclusively
  • somewhat random jumps

10/8/2011 Optimization for ML 6

slide-7
SLIDE 7

Tom Schaul – /49

Batch vs. Stochastic

  • Batch methods are based on true loss
  • Reliable gradients, large updates
  • Stochastic methods use sample gradients
  • Many more updates, smaller steps
  • Minibatch methods interpolate in-between
  • Gradients are averaged over n samples

10/8/2011 Optimization for ML 7

slide-8
SLIDE 8

Tom Schaul – /49

Gradient Descent

  • Step in direction of steepest descent
  • Gradient comes from backprop
  • How to choose step-size?
  • Line search

(extra evaluations)

  • Fixed number

10/8/2011 Optimization for ML 8

slide-9
SLIDE 9

Tom Schaul – /49

Convergence of GD (1D)

  • Iteratively approach optimum

10/8/2011 Optimization for ML 9

slide-10
SLIDE 10

Tom Schaul – /49

Optimal Learning Rate (1D)

  • Weight change
  • With quadratic loss
  • Optimal leaning rate

10/8/2011 Optimization for ML 10

slide-11
SLIDE 11

Tom Schaul – /49

Convergence of GD (N-Dim)

  • Assumption: smooth loss function
  • Quadratic approximation around optimum

with Hessian matrix

  • Convergence condition:

10/8/2011 Optimization for ML 11

Must shrink any vector

slide-12
SLIDE 12

Tom Schaul – /49

Convergence of GD (N-Dim)

  • We do a change of coordinates such that
  • Then
  • Intuition: GD in N dimensions is equivalent to N

1D-descents along the eigenvectors of H

  • Convergence if

10/8/2011 Optimization for ML 12

diagonal

slide-13
SLIDE 13

Tom Schaul – /49

GD Convergence: Example

  • Batch GD
  • Small learning rate
  • Convergence

10/8/2011 Optimization for ML 13

slide-14
SLIDE 14

Tom Schaul – /49

GD Convergence: Example

  • Batch GD
  • Large learning rate
  • Divergence

10/8/2011 Optimization for ML 14

slide-15
SLIDE 15

Tom Schaul – /49

GD Convergence: Example

  • Stochastic GD
  • Large learning rate
  • Fast convergence

10/8/2011 Optimization for ML 15

slide-16
SLIDE 16

Tom Schaul – /49

Convergence Speed

  • With optimal fixed learning rate
  • One-step convergence in that direction
  • Slower in all others
  • Total number of iterations proportional to the

conditioning number of Hessian

10/8/2011 Optimization for ML 16

slide-17
SLIDE 17

Tom Schaul – /49

Optimal LR Estimation

  • A cheap way of finding

(without finding H first)

  • Part 1: cheap Hessian-vector products
  • Part 2: power method

10/8/2011 Optimization for ML 17

slide-18
SLIDE 18

Tom Schaul – /49

Hessian-vector Products

  • Based on this approximation (finite difference)

where we obtain simply from one additional forward/backward pass (after perturbing the parameters)

10/8/2011 Optimization for ML 18

slide-19
SLIDE 19

Tom Schaul – /49

Power Method

  • We know that iterating

converges to the principal eigenvector, with

  • With sample-estimates (on-line) we introduce

some robustness by averaging: (good enough after 10-100 samples, )

10/8/2011 Optimization for ML 19

slide-20
SLIDE 20

Tom Schaul – /49

Optimal LR Estimation

10/8/2011 Optimization for ML 20

slide-21
SLIDE 21

Tom Schaul – /49

Conditioning of H

  • Some parameters are more sensitive than
  • thers
  • Very different scales
  • Illustration
  • Solution 1 (model/data)
  • Solution 2 (algorithm)

10/8/2011 Optimization for ML 21

slide-22
SLIDE 22

Tom Schaul – /49

H-eigenvalues in Neural Nets (1)

  • Few large ones
  • Many medium ones
  • Spanning orders
  • f magnitude

10/8/2011 Optimization for ML 22

slide-23
SLIDE 23

Tom Schaul – /49

H-eigenvalues in Neural Nets (2)

  • Differences by layer
  • Steeper gradients on biases

10/8/2011 Optimization for ML 23

slide-24
SLIDE 24

Tom Schaul – /49

H-Conditioning: Solution 1

  • Normalize data
  • Always useful, rarely sufficient
  • How?
  • Subtract mean

from inputs

  • If possible:

decorrelate inputs

  • Divide by standard

deviation on each input (all unit-variance)

10/8/2011 Optimization for ML 24

slide-25
SLIDE 25

Tom Schaul – /49

H-Conditioning: Solution 1

10/8/2011 Optimization for ML 25

  • Normalize data
  • Structural choices
  • Non-linearities with

zero-mean unit-variance activations

  • Explicit normalization layers
  • Weight initialization
  • such that all hidden activations

have approximately zero-mean unit-variance

slide-26
SLIDE 26

Tom Schaul – /49

H-Conditioning: Solution 2

  • Algorithmic solution:
  • Take smaller steps in sensitive directions
  • One learning rate per parameter
  • Estimate diagonal Hessian
  • Small constant for stability

10/8/2011 Optimization for ML 26

slide-27
SLIDE 27

Tom Schaul – /49

Hessian Estimation

  • Approximate full Hessian
  • Finite difference approximation of the k-th row
  • One forward/backward for each parameter

(perturbed slightly)

  • Concatenate all the rows
  • Symmetrize resulting matrix

10/8/2011 Optimization for ML 27

slide-28
SLIDE 28

Tom Schaul – /49

  • Cheaply approximate the Hessian in a modular

architecture.

  • Assume we have
  • Find and
  • Apply chain rule
  • Positive-definite approximation

BBprop (1)

10/8/2011 Optimization for ML 28

y X

slide-29
SLIDE 29

Tom Schaul – /49

BBprop (2)

  • Just the diagonal terms
  • Take exponentially moving average of estimates

10/8/2011 Optimization for ML 29

y X

slide-30
SLIDE 30

Tom Schaul – /49

Batch vs. Stochastic

  • Batch methods
  • True loss, reliable gradients, large updates
  • But:
  • Expensive on large datasets
  • Slowed by redundant samples
  • Stochastic methods
  • Many more updates, smaller steps
  • Minibatch methods
  • Gradients are averaged over n samples

10/8/2011 Optimization for ML 30

slide-31
SLIDE 31

Tom Schaul – /49

Batch vs. Stochastic

  • Batch methods
  • Stochastic methods (SGD)
  • Many more updates, smaller steps
  • More aggressive
  • Also works online (e.g. streaming data)
  • Cooling schedule on learning rate

(guaranteed to converge)

  • Minibatch methods

10/8/2011 Optimization for ML 31

slide-32
SLIDE 32

Tom Schaul – /49

Batch vs. Stochastic

  • Batch methods
  • Stochastic methods
  • Minibatch methods
  • Stochastic updates, but more accurate gradients

based on a small number of samples

  • In-between SGD and batch GD
  • Not usually faster, but much easier to parallelize
  • Samples in a mini-batch should be diverse
  • Don’t forget to shuffle dataset!
  • Stratified sampling

10/8/2011 Optimization for ML 32

slide-33
SLIDE 33

Tom Schaul – /49

Variance-normalization

  • SGD learning rate depends on Hessian,

but also on sample variance

  • Intuition: parameters whose gradients vary

wildly across samples should be updated with smaller learning rates than stable ones

  • Variance-scaled rates:

where

  • Those are exponential moving averages

10/8/2011 Optimization for ML 33

slide-34
SLIDE 34

Tom Schaul – /49

Variance-normalization

  • This scheme is adaptive: no need for tuning

10/8/2011 Optimization for ML 34

slide-35
SLIDE 35

Tom Schaul – /49

Optimization Types

  • First-order methods, aka gradient descent
  • use gradients
  • incremental steps downhill on energy surface
  • Second-order methods
  • use second derivatives (curvature)
  • attempt large jumps (into the bottom of the valley)
  • Zeroth-order methods, aka black-box
  • use on values of loss function exclusively
  • somewhat random jumps

10/8/2011 Optimization for ML 35

slide-36
SLIDE 36

Tom Schaul – /49

Second-order Optimization

  • Newton’s method
  • Quasi-Newton (BFGS)
  • Conjugate gradients
  • Gauss-Newton (Levenberg-Marquandt)
  • Many more:
  • Momentum
  • Nesterov gradient
  • Natural gradient descent

10/8/2011 Optimization for ML 36

slide-37
SLIDE 37

Tom Schaul – /49

Newton’s Method

  • Locally quadratic loss:
  • Minimize w.r.t. weight change
  • Jumps to the center of quadratic bowl
  • Optimal (single step) if quadratic approximation

hold, no guarantees otherwise

10/8/2011 Optimization for ML 37

slide-38
SLIDE 38

Tom Schaul – /49

Quasi-Newton / BFGS

  • Keep an estimate of the inverse Hessian M
  • Gradient premultiplied by M
  • M always positive-definite
  • Line search
  • Update M incrementally
  • e.g. in BFGS algorithm (Broyden-Fletcher-Goldfarb-Shanno)

10/8/2011 Optimization for ML 38

slide-39
SLIDE 39

Tom Schaul – /49

Conjugate-Gradient (1)

  • Find minimum along descent direction
  • Using line search
  • Find a conjugate direction: at all points along

this direction, the gradients are aligned

  • If the Hessian is the

identity, conjugate is orthogonal

  • This guarantees

that the previous update is not spoiled

10/8/2011 Optimization for ML 39

slide-40
SLIDE 40

Tom Schaul – /49

  • Conjugate direction
  • Fletcher-Reeves formula
  • Polak-Ribiere formula
  • CG update (always using line search):

Conjugate-Gradient (2)

10/8/2011 Optimization for ML 40

slide-41
SLIDE 41

Tom Schaul – /49

Gauss-Newton / Lev.-Marqandt

  • If the loss is mean-squared error (MSE), the

squared Jacobian approximates the Hessian

  • Gauss-Newton
  • Levenberg-Marquandt
  • Avoids instability when some eigenvalues are small

10/8/2011 Optimization for ML 41

y X MSE target

slide-42
SLIDE 42

Tom Schaul – /49

Second-order and Neural Nets

  • Don’t scale well
  • Use on small parameter spaces
  • Operate in batch mode
  • Use on small datasets
  • Or use mini-batches (well, maybe)
  • Sensitive to noise
  • Very good if very precise values are needed
  • Flip-side: more prone to overfitting to local optima
  • SGD is hard to beat on large problems

10/8/2011 Optimization for ML 42

slide-43
SLIDE 43

Tom Schaul – /49

Optimization Types

  • First-order methods, aka gradient descent
  • use gradients
  • incremental steps downhill on energy surface
  • Second-order methods
  • use second derivatives (curvature)
  • attempt large jumps (into the bottom of the valley)
  • Zeroth-order methods, aka black-box
  • use on values of loss function exclusively
  • somewhat random jumps

10/8/2011 Optimization for ML 43

slide-44
SLIDE 44

Tom Schaul – /49

Natural Evolution Strategies (1)

Represent search points by a search distribution

  • that is parameterized
  • iteratively update distribution parameters,
  • until good enough fitness is found.

7/7/2011 Benchmarking Natural Evolution Strategies 44

distribution parameters

slide-45
SLIDE 45

Tom Schaul – /49

Natural Evolution Strategies (2)

Approach: minimize expected loss

  • by computing the gradient,
  • or its Monte Carlo approximation,
  • to update distribution parameters:

7/7/2011 Benchmarking Natural Evolution Strategies 45

learning rate population size

slide-46
SLIDE 46

Tom Schaul – /49

Natural Evolution Strategies (3)

But the steepest gradient is not scale-invariant,

  • so we use the natural gradient instead

7/7/2011 Benchmarking Natural Evolution Strategies 46

Fisher information matrix

vanilla natural

slide-47
SLIDE 47

Tom Schaul – /49

Today’s Key Concepts

  • Navigating loss surfaces
  • First-order vs. second-order
  • Batch vs. stochastic
  • Hessians and conditioning
  • Finding step-sizes

10/8/2011 Optimization for ML 47

slide-48
SLIDE 48

Tom Schaul – /49

Today’s Take-homes (1)

  • Always shuffle data
  • Always normalize data, if possible decorrelate
  • Use components that lead to zero-mean unit-

variance activations

  • Initialize weights carefully

10/8/2011 Optimization for ML 48

slide-49
SLIDE 49

Tom Schaul – /49

Today’s Take-homes (2)

  • By default, use SGD algorithm
  • Use minibatches on parallel machines
  • Either
  • Use power method to roughly estimate optimal

learning rate, or

  • Scale learning-rates element-wise by estimates of

diagonal Hessian and sample variance

  • Use conjugate gradients on small-scale

and/or high-precision tasks

  • Use black-box methods as last resort

10/8/2011 Optimization for ML 49

slide-50
SLIDE 50

Tom Schaul – /49

References

Most of today’s material http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf Adaptive learning rates http://arxiv.org/abs/1206.1106 Natural Evolution Strategies http://arxiv.org/abs/1106.4487 Any additional questions schaul@cims.nyu.edu

10/8/2011 Optimization for ML 50