Optimization for Machine Learning Tom Schaul schaul@cims.nyu.edu - - PowerPoint PPT Presentation
Optimization for Machine Learning Tom Schaul schaul@cims.nyu.edu - - PowerPoint PPT Presentation
Optimization for Machine Learning Tom Schaul schaul@cims.nyu.edu Recap: Learning Machines Learning machines (Neural Networks, etc.) Forward passes produce a function of input Trainable parameters (aka weights, biases, etc.)
Tom Schaul – /49
Recap: Learning Machines
- Learning machines (Neural Networks, etc.)
- Forward passes produce a function of input
- Trainable parameters (aka weights, biases, etc.)
- Backward passes compute gradients w.r.t. target
- Modular structure chain rule (aka Backprop)
- Loss function (aka energy, cost, error)
- an expectation over samples from dataset
- Today: algorithms for minimizing the loss
10/8/2011 Optimization for ML 2
Tom Schaul – /49
Flattening Parameters
- Parameter space
- Gradient from backprop
- Element-wise correspondence
10/8/2011 Optimization for ML 3
Tom Schaul – /49
Energy Surfaces
- We can visualize the loss as a function of
parameters
- Properties:
- Local optima
- Saddle points
- Steep cliffs
- Narrow, bent valleys
- Flat areas
- Only convex in the simplest cases
- Convex optimization tools are of limited use
10/8/2011 Optimization for ML 4
Tom Schaul – /49
Sample Variance
- Every sample has a contribution to the loss
- Sample distributions are complex
- Sample gradients can have high variance
10/8/2011 Optimization for ML 5
Tom Schaul – /49
Optimization Types
- First-order methods, aka gradient descent
- use gradients
- incremental steps downhill on surface
- Second-order methods
- use second derivatives (curvature)
- attempt large jumps (into the bottom of the valley)
- Zeroth-order methods, aka black-box
- use on values of loss function exclusively
- somewhat random jumps
10/8/2011 Optimization for ML 6
Tom Schaul – /49
Batch vs. Stochastic
- Batch methods are based on true loss
- Reliable gradients, large updates
- Stochastic methods use sample gradients
- Many more updates, smaller steps
- Minibatch methods interpolate in-between
- Gradients are averaged over n samples
10/8/2011 Optimization for ML 7
Tom Schaul – /49
Gradient Descent
- Step in direction of steepest descent
- Gradient comes from backprop
- How to choose step-size?
- Line search
(extra evaluations)
- Fixed number
10/8/2011 Optimization for ML 8
Tom Schaul – /49
Convergence of GD (1D)
- Iteratively approach optimum
10/8/2011 Optimization for ML 9
Tom Schaul – /49
Optimal Learning Rate (1D)
- Weight change
- With quadratic loss
- Optimal leaning rate
10/8/2011 Optimization for ML 10
Tom Schaul – /49
Convergence of GD (N-Dim)
- Assumption: smooth loss function
- Quadratic approximation around optimum
with Hessian matrix
- Convergence condition:
10/8/2011 Optimization for ML 11
Must shrink any vector
Tom Schaul – /49
Convergence of GD (N-Dim)
- We do a change of coordinates such that
- Then
- Intuition: GD in N dimensions is equivalent to N
1D-descents along the eigenvectors of H
- Convergence if
10/8/2011 Optimization for ML 12
diagonal
Tom Schaul – /49
GD Convergence: Example
- Batch GD
- Small learning rate
- Convergence
10/8/2011 Optimization for ML 13
Tom Schaul – /49
GD Convergence: Example
- Batch GD
- Large learning rate
- Divergence
10/8/2011 Optimization for ML 14
Tom Schaul – /49
GD Convergence: Example
- Stochastic GD
- Large learning rate
- Fast convergence
10/8/2011 Optimization for ML 15
Tom Schaul – /49
Convergence Speed
- With optimal fixed learning rate
- One-step convergence in that direction
- Slower in all others
- Total number of iterations proportional to the
conditioning number of Hessian
10/8/2011 Optimization for ML 16
Tom Schaul – /49
Optimal LR Estimation
- A cheap way of finding
(without finding H first)
- Part 1: cheap Hessian-vector products
- Part 2: power method
10/8/2011 Optimization for ML 17
Tom Schaul – /49
Hessian-vector Products
- Based on this approximation (finite difference)
where we obtain simply from one additional forward/backward pass (after perturbing the parameters)
10/8/2011 Optimization for ML 18
Tom Schaul – /49
Power Method
- We know that iterating
converges to the principal eigenvector, with
- With sample-estimates (on-line) we introduce
some robustness by averaging: (good enough after 10-100 samples, )
10/8/2011 Optimization for ML 19
Tom Schaul – /49
Optimal LR Estimation
10/8/2011 Optimization for ML 20
Tom Schaul – /49
Conditioning of H
- Some parameters are more sensitive than
- thers
- Very different scales
- Illustration
- Solution 1 (model/data)
- Solution 2 (algorithm)
10/8/2011 Optimization for ML 21
Tom Schaul – /49
H-eigenvalues in Neural Nets (1)
- Few large ones
- Many medium ones
- Spanning orders
- f magnitude
10/8/2011 Optimization for ML 22
Tom Schaul – /49
H-eigenvalues in Neural Nets (2)
- Differences by layer
- Steeper gradients on biases
10/8/2011 Optimization for ML 23
Tom Schaul – /49
H-Conditioning: Solution 1
- Normalize data
- Always useful, rarely sufficient
- How?
- Subtract mean
from inputs
- If possible:
decorrelate inputs
- Divide by standard
deviation on each input (all unit-variance)
10/8/2011 Optimization for ML 24
Tom Schaul – /49
H-Conditioning: Solution 1
10/8/2011 Optimization for ML 25
- Normalize data
- Structural choices
- Non-linearities with
zero-mean unit-variance activations
- Explicit normalization layers
- Weight initialization
- such that all hidden activations
have approximately zero-mean unit-variance
Tom Schaul – /49
H-Conditioning: Solution 2
- Algorithmic solution:
- Take smaller steps in sensitive directions
- One learning rate per parameter
- Estimate diagonal Hessian
- Small constant for stability
10/8/2011 Optimization for ML 26
Tom Schaul – /49
Hessian Estimation
- Approximate full Hessian
- Finite difference approximation of the k-th row
- One forward/backward for each parameter
(perturbed slightly)
- Concatenate all the rows
- Symmetrize resulting matrix
10/8/2011 Optimization for ML 27
Tom Schaul – /49
- Cheaply approximate the Hessian in a modular
architecture.
- Assume we have
- Find and
- Apply chain rule
- Positive-definite approximation
BBprop (1)
10/8/2011 Optimization for ML 28
y X
Tom Schaul – /49
BBprop (2)
- Just the diagonal terms
- Take exponentially moving average of estimates
10/8/2011 Optimization for ML 29
y X
Tom Schaul – /49
Batch vs. Stochastic
- Batch methods
- True loss, reliable gradients, large updates
- But:
- Expensive on large datasets
- Slowed by redundant samples
- Stochastic methods
- Many more updates, smaller steps
- Minibatch methods
- Gradients are averaged over n samples
10/8/2011 Optimization for ML 30
Tom Schaul – /49
Batch vs. Stochastic
- Batch methods
- Stochastic methods (SGD)
- Many more updates, smaller steps
- More aggressive
- Also works online (e.g. streaming data)
- Cooling schedule on learning rate
(guaranteed to converge)
- Minibatch methods
10/8/2011 Optimization for ML 31
Tom Schaul – /49
Batch vs. Stochastic
- Batch methods
- Stochastic methods
- Minibatch methods
- Stochastic updates, but more accurate gradients
based on a small number of samples
- In-between SGD and batch GD
- Not usually faster, but much easier to parallelize
- Samples in a mini-batch should be diverse
- Don’t forget to shuffle dataset!
- Stratified sampling
10/8/2011 Optimization for ML 32
Tom Schaul – /49
Variance-normalization
- SGD learning rate depends on Hessian,
but also on sample variance
- Intuition: parameters whose gradients vary
wildly across samples should be updated with smaller learning rates than stable ones
- Variance-scaled rates:
where
- Those are exponential moving averages
10/8/2011 Optimization for ML 33
Tom Schaul – /49
Variance-normalization
- This scheme is adaptive: no need for tuning
10/8/2011 Optimization for ML 34
Tom Schaul – /49
Optimization Types
- First-order methods, aka gradient descent
- use gradients
- incremental steps downhill on energy surface
- Second-order methods
- use second derivatives (curvature)
- attempt large jumps (into the bottom of the valley)
- Zeroth-order methods, aka black-box
- use on values of loss function exclusively
- somewhat random jumps
10/8/2011 Optimization for ML 35
Tom Schaul – /49
Second-order Optimization
- Newton’s method
- Quasi-Newton (BFGS)
- Conjugate gradients
- Gauss-Newton (Levenberg-Marquandt)
- Many more:
- Momentum
- Nesterov gradient
- Natural gradient descent
10/8/2011 Optimization for ML 36
Tom Schaul – /49
Newton’s Method
- Locally quadratic loss:
- Minimize w.r.t. weight change
- Jumps to the center of quadratic bowl
- Optimal (single step) if quadratic approximation
hold, no guarantees otherwise
10/8/2011 Optimization for ML 37
Tom Schaul – /49
Quasi-Newton / BFGS
- Keep an estimate of the inverse Hessian M
- Gradient premultiplied by M
- M always positive-definite
- Line search
- Update M incrementally
- e.g. in BFGS algorithm (Broyden-Fletcher-Goldfarb-Shanno)
10/8/2011 Optimization for ML 38
Tom Schaul – /49
Conjugate-Gradient (1)
- Find minimum along descent direction
- Using line search
- Find a conjugate direction: at all points along
this direction, the gradients are aligned
- If the Hessian is the
identity, conjugate is orthogonal
- This guarantees
that the previous update is not spoiled
10/8/2011 Optimization for ML 39
Tom Schaul – /49
- Conjugate direction
- Fletcher-Reeves formula
- Polak-Ribiere formula
- CG update (always using line search):
Conjugate-Gradient (2)
10/8/2011 Optimization for ML 40
Tom Schaul – /49
Gauss-Newton / Lev.-Marqandt
- If the loss is mean-squared error (MSE), the
squared Jacobian approximates the Hessian
- Gauss-Newton
- Levenberg-Marquandt
- Avoids instability when some eigenvalues are small
10/8/2011 Optimization for ML 41
y X MSE target
Tom Schaul – /49
Second-order and Neural Nets
- Don’t scale well
- Use on small parameter spaces
- Operate in batch mode
- Use on small datasets
- Or use mini-batches (well, maybe)
- Sensitive to noise
- Very good if very precise values are needed
- Flip-side: more prone to overfitting to local optima
- SGD is hard to beat on large problems
10/8/2011 Optimization for ML 42
Tom Schaul – /49
Optimization Types
- First-order methods, aka gradient descent
- use gradients
- incremental steps downhill on energy surface
- Second-order methods
- use second derivatives (curvature)
- attempt large jumps (into the bottom of the valley)
- Zeroth-order methods, aka black-box
- use on values of loss function exclusively
- somewhat random jumps
10/8/2011 Optimization for ML 43
Tom Schaul – /49
Natural Evolution Strategies (1)
Represent search points by a search distribution
- that is parameterized
- iteratively update distribution parameters,
- until good enough fitness is found.
7/7/2011 Benchmarking Natural Evolution Strategies 44
distribution parameters
…
Tom Schaul – /49
Natural Evolution Strategies (2)
Approach: minimize expected loss
- by computing the gradient,
- or its Monte Carlo approximation,
- to update distribution parameters:
7/7/2011 Benchmarking Natural Evolution Strategies 45
learning rate population size
Tom Schaul – /49
Natural Evolution Strategies (3)
But the steepest gradient is not scale-invariant,
- so we use the natural gradient instead
7/7/2011 Benchmarking Natural Evolution Strategies 46
Fisher information matrix
vanilla natural
Tom Schaul – /49
Today’s Key Concepts
- Navigating loss surfaces
- First-order vs. second-order
- Batch vs. stochastic
- Hessians and conditioning
- Finding step-sizes
10/8/2011 Optimization for ML 47
Tom Schaul – /49
Today’s Take-homes (1)
- Always shuffle data
- Always normalize data, if possible decorrelate
- Use components that lead to zero-mean unit-
variance activations
- Initialize weights carefully
10/8/2011 Optimization for ML 48
Tom Schaul – /49
Today’s Take-homes (2)
- By default, use SGD algorithm
- Use minibatches on parallel machines
- Either
- Use power method to roughly estimate optimal
learning rate, or
- Scale learning-rates element-wise by estimates of
diagonal Hessian and sample variance
- Use conjugate gradients on small-scale
and/or high-precision tasks
- Use black-box methods as last resort
10/8/2011 Optimization for ML 49
Tom Schaul – /49
References
Most of today’s material http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf Adaptive learning rates http://arxiv.org/abs/1206.1106 Natural Evolution Strategies http://arxiv.org/abs/1106.4487 Any additional questions schaul@cims.nyu.edu
10/8/2011 Optimization for ML 50