optimization for
play

Optimization for Machine Learning Tom Schaul schaul@cims.nyu.edu - PowerPoint PPT Presentation

Optimization for Machine Learning Tom Schaul schaul@cims.nyu.edu Recap: Learning Machines Learning machines (Neural Networks, etc.) Forward passes produce a function of input Trainable parameters (aka weights, biases, etc.)


  1. Optimization for Machine Learning Tom Schaul schaul@cims.nyu.edu

  2. Recap: Learning Machines • Learning machines (Neural Networks, etc.) • Forward passes produce a function of input • Trainable parameters (aka weights, biases, etc.) • Backward passes compute gradients w.r.t. target • Modular structure  chain rule (aka Backprop) • Loss function (aka energy, cost, error) • an expectation over samples from dataset • Today: algorithms for minimizing the loss Optimization for ML Tom Schaul – 10/8/2011 2 /49

  3. Flattening Parameters • Parameter space • Gradient from backprop • Element-wise correspondence Optimization for ML Tom Schaul – 10/8/2011 3 /49

  4. Energy Surfaces • We can visualize the loss as a function of parameters • Properties: • Local optima • Saddle points • Steep cliffs • Narrow, bent valleys • Flat areas • Only convex in the simplest cases • Convex optimization tools are of limited use Optimization for ML Tom Schaul – 10/8/2011 4 /49

  5. Sample Variance • Every sample has a contribution to the loss • Sample distributions are complex • Sample gradients can have high variance Optimization for ML Tom Schaul – 10/8/2011 5 /49

  6. Optimization Types • First-order methods, aka gradient descent • use gradients • incremental steps downhill on surface • Second-order methods • use second derivatives (curvature) • attempt large jumps (into the bottom of the valley) • Zeroth-order methods, aka black-box • use on values of loss function exclusively • somewhat random jumps Optimization for ML Tom Schaul – 10/8/2011 6 /49

  7. Batch vs. Stochastic • Batch methods are based on true loss • Reliable gradients, large updates • Stochastic methods use sample gradients • Many more updates, smaller steps • Minibatch methods interpolate in-between • Gradients are averaged over n samples Optimization for ML Tom Schaul – 10/8/2011 7 /49

  8. Gradient Descent • Step in direction of steepest descent • Gradient comes from backprop • How to choose step-size? • Line search (extra evaluations) • Fixed number Optimization for ML Tom Schaul – 10/8/2011 8 /49

  9. Convergence of GD (1D) • Iteratively approach optimum Optimization for ML Tom Schaul – 10/8/2011 9 /49

  10. Optimal Learning Rate (1D) • Weight change • With quadratic loss • Optimal leaning rate Optimization for ML Tom Schaul – 10/8/2011 10 /49

  11. Convergence of GD (N-Dim) • Assumption: smooth loss function • Quadratic approximation around optimum with Hessian matrix • Convergence condition: Must shrink any vector Optimization for ML Tom Schaul – 10/8/2011 11 /49

  12. Convergence of GD (N-Dim) • We do a change of coordinates such that diagonal • Then • Intuition: GD in N dimensions is equivalent to N 1D-descents along the eigenvectors of H • Convergence if Optimization for ML Tom Schaul – 10/8/2011 12 /49

  13. GD Convergence: Example • Batch GD • Small learning rate • Convergence Optimization for ML Tom Schaul – 10/8/2011 13 /49

  14. GD Convergence: Example • Batch GD • Large learning rate • Divergence Optimization for ML Tom Schaul – 10/8/2011 14 /49

  15. GD Convergence: Example • Stochastic GD • Large learning rate • Fast convergence Optimization for ML Tom Schaul – 10/8/2011 15 /49

  16. Convergence Speed • With optimal fixed learning rate • One-step convergence in that direction • Slower in all others • Total number of iterations proportional to the conditioning number of Hessian Optimization for ML Tom Schaul – 10/8/2011 16 /49

  17. Optimal LR Estimation • A cheap way of finding (without finding H first) • Part 1: cheap Hessian-vector products • Part 2: power method Optimization for ML Tom Schaul – 10/8/2011 17 /49

  18. Hessian-vector Products • Based on this approximation (finite difference) where we obtain simply from one additional forward/backward pass (after perturbing the parameters) Optimization for ML Tom Schaul – 10/8/2011 18 /49

  19. Power Method • We know that iterating converges to the principal eigenvector, with • With sample-estimates (on-line) we introduce some robustness by averaging: (good enough after 10-100 samples, ) Optimization for ML Tom Schaul – 10/8/2011 19 /49

  20. Optimal LR Estimation Optimization for ML Tom Schaul – 10/8/2011 20 /49

  21. Conditioning of H • Some parameters are more sensitive than others • Very different scales • Illustration • Solution 1 (model/data) • Solution 2 (algorithm) Optimization for ML Tom Schaul – 10/8/2011 21 /49

  22. H-eigenvalues in Neural Nets (1) • Few large ones • Many medium ones • Spanning orders of magnitude Optimization for ML Tom Schaul – 10/8/2011 22 /49

  23. H-eigenvalues in Neural Nets (2) • Differences by layer • Steeper gradients on biases Optimization for ML Tom Schaul – 10/8/2011 23 /49

  24. H-Conditioning: Solution 1 • Normalize data • Always useful, rarely sufficient • How? • Subtract mean from inputs • If possible: decorrelate inputs • Divide by standard deviation on each input (all unit-variance) Optimization for ML Tom Schaul – 10/8/2011 24 /49

  25. H-Conditioning: Solution 1 • Normalize data • Structural choices • Non-linearities with zero-mean unit-variance activations • Explicit normalization layers • Weight initialization • such that all hidden activations have approximately zero-mean unit-variance Optimization for ML Tom Schaul – 10/8/2011 25 /49

  26. H-Conditioning: Solution 2 • Algorithmic solution: • Take smaller steps in sensitive directions • One learning rate per parameter • Estimate diagonal Hessian • Small constant for stability Optimization for ML Tom Schaul – 10/8/2011 26 /49

  27. Hessian Estimation • Approximate full Hessian • Finite difference approximation of the k-th row • One forward/backward for each parameter (perturbed slightly) • Concatenate all the rows • Symmetrize resulting matrix Optimization for ML Tom Schaul – 10/8/2011 27 /49

  28. BBprop (1) • Cheaply approximate the Hessian in a modular architecture. y • Assume we have • Find and • Apply chain rule X • Positive-definite approximation Optimization for ML Tom Schaul – 10/8/2011 28 /49

  29. BBprop (2) • Just the diagonal terms y X • Take exponentially moving average of estimates Optimization for ML Tom Schaul – 10/8/2011 29 /49

  30. Batch vs. Stochastic • Batch methods • True loss, reliable gradients, large updates • But: • Expensive on large datasets • Slowed by redundant samples • Stochastic methods • Many more updates, smaller steps • Minibatch methods • Gradients are averaged over n samples Optimization for ML Tom Schaul – 10/8/2011 30 /49

  31. Batch vs. Stochastic • Batch methods • Stochastic methods (SGD) • Many more updates, smaller steps • More aggressive • Also works online (e.g. streaming data) • Cooling schedule on learning rate (guaranteed to converge) • Minibatch methods Optimization for ML Tom Schaul – 10/8/2011 31 /49

  32. Batch vs. Stochastic • Batch methods • Stochastic methods • Minibatch methods • Stochastic updates, but more accurate gradients based on a small number of samples • In-between SGD and batch GD • Not usually faster, but much easier to parallelize • Samples in a mini-batch should be diverse • Don’t forget to shuffle dataset! • Stratified sampling Optimization for ML Tom Schaul – 10/8/2011 32 /49

  33. Variance-normalization • SGD learning rate depends on Hessian, but also on sample variance • Intuition: parameters whose gradients vary wildly across samples should be updated with smaller learning rates than stable ones • Variance-scaled rates: where • Those are exponential moving averages Optimization for ML Tom Schaul – 10/8/2011 33 /49

  34. Variance-normalization • This scheme is adaptive: no need for tuning Optimization for ML Tom Schaul – 10/8/2011 34 /49

  35. Optimization Types • First-order methods, aka gradient descent • use gradients • incremental steps downhill on energy surface • Second-order methods • use second derivatives (curvature) • attempt large jumps (into the bottom of the valley) • Zeroth-order methods, aka black-box • use on values of loss function exclusively • somewhat random jumps Optimization for ML Tom Schaul – 10/8/2011 35 /49

  36. Second-order Optimization • Newton’s method • Quasi-Newton (BFGS) • Conjugate gradients • Gauss-Newton (Levenberg-Marquandt) • Many more: • Momentum • Nesterov gradient • Natural gradient descent Optimization for ML Tom Schaul – 10/8/2011 36 /49

  37. Newton’s Method • Locally quadratic loss: • Minimize w.r.t. weight change • Jumps to the center of quadratic bowl • Optimal (single step) if quadratic approximation hold, no guarantees otherwise Optimization for ML Tom Schaul – 10/8/2011 37 /49

  38. Quasi-Newton / BFGS • Keep an estimate of the inverse Hessian M • Gradient premultiplied by M • M always positive-definite • Line search • Update M incrementally • e.g. in BFGS algorithm (Broyden-Fletcher-Goldfarb-Shanno) Optimization for ML Tom Schaul – 10/8/2011 38 /49

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend