advanced machine learning gradient descent for non convex
play

Advanced Machine Learning Gradient Descent for Non-Convex Functions - PowerPoint PPT Presentation

Advanced Machine Learning Gradient Descent for Non-Convex Functions Amit Sethi Electrical Engineering, IIT Bombay Learning outcomes for the lecture Characterize non-convex loss surfaces with Hessian List issues with non-convex surfaces


  1. Advanced Machine Learning Gradient Descent for Non-Convex Functions Amit Sethi Electrical Engineering, IIT Bombay

  2. Learning outcomes for the lecture • Characterize non-convex loss surfaces with Hessian • List issues with non-convex surfaces • Explain how certain optimization techniques help solve these issues

  3. Contents • Characterizing a non-convex loss surfaces • Issues with gradient decent • Issues with Newton’s method • Stochastic gradient descent to the rescue • Momentum and its variants • Saddle-free Newton

  4. Why do we not get stuck in bad local minima? • Local minima are close to global minima in terms of errors • Saddle points are much more likely at higher portions of the error surface (in high-dimensional weight space) • SGD (and other techniques) allow you to escape the saddle points

  5. Error surfaces and saddle points http://math.etsu.edu/multicalc/prealpha/Chap2/Chap2-8/10-6-53.gif http://pundit.pratt.duke.edu/piki/images/thumb/0/0a/SurfExp04.png/400px-SurfExp04.png

  6. Eigenvalues of Hessian at critical points Local minima Long furrow Plateau Saddle point http://i.stack.imgur.com/NsI2J.png

  7. Saddle point Global minima A realistic picture Local minima Local maxima Image source: https://www.cs.umd.edu/~tomg/projects/landscapes/

  8. Is achieving global minima important? • Global minima for the training data may not be the global minima for the validation or test data • Local minimas are often good enough “ The Loss Surfaces of Multilayer Networks” Choromanska et al. JMLR’15

  9. Under certain assumptions, theoretically also they are of high quality • Results: – Lowest critical values of the random loss form a band – Probability of minima outside that band diminishes exponentially with the size of the network – Empirical verification • Assumptions: – Fully-connected feed-forward neural network – Variable independence – Redundancy in network parametrization – Uniformity “ The Loss Surfaces of Multilayer Networks” Choromanska et al. JMLR’15

  10. Empirically, most minima are of high quality “ Identifying and attacking the saddle pointproblem in high-dimensional non-convex optimization” Dauphin et al., NIPS’15

  11. GD vs. Newton’s method • Gradient descent is based on first-order approx. 𝑔 𝜄 ∗ + ∆𝜄 = 𝑔 𝜄 ∗ + 𝛼𝑔 𝑈 ∆𝜄 Δ𝜄 = −𝜃 𝛼𝑔 • Newton’s method is based on second order 𝑔 𝜄 ∗ + ∆𝜄 = 𝑔 𝜄 ∗ + 𝛼𝑔 𝑈 ∆𝜄 + 1 2 ∆𝜄 𝑈 𝐼 ∆𝜄 Δ𝜄 = −𝐼 −1 𝛼𝑔 𝑜 𝜄 𝑔 𝜄 ∗ + ∆𝜄 = 𝑔 𝜄 ∗ + 1 2 𝜇 𝑗 ∆𝒘 𝑗2 𝑗=1 “ Identifying and attacking the saddle pointproblem in high-dimensional non-convex optimization” Dauphin et al., NIPS’15

  12. Disadvantages of 2 nd order methods • Updates require O(d 3 ) or at least O(d 2 ) • May not work well for non-convex surfaces • Get attracted to saddle points (how?) • Not very good for batch-updates

  13. GD vs. SGD • GD: w t • w t+1 = w t − η g for all samples (w t ) − η g for all samples (w t ) • SGD momentum: w t • w t+1 = w t − η g for a random subset (w t ) − η g for a random subset (w t )

  14. Compare GD with SGD • GD requires more computations per update • SGD is more noisy

  15. SGD helps by changing the loss surface • Different mini-batches (or samples) have their own loss surfaces • The loss surface of the entire training sample (dotted) may be different • Local minima of one loss surface may not be local minima of another one • This helps us escape local minima using stochastic or batch gradient descent • Mini-batch size depends on computational resource utilization

  16. Noise can be added in other ways to escape saddle points • Random mini-batches (SGD) • Add noise to the gradient or the update • Add noise to the input

  17. Learning rate scheduling • High learning rates explore faster earlier – But, they can lead to divergence or high final loss • Low learning rates fine-tune better later – But, they can be very slow to converge • LR scheduling combines advantages of both – Lots of schedules possible: linear, exponential, square-root, step-wise, cosine Training loss Training iterations

  18. Classical and Nesterov Momentum w t • GD: − η g(w t ) • w t+1 = w t − η g(w t ) • Classical momentum: α v t w t − η g(w t ) • v t+1 = α v t − η g(w t ); v t+1 − η g(w t ) v t+1 w t+1 • w t+1 = w t + v t+1 • Nesterov momentum α v t w t • v t+1 = α v t − η g(w t + α v t ); − η g(w t + α v t ) v t+1 w t+1 • w t+1 = w t + v t+1 • Better course-correction for bad velocity Nesterov , “A method of solving a convex programming problem with convergence rate O(1/k^2)”, 1983 Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013

  19. AdaGrad, RMSProp, AdaDelta • Scales the gradient by a running norm of all the previous gradients • Per dimension: 𝑕(𝑥 𝑢 ) 𝑥 𝑢+1 = 𝑥 𝑢 − 𝜃 𝑢 𝑕(𝑥 𝑢 ) 2 + 𝜁 𝑗=1 • Automatically reduces learning rate with t • Parameters with small gradients speed up • RMSProp and AdaDelta use a forgetting factor in grad squared so that the updates do not become too small

  20. Adam optimizer combines AdaGrad and momentum • Initialize • 𝑛 0 = 0 • 𝑤 0 = 0 • Loop over t Get gradient • 𝑕 𝑢 = 𝛼 𝑥 𝑔 𝑢 𝑥 𝑢−1 • 𝑛 𝑢 = 𝛾 1 𝑛 𝑢−1 + 1 − 𝛾 1 𝑕 𝑢 Update first moment (biased) • 𝑤 𝑢 = 𝛾 2 𝑤 𝑢−1 + 1 − 𝛾 2 𝑕 𝑢 2 Update second moment (biased) • 𝑛 𝑢 ) 𝑢 = 𝑛𝑢/(1 − 𝛾 1 Correct bias in first moment • 𝑤 𝑢 ) 𝑢 = 𝑤𝑢/(1 − 𝛾 2 Correct bias in second moment • 𝑥 𝑢 = 𝑥 𝑢−1 − 𝛽 𝑛 𝑢 / 𝑤 𝑢 + 𝜁 Update parameters “ADAM : A method for stochastic optimization” Kingma and Ba, ICLR’15

  21. Visualizing optimizers Source: http://ruder.io/optimizing-gradient-descent/index.html

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend