Advanced Machine Learning Gradient Descent for Non-Convex Functions - PowerPoint PPT Presentation

Advanced Machine Learning Gradient Descent for Non-Convex Functions Amit Sethi Electrical Engineering, IIT Bombay

Learning outcomes for the lecture • Characterize non-convex loss surfaces with Hessian • List issues with non-convex surfaces • Explain how certain optimization techniques help solve these issues

Contents • Characterizing a non-convex loss surfaces • Issues with gradient decent • Issues with Newton’s method • Stochastic gradient descent to the rescue • Momentum and its variants • Saddle-free Newton

Why do we not get stuck in bad local minima? • Local minima are close to global minima in terms of errors • Saddle points are much more likely at higher portions of the error surface (in high-dimensional weight space) • SGD (and other techniques) allow you to escape the saddle points

Error surfaces and saddle points http://math.etsu.edu/multicalc/prealpha/Chap2/Chap2-8/10-6-53.gif http://pundit.pratt.duke.edu/piki/images/thumb/0/0a/SurfExp04.png/400px-SurfExp04.png

Eigenvalues of Hessian at critical points Local minima Long furrow Plateau Saddle point http://i.stack.imgur.com/NsI2J.png

Saddle point Global minima A realistic picture Local minima Local maxima Image source: https://www.cs.umd.edu/~tomg/projects/landscapes/

Is achieving global minima important? • Global minima for the training data may not be the global minima for the validation or test data • Local minimas are often good enough “ The Loss Surfaces of Multilayer Networks” Choromanska et al. JMLR’15

Under certain assumptions, theoretically also they are of high quality • Results: – Lowest critical values of the random loss form a band – Probability of minima outside that band diminishes exponentially with the size of the network – Empirical verification • Assumptions: – Fully-connected feed-forward neural network – Variable independence – Redundancy in network parametrization – Uniformity “ The Loss Surfaces of Multilayer Networks” Choromanska et al. JMLR’15

Empirically, most minima are of high quality “ Identifying and attacking the saddle pointproblem in high-dimensional non-convex optimization” Dauphin et al., NIPS’15

GD vs. Newton’s method • Gradient descent is based on first-order approx. 𝑔 𝜄 ∗ + ∆𝜄 = 𝑔 𝜄 ∗ + 𝛼𝑔 𝑈 ∆𝜄 Δ𝜄 = −𝜃 𝛼𝑔 • Newton’s method is based on second order 𝑔 𝜄 ∗ + ∆𝜄 = 𝑔 𝜄 ∗ + 𝛼𝑔 𝑈 ∆𝜄 + 1 2 ∆𝜄 𝑈 𝐼 ∆𝜄 Δ𝜄 = −𝐼 −1 𝛼𝑔 𝑜 𝜄 𝑔 𝜄 ∗ + ∆𝜄 = 𝑔 𝜄 ∗ + 1 2 𝜇 𝑗 ∆𝒘 𝑗2 𝑗=1 “ Identifying and attacking the saddle pointproblem in high-dimensional non-convex optimization” Dauphin et al., NIPS’15

Disadvantages of 2 nd order methods • Updates require O(d 3 ) or at least O(d 2 ) • May not work well for non-convex surfaces • Get attracted to saddle points (how?) • Not very good for batch-updates

GD vs. SGD • GD: w t • w t+1 = w t − η g for all samples (w t ) − η g for all samples (w t ) • SGD momentum: w t • w t+1 = w t − η g for a random subset (w t ) − η g for a random subset (w t )

Compare GD with SGD • GD requires more computations per update • SGD is more noisy

SGD helps by changing the loss surface • Different mini-batches (or samples) have their own loss surfaces • The loss surface of the entire training sample (dotted) may be different • Local minima of one loss surface may not be local minima of another one • This helps us escape local minima using stochastic or batch gradient descent • Mini-batch size depends on computational resource utilization

Noise can be added in other ways to escape saddle points • Random mini-batches (SGD) • Add noise to the gradient or the update • Add noise to the input

Learning rate scheduling • High learning rates explore faster earlier – But, they can lead to divergence or high final loss • Low learning rates fine-tune better later – But, they can be very slow to converge • LR scheduling combines advantages of both – Lots of schedules possible: linear, exponential, square-root, step-wise, cosine Training loss Training iterations

Classical and Nesterov Momentum w t • GD: − η g(w t ) • w t+1 = w t − η g(w t ) • Classical momentum: α v t w t − η g(w t ) • v t+1 = α v t − η g(w t ); v t+1 − η g(w t ) v t+1 w t+1 • w t+1 = w t + v t+1 • Nesterov momentum α v t w t • v t+1 = α v t − η g(w t + α v t ); − η g(w t + α v t ) v t+1 w t+1 • w t+1 = w t + v t+1 • Better course-correction for bad velocity Nesterov , “A method of solving a convex programming problem with convergence rate O(1/k^2)”, 1983 Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013

AdaGrad, RMSProp, AdaDelta • Scales the gradient by a running norm of all the previous gradients • Per dimension: 𝑕(𝑥 𝑢 ) 𝑥 𝑢+1 = 𝑥 𝑢 − 𝜃 𝑢 𝑕(𝑥 𝑢 ) 2 + 𝜁 𝑗=1 • Automatically reduces learning rate with t • Parameters with small gradients speed up • RMSProp and AdaDelta use a forgetting factor in grad squared so that the updates do not become too small

Adam optimizer combines AdaGrad and momentum • Initialize • 𝑛 0 = 0 • 𝑤 0 = 0 • Loop over t Get gradient • 𝑕 𝑢 = 𝛼 𝑥 𝑔 𝑢 𝑥 𝑢−1 • 𝑛 𝑢 = 𝛾 1 𝑛 𝑢−1 + 1 − 𝛾 1 𝑕 𝑢 Update first moment (biased) • 𝑤 𝑢 = 𝛾 2 𝑤 𝑢−1 + 1 − 𝛾 2 𝑕 𝑢 2 Update second moment (biased) • 𝑛 𝑢 ) 𝑢 = 𝑛𝑢/(1 − 𝛾 1 Correct bias in first moment • 𝑤 𝑢 ) 𝑢 = 𝑤𝑢/(1 − 𝛾 2 Correct bias in second moment • 𝑥 𝑢 = 𝑥 𝑢−1 − 𝛽 𝑛 𝑢 / 𝑤 𝑢 + 𝜁 Update parameters “ADAM : A method for stochastic optimization” Kingma and Ba, ICLR’15

Visualizing optimizers Source: http://ruder.io/optimizing-gradient-descent/index.html

Advanced Machine Learning Gradient Descent for Non-Convex Functions - PowerPoint PPT Presentation

Advanced Machine Learning Gradient Descent for Non-Convex Functions Amit Sethi Electrical Engineering, IIT Bombay Learning outcomes for the lecture Characterize non-convex loss surfaces with Hessian List issues with non-convex surfaces

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Convex Hell 362 dnc CS 16: Convex Hull Whoops, I mean... Convex Hull Whats a Convex Hull?

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

3.1 Online Convex Programming Definition 3.1.1 (Convex Set) A set of vectors X R n is convex

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Gradient Descent Michail Michailidis & Patrick Maiden Outline

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Large Scale Machine Learning with Stochastic Gradient Descent L eon Bottou leon@bottou.org

Probabilistic Graphical Models David Sontag New York University Lecture 8, March 28, 2012 David

Math 211 Math 211 Lecture #27 December 5, 2000 2 Review of Methods Review of Methods

Applied Algorithm Design: Exam June, 23rd, 2008 Prof. Pietro Michiardi Exam Rules Exam

Metastability in Stochastic Dynamics: Random-Field Curie-Weiss-Potts Model Prag summer school 1

No Smooth Julia Sets for Complex H enon Maps Eric Bedford Stony Brook U. joint with John

Finite element methods for Maxwells equations: A local a priori estimate Claudio Rojik Vienna

Staying Safe at Work Teaching PRIDE Workers with Disabilities about Health and Safety on the Job

Audio instructions Select Computer audio to use your computers sound OR Select

Advanced Machine Learning Gradient Descent for Non-Convex Functions - PowerPoint PPT Presentation

Advanced Machine Learning Gradient Descent for Non-Convex Functions Amit Sethi Electrical Engineering, IIT Bombay Learning outcomes for the lecture Characterize non-convex loss surfaces with Hessian List issues with non-convex surfaces

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Convex Hell 362 dnc CS 16: Convex Hull Whoops, I mean... Convex Hull Whats a Convex Hull?

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

3.1 Online Convex Programming Definition 3.1.1 (Convex Set) A set of vectors X R n is convex

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Gradient Descent Michail Michailidis &amp; Patrick Maiden Outline

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Large Scale Machine Learning with Stochastic Gradient Descent L eon Bottou leon@bottou.org

Probabilistic Graphical Models David Sontag New York University Lecture 8, March 28, 2012 David

Math 211 Math 211 Lecture #27 December 5, 2000 2 Review of Methods Review of Methods

Applied Algorithm Design: Exam June, 23rd, 2008 Prof. Pietro Michiardi Exam Rules Exam

Metastability in Stochastic Dynamics: Random-Field Curie-Weiss-Potts Model Prag summer school 1

No Smooth Julia Sets for Complex H enon Maps Eric Bedford Stony Brook U. joint with John

Finite element methods for Maxwells equations: A local a priori estimate Claudio Rojik Vienna

Staying Safe at Work Teaching PRIDE Workers with Disabilities about Health and Safety on the Job

Audio instructions Select Computer audio to use your computers sound OR Select

Gradient Descent Michail Michailidis & Patrick Maiden Outline