Applied Machine Learning Gradient Descent Methods Siamak - PowerPoint PPT Presentation

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Learning objectives Basic idea of gradient descent stochastic gradient descent method of momentum using an adaptive learning rate sub-gradient Application to linear regression and classification

Optimization in ML inference and learning of a model often involves optimization: optimization is a huge field bold: the setting considered in this class discrete (combinatorial) vs continuous variables constrained vs unconstrained for continuous optimization in ML: convex vs non-convex looking for local vs global optima? analytic gradient ? analytic Hessian? stochastic vs batch smooth vs non-smooth

Gradient Recall for a multivariate function J ( w , w ) 0 1 J partial derivatives instead of derivative = derivative when other vars. are fixed J ( w , w + ϵ )− J ( w , w ) ∂ J ( w , w ) ≜ lim ϵ →0 0 1 0 1 0 1 w 0 ∂ w 1 ϵ w 1 we can estimate this numerically if needed J (use small epsilon in the the formula above) gradient : vector of all partial derivatives ∂ ∂ ∇ J ( w ) = [ J ( w ), ⋯ J ( w )] T ∂ w 1 ∂ w D w 0 w 1

Gradient descent an iterative algorithm for optimization w {0} starts from some new notation! update using gradient { t +1} { t } { t } ← − α ∇ J ( w ) w w steepest descent direction learning rate cost function converges to a local minima (for maximization : objective function ) ∂ ∂ ∇ J ( w ) = [ J ( w ), ⋯ J ( w )] T ∂ w 1 ∂ w D image: https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html

Convex function R N a convex subset of intersects any line in at most one line segment convex not convex a convex function is a function for which the epigraph is a convex set epigraph: set of all points above the graph ′ ′ f ( λw + (1 − λ ) w ) ≤ λf ( w ) + (1 − λ ) f ( w ) 0 < λ < 1 w ′ w

Minimum of a convex function Convex functions are easier to minimize: critical points are global minimum gradient descent can find it { t +1} { t } { t } ← − α ∇ J ( w ) w w convex non-convex: gradient descent may find a local optima J ( w ) w w a concave function is a negative of a convex function (easy to maximize ) image: https://www.willamette.edu/~gorr/classes/cs449/momrate.html

Recognizing convex functions a constant function is convex f ( x ) = c a linear function is convex f ( x ) = w x ⊤ d 2 convex if second derivative is positive everywhere f ≥ 0 ∀ x x 2 examples 2 d , e , − log( x ), − x x x

Recognizing convex functions sum of convex functions is convex example example sum of squared errors 2 ⊤ ( n ) y ) 2 J ( w ) = ∣∣ Xw − y ∣∣ = ( w x − ∑ n 2

Recognizing convex functions maximum of convex functions is convex example 3 4 9 y 4 f ( y ) = max x y = example x ∈[0,3] note this is not convex in x

Recognizing convex functions composition of convex functions is generally not convex (− log( x )) 2 example however, if are convex, f , g and is non-decreasing is convex g ( f ( x )) g e f ( x ) example for convex f

Recognizing convex functions is the logistic regression cost function convex in model parameters (w)? linear non-negative ⊤ ⊤ N ( n ) − w x ( n ) J ( w ) = log ( 1 + ) + (1 − y ) log ( 1 + w x ∑ n =1 ) y e e same argument checking second derivative ∂ 2 e − z log(1 + e ) = z ≥ 0 ∂ z 2 (1+ e − z 2 ) sum of convex functions COMP 551 | Fall 2020

Gradient recall for linear and logistic regression ⊤ y in both cases: ∇ J ( w ) = ∑ n x ( − ^ y ) = X ( − ^ y ) y ⊤ ^ = linear regression: y w x 1 def gradient(x, y, w): ⊤ ^ = σ ( w x ) 2 N,D = x.shape y logistic regression: 3 yh = logistic(np.dot(x, w)) 4 grad = np.dot(x.T, yh - y) / N 5 return grad time complexity: O ( ND ) (two matrix multiplications) 2 3 O ( ND + D ) compared to the direct solution for linear regression: gradient descent can be much faster for large D 1

Gradient Descent implementing gradient descent is easy! 1 def GradientDescent(x, # N x D 2 y, # N 3 lr=.01, # learning rate 4 eps=1e-2, # termination codition 5 ): 6 N,D = x.shape 7 w = np.zeros(D) 8 g = np.inf 9 while np.linalg.norm(g) > eps: Some termination condition : 10 g = gradient(x, y, w) code on the previous page some max #iterations 11 w = w - lr*g 12 return w small gradient 13 a small change in the objective increasing error on validation set early stopping (one way to avoid overfitting)

GD for linear regression example ( n ) ( n ) ( x , −3 x + noise) using direct solution method −1 w = ( X X ) X y ≈ −3.2 T T y = wx y = −3 x

GD for linear regression example After 22 steps { t +1} { t } { t } ← − .01∇ J ( w ) w w data space cost function {0} = 0 w 0 y = w x J ( w ) {22} ≈ −3.2 w w

Learning rate α Learning rate has a significant effect on GD too small: may take a long time to converge too large: it overshoots α = .01 α = .05 J ( w ) w

Learning rate α Learning rate has a significant effect on GD too small: may take a long time to converge too large: it overshoots example linear regression, D=2, 50 gradient steps COMP 551 | Fall 2020

Stochastic Gradient Descent we can write the cost function as a average over instances 1 ∑ n =1 N J ( w ) = J ( w ) cost for a single data-point n N e.g. for linear regression 1 ( n ) ( n ) 2 J ( w ) = ( w x T − ) y n 2 1 ∑ n =1 ∂ N ∂ the same is true for the partial derivatives J ( w ) = J ( w ) n ∂ w j ∂ w j N ∇ J ( w ) = E [∇ J ( w )] therefore D n

Stochastic Gradient Descent Idea: use stochastic approximations in gradient descent ∇ J ( w ) n contour plot of the cost function + batch gradient update w ← w − α ∇ J ( w ) with small learning rate: guaranteed improvement at each step w 1 w 0 image:https://jaykanidan.wordpress.com

Stochastic Gradient Descent Idea: use stochastic approximations in gradient descent ∇ J ( w ) n using stochastic gradient w ← w − α ∇ J ( w ) n the steps are "on average" in the right direction each step is using gradient of a different cost J ( w ) n each update is (1/N) of the cost of batch gradient w 1 e.g., for linear regression O ( D ) ( n ) ⊤ ( n ) ( n ) ∇ J ( w ) = ( w x − ) x y n w 0 image:https://jaykanidan.wordpress.com image:https://jaykanidan.wordpress.com

SGD for logistic regression example logistic regression for Iris dataset (D=2 , ) α = .1 batch gradient stochastic gradient

Convergence of SGD stochastic gradients are not zero even at the optimum w how to guarantee convergence? idea: schedule to have a smaller learning rate over time Robbins Monro ∞ { t } = ∞ the sequence we use should satisfy: ∑ t =0 α otherwise for large we can't reach the minimum {0} ∗ ∣∣ w − w ∣∣ ∞ the steps should go to zero { t } 2 ( α ) < ∞ ∑ t =0 10 { t } { t } t −.51 = , α = example α t

Minibatch SGD use a minibatch to produce gradient estimates ∇ J = ∇ J ( w ) ∑ n ∈ B B n B ⊆ {1, … , N } a subset of the dataset GD full batch SGD minibatch-size=16 SGD minibatch-size=1

Oscillations gradient descent can oscillate a lot! each gradient step is prependicular to isocontours in SGD this is worsened due to noisy gradient estimate COMP 551 | Fall 2020

Momentum to help with oscillations: use a running average of gradients more recent gradients should have higher weights { t } { t −1} { t } Δ w ← β Δ w + (1 − β )∇ J ( w ) B { t } { t −1} α Δ w { t } ← − w w weight for the most recent gradient momentum of 0 reduces to SGD (1 − β ) common value > .9 (1 − β ) β T −1 weight for the older gradient is effectively an exponential moving average { T } T T − t { t } Δ w = (1 − β )∇ J ( w ) ∑ t =1 β B there are other variations of momentum with similar idea t = 1 t = T

Momentum Example: logistic regression no momentum with momentum α = .5, β = 0, ∣ B ∣ = 8 α = .5, β = .99, ∣ B ∣ = 8 see the beautiful demo at Distill https://distill.pub/2017/momentum/ COMP 551 | Fall 2020

Adagrad (Adaptive gradient) optional w d use different learning rate for each parameter also make the learning rate adaptive { t } { t −1} ∂ { t −1} 2 ← + J ( w ) S S ∂ w d d d sum of squares of derivatives over all iterations so far (for individual parameter) { t } { t −1} ∂ { t −1} ← − α J ( w ) w w ∂ w d d d { t −1} + ϵ S d the learning rate is adapted to previous updates ϵ is to avoid numerical issues useful when parameters are updated at different rates (e.g., when some features are often zero when using SGD)

Adagrad (Adaptive gradient) optional different learning rate for each parameter w d make the learning rate adaptive α = .1, ∣ B ∣ = 1, T = 80, 000 α = .1, ∣ B ∣ = 1, T = 80, 000, ϵ = 1 e − 8 Adagrad SGD problem: the learning rate goes to zero too quickly COMP 551 | Fall 2020

Applied Machine Learning Gradient Descent Methods Siamak - PowerPoint PPT Presentation

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020) Learning objectives Basic idea of gradient descent stochastic gradient descent method of momentum using an adaptive learning rate sub-gradient

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Contact

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Slides and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

APPLIED MACHINE LEARNING Probability Density Functions Gaussian Mixture Models 1 APPLIED

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Multilayer Perceptron Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Perceptron and Support Vector Machines Siamak

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Lecture 9: Demand Uncertainty: Demand Uncertainty: Lecture 9: Forecasting Forecasting

Adaptive networks with preferred degree from the mundane to the astonishing R.K.P. Zia

Efficient adaptive experimental design Liam Paninski Department of Statistics and Center for

Adaptive estimation in functional linear model: a model selection approach Angelina Roche joint

Adaptive Estimation of the Distribution Function and its Density in Sup-Norm Loss Evarist Gin

This document must be cited according to its fjnal version which is published in a conference as:

Adaptive mesh redistribution on the sphere for global atmospheric modelling Phil Browne &

1 University of Saskatchewan, Canada 2 University of Tartu, Estonia

Applied Machine Learning Gradient Descent Methods Siamak - PowerPoint PPT Presentation

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020) Learning objectives Basic idea of gradient descent stochastic gradient descent method of momentum using an adaptive learning rate sub-gradient

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Contact

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Slides and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

APPLIED MACHINE LEARNING Probability Density Functions Gaussian Mixture Models 1 APPLIED

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Multilayer Perceptron Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Perceptron and Support Vector Machines Siamak

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Lecture 9: Demand Uncertainty: Demand Uncertainty: Lecture 9: Forecasting Forecasting

Adaptive networks with preferred degree from the mundane to the astonishing R.K.P. Zia

Efficient adaptive experimental design Liam Paninski Department of Statistics and Center for

Adaptive estimation in functional linear model: a model selection approach Angelina Roche joint

Adaptive Estimation of the Distribution Function and its Density in Sup-Norm Loss Evarist Gin

This document must be cited according to its fjnal version which is published in a conference as:

Adaptive mesh redistribution on the sphere for global atmospheric modelling Phil Browne &amp;

1 University of Saskatchewan, Canada 2 University of Tartu, Estonia

Adaptive mesh redistribution on the sphere for global atmospheric modelling Phil Browne &