Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient - PowerPoint PPT Presentation

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade � 2018 c University of Washington cse446-staff@cs.washington.edu 1 / 12

Announcements ◮ Midterm: Weds, Feb 7th. Policies: ◮ You may use a single side of a single sheet of handwritten notes that you prepared. ◮ You must turn your sheet of notes in, with your name on it, in at the conclusion of the exam, even if you never looked at it. ◮ You may not use electronics devices of any sort. ◮ A few comments on the course difficulty ◮ Today: New: GD and SGD 1 / 12

Course difficulty Why is it difficult/what should we learn? ◮ homeworks ◮ exams ◮ grading 1 / 12

Review 1 / 12

Gradient Descent: Convergence ◮ Denote: z ∗ = argmin z F ( z ) : the global minimum z ( k ) : our parameter after k updates. ◮ Thm: Suppose F is convex and “ L -smooth” (e.g. works for square loss and the logistic loss). Using a fixed step size η ≤ 1 L , we have: F ( z ( k ) ) − F ( z ∗ ) ≤ � z (0) − z ∗ � 2 η · k That is the convergence rate is O ( 1 k ) . ◮ A constant learning rate means no parameter tuning! 2 / 12

Probabilistic machine learning: Probabilistic machine learning: ◮ define a probabilistic model relating random variables x to y ◮ estimate its parameters . 2 / 12

A Probabilistic Model for Binary Classification: Logistic Regression ◮ For Y ∈ {− 1 , 1 } define p w ,b ( Y | X ) as: 1. Transform feature vector x via the “activation” function: a = w · x + b 2. Transform a into a binomial probability by passing it through the logistic function: 1 1 p w ,b ( Y = +1 | x ) = 1 + exp − a = 1 + exp − ( w · x + b ) 0.8 0.4 0.0 -10 -5 0 5 10 ◮ If we learn p w ,b ( Y | x ) , we can (almost) do whatever we like! 3 / 12

Maximum Likelihood Estimation and the Log loss The principle of maximum likelihood estimation is to choose our parameters to make our observed data as likely as possible (under our model). ◮ Mathematically: find ˆ w that maximizes the probability of the labels y 1 , . . . y n given the inputs x 1 , . . . x n . ◮ The Maximum Likelihood Estimator (the ’MLE’ ) is: N � w = argmax p w ( y n | x n ) ˆ w n =1 N � = argmin − log p w ( y n | x n ) w n =1 4 / 12

The MLE for Logistic Regression ◮ the MLE for the logistic regression model: N N � � argmin − log p w ( y n | x n ) = argmin log (1 + exp( − y n w · x n )) w w n =1 n =1 ◮ This is the logistic loss function that we saw earlier. ◮ How do we find the MLE? 5 / 12

Derivation for Log loss for Logistic Regression: scratch space 5 / 12

Today 5 / 12

Linear Regression as a Probabilistic Model Linear regression defines p w ( Y | X ) as follows: 1. Observe the feature vector x ; transform it via the activation function: µ = w · x 2. Let µ be the mean of a normal distribution and define the density: 2 π exp − ( Y − µ ) 2 1 √ p w ( Y | x ) = 2 σ 2 σ 3. Sample Y from p w ( Y | x ) . 6 / 12

Linear Regression-MLE is (Unregularized) Squared Loss Minimization! N N 1 � � ( y n − w · x n ) 2 argmin − log p w ( y n | x n ) ≡ argmin N � �� w w n =1 n =1 SquaredLoss n ( w ,b ) Where did the variance go? What is GD here? 7 / 12

Loss Minimization & Gradient Descent N 1 � w ∗ = argmin ℓ ( x n , y n , w ) + R ( w ) N � �� w n =1 ℓ n ( w ) What is GD here? What do we do if N is large? 8 / 12

Stochastic Gradient Descent (SGD): by example N 1 � ( y n − w · x n ) 2 argmin N w n =1 ◮ Gradient descent: ◮ Note we are computing an average. What is a crude way to estimate an average? ◮ Stochastic gradient descent: Will it converge? 9 / 12

Stochastic Gradient Descent (SGD): by example N 1 � ( y n − w · x n ) 2 argmin N w n =1 ◮ Gradient descent: ◮ Note we are computing an average. What is a crude way to estimate an average? ◮ Stochastic gradient descent: Will it converge? If the step size in SGD is a constant, we will not converge. 9 / 12

Stochastic Gradient Descent (SGD) (without regularization) Data : loss functions ℓ ( · ) , training data, number of iterations K , step sizes � η (1) , . . . , η ( K ) � Result : parameters w ∈ R d initialize: w (0) = 0 ; for k ∈ { 1 , . . . , K } do i ∼ Uniform( { 1 , . . . , N } ) ; w ( k ) = w ( k − 1) − η ( k ) · ∇ w ℓ i ( w ( k − 1) ) ; end return w ( K ) ; Algorithm 1: SGD 10 / 12

Stochastic Gradient Descent: Convergence N 1 � w ∗ = argmin ℓ n ( w ) N w n =1 ◮ w ( k ) : our parameter after k updates. ◮ Thm: Suppose ℓ ( · ) is convex (and satisfies mild regularity conditions). There exists a way to decrease our step sizes η ( k ) over time so that our function value, F ( w ( k ) ) will converge to the minimal function value F ( w ∗ ) . ◮ This Thm is different from GD in that we need to turn down our step sizes over time! 11 / 12

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient - PowerPoint PPT Presentation

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade 2018 c University of Washington cse446-staff@cs.washington.edu 1 / 12 Announcements Midterm: Weds, Feb 7th. Policies: You may use a single

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Large Scale Machine Learning with Stochastic Gradient Descent L eon Bottou leon@bottou.org

Machine Learning (CSE 446): Learning as Minimizing Loss: Regularization and Gradient Descent

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Gradient Descent Michail Michailidis & Patrick Maiden Outline

Getting Correct Results from PROC REG Nate Derby Stakana Analytics Seattle, WA, USA Regina SAS

Chapter 5 Continuous Random Variables Continuous Probability Distributions Continuous Probability

Lecture 3: The Normal Distribution and Statistical Inference Ani Manichaikul amanicha@jhsph.edu

Continuous Random Variables For a discrete variable X, the cumulative distribution function , F(x)

Structured Prediction Basics Graham Neubig Site https://phontron.com/class/nn4nlp2017/ A

Concise Implementation of Linear Regression Concise Implementation of Linear Regression

Estimating the ATE of an endogenously assigned treatment from a sample with endogenous selection

Probability, Decision Theory, and Loss Functions CMSC 678 UMBC Some slides adapted from Hamed

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient - PowerPoint PPT Presentation

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade 2018 c University of Washington cse446-staff@cs.washington.edu 1 / 12 Announcements Midterm: Weds, Feb 7th. Policies: You may use a single

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Large Scale Machine Learning with Stochastic Gradient Descent L eon Bottou leon@bottou.org

Machine Learning (CSE 446): Learning as Minimizing Loss: Regularization and Gradient Descent

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Gradient Descent Michail Michailidis &amp; Patrick Maiden Outline

Getting Correct Results from PROC REG Nate Derby Stakana Analytics Seattle, WA, USA Regina SAS

Chapter 5 Continuous Random Variables Continuous Probability Distributions Continuous Probability

Lecture 3: The Normal Distribution and Statistical Inference Ani Manichaikul amanicha@jhsph.edu

Continuous Random Variables For a discrete variable X, the cumulative distribution function , F(x)

Structured Prediction Basics Graham Neubig Site https://phontron.com/class/nn4nlp2017/ A

Concise Implementation of Linear Regression Concise Implementation of Linear Regression

Estimating the ATE of an endogenously assigned treatment from a sample with endogenous selection

Probability, Decision Theory, and Loss Functions CMSC 678 UMBC Some slides adapted from Hamed

Gradient Descent Michail Michailidis & Patrick Maiden Outline