machine learning cse 446 gradient descent and stochastic
play

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient - PowerPoint PPT Presentation

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade 2018 c University of Washington cse446-staff@cs.washington.edu 1 / 12 Announcements Midterm: Weds, Feb 7th. Policies: You may use a single


  1. Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade � 2018 c University of Washington cse446-staff@cs.washington.edu 1 / 12

  2. Announcements ◮ Midterm: Weds, Feb 7th. Policies: ◮ You may use a single side of a single sheet of handwritten notes that you prepared. ◮ You must turn your sheet of notes in, with your name on it, in at the conclusion of the exam, even if you never looked at it. ◮ You may not use electronics devices of any sort. ◮ A few comments on the course difficulty ◮ Today: New: GD and SGD 1 / 12

  3. Course difficulty Why is it difficult/what should we learn? ◮ homeworks ◮ exams ◮ grading 1 / 12

  4. Review 1 / 12

  5. Gradient Descent: Convergence ◮ Denote: z ∗ = argmin z F ( z ) : the global minimum z ( k ) : our parameter after k updates. ◮ Thm: Suppose F is convex and “ L -smooth” (e.g. works for square loss and the logistic loss). Using a fixed step size η ≤ 1 L , we have: F ( z ( k ) ) − F ( z ∗ ) ≤ � z (0) − z ∗ � 2 η · k That is the convergence rate is O ( 1 k ) . ◮ A constant learning rate means no parameter tuning! 2 / 12

  6. Probabilistic machine learning: Probabilistic machine learning: ◮ define a probabilistic model relating random variables x to y ◮ estimate its parameters . 2 / 12

  7. A Probabilistic Model for Binary Classification: Logistic Regression ◮ For Y ∈ {− 1 , 1 } define p w ,b ( Y | X ) as: 1. Transform feature vector x via the “activation” function: a = w · x + b 2. Transform a into a binomial probability by passing it through the logistic function: 1 1 p w ,b ( Y = +1 | x ) = 1 + exp − a = 1 + exp − ( w · x + b ) 0.8 0.4 0.0 -10 -5 0 5 10 ◮ If we learn p w ,b ( Y | x ) , we can (almost) do whatever we like! 3 / 12

  8. Maximum Likelihood Estimation and the Log loss The principle of maximum likelihood estimation is to choose our parameters to make our observed data as likely as possible (under our model). ◮ Mathematically: find ˆ w that maximizes the probability of the labels y 1 , . . . y n given the inputs x 1 , . . . x n . ◮ The Maximum Likelihood Estimator (the ’MLE’ ) is: N � w = argmax p w ( y n | x n ) ˆ w n =1 N � = argmin − log p w ( y n | x n ) w n =1 4 / 12

  9. The MLE for Logistic Regression ◮ the MLE for the logistic regression model: N N � � argmin − log p w ( y n | x n ) = argmin log (1 + exp( − y n w · x n )) w w n =1 n =1 ◮ This is the logistic loss function that we saw earlier. ◮ How do we find the MLE? 5 / 12

  10. Derivation for Log loss for Logistic Regression: scratch space 5 / 12

  11. Today 5 / 12

  12. Linear Regression as a Probabilistic Model Linear regression defines p w ( Y | X ) as follows: 1. Observe the feature vector x ; transform it via the activation function: µ = w · x 2. Let µ be the mean of a normal distribution and define the density: 2 π exp − ( Y − µ ) 2 1 √ p w ( Y | x ) = 2 σ 2 σ 3. Sample Y from p w ( Y | x ) . 6 / 12

  13. Linear Regression-MLE is (Unregularized) Squared Loss Minimization! N N 1 � � ( y n − w · x n ) 2 argmin − log p w ( y n | x n ) ≡ argmin N � �� � w w n =1 n =1 SquaredLoss n ( w ,b ) Where did the variance go? What is GD here? 7 / 12

  14. Loss Minimization & Gradient Descent N 1 � w ∗ = argmin ℓ ( x n , y n , w ) + R ( w ) N � �� � w n =1 ℓ n ( w ) What is GD here? What do we do if N is large? 8 / 12

  15. Stochastic Gradient Descent (SGD): by example N 1 � ( y n − w · x n ) 2 argmin N w n =1 ◮ Gradient descent: ◮ Note we are computing an average. What is a crude way to estimate an average? ◮ Stochastic gradient descent: Will it converge? 9 / 12

  16. Stochastic Gradient Descent (SGD): by example N 1 � ( y n − w · x n ) 2 argmin N w n =1 ◮ Gradient descent: ◮ Note we are computing an average. What is a crude way to estimate an average? ◮ Stochastic gradient descent: Will it converge? If the step size in SGD is a constant, we will not converge. 9 / 12

  17. Stochastic Gradient Descent (SGD) (without regularization) Data : loss functions ℓ ( · ) , training data, number of iterations K , step sizes � η (1) , . . . , η ( K ) � Result : parameters w ∈ R d initialize: w (0) = 0 ; for k ∈ { 1 , . . . , K } do i ∼ Uniform( { 1 , . . . , N } ) ; w ( k ) = w ( k − 1) − η ( k ) · ∇ w ℓ i ( w ( k − 1) ) ; end return w ( K ) ; Algorithm 1: SGD 10 / 12

  18. Stochastic Gradient Descent: Convergence N 1 � w ∗ = argmin ℓ n ( w ) N w n =1 ◮ w ( k ) : our parameter after k updates. ◮ Thm: Suppose ℓ ( · ) is convex (and satisfies mild regularity conditions). There exists a way to decrease our step sizes η ( k ) over time so that our function value, F ( w ( k ) ) will converge to the minimal function value F ( w ∗ ) . ◮ This Thm is different from GD in that we need to turn down our step sizes over time! 11 / 12

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend