machine learning cse 446 probabilistic view of logistic
play

Machine Learning (CSE 446): Probabilistic View of Logistic - PowerPoint PPT Presentation

Machine Learning (CSE 446): Probabilistic View of Logistic Regression and Linear Regression Sham M Kakade c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 12 Announcements Midterm: Weds, Feb 7th. Policies: You


  1. Machine Learning (CSE 446): Probabilistic View of Logistic Regression and Linear Regression Sham M Kakade c � 2018 University of Washington cse446-staff@cs.washington.edu 1 / 12

  2. Announcements ◮ Midterm: Weds, Feb 7th. Policies: ◮ You may use a single side of a single sheet of handwritten notes that you prepared. ◮ You must turn your sheet of notes in, with your name on it, in at the conclusion of the exam, even if you never looked at it. ◮ You may not use electronics devices of any sort. ◮ Today: Review: Regularization and Optimization New: (wrap up GD) + probabilistic modeling! 1 / 12

  3. Review 1 / 12

  4. Regularization / Ridge Regression ◮ Regularize the optimization problem: N 1 1 � ( y n − w · x n ) 2 + λ � w � 2 = min N � Y − X ⊤ w � 2 + λ � w � 2 min N w w n =1 ◮ This particular case: “Ridge” Regression, Tikhonov regularization ◮ The solution is the least squares estimator : � 1 � − 1 � 1 � w least squares = N X ⊤ X + λ I N X ⊤ Y Regularization is often necessary for the “exact” solution method (regardless of if d bigger/less than N ) 2 / 12

  5. Gradient Descent ◮ Want to solve: min F ( z ) z ◮ How should we update z? 3 / 12

  6. Gradient Descent Data : function F : R d → R , number of iterations K , step sizes � η (1) , . . . , η ( K ) � Result : z ∈ R d initialize: z (0) = 0 ; for k ∈ { 1 , . . . , K } do z ( k ) = z ( k − 1) − η ( k ) · ∇ z F ( z ( k − 1) ) ; end return z ( K ) ; Algorithm 1: GradientDescent 3 / 12

  7. Today 3 / 12

  8. Gradient Descent: Convergence ◮ Denote: z ∗ = argmin z F ( z ) : the global minimum z ( k ) : our parameter after k updates. ◮ Thm: Suppose F is convex and “ L -smooth”. Using a fixed step size η ≤ 1 L , we have: F ( z ( k ) ) − F ( z ∗ ) ≤ � z (0) − z ∗ � 2 η · k That is the convergence rate is O ( 1 k ) . ◮ This Thm applies to both the square loss and logistic loss! 4 / 12

  9. Proof intuition: smoothness and GD Convergence ◮ L -Smooth functions: “The gradients don’t change quickly.” Precisely, For all z, z ′ �∇ F ( z ) − ∇ F ( z ′ ) � ≤ L � z − z ′ � ◮ Proof idea: 1. If our gradient is large, we will make good progress decreasing our function value: 2. If our gradient is small, we must have value near the optimal value: 5 / 12

  10. A better idea? ◮ Remember the Bayes optimal classifier. D ( x, y ) is the true probability of ( x, y ) . f ( BO ) ( x ) = argmax D ( x, y ) y = argmax D ( y | x ) y ◮ Of course, we don’t have D ( y | x ) . Probabilistic machine learning: define a probabilistic model relating random variables x to y and estimate its parameters . 6 / 12

  11. A Probabilistic Model for Binary Classification: Logistic Regression ◮ For Y ∈ {− 1 , 1 } define p w ,b ( Y | X ) as: 1. Transform feature vector x via the “activation” function: a = w · x + b 2. Transform a into a binomial probability by passing it through the logistic function: 1 1 p w ,b ( Y = +1 | x ) = 1 + exp − a = 1 + exp − ( w · x + b ) 0.8 0.4 0.0 -10 -5 0 5 10 ◮ If we learn p w ,b ( Y | x ) , we can (almost) do whatever we like! 7 / 12

  12. Maximum Likelihood Estimation The principle of maximum likelihood estimation is to choose our parameters to make our observed data as likely as possible (under our model). ◮ Mathematically: find ˆ w that maximizes the probability of the labels y 1 , . . . y n given the inputs x 1 , . . . x n . ◮ Note, by the i.i.d. assumption: D ( y 1 , . . . y n | x 1 , . . . x N ) = ◮ The Maximum Likelihood Estimator (the ’MLE’ ) is: N � w = argmax p w ( y n | x n ) ˆ w n =1 8 / 12

  13. Maximum Likelihood Estimation and the Log loss ◮ The ’MLE’ is: N � w = argmax p w ( y n | x n ) ˆ w n =1 N � = argmax log p w ( y n | x n ) w n =1 N � = argmax log p w ( y n | x n ) w n =1 N � = argmin − log p w ( y n | x n ) w n =1 ◮ This is referred to as the log loss . 9 / 12

  14. The MLE for Logistic Regression ◮ the MLE for the logistic regression model: N N � � argmin − log p w ( y n | x n ) = argmin log (1 + exp( − y n w · x n )) w w n =1 n =1 ◮ This is the logistic loss function that we saw earlier. ◮ How do we find the MLE? 10 / 12

  15. Derivation for Log loss for Logistic Regression: scratch space 10 / 12

  16. Linear Regression as a Probabilistic Model Linear regression defines p w ( Y | X ) as follows: 1. Observe the feature vector x ; transform it via the activation function: µ = w · x 2. Let µ be the mean of a normal distribution and define the density: 2 π exp − ( Y − µ ) 2 1 √ p w ( Y | x ) = 2 σ 2 σ 3. Sample Y from p w ( Y | x ) . 11 / 12

  17. Linear Regression-MLE is (Unregularized) Squared Loss Minimization! N N 1 � � ( y n − w · x n ) 2 argmin − log p w ( y n | x n ) ≡ argmin N � �� � w w n =1 n =1 SquaredLoss n ( w ,b ) Where did the variance go? 12 / 12

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend