machine learning cse 446 probabilistic machine learning
play

Machine Learning (CSE 446): Probabilistic Machine Learning MLE - PowerPoint PPT Presentation

Machine Learning (CSE 446): Probabilistic Machine Learning MLE & MAP Sham M Kakade 2018 c University of Washington cse446-staff@cs.washington.edu 1 / 14 Announcements Homeworks HW 3 posted. Get the most recent version.


  1. Machine Learning (CSE 446): Probabilistic Machine Learning MLE & MAP Sham M Kakade � 2018 c University of Washington cse446-staff@cs.washington.edu 1 / 14

  2. Announcements ◮ Homeworks ◮ HW 3 posted. Get the most recent version. ◮ You must do the regular probs before obtaining any extra credit. ◮ Extra credit factored in after your scores are averaged together. ◮ Office hours today: 3-4p ◮ Today: ◮ Review ◮ Probabilistic methods 1 / 14

  3. Review 1 / 14

  4. SGD: How do we set the step sizes? ◮ Theory: If you turn down the step sizes at (some prescribed decaying method) then SGD will converge to the right answer. The “classical” theory doesn’t provide enough practical guidance. ◮ Practice: ◮ starting stepsize: start it “large”: if it is “too large”, then either you diverge (or nothing improves). set it a little less (like 1 / 4 ) less than this point. ◮ When do we decay it? When your training error stops decreasing “enough”. ◮ HW: you’ll need to tune it a little. (a slow approach: sometimes you can just start it somewhat smaller than the “divergent” value and you will find something reasonable.) 2 / 14

  5. SGD: How do we set the mini-batch size m ? ◮ Theory: there are diminishing returns to increasing m . ◮ Practice: just keep cranking it up and eventually you’ll see that your code doesn’t get any faster. 3 / 14

  6. Regularization: How do we set it? ◮ Theory: really just says that λ controls your “model complexity”. ◮ we DO know that “early stopping” for GD/SGD is (basically) doing L2 regularization for us ◮ i.e. if we don’t run for too long, then � w � 2 won’t become too big. ◮ Practice: ◮ Set with a dev set! ◮ Exact methods (like matrix inverse/least squares): always need to regularize or something horrible happens.... ◮ GD/SGD: sometimes (often ?) it works just fine ignoring regularization 4 / 14

  7. Today 4 / 14

  8. There is no magic in vector derivatives: scratch space 5 / 14

  9. There is no magic in vector derivatives: scratch space 5 / 14

  10. There is no magic in matrix derivatives: scratch space 5 / 14

  11. Understanding MLE MLE y 1 ^ π You can think of MLE as a “black box” for choosing parameter values. 6 / 14

  12. Understanding MLE π Y MLE y 1 ^ π 6 / 14

  13. Understanding MLE x ŵ xx MLE x 1 y 1 ^ b 6 / 14

  14. Understanding MLE x w × ∑ logistic Y b x ŵ xx MLE x 1 y 1 ^ b 6 / 14

  15. Probabilistic Stories π Y Bernoulli x w × ∑ logistic Y logistic regression b 7 / 14

  16. Probabilistic Stories π Y Bernoulli x w ∑ logistic × Y logistic regression b μ Y Gaussian σ 2 ∑ x w × Y linear regression σ 2 b 7 / 14

  17. MLE example: estimating the bias of a coin 8 / 14

  18. MLE example: estimating the bias of a coin 9 / 14

  19. Then and Now Before today, you knew how to do MLE: count (+1)+ count ( − 1) = N + count (+1) ◮ For a Bernoulli distribution: ˆ π = N � N n =1 y n ◮ For a Gaussian distribution: ˆ σ 2 ). µ = (and similar for estimating variance, ˆ N Logistic regression and linear regression, respectively, generalize these so that the parameter is itself a function of x , so that we have a conditional model of Y given X . ◮ The practical difference is that the MLE doesn’t have a closed form for these models. (So we use SGD and friends.) 10 / 14

  20. Remember: Linear Regression as a Probabilistic Model Linear regression defines p w ( Y | X ) as follows: 1. Observe the feature vector x ; transform it via the activation function: µ = w · x 2. Let µ be the mean of a normal distribution and define the density: 2 π exp − ( Y − µ ) 2 1 √ p w ( Y | x ) = 2 σ 2 σ 3. Sample Y from p w ( Y | x ) . 10 / 14

  21. Remember: Linear Regression-MLE is (Unregularized) Squared Loss Minimization! N N 1 � � ( y n − w · x n ) 2 argmin − log p w ( y n | x n ) ≡ argmin N � �� � w w n =1 n =1 SquaredLoss n ( w ,b ) Where did the variance go? 10 / 14

  22. Adding a “Prior” to the Probabilistic Story Probabilistic story: ◮ For n ∈ { 1 , . . . , N } : ◮ Observe x n . ◮ Transform it using parameters w to get p ( Y = y | x n , w ) . ◮ Sample y n ∼ p ( Y | x n , w ) . 11 / 14

  23. Adding a “Prior” to the Probabilistic Story Probabilistic story with a “prior”: ◮ Use hyperparameters α to define a Probabilistic story: prior distribution over random ◮ For n ∈ { 1 , . . . , N } : variables W , p α ( W ) . ◮ Observe x n . ◮ Sample w ∼ p α ( W = w ) . ◮ Transform it using parameters w to ◮ For n ∈ { 1 , . . . , N } : get p ( Y = y | x n , w ) . ◮ Observe x n . ◮ Sample y n ∼ p ( Y | x n , w ) . ◮ Transform it using parameters w and b to get p ( Y | x n , w ) . ◮ Sample y n ∼ p ( Y | x n , w ) . 11 / 14

  24. MLE vs. Maximum a Posteriori (MAP) Estimation ◮ Review: MLE ◮ We have a model Pr(Data | w ) . ◮ Find w which maximizes the probability of the data you have observed: argmax Pr(Data | w ) w ◮ New: Maximum a Posterior Estimation ◮ Also have a prior Pr( W = w ) ◮ Now we a have posterior distribution: Pr( w | Data) = Pr(Data | w ) Pr( W = w ) Pr(Data) ◮ Now suppose we are asked to provide our “best guess” at w . What should we do? 12 / 14

  25. Maximum a Posteriori (MAP) Estimation and Regularization ◮ MAP estimation: argmax Pr( w | Data) w ◮ In many settings, this leads to N � ( ˆ w ) = argmax log p α ( w ) + log p w ( y n | x n ) � �� � w n =1 log prior � �� � log likelihood 13 / 14

  26. Maximum a Posteriori (MAP) Estimation and Regularization ◮ MAP estimation: argmax Pr( w | Data) w ◮ In many settings, this leads to N � ( ˆ w ) = argmax log p α ( w ) + log p w ( y n | x n ) � �� � w n =1 log prior � �� � log likelihood Option 1: let p α ( W ) be a zero-mean Gaussian distribution with standard deviation α . log p α ( w ) = − 1 2 α 2 � w � 2 2 + constant 13 / 14

  27. Maximum a Posteriori (MAP) Estimation and Regularization ◮ MAP estimation: argmax Pr( w | Data) w ◮ In many settings, this leads to N � ( ˆ w ) = argmax log p α ( w ) + log p w ( y n | x n ) � �� � w n =1 log prior � �� � log likelihood Option 1: let p α ( W ) be a zero-mean Gaussian distribution with standard deviation α . log p α ( w ) = − 1 2 α 2 � w � 2 2 + constant Option 2: let p α ( W j ) be a zero-location “Laplace” distribution with scale α . log p α ( w ) = − 1 α � w � 1 + constant 13 / 14

  28. L 2 v.s. L 1 -Regularization 14 / 14

  29. Probabilistic Story: L 2 -Regularized Logistic Regression 0 σ 2 ∑ x w × logistic Y b x ŵ xx MAP x 1 y 1 ^ b 14 / 14

  30. Why Go Probabilistic? ◮ Interpret the classifier’s activation function as a (log) probability (density), which encodes uncertainty. ◮ Interpret the regularizer as a (log) probability (density), which encodes uncertainty. ◮ Leverage theory from statistics to get a better understanding of the guarantees we can hope for with our learning algorithms. ◮ Change your assumptions, turn the optimization-crank, and get a new machine learning method. The key to success is to tell a probabilistic story that’s reasonably close to reality, including the prior(s). 14 / 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend