Machine Learning (CSE 446): Probabilistic Machine Learning MLE - PowerPoint PPT Presentation

Machine Learning (CSE 446): Probabilistic Machine Learning MLE & MAP Sham M Kakade � 2018 c University of Washington cse446-staff@cs.washington.edu 1 / 14

Announcements ◮ Homeworks ◮ HW 3 posted. Get the most recent version. ◮ You must do the regular probs before obtaining any extra credit. ◮ Extra credit factored in after your scores are averaged together. ◮ Office hours today: 3-4p ◮ Today: ◮ Review ◮ Probabilistic methods 1 / 14

Review 1 / 14

SGD: How do we set the step sizes? ◮ Theory: If you turn down the step sizes at (some prescribed decaying method) then SGD will converge to the right answer. The “classical” theory doesn’t provide enough practical guidance. ◮ Practice: ◮ starting stepsize: start it “large”: if it is “too large”, then either you diverge (or nothing improves). set it a little less (like 1 / 4 ) less than this point. ◮ When do we decay it? When your training error stops decreasing “enough”. ◮ HW: you’ll need to tune it a little. (a slow approach: sometimes you can just start it somewhat smaller than the “divergent” value and you will find something reasonable.) 2 / 14

SGD: How do we set the mini-batch size m ? ◮ Theory: there are diminishing returns to increasing m . ◮ Practice: just keep cranking it up and eventually you’ll see that your code doesn’t get any faster. 3 / 14

Regularization: How do we set it? ◮ Theory: really just says that λ controls your “model complexity”. ◮ we DO know that “early stopping” for GD/SGD is (basically) doing L2 regularization for us ◮ i.e. if we don’t run for too long, then � w � 2 won’t become too big. ◮ Practice: ◮ Set with a dev set! ◮ Exact methods (like matrix inverse/least squares): always need to regularize or something horrible happens.... ◮ GD/SGD: sometimes (often ?) it works just fine ignoring regularization 4 / 14

Today 4 / 14

There is no magic in vector derivatives: scratch space 5 / 14

There is no magic in matrix derivatives: scratch space 5 / 14

Understanding MLE MLE y 1 ^ π You can think of MLE as a “black box” for choosing parameter values. 6 / 14

Understanding MLE π Y MLE y 1 ^ π 6 / 14

Understanding MLE x ŵ xx MLE x 1 y 1 ^ b 6 / 14

Understanding MLE x w × ∑ logistic Y b x ŵ xx MLE x 1 y 1 ^ b 6 / 14

Probabilistic Stories π Y Bernoulli x w × ∑ logistic Y logistic regression b 7 / 14

Probabilistic Stories π Y Bernoulli x w ∑ logistic × Y logistic regression b μ Y Gaussian σ 2 ∑ x w × Y linear regression σ 2 b 7 / 14

MLE example: estimating the bias of a coin 8 / 14

MLE example: estimating the bias of a coin 9 / 14

Then and Now Before today, you knew how to do MLE: count (+1)+ count ( − 1) = N + count (+1) ◮ For a Bernoulli distribution: ˆ π = N � N n =1 y n ◮ For a Gaussian distribution: ˆ σ 2 ). µ = (and similar for estimating variance, ˆ N Logistic regression and linear regression, respectively, generalize these so that the parameter is itself a function of x , so that we have a conditional model of Y given X . ◮ The practical difference is that the MLE doesn’t have a closed form for these models. (So we use SGD and friends.) 10 / 14

Remember: Linear Regression as a Probabilistic Model Linear regression defines p w ( Y | X ) as follows: 1. Observe the feature vector x ; transform it via the activation function: µ = w · x 2. Let µ be the mean of a normal distribution and define the density: 2 π exp − ( Y − µ ) 2 1 √ p w ( Y | x ) = 2 σ 2 σ 3. Sample Y from p w ( Y | x ) . 10 / 14

Remember: Linear Regression-MLE is (Unregularized) Squared Loss Minimization! N N 1 � � ( y n − w · x n ) 2 argmin − log p w ( y n | x n ) ≡ argmin N � �� w w n =1 n =1 SquaredLoss n ( w ,b ) Where did the variance go? 10 / 14

Adding a “Prior” to the Probabilistic Story Probabilistic story: ◮ For n ∈ { 1 , . . . , N } : ◮ Observe x n . ◮ Transform it using parameters w to get p ( Y = y | x n , w ) . ◮ Sample y n ∼ p ( Y | x n , w ) . 11 / 14

Adding a “Prior” to the Probabilistic Story Probabilistic story with a “prior”: ◮ Use hyperparameters α to define a Probabilistic story: prior distribution over random ◮ For n ∈ { 1 , . . . , N } : variables W , p α ( W ) . ◮ Observe x n . ◮ Sample w ∼ p α ( W = w ) . ◮ Transform it using parameters w to ◮ For n ∈ { 1 , . . . , N } : get p ( Y = y | x n , w ) . ◮ Observe x n . ◮ Sample y n ∼ p ( Y | x n , w ) . ◮ Transform it using parameters w and b to get p ( Y | x n , w ) . ◮ Sample y n ∼ p ( Y | x n , w ) . 11 / 14

MLE vs. Maximum a Posteriori (MAP) Estimation ◮ Review: MLE ◮ We have a model Pr(Data | w ) . ◮ Find w which maximizes the probability of the data you have observed: argmax Pr(Data | w ) w ◮ New: Maximum a Posterior Estimation ◮ Also have a prior Pr( W = w ) ◮ Now we a have posterior distribution: Pr( w | Data) = Pr(Data | w ) Pr( W = w ) Pr(Data) ◮ Now suppose we are asked to provide our “best guess” at w . What should we do? 12 / 14

Maximum a Posteriori (MAP) Estimation and Regularization ◮ MAP estimation: argmax Pr( w | Data) w ◮ In many settings, this leads to N � ( ˆ w ) = argmax log p α ( w ) + log p w ( y n | x n ) � �� w n =1 log prior � �� log likelihood 13 / 14

Maximum a Posteriori (MAP) Estimation and Regularization ◮ MAP estimation: argmax Pr( w | Data) w ◮ In many settings, this leads to N � ( ˆ w ) = argmax log p α ( w ) + log p w ( y n | x n ) � �� w n =1 log prior � �� log likelihood Option 1: let p α ( W ) be a zero-mean Gaussian distribution with standard deviation α . log p α ( w ) = − 1 2 α 2 � w � 2 2 + constant 13 / 14

Maximum a Posteriori (MAP) Estimation and Regularization ◮ MAP estimation: argmax Pr( w | Data) w ◮ In many settings, this leads to N � ( ˆ w ) = argmax log p α ( w ) + log p w ( y n | x n ) � �� w n =1 log prior � �� log likelihood Option 1: let p α ( W ) be a zero-mean Gaussian distribution with standard deviation α . log p α ( w ) = − 1 2 α 2 � w � 2 2 + constant Option 2: let p α ( W j ) be a zero-location “Laplace” distribution with scale α . log p α ( w ) = − 1 α � w � 1 + constant 13 / 14

L 2 v.s. L 1 -Regularization 14 / 14

Probabilistic Story: L 2 -Regularized Logistic Regression 0 σ 2 ∑ x w × logistic Y b x ŵ xx MAP x 1 y 1 ^ b 14 / 14

Why Go Probabilistic? ◮ Interpret the classifier’s activation function as a (log) probability (density), which encodes uncertainty. ◮ Interpret the regularizer as a (log) probability (density), which encodes uncertainty. ◮ Leverage theory from statistics to get a better understanding of the guarantees we can hope for with our learning algorithms. ◮ Change your assumptions, turn the optimization-crank, and get a new machine learning method. The key to success is to tell a probabilistic story that’s reasonably close to reality, including the prior(s). 14 / 14

Machine Learning (CSE 446): Probabilistic Machine Learning MLE - PowerPoint PPT Presentation

Machine Learning (CSE 446): Probabilistic Machine Learning MLE & MAP Sham M Kakade 2018 c University of Washington cse446-staff@cs.washington.edu 1 / 14 Announcements Homeworks HW 3 posted. Get the most recent version.

Machine Learning (CSE 446): Probabilistic View of Logistic Regression and Linear Regression Sham

CSE 446: Linear Algebra Review Sachin Mehta University of Washington, Seattle Email:

CSCI 446: Arficial Intelligence CSCI 446: Arficial Intelligence

CSCI 446: Artificial Intelligence CSCI 446: Artificial Intelligence Course Website:

Machine Learning (CSE 446): Introduction Sham M Kakade 2018 c University of Washington

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Machine Learning Supervised Learning Unsupervised Learning CSE 446: Expectation Maximization

Machine Learning (CSE 446): Learning as Minimizing Loss: Regularization and Gradient Descent

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm Sham M

Machine Learning (CSE 446): (continuation of overfitting &) Limits of Learning Sham M Kakade

Machine Learning (CSE 446): PCA (continued) and Learning as Minimizing Loss Sham M Kakade

Machine Learning (CSE 446): Learning as Minimizing Loss; Least Squares Sham M Kakade c 2018

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course

Machine Learning (CSE 446): Decision Trees Sham M Kakade 2018 c University of Washington

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods Sham M Kakade 2018 c

& Exact inference for Gaussian networks Probabilistic Graphical Models Sharif University of

CS480/680 Lecture 12: June 17, 2019 Gaussian Processes [B] Section 6.4 [M] Chap. 15 [HTF] Sec.

Parameter estimation (cont.) Dr. Jarad Niemi STAT 544 - Iowa State University January 24, 2019

Gaussian Random Variables and Processes Saravanan Vijayakumaran sarva@ee.iitb.ac.in Department

MLE 04-09-2019 For Gaussian and Mixture Gaussian Models Instructor - Sriram Ganapathy

Multivariate normal distribution Surajit Ray Reader, University of Glasgow DataCamp

Probability Review III Harvard Math Camp - Econometrics Ashesh Rambachan Summer 2018 Outline

Using Discrete Gaussian Sampling Divesh Aggarwal National University of Singapore (NUS) Daniel

Machine Learning (CSE 446): Probabilistic Machine Learning MLE - PowerPoint PPT Presentation

Machine Learning (CSE 446): Probabilistic Machine Learning MLE & MAP Sham M Kakade 2018 c University of Washington cse446-staff@cs.washington.edu 1 / 14 Announcements Homeworks HW 3 posted. Get the most recent version.

Machine Learning (CSE 446): Probabilistic View of Logistic Regression and Linear Regression Sham

CSE 446: Linear Algebra Review Sachin Mehta University of Washington, Seattle Email:

CSCI 446: Ar*ficial Intelligence CSCI 446: Ar*ficial Intelligence

CSCI 446: Artificial Intelligence CSCI 446: Artificial Intelligence Course Website:

Machine Learning (CSE 446): Introduction Sham M Kakade 2018 c University of Washington

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Machine Learning Supervised Learning Unsupervised Learning CSE 446: Expectation Maximization

Machine Learning (CSE 446): Learning as Minimizing Loss: Regularization and Gradient Descent

Machine Learning (CSE 446): Concepts &amp; the i.i.d. Supervised Learning Paradigm Sham M

Machine Learning (CSE 446): (continuation of overfitting &amp;) Limits of Learning Sham M Kakade

Machine Learning (CSE 446): PCA (continued) and Learning as Minimizing Loss Sham M Kakade

Machine Learning (CSE 446): Learning as Minimizing Loss; Least Squares Sham M Kakade c 2018

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course

Machine Learning (CSE 446): Decision Trees Sham M Kakade 2018 c University of Washington

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods Sham M Kakade 2018 c

&amp; Exact inference for Gaussian networks Probabilistic Graphical Models Sharif University of

CS480/680 Lecture 12: June 17, 2019 Gaussian Processes [B] Section 6.4 [M] Chap. 15 [HTF] Sec.

Parameter estimation (cont.) Dr. Jarad Niemi STAT 544 - Iowa State University January 24, 2019

Gaussian Random Variables and Processes Saravanan Vijayakumaran sarva@ee.iitb.ac.in Department

MLE 04-09-2019 For Gaussian and Mixture Gaussian Models Instructor - Sriram Ganapathy

Multivariate normal distribution Surajit Ray Reader, University of Glasgow DataCamp

Probability Review III Harvard Math Camp - Econometrics Ashesh Rambachan Summer 2018 Outline

Using Discrete Gaussian Sampling Divesh Aggarwal National University of Singapore (NUS) Daniel

CSCI 446: Arficial Intelligence CSCI 446: Arficial Intelligence

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm Sham M

Machine Learning (CSE 446): (continuation of overfitting &) Limits of Learning Sham M Kakade

& Exact inference for Gaussian networks Probabilistic Graphical Models Sharif University of