Maximum Likelihood Estimation CS 446 Maximum likelihood: abstract - PowerPoint PPT Presentation

Maximum Likelihood Estimation CS 446

Maximum likelihood: abstract formulation We’ve had one main “meta-algorithm” this semester: ◮ (Regularized) ERM principle: pick the model that minimizes an average loss over training data. 1 / 70

Maximum likelihood: abstract formulation We’ve had one main “meta-algorithm” this semester: ◮ (Regularized) ERM principle: pick the model that minimizes an average loss over training data. We’ve also discussed another: the “Maximum likelihood estimation (MLE)” principle : ◮ Pick a set of probability models for your data: P := { p θ : θ ∈ Θ } . ◮ p θ will denote both densities and masses; the literature is similarly inconsistent. ◮ Given samples ( z i ) n i =1 , pick the model that maximized the likelihood n n � � max θ ∈ Θ L ( θ ) = max θ ∈ Θ ln p θ ( z i ) = max ln p θ ( z i ) , θ ∈ Θ i =1 i =1 where the ln( · ) is for mathematical convenience, and z i can be a labeled pair ( x i , y i ) or just x i . 1 / 70

Connections between ERM and MLE ◮ We can often derive and justify many basic methods with either (e.g., least squares, logistic regression, k -means, . . . ). ◮ MLE ideas were used to verive VAEs, which we’ll cover next week! 2 / 70

Connections between ERM and MLE ◮ We can often derive and justify many basic methods with either (e.g., least squares, logistic regression, k -means, . . . ). ◮ MLE ideas were used to verive VAEs, which we’ll cover next week! ◮ Each perspective suggests some different details and interpretation. 2 / 70

Connections between ERM and MLE ◮ We can often derive and justify many basic methods with either (e.g., least squares, logistic regression, k -means, . . . ). ◮ MLE ideas were used to verive VAEs, which we’ll cover next week! ◮ Each perspective suggests some different details and interpretation. ◮ Both approaches rely upon seemingly arbitrary assumptions and choices. 2 / 70

Connections between ERM and MLE ◮ We can often derive and justify many basic methods with either (e.g., least squares, logistic regression, k -means, . . . ). ◮ MLE ideas were used to verive VAEs, which we’ll cover next week! ◮ Each perspective suggests some different details and interpretation. ◮ Both approaches rely upon seemingly arbitrary assumptions and choices. ◮ The success of MLE seems to often hinge upon an astute choice of model. ◮ Applied scientists often like MLE and its ilk due to interpretability and “usability”: they can easily encode domain knowledge. We’ll return to this. 2 / 70

Example 1: coin flips. ◮ We flip a coin of bias θ ∈ [0 , 1] . ◮ Write down x i = 0 for tails, x i = 1 for heads; then p θ ( x i ) = x i θ + (1 − x i )(1 − θ ) , or alternatively p θ ( x i ) = θ x i (1 − θ ) 1 − x i . The second form will be more convenient. 3 / 70

Example 1: coin flips. ◮ We flip a coin of bias θ ∈ [0 , 1] . ◮ Write down x i = 0 for tails, x i = 1 for heads; then p θ ( x i ) = x i θ + (1 − x i )(1 − θ ) , or alternatively p θ ( x i ) = θ x i (1 − θ ) 1 − x i . The second form will be more convenient. ◮ Writing H := � i x i and T := � i (1 − x i ) = n − H for convenience, n � � � L ( θ ) = x i ln θ + (1 − x i ) ln(1 − θ ) = H ln θ + T ln(1 − θ ) . i =1 3 / 70

Example 1: coin flips. ◮ We flip a coin of bias θ ∈ [0 , 1] . ◮ Write down x i = 0 for tails, x i = 1 for heads; then p θ ( x i ) = x i θ + (1 − x i )(1 − θ ) , or alternatively p θ ( x i ) = θ x i (1 − θ ) 1 − x i . The second form will be more convenient. ◮ Writing H := � i x i and T := � i (1 − x i ) = n − H for convenience, n � � � L ( θ ) = x i ln θ + (1 − x i ) ln(1 − θ ) = H ln θ + T ln(1 − θ ) . i =1 Differentiating and setting to 0, 0 = H T θ − 1 − θ, T + H = H H which gives θ = N . ◮ In this way, we’ve justified a natural algorithm. 3 / 70

Example 2: mean of a Gaussian ◮ Suppose x i ∼ N ( µ, σ 2 ) , so θ = ( µ, σ 2 ) , and � − ( x i − µ ) 2 � exp = − ( x i − µ ) 2 − ln(2 πσ 2 ) 2 σ 2 √ ln p θ ( x i ) = ln . 2 σ 2 2 2 πσ 2 4 / 70

Example 2: mean of a Gaussian ◮ Suppose x i ∼ N ( µ, σ 2 ) , so θ = ( µ, σ 2 ) , and � − ( x i − µ ) 2 � exp = − ( x i − µ ) 2 − ln(2 πσ 2 ) 2 σ 2 √ ln p θ ( x i ) = ln . 2 σ 2 2 2 πσ 2 ◮ Therefore n L ( θ ) = − 1 ( x i − µ ) 2 + stuff without µ ; � 2 σ 2 i =1 applying ∇ µ and setting to zero gives µ = 1 � x i . n i ◮ A similar derivation gives σ 2 = 1 i ( x i − µ ) 2 . � n 4 / 70

Discussion: Bayesian vs. frequentist perspectives Question: � n x i n estimates a Gaussian µ parameter; but isn’t it useful i =1 more generally? 5 / 70

Discussion: Bayesian vs. frequentist perspectives Question: � n x i n estimates a Gaussian µ parameter; but isn’t it useful i =1 more generally? Bayesian perspective: we choose a model and believe it well-approximates reality; learning its parameters determines underlying phenomena. ◮ Bayesian methods can handle model misspecification; LDA is an example which works well despite seemingly impractical assumptions. 5 / 70

Discussion: Bayesian vs. frequentist perspectives Question: � n x i n estimates a Gaussian µ parameter; but isn’t it useful i =1 more generally? Bayesian perspective: we choose a model and believe it well-approximates reality; learning its parameters determines underlying phenomena. ◮ Bayesian methods can handle model misspecification; LDA is an example which works well despite seemingly impractical assumptions. Frequentist perspective: we ask certain questions, and reason about the accuracy of our answers. ◮ For many distributions, � n x i n is a valid estimate of the mean, i =1 moreover with confidence intervals of size 1 / √ n . This approach isn’t free of assumptions: IID is there. . . 5 / 70

Discussion: Bayesian vs. frequentist perspectives (part 2) ◮ Discussion also appears in the form “generative vs discriminative ML”. ◮ As before: both philosophies can justify/derive the same algorithm; they differ on some details (e.g., choosing k in k -means). IMO: it’s nice having more tools ◮ (as mentioned before: VAE derived from MLE perspective). 6 / 70

Example 3: Least squares (recap) If we assume Y | X ∼ N ( w T X , σ 2 ) , then n � L ( w ) = ln p w ( x i , y i ) i =1 n � � � = ln p w ( y i | x i ) + ln p ( x i ) i =1 n � � − ( w T x i − y i ) 2 � = + terms without w . 2 σ 2 i =1 Therefore n � T x i − y i ) 2 . arg max L ( w ) = arg min ( w w ∈ R d w ∈ R d i =1 7 / 70

Example 3: Least squares (recap) If we assume Y | X ∼ N ( w T X , σ 2 ) , then n � L ( w ) = ln p w ( x i , y i ) i =1 n � � � = ln p w ( y i | x i ) + ln p ( x i ) i =1 n � � − ( w T x i − y i ) 2 � = + terms without w . 2 σ 2 i =1 Therefore n � T x i − y i ) 2 . arg max L ( w ) = arg min ( w w ∈ R d w ∈ R d i =1 We can derive/justify the algorithm either way, but some refinements now differ with each perspective (e.g., regularization). 7 / 70

Example 4: Naive Bayes ◮ Let’s try a simple prediction setup, with (Bayes) optimal classifier arg max p ( Y = y | X = x ) . y ∈Y (We haven’t discussed this concept a lot, but it’s widespread in ML.) 8 / 70

Example 4: Naive Bayes ◮ Let’s try a simple prediction setup, with (Bayes) optimal classifier arg max p ( Y = y | X = x ) . y ∈Y (We haven’t discussed this concept a lot, but it’s widespread in ML.) ◮ One way to proceed is to learn p ( Y | X ) exactly; that’s a pain. 8 / 70

Example 4: Naive Bayes ◮ Let’s try a simple prediction setup, with (Bayes) optimal classifier arg max p ( Y = y | X = x ) . y ∈Y (We haven’t discussed this concept a lot, but it’s widespread in ML.) ◮ One way to proceed is to learn p ( Y | X ) exactly; that’s a pain. ◮ Let’s assume coordinates of X = ( X 1 , . . . , X d ) are independent given Y : = p ( X = x | Y = y ) p ( Y = y ) p ( Y = y | X = x ) = p ( Y = y, X = x ) p ( X = x ) p ( X = x ) p ( Y = y ) � d j =1 p ( X j = x j | Y = y ) = , p ( X = x ) and d � arg max p ( Y = y | X = x ) = arg max p ( Y = y ) p ( X = x ) | Y = y ) . y ∈Y y ∈Y j =1 8 / 70

Example 4: Naive Bayes (part 2) d � arg max p ( Y = y | X = x ) = arg max p ( Y = y ) p ( X = x ) | Y = y ) . y ∈Y y ∈Y j =1 9 / 70

Example 4: Naive Bayes (part 2) d � arg max p ( Y = y | X = x ) = arg max p ( Y = y ) p ( X = x ) | Y = y ) . y ∈Y y ∈Y j =1 Examples where this helps: ◮ Suppose X ∈ { 0 , 1 } d has an arbitrary distribution; \ it’s specified with 2 d − 1 numbers. \ The factored form above needs d numbers. To see how this can help, suppose x ∈ { 0 , 1 } d ; instead of having to learn a probability model of 2 d possibilities, we now have to learn d + 1 models each with 2 possibilities (binary labels). ◮ HW5 will use the standard “Iris dataset”. \ This data is continuous, Naive Bayes would approximate univariate distributions. 9 / 70

Mixtures of Gaussians. 10 / 70

k -means has spherical clusters? Recall that k -means baked in spherical clusters. How about we model each cluster with a Gaussian? 11 / 70

k -means has spherical clusters? Recall that k -means baked in spherical clusters. 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 How about we model each cluster with a Gaussian? 11 / 70

Maximum Likelihood Estimation CS 446 Maximum likelihood: abstract - PowerPoint PPT Presentation

Maximum Likelihood Estimation CS 446 Maximum likelihood: abstract formulation Weve had one main meta-algorithm this semester: (Regularized) ERM principle: pick the model that minimizes an average loss over training data. 1 / 70

Binary choice 3.3 Maximum likelihood estimation Michel Bierlaire Output of the estimation

Maximum likelihood parameter estimation Maximum likelihood parameter estimation For an HMM

Maximum Likelihood properties Maximum parsimony Maximum likelihood Experimental design

Binary choice 3.3 Maximum likelihood estimation Michel Bierlaire Maximum likelihood

15-388/688 - Practical Data Science: Maximum likelihood estimation, nave Bayes J. Zico Kolter

Chapter 8: Estimation In this chapter we will cover: 1. The likelihood and maximum likelihood

Maximum Likelihood Estimation CS 446 Maximum likelihood: abstract formulation Weve had one

Maximum-likelihood and Bayesian parameter estimation Andrea Passerini passerini@disi.unitn.it

Phylogenetic trees IV Maximum Likelihood Gerhard Jger ESSLLI 2016 Gerhard Jger Maximum

Maximum likelihood models Tues. Feb. 27, 2018 1 Overview of today Informal notion of

Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood

Quasi-maximum likelihood estimation for multivariate CARMA processes Eckhard Schlemm Institute

Week 2: Maximum Likelihood Estimation Instructor: Sergey Levine 1 Recap: MLE for the binomial

Phylogenetic trees IV Maximum Likelihood Gerhard Jger Words, Bones, Genes, Tools February 28,

Lecture 8: Maximum Likelihood Estimation (MLE) (contd.) Maximum a posteriori (MAP)

Maximum Maximum Likelihood Estimation Daphne Koller Biased Coin Example P is a Bernoulli

Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science,

2020 ASEE ERM Business Meeting WHY THE DOG? BECAUSE WE ALL NEED A SMILE RIGHT NOW Agenda: 1.

Defining ERM and How It Benefits an Executive Gordon Proctor Launching Enterprise Risk

Formal Theory James Drummond QCD Amplitudes Strings Quantum Gravity Black holes

Two Sides of the Same Coin ERM and Clinical Quality Innovation A/Prof Wong Moh Sim Head and

Principled Learning Method for Wasserstein Distributionally Robust Optimization with Local

Openness, Technology Capital, and Development Ellen McGrattan and Edward Prescott April 2007 Why

Lecture 5: Logistic Regression Feb 10 2020 Lecturer: Steven Wu Scribe: Steven Wu Last lecture,

Maximum Likelihood Estimation CS 446 Maximum likelihood: abstract - PowerPoint PPT Presentation

Maximum Likelihood Estimation CS 446 Maximum likelihood: abstract formulation Weve had one main meta-algorithm this semester: (Regularized) ERM principle: pick the model that minimizes an average loss over training data. 1 / 70

Binary choice 3.3 Maximum likelihood estimation Michel Bierlaire Output of the estimation

Maximum likelihood parameter estimation Maximum likelihood parameter estimation For an HMM

Maximum Likelihood properties Maximum parsimony Maximum likelihood Experimental design

Binary choice 3.3 Maximum likelihood estimation Michel Bierlaire Maximum likelihood

15-388/688 - Practical Data Science: Maximum likelihood estimation, nave Bayes J. Zico Kolter

Chapter 8: Estimation In this chapter we will cover: 1. The likelihood and maximum likelihood

Maximum Likelihood Estimation CS 446 Maximum likelihood: abstract formulation Weve had one

Maximum-likelihood and Bayesian parameter estimation Andrea Passerini passerini@disi.unitn.it

Phylogenetic trees IV Maximum Likelihood Gerhard Jger ESSLLI 2016 Gerhard Jger Maximum

Maximum likelihood models Tues. Feb. 27, 2018 1 Overview of today Informal notion of

Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood

Quasi-maximum likelihood estimation for multivariate CARMA processes Eckhard Schlemm Institute

Week 2: Maximum Likelihood Estimation Instructor: Sergey Levine 1 Recap: MLE for the binomial

Phylogenetic trees IV Maximum Likelihood Gerhard Jger Words, Bones, Genes, Tools February 28,

Lecture 8: Maximum Likelihood Estimation (MLE) (contd.) Maximum a posteriori (MAP)

Maximum Maximum Likelihood Estimation Daphne Koller Biased Coin Example P is a Bernoulli

Class 2 &amp; 3 Overfitting &amp; Regularization Carlo Ciliberto Department of Computer Science,

2020 ASEE ERM Business Meeting WHY THE DOG? BECAUSE WE ALL NEED A SMILE RIGHT NOW Agenda: 1.

Defining ERM and How It Benefits an Executive Gordon Proctor Launching Enterprise Risk

Formal Theory James Drummond QCD Amplitudes Strings Quantum Gravity Black holes

Two Sides of the Same Coin ERM and Clinical Quality Innovation A/Prof Wong Moh Sim Head and

Principled Learning Method for Wasserstein Distributionally Robust Optimization with Local

Openness, Technology Capital, and Development Ellen McGrattan and Edward Prescott April 2007 Why

Lecture 5: Logistic Regression Feb 10 2020 Lecturer: Steven Wu Scribe: Steven Wu Last lecture,

Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science,