maximum likelihood estimation
play

Maximum Likelihood Estimation CS 446 Maximum likelihood: abstract - PowerPoint PPT Presentation

Maximum Likelihood Estimation CS 446 Maximum likelihood: abstract formulation Weve had one main meta-algorithm this semester: (Regularized) ERM principle: pick the model that minimizes an average loss over training data. 1 / 70


  1. Maximum Likelihood Estimation CS 446

  2. Maximum likelihood: abstract formulation We’ve had one main “meta-algorithm” this semester: ◮ (Regularized) ERM principle: pick the model that minimizes an average loss over training data. 1 / 70

  3. Maximum likelihood: abstract formulation We’ve had one main “meta-algorithm” this semester: ◮ (Regularized) ERM principle: pick the model that minimizes an average loss over training data. We’ve also discussed another: the “Maximum likelihood estimation (MLE)” principle : ◮ Pick a set of probability models for your data: P := { p θ : θ ∈ Θ } . ◮ p θ will denote both densities and masses; the literature is similarly inconsistent. ◮ Given samples ( z i ) n i =1 , pick the model that maximized the likelihood n n � � max θ ∈ Θ L ( θ ) = max θ ∈ Θ ln p θ ( z i ) = max ln p θ ( z i ) , θ ∈ Θ i =1 i =1 where the ln( · ) is for mathematical convenience, and z i can be a labeled pair ( x i , y i ) or just x i . 1 / 70

  4. Connections between ERM and MLE ◮ We can often derive and justify many basic methods with either (e.g., least squares, logistic regression, k -means, . . . ). ◮ MLE ideas were used to verive VAEs, which we’ll cover next week! 2 / 70

  5. Connections between ERM and MLE ◮ We can often derive and justify many basic methods with either (e.g., least squares, logistic regression, k -means, . . . ). ◮ MLE ideas were used to verive VAEs, which we’ll cover next week! ◮ Each perspective suggests some different details and interpretation. 2 / 70

  6. Connections between ERM and MLE ◮ We can often derive and justify many basic methods with either (e.g., least squares, logistic regression, k -means, . . . ). ◮ MLE ideas were used to verive VAEs, which we’ll cover next week! ◮ Each perspective suggests some different details and interpretation. ◮ Both approaches rely upon seemingly arbitrary assumptions and choices. 2 / 70

  7. Connections between ERM and MLE ◮ We can often derive and justify many basic methods with either (e.g., least squares, logistic regression, k -means, . . . ). ◮ MLE ideas were used to verive VAEs, which we’ll cover next week! ◮ Each perspective suggests some different details and interpretation. ◮ Both approaches rely upon seemingly arbitrary assumptions and choices. ◮ The success of MLE seems to often hinge upon an astute choice of model. ◮ Applied scientists often like MLE and its ilk due to interpretability and “usability”: they can easily encode domain knowledge. We’ll return to this. 2 / 70

  8. Example 1: coin flips. ◮ We flip a coin of bias θ ∈ [0 , 1] . ◮ Write down x i = 0 for tails, x i = 1 for heads; then p θ ( x i ) = x i θ + (1 − x i )(1 − θ ) , or alternatively p θ ( x i ) = θ x i (1 − θ ) 1 − x i . The second form will be more convenient. 3 / 70

  9. Example 1: coin flips. ◮ We flip a coin of bias θ ∈ [0 , 1] . ◮ Write down x i = 0 for tails, x i = 1 for heads; then p θ ( x i ) = x i θ + (1 − x i )(1 − θ ) , or alternatively p θ ( x i ) = θ x i (1 − θ ) 1 − x i . The second form will be more convenient. ◮ Writing H := � i x i and T := � i (1 − x i ) = n − H for convenience, n � � � L ( θ ) = x i ln θ + (1 − x i ) ln(1 − θ ) = H ln θ + T ln(1 − θ ) . i =1 3 / 70

  10. Example 1: coin flips. ◮ We flip a coin of bias θ ∈ [0 , 1] . ◮ Write down x i = 0 for tails, x i = 1 for heads; then p θ ( x i ) = x i θ + (1 − x i )(1 − θ ) , or alternatively p θ ( x i ) = θ x i (1 − θ ) 1 − x i . The second form will be more convenient. ◮ Writing H := � i x i and T := � i (1 − x i ) = n − H for convenience, n � � � L ( θ ) = x i ln θ + (1 − x i ) ln(1 − θ ) = H ln θ + T ln(1 − θ ) . i =1 Differentiating and setting to 0, 0 = H T θ − 1 − θ, T + H = H H which gives θ = N . ◮ In this way, we’ve justified a natural algorithm. 3 / 70

  11. Example 2: mean of a Gaussian ◮ Suppose x i ∼ N ( µ, σ 2 ) , so θ = ( µ, σ 2 ) , and � − ( x i − µ ) 2 � exp = − ( x i − µ ) 2 − ln(2 πσ 2 ) 2 σ 2 √ ln p θ ( x i ) = ln . 2 σ 2 2 2 πσ 2 4 / 70

  12. Example 2: mean of a Gaussian ◮ Suppose x i ∼ N ( µ, σ 2 ) , so θ = ( µ, σ 2 ) , and � − ( x i − µ ) 2 � exp = − ( x i − µ ) 2 − ln(2 πσ 2 ) 2 σ 2 √ ln p θ ( x i ) = ln . 2 σ 2 2 2 πσ 2 ◮ Therefore n L ( θ ) = − 1 ( x i − µ ) 2 + stuff without µ ; � 2 σ 2 i =1 applying ∇ µ and setting to zero gives µ = 1 � x i . n i ◮ A similar derivation gives σ 2 = 1 i ( x i − µ ) 2 . � n 4 / 70

  13. Discussion: Bayesian vs. frequentist perspectives Question: � n x i n estimates a Gaussian µ parameter; but isn’t it useful i =1 more generally? 5 / 70

  14. Discussion: Bayesian vs. frequentist perspectives Question: � n x i n estimates a Gaussian µ parameter; but isn’t it useful i =1 more generally? Bayesian perspective: we choose a model and believe it well-approximates reality; learning its parameters determines underlying phenomena. ◮ Bayesian methods can handle model misspecification; LDA is an example which works well despite seemingly impractical assumptions. 5 / 70

  15. Discussion: Bayesian vs. frequentist perspectives Question: � n x i n estimates a Gaussian µ parameter; but isn’t it useful i =1 more generally? Bayesian perspective: we choose a model and believe it well-approximates reality; learning its parameters determines underlying phenomena. ◮ Bayesian methods can handle model misspecification; LDA is an example which works well despite seemingly impractical assumptions. Frequentist perspective: we ask certain questions, and reason about the accuracy of our answers. ◮ For many distributions, � n x i n is a valid estimate of the mean, i =1 moreover with confidence intervals of size 1 / √ n . This approach isn’t free of assumptions: IID is there. . . 5 / 70

  16. Discussion: Bayesian vs. frequentist perspectives (part 2) ◮ Discussion also appears in the form “generative vs discriminative ML”. ◮ As before: both philosophies can justify/derive the same algorithm; they differ on some details (e.g., choosing k in k -means). IMO: it’s nice having more tools ◮ (as mentioned before: VAE derived from MLE perspective). 6 / 70

  17. Example 3: Least squares (recap) If we assume Y | X ∼ N ( w T X , σ 2 ) , then n � L ( w ) = ln p w ( x i , y i ) i =1 n � � � = ln p w ( y i | x i ) + ln p ( x i ) i =1 n � � − ( w T x i − y i ) 2 � = + terms without w . 2 σ 2 i =1 Therefore n � T x i − y i ) 2 . arg max L ( w ) = arg min ( w w ∈ R d w ∈ R d i =1 7 / 70

  18. Example 3: Least squares (recap) If we assume Y | X ∼ N ( w T X , σ 2 ) , then n � L ( w ) = ln p w ( x i , y i ) i =1 n � � � = ln p w ( y i | x i ) + ln p ( x i ) i =1 n � � − ( w T x i − y i ) 2 � = + terms without w . 2 σ 2 i =1 Therefore n � T x i − y i ) 2 . arg max L ( w ) = arg min ( w w ∈ R d w ∈ R d i =1 We can derive/justify the algorithm either way, but some refinements now differ with each perspective (e.g., regularization). 7 / 70

  19. Example 4: Naive Bayes ◮ Let’s try a simple prediction setup, with (Bayes) optimal classifier arg max p ( Y = y | X = x ) . y ∈Y (We haven’t discussed this concept a lot, but it’s widespread in ML.) 8 / 70

  20. Example 4: Naive Bayes ◮ Let’s try a simple prediction setup, with (Bayes) optimal classifier arg max p ( Y = y | X = x ) . y ∈Y (We haven’t discussed this concept a lot, but it’s widespread in ML.) ◮ One way to proceed is to learn p ( Y | X ) exactly; that’s a pain. 8 / 70

  21. Example 4: Naive Bayes ◮ Let’s try a simple prediction setup, with (Bayes) optimal classifier arg max p ( Y = y | X = x ) . y ∈Y (We haven’t discussed this concept a lot, but it’s widespread in ML.) ◮ One way to proceed is to learn p ( Y | X ) exactly; that’s a pain. ◮ Let’s assume coordinates of X = ( X 1 , . . . , X d ) are independent given Y : = p ( X = x | Y = y ) p ( Y = y ) p ( Y = y | X = x ) = p ( Y = y, X = x ) p ( X = x ) p ( X = x ) p ( Y = y ) � d j =1 p ( X j = x j | Y = y ) = , p ( X = x ) and d � arg max p ( Y = y | X = x ) = arg max p ( Y = y ) p ( X = x ) | Y = y ) . y ∈Y y ∈Y j =1 8 / 70

  22. Example 4: Naive Bayes (part 2) d � arg max p ( Y = y | X = x ) = arg max p ( Y = y ) p ( X = x ) | Y = y ) . y ∈Y y ∈Y j =1 9 / 70

  23. Example 4: Naive Bayes (part 2) d � arg max p ( Y = y | X = x ) = arg max p ( Y = y ) p ( X = x ) | Y = y ) . y ∈Y y ∈Y j =1 Examples where this helps: ◮ Suppose X ∈ { 0 , 1 } d has an arbitrary distribution; \ it’s specified with 2 d − 1 numbers. \ The factored form above needs d numbers. To see how this can help, suppose x ∈ { 0 , 1 } d ; instead of having to learn a probability model of 2 d possibilities, we now have to learn d + 1 models each with 2 possibilities (binary labels). ◮ HW5 will use the standard “Iris dataset”. \ This data is continuous, Naive Bayes would approximate univariate distributions. 9 / 70

  24. Mixtures of Gaussians. 10 / 70

  25. k -means has spherical clusters? Recall that k -means baked in spherical clusters. How about we model each cluster with a Gaussian? 11 / 70

  26. k -means has spherical clusters? Recall that k -means baked in spherical clusters. 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 How about we model each cluster with a Gaussian? 11 / 70

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend