lecture 8
play

Lecture 8: Maximum Likelihood Estimation (MLE) (contd.) Maximum a - PowerPoint PPT Presentation

Lecture 8: Maximum Likelihood Estimation (MLE) (contd.) Maximum a posteriori (MAP) estimation Nave Bayes Classifier Aykut Erdem March 2016 Hacettepe University Last time Flipping a Coin I have a coin, if I flip it, whats the


  1. Lecture 8: − Maximum Likelihood Estimation (MLE) (cont’d.) − Maximum a posteriori (MAP) estimation − Naïve Bayes Classifier Aykut Erdem March 2016 Hacettepe University

  2. Last time… Flipping a Coin I have a coin, if I flip it, what’s the probability that it will fall with the head up? Let us flip it a few times to estimate the probability: slide by Barnabás Póczos & Alex Smola “Frequency of heads” The estimated probability is: 3/5 2

  3. Last time… Flipping a Coin 3/5 “Frequency of heads” The estimated probability is: Questions: (1) Why frequency of heads??? (2) How good is this estimation??? slide by Barnabás Póczos & Alex Smola (3) Why is this a machine learning problem??? We are going to answer these questions 3

  4. Question (1) Why frequency of heads??? 
 • Frequency of heads is exactly the 
 maximum likelihood estimator for this problem 
 • MLE has nice properties 
 (interpretation, statistical guarantees, simple) slide by Barnabás Póczos & Alex Smola 4

  5. MLE for Bernoulli distribution Data, D = P(Heads) = θ , P(Tails) = 1- θ Flips are i.i.d. : – Independent events slide by Barnabás Póczos & Alex Smola Identically distributed according to Bernoulli distribution – MLE: Choose θ that maximizes the probability of observed data 5

  6. MLE for Bernoulli distribution Data, D = P(Heads) = θ , P(Tails) = 1- θ Flips are i.i.d. : – Independent events slide by Barnabás Póczos & Alex Smola Identically distributed according to Bernoulli distribution – MLE: Choose θ that maximizes the probability of observed data 6

  7. MLE for Bernoulli distribution Data, D = P(Heads) = θ , P(Tails) = 1- θ Flips are i.i.d. : – Independent events slide by Barnabás Póczos & Alex Smola Identically distributed according to Bernoulli distribution – MLE: Choose θ that maximizes the probability of observed data 7

  8. MLE for Bernoulli distribution Data, D = P(Heads) = θ , P(Tails) = 1- θ Flips are i.i.d. : – Independent events slide by Barnabás Póczos & Alex Smola Identically distributed according to Bernoulli distribution – MLE: Choose θ that maximizes the probability of observed data 8

  9. Maximum Likelihood Estimation MLE: Choose θ that maximizes the probability of observed data independent draws iden,cally 
 distributed slide by Barnabás Póczos & Alex Smola 9

  10. Maximum Likelihood Estimation MLE: Choose θ that maximizes the probability of observed data independent draws iden,cally 
 distributed slide by Barnabás Póczos & Alex Smola 10

  11. Maximum Likelihood Estimation MLE: Choose θ that maximizes the probability of observed data independent draws identically 
 distributed slide by Barnabás Póczos & Alex Smola 11

  12. Maximum Likelihood Estimation MLE: Choose θ that maximizes the probability of observed data independent draws identically 
 distributed slide by Barnabás Póczos & Alex Smola 12

  13. Maximum Likelihood Estimation MLE: Choose θ that maximizes the probability of observed data independent draws identically 
 distributed slide by Barnabás Póczos & Alex Smola 13

  14. Maximum Likelihood Estimation � MLE: Choose θ that maximizes the probability of observed data slide by Barnabás Póczos & Alex Smola That’s exactly the “Frequency of heads” 14

  15. Maximum Likelihood Estimation � MLE: Choose θ that maximizes the probability of observed data slide by Barnabás Póczos & Alex Smola That’s exactly the “Frequency of heads” 15

  16. Maximum Likelihood Estimation � MLE: Choose θ that maximizes the probability of observed data slide by Barnabás Póczos & Alex Smola That’s exactly the “Frequency of heads” 16

  17. Question (2) • How good is this MLE estimation??? slide by Barnabás Póczos & Alex Smola 17

  18. How many flips do I need? I flipped the coins 5 times: 3 heads, 2 tails What if I flipped 30 heads and 20 tails? slide by Barnabás Póczos & Alex Smola • Which estimator should we trust more? • The more the merrier??? 18

  19. Simple bound Let θ * be the true parameter. For n = α H + α T , and For any ε >0: Hoe ff ding’s inequality: slide by Barnabás Póczos & Alex Smola 19

  20. Probably Approximate Correct 
 (PAC) Learning I want to know the coin parameter θ , within ε = 0.1 
 error with probability at least 1- δ = 0.95. How many flips do I need? Sample complexity: slide by Barnabás Póczos & Alex Smola 20

  21. Question (3) Why is this a machine learning problem??? • improve their performance (accuracy of the predicted prob. ) • at some task (predicting the probability of heads) • with experience (the more coins we flip the better we are) slide by Barnabás Póczos & Alex Smola 21

  22. What about continuous 
 features? 3 4 5 6 7 8 9 Let us try Gaussians… slide by Barnabás Póczos & Alex Smola σ 2 2 2 2 σ σ σ σ 2 2 2 2 σ σ σ µ =0 µ =0 µ µ µ µ µ µ 22

  23. MLE for Gaussian mean 
 and variance and variance Choose θ = ( µ , σ 2 ) that maximizes the probability of observed data Independent draws Identically distributed slide by Barnabás Póczos & Alex Smola 23

  24. 
 MLE for Gaussian mean 
 and variance and variance Note: MLE for the variance of a Gaussian is biased slide by Barnabás Póczos & Alex Smola [Expected result of estimation is not the true parameter!] Unbiased variance estimator: 24

  25. What about prior knowledge? 
 (MAP Estimation) slide by Barnabás Póczos & Aarti Singh 25

  26. What about prior knowledge? We know the coin is “close” to 50-50. What can we do now? The Bayesian way… Rather than estimating a single θ , we obtain a distribution over possible values of θ After data Before data slide by Barnabás Póczos & Aarti Singh 50-50 26

  27. Prior distribution What prior? What distribution do we want for a prior? • Represents expert knowledge (philosophical approach) • Simple posterior form (engineer’s approach) 
 Uninformative priors: • Uniform distribution 
 Conjugate priors: • Closed-form representation of posterior slide by Barnabás Póczos & Aarti Singh • P( θ ) and P( θ |D) have the same form 
 27

  28. In order to proceed we will need: Bayes Rule slide by Barnabás Póczos & Aarti Singh 28

  29. Chain Rule & Bayes Rule Chain rule: Bayes rule: slide by Barnabás Póczos & Aarti Singh Bayes rule is important for reverse conditioning. 29

  30. Bayesian Learning • Use Bayes rule: • Or equivalently: posterior likelihood prior slide by Barnabás Póczos & Aarti Singh 30

  31. MAP estimation for Binomial distribution Coin flip problem Likelihood is Binomial If the prior is Beta distribution, ) posterior is Beta distribution slide by Barnabás Póczos & Aarti Singh P( � ) and P( � | D) have the same form! [Conjugate prior] 31

  32. Beta distribution slide by Barnabás Póczos & Aarti Singh More concentrated as values of α , β increase 32

  33. Beta conjugate prior slide by Barnabás Póczos & Aarti Singh As n = α H + α T increases As we get more samples, e ff ect of prior is “washed out” 33

  34. Han Solo and Bayesian Priors C3PO: Sir, the possibility of successfully navigating an asteroid field is approximately 3,720 to 1! Han: Never tell me the odds! https://www.countbayesie.com/blog/2015/2/18/hans-solo-and-bayesian-priors 34

  35. MLE vs. MAP Maximum Likelihood estimation (MLE) ! Choose value that maximizes the probability of observed data Maximum a posteriori (MAP) estimation ! Choose value that is most probable given observed data and prior belief slide by Barnabás Póczos & Aarti Singh When is MAP same as MLE? 35

  36. 
 From Binomial to Multinomial Example: Dice roll problem (6 outcomes instead of 2) ) Likelihood is ~ Multinomial( θ = { θ 1 , θ 2 , ... , θ k }) If prior is Dirichlet distribution, chlet distribution, Then posterior is Dirichlet distribution slide by Barnabás Póczos & Aarti Singh For Multinomial, conjugate prior is Dirichlet distribution. http://en.wikipedia.org/wiki/Dirichlet_distribution 36

  37. Bayesians vs. Frequentists You are no good when sample is You give a small different answer for different slide by Barnabás Póczos & Aarti Singh priors 37

  38. Recap: What about prior knowledge? 
 (MAP Estimation) slide by Barnabás Póczos & Aarti Singh 38

  39. Recap: What about prior knowledge? We know the coin is “close” to 50-50. What can we do now? The Bayesian way… Rather than estimating a single θ , we obtain a distribution over possible values of θ After data Before data slide by Barnabás Póczos & Aarti Singh 50-50 39

  40. Recap: Chain Rule & Bayes Rule Chain rule: Bayes rule: slide by Barnabás Póczos & Aarti Singh 40

  41. 
 
 Recap: Bayesian Learning D is the measured data Our goal is to estimate parameter θ � • Use Bayes rule: 
 � • Or equivalently: � � slide by Barnabás Póczos & Aarti Singh posterior prior likelihood 41

  42. 
 Recap: MAP estimation for Binomial distribution In the coin flip problem: Likelihood is Binomial: If the prior is Beta: slide by Barnabás Póczos & Aarti Singh then the posterior is Beta distribution 42

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend