mle vs map
play

MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, - PowerPoint PPT Presentation

MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010 1 MLE vs. MAP Maximum Likelihood estimation (MLE) Choose value that maximizes the probability of observed data Maximum a posteriori (MAP) estimation Choose value


  1. MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010 1

  2. MLE vs. MAP  Maximum Likelihood estimation (MLE) Choose value that maximizes the probability of observed data  Maximum a posteriori (MAP) estimation Choose value that is most probable given observed data and prior belief When is MAP same as MLE? 2

  3. MAP using Conjugate Prior Coin flip problem Likelihood is ~ Binomial If prior is Beta distribution, Then posterior is Beta distribution For Binomial, conjugate prior is Beta distribution. 3

  4. MLE vs. MAP What if we toss the coin too few times? • You say: Probability next toss is a head = 0 • Billionaire says: You’re fired! …with prob 1  • Beta prior equivalent to extra coin flips (regularization) • As n → 1 , prior is “forgotten” • But, for small sample size, prior is important! 4

  5. Bayesians vs. Frequentists You are no good when sample is You give a small different answer for different priors 5

  6. What about continuous variables? • Billionaire says: If I am measuring a continuous variable, what can you do for me? • You say: Let me tell you about Gaussians… = N( m , s 2 ) s 2 s 2 6 m =0 m =0

  7. Gaussian distribution Data, D = Sleep hrs 3 4 5 6 7 8 9 • Parameters: m – mean, s 2 - variance • Sleep hrs are i.i.d. : – Independent events – Identically distributed according to Gaussian distribution 7

  8. Properties of Gaussians • affine transformation (multiplying by scalar and adding a constant) – X ~ N ( m , s 2 ) – Y = aX + b ! Y ~ N (a m +b,a 2 s 2 ) • Sum of Gaussians – X ~ N ( m X , s 2 X ) – Y ~ N ( m Y , s 2 Y ) – Z = X+Y ! Z ~ N ( m X + m Y , s 2 X + s 2 Y ) 8

  9. MLE for Gaussian mean and variance 9

  10. MLE for Gaussian mean and variance Note: MLE for the variance of a Gaussian is biased – Expected result of estimation is not true parameter! – Unbiased variance estimator: 10

  11. MAP for Gaussian mean and variance • Conjugate priors – Mean: Gaussian prior – Variance: Wishart Distribution • Prior for mean: = N( h , l 2 ) 11

  12. MAP for Gaussian Mean (Assuming known variance s 2 ) MAP under Gauss-Wishart prior - Homework 12

  13. What you should know… • Learning parametric distributions: form known, parameters unknown – Bernoulli (q, probability of flip) – Gaussian (m, mean and s 2 , variance) • MLE • MAP 13

  14. What loss function are we minimizing? • Learning distributions/densities – Unsupervised learning (know form of P, except q ) • Task: Learn • Experience: D = • Performance: Negative log Likelihood loss 14

  15. Recitation Tomorrow! • Linear Algebra and Matlab • Strongly recommended!! • Place: NSH 1507 (Note: change from last time) • Time: 5-6 pm Leman 15

  16. Bayes Optimal Classifier Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010

  17. Classification Goal: Sports Science News Features, X Labels, Y Probability of Error 17

  18. Optimal Classification Optimal predictor: (Bayes classifier) Bayes risk • Even the optimal classifier makes mistakes R(f*) > 0 • Optimal classifier depends on unknown distribution 18

  19. Optimal Classifier Bayes Rule: Optimal classifier: Class prior Class conditional density 19

  20. Example Decision Boundaries • Gaussian class conditional densities (1-dimension/feature) Decision Boundary 20

  21. Example Decision Boundaries • Gaussian class conditional densities (2-dimensions/features) Decision Boundary 21

  22. Learning the Optimal Classifier Optimal classifier: Class conditional Class prior density Need to know Prior P(Y = y) for all y Likelihood P(X=x|Y = y) for all x,y 22

  23. Learning the Optimal Classifier Task: Predict whether or not a picnic spot is enjoyable Training Data: X = (X 1 X 2 X 3 … … X d ) Y n rows Lets learn P(Y|X) – how many parameters? Prior: P(Y = y) for all y K-1 if K labels (2 d – 1)K if d binary features Likelihood: P(X=x|Y = y) for all x,y 23

  24. Learning the Optimal Classifier Task: Predict whether or not a picnic spot is enjoyable Training Data: X = (X 1 X 2 X 3 … … X d ) Y n rows Lets learn P(Y|X) – how many parameters? 2 d K – 1 (K classes, d binary features) Need n >> 2 d K – 1 number of training data to learn all parameters 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend