MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, - - PowerPoint PPT Presentation

mle vs map
SMART_READER_LITE
LIVE PREVIEW

MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, - - PowerPoint PPT Presentation

MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010 1 MLE vs. MAP Maximum Likelihood estimation (MLE) Choose value that maximizes the probability of observed data Maximum a posteriori (MAP) estimation Choose value


slide-1
SLIDE 1

1

MLE vs. MAP

Aarti Singh

Machine Learning 10-701/15-781 Sept 15, 2010

slide-2
SLIDE 2

MLE vs. MAP

2

When is MAP same as MLE?

 Maximum Likelihood estimation (MLE)

Choose value that maximizes the probability of observed data

 Maximum a posteriori (MAP) estimation

Choose value that is most probable given observed data and prior belief

slide-3
SLIDE 3

MAP using Conjugate Prior

Coin flip problem Likelihood is ~ Binomial If prior is Beta distribution, Then posterior is Beta distribution For Binomial, conjugate prior is Beta distribution.

3

slide-4
SLIDE 4

MLE vs. MAP

4

  • Beta prior equivalent to extra coin flips (regularization)
  • As n → 1, prior is “forgotten”
  • But, for small sample size, prior is important!

What if we toss the coin too few times?

  • You say: Probability next toss is a head = 0
  • Billionaire says: You’re fired!

…with prob 1 

slide-5
SLIDE 5

Bayesians vs. Frequentists

5

You are no good when sample is small You give a different answer for different priors

slide-6
SLIDE 6

What about continuous variables?

6

  • Billionaire says: If I am measuring a continuous

variable, what can you do for me?

  • You say: Let me tell you about Gaussians…

= N(m,s2)

m=0 m=0 s2 s2

slide-7
SLIDE 7

Gaussian distribution

7

Data, D =

  • Parameters: m – mean, s2 - variance
  • Sleep hrs are i.i.d.:

– Independent events – Identically distributed according to Gaussian distribution

6 5 4 3 7 8 9

Sleep hrs

slide-8
SLIDE 8

Properties of Gaussians

8

  • affine transformation (multiplying by scalar

and adding a constant)

– X ~ N(m,s2) – Y = aX + b ! Y ~ N(am+b,a2s2)

  • Sum of Gaussians

– X ~ N(mX,s2

X)

– Y ~ N(mY,s2

Y)

– Z = X+Y ! Z ~ N(mX+mY, s2

X+s2 Y)

slide-9
SLIDE 9

MLE for Gaussian mean and variance

9

slide-10
SLIDE 10

MLE for Gaussian mean and variance

Note: MLE for the variance of a Gaussian is biased

– Expected result of estimation is not true parameter! – Unbiased variance estimator:

10

slide-11
SLIDE 11

MAP for Gaussian mean and variance

11

  • Conjugate priors

– Mean: Gaussian prior – Variance: Wishart Distribution

  • Prior for mean:

= N(h,l2)

slide-12
SLIDE 12

MAP for Gaussian Mean

12

MAP under Gauss-Wishart prior - Homework

(Assuming known variance s2)

slide-13
SLIDE 13

What you should know…

  • Learning parametric distributions: form known,

parameters unknown

– Bernoulli (q, probability of flip) – Gaussian (m, mean and s2, variance)

  • MLE
  • MAP

13

slide-14
SLIDE 14

What loss function are we minimizing?

  • Learning distributions/densities – Unsupervised learning
  • Task: Learn

(know form of P, except q)

  • Experience: D =
  • Performance:

14

Negative log Likelihood loss

slide-15
SLIDE 15

Recitation Tomorrow!

  • Linear Algebra and Matlab
  • Strongly recommended!!
  • Place: NSH 1507 (Note: change from last time)
  • Time: 5-6 pm

15

Leman

slide-16
SLIDE 16

Bayes Optimal Classifier

Aarti Singh

Machine Learning 10-701/15-781 Sept 15, 2010

slide-17
SLIDE 17

17

Goal:

Classification

Sports Science News

Features, X Labels, Y

Probability of Error

slide-18
SLIDE 18

Optimal Classification

Optimal predictor: (Bayes classifier)

18

  • Even the optimal classifier makes mistakes R(f*) > 0
  • Optimal classifier depends on unknown distribution

Bayes risk

slide-19
SLIDE 19

Optimal Classifier

Bayes Rule: Optimal classifier:

19

Class conditional density Class prior

slide-20
SLIDE 20

Example Decision Boundaries

  • Gaussian class conditional densities (1-dimension/feature)

20

Decision Boundary

slide-21
SLIDE 21

Example Decision Boundaries

  • Gaussian class conditional densities (2-dimensions/features)

21

Decision Boundary

slide-22
SLIDE 22

Learning the Optimal Classifier

Optimal classifier:

22

Need to know Prior P(Y = y) for all y Likelihood P(X=x|Y = y) for all x,y

Class conditional density Class prior

slide-23
SLIDE 23

Learning the Optimal Classifier

Task: Predict whether or not a picnic spot is enjoyable Training Data: Lets learn P(Y|X) – how many parameters?

23

X = (X1 X2 X3 … … Xd) Y Prior: P(Y = y) for all y Likelihood: P(X=x|Y = y) for all x,y n rows K-1 if K labels (2d – 1)K if d binary features

slide-24
SLIDE 24

Learning the Optimal Classifier

Task: Predict whether or not a picnic spot is enjoyable Training Data: Lets learn P(Y|X) – how many parameters?

24

X = (X1 X2 X3 … … Xd) Y 2dK – 1 (K classes, d binary features) n rows Need n >> 2dK – 1 number of training data to learn all parameters