Introduction to Machine Learning CMU-10701 2. MLE, MAP, Bayes - - PowerPoint PPT Presentation

introduction to machine learning cmu 10701
SMART_READER_LITE
LIVE PREVIEW

Introduction to Machine Learning CMU-10701 2. MLE, MAP, Bayes - - PowerPoint PPT Presentation

Introduction to Machine Learning CMU-10701 2. MLE, MAP, Bayes classification Barnabs Pczos & Aarti Singh 2014 Spring Administration http://www.cs.cmu.edu/~aarti/Class/10701_Spring14/index.html Blackboard manager & Peer grading:


slide-1
SLIDE 1

Introduction to Machine Learning CMU-10701

  • 2. MLE, MAP, Bayes classification

Barnabás Póczos & Aarti Singh 2014 Spring

slide-2
SLIDE 2

Administration

2

http://www.cs.cmu.edu/~aarti/Class/10701_Spring14/index.html

Blackboard manager & Peer grading: Dani Webpage manager and autolab: Pulkit Camera man: Pengtao Homework manager: Jit Piazza manager: Prashant Recitation: Wean 7500, 6pm-7pm, on Wednesdays

slide-3
SLIDE 3

Outline

Theory:

Probabilities:

  • Dependence, Independence, Conditional Independence

Parameter estimation:

  • Maximum Likelihood Estimation (MLE)
  • Maximum aposteriori (MAP)

Bayes rule

  • Naïve Bayes Classifier

Application:

Naive Bayes Classifier for

3

  • Spam filtering
  • “Mind reading” = fMRI data processing
slide-4
SLIDE 4

Independence

slide-5
SLIDE 5

Independence

5

Y and X don’t contain information about each other. Observing Y doesn’t help predicting X. Observing X doesn’t help predicting Y. Examples:

Independent: Winning on roulette this week and next week. Dependent: Russian roulette

Independent random variables:

slide-6
SLIDE 6

Dependent / Independent

Independent X,Y X Dependent X,Y X Y Y

6

slide-7
SLIDE 7

Conditionally Independent

7

Dependent: show size and reading skills Conditionally independent: show size and reading skills given …?

Examples:

Storks deliver babies: Highly statistically significant correlation exists between stork populations and human birth rates across Europe.

Conditionally independent: Knowing Z makes X and Y independent

age

slide-8
SLIDE 8

London taxi drivers: A survey has pointed out a positive and

significant correlation between the number of accidents and wearing

  • coats. They concluded that coats could hinder movements of drivers and

be the cause of accidents. A new law was prepared to prohibit drivers from wearing coats when driving. Finally another study pointed out that people wear coats when it rains…

Conditionally Independent

8

slide-9
SLIDE 9

Correlation ≠ Causation

xkcd.com

9

slide-10
SLIDE 10

Conditional Independence

Formally: X is conditionally independent of Y given Z: Equivalent to:

10

Note: does NOT mean Thunder is independent of Rain But given Lightning knowing Rain doesn’t give more info about Thunder

slide-11
SLIDE 11

C calls A and B separately and tells them a number n ϵ {1,…,10} Due to noise in the phone, A and B each imperfectly (and independently) draw a conclusion about what the number was. A thinks the number was na and B thinks it was nb. Are na and nb marginally independent? – No, we expect e.g. P(na = 1 | nb = 1) > P(na = 1) Are na and nb conditionally independent given n? – Yes, because if we know the true number, the outcomes na and nb are purely determined by the noise in each phone. P(na = 1 | nb = 1, n = 2) = P(na = 1 | n = 2)

11

Conditional vs. Marginal Independence

n nb na

slide-12
SLIDE 12

Parameter estimation: MLE, MAP

Estimating Probabilities

12

Our first machine learning problem:

slide-13
SLIDE 13

Flipping a Coin

3/5

“Frequency of heads”

13

I have a coin, if I flip it, what’s the probability that it will fall with the head up?

Let us flip it a few times to estimate the probability:

The estimated probability is:

slide-14
SLIDE 14

Flipping a Coin

3/5 “Frequency of heads”

14

The estimated probability is: (1) Why frequency of heads??? (2) How good is this estimation??? (3) Why is this a machine learning problem??? Questions:

We are going to answer these questions

slide-15
SLIDE 15

Question (1)

15

Why frequency of heads???

  • Frequency of heads is exactly the

maximum likelihood estimator for this problem

  • MLE has nice properties

(interpretation, statistical guarantees, simple)

slide-16
SLIDE 16

Maximum Likelihood Estimation

16

slide-17
SLIDE 17

MLE for Bernoulli distribution

Flips are i.i.d.:

– Independent events – Identically distributed according to Bernoulli distribution

17

Data, D = P(Heads) = θ, P(Tails) = 1-θ

MLE: Choose θ that maximizes the probability of observed data

slide-18
SLIDE 18

Maximum Likelihood Estimation

Independent draws Identically distributed

18

MLE: Choose θ that maximizes the probability of observed data

slide-19
SLIDE 19

Maximum Likelihood Estimation

19

MLE: Choose θ that maximizes the probability of observed data

That’s exactly the “Frequency of heads”

slide-20
SLIDE 20

Question (2)

20

How good is this MLE estimation???

slide-21
SLIDE 21

How many flips do I need?

I flipped the coins 5 times: 3 heads, 2 tails

  • Which estimator should we trust more?
  • The more the merrier???

What if I flipped 30 heads and 20 tails?

21

slide-22
SLIDE 22

Simple bound

For n = αH+αT, and

Hoeffding’s inequality:

For any ε>0:

Let θ* be the true parameter.

22

slide-23
SLIDE 23

Probably Approximate Correct (PAC )Learning

I want to know the coin parameter θ, within ε = 0.1 error with probability at least 1-δ = 0.95. How many flips do I need? Sample complexity:

23

slide-24
SLIDE 24

Question (3)

24

Why is this a machine learning problem???

improve their performance at some task with experience (accuracy of the predicted prob. ) (predicting the probability of heads) (the more coins we flip the better we are)

slide-25
SLIDE 25

What about continuous features?

µ µ µ µ=0 µ µ µ µ=0 σ σ σ σ2

2 2 2

σ σ σ σ2

2 2 2

Let us try Gaussians…

6 5 4 3 7 8 9

25

slide-26
SLIDE 26

MLE for Gaussian mean and variance

Choose θ= (µ,σ2) that maximizes the probability of observed data

Independent draws Identically distributed

26

slide-27
SLIDE 27

MLE for Gaussian mean and variance

27

Unbiased variance estimator:

Note: MLE for the variance of a Gaussian is biased

[Expected result of estimation is not the true parameter!]

slide-28
SLIDE 28

What about prior knowledge? (MAP Estimation)

slide-29
SLIDE 29

What about prior knowledge?

We know the coin is “close” to 50-50. What can we do now?

The Bayesian way…

Rather than estimating a single θ, we obtain a distribution over possible values of θ

50-50 Before data After data

29

slide-30
SLIDE 30

Prior distribution

What prior? What distribution do we want for a prior?

  • Represents expert knowledge (philosophical approach)
  • Simple posterior form (engineer’s approach)

Uninformative priors:

  • Uniform distribution

Conjugate priors:

  • Closed-form representation of posterior
  • P(θ) and P(θ|D) have the same form

30

slide-31
SLIDE 31

Bayes Rule

31

In order to proceed we will need:

slide-32
SLIDE 32

Chain Rule & Bayes Rule

32

Bayes rule is important for reverse conditioning. Bayes rule: Chain rule:

slide-33
SLIDE 33

Bayesian Learning

  • Use Bayes rule:
  • Or equivalently:

posterior likelihood prior

33

slide-34
SLIDE 34

MLE vs. MAP

When is MAP same as MLE?

  • Maximum Likelihood estimation (MLE)

Choose value that maximizes the probability of observed data

  • Maximum a posteriori (MAP) estimation

Choose value that is most probable given observed data and prior belief

34

slide-35
SLIDE 35

MAP estimation for Binomial distribution

Likelihood is Binomial

Coin flip problem:

35

If the prior is Beta distribution, ⇒ posterior is Beta distribution Beta function:

slide-36
SLIDE 36

MAP estimation for Binomial distribution

Likelihood is Binomial:

36

P(θ) and P(θ|D) have the same form! [Conjugate prior] Prior is Beta distribution: ⇒ posterior is Beta distribution

slide-37
SLIDE 37

Beta distribution

More concentrated as values of α, β increase

37

slide-38
SLIDE 38

Beta conjugate prior

As n = αH + αT increases As we get more samples, effect of prior is “washed out”

38

slide-39
SLIDE 39

From Binomial to Multinomial

Likelihood is ~ Multinomial(θ = {θ1, θ2, … , θk}) If prior is Dirichlet distribution, Then posterior is Dirichlet distribution

For Multinomial, conjugate prior is Dirichlet distribution. Example: Dice roll problem (6 outcomes instead of 2)

http://en.wikipedia.org/wiki/Dirichlet_distribution

39

slide-40
SLIDE 40

Conjugate prior for Gaussian?

40

Conjugate prior on mean: Conjugate prior on covariance matrix: Gaussian Inverse Wishart

slide-41
SLIDE 41

Bayesians vs.Frequentists

You are no good when sample is small You give a different answer for different priors

41