[PPT] - Lecture 7: Maximum Likelihood Estimation (MLE) Maximum a Posteriori PowerPoint Presentation

SLIDE 1

Lecture 7:

−Maximum Likelihood Estimation (MLE) −Maximum a Posteriori (MAP)

Aykut Erdem

October 2016 Hacettepe University

SLIDE 2

Administrative

Assignment 2 will be out on Thursday

− It is due November 10 (i.e. in 2 weeks) − You will implement

Naive Bayes Classifier for sentiment analysis  
n movie reviews

2

SLIDE 3

Administrative

Project proposal due October 31
A half page description

− problem to be investigated, − why it is interesting, − what data you will use, − related work.

3

SLIDE 4

Today

Probabilities
Dependence, Independence, Conditional

Independence 

Parameter estimation
Maximum Likelihood Estimation (MLE)
Maximum a Posteriori (MAP)

4

SLIDE 5

Last time… Sample space

Examples:

Ω may be the set of all possible outcomes of a

dice roll (1,2,3,4,5,6)  

Pages of a book opened randomly. (1-157)  
Real numbers for temperature, location, time, etc

5

Def: A sample space Ω is the set of all possible outcomes of a (conceptual or physical) random experiment. (Ω can be finite or infinite.)

slide by Barnabás Póczos & Alex Smola

SLIDE 6

Last time… Events

Examples: What is the probability of

the book is open at an odd number
rolling a dice the number <4
a random person’s height X : a<X<b

6

We will ask the question: What is the probability of a particular event? Def: Event A is a subset of the sample space Ω

slide by Barnabás Póczos & Alex Smola

SLIDE 7

utcomes in

which A is true

utcomes in which A is false

P(A) is the volume of the area.

sample space

10

Example: What is the probability that

the number on the dice is 2 or 4?

1,3,5,6 2,4

Last time… Probability

What is the probability that the number on the dice is 2 or 4?

7

Def: Probability P(A), the probability that event (subset) A happens, is a function that maps the event A onto the interval [0, 1]. P(A) is also called the probability measure of A.

Example:

slide by Barnabás Póczos & Alex Smola

SLIDE 8

Last time… Kolmogorov Axioms

8

Consequences:

slide by Barnabás Póczos & Alex Smola

SLIDE 9

Last time… Venn Diagram

9

A

B

P(A U B) = P(A) + P(B) - P(A B)

slide by Barnabás Póczos & Alex Smola

SLIDE 10

Last time… Random Variables

10

Discrete random variable examples ( is discrete):
X() = True if a randomly drawn person () from our

class () is female

X() = The hometown X() of a randomly drawn person

() from our class ()

Def: Real valued random variable is a function of the

utcome of a randomized experiment

Examples:

slide by Barnabás Póczos & Alex Smola

SLIDE 11

Last time… Discrete Distributions

11

Bernoulli distribution: Ber(p)

Suppose a coin with head prob. p is tossed n times. What is the probability of getting k heads and n-k tails?

Binomial distribution: Bin(n,p)

17

slide by Barnabás Póczos & Alex Smola

SLIDE 12

Last time… Discrete Distributions

12

Bernoulli distribution: Ber(p)

Suppose a coin with head prob. p is tossed n times. What is the probability of getting k heads and n-k tails?

Binomial distribution: Bin(n,p)

17

slide by Barnabás Póczos & Alex Smola

SLIDE 13

Last time… Discrete Distributions

13

Bernoulli distribution: Ber(p)

Suppose a coin with head prob. p is tossed n times. What is the probability of getting k heads and n-k tails?

Binomial distribution: Bin(n,p)

17

slide by Barnabás Póczos & Alex Smola

SLIDE 14

Last time… Discrete Distributions

14

Bernoulli distribution: Ber(p)

Suppose a coin with head prob. p is tossed n times. What is the probability of getting k heads and n-k tails?

Binomial distribution: Bin(n,p)

17

slide by Barnabás Póczos & Alex Smola

SLIDE 15

Last time… Conditional Probability

15

P(X|Y) = Fraction of worlds in which X event is true given Y event is true.

X Y

XY

28

1/80 7/80 1/80 71/80

Headache Flu No Headache No Flu

slide by Barnabás Póczos & Alex Smola

SLIDE 16

Last time… Conditional Probability

16

P(X|Y) = Fraction of worlds in which X event is true given Y event is true.

X Y

XY

28

1/80 7/80 1/80 71/80

Headache Flu No Headache No Flu

slide by Barnabás Póczos & Alex Smola

SLIDE 17

Independence

Y and X don’t contain information about each other. Observing Y doesn’t help predicting X. Observing X doesn’t help predicting Y.

Examples:

Independent: Winning on roulette this week and next week. Dependent: Russian roulette

17

Independent random variables:

slide by Barnabás Póczos & Alex Smola

SLIDE 18

Dependent / Independent

18

X X Y Y

Independent X,Y Dependent X,Y

slide by Barnabás Póczos & Alex Smola

SLIDE 19

7

Conditionally Independent

Examples:

Dependent: shoe size of children and reading skills Conditionally independent: shoe size of children and reading skills given age Stork deliver babies:   Highly statistically significant correlation  exists between stork populations and   human birth rates across Europe.

19

Conditionally independent: Knowing Z makes X and Y independent

slide by Barnabás Póczos & Alex Smola

SLIDE 20

Conditionally Independent

London taxi drivers: A survey has pointed out a positive

and significant correlation between the number of accidents and wearing coats. They concluded that coats could hinder movements of drivers and be the cause of

accidents. A new law was prepared to prohibit drivers

from wearing coats when driving.

20

Finally, another study pointed out that people wear coats when it rains…

slide by Barnabás Póczos & Alex Smola

SLIDE 21

Correlation ≠ Causation

21

Number people who drowned by falling into a swimming-pool correlates with Number of films Nicolas Cage appeared in

Correlation: 0.666004

http://www.tylervigen.com

slide by Barnabás Póczos & Alex Smola

SLIDE 22

Conditional Independence

Formally: X is conditionally independent of Y given Z

22

Equivalent to:

Note: does NOT mean Thunder is independent of Rain But given Lightning knowing Rain doesn’t give more info about Thunder

slide by Barnabás Póczos & Alex Smola

SLIDE 23

Conditional Independence

Formally: X is conditionally independent of Y given Z

23

Equivalent to:

Note: does NOT mean Thunder is independent of Rain But given Lightning knowing Rain doesn’t give more info about Thunder

slide by Barnabás Póczos & Alex Smola

SLIDE 24

Conditional Independence

Formally: X is conditionally independent of Y given Z

24

Equivalent to:

Note: does NOT mean Thunder is independent of Rain But given Lightning knowing Rain doesn’t give more info about Thunder

slide by Barnabás Póczos & Alex Smola

SLIDE 25

Conditional vs. Marginal Independence

C calls A and B separately and tells them a number n ∈ {1,...,10}
Due to noise in the phone, A and B each imperfectly (and

independently) draw a conclusion about what the number was.

A thinks the number was na and B thinks it was nb.
Are na and nb marginally independent?
No,we expect e.g. P(na =1|nb =1)>P(na =1)
Are na and nb conditionally independent given n?
Yes, because if we know the true number, the outcomes na and nb

are purely determined by the noise in each phone.     P(na =1|nb =1,n=2)=P(na =1|n=2)

25

nb. = 1) ?

n nb na

slide by Barnabás Póczos & Alex Smola

SLIDE 26

Parameter estimation: MLE, MAP

26

Estimating Probabilities

slide by Barnabás Póczos & Alex Smola

SLIDE 27

Flipping a Coin

27

3/5

“Frequency of heads”

I have a coin, if I flip it, what’s the probability that it will fall with the head up?

Let us flip it a few times to estimate the probability:

The estimated probability is:

slide by Barnabás Póczos & Alex Smola

SLIDE 28

Flipping a Coin

28

3/5

“Frequency of heads”

I have a coin, if I flip it, what’s the probability that it will fall with the head up?

Let us flip it a few times to estimate the probability:

The estimated probability is:

slide by Barnabás Póczos & Alex Smola

SLIDE 29

Flipping a Coin

29

3/5

“Frequency of heads”

I have a coin, if I flip it, what’s the probability that it will fall with the head up?

Let us flip it a few times to estimate the probability:

The estimated probability is:

slide by Barnabás Póczos & Alex Smola

SLIDE 30

Flipping a Coin

30

3/5

“Frequency of heads”

I have a coin, if I flip it, what’s the probability that it will fall with the head up?

Let us flip it a few times to estimate the probability:

The estimated probability is:

slide by Barnabás Póczos & Alex Smola

SLIDE 31

Flipping a Coin

Questions:

(1) Why frequency of heads??? (2) How good is this estimation??? (3) Why is this a machine learning problem???

We are going to answer these questions

31

3/5 “Frequency of heads” The estimated probability is:

slide by Barnabás Póczos & Alex Smola

SLIDE 32

Question (1)

Why frequency of heads??? 

Frequency of heads is exactly the

maximum likelihood estimator for this problem 

MLE has nice properties

(interpretation, statistical guarantees, simple)

32

slide by Barnabás Póczos & Alex Smola

SLIDE 33

33

Maximum Likelihood Estimation

slide by Barnabás Póczos & Alex Smola

SLIDE 34

MLE for Bernoulli distribution

34

Flips are i.i.d.:

– Independent events – Identically distributed according to Bernoulli distribution

Data, D = P(Heads) = θ, P(Tails) = 1-θ

MLE: Choose θ that maximizes the probability of observed data

slide by Barnabás Póczos & Alex Smola

SLIDE 35

MLE for Bernoulli distribution

35

Flips are i.i.d.:

– Independent events – Identically distributed according to Bernoulli distribution

Data, D = P(Heads) = θ, P(Tails) = 1-θ

MLE: Choose θ that maximizes the probability of observed data

slide by Barnabás Póczos & Alex Smola

SLIDE 36

MLE for Bernoulli distribution

36

Flips are i.i.d.:

– Independent events – Identically distributed according to Bernoulli distribution

Data, D = P(Heads) = θ, P(Tails) = 1-θ

MLE: Choose θ that maximizes the probability of observed data

slide by Barnabás Póczos & Alex Smola

SLIDE 37

MLE for Bernoulli distribution

37

Flips are i.i.d.:

– Independent events – Identically distributed according to Bernoulli distribution

Data, D = P(Heads) = θ, P(Tails) = 1-θ

MLE: Choose θ that maximizes the probability of observed data

slide by Barnabás Póczos & Alex Smola

SLIDE 38

Maximum Likelihood Estimation

38

MLE: Choose θ that maximizes the probability of observed data

independent draws iden,cally   distributed

slide by Barnabás Póczos & Alex Smola

SLIDE 39

Maximum Likelihood Estimation

39

MLE: Choose θ that maximizes the probability of observed data

independent draws iden,cally   distributed

slide by Barnabás Póczos & Alex Smola

SLIDE 40

Maximum Likelihood Estimation

40

MLE: Choose θ that maximizes the probability of observed data

independent draws identically   distributed

slide by Barnabás Póczos & Alex Smola

SLIDE 41

Maximum Likelihood Estimation

41

MLE: Choose θ that maximizes the probability of observed data

independent draws identically   distributed

slide by Barnabás Póczos & Alex Smola

SLIDE 42

Maximum Likelihood Estimation

42

MLE: Choose θ that maximizes the probability of observed data

independent draws identically   distributed

slide by Barnabás Póczos & Alex Smola

SLIDE 43

Maximum Likelihood Estimation

43

MLE: Choose θ that maximizes the probability of observed data

That’s exactly the “Frequency of heads”

slide by Barnabás Póczos & Alex Smola

SLIDE 44

Maximum Likelihood Estimation

44

MLE: Choose θ that maximizes the probability of observed data

That’s exactly the “Frequency of heads”

slide by Barnabás Póczos & Alex Smola

SLIDE 45

Maximum Likelihood Estimation

45

MLE: Choose θ that maximizes the probability of observed data

That’s exactly the “Frequency of heads”

slide by Barnabás Póczos & Alex Smola

SLIDE 46

Question (2)

How good is this MLE estimation???

46

slide by Barnabás Póczos & Alex Smola

SLIDE 47

How many flips do I need?

I flipped the coins 5 times: 3 heads, 2 tails What if I flipped 30 heads and 20 tails?

Which estimator should we trust more?
The more the merrier???

47

slide by Barnabás Póczos & Alex Smola

SLIDE 48

Let θ* be the true parameter. For n = αH+αT, and For any ε>0:

Hoeffding’s inequality:

48

slide by Barnabás Póczos & Alex Smola

SLIDE 49

Probably Approximate Correct   (PAC) Learning

I want to know the coin parameter θ, within ε = 0.1   error with probability at least 1-δ = 0.95. How many flips do I need? Sample complexity:

49

slide by Barnabás Póczos & Alex Smola

SLIDE 50

Question (3)

Why is this a machine learning problem???

improve their performance (accuracy of the

predicted prob. )

at some task (predicting the probability of heads)
with experience (the more coins we flip the better

we are)

50

slide by Barnabás Póczos & Alex Smola

SLIDE 51

What about continuous   features?

51

µ µ µ µ=0 µ µ µ µ=0 σ σ σ σ2

2 2 2

σ σ σ σ2

2 2 2

Let us try Gaussians…

6 5 4 3 7 8 9

slide by Barnabás Póczos & Alex Smola

SLIDE 52

MLE for Gaussian mean   and variance

52

and variance

Choose θ= (µ,σ2) that maximizes the probability of observed data

Independent draws Identically distributed

slide by Barnabás Póczos & Alex Smola

SLIDE 53

MLE for Gaussian mean  and variance

53

Note: MLE for the variance of a Gaussian is biased

[Expected result of estimation is not the true parameter!]   Unbiased variance estimator:

and variance

slide by Barnabás Póczos & Alex Smola

SLIDE 54

Probably Approximate Correct   (PAC) Learning

I want to know the coin parameter θ, within ε = 0.1   error with probability at least 1-δ = 0.95. How many flips do I need? Sample complexity:

54

slide by Barnabás Póczos & Alex Smola

SLIDE 55

55

What about prior knowledge?  (MAP Estimation)

slide by Barnabás Póczos & Aarti Singh

SLIDE 56

What about prior knowledge?

56

We know the coin is “close” to 50-50. What can we do now?

The Bayesian way…

Rather than estimating a single θ, we obtain a distribution over possible values of θ

50-50 Before data After data

slide by Barnabás Póczos & Aarti Singh

SLIDE 57

What about prior knowledge?

57

We know the coin is “close” to 50-50. What can we do now?

The Bayesian way…

Rather than estimating a single θ, we obtain a distribution over possible values of θ

50-50 Before data After data

slide by Barnabás Póczos & Aarti Singh

SLIDE 58

Prior distribution

What prior? What distribution do we want for

a prior?

− Represents expert knowledge (philosophical

approach)

− Simple posterior form (engineer’s approach) 

Uninformative priors:

− Uniform distribution 

Conjugate priors:

− Closed-form representation of posterior − P(θ) and P(θ|D) have the same form 

58

slide by Barnabás Póczos & Aarti Singh

SLIDE 59

59

Bayes Rule

In order to proceed we will need:

slide by Barnabás Póczos & Aarti Singh

SLIDE 60

Chain Rule & Bayes Rule

60

Bayes rule: Chain rule:

Bayes rule is important for reverse conditioning.

slide by Barnabás Póczos & Aarti Singh

SLIDE 61

Bayesian Learning

61

Use Bayes rule:
Or equivalently:

posterior likelihood prior

slide by Barnabás Póczos & Aarti Singh

SLIDE 62

MAP estimation for Binomial distribution

62

Likelihood is Binomial

Coin flip problem

P() and P(| D) have the same form! [Conjugate prior]

If the prior is Beta distribution, ) posterior is Beta distribution

slide by Barnabás Póczos & Aarti Singh

SLIDE 63

Beta distribution

63

More concentrated as values of α, β increase

slide by Barnabás Póczos & Aarti Singh

SLIDE 64

Beta conjugate prior

64

As we get more samples, effect of prior is “washed out” As n = α H + αT increases

slide by Barnabás Póczos & Aarti Singh

SLIDE 65

65

SLIDE 66

Han Solo and Bayesian Priors

C3PO: Sir, the possibility of successfully navigating an asteroid field is approximately 3,720 to 1! Han: Never tell me the odds!

66

https://www.countbayesie.com/blog/2015/2/18/hans-solo-and-bayesian-priors

SLIDE 67

MLE vs. MAP

67

!

Maximum Likelihood estimation (MLE) Choose value that maximizes the probability of observed data

slide by Barnabás Póczos & Aarti Singh

SLIDE 68

MLE vs. MAP

68

When is MAP same as MLE?

!

Maximum Likelihood estimation (MLE) Choose value that maximizes the probability of observed data

!

Maximum a posteriori (MAP) estimation Choose value that is most probable given observed data and prior belief

slide by Barnabás Póczos & Aarti Singh

When is MAP same as MLE?

SLIDE 69

)

From Binomial to Multinomial

Example: Dice roll problem (6 outcomes instead of 2) Likelihood is ~ Multinomial(θ = {θ1, θ2, ... , θk}) If prior is Dirichlet distribution,   Then posterior is Dirichlet distribution For Multinomial, conjugate prior is Dirichlet distribution. http://en.wikipedia.org/wiki/Dirichlet_distribution

69

chlet distribution,

slide by Barnabás Póczos & Aarti Singh

SLIDE 70

Bayesians vs. Frequentists

70

You are no good when sample is small You give a different answer for different priors

slide by Barnabás Póczos & Aarti Singh

Lecture 7:

Administrative

Administrative

Today

Independence

Last time… Sample space

Examples:

dice roll (1,2,3,4,5,6)

Def: A sample space Ω is the set of all possible outcomes of a (conceptual or physical) random experiment. (Ω can be finite or infinite.)

Last time… Events

Examples: What is the probability of

We will ask the question: What is the probability of a particular event? Def: Event A is a subset of the sample space Ω

Last time… Probability

Def: Probability P(A), the probability that event (subset) A happens, is a function that maps the event A onto the interval [0, 1]. P(A) is also called the probability measure of A.

Example:

Last time… Kolmogorov Axioms

Consequences:

Last time… Venn Diagram

B

Last time… Random Variables

Last time… Discrete Distributions

Last time… Discrete Distributions

Last time… Discrete Distributions

Last time… Discrete Distributions

Last time… Conditional Probability

Last time… Conditional Probability

Independence

Y and X don’t contain information about each other. Observing Y doesn’t help predicting X. Observing X doesn’t help predicting Y.

Examples:

Dependent / Independent

Conditionally Independent

Examples:

Conditionally Independent

Correlation ≠ Causation

Conditional Independence

Formally: X is conditionally independent of Y given Z

Conditional Independence

Formally: X is conditionally independent of Y given Z

Conditional Independence

Formally: X is conditionally independent of Y given Z

Conditional vs. Marginal Independence

Parameter estimation: MLE, MAP

Estimating Probabilities

Flipping a Coin

Flipping a Coin

Flipping a Coin

Flipping a Coin

Flipping a Coin

Questions:

(1) Why frequency of heads??? (2) How good is this estimation??? (3) Why is this a machine learning problem???

Question (1)

Why frequency of heads???

maximum likelihood estimator for this problem

(interpretation, statistical guarantees, simple)

Maximum Likelihood Estimation

MLE for Bernoulli distribution

MLE for Bernoulli distribution

MLE for Bernoulli distribution

MLE for Bernoulli distribution

Maximum Likelihood Estimation

Maximum Likelihood Estimation

Maximum Likelihood Estimation

Maximum Likelihood Estimation

Maximum Likelihood Estimation

Maximum Likelihood Estimation

Maximum Likelihood Estimation

Maximum Likelihood Estimation

Question (2)

How many flips do I need?

I flipped the coins 5 times: 3 heads, 2 tails What if I flipped 30 heads and 20 tails?

Let θ* be the true parameter. For n = αH+αT, and For any ε>0:

Probably Approximate Correct (PAC) Learning

I want to know the coin parameter θ, within ε = 0.1 error with probability at least 1-δ = 0.95. How many flips do I need? Sample complexity:

Question (3)

Why is this a machine learning problem???

predicted prob. )

we are)

What about continuous features?

Let us try Gaussians…

MLE for Gaussian mean and variance

Independence 

dice roll (1,2,3,4,5,6)  

Why frequency of heads??? 

maximum likelihood estimator for this problem 

Probably Approximate Correct   (PAC) Learning

I want to know the coin parameter θ, within ε = 0.1   error with probability at least 1-δ = 0.95. How many flips do I need? Sample complexity:

What about continuous   features?

MLE for Gaussian mean   and variance

MLE for Gaussian mean  and variance

Probably Approximate Correct   (PAC) Learning

I want to know the coin parameter θ, within ε = 0.1   error with probability at least 1-δ = 0.95. How many flips do I need? Sample complexity:

What about prior knowledge?  (MAP Estimation)