Lecture 7: Maximum Likelihood Estimation (MLE) Maximum a Posteriori - - PowerPoint PPT Presentation

lecture 7
SMART_READER_LITE
LIVE PREVIEW

Lecture 7: Maximum Likelihood Estimation (MLE) Maximum a Posteriori - - PowerPoint PPT Presentation

Lecture 7: Maximum Likelihood Estimation (MLE) Maximum a Posteriori (MAP) Aykut Erdem October 2016 Hacettepe University Administrative Assignment 2 will be out on Thursday It is due November 10 (i.e. in 2 weeks) You will


slide-1
SLIDE 1

Lecture 7:

−Maximum Likelihood Estimation (MLE) −Maximum a Posteriori (MAP)

Aykut Erdem

October 2016 Hacettepe University

slide-2
SLIDE 2

Administrative

  • Assignment 2 will be out on Thursday

− It is due November 10 (i.e. in 2 weeks) − You will implement

  • Naive Bayes Classifier for sentiment analysis 

  • n movie reviews

2

slide-3
SLIDE 3

Administrative

  • Project proposal due October 31
  • A half page description

− problem to be investigated, − why it is interesting, − what data you will use, − related work.

3

slide-4
SLIDE 4

Today

  • Probabilities
  • Dependence, Independence, Conditional

Independence


  • Parameter estimation
  • Maximum Likelihood Estimation (MLE)
  • Maximum a Posteriori (MAP)

4

slide-5
SLIDE 5

Last time… Sample space

Examples:

  • Ω may be the set of all possible outcomes of a

dice roll (1,2,3,4,5,6) 


  • Pages of a book opened randomly. (1-157) 

  • Real numbers for temperature, location, time, etc

5

Def: A sample space Ω is the set of all possible outcomes of a (conceptual or physical) random experiment. (Ω can be finite or infinite.)

  • slide by Barnabás Póczos & Alex Smola
slide-6
SLIDE 6

Last time… Events

Examples: What is the probability of

  • the book is open at an odd number
  • rolling a dice the number <4
  • a random person’s height X : a<X<b

6

We will ask the question: What is the probability of a particular event? Def: Event A is a subset of the sample space Ω

slide by Barnabás Póczos & Alex Smola

slide-7
SLIDE 7
  • utcomes in

which A is true

  • utcomes in which A is false

P(A) is the volume of the area.

sample space

10

Example: What is the probability that

the number on the dice is 2 or 4?

1,3,5,6 2,4

Last time… Probability

What is the probability that the number on the dice is 2 or 4?

7

Def: Probability P(A), the probability that event (subset) A happens, is a function that maps the event A onto the interval [0, 1]. P(A) is also called the probability measure of A.

Example:

slide by Barnabás Póczos & Alex Smola

slide-8
SLIDE 8

Last time… Kolmogorov Axioms

8

Consequences:

slide by Barnabás Póczos & Alex Smola

slide-9
SLIDE 9

Last time… Venn Diagram

9

  • A

B

P(A U B) = P(A) + P(B) - P(A B)

slide by Barnabás Póczos & Alex Smola

slide-10
SLIDE 10

Last time… Random Variables

10

  • Discrete random variable examples ( is discrete):
  • X() = True if a randomly drawn person () from our

class () is female

  • X() = The hometown X() of a randomly drawn person

() from our class ()

Def: Real valued random variable is a function of the

  • utcome of a randomized experiment

Examples:

slide by Barnabás Póczos & Alex Smola

slide-11
SLIDE 11

Last time… Discrete Distributions

11

  • Bernoulli distribution: Ber(p)

Suppose a coin with head prob. p is tossed n times. What is the probability of getting k heads and n-k tails?

  • Binomial distribution: Bin(n,p)

17

slide by Barnabás Póczos & Alex Smola

slide-12
SLIDE 12

Last time… Discrete Distributions

12

  • Bernoulli distribution: Ber(p)

Suppose a coin with head prob. p is tossed n times. What is the probability of getting k heads and n-k tails?

  • Binomial distribution: Bin(n,p)

17

slide by Barnabás Póczos & Alex Smola

slide-13
SLIDE 13

Last time… Discrete Distributions

13

  • Bernoulli distribution: Ber(p)

Suppose a coin with head prob. p is tossed n times. What is the probability of getting k heads and n-k tails?

  • Binomial distribution: Bin(n,p)

17

slide by Barnabás Póczos & Alex Smola

slide-14
SLIDE 14

Last time… Discrete Distributions

14

  • Bernoulli distribution: Ber(p)

Suppose a coin with head prob. p is tossed n times. What is the probability of getting k heads and n-k tails?

  • Binomial distribution: Bin(n,p)

17

slide by Barnabás Póczos & Alex Smola

slide-15
SLIDE 15

Last time… Conditional Probability

15

P(X|Y) = Fraction of worlds in which X event is true given Y event is true.

X Y

XY

28

1/80 7/80 1/80 71/80

Headache Flu No Headache No Flu

slide by Barnabás Póczos & Alex Smola

slide-16
SLIDE 16

Last time… Conditional Probability

16

P(X|Y) = Fraction of worlds in which X event is true given Y event is true.

X Y

XY

28

1/80 7/80 1/80 71/80

Headache Flu No Headache No Flu

slide by Barnabás Póczos & Alex Smola

slide-17
SLIDE 17

Independence

Y and X don’t contain information about each other. Observing Y doesn’t help predicting X. Observing X doesn’t help predicting Y.

Examples:

Independent: Winning on roulette this week and next week. Dependent: Russian roulette

17

Independent random variables:

slide by Barnabás Póczos & Alex Smola

slide-18
SLIDE 18

Dependent / Independent

18

X X Y Y

Independent X,Y Dependent X,Y

slide by Barnabás Póczos & Alex Smola

slide-19
SLIDE 19

7

Conditionally Independent

Examples:

Dependent: shoe size of children and reading skills Conditionally independent: shoe size of children and reading skills given age Stork deliver babies: 
 Highly statistically significant correlation
 exists between stork populations and 
 human birth rates across Europe.

19

Conditionally independent: Knowing Z makes X and Y independent

slide by Barnabás Póczos & Alex Smola

slide-20
SLIDE 20

Conditionally Independent

  • London taxi drivers: A survey has pointed out a positive

and significant correlation between the number of accidents and wearing coats. They concluded that coats could hinder movements of drivers and be the cause of

  • accidents. A new law was prepared to prohibit drivers

from wearing coats when driving.

20

Finally, another study pointed out that people wear coats when it rains…

slide by Barnabás Póczos & Alex Smola

slide-21
SLIDE 21

Correlation ≠ Causation

21

Number people who drowned by falling into a swimming-pool correlates with Number of films Nicolas Cage appeared in

Correlation: 0.666004

http://www.tylervigen.com

slide by Barnabás Póczos & Alex Smola

slide-22
SLIDE 22

Conditional Independence

Formally: X is conditionally independent of Y given Z

22

Equivalent to:

Note: does NOT mean Thunder is independent of Rain But given Lightning knowing Rain doesn’t give more info about Thunder

slide by Barnabás Póczos & Alex Smola

slide-23
SLIDE 23

Conditional Independence

Formally: X is conditionally independent of Y given Z

23

Equivalent to:

Note: does NOT mean Thunder is independent of Rain But given Lightning knowing Rain doesn’t give more info about Thunder

slide by Barnabás Póczos & Alex Smola

slide-24
SLIDE 24

Conditional Independence

Formally: X is conditionally independent of Y given Z

24

Equivalent to:

Note: does NOT mean Thunder is independent of Rain But given Lightning knowing Rain doesn’t give more info about Thunder

slide by Barnabás Póczos & Alex Smola

slide-25
SLIDE 25

Conditional vs. Marginal Independence

  • C calls A and B separately and tells them a number n ∈ {1,...,10}
  • Due to noise in the phone, A and B each imperfectly (and

independently) draw a conclusion about what the number was.

  • A thinks the number was na and B thinks it was nb.
  • Are na and nb marginally independent?
  • No,we expect e.g. P(na =1|nb =1)>P(na =1)
  • Are na and nb conditionally independent given n?
  • Yes, because if we know the true number, the outcomes na and nb

are purely determined by the noise in each phone. 
 
 P(na =1|nb =1,n=2)=P(na =1|n=2)

25

nb. = 1) ?

n nb na

slide by Barnabás Póczos & Alex Smola

slide-26
SLIDE 26

Parameter estimation: MLE, MAP

26

Estimating Probabilities

slide by Barnabás Póczos & Alex Smola

slide-27
SLIDE 27

Flipping a Coin

27

3/5

“Frequency of heads”

I have a coin, if I flip it, what’s the probability that it will fall with the head up?

Let us flip it a few times to estimate the probability:

The estimated probability is:

slide by Barnabás Póczos & Alex Smola

slide-28
SLIDE 28

Flipping a Coin

28

3/5

“Frequency of heads”

I have a coin, if I flip it, what’s the probability that it will fall with the head up?

Let us flip it a few times to estimate the probability:

The estimated probability is:

slide by Barnabás Póczos & Alex Smola

slide-29
SLIDE 29

Flipping a Coin

29

3/5

“Frequency of heads”

I have a coin, if I flip it, what’s the probability that it will fall with the head up?

Let us flip it a few times to estimate the probability:

The estimated probability is:

slide by Barnabás Póczos & Alex Smola

slide-30
SLIDE 30

Flipping a Coin

30

3/5

“Frequency of heads”

I have a coin, if I flip it, what’s the probability that it will fall with the head up?

Let us flip it a few times to estimate the probability:

The estimated probability is:

slide by Barnabás Póczos & Alex Smola

slide-31
SLIDE 31

Flipping a Coin

Questions:

(1) Why frequency of heads??? (2) How good is this estimation??? (3) Why is this a machine learning problem???

We are going to answer these questions

31

3/5 “Frequency of heads” The estimated probability is:

slide by Barnabás Póczos & Alex Smola

slide-32
SLIDE 32

Question (1)

Why frequency of heads???


  • Frequency of heads is exactly the 


maximum likelihood estimator for this problem


  • MLE has nice properties


(interpretation, statistical guarantees, simple)

32

slide by Barnabás Póczos & Alex Smola

slide-33
SLIDE 33

33

Maximum Likelihood Estimation

slide by Barnabás Póczos & Alex Smola

slide-34
SLIDE 34

MLE for Bernoulli distribution

34

Flips are i.i.d.:

– Independent events – Identically distributed according to Bernoulli distribution

Data, D = P(Heads) = θ, P(Tails) = 1-θ

MLE: Choose θ that maximizes the probability of observed data

slide by Barnabás Póczos & Alex Smola

slide-35
SLIDE 35

MLE for Bernoulli distribution

35

Flips are i.i.d.:

– Independent events – Identically distributed according to Bernoulli distribution

Data, D = P(Heads) = θ, P(Tails) = 1-θ

MLE: Choose θ that maximizes the probability of observed data

slide by Barnabás Póczos & Alex Smola

slide-36
SLIDE 36

MLE for Bernoulli distribution

36

Flips are i.i.d.:

– Independent events – Identically distributed according to Bernoulli distribution

Data, D = P(Heads) = θ, P(Tails) = 1-θ

MLE: Choose θ that maximizes the probability of observed data

slide by Barnabás Póczos & Alex Smola

slide-37
SLIDE 37

MLE for Bernoulli distribution

37

Flips are i.i.d.:

– Independent events – Identically distributed according to Bernoulli distribution

Data, D = P(Heads) = θ, P(Tails) = 1-θ

MLE: Choose θ that maximizes the probability of observed data

slide by Barnabás Póczos & Alex Smola

slide-38
SLIDE 38

Maximum Likelihood Estimation

38

MLE: Choose θ that maximizes the probability of observed data

independent draws iden,cally 
 distributed

slide by Barnabás Póczos & Alex Smola

slide-39
SLIDE 39

Maximum Likelihood Estimation

39

MLE: Choose θ that maximizes the probability of observed data

independent draws iden,cally 
 distributed

slide by Barnabás Póczos & Alex Smola

slide-40
SLIDE 40

Maximum Likelihood Estimation

40

MLE: Choose θ that maximizes the probability of observed data

independent draws identically 
 distributed

slide by Barnabás Póczos & Alex Smola

slide-41
SLIDE 41

Maximum Likelihood Estimation

41

MLE: Choose θ that maximizes the probability of observed data

independent draws identically 
 distributed

slide by Barnabás Póczos & Alex Smola

slide-42
SLIDE 42

Maximum Likelihood Estimation

42

MLE: Choose θ that maximizes the probability of observed data

independent draws identically 
 distributed

slide by Barnabás Póczos & Alex Smola

slide-43
SLIDE 43

Maximum Likelihood Estimation

43

MLE: Choose θ that maximizes the probability of observed data

That’s exactly the “Frequency of heads”

  • slide by Barnabás Póczos & Alex Smola
slide-44
SLIDE 44

Maximum Likelihood Estimation

44

MLE: Choose θ that maximizes the probability of observed data

That’s exactly the “Frequency of heads”

  • slide by Barnabás Póczos & Alex Smola
slide-45
SLIDE 45

Maximum Likelihood Estimation

45

MLE: Choose θ that maximizes the probability of observed data

That’s exactly the “Frequency of heads”

  • slide by Barnabás Póczos & Alex Smola
slide-46
SLIDE 46

Question (2)

  • How good is this MLE estimation???

46

slide by Barnabás Póczos & Alex Smola

slide-47
SLIDE 47

How many flips do I need?

I flipped the coins 5 times: 3 heads, 2 tails What if I flipped 30 heads and 20 tails?

  • Which estimator should we trust more?
  • The more the merrier???

47

slide by Barnabás Póczos & Alex Smola

slide-48
SLIDE 48

Let θ* be the true parameter. For n = αH+αT, and For any ε>0:

Hoeffding’s inequality:

48

slide by Barnabás Póczos & Alex Smola

slide-49
SLIDE 49

Probably Approximate Correct 
 (PAC) Learning

I want to know the coin parameter θ, within ε = 0.1 
 error with probability at least 1-δ = 0.95. How many flips do I need? Sample complexity:

49

slide by Barnabás Póczos & Alex Smola

slide-50
SLIDE 50

Question (3)

Why is this a machine learning problem???

  • improve their performance (accuracy of the

predicted prob. )

  • at some task (predicting the probability of heads)
  • with experience (the more coins we flip the better

we are)

50

slide by Barnabás Póczos & Alex Smola

slide-51
SLIDE 51

What about continuous 
 features?

51

µ µ µ µ=0 µ µ µ µ=0 σ σ σ σ2

2 2 2

σ σ σ σ2

2 2 2

Let us try Gaussians…

6 5 4 3 7 8 9

slide by Barnabás Póczos & Alex Smola

slide-52
SLIDE 52

MLE for Gaussian mean 
 and variance

52

and variance

Choose θ= (µ,σ2) that maximizes the probability of observed data

Independent draws Identically distributed

slide by Barnabás Póczos & Alex Smola

slide-53
SLIDE 53

MLE for Gaussian mean
 and variance

53

Note: MLE for the variance of a Gaussian is biased

[Expected result of estimation is not the true parameter!] 
 Unbiased variance estimator:

and variance

slide by Barnabás Póczos & Alex Smola

slide-54
SLIDE 54

Probably Approximate Correct 
 (PAC) Learning

I want to know the coin parameter θ, within ε = 0.1 
 error with probability at least 1-δ = 0.95. How many flips do I need? Sample complexity:

54

slide by Barnabás Póczos & Alex Smola

slide-55
SLIDE 55

55

What about prior knowledge?
 (MAP Estimation)

slide by Barnabás Póczos & Aarti Singh

slide-56
SLIDE 56

What about prior knowledge?

56

We know the coin is “close” to 50-50. What can we do now?

The Bayesian way…

Rather than estimating a single θ, we obtain a distribution over possible values of θ

50-50 Before data After data

slide by Barnabás Póczos & Aarti Singh

slide-57
SLIDE 57

What about prior knowledge?

57

We know the coin is “close” to 50-50. What can we do now?

The Bayesian way…

Rather than estimating a single θ, we obtain a distribution over possible values of θ

50-50 Before data After data

slide by Barnabás Póczos & Aarti Singh

slide-58
SLIDE 58

Prior distribution

  • What prior? What distribution do we want for 


a prior?

− Represents expert knowledge (philosophical

approach)

− Simple posterior form (engineer’s approach)


  • Uninformative priors:

− Uniform distribution


  • Conjugate priors:

− Closed-form representation of posterior − P(θ) and P(θ|D) have the same form


58

slide by Barnabás Póczos & Aarti Singh

slide-59
SLIDE 59

59

Bayes Rule

In order to proceed we will need:

slide by Barnabás Póczos & Aarti Singh

slide-60
SLIDE 60

Chain Rule & Bayes Rule

60

Bayes rule: Chain rule:

Bayes rule is important for reverse conditioning.

slide by Barnabás Póczos & Aarti Singh

slide-61
SLIDE 61

Bayesian Learning

61

  • Use Bayes rule:
  • Or equivalently:

posterior likelihood prior

slide by Barnabás Póczos & Aarti Singh

slide-62
SLIDE 62

MAP estimation for Binomial distribution

62

Likelihood is Binomial

Coin flip problem

P() and P(| D) have the same form! [Conjugate prior]

If the prior is Beta distribution, ) posterior is Beta distribution

slide by Barnabás Póczos & Aarti Singh

slide-63
SLIDE 63

Beta distribution

63

More concentrated as values of α, β increase

slide by Barnabás Póczos & Aarti Singh

slide-64
SLIDE 64

Beta conjugate prior

64

As we get more samples, effect of prior is “washed out” As n = α H + αT increases

slide by Barnabás Póczos & Aarti Singh

slide-65
SLIDE 65

65

slide-66
SLIDE 66

Han Solo and Bayesian Priors

C3PO: Sir, the possibility of successfully navigating an asteroid field is approximately 3,720 to 1! Han: Never tell me the odds!

66

https://www.countbayesie.com/blog/2015/2/18/hans-solo-and-bayesian-priors

slide-67
SLIDE 67

MLE vs. MAP

67

!

Maximum Likelihood estimation (MLE) Choose value that maximizes the probability of observed data

slide by Barnabás Póczos & Aarti Singh

slide-68
SLIDE 68

MLE vs. MAP

68

When is MAP same as MLE?

!

Maximum Likelihood estimation (MLE) Choose value that maximizes the probability of observed data

!

Maximum a posteriori (MAP) estimation Choose value that is most probable given observed data and prior belief

slide by Barnabás Póczos & Aarti Singh

When is MAP same as MLE?

slide-69
SLIDE 69

)

From Binomial to Multinomial

Example: Dice roll problem (6 outcomes instead of 2) Likelihood is ~ Multinomial(θ = {θ1, θ2, ... , θk}) If prior is Dirichlet distribution, 
 Then posterior is Dirichlet distribution For Multinomial, conjugate prior is Dirichlet distribution. http://en.wikipedia.org/wiki/Dirichlet_distribution

69

chlet distribution,

slide by Barnabás Póczos & Aarti Singh

slide-70
SLIDE 70

Bayesians vs. Frequentists

70

You are no good when sample is small You give a different answer for different priors

slide by Barnabás Póczos & Aarti Singh