Lecture 7:
−Maximum Likelihood Estimation (MLE) −Maximum a Posteriori (MAP)
Aykut Erdem
October 2016 Hacettepe University
Lecture 7: Maximum Likelihood Estimation (MLE) Maximum a Posteriori - - PowerPoint PPT Presentation
Lecture 7: Maximum Likelihood Estimation (MLE) Maximum a Posteriori (MAP) Aykut Erdem October 2016 Hacettepe University Administrative Assignment 2 will be out on Thursday It is due November 10 (i.e. in 2 weeks) You will
−Maximum Likelihood Estimation (MLE) −Maximum a Posteriori (MAP)
Aykut Erdem
October 2016 Hacettepe University
− It is due November 10 (i.e. in 2 weeks) − You will implement
2
− problem to be investigated, − why it is interesting, − what data you will use, − related work.
3
4
5
6
slide by Barnabás Póczos & Alex Smola
which A is true
P(A) is the volume of the area.
sample space
10
Example: What is the probability that
the number on the dice is 2 or 4?
1,3,5,6 2,4
What is the probability that the number on the dice is 2 or 4?
7
slide by Barnabás Póczos & Alex Smola
8
slide by Barnabás Póczos & Alex Smola
9
P(A U B) = P(A) + P(B) - P(A B)
slide by Barnabás Póczos & Alex Smola
10
class () is female
() from our class ()
Def: Real valued random variable is a function of the
Examples:
slide by Barnabás Póczos & Alex Smola
11
Suppose a coin with head prob. p is tossed n times. What is the probability of getting k heads and n-k tails?
17
slide by Barnabás Póczos & Alex Smola
12
Suppose a coin with head prob. p is tossed n times. What is the probability of getting k heads and n-k tails?
17
slide by Barnabás Póczos & Alex Smola
13
Suppose a coin with head prob. p is tossed n times. What is the probability of getting k heads and n-k tails?
17
slide by Barnabás Póczos & Alex Smola
14
Suppose a coin with head prob. p is tossed n times. What is the probability of getting k heads and n-k tails?
17
slide by Barnabás Póczos & Alex Smola
15
P(X|Y) = Fraction of worlds in which X event is true given Y event is true.
X Y
XY
28
1/80 7/80 1/80 71/80
Headache Flu No Headache No Flu
slide by Barnabás Póczos & Alex Smola
16
P(X|Y) = Fraction of worlds in which X event is true given Y event is true.
X Y
XY
28
1/80 7/80 1/80 71/80
Headache Flu No Headache No Flu
slide by Barnabás Póczos & Alex Smola
Independent: Winning on roulette this week and next week. Dependent: Russian roulette
17
Independent random variables:
slide by Barnabás Póczos & Alex Smola
18
X X Y Y
Independent X,Y Dependent X,Y
slide by Barnabás Póczos & Alex Smola
7
Dependent: shoe size of children and reading skills Conditionally independent: shoe size of children and reading skills given age Stork deliver babies: Highly statistically significant correlation exists between stork populations and human birth rates across Europe.
19
Conditionally independent: Knowing Z makes X and Y independent
slide by Barnabás Póczos & Alex Smola
and significant correlation between the number of accidents and wearing coats. They concluded that coats could hinder movements of drivers and be the cause of
from wearing coats when driving.
20
Finally, another study pointed out that people wear coats when it rains…
slide by Barnabás Póczos & Alex Smola
21
Number people who drowned by falling into a swimming-pool correlates with Number of films Nicolas Cage appeared in
Correlation: 0.666004
http://www.tylervigen.com
slide by Barnabás Póczos & Alex Smola
22
Equivalent to:
Note: does NOT mean Thunder is independent of Rain But given Lightning knowing Rain doesn’t give more info about Thunder
slide by Barnabás Póczos & Alex Smola
23
Equivalent to:
Note: does NOT mean Thunder is independent of Rain But given Lightning knowing Rain doesn’t give more info about Thunder
slide by Barnabás Póczos & Alex Smola
24
Equivalent to:
Note: does NOT mean Thunder is independent of Rain But given Lightning knowing Rain doesn’t give more info about Thunder
slide by Barnabás Póczos & Alex Smola
independently) draw a conclusion about what the number was.
are purely determined by the noise in each phone. P(na =1|nb =1,n=2)=P(na =1|n=2)
25
nb. = 1) ?
n nb na
slide by Barnabás Póczos & Alex Smola
26
slide by Barnabás Póczos & Alex Smola
27
3/5
“Frequency of heads”
I have a coin, if I flip it, what’s the probability that it will fall with the head up?
Let us flip it a few times to estimate the probability:
The estimated probability is:
slide by Barnabás Póczos & Alex Smola
28
3/5
“Frequency of heads”
I have a coin, if I flip it, what’s the probability that it will fall with the head up?
Let us flip it a few times to estimate the probability:
The estimated probability is:
slide by Barnabás Póczos & Alex Smola
29
3/5
“Frequency of heads”
I have a coin, if I flip it, what’s the probability that it will fall with the head up?
Let us flip it a few times to estimate the probability:
The estimated probability is:
slide by Barnabás Póczos & Alex Smola
30
3/5
“Frequency of heads”
I have a coin, if I flip it, what’s the probability that it will fall with the head up?
Let us flip it a few times to estimate the probability:
The estimated probability is:
slide by Barnabás Póczos & Alex Smola
We are going to answer these questions
31
3/5 “Frequency of heads” The estimated probability is:
slide by Barnabás Póczos & Alex Smola
32
slide by Barnabás Póczos & Alex Smola
33
slide by Barnabás Póczos & Alex Smola
34
Flips are i.i.d.:
– Independent events – Identically distributed according to Bernoulli distribution
Data, D = P(Heads) = θ, P(Tails) = 1-θ
MLE: Choose θ that maximizes the probability of observed data
slide by Barnabás Póczos & Alex Smola
35
Flips are i.i.d.:
– Independent events – Identically distributed according to Bernoulli distribution
Data, D = P(Heads) = θ, P(Tails) = 1-θ
MLE: Choose θ that maximizes the probability of observed data
slide by Barnabás Póczos & Alex Smola
36
Flips are i.i.d.:
– Independent events – Identically distributed according to Bernoulli distribution
Data, D = P(Heads) = θ, P(Tails) = 1-θ
MLE: Choose θ that maximizes the probability of observed data
slide by Barnabás Póczos & Alex Smola
37
Flips are i.i.d.:
– Independent events – Identically distributed according to Bernoulli distribution
Data, D = P(Heads) = θ, P(Tails) = 1-θ
MLE: Choose θ that maximizes the probability of observed data
slide by Barnabás Póczos & Alex Smola
38
MLE: Choose θ that maximizes the probability of observed data
independent draws iden,cally distributed
slide by Barnabás Póczos & Alex Smola
39
MLE: Choose θ that maximizes the probability of observed data
independent draws iden,cally distributed
slide by Barnabás Póczos & Alex Smola
40
MLE: Choose θ that maximizes the probability of observed data
independent draws identically distributed
slide by Barnabás Póczos & Alex Smola
41
MLE: Choose θ that maximizes the probability of observed data
independent draws identically distributed
slide by Barnabás Póczos & Alex Smola
42
MLE: Choose θ that maximizes the probability of observed data
independent draws identically distributed
slide by Barnabás Póczos & Alex Smola
43
MLE: Choose θ that maximizes the probability of observed data
That’s exactly the “Frequency of heads”
44
MLE: Choose θ that maximizes the probability of observed data
That’s exactly the “Frequency of heads”
45
MLE: Choose θ that maximizes the probability of observed data
That’s exactly the “Frequency of heads”
46
slide by Barnabás Póczos & Alex Smola
47
slide by Barnabás Póczos & Alex Smola
Hoeffding’s inequality:
48
slide by Barnabás Póczos & Alex Smola
49
slide by Barnabás Póczos & Alex Smola
50
slide by Barnabás Póczos & Alex Smola
51
µ µ µ µ=0 µ µ µ µ=0 σ σ σ σ2
2 2 2
σ σ σ σ2
2 2 2
6 5 4 3 7 8 9
slide by Barnabás Póczos & Alex Smola
52
Choose θ= (µ,σ2) that maximizes the probability of observed data
Independent draws Identically distributed
slide by Barnabás Póczos & Alex Smola
53
[Expected result of estimation is not the true parameter!] Unbiased variance estimator:
slide by Barnabás Póczos & Alex Smola
54
slide by Barnabás Póczos & Alex Smola
55
slide by Barnabás Póczos & Aarti Singh
56
We know the coin is “close” to 50-50. What can we do now?
Rather than estimating a single θ, we obtain a distribution over possible values of θ
50-50 Before data After data
slide by Barnabás Póczos & Aarti Singh
57
We know the coin is “close” to 50-50. What can we do now?
Rather than estimating a single θ, we obtain a distribution over possible values of θ
50-50 Before data After data
slide by Barnabás Póczos & Aarti Singh
− Represents expert knowledge (philosophical
approach)
− Simple posterior form (engineer’s approach)
− Uniform distribution
− Closed-form representation of posterior − P(θ) and P(θ|D) have the same form
58
slide by Barnabás Póczos & Aarti Singh
59
slide by Barnabás Póczos & Aarti Singh
60
Bayes rule: Chain rule:
Bayes rule is important for reverse conditioning.
slide by Barnabás Póczos & Aarti Singh
61
posterior likelihood prior
slide by Barnabás Póczos & Aarti Singh
62
Likelihood is Binomial
Coin flip problem
P() and P(| D) have the same form! [Conjugate prior]
If the prior is Beta distribution, ) posterior is Beta distribution
slide by Barnabás Póczos & Aarti Singh
63
More concentrated as values of α, β increase
slide by Barnabás Póczos & Aarti Singh
64
As we get more samples, effect of prior is “washed out” As n = α H + αT increases
slide by Barnabás Póczos & Aarti Singh
65
C3PO: Sir, the possibility of successfully navigating an asteroid field is approximately 3,720 to 1! Han: Never tell me the odds!
66
https://www.countbayesie.com/blog/2015/2/18/hans-solo-and-bayesian-priors
67
!
Maximum Likelihood estimation (MLE) Choose value that maximizes the probability of observed data
slide by Barnabás Póczos & Aarti Singh
68
When is MAP same as MLE?
!
Maximum Likelihood estimation (MLE) Choose value that maximizes the probability of observed data
!
Maximum a posteriori (MAP) estimation Choose value that is most probable given observed data and prior belief
slide by Barnabás Póczos & Aarti Singh
When is MAP same as MLE?
)
Example: Dice roll problem (6 outcomes instead of 2) Likelihood is ~ Multinomial(θ = {θ1, θ2, ... , θk}) If prior is Dirichlet distribution, Then posterior is Dirichlet distribution For Multinomial, conjugate prior is Dirichlet distribution. http://en.wikipedia.org/wiki/Dirichlet_distribution
69
chlet distribution,
slide by Barnabás Póczos & Aarti Singh
70
You are no good when sample is small You give a different answer for different priors
slide by Barnabás Póczos & Aarti Singh