Aykut Erdem // Hacettepe University // Fall 2019
Lecture 7:
Probability Review (cont’d.) Maximum Likelihood Estimation (MLE)
BBM406
Fundamentals of Machine Learning
photo: Chessex Borealis™ Aquerple Polyhedral
BBM406 Fundamentals of Machine Learning Lecture 7: Probability - - PowerPoint PPT Presentation
photo: Chessex Borealis Aquerple Polyhedral BBM406 Fundamentals of Machine Learning Lecture 7: Probability Review (contd.) Maximum Likelihood Estimation (MLE) Aykut Erdem // Hacettepe University // Fall 2019 Administrative Project
Aykut Erdem // Hacettepe University // Fall 2019
Lecture 7:
Probability Review (cont’d.) Maximum Likelihood Estimation (MLE)
photo: Chessex Borealis™ Aquerple Polyhedral
− problem to be investigated, − why it is interesting, − what data you will use, − related work.
2
3
Deadlines in the syllabus are closer than they appear
Independence
4
Examples:
dice roll (1,2,3,4,5,6)
5
Def: A sample space Ω is the set of all possible outcomes of a (conceptual or physical) random experiment. (Ω can be finite or infinite.)
Examples: What is the probability of
6
We will ask the question: What is the probability of a particular event? Def: Event A is a subset of the sample space Ω
slide by Barnabás Póczos & Alex Smolawhich A is true
P(A) is the volume of the area.
sample space
10
Example: What is the probability that
the number on the dice is 2 or 4?
1,3,5,6 2,4
What is the probability that the number on the dice is 2 or 4?
7
Def: Probability P(A), the probability that event (subset) A happens, is a function that maps the event A onto the interval [0, 1]. P(A) is also called the probability measure of A.
Example:
slide by Barnabás Póczos & Alex SmolaLast time… Kolmogorov Axioms
8
Consequences:
slide by Barnabás Póczos & Alex Smola9
P(A U B) = P(A) + P(B) - P(A B)
slide by Barnabás Póczos & Alex SmolaLast time… Random Variables
10
class () is female
() from our class ()
Def: Real valued random variable is a function of the
Examples:
slide by Barnabás Póczos & Alex SmolaLast time… Discrete Distributions
11
Suppose a coin with head prob. p is tossed n times. What is the probability of getting k heads and n-k tails?
17
slide by Barnabás Póczos & Alex SmolaLast time… Discrete Distributions
12
Suppose a coin with head prob. p is tossed n times. What is the probability of getting k heads and n-k tails?
17
slide by Barnabás Póczos & Alex SmolaLast time… Discrete Distributions
13
Suppose a coin with head prob. p is tossed n times. What is the probability of getting k heads and n-k tails?
17
slide by Barnabás Póczos & Alex SmolaLast time… Discrete Distributions
14
Suppose a coin with head prob. p is tossed n times. What is the probability of getting k heads and n-k tails?
17
slide by Barnabás Póczos & Alex SmolaLast time… Conditional Probability
15
P(X|Y) = Fraction of worlds in which X event is true given Y event is true.
X Y
XY
28
1/80 7/80 1/80 71/80
Headache Flu No Headache No Flu
slide by Barnabás Póczos & Alex SmolaLast time… Conditional Probability
16
P(X|Y) = Fraction of worlds in which X event is true given Y event is true.
X Y
XY
28
1/80 7/80 1/80 71/80
Headache Flu No Headache No Flu
slide by Barnabás Póczos & Alex SmolaY and X don’t contain information about each other. Observing Y doesn’t help predicting X. Observing X doesn’t help predicting Y.
Examples:
Independent: Winning on roulette this week and next week. Dependent: Russian roulette
17
Independent random variables:
slide by Barnabás Póczos & Alex Smola18
X X Y Y
Independent X,Y Dependent X,Y
slide by Barnabás Póczos & Alex SmolaExamples:
Dependent: shoe size of children and reading skills Conditionally independent: shoe size of children and reading skills given age Stork deliver babies: Highly statistically significant correlation exists between stork populations and human birth rates across Europe.
19
Conditionally independent: Knowing Z makes X and Y independent
slide by Barnabás Póczos & Alex Smolaand significant correlation between the number of accidents and wearing coats. They concluded that coats could hinder movements of drivers and be the cause of
from wearing coats when driving.
20
Finally, another study pointed out that people wear coats when it rains…
slide by Barnabás Póczos & Alex Smola21
Number people who drowned by falling into a swimming-pool correlates with Number of films Nicolas Cage appeared in
Correlation: 0.666004
Formally: X is conditionally independent of Y given Z
22
Equivalent to:
Note: does NOT mean Thunder is independent of Rain But given Lightning knowing Rain doesn’t give more info about Thunder
slide by Barnabás Póczos & Alex SmolaFormally: X is conditionally independent of Y given Z
23
Equivalent to:
Note: does NOT mean Thunder is independent of Rain But given Lightning knowing Rain doesn’t give more info about Thunder
slide by Barnabás Póczos & Alex SmolaFormally: X is conditionally independent of Y given Z
24
Equivalent to:
Note: does NOT mean Thunder is independent of Rain But given Lightning knowing Rain doesn’t give more info about Thunder
slide by Barnabás Póczos & Alex Smola25
Estimating Probabilities
slide by Barnabás Póczos & Alex Smola26
3/5
“Frequency of heads”
I have a coin, if I flip it, what’s the probability that it will fall with the head up?
Let us flip it a few times to estimate the probability:
The estimated probability is:
slide by Barnabás Póczos & Alex Smola27
3/5
“Frequency of heads”
I have a coin, if I flip it, what’s the probability that it will fall with the head up?
Let us flip it a few times to estimate the probability:
The estimated probability is:
slide by Barnabás Póczos & Alex Smola28
3/5
“Frequency of heads”
I have a coin, if I flip it, what’s the probability that it will fall with the head up?
Let us flip it a few times to estimate the probability:
The estimated probability is:
slide by Barnabás Póczos & Alex Smola29
3/5
“Frequency of heads”
I have a coin, if I flip it, what’s the probability that it will fall with the head up?
Let us flip it a few times to estimate the probability:
The estimated probability is:
slide by Barnabás Póczos & Alex SmolaQuestions:
(1) Why frequency of heads??? (2) How good is this estimation??? (3) Why is this a machine learning problem???
We are going to answer these questions
30
3/5 “Frequency of heads” The estimated probability is:
slide by Barnabás Póczos & Alex SmolaWhy frequency of heads???
maximum likelihood estimator for this problem
(interpretation, statistical guarantees, simple)
31
slide by Barnabás Póczos & Alex Smola32
Maximum Likelihood Estimation
slide by Barnabás Póczos & Alex Smola33
Flips are i.i.d.:
– Independent events – Identically distributed according to Bernoulli distribution
Data, D = P(Heads) = θ, P(Tails) = 1-θ
MLE: Choose θ that maximizes the probability of observed data
slide by Barnabás Póczos & Alex Smola34
Flips are i.i.d.:
– Independent events – Identically distributed according to Bernoulli distribution
Data, D = P(Heads) = θ, P(Tails) = 1-θ
MLE: Choose θ that maximizes the probability of observed data
slide by Barnabás Póczos & Alex Smola35
Flips are i.i.d.:
– Independent events – Identically distributed according to Bernoulli distribution
Data, D = P(Heads) = θ, P(Tails) = 1-θ
MLE: Choose θ that maximizes the probability of observed data
slide by Barnabás Póczos & Alex Smola36
Flips are i.i.d.:
– Independent events – Identically distributed according to Bernoulli distribution
Data, D = P(Heads) = θ, P(Tails) = 1-θ
MLE: Choose θ that maximizes the probability of observed data
slide by Barnabás Póczos & Alex SmolaMaximum Likelihood Estimation
37
MLE: Choose θ that maximizes the probability of observed data
independent draws identically distributed
slide by Barnabás Póczos & Alex SmolaMaximum Likelihood Estimation
38
MLE: Choose θ that maximizes the probability of observed data
independent draws identically distributed
slide by Barnabás Póczos & Alex SmolaMaximum Likelihood Estimation
39
MLE: Choose θ that maximizes the probability of observed data
independent draws identically distributed
slide by Barnabás Póczos & Alex SmolaMaximum Likelihood Estimation
40
MLE: Choose θ that maximizes the probability of observed data
independent draws identically distributed
slide by Barnabás Póczos & Alex SmolaMaximum Likelihood Estimation
41
MLE: Choose θ that maximizes the probability of observed data
independent draws identically distributed
slide by Barnabás Póczos & Alex SmolaMaximum Likelihood Estimation
42
MLE: Choose θ that maximizes the probability of observed data
That’s exactly the “Frequency of heads”
Maximum Likelihood Estimation
43
MLE: Choose θ that maximizes the probability of observed data
That’s exactly the “Frequency of heads”
Maximum Likelihood Estimation
44
MLE: Choose θ that maximizes the probability of observed data
That’s exactly the “Frequency of heads”
45
slide by Barnabás Póczos & Alex SmolaI flipped the coins 5 times: 3 heads, 2 tails What if I flipped 30 heads and 20 tails?
46
slide by Barnabás Póczos & Alex SmolaLet θ* be the true parameter. For n = αH+αT, and For any ε>0:
Hoeffding’s inequality:
47
slide by Barnabás Póczos & Alex SmolaProbably Approximate Correct (PAC) Learning
I want to know the coin parameter θ, within ε = 0.1 error with probability at least 1-δ = 0.95. How many flips do I need? Sample complexity:
48
slide by Barnabás Póczos & Alex SmolaWhy is this a machine learning problem???
predicted prob. )
we are)
49
slide by Barnabás Póczos & Alex SmolaWhat about continuous features?
50
µ µ µ µ=0 µ µ µ µ=0 σ σ σ σ2
2 2 2
σ σ σ σ2
2 2 2
Let us try Gaussians…
6 5 4 3 7 8 9
slide by Barnabás Póczos & Alex SmolaMLE for Gaussian mean and variance
51
Choose θ= (µ,σ2) that maximizes the probability of observed data
Independent draws Identically distributed
slide by Barnabás Póczos & Alex SmolaMLE for Gaussian mean and variance
52
Note: MLE for the variance of a Gaussian is biased
[Expected result of estimation is not the true parameter!] Unbiased variance estimator:
MAP estimation Naïve Bayes Classifier
53