[PPT] - Lecture 7: Probability Review (contd.) Maximum Likelihood PowerPoint Presentation

SLIDE 1

Lecture 7:

−Probability Review (cont’d.) −Maximum Likelihood Estimation (MLE)

Aykut Erdem

November 2018 Hacettepe University

SLIDE 2

Administrative

Assignment 2 will be out tonight

− It is due November 24 (i.e. in 2 weeks) − You will implement

Naive Bayes classifier for fake news detection

2

SLIDE 3

Administrative

Project proposal due November 16
A half page description

− problem to be investigated, − why it is interesting, − what data you will use, − related work.

3

SLIDE 4

4

D e a d l i n e s i n t h e s y l l a b u s a r e   c l

s

e r t h a n t h e y a p p e a r

SLIDE 5

Today

Probabilities
Dependence, Independence, Conditional

Independence 

Parameter estimation
Maximum Likelihood Estimation (MLE)
Maximum a Posteriori (MAP)

!5

SLIDE 6

Last time… Sample space

Examples:

Ω may be the set of all possible outcomes of a

dice roll (1,2,3,4,5,6)  

Pages of a book opened randomly. (1-157)  
Real numbers for temperature, location, time, etc

!6

Def: A sample space Ω is the set of all possible outcomes of a (conceptual or physical) random experiment. (Ω can be finite or infinite.)

slide by Barnabás Póczos & Alex Smola

SLIDE 7

Last time… Events

Examples: What is the probability of

the book is open at an odd number
rolling a dice the number <4
a random person’s height X : a<X<b

!7

We will ask the question: What is the probability of a particular event? Def: Event A is a subset of the sample space Ω

slide by Barnabás Póczos & Alex Smola

SLIDE 8

utcomes in

which A is true

utcomes in which A is false

P(A) is the volume of the area.

sample space

10

Example: What is the probability that

the number on the dice is 2 or 4?

1,3,5,6 2,4

Last time… Probability

What is the probability that the number on the dice is 2 or 4?

!8

Def: Probability P(A), the probability that event (subset) A happens, is a function that maps the event A onto the interval [0, 1]. P(A) is also called the probability measure of A.

Example:

slide by Barnabás Póczos & Alex Smola

SLIDE 9

Last time… Kolmogorov Axioms

!9

Consequences:

slide by Barnabás Póczos & Alex Smola

SLIDE 10

Last time… Venn Diagram

!10

A

B

P(A U B) = P(A) + P(B) - P(A B)

slide by Barnabás Póczos & Alex Smola

SLIDE 11

Last time… Random Variables

!11

Discrete random variable examples ( is discrete):
X() = True if a randomly drawn person () from our

class () is female

X() = The hometown X() of a randomly drawn person

() from our class ()

Def: Real valued random variable is a function of the

utcome of a randomized experiment

Examples:

slide by Barnabás Póczos & Alex Smola

SLIDE 12

Last time… Discrete Distributions

!12

Bernoulli distribution: Ber(p)

Suppose a coin with head prob. p is tossed n times. What is the probability of getting k heads and n-k tails?

Binomial distribution: Bin(n,p)

17

slide by Barnabás Póczos & Alex Smola

SLIDE 13

Last time… Discrete Distributions

!13

Bernoulli distribution: Ber(p)

Suppose a coin with head prob. p is tossed n times. What is the probability of getting k heads and n-k tails?

Binomial distribution: Bin(n,p)

17

slide by Barnabás Póczos & Alex Smola

SLIDE 14

Last time… Discrete Distributions

!14

Bernoulli distribution: Ber(p)

Suppose a coin with head prob. p is tossed n times. What is the probability of getting k heads and n-k tails?

Binomial distribution: Bin(n,p)

17

slide by Barnabás Póczos & Alex Smola

SLIDE 15

Last time… Discrete Distributions

!15

Bernoulli distribution: Ber(p)

Suppose a coin with head prob. p is tossed n times. What is the probability of getting k heads and n-k tails?

Binomial distribution: Bin(n,p)

17

slide by Barnabás Póczos & Alex Smola

SLIDE 16

Last time… Conditional Probability

!16

P(X|Y) = Fraction of worlds in which X event is true given Y event is true.

X Y

XY

28

1/80 7/80 1/80 71/80

Headache Flu No Headache No Flu

slide by Barnabás Póczos & Alex Smola

SLIDE 17

Last time… Conditional Probability

!17

P(X|Y) = Fraction of worlds in which X event is true given Y event is true.

X Y

XY

28

1/80 7/80 1/80 71/80

Headache Flu No Headache No Flu

slide by Barnabás Póczos & Alex Smola

SLIDE 18

Independence

Y and X don’t contain information about each other. Observing Y doesn’t help predicting X. Observing X doesn’t help predicting Y.

Examples:

Independent: Winning on roulette this week and next week. Dependent: Russian roulette

!18

Independent random variables:

slide by Barnabás Póczos & Alex Smola

SLIDE 19

Dependent / Independent

!19

X X Y Y

Independent X,Y Dependent X,Y

slide by Barnabás Póczos & Alex Smola

SLIDE 20

7

Conditionally Independent

Examples:

Dependent: shoe size of children and reading skills Conditionally independent: shoe size of children and reading skills given age Stork deliver babies:   Highly statistically significant correlation  exists between stork populations and   human birth rates across Europe.

!20

Conditionally independent: Knowing Z makes X and Y independent

slide by Barnabás Póczos & Alex Smola

SLIDE 21

Conditionally Independent

London taxi drivers: A survey has pointed out a positive

and significant correlation between the number of accidents and wearing coats. They concluded that coats could hinder movements of drivers and be the cause of

accidents. A new law was prepared to prohibit drivers

from wearing coats when driving.

!21

Finally, another study pointed out that people wear coats when it rains…

slide by Barnabás Póczos & Alex Smola

SLIDE 22

Correlation ≠ Causation

!22

Number people who drowned by falling into a swimming-pool correlates with Number of films Nicolas Cage appeared in

Correlation: 0.666004

SLIDE 23

Conditional Independence

Formally: X is conditionally independent of Y given Z

!23

Equivalent to:

Note: does NOT mean Thunder is independent of Rain But given Lightning knowing Rain doesn’t give more info about Thunder

slide by Barnabás Póczos & Alex Smola

SLIDE 24

Conditional Independence

Formally: X is conditionally independent of Y given Z

!24

Equivalent to:

Note: does NOT mean Thunder is independent of Rain But given Lightning knowing Rain doesn’t give more info about Thunder

slide by Barnabás Póczos & Alex Smola

SLIDE 25

Conditional Independence

Formally: X is conditionally independent of Y given Z

!25

Equivalent to:

Note: does NOT mean Thunder is independent of Rain But given Lightning knowing Rain doesn’t give more info about Thunder

slide by Barnabás Póczos & Alex Smola

SLIDE 26

Conditional vs. Marginal Independence

C calls A and B separately and tells them a number n ∈ {1,...,10}
Due to noise in the phone, A and B each imperfectly (and

independently) draw a conclusion about what the number was.

A thinks the number was na and B thinks it was nb.
Are na and nb marginally independent?
No,we expect e.g. P(na =1|nb =1)>P(na =1)
Are na and nb conditionally independent given n?
Yes, because if we know the true number, the outcomes na and nb

are purely determined by the noise in each phone.     P(na =1|nb =1,n=2)=P(na =1|n=2)

!26

nb. = 1) ?

n nb na

slide by Barnabás Póczos & Alex Smola

SLIDE 27

Parameter estimation: MLE, MAP

!27

Estimating Probabilities

slide by Barnabás Póczos & Alex Smola

SLIDE 28

Flipping a Coin

!28

3/5

“Frequency of heads”

I have a coin, if I flip it, what’s the probability that it will fall with the head up?

Let us flip it a few times to estimate the probability:

The estimated probability is:

slide by Barnabás Póczos & Alex Smola

SLIDE 29

Flipping a Coin

!29

3/5

“Frequency of heads”

I have a coin, if I flip it, what’s the probability that it will fall with the head up?

Let us flip it a few times to estimate the probability:

The estimated probability is:

slide by Barnabás Póczos & Alex Smola

SLIDE 30

Flipping a Coin

!30

3/5

“Frequency of heads”

I have a coin, if I flip it, what’s the probability that it will fall with the head up?

Let us flip it a few times to estimate the probability:

The estimated probability is:

slide by Barnabás Póczos & Alex Smola

SLIDE 31

Flipping a Coin

!31

3/5

“Frequency of heads”

I have a coin, if I flip it, what’s the probability that it will fall with the head up?

Let us flip it a few times to estimate the probability:

The estimated probability is:

slide by Barnabás Póczos & Alex Smola

SLIDE 32

Flipping a Coin

Questions:

(1) Why frequency of heads??? (2) How good is this estimation??? (3) Why is this a machine learning problem???

We are going to answer these questions

!32

3/5 “Frequency of heads” The estimated probability is:

slide by Barnabás Póczos & Alex Smola

SLIDE 33

Question (1)

Why frequency of heads??? 

Frequency of heads is exactly the

maximum likelihood estimator for this problem 

MLE has nice properties

(interpretation, statistical guarantees, simple)

!33

slide by Barnabás Póczos & Alex Smola

SLIDE 34

!34

Maximum Likelihood Estimation

slide by Barnabás Póczos & Alex Smola

SLIDE 35

MLE for Bernoulli distribution

!35

Flips are i.i.d.:

– Independent events – Identically distributed according to Bernoulli distribution

Data, D = P(Heads) = θ, P(Tails) = 1-θ

MLE: Choose θ that maximizes the probability of observed data

slide by Barnabás Póczos & Alex Smola

SLIDE 36

MLE for Bernoulli distribution

!36

Flips are i.i.d.:

– Independent events – Identically distributed according to Bernoulli distribution

Data, D = P(Heads) = θ, P(Tails) = 1-θ

MLE: Choose θ that maximizes the probability of observed data

slide by Barnabás Póczos & Alex Smola

SLIDE 37

MLE for Bernoulli distribution

!37

Flips are i.i.d.:

– Independent events – Identically distributed according to Bernoulli distribution

Data, D = P(Heads) = θ, P(Tails) = 1-θ

MLE: Choose θ that maximizes the probability of observed data

slide by Barnabás Póczos & Alex Smola

SLIDE 38

MLE for Bernoulli distribution

!38

Flips are i.i.d.:

– Independent events – Identically distributed according to Bernoulli distribution

Data, D = P(Heads) = θ, P(Tails) = 1-θ

MLE: Choose θ that maximizes the probability of observed data

slide by Barnabás Póczos & Alex Smola

SLIDE 39

Maximum Likelihood Estimation

!39

MLE: Choose θ that maximizes the probability of observed data

independent draws iden,cally   distributed

slide by Barnabás Póczos & Alex Smola

SLIDE 40

Maximum Likelihood Estimation

!40

MLE: Choose θ that maximizes the probability of observed data

independent draws iden,cally   distributed

slide by Barnabás Póczos & Alex Smola

SLIDE 41

Maximum Likelihood Estimation

!41

MLE: Choose θ that maximizes the probability of observed data

independent draws identically   distributed

slide by Barnabás Póczos & Alex Smola

SLIDE 42

Maximum Likelihood Estimation

!42

MLE: Choose θ that maximizes the probability of observed data

independent draws identically   distributed

slide by Barnabás Póczos & Alex Smola

SLIDE 43

Maximum Likelihood Estimation

!43

MLE: Choose θ that maximizes the probability of observed data

independent draws identically   distributed

slide by Barnabás Póczos & Alex Smola

SLIDE 44

Maximum Likelihood Estimation

!44

MLE: Choose θ that maximizes the probability of observed data

That’s exactly the “Frequency of heads”

slide by Barnabás Póczos & Alex Smola

SLIDE 45

Maximum Likelihood Estimation

!45

MLE: Choose θ that maximizes the probability of observed data

That’s exactly the “Frequency of heads”

slide by Barnabás Póczos & Alex Smola

SLIDE 46

Maximum Likelihood Estimation

!46

MLE: Choose θ that maximizes the probability of observed data

That’s exactly the “Frequency of heads”

slide by Barnabás Póczos & Alex Smola

SLIDE 47

Question (2)

How good is this MLE estimation???

!47

slide by Barnabás Póczos & Alex Smola

SLIDE 48

How many flips do I need?

I flipped the coins 5 times: 3 heads, 2 tails What if I flipped 30 heads and 20 tails?

Which estimator should we trust more?
The more the merrier???

!48

slide by Barnabás Póczos & Alex Smola

SLIDE 49

Let θ* be the true parameter. For n = αH+αT, and For any ε>0:

Hoeffding’s inequality:

!49

slide by Barnabás Póczos & Alex Smola

SLIDE 50

Probably Approximate Correct   (PAC) Learning

I want to know the coin parameter θ, within ε = 0.1   error with probability at least 1-δ = 0.95. How many flips do I need? Sample complexity:

!50

slide by Barnabás Póczos & Alex Smola

SLIDE 51

Question (3)

Why is this a machine learning problem???

improve their performance (accuracy of the

predicted prob. )

at some task (predicting the probability of heads)
with experience (the more coins we flip the better

we are)

!51

slide by Barnabás Póczos & Alex Smola

SLIDE 52

What about continuous   features?

!52

µ µ µ µ=0 µ µ µ µ=0 σ σ σ σ2

2 2 2

σ σ σ σ2

2 2 2

Let us try Gaussians…

6 5 4 3 7 8 9

slide by Barnabás Póczos & Alex Smola

SLIDE 53

MLE for Gaussian mean   and variance

!53

and variance

Choose θ= (µ,σ2) that maximizes the probability of observed data

Independent draws Identically distributed

slide by Barnabás Póczos & Alex Smola

SLIDE 54

MLE for Gaussian mean  and variance

!54

Note: MLE for the variance of a Gaussian is biased

[Expected result of estimation is not the true parameter!]   Unbiased variance estimator:

and variance

slide by Barnabás Póczos & Alex Smola

SLIDE 55

Next Class:

MAP estimation Naïve Bayes Classifier

!55

SLIDE 56

Next Class:

Naïve Bayes Classifier

!56

Lecture 7:

Administrative

Administrative

D e a d l i n e s i n t h e s y l l a b u s a r e c l

e r t h a n t h e y a p p e a r

Today

Independence

Last time… Sample space

Examples:

dice roll (1,2,3,4,5,6)

Def: A sample space Ω is the set of all possible outcomes of a (conceptual or physical) random experiment. (Ω can be finite or infinite.)

Last time… Events

Examples: What is the probability of

We will ask the question: What is the probability of a particular event? Def: Event A is a subset of the sample space Ω

Last time… Probability

Def: Probability P(A), the probability that event (subset) A happens, is a function that maps the event A onto the interval [0, 1]. P(A) is also called the probability measure of A.

Example:

Last time… Kolmogorov Axioms

Consequences:

Last time… Venn Diagram

B

Last time… Random Variables

Last time… Discrete Distributions

Last time… Discrete Distributions

Last time… Discrete Distributions

Last time… Discrete Distributions

Last time… Conditional Probability

Last time… Conditional Probability

Independence

Y and X don’t contain information about each other. Observing Y doesn’t help predicting X. Observing X doesn’t help predicting Y.

Examples:

Dependent / Independent

Conditionally Independent

Examples:

Conditionally Independent

Correlation ≠ Causation

Conditional Independence

Formally: X is conditionally independent of Y given Z

Conditional Independence

Formally: X is conditionally independent of Y given Z

Conditional Independence

Formally: X is conditionally independent of Y given Z

Conditional vs. Marginal Independence

Parameter estimation: MLE, MAP

Estimating Probabilities

Flipping a Coin

Flipping a Coin

Flipping a Coin

Flipping a Coin

Flipping a Coin

Questions:

(1) Why frequency of heads??? (2) How good is this estimation??? (3) Why is this a machine learning problem???

Question (1)

Why frequency of heads???

maximum likelihood estimator for this problem

(interpretation, statistical guarantees, simple)

Maximum Likelihood Estimation

MLE for Bernoulli distribution

MLE for Bernoulli distribution

MLE for Bernoulli distribution

MLE for Bernoulli distribution

Maximum Likelihood Estimation

Maximum Likelihood Estimation

Maximum Likelihood Estimation

Maximum Likelihood Estimation

Maximum Likelihood Estimation

Maximum Likelihood Estimation

Maximum Likelihood Estimation

Maximum Likelihood Estimation

Question (2)

How many flips do I need?

I flipped the coins 5 times: 3 heads, 2 tails What if I flipped 30 heads and 20 tails?

Let θ* be the true parameter. For n = αH+αT, and For any ε>0:

Probably Approximate Correct (PAC) Learning

I want to know the coin parameter θ, within ε = 0.1 error with probability at least 1-δ = 0.95. How many flips do I need? Sample complexity:

Question (3)

Why is this a machine learning problem???

predicted prob. )

we are)

What about continuous features?

D e a d l i n e s i n t h e s y l l a b u s a r e   c l

Independence 

dice roll (1,2,3,4,5,6)  

Why frequency of heads??? 

maximum likelihood estimator for this problem 

Probably Approximate Correct   (PAC) Learning

I want to know the coin parameter θ, within ε = 0.1   error with probability at least 1-δ = 0.95. How many flips do I need? Sample complexity:

What about continuous   features?

MLE for Gaussian mean   and variance

MLE for Gaussian mean  and variance