Lecture 7: Probability Review (contd.) Maximum Likelihood - - PowerPoint PPT Presentation

lecture 7
SMART_READER_LITE
LIVE PREVIEW

Lecture 7: Probability Review (contd.) Maximum Likelihood - - PowerPoint PPT Presentation

Lecture 7: Probability Review (contd.) Maximum Likelihood Estimation (MLE) Aykut Erdem November 2018 Hacettepe University Administrative Assignment 2 will be out tonight It is due November 24 (i.e. in 2 weeks) You will


slide-1
SLIDE 1

Lecture 7:

−Probability Review (cont’d.) −Maximum Likelihood Estimation (MLE)

Aykut Erdem

November 2018 Hacettepe University

slide-2
SLIDE 2

Administrative

  • Assignment 2 will be out tonight

− It is due November 24 (i.e. in 2 weeks) − You will implement

  • Naive Bayes classifier for fake news detection

2

slide-3
SLIDE 3

Administrative

  • Project proposal due November 16
  • A half page description

− problem to be investigated, − why it is interesting, − what data you will use, − related work.

3

slide-4
SLIDE 4

4

D e a d l i n e s i n t h e s y l l a b u s a r e 
 c l

  • s

e r t h a n t h e y a p p e a r

slide-5
SLIDE 5

Today

  • Probabilities
  • Dependence, Independence, Conditional

Independence


  • Parameter estimation
  • Maximum Likelihood Estimation (MLE)
  • Maximum a Posteriori (MAP)

!5

slide-6
SLIDE 6

Last time… Sample space

Examples:

  • Ω may be the set of all possible outcomes of a

dice roll (1,2,3,4,5,6) 


  • Pages of a book opened randomly. (1-157) 

  • Real numbers for temperature, location, time, etc

!6

Def: A sample space Ω is the set of all possible outcomes of a (conceptual or physical) random experiment. (Ω can be finite or infinite.)

  • slide by Barnabás Póczos & Alex Smola
slide-7
SLIDE 7

Last time… Events

Examples: What is the probability of

  • the book is open at an odd number
  • rolling a dice the number <4
  • a random person’s height X : a<X<b

!7

We will ask the question: What is the probability of a particular event? Def: Event A is a subset of the sample space Ω

slide by Barnabás Póczos & Alex Smola

slide-8
SLIDE 8
  • utcomes in

which A is true

  • utcomes in which A is false

P(A) is the volume of the area.

sample space

10

Example: What is the probability that

the number on the dice is 2 or 4?

1,3,5,6 2,4

Last time… Probability

What is the probability that the number on the dice is 2 or 4?

!8

Def: Probability P(A), the probability that event (subset) A happens, is a function that maps the event A onto the interval [0, 1]. P(A) is also called the probability measure of A.

Example:

slide by Barnabás Póczos & Alex Smola

slide-9
SLIDE 9

Last time… Kolmogorov Axioms

!9

Consequences:

slide by Barnabás Póczos & Alex Smola

slide-10
SLIDE 10

Last time… Venn Diagram

!10

  • A

B

P(A U B) = P(A) + P(B) - P(A B)

slide by Barnabás Póczos & Alex Smola

slide-11
SLIDE 11

Last time… Random Variables

!11

  • Discrete random variable examples ( is discrete):
  • X() = True if a randomly drawn person () from our

class () is female

  • X() = The hometown X() of a randomly drawn person

() from our class ()

Def: Real valued random variable is a function of the

  • utcome of a randomized experiment

Examples:

slide by Barnabás Póczos & Alex Smola

slide-12
SLIDE 12

Last time… Discrete Distributions

!12

  • Bernoulli distribution: Ber(p)

Suppose a coin with head prob. p is tossed n times. What is the probability of getting k heads and n-k tails?

  • Binomial distribution: Bin(n,p)

17

slide by Barnabás Póczos & Alex Smola

slide-13
SLIDE 13

Last time… Discrete Distributions

!13

  • Bernoulli distribution: Ber(p)

Suppose a coin with head prob. p is tossed n times. What is the probability of getting k heads and n-k tails?

  • Binomial distribution: Bin(n,p)

17

slide by Barnabás Póczos & Alex Smola

slide-14
SLIDE 14

Last time… Discrete Distributions

!14

  • Bernoulli distribution: Ber(p)

Suppose a coin with head prob. p is tossed n times. What is the probability of getting k heads and n-k tails?

  • Binomial distribution: Bin(n,p)

17

slide by Barnabás Póczos & Alex Smola

slide-15
SLIDE 15

Last time… Discrete Distributions

!15

  • Bernoulli distribution: Ber(p)

Suppose a coin with head prob. p is tossed n times. What is the probability of getting k heads and n-k tails?

  • Binomial distribution: Bin(n,p)

17

slide by Barnabás Póczos & Alex Smola

slide-16
SLIDE 16

Last time… Conditional Probability

!16

P(X|Y) = Fraction of worlds in which X event is true given Y event is true.

X Y

XY

28

1/80 7/80 1/80 71/80

Headache Flu No Headache No Flu

slide by Barnabás Póczos & Alex Smola

slide-17
SLIDE 17

Last time… Conditional Probability

!17

P(X|Y) = Fraction of worlds in which X event is true given Y event is true.

X Y

XY

28

1/80 7/80 1/80 71/80

Headache Flu No Headache No Flu

slide by Barnabás Póczos & Alex Smola

slide-18
SLIDE 18

Independence

Y and X don’t contain information about each other. Observing Y doesn’t help predicting X. Observing X doesn’t help predicting Y.

Examples:

Independent: Winning on roulette this week and next week. Dependent: Russian roulette

!18

Independent random variables:

slide by Barnabás Póczos & Alex Smola

slide-19
SLIDE 19

Dependent / Independent

!19

X X Y Y

Independent X,Y Dependent X,Y

slide by Barnabás Póczos & Alex Smola

slide-20
SLIDE 20

7

Conditionally Independent

Examples:

Dependent: shoe size of children and reading skills Conditionally independent: shoe size of children and reading skills given age Stork deliver babies: 
 Highly statistically significant correlation
 exists between stork populations and 
 human birth rates across Europe.

!20

Conditionally independent: Knowing Z makes X and Y independent

slide by Barnabás Póczos & Alex Smola

slide-21
SLIDE 21

Conditionally Independent

  • London taxi drivers: A survey has pointed out a positive

and significant correlation between the number of accidents and wearing coats. They concluded that coats could hinder movements of drivers and be the cause of

  • accidents. A new law was prepared to prohibit drivers

from wearing coats when driving.

!21

Finally, another study pointed out that people wear coats when it rains…

slide by Barnabás Póczos & Alex Smola

slide-22
SLIDE 22

Correlation ≠ Causation

!22

Number people who drowned by falling into a swimming-pool correlates with Number of films Nicolas Cage appeared in

Correlation: 0.666004

slide-23
SLIDE 23

Conditional Independence

Formally: X is conditionally independent of Y given Z

!23

Equivalent to:

Note: does NOT mean Thunder is independent of Rain But given Lightning knowing Rain doesn’t give more info about Thunder

slide by Barnabás Póczos & Alex Smola

slide-24
SLIDE 24

Conditional Independence

Formally: X is conditionally independent of Y given Z

!24

Equivalent to:

Note: does NOT mean Thunder is independent of Rain But given Lightning knowing Rain doesn’t give more info about Thunder

slide by Barnabás Póczos & Alex Smola

slide-25
SLIDE 25

Conditional Independence

Formally: X is conditionally independent of Y given Z

!25

Equivalent to:

Note: does NOT mean Thunder is independent of Rain But given Lightning knowing Rain doesn’t give more info about Thunder

slide by Barnabás Póczos & Alex Smola

slide-26
SLIDE 26

Conditional vs. Marginal Independence

  • C calls A and B separately and tells them a number n ∈ {1,...,10}
  • Due to noise in the phone, A and B each imperfectly (and

independently) draw a conclusion about what the number was.

  • A thinks the number was na and B thinks it was nb.
  • Are na and nb marginally independent?
  • No,we expect e.g. P(na =1|nb =1)>P(na =1)
  • Are na and nb conditionally independent given n?
  • Yes, because if we know the true number, the outcomes na and nb

are purely determined by the noise in each phone. 
 
 P(na =1|nb =1,n=2)=P(na =1|n=2)

!26

nb. = 1) ?

n nb na

slide by Barnabás Póczos & Alex Smola

slide-27
SLIDE 27

Parameter estimation: MLE, MAP

!27

Estimating Probabilities

slide by Barnabás Póczos & Alex Smola

slide-28
SLIDE 28

Flipping a Coin

!28

3/5

“Frequency of heads”

I have a coin, if I flip it, what’s the probability that it will fall with the head up?

Let us flip it a few times to estimate the probability:

The estimated probability is:

slide by Barnabás Póczos & Alex Smola

slide-29
SLIDE 29

Flipping a Coin

!29

3/5

“Frequency of heads”

I have a coin, if I flip it, what’s the probability that it will fall with the head up?

Let us flip it a few times to estimate the probability:

The estimated probability is:

slide by Barnabás Póczos & Alex Smola

slide-30
SLIDE 30

Flipping a Coin

!30

3/5

“Frequency of heads”

I have a coin, if I flip it, what’s the probability that it will fall with the head up?

Let us flip it a few times to estimate the probability:

The estimated probability is:

slide by Barnabás Póczos & Alex Smola

slide-31
SLIDE 31

Flipping a Coin

!31

3/5

“Frequency of heads”

I have a coin, if I flip it, what’s the probability that it will fall with the head up?

Let us flip it a few times to estimate the probability:

The estimated probability is:

slide by Barnabás Póczos & Alex Smola

slide-32
SLIDE 32

Flipping a Coin

Questions:

(1) Why frequency of heads??? (2) How good is this estimation??? (3) Why is this a machine learning problem???

We are going to answer these questions

!32

3/5 “Frequency of heads” The estimated probability is:

slide by Barnabás Póczos & Alex Smola

slide-33
SLIDE 33

Question (1)

Why frequency of heads???


  • Frequency of heads is exactly the 


maximum likelihood estimator for this problem


  • MLE has nice properties


(interpretation, statistical guarantees, simple)

!33

slide by Barnabás Póczos & Alex Smola

slide-34
SLIDE 34

!34

Maximum Likelihood Estimation

slide by Barnabás Póczos & Alex Smola

slide-35
SLIDE 35

MLE for Bernoulli distribution

!35

Flips are i.i.d.:

– Independent events – Identically distributed according to Bernoulli distribution

Data, D = P(Heads) = θ, P(Tails) = 1-θ

MLE: Choose θ that maximizes the probability of observed data

slide by Barnabás Póczos & Alex Smola

slide-36
SLIDE 36

MLE for Bernoulli distribution

!36

Flips are i.i.d.:

– Independent events – Identically distributed according to Bernoulli distribution

Data, D = P(Heads) = θ, P(Tails) = 1-θ

MLE: Choose θ that maximizes the probability of observed data

slide by Barnabás Póczos & Alex Smola

slide-37
SLIDE 37

MLE for Bernoulli distribution

!37

Flips are i.i.d.:

– Independent events – Identically distributed according to Bernoulli distribution

Data, D = P(Heads) = θ, P(Tails) = 1-θ

MLE: Choose θ that maximizes the probability of observed data

slide by Barnabás Póczos & Alex Smola

slide-38
SLIDE 38

MLE for Bernoulli distribution

!38

Flips are i.i.d.:

– Independent events – Identically distributed according to Bernoulli distribution

Data, D = P(Heads) = θ, P(Tails) = 1-θ

MLE: Choose θ that maximizes the probability of observed data

slide by Barnabás Póczos & Alex Smola

slide-39
SLIDE 39

Maximum Likelihood Estimation

!39

MLE: Choose θ that maximizes the probability of observed data

independent draws iden,cally 
 distributed

slide by Barnabás Póczos & Alex Smola

slide-40
SLIDE 40

Maximum Likelihood Estimation

!40

MLE: Choose θ that maximizes the probability of observed data

independent draws iden,cally 
 distributed

slide by Barnabás Póczos & Alex Smola

slide-41
SLIDE 41

Maximum Likelihood Estimation

!41

MLE: Choose θ that maximizes the probability of observed data

independent draws identically 
 distributed

slide by Barnabás Póczos & Alex Smola

slide-42
SLIDE 42

Maximum Likelihood Estimation

!42

MLE: Choose θ that maximizes the probability of observed data

independent draws identically 
 distributed

slide by Barnabás Póczos & Alex Smola

slide-43
SLIDE 43

Maximum Likelihood Estimation

!43

MLE: Choose θ that maximizes the probability of observed data

independent draws identically 
 distributed

slide by Barnabás Póczos & Alex Smola

slide-44
SLIDE 44

Maximum Likelihood Estimation

!44

MLE: Choose θ that maximizes the probability of observed data

That’s exactly the “Frequency of heads”

  • slide by Barnabás Póczos & Alex Smola
slide-45
SLIDE 45

Maximum Likelihood Estimation

!45

MLE: Choose θ that maximizes the probability of observed data

That’s exactly the “Frequency of heads”

  • slide by Barnabás Póczos & Alex Smola
slide-46
SLIDE 46

Maximum Likelihood Estimation

!46

MLE: Choose θ that maximizes the probability of observed data

That’s exactly the “Frequency of heads”

  • slide by Barnabás Póczos & Alex Smola
slide-47
SLIDE 47

Question (2)

  • How good is this MLE estimation???

!47

slide by Barnabás Póczos & Alex Smola

slide-48
SLIDE 48

How many flips do I need?

I flipped the coins 5 times: 3 heads, 2 tails What if I flipped 30 heads and 20 tails?

  • Which estimator should we trust more?
  • The more the merrier???

!48

slide by Barnabás Póczos & Alex Smola

slide-49
SLIDE 49

Let θ* be the true parameter. For n = αH+αT, and For any ε>0:

Hoeffding’s inequality:

!49

slide by Barnabás Póczos & Alex Smola

slide-50
SLIDE 50

Probably Approximate Correct 
 (PAC) Learning

I want to know the coin parameter θ, within ε = 0.1 
 error with probability at least 1-δ = 0.95. How many flips do I need? Sample complexity:

!50

slide by Barnabás Póczos & Alex Smola

slide-51
SLIDE 51

Question (3)

Why is this a machine learning problem???

  • improve their performance (accuracy of the

predicted prob. )

  • at some task (predicting the probability of heads)
  • with experience (the more coins we flip the better

we are)

!51

slide by Barnabás Póczos & Alex Smola

slide-52
SLIDE 52

What about continuous 
 features?

!52

µ µ µ µ=0 µ µ µ µ=0 σ σ σ σ2

2 2 2

σ σ σ σ2

2 2 2

Let us try Gaussians…

6 5 4 3 7 8 9

slide by Barnabás Póczos & Alex Smola

slide-53
SLIDE 53

MLE for Gaussian mean 
 and variance

!53

and variance

Choose θ= (µ,σ2) that maximizes the probability of observed data

Independent draws Identically distributed

slide by Barnabás Póczos & Alex Smola

slide-54
SLIDE 54

MLE for Gaussian mean
 and variance

!54

Note: MLE for the variance of a Gaussian is biased

[Expected result of estimation is not the true parameter!] 
 Unbiased variance estimator:

and variance

slide by Barnabás Póczos & Alex Smola

slide-55
SLIDE 55

Next Class:

MAP estimation Naïve Bayes Classifier

!55

slide-56
SLIDE 56

Next Class:

Naïve Bayes Classifier

!56