BBM406 Fundamentals of Machine Learning Lecture 7: Probability - - PowerPoint PPT Presentation

bbm406
SMART_READER_LITE
LIVE PREVIEW

BBM406 Fundamentals of Machine Learning Lecture 7: Probability - - PowerPoint PPT Presentation

photo: Chessex Borealis Aquerple Polyhedral BBM406 Fundamentals of Machine Learning Lecture 7: Probability Review (contd.) Maximum Likelihood Estimation (MLE) Aykut Erdem // Hacettepe University // Fall 2019 Administrative Project


slide-1
SLIDE 1

Aykut Erdem // Hacettepe University // Fall 2019

Lecture 7:

Probability Review (cont’d.) Maximum Likelihood Estimation (MLE)

BBM406

Fundamentals of 
 Machine Learning

photo: Chessex Borealis™ Aquerple Polyhedral

slide-2
SLIDE 2

Administrative

  • Project proposal due November 15
  • A half page description

− problem to be investigated, − why it is interesting, − what data you will use, − related work.

2

slide-3
SLIDE 3

3

Deadlines in the syllabus are 
 closer than they appear

slide-4
SLIDE 4

Today

  • Probabilities
  • Dependence, Independence, Conditional

Independence


  • Parameter estimation
  • Maximum Likelihood Estimation (MLE)
  • Maximum a Posteriori (MAP)

4

slide-5
SLIDE 5

Last time… Sample space

Examples:

  • Ω may be the set of all possible outcomes of a

dice roll (1,2,3,4,5,6) 


  • Pages of a book opened randomly. (1-157) 

  • Real numbers for temperature, location, time, etc

5

Def: A sample space Ω is the set of all possible outcomes of a (conceptual or physical) random experiment. (Ω can be finite or infinite.)

  • slide by Barnabás Póczos & Alex Smola
slide-6
SLIDE 6

Last time… Events

Examples: What is the probability of

  • the book is open at an odd number
  • rolling a dice the number <4
  • a random person’s height X : a<X<b

6

We will ask the question: What is the probability of a particular event? Def: Event A is a subset of the sample space Ω

slide by Barnabás Póczos & Alex Smola
slide-7
SLIDE 7
  • utcomes in

which A is true

  • utcomes in which A is false

P(A) is the volume of the area.

sample space

10

Example: What is the probability that

the number on the dice is 2 or 4?

1,3,5,6 2,4

Last time… Probability

What is the probability that the number on the dice is 2 or 4?

7

Def: Probability P(A), the probability that event (subset) A happens, is a function that maps the event A onto the interval [0, 1]. P(A) is also called the probability measure of A.

Example:

slide by Barnabás Póczos & Alex Smola
slide-8
SLIDE 8

Last time… Kolmogorov Axioms

8

Consequences:

slide by Barnabás Póczos & Alex Smola
slide-9
SLIDE 9

Last time… Venn Diagram

9

  • A

B

P(A U B) = P(A) + P(B) - P(A B)

slide by Barnabás Póczos & Alex Smola
slide-10
SLIDE 10

Last time… Random Variables

10

  • Discrete random variable examples ( is discrete):
  • X() = True if a randomly drawn person () from our

class () is female

  • X() = The hometown X() of a randomly drawn person

() from our class ()

Def: Real valued random variable is a function of the

  • utcome of a randomized experiment

Examples:

slide by Barnabás Póczos & Alex Smola
slide-11
SLIDE 11

Last time… Discrete Distributions

11

  • Bernoulli distribution: Ber(p)

Suppose a coin with head prob. p is tossed n times. What is the probability of getting k heads and n-k tails?

  • Binomial distribution: Bin(n,p)

17

slide by Barnabás Póczos & Alex Smola
slide-12
SLIDE 12

Last time… Discrete Distributions

12

  • Bernoulli distribution: Ber(p)

Suppose a coin with head prob. p is tossed n times. What is the probability of getting k heads and n-k tails?

  • Binomial distribution: Bin(n,p)

17

slide by Barnabás Póczos & Alex Smola
slide-13
SLIDE 13

Last time… Discrete Distributions

13

  • Bernoulli distribution: Ber(p)

Suppose a coin with head prob. p is tossed n times. What is the probability of getting k heads and n-k tails?

  • Binomial distribution: Bin(n,p)

17

slide by Barnabás Póczos & Alex Smola
slide-14
SLIDE 14

Last time… Discrete Distributions

14

  • Bernoulli distribution: Ber(p)

Suppose a coin with head prob. p is tossed n times. What is the probability of getting k heads and n-k tails?

  • Binomial distribution: Bin(n,p)

17

slide by Barnabás Póczos & Alex Smola
slide-15
SLIDE 15

Last time… Conditional Probability

15

P(X|Y) = Fraction of worlds in which X event is true given Y event is true.

X Y

XY

28

1/80 7/80 1/80 71/80

Headache Flu No Headache No Flu

slide by Barnabás Póczos & Alex Smola
slide-16
SLIDE 16

Last time… Conditional Probability

16

P(X|Y) = Fraction of worlds in which X event is true given Y event is true.

X Y

XY

28

1/80 7/80 1/80 71/80

Headache Flu No Headache No Flu

slide by Barnabás Póczos & Alex Smola
slide-17
SLIDE 17

Independence

Y and X don’t contain information about each other. Observing Y doesn’t help predicting X. Observing X doesn’t help predicting Y.

Examples:

Independent: Winning on roulette this week and next week. Dependent: Russian roulette

17

Independent random variables:

slide by Barnabás Póczos & Alex Smola
slide-18
SLIDE 18

Dependent / Independent

18

X X Y Y

Independent X,Y Dependent X,Y

slide by Barnabás Póczos & Alex Smola
slide-19
SLIDE 19 7

Conditionally Independent

Examples:

Dependent: shoe size of children and reading skills Conditionally independent: shoe size of children and reading skills given age Stork deliver babies: 
 Highly statistically significant correlation
 exists between stork populations and 
 human birth rates across Europe.

19

Conditionally independent: Knowing Z makes X and Y independent

slide by Barnabás Póczos & Alex Smola
slide-20
SLIDE 20

Conditionally Independent

  • London taxi drivers: A survey has pointed out a positive

and significant correlation between the number of accidents and wearing coats. They concluded that coats could hinder movements of drivers and be the cause of

  • accidents. A new law was prepared to prohibit drivers

from wearing coats when driving.

20

Finally, another study pointed out that people wear coats when it rains…

slide by Barnabás Póczos & Alex Smola
slide-21
SLIDE 21

Correlation ≠ Causation

21

Number people who drowned by falling into a swimming-pool correlates with Number of films Nicolas Cage appeared in

Correlation: 0.666004

slide-22
SLIDE 22

Conditional Independence

Formally: X is conditionally independent of Y given Z

22

Equivalent to:

Note: does NOT mean Thunder is independent of Rain But given Lightning knowing Rain doesn’t give more info about Thunder

slide by Barnabás Póczos & Alex Smola
slide-23
SLIDE 23

Conditional Independence

Formally: X is conditionally independent of Y given Z

23

Equivalent to:

Note: does NOT mean Thunder is independent of Rain But given Lightning knowing Rain doesn’t give more info about Thunder

slide by Barnabás Póczos & Alex Smola
slide-24
SLIDE 24

Conditional Independence

Formally: X is conditionally independent of Y given Z

24

Equivalent to:

Note: does NOT mean Thunder is independent of Rain But given Lightning knowing Rain doesn’t give more info about Thunder

slide by Barnabás Póczos & Alex Smola
slide-25
SLIDE 25

Parameter estimation: MLE, MAP

25

Estimating Probabilities

slide by Barnabás Póczos & Alex Smola
slide-26
SLIDE 26

Flipping a Coin

26

3/5

“Frequency of heads”

I have a coin, if I flip it, what’s the probability that it will fall with the head up?

Let us flip it a few times to estimate the probability:

The estimated probability is:

slide by Barnabás Póczos & Alex Smola
slide-27
SLIDE 27

Flipping a Coin

27

3/5

“Frequency of heads”

I have a coin, if I flip it, what’s the probability that it will fall with the head up?

Let us flip it a few times to estimate the probability:

The estimated probability is:

slide by Barnabás Póczos & Alex Smola
slide-28
SLIDE 28

Flipping a Coin

28

3/5

“Frequency of heads”

I have a coin, if I flip it, what’s the probability that it will fall with the head up?

Let us flip it a few times to estimate the probability:

The estimated probability is:

slide by Barnabás Póczos & Alex Smola
slide-29
SLIDE 29

Flipping a Coin

29

3/5

“Frequency of heads”

I have a coin, if I flip it, what’s the probability that it will fall with the head up?

Let us flip it a few times to estimate the probability:

The estimated probability is:

slide by Barnabás Póczos & Alex Smola
slide-30
SLIDE 30

Flipping a Coin

Questions:

(1) Why frequency of heads??? (2) How good is this estimation??? (3) Why is this a machine learning problem???

We are going to answer these questions

30

3/5 “Frequency of heads” The estimated probability is:

slide by Barnabás Póczos & Alex Smola
slide-31
SLIDE 31

Question (1)

Why frequency of heads???


  • Frequency of heads is exactly the 


maximum likelihood estimator for this problem


  • MLE has nice properties


(interpretation, statistical guarantees, simple)

31

slide by Barnabás Póczos & Alex Smola
slide-32
SLIDE 32

32

Maximum Likelihood Estimation

slide by Barnabás Póczos & Alex Smola
slide-33
SLIDE 33

MLE for Bernoulli distribution

33

Flips are i.i.d.:

– Independent events – Identically distributed according to Bernoulli distribution

Data, D = P(Heads) = θ, P(Tails) = 1-θ

MLE: Choose θ that maximizes the probability of observed data

slide by Barnabás Póczos & Alex Smola
slide-34
SLIDE 34

MLE for Bernoulli distribution

34

Flips are i.i.d.:

– Independent events – Identically distributed according to Bernoulli distribution

Data, D = P(Heads) = θ, P(Tails) = 1-θ

MLE: Choose θ that maximizes the probability of observed data

slide by Barnabás Póczos & Alex Smola
slide-35
SLIDE 35

MLE for Bernoulli distribution

35

Flips are i.i.d.:

– Independent events – Identically distributed according to Bernoulli distribution

Data, D = P(Heads) = θ, P(Tails) = 1-θ

MLE: Choose θ that maximizes the probability of observed data

slide by Barnabás Póczos & Alex Smola
slide-36
SLIDE 36

MLE for Bernoulli distribution

36

Flips are i.i.d.:

– Independent events – Identically distributed according to Bernoulli distribution

Data, D = P(Heads) = θ, P(Tails) = 1-θ

MLE: Choose θ that maximizes the probability of observed data

slide by Barnabás Póczos & Alex Smola
slide-37
SLIDE 37

Maximum Likelihood Estimation

37

MLE: Choose θ that maximizes the probability of observed data

independent draws identically 
 distributed

slide by Barnabás Póczos & Alex Smola
slide-38
SLIDE 38

Maximum Likelihood Estimation

38

MLE: Choose θ that maximizes the probability of observed data

independent draws identically 
 distributed

slide by Barnabás Póczos & Alex Smola
slide-39
SLIDE 39

Maximum Likelihood Estimation

39

MLE: Choose θ that maximizes the probability of observed data

independent draws identically 
 distributed

slide by Barnabás Póczos & Alex Smola
slide-40
SLIDE 40

Maximum Likelihood Estimation

40

MLE: Choose θ that maximizes the probability of observed data

independent draws identically 
 distributed

slide by Barnabás Póczos & Alex Smola
slide-41
SLIDE 41

Maximum Likelihood Estimation

41

MLE: Choose θ that maximizes the probability of observed data

independent draws identically 
 distributed

slide by Barnabás Póczos & Alex Smola
slide-42
SLIDE 42

Maximum Likelihood Estimation

42

MLE: Choose θ that maximizes the probability of observed data

That’s exactly the “Frequency of heads”

  • slide by Barnabás Póczos & Alex Smola
slide-43
SLIDE 43

Maximum Likelihood Estimation

43

MLE: Choose θ that maximizes the probability of observed data

That’s exactly the “Frequency of heads”

  • slide by Barnabás Póczos & Alex Smola
slide-44
SLIDE 44

Maximum Likelihood Estimation

44

MLE: Choose θ that maximizes the probability of observed data

That’s exactly the “Frequency of heads”

  • slide by Barnabás Póczos & Alex Smola
slide-45
SLIDE 45

Question (2)

  • How good is this MLE estimation???

45

slide by Barnabás Póczos & Alex Smola
slide-46
SLIDE 46

How many flips do I need?

I flipped the coins 5 times: 3 heads, 2 tails What if I flipped 30 heads and 20 tails?

  • Which estimator should we trust more?
  • The more the merrier???

46

slide by Barnabás Póczos & Alex Smola
slide-47
SLIDE 47

Let θ* be the true parameter. For n = αH+αT, and For any ε>0:

Hoeffding’s inequality:

47

slide by Barnabás Póczos & Alex Smola
slide-48
SLIDE 48

Probably Approximate Correct 
 (PAC) Learning

I want to know the coin parameter θ, within ε = 0.1 
 error with probability at least 1-δ = 0.95. How many flips do I need? Sample complexity:

48

slide by Barnabás Póczos & Alex Smola
slide-49
SLIDE 49

Question (3)

Why is this a machine learning problem???

  • improve their performance (accuracy of the

predicted prob. )

  • at some task (predicting the probability of heads)
  • with experience (the more coins we flip the better

we are)

49

slide by Barnabás Póczos & Alex Smola
slide-50
SLIDE 50

What about continuous 
 features?

50

µ µ µ µ=0 µ µ µ µ=0 σ σ σ σ2

2 2 2

σ σ σ σ2

2 2 2

Let us try Gaussians…

6 5 4 3 7 8 9

slide by Barnabás Póczos & Alex Smola
slide-51
SLIDE 51

MLE for Gaussian mean 
 and variance

51

and variance

Choose θ= (µ,σ2) that maximizes the probability of observed data

Independent draws Identically distributed

slide by Barnabás Póczos & Alex Smola
slide-52
SLIDE 52

MLE for Gaussian mean
 and variance

52

Note: MLE for the variance of a Gaussian is biased

[Expected result of estimation is not the true parameter!] 
 Unbiased variance estimator:

and variance

slide by Barnabás Póczos & Alex Smola
slide-53
SLIDE 53

Next Class:

MAP estimation Naïve Bayes Classifier

53