Machine Learning 2007: Lecture 11 Instructor: Tim van Erven - - PowerPoint PPT Presentation

machine learning 2007 lecture 11 instructor tim van erven
SMART_READER_LITE
LIVE PREVIEW

Machine Learning 2007: Lecture 11 Instructor: Tim van Erven - - PowerPoint PPT Presentation

Machine Learning 2007: Lecture 11 Instructor: Tim van Erven (Tim.van.Erven@cwi.nl) Website: www.cwi.nl/erven/teaching/0708/ml/ November 28, 2007 1 / 35 Overview Organisational Organisational Matters Matters Models Models


slide-1
SLIDE 1

1 / 35

Machine Learning 2007: Lecture 11 Instructor: Tim van Erven (Tim.van.Erven@cwi.nl) Website: www.cwi.nl/˜erven/teaching/0708/ml/

November 28, 2007

slide-2
SLIDE 2

Overview

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 2 / 35

  • Organisational Matters
  • Models
  • Maximum Likelihood Parameter Estimation
  • Probability Theory
  • Bayesian Learning

The Bayesian Distribution

From Prior to Posterior

MAP Parameter Estimation

Bayesian Predictions

Discussion

Advanced Issues

slide-3
SLIDE 3

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 3 / 35

Guest lecture:

  • Next week, Peter Gr¨

unwald will give a special guest lecture about minimum description length (MDL) learning.

This Lecture versus Mitchell:

  • Chapter 6 up to section 6.5.0 about Bayesian learning.
  • I present things in a better order.
  • Mitchell also covers the connection between MAP parameter

estimation and least squares linear regression: It is good for you to study this, but I will not ask an exam question about it.

slide-4
SLIDE 4

Overview

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 4 / 35

  • Organisational Matters
  • Models
  • Maximum Likelihood Parameter Estimation
  • Probability Theory
  • Bayesian Learning

The Bayesian Distribution

From Prior to Posterior

MAP Parameter Estimation

Bayesian Predictions

Discussion

Advanced Issues

slide-5
SLIDE 5

Prediction Example without Noise

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 5 / 35

Training data:

D = y1 y2 y3 y4 y5 y6 y7 y8 1 1 1 1

Hypothesis Space:

H = {h1, h2, h3} h1: yn = 0 h2: yn =

  • if n is odd

1 if n is even h3: yn = 1

slide-6
SLIDE 6

Prediction Example without Noise

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 5 / 35

Training data:

D = y1 y2 y3 y4 y5 y6 y7 y8 1 1 1 1

Hypothesis Space:

H = {h1, h2, h3} h1: yn = 0 h2: yn =

  • if n is odd

1 if n is even h3: yn = 1

By simple list-then-eliminate:

  • Only h2 is consistent with the training data.
  • Therefore we predict, in accordance with h2, that y9 = 0.
slide-7
SLIDE 7

Turning Hypotheses into Distributions

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 6 / 35

Models:

  • We may view each hypothesis as probability distribution that

gives probability 1 to a certain outcome.

  • A hypothesis space that contains such probabilistic

hypotheses is called a (statistical) model.

The previous hypotheses as distributions:

M = {P1, P2, P3}

P1: P1(yn = 0) = 1 P2: P2(yn = 0) = ( 1 if n is odd if n is even P3: P3(yn = 1) = 1

slide-8
SLIDE 8

Turning Hypotheses into Distributions

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 6 / 35

Models:

  • We may view each hypothesis as probability distribution that

gives probability 1 to a certain outcome.

  • A hypothesis space that contains such probabilistic

hypotheses is called a (statistical) model.

The previous hypotheses as distributions:

M = {P1, P2, P3}

P1: P1(yn = 0) = 1 P2: P2(yn = 0) = ( 1 if n is odd if n is even P3: P3(yn = 1) = 1

List-then-eliminate still works:

  • A probabilistic hypothesis is consistent with the data if it gives

positive probability to the data.

slide-9
SLIDE 9

Prediction Example with Noise

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 7 / 35

Noise:

  • Using probabilistic hypotheses is natural when there is noise

in the data.

  • Suppose we observe a measurement error with some (small)

probability ǫ.

This is easy to incorporate:

M = {P1, P2, P3}

P1: P1(yn = 0) = 1 − ǫ P2: P2(yn = 0) = ( 1 − ǫ if n is odd ǫ if n is even P3: P3(yn = 1) = 1 − ǫ

slide-10
SLIDE 10

Prediction Example with Noise

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 7 / 35

Noise:

  • Using probabilistic hypotheses is natural when there is noise

in the data.

  • Suppose we observe a measurement error with some (small)

probability ǫ.

This is easy to incorporate:

M = {P1, P2, P3}

P1: P1(yn = 0) = 1 − ǫ P2: P2(yn = 0) = ( 1 − ǫ if n is odd ǫ if n is even P3: P3(yn = 1) = 1 − ǫ

List-then-eliminate does not work any more:

  • For example, P1(D = 0, 1, 0, 1, 0, 1, 0, 1) = ǫ4(1 − ǫ)4.
  • Typically many or all probabilistic hypotheses in our model will

be consistent with the data.

slide-11
SLIDE 11

Overview

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 8 / 35

  • Organisational Matters
  • Models
  • Maximum Likelihood Parameter Estimation
  • Probability Theory
  • Bayesian Learning

The Bayesian Distribution

From Prior to Posterior

MAP Parameter Estimation

Bayesian Predictions

Discussion

Advanced Issues

slide-12
SLIDE 12

Parameters

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 9 / 35

Parameters index the elements of a hypothesis space:

H = {h1, h2, h3} ⇐ ⇒ H = {hθ | θ ∈ {1, 2, 3}}

slide-13
SLIDE 13

Parameters

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 9 / 35

Parameters index the elements of a hypothesis space:

H = {h1, h2, h3} ⇐ ⇒ H = {hθ | θ ∈ {1, 2, 3}}

Usually in a convenient way:

Hypotheses are often expressed in terms of the parameters. In linear regression for example: H = {hw | w ∈ R2} where hw : y = w0 + w1x.

slide-14
SLIDE 14

Parameters

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 9 / 35

Parameters index the elements of a hypothesis space:

H = {h1, h2, h3} ⇐ ⇒ H = {hθ | θ ∈ {1, 2, 3}}

Usually in a convenient way:

Hypotheses are often expressed in terms of the parameters. In linear regression for example: H = {hw | w ∈ R2} where hw : y = w0 + w1x.

Example where the hypothesis space is a model:

For example in prediction of binary outcomes: M =

  • Pθ | θ ∈

1 4, 1 2, 3 4

  • where Pθ(yn = 1) = θ.
slide-15
SLIDE 15

Maximum Likelihood Parameter Estimation

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 10 / 35

Training data and model:

D = y1 y2 y3 y4 y5 y6 y7 y8 1 1 1 1 1 1 M =

  • Pθ | θ ∈

1 4, 1 2, 3 4

  • where Pθ(yn = 1) = θ.

Likelihood:

θ 1/4 1/2 3/4 Pθ(D) (1/4)6(3/4)2 (1/2)8 (3/4)6(1/4)2 = 9/65536 = 256/65536 = 729/65536

Maximum Likelihood Parameter Estimation:

ˆ θ = arg maxθ Pθ(D) = 3/4

slide-16
SLIDE 16

Overview

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 11 / 35

  • Organisational Matters
  • Models
  • Maximum Likelihood Parameter Estimation
  • Probability Theory
  • Bayesian Learning

The Bayesian Distribution

From Prior to Posterior

MAP Parameter Estimation

Bayesian Predictions

Discussion

Advanced Issues

slide-17
SLIDE 17

Relating Unions and Intersections

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 12 / 35

For any two events A and B: P(A ∪ B) = P(A) + P(B) − P(A ∩ B)

slide-18
SLIDE 18

The Law of Total Probability

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 13 / 35

c e g a f d b

  • Suppose Ω = {a, b, c, d, e, f, g}.
slide-19
SLIDE 19

The Law of Total Probability

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 13 / 35

c e g a f d b

  • Suppose Ω = {a, b, c, d, e, f, g}.
  • A partition of Ω cuts it into parts:

Let the parts be A1 = {a, b}, A2 = {c, d, e} and A3 = {f, g}

The parts do not overlap, and together cover Ω.

slide-20
SLIDE 20

The Law of Total Probability

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 13 / 35

c e g a f d b

  • Suppose Ω = {a, b, c, d, e, f, g}.
  • A partition of Ω cuts it into parts:

Let the parts be A1 = {a, b}, A2 = {c, d, e} and A3 = {f, g}

The parts do not overlap, and together cover Ω.

  • B = {b, d, f}
slide-21
SLIDE 21

The Law of Total Probability

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 13 / 35

c e g a f d b

  • Suppose Ω = {a, b, c, d, e, f, g}.
  • A partition of Ω cuts it into parts:

Let the parts be A1 = {a, b}, A2 = {c, d, e} and A3 = {f, g}

The parts do not overlap, and together cover Ω.

  • B = {b, d, f}

Law of Total Probability:

P(B) =

3

  • i=1

P(B ∩ Ai) =

3

  • i=1

P(B | Ai)P(Ai)

slide-22
SLIDE 22

Marginal Probability

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 14 / 35

  • Suppose we throw a blue and a red die.
  • Let X and Y be random variables, where

X: outcome blue die; Y : outcome red die

  • If we only know P(X, Y ), how do we compute P(X)?
slide-23
SLIDE 23

Marginal Probability

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 14 / 35

  • Suppose we throw a blue and a red die.
  • Let X and Y be random variables, where

X: outcome blue die; Y : outcome red die

  • If we only know P(X, Y ), how do we compute P(X)?

Marginal Probability of X:

X \ Y 1 2 3 4 5 6 1 1/6 2

1 36 1 36 1 36 1 36 1 36 1 36

1/6 3 1/6 4 P(X, Y ) 1/6 5 1/6 6 1/6 1/6 1/6 1/6 1/6 1/6 1/6 1

P(X = 2) =

6

  • y=1

P(X = 2, Y = y) = 1/6

slide-24
SLIDE 24

Overview

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 15 / 35

  • Organisational Matters
  • Models
  • Maximum Likelihood Parameter Estimation
  • Probability Theory
  • Bayesian Learning

The Bayesian Distribution

From Prior to Posterior

MAP Parameter Estimation

Bayesian Predictions

Discussion

Advanced Issues

slide-25
SLIDE 25

Bayesian Learning

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 16 / 35

Very popular:

  • Bayesian learning can be used with any model, and even if

we have multiple models.

  • It is widely used in machine learning.

Nice properties:

  • It avoids overfitting.
  • Makes preference bias clearly visible.
slide-26
SLIDE 26

Bayesian Learning

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 16 / 35

Very popular:

  • Bayesian learning can be used with any model, and even if

we have multiple models.

  • It is widely used in machine learning.

Nice properties:

  • It avoids overfitting.
  • Makes preference bias clearly visible.

Main idea:

  • Given some model with parameter θ, construct a single

distribution PBayes on both data D and the parameter θ.

  • Now we can compute the probability of

parameters given the training data: PBayes(θ = 3/4 | D);

the next outcome given the training data: PBayes(yn+1 = 1 | D).

slide-27
SLIDE 27

Overview

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 17 / 35

  • Organisational Matters
  • Models
  • Maximum Likelihood Parameter Estimation
  • Probability Theory
  • Bayesian Learning

The Bayesian Distribution

From Prior to Posterior

MAP Parameter Estimation

Bayesian Predictions

Discussion

Advanced Issues

slide-28
SLIDE 28

The Bayesian Distribution

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 18 / 35

Prior Distribution:

  • A model contains many distributions. For example,

M = {Pθ | θ ∈ {1, . . . , 10}}.

  • We put a prior distribution π on the parameter θ.
slide-29
SLIDE 29

The Bayesian Distribution

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 18 / 35

Prior Distribution:

  • A model contains many distributions. For example,

M = {Pθ | θ ∈ {1, . . . , 10}}.

  • We put a prior distribution π on the parameter θ.
  • π(θ) reflects our a priori 1 degree of belief that θ is the right

parameter.

slide-30
SLIDE 30

The Bayesian Distribution

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 18 / 35

Prior Distribution:

  • A model contains many distributions. For example,

M = {Pθ | θ ∈ {1, . . . , 10}}.

  • We put a prior distribution π on the parameter θ.
  • π(θ) reflects our a priori 1 degree of belief that θ is the right

parameter.

Definition of PBayes:

  • The single distribution PBayes on both parameters and data is

defined by: PBayes(θ) = π(θ) and PBayes(D | θ)= Pθ(D)

  • This implies that PBayes(D, θ) = Pθ(D)π(θ)

1“A priori” means before seeing the data.

slide-31
SLIDE 31

Example

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 19 / 35

Model, prior and training data:

  • Model: M =
  • Pθ | θ ∈

1

4, 1 2, 3 4

  • where Pθ(yn = 1) = θ.
  • Prior: π

1

4

  • = π

1

2

  • = π

3

4

  • = 1

3

  • Data: D = y1

y2 y3 y4 y5 y6 y7 y8 1 1 1 1 1 1

Joint Probabilities:

PBayes(D, θ) = Pθ(D)π(θ): θ PBayes(D, θ) 1/4 1/3 · (1/4)6(3/4)2 = 9/196608 1/2 1/3 · (1/2)8 = 256/196608 3/4 1/3 · (3/4)6(1/4)2 = 729/196608

slide-32
SLIDE 32

The Marginal Probability of the Data

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 20 / 35

The marginal probability of the data:

PBayes(D) =

  • θ

PBayes(D, θ) =

  • θ

Pθ(D)π(θ)

Example:

θ PBayes(D, θ) 1/4 9/196608 1/2 256/196608 3/4 729/196608 = ⇒ PBayes(D) = 9 + 256 + 729 196608 = 994 196608

Remarks:

  • The marginal probability PBayes(D) is a weighted average of

Pθ(D), where each θ has the weight π(θ).

  • This weight π(θ) does not depend on the data.
slide-33
SLIDE 33

Overview

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 21 / 35

  • Organisational Matters
  • Models
  • Maximum Likelihood Parameter Estimation
  • Probability Theory
  • Bayesian Learning

The Bayesian Distribution

From Prior to Posterior

MAP Parameter Estimation

Bayesian Predictions

Discussion

Advanced Issues

slide-34
SLIDE 34

From Prior to Posterior Distribution

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 22 / 35

Updating beliefs:

  • The prior π(θ) gives the probability of θ before we observe

any data.

  • The posterior distribution PBayes(θ | D) gives the probability
  • f θ after observing data D.
slide-35
SLIDE 35

From Prior to Posterior Distribution

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 22 / 35

Updating beliefs:

  • The prior π(θ) gives the probability of θ before we observe

any data.

  • The posterior distribution PBayes(θ | D) gives the probability
  • f θ after observing data D.
  • This is the Bayesian way to update beliefs about parameters

based on data D.

slide-36
SLIDE 36

From Prior to Posterior Distribution

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 22 / 35

Updating beliefs:

  • The prior π(θ) gives the probability of θ before we observe

any data.

  • The posterior distribution PBayes(θ | D) gives the probability
  • f θ after observing data D.
  • This is the Bayesian way to update beliefs about parameters

based on data D.

Notation:

  • The prior and the posterior both represent beliefs about θ.
  • It is therefore common to write π(θ | D) for PBayes(θ | D).
slide-37
SLIDE 37

Example

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 23 / 35

Previous example continued:

θ PBayes(D, θ) 1/4 9/196608 1/2 256/196608 3/4 729/196608 = ⇒ PBayes(D) = 994 196608

Posterior probability:

π(θ | D) = PBayes(D, θ) PBayes(D) = ⇒ θ π(θ | D) 1/4

9/196608 994/196608

= 9/994 1/2

256/196608 994/196608

= 256/994 3/4

729/196608 994/196608

= 729/994

  • We started with equal prior probabilities.
  • After observing the data, θ = 3/4 is considered much more

likely than the other θ.

slide-38
SLIDE 38

Overview

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 24 / 35

  • Organisational Matters
  • Models
  • Maximum Likelihood Parameter Estimation
  • Probability Theory
  • Bayesian Learning

The Bayesian Distribution

From Prior to Posterior

MAP Parameter Estimation

Bayesian Predictions

Discussion

Advanced Issues

slide-39
SLIDE 39

MAP Parameter Estimation

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 25 / 35

Definition:

The maximum a posteriori (MAP) parameter estimate is the parameter with largest posterior (= a posteriori) probability: θMAP = arg maxθ π(θ | D)

Example continued:

θ π(θ | D) 1/4 9/994 1/2 256/994 3/4 729/994 = ⇒ θMAP = 3/4

slide-40
SLIDE 40

Overview

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 26 / 35

  • Organisational Matters
  • Models
  • Maximum Likelihood Parameter Estimation
  • Probability Theory
  • Bayesian Learning

The Bayesian Distribution

From Prior to Posterior

MAP Parameter Estimation

Bayesian Predictions

Discussion

Advanced Issues

slide-41
SLIDE 41

The Predictive Distribution

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 27 / 35

Definition:

  • Suppose D = y1, . . . , yn.
  • Then the Bayesian predictive distribution is PBayes(yn+1 | D).

Understanding the predictive distribution:

It can be shown that: PBayes(yn+1 | D) =

  • θ

Pθ(yn+1)π(θ | D)

  • The predictive probability PBayes(yn+1 | D) is a weighted

average of Pθ(yn+1), where each θ has the weight π(θ | D).

slide-42
SLIDE 42

Example Continued

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 28 / 35

Previous example continued:

  • Recall that in this example Pθ(yn+1 = 1) = θ.

θ π(θ | D) 0.25 9/994 0.5 256/994 0.75 729/994

Predictive probability:

PBayes(yn+1 = 1 | D) =

3

  • θ=1

Pθ(yn+1 = 1)π(θ | D) = 1 4 · 9 994 + 1 2 · 256 994 + 3 4 · 729 994 ≈ 0.68

  • Notice that 0.68 is pretty close to 0.75.
slide-43
SLIDE 43

Overview

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 29 / 35

  • Organisational Matters
  • Models
  • Maximum Likelihood Parameter Estimation
  • Probability Theory
  • Bayesian Learning

The Bayesian Distribution

From Prior to Posterior

MAP Parameter Estimation

Bayesian Predictions

Discussion

Advanced Issues

slide-44
SLIDE 44

MAP versus Predictive Distribution

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 30 / 35

  • Prediction with map: PθMAP(yn+1), where

θMAP = arg maxθ π(θ | D)

  • Predictive distribution:

θ Pθ(yn+1)π(θ | D)

New example:

Two hypotheses that predict a 1 with high probability, one MAP hypothesis that predicts a 0 with high probability: Pθ(yn+1 = 1) 1/10 8/10 9/10 π(θ | D) 4/10 3/10 3/10 PBayes(yn+1 = 1 | D) = 4 · 1 100 + 3 · 8 100 + 3 · 9 100 = 55 100

  • Together the hypotheses that predict 1 have higher posterior

probability than the MAP hypothesis that predicts 0.

  • If we use the MAP

, then we ignore their predictions!

slide-45
SLIDE 45

The Prior Determines the Preference Bias

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 31 / 35

Marginal probability of the data:

PBayes(D) =

  • θ

PBayes(D, θ) =

  • θ

Pθ(D)π(θ)

Posterior distribution:

π(θ | D) = PBayes(D, θ) PBayes(D) = Pθ(D)π(θ) PBayes(D)

Dependence on the prior:

  • The most important probabilities in Bayesian inference.
  • Both use Pθ(D) and π(θ).
  • Pθ(D) depends on the data, but π(θ) does not!
  • π(θ) determines the relative importance of each parameter θ.
slide-46
SLIDE 46

The Prior Determines the Preference Bias

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 31 / 35

Marginal probability of the data:

PBayes(D) =

  • θ

PBayes(D, θ) =

  • θ

Pθ(D)π(θ)

Posterior distribution:

π(θ | D) = PBayes(D, θ) PBayes(D) = Pθ(D)π(θ) PBayes(D)

Dependence on the prior:

  • The most important probabilities in Bayesian inference.
  • Both use Pθ(D) and π(θ).
  • Pθ(D) depends on the data, but π(θ) does not!
  • π(θ) determines the relative importance of each parameter θ.
  • However, if we get a lot of data, then the effect of Pθ(D)

becomes much more important than the effect of the prior.

slide-47
SLIDE 47

Overview

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 32 / 35

  • Organisational Matters
  • Models
  • Maximum Likelihood Parameter Estimation
  • Probability Theory
  • Bayesian Learning

The Bayesian Distribution

From Prior to Posterior

MAP Parameter Estimation

Bayesian Predictions

Discussion

Advanced Issues

slide-48
SLIDE 48

Different Interpretations of Probability

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 33 / 35

  • Suppose P is a distribution on Ω = {a, b, c, d, e, f, g} and

A = {c, d, f} is an event.

Frequentist: If we perform this

same experiment n times, then the relative frequency of ob- serving an outcome in A goes to P(A) as n → ∞.

Subjective Bayesian:2 Be-

fore observing the outcome of the experiment, P(A) is our degree of belief that we will get an outcome in A.

2There are other Bayesian interpretations of probability as well.

slide-49
SLIDE 49

Different Interpretations of Probability

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 33 / 35

  • Suppose P is a distribution on Ω = {a, b, c, d, e, f, g} and

A = {c, d, f} is an event.

Frequentist: If we perform this

same experiment n times, then the relative frequency of ob- serving an outcome in A goes to P(A) as n → ∞.

  • Considers infinite number of

repetitions of the experiment.

  • Requires that it is possible (in

principle) to observe the out- come of the experiment.

  • Objective, the same for every-
  • ne.

Subjective Bayesian:2 Be-

fore observing the outcome of the experiment, P(A) is our degree of belief that we will get an outcome in A.

  • Considers only one repeti-

tion of the experiment.

  • Does not require that we

can observe the outcome

  • f the experiment.
  • Subjective: My probability

may be different from your probability.

2There are other Bayesian interpretations of probability as well.

slide-50
SLIDE 50

Overview

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 34 / 35

  • Organisational Matters
  • Models
  • Maximum Likelihood Parameter Estimation
  • Probability Theory
  • Bayesian Learning

The Bayesian Distribution

From Prior to Posterior

MAP Parameter Estimation

Bayesian Predictions

Discussion

Advanced Issues

slide-51
SLIDE 51

References

Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 35 / 35

  • A.N. Shiryaev, “Probability”, Second Edition, 1996
  • P

. Gr¨ unwald, “The Minimum Description Length Principle”, 2007

  • T.M. Mitchell, “Machine Learning”, McGraw-Hill, 1997