An Introduction to Probabilistic modeling Oliver Stegle and Karsten - - PowerPoint PPT Presentation

an introduction to probabilistic modeling
SMART_READER_LITE
LIVE PREVIEW

An Introduction to Probabilistic modeling Oliver Stegle and Karsten - - PowerPoint PPT Presentation

An Introduction to Probabilistic modeling Oliver Stegle and Karsten Borgwardt Machine Learning and Computational Biology Research Group, Max Planck Institute for Biological Cybernetics and Max Planck Institute for Developmental Biology,


slide-1
SLIDE 1

Oliver Stegle and Karsten Borgwardt: Computational Approaches for Analysing Complex Biological Systems, Page 1

An Introduction to Probabilistic modeling

Oliver Stegle and Karsten Borgwardt Machine Learning and Computational Biology Research Group, Max Planck Institute for Biological Cybernetics and Max Planck Institute for Developmental Biology, Tübingen

slide-2
SLIDE 2

Motivation

Why probabilistic modeling?

◮ Inferences from data are intrinsically uncertain. ◮ Probability theory: model uncertainty instead of ignoring it! ◮ Applications: Machine learning, Data Mining, Pattern Recognition,

etc.

◮ Goal of this part of the course

◮ Overview on probabilistic modeling ◮ Key concepts ◮ Focus on Applications in Bioinformatics

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 1

slide-3
SLIDE 3

Motivation

Why probabilistic modeling?

◮ Inferences from data are intrinsically uncertain. ◮ Probability theory: model uncertainty instead of ignoring it! ◮ Applications: Machine learning, Data Mining, Pattern Recognition,

etc.

◮ Goal of this part of the course

◮ Overview on probabilistic modeling ◮ Key concepts ◮ Focus on Applications in Bioinformatics

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 1

slide-4
SLIDE 4

Motivation

Why probabilistic modeling?

◮ Inferences from data are intrinsically uncertain. ◮ Probability theory: model uncertainty instead of ignoring it! ◮ Applications: Machine learning, Data Mining, Pattern Recognition,

etc.

◮ Goal of this part of the course

◮ Overview on probabilistic modeling ◮ Key concepts ◮ Focus on Applications in Bioinformatics

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 1

slide-5
SLIDE 5

Motivation

Further reading, useful material

◮ Christopher M. Bishop: Pattern Recognition and Machine learning.

◮ Good background, covers most of the course material and much more! ◮ Substantial parts of this tutorial borrow figures and ideas from this

book.

◮ David J.C. MacKay: Information Theory, Learning and Inference

◮ Very worth while reading, not quite the same quality of overlap with

the lecture synopsis.

◮ Freely available online.

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 2

slide-6
SLIDE 6

Motivation

Lecture overview

  • 1. An Introduction to probabilistic modeling
  • 2. Applications: linear models, hypothesis testing
  • 3. An introduction to Gaussian processes
  • 4. Applications: time series, model comparison
  • 5. Applications: continued
  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 3

slide-7
SLIDE 7

Outline

Outline

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 4

slide-8
SLIDE 8

Prerequisites

Outline

Motivation Prerequisites Probability Theory Parameter Inference for the Gaussian Summary

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 5

slide-9
SLIDE 9

Prerequisites

Key concepts

Data

◮ Let D denote a dataset, consisting of N datapoints

D = { xn

  • Inputs

, yn

  • Outputs

}N

n=1. ◮ Typical (this course)

◮ x = {x1, . . . , xD} multivariate, spanning D features for each

  • bservation (nodes in a graph, etc.).

◮ y univariate (fitness, expression level etc.).

◮ Notation:

◮ Scalars are printed as y. ◮ Vectors are printed in bold: x. ◮ Matrices are printed in capital

bold: Σ.

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 6

slide-10
SLIDE 10

Prerequisites

Key concepts

Data

◮ Let D denote a dataset, consisting of N datapoints

D = { xn

  • Inputs

, yn

  • Outputs

}N

n=1. ◮ Typical (this course)

◮ x = {x1, . . . , xD} multivariate, spanning D features for each

  • bservation (nodes in a graph, etc.).

◮ y univariate (fitness, expression level etc.).

◮ Notation:

◮ Scalars are printed as y. ◮ Vectors are printed in bold: x. ◮ Matrices are printed in capital

bold: Σ.

X Y

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 6

slide-11
SLIDE 11

Prerequisites

Key concepts

Data

◮ Let D denote a dataset, consisting of N datapoints

D = { xn

  • Inputs

, yn

  • Outputs

}N

n=1. ◮ Typical (this course)

◮ x = {x1, . . . , xD} multivariate, spanning D features for each

  • bservation (nodes in a graph, etc.).

◮ y univariate (fitness, expression level etc.).

◮ Notation:

◮ Scalars are printed as y. ◮ Vectors are printed in bold: x. ◮ Matrices are printed in capital

bold: Σ.

X Y

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 6

slide-12
SLIDE 12

Prerequisites

Key concepts

Predictions

◮ Observed dataset D = { xn

  • Inputs

, yn

  • Outputs

}N

n=1. ◮ Given D, what can we say about y⋆ at an unseen test input x⋆?

X Y

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 7

slide-13
SLIDE 13

Prerequisites

Key concepts

Predictions

◮ Observed dataset D = { xn

  • Inputs

, yn

  • Outputs

}N

n=1. ◮ Given D, what can we say about y⋆ at an unseen test input x⋆?

X Y ?

x*

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 7

slide-14
SLIDE 14

Prerequisites

Key concepts

Model

◮ Observed dataset D = { xn

  • Inputs

, yn

  • Outputs

}N

n=1. ◮ Given D, what can we say about y⋆ at an unseen test input x⋆? ◮ To make predictions we need to make assumptions. ◮ A model H encodes these assumptions and often depends on some

parameters θ.

◮ Curve fitting: the model relates

x to y, y = f(x | θ) = θ0 + θ1 · x

  • example, a linear model

X Y ?

x*

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 8

slide-15
SLIDE 15

Prerequisites

Key concepts

Model

◮ Observed dataset D = { xn

  • Inputs

, yn

  • Outputs

}N

n=1. ◮ Given D, what can we say about y⋆ at an unseen test input x⋆? ◮ To make predictions we need to make assumptions. ◮ A model H encodes these assumptions and often depends on some

parameters θ.

◮ Curve fitting: the model relates

x to y, y = f(x | θ) = θ0 + θ1 · x

  • example, a linear model

X Y

x*

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 8

slide-16
SLIDE 16

Prerequisites

Key concepts

Uncertainty

◮ Virtually in all steps there is uncertainty

◮ Measurement uncertainty (D) ◮ Parameter uncertainty (θ) ◮ Uncertainty regarding the correct model (H)

◮ Uncertainty can occur in both

inputs and outputs.

◮ How to represent uncertainty?

X Y

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 9

slide-17
SLIDE 17

Prerequisites

Key concepts

Uncertainty

◮ Virtually in all steps there is uncertainty

◮ Measurement uncertainty (D) ◮ Parameter uncertainty (θ) ◮ Uncertainty regarding the correct model (H)

◮ Uncertainty can occur in both

inputs and outputs.

◮ How to represent uncertainty?

X Y

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 9

slide-18
SLIDE 18

Prerequisites

Key concepts

Uncertainty

◮ Virtually in all steps there is uncertainty

◮ Measurement uncertainty (D) ◮ Parameter uncertainty (θ) ◮ Uncertainty regarding the correct model (H)

Measurement uncertainty

◮ Uncertainty can occur in both

inputs and outputs.

◮ How to represent uncertainty?

X Y

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 9

slide-19
SLIDE 19

Probability Theory

Outline

Motivation Prerequisites Probability Theory Parameter Inference for the Gaussian Summary

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 10

slide-20
SLIDE 20

Probability Theory

Probabilities

◮ Let X be a random variable, defined over a set X or measurable

space.

◮ P(X = x) denotes the probability that X takes value x, short p(x).

◮ Probabilities are positive, P(X = x) ≥ 0 ◮ Probabilities sum to one

  • x∈X

p(x)dx = 1

  • x∈X

p(x) = 1

◮ Special case: no uncertainty p(x) = δ(x − ˆ

x).

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 11

slide-21
SLIDE 21

Probability Theory

Probability Theory

Joint Probability P(X = xi, Y = yj) = ni,j N Marginal Probability P(X = xi) = ci N Conditional Probability P(Y = yj | X = xi) = ni,j ci

(C.M. Bishop, Pattern Recognition and Machine Learning)

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 12

slide-22
SLIDE 22

Probability Theory

Probability Theory

Product Rule P(X = xi, Y = yj) = ni,j N = ni,j ci · ci N = P(Y = yj | X = xi)P(X = xi) Marginal Probability P(X = xi) = ci N Conditional Probability P(Y = yj | X = xi) = ni,j ci

(C.M. Bishop, Pattern Recognition and Machine Learning)

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 12

slide-23
SLIDE 23

Probability Theory

Probability Theory

Product Rule P(X = xi, Y = yj) = ni,j N = ni,j ci · ci N = P(Y = yj | X = xi)P(X = xi) Sum Rule P(X = xi) = ci N = 1 N

L

  • j=1

ni,j =

  • j

P(X = xi, Y = yj)

(C.M. Bishop, Pattern Recognition and Machine Learning)

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 12

slide-24
SLIDE 24

Probability Theory

The Rules of Probability Sum & Product Rule

Sum Rule p(x) =

y p(x, y)

Product Rule p(x, y) = p(y | x)p(x)

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 13

slide-25
SLIDE 25

Probability Theory

The Rules of Probability Bayes Theorem

◮ Using the product rule we obtain

p(y | x) = p(x | y)p(y) p(x) p(x) =

  • y

p(x | y)p(y)

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 14

slide-26
SLIDE 26

Probability Theory

Bayesian probability calculus

◮ Bayes rule is the basis for inference and learning. ◮ Assume we have a model with parameters θ,

e.g. y = θ0 + θ1 · x

X Y

x*

◮ Goal: learn parameters θ given Data D.

p(θ | D) = p(D | θ) p(θ) p(D)

Posterior

Likelihood

Prior

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 15

slide-27
SLIDE 27

Probability Theory

Bayesian probability calculus

◮ Bayes rule is the basis for inference and learning. ◮ Assume we have a model with parameters θ,

e.g. y = θ0 + θ1 · x

X Y

x*

◮ Goal: learn parameters θ given Data D.

p(θ | D) = p(D | θ) p(θ) p(D) posterior ∝ likelihood · prior

Posterior

Likelihood

Prior

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 15

slide-28
SLIDE 28

Probability Theory

Information and Entropy

◮ Information is the reduction of uncertainty. ◮ Entropy H(X) is the quantitative description of uncertainty

◮ H(X) = 0: certainty about X. ◮ H(X) maximal if all possibilities are equal probable. ◮ Uncertainty and information are additive.

◮ These conditions are fulfilled by the entropy function:

H(X) = −

  • x∈X

P(X = x) log P(X = x)

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 16

slide-29
SLIDE 29

Probability Theory

Information and Entropy

◮ Information is the reduction of uncertainty. ◮ Entropy H(X) is the quantitative description of uncertainty

◮ H(X) = 0: certainty about X. ◮ H(X) maximal if all possibilities are equal probable. ◮ Uncertainty and information are additive.

◮ These conditions are fulfilled by the entropy function:

H(X) = −

  • x∈X

P(X = x) log P(X = x)

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 16

slide-30
SLIDE 30

Probability Theory

Definitions related to entropy and information

◮ Entropy is the average surprise

H(X) =

  • x∈X

P(X = x) (− log P(X = x))

  • surprise

◮ Conditional entropy

H(X | Y ) = −

  • x∈X,y∈Y

P(X = x, Y = y) log P(X = x | Y = y)

◮ Mutual information

I(X : Y ) = H(X) − H(X | Y ) = H(Y ) − H(Y | X) H(X) + H(Y ) − H(X, Y )

◮ Independence of X and Y , p(x, y) = p(x)p(y).

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 17

slide-31
SLIDE 31

Probability Theory

Definitions related to entropy and information

◮ Entropy is the average surprise

H(X) =

  • x∈X

P(X = x) (− log P(X = x))

  • surprise

◮ Conditional entropy

H(X | Y ) = −

  • x∈X,y∈Y

P(X = x, Y = y) log P(X = x | Y = y)

◮ Mutual information

I(X : Y ) = H(X) − H(X | Y ) = H(Y ) − H(Y | X) H(X) + H(Y ) − H(X, Y )

◮ Independence of X and Y , p(x, y) = p(x)p(y).

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 17

slide-32
SLIDE 32

Probability Theory

Definitions related to entropy and information

◮ Entropy is the average surprise

H(X) =

  • x∈X

P(X = x) (− log P(X = x))

  • surprise

◮ Conditional entropy

H(X | Y ) = −

  • x∈X,y∈Y

P(X = x, Y = y) log P(X = x | Y = y)

◮ Mutual information

I(X : Y ) = H(X) − H(X | Y ) = H(Y ) − H(Y | X) H(X) + H(Y ) − H(X, Y )

◮ Independence of X and Y , p(x, y) = p(x)p(y).

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 17

slide-33
SLIDE 33

Probability Theory

Definitions related to entropy and information

◮ Entropy is the average surprise

H(X) =

  • x∈X

P(X = x) (− log P(X = x))

  • surprise

◮ Conditional entropy

H(X | Y ) = −

  • x∈X,y∈Y

P(X = x, Y = y) log P(X = x | Y = y)

◮ Mutual information

I(X : Y ) = H(X) − H(X | Y ) = H(Y ) − H(Y | X) H(X) + H(Y ) − H(X, Y )

◮ Independence of X and Y , p(x, y) = p(x)p(y).

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 17

slide-34
SLIDE 34

Probability Theory

Entropy in action

The optimal weighting problem

◮ Given 12 balls, all equal except for one that is lighter or heavier. ◮ What is the ideal weighting strategy and how many weightings are

needed to identify the odd ball?

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 18

slide-35
SLIDE 35

Probability Theory

Probability distributions

◮ Gaussian

p(x | µ, σ2) = N (x | µ, σ) = 1 √ 2πσ2 e−

1 2σ2 (x−µ)2

−6 −4 −2 2 4 6 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

◮ Multivariate Gaussian

p(x | µ, Σ) = N (x | µ, Σ) = 1

  • |2πΣ|

exp

  • −1

2(x − µ)TΣ−1(x − µ)

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 19

slide-36
SLIDE 36

Probability Theory

Probability distributions

◮ Gaussian

p(x | µ, σ2) = N (x | µ, σ) = 1 √ 2πσ2 e−

1 2σ2 (x−µ)2

−6 −4 −2 2 4 6 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

◮ Multivariate Gaussian

p(x | µ, Σ) = N (x | µ, Σ) = 1

  • |2πΣ|

exp

  • −1

2(x − µ)TΣ−1(x − µ)

  • −3
−2 −1 1 2 3 −3 −2 −1 1 2 3

Σ = 1 0.8 0.8 1

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 19

slide-37
SLIDE 37

Probability Theory

Probability distributions

continued...

◮ Bernoulli

p(x | θ) = θx(1 − θ)1−x

◮ Gamma

p(x | a, b) = ba Γ(a)xa−1e−bx

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 20

slide-38
SLIDE 38

Probability Theory

Probability distributions

continued...

◮ Bernoulli

p(x | θ) = θx(1 − θ)1−x

◮ Gamma

p(x | a, b) = ba Γ(a)xa−1e−bx

1 2 3 4 5 x 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 p(x|a = 1, b = 1)

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 20

slide-39
SLIDE 39

Probability Theory

Probability distributions

The Gaussian revisited

◮ Gaussian PDF

N

  • x
  • µ, σ2

= 1 √ 2πσ2 e−

1 2σ2 (x−µ)2

◮ Positive: N

  • x
  • µ, σ2

> 0

◮ Normalized:

+∞

−∞

N (x | µ, σ) dx = 1 (check)

◮ Expectation:

< x >= +∞

−∞

N

  • x
  • µ, σ2

xdx = µ

◮ Variance: Var[x] =< x2 > − < x >2

= µ2 + σ2 − µ2 = σ2

−6 −4 −2 2 4 6 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 21

slide-40
SLIDE 40

Probability Theory

Probability distributions

The Gaussian revisited

◮ Gaussian PDF

N

  • x
  • µ, σ2

= 1 √ 2πσ2 e−

1 2σ2 (x−µ)2

◮ Positive: N

  • x
  • µ, σ2

> 0

◮ Normalized:

+∞

−∞

N (x | µ, σ) dx = 1 (check)

◮ Expectation:

< x >= +∞

−∞

N

  • x
  • µ, σ2

xdx = µ

◮ Variance: Var[x] =< x2 > − < x >2

= µ2 + σ2 − µ2 = σ2

−6 −4 −2 2 4 6 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 21

slide-41
SLIDE 41

Parameter Inference for the Gaussian

Outline

Motivation Prerequisites Probability Theory Parameter Inference for the Gaussian Summary

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 22

slide-42
SLIDE 42

Parameter Inference for the Gaussian

Inference for the Gaussian

Ingredients

◮ Data

D = {x1, . . . , xN}

◮ Model HGauss – Gaussian PDF

N

  • x
  • µ, σ2

= 1 √ 2πσ2 e−

1 2σ2 (x−µ)2

θ = {µ, σ2}

◮ Likelihood

p(D | θ) =

N

  • n=1

N

  • xn
  • µ, σ2
  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 23

slide-43
SLIDE 43

Parameter Inference for the Gaussian

Inference for the Gaussian

Ingredients

◮ Data

D = {x1, . . . , xN}

◮ Model HGauss – Gaussian PDF

N

  • x
  • µ, σ2

= 1 √ 2πσ2 e−

1 2σ2 (x−µ)2

θ = {µ, σ2}

◮ Likelihood

p(D | θ) =

N

  • n=1

N

  • xn
  • µ, σ2

−6 −4 −2 2 4 6 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 23

slide-44
SLIDE 44

Parameter Inference for the Gaussian

Inference for the Gaussian

Ingredients

◮ Data

D = {x1, . . . , xN}

◮ Model HGauss – Gaussian PDF

N

  • x
  • µ, σ2

= 1 √ 2πσ2 e−

1 2σ2 (x−µ)2

θ = {µ, σ2}

◮ Likelihood

p(D | θ) =

N

  • n=1

N

  • xn
  • µ, σ2

x p(x) xn N(xn|µ, σ2)

(C.M. Bishop, Pattern Recognition and Machine Learning)

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 23

slide-45
SLIDE 45

Parameter Inference for the Gaussian

Inference for the Gaussian

Maximum likelihood

◮ Likelihood

p(D | θ) =

N

  • n=1

N

  • xn
  • µ, σ2

◮ Maximum likelihood

ˆ θ = argmax θ p(D | θ)

x p(x) xn N(xn|µ, σ2)

(C.M. Bishop, Pattern Recognition and Machine Learning)

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 24

slide-46
SLIDE 46

Parameter Inference for the Gaussian

Inference for the Gaussian

Maximum likelihood

ˆ θ = argmax θ p(D | θ) = argmax θ

N

  • n=1

1 √ 2πσ2 e−

1 2σ2 (xn−µ)2

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 25

slide-47
SLIDE 47

Parameter Inference for the Gaussian

Inference for the Gaussian

Maximum likelihood

ˆ θ = argmax θ ln p(D | θ) = argmax θ ln

N

  • n=1

1 √ 2πσ2 e−

1 2σ2 (xn−µ)2

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 25

slide-48
SLIDE 48

Parameter Inference for the Gaussian

Inference for the Gaussian

Maximum likelihood

ˆ θ = argmax θ ln p(D | θ) = argmax θ

  • −N

2 ln(2π) − N 2 ln σ2 − 1 2σ2

N

  • n=1

(xn − µ)2

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 26

slide-49
SLIDE 49

Parameter Inference for the Gaussian

Inference for the Gaussian

Maximum likelihood

ˆ θ = argmax θ ln p(D | θ) = argmax θ

  • −N

2 ln(2π) − N 2 ln σ2 − 1 2σ2

N

  • n=1

(xn − µ)2

  • ˆ

µ : d µ ln p(D | µ) = 0 ˆ σ2 : d σ2 ln p(D | σ2) = 0

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 26

slide-50
SLIDE 50

Parameter Inference for the Gaussian

Inference for the Gaussian

Maximum likelihood

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 27

slide-51
SLIDE 51

Parameter Inference for the Gaussian

Inference for the Gaussian

Maximum likelihood

◮ Maximum likelihood solutions

µML = 1 N

N

  • n=1

xn σ2

ML = 1

N

N

  • n=1

(xn − µML)2 Equivalent to common mean and variance estimators (almost).

◮ Maximum likelihood ignores parameter uncertainty

◮ Think of the ML solution for a single observed datapoint x1

µML1 = x1 σ2

ML1 = (x1 − µML1)2 = 0 ◮ How about Bayesian inference?

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 28

slide-52
SLIDE 52

Parameter Inference for the Gaussian

Inference for the Gaussian

Maximum likelihood

◮ Maximum likelihood solutions

µML = 1 N

N

  • n=1

xn σ2

ML = 1

N

N

  • n=1

(xn − µML)2 Equivalent to common mean and variance estimators (almost).

◮ Maximum likelihood ignores parameter uncertainty

◮ Think of the ML solution for a single observed datapoint x1

µML1 = x1 σ2

ML1 = (x1 − µML1)2 = 0 ◮ How about Bayesian inference?

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 28

slide-53
SLIDE 53

Parameter Inference for the Gaussian

Inference for the Gaussian

Maximum likelihood

◮ Maximum likelihood solutions

µML = 1 N

N

  • n=1

xn σ2

ML = 1

N

N

  • n=1

(xn − µML)2 Equivalent to common mean and variance estimators (almost).

◮ Maximum likelihood ignores parameter uncertainty

◮ Think of the ML solution for a single observed datapoint x1

µML1 = x1 σ2

ML1 = (x1 − µML1)2 = 0 ◮ How about Bayesian inference?

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 28

slide-54
SLIDE 54

Parameter Inference for the Gaussian

Bayesian Inference for the Gaussian

Ingredients

◮ Data

D = {x1, . . . , xN}

◮ Model HGauss – Gaussian PDF

N

  • x
  • µ, σ2

= 1 √ 2πσ2 e−

1 2σ2 (x−µ)2

θ = {µ}

◮ For simplicity: assume variance σ2 is

known.

◮ Likelihood

p(D | µ) =

N

  • n=1

N

  • xn
  • µ, σ2
  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 29

slide-55
SLIDE 55

Parameter Inference for the Gaussian

Bayesian Inference for the Gaussian

Ingredients

◮ Data

D = {x1, . . . , xN}

◮ Model HGauss – Gaussian PDF

N

  • x
  • µ, σ2

= 1 √ 2πσ2 e−

1 2σ2 (x−µ)2

θ = {µ}

◮ For simplicity: assume variance σ2 is

known.

◮ Likelihood

p(D | µ) =

N

  • n=1

N

  • xn
  • µ, σ2

−6 −4 −2 2 4 6 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 29

slide-56
SLIDE 56

Parameter Inference for the Gaussian

Bayesian Inference for the Gaussian

Ingredients

◮ Data

D = {x1, . . . , xN}

◮ Model HGauss – Gaussian PDF

N

  • x
  • µ, σ2

= 1 √ 2πσ2 e−

1 2σ2 (x−µ)2

θ = {µ}

◮ For simplicity: assume variance σ2 is

known.

◮ Likelihood

p(D | µ) =

N

  • n=1

N

  • xn
  • µ, σ2

x p(x) xn N(xn|µ, σ2)

(C.M. Bishop, Pattern Recognition and Machine Learning)

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 29

slide-57
SLIDE 57

Parameter Inference for the Gaussian

Bayesian Inference for the Gaussian

Bayes rule

◮ Combine likelihood with a Gaussian prior over µ

p(µ) = N

  • µ
  • m0, s2
  • ◮ The posterior is proportional to

p(µ | D, σ2) ∝ p(D | µ, σ2)p(µ)

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 30

slide-58
SLIDE 58

Parameter Inference for the Gaussian

Bayesian Inference for the Gaussian

p(µ | D, σ2) ∝ p(D | µ)p(µ) = N

  • n=1

1 √ 2πσ2 e−

1 2σ2 (xn−µ)2

  • 1
  • 2πs2

e

1 2s2

(µ−m0)2

= 1 √ 2πσ2

N

1

  • 2πs2
  • C1

exp

  • − 1

2s2 (µ2 − 2µm0 + m2

0) −

1 2σ2

N

  • n=1

(µ2 − 2µxn + x2

n)

  • = C2 exp
  • − 1

2 1 s2 + N σ2

  • 1/ˆ

σ

  • µ2 − 2µ ˆ

σ( 1 s2 m0 + 1 σ2

N

  • n=1

xn)

  • ˆ

µ

  • + C3
  • ◮ Posterior parameters follow as the new coefficients.

◮ Note: All the constants we dropped on the way yield the model

evidence: p(µ | D, σ2) = p(D | µ)p(µ) Z

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 31

slide-59
SLIDE 59

Parameter Inference for the Gaussian

Bayesian Inference for the Gaussian

p(µ | D, σ2) ∝ p(D | µ)p(µ) = N

  • n=1

1 √ 2πσ2 e−

1 2σ2 (xn−µ)2

  • 1
  • 2πs2

e

1 2s2

(µ−m0)2

= 1 √ 2πσ2

N

1

  • 2πs2
  • C1

exp

  • − 1

2s2 (µ2 − 2µm0 + m2

0) −

1 2σ2

N

  • n=1

(µ2 − 2µxn + x2

n)

  • = C2 exp
  • − 1

2 1 s2 + N σ2

  • 1/ˆ

σ

  • µ2 − 2µ ˆ

σ( 1 s2 m0 + 1 σ2

N

  • n=1

xn)

  • ˆ

µ

  • + C3
  • ◮ Posterior parameters follow as the new coefficients.

◮ Note: All the constants we dropped on the way yield the model

evidence: p(µ | D, σ2) = p(D | µ)p(µ) Z

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 31

slide-60
SLIDE 60

Parameter Inference for the Gaussian

Bayesian Inference for the Gaussian

p(µ | D, σ2) ∝ p(D | µ)p(µ) = N

  • n=1

1 √ 2πσ2 e−

1 2σ2 (xn−µ)2

  • 1
  • 2πs2

e

1 2s2

(µ−m0)2

= 1 √ 2πσ2

N

1

  • 2πs2
  • C1

exp

  • − 1

2s2 (µ2 − 2µm0 + m2

0) −

1 2σ2

N

  • n=1

(µ2 − 2µxn + x2

n)

  • = C2 exp
  • − 1

2 1 s2 + N σ2

  • 1/ˆ

σ

  • µ2 − 2µ ˆ

σ( 1 s2 m0 + 1 σ2

N

  • n=1

xn)

  • ˆ

µ

  • + C3
  • ◮ Posterior parameters follow as the new coefficients.

◮ Note: All the constants we dropped on the way yield the model

evidence: p(µ | D, σ2) = p(D | µ)p(µ) Z

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 31

slide-61
SLIDE 61

Parameter Inference for the Gaussian

Bayesian Inference for the Gaussian

◮ Posterior of the mean: p(µ | D, σ2) ∝ N (µ | ˆ

µ, ˆ σ), after some rewriting

ˆ µ = σ2 Ns2

0 + σ2 m0 +

Ns2 Ns2

0 + σ2 µML,

µML = 1 N

N

  • n=1

xn 1 ˆ σ2 = 1 s2 + N σ2

◮ Limiting cases for no and infinite amount of data

N = 0 N → ∞ ˆ µ m0 µML ˆ σ2 s2

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 32

slide-62
SLIDE 62

Parameter Inference for the Gaussian

Bayesian Inference for the Gaussian

Examples

◮ Posterior p(µ | D, σ2) for increasing data sizes.

N = 0 N = 1 N = 2 N = 10 −1 1 5

(C.M. Bishop, Pattern Recognition and Machine Learning)

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 33

slide-63
SLIDE 63

Parameter Inference for the Gaussian

Conjugate priors

◮ It is not chance that the posterior

p(µ | D, σ2) ∝ p(D | µ, σ2)p(µ) is tractable in closed form for the Gaussian.

Conjugate prior

p(θ) is a conjugate prior for a particular likelihood p(D | θ) if the posterior is of the same functional form than the prior.

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 34

slide-64
SLIDE 64

Parameter Inference for the Gaussian

Conjugate priors

◮ It is not chance that the posterior

p(µ | D, σ2) ∝ p(D | µ, σ2)p(µ) is tractable in closed form for the Gaussian.

Conjugate prior

p(θ) is a conjugate prior for a particular likelihood p(D | θ) if the posterior is of the same functional form than the prior.

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 34

slide-65
SLIDE 65

Parameter Inference for the Gaussian

Conjugate priors

Exponential family distributions

◮ A large class of probability distributions are part of the exponential

family (all in this course) and can be written as:

p(x | θ) = h(x)g(θ) exp{θTu(x)}

◮ For example for the Gaussian:

p(x | µ, σ2) = 1 2πσ2 exp{− 1 2σ2 (x2 − 2xµ + µ2)} = h(x)g(θ)exp{θTu(x)}

with θ =

  • µ/σ2

−1/2σ2

  • , h(x) =

1 √ 2π u(x) = x x2

  • , g(θ) = (−2θ2)1/2 exp

θ2

1

4θ2

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 35

slide-66
SLIDE 66

Parameter Inference for the Gaussian

Conjugate priors

Exponential family distributions

Conjugacy and exponential family distributions

◮ For all members of the exponential family it is possible to construct a

conjugate prior.

◮ Intuition: The exponential form ensures that we can construct a prior

that keeps its functional form.

◮ Conjugate priors for the Gaussian N

  • x
  • µ, σ2

◮ p(µ) = N

  • µ
  • m0, s2
  • ◮ p( 1

σ2 ) = Γ( 1 σ2 , a0, b0).

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 36

slide-67
SLIDE 67

Parameter Inference for the Gaussian

Conjugate priors

Exponential family distributions

Conjugacy and exponential family distributions

◮ For all members of the exponential family it is possible to construct a

conjugate prior.

◮ Intuition: The exponential form ensures that we can construct a prior

that keeps its functional form.

◮ Conjugate priors for the Gaussian N

  • x
  • µ, σ2

◮ p(µ) = N

  • µ
  • m0, s2
  • ◮ p( 1

σ2 ) = Γ( 1 σ2 , a0, b0).

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 36

slide-68
SLIDE 68

Parameter Inference for the Gaussian

Bayesian Inference for the Gaussian

Sequential learning

◮ Bayes rule naturally leads itself to sequential learning ◮ Assume one by one multiple datasets become available: D1, . . . , DS

p1(θ) ∝ p(D1 | θ)p(θ) p2(θ) ∝ p(D2 | θ)p1(θ) . . .

◮ Note: Assuming the datasets are independent, sequential updates and

a single learning step yield the same answer.

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 37

slide-69
SLIDE 69

Summary

Outline

Motivation Prerequisites Probability Theory Parameter Inference for the Gaussian Summary

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 38

slide-70
SLIDE 70

Summary

Summary

◮ Probability theory: the language of uncertainty. ◮ Key rules of probability: sum rule, product rule. ◮ Bayes rules formes the fundamentals of learning.

(posterior ∝ likelihood · prior).

◮ The entropy quantifies uncertainty. ◮ Parameter learning using maximum likelihood. ◮ Bayesian inference for the Gaussian.

  • O. Stegle & K. Borgwardt

An introduction to probabilistic modeling T¨ ubingen 39