Linear models Oliver Stegle and Karsten Borgwardt Machine Learning - - PowerPoint PPT Presentation

linear models
SMART_READER_LITE
LIVE PREVIEW

Linear models Oliver Stegle and Karsten Borgwardt Machine Learning - - PowerPoint PPT Presentation

Linear models Oliver Stegle and Karsten Borgwardt Machine Learning and Computational Biology Research Group, Max Planck Institute for Biological Cybernetics and Max Planck Institute for Developmental Biology, Tbingen Oliver Stegle and


slide-1
SLIDE 1

Oliver Stegle and Karsten Borgwardt: Computational Approaches for Analysing Complex Biological Systems, Page 1

Linear models

Oliver Stegle and Karsten Borgwardt Machine Learning and Computational Biology Research Group, Max Planck Institute for Biological Cybernetics and Max Planck Institute for Developmental Biology, Tübingen

slide-2
SLIDE 2

Motivation

Curve fitting

Tasks we are interested in:

◮ Making predictions ◮ Comparison of alternative

models

X Y ?

x*

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 1

slide-3
SLIDE 3

Motivation

Curve fitting

Tasks we are interested in:

◮ Making predictions ◮ Comparison of alternative

models

X Y ?

x*

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 1

slide-4
SLIDE 4

Motivation

Further reading, useful material

◮ Christopher M. Bishop: Pattern Recognition and Machine learning.

◮ Good background, covers most of the course material and much more! ◮ This lecture is largely inspired by chapter 3 of the book.

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 2

slide-5
SLIDE 5

Outline

Outline

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 3

slide-6
SLIDE 6

Linear Regression

Outline

Motivation Linear Regression Bayesian linear regression Model comparison and hypothesis testing Summary

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 4

slide-7
SLIDE 7

Linear Regression

Regression

Noise model and likelihood

◮ Given a dataset D = {xn, yn}N n=1, where xn = {xn,1, . . . , xn,D} is D

dimensional, fit parameters θ of a regressor f with added Gaussian noise: yn = f(xn; θ) + ǫn where p(ǫ | σ2) = N

  • ǫ
  • 0, σ2

.

◮ Equivalent likelihood formulation:

p(y | X) =

N

  • n=1

N

  • yn
  • f(xn), σ2
  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 5

slide-8
SLIDE 8

Linear Regression

Regression

Choosing a regressor

◮ Choose f to be linear:

p(y | X) =

N

  • n=1

N

  • yn
  • wT · xn + c, σ2

◮ Consider bias free case, c = 0,

  • therwise inlcude an additional

column of ones in each xn.

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 6

slide-9
SLIDE 9

Linear Regression

Regression

Choosing a regressor

◮ Choose f to be linear:

p(y | X) =

N

  • n=1

N

  • yn
  • wT · xn + c, σ2

◮ Consider bias free case, c = 0,

  • therwise inlcude an additional

column of ones in each xn.

Equivalent graphical model

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 6

slide-10
SLIDE 10

Linear Regression

Linear Regression

Maximum likelihood

◮ Taking the logarithm, we obtain

ln p(y | w, X, σ2) =

N

  • n=1

ln N

  • yn
  • wTxn, σ2

= −N 2 ln 2πσ2 − 1 2σ2

N

  • n=1

(yn − wT · xn)2

  • Sum of squares

◮ The likelihood is maximized when the squared error is minimized. ◮ Least squares and maximum likelihood are equivalent.

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 7

slide-11
SLIDE 11

Linear Regression

Linear Regression

Maximum likelihood

◮ Taking the logarithm, we obtain

ln p(y | w, X, σ2) =

N

  • n=1

ln N

  • yn
  • wTxn, σ2

= −N 2 ln 2πσ2 − 1 2σ2

N

  • n=1

(yn − wT · xn)2

  • Sum of squares

◮ The likelihood is maximized when the squared error is minimized. ◮ Least squares and maximum likelihood are equivalent.

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 7

slide-12
SLIDE 12

Linear Regression

Linear Regression

Maximum likelihood

◮ Taking the logarithm, we obtain

ln p(y | w, X, σ2) =

N

  • n=1

ln N

  • yn
  • wTxn, σ2

= −N 2 ln 2πσ2 − 1 2σ2

N

  • n=1

(yn − wT · xn)2

  • Sum of squares

◮ The likelihood is maximized when the squared error is minimized. ◮ Least squares and maximum likelihood are equivalent.

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 7

slide-13
SLIDE 13

Linear Regression

Linear Regression and Least Squares

y x f(xn, w) y

n

xn

(C.M. Bishop, Pattern Recognition and Machine Learning)

E(w) = 1 2

N

  • n=1

(yn − wTxn)2

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 8

slide-14
SLIDE 14

Linear Regression

Linear Regression and Least Squares

◮ Derivative w.r.t a single weight entry wi

d dwi ln p(y | w, σ2) = d dwi

  • − 1

2σ2

N

  • n=1

(yn − w · xn)2

  • = 1

σ2

N

  • n=1

(yn − w · xn)xi

◮ Set gradient w.r.t to w to zero

∇w ln p(y | w, σ2) = 1 σ2

N

  • n=1

(yn − w · xn)xT

n = 0

= ⇒ wML = (XTX)−1XT

  • Pseudo inverse

y

◮ Here, the matrix X is defined as X =

  x1,1 . . . x1, D . . . . . . . . . xN,1 . . . xN,D  

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 9

slide-15
SLIDE 15

Linear Regression

Polynomial Curve Fitting

◮ Use the polynomials up to degree K to construct new features from x

f(x, w) = w0 + w1x + w2x2 + · · · + wKxK = wTφ(x), where we defined φ(x) = (1, x, x2, . . . , xK).

◮ Similarly, φ can be any feature mapping. ◮ Possible to show: the feature map φ can be expressed in terms of

kernels (kernel trick).

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 10

slide-16
SLIDE 16

Linear Regression

Polynomial Curve Fitting

◮ Use the polynomials up to degree K to construct new features from x

f(x, w) = w0 + w1x + w2x2 + · · · + wKxK = wTφ(x), where we defined φ(x) = (1, x, x2, . . . , xK).

◮ Similarly, φ can be any feature mapping. ◮ Possible to show: the feature map φ can be expressed in terms of

kernels (kernel trick).

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 10

slide-17
SLIDE 17

Linear Regression

Polynomial Curve Fitting

Overfitting

◮ The degree of the polynomial is crucial to avoid under- and

  • verfitting.

x t M = 0 1 −1 1

(C.M. Bishop, Pattern Recognition and Machine Learning)

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 11

slide-18
SLIDE 18

Linear Regression

Polynomial Curve Fitting

Overfitting

◮ The degree of the polynomial is crucial to avoid under- and

  • verfitting.

x t M = 1 1 −1 1

(C.M. Bishop, Pattern Recognition and Machine Learning)

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 11

slide-19
SLIDE 19

Linear Regression

Polynomial Curve Fitting

Overfitting

◮ The degree of the polynomial is crucial to avoid under- and

  • verfitting.

x t M = 3 1 −1 1

(C.M. Bishop, Pattern Recognition and Machine Learning)

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 11

slide-20
SLIDE 20

Linear Regression

Polynomial Curve Fitting

Overfitting

◮ The degree of the polynomial is crucial to avoid under- and

  • verfitting.

x t M = 9 1 −1 1

(C.M. Bishop, Pattern Recognition and Machine Learning)

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 11

slide-21
SLIDE 21

Linear Regression

Regularized Least Squares

◮ Solutions to avoid overfitting:

◮ Intelligently choose K ◮ Regularize the regression weights w

◮ Construct a smoothed error function

E(w) = 1 2

N

  • n=1
  • yn − wTφ(xn)

2

  • Squared error

+ λ 2 wTw

Regularizer

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 12

slide-22
SLIDE 22

Linear Regression

Regularized Least Squares

◮ Solutions to avoid overfitting:

◮ Intelligently choose K ◮ Regularize the regression weights w

◮ Construct a smoothed error function

E(w) = 1 2

N

  • n=1
  • yn − wTφ(xn)

2

  • Squared error

+ λ 2 wTw

Regularizer

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 12

slide-23
SLIDE 23

Linear Regression

Regularized Least Squares

More general regularizers

◮ A more general regularization approach:

E(w) = 1 2

N

  • n=1
  • yn − wTφ(xn)

2

  • Squared error

+ λ 2

D

  • d=1

|wd|q

  • Regularizer
  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 13

slide-24
SLIDE 24

Linear Regression

Regularized Least Squares

More general regularizers

◮ A more general regularization approach:

E(w) = 1 2

N

  • n=1
  • yn − wTφ(xn)

2

  • Squared error

+ λ 2

D

  • d=1

|wd|q

  • Regularizer

q =0 .5 q =1 q =2 q =4

(C.M. Bishop, Pattern Recognition and Machine Learning)

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 13

slide-25
SLIDE 25

Linear Regression

Regularized Least Squares

More general regularizers

◮ A more general regularization approach:

E(w) = 1 2

N

  • n=1
  • yn − wTφ(xn)

2

  • Squared error

+ λ 2

D

  • d=1

|wd|q

  • Regularizer

q =0 .5 q =1 q =2 q =4

Quadratic Lasso sparse

(C.M. Bishop, Pattern Recognition and Machine Learning)

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 13

slide-26
SLIDE 26

Linear Regression

Loss functions and other methods

◮ Even more general: vary the loss function

E(w) = 1 2

N

  • n=1

L(yn − wTφ(xn))

  • Loss

+ λ 2

D

  • d=1

|wd|q

  • Regularizer

◮ Many state-of-the-art machine learning methods can be expressed

within this framework.

◮ Linear Regression: squared loss, squared regularizer. ◮ Support Vector Machine: hinge loss, squared regularizer. ◮ Lasso: squared loss, L1 regularizer.

◮ Inference: minimize the cost function E(w), yielding a point estimate

for w.

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 14

slide-27
SLIDE 27

Linear Regression

Loss functions and other methods

◮ Even more general: vary the loss function

E(w) = 1 2

N

  • n=1

L(yn − wTφ(xn))

  • Loss

+ λ 2

D

  • d=1

|wd|q

  • Regularizer

◮ Many state-of-the-art machine learning methods can be expressed

within this framework.

◮ Linear Regression: squared loss, squared regularizer. ◮ Support Vector Machine: hinge loss, squared regularizer. ◮ Lasso: squared loss, L1 regularizer.

◮ Inference: minimize the cost function E(w), yielding a point estimate

for w.

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 14

slide-28
SLIDE 28

Linear Regression

Loss functions and other methods

◮ Even more general: vary the loss function

E(w) = 1 2

N

  • n=1

L(yn − wTφ(xn))

  • Loss

+ λ 2

D

  • d=1

|wd|q

  • Regularizer

◮ Many state-of-the-art machine learning methods can be expressed

within this framework.

◮ Linear Regression: squared loss, squared regularizer. ◮ Support Vector Machine: hinge loss, squared regularizer. ◮ Lasso: squared loss, L1 regularizer.

◮ Inference: minimize the cost function E(w), yielding a point estimate

for w.

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 14

slide-29
SLIDE 29

Linear Regression

Regularized Least Squares

Probabilistic equivalent

◮ So far: minimization of error functions. ◮ Back to probabilities?

E(w) = 1 2

N

  • n=1
  • yn − wTφ(xn)

2

  • Squared error

+ λ 2 wTw

Regularizer

◮ Similarly: most other choices of regularizers and loss functions can be

mapped to an equivalent probabilistic representation.

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 15

slide-30
SLIDE 30

Linear Regression

Regularized Least Squares

Probabilistic equivalent

◮ So far: minimization of error functions. ◮ Back to probabilities?

E(w) = 1 2

N

  • n=1
  • yn − wTφ(xn)

2

  • Squared error

+ λ 2 wTw

Regularizer

= − ln p(y | w, Φ(X), σ2) − ln p(w)

◮ Similarly: most other choices of regularizers and loss functions can be

mapped to an equivalent probabilistic representation.

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 15

slide-31
SLIDE 31

Linear Regression

Regularized Least Squares

Probabilistic equivalent

◮ So far: minimization of error functions. ◮ Back to probabilities?

E(w) = 1 2

N

  • n=1
  • yn − wTφ(xn)

2

  • Squared error

+ λ 2 wTw

Regularizer

= − ln p(y | w, Φ(X), σ2) − ln p(w) = −

N

  • n=1

ln N

  • yn
  • wTφ(xn), σ2

− ln N

  • w
  • 0, 1

λI

  • ◮ Similarly: most other choices of regularizers and loss functions can be

mapped to an equivalent probabilistic representation.

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 15

slide-32
SLIDE 32

Linear Regression

Regularized Least Squares

Probabilistic equivalent

◮ So far: minimization of error functions. ◮ Back to probabilities?

E(w) = 1 2

N

  • n=1
  • yn − wTφ(xn)

2

  • Squared error

+ λ 2 wTw

Regularizer

= − ln p(y | w, Φ(X), σ2) − ln p(w) = −

N

  • n=1

ln N

  • yn
  • wTφ(xn), σ2

− ln N

  • w
  • 0, 1

λI

  • ◮ Similarly: most other choices of regularizers and loss functions can be

mapped to an equivalent probabilistic representation.

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 15

slide-33
SLIDE 33

Bayesian linear regression

Outline

Motivation Linear Regression Bayesian linear regression Model comparison and hypothesis testing Summary

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 16

slide-34
SLIDE 34

Bayesian linear regression

Bayesian linear regression

◮ Likelihood as before

p(y | X, w, σ2) =

N

  • n=1

N

  • yn
  • wT · φ(xn), σ2

◮ Define a conjugate prior over w

p(w) = N (w | m0, S0)

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 17

slide-35
SLIDE 35

Bayesian linear regression

Bayesian linear regression

◮ Likelihood as before

p(y | X, w, σ2) =

N

  • n=1

N

  • yn
  • wT · φ(xn), σ2

◮ Define a conjugate prior over w

p(w) = N (w | m0, S0)

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 17

slide-36
SLIDE 36

Bayesian linear regression

Bayesian linear regression

◮ Posterior probability of w

p(w | y, X, σ2) ∝

N

  • n=1

N

  • yn
  • wT · φ(xn), σ2

· N (w | m0, S0) = N

  • y
  • w · Φ(X), σ2I
  • · N (w | m0, S0)

= N (w | µw, Σw)

◮ where

µw = Σw

  • S−1

0 m0 + 1

σ2 Φ(X)Ty

  • Σw =
  • S−1

+ 1 σ2 Φ(X)TΦ(X) −1

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 18

slide-37
SLIDE 37

Bayesian linear regression

Bayesian linear regression

Prior choice

◮ A common choice is a prior that corresponds to regularized regression

p(w) = N

  • w
  • 0, 1

λI

  • .

◮ In this case

µw = Σw

  • S−1

0 m0 + 1

σ2 Φ(X)Ty

  • Σw =
  • S−1

+ 1 σ2 Φ(X)TΦ(X) −1

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 19

slide-38
SLIDE 38

Bayesian linear regression

Bayesian linear regression

Prior choice

◮ A common choice is a prior that corresponds to regularized regression

p(w) = N

  • w
  • 0, 1

λI

  • .

◮ In this case

µw = Σw

  • 1

σ2 Φ(X)Ty

  • Σw =
  • λI + 1

σ2 Φ(X)TΦ(X) −1

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 19

slide-39
SLIDE 39

Bayesian linear regression

Bayesian linear regression

Example

0 Data points

(C.M. Bishop, Pattern Recognition and Machine Learning)

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 20

slide-40
SLIDE 40

Bayesian linear regression

Bayesian linear regression

Example

1 Data point

(C.M. Bishop, Pattern Recognition and Machine Learning)

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 20

slide-41
SLIDE 41

Bayesian linear regression

Bayesian linear regression

Example

20 Data points

(C.M. Bishop, Pattern Recognition and Machine Learning)

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 20

slide-42
SLIDE 42

Bayesian linear regression

Making predictions

◮ Prediction for fixed weight ˆ

w at input x⋆ trivial: p(y⋆ | x⋆, ˆ w, σ2) = N

  • y⋆
  • ˆ

wTφ(x⋆), σ2

◮ Integrate over w to take the posterior uncertainty into account

p(y⋆ | x⋆, D) =

  • w

p(y⋆ | x⋆, w, σ2)p(w | X, y, σ2) =

  • w

N

  • y⋆

wTφ(x⋆), σ2 N (w | µw, Σw) = N

  • y⋆

µT

wφ(x⋆), σ2 + φ(x⋆)TΣwφ(x⋆)

  • ◮ Key:

◮ prediction is again Gaussian ◮ Predictive variance is increase due to the posterior uncertainty in w.

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 21

slide-43
SLIDE 43

Bayesian linear regression

Making predictions

◮ Prediction for fixed weight ˆ

w at input x⋆ trivial: p(y⋆ | x⋆, ˆ w, σ2) = N

  • y⋆
  • ˆ

wTφ(x⋆), σ2

◮ Integrate over w to take the posterior uncertainty into account

p(y⋆ | x⋆, D) =

  • w

p(y⋆ | x⋆, w, σ2)p(w | X, y, σ2) =

  • w

N

  • y⋆

wTφ(x⋆), σ2 N (w | µw, Σw) = N

  • y⋆

µT

wφ(x⋆), σ2 + φ(x⋆)TΣwφ(x⋆)

  • ◮ Key:

◮ prediction is again Gaussian ◮ Predictive variance is increase due to the posterior uncertainty in w.

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 21

slide-44
SLIDE 44

Bayesian linear regression

Making predictions

◮ Prediction for fixed weight ˆ

w at input x⋆ trivial: p(y⋆ | x⋆, ˆ w, σ2) = N

  • y⋆
  • ˆ

wTφ(x⋆), σ2

◮ Integrate over w to take the posterior uncertainty into account

p(y⋆ | x⋆, D) =

  • w

p(y⋆ | x⋆, w, σ2)p(w | X, y, σ2) =

  • w

N

  • y⋆

wTφ(x⋆), σ2 N (w | µw, Σw) = N

  • y⋆

µT

wφ(x⋆), σ2 + φ(x⋆)TΣwφ(x⋆)

  • ◮ Key:

◮ prediction is again Gaussian ◮ Predictive variance is increase due to the posterior uncertainty in w.

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 21

slide-45
SLIDE 45

Model comparison and hypothesis testing

Outline

Motivation Linear Regression Bayesian linear regression Model comparison and hypothesis testing Summary

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 22

slide-46
SLIDE 46

Model comparison and hypothesis testing

Model comparison

Motivation

◮ What degree of polynomials

describes the data best?

◮ Is the linear model at all

appropriate?

◮ Association testing.

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 23

slide-47
SLIDE 47

Model comparison and hypothesis testing

Model comparison

Motivation

◮ What degree of polynomials

describes the data best?

◮ Is the linear model at all

appropriate?

◮ Association testing.

?

Phenome Genome

ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT individuals phenotypes SNPs

y y y y y

y1

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 23

slide-48
SLIDE 48

Model comparison and hypothesis testing

Bayesian model comparison

◮ How do we choose among alternative models? ◮ Assume we want to choose among models H0, . . . , HM for a

dataset D.

◮ Posterior probability for a particular model i

p(Hi | D) ∝ p(D | Hi)

  • Evidence

p(Hi)

Prior

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 24

slide-49
SLIDE 49

Model comparison and hypothesis testing

Bayesian model comparison

◮ How do we choose among alternative models? ◮ Assume we want to choose among models H0, . . . , HM for a

dataset D.

◮ Posterior probability for a particular model i

p(Hi | D) ∝ p(D | Hi)

  • Evidence

p(Hi)

Prior

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 24

slide-50
SLIDE 50

Model comparison and hypothesis testing

Bayesian model comparison

How to calculate the evidence

◮ The evidence is not the model likelihood!

p(D | Hi) =

  • θ

p(D | θ)p(θ) for model parameters θ.

◮ Remember:

p(θ | Hi, D) = p(D | Hi, θ)p(θ) p(D | Hi)

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 25

slide-51
SLIDE 51

Model comparison and hypothesis testing

Bayesian model comparison

How to calculate the evidence

◮ The evidence is not the model likelihood!

p(D | Hi) =

  • θ

p(D | θ)p(θ) for model parameters θ.

◮ Remember:

p(θ | Hi, D) = p(D | Hi, θ)p(θ) p(D | Hi) posterior = likelihood · prior Evidence

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 25

slide-52
SLIDE 52

Model comparison and hypothesis testing

Bayesian model comparison

Ocam’s razor

◮ The evidence integral penalizes

  • verly complex models.

◮ A model with few parameters

and lower maximum likelihood (H1) may win over a model with a peaked likelihood that requires many more parameters (H2).

wM

A P

w Likelihood

H2 H1 (C.M. Bishop, Pattern Recognition and Machine Learning)

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 26

slide-53
SLIDE 53

Model comparison and hypothesis testing

Bayesian model comparison

Ocam’s razor

◮ The evidence integral penalizes

  • verly complex models.

◮ A model with few parameters

and lower maximum likelihood (H1) may win over a model with a peaked likelihood that requires many more parameters (H2).

wM

A P

w Likelihood

H2 H1 (C.M. Bishop, Pattern Recognition and Machine Learning)

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 26

slide-54
SLIDE 54

Model comparison and hypothesis testing

Application to GWA

◮ Consider an association study.

◮ H0: p(y | H0, X, θ) = N

  • y
  • 0, σ2I
  • (no association)

θ = {σ2}

◮ H1: p(y | H1, X, θ) = N

  • y
  • wT · X, σ2I
  • (linear association)

θ = {σ2, w}

◮ Choosing conjugate priors for σ2 and w, the required integrals are

tractable in closed form.

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 27

slide-55
SLIDE 55

Model comparison and hypothesis testing

Application to GWA

◮ Consider an association study.

◮ H0: p(y | H0, X, θ) = N

  • y
  • 0, σ2I
  • (no association)

θ = {σ2}

◮ H1: p(y | H1, X, θ) = N

  • y
  • wT · X, σ2I
  • (linear association)

θ = {σ2, w}

◮ Choosing conjugate priors for σ2 and w, the required integrals are

tractable in closed form.

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 27

slide-56
SLIDE 56

Model comparison and hypothesis testing

Application to GWA

◮ Consider an association study.

◮ H0: p(y | H0, X, θ) = N

  • y
  • 0, σ2I
  • (no association)

θ = {σ2}

◮ H1: p(y | H1, X, θ) = N

  • y
  • wT · X, σ2I
  • (linear association)

θ = {σ2, w}

◮ Choosing conjugate priors for σ2 and w, the required integrals are

tractable in closed form.

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 27

slide-57
SLIDE 57

Model comparison and hypothesis testing

Application to GWA

Scoring models

◮ The ratio of the evidences, the Bayes factor is a common scoring

metric to compare two models: BF = ln p(D | H1) p(D | H0).

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 28

slide-58
SLIDE 58

Model comparison and hypothesis testing

Application to GWA

Scoring models

◮ The ratio of the evidences, the Bayes factor is a common scoring

metric to compare two models: BF = ln p(D | H1) p(D | H0).

1.3354 1.3356 1.3358 1.336 1.3362 1.3364 1.3366 1.3368 1.337 1.3372 1.3374 x 10

8

5 10 15 LOD/BF Position in chr. 7 SLC35B4

0.01% FPR

0.01% FPR SLC35B4

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 28

slide-59
SLIDE 59

Model comparison and hypothesis testing

Application to GWA

Posterior probability of an association

◮ Bayes factors are useful, however we would like a probabilistic answer

how certain an association really is.

◮ Posterior probability of H1

p(H1 | D) = p(D | H1)p(H1) p(D) = p(D | H1)p(H1) p(D | H1)p(H1) + p(D | H0)p(H0)

◮ p(H1 | D) + p(H0 | D) = 1, prior probability of observing a real

association.

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 29

slide-60
SLIDE 60

Model comparison and hypothesis testing

Application to GWA

Posterior probability of an association

◮ Bayes factors are useful, however we would like a probabilistic answer

how certain an association really is.

◮ Posterior probability of H1

p(H1 | D) = p(D | H1)p(H1) p(D) = p(D | H1)p(H1) p(D | H1)p(H1) + p(D | H0)p(H0)

◮ p(H1 | D) + p(H0 | D) = 1, prior probability of observing a real

association.

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 29

slide-61
SLIDE 61

Model comparison and hypothesis testing

Application to GWA

Posterior probability of an association

◮ Bayes factors are useful, however we would like a probabilistic answer

how certain an association really is.

◮ Posterior probability of H1

p(H1 | D) = p(D | H1)p(H1) p(D) = p(D | H1)p(H1) p(D | H1)p(H1) + p(D | H0)p(H0)

◮ p(H1 | D) + p(H0 | D) = 1, prior probability of observing a real

association.

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 29

slide-62
SLIDE 62

Summary

Outline

Motivation Linear Regression Bayesian linear regression Model comparison and hypothesis testing Summary

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 30

slide-63
SLIDE 63

Summary

Summary

◮ Curve fitting and linear regression. ◮ Maximum likelihood and least squares regression are identical. ◮ Construction of features using a mapping φ. ◮ Regularized least squares. ◮ Bayesian linear regression. ◮ Model comparison and ocam’s razor.

  • O. Stegle & K. Borgwardt

Linear models T¨ ubingen 31