[PPT] - GWAS IV: Bayesian linear (variance component) models Dr. Oliver PowerPoint Presentation

SLIDE 1

GWAS IV: Bayesian linear (variance component) models

Dr. Oliver Stegle

Christoh Lippert

Prof. Dr. Karsten Borgwardt

Max-Planck-Institutes T¨ ubingen, Germany

T¨ ubingen Summer 2011

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 1

SLIDE 2

Motivation

Regression

Lineare regression:

◮ Making predictions ◮ Comparison of alternative

models Bayesian and regularized regression:

◮ Uncertainty in model parameters ◮ Generalized basis functions

X Y

x*

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 2

SLIDE 3

Motivation

Regression

Lineare regression:

◮ Making predictions ◮ Comparison of alternative

models Bayesian and regularized regression:

◮ Uncertainty in model parameters ◮ Generalized basis functions

X Y

x*

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 2

SLIDE 4

Motivation

Regression

Lineare regression:

◮ Making predictions ◮ Comparison of alternative

models Bayesian and regularized regression:

◮ Uncertainty in model parameters ◮ Generalized basis functions

X Y

x*

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 2

SLIDE 5

Motivation

Outline

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 4

SLIDE 7

Linear Regression II

Outline

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 5

SLIDE 8

Linear Regression II

Regression

Noise model and likelihood

◮ Given a dataset D = {xn, yn}N n=1, where xn = {xn,1, . . . , xn,S} is S

dimensional (for example S SNPs), fit parameters θ of a regressor f with added Gaussian noise: yn = f(xn; θ) + ǫn where p(ǫ | σ2) = N

ǫ
0, σ2

.

◮ Equivalent likelihood formulation:

p(y | X) =

N

n=1

N

yn
f(xn), σ2

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 6

SLIDE 9

Linear Regression II

Regression

Choosing a regressor

◮ Choose f to be linear:

p(y | X) =

N

n=1

N

yn
xn · θ + c, σ2

◮ Consider bias free case, c = 0,

therwise inlcude an additional

column of ones in each xn.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 7

SLIDE 10

Linear Regression II

Regression

Choosing a regressor

◮ Choose f to be linear:

p(y | X) =

N

n=1

N

yn
xn · θ + c, σ2

◮ Consider bias free case, c = 0,

therwise inlcude an additional

column of ones in each xn.

Equivalent graphical model Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 7

SLIDE 11

Linear Regression II

Linear Regression

Maximum likelihood

◮ Taking the logarithm, we obtain

ln p(y | θ, X, σ2) =

N

n=1

ln N

yn
xn · θ, σ2

= −N 2 ln 2πσ2 − 1 2σ2

N

n=1

(yn − xn · θ)2

Sum of squares

◮ The likelihood is maximized when the squared error is minimized. ◮ Least squares and maximum likelihood are equivalent.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 8

SLIDE 12

Linear Regression II

Linear Regression

Maximum likelihood

◮ Taking the logarithm, we obtain

ln p(y | θ, X, σ2) =

N

n=1

ln N

yn
xn · θ, σ2

= −N 2 ln 2πσ2 − 1 2σ2

N

n=1

(yn − xn · θ)2

Sum of squares

◮ The likelihood is maximized when the squared error is minimized. ◮ Least squares and maximum likelihood are equivalent.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 8

SLIDE 13

Linear Regression II

Linear Regression

Maximum likelihood

◮ Taking the logarithm, we obtain

ln p(y | θ, X, σ2) =

N

n=1

ln N

yn
xn · θ, σ2

= −N 2 ln 2πσ2 − 1 2σ2

N

n=1

(yn − xn · θ)2

Sum of squares

◮ The likelihood is maximized when the squared error is minimized. ◮ Least squares and maximum likelihood are equivalent.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 8

SLIDE 14

Linear Regression II

Linear Regression and Least Squares

y x f(xn, w) y

n

xn

(C.M. Bishop, Pattern Recognition and Machine Learning)

E(θ) = 1 2

N

n=1

(yn − xn · θ)2

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 9

SLIDE 15

Linear Regression II

Linear Regression and Least Squares

◮ Derivative w.r.t a single weight entry θi

d dθi ln p(y | θ, σ2) = d dθi

− 1

2σ2

N

n=1

(yn − xn · θ)2

= 1

σ2

N

n=1

(yn − xn · θ)xi

◮ Set gradient w.r.t to θ to zero

∇θ ln p(y | θ, σ2) = 1 σ2

N

n=1

(yn − xn · θ)xT

n = 0

= ⇒ θML = (XTX)−1XT

Pseudo inverse

y

◮ Here, the matrix X is defined as X =

  x1,1 . . . x1, D . . . . . . . . . xN,1 . . . xN,D  

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 10

SLIDE 16

Linear Regression II

Linear Regression and Least Squares

◮ Derivative w.r.t a single weight entry θi

d dθi ln p(y | θ, σ2) = d dθi

− 1

2σ2

N

n=1

(yn − xn · θ)2

= 1

σ2

N

n=1

(yn − xn · θ)xi

◮ Set gradient w.r.t to θ to zero

∇θ ln p(y | θ, σ2) = 1 σ2

N

n=1

(yn − xn · θ)xT

n = 0

= ⇒ θML = (XTX)−1XT

Pseudo inverse

y

◮ Here, the matrix X is defined as X =

  x1,1 . . . x1, D . . . . . . . . . xN,1 . . . xN,D  

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 10

SLIDE 17

Linear Regression II

Linear Regression and Least Squares

◮ Derivative w.r.t a single weight entry θi

d dθi ln p(y | θ, σ2) = d dθi

− 1

2σ2

N

n=1

(yn − xn · θ)2

= 1

σ2

N

n=1

(yn − xn · θ)xi

◮ Set gradient w.r.t to θ to zero

∇θ ln p(y | θ, σ2) = 1 σ2

N

n=1

(yn − xn · θ)xT

n = 0

= ⇒ θML = (XTX)−1XT

Pseudo inverse

y

◮ Here, the matrix X is defined as X =

  x1,1 . . . x1, D . . . . . . . . . xN,1 . . . xN,D  

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 10

SLIDE 18

Linear Regression II

Polynomial Curve Fitting

Motivation

◮ Non-linear relationships. ◮ Multiple SNPs playing a role for

a particular phenotype.

X Y

x*

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 11

SLIDE 19

Linear Regression II

Polynomial Curve Fitting

Univariate input x

◮ Use the polynomials up to degree K to construct new features from x

f(x, θ) = θ0 + θ1x + θ2x2 + · · · + θKxK =

K

k=1

θkφk(x) = θTφ(x) where we defined φ(x) = (1, x, x2, . . . , xK).

◮ φ can be any feature mapping. ◮ Possible to show: the feature map φ can be expressed in terms of

kernels (kernel trick).

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 12

SLIDE 20

Linear Regression II

Polynomial Curve Fitting

Univariate input x

◮ Use the polynomials up to degree K to construct new features from x

f(x, θ) = θ0 + θ1x + θ2x2 + · · · + θKxK =

K

k=1

θkφk(x) = θTφ(x) where we defined φ(x) = (1, x, x2, . . . , xK).

◮ φ can be any feature mapping. ◮ Possible to show: the feature map φ can be expressed in terms of

kernels (kernel trick).

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 12

SLIDE 21

Linear Regression II

Polynomial Curve Fitting

Overfitting

◮ The degree of the polynomial is crucial to avoid under- and

verfitting.

x t M = 0 1 −1 1

(C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 13

SLIDE 22

Linear Regression II

Polynomial Curve Fitting

Overfitting

◮ The degree of the polynomial is crucial to avoid under- and

verfitting.

x t M = 1 1 −1 1

(C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 13

SLIDE 23

Linear Regression II

Polynomial Curve Fitting

Overfitting

◮ The degree of the polynomial is crucial to avoid under- and

verfitting.

x t M = 3 1 −1 1

(C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 13

SLIDE 24

Linear Regression II

Polynomial Curve Fitting

Overfitting

◮ The degree of the polynomial is crucial to avoid under- and

verfitting.

x t M = 9 1 −1 1

(C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 13

SLIDE 25

Linear Regression II

Multivariate regression Polynomial curve fitting

f(x, θ) = θ0 + θ1x + · · · + θKxK =

K

k=1

θkφk(x) = φ(x) · θ,

Multivariate regression (SNPs)

f(x, θ) =

S

s=1

θsxs = x · θ

◮ Note: When fitting a single binary SNP genotype xi, a linear model is

most general!

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 14

SLIDE 26

Linear Regression II

Multivariate regression Polynomial curve fitting

f(x, θ) = θ0 + θ1x + · · · + θKxK =

K

k=1

θkφk(x) = φ(x) · θ,

Multivariate regression (SNPs)

f(x, θ) =

S

s=1

θsxs = x · θ

◮ Note: When fitting a single binary SNP genotype xi, a linear model is

most general!

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 14

SLIDE 27

Linear Regression II

Regularized Least Squares

◮ Solutions to avoid overfitting:

1. Intelligently choose number of dimensions
2. Regularize the regression weights θ

◮ Quadratically regularized objective function

E(θ) = 1 2

N

n=1

(yn − φ(xn) · θ)2

Squared error

+ λ 2 θTθ

Regularizer

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 15

SLIDE 28

Linear Regression II

Regularized Least Squares

◮ Solutions to avoid overfitting:

1. Intelligently choose number of dimensions
2. Regularize the regression weights θ

◮ Quadratically regularized objective function

E(θ) = 1 2

N

n=1

(yn − φ(xn) · θ)2

Squared error

+ λ 2 θTθ

Regularizer

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 15

SLIDE 29

Linear Regression II

Regularized Least Squares

More general regularizers

◮ More general regularization:

E(θ) = 1 2

N

n=1

(yn − φ(xn) · θ)2

Squared error

+ λ 2

D

d=1

|θd|q

Regularizer

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 16

SLIDE 30

Linear Regression II

Regularized Least Squares

More general regularizers

◮ More general regularization:

E(θ) = 1 2

N

n=1

(yn − φ(xn) · θ)2

Squared error

+ λ 2

D

d=1

|θd|q

Regularizer

q =0 .5 q =1 q =2 q =4

(C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 16

SLIDE 31

Linear Regression II

Regularized Least Squares

More general regularizers

◮ More general regularization:

E(θ) = 1 2

N

n=1

(yn − φ(xn) · θ)2

Squared error

+ λ 2

D

d=1

|θd|q

Regularizer

q =0 .5 q =1 q =2 q =4

Quadratic Lasso sparse

(C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 16

SLIDE 32

Linear Regression II

Loss functions and related methods

◮ Even more general: general loss function

E(θ) = 1 2

N

n=1

L(yn − φ(xn) · θ)

Loss

+ λ 2

D

d=1

|θd|q

Regularizer

◮ Many state-of-the-art machine learning methods can be expressed

within this framework.

◮ Linear Regression: squared loss, squared regularizer. ◮ Support Vector Machine: hinge loss, squared regularizer. ◮ Lasso: squared loss, L1 regularizer.

◮ Inference: minimize the cost function E(θ), yielding a point estimate

for θ.

◮ Q: How to determine q and the a suitable loss function?

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 17

SLIDE 33

Linear Regression II

Loss functions and related methods

◮ Even more general: general loss function

E(θ) = 1 2

N

n=1

L(yn − φ(xn) · θ)

Loss

+ λ 2

D

d=1

|θd|q

Regularizer

◮ Many state-of-the-art machine learning methods can be expressed

within this framework.

◮ Linear Regression: squared loss, squared regularizer. ◮ Support Vector Machine: hinge loss, squared regularizer. ◮ Lasso: squared loss, L1 regularizer.

◮ Inference: minimize the cost function E(θ), yielding a point estimate

for θ.

◮ Q: How to determine q and the a suitable loss function?

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 17

SLIDE 34

Linear Regression II

Loss functions and related methods

◮ Even more general: general loss function

E(θ) = 1 2

N

n=1

L(yn − φ(xn) · θ)

Loss

+ λ 2

D

d=1

|θd|q

Regularizer

◮ Many state-of-the-art machine learning methods can be expressed

within this framework.

◮ Linear Regression: squared loss, squared regularizer. ◮ Support Vector Machine: hinge loss, squared regularizer. ◮ Lasso: squared loss, L1 regularizer.

◮ Inference: minimize the cost function E(θ), yielding a point estimate

for θ.

◮ Q: How to determine q and the a suitable loss function?

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 17

SLIDE 35

Linear Regression II

Loss functions and related methods

◮ Even more general: general loss function

E(θ) = 1 2

N

n=1

L(yn − φ(xn) · θ)

Loss

+ λ 2

D

d=1

|θd|q

Regularizer

◮ Many state-of-the-art machine learning methods can be expressed

within this framework.

◮ Linear Regression: squared loss, squared regularizer. ◮ Support Vector Machine: hinge loss, squared regularizer. ◮ Lasso: squared loss, L1 regularizer.

◮ Inference: minimize the cost function E(θ), yielding a point estimate

for θ.

◮ Q: How to determine q and the a suitable loss function?

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 17

SLIDE 36

Linear Regression II

Loss functions and related methods

Cross validation: minimization of expected loss

For each candidate model H:

◮ Split data into K folds ◮ Training-test evaluation for each

fold

◮ Assess average loss on test set

EH = 1 K

K

k=1

Ltest

k

fold 1 fold 2 fold 3 test set training set Total number of samples Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 18

SLIDE 37

Linear Regression II

Probabilistic interpretation

◮ So far: minimization of error functions. ◮ Back to probabilities?

E(θ) = 1 2

N

n=1

(yn − φ(xn) · θ)2

Squared error

+ λ 2 θTθ

Regularizer

◮ Most alternative choices of regularizers and loss functions can be

mapped to an equivalent probabilistic representation in a similar way.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 19

SLIDE 38

Linear Regression II

Probabilistic interpretation

◮ So far: minimization of error functions. ◮ Back to probabilities?

E(θ) = 1 2

N

n=1

(yn − φ(xn) · θ)2

Squared error

+ λ 2 θTθ

Regularizer

= −

N

n=1

ln N

yn
φ(xn) · θ, σ2

− ln N

θ
0, 1

λI

◮ Most alternative choices of regularizers and loss functions can be

mapped to an equivalent probabilistic representation in a similar way.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 19

SLIDE 39

Linear Regression II

Probabilistic interpretation

◮ So far: minimization of error functions. ◮ Back to probabilities?

E(θ) = 1 2

N

n=1

(yn − φ(xn) · θ)2

Squared error

+ λ 2 θTθ

Regularizer

= −

N

n=1

ln N

yn
φ(xn) · θ, σ2

− ln N

θ
0, 1

λI

= − ln p(y | θ, Φ(X), σ2)

− ln p(θ)

◮ Most alternative choices of regularizers and loss functions can be

mapped to an equivalent probabilistic representation in a similar way.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 19

SLIDE 40

Linear Regression II

Probabilistic interpretation

◮ So far: minimization of error functions. ◮ Back to probabilities?

E(θ) = 1 2

N

n=1

(yn − φ(xn) · θ)2

Squared error

+ λ 2 θTθ

Regularizer

= −

N

n=1

ln N

yn
φ(xn) · θ, σ2

− ln N

θ
0, 1

λI

= − ln p(y | θ, Φ(X), σ2)

− ln p(θ)

◮ Most alternative choices of regularizers and loss functions can be

mapped to an equivalent probabilistic representation in a similar way.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 19

SLIDE 41

Bayesian linear regression

Outline

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 20

SLIDE 42

Bayesian linear regression

◮ Likelihood as before

p(y | X, θ, σ2) =

N

n=1

N

yn
φ(xn) · θ, σ2

◮ Define a conjugate prior over θ

p(θ) = N (θ | m0, S0)

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 21

SLIDE 43

Bayesian linear regression

◮ Likelihood as before

p(y | X, θ, σ2) =

N

n=1

N

yn
φ(xn) · θ, σ2

◮ Define a conjugate prior over θ

p(θ) = N (θ | m0, S0)

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 21

SLIDE 44

Bayesian linear regression

◮ Posterior probability of θ

p(θ | y, X, σ2) ∝

N

n=1

N

yn
φ(xn) · θ, σ2

· N (θ | m0, S0) = N

y
Φ(X) · θ, σ2I
· N (θ | m0, S0)

= N

θ
µθ, Σθ
◮ where

µθ = Σθ

S−1

0 m0 + 1

σ2 Φ(X)Ty

Σθ =
S−1

+ 1 σ2 Φ(X)TΦ(X) −1

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 22

SLIDE 45

Bayesian linear regression

Prior choice

◮ Choice of prior: regularized (ridge) regression

p(θ) = N

θ | m0, S0
.

◮ In this case

p(θ | y, X, σ2) ∝ N

θ
µθ, Σθ
µθ = Σθ
S−1

0 m0 + 1

σ2 Φ(X)Ty

Σθ =
S−1

+ 1 σ2 Φ(X)TΦ(X) −1

◮ Equivalent to maximum likelihood estimate for λ → 0!

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 23

SLIDE 46

Bayesian linear regression

Prior choice

◮ Choice of prior: regularized (ridge) regression

p(θ) = N

θ | 0, 1

λI

.

◮ In this case

p(θ | y, X, σ2) ∝ N

θ
µθ, Σθ
µθ = Σθ
1

σ2 Φ(X)Ty

Σθ =
λI + 1

σ2 Φ(X)TΦ(X) −1

◮ Equivalent to maximum likelihood estimate for λ → 0!

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 23

SLIDE 47

Bayesian linear regression

Prior choice

◮ Choice of prior: regularized (ridge) regression ◮ In this case

p(θ | y, X, σ2) ∝ N

θ
µθ, Σθ
µθ = Σθ
1

σ2 Φ(X)Ty

Σθ =
λI + 1

σ2 Φ(X)TΦ(X) −1

◮ Equivalent to maximum likelihood estimate for λ → 0!

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 23

SLIDE 48

Bayesian linear regression

Example

0 Data points

(C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 24

SLIDE 49

Bayesian linear regression

Example

1 Data point

(C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 24

SLIDE 50

Bayesian linear regression

Example

20 Data points

(C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 24

SLIDE 51

Bayesian linear regression

Making predictions

◮ Prediction for fixed weight ˆ

θ at input x⋆ trivial: p(y⋆ | x⋆, ˆ θ, σ2) = N

y⋆
φ(x⋆)ˆ

θ, σ2

◮ Integrate over θ to take the posterior uncertainty into account

p(y⋆ | x⋆, D) =

θ

p(y⋆ | x⋆, θ, σ2)p(θ | X, y, σ2) =

θ

N

y⋆

φ(x⋆)θ, σ2 N

θ
µθ, Σθ
= N
y⋆

φ(x⋆) · µθ, σ2 + φ(x⋆)TΣθφ(x⋆)

◮ Key:

◮ prediction is again Gaussian ◮ Predictive variance is increase due to the posterior uncertainty in θ. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 25

SLIDE 52

Bayesian linear regression

Making predictions

◮ Prediction for fixed weight ˆ

θ at input x⋆ trivial: p(y⋆ | x⋆, ˆ θ, σ2) = N

y⋆
φ(x⋆)ˆ

θ, σ2

◮ Integrate over θ to take the posterior uncertainty into account

p(y⋆ | x⋆, D) =

θ

p(y⋆ | x⋆, θ, σ2)p(θ | X, y, σ2) =

θ

N

y⋆

φ(x⋆)θ, σ2 N

θ
µθ, Σθ
= N
y⋆

φ(x⋆) · µθ, σ2 + φ(x⋆)TΣθφ(x⋆)

◮ Key:

◮ prediction is again Gaussian ◮ Predictive variance is increase due to the posterior uncertainty in θ. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 25

SLIDE 53

Bayesian linear regression

Making predictions

◮ Prediction for fixed weight ˆ

θ at input x⋆ trivial: p(y⋆ | x⋆, ˆ θ, σ2) = N

y⋆
φ(x⋆)ˆ

θ, σ2

◮ Integrate over θ to take the posterior uncertainty into account

p(y⋆ | x⋆, D) =

θ

p(y⋆ | x⋆, θ, σ2)p(θ | X, y, σ2) =

θ

N

y⋆

φ(x⋆)θ, σ2 N

θ
µθ, Σθ
= N
y⋆

φ(x⋆) · µθ, σ2 + φ(x⋆)TΣθφ(x⋆)

◮ Key:

◮ prediction is again Gaussian ◮ Predictive variance is increase due to the posterior uncertainty in θ. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 25

SLIDE 54

Model comparison and hypothesis testing

Outline

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 26

SLIDE 55

Model comparison and hypothesis testing

Model comparison

Motivation

◮ What degree of polynomials

describes the data best?

◮ Is the linear model at all

appropriate?

◮ Association testing.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 27

SLIDE 56

Model comparison and hypothesis testing

Model comparison

Motivation

◮ What degree of polynomials

describes the data best?

◮ Is the linear model at all

appropriate?

◮ Association testing.

?

Phenome Genome

ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT individuals phenotypes SNPs

y y y y y

y1

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 27

SLIDE 57

Model comparison and hypothesis testing

Bayesian model comparison

◮ How do we choose among alternative models? ◮ Assume we want to choose among models H0, . . . , HM for a

dataset D.

◮ Posterior probability for a particular model i

p(Hi | D) ∝ p(D | Hi)

Evidence

p(Hi)

Prior

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 28

SLIDE 58

Model comparison and hypothesis testing

Bayesian model comparison

◮ How do we choose among alternative models? ◮ Assume we want to choose among models H0, . . . , HM for a

dataset D.

◮ Posterior probability for a particular model i

p(Hi | D) ∝ p(D | Hi)

Evidence

p(Hi)

Prior

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 28

SLIDE 59

Model comparison and hypothesis testing

Bayesian model comparison

How to calculate the evidence

◮ The evidence is not the model likelihood!

p(D | Hi) =

Θ

dΘp(D | Θ)p(Θ) for model parameters Θ.

◮ Remember:

p(Θ | Hi, D) = p(D | Hi, Θ)p(Θ) p(D | Hi)

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 29

SLIDE 60

Model comparison and hypothesis testing

Bayesian model comparison

How to calculate the evidence

◮ The evidence is not the model likelihood!

p(D | Hi) =

Θ

dΘp(D | Θ)p(Θ) for model parameters Θ.

◮ Remember:

p(Θ | Hi, D) = p(D | Hi, Θ)p(Θ) p(D | Hi) posterior = likelihood · prior Evidence

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 29

SLIDE 61

Model comparison and hypothesis testing

Bayesian model comparison

Ocam’s razor

◮ The evidence integral penalizes

verly complex models.

◮ A model with few parameters

and lower maximum likelihood (H1) may win over a model with a peaked likelihood that requires many more parameters (H2).

wM

A P

w Likelihood

H2 H1 (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 30

SLIDE 62

Model comparison and hypothesis testing

Bayesian model comparison

Ocam’s razor

◮ The evidence integral penalizes

verly complex models.

◮ A model with few parameters

and lower maximum likelihood (H1) may win over a model with a peaked likelihood that requires many more parameters (H2).

wM

A P

w Likelihood

H2 H1 (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 30

SLIDE 63

Model comparison and hypothesis testing

Application to GWA

Relevance of a single SNP

◮ Consider an association study.

◮ H0 : no association

p(y | H0, X, Θ0) = N

y
0, σ2I
p(D | H0) =
σ2 N
y
0, σ2I
p(σ2)

◮ H1: linear association

p(y | H1, xi, Θ1) = N

y
xi · θ, σ2I
p(D | H1) =
σ2,θ

N

y
xi · θ, σ2I
p(σ2)p(θ)

◮ Depending on the choice of priors, p(σ2) and p(θ), the required

integrals are often tractable in closed form.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 31

SLIDE 64

Model comparison and hypothesis testing

Application to GWA

Relevance of a single SNP

◮ Consider an association study.

◮ H0 : no association

p(y | H0, X, Θ0) = N

y
0, σ2I
p(D | H0) =
σ2 N
y
0, σ2I
p(σ2)

◮ H1: linear association

p(y | H1, xi, Θ1) = N

y
xi · θ, σ2I
p(D | H1) =
σ2,θ

N

y
xi · θ, σ2I
p(σ2)p(θ)

◮ Depending on the choice of priors, p(σ2) and p(θ), the required

integrals are often tractable in closed form.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 31

SLIDE 65

Model comparison and hypothesis testing

Application to GWA

Relevance of a single SNP

◮ Consider an association study.

◮ H0 : no association

p(y | H0, X, Θ0) = N

y
0, σ2I
p(D | H0) =
σ2 N
y
0, σ2I
p(σ2)

◮ H1: linear association

p(y | H1, xi, Θ1) = N

y
xi · θ, σ2I
p(D | H1) =
σ2,θ

N

y
xi · θ, σ2I
p(σ2)p(θ)

◮ Depending on the choice of priors, p(σ2) and p(θ), the required

integrals are often tractable in closed form.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 31

SLIDE 66

Model comparison and hypothesis testing

Application to GWA

Scoring models

◮ Similar to likelihood ratios, the ratio of the evidences, the Bayes

factor can be used to score alternative models: BF = ln p(D | H1) p(D | H0).

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 32

SLIDE 67

Model comparison and hypothesis testing

Application to GWA

Scoring models

◮ Similar to likelihood ratios, the ratio of the evidences, the Bayes

factor can be used to score alternative models: BF = ln p(D | H1) p(D | H0).

1.3354 1.3356 1.3358 1.336 1.3362 1.3364 1.3366 1.3368 1.337 1.3372 1.3374 x 10

8

5 10 15 LOD/BF Position in chr. 7 SLC35B4

0.01% FPR

0.01% FPR SLC35B4

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 32

SLIDE 68

Model comparison and hypothesis testing

Application to GWA

Posterior probability of an association

◮ Bayes factors are useful, however we would like a probabilistic answer

how certain an association really is.

◮ Posterior probability of H1

p(H1 | D) = p(D | H1)p(H1) p(D) = p(D | H1)p(H1) p(D | H1)p(H1) + p(D | H0)p(H0)

◮ p(H1 | D) + p(H0 | D) = 1, prior probability of observing a real

association.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 33

SLIDE 69

Model comparison and hypothesis testing

Application to GWA

Posterior probability of an association

◮ Bayes factors are useful, however we would like a probabilistic answer

how certain an association really is.

◮ Posterior probability of H1

p(H1 | D) = p(D | H1)p(H1) p(D) = p(D | H1)p(H1) p(D | H1)p(H1) + p(D | H0)p(H0)

◮ p(H1 | D) + p(H0 | D) = 1, prior probability of observing a real

association.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 33

SLIDE 70

Model comparison and hypothesis testing

Application to GWA

Posterior probability of an association

◮ Bayes factors are useful, however we would like a probabilistic answer

how certain an association really is.

◮ Posterior probability of H1

p(H1 | D) = p(D | H1)p(H1) p(D) = p(D | H1)p(H1) p(D | H1)p(H1) + p(D | H0)p(H0)

◮ p(H1 | D) + p(H0 | D) = 1, prior probability of observing a real

association.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 33

SLIDE 71

Model comparison and hypothesis testing

Bayes factor verus likelihood ratio Bayes factor

◮ Models of different

complexity can be

bjectively compared.

◮ Statistical significance as

posterior probability of a model.

◮ Typically hard to compute.

Likelihood ratio

◮ Likelihood ratio scales with

the number of parameters.

◮ Likelihood ratios have

known null distribution, yielding p-values.

◮ Often easy to compute.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 34

SLIDE 72

Model comparison and hypothesis testing

Bayes factor verus likelihood ratio Bayes factor

◮ Models of different

complexity can be

bjectively compared.

◮ Statistical significance as

posterior probability of a model.

◮ Typically hard to compute.

Likelihood ratio

◮ Likelihood ratio scales with

the number of parameters.

◮ Likelihood ratios have

known null distribution, yielding p-values.

◮ Often easy to compute.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 34

SLIDE 73

Model comparison and hypothesis testing

Marginal likelihood of variance component models

◮ Consider a linear model, accounting for a set of measured SNPs X

p(y | X, θ, σ2) = N

y
S
s=1

xsθs, σ2I

◮ Choose identical Gaussian prior for all weights

p(θ) =

S

s=1

N

θs
0, σ2

g

◮ Marginal likelihood

p(y | X, ) =

θ

N

y
Xθ, σ2I
N
θ
0, σ2

gI

◮ Number of hyperparameters independent of number of SNPs

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 35

SLIDE 74

Model comparison and hypothesis testing

Marginal likelihood of variance component models

◮ Consider a linear model, accounting for a set of measured SNPs X

p(y | X, θ, σ2) = N

y
S
s=1

xsθs, σ2I

◮ Choose identical Gaussian prior for all weights

p(θ) =

S

s=1

N

θs
0, σ2

g

◮ Marginal likelihood

p(y | X, ) =

θ

N

y
Xθ, σ2I
N
θ
0, σ2

gI

◮ Number of hyperparameters independent of number of SNPs

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 35

SLIDE 75

Model comparison and hypothesis testing

Marginal likelihood of variance component models

◮ Consider a linear model, accounting for a set of measured SNPs X

p(y | X, θ, σ2) = N

y
S
s=1

xsθs, σ2I

◮ Choose identical Gaussian prior for all weights

p(θ) =

S

s=1

N

θs
0, σ2

g

◮ Marginal likelihood

p(y | X, σ2, σ2

g) =

θ

N

y
Xθ, σ2I
N
θ
0, σ2

gI

= N
y
0, σ2

gXXT + σ2I

◮ Number of hyperparameters independent of number of SNPs

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 35

SLIDE 76

Model comparison and hypothesis testing

Marginal likelihood of variance component models

◮ Consider a linear model, accounting for a set of measured SNPs X

p(y | X, θ, σ2) = N

y
S
s=1

xsθs, σ2I

◮ Choose identical Gaussian prior for all weights

p(θ) =

S

s=1

N

θs
0, σ2

g

◮ Marginal likelihood

p(y | X, σ2, σ2

g) =

θ

N

y
Xθ, σ2I
N
θ
0, σ2

gI

= N
y
0, σ2

gXXT + σ2I

◮ Number of hyperparameters independent of number of SNPs

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 35

SLIDE 77

Model comparison and hypothesis testing

Marginal likelihood of variance component models

Application to GWAs

The missing heritability paradox

◮ Complex traits are regulated by a large number of small effects

◮ Human height: the best single SNP explains little variance. ◮ But: the parents are highly predictive for the height of the child! Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 36

SLIDE 78

Model comparison and hypothesis testing

Marginal likelihood of variance component models

Application to GWAs

Multivariate additive models for complex traits

◮ Multivariate model over causal SNPs

p(y | X, θ, σ2) = N

y |
s∈causal

xsθs, σ2I

◮ Common variance prior for causal SNPs p(θs) = N
θs
0, σ2

g

◮ Marinalize out weights

p(y | X, σ2

g, σ2 e) = N

y | 0, σ2

g

s∈causal

xsxT

s + σ2 eI

◮ Which SNPs are causal ?

Approximation: consider all SNPs [Yang et al., 2011] p(y | X, σ2

g, σ2 e) = N

y | 0, σ2

gXXT + σ2 eI

Oliver Stegle

GWAS IV: Bayesian linear models Summer 2011 37

SLIDE 79

Model comparison and hypothesis testing

Marginal likelihood of variance component models

Application to GWAs

Multivariate additive models for complex traits

◮ Multivariate model over causal SNPs

p(y | X, θ, σ2) = N

y |
s∈causal

xsθs, σ2I

◮ Common variance prior for causal SNPs p(θs) = N
θs
0, σ2

g

◮ Marinalize out weights

p(y | X, σ2

g, σ2 e) = N

y | 0, σ2

g

s∈causal

xsxT

s + σ2 eI

◮ Which SNPs are causal ?

Approximation: consider all SNPs [Yang et al., 2011] p(y | X, σ2

g, σ2 e) = N

y | 0, σ2

gXXT + σ2 eI

Oliver Stegle

GWAS IV: Bayesian linear models Summer 2011 37

SLIDE 80

Model comparison and hypothesis testing

Marginal likelihood of variance component models

Application to GWAs

Multivariate additive models for complex traits

◮ Multivariate model over causal SNPs

p(y | X, θ, σ2) = N

y |
s∈causal

xsθs, σ2I

◮ Common variance prior for causal SNPs p(θs) = N
θs
0, σ2

g

◮ Marinalize out weights

p(y | X, σ2

g, σ2 e) = N

y | 0, σ2

g

s∈causal

xsxT

s + σ2 eI

◮ Which SNPs are causal ?

Approximation: consider all SNPs [Yang et al., 2011] p(y | X, σ2

g, σ2 e) = N

y | 0, σ2

gXXT + σ2 eI

Oliver Stegle

GWAS IV: Bayesian linear models Summer 2011 37

SLIDE 81

Model comparison and hypothesis testing

Marginal likelihood of variance component models

Application to GWAs

Multivariate additive models for complex traits

◮ Multivariate model over causal SNPs

p(y | X, θ, σ2) = N

y |
s∈causal

xsθs, σ2I

◮ Common variance prior for causal SNPs p(θs) = N
θs
0, σ2

g

◮ Marinalize out weights

p(y | X, σ2

g, σ2 e) = N

y | 0, σ2

g

s∈causal

xsxT

s + σ2 eI

◮ Which SNPs are causal ?

Approximation: consider all SNPs [Yang et al., 2011] p(y | X, σ2

g, σ2 e) = N

y | 0, σ2

gXXT + σ2 eI

Oliver Stegle

GWAS IV: Bayesian linear models Summer 2011 37

SLIDE 82

Model comparison and hypothesis testing

Marginal likelihood of variance component models

Application to GWAs

◮ Approximate variance model

p(y | X, σ2

g, σ2 e) = N

y | 0, σ2

gXXT + σ2 eI

◮ Genetic variance σ2

g across

chromosomes

◮ Heritability h2 =

σ2

g

σ2

g + σ2 e

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 38

SLIDE 83

Model comparison and hypothesis testing

Marginal likelihood of variance component models

Application to GWAs

◮ Approximate variance model

p(y | X, σ2

g, σ2 e) = N

y | 0, σ2

gXXT + σ2 eI

◮ Genetic variance σ2

g across

chromosomes

◮ Heritability h2 =

σ2

g

σ2

g + σ2 e

[Yang et al., 2011]

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 38

SLIDE 84

Model comparison and hypothesis testing

Marginal likelihood of variance component models

Application to GWAs

◮ Approximate variance model

p(y | X, σ2

g, σ2 e) = N

y | 0, σ2

gXXT + σ2 eI

◮ Genetic variance σ2

g across

chromosomes

◮ Heritability h2 =

σ2

g

σ2

g + σ2 e

[Yang et al., 2011]

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 38

SLIDE 85

Summary

Outline

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 39

SLIDE 86

Summary

◮ Generalized linear models for Curve fitting and multivariate regression. ◮ Maximum likelihood and least squares regression are identical. ◮ Construction of features using a mapping φ. ◮ Regularized least squares and other models that correspond to

different choices of loss functions.

◮ Bayesian linear regression. ◮ Model comparison and ocam’s razor. ◮ Variance component models in GWAs.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 40

SLIDE 87

Summary

Tasks

◮ Prove that the product of two Gaussians is Gaussian distributed. ◮ Try to understand the convolution formula of Gaussian random

variables.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 41

SLIDE 88

Summary

References I

C. Bishop. Pattern recognition and machine learning, volume 4. Springer New York, 2006.
S. Roweis. Gaussian identities. technical report, 1999. URL

http://www.cs.nyu.edu/~roweis/notes/gaussid.pdf.

J. Yang, T. Manolio, L. Pasquale, E. Boerwinkle, N. Caporaso, J. Cunningham, M. de Andrade,
B. Feenstra, E. Feingold, M. Hayes, et al. Genome partitioning of genetic variation for

complex traits using common snps. Nature Genetics, 43(6):519–525, 2011.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 42

GWAS IV: Bayesian linear (variance component) models

Christoh Lippert

Max-Planck-Institutes T¨ ubingen, Germany

T¨ ubingen Summer 2011

Regression

Lineare regression:

◮ Making predictions ◮ Comparison of alternative

models Bayesian and regularized regression:

◮ Uncertainty in model parameters ◮ Generalized basis functions

X Y

x*

Regression

Lineare regression:

◮ Making predictions ◮ Comparison of alternative

models Bayesian and regularized regression:

◮ Uncertainty in model parameters ◮ Generalized basis functions

X Y

x*

Regression

Lineare regression:

◮ Making predictions ◮ Comparison of alternative

models Bayesian and regularized regression:

◮ Uncertainty in model parameters ◮ Generalized basis functions

X Y

x*

Further reading, useful material

◮ Christopher M. Bishop: Pattern Recognition and Machine

learning [Bishop, 2006]

◮ Sam Roweis: Gaussian identities [Roweis, 1999]

Outline

Outline

Regression

Noise model and likelihood

◮ Given a dataset D = {xn, yn}N n=1, where xn = {xn,1, . . . , xn,S} is S

dimensional (for example S SNPs), fit parameters θ of a regressor f with added Gaussian noise: yn = f(xn; θ) + ǫn where p(ǫ | σ2) = N

.

◮ Equivalent likelihood formulation:

p(y | X) =

N

N

Regression

Choosing a regressor

◮ Choose f to be linear:

p(y | X) =

N

N

◮ Consider bias free case, c = 0,

column of ones in each xn.

Regression

Choosing a regressor

◮ Choose f to be linear:

p(y | X) =

N

N

◮ Consider bias free case, c = 0,

column of ones in each xn.

Linear Regression

Maximum likelihood

◮ Taking the logarithm, we obtain

ln p(y | θ, X, σ2) =

N

ln N

= −N 2 ln 2πσ2 − 1 2σ2

N

(yn − xn · θ)2

◮ The likelihood is maximized when the squared error is minimized. ◮ Least squares and maximum likelihood are equivalent.

Linear Regression

Maximum likelihood

◮ Taking the logarithm, we obtain

ln p(y | θ, X, σ2) =

N

ln N

= −N 2 ln 2πσ2 − 1 2σ2

N

(yn − xn · θ)2

◮ The likelihood is maximized when the squared error is minimized. ◮ Least squares and maximum likelihood are equivalent.

Linear Regression

Maximum likelihood

◮ Taking the logarithm, we obtain

ln p(y | θ, X, σ2) =