GWAS IV: Bayesian linear (variance component) models Dr. Oliver - - PowerPoint PPT Presentation

gwas iv bayesian linear variance component models
SMART_READER_LITE
LIVE PREVIEW

GWAS IV: Bayesian linear (variance component) models Dr. Oliver - - PowerPoint PPT Presentation

GWAS IV: Bayesian linear (variance component) models Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes T ubingen, Germany T ubingen Summer 2011 Oliver Stegle GWAS IV: Bayesian linear models Summer


slide-1
SLIDE 1

GWAS IV: Bayesian linear (variance component) models

  • Dr. Oliver Stegle

Christoh Lippert

  • Prof. Dr. Karsten Borgwardt

Max-Planck-Institutes T¨ ubingen, Germany

T¨ ubingen Summer 2011

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 1

slide-2
SLIDE 2

Motivation

Regression

Lineare regression:

◮ Making predictions ◮ Comparison of alternative

models Bayesian and regularized regression:

◮ Uncertainty in model parameters ◮ Generalized basis functions

X Y

x*

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 2

slide-3
SLIDE 3

Motivation

Regression

Lineare regression:

◮ Making predictions ◮ Comparison of alternative

models Bayesian and regularized regression:

◮ Uncertainty in model parameters ◮ Generalized basis functions

X Y

x*

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 2

slide-4
SLIDE 4

Motivation

Regression

Lineare regression:

◮ Making predictions ◮ Comparison of alternative

models Bayesian and regularized regression:

◮ Uncertainty in model parameters ◮ Generalized basis functions

X Y

x*

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 2

slide-5
SLIDE 5

Motivation

Further reading, useful material

◮ Christopher M. Bishop: Pattern Recognition and Machine

learning [Bishop, 2006]

◮ Sam Roweis: Gaussian identities [Roweis, 1999]

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 3

slide-6
SLIDE 6

Outline

Outline

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 4

slide-7
SLIDE 7

Linear Regression II

Outline

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 5

slide-8
SLIDE 8

Linear Regression II

Regression

Noise model and likelihood

◮ Given a dataset D = {xn, yn}N n=1, where xn = {xn,1, . . . , xn,S} is S

dimensional (for example S SNPs), fit parameters θ of a regressor f with added Gaussian noise: yn = f(xn; θ) + ǫn where p(ǫ | σ2) = N

  • ǫ
  • 0, σ2

.

◮ Equivalent likelihood formulation:

p(y | X) =

N

  • n=1

N

  • yn
  • f(xn), σ2

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 6

slide-9
SLIDE 9

Linear Regression II

Regression

Choosing a regressor

◮ Choose f to be linear:

p(y | X) =

N

  • n=1

N

  • yn
  • xn · θ + c, σ2

◮ Consider bias free case, c = 0,

  • therwise inlcude an additional

column of ones in each xn.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 7

slide-10
SLIDE 10

Linear Regression II

Regression

Choosing a regressor

◮ Choose f to be linear:

p(y | X) =

N

  • n=1

N

  • yn
  • xn · θ + c, σ2

◮ Consider bias free case, c = 0,

  • therwise inlcude an additional

column of ones in each xn.

Equivalent graphical model Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 7

slide-11
SLIDE 11

Linear Regression II

Linear Regression

Maximum likelihood

◮ Taking the logarithm, we obtain

ln p(y | θ, X, σ2) =

N

  • n=1

ln N

  • yn
  • xn · θ, σ2

= −N 2 ln 2πσ2 − 1 2σ2

N

  • n=1

(yn − xn · θ)2

  • Sum of squares

◮ The likelihood is maximized when the squared error is minimized. ◮ Least squares and maximum likelihood are equivalent.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 8

slide-12
SLIDE 12

Linear Regression II

Linear Regression

Maximum likelihood

◮ Taking the logarithm, we obtain

ln p(y | θ, X, σ2) =

N

  • n=1

ln N

  • yn
  • xn · θ, σ2

= −N 2 ln 2πσ2 − 1 2σ2

N

  • n=1

(yn − xn · θ)2

  • Sum of squares

◮ The likelihood is maximized when the squared error is minimized. ◮ Least squares and maximum likelihood are equivalent.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 8

slide-13
SLIDE 13

Linear Regression II

Linear Regression

Maximum likelihood

◮ Taking the logarithm, we obtain

ln p(y | θ, X, σ2) =

N

  • n=1

ln N

  • yn
  • xn · θ, σ2

= −N 2 ln 2πσ2 − 1 2σ2

N

  • n=1

(yn − xn · θ)2

  • Sum of squares

◮ The likelihood is maximized when the squared error is minimized. ◮ Least squares and maximum likelihood are equivalent.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 8

slide-14
SLIDE 14

Linear Regression II

Linear Regression and Least Squares

y x f(xn, w) y

n

xn

(C.M. Bishop, Pattern Recognition and Machine Learning)

E(θ) = 1 2

N

  • n=1

(yn − xn · θ)2

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 9

slide-15
SLIDE 15

Linear Regression II

Linear Regression and Least Squares

◮ Derivative w.r.t a single weight entry θi

d dθi ln p(y | θ, σ2) = d dθi

  • − 1

2σ2

N

  • n=1

(yn − xn · θ)2

  • = 1

σ2

N

  • n=1

(yn − xn · θ)xi

◮ Set gradient w.r.t to θ to zero

∇θ ln p(y | θ, σ2) = 1 σ2

N

  • n=1

(yn − xn · θ)xT

n = 0

= ⇒ θML = (XTX)−1XT

  • Pseudo inverse

y

◮ Here, the matrix X is defined as X =

  x1,1 . . . x1, D . . . . . . . . . xN,1 . . . xN,D  

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 10

slide-16
SLIDE 16

Linear Regression II

Linear Regression and Least Squares

◮ Derivative w.r.t a single weight entry θi

d dθi ln p(y | θ, σ2) = d dθi

  • − 1

2σ2

N

  • n=1

(yn − xn · θ)2

  • = 1

σ2

N

  • n=1

(yn − xn · θ)xi

◮ Set gradient w.r.t to θ to zero

∇θ ln p(y | θ, σ2) = 1 σ2

N

  • n=1

(yn − xn · θ)xT

n = 0

= ⇒ θML = (XTX)−1XT

  • Pseudo inverse

y

◮ Here, the matrix X is defined as X =

  x1,1 . . . x1, D . . . . . . . . . xN,1 . . . xN,D  

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 10

slide-17
SLIDE 17

Linear Regression II

Linear Regression and Least Squares

◮ Derivative w.r.t a single weight entry θi

d dθi ln p(y | θ, σ2) = d dθi

  • − 1

2σ2

N

  • n=1

(yn − xn · θ)2

  • = 1

σ2

N

  • n=1

(yn − xn · θ)xi

◮ Set gradient w.r.t to θ to zero

∇θ ln p(y | θ, σ2) = 1 σ2

N

  • n=1

(yn − xn · θ)xT

n = 0

= ⇒ θML = (XTX)−1XT

  • Pseudo inverse

y

◮ Here, the matrix X is defined as X =

  x1,1 . . . x1, D . . . . . . . . . xN,1 . . . xN,D  

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 10

slide-18
SLIDE 18

Linear Regression II

Polynomial Curve Fitting

Motivation

◮ Non-linear relationships. ◮ Multiple SNPs playing a role for

a particular phenotype.

X Y

x*

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 11

slide-19
SLIDE 19

Linear Regression II

Polynomial Curve Fitting

Univariate input x

◮ Use the polynomials up to degree K to construct new features from x

f(x, θ) = θ0 + θ1x + θ2x2 + · · · + θKxK =

K

  • k=1

θkφk(x) = θTφ(x) where we defined φ(x) = (1, x, x2, . . . , xK).

◮ φ can be any feature mapping. ◮ Possible to show: the feature map φ can be expressed in terms of

kernels (kernel trick).

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 12

slide-20
SLIDE 20

Linear Regression II

Polynomial Curve Fitting

Univariate input x

◮ Use the polynomials up to degree K to construct new features from x

f(x, θ) = θ0 + θ1x + θ2x2 + · · · + θKxK =

K

  • k=1

θkφk(x) = θTφ(x) where we defined φ(x) = (1, x, x2, . . . , xK).

◮ φ can be any feature mapping. ◮ Possible to show: the feature map φ can be expressed in terms of

kernels (kernel trick).

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 12

slide-21
SLIDE 21

Linear Regression II

Polynomial Curve Fitting

Overfitting

◮ The degree of the polynomial is crucial to avoid under- and

  • verfitting.

x t M = 0 1 −1 1

(C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 13

slide-22
SLIDE 22

Linear Regression II

Polynomial Curve Fitting

Overfitting

◮ The degree of the polynomial is crucial to avoid under- and

  • verfitting.

x t M = 1 1 −1 1

(C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 13

slide-23
SLIDE 23

Linear Regression II

Polynomial Curve Fitting

Overfitting

◮ The degree of the polynomial is crucial to avoid under- and

  • verfitting.

x t M = 3 1 −1 1

(C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 13

slide-24
SLIDE 24

Linear Regression II

Polynomial Curve Fitting

Overfitting

◮ The degree of the polynomial is crucial to avoid under- and

  • verfitting.

x t M = 9 1 −1 1

(C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 13

slide-25
SLIDE 25

Linear Regression II

Multivariate regression Polynomial curve fitting

f(x, θ) = θ0 + θ1x + · · · + θKxK =

K

  • k=1

θkφk(x) = φ(x) · θ,

Multivariate regression (SNPs)

f(x, θ) =

S

  • s=1

θsxs = x · θ

◮ Note: When fitting a single binary SNP genotype xi, a linear model is

most general!

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 14

slide-26
SLIDE 26

Linear Regression II

Multivariate regression Polynomial curve fitting

f(x, θ) = θ0 + θ1x + · · · + θKxK =

K

  • k=1

θkφk(x) = φ(x) · θ,

Multivariate regression (SNPs)

f(x, θ) =

S

  • s=1

θsxs = x · θ

◮ Note: When fitting a single binary SNP genotype xi, a linear model is

most general!

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 14

slide-27
SLIDE 27

Linear Regression II

Regularized Least Squares

◮ Solutions to avoid overfitting:

  • 1. Intelligently choose number of dimensions
  • 2. Regularize the regression weights θ

◮ Quadratically regularized objective function

E(θ) = 1 2

N

  • n=1

(yn − φ(xn) · θ)2

  • Squared error

+ λ 2 θTθ

Regularizer

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 15

slide-28
SLIDE 28

Linear Regression II

Regularized Least Squares

◮ Solutions to avoid overfitting:

  • 1. Intelligently choose number of dimensions
  • 2. Regularize the regression weights θ

◮ Quadratically regularized objective function

E(θ) = 1 2

N

  • n=1

(yn − φ(xn) · θ)2

  • Squared error

+ λ 2 θTθ

Regularizer

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 15

slide-29
SLIDE 29

Linear Regression II

Regularized Least Squares

More general regularizers

◮ More general regularization:

E(θ) = 1 2

N

  • n=1

(yn − φ(xn) · θ)2

  • Squared error

+ λ 2

D

  • d=1

|θd|q

  • Regularizer

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 16

slide-30
SLIDE 30

Linear Regression II

Regularized Least Squares

More general regularizers

◮ More general regularization:

E(θ) = 1 2

N

  • n=1

(yn − φ(xn) · θ)2

  • Squared error

+ λ 2

D

  • d=1

|θd|q

  • Regularizer

q =0 .5 q =1 q =2 q =4

(C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 16

slide-31
SLIDE 31

Linear Regression II

Regularized Least Squares

More general regularizers

◮ More general regularization:

E(θ) = 1 2

N

  • n=1

(yn − φ(xn) · θ)2

  • Squared error

+ λ 2

D

  • d=1

|θd|q

  • Regularizer

q =0 .5 q =1 q =2 q =4

Quadratic Lasso sparse

(C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 16

slide-32
SLIDE 32

Linear Regression II

Loss functions and related methods

◮ Even more general: general loss function

E(θ) = 1 2

N

  • n=1

L(yn − φ(xn) · θ)

  • Loss

+ λ 2

D

  • d=1

|θd|q

  • Regularizer

◮ Many state-of-the-art machine learning methods can be expressed

within this framework.

◮ Linear Regression: squared loss, squared regularizer. ◮ Support Vector Machine: hinge loss, squared regularizer. ◮ Lasso: squared loss, L1 regularizer.

◮ Inference: minimize the cost function E(θ), yielding a point estimate

for θ.

◮ Q: How to determine q and the a suitable loss function?

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 17

slide-33
SLIDE 33

Linear Regression II

Loss functions and related methods

◮ Even more general: general loss function

E(θ) = 1 2

N

  • n=1

L(yn − φ(xn) · θ)

  • Loss

+ λ 2

D

  • d=1

|θd|q

  • Regularizer

◮ Many state-of-the-art machine learning methods can be expressed

within this framework.

◮ Linear Regression: squared loss, squared regularizer. ◮ Support Vector Machine: hinge loss, squared regularizer. ◮ Lasso: squared loss, L1 regularizer.

◮ Inference: minimize the cost function E(θ), yielding a point estimate

for θ.

◮ Q: How to determine q and the a suitable loss function?

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 17

slide-34
SLIDE 34

Linear Regression II

Loss functions and related methods

◮ Even more general: general loss function

E(θ) = 1 2

N

  • n=1

L(yn − φ(xn) · θ)

  • Loss

+ λ 2

D

  • d=1

|θd|q

  • Regularizer

◮ Many state-of-the-art machine learning methods can be expressed

within this framework.

◮ Linear Regression: squared loss, squared regularizer. ◮ Support Vector Machine: hinge loss, squared regularizer. ◮ Lasso: squared loss, L1 regularizer.

◮ Inference: minimize the cost function E(θ), yielding a point estimate

for θ.

◮ Q: How to determine q and the a suitable loss function?

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 17

slide-35
SLIDE 35

Linear Regression II

Loss functions and related methods

◮ Even more general: general loss function

E(θ) = 1 2

N

  • n=1

L(yn − φ(xn) · θ)

  • Loss

+ λ 2

D

  • d=1

|θd|q

  • Regularizer

◮ Many state-of-the-art machine learning methods can be expressed

within this framework.

◮ Linear Regression: squared loss, squared regularizer. ◮ Support Vector Machine: hinge loss, squared regularizer. ◮ Lasso: squared loss, L1 regularizer.

◮ Inference: minimize the cost function E(θ), yielding a point estimate

for θ.

◮ Q: How to determine q and the a suitable loss function?

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 17

slide-36
SLIDE 36

Linear Regression II

Loss functions and related methods

Cross validation: minimization of expected loss

For each candidate model H:

◮ Split data into K folds ◮ Training-test evaluation for each

fold

◮ Assess average loss on test set

EH = 1 K

K

  • k=1

Ltest

k

fold 1 fold 2 fold 3 test set training set Total number of samples Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 18

slide-37
SLIDE 37

Linear Regression II

Probabilistic interpretation

◮ So far: minimization of error functions. ◮ Back to probabilities?

E(θ) = 1 2

N

  • n=1

(yn − φ(xn) · θ)2

  • Squared error

+ λ 2 θTθ

Regularizer

◮ Most alternative choices of regularizers and loss functions can be

mapped to an equivalent probabilistic representation in a similar way.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 19

slide-38
SLIDE 38

Linear Regression II

Probabilistic interpretation

◮ So far: minimization of error functions. ◮ Back to probabilities?

E(θ) = 1 2

N

  • n=1

(yn − φ(xn) · θ)2

  • Squared error

+ λ 2 θTθ

Regularizer

= −

N

  • n=1

ln N

  • yn
  • φ(xn) · θ, σ2

− ln N

  • θ
  • 0, 1

λI

  • ◮ Most alternative choices of regularizers and loss functions can be

mapped to an equivalent probabilistic representation in a similar way.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 19

slide-39
SLIDE 39

Linear Regression II

Probabilistic interpretation

◮ So far: minimization of error functions. ◮ Back to probabilities?

E(θ) = 1 2

N

  • n=1

(yn − φ(xn) · θ)2

  • Squared error

+ λ 2 θTθ

Regularizer

= −

N

  • n=1

ln N

  • yn
  • φ(xn) · θ, σ2

− ln N

  • θ
  • 0, 1

λI

  • = − ln p(y | θ, Φ(X), σ2)

− ln p(θ)

◮ Most alternative choices of regularizers and loss functions can be

mapped to an equivalent probabilistic representation in a similar way.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 19

slide-40
SLIDE 40

Linear Regression II

Probabilistic interpretation

◮ So far: minimization of error functions. ◮ Back to probabilities?

E(θ) = 1 2

N

  • n=1

(yn − φ(xn) · θ)2

  • Squared error

+ λ 2 θTθ

Regularizer

= −

N

  • n=1

ln N

  • yn
  • φ(xn) · θ, σ2

− ln N

  • θ
  • 0, 1

λI

  • = − ln p(y | θ, Φ(X), σ2)

− ln p(θ)

◮ Most alternative choices of regularizers and loss functions can be

mapped to an equivalent probabilistic representation in a similar way.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 19

slide-41
SLIDE 41

Bayesian linear regression

Outline

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 20

slide-42
SLIDE 42

Bayesian linear regression

Bayesian linear regression

◮ Likelihood as before

p(y | X, θ, σ2) =

N

  • n=1

N

  • yn
  • φ(xn) · θ, σ2

◮ Define a conjugate prior over θ

p(θ) = N (θ | m0, S0)

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 21

slide-43
SLIDE 43

Bayesian linear regression

Bayesian linear regression

◮ Likelihood as before

p(y | X, θ, σ2) =

N

  • n=1

N

  • yn
  • φ(xn) · θ, σ2

◮ Define a conjugate prior over θ

p(θ) = N (θ | m0, S0)

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 21

slide-44
SLIDE 44

Bayesian linear regression

Bayesian linear regression

◮ Posterior probability of θ

p(θ | y, X, σ2) ∝

N

  • n=1

N

  • yn
  • φ(xn) · θ, σ2

· N (θ | m0, S0) = N

  • y
  • Φ(X) · θ, σ2I
  • · N (θ | m0, S0)

= N

  • θ
  • µθ, Σθ
  • ◮ where

µθ = Σθ

  • S−1

0 m0 + 1

σ2 Φ(X)Ty

  • Σθ =
  • S−1

+ 1 σ2 Φ(X)TΦ(X) −1

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 22

slide-45
SLIDE 45

Bayesian linear regression

Bayesian linear regression

Prior choice

◮ Choice of prior: regularized (ridge) regression

p(θ) = N

  • θ | m0, S0
  • .

◮ In this case

p(θ | y, X, σ2) ∝ N

  • θ
  • µθ, Σθ
  • µθ = Σθ
  • S−1

0 m0 + 1

σ2 Φ(X)Ty

  • Σθ =
  • S−1

+ 1 σ2 Φ(X)TΦ(X) −1

◮ Equivalent to maximum likelihood estimate for λ → 0!

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 23

slide-46
SLIDE 46

Bayesian linear regression

Bayesian linear regression

Prior choice

◮ Choice of prior: regularized (ridge) regression

p(θ) = N

  • θ | 0, 1

λI

  • .

◮ In this case

p(θ | y, X, σ2) ∝ N

  • θ
  • µθ, Σθ
  • µθ = Σθ
  • 1

σ2 Φ(X)Ty

  • Σθ =
  • λI + 1

σ2 Φ(X)TΦ(X) −1

◮ Equivalent to maximum likelihood estimate for λ → 0!

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 23

slide-47
SLIDE 47

Bayesian linear regression

Bayesian linear regression

Prior choice

◮ Choice of prior: regularized (ridge) regression ◮ In this case

p(θ | y, X, σ2) ∝ N

  • θ
  • µθ, Σθ
  • µθ = Σθ
  • 1

σ2 Φ(X)Ty

  • Σθ =
  • λI + 1

σ2 Φ(X)TΦ(X) −1

◮ Equivalent to maximum likelihood estimate for λ → 0!

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 23

slide-48
SLIDE 48

Bayesian linear regression

Bayesian linear regression

Example

0 Data points

(C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 24

slide-49
SLIDE 49

Bayesian linear regression

Bayesian linear regression

Example

1 Data point

(C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 24

slide-50
SLIDE 50

Bayesian linear regression

Bayesian linear regression

Example

20 Data points

(C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 24

slide-51
SLIDE 51

Bayesian linear regression

Making predictions

◮ Prediction for fixed weight ˆ

θ at input x⋆ trivial: p(y⋆ | x⋆, ˆ θ, σ2) = N

  • y⋆
  • φ(x⋆)ˆ

θ, σ2

◮ Integrate over θ to take the posterior uncertainty into account

p(y⋆ | x⋆, D) =

  • θ

p(y⋆ | x⋆, θ, σ2)p(θ | X, y, σ2) =

  • θ

N

  • y⋆

φ(x⋆)θ, σ2 N

  • θ
  • µθ, Σθ
  • = N
  • y⋆

φ(x⋆) · µθ, σ2 + φ(x⋆)TΣθφ(x⋆)

  • ◮ Key:

◮ prediction is again Gaussian ◮ Predictive variance is increase due to the posterior uncertainty in θ. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 25

slide-52
SLIDE 52

Bayesian linear regression

Making predictions

◮ Prediction for fixed weight ˆ

θ at input x⋆ trivial: p(y⋆ | x⋆, ˆ θ, σ2) = N

  • y⋆
  • φ(x⋆)ˆ

θ, σ2

◮ Integrate over θ to take the posterior uncertainty into account

p(y⋆ | x⋆, D) =

  • θ

p(y⋆ | x⋆, θ, σ2)p(θ | X, y, σ2) =

  • θ

N

  • y⋆

φ(x⋆)θ, σ2 N

  • θ
  • µθ, Σθ
  • = N
  • y⋆

φ(x⋆) · µθ, σ2 + φ(x⋆)TΣθφ(x⋆)

  • ◮ Key:

◮ prediction is again Gaussian ◮ Predictive variance is increase due to the posterior uncertainty in θ. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 25

slide-53
SLIDE 53

Bayesian linear regression

Making predictions

◮ Prediction for fixed weight ˆ

θ at input x⋆ trivial: p(y⋆ | x⋆, ˆ θ, σ2) = N

  • y⋆
  • φ(x⋆)ˆ

θ, σ2

◮ Integrate over θ to take the posterior uncertainty into account

p(y⋆ | x⋆, D) =

  • θ

p(y⋆ | x⋆, θ, σ2)p(θ | X, y, σ2) =

  • θ

N

  • y⋆

φ(x⋆)θ, σ2 N

  • θ
  • µθ, Σθ
  • = N
  • y⋆

φ(x⋆) · µθ, σ2 + φ(x⋆)TΣθφ(x⋆)

  • ◮ Key:

◮ prediction is again Gaussian ◮ Predictive variance is increase due to the posterior uncertainty in θ. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 25

slide-54
SLIDE 54

Model comparison and hypothesis testing

Outline

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 26

slide-55
SLIDE 55

Model comparison and hypothesis testing

Model comparison

Motivation

◮ What degree of polynomials

describes the data best?

◮ Is the linear model at all

appropriate?

◮ Association testing.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 27

slide-56
SLIDE 56

Model comparison and hypothesis testing

Model comparison

Motivation

◮ What degree of polynomials

describes the data best?

◮ Is the linear model at all

appropriate?

◮ Association testing.

?

Phenome Genome

ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT individuals phenotypes SNPs

y y y y y

y1

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 27

slide-57
SLIDE 57

Model comparison and hypothesis testing

Bayesian model comparison

◮ How do we choose among alternative models? ◮ Assume we want to choose among models H0, . . . , HM for a

dataset D.

◮ Posterior probability for a particular model i

p(Hi | D) ∝ p(D | Hi)

  • Evidence

p(Hi)

Prior

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 28

slide-58
SLIDE 58

Model comparison and hypothesis testing

Bayesian model comparison

◮ How do we choose among alternative models? ◮ Assume we want to choose among models H0, . . . , HM for a

dataset D.

◮ Posterior probability for a particular model i

p(Hi | D) ∝ p(D | Hi)

  • Evidence

p(Hi)

Prior

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 28

slide-59
SLIDE 59

Model comparison and hypothesis testing

Bayesian model comparison

How to calculate the evidence

◮ The evidence is not the model likelihood!

p(D | Hi) =

  • Θ

dΘp(D | Θ)p(Θ) for model parameters Θ.

◮ Remember:

p(Θ | Hi, D) = p(D | Hi, Θ)p(Θ) p(D | Hi)

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 29

slide-60
SLIDE 60

Model comparison and hypothesis testing

Bayesian model comparison

How to calculate the evidence

◮ The evidence is not the model likelihood!

p(D | Hi) =

  • Θ

dΘp(D | Θ)p(Θ) for model parameters Θ.

◮ Remember:

p(Θ | Hi, D) = p(D | Hi, Θ)p(Θ) p(D | Hi) posterior = likelihood · prior Evidence

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 29

slide-61
SLIDE 61

Model comparison and hypothesis testing

Bayesian model comparison

Ocam’s razor

◮ The evidence integral penalizes

  • verly complex models.

◮ A model with few parameters

and lower maximum likelihood (H1) may win over a model with a peaked likelihood that requires many more parameters (H2).

wM

A P

w Likelihood

H2 H1 (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 30

slide-62
SLIDE 62

Model comparison and hypothesis testing

Bayesian model comparison

Ocam’s razor

◮ The evidence integral penalizes

  • verly complex models.

◮ A model with few parameters

and lower maximum likelihood (H1) may win over a model with a peaked likelihood that requires many more parameters (H2).

wM

A P

w Likelihood

H2 H1 (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 30

slide-63
SLIDE 63

Model comparison and hypothesis testing

Application to GWA

Relevance of a single SNP

◮ Consider an association study.

◮ H0 : no association

p(y | H0, X, Θ0) = N

  • y
  • 0, σ2I
  • p(D | H0) =
  • σ2 N
  • y
  • 0, σ2I
  • p(σ2)

◮ H1: linear association

p(y | H1, xi, Θ1) = N

  • y
  • xi · θ, σ2I
  • p(D | H1) =
  • σ2,θ

N

  • y
  • xi · θ, σ2I
  • p(σ2)p(θ)

◮ Depending on the choice of priors, p(σ2) and p(θ), the required

integrals are often tractable in closed form.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 31

slide-64
SLIDE 64

Model comparison and hypothesis testing

Application to GWA

Relevance of a single SNP

◮ Consider an association study.

◮ H0 : no association

p(y | H0, X, Θ0) = N

  • y
  • 0, σ2I
  • p(D | H0) =
  • σ2 N
  • y
  • 0, σ2I
  • p(σ2)

◮ H1: linear association

p(y | H1, xi, Θ1) = N

  • y
  • xi · θ, σ2I
  • p(D | H1) =
  • σ2,θ

N

  • y
  • xi · θ, σ2I
  • p(σ2)p(θ)

◮ Depending on the choice of priors, p(σ2) and p(θ), the required

integrals are often tractable in closed form.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 31

slide-65
SLIDE 65

Model comparison and hypothesis testing

Application to GWA

Relevance of a single SNP

◮ Consider an association study.

◮ H0 : no association

p(y | H0, X, Θ0) = N

  • y
  • 0, σ2I
  • p(D | H0) =
  • σ2 N
  • y
  • 0, σ2I
  • p(σ2)

◮ H1: linear association

p(y | H1, xi, Θ1) = N

  • y
  • xi · θ, σ2I
  • p(D | H1) =
  • σ2,θ

N

  • y
  • xi · θ, σ2I
  • p(σ2)p(θ)

◮ Depending on the choice of priors, p(σ2) and p(θ), the required

integrals are often tractable in closed form.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 31

slide-66
SLIDE 66

Model comparison and hypothesis testing

Application to GWA

Scoring models

◮ Similar to likelihood ratios, the ratio of the evidences, the Bayes

factor can be used to score alternative models: BF = ln p(D | H1) p(D | H0).

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 32

slide-67
SLIDE 67

Model comparison and hypothesis testing

Application to GWA

Scoring models

◮ Similar to likelihood ratios, the ratio of the evidences, the Bayes

factor can be used to score alternative models: BF = ln p(D | H1) p(D | H0).

1.3354 1.3356 1.3358 1.336 1.3362 1.3364 1.3366 1.3368 1.337 1.3372 1.3374 x 10

8

5 10 15 LOD/BF Position in chr. 7 SLC35B4

0.01% FPR

0.01% FPR SLC35B4

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 32

slide-68
SLIDE 68

Model comparison and hypothesis testing

Application to GWA

Posterior probability of an association

◮ Bayes factors are useful, however we would like a probabilistic answer

how certain an association really is.

◮ Posterior probability of H1

p(H1 | D) = p(D | H1)p(H1) p(D) = p(D | H1)p(H1) p(D | H1)p(H1) + p(D | H0)p(H0)

◮ p(H1 | D) + p(H0 | D) = 1, prior probability of observing a real

association.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 33

slide-69
SLIDE 69

Model comparison and hypothesis testing

Application to GWA

Posterior probability of an association

◮ Bayes factors are useful, however we would like a probabilistic answer

how certain an association really is.

◮ Posterior probability of H1

p(H1 | D) = p(D | H1)p(H1) p(D) = p(D | H1)p(H1) p(D | H1)p(H1) + p(D | H0)p(H0)

◮ p(H1 | D) + p(H0 | D) = 1, prior probability of observing a real

association.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 33

slide-70
SLIDE 70

Model comparison and hypothesis testing

Application to GWA

Posterior probability of an association

◮ Bayes factors are useful, however we would like a probabilistic answer

how certain an association really is.

◮ Posterior probability of H1

p(H1 | D) = p(D | H1)p(H1) p(D) = p(D | H1)p(H1) p(D | H1)p(H1) + p(D | H0)p(H0)

◮ p(H1 | D) + p(H0 | D) = 1, prior probability of observing a real

association.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 33

slide-71
SLIDE 71

Model comparison and hypothesis testing

Bayes factor verus likelihood ratio Bayes factor

◮ Models of different

complexity can be

  • bjectively compared.

◮ Statistical significance as

posterior probability of a model.

◮ Typically hard to compute.

Likelihood ratio

◮ Likelihood ratio scales with

the number of parameters.

◮ Likelihood ratios have

known null distribution, yielding p-values.

◮ Often easy to compute.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 34

slide-72
SLIDE 72

Model comparison and hypothesis testing

Bayes factor verus likelihood ratio Bayes factor

◮ Models of different

complexity can be

  • bjectively compared.

◮ Statistical significance as

posterior probability of a model.

◮ Typically hard to compute.

Likelihood ratio

◮ Likelihood ratio scales with

the number of parameters.

◮ Likelihood ratios have

known null distribution, yielding p-values.

◮ Often easy to compute.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 34

slide-73
SLIDE 73

Model comparison and hypothesis testing

Marginal likelihood of variance component models

◮ Consider a linear model, accounting for a set of measured SNPs X

p(y | X, θ, σ2) = N

  • y
  • S
  • s=1

xsθs, σ2I

  • ◮ Choose identical Gaussian prior for all weights

p(θ) =

S

  • s=1

N

  • θs
  • 0, σ2

g

  • ◮ Marginal likelihood

p(y | X, ) =

  • θ

N

  • y
  • Xθ, σ2I
  • N
  • θ
  • 0, σ2

gI

  • ◮ Number of hyperparameters independent of number of SNPs

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 35

slide-74
SLIDE 74

Model comparison and hypothesis testing

Marginal likelihood of variance component models

◮ Consider a linear model, accounting for a set of measured SNPs X

p(y | X, θ, σ2) = N

  • y
  • S
  • s=1

xsθs, σ2I

  • ◮ Choose identical Gaussian prior for all weights

p(θ) =

S

  • s=1

N

  • θs
  • 0, σ2

g

  • ◮ Marginal likelihood

p(y | X, ) =

  • θ

N

  • y
  • Xθ, σ2I
  • N
  • θ
  • 0, σ2

gI

  • ◮ Number of hyperparameters independent of number of SNPs

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 35

slide-75
SLIDE 75

Model comparison and hypothesis testing

Marginal likelihood of variance component models

◮ Consider a linear model, accounting for a set of measured SNPs X

p(y | X, θ, σ2) = N

  • y
  • S
  • s=1

xsθs, σ2I

  • ◮ Choose identical Gaussian prior for all weights

p(θ) =

S

  • s=1

N

  • θs
  • 0, σ2

g

  • ◮ Marginal likelihood

p(y | X, σ2, σ2

g) =

  • θ

N

  • y
  • Xθ, σ2I
  • N
  • θ
  • 0, σ2

gI

  • = N
  • y
  • 0, σ2

gXXT + σ2I

  • ◮ Number of hyperparameters independent of number of SNPs

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 35

slide-76
SLIDE 76

Model comparison and hypothesis testing

Marginal likelihood of variance component models

◮ Consider a linear model, accounting for a set of measured SNPs X

p(y | X, θ, σ2) = N

  • y
  • S
  • s=1

xsθs, σ2I

  • ◮ Choose identical Gaussian prior for all weights

p(θ) =

S

  • s=1

N

  • θs
  • 0, σ2

g

  • ◮ Marginal likelihood

p(y | X, σ2, σ2

g) =

  • θ

N

  • y
  • Xθ, σ2I
  • N
  • θ
  • 0, σ2

gI

  • = N
  • y
  • 0, σ2

gXXT + σ2I

  • ◮ Number of hyperparameters independent of number of SNPs

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 35

slide-77
SLIDE 77

Model comparison and hypothesis testing

Marginal likelihood of variance component models

Application to GWAs

The missing heritability paradox

◮ Complex traits are regulated by a large number of small effects

◮ Human height: the best single SNP explains little variance. ◮ But: the parents are highly predictive for the height of the child! Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 36

slide-78
SLIDE 78

Model comparison and hypothesis testing

Marginal likelihood of variance component models

Application to GWAs

Multivariate additive models for complex traits

◮ Multivariate model over causal SNPs

p(y | X, θ, σ2) = N

  • y |
  • s∈causal

xsθs, σ2I

  • ◮ Common variance prior for causal SNPs p(θs) = N
  • θs
  • 0, σ2

g

  • ◮ Marinalize out weights

p(y | X, σ2

g, σ2 e) = N

  • y | 0, σ2

g

  • s∈causal

xsxT

s + σ2 eI

  • ◮ Which SNPs are causal ?

Approximation: consider all SNPs [Yang et al., 2011] p(y | X, σ2

g, σ2 e) = N

  • y | 0, σ2

gXXT + σ2 eI

  • Oliver Stegle

GWAS IV: Bayesian linear models Summer 2011 37

slide-79
SLIDE 79

Model comparison and hypothesis testing

Marginal likelihood of variance component models

Application to GWAs

Multivariate additive models for complex traits

◮ Multivariate model over causal SNPs

p(y | X, θ, σ2) = N

  • y |
  • s∈causal

xsθs, σ2I

  • ◮ Common variance prior for causal SNPs p(θs) = N
  • θs
  • 0, σ2

g

  • ◮ Marinalize out weights

p(y | X, σ2

g, σ2 e) = N

  • y | 0, σ2

g

  • s∈causal

xsxT

s + σ2 eI

  • ◮ Which SNPs are causal ?

Approximation: consider all SNPs [Yang et al., 2011] p(y | X, σ2

g, σ2 e) = N

  • y | 0, σ2

gXXT + σ2 eI

  • Oliver Stegle

GWAS IV: Bayesian linear models Summer 2011 37

slide-80
SLIDE 80

Model comparison and hypothesis testing

Marginal likelihood of variance component models

Application to GWAs

Multivariate additive models for complex traits

◮ Multivariate model over causal SNPs

p(y | X, θ, σ2) = N

  • y |
  • s∈causal

xsθs, σ2I

  • ◮ Common variance prior for causal SNPs p(θs) = N
  • θs
  • 0, σ2

g

  • ◮ Marinalize out weights

p(y | X, σ2

g, σ2 e) = N

  • y | 0, σ2

g

  • s∈causal

xsxT

s + σ2 eI

  • ◮ Which SNPs are causal ?

Approximation: consider all SNPs [Yang et al., 2011] p(y | X, σ2

g, σ2 e) = N

  • y | 0, σ2

gXXT + σ2 eI

  • Oliver Stegle

GWAS IV: Bayesian linear models Summer 2011 37

slide-81
SLIDE 81

Model comparison and hypothesis testing

Marginal likelihood of variance component models

Application to GWAs

Multivariate additive models for complex traits

◮ Multivariate model over causal SNPs

p(y | X, θ, σ2) = N

  • y |
  • s∈causal

xsθs, σ2I

  • ◮ Common variance prior for causal SNPs p(θs) = N
  • θs
  • 0, σ2

g

  • ◮ Marinalize out weights

p(y | X, σ2

g, σ2 e) = N

  • y | 0, σ2

g

  • s∈causal

xsxT

s + σ2 eI

  • ◮ Which SNPs are causal ?

Approximation: consider all SNPs [Yang et al., 2011] p(y | X, σ2

g, σ2 e) = N

  • y | 0, σ2

gXXT + σ2 eI

  • Oliver Stegle

GWAS IV: Bayesian linear models Summer 2011 37

slide-82
SLIDE 82

Model comparison and hypothesis testing

Marginal likelihood of variance component models

Application to GWAs

◮ Approximate variance model

p(y | X, σ2

g, σ2 e) = N

  • y | 0, σ2

gXXT + σ2 eI

  • ◮ Genetic variance σ2

g across

chromosomes

◮ Heritability h2 =

σ2

g

σ2

g + σ2 e

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 38

slide-83
SLIDE 83

Model comparison and hypothesis testing

Marginal likelihood of variance component models

Application to GWAs

◮ Approximate variance model

p(y | X, σ2

g, σ2 e) = N

  • y | 0, σ2

gXXT + σ2 eI

  • ◮ Genetic variance σ2

g across

chromosomes

◮ Heritability h2 =

σ2

g

σ2

g + σ2 e

[Yang et al., 2011]

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 38

slide-84
SLIDE 84

Model comparison and hypothesis testing

Marginal likelihood of variance component models

Application to GWAs

◮ Approximate variance model

p(y | X, σ2

g, σ2 e) = N

  • y | 0, σ2

gXXT + σ2 eI

  • ◮ Genetic variance σ2

g across

chromosomes

◮ Heritability h2 =

σ2

g

σ2

g + σ2 e

[Yang et al., 2011]

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 38

slide-85
SLIDE 85

Summary

Outline

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 39

slide-86
SLIDE 86

Summary

Summary

◮ Generalized linear models for Curve fitting and multivariate regression. ◮ Maximum likelihood and least squares regression are identical. ◮ Construction of features using a mapping φ. ◮ Regularized least squares and other models that correspond to

different choices of loss functions.

◮ Bayesian linear regression. ◮ Model comparison and ocam’s razor. ◮ Variance component models in GWAs.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 40

slide-87
SLIDE 87

Summary

Tasks

◮ Prove that the product of two Gaussians is Gaussian distributed. ◮ Try to understand the convolution formula of Gaussian random

variables.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 41

slide-88
SLIDE 88

Summary

References I

  • C. Bishop. Pattern recognition and machine learning, volume 4. Springer New York, 2006.
  • S. Roweis. Gaussian identities. technical report, 1999. URL

http://www.cs.nyu.edu/~roweis/notes/gaussid.pdf.

  • J. Yang, T. Manolio, L. Pasquale, E. Boerwinkle, N. Caporaso, J. Cunningham, M. de Andrade,
  • B. Feenstra, E. Feingold, M. Hayes, et al. Genome partitioning of genetic variation for

complex traits using common snps. Nature Genetics, 43(6):519–525, 2011.

Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 42