Analysing Gene Expression Data Using Gaussian Processes Lorenz - - PowerPoint PPT Presentation

analysing gene expression data using gaussian processes
SMART_READER_LITE
LIVE PREVIEW

Analysing Gene Expression Data Using Gaussian Processes Lorenz - - PowerPoint PPT Presentation

Analysing Gene Expression Data Using Gaussian Processes Lorenz Wernisch School of Crystallography Birkbeck College London p.1/35 Overview Gene regulatory networks, microarrays Time-series analysis by linear regression Bayesian


slide-1
SLIDE 1

Analysing Gene Expression Data Using Gaussian Processes

Lorenz Wernisch

School of Crystallography Birkbeck College London

– p.1/35

slide-2
SLIDE 2

Overview

Gene regulatory networks, microarrays Time-series analysis by linear regression Bayesian inference, Occam’s razor Extension to nonlinear models Gaussian processes Applications Filtering with Gaussian processes

– p.2/35

slide-3
SLIDE 3

Gene Regulatory Networks

+ _ Gene expression levels depend on external stimuli and activity of genes (transcription factors) Microarrays measure the mRNA levels of genes Construction of gene networks from microarray data

– p.3/35

slide-4
SLIDE 4
  • A. thaliana: APRR family

time/hrs log expression levels 26 30 34 38 42 46 50 54 58 62 66 70 74

APRR9 APRR7 APRR5 APRR3 TOC1/APRR1

Time-series of A. thaliana Constant light 13 time points every 4 hours from 26 to 74 hrs Data by Kieron Edwards and Andrew Millar APRR family, possible modulators for light sensitivity of main circadian clock series

– p.4/35

slide-5
SLIDE 5

Networks from time-series data

B C A B A A B C C A B C t = 0 t = 1 t = 2 . . . . . . . . .

Static graph representing dependencies between genes has cycles Cycles unrolled in time: acyclic graph Network topology repeated over time slices

– p.5/35

slide-6
SLIDE 6

Linear time-series model

xt = Φxt−1 + µ + wt xt is N-vector of RNA levels at time t (of N genes) wt is N-vector of biological noise added at t µ is N-vector of constant trend, ie constitutive

expression If there is no constant trend, µ = 0, Φ can be estimated by standard regression:

Φ′ = (Xt−1X′

t−1)−1Xt−1X′ t

where Xt and Xt−1 are N × (T − 1) matrices with time vectors x2, . . . , xT and x1, . . . , xT−1 as columns

– p.6/35

slide-7
SLIDE 7

Estimating matrix for APPR family

Estimation by standard (least squares) regression: APRR9 APRR7 APRR5 APRR3 TOC1 APRR9

  • 0.59
  • 0.06

0.78 0.39 0.48 APRR7 0.56 0.35 0.34 0.29 0.21 APRR5

  • 0.80

0.15

  • 0.26

0.46 0.43 APRR3

  • 0.34
  • 0.94
  • 0.12
  • 0.13

0.05 TOC1

  • 0.11
  • 0.05

0.66 0.46 0.30 Problem: each gene connected to each other One could test for significance of nonzero parameters: problems of significance tests, significance levels, multiple testing, . . .

– p.7/35

slide-8
SLIDE 8

Bayesian models are simple

−100 −50 50 100 0.000 0.010 0.020 0.030 data space data probability

Automatic complexity control, Occam’s razor: Complex model covers many data sets: small probability each Simple model few data sets: large probability each [MacKay, Neal] Automatic relevance determination: assume Gaussian distribution for each matrix entry aij with variances σ2

ij as free parameters, integrate out aij and

maximize P(D | model, {σ2

ij}) [RVMs Tipping]

– p.8/35

slide-9
SLIDE 9

Linear regression framework

t = Φw + ǫ

Probability of data, given parameters (likelihood):

p(t | w, σ2) = 1 (2π)N/2σN exp(−|t − Φw|2 2σ2 )

Gaussian prior on coefficients (weights) w:

p(w | α) = 1 (2π)−M/2

M

  • m=1

α1/2

m exp(−αmw2 m

2 ) αm is the precision (the inverse variance 1/σ2

m)

– p.9/35

slide-10
SLIDE 10

Maximum likelihood type II

Integrating out w:

p(t | α, σ2) = 1 (2π)N/2|C|1/2 exp(−1 2t′C−1t) C = σ2I + ΦA−1Φ′

Maximum likelihood estimation of hyperparameters α by maximizing p(t | α, σ2) (type II ML) brings Occam’s razor to bear Tipping et al. suggest analytical solutions for iterative optimization, optimizing for αi in turn Maximization, eg, by conjugate gradients seems to be at least as efficient

– p.10/35

slide-11
SLIDE 11

Sparse Bayesian estimates for APRR net

APRR9 APRR7 APRR5 APRR3 TOC1 APRR9

  • 0.11

0.27

  • 0.90
  • 0.01

APRR7 0.00 0.28 0.00

  • 0.80

APRR5 0.28 0.39 0.00 0.00 APRR3 0.00 0.41 0.59 0.00 TOC1 0.00 0.37 0.52 0.00 Far fewer nonzero entries than in standard regression!

– p.11/35

slide-12
SLIDE 12

Reconstruction of APRR traces

time/hrs 26 30 34 38 42 46 50 54 58 62 66 70 74

APRR9 APRR7 APRR5 APRR3 TOC1/APRR1

time/hrs log expression levels 26 30 34 38 42 46 50 54 58 62 66 70 74

APRR9 APRR7 APRR5 APRR3 TOC1/APRR1

Start estimated dynamics on initial conditions with 0 process noise: good agreement

– p.12/35

slide-13
SLIDE 13

Sparse Bayesian estimates for LHY/TOC1 net

LHY TOC1 GI PIF3 LHY 0.66 0.80

  • 0.78

0.00 TOC1

  • 0.34
  • 0.19

0.58

  • 0.10

GI 0.00

  • 0.87

0.65 0.00 PIF3 0.00 0.00 0.22

  • 0.14

LHY in negative feedback with TOC1 Second negative feedback loop involving GI PIF3 just added for good measure

– p.13/35

slide-14
SLIDE 14

Reconstruction of LHY/TOC1 traces

time/hrs 26 30 34 38 42 46 50 54 58 62 66 70 74

LHY TOC1/APRR1 GI PIF3

time/hrs log expression levels 26 30 34 38 42 46 50 54 58 62 66 70 74

LHY TOC1/APRR1 GI PIF3

Start estimated dynamics on initial conditions with 0 process noise

– p.14/35

slide-15
SLIDE 15

Nonlinear dependencies

Gene A −10 −5 5 10 Gene B −10 −5 5 10 G e n e C −4 −2 2 4

Assumed linear depencies of level of gene A on other gene levels Genes often operate as switches and complex gates with nonlinear interactions (eg exclusive or) Need to go beyond linear models: Gaussian processes (GP)

– p.15/35

slide-16
SLIDE 16

Gaussian process

Input values d-dimensional x = (x1, . . . , xN),

xi ∈ Rd

Target values t = (t1, . . . , tN), ti ∈ R Joint distribution of the output t is multivariate Gaussian N(0, K) Covariance matrix K

Kpq = β0 + CL(xp, xq) + CG(xp, xq) + σ2

ǫI(p = q)

β0 overall constant σ2

ǫ noise term along diagonal of K

I() indicator function

– p.16/35

slide-17
SLIDE 17

Covariance components

Linear covariance part

CL(xp, xq) = x′

pB−1xq

with linear relevance parameters

B = diag(β1, . . . , βd)

Squared exponential (Gaussian) covariance part

CG(xp, xq) = α0 exp(−1 2(xp − xq)′A−1(xp − xq))

with nonlinear relevance parameters

A = diag(α1, . . . , αd) and scale parameter α0

– p.17/35

slide-18
SLIDE 18

Compare with linear regression

Compare linear covariance part with noise:

CL(xp, xq) = x′

pB−1xq + σ2 ǫI

with the covariance matrix of a linear regression with weights integrated out (see above):

C = ΦA−1Φ′ + σ2

ǫI

This is the same if

B = diag(α1, . . . , αp) = diag(1/σ2

1, . . . , 1/σ2 p)

and the rows of Φ are the input vectors xi

– p.18/35

slide-19
SLIDE 19

Training of GP

Covariance parameters θMAP maximizing posterior probability:

P(θ | t, x) ∝ P(t | x, θ)P(θ)

with

log P(t | x, θ) = −1 2

  • t′K(x, θ)t−log |K(x, θ)|−n log 2π
  • Lognormal prior P(θ) with fixed a and b

log P(θ) = N(θ | a, b)

Optimization with conjugate gradients (using derivatives)

– p.19/35

slide-20
SLIDE 20

Conditional mean and variance

New input point x∗:

˜ K =

  • K

k(x∗) k(x∗)′ k(x∗, x∗)

  • where

k(x∗) = (β0 + CL(x∗, xq) + CG(x∗, xq))N

q=1

k(x∗, x∗) = β0 + x∗′B−1x∗ + α0 + σ2

ǫ

f(x∗) is Gaussian N(µ(x∗), σ2(x∗)) µ(x∗) = k(x∗)′K−1t σ2(x∗) = k(x∗, x∗) − k(x∗)′K−1k(x∗)

– p.20/35

slide-21
SLIDE 21

GP on simulated static data

x −10 −5 5 10 y −10 −5 5 10 z −10 −5 5 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Relevance parameters:

x1 x2 x3

nonlinear 0.21 linear 0.35 estimated sd 0.92 30 data points with f(x1, x2, x3) = 5 sin(0.7x1) + 0.5x2 + ǫ where ǫ ∼ N(0, 1)

– p.21/35

slide-22
SLIDE 22

GP on simulated time-series data

5 10 15 20 −2 2 4 6 8 time 3 variables

Artificial network of 3 variables connected by nonlinear relationships

xt+1 = 0.35xt + 5 sin(0.3yt) + ǫ1 yt+1 = 0.4yt + 5 cos(0.3zt) + ǫ2 zt+1 = 0.4zt + 0.1y2

t − 2 + ǫ3

Stable cycling easy to achieve with nonlinear networks

– p.22/35

slide-23
SLIDE 23

GP on time-series

x −10 −5 5 10 y −10 −5 5 10 z −10 −5 5 10 x −10 −5 5 10 y −10 −5 5 10 z −10 −5 5 10

Variable 1: the linear and nonlinear relevance parameters for input 3 are both 0

– p.23/35

slide-24
SLIDE 24

GP on time-series

x −10 −5 5 10 y −10 −5 5 10 z −10 −5 5 10 x −10 −5 5 10 y −10 −5 5 10 z −10 −5 5 10

Variable 2: the linear and nonlinear relevance parameters for input 1 are both 0

– p.24/35

slide-25
SLIDE 25

GP on time-series

x −10 −5 5 10 y −10 −5 5 10 z −10 −5 5 10 x −10 −5 5 10 y −10 −5 5 10 z −10 −5 5 10

Variable 3: the linear and nonlinear relevance parameters for input 1 are both 0

– p.25/35

slide-26
SLIDE 26

Gene network: LHY dependency

x −2 −1 1 2 y −2 −1 1 2 z −1 1 2

Nonlinear relevance: 0.01, 0.01, 0.73, 0.01 Linear relevance: 0.81, 1.13, 0.45, 0.00 Estimated sd 0.18 No dependency of LHY on PIF3 Nonlinear dependency of LHY on TOC1 and GI, LHY and PIF3 were set to 0

– p.26/35

slide-27
SLIDE 27

Gene network: GI dependency

x −2 −1 1 2 y −2 −1 1 2 z −4 −2 2 4

Nonlinear relevance: 0.01, 0.00, 0.78, 0.00 Linear relevance: 0.00, 0.82, 0.17, 0.00 Estimated sd 0.30 No dependency of GI on LHY and PIF3 Linear (negative) dependency of GI on TOC1, nonlinear (positive) dependency of GI on itself

– p.27/35

slide-28
SLIDE 28

Light input pathway

time/hrs 26 30 34 38 42 46 50 54 58 62 66 70 74

LHY TOC1 GI PHYA PHYB PHYC PHYD PHYE CRY1 CRY2

Entrainment of 24h rhythm via light input phytochromes (phy): red, IR cryptochromes (cry): blue, UV Even in constant light condition cycling (Cy2, PhyA, PhyB) Bidirectional links from central clock?

– p.28/35

slide-29
SLIDE 29

Light input pathway

TOC1 LHY GI PHYA CRY1 CRY2 PHYE PHYD PHYC PHYB

Chain of Phy and Cry regulation

– p.29/35

slide-30
SLIDE 30

Light input and PRR pathway

GI LHY PRR7 TOC1 CRY1 PHYC PRR9 ELF3 PRR5 TIC LUX ELF4 CRY2 PHYD PHYE PHYA PRR3 PHYB

– p.30/35

slide-31
SLIDE 31

State space model

xt = f(xt−1) + ǫ1 yt = Cxt + ǫ2

If vector y represents observable variables (genes), use C = (0, I)

f(x) = (f1(x), . . . , fd(x)) is vector of d parallel GPs

each trained independently Extended Kalman filtering with GPs: modify predictive mean and variance Iterate with MLE type II estimation of relevance parameters: ARD-EM algorithm for GP

– p.31/35

slide-32
SLIDE 32

Extended Kalman filter

P(xi) = N(xi | xp, Vp) xp = ˜ µ(mi−1, Pi−1), Vp = Σ(mi−1, Pi−1) + Q P(xi | ti) = N(xi | mi, Pi) mi = xp + K(ti − Cxp), Pi = (I − KC)Vp K = VpC′(CVpC′ + R)−1

Need to calculate mean ˜

µ(u, S) and covariance Vp = Σ(u, S) of parallel GPs for an uncertain input u ∼ N(u, S) (similar to J. Quiñonero-Candela, A.

Girard, and C. E. Rasmussen, 2003)

– p.32/35

slide-33
SLIDE 33

Uncertain input for parallel GPs

With covariances CG and CL mean and covariance exact, eg

  • CG(x∗, xj) pG(x∗ | u, S) dx∗

combination of two Gaussians Variance of f(x∗) is

Ex∗( Σ(x∗)) + varx∗(˜ µ(x∗))

  • Σ(x∗) is composed of covariances of each GP

varx∗(˜

µ(x∗)) involves covariances across GPs

(solution along lines of Quiñonero-Candela et al.)

– p.33/35

slide-34
SLIDE 34

Reconstruction of hidden variable

5 10 15 20 −2 2 4 6 8 time 3 variables 5 10 15 20 2 4 6 8 time 3 variables

3rd variable (green) treated as hidden variable in GP-EM reconstruction on left-hand side

– p.34/35

slide-35
SLIDE 35

Conclusion

Complexity control (Occam’s razor) by Bayesian estimation of hyperparameters MAP estimation of hyperparameters (Maximum likelihood type II) works fine Gaussian processes integrate linear and nonlinear components Downside: setting of prior parameters (a and b) above is critical, particularly noise parameter in case of noisy data GP EM possible but tricky due to presence of many local optima

– p.35/35