Gene Networks Estimation Extensions of the lasso Jos e S anchez - - PowerPoint PPT Presentation

gene networks estimation
SMART_READER_LITE
LIVE PREVIEW

Gene Networks Estimation Extensions of the lasso Jos e S anchez - - PowerPoint PPT Presentation

Gene Networks Estimation References Gene Networks Estimation Extensions of the lasso Jos e S anchez Mathematical Sciences, Chalmers University of Technology Sep 12, 2013 Gene Networks Cancer systems biology Estimation References


slide-1
SLIDE 1

Gene Networks Estimation References

Gene Networks Estimation

Extensions of the lasso Jos´ e S´ anchez

Mathematical Sciences, Chalmers University of Technology Sep 12, 2013

slide-2
SLIDE 2

Gene Networks Estimation References

Cancer systems biology

The transfer of information from a protein to either DNA or RNA is not possible. This fact establishes a framework for the study of cancer at molecular level.

slide-3
SLIDE 3

Gene Networks Estimation References

Network Modeling

Why gene networks?

A gene regulatory network describes how genes interact with each other to form modules and carry out cell functions. Help in systematically understanding complex molecular mechanisms. Identification of hub genes, since they are potential disease drivers (Kendall et al., 2005; Mani et al., 2008; Nibbe et al., 2010; Slavov and Dawson, 2009).

slide-4
SLIDE 4

Gene Networks Estimation References

Network Modeling

Why gene networks?

A gene regulatory network describes how genes interact with each other to form modules and carry out cell functions. Help in systematically understanding complex molecular mechanisms. Identification of hub genes, since they are potential disease drivers (Kendall et al., 2005; Mani et al., 2008; Nibbe et al., 2010; Slavov and Dawson, 2009).

Goals

Estimation of joint gene regulatory networks for several types

  • f cancer and data types.

Incorporate biologically meaningful constraints into the model (commonality, modularity). Take into account the high-dimensionality (p >> N)of the problem.

slide-5
SLIDE 5

Gene Networks Estimation References

Gaussian Graphical Models

A graph consists of a set of vertices V and edges E, which is a subset of V × V . In a graphical model, the vertices correspond to a set of random variables X = (X 1, X 2, . . . , X p) coming from distribution P.

slide-6
SLIDE 6

Gene Networks Estimation References

Gaussian Graphical Models

A graph consists of a set of vertices V and edges E, which is a subset of V × V . In a graphical model, the vertices correspond to a set of random variables X = (X 1, X 2, . . . , X p) coming from distribution P. A conditonal independence graph (CIG), is a graphical model where the absence of an edge between variables X i and X j implies that they are conditionally independent (given the rest), that is X i ⊥ X j | X V \{i,j}.

slide-7
SLIDE 7

Gene Networks Estimation References

Gaussian Graphical Models

A graph consists of a set of vertices V and edges E, which is a subset of V × V . In a graphical model, the vertices correspond to a set of random variables X = (X 1, X 2, . . . , X p) coming from distribution P. A conditonal independence graph (CIG), is a graphical model where the absence of an edge between variables X i and X j implies that they are conditionally independent (given the rest), that is X i ⊥ X j | X V \{i,j}. If the variables X = (X 1, X 2, . . . , X p) come from the multivariate normal distribution N(0, Σ), the CIG corresponds to a Gaussian Graphical Model (Lauritzen, 1996). In this case the conditional independencies between the variable is the model (the edges in the graph) are given by the inverse covariance matrix Θ = Σ−1.

slide-8
SLIDE 8

Gene Networks Estimation References

Gene Network Modeling

GGM for gene networks

Assume genes to be N(µ, Σ) distributed and model using Gaussian graphical models. The links for the gene network are given by the non-zeros of the precision matrix Θ = Σ−1. Since p >> N problem the precision matrix can’t be estimated directly, regularization (sparsity) has to be introduced.

slide-9
SLIDE 9

Gene Networks Estimation References

Gene Network Modeling

GGM for gene networks

Assume genes to be N(µ, Σ) distributed and model using Gaussian graphical models. The links for the gene network are given by the non-zeros of the precision matrix Θ = Σ−1. Since p >> N problem the precision matrix can’t be estimated directly, regularization (sparsity) has to be introduced.

Not the only methods

Bayesian networks. Information theory-based methods. Correlation based methods.

slide-10
SLIDE 10

Gene Networks Estimation References

Network Modeling: a high-dimensional problem

We may not be grapes, but estimation of (human) gene networks is still a high-dimensional problem.

Figure : Source: M. Pertea and S. Salzberg/Genome Biology 2010

slide-11
SLIDE 11

Gene Networks Estimation References

The Lasso: an approach to the p >> N problem

Consider the usual multivariate regression setting. X1, X2, . . . , Xn p-dimensional covariates and a univariate response Y1, Y2, . . . , Yn. We model the response variable through a linear model Yi =

p

  • j=1

βjX j

i + εi

i = 1, 2, . . . , n.

slide-12
SLIDE 12

Gene Networks Estimation References

The Lasso: an approach to the p >> N problem

Consider the usual multivariate regression setting. X1, X2, . . . , Xn p-dimensional covariates and a univariate response Y1, Y2, . . . , Yn. We model the response variable through a linear model Yi =

p

  • j=1

βjX j

i + εi

i = 1, 2, . . . , n. The Lasso estimates for β are given by the minimizer of (Tibshirani, 1996) ˆ β(λ) = 1 nY − Xβ2

2 + λβ1

slide-13
SLIDE 13

Gene Networks Estimation References

Penalized GGM for gene networks

Maximize the L1 penalized likelihood function for the precision matrix Θ

l(Θ) = ln [det (Θ)] − tr (SΘ) − g(λ, Θ) where Sk is 1

nX T X is the empirical covariance matrix.

The graphical lasso (Friedman et al., 2008)

g(λ, Θ) = λ

  • i=j

| θij |

slide-14
SLIDE 14

Gene Networks Estimation References

Penalized GGM for gene networks

Maximize the L1 penalized likelihood function for the precision matrix Θ

l(Θ) = ln [det (Θ)] − tr (SΘ) − g(λ, Θ) where Sk is 1

nX T X is the empirical covariance matrix.

The graphical lasso (Friedman et al., 2008)

g(λ, Θ) = λ

  • i=j

| θij |

The group lasso (Yuan and Lin, 2007)

g(λ, {Θ}) = λ1

K

  • k=1
  • i=j

|θk

ij| + λ2

  • i=j
  • K
  • k=1

|θk

ij|

The fused lasso (Danaher et al., 2011)

g(λ, {Θ}) = λ1

K

  • k=1
  • i=j

|θk

ij| + λ2 K

  • k<k′
  • i,j

|θk

ij − θk′ ij |

slide-15
SLIDE 15

Gene Networks Estimation References

Network Modeling: a high-dimensional problem

Specifically, we are interested in estimating the networks for 8 cancer types and 6 types of variables. The problem results in the estimation of about 485 million edges. mRNA 7954 CNA 6562 miRNA 285 Methylation 3831 Mutation 469 Clinical 3

slide-16
SLIDE 16

Gene Networks Estimation References

The Alternating Directions Method

  • f Multipliers

To jointly model sparse GGM we propose an extended version of the fused lasso penalty.

l({Θ}) =

K

  • k=1

nk

  • tr(SkΘk) − ln
  • det(Θk)
  • − g(λ, {Z})

g(λ, {Z}) = λ1

K

  • k=1
  • i=j
  • α
  • Zk

ij

  • + (1 − α)Z2

ij

  • + λ2
  • k<k′
  • i,j
  • Zk

ij − Zk′ ij

  • .
slide-17
SLIDE 17

Gene Networks Estimation References

The Alternating Directions Method

  • f Multipliers

To jointly model sparse GGM we propose an extended version of the fused lasso penalty.

l({Θ}) =

K

  • k=1

nk

  • tr(SkΘk) − ln
  • det(Θk)
  • − g(λ, {Z})

g(λ, {Z}) = λ1

K

  • k=1
  • i=j
  • α
  • Zk

ij

  • + (1 − α)Z2

ij

  • + λ2
  • k<k′
  • i,j
  • Zk

ij − Zk′ ij

  • .

The ADMM (Boyd et al., 2011) can be applied to the general problem minimize

{Θ},{Z}

f ({Θ}) + g(λ, {Z}) subject to Θk = Z k, k = 1, . . . , K.

slide-18
SLIDE 18

Gene Networks Estimation References

ADMM steps

ADMM solves this problem by defining the scaled augmented lagrangian as follows

L({Θ}, {Z}, {U}) = f ({Θ}) + g(λ, {Z}) + ρ 2

K

  • k=1

Θk − Zk + Uk2

F ,

where Uk are the dual variables. At iteration m, the variables {Θ}, {Z} and {U} are updated according to

1

Θk

m ← arg min{Θ} {L({Θ}, {Zm−1}, {Um−1})}

2

Z k

m ← arg min{Z} {L({Θm}, {Z}, {Um−1})}

3

Uk

m ← Uk m−1 + Θk m − Z k m

for k = 1, . . . , K.

slide-19
SLIDE 19

Gene Networks Estimation References

ADMM, first step

For the first step, function g is a constant, so the problem is to minimize the function

K

  • k=1

nk

  • tr(SkΘk) − ln
  • det(Θk)
  • + ρ

2

K

  • k=1

Θk − Z k + Uk2

F,

with respect to Θ. Let VDV T be the singular value decomposition of ρ/nk(Z k − Uk) − Sk. The minimizer is given (Witten and Tibshirani, 2009) by V ˜ DV T where ˜ D is diagonal and Djj = nk/2ρ(Djj +

  • D2

jj + 4ρ/nk).

slide-20
SLIDE 20

Gene Networks Estimation References

ADMM, second step

For the second step, function f is a constant, so the problem is to minimize the function

g(λ, {Z}) + ρ 2

K

  • k=1

Θk − Zk + Uk2

F

= ρ 2

K

  • k=1

Zk − Ak 2

F + λ1 K

  • k=1
  • i=j
  • α|Zk

ij | + (1 − α)

  • Zk

ij

2 + λ2

  • k<k′
  • i,j

|Zk

ij − Zk′ ij |,

with respect to Z, where Ak = Θk + Uk. This problem is separable for each element (i, j), so we can solve separately the problems

minimize

{Zij }

  • 1

2

K

  • k=1
  • Zk

ij − Ak ij

2 + λ1 ρ Ii=j

K

  • k=1
  • α|Zk

ij | + (1 − α)

  • Zk

ij

2 + λ2 ρ

  • k<k′

|Zk

ij − Zk′ ij |

  

slide-21
SLIDE 21

Gene Networks Estimation References

ADMM, second step

Let

g1(Z) = 1 2

K

  • k=1
  • Zk − Ak2

g2(Z) =

K

  • k=1

λk

1

  • α|Zk| + (1 − α)
  • Zk2

g3(Z) =

  • k<k′

λkk′

2

|Zk − Zk′ | = Λ2LZ1,

where Λ2 = (λkk′

2 ) is a vector of dimension 1 2K(K + 1) and L

is a 1

2K(K + 1)-by-K matrix with values in {−1, 0, 1}

corresponding to the pairwise differences to be penalized. This problem can be written as

minimize

Z

g1(Z) + g2(V ) + g3(W ) subject to V = Z W = LZ.

slide-22
SLIDE 22

Gene Networks Estimation References

ADMM, second step

In each iteration, the solutions to this problem are given by

Z =

  • (ρ1 + 1)I + ρ2LTL

−1 A + ρ1

  • V − 1

ρ1 P

  • + ρ2LT
  • W − 1

ρ2 Q

  • V = STλ1/ρ1
  • Z + 1

ρ1 P

  • W = STλ2/ρ2
  • LZ + 1

ρ2 Q

  • .
slide-23
SLIDE 23

Gene Networks Estimation References

Selection of parameters via bootstrap

The most important parameters in the model are the sparsity parameter, λ1, and the fusing parameter, λ2. Here we propose to use the bootstrap and select values for the parameters that generate stable networks.

slide-24
SLIDE 24

Gene Networks Estimation References

Selection of parameters via bootstrap

The most important parameters in the model are the sparsity parameter, λ1, and the fusing parameter, λ2. Here we propose to use the bootstrap and select values for the parameters that generate stable networks. Consider first the sparsity parameter and assume we have B bootstrap estimates of our networks. For class k = 1, 2, . . . , K let nk

ij =

B

b=1 I(θk ij,b = 0)

B , where θk

ij,b is the b-th bootstrap estimate for link (i, j)

in class k is an estimate of the probability of presence of link (i, j) in cancer class k.

slide-25
SLIDE 25

Gene Networks Estimation References

Selection of parameters via bootstrap

The most important parameters in the model are the sparsity parameter, λ1, and the fusing parameter, λ2. Here we propose to use the bootstrap and select values for the parameters that generate stable networks. Consider first the sparsity parameter and assume we have B bootstrap estimates of our networks. For class k = 1, 2, . . . , K let nk

ij =

B

b=1 I(θk ij,b = 0)

B , where θk

ij,b is the b-th bootstrap estimate for link (i, j)

in class k is an estimate of the probability of presence of link (i, j) in cancer class k. For a given threshold T1, a link will be present in the final estimate if it is present in 100T1% of the bootstrap estimates.

slide-26
SLIDE 26

Gene Networks Estimation References

Selection of parameters via bootstrap

To select the fusing parameter we proceed similarly. Consider classes k, k′ = 1, 2, . . . , K let

nkk′

ij

= B

b=1 I(θk ij,b = θk′ ij,b, θk ij,b = 0, θk′ ij,b = 0)

B

b=1 I(θk ij,b = 0, θk′ ij,b = 0)

.

is an estimate of the probability that link (i, j) is differential in classes k and k′ given it is present in both classes.

slide-27
SLIDE 27

Gene Networks Estimation References

Selection of parameters via bootstrap

To select the fusing parameter we proceed similarly. Consider classes k, k′ = 1, 2, . . . , K let

nkk′

ij

= B

b=1 I(θk ij,b = θk′ ij,b, θk ij,b = 0, θk′ ij,b = 0)

B

b=1 I(θk ij,b = 0, θk′ ij,b = 0)

.

is an estimate of the probability that link (i, j) is differential in classes k and k′ given it is present in both classes. For a given threshold T2, if nkk′

ij

≥ T2, then link (i, j) is differential in classes k and k′, otherwise it is fused.

slide-28
SLIDE 28

Gene Networks Estimation References

Pipeline for TCGA data analysis

slide-29
SLIDE 29

Gene Networks Estimation References

Validation

slide-30
SLIDE 30

Gene Networks Estimation References

Biological analysis

slide-31
SLIDE 31
slide-32
SLIDE 32

Gene Networks Estimation References

  • S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed
  • ptimization and statistical learning via the alternating direction

method of multipliers. Foundations and Trends in Machine Learning., 3(1):1–122, 2011.

  • P. Danaher, P. Wang, and D. Witten. The joint graphical lasso for

inverse covariance estimation across multiple classes. arXiv:1111.0324v1, 2011.

  • J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance

estimation with the graphical lasso. Biostatistics., 9:432–441, 2008.

  • SD. Kendall, CM. Linardic, SJ. Adam, and CM. Counter. A network of

genetic events sufficient to convert normal human cells to a tumorigenic state. Cancer Research., 65:9824–9828, 2005.

  • S. Lauritzen. Graphical Models. Oxford Science Publications., 1996.
  • KM. Mani, C. Lefebvre, K. Wang, WK. Lim, K. Baso, and et al. A

systems biology approach to prediction of oncogenes and molecular perturbation targets in b-cell lymphomas. Molecular Systems Biology., 4(169), 2008.

  • RK. Nibbe, M. Koyuturk, and MR. Chance. An integrative -omics

approach to identify functional sub-networks in human colorectal

  • cancer. PLoS Computational Biology., 6(1):1–15, 2010.
  • N. Slavov and KA. Dawson. Correlation signature of the macroscopic

states of the gene regulatory network in cancer. Proceedings of the National Academy of Sciences of the United States of America., 106 (11):4079–4084, 2009.

  • R. Tibshirani. Regression shrinkage and selection via the lasso. Journal