Bayesian Methods for Variable Selection with Applications to - - PowerPoint PPT Presentation

bayesian methods for variable selection with applications
SMART_READER_LITE
LIVE PREVIEW

Bayesian Methods for Variable Selection with Applications to - - PowerPoint PPT Presentation

Bayesian Methods for Variable Selection with Applications to High-Dimensional Data Part 2: Bayesian Models for Integrative Genomics Marina Vannucci Rice University, USA ABS13-Italy 06/17-21/2013 Marina Vannucci (Rice University, USA)


slide-1
SLIDE 1

Bayesian Methods for Variable Selection with Applications to High-Dimensional Data

Part 2: Bayesian Models for Integrative Genomics Marina Vannucci

Rice University, USA

ABS13-Italy 06/17-21/2013

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 1 / 34

slide-2
SLIDE 2

Part 2: Bayesian Models for Integrative Genomics

Summary of methods so far (annotated bibliography). Models that incorporate a priori biological information. Bayesian networks for genomic data integration.

Ref: Vannucci, M. and Stingo, F.C. (2011). Bayesian Models for Variable Selection that Incorporate Biological Information (with discussion). In Bayesian Statistics 9 (J.M. Bernardo, M.J. Bayarri, J.O. Berger, A.P. Dawid, D. Heckerman, A.F.M. Smith and M. West eds.). Oxford: University Press, 659-678.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 2 / 34

slide-3
SLIDE 3

Identification of Genomic Biomarkers

DNA microarrays allow the parallel quantification of thousands of genes in a single experiment. Goal: identification (selection) of biomarkers that predict a response (clinical outcome, survival time, etc.). Major challenge: small n, large p Biomarkers selection important for treatment strategies and diagnostic tools. Identifying individual genes as therapeutic targets not sufficient. Cancer drugs designed to target specific pathways

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 3 / 34

slide-4
SLIDE 4

Pathways: Ordered series of chemical reactions in a living cell that serve different functions. Vast amount of biological knowledge generated and stored in public databases: KEGG, Cell Signaling Technology (CST) Pathway, Ivitrogen iPath, Reactome ... Pathways can be activated or inhibited at different points. Also, genes are not independent biological elements. Information is available on “gene networks”, describing relations among genes both within and between pathways. Signaling through branches or alternative pathways.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 4 / 34

slide-5
SLIDE 5

Available data and information: Response variable - Yn×1 - log(time to distant metastasis) Covariates (gene expressions) - Xn×p Pathway-gene relationship - Sp×K, where sjk = I{gene j ∈ pathway k} Gene-gene network - Rp×p, where rij = I{direct link between genes i and j} Therefore, We propose to incorporate pathway information in gene selection for disease prediction Priors that account for the gene network Select critical genes and pathways

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 5 / 34

slide-6
SLIDE 6

Pathway analyses

Gene-set enrichment analysis (Subramanian et al.,2005) Other pathway-based analyses: Supergene (Park et al., 2007)

Cluster genes using GO, then filter by cluster size and PCs Perform Lasso for the selection of clusters Only selection on clusters, but not on genes

Markov random field model (Wei & Li, 2007 & 2008)

Gene selection. Identify differentially expressed genes between two experimental conditions utilizing the pathway structure information Bayes models: (Telesca et al. 2008) for gene selection and (Li & Zhang, 2010) for “motifs” selection

We select both genes and pathways

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 6 / 34

slide-7
SLIDE 7

Proposed Method

Pathway information is used

1

in the likelihood

2

to elicit prior

3

to structure MCMC moves

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 7 / 34

slide-8
SLIDE 8

Model - Pathway Scores and Priors

Y Y Y = 1α + T T Tβ β β + ε ε ε, ε ε ε ∼ N(0, σ2I I I) T T T is n × K and summarizes group behavior of genes as PCA components

  • btained from the expression data of genes belonging to individual

pathways. Pathway selection via a latent K-vector θ θk = 1 if pathway k is included

  • therwise

k = 1, . . . , K. Mixture prior on regression coefficient βk indexed by θk βk|θk, σ2 ∼ θk · N(β0, hσ2) + (1 − θk) · δ0(βk). Independent Bernoulli priors for θk’s and conjugate priors on α, σ2

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 8 / 34

slide-9
SLIDE 9

Gene Selection via a latent p-vector γ γj = 1 if gene j is included

  • therwise

j = 1, . . . , p. Markov Random Field prior on γ P(γj|θ θ θ, γi, i ∈ Nj) = exp(γjF(γj)) 1 + exp(F(γj)) F(γj) = µ + η

i∈Nj(2γi − 1) and Nj the set of neighbors of gene j from

included pathways. µ controls sparsity. Higher η’s induce more neighbors to take on same values. We use an hyperprior for η, η ∼ Gamma(αη, βη). See also Wei & Li (2008, Ann. Appl. Stat.), Telesca et al. (2008), Li & Zhang (2010, JASA)

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 9 / 34

slide-10
SLIDE 10

Model Fitting and Posterior Inference

Integrate out α, β β β, σ2, to get the marginal posterior f(θ θ θ, γ γ γ, η, |Y Y Y, T T T) ∝ f(Y Y Y|T T T, θ θ θ, γ γ γ) · p(γ γ γ, θ θ θ|η) · p(η) We use a 2-stage Metropolis to update (θ θ θ, γ γ γ)

pick a pathway k pick a gene j from pathway k add/delete set of moves (with constraints)

and update the parameter η of the MRF by employing the general method proposed by Moller et al. (2006) that uses auxiliary variables. Inference for pathways and genes can be made based on:

(θ θ θ, γ γ γ) with largest joint posterior probability, θk’s and γj’s with largest marginal posterior probabilities

Prediction of future samples can be made via Bayesian model averaging.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 10 / 34

slide-11
SLIDE 11

Case Study on Breast Cancer - Van’t Veer et al. (2002, Nature)

Microarray data for 76 breast cancer patients, of which 33 developed distant metastases within 5 years. X: gene expression. Y: log(time to distant metastasis) Matrices S and R: gene-pathways and gene-gene relationships:

Link the probes to the Gene IDs (LocusLink) and link the Gene IDs to the pathways (KEGG) R package KEGG-graph to dowanload the gene network A total of 3,592 probes, mapped to 196 pathways, was included in the study.

Training and validation sets A priori we expect about 10% good pathways and 3% of the genes Vague priors on model parameters Two MCMC chains with 600,000 iterations (r = .9996)

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 11 / 34

slide-12
SLIDE 12

1 2 3 4 5 6 x 10

5

10 20 30 40 50 60 Iteration Number of included pathways 1 2 3 4 5 6 x 10

5

20 40 60 80 100 120 Number of included genes Iteration

Figure : Trace plots: Number of included pathways and genes

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 12 / 34

slide-13
SLIDE 13

Prediction: MSE=1.57 (7 pathways & 12 genes) MSE=1.93 (11 genes, Sha et al. 2006, Bioinfo.) Selection:

50 100 150 200 0.0 0.2 0.4 0.6 0.8 1.0 Pathway Marginal Posterior Probability Purine metabolism MAPK signaling pathway Cytokine−cytokine receptor interaction Neuroactive ligand−receptor interaction Cell cycle Axon guidance Cell adhesion molecules (CAMs) Complement and coagulation cascades Regulation of actin cytoskeleton Insulin signaling pathway Pathways in cancer Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 13 / 34

slide-14
SLIDE 14

Selection (cont’d):

Singleton genes (no direct neighbor selected) ACACB (10), C4A (8,12), CALM1 (10), CCNB2 (5), CD4 (7), CDC2 (5), CLDN11 (7), FZD9 (11), GYS2 (10), HIST1H2BN (12), IFNA7 (3), NFASC (7), NRCAM (7), PCK1 (10), PFKP (10), PPARGC1A (10), PXN (9) Island 1 ACTB (9), ACTG1 (9), ITGA1 (9), ITGA7 (9), ITGB3 (9), ITGB4 (9), ITGB6 (9), ITGB8 (7,10), MYL5 (9), MYL9 (9), PDPK1 (10), PIK3CD (9,10,11), PLA2G4A (2), PLCG1 (11), PRKCA (2,11), PRKY (2,10), PRKY (2,10), PTGS2 (11), SOCS3 (10) Island 2 ACVR1B (2,3,11), ACVR1B (2,3,11), TGFB3 (2,3,5,11) Island 3 ENTPD3 (1), GMPS (1)

Table : The 41 selected genes divided by islands and with associated pathway indices (in parenthesis). The pathway indices correspond

to: 1-Purine metabolism, 2-MAPK signaling pathway, 3-Cytokine-cytokine receptor interaction, 4-Neuroactive ligand-receptor interaction, 5-Cell cycle, 6-Axon guidance, 7-Cell adhesion molecules (CAMs), 8-Complement and coagulation cascades, 9-Regulation of actin cytoskeleton, 10-Insulin signaling pathway, 11-Pathways in cancer, 12-Systemic lupus erythematosus. Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 14 / 34

slide-15
SLIDE 15

Island 8: DUSP3, DUSP4, MAPK10

Figure : Some selected pathways and islands (sets of connected genes). Stingo et al. (Ann. Appl. Stat., 2011)

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 15 / 34

slide-16
SLIDE 16

A more recent study on hypertension

Renal failure → high blood pressure → hypertension. Salt-sensitive (SS) rats have a genetic imbalance that causes kidney failure. mRNA data from SS rats on low- and high-salt diets. “Pathways in cancer” largely regulates cell proliferation. Immunofluorescent analysis found increased CP in mTALs (nephron segments) under the high-salt diet.

Figure : From Yang et al. (Hypertension, 2013).

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 16 / 34

slide-17
SLIDE 17

Summary

Goal: Identification of biomarkers that predict a response (clinical

  • utcome, survival time, etc.)

The proposed method integrates experimental data with existing biological knowledge The model incorporates information on pathways (groups of genes) The prior incorporates gene-gene network information We infer important pathways and important genes and also predict the response simultaneously

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 17 / 34

slide-18
SLIDE 18

Currently working on extensions to genetic data

  • Genome-wide association studies have large n and very large p.
  • Goal: Achieve dimension reduction via creating gene scores based on

SNP allele frequencies and studying their association to the phenotype.

  • Define gene scores by weighting observed SNP frequencies with the

population frequencies defined by the Hardy-Weinberg equilibrium law.

  • A network prior takes into account a SNP-SNP network based on linkage

disequilibrium describing non-random associations between SNPs.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 18 / 34

slide-19
SLIDE 19

Available data and information:

  • Response variable - Yn×1 - patient’s phenotype (disease/control)
  • Covariates (SNP allele frequencies) - Xn×p (categorical covariates; 0,1,2)
  • SNP-gene relationship - Sp×K, where sjk = I{SNP j ∈ gene k}
  • SNP-SNP network - Rp×p, where rij =

I{direct link between SNPs i and j} Therefore,

  • We propose to incorporate gene information in SNP selection for disease

prediction

  • Priors that account for the SNP-SNP network
  • Select critical genes and SNPs

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 19 / 34

slide-20
SLIDE 20

Bayesian Network for Genomic Data Integration

Bayesian graphical model for miRNAs regulatory inference Model incorporates expression data of two kinds of units (genes and miRNAs) The prior incorporates biological information We infer edges as gene-miRNA interactions

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 20 / 34

slide-21
SLIDE 21

Description of the Problem

30% of genes in human genomes are regulated by microRNAs (Rajewsky, Nature Genetics 2006). microRNAs (miRNAs) are a small (∼22 nucleotide) RNAs. The predicted genes are called miRNA targets or simply targets. Our goal is to understand the regulatory process of microRNAs (miRNAs) on the genes (miRNA targets).

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 21 / 34

slide-22
SLIDE 22

Neural Tube Defects Experiment

Experimental data from a study on hyperthermia as a developmental toxicant causing neural tube defects. Mice were exposed in vivo to a 10 minute hyperthermia treatment on gestational day 8.5 Litters were collected and MiRNAs and mRNAs were extracted from each sample for expression analysis. Expression levels of G = 1, 297 genes and M = 23 miRNAs on n = 11 mice. Scores of possible miRNA-gene associations calculated by target prediction algorithms PicTar, miRanda, PITA and DIANA-microT (using sequence data or structure information). Disease and control group.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 22 / 34

slide-23
SLIDE 23

microRNA Regulatory Network

Multiple responses, multiple predictors Goal: Infer the regulatory process. Few observations, a large number of target genes and a “small” number of

  • miRNAs. Also, scores of possible regulatory associations based on

sequence/structure information

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 23 / 34

slide-24
SLIDE 24

Proposed Model

Gaussian Graphical models (GGM) are graphs in which nodes represent random variables and the lack of an arc represents conditional independence. A graph G and the covariance matrix Ω entirely define a GGMM (zero precision ⇔ lack of arc) by biological considerations we use a predetermined ordering of the nodes; answers to the baseline question ’which miRNAs regulate which targets’; takes into account constraints on the sign of the miRNAs-targets relations.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 24 / 34

slide-25
SLIDE 25

Directed Graphical Model Formulation (Bayesian Network)

Given Yg the targets and Xm the miRNAs Z = (Y1, Y2, . . . , YG, X1, . . . , XM) ∼ N(0; In, Ω) Conditional independence of the targets given the miRNAs Yi⊥ ⊥Yj|X1, . . . , XM Equivalent to the system of equations      Y1 = −Xβ1 + ǫσ1, . . . YG = −XβG + ǫσG, where ǫσg ∼ N(0, σgIN). The parameters of the regression models are βg = Ω−1

XXΩXYg and

σg = ωgg − Ω′

XYgΩ−1 XXΩXYg.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 25 / 34

slide-26
SLIDE 26

The Prior Formulation

We impose our biological constraints by using Gamma distribution priors for the positive regressions coefficients (βgm|σg) ∼ Ga(1, c σg) We set Inverse-Gamma distributions for error variances, σ−1

g

∼ Ga((δ + M)/2, d/2). We introduce a (G × M) matrix R with elements rgm = 1 if the m-th miRNA is included in the regression of the g-th target and rgm = 0

  • therwise.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 26 / 34

slide-27
SLIDE 27

The Bayesian Hierarchical Structure

Each regression coefficient has a mixture distribution π(βgm|σg, rgm) = rgmGa(1, c σg) + (1 − rgm)I[βgm=0]. The probability of inclusion is modeled as a function of the sj

gm scores

(from PicTar, miRanda, PITA and DIANA-microT) as follows:

P(rgm = 1|τ) = exp[η + τ1s1

gm + τ2s2 gm + τ3s3 gm + τ4s4 gm + τ5s5 gm]

1 + exp[η + τ1s1

gm + τ2s2 gm + τ3s3 gm + τ4s4 gm + τ5s5 gm],

We specify a hyperprior on τj’s as a gamma distribution τj ∼ Ga(aτ, bτ).

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 27 / 34

slide-28
SLIDE 28

MCMC Inference

Inference by a three steps Metropolis-Hastings algorithm:

1

we accept the new value of R with probability:

min f(Y|X(Rnew), Rnew, σg)π(Rnew|τ) f(Y|X(Rold), Rold, σg)π(Rold|τ) , 1

  • .

2

we accept the new value of τj with probability:

min

  • π(R|τ new

j

)π(τ new

j

)q(τ old

j

; τ new

j

) π(R|τ old

j

)π(τ old

j

)q(τ new

j

; τ old

j

) , 1

  • ,

3

we accept the new value of σg with probability:

min

  • f(Y|X(R), R, σnew

g

)π(σnew

g

)q(σold

g ; σnew g

) f(Y|X(R), R, σold

g )π(σold g )q(σnew g

; σold

g ) , 1

  • .

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 28 / 34

slide-29
SLIDE 29

Some Results

Most genes well predicted by a small number of miRNAs η = −3 expected prior number of regressors=1 5% prior prob of selection Posterior inference based on the posterior probability of edge inclusion

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 29 / 34

slide-30
SLIDE 30

93 arrows for .8 cut-off (91 genes and 11 miRNAs) - Bayesian FDR of 9.0% Selection of most important arcs is robust to η

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 30 / 34

slide-31
SLIDE 31

Posterior distributions of the τj’s

Analysis of the prior information influence Gamma prior on τ not sensitive

0.01 0.02 0.03 0.04 0.05 20 40 60 80 100 120 140 160 180 200 220 τ1 π(τ1|Y,X) η = −2.5 η = −3 η = −3.5 5 10 15 20 x 10

−3

100 200 300 400 500 600 700 800 τj π(τj|Y,X) PicTar miRanda TargetScan agg.

Stingo et al. (2010, Ann. Appl. Stat., 4(4), 2024-2048)

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 31 / 34

slide-32
SLIDE 32

Currently working on extensions to Genetic data

Genetical genomics: Integration of DNA and mRNA data. We consider Comparative Genomic Hybridization (CGH) and Gene Expression (GE) data. CGH data give information about changes at chromosome level. Look at CGHs as “surrogates” of copy number states in a continuous scale.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 32 / 34

slide-33
SLIDE 33

Let Yig denote the gene expression data, Xim the CGH measurements, and ξim the latent CGH state. Let Z = [Y, X]. Assume that gene expression levels are affected by the latent CGH “states”. Assume that the gene expression measurements are independent conditionally upon the copy numbers and that the copy number values are independent given their states f(Z|ξ) =

G

  • g=1

f(Yg|ξ, βg)

M

  • m=1

n

  • i=1

f(Xim|ξim) Dependent regressions for the conditional model. Hidden Markov model

  • n the CGH data, to learn on the states.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 33 / 34

slide-34
SLIDE 34

Summary

  • Bayesian variable selection has a wide range of application in the

analysis of large-scale genomic data.

  • Bayesian linear model for outcome prediction (or sample classification)

with simultaneous selection of genes and pathways (or genes and SNPs).

  • Bayesian network for integration of data from different platforms

(gene-miRNA regulatory networks or gene-CGH interactions).

  • Models and priors can easily incorporate biological knowledge

(gene-gene networks; scores of regulatory association).

  • Inference via efficient stochastic search methods.
  • Also, imaging genetics (next).

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 34 / 34