bayesian methods for variable selection with applications
play

Bayesian Methods for Variable Selection with Applications to - PowerPoint PPT Presentation

Bayesian Methods for Variable Selection with Applications to High-Dimensional Data Part 2: Bayesian Models for Integrative Genomics Marina Vannucci Rice University, USA ABS13-Italy 06/17-21/2013 Marina Vannucci (Rice University, USA)


  1. Bayesian Methods for Variable Selection with Applications to High-Dimensional Data Part 2: Bayesian Models for Integrative Genomics Marina Vannucci Rice University, USA ABS13-Italy 06/17-21/2013 Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 1 / 34

  2. Part 2: Bayesian Models for Integrative Genomics Summary of methods so far (annotated bibliography). Models that incorporate a priori biological information. Bayesian networks for genomic data integration. Ref: Vannucci, M. and Stingo, F.C. (2011). Bayesian Models for Variable Selection that Incorporate Biological Information (with discussion). In Bayesian Statistics 9 (J.M. Bernardo, M.J. Bayarri, J.O. Berger, A.P. Dawid, D. Heckerman, A.F.M. Smith and M. West eds.). Oxford: University Press, 659-678. Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 2 / 34

  3. Identification of Genomic Biomarkers DNA microarrays allow the parallel quantification of thousands of genes in a single experiment. Goal: identification (selection) of biomarkers that predict a response (clinical outcome, survival time, etc.). Major challenge: small n , large p Biomarkers selection important for treatment strategies and diagnostic tools. Identifying individual genes as therapeutic targets not sufficient. Cancer drugs designed to target specific pathways Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 3 / 34

  4. Pathways: Ordered series of chemical reactions in a living cell that serve different functions. Vast amount of biological knowledge generated and stored in public databases: KEGG, Cell Signaling Technology (CST) Pathway, Ivitrogen iPath, Reactome ... Pathways can be activated or inhibited at different points. Also, genes are not independent biological elements. Information is available on “gene networks”, describing relations among genes both within and between pathways. Signaling through branches or alternative pathways. Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 4 / 34

  5. Available data and information: Response variable - Y n × 1 - log(time to distant metastasis) Covariates (gene expressions) - X n × p Pathway-gene relationship - S p × K , where s jk = I { gene j ∈ pathway k } Gene-gene network - R p × p , where r ij = I { direct link between genes i and j } Therefore, We propose to incorporate pathway information in gene selection for disease prediction Priors that account for the gene network Select critical genes and pathways Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 5 / 34

  6. Pathway analyses Gene-set enrichment analysis (Subramanian et al. ,2005) Other pathway-based analyses: Supergene (Park et al. , 2007) Cluster genes using GO, then filter by cluster size and PCs Perform Lasso for the selection of clusters Only selection on clusters, but not on genes Markov random field model (Wei & Li, 2007 & 2008) Gene selection. Identify differentially expressed genes between two experimental conditions utilizing the pathway structure information Bayes models: (Telesca et al. 2008) for gene selection and (Li & Zhang, 2010) for “motifs” selection We select both genes and pathways Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 6 / 34

  7. Proposed Method Pathway information is used in the likelihood 1 to elicit prior 2 to structure MCMC moves 3 Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 7 / 34

  8. Model - Pathway Scores and Priors Y Y Y = 1 α + T T T β ε ∼ N ( 0 , σ 2 I I I ) β β + ε ε ε, ε ε T T is n × K and summarizes group behavior of genes as PCA components T obtained from the expression data of genes belonging to individual pathways. Pathway selection via a latent K -vector θ � 1 if pathway k is included k = 1 , . . . , K . θ k = 0 otherwise Mixture prior on regression coefficient β k indexed by θ k β k | θ k , σ 2 ∼ θ k · N ( β 0 , h σ 2 ) + ( 1 − θ k ) · δ 0 ( β k ) . Independent Bernoulli priors for θ k ’s and conjugate priors on α , σ 2 Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 8 / 34

  9. Gene Selection via a latent p -vector γ � 1 if gene j is included j = 1 , . . . , p . γ j = 0 otherwise Markov Random Field prior on γ exp ( γ j F ( γ j )) P ( γ j | θ θ, γ i , i ∈ N j ) = θ 1 + exp ( F ( γ j )) F ( γ j ) = µ + η � i ∈ N j ( 2 γ i − 1 ) and N j the set of neighbors of gene j from included pathways. µ controls sparsity. Higher η ’s induce more neighbors to take on same values. We use an hyperprior for η , η ∼ Gamma ( α η , β η ) . See also Wei & Li (2008, Ann. Appl. Stat. ), Telesca et al. (2008), Li & Zhang (2010, JASA ) Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 9 / 34

  10. Model Fitting and Posterior Inference β , σ 2 , to get the marginal posterior Integrate out α , β β f ( θ γ, η, | Y Y Y , T T T ) ∝ f ( Y Y | T Y T T , θ γ ) · p ( γ θ | η ) · p ( η ) θ, γ θ γ θ θ, γ γ γ γ, θ θ We use a 2-stage Metropolis to update ( θ θ θ, γ γ γ ) pick a pathway k pick a gene j from pathway k add/delete set of moves (with constraints) and update the parameter η of the MRF by employing the general method proposed by Moller et al. (2006) that uses auxiliary variables. Inference for pathways and genes can be made based on: ( θ θ θ, γ γ γ ) with largest joint posterior probability, θ k ’s and γ j ’s with largest marginal posterior probabilities Prediction of future samples can be made via Bayesian model averaging. Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 10 / 34

  11. Case Study on Breast Cancer - Van’t Veer et al. (2002, Nature ) Microarray data for 76 breast cancer patients, of which 33 developed distant metastases within 5 years. X : gene expression. Y : log(time to distant metastasis) Matrices S and R : gene-pathways and gene-gene relationships: Link the probes to the Gene IDs (LocusLink) and link the Gene IDs to the pathways (KEGG) R package KEGG-graph to dowanload the gene network A total of 3,592 probes, mapped to 196 pathways, was included in the study. Training and validation sets A priori we expect about 10% good pathways and 3% of the genes Vague priors on model parameters Two MCMC chains with 600,000 iterations ( r = . 9996) Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 11 / 34

  12. 60 120 50 100 Number of included pathways Number of included genes 40 80 30 60 20 40 10 20 0 0 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Iteration Iteration 5 5 x 10 x 10 Figure : Trace plots: Number of included pathways and genes Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 12 / 34

  13. Prediction: MSE=1.57 (7 pathways & 12 genes) MSE=1.93 (11 genes, Sha et al. 2006, Bioinfo. ) Selection: 1.0 Purine metabolism MAPK signaling pathway Cytokine−cytokine receptor interaction 0.8 Neuroactive ligand−receptor interaction Cell cycle Marginal Posterior Probability Axon guidance Cell adhesion molecules (CAMs) Complement and coagulation cascades 0.6 Regulation of actin cytoskeleton Insulin signaling pathway Pathways in cancer 0.4 0.2 0.0 0 50 100 150 200 Pathway Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 13 / 34

  14. Selection (cont’d): Singleton genes (no direct neighbor selected) ACACB (10), C4A (8,12), CALM1 (10), CCNB2 (5), CD4 (7), CDC2 (5), CLDN11 (7), FZD9 (11), GYS2 (10), HIST1H2BN (12), IFNA7 (3), NFASC (7), NRCAM (7), PCK1 (10), PFKP (10), PPARGC1A (10), PXN (9) Island 1 ACTB (9), ACTG1 (9), ITGA1 (9), ITGA7 (9), ITGB3 (9), ITGB4 (9), ITGB6 (9), ITGB8 (7,10), MYL5 (9), MYL9 (9), PDPK1 (10), PIK3CD (9,10,11), PLA2G4A (2), PLCG1 (11), PRKCA (2,11), PRKY (2,10), PRKY (2,10), PTGS2 (11), SOCS3 (10) Island 2 ACVR1B (2,3,11), ACVR1B (2,3,11), TGFB3 (2,3,5,11) Island 3 ENTPD3 (1), GMPS (1) Table : The 41 selected genes divided by islands and with associated pathway indices (in parenthesis). The pathway indices correspond to: 1-Purine metabolism, 2-MAPK signaling pathway, 3-Cytokine-cytokine receptor interaction, 4-Neuroactive ligand-receptor interaction, 5-Cell cycle, 6-Axon guidance, 7-Cell adhesion molecules (CAMs), 8-Complement and coagulation cascades, 9-Regulation of actin cytoskeleton, 10-Insulin signaling pathway, 11-Pathways in cancer, 12-Systemic lupus erythematosus. Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 14 / 34

  15. Island 8: DUSP3, DUSP4, MAPK10 Figure : Some selected pathways and islands (sets of connected genes). Stingo et al. ( Ann. Appl. Stat. , 2011) Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 2) ABS13-Italy 06/17-21/2013 15 / 34

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend