feature selection in high dimension for precision medicine
play

Feature selection in high dimension for precision medicine - PowerPoint PPT Presentation

Feature selection in high dimension for precision medicine Chlo-Agathe Azencott Center for Computational Biology (CBIO) Mines ParisTech Institut Curie INSERM U900 PSL Research University, Paris, France March 21, 2017 MACARON


  1. Feature selection in high dimension for precision medicine Chloé-Agathe Azencott Center for Computational Biology (CBIO) Mines ParisTech – Institut Curie – INSERM U900 PSL Research University, Paris, France March 21, 2017 – MACARON Workshop http://cazencott.info chloe-agathe.azencott@mines-paristech.fr @cazencott

  2. Precision Medicine ◮ Treatment adapted to the (genetic) specificities of the patient. E.g. Trastuzumab for HER2+ breast cancer. ◮ Data-driven biology/medicine Identify similarities between patients that exhibit similar susceptibilities / prognoses / responses to treatment. 1

  3. Sequencing costs 2

  4. Big data! 3

  5. Big data! 4

  6. 5

  7. GWAS: Genome-Wide Association Studies Which genomic features explain the phenotype? p = 10 5 − 10 7 Single Nucleotide Polymorphisms (SNPs) n = 10 2 − 10 4 samples ◮ High-dimensional (large p) ◮ Low sample size (small n) 6

  8. Google Flu Trends D. Lazer, R. Kennedy, G. King and A. Vespignani. The Parable of Google Flu: Traps in Big Data Analysis. Science 2014 ◮ p = 50 million search terms ◮ n = 1152 data points ◮ Predictive search terms include keywords related to high school basketball. 7

  9. from the start? ? ? ? Is extracting information from this data doomed 8

  10. GWAS successes Multiple sclerosis HaemGen consortium Ankylosing spondylitis P. Visscher, M. Brown, M. McCarthy, J. Yang. Five years of GWAS discovery. AJHG 2012. 9

  11. Missing heritability GWAS fail to explain most of the inheritable variability of complex traits. Many possible reasons: – non-genetic / non-SNP factors – heterogeneity of the phenotype – rare SNPs – weak effect sizes – few samples in high dimension (p ≫ n) – joint effets of multiple SNPs. 10

  12. Integrating prior knowledge Use additional data and prior knowledge to constrain the feature selection procedure. – Consistant with previously established knowledge – More easily interpretable – Statistical power. Prior knowledge can be represented as structure: – Linear structure of DNA – Groups: e.g. pathways – Networks (molecular, 3D structure). 11

  13. Regularized relevance Set V of p variables. ◮ Relevance score R : 2 V → R Quantifies the importance of any subset of variables for the question under consideration. Ex : correlation, HSIC, statistical test of association. ◮ Structured regularizer Ω : 2 V → R Promotes a sparsity pattern that is compatible with the constraint on the feature space. Ex : cardinality Ω : S �→ |S| . ◮ Regularized relevance arg max R ( S ) − λ Ω( S ) S⊆V 12

  14. Network-guided multi-locus GWAS Goal: Find a set of explanatory SNPs compatible with a given network structure. 13

  15. Network-guided GWAS ◮ Additive test of association SKAT [Wu et al. 2011] � c i = ( G ⊤ ( y − µ )) 2 R ( S ) = c i i i ∈S ◮ Sparse Laplacian regularization � � Ω : S �→ W ij + α |S| i ∈S j / ∈S ◮ Regularized maximization of R � � � arg max − η |S| − λ c i W ij ���� S⊆V i ∈S i ∈S j / ∈S sparsity � �� � � �� � association connectivity 14

  16. Minimum cut reformulation The graph-regularized maximization of score Q ( ∗ ) is equivalent to a s / t -min-cut for a graph with adjacency matrix A and two additional nodes s and t , where A ij = λ W ij for 1 ≤ i, j ≤ p and the weights of the edges adjacent to nodes s and t are defined as � c i − η � η − c i if c i > η if c i < η A si = A it = and 0 0 otherwise . otherwise SConES: S electing Con nected E xplanatory S NPs. 15

  17. Comparison partners ◮ Univariate linear regression y k = α 0 + β G i k ◮ Lasso 1 2 || y − G β || 2 arg min + η || β || 1 2 β ∈ R p � �� � � �� � sparsity loss ◮ Feature selection with sparsity and connectivity constraints arg min L ( y , G β ) + η || β || 1 + λ Ω( β ) β ∈ R p � �� � � �� � � �� � loss connectivity sparsity – ncLasso : network connected Lasso [Li and Li, Bioinformatics 2008] – Overlapping group Lasso [Jacob et al., ICML 2009] – groupLasso : E.g. SNPs near the same gene grouped together – graphLasso : 1 edge = 1 group. 16

  18. Runtime 10 6 CPU runtime [sec] (log-scale) 10 5 10 4 10 3 10 2 graphLasso 10 1 ncLasso 10 0 ncLasso (accelerated) SConES 10 − 1 linear regression 10 − 2 10 2 10 3 10 4 10 5 10 6 #SNPs (log-scale) n = 200 exponential random network (2 % density) 17

  19. Experiments: Performance on simulated data ◮ Arabidopsis thaliana genotypes n=500 samples, p=1 000 SNPs TAIR Protein-Protein Interaction data ∼ 50.10 6 edges ◮ Higher power and lower FDR than comparison partners except for groupLasso when groups = causal structure ◮ Fairly robust to missing edges ◮ Fails if network is random. 18

  20. Arabidopsis thaliana flowering time 17 flowering time phenotypes [Atwell et al., Nature, 2010] p ∼ 170 000 SNPs (after MAF filtering) n ∼ 150 samples 165 candidate genes [Segura et al., Nat Genet 2012] Correction for population structure : regress out PCs. 19 ✶✵✳✶✷✹✷✴❥❝s✳✵✾✻✾✹✶

  21. Arabidopsis thaliana flowering time # candidate genes hit # selected SNPs 600 450 10 300 5 150 0 e o o o 0 S t s s s E e o o o S a s s s t s s s E i a a a n a s s s r L L L o a a a n i a r C p c L L L o v a n C i u S v p c n n o u S i U r n o g U r g ◮ SConES selects about as many SNPs as other network-guided approaches but detects more candidates. 20

  22. Arabidopsis thaliana flowering time Predictivity of selected SNPs 1 . 0 Lasso ncLasso groupLasso SConES 0 . 8 0 . 6 R 2 0 . 4 0 . 2 0 . 0 0W LN22 21

  23. SConES: S electing Con nected E xplanatory S NPs ◮ s elects con nected, e xplanatory S NPs; ◮ incorporates large networks into GWAS; ◮ is efficient , effective and robust . C.-A. Azencott, D. Grimm, M. Sugiyama, Y. Kawahara and K. Borgwardt (2013) Efficient network-guided multi-locus association mapping with graph cuts , Bioinformatics 29 (13), i171–i179 doi:10.1093/bioinformatics/btt238 ❤tt♣s✿✴✴❣✐t❤✉❜✳❝♦♠✴❝❤❛❣❛③✴s❝♦♥❡s ❤tt♣s✿✴✴❣✐t❤✉❜✳❝♦♠✴❝❤❛❣❛③✴s❢❛♥ ❤tt♣s✿✴✴❣✐t❤✉❜✳❝♦♠✴❞♦♠✐♥✐❦❣r✐♠♠✴❡❛s②●❲❆❙❈♦r❡ 22

  24. Multi-trait GWAS Increase sample size by jointly performing GWAS for multiple related phenotypes 23

  25. Toxicogenetics / Pharmacogenomics Tasks (phenotypes) = chemical compounds F. Eduati, L. Mangravite, et al. (2015) Prediction of human population responses to toxic compounds by a collaborative competition. Nature Biotechnology, 33 (9), 933–940 doi: 10.1038/nbt.3299 24

  26. Multi-SConES T related phenotypes. ◮ Goal: obtain similar sets of features on related tasks.   T � � � �   arg max c i − η |S| − λ W ij − µ |S t − 1 ∆ S t |     � �� � S 1 ,..., S T ⊆V t =1 i ∈S i ∈S j / ∈S task sharing S ∆ S ′ = ( S ∪ S ′ ) \ ( S ∩ S ′ ) (symmetric difference) ◮ Can be reduced to single-task by building a meta-network. 25

  27. Multi-SConES: Multiple related tasks Simulations: retrieving causal features Single task Two tasks Three tasks Four tasks Model 1 1.0 1.0 0.8 0.8 0.6 0.6 MCC MCC 0.4 0.4 0.2 0.2 0 0 CR LA EN GL GR AG SC CR LA GR SC CR LA GR SC CR LA GR SC CR LA EN GL GR AG SC CR LA GR SC CR LA GR SC CR LA GR SC Model 4 Model 2 Single task Two tasks Three tasks Four tasks Single task Two tasks Three tasks Four tasks 1.0 1.0 0.8 0.8 0.6 0.6 MCC MCC 0.4 0.4 0.2 0.2 0 0 CR LA EN GL GR AG SC CR LA GR SC CR LA GR SC CR LA GR SC CR LA EN GL GR AG SC CR LA GR SC CR LA GR SC CR LA GR SC Model 3 Single task Two tasks Three tasks Four tasks Single task Two tasks Three tasks Four tasks 1.0 M. Sugiyama, C.-A. Azencott, D. Grimm, Y. Kawahara and K. Borgwardt (2014) Multi-task feature selection on multiple networks via maximum flows , SIAM ICDM, 199–207 doi:10.1137/1.9781611973440.23 ❤tt♣s✿✴✴❣✐t❤✉❜✳❝♦♠✴♠❛❤✐t♦✲s✉❣✐②❛♠❛✴▼✉❧t✐✲❙❈♦♥❊❙ ❤tt♣s✿✴✴❣✐t❤✉❜✳❝♦♠✴❝❤❛❣❛③✴s❢❛♥ 26

  28. Leveraging similarity between tasks Use prior knowledge about the relationship between the tasks: Ω ∈ R T × T     T  T  � � � � � �   Ω − 1 arg max c i − η |S| − λ W ij − µ   tu   S 1 ,..., S T ⊆V t =1  u =1  i ∈S i ∈S j / i ∈S t ∩S u ∈S   � �� � task sharing Can also be mapped to a meta-network. Code: ❤tt♣✿✴✴❣✐t❤✉❜✳❝♦♠✴❝❤❛❣❛③✴s❢❛♥ 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend