Using structure to select features in high dimension Chlo-Agathe - PowerPoint PPT Presentation

Using structure to select features in high dimension Chloé-Agathe Azencott Center for Computational Biology (CBIO) Mines ParisTech – Institut Curie – INSERM U900 PSL Research University, Paris, France April 2, 2019 – IHP http://cazencott.info chloe-agathe.azencott@mines-paristech.fr @cazencott

Precision Medicine ◮ The top highest-grossing drugs in the US only help 1/25 to 1/4 patients. ◮ Differences in drug response are partially due to genetic differences. ◮ Adapt treatment to the (genetic) specificities of the patient. E.g. Trastuzumab for HER2+ breast cancer. 1

From genotype to phenotype Which genomic features explain the phenotype? 2

From genotype to phenotype Which genomic features explain the phenotype? – 80 000 proteins; – 10 million SNPs; – 200 000 mRNA; – 28 million CpG islands. 2

From genotype to phenotype Which genomic features explain the phenotype? p = 10 5 – 10 7 genomic features n = 10 3 – 10 5 samples . – 80 000 proteins; – 10 million SNPs; – 200 000 mRNA; – 28 million CpG islands. 2

From genotype to phenotype Which genomic features explain the phenotype? p = 10 5 – 10 7 genomic features n = 10 3 – 10 5 samples . – 80 000 proteins; – 10 million SNPs; – 200 000 mRNA; – 28 million CpG islands. High-dimensional (large p), low sample size (small n) data. 2

From genotype to phenotype Which genomic features explain the phenotype? p = 10 5 – 10 7 genomic features n = 10 3 – 10 5 samples . – 10 million S ingle N ucleotide P olymorphisms. G enome- W ide A ssociation S tudies. 3

Missing heritability GWAS fail to explain most of the inheritable variability of complex traits. Many possible reasons: – non-genetic / non-SNP factors – heterogeneity of the phenotype – rare SNPs – weak effect sizes – few samples in high dimension (p ≫ n) – joint effets of multiple SNPs. 4

Integrating prior knowledge: Network-guided GWAS Joint work with Dominik Grimm, Yoshinobu Kawahara, Karsten Borgwardt, and Héctor Climente González. 5

Integrating prior knowledge Use additional data and prior knowledge to constrain the feature selection procedure. – Consistant with previously established knowledge; – More easily interpretable ; – Statistical power. Prior knowledge can be represented as structure: – Linear structure of the genome; – Groups: e.g. pathways; – Networks (molecular, 3D structure). 6

Network-guided biomarker discovery ◮ Biological networks help understanding disease. ◮ Goal: Find a set of explanatory features compatible with a given network structure. C.-A. Azencott (2016). Network-guided biomarker discovery, LNCS. 7

Integrating prior network knowledge ◮ Network-constrained lasso: � � 2 p p p p � n � � � � 1 y i − arg min + λ + η . β j x ij | β j | β j L jk β k 2 β ∈ R p i =1 j =1 j =1 j =1 k =1 � �� loss sparsity connectivity ◮ Graph Laplacian L → β varies smoothly on the network.   1 if j = k � L jk = − W jk / d j d j if j ∼ k  0 otherwise. C. Li and H. Li (2008). Network-constrained regularization and variable selection for analysis of genomic data, Bioinformatics, 24, 1175–1182. 8

Regularized relevance Set V of p variables. ◮ Relevance score R : 2 V → R Quantifies the importance of any subset of variables for the question under consideration. Ex : correlation, HSIC, statistical test of association. ◮ Structured regularizer Ω : 2 V → R Promotes a sparsity pattern that is compatible with the constraint on the feature space. Ex : cardinality Ω : S �→ |S| . ◮ Regularized relevance R ( S ) − λ Ω( S ) arg max S⊆V 9

Network-guided GWAS ◮ Additive test of association SKAT: [Wu et al. 2011] � c j = ( X ⊤ ( y − µ )) 2 R ( S ) = c j j . j ∈S ◮ Sparse Laplacian regularization: � � Ω : S �→ W jk + α |S| . j ∈S ∈S k/ ◮ Regularized maximization of R : � � � arg max c j − η |S| − λ W jk . �� S⊆V j ∈S j ∈S k/ ∈S sparsity � �� association connectivity 10

Minimum cut reformulation The graph-regularized maximization of score Q ( ∗ ) is equivalent to a s / t -min-cut for a graph with adjacency matrix A and two additional nodes s and t , where A ij = λ W ij for 1 ≤ i, j ≤ p and the weights of the edges adjacent to nodes s and t are defined as � c i − η � η − c i if c i > η if c i < η A si = A it = and 0 0 otherwise . otherwise SConES: S electing Con nected E xplanatory S NPs. 11

Comparison partners ◮ Univariate linear regression 1 2 || y − β j x j || 2 arg min 2 . β j ∈ R ◮ Lasso 1 2 || y − X β || 2 arg min 2 + η || β || 1 . β ∈ R p ◮ Feature selection with sparsity and connectivity constraints 1 2 || y − X β ) || 2 2 + η || β || 1 + λ Ω( β ) . arg min β ∈ R p – ncLasso : network connected Lasso [Li and Li, Bioinformatics 2008] – Overlapping group Lasso [Jacob et al., ICML 2009] – groupLasso : E.g. SNPs near the same gene grouped together. – graphLasso : 1 edge = 1 group. 12

Runtime 10 6 CPU runtime [sec] (log-scale) 10 5 10 4 10 3 10 2 graphLasso 10 1 ncLasso 10 0 ncLasso (accelerated) SConES 10 − 1 linear regression 10 − 2 10 2 10 3 10 4 10 5 10 6 #SNPs (log-scale) n = 200 exponential random network (2 % density) 13

Experiments: Performance on simulated data ◮ Arabidopsis thaliana genotypes: n=500 samples, p=1 000 SNPs, TAIR Protein-Protein Interaction data ≈ 50.10 6 edges. ◮ Higher power and lower FDR than comparison partners except for groupLasso when groups = causal structure. ◮ Systematically better than relaxed version (ncLasso). ◮ Fairly robust to missing edges. ◮ Fails if network is random. Image source: Jean Weber / INRA via Flickr. 14

Experiments: Performance on real data ◮ Arabidopsis thaliana genotypes: n ≈ 150 samples, p ≈ 170 000 SNPs, 165 candidate genes [Segura et al., Nat Genet 2012]. ◮ SConES selects about as many SNPs as other network-guided approaches but they tag more candidate genes. ◮ Predictivity of the selected SNPs: ◮ In half the cases, lasso outperforms all other approaches; ◮ In the remaining cases, SConES outperforms all other approaches. Image source: Jean Weber / INRA via Flickr. 15

SConES: S electing Con nected E xplanatory S NPs ◮ s elects con nected, e xplanatory S NPs; ◮ incorporates large networks into GWAS; ◮ is efficient , effective and robust . – C.-A. Azencott, D. Grimm, M. Sugiyama, Y. Kawahara and K. Borgwardt (2013) Efficient network-guided multi-locus association mapping with graph cuts, Bioinformatics 29 (13), i171–i179 doi:10.1093/bioinformatics/btt238. ❤tt♣s✿✴✴❣✐t❤✉❜✳❝♦♠✴❝❤❛❣❛③✴s❢❛♥ – H. Climente, C.-A. Azencott (2017). martini: GWAS incorporating networks in R, doi:10.18129/B9.bioc.martini. Bioconductor/martini 16

Finding interactions between a target SNP and the rest of the genome. Joint work with Lotfi Slim, Jean-Philippe Vert, and Clément Chatelain. 17

◮ p variables X 1 , X 2 , . . . , X p ∈ { 0 , 1 , 2 } ; ◮ one target variable A ∈ {− 1 , 1 } ; ◮ outcome Y . Which of the p variables interact with A towards Y ? 18

◮ p variables X 1 , X 2 , . . . , X p ∈ { 0 , 1 , 2 } ; ◮ one target variable A ∈ {− 1 , 1 } ; ◮ outcome Y . Which of the p variables interact with A towards Y ? ◮ GBOOST: For each j = 1 , . . . , p , LRT between – a full logistic regression model on ( X j , A, A.X j ) ; – a main-effect logistic regression model on ( X j , A ) . 18

◮ p variables X 1 , X 2 , . . . , X p ∈ { 0 , 1 , 2 } ; ◮ one target variable A ∈ {− 1 , 1 } ; ◮ outcome Y . Which of the p variables interact with A towards Y ? ◮ GBOOST: For each j = 1 , . . . , p , LRT between – a full logistic regression model on ( X j , A, A.X j ) ; – a main-effect logistic regression model on ( X j , A ) . ◮ product Lasso: Lasso on ( X 1 , X 2 , . . . , X p , A, A.X 1 , A.X 2 , . . . , A.X p ) . 18

Modeling epistasis ◮ p variables X 1 , X 2 , . . . , X p ∈ { 0 , 1 , 2 } ; ◮ one target variable A ∈ {− 1 , 1 } ; ◮ outcome Y . Which of the p variables interact with A towards Y ? ǫ ∼ N (0 , σ 2 ) . ◮ Y = µ ( X ) + A.δ ( X ) + ǫ, 19

Modeling epistasis ◮ p variables X 1 , X 2 , . . . , X p ∈ { 0 , 1 , 2 } ; ◮ one target variable A ∈ {− 1 , 1 } ; ◮ outcome Y . Which of the p variables interact with A towards Y ? ǫ ∼ N (0 , σ 2 ) . ◮ Y = µ ( X ) + A.δ ( X ) + ǫ, ◮ SNPs in epsitasis with A = support of δ ( X ) . 19

Clinical trials ǫ ∼ N (0 , σ 2 ) . Y = µ ( X ) + A.δ ( X ) + ǫ, ◮ Which of the SNPs in X interact with target SNP A towards phenotype Y ? ◮ Which of the clinical covariates X interact with treatment A towards outcome Y ? L. Tian et al. (2014). A simple method for estimating interactions between a treatment and a large number of covariates. JASA 109, 1517–1532. 20

Clinical trials ǫ ∼ N (0 , σ 2 ) . Y = µ ( X ) + A.δ ( X ) + ǫ, ◮ Which of the SNPs in X interact with target SNP A towards phenotype Y ? ◮ Which of the clinical covariates X interact with treatment A towards outcome Y ? L. Tian et al. (2014). A simple method for estimating interactions between a treatment and a large number of covariates. JASA 109, 1517–1532. Modified outcome method to model δ : Y ′ = 2 Y A, δ ( X ) = 1 2 E [ Y ′ | X ] . 20

Using structure to select features in high dimension Chlo-Agathe - PowerPoint PPT Presentation

Using structure to select features in high dimension Chlo-Agathe Azencott Center for Computational Biology (CBIO) Mines ParisTech Institut Curie INSERM U900 PSL Research University, Paris, France April 2, 2019 IHP

SQL Database Manipulations: SELECT statements Thomas Schwarz, SJ SELECT SELECT is the most

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

Nested queries Subqueries in SELECT SELECT DISTINCT C.cname, (SELECT count(*) FROM Product P

Select the best sources by Currency Select the checking best sources by Range Select the

VC-dimension and Erd os-P osa property Nicolas Bousquet LIRMM, University Montpellier II

This Lecture SQL SELECT WHERE Clauses SQL SELECT SELECT from multiple tables JOINs

Dimension Reduction and Nearest Neighbor Search Advanced Algorithms Nanjing University, Fall

Superpave TM Mix Design Marshall Mix Design 1. Select suitable aggregates 2. Select a suitable

The Human Dimension Sue Manns Regional Director Pegasus The Human Dimension The Human

Dimension Reduction CSE 6242 / CX 4242 Thanks : Prof. Jaegul Choo , Dr. Ramakrishnan Kannan,

The Metric Dimension Problem. J. D az Monash U., May 2018 The Metric Dimension problem

Packing Dimension Results for Anisotropic Gaussian Random Fields Dongsheng Wu Department of

SQL Queries 1 / 28 The SELECT-FROM-WHERE Structure SELECT <attributes > FROM <tables

TRIGON SELECT LTD Apparel Supplier Assessment & Selection Programme through Trigon Select

Quiz Prove that the dimension of R 5 is 5, using the definition of dimension . Find the

N+1 Select Issues with Hibernate Part 2: Solving n+1 select issues with @NamedEntityGraphs

He who asks is a fool for five CSE527 minutes, but he who does not Computational Biology ask

GENOME DUPLICATION AND GENE ANNOTATION: AN EXAMPLE FOR A REFERENCE PLANT SPECIES. Alessandra

Plant Development Lecture 1: Plant architecture and embryogenesis. Lecture 2: Polarity and

Neuromorphic Electronics Introduction Philipp H afliger hafliger@ifi.uio.no Brain Research

A Statistical Framework for Spatial Comparative Genomics Thesis Proposal Rose Hoberman Carnegie

Bioremediation Expanding the Toolbox: Session II - Novel Omics Approaches Julian Schroeder

The 1000 genomes project The 1000 genomes project Genetic variation > 1% 1000 2500

Introducing ShortRead Paula Andrea Martinez, PhD. Data Scientist DataCamp Introduction to

Using structure to select features in high dimension Chlo-Agathe - PowerPoint PPT Presentation

Using structure to select features in high dimension Chlo-Agathe Azencott Center for Computational Biology (CBIO) Mines ParisTech Institut Curie INSERM U900 PSL Research University, Paris, France April 2, 2019 IHP

SQL Database Manipulations: SELECT statements Thomas Schwarz, SJ SELECT SELECT is the most

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

Nested queries Subqueries in SELECT SELECT DISTINCT C.cname, (SELECT count(*) FROM Product P

Select the best sources by Currency Select the checking best sources by Range Select the

VC-dimension and Erd os-P osa property Nicolas Bousquet LIRMM, University Montpellier II

This Lecture SQL SELECT WHERE Clauses SQL SELECT SELECT from multiple tables JOINs

Dimension Reduction and Nearest Neighbor Search Advanced Algorithms Nanjing University, Fall

Superpave TM Mix Design Marshall Mix Design 1. Select suitable aggregates 2. Select a suitable

The Human Dimension Sue Manns Regional Director Pegasus The Human Dimension The Human

Dimension Reduction CSE 6242 / CX 4242 Thanks : Prof. Jaegul Choo , Dr. Ramakrishnan Kannan,

The Metric Dimension Problem. J. D az Monash U., May 2018 The Metric Dimension problem

Packing Dimension Results for Anisotropic Gaussian Random Fields Dongsheng Wu Department of

SQL Queries 1 / 28 The SELECT-FROM-WHERE Structure SELECT &lt;attributes &gt; FROM &lt;tables

TRIGON SELECT LTD Apparel Supplier Assessment &amp; Selection Programme through Trigon Select

Quiz Prove that the dimension of R 5 is 5, using the definition of dimension . Find the

N+1 Select Issues with Hibernate Part 2: Solving n+1 select issues with @NamedEntityGraphs

He who asks is a fool for five CSE527 minutes, but he who does not Computational Biology ask

GENOME DUPLICATION AND GENE ANNOTATION: AN EXAMPLE FOR A REFERENCE PLANT SPECIES. Alessandra

Plant Development Lecture 1: Plant architecture and embryogenesis. Lecture 2: Polarity and

Neuromorphic Electronics Introduction Philipp H afliger hafliger@ifi.uio.no Brain Research

A Statistical Framework for Spatial Comparative Genomics Thesis Proposal Rose Hoberman Carnegie

Bioremediation Expanding the Toolbox: Session II - Novel Omics Approaches Julian Schroeder

The 1000 genomes project The 1000 genomes project Genetic variation &gt; 1% 1000 2500

Introducing ShortRead Paula Andrea Martinez, PhD. Data Scientist DataCamp Introduction to

SQL Queries 1 / 28 The SELECT-FROM-WHERE Structure SELECT <attributes > FROM <tables

TRIGON SELECT LTD Apparel Supplier Assessment & Selection Programme through Trigon Select

The 1000 genomes project The 1000 genomes project Genetic variation > 1% 1000 2500