Feature selection in high dimension for precision medicine - - PowerPoint PPT Presentation

feature selection in high dimension for precision medicine
SMART_READER_LITE
LIVE PREVIEW

Feature selection in high dimension for precision medicine - - PowerPoint PPT Presentation

Feature selection in high dimension for precision medicine Chlo-Agathe Azencott Center for Computational Biology (CBIO) Mines ParisTech Institut Curie INSERM U900 PSL Research University, Paris, France March 21, 2017 MACARON


slide-1
SLIDE 1

Feature selection in high dimension for precision medicine

Chloé-Agathe Azencott

Center for Computational Biology (CBIO) Mines ParisTech – Institut Curie – INSERM U900 PSL Research University, Paris, France

March 21, 2017 – MACARON Workshop

http://cazencott.info chloe-agathe.azencott@mines-paristech.fr @cazencott

slide-2
SLIDE 2

Precision Medicine

◮ Treatment adapted to the (genetic) specificities of the

patient. E.g. Trastuzumab for HER2+ breast cancer.

◮ Data-driven biology/medicine

Identify similarities between patients that exhibit similar susceptibilities / prognoses / responses to treatment.

1

slide-3
SLIDE 3

Sequencing costs

2

slide-4
SLIDE 4

Big data!

3

slide-5
SLIDE 5

Big data!

4

slide-6
SLIDE 6

5

slide-7
SLIDE 7

GWAS: Genome-Wide Association Studies

Which genomic features explain the phenotype? p = 105 − 107 Single Nucleotide Polymorphisms (SNPs) n = 102 − 104 samples

◮ High-dimensional (large p) ◮ Low sample size (small n)

6

slide-8
SLIDE 8

Google Flu Trends

  • D. Lazer, R. Kennedy, G. King and A. Vespignani. The Parable of Google Flu: Traps in Big

Data Analysis. Science 2014

◮ p = 50 million search terms ◮ n = 1152 data points ◮ Predictive search terms include keywords related to high

school basketball.

7

slide-9
SLIDE 9

Is extracting information from this data doomed from the start? ?

? ?

8

slide-10
SLIDE 10

GWAS successes

Multiple sclerosis HaemGen consortium Ankylosing spondylitis

  • P. Visscher, M. Brown, M. McCarthy, J. Yang. Five years of GWAS discovery. AJHG 2012.

9

slide-11
SLIDE 11

Missing heritability

GWAS fail to explain most of the inheritable variability of complex traits. Many possible reasons: – non-genetic / non-SNP factors – heterogeneity of the phenotype – rare SNPs – weak effect sizes – few samples in high dimension (p ≫ n) – joint effets of multiple SNPs.

10

slide-12
SLIDE 12

Integrating prior knowledge

Use additional data and prior knowledge to constrain the feature selection procedure.

– Consistant with previously established knowledge – More easily interpretable – Statistical power.

Prior knowledge can be represented as structure:

– Linear structure of DNA – Groups: e.g. pathways – Networks (molecular, 3D structure).

11

slide-13
SLIDE 13

Regularized relevance

Set V of p variables.

◮ Relevance score R : 2V → R

Quantifies the importance of any subset of variables for the question under consideration. Ex : correlation, HSIC, statistical test of association.

◮ Structured regularizer Ω : 2V → R

Promotes a sparsity pattern that is compatible with the constraint on the feature space. Ex : cardinality Ω : S → |S|.

◮ Regularized relevance

arg max

S⊆V

R(S) − λΩ(S)

12

slide-14
SLIDE 14

Network-guided multi-locus GWAS

Goal: Find a set of explanatory SNPs compatible with a given network structure.

13

slide-15
SLIDE 15

Network-guided GWAS

◮ Additive test of association SKAT [Wu et al. 2011]

R(S) =

  • i∈S

ci ci = (G⊤(y − µ))2

i

◮ Sparse Laplacian regularization

Ω : S →

  • i∈S
  • j /

∈S

Wij + α|S|

◮ Regularized maximization of R

arg max

S⊆V

  • i∈S

ci association − η |S|

  • sparsity

− λ

  • i∈S
  • j /

∈S

Wij

  • connectivity

14

slide-16
SLIDE 16

Minimum cut reformulation

The graph-regularized maximization of score Q(∗) is equivalent to a s/t-min-cut for a graph with adjacency matrix A and two additional nodes s and t, where Aij = λWij for 1 ≤ i, j ≤ p and the weights of the edges adjacent to nodes s and t are defined as Asi = ci − η if ci > η

  • therwise

and Ait = η − ci if ci < η

  • therwise .

SConES: Selecting Connected Explanatory SNPs.

15

slide-17
SLIDE 17

Comparison partners

◮ Univariate linear regression

yk = α0 + βGi

k

◮ Lasso

arg min

β∈Rp

1 2 ||y − Gβ||2

2

  • loss

+ η ||β||1 sparsity

◮ Feature selection with sparsity and connectivity constraints

arg min

β∈Rp

L(y, Gβ)

  • loss

+ η ||β||1 sparsity + λ Ω(β) connectivity

– ncLasso: network connected Lasso [Li and Li, Bioinformatics 2008] – Overlapping group Lasso [Jacob et al., ICML 2009]

– groupLasso: E.g. SNPs near the same gene grouped together – graphLasso: 1 edge = 1 group.

16

slide-18
SLIDE 18

Runtime

102 103 104 105 106

#SNPs (log-scale)

10−2 10−1 100 101 102 103 104 105 106

CPU runtime [sec] (log-scale)

graphLasso ncLasso ncLasso (accelerated) SConES linear regression

n = 200 exponential random network (2 % density)

17

slide-19
SLIDE 19

Experiments: Performance on simulated data

◮ Arabidopsis thaliana genotypes

n=500 samples, p=1 000 SNPs TAIR Protein-Protein Interaction data ∼ 50.106 edges

◮ Higher power and lower FDR than comparison partners

except for groupLasso when groups = causal structure

◮ Fairly robust to missing edges ◮ Fails if network is random. 18

slide-20
SLIDE 20

Arabidopsis thaliana flowering time

17 flowering time phenotypes

[Atwell et al., Nature, 2010]

p ∼ 170 000 SNPs (after MAF filtering) n ∼ 150 samples 165 candidate genes

[Segura et al., Nat Genet 2012]

Correction for population structure: regress out PCs.

✶✵✳✶✷✹✷✴❥❝s✳✵✾✻✾✹✶ 19

slide-21
SLIDE 21

Arabidopsis thaliana flowering time

U n i v a r i a t e L a s s

  • g

r

  • u

p L a s s

  • n

c L a s s

  • S

C

  • n

E S 150 300 450 600 # selected SNPs U n i v a r i a t e L a s s

  • g

r

  • u

p L a s s

  • n

c L a s s

  • S

C

  • n

E S 5 10 # candidate genes hit

◮ SConES selects about as many SNPs as other network-guided

approaches but detects more candidates.

20

slide-22
SLIDE 22

Arabidopsis thaliana flowering time

Predictivity of selected SNPs

0W LN22 0.0 0.2 0.4 0.6 0.8 1.0 R2 Lasso groupLasso ncLasso SConES

21

slide-23
SLIDE 23

SConES: Selecting Connected Explanatory SNPs

◮ selects connected, explanatory SNPs; ◮ incorporates large networks into GWAS; ◮ is efficient, effective and robust.

C.-A. Azencott, D. Grimm, M. Sugiyama, Y. Kawahara and K. Borgwardt (2013) Efficient network-guided multi-locus association mapping with graph cuts, Bioinformatics 29 (13), i171–i179 doi:10.1093/bioinformatics/btt238 ❤tt♣s✿✴✴❣✐t❤✉❜✳❝♦♠✴❝❤❛❣❛③✴s❝♦♥❡s ❤tt♣s✿✴✴❣✐t❤✉❜✳❝♦♠✴❝❤❛❣❛③✴s❢❛♥ ❤tt♣s✿✴✴❣✐t❤✉❜✳❝♦♠✴❞♦♠✐♥✐❦❣r✐♠♠✴❡❛s②●❲❆❙❈♦r❡

22

slide-24
SLIDE 24

Multi-trait GWAS

Increase sample size by jointly performing GWAS for multiple related phenotypes

23

slide-25
SLIDE 25

Toxicogenetics / Pharmacogenomics

Tasks (phenotypes) = chemical compounds

  • F. Eduati, L. Mangravite, et al. (2015) Prediction of human population responses to toxic

compounds by a collaborative competition. Nature Biotechnology, 33 (9), 933–940 doi: 10.1038/nbt.3299

24

slide-26
SLIDE 26

Multi-SConES

T related phenotypes.

◮ Goal: obtain similar sets of features on related tasks.

arg max

S1,...,ST ⊆V T

  • t=1

   

  • i∈S

ci − η |S| − λ

  • i∈S
  • j /

∈S

Wij − µ |St−1 ∆ St|

  • task sharing

   

S ∆ S′ = (S ∪ S′) \ (S ∩ S′) (symmetric difference)

◮ Can be reduced to single-task by building a meta-network. 25

slide-27
SLIDE 27

Multi-SConES: Multiple related tasks

Simulations: retrieving causal features

0.2 0.4 0.6 0.8 1.0 Model 1 Model 2 1.0 Model 3 0.2 0.4 0.6 0.8 1.0 LA CR EN GL GR AG SC Single task CR LA GR SC Two tasks CR LA GR SC Three tasks CR LA GR SC Four tasks LA CR EN GL GR AG SC Single task CR LA GR SC Two tasks CR LA GR SC Three tasks CR LA GR SC Four tasks MCC MCC 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 LA CR EN GL GR AG SC Single task CR LA GR SC Two tasks CR LA GR SC Three tasks CR LA GR SC Four tasks LA CR EN GL GR AG SC Single task CR LA GR SC Two tasks CR LA GR SC Three tasks CR LA GR SC Four tasks Single task Two tasks Three tasks Four tasks MCC MCC Model 4

  • M. Sugiyama, C.-A. Azencott, D. Grimm, Y. Kawahara and K. Borgwardt (2014) Multi-task

feature selection on multiple networks via maximum flows, SIAM ICDM, 199–207 doi:10.1137/1.9781611973440.23 ❤tt♣s✿✴✴❣✐t❤✉❜✳❝♦♠✴♠❛❤✐t♦✲s✉❣✐②❛♠❛✴▼✉❧t✐✲❙❈♦♥❊❙ ❤tt♣s✿✴✴❣✐t❤✉❜✳❝♦♠✴❝❤❛❣❛③✴s❢❛♥

26

slide-28
SLIDE 28

Leveraging similarity between tasks

Use prior knowledge about the relationship between the tasks: Ω ∈ RT×T

arg max

S1,...,ST ⊆V T

  • t=1

       

  • i∈S

ci − η |S| − λ

  • i∈S
  • j /

∈S

Wij − µ

T

  • u=1
  • i∈St∩Su

Ω−1

tu

  • task sharing

       

Can also be mapped to a meta-network. Code: ❤tt♣✿✴✴❣✐t❤✉❜✳❝♦♠✴❝❤❛❣❛③✴s❢❛♥

27

slide-29
SLIDE 29

Multiplicative Multitask Lasso with Task Descriptors

◮ Multitask Lasso [Obozinski et al. 2006] arg min

β∈RT ×p

L

  • yt

m, p

  • i=1

βigt

mi

  • loss

+ λ

p

  • i=1

||βi||2

  • task sharing

◮ Multilevel Multitask Lasso [Lozano and Swirszczw, 2012] arg min

θ∈Rp

+,γ∈RT ×p

L

  • yt

m, p

  • i=1

θiγt

igt mi

  • loss

+ λ1 ||θ||1

  • sparsity

+ λ2

p

  • i=1

T

  • t=1

|γt

i|

  • task sharing

◮ Multiplicative Multitask Lasso with Task Descriptors arg min

θ∈Rp

+,α∈Rp×L

L

  • yt

m, p

  • i=1

θi L

  • l=1

αildt

l

  • gt

mi

  • loss

+ λ1 ||θ||1

  • sparsity

+ λ2

p

  • i=1

L

  • l=1

|αil|

  • task sharing

28

slide-30
SLIDE 30

Multiplicative Multitask Lasso with Task Descriptors

arg min

θ∈Rp

+,α∈Rp×L

L

  • yt

m, p

  • i=1

θi L

  • l=1

αildt

l

  • gt

mi

  • loss

+ λ1 ||θ||1

  • sparsity

+ λ2

p

  • i=1

L

  • l=1

|αil|

  • task sharing

◮ On simulations:

◮ Sparser solution ◮ Better recovery of true features (higher PPV) ◮ Improved stability ◮ Better predictivity (RMSE).

29

slide-31
SLIDE 31

Multiplicative Multitask Lasso with Task Descriptors

◮ Making predictions for tasks for which you have no data.

  • V. Bellón, V. Stoven, and C.-A. Azencott (2016) Multitask feature selection with task

descriptors, PSB. ❤tt♣s✿✴✴❣✐t❤✉❜✳❝♦♠✴✈♠♦❧✐♥❛✴▼✉❧t✐t❛s❦❉❡s❝r✐♣t♦r

30

slide-32
SLIDE 32

Limitations of current approaches

◮ Robustness/stability

Recovering the same SNPs when the data changes slightly.

◮ Complex epistasis patterns

– Limited to additive or quadrative effects – Some work on e.g. random forests + importance score.

◮ Statistical significance

– Computing p-values – Correcting for multiple hypotheses.

31

slide-33
SLIDE 33

❤tt♣s✿✴✴❣✐t❤✉❜✳❝♦♠✴❝❤❛❣❛③✴

source: ❤tt♣✿✴✴✇✇✇✳❢❧✐❝❦r✳❝♦♠✴♣❤♦t♦s✴✇✇✇♦r❦s✴

CBIO: Víctor Bellón, Yunlong Jiao, Véronique Stoven, Athénaïs Vaginay, Nelle Varoquaux, Jean-Philippe Vert, Thomas Walter. MLCB Tübingen: Karsten Borgwardt, Aasa Feragen, Dominik Grimm, Theofanis Karaletsos, Niklas Kasenburg, Christoph Lippert, Barbara Rakitsch, Damian Roqueiro, Nino Shervashidze, Oliver Stegle, Mahito Sugiyama. MPI for Intelligent Systems: Lawrence Cayton, Bernhard Schölkopf. MPI for Developmental Biology: Detlef Weigel. MPI for Psychiatry: André Altmann, Tony Kam-Thong, Bertram Müller-Myhsok, Benno Pütz. Osaka University: Yoshinobu Kawahara.

32