Data Mining in Bioinformatics Day 8: Feature Selection - Epistasis - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 8: Feature Selection - Epistasis Detection Karsten Borgwardt February 18 to March 1, 2013 Machine Learning & Computational Biology Research Group Max Planck Institute Tübingen and Eberhard Karls Universität Tübingen Karsten Borgwardt: Data Mining in Bioinformatics, Page 1

Classic and new questions Genetics How does genotypic variation lead to phenotypic variation? Can we predict phenotypes based on the genotype of an individual? Recent progress Genotypes can be de- termined at an unprece- dented level of detail Phenotypes can be recorded in an automated manner Karsten Borgwardt: Data Mining in Bioinformatics, Page 2

Genome-wide association mapping by courtesy of D. Weigel Karsten Borgwardt: Data Mining in Bioinformatics, Page 3

Phenotype prediction Arabidopsis phenotypes (99-199 plants, 250k SNPs, Atwell et al., 2010) AUC SVM Phenotype Chlorosis at 22 ◦ C 0.629 ± 0.003 Anthocyanin at 16 ◦ C 0.569 ± 0.003 Anthocyanin at 22 ◦ C 0.609 ± 0.003 Leaf Roll at 10 ◦ C 0.696 ± 0.002 Leaf Roll at 22 ◦ C 0.587 ± 0.004 Why is there room for improvement? We assume additive effects of SNPs, ignore gene-gene interactions and gene-environment interactions. We ignore population structure, that is systematic an- cestry differences of cases and controls. Karsten Borgwardt: Data Mining in Bioinformatics, Page 4

Epistasis - what it means I (Cordell, 2002) Bateson’s masking effect model Bateson defines epistasis as a masking effect, whereby a variant or allele at one locus prevents the variant at another locus from manifesting its effect. Genotype at locus B / G gg gG GG bb White Grey Grey bB Black Grey Grey BB Black Grey Grey Example of phenotypes (e.g. hair colour) obtained from different genotypes at two loci interacting epistatically under Bateson’s (1909) definition of epistasis. Karsten Borgwardt: Data Mining in Bioinformatics, Page 5

Epistasis - what it means II (Cordell, 2002) Epistasis in a general sense Genotype at locus A / B bb bB BB aa 0 0 0 aA 0 1 1 AA 0 1 1 Example of penetrance table for two loci interacting epistatically in a general sense Karsten Borgwardt: Data Mining in Bioinformatics, Page 6

Epistasis - what it means III (Cordell, 2002) Genetic heterogeneity model Genotype at locus A / B bb bB BB aa 0 0 1 aA 0 0 1 AA 1 1 1 Example of penetrance table for two loci acting together in a heterogeneity model Karsten Borgwardt: Data Mining in Bioinformatics, Page 7

Epistasis - what it means IV Regression model Most popular statistical definition: y = θ i x i + θ j x j + θ ( i,j ) x i ⊙ x j + ǫ (1) Test whether θ ( i,j ) is significantly different from zero; rank pairs by the resulting p-value. Other common measures of association include e.g. the F-statistics and Pearson’s correlation coefficient. Karsten Borgwardt: Data Mining in Bioinformatics, Page 8

Epistasis - what it means V (Marchini et al., 2005) Model 1: Multiplicative interaction within and between loci Locus A / B bb bB BB α (1 + θ 2 ) 2 aa α α (1 + θ 2 ) α (1 + θ 1 )(1 + θ 2 ) 2 α (1 + θ 1 ) α (1 + θ 1 )(1 + θ 2 ) aA α (1 + θ 1 ) 2 α (1 + θ 1 ) 2 (1 + θ 2 ) α (1 + θ 1 ) 2 (1 + θ 2 ) 2 AA The odds increase multiplicatively with genotype both within and between loci. Karsten Borgwardt: Data Mining in Bioinformatics, Page 9

Epistasis - what it means VI (Marchini et al., 2005) Model 2: Two-locus interaction with multiplicative effects Locus A / B bb bB BB α α α aa α (1 + θ ) 2 α α (1 + θ ) aA α (1 + θ ) 2 α (1 + θ ) 4 α AA In this model, the odds have a baseline value ( α ) unless both loci have at least one disease-associated allele. After that, the odds increase multiplicatively within and between genotypes. Karsten Borgwardt: Data Mining in Bioinformatics, Page 10

Epistasis - what it means VII (Marchini et al., 2005) Model 3: Two-locus interaction with threshold effects Locus A / B bb bB BB aa α α α α α (1 + θ ) α (1 + θ ) aA α α (1 + θ ) α (1 + θ ) AA In this model, the odds have a baseline value ( α ) unless both loci have at least one disease-associated allele. In this case, the odds-ratio is α (1 + θ ) . Karsten Borgwardt: Data Mining in Bioinformatics, Page 11

Epistasis - what it means VIII (Marchini et al., 2005) by courtesy of J. Marchini Karsten Borgwardt: Data Mining in Bioinformatics, Page 12

Impact of Epistasis Examples of Epistasis Epistasis is conjectured to be one source of missing her- itability (Manolio et al., 2009) Genetic interactions are one indicator that epistasis is a major factor in the genotype-phenotype relationship (e.g. Boone et al., 2007) Pairs of genes have been reported to affect complex dis- eases such as breast cancer (Ashworth et al., 2011): Loss of either BRCA1 or BRCA2 tumor suppressor gene function in cells triggers a cell-cycle arrest at the G2/M checkpoint that can be suppressed by the inactivation of P53 (Connor et al., 1997 and Liu et al., 2007). Loss of VHL (Von Hippel-Lindau tumor suppressor) function normally causes cellular senescence, but inactivation of a second tumor suppressor, RB (Retinoblastoma), can suppress this process (Young et al., 2008). Karsten Borgwardt: Data Mining in Bioinformatics, Page 13

Bottlenecks in two-locus mapping Scale of the problem Typical datasets include order 10 5 − 10 7 SNPs. Hence we have to consider order 10 10 − 10 14 SNP pairs. Enormous multiple hypothesis testing problem. Enormous computational runtime problem. Karsten Borgwardt: Data Mining in Bioinformatics, Page 14

Common approaches in the literature Exhaustive enumeration Only with special hardware such as GPU implementa- tions: EPIBLASTER (Kam-Thong et al., EJHG 2010) Filtering approaches Statistical criterion, e.g. SNPs with large main effect (Zhang et al., 2007) Biological criterion, e.g. underlying PPI (Emily et al., 2009) Index structure approaches fastANOVA, branch-and-bound on SNPs (Zhang et al., 2008) TEAM, efficient updates of contingency tables (Zhang et al., 2010) Karsten Borgwardt: Data Mining in Bioinformatics, Page 15

Family I: Exhaustive search Exhaustive enumeration Concept: Run through all pairs of SNPs exhaustively Setback: On standard PCs, such searches are limited to hundreds of SNPs. Workaround: Use special hardware, such as Computing clusters Graphical processing units Cloud computing Current limitation: These solutions tend to work for datasets that are currently available, but they may not be able to cope with an increase in sample size or SNP marker number in the future. Karsten Borgwardt: Data Mining in Bioinformatics, Page 16

Engineering approach to epistasis detection GPUs are heavily optimized for basic matrix operations Computational power in terms of GPUs much cheaper than CPUs We exploit the power of GPUs for rapid exhaustive SNP- SNP interaction detection EPIBLASTER: Difference in correlation for binary phenotypes (Kam-Thong et al., EJHG 2010) EPIGPUHSIC: Kernel-based test for arbitrary phenotypes (Kam-Thong et al., ISMB 2011) CUDALIN: Regression model with main effects (Kam- Thong et al., submitted) Available from http://agkb.is.tuebingen.mpg.de/Forschung/epistasistools/ Karsten Borgwardt: Data Mining in Bioinformatics, Page 17

First finding 567 subjects 1,075,163 SNPs phenotype: Hippocam- pus volume genome-wide signifi- cant results ( p < 10 − 12 ) near genes involved in by P . Sämann cell-cell signaling Karsten Borgwardt: Data Mining in Bioinformatics, Page 18

Multifactor-Dimensionality Reduction Properties of MDR (Ritchie et al., 2001) A model-free and non-parametric approach to epistasis detection Was proposed to overcome the problem that the type of encoding of SNPs affects the results in generalized lin- ear models; does not assume a specific genetic model Measures the association between SNPs and disease risk using prediction accuracy of selected multifactor models Limitations: Runs exhaustively through all SNP combinations and detects the best model Resulting models may be difficult to interpret Original variant only considers balanced case-control datasets Karsten Borgwardt: Data Mining in Bioinformatics, Page 19

Multifactor-Dimensionality Reduction Algorithm of MDR 1. A set of n genetic and/or discrete environmental factors is selected from the pool of all factors. 2. The n factors and their possible multifactor classes or cells are represented in n -dimensional space. Then, the ratio of the number of cases to the number of controls is estimated within each multifactor class. 3. Each multifactor cell in n-dimensional space is labeled either as high-risk if the cases:controls ratio meets or exceeds some threshold or as low-risk if that threshold is not exceeded. This reduces the n-dimensional model to a one-dimensional model. 4. The prediction error of each model is estimated by 10 repetitions of 10-fold cross-validation. Karsten Borgwardt: Data Mining in Bioinformatics, Page 20

Family II: Filtering approaches Two-stage procedure (popular reference: Marchini et al., 2005): First, reduce set of SNPs. Second, compute all remaining pairs exhaustively. Karsten Borgwardt: Data Mining in Bioinformatics, Page 21

Filtering in practice by courtesy of J. Marchini Karsten Borgwardt: Data Mining in Bioinformatics, Page 22

Data Mining in Bioinformatics Day 8: Feature Selection - Epistasis - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 8: Feature Selection - Epistasis Detection Karsten Borgwardt February 18 to March 1, 2013 Machine Learning & Computational Biology Research Group Max Planck Institute Tbingen and Eberhard Karls

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt March

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 6: Feature Selection in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Data Mining in Bioinformatics Day 3: Feature Selection Karsten Borgwardt February 21 to March 4,

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Epistasis detection

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Feature Selection: ROC and Subset Selection Theodoridis 5.5-5.7 Using ROC for Feature Selection

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 21

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt March 1

The new medical landscape . Emerging risks 1. Age 2. Sun 3. Air pollutions: surface ozone ,

Muscle regenerative capacity and aging International Conference on Frailty & Sarcopenia

Integrative analysis of methylation and transcriptional profiles to predict aging and construct

Radiation Toxicity in Era of Combined Modality Therapy with Targeted Agents Christopher J. Anker,

Clinical Lecture Institutes January 15, 2016 Assistant Consulting Professor Department of

Abstraction-Refinement Edmund M. Clarke School of Computer Science Carnegie Mellon University

Genome-wide Survey of Mixed MicroRNA / Transcription Factor Feed-Forward Regulatory Circuits in

Lecturer: Dr. Joana Salifu Yendork , Department of Psychology Contact Information:

Data Mining in Bioinformatics Day 8: Feature Selection - Epistasis - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 8: Feature Selection - Epistasis Detection Karsten Borgwardt February 18 to March 1, 2013 Machine Learning & Computational Biology Research Group Max Planck Institute Tbingen and Eberhard Karls

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt March

Data Mining in Bioinformatics Day 9: String &amp; Text Mining in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 6: Feature Selection in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Data Mining in Bioinformatics Day 3: Feature Selection Karsten Borgwardt February 21 to March 4,

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Epistasis detection

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Feature Selection: ROC and Subset Selection Theodoridis 5.5-5.7 Using ROC for Feature Selection

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 21

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt March 1

The new medical landscape . Emerging risks 1. Age 2. Sun 3. Air pollutions: surface ozone ,

Muscle regenerative capacity and aging International Conference on Frailty &amp; Sarcopenia

Integrative analysis of methylation and transcriptional profiles to predict aging and construct

Radiation Toxicity in Era of Combined Modality Therapy with Targeted Agents Christopher J. Anker,

Clinical Lecture Institutes January 15, 2016 Assistant Consulting Professor Department of

Abstraction-Refinement Edmund M. Clarke School of Computer Science Carnegie Mellon University

Genome-wide Survey of Mixed MicroRNA / Transcription Factor Feed-Forward Regulatory Circuits in

Lecturer: Dr. Joana Salifu Yendork , Department of Psychology Contact Information:

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

Muscle regenerative capacity and aging International Conference on Frailty & Sarcopenia