Karsten Borgwardt: Data Mining in Bioinformatics, Page 1
Data Mining in Bioinformatics Day 8: Feature Selection - Epistasis - - PowerPoint PPT Presentation
Data Mining in Bioinformatics Day 8: Feature Selection - Epistasis - - PowerPoint PPT Presentation
Data Mining in Bioinformatics Day 8: Feature Selection - Epistasis Detection Karsten Borgwardt February 18 to March 1, 2013 Machine Learning & Computational Biology Research Group Max Planck Institute Tbingen and Eberhard Karls
Classic and new questions
Karsten Borgwardt: Data Mining in Bioinformatics, Page 2
Genetics How does genotypic vari- ation lead to phenotypic variation? Can we predict pheno- types based on the geno- type of an individual? Recent progress Genotypes can be de- termined at an unprece- dented level of detail Phenotypes can be recorded in an automated manner
Genome-wide association mapping
Karsten Borgwardt: Data Mining in Bioinformatics, Page 3
by courtesy of D. Weigel
Phenotype prediction
Karsten Borgwardt: Data Mining in Bioinformatics, Page 4
Arabidopsis phenotypes (99-199 plants, 250k SNPs, Atwell et al., 2010)
Phenotype
AUCSVM
Chlorosis at 22◦C 0.629 ± 0.003 Anthocyanin at 16◦C 0.569 ± 0.003 Anthocyanin at 22◦C 0.609 ± 0.003 Leaf Roll at 10◦C 0.696 ± 0.002 Leaf Roll at 22◦C 0.587 ± 0.004 Why is there room for improvement? We assume additive effects of SNPs, ignore gene-gene interactions and gene-environment interactions. We ignore population structure, that is systematic an- cestry differences of cases and controls.
Epistasis - what it means I (Cordell, 2002)
Karsten Borgwardt: Data Mining in Bioinformatics, Page 5
Bateson’s masking effect model Bateson defines epistasis as a masking effect, whereby a variant or allele at one locus prevents the variant at another locus from manifesting its effect. Genotype at locus B/G gg gG GG bb White Grey Grey bB Black Grey Grey BB Black Grey Grey Example of phenotypes (e.g. hair colour) obtained from different genotypes at two loci interacting epistatically under Bateson’s (1909) definition of epistasis.
Epistasis - what it means II (Cordell, 2002)
Karsten Borgwardt: Data Mining in Bioinformatics, Page 6
Epistasis in a general sense Genotype at locus A/B bb bB BB aa aA 1 1 AA 1 1 Example of penetrance table for two loci interacting epistatically in a general sense
Epistasis - what it means III (Cordell, 2002)
Karsten Borgwardt: Data Mining in Bioinformatics, Page 7
Genetic heterogeneity model Genotype at locus A/B bb bB BB aa 1 aA 1 AA 1 1 1 Example of penetrance table for two loci acting together in a heterogeneity model
Epistasis - what it means IV
Karsten Borgwardt: Data Mining in Bioinformatics, Page 8
Regression model Most popular statistical definition:
y = θixi + θjxj + θ(i,j)xi ⊙ xj + ǫ (1)
Test whether θ(i,j) is significantly different from zero; rank pairs by the resulting p-value. Other common measures of association include e.g. the F-statistics and Pearson’s correlation coefficient.
Epistasis - what it means V (Marchini et al., 2005)
Karsten Borgwardt: Data Mining in Bioinformatics, Page 9
Model 1: Multiplicative interaction within and between loci
Locus A/B bb bB BB aa
α α(1 + θ2) α(1 + θ2)2
aA
α(1 + θ1) α(1 + θ1)(1 + θ2) α(1 + θ1)(1 + θ2)2
AA
α(1 + θ1)2 α(1 + θ1)2(1 + θ2) α(1 + θ1)2(1 + θ2)2
The odds increase multiplicatively with genotype both within and between loci.
Epistasis - what it means VI (Marchini et al., 2005)
Karsten Borgwardt: Data Mining in Bioinformatics, Page 10
Model 2: Two-locus interaction with multiplicative effects Locus A/B bb bB BB aa
α α α
aA
α α(1 + θ) α(1 + θ)2
AA
α α(1 + θ)2 α(1 + θ)4
In this model, the odds have a baseline value (α) un- less both loci have at least one disease-associated al-
- lele. After that, the odds increase multiplicatively within
and between genotypes.
Epistasis - what it means VII (Marchini et al., 2005)
Karsten Borgwardt: Data Mining in Bioinformatics, Page 11
Model 3: Two-locus interaction with threshold effects Locus A/B bb bB BB aa
α α α
aA
α α(1 + θ) α(1 + θ)
AA
α α(1 + θ) α(1 + θ)
In this model, the odds have a baseline value (α) unless both loci have at least one disease-associated allele. In this case, the odds-ratio is α(1 + θ).
Epistasis - what it means VIII (Marchini et al., 2005)
Karsten Borgwardt: Data Mining in Bioinformatics, Page 12
by courtesy of J. Marchini
Impact of Epistasis
Karsten Borgwardt: Data Mining in Bioinformatics, Page 13
Examples of Epistasis Epistasis is conjectured to be one source of missing her- itability (Manolio et al., 2009) Genetic interactions are one indicator that epistasis is a major factor in the genotype-phenotype relationship (e.g. Boone et al., 2007) Pairs of genes have been reported to affect complex dis- eases such as breast cancer (Ashworth et al., 2011):
Loss of either BRCA1 or BRCA2 tumor suppressor gene function in cells triggers a cell-cycle arrest at the G2/M checkpoint that can be suppressed by the inactivation of P53 (Connor et al., 1997 and Liu et al., 2007). Loss of VHL (Von Hippel-Lindau tumor suppressor) function normally causes cellular senescence, but inactivation of a second tumor suppres- sor, RB (Retinoblastoma), can suppress this process (Young et al., 2008).
Bottlenecks in two-locus mapping
Karsten Borgwardt: Data Mining in Bioinformatics, Page 14
Scale of the problem Typical datasets include order 105 − 107 SNPs. Hence we have to consider order 1010 − 1014 SNP pairs. Enormous multiple hypothesis testing problem. Enormous computational runtime problem.
Common approaches in the literature
Karsten Borgwardt: Data Mining in Bioinformatics, Page 15
Exhaustive enumeration Only with special hardware such as GPU implementa- tions: EPIBLASTER (Kam-Thong et al., EJHG 2010) Filtering approaches Statistical criterion, e.g. SNPs with large main effect (Zhang et al., 2007) Biological criterion, e.g. underlying PPI (Emily et al., 2009) Index structure approaches fastANOVA, branch-and-bound on SNPs (Zhang et al., 2008) TEAM, efficient updates of contingency tables (Zhang et al., 2010)
Family I: Exhaustive search
Karsten Borgwardt: Data Mining in Bioinformatics, Page 16
Exhaustive enumeration Concept: Run through all pairs of SNPs exhaustively Setback: On standard PCs, such searches are limited to hundreds of SNPs. Workaround: Use special hardware, such as Computing clusters Graphical processing units Cloud computing Current limitation: These solutions tend to work for datasets that are currently available, but they may not be able to cope with an increase in sample size or SNP marker number in the future.
Engineering approach to epistasis detection
Karsten Borgwardt: Data Mining in Bioinformatics, Page 17
GPUs are heavily optimized for basic matrix operations Computational power in terms of GPUs much cheaper than CPUs We exploit the power of GPUs for rapid exhaustive SNP- SNP interaction detection EPIBLASTER: Difference in correlation for binary phe- notypes (Kam-Thong et al., EJHG 2010) EPIGPUHSIC: Kernel-based test for arbitrary pheno- types (Kam-Thong et al., ISMB 2011) CUDALIN: Regression model with main effects (Kam- Thong et al., submitted) Available from http://agkb.is.tuebingen.mpg.de/Forschung/epistasistools/
First finding
Karsten Borgwardt: Data Mining in Bioinformatics, Page 18
by P . Sämann 567 subjects 1,075,163 SNPs phenotype: Hippocam- pus volume genome-wide signifi- cant results (p < 10−12) near genes involved in cell-cell signaling
Multifactor-Dimensionality Reduction
Karsten Borgwardt: Data Mining in Bioinformatics, Page 19
Properties of MDR (Ritchie et al., 2001) A model-free and non-parametric approach to epistasis detection Was proposed to overcome the problem that the type of encoding of SNPs affects the results in generalized lin- ear models; does not assume a specific genetic model Measures the association between SNPs and disease risk using prediction accuracy of selected multifactor models Limitations:
Runs exhaustively through all SNP combinations and detects the best model Resulting models may be difficult to interpret Original variant only considers balanced case-control datasets
Multifactor-Dimensionality Reduction
Karsten Borgwardt: Data Mining in Bioinformatics, Page 20
Algorithm of MDR
- 1. A set of n genetic and/or discrete environmental factors
is selected from the pool of all factors.
- 2. The n factors and their possible multifactor classes or
cells are represented in n-dimensional space. Then, the ratio of the number of cases to the number of controls is estimated within each multifactor class.
- 3. Each multifactor cell in n-dimensional space is labeled
either as high-risk if the cases:controls ratio meets or exceeds some threshold or as low-risk if that threshold is not exceeded. This reduces the n-dimensional model to a one-dimensional model.
- 4. The prediction error of each model is estimated by 10
repetitions of 10-fold cross-validation.
Family II: Filtering approaches
Karsten Borgwardt: Data Mining in Bioinformatics, Page 21
Two-stage procedure (popular reference: Marchini et al., 2005): First, reduce set of SNPs. Second, compute all remaining pairs exhaustively.
Filtering in practice
Karsten Borgwardt: Data Mining in Bioinformatics, Page 22
by courtesy of J. Marchini
SNP Harvester (Yang et al., 2008)
Karsten Borgwardt: Data Mining in Bioinformatics, Page 23
Strategy Identify groups of k SNPs associated with the phenotype by a Metropolis-Hastings like search-procedure Run L2-regularized logistic regression on all remaining sets of k SNPs to identify sets of SNPs that significantly affect the phenotype
Family III: Index-structure approaches
Karsten Borgwardt: Data Mining in Bioinformatics, Page 24
FASTANOVA (Zhang et al., KDD 2008)a: assumes binary SNPs (inbred strains). allows for quantitative phenotypes retrieves pairs of SNPs most associated with the pheno- type; the solution is exact.
aFigures and parts of text by courtesy of Zhang et al.
FASTANOVA: Problem formalization
Karsten Borgwardt: Data Mining in Bioinformatics, Page 25
Dataset: N individuals, S SNPs {x1, x2, . . . , xS}, a quanti- tative phenotype y, and its K permutations y1, y2, . . . , yK. Maximum ANOVA test (F-statistic) value of permutation yk:
Fyk = max{F(xixj, yk)|1 ≤ i < j ≤ S} (2)
Problem 1: Given Type I error threshold α, find critical value Fα, which is the αK-th largest value among {Fyk|1 ≤
k ≤ K}.
Problem 2: Given the threshold Fα, find all significant SNP- pairs such that F(xixj, y) ≥ Fα
Analysis of variance test
Karsten Borgwardt: Data Mining in Bioinformatics, Page 26
ANOVA (analysis of variance) test is one of the stan- dard statistic methods to measure the association between SNPs and the phenotypes of interest. The goal of ANOVA test is to determine whether the group means are significantly different after accounting for the variances within groups. It accomplishes the comparison by decomposing the to- tal variance in the data into within-group variance and between-group variance. If the between-group variance is sufficiently larger than the within-group variance, then the test concludes that there is significant (phenotypic) difference between the groups.
Analysis of variance test
Karsten Borgwardt: Data Mining in Bioinformatics, Page 27
The basic idea of the ANOVA test is to partition the total sum of squared deviations SST into between-group sum of squared deviations SSB and within-group sum of squared deviations SSW:
SST = SSB + SSW. (3)
The F-statistics for ANOVA tests on xi and (xixj) are:
F(xi, y) = M − 2 2 − 1 SSB(xi, y) SST(xi, y) − SSB(xi, y) (4) F(xixj, y) = M − g g − 1 SSB(xixj, y) SST(xixj, y) − SSB(xixj, y) (5)
Brute force approach
Karsten Borgwardt: Data Mining in Bioinformatics, Page 28
Problem 1: Permutation test to find critical value For permutation yk, test all SNP pairs to find the maxi- mum test value Fyk Repeat for all permutations Report αK-th largest value in {Fyk|1 ≤ k ≤ K} Problem 2: Finding significant SNP pairs For phenotype y, test all SNP-pairs and report the SNP- pairs whose test values are above Fα.
Overview of FASTANOVA
Karsten Borgwardt: Data Mining in Bioinformatics, Page 29
Goal: Scale large permutation test at genome-wide scale Question: Do we have to perform ANOVA tests for every SNP-pair and repeat for all permutations? Idea: Develop an upper bound: to filter out SNP-pairs having no chance to become significant (all nodes on the same level of the search tree, no sub-tree pruning) Efficiently compute the upper bound: calculate the up- per bound for a group of SNP-pairs together Identify redundant computations in the permutation tests (reuse computations)
The upper bound I
Karsten Borgwardt: Data Mining in Bioinformatics, Page 30
For any SNP pair (xi, xj):
F(xixj, y) ≥ Fα ⇔ SSB(xixj, y) ≥ β (6)
(β is fixed for a given Fα). Bound on SSB:
SSB(xixj, y) ≤ SSB(xi, y) + R1 + R2 (7)
Right-hand side needs to be greater than β for (xixj) to be significant.
The upper bound II
Karsten Borgwardt: Data Mining in Bioinformatics, Page 31
Let
na = min{#xj = 1, #xj = 0|xi = 0} (8) nb = min{#xj = 1, #xj = 0|xi = 1} (9)
For any SNP pair (xi, xj):
SSB(xixj, y) ≤ SSB(xi, y) + R1 + R2 (10) SSB(xixj, y) is constant for a given xi R1 is a function of na. R2 is a function of nb.
For a fixed xi, R1 and R2 depend only on xj.
Schema of FastANOVA
Karsten Borgwardt: Data Mining in Bioinformatics, Page 32
For each xi, index the SNP pairs {(xixj)|i + 1 ≤ j ≤ S} in the 2D space of (na, nb). For each permutation, find the candidate SNP pairs by ac- cessing the indexing structure Candidates are SNP pairs whose upper bounds are above the threshold. The dynamic threshold is the maximum test value found so far.
Complexity of FastANOVA
Karsten Borgwardt: Data Mining in Bioinformatics, Page 33
Runtime complexity FastANOVA: O(S2N + KSN 2 + CN) Brute force: O(KNS2) Space complexity
O((S + K)N)
FASTANOVA: Runtime versus number of SNPs
Karsten Borgwardt: Data Mining in Bioinformatics, Page 34
FASTANOVA: Runtime versus type I error
Karsten Borgwardt: Data Mining in Bioinformatics, Page 35
FASTANOVA: Pruning power versus type I error
Karsten Borgwardt: Data Mining in Bioinformatics, Page 36
Convex Optimization based Epistasis Detection
Karsten Borgwardt: Data Mining in Bioinformatics, Page 37
COE approach (Zhang et al., RECOMB 2009) Goal: Efficient epistasis detection across different test
statistics
Assumptions: Binary phenotypes Binary genotypes Strategy:
Show that many commonly used statistics are convex functions Develop an upper bound to filter out SNP-pairs having no chance to be- come significant Efficiently compute the upper bound: calculate the upper bound for a group
- f SNP-pairs together
Identify redundant computations in the permutation tests Generalizes FASTANOVA result, but restricted to binary phenotypes.
Tree-based Epistasis Association Mapping (TEAM)
Karsten Borgwardt: Data Mining in Bioinformatics, Page 38
TEAM Properties (Zhang et al., ISMB 2010) Supports homozygous and heterozygous data Applicable to all tests based on the contingency table More efficient for large sample studies
TEAM: Problem formalization
Karsten Borgwardt: Data Mining in Bioinformatics, Page 39
Dataset: N individuals {I1, . . . , IN}, S heterozygous SNPs
{x1, x2, . . . , xS} with state 0, 1 or 2, a binary phenotype y,
and its K permutations y1, y2, . . . , yK. Goal: For each phenotype permutation yk, compute the corresponding contingency table
TEAM: Contingency tables II
Karsten Borgwardt: Data Mining in Bioinformatics, Page 40
xi = 0 xi = 1 xi = 2
Total
xj = 0 Event S Event T
Event R
xj = 1 Event P Event Q Event U xj = 2 Event V Event W Event Z
Total M Contingency table for genotype relation between two SNPs
xi and xj
TEAM: Contingency tables III
Karsten Borgwardt: Data Mining in Bioinformatics, Page 41
xi = 0: xj = 0 xj = 1 xj = 2 yk = 0 Event a1 Event a2 Event a3 yk = 1 Event c1 Event c2 Event c3 xi = 1: xj = 0 xj = 1 xj = 2 yk = 0 Event b1 Event b2 Event b3 yk = 1 Event d1 Event d2 Event d3 xi = 2: xj = 0 xj = 1 xj = 2 yk = 0 Event e1 Event e2 Event e3 yk = 1 Event f1
Event f2 Event f3 Total M Contingency table for two-locus test T(xixj, yk)
TEAM: Key observation
Karsten Borgwardt: Data Mining in Bioinformatics, Page 42
Structure in contingency tables If the contingency tables I and II are known, all entries
- f contingency table III can be inferred if d2, d3, f2 and f3
are known.
xi = 0: xj = 0 xj = 1 xj = 2 yk = 0 Event a1 Event a2 Event a3 yk = 1 Event c1 Event c2 Event c3 xi = 1: xj = 0 xj = 1 xj = 2 yk = 0 Event b1 Event b2 Event b3 yk = 1 Event d1 Event d2 Event d3 xi = 2: xj = 0 xj = 1 xj = 2 yk = 0 Event e1 Event e2 Event e3 yk = 1 Event f1
Event f2 Event f3 Total M
TEAM: Algorithm
Karsten Borgwardt: Data Mining in Bioinformatics, Page 43
Build a minimum spanning tree (MST) on SNPs For leaf xi, update tables for the pairs associated with xi (by a DFS of the MST) Remove xi and repeat updating Example
- f
a minimum spanning tree:
TEAM: Minimum Spanning Tree
Karsten Borgwardt: Data Mining in Bioinformatics, Page 44
Nodes V (T) of the tree T are SNPs Weights of the edges E(T) are the number of individuals having different genotypes in the two SNPs A spanning tree is a tree that connects all SNPs A minimum spanning tree is a tree whose weight is not larger than that of any other spanning tree. It can be computed via Kruskal’s algorithm in O(E log E). A computational bottleneck in TEAM, however, is that one has to compute all pairwise distances between SNPs first.
TEAM: Updating rule
Karsten Borgwardt: Data Mining in Bioinformatics, Page 45
Let (xjx′
j){k→l} denote the pairs of SNPs, where xj = k
and xj′ = l. From the contingency tables I, it is easy to see that
Od2(xixj, yk) = |D(xi, yk) ∩ Q(xi, xj)|
and
Od2(xixj′, yk) = |D(xi, yk) ∩ Q(xi, xj′)|
. Theorem For any SNP-pair (xixj) and an edge (xjxj′) ∈
E(T), we have Od2(xixj′, yk) = Od2(xixj, yk) + |D(xi, yk) ∩ (xjxj′){0→1}∪{2→1}| −|D(xi, yk) ∩ (xjxj′){1→0}∪{1→2}|.
Complexity of TEAM
Karsten Borgwardt: Data Mining in Bioinformatics, Page 46
Runtime complexity TEAM: O(NSK + NS2 + WtreeSK) Brute force: O(NS2K) Space complexity
O((S + K)N + K(S + N) + Wtree)
TEAM: Empirical runtime analysis
Karsten Borgwardt: Data Mining in Bioinformatics, Page 47
Comparison between TEAM and the brute force approach on human datasets under various experimental settings: varyhing the number of SNPs (a), indi- viduals (b), permutations (c) and varyhing the case/control ratio (d).
TEAM: Empirical runtime comparison
Karsten Borgwardt: Data Mining in Bioinformatics, Page 48
Comparison between TEAM, COE and the brute force approach on mouse datasets under various experimental settings: (a) varying the number of SNPs and (b) varying the number of individuals.
Taxonomy of space pruning methods (Wei Wang Lab)
Karsten Borgwardt: Data Mining in Bioinformatics, Page 49
http://www.csbio.unc.edu/epistasis/
Family IV: Sampling approaches to epistasis detection
Karsten Borgwardt: Data Mining in Bioinformatics, Page 50
Overview Bayesian inference of epistatic interactions in case- control studies (Zhang et al., 2007) which partitions the SNPs into SNPs with ’no effect’, ’main effect’ and ’inter- action effect’ via MCMC sampling Random forests (Lunetta, 2004) filter SNPs by their im- portance in constructing decision trees based on sam- pled subsets of SNPs. The Epistasis Lightbulb Algorithm (Achlioptas et al., 2011) detects pairs of interaction SNPs in a runtime which is subquadratic in the number of SNPs.
Filtering via random forests
Karsten Borgwardt: Data Mining in Bioinformatics, Page 51
Random Forests (Breiman, 2001; Lunetta et al., 2004)
from Jiang et al., 2009
A subquadratic runtime approach to epistasis detection
Karsten Borgwardt: Data Mining in Bioinformatics, Page 52
The epistasis lightbulb algorithm We assume binary phenotypes (cases and controls). Genotypes may be homozygous or heterozygous. We assume m individuals with n SNPs each. We define an algorithm that rapidly detects epistatic in- teractions in a runtime subquadratic in n (Achlioptas et al., KDD 2011).
Difference in correlation for epistasis detection
Karsten Borgwardt: Data Mining in Bioinformatics, Page 53
We phrase epistasis detection as a difference in correla- tion problem:
arg max
i,j
|ρcases(xi, xj) − ρcontrols(xi, xj)|. (11)
Different degree of linkage disequilibrium of two loci in cases and controls
The lightbulb approach (Paturi et al., COLT 1989)
Karsten Borgwardt: Data Mining in Bioinformatics, Page 54
Maximum correlation The lightbulb algorithm tackles the maximum correlation problem on an m × n matrix A with binary entries:
arg max
i,j
|ρA(xi, xj)|. (12)
Quadratic runtime algorithm As in epistasis detection, the problem can be solved by naive enumeration of all n2 possible solutions.
The lightbulb approach
Karsten Borgwardt: Data Mining in Bioinformatics, Page 55
Lightbulb algorithm
- 1. Given a binary matrix A with m rows and n columns.
- 2. Repeat l times:
Sample k rows Increase a counter for all pairs of columns that match
- n these k rows.
- 3. The counters divided by l give an estimate of the corre-
lation P(xi = xj). Subquadratic runtime With probability 1 − n−α, the lightbulb algorithm re- trieves the most correlated pair in O(α n1+ln p1
ln q2 ln2 n) =
O(n(α n
ln p1 ln q2 ln2 n)).
Difference between the two settings
Karsten Borgwardt: Data Mining in Bioinformatics, Page 56
Discrepancies Difference in correlation SNPs are non-binary in general Pearson’s correlation coefficient
Step 1: Difference in correlation
Karsten Borgwardt: Data Mining in Bioinformatics, Page 57
Theorem Given a matrix of cases A and a matrix of controls B of identical size. Finding the maximally correlated pair on A
A B 1 − B
- (13)
and on A 1 − A
B B
- (14)
is identical to
arg max
i,j
|ρA(xi, xj) − ρB(xi, xj)|. (15)
Step 2: Locality sensitive hashing (Charikar, 2002)
Karsten Borgwardt: Data Mining in Bioinformatics, Page 58
Given a collection of vectors in Rm we choose a random vec- tor r from the m-dimensional Gaussian distribution. Cor- responding to this vector r, we define a hash function hr as follows:
hr(u) =
- 1
if r · u ≥ 0 if r · u < 0
(16)
Theorem For vectors v, u, Pr[hr(u) = hr(v)] = 1 − θ(u,v)
π
, where θ is the angle between the two vectors.
Step 3: Pearson’s correlation coefficient
Karsten Borgwardt: Data Mining in Bioinformatics, Page 59
Link between correlation and cosine Karl Pearson defined the correlation of 2 vectors v, u in Rm as
ρ = cov(v, u) σvσu , (17)
that is the covariance of the two vectors divided by their standard deviations. An equivalent geometric way to de- fine it is:
ρ = cos(v − ¯ v, u − ¯ u), (18)
where ¯
v and ¯ u are the mean value of u and v, respectively.
The lightbulb epistasis algorithm
Karsten Borgwardt: Data Mining in Bioinformatics, Page 60
Algorithm
- 1. Binarize original matrices A0 and B0 into A and B by
locality sensitive hashing.
- 2. Compute maximally correlated pair P1 on
A A
B 1 − B
- via lightbulb.
- 3. Compute maximally correlated pair P2 on
A 1 − A
B B
- via lightbulb.
- 4. Report the maximum of P1 and P2.
Experiments: Nordborg lab SNP dataset
Karsten Borgwardt: Data Mining in Bioinformatics, Page 61
Results on Nordborg SNP dataset
# SNPs Measurements Pairs Exponent Speedup Top 10 Top 100 Top 500 Top 1K 100,000 8,255,645 8,186,657 1.38 611 1.00 0.86 0.82 0.80 100,000 52,762,001 51,732,700 1.54 97 1.00 1.00 0.99 0.98
Runtime Runtime is empirically O(n1.5). Epistasis detection on the human genome would require 1 day of computation on a typical desktop PC.
Summary: Epistasis detection in via sampling
Karsten Borgwardt: Data Mining in Bioinformatics, Page 62