Data Mining in Bioinformatics Day 8: Feature Selection - Epistasis - - PowerPoint PPT Presentation

data mining in bioinformatics day 8 feature selection
SMART_READER_LITE
LIVE PREVIEW

Data Mining in Bioinformatics Day 8: Feature Selection - Epistasis - - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 8: Feature Selection - Epistasis Detection Karsten Borgwardt February 18 to March 1, 2013 Machine Learning & Computational Biology Research Group Max Planck Institute Tbingen and Eberhard Karls


slide-1
SLIDE 1

Karsten Borgwardt: Data Mining in Bioinformatics, Page 1

Data Mining in Bioinformatics Day 8: Feature Selection - Epistasis Detection

Karsten Borgwardt February 18 to March 1, 2013 Machine Learning & Computational Biology Research Group Max Planck Institute Tübingen and Eberhard Karls Universität Tübingen

slide-2
SLIDE 2

Classic and new questions

Karsten Borgwardt: Data Mining in Bioinformatics, Page 2

Genetics How does genotypic vari- ation lead to phenotypic variation? Can we predict pheno- types based on the geno- type of an individual? Recent progress Genotypes can be de- termined at an unprece- dented level of detail Phenotypes can be recorded in an automated manner

slide-3
SLIDE 3

Genome-wide association mapping

Karsten Borgwardt: Data Mining in Bioinformatics, Page 3

by courtesy of D. Weigel

slide-4
SLIDE 4

Phenotype prediction

Karsten Borgwardt: Data Mining in Bioinformatics, Page 4

Arabidopsis phenotypes (99-199 plants, 250k SNPs, Atwell et al., 2010)

Phenotype

AUCSVM

Chlorosis at 22◦C 0.629 ± 0.003 Anthocyanin at 16◦C 0.569 ± 0.003 Anthocyanin at 22◦C 0.609 ± 0.003 Leaf Roll at 10◦C 0.696 ± 0.002 Leaf Roll at 22◦C 0.587 ± 0.004 Why is there room for improvement? We assume additive effects of SNPs, ignore gene-gene interactions and gene-environment interactions. We ignore population structure, that is systematic an- cestry differences of cases and controls.

slide-5
SLIDE 5

Epistasis - what it means I (Cordell, 2002)

Karsten Borgwardt: Data Mining in Bioinformatics, Page 5

Bateson’s masking effect model Bateson defines epistasis as a masking effect, whereby a variant or allele at one locus prevents the variant at another locus from manifesting its effect. Genotype at locus B/G gg gG GG bb White Grey Grey bB Black Grey Grey BB Black Grey Grey Example of phenotypes (e.g. hair colour) obtained from different genotypes at two loci interacting epistatically under Bateson’s (1909) definition of epistasis.

slide-6
SLIDE 6

Epistasis - what it means II (Cordell, 2002)

Karsten Borgwardt: Data Mining in Bioinformatics, Page 6

Epistasis in a general sense Genotype at locus A/B bb bB BB aa aA 1 1 AA 1 1 Example of penetrance table for two loci interacting epistatically in a general sense

slide-7
SLIDE 7

Epistasis - what it means III (Cordell, 2002)

Karsten Borgwardt: Data Mining in Bioinformatics, Page 7

Genetic heterogeneity model Genotype at locus A/B bb bB BB aa 1 aA 1 AA 1 1 1 Example of penetrance table for two loci acting together in a heterogeneity model

slide-8
SLIDE 8

Epistasis - what it means IV

Karsten Borgwardt: Data Mining in Bioinformatics, Page 8

Regression model Most popular statistical definition:

y = θixi + θjxj + θ(i,j)xi ⊙ xj + ǫ (1)

Test whether θ(i,j) is significantly different from zero; rank pairs by the resulting p-value. Other common measures of association include e.g. the F-statistics and Pearson’s correlation coefficient.

slide-9
SLIDE 9

Epistasis - what it means V (Marchini et al., 2005)

Karsten Borgwardt: Data Mining in Bioinformatics, Page 9

Model 1: Multiplicative interaction within and between loci

Locus A/B bb bB BB aa

α α(1 + θ2) α(1 + θ2)2

aA

α(1 + θ1) α(1 + θ1)(1 + θ2) α(1 + θ1)(1 + θ2)2

AA

α(1 + θ1)2 α(1 + θ1)2(1 + θ2) α(1 + θ1)2(1 + θ2)2

The odds increase multiplicatively with genotype both within and between loci.

slide-10
SLIDE 10

Epistasis - what it means VI (Marchini et al., 2005)

Karsten Borgwardt: Data Mining in Bioinformatics, Page 10

Model 2: Two-locus interaction with multiplicative effects Locus A/B bb bB BB aa

α α α

aA

α α(1 + θ) α(1 + θ)2

AA

α α(1 + θ)2 α(1 + θ)4

In this model, the odds have a baseline value (α) un- less both loci have at least one disease-associated al-

  • lele. After that, the odds increase multiplicatively within

and between genotypes.

slide-11
SLIDE 11

Epistasis - what it means VII (Marchini et al., 2005)

Karsten Borgwardt: Data Mining in Bioinformatics, Page 11

Model 3: Two-locus interaction with threshold effects Locus A/B bb bB BB aa

α α α

aA

α α(1 + θ) α(1 + θ)

AA

α α(1 + θ) α(1 + θ)

In this model, the odds have a baseline value (α) unless both loci have at least one disease-associated allele. In this case, the odds-ratio is α(1 + θ).

slide-12
SLIDE 12

Epistasis - what it means VIII (Marchini et al., 2005)

Karsten Borgwardt: Data Mining in Bioinformatics, Page 12

by courtesy of J. Marchini

slide-13
SLIDE 13

Impact of Epistasis

Karsten Borgwardt: Data Mining in Bioinformatics, Page 13

Examples of Epistasis Epistasis is conjectured to be one source of missing her- itability (Manolio et al., 2009) Genetic interactions are one indicator that epistasis is a major factor in the genotype-phenotype relationship (e.g. Boone et al., 2007) Pairs of genes have been reported to affect complex dis- eases such as breast cancer (Ashworth et al., 2011):

Loss of either BRCA1 or BRCA2 tumor suppressor gene function in cells triggers a cell-cycle arrest at the G2/M checkpoint that can be suppressed by the inactivation of P53 (Connor et al., 1997 and Liu et al., 2007). Loss of VHL (Von Hippel-Lindau tumor suppressor) function normally causes cellular senescence, but inactivation of a second tumor suppres- sor, RB (Retinoblastoma), can suppress this process (Young et al., 2008).

slide-14
SLIDE 14

Bottlenecks in two-locus mapping

Karsten Borgwardt: Data Mining in Bioinformatics, Page 14

Scale of the problem Typical datasets include order 105 − 107 SNPs. Hence we have to consider order 1010 − 1014 SNP pairs. Enormous multiple hypothesis testing problem. Enormous computational runtime problem.

slide-15
SLIDE 15

Common approaches in the literature

Karsten Borgwardt: Data Mining in Bioinformatics, Page 15

Exhaustive enumeration Only with special hardware such as GPU implementa- tions: EPIBLASTER (Kam-Thong et al., EJHG 2010) Filtering approaches Statistical criterion, e.g. SNPs with large main effect (Zhang et al., 2007) Biological criterion, e.g. underlying PPI (Emily et al., 2009) Index structure approaches fastANOVA, branch-and-bound on SNPs (Zhang et al., 2008) TEAM, efficient updates of contingency tables (Zhang et al., 2010)

slide-16
SLIDE 16

Family I: Exhaustive search

Karsten Borgwardt: Data Mining in Bioinformatics, Page 16

Exhaustive enumeration Concept: Run through all pairs of SNPs exhaustively Setback: On standard PCs, such searches are limited to hundreds of SNPs. Workaround: Use special hardware, such as Computing clusters Graphical processing units Cloud computing Current limitation: These solutions tend to work for datasets that are currently available, but they may not be able to cope with an increase in sample size or SNP marker number in the future.

slide-17
SLIDE 17

Engineering approach to epistasis detection

Karsten Borgwardt: Data Mining in Bioinformatics, Page 17

GPUs are heavily optimized for basic matrix operations Computational power in terms of GPUs much cheaper than CPUs We exploit the power of GPUs for rapid exhaustive SNP- SNP interaction detection EPIBLASTER: Difference in correlation for binary phe- notypes (Kam-Thong et al., EJHG 2010) EPIGPUHSIC: Kernel-based test for arbitrary pheno- types (Kam-Thong et al., ISMB 2011) CUDALIN: Regression model with main effects (Kam- Thong et al., submitted) Available from http://agkb.is.tuebingen.mpg.de/Forschung/epistasistools/

slide-18
SLIDE 18

First finding

Karsten Borgwardt: Data Mining in Bioinformatics, Page 18

by P . Sämann 567 subjects 1,075,163 SNPs phenotype: Hippocam- pus volume genome-wide signifi- cant results (p < 10−12) near genes involved in cell-cell signaling

slide-19
SLIDE 19

Multifactor-Dimensionality Reduction

Karsten Borgwardt: Data Mining in Bioinformatics, Page 19

Properties of MDR (Ritchie et al., 2001) A model-free and non-parametric approach to epistasis detection Was proposed to overcome the problem that the type of encoding of SNPs affects the results in generalized lin- ear models; does not assume a specific genetic model Measures the association between SNPs and disease risk using prediction accuracy of selected multifactor models Limitations:

Runs exhaustively through all SNP combinations and detects the best model Resulting models may be difficult to interpret Original variant only considers balanced case-control datasets

slide-20
SLIDE 20

Multifactor-Dimensionality Reduction

Karsten Borgwardt: Data Mining in Bioinformatics, Page 20

Algorithm of MDR

  • 1. A set of n genetic and/or discrete environmental factors

is selected from the pool of all factors.

  • 2. The n factors and their possible multifactor classes or

cells are represented in n-dimensional space. Then, the ratio of the number of cases to the number of controls is estimated within each multifactor class.

  • 3. Each multifactor cell in n-dimensional space is labeled

either as high-risk if the cases:controls ratio meets or exceeds some threshold or as low-risk if that threshold is not exceeded. This reduces the n-dimensional model to a one-dimensional model.

  • 4. The prediction error of each model is estimated by 10

repetitions of 10-fold cross-validation.

slide-21
SLIDE 21

Family II: Filtering approaches

Karsten Borgwardt: Data Mining in Bioinformatics, Page 21

Two-stage procedure (popular reference: Marchini et al., 2005): First, reduce set of SNPs. Second, compute all remaining pairs exhaustively.

slide-22
SLIDE 22

Filtering in practice

Karsten Borgwardt: Data Mining in Bioinformatics, Page 22

by courtesy of J. Marchini

slide-23
SLIDE 23

SNP Harvester (Yang et al., 2008)

Karsten Borgwardt: Data Mining in Bioinformatics, Page 23

Strategy Identify groups of k SNPs associated with the phenotype by a Metropolis-Hastings like search-procedure Run L2-regularized logistic regression on all remaining sets of k SNPs to identify sets of SNPs that significantly affect the phenotype

slide-24
SLIDE 24

Family III: Index-structure approaches

Karsten Borgwardt: Data Mining in Bioinformatics, Page 24

FASTANOVA (Zhang et al., KDD 2008)a: assumes binary SNPs (inbred strains). allows for quantitative phenotypes retrieves pairs of SNPs most associated with the pheno- type; the solution is exact.

aFigures and parts of text by courtesy of Zhang et al.

slide-25
SLIDE 25

FASTANOVA: Problem formalization

Karsten Borgwardt: Data Mining in Bioinformatics, Page 25

Dataset: N individuals, S SNPs {x1, x2, . . . , xS}, a quanti- tative phenotype y, and its K permutations y1, y2, . . . , yK. Maximum ANOVA test (F-statistic) value of permutation yk:

Fyk = max{F(xixj, yk)|1 ≤ i < j ≤ S} (2)

Problem 1: Given Type I error threshold α, find critical value Fα, which is the αK-th largest value among {Fyk|1 ≤

k ≤ K}.

Problem 2: Given the threshold Fα, find all significant SNP- pairs such that F(xixj, y) ≥ Fα

slide-26
SLIDE 26

Analysis of variance test

Karsten Borgwardt: Data Mining in Bioinformatics, Page 26

ANOVA (analysis of variance) test is one of the stan- dard statistic methods to measure the association between SNPs and the phenotypes of interest. The goal of ANOVA test is to determine whether the group means are significantly different after accounting for the variances within groups. It accomplishes the comparison by decomposing the to- tal variance in the data into within-group variance and between-group variance. If the between-group variance is sufficiently larger than the within-group variance, then the test concludes that there is significant (phenotypic) difference between the groups.

slide-27
SLIDE 27

Analysis of variance test

Karsten Borgwardt: Data Mining in Bioinformatics, Page 27

The basic idea of the ANOVA test is to partition the total sum of squared deviations SST into between-group sum of squared deviations SSB and within-group sum of squared deviations SSW:

SST = SSB + SSW. (3)

The F-statistics for ANOVA tests on xi and (xixj) are:

F(xi, y) = M − 2 2 − 1 SSB(xi, y) SST(xi, y) − SSB(xi, y) (4) F(xixj, y) = M − g g − 1 SSB(xixj, y) SST(xixj, y) − SSB(xixj, y) (5)

slide-28
SLIDE 28

Brute force approach

Karsten Borgwardt: Data Mining in Bioinformatics, Page 28

Problem 1: Permutation test to find critical value For permutation yk, test all SNP pairs to find the maxi- mum test value Fyk Repeat for all permutations Report αK-th largest value in {Fyk|1 ≤ k ≤ K} Problem 2: Finding significant SNP pairs For phenotype y, test all SNP-pairs and report the SNP- pairs whose test values are above Fα.

slide-29
SLIDE 29

Overview of FASTANOVA

Karsten Borgwardt: Data Mining in Bioinformatics, Page 29

Goal: Scale large permutation test at genome-wide scale Question: Do we have to perform ANOVA tests for every SNP-pair and repeat for all permutations? Idea: Develop an upper bound: to filter out SNP-pairs having no chance to become significant (all nodes on the same level of the search tree, no sub-tree pruning) Efficiently compute the upper bound: calculate the up- per bound for a group of SNP-pairs together Identify redundant computations in the permutation tests (reuse computations)

slide-30
SLIDE 30

The upper bound I

Karsten Borgwardt: Data Mining in Bioinformatics, Page 30

For any SNP pair (xi, xj):

F(xixj, y) ≥ Fα ⇔ SSB(xixj, y) ≥ β (6)

(β is fixed for a given Fα). Bound on SSB:

SSB(xixj, y) ≤ SSB(xi, y) + R1 + R2 (7)

Right-hand side needs to be greater than β for (xixj) to be significant.

slide-31
SLIDE 31

The upper bound II

Karsten Borgwardt: Data Mining in Bioinformatics, Page 31

Let

na = min{#xj = 1, #xj = 0|xi = 0} (8) nb = min{#xj = 1, #xj = 0|xi = 1} (9)

For any SNP pair (xi, xj):

SSB(xixj, y) ≤ SSB(xi, y) + R1 + R2 (10) SSB(xixj, y) is constant for a given xi R1 is a function of na. R2 is a function of nb.

For a fixed xi, R1 and R2 depend only on xj.

slide-32
SLIDE 32

Schema of FastANOVA

Karsten Borgwardt: Data Mining in Bioinformatics, Page 32

For each xi, index the SNP pairs {(xixj)|i + 1 ≤ j ≤ S} in the 2D space of (na, nb). For each permutation, find the candidate SNP pairs by ac- cessing the indexing structure Candidates are SNP pairs whose upper bounds are above the threshold. The dynamic threshold is the maximum test value found so far.

slide-33
SLIDE 33

Complexity of FastANOVA

Karsten Borgwardt: Data Mining in Bioinformatics, Page 33

Runtime complexity FastANOVA: O(S2N + KSN 2 + CN) Brute force: O(KNS2) Space complexity

O((S + K)N)

slide-34
SLIDE 34

FASTANOVA: Runtime versus number of SNPs

Karsten Borgwardt: Data Mining in Bioinformatics, Page 34

slide-35
SLIDE 35

FASTANOVA: Runtime versus type I error

Karsten Borgwardt: Data Mining in Bioinformatics, Page 35

slide-36
SLIDE 36

FASTANOVA: Pruning power versus type I error

Karsten Borgwardt: Data Mining in Bioinformatics, Page 36

slide-37
SLIDE 37

Convex Optimization based Epistasis Detection

Karsten Borgwardt: Data Mining in Bioinformatics, Page 37

COE approach (Zhang et al., RECOMB 2009) Goal: Efficient epistasis detection across different test

statistics

Assumptions: Binary phenotypes Binary genotypes Strategy:

Show that many commonly used statistics are convex functions Develop an upper bound to filter out SNP-pairs having no chance to be- come significant Efficiently compute the upper bound: calculate the upper bound for a group

  • f SNP-pairs together

Identify redundant computations in the permutation tests Generalizes FASTANOVA result, but restricted to binary phenotypes.

slide-38
SLIDE 38

Tree-based Epistasis Association Mapping (TEAM)

Karsten Borgwardt: Data Mining in Bioinformatics, Page 38

TEAM Properties (Zhang et al., ISMB 2010) Supports homozygous and heterozygous data Applicable to all tests based on the contingency table More efficient for large sample studies

slide-39
SLIDE 39

TEAM: Problem formalization

Karsten Borgwardt: Data Mining in Bioinformatics, Page 39

Dataset: N individuals {I1, . . . , IN}, S heterozygous SNPs

{x1, x2, . . . , xS} with state 0, 1 or 2, a binary phenotype y,

and its K permutations y1, y2, . . . , yK. Goal: For each phenotype permutation yk, compute the corresponding contingency table

slide-40
SLIDE 40

TEAM: Contingency tables II

Karsten Borgwardt: Data Mining in Bioinformatics, Page 40

xi = 0 xi = 1 xi = 2

Total

xj = 0 Event S Event T

Event R

xj = 1 Event P Event Q Event U xj = 2 Event V Event W Event Z

Total M Contingency table for genotype relation between two SNPs

xi and xj

slide-41
SLIDE 41

TEAM: Contingency tables III

Karsten Borgwardt: Data Mining in Bioinformatics, Page 41

xi = 0: xj = 0 xj = 1 xj = 2 yk = 0 Event a1 Event a2 Event a3 yk = 1 Event c1 Event c2 Event c3 xi = 1: xj = 0 xj = 1 xj = 2 yk = 0 Event b1 Event b2 Event b3 yk = 1 Event d1 Event d2 Event d3 xi = 2: xj = 0 xj = 1 xj = 2 yk = 0 Event e1 Event e2 Event e3 yk = 1 Event f1

Event f2 Event f3 Total M Contingency table for two-locus test T(xixj, yk)

slide-42
SLIDE 42

TEAM: Key observation

Karsten Borgwardt: Data Mining in Bioinformatics, Page 42

Structure in contingency tables If the contingency tables I and II are known, all entries

  • f contingency table III can be inferred if d2, d3, f2 and f3

are known.

xi = 0: xj = 0 xj = 1 xj = 2 yk = 0 Event a1 Event a2 Event a3 yk = 1 Event c1 Event c2 Event c3 xi = 1: xj = 0 xj = 1 xj = 2 yk = 0 Event b1 Event b2 Event b3 yk = 1 Event d1 Event d2 Event d3 xi = 2: xj = 0 xj = 1 xj = 2 yk = 0 Event e1 Event e2 Event e3 yk = 1 Event f1

Event f2 Event f3 Total M

slide-43
SLIDE 43

TEAM: Algorithm

Karsten Borgwardt: Data Mining in Bioinformatics, Page 43

Build a minimum spanning tree (MST) on SNPs For leaf xi, update tables for the pairs associated with xi (by a DFS of the MST) Remove xi and repeat updating Example

  • f

a minimum spanning tree:

slide-44
SLIDE 44

TEAM: Minimum Spanning Tree

Karsten Borgwardt: Data Mining in Bioinformatics, Page 44

Nodes V (T) of the tree T are SNPs Weights of the edges E(T) are the number of individuals having different genotypes in the two SNPs A spanning tree is a tree that connects all SNPs A minimum spanning tree is a tree whose weight is not larger than that of any other spanning tree. It can be computed via Kruskal’s algorithm in O(E log E). A computational bottleneck in TEAM, however, is that one has to compute all pairwise distances between SNPs first.

slide-45
SLIDE 45

TEAM: Updating rule

Karsten Borgwardt: Data Mining in Bioinformatics, Page 45

Let (xjx′

j){k→l} denote the pairs of SNPs, where xj = k

and xj′ = l. From the contingency tables I, it is easy to see that

Od2(xixj, yk) = |D(xi, yk) ∩ Q(xi, xj)|

and

Od2(xixj′, yk) = |D(xi, yk) ∩ Q(xi, xj′)|

. Theorem For any SNP-pair (xixj) and an edge (xjxj′) ∈

E(T), we have Od2(xixj′, yk) = Od2(xixj, yk) + |D(xi, yk) ∩ (xjxj′){0→1}∪{2→1}| −|D(xi, yk) ∩ (xjxj′){1→0}∪{1→2}|.

slide-46
SLIDE 46

Complexity of TEAM

Karsten Borgwardt: Data Mining in Bioinformatics, Page 46

Runtime complexity TEAM: O(NSK + NS2 + WtreeSK) Brute force: O(NS2K) Space complexity

O((S + K)N + K(S + N) + Wtree)

slide-47
SLIDE 47

TEAM: Empirical runtime analysis

Karsten Borgwardt: Data Mining in Bioinformatics, Page 47

Comparison between TEAM and the brute force approach on human datasets under various experimental settings: varyhing the number of SNPs (a), indi- viduals (b), permutations (c) and varyhing the case/control ratio (d).

slide-48
SLIDE 48

TEAM: Empirical runtime comparison

Karsten Borgwardt: Data Mining in Bioinformatics, Page 48

Comparison between TEAM, COE and the brute force approach on mouse datasets under various experimental settings: (a) varying the number of SNPs and (b) varying the number of individuals.

slide-49
SLIDE 49

Taxonomy of space pruning methods (Wei Wang Lab)

Karsten Borgwardt: Data Mining in Bioinformatics, Page 49

http://www.csbio.unc.edu/epistasis/

slide-50
SLIDE 50

Family IV: Sampling approaches to epistasis detection

Karsten Borgwardt: Data Mining in Bioinformatics, Page 50

Overview Bayesian inference of epistatic interactions in case- control studies (Zhang et al., 2007) which partitions the SNPs into SNPs with ’no effect’, ’main effect’ and ’inter- action effect’ via MCMC sampling Random forests (Lunetta, 2004) filter SNPs by their im- portance in constructing decision trees based on sam- pled subsets of SNPs. The Epistasis Lightbulb Algorithm (Achlioptas et al., 2011) detects pairs of interaction SNPs in a runtime which is subquadratic in the number of SNPs.

slide-51
SLIDE 51

Filtering via random forests

Karsten Borgwardt: Data Mining in Bioinformatics, Page 51

Random Forests (Breiman, 2001; Lunetta et al., 2004)

from Jiang et al., 2009

slide-52
SLIDE 52

A subquadratic runtime approach to epistasis detection

Karsten Borgwardt: Data Mining in Bioinformatics, Page 52

The epistasis lightbulb algorithm We assume binary phenotypes (cases and controls). Genotypes may be homozygous or heterozygous. We assume m individuals with n SNPs each. We define an algorithm that rapidly detects epistatic in- teractions in a runtime subquadratic in n (Achlioptas et al., KDD 2011).

slide-53
SLIDE 53

Difference in correlation for epistasis detection

Karsten Borgwardt: Data Mining in Bioinformatics, Page 53

We phrase epistasis detection as a difference in correla- tion problem:

arg max

i,j

|ρcases(xi, xj) − ρcontrols(xi, xj)|. (11)

Different degree of linkage disequilibrium of two loci in cases and controls

slide-54
SLIDE 54

The lightbulb approach (Paturi et al., COLT 1989)

Karsten Borgwardt: Data Mining in Bioinformatics, Page 54

Maximum correlation The lightbulb algorithm tackles the maximum correlation problem on an m × n matrix A with binary entries:

arg max

i,j

|ρA(xi, xj)|. (12)

Quadratic runtime algorithm As in epistasis detection, the problem can be solved by naive enumeration of all n2 possible solutions.

slide-55
SLIDE 55

The lightbulb approach

Karsten Borgwardt: Data Mining in Bioinformatics, Page 55

Lightbulb algorithm

  • 1. Given a binary matrix A with m rows and n columns.
  • 2. Repeat l times:

Sample k rows Increase a counter for all pairs of columns that match

  • n these k rows.
  • 3. The counters divided by l give an estimate of the corre-

lation P(xi = xj). Subquadratic runtime With probability 1 − n−α, the lightbulb algorithm re- trieves the most correlated pair in O(α n1+ln p1

ln q2 ln2 n) =

O(n(α n

ln p1 ln q2 ln2 n)).

slide-56
SLIDE 56

Difference between the two settings

Karsten Borgwardt: Data Mining in Bioinformatics, Page 56

Discrepancies Difference in correlation SNPs are non-binary in general Pearson’s correlation coefficient

slide-57
SLIDE 57

Step 1: Difference in correlation

Karsten Borgwardt: Data Mining in Bioinformatics, Page 57

Theorem Given a matrix of cases A and a matrix of controls B of identical size. Finding the maximally correlated pair on A

A B 1 − B

  • (13)

and on A 1 − A

B B

  • (14)

is identical to

arg max

i,j

|ρA(xi, xj) − ρB(xi, xj)|. (15)

slide-58
SLIDE 58

Step 2: Locality sensitive hashing (Charikar, 2002)

Karsten Borgwardt: Data Mining in Bioinformatics, Page 58

Given a collection of vectors in Rm we choose a random vec- tor r from the m-dimensional Gaussian distribution. Cor- responding to this vector r, we define a hash function hr as follows:

hr(u) =

  • 1

if r · u ≥ 0 if r · u < 0

(16)

Theorem For vectors v, u, Pr[hr(u) = hr(v)] = 1 − θ(u,v)

π

, where θ is the angle between the two vectors.

slide-59
SLIDE 59

Step 3: Pearson’s correlation coefficient

Karsten Borgwardt: Data Mining in Bioinformatics, Page 59

Link between correlation and cosine Karl Pearson defined the correlation of 2 vectors v, u in Rm as

ρ = cov(v, u) σvσu , (17)

that is the covariance of the two vectors divided by their standard deviations. An equivalent geometric way to de- fine it is:

ρ = cos(v − ¯ v, u − ¯ u), (18)

where ¯

v and ¯ u are the mean value of u and v, respectively.

slide-60
SLIDE 60

The lightbulb epistasis algorithm

Karsten Borgwardt: Data Mining in Bioinformatics, Page 60

Algorithm

  • 1. Binarize original matrices A0 and B0 into A and B by

locality sensitive hashing.

  • 2. Compute maximally correlated pair P1 on

A A

B 1 − B

  • via lightbulb.
  • 3. Compute maximally correlated pair P2 on

A 1 − A

B B

  • via lightbulb.
  • 4. Report the maximum of P1 and P2.
slide-61
SLIDE 61

Experiments: Nordborg lab SNP dataset

Karsten Borgwardt: Data Mining in Bioinformatics, Page 61

Results on Nordborg SNP dataset

# SNPs Measurements Pairs Exponent Speedup Top 10 Top 100 Top 500 Top 1K 100,000 8,255,645 8,186,657 1.38 611 1.00 0.86 0.82 0.80 100,000 52,762,001 51,732,700 1.54 97 1.00 1.00 0.99 0.98

Runtime Runtime is empirically O(n1.5). Epistasis detection on the human genome would require 1 day of computation on a typical desktop PC.

slide-62
SLIDE 62

Summary: Epistasis detection in via sampling

Karsten Borgwardt: Data Mining in Bioinformatics, Page 62

We phrase epistasis detection as a difference in correlation problem:

arg max

i,j

|ρcases(xi, xj) − ρcontrols(xi, xj)|. (19)

SNPs have three discrete states. We binarize them by lo- cality sensitive hashing, using a random vector r for check- ing r, xi > 0. Correlation between binary random variables can be ap- proximated by sampling rows (of patients) via Paturi’s light- bulb algorithm (Paturi et al., COLT 1989). Empirically, the runtime is n1.5, where n is the number of SNPs. We can search the whole human genome on a single PC in a day.