Karsten Borgwardt: Data Mining in Bioinformatics, Page 1
Data Mining in Bioinformatics Day 8: Feature Selection in - - PowerPoint PPT Presentation
Data Mining in Bioinformatics Day 8: Feature Selection in - - PowerPoint PPT Presentation
Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Epistasis detection Karsten Borgwardt February 21 to March 4, 2011 Machine Learning & Computational Biology Group Max Planck Institutes Tbingen, Germany Karsten
Classic and new questions
Karsten Borgwardt: Data Mining in Bioinformatics, Page 2
Genetics How does genotypic vari- ation lead to phenotypic variation? Can we predict pheno- types based on the geno- type of an individual? Recent progress Genotypes can be de- termined at an unprece- dented level of detail Phenotypes can be recorded in an auto- mated manner
Phenotype prediction
Karsten Borgwardt: Data Mining in Bioinformatics, Page 3
Phenotype
AUCSVM
Chlorosis at 22◦C 0.629 ± 0.003 Anthocyanin at 16◦C 0.569 ± 0.003 Anthocyanin at 22◦C 0.609 ± 0.003 Leaf Roll at 10◦C 0.696 ± 0.002 Leaf Roll at 22◦C 0.587 ± 0.004
99-199 plants, 250k SNPs, Atwell et al., Nature 2010
Why is there room for improvement? We assume additive effects of SNPs, ignore gene-gene interactions. We ignore population structure, that is systematic an- cestry differences of cases and controls. We ignore gene-environment interactions.
Gene-gene interactions
Karsten Borgwardt: Data Mining in Bioinformatics, Page 4
Scale of the problem Typical datasets include order 105 − 106 SNPs. Hence we have to consider order 1010 − 1012 SNP pairs. Enormous multiple hypothesis testing problem. Enormous computational runtime problem. Our contribution We assume binary phenotypes (cases and controls). Genotypes may be homozygous or heterozygous. We assume m individuals with n SNPs each. We define an algorithm called epiSVM that rapidly de- tects epistatic interactions underlying the phenotypes in
O(mn).
Epistasis detection approaches
Karsten Borgwardt: Data Mining in Bioinformatics, Page 5
Filtering approaches Statistical criterion, e.g. SNPs with large main effect (Zhang et al., 2007) Biological criterion, e.g. underlying PPI (Emily et al., 2009) Index structure approaches fastANOVA, branch-and-bound on SNPs (Zhang et al., 2008) TEAM, efficient updates of contingency tables (Zhang et al., 2010) Exhaustive enumeration Only with special hardware such as GPU implementa- tions: EPIBLASTER (Kam-Thong, EJHG 2010)
Support Vector Machines
Karsten Borgwardt: Data Mining in Bioinformatics, Page 6
Separating hyperplane classifier (Vapnik & Chervonenkis, 1974)
Feature 1 Feature 2 w = (1,1) (x)
SVM: naive approach
Karsten Borgwardt: Data Mining in Bioinformatics, Page 7
Classify case/control by means of pairs of features SVM classifier:
f(x) = sgn(w, φ(x) + b) = sgn(
- γ
wγφγ(x) + b)
Mapping φ : X → X ⊗ X We have to compute all n2 entries of w to detect the feature pairs with maximum weight. These may be up to
1012 entries.
Can we avoid this?
Three types of epistasis
Karsten Borgwardt: Data Mining in Bioinformatics, Page 8
Based on HJ Cordell, 2002:
aa(0) Aa(1) AA(2) bb(0) Bb(1) BB(2)
D&D
aa(0) Aa(1) AA(2) bb(0) Bb(1) BB(2)
D&R
aa(0) Aa(1) AA(2) bb(0) Bb(1) BB(2)
R&R
aa(0) Aa(1) AA(2) bb(0) Bb(1) BB(2)
R|R
aa(0) Aa(1) AA(2) bb(0) Bb(1) BB(2)
R|D
aa(0) Aa(1) AA(2) bb(0) Bb(1) BB(2)
D|D
Two-state feature space: φepi
Karsten Borgwardt: Data Mining in Bioinformatics, Page 9
aa(0) A*(1) bb(0) B*(1)
D&D
aa(0) A*(1) bb(0) B*(1)
D&R
aa(0) A*(1) bb(0) B*(1)
R&R
aa(0) A*(1) bb(0) B*(1)
R|R
aa(0) A*(1) bb(0) B*(1)
R|D
aa(0) A*(1) bb(0) B*(1)
D|D
Mapping φepi results in 2n features that represent n SNPs (not in n2!).
epiSVM
Karsten Borgwardt: Data Mining in Bioinformatics, Page 10
Optimization problem (Rakitsch, Li, B., 2011)
min
w∈Rn w2
(1)
subject to yi(w · φepi(xi) + b) ≥ 1 and w0 ≤ 2.
ℓ0-Support Vector Machine (Weston et al., 2003)
Approximation of (1) via repeated application of ℓ2-SVM Rescale x by pointwise multiplication with w Empirically, this procedure converges within h = 20 iter- ations Runtime In each iteration, one has to solve a linear SVM on m individuals and n SNPs, which can be done in O(m n).
Runtime
Karsten Borgwardt: Data Mining in Bioinformatics, Page 11
Power
Karsten Borgwardt: Data Mining in Bioinformatics, Page 12
D&D D&R R&R R|R R|D D|D 20 40 60 80 100
Power in %
r2 =0.7
D&D D&R R&R R|R R|D D|D 20 40 60 80 100
r2 =1.0
epiSVM TEAM
Sample size: m = 400, n = 10.000
Confounder-robust SVMs
Karsten Borgwardt: Data Mining in Bioinformatics, Page 13
Genotypes xi, phenotypes yi and kinship matrix ˜
L
Confounder-robust SVM (Li and B., 2011):
min
w∈Rn w2+λ tr(K ˜
L)
subject to yi(w · xi + b) ≥ 1
Kij = w ⊙ xi, w ⊙ xj
Equivalent to:
min
˜ w∈Rn ˜
w2
subject to yi(˜
w · ˜ xi + b) ≥ 1
where ˜
xi(γ) = xi(γ)
- 1 + λ
i,j xi(γ)xj(γ)˜
Lij
Confounder-robust SVM
Karsten Borgwardt: Data Mining in Bioinformatics, Page 14
Binary Arabidopsis pheontype prediction
xi = 250k SNPs of plant i yi = phenotype of plant i Lij =
- 1 if i and j are in the same subpopulation
0 otherwise
Prediction results 5-fold cross-validation Linear SVM versus confounder-robust linear SVM
Phenotype AUCcrSVM AUCSVM pSVM
Chlorosis at 22◦C 0.662 ± 0.003 0.629± 0.003 1.7e-16 Anthocyanin at 16◦C 0.598 ± 0.002 0.569± 0.003 3.0e-16 Anthocyanin at 22◦C 0.618 ± 0.003 0.609± 0.003 1.0e-02 Leaf Roll at 10◦C 0.711 ± 0.002 0.696± 0.002 3.0e-06 Leaf Roll at 22◦C 0.594 ± 0.004 0.587± 0.004 4.0e-03
Summary and Outlook
Karsten Borgwardt: Data Mining in Bioinformatics, Page 15