Data Mining in Bioinformatics Day 8: Feature Selection in - - PowerPoint PPT Presentation

data mining in bioinformatics day 8 feature selection in
SMART_READER_LITE
LIVE PREVIEW

Data Mining in Bioinformatics Day 8: Feature Selection in - - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Epistasis detection Karsten Borgwardt February 21 to March 4, 2011 Machine Learning & Computational Biology Group Max Planck Institutes Tbingen, Germany Karsten


slide-1
SLIDE 1

Karsten Borgwardt: Data Mining in Bioinformatics, Page 1

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics

Epistasis detection

Karsten Borgwardt February 21 to March 4, 2011 Machine Learning & Computational Biology Group Max Planck Institutes Tübingen, Germany

slide-2
SLIDE 2

Classic and new questions

Karsten Borgwardt: Data Mining in Bioinformatics, Page 2

Genetics How does genotypic vari- ation lead to phenotypic variation? Can we predict pheno- types based on the geno- type of an individual? Recent progress Genotypes can be de- termined at an unprece- dented level of detail Phenotypes can be recorded in an auto- mated manner

slide-3
SLIDE 3

Phenotype prediction

Karsten Borgwardt: Data Mining in Bioinformatics, Page 3

Phenotype

AUCSVM

Chlorosis at 22◦C 0.629 ± 0.003 Anthocyanin at 16◦C 0.569 ± 0.003 Anthocyanin at 22◦C 0.609 ± 0.003 Leaf Roll at 10◦C 0.696 ± 0.002 Leaf Roll at 22◦C 0.587 ± 0.004

99-199 plants, 250k SNPs, Atwell et al., Nature 2010

Why is there room for improvement? We assume additive effects of SNPs, ignore gene-gene interactions. We ignore population structure, that is systematic an- cestry differences of cases and controls. We ignore gene-environment interactions.

slide-4
SLIDE 4

Gene-gene interactions

Karsten Borgwardt: Data Mining in Bioinformatics, Page 4

Scale of the problem Typical datasets include order 105 − 106 SNPs. Hence we have to consider order 1010 − 1012 SNP pairs. Enormous multiple hypothesis testing problem. Enormous computational runtime problem. Our contribution We assume binary phenotypes (cases and controls). Genotypes may be homozygous or heterozygous. We assume m individuals with n SNPs each. We define an algorithm called epiSVM that rapidly de- tects epistatic interactions underlying the phenotypes in

O(mn).

slide-5
SLIDE 5

Epistasis detection approaches

Karsten Borgwardt: Data Mining in Bioinformatics, Page 5

Filtering approaches Statistical criterion, e.g. SNPs with large main effect (Zhang et al., 2007) Biological criterion, e.g. underlying PPI (Emily et al., 2009) Index structure approaches fastANOVA, branch-and-bound on SNPs (Zhang et al., 2008) TEAM, efficient updates of contingency tables (Zhang et al., 2010) Exhaustive enumeration Only with special hardware such as GPU implementa- tions: EPIBLASTER (Kam-Thong, EJHG 2010)

slide-6
SLIDE 6

Support Vector Machines

Karsten Borgwardt: Data Mining in Bioinformatics, Page 6

Separating hyperplane classifier (Vapnik & Chervonenkis, 1974)

Feature 1 Feature 2 w = (1,1) (x)

slide-7
SLIDE 7

SVM: naive approach

Karsten Borgwardt: Data Mining in Bioinformatics, Page 7

Classify case/control by means of pairs of features SVM classifier:

f(x) = sgn(w, φ(x) + b) = sgn(

  • γ

wγφγ(x) + b)

Mapping φ : X → X ⊗ X We have to compute all n2 entries of w to detect the feature pairs with maximum weight. These may be up to

1012 entries.

Can we avoid this?

slide-8
SLIDE 8

Three types of epistasis

Karsten Borgwardt: Data Mining in Bioinformatics, Page 8

Based on HJ Cordell, 2002:

aa(0) Aa(1) AA(2) bb(0) Bb(1) BB(2)

D&D

aa(0) Aa(1) AA(2) bb(0) Bb(1) BB(2)

D&R

aa(0) Aa(1) AA(2) bb(0) Bb(1) BB(2)

R&R

aa(0) Aa(1) AA(2) bb(0) Bb(1) BB(2)

R|R

aa(0) Aa(1) AA(2) bb(0) Bb(1) BB(2)

R|D

aa(0) Aa(1) AA(2) bb(0) Bb(1) BB(2)

D|D

slide-9
SLIDE 9

Two-state feature space: φepi

Karsten Borgwardt: Data Mining in Bioinformatics, Page 9

aa(0) A*(1) bb(0) B*(1)

D&D

aa(0) A*(1) bb(0) B*(1)

D&R

aa(0) A*(1) bb(0) B*(1)

R&R

aa(0) A*(1) bb(0) B*(1)

R|R

aa(0) A*(1) bb(0) B*(1)

R|D

aa(0) A*(1) bb(0) B*(1)

D|D

Mapping φepi results in 2n features that represent n SNPs (not in n2!).

slide-10
SLIDE 10

epiSVM

Karsten Borgwardt: Data Mining in Bioinformatics, Page 10

Optimization problem (Rakitsch, Li, B., 2011)

min

w∈Rn w2

(1)

subject to yi(w · φepi(xi) + b) ≥ 1 and w0 ≤ 2.

ℓ0-Support Vector Machine (Weston et al., 2003)

Approximation of (1) via repeated application of ℓ2-SVM Rescale x by pointwise multiplication with w Empirically, this procedure converges within h = 20 iter- ations Runtime In each iteration, one has to solve a linear SVM on m individuals and n SNPs, which can be done in O(m n).

slide-11
SLIDE 11

Runtime

Karsten Borgwardt: Data Mining in Bioinformatics, Page 11

slide-12
SLIDE 12

Power

Karsten Borgwardt: Data Mining in Bioinformatics, Page 12

D&D D&R R&R R|R R|D D|D 20 40 60 80 100

Power in %

r2 =0.7

D&D D&R R&R R|R R|D D|D 20 40 60 80 100

r2 =1.0

epiSVM TEAM

Sample size: m = 400, n = 10.000

slide-13
SLIDE 13

Confounder-robust SVMs

Karsten Borgwardt: Data Mining in Bioinformatics, Page 13

Genotypes xi, phenotypes yi and kinship matrix ˜

L

Confounder-robust SVM (Li and B., 2011):

min

w∈Rn w2+λ tr(K ˜

L)

subject to yi(w · xi + b) ≥ 1

Kij = w ⊙ xi, w ⊙ xj

Equivalent to:

min

˜ w∈Rn ˜

w2

subject to yi(˜

w · ˜ xi + b) ≥ 1

where ˜

xi(γ) = xi(γ)

  • 1 + λ

i,j xi(γ)xj(γ)˜

Lij

slide-14
SLIDE 14

Confounder-robust SVM

Karsten Borgwardt: Data Mining in Bioinformatics, Page 14

Binary Arabidopsis pheontype prediction

xi = 250k SNPs of plant i yi = phenotype of plant i Lij =

  • 1 if i and j are in the same subpopulation

0 otherwise

Prediction results 5-fold cross-validation Linear SVM versus confounder-robust linear SVM

Phenotype AUCcrSVM AUCSVM pSVM

Chlorosis at 22◦C 0.662 ± 0.003 0.629± 0.003 1.7e-16 Anthocyanin at 16◦C 0.598 ± 0.002 0.569± 0.003 3.0e-16 Anthocyanin at 22◦C 0.618 ± 0.003 0.609± 0.003 1.0e-02 Leaf Roll at 10◦C 0.711 ± 0.002 0.696± 0.002 3.0e-06 Leaf Roll at 22◦C 0.594 ± 0.004 0.587± 0.004 4.0e-03

slide-15
SLIDE 15

Summary and Outlook

Karsten Borgwardt: Data Mining in Bioinformatics, Page 15

Computational challenges in genetics Gene-gene interactions Correction for population structure Analysing structured phenotypes such as images or time series Further current topics in the group Confounder(-gene interaction) correction Active learning for optimized phenotyping Feature extraction from image phenotypes Feature selection in structured spaces Comparing networks efficiently Soon: Functional annotation of SNPs Soon: Webtool for association studies