Karsten Borgwardt: Data Mining in Bioinformatics, Page 1
Data Mining in Bioinformatics Day 6: Feature Selection in - - PowerPoint PPT Presentation
Data Mining in Bioinformatics Day 6: Feature Selection in - - PowerPoint PPT Presentation
Data Mining in Bioinformatics Day 6: Feature Selection in Bioinformatics Karsten Borgwardt February 6 to February 17, 2012 Machine Learning & Computational Biology Research Group MPIs Tbingen Karsten Borgwardt: Data Mining in
Gene Selection via the BAHSIC Family
- f Algorithms
Le Song
NICTA Statistical Machine Learning Program, Australia University of Sydney Joint work with Justin Bedo, Karsten Borgwardt, Arthur Gretton and Alex Smola 25th July 2007
Le Song Gene Selection via the BAHSIC Family of Algorithms
Gene Selection
Reasons Biological: identify disease related genes. Statistical: avoid model overfitting.
Le Song Gene Selection via the BAHSIC Family of Algorithms
Gene Selection
Current State Small sample size (100+), large number of genes (10,000+) Lack of robustness: gene lists are not reproducible. Plethora of feature selectors: which to choose?
Le Song Gene Selection via the BAHSIC Family of Algorithms
Gene Selection
Two components Selection criterion (eg. Pearson’s correlation, t-statistic, mutual information) Selection algorithm (eg. forward greedy method, backward elimination, feature weighting)
Le Song Gene Selection via the BAHSIC Family of Algorithms
Gene Selection
BAHSIC: BAckward elimination via Hilbert-Schmidt Independence Criterion (BAHSIC) Key Idea: Select genes whose expression levels are most relevant/dependent on the phenotype as measured by HSIC.
Le Song Gene Selection via the BAHSIC Family of Algorithms
HSIC
Hilbert-Schmidt Independence Criterion (HSIC)
tr(KHLH)
K: kernel or similarity matrix on gene expression data L: kernel or similarity matrix on phenotype information H: centering matrix
Le Song Gene Selection via the BAHSIC Family of Algorithms
BAHSIC
BAckward elimination via HSIC(BAHSIC) Start with full set of genes. Find the gene that is the least relevant to phenotype information. Remove this gene. Repeat until a few genes are left.
Le Song Gene Selection via the BAHSIC Family of Algorithms
BAHSIC Family: tr(KHLH)
Examples Pearson’s correlation Mean difference Kernel mean difference t-statistic Signal-to-noise ratio (SNR) Moderated-t Shrunken centroid Ridge regression Quadratic mutual information ...
Le Song Gene Selection via the BAHSIC Family of Algorithms
BAHSIC Family: tr(KHLH)
Pearson’s correlation rxy =
m
i=1(xi−¯
x)(yi−¯ y) sxsy
Normalize data and labels by std. sx and sy. Linear kernels on both domain
Le Song Gene Selection via the BAHSIC Family of Algorithms
BAHSIC Family: tr(KHLH)
Mean difference and variants (¯ x+ − ¯ x−)2 Use
1 m+ and −1 m− as labels.
Linear kernels on both domain
- Eg. signal-to-noise ratio: normalize by (s+ + s−)
Le Song Gene Selection via the BAHSIC Family of Algorithms
BAHSIC Family: tr(KHLH)
L Linear Polynomial Gaussian · · · K Linear
- ?
? ? Polynomial
- ?
? ? Gaussian
- ?
? ? . . .
- ?
? ?
Le Song Gene Selection via the BAHSIC Family of Algorithms
Experiments
Linear vs nonlinear features Linear Nonlinear Insert 10 artificial genes into dataset 9 and 10
BAHSIC family Others Ref.# pc snr pam t m-t lods lin RBF dis rfe l1 mi Linear 9
- 10
- Nonlinear
9
10
Le Song Gene Selection via the BAHSIC Family of Algorithms
Experiments
Subtype discovery linear (first row) vs nonlinear kernel (second row) Dataset 18 Dataset 27 Dataset 28
Le Song Gene Selection via the BAHSIC Family of Algorithms
Experiments
Select top 10 genes for classification:
BAHSIC family Others Ref.# pc snr pam t m-t lods lin RBF dis rfe l1 mi ℓ2 16.9 20.9 17.3 43.5 50.5 50.3 13.2 22.9 35.4 26.3 19.7 23.5 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- (pc=Pearson’s correlation, snr=signal-to-noise ratio,
pam=shrunken centroid, t=t-statistics, m-t=moderated
Le Song Gene Selection via the BAHSIC Family of Algorithms
Experiments
Robustness of the top 10 genes.
BAHSIC family Others Ref.# pc snr pam t m-t lods lin RBF dis rfe l1 mi best 2 1 1 6 10 9 2 0 0 0 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- (pc=Pearson’s correlation, snr=signal-to-noise ratio,
Le Song Gene Selection via the BAHSIC Family of Algorithms
Experiments
Rule of thumb: Apply a linear kernel in general. Apply a RBF kernel if nonlinear effects are sought.
Le Song Gene Selection via the BAHSIC Family of Algorithms
Summary
BAHSIC: BAckward elimination using HSIC BAHSIC provides a unifying framework for various feature selectors. BAHSIC provides guidelines for practical gene selection.
Le Song Gene Selection via the BAHSIC Family of Algorithms
The End
Acknowledgement US National Science Foundation For more information http://www.cs.usyd.edu.au/∼lesong/
Le Song Gene Selection via the BAHSIC Family of Algorithms
Two-locus association mapping in subquadratic time
Karsten Borgwardt
Machine Learning and Computational Biology Research Group Max Planck Institute for Intelligent Systems & Max Planck Institute for Developmental Biology, T¨ ubingen Eberhard Karls Universit¨ at T¨ ubingen
EMBL Heidelberg October 18, 2011
Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 1
Classic and new questions
◮ Genetics
◮ How does genotypic variation
lead to phenotypic variation?
◮ Can we predict phenotypes
based on the genotype of an individual?
◮ Recent progress
◮ Genotypes can be determined
at an unprecedented level of detail
◮ Phenotypes can be recorded
in an automated manner
Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 2
Phenotype prediction Arabidopsis phenotypes (99-199 plants, 250k SNPs, Atwell et al., 2010)
Phenotype AUCSVM Chlorosis at 22◦C 0.629 ± 0.003 Anthocyanin at 16◦C 0.569 ± 0.003 Anthocyanin at 22◦C 0.609 ± 0.003 Leaf Roll at 10◦C 0.696 ± 0.002 Leaf Roll at 22◦C 0.587 ± 0.004
Why is there room for improvement?
◮ We assume additive effects of SNPs, ignore gene-gene interactions
and gene-environment interactions.
◮ We ignore population structure, that is systematic ancestry
differences of cases and controls.
Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 3
Overview
Machine Learning in Statistical Genetics
◮ Accurate genotyping and characterization of individuals:
◮ Population-wide genome sequencing in A. thaliana (Cao et al., Nature
Genetics 2011)
◮ Methylome of A. thaliana (Becker et al., Nature 2011) ◮ Detecting structural variants using NGS
◮ Accurate mapping:
◮ Taking confounders into account (Stegle et al., NIPS 2011, Li et al.,
ISMB 2011)
◮ Taking gene-gene interactions into account (Kam-Thong et al., ISMB
2011; Achlioptas et al., KDD 2011)
◮ Automated phenotyping:
- 1. Guppy colour pattern and shape extraction (Karaletsos et al., under
review)
- 2. Silique number estimation in Arabidopsis
Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 4
Genome-wide association mapping
by courtesy of D. Weigel Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 5
Challenges in two-locus mapping Scale of the problem
◮ Typical datasets include order 105 − 107 SNPs. ◮ Hence we have to consider order 1010 − 1014 SNP pairs. ◮ Enormous multiple hypothesis testing problem. ◮ Enormous computational runtime problem.
Our contribution
◮ We assume binary phenotypes (cases and controls). ◮ Genotypes may be homozygous or heterozygous. ◮ We assume m individuals with n SNPs each. ◮ We define an algorithm that rapidly detects epistatic interactions in a
runtime subquadratic in n (Achlioptas et al., KDD 2011).
Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 6
Common approaches in the literature Exhaustive enumeration
◮ Only with special hardware such as GPU implementations:
EPIBLASTER (Kam-Thong et al., EJHG 2010)
Filtering approaches
◮ Statistical criterion, e.g. SNPs with large main effect (Zhang et al.,
2007)
◮ Biological criterion, e.g. underlying PPI (Emily et al., 2009)
Index structure approaches
◮ fastANOVA, branch-and-bound on SNPs (Zhang et al., 2008) ◮ TEAM, efficient updates of contingency tables (Zhang et al., 2010)
Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 7
Difference in correlation for epistasis detection
◮ We phrase epistasis detection as a difference in correlation problem:
argmax
i,j
|ρcases(xi, xj) − ρcontrols(xi, xj)|. (1)
◮ Different degree of linkage disequilibrium of two loci in cases and
controls
Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 8
The lightbulb approach (Paturi et al., COLT 1989) Maximum correlation
◮ The lightbulb algorithm tackles the maximum correlation problem on
an m × n matrix A with binary entries: argmax
i,j
|ρA(xi, xj)|. (2)
Quadratic runtime algorithm
◮ As in epistasis detection, the problem can be solved by naive
enumeration of all n2 possible solutions.
Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 9
The lightbulb approach Lightbulb algorithm
- 1. Given a binary matrix A with m rows and n columns.
- 2. Repeat l times:
◮ Sample k rows ◮ Increase a counter for all pairs of columns that match on these k rows.
- 3. The counters divided by l give an estimate of the correlation
P(xi = xj).
Subquadratic runtime
◮ With probability 1 − n−α, the lightbulb algorithm retrieves the most
correlated pair in O(α n1+ ln p1
ln q2 ln2 n) = O(n(α n ln p1 ln q2 ln2 n)). Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 10
Difference between the epistasis and lightbulb problem setting Discrepancies
◮ Difference in correlation ◮ SNPs are non-binary in general ◮ Pearson’s correlation coefficient
Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 11
Step 1: Difference in correlation Theorem
◮ Given a matrix of cases A and a matrix of controls B of identical size. ◮ Finding the maximally correlated pair on
A A B 1 − B
- (3)
◮ and on
A 1 − A B B
- (4)
◮ is identical to
argmax
i,j
|ρA(xi, xj) − ρB(xi, xj)|. (5)
Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 12
Step 2: Locality sensitive hashing (Charikar, 2002)
Given a collection of vectors in Rm we choose a random vector r from the m-dimensional Gaussian distribution. Corresponding to this vector r, we define a hash function h
r as follows:
h
r(
u) =
- 1
if r · u ≥ 0 if r · u < 0 (6)
Theorem
For vectors v, u, Pr[h
r(
u) = h
r(
v)] = 1 − θ( u, v) π , where θ is the angle between the two vectors.
Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 13
Step 3: Pearson’s correlation coefficient Link between correlation and cosine
Karl Pearson defined the correlation of 2 vectors v, u in Rm as ρ = cov( v, u) σvσu , (7) that is the covariance of the two vectors divided by their standard
- deviations. An equivalent geometric way to define it is:
ρ = cos( v − ¯ v, u − ¯ u), (8) where ¯ v and ¯ u are the mean value of u and v, respectively.
Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 14
The lightbulb epistasis algorithm Algorithm
- 1. Binarize original matrices A0 and B0 into A and B by locality
sensitive hashing.
- 2. Compute maximally correlated pair P1 on
A A B 1 − B
- via
lightbulb.
- 3. Compute maximally correlated pair P2 on
A 1 − A B B
- via
lightbulb.
- 4. Report the maximum of P1 and P2.
Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 15
Experiments: Arabidopsis SNP dataset Results on Arabidopsis SNP dataset
# SNPs Measurements Pairs Exponent Speedup Top 10 Top 100 Top 500 Top 1K 100,000 8,255,645 8,186,657 1.38 611 1.00 0.86 0.82 0.80 100,000 52,762,001 51,732,700 1.54 97 1.00 1.00 0.99 0.98
Runtime
◮ Runtime is empirically O(n1.5). ◮ Epistasis detection on the human genome would require 1 day of
computation on a typical desktop PC.
Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 16
Experiments: Runtime versus Recall
0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 1.3 1.35 1.4 1.45 1.5 1.55 1.6
Recall among top 1000 SNP pairs (in %) Exponent of runtime (base n) dbgap Schizophrenia dataset Hapsample simulated dataset Arabidopsis thaliana dataset
Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 17
Common approaches in the literature Exhaustive enumeration
◮ Only with special hardware such as GPU implementations:
EPIBLASTER/EPIGPUHSIC/CUDALIN (with T. Kam-Thong and B. M¨ uller-Myhsok)
Filtering approaches
◮ Statistical criterion (Zhang et al., 2007) ◮ Biological criterion (Emily et al., 2009)
Index structure approaches
◮ fastANOVA, branch-and-bound on SNPs (Zhang et al., 2008) ◮ TEAM, efficient updates of contingency tables (Zhang et al., 2010)
Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 18
Engineering approach to epistasis detection
Epistasis detection via GPU
◮ GPUs are heavily optimized for basic matrix operations ◮ Computational power in terms of GPUs much cheaper than CPUs ◮ We exploit the power of GPUs for rapid exhaustive SNP-SNP
interaction detection
◮ EPIBLASTER: Difference in correlation for binary phenotypes
(Kam-Thong et al., EJHG 2010)
◮ EPIGPUHSIC: Kernel-based test for arbitrary phenotypes (Kam-Thong
et al., ISMB 2011)
◮ CUDALIN: Regression model with main effects (Kam-Thong et al.,
submitted)
◮ Available from
http://mlcb.is.tuebingen.mpg.de/Forschung/epistasistools/
Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 19
First finding
by P. S¨ amann
◮ 567 subjects ◮ 1,075,163 SNPs ◮ phenotype: Hippocampus volume ◮ genome-wide significant results (p < 10−12) ◮ near genes involved in cell-cell signaling
Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 20
Algorithmic approach to epistasis detection
Epistasis detection in subquadratic runtime
◮ We phrase epistasis detection as a difference in correlation problem:
argmax
i,j
|ρcases(xi, xj) − ρcontrols(xi, xj)|. (9)
◮ SNPs have three discrete states. We binarize them by locality
sensitive hashing, using a random vector r for checking r, xi > 0.
◮ Correlation between binary random variables can be approximated by
sampling rows (of patients) via Paturi’s lightbulb algorithm (Paturi et al., COLT 1989).
◮ Empirically, the runtime is n1.5, where n is the number of SNPs. ◮ We can search the whole human genome on a single PC in a day.
Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 21
Summary
Machine Learning in Statistical Genetics
◮ Accurate genotyping and characterization of individuals:
◮ Population-wide genome sequencing in A. thaliana (Cao et al., Nature
Genetics 2011)
◮ Methylome of A. thaliana (Becker et al., Nature 2011) ◮ Detecting structural variants using NGS
◮ Accurate mapping:
◮ Taking confounders into account (Stegle et al., NIPS 2011, Li et al.,
ISMB 2011)
◮ Taking gene-gene interactions into account (Kam-Thong et al., ISMB
2011; Achlioptas et al., KDD 2011)
◮ Automated phenotyping:
- 1. Guppy colour pattern and shape extraction (Karaletsos et al., under
review)
- 2. Silique number estimation in Arabidopsis
◮ Questions or comments? Contact karsten@tue.mpg.de
Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 22
Thank you
Group members:
◮ Panagiotis Achlioptas ◮ Tony Kam-Thong ◮ Chlo´
e-Agathe Azencott
◮ Barbara Rakitsch ◮ Limin Li ◮ Dominik Grimm ◮ Theofanis Karaletsos ◮ Christoph Lippert ◮ Nino Shervashidze ◮ Oliver Stegle ◮ Hyokun Yun
Collaborators:
◮ B. M¨
uller-Myhsok, MPI Psychiatry
◮ L. Cayton, MPI-IS ◮ P. S¨
amann, MPI Psychiatry
◮ B. Sch¨
- lkopf, MPI-IS
◮ D. Weigel, MPI Dev. Biology ◮ F. Holsboer, MPI Psychiatry
Sponsors:
◮ A.-v.-Humboldt (Chlo´
e)
◮ DFG ◮ Microsoft Research Cambridge ◮ VW (Oliver)
Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 23
Main references
◮ Panagiotis Achlioptas et al. Two-locus association mapping in
subquadratic runtime, KDD 2011.
◮ Tony Kam-Thong et al. Epistasis detection on quantitative
phenotypes by exhaustive enumeration using GPUs, ISMB 2011.
◮ Tony Kam-Thong et al. EPIBLASTER-Fast exhaustive two-locus
epistasis detection strategy using graphical processing units, European Journal of Human Genetics, 2011.
Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 24