Data Mining in Bioinformatics Day 6: Feature Selection in - - PowerPoint PPT Presentation

data mining in bioinformatics day 6 feature selection in
SMART_READER_LITE
LIVE PREVIEW

Data Mining in Bioinformatics Day 6: Feature Selection in - - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 6: Feature Selection in Bioinformatics Karsten Borgwardt February 6 to February 17, 2012 Machine Learning & Computational Biology Research Group MPIs Tbingen Karsten Borgwardt: Data Mining in


slide-1
SLIDE 1

Karsten Borgwardt: Data Mining in Bioinformatics, Page 1

Data Mining in Bioinformatics Day 6: Feature Selection in Bioinformatics

Karsten Borgwardt February 6 to February 17, 2012 Machine Learning & Computational Biology Research Group MPIs Tübingen

slide-2
SLIDE 2

Gene Selection via the BAHSIC Family

  • f Algorithms

Le Song

NICTA Statistical Machine Learning Program, Australia University of Sydney Joint work with Justin Bedo, Karsten Borgwardt, Arthur Gretton and Alex Smola 25th July 2007

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-3
SLIDE 3

Gene Selection

Reasons Biological: identify disease related genes. Statistical: avoid model overfitting.

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-4
SLIDE 4

Gene Selection

Current State Small sample size (100+), large number of genes (10,000+) Lack of robustness: gene lists are not reproducible. Plethora of feature selectors: which to choose?

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-5
SLIDE 5

Gene Selection

Two components Selection criterion (eg. Pearson’s correlation, t-statistic, mutual information) Selection algorithm (eg. forward greedy method, backward elimination, feature weighting)

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-6
SLIDE 6

Gene Selection

BAHSIC: BAckward elimination via Hilbert-Schmidt Independence Criterion (BAHSIC) Key Idea: Select genes whose expression levels are most relevant/dependent on the phenotype as measured by HSIC.

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-7
SLIDE 7

HSIC

Hilbert-Schmidt Independence Criterion (HSIC)

tr(KHLH)

K: kernel or similarity matrix on gene expression data L: kernel or similarity matrix on phenotype information H: centering matrix

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-8
SLIDE 8

BAHSIC

BAckward elimination via HSIC(BAHSIC) Start with full set of genes. Find the gene that is the least relevant to phenotype information. Remove this gene. Repeat until a few genes are left.

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-9
SLIDE 9

BAHSIC Family: tr(KHLH)

Examples Pearson’s correlation Mean difference Kernel mean difference t-statistic Signal-to-noise ratio (SNR) Moderated-t Shrunken centroid Ridge regression Quadratic mutual information ...

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-10
SLIDE 10

BAHSIC Family: tr(KHLH)

Pearson’s correlation rxy =

m

i=1(xi−¯

x)(yi−¯ y) sxsy

Normalize data and labels by std. sx and sy. Linear kernels on both domain

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-11
SLIDE 11

BAHSIC Family: tr(KHLH)

Mean difference and variants (¯ x+ − ¯ x−)2 Use

1 m+ and −1 m− as labels.

Linear kernels on both domain

  • Eg. signal-to-noise ratio: normalize by (s+ + s−)

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-12
SLIDE 12

BAHSIC Family: tr(KHLH)

L Linear Polynomial Gaussian · · · K Linear

  • ?

? ? Polynomial

  • ?

? ? Gaussian

  • ?

? ? . . .

  • ?

? ?

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-13
SLIDE 13

Experiments

Linear vs nonlinear features Linear Nonlinear Insert 10 artificial genes into dataset 9 and 10

BAHSIC family Others Ref.# pc snr pam t m-t lods lin RBF dis rfe l1 mi Linear 9

  • 10
  • Nonlinear

9

10

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-14
SLIDE 14

Experiments

Subtype discovery linear (first row) vs nonlinear kernel (second row) Dataset 18 Dataset 27 Dataset 28

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-15
SLIDE 15

Experiments

Select top 10 genes for classification:

BAHSIC family Others Ref.# pc snr pam t m-t lods lin RBF dis rfe l1 mi ℓ2 16.9 20.9 17.3 43.5 50.5 50.3 13.2 22.9 35.4 26.3 19.7 23.5 1

  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • (pc=Pearson’s correlation, snr=signal-to-noise ratio,

pam=shrunken centroid, t=t-statistics, m-t=moderated

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-16
SLIDE 16

Experiments

Robustness of the top 10 genes.

BAHSIC family Others Ref.# pc snr pam t m-t lods lin RBF dis rfe l1 mi best 2 1 1 6 10 9 2 0 0 0 1

  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • (pc=Pearson’s correlation, snr=signal-to-noise ratio,

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-17
SLIDE 17

Experiments

Rule of thumb: Apply a linear kernel in general. Apply a RBF kernel if nonlinear effects are sought.

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-18
SLIDE 18

Summary

BAHSIC: BAckward elimination using HSIC BAHSIC provides a unifying framework for various feature selectors. BAHSIC provides guidelines for practical gene selection.

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-19
SLIDE 19

The End

Acknowledgement US National Science Foundation For more information http://www.cs.usyd.edu.au/∼lesong/

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-20
SLIDE 20

Two-locus association mapping in subquadratic time

Karsten Borgwardt

Machine Learning and Computational Biology Research Group Max Planck Institute for Intelligent Systems & Max Planck Institute for Developmental Biology, T¨ ubingen Eberhard Karls Universit¨ at T¨ ubingen

EMBL Heidelberg October 18, 2011

Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 1

slide-21
SLIDE 21

Classic and new questions

◮ Genetics

◮ How does genotypic variation

lead to phenotypic variation?

◮ Can we predict phenotypes

based on the genotype of an individual?

◮ Recent progress

◮ Genotypes can be determined

at an unprecedented level of detail

◮ Phenotypes can be recorded

in an automated manner

Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 2

slide-22
SLIDE 22

Phenotype prediction Arabidopsis phenotypes (99-199 plants, 250k SNPs, Atwell et al., 2010)

Phenotype AUCSVM Chlorosis at 22◦C 0.629 ± 0.003 Anthocyanin at 16◦C 0.569 ± 0.003 Anthocyanin at 22◦C 0.609 ± 0.003 Leaf Roll at 10◦C 0.696 ± 0.002 Leaf Roll at 22◦C 0.587 ± 0.004

Why is there room for improvement?

◮ We assume additive effects of SNPs, ignore gene-gene interactions

and gene-environment interactions.

◮ We ignore population structure, that is systematic ancestry

differences of cases and controls.

Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 3

slide-23
SLIDE 23

Overview

Machine Learning in Statistical Genetics

◮ Accurate genotyping and characterization of individuals:

◮ Population-wide genome sequencing in A. thaliana (Cao et al., Nature

Genetics 2011)

◮ Methylome of A. thaliana (Becker et al., Nature 2011) ◮ Detecting structural variants using NGS

◮ Accurate mapping:

◮ Taking confounders into account (Stegle et al., NIPS 2011, Li et al.,

ISMB 2011)

◮ Taking gene-gene interactions into account (Kam-Thong et al., ISMB

2011; Achlioptas et al., KDD 2011)

◮ Automated phenotyping:

  • 1. Guppy colour pattern and shape extraction (Karaletsos et al., under

review)

  • 2. Silique number estimation in Arabidopsis

Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 4

slide-24
SLIDE 24

Genome-wide association mapping

by courtesy of D. Weigel Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 5

slide-25
SLIDE 25

Challenges in two-locus mapping Scale of the problem

◮ Typical datasets include order 105 − 107 SNPs. ◮ Hence we have to consider order 1010 − 1014 SNP pairs. ◮ Enormous multiple hypothesis testing problem. ◮ Enormous computational runtime problem.

Our contribution

◮ We assume binary phenotypes (cases and controls). ◮ Genotypes may be homozygous or heterozygous. ◮ We assume m individuals with n SNPs each. ◮ We define an algorithm that rapidly detects epistatic interactions in a

runtime subquadratic in n (Achlioptas et al., KDD 2011).

Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 6

slide-26
SLIDE 26

Common approaches in the literature Exhaustive enumeration

◮ Only with special hardware such as GPU implementations:

EPIBLASTER (Kam-Thong et al., EJHG 2010)

Filtering approaches

◮ Statistical criterion, e.g. SNPs with large main effect (Zhang et al.,

2007)

◮ Biological criterion, e.g. underlying PPI (Emily et al., 2009)

Index structure approaches

◮ fastANOVA, branch-and-bound on SNPs (Zhang et al., 2008) ◮ TEAM, efficient updates of contingency tables (Zhang et al., 2010)

Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 7

slide-27
SLIDE 27

Difference in correlation for epistasis detection

◮ We phrase epistasis detection as a difference in correlation problem:

argmax

i,j

|ρcases(xi, xj) − ρcontrols(xi, xj)|. (1)

◮ Different degree of linkage disequilibrium of two loci in cases and

controls

Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 8

slide-28
SLIDE 28

The lightbulb approach (Paturi et al., COLT 1989) Maximum correlation

◮ The lightbulb algorithm tackles the maximum correlation problem on

an m × n matrix A with binary entries: argmax

i,j

|ρA(xi, xj)|. (2)

Quadratic runtime algorithm

◮ As in epistasis detection, the problem can be solved by naive

enumeration of all n2 possible solutions.

Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 9

slide-29
SLIDE 29

The lightbulb approach Lightbulb algorithm

  • 1. Given a binary matrix A with m rows and n columns.
  • 2. Repeat l times:

◮ Sample k rows ◮ Increase a counter for all pairs of columns that match on these k rows.

  • 3. The counters divided by l give an estimate of the correlation

P(xi = xj).

Subquadratic runtime

◮ With probability 1 − n−α, the lightbulb algorithm retrieves the most

correlated pair in O(α n1+ ln p1

ln q2 ln2 n) = O(n(α n ln p1 ln q2 ln2 n)). Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 10

slide-30
SLIDE 30

Difference between the epistasis and lightbulb problem setting Discrepancies

◮ Difference in correlation ◮ SNPs are non-binary in general ◮ Pearson’s correlation coefficient

Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 11

slide-31
SLIDE 31

Step 1: Difference in correlation Theorem

◮ Given a matrix of cases A and a matrix of controls B of identical size. ◮ Finding the maximally correlated pair on

A A B 1 − B

  • (3)

◮ and on

A 1 − A B B

  • (4)

◮ is identical to

argmax

i,j

|ρA(xi, xj) − ρB(xi, xj)|. (5)

Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 12

slide-32
SLIDE 32

Step 2: Locality sensitive hashing (Charikar, 2002)

Given a collection of vectors in Rm we choose a random vector r from the m-dimensional Gaussian distribution. Corresponding to this vector r, we define a hash function h

r as follows:

h

r(

u) =

  • 1

if r · u ≥ 0 if r · u < 0 (6)

Theorem

For vectors v, u, Pr[h

r(

u) = h

r(

v)] = 1 − θ( u, v) π , where θ is the angle between the two vectors.

Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 13

slide-33
SLIDE 33

Step 3: Pearson’s correlation coefficient Link between correlation and cosine

Karl Pearson defined the correlation of 2 vectors v, u in Rm as ρ = cov( v, u) σvσu , (7) that is the covariance of the two vectors divided by their standard

  • deviations. An equivalent geometric way to define it is:

ρ = cos( v − ¯ v, u − ¯ u), (8) where ¯ v and ¯ u are the mean value of u and v, respectively.

Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 14

slide-34
SLIDE 34

The lightbulb epistasis algorithm Algorithm

  • 1. Binarize original matrices A0 and B0 into A and B by locality

sensitive hashing.

  • 2. Compute maximally correlated pair P1 on

A A B 1 − B

  • via

lightbulb.

  • 3. Compute maximally correlated pair P2 on

A 1 − A B B

  • via

lightbulb.

  • 4. Report the maximum of P1 and P2.

Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 15

slide-35
SLIDE 35

Experiments: Arabidopsis SNP dataset Results on Arabidopsis SNP dataset

# SNPs Measurements Pairs Exponent Speedup Top 10 Top 100 Top 500 Top 1K 100,000 8,255,645 8,186,657 1.38 611 1.00 0.86 0.82 0.80 100,000 52,762,001 51,732,700 1.54 97 1.00 1.00 0.99 0.98

Runtime

◮ Runtime is empirically O(n1.5). ◮ Epistasis detection on the human genome would require 1 day of

computation on a typical desktop PC.

Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 16

slide-36
SLIDE 36

Experiments: Runtime versus Recall

0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 1.3 1.35 1.4 1.45 1.5 1.55 1.6

Recall among top 1000 SNP pairs (in %) Exponent of runtime (base n) dbgap Schizophrenia dataset Hapsample simulated dataset Arabidopsis thaliana dataset

Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 17

slide-37
SLIDE 37

Common approaches in the literature Exhaustive enumeration

◮ Only with special hardware such as GPU implementations:

EPIBLASTER/EPIGPUHSIC/CUDALIN (with T. Kam-Thong and B. M¨ uller-Myhsok)

Filtering approaches

◮ Statistical criterion (Zhang et al., 2007) ◮ Biological criterion (Emily et al., 2009)

Index structure approaches

◮ fastANOVA, branch-and-bound on SNPs (Zhang et al., 2008) ◮ TEAM, efficient updates of contingency tables (Zhang et al., 2010)

Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 18

slide-38
SLIDE 38

Engineering approach to epistasis detection

Epistasis detection via GPU

◮ GPUs are heavily optimized for basic matrix operations ◮ Computational power in terms of GPUs much cheaper than CPUs ◮ We exploit the power of GPUs for rapid exhaustive SNP-SNP

interaction detection

◮ EPIBLASTER: Difference in correlation for binary phenotypes

(Kam-Thong et al., EJHG 2010)

◮ EPIGPUHSIC: Kernel-based test for arbitrary phenotypes (Kam-Thong

et al., ISMB 2011)

◮ CUDALIN: Regression model with main effects (Kam-Thong et al.,

submitted)

◮ Available from

http://mlcb.is.tuebingen.mpg.de/Forschung/epistasistools/

Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 19

slide-39
SLIDE 39

First finding

by P. S¨ amann

◮ 567 subjects ◮ 1,075,163 SNPs ◮ phenotype: Hippocampus volume ◮ genome-wide significant results (p < 10−12) ◮ near genes involved in cell-cell signaling

Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 20

slide-40
SLIDE 40

Algorithmic approach to epistasis detection

Epistasis detection in subquadratic runtime

◮ We phrase epistasis detection as a difference in correlation problem:

argmax

i,j

|ρcases(xi, xj) − ρcontrols(xi, xj)|. (9)

◮ SNPs have three discrete states. We binarize them by locality

sensitive hashing, using a random vector r for checking r, xi > 0.

◮ Correlation between binary random variables can be approximated by

sampling rows (of patients) via Paturi’s lightbulb algorithm (Paturi et al., COLT 1989).

◮ Empirically, the runtime is n1.5, where n is the number of SNPs. ◮ We can search the whole human genome on a single PC in a day.

Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 21

slide-41
SLIDE 41

Summary

Machine Learning in Statistical Genetics

◮ Accurate genotyping and characterization of individuals:

◮ Population-wide genome sequencing in A. thaliana (Cao et al., Nature

Genetics 2011)

◮ Methylome of A. thaliana (Becker et al., Nature 2011) ◮ Detecting structural variants using NGS

◮ Accurate mapping:

◮ Taking confounders into account (Stegle et al., NIPS 2011, Li et al.,

ISMB 2011)

◮ Taking gene-gene interactions into account (Kam-Thong et al., ISMB

2011; Achlioptas et al., KDD 2011)

◮ Automated phenotyping:

  • 1. Guppy colour pattern and shape extraction (Karaletsos et al., under

review)

  • 2. Silique number estimation in Arabidopsis

◮ Questions or comments? Contact karsten@tue.mpg.de

Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 22

slide-42
SLIDE 42

Thank you

Group members:

◮ Panagiotis Achlioptas ◮ Tony Kam-Thong ◮ Chlo´

e-Agathe Azencott

◮ Barbara Rakitsch ◮ Limin Li ◮ Dominik Grimm ◮ Theofanis Karaletsos ◮ Christoph Lippert ◮ Nino Shervashidze ◮ Oliver Stegle ◮ Hyokun Yun

Collaborators:

◮ B. M¨

uller-Myhsok, MPI Psychiatry

◮ L. Cayton, MPI-IS ◮ P. S¨

amann, MPI Psychiatry

◮ B. Sch¨

  • lkopf, MPI-IS

◮ D. Weigel, MPI Dev. Biology ◮ F. Holsboer, MPI Psychiatry

Sponsors:

◮ A.-v.-Humboldt (Chlo´

e)

◮ DFG ◮ Microsoft Research Cambridge ◮ VW (Oliver)

Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 23

slide-43
SLIDE 43

Main references

◮ Panagiotis Achlioptas et al. Two-locus association mapping in

subquadratic runtime, KDD 2011.

◮ Tony Kam-Thong et al. Epistasis detection on quantitative

phenotypes by exhaustive enumeration using GPUs, ISMB 2011.

◮ Tony Kam-Thong et al. EPIBLASTER-Fast exhaustive two-locus

epistasis detection strategy using graphical processing units, European Journal of Human Genetics, 2011.

Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 24