Data Mining in Bioinformatics Day 8: Feature Selection in - - PowerPoint PPT Presentation

data mining in bioinformatics day 8 feature selection in
SMART_READER_LITE
LIVE PREVIEW

Data Mining in Bioinformatics Day 8: Feature Selection in - - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt March 1 to March 12, 2010 Machine Learning & Computational Biology Research Group MPIs Tbingen Karsten Borgwardt: Data Mining in Bioinformatics,


slide-1
SLIDE 1

Karsten Borgwardt: Data Mining in Bioinformatics, Page 1

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics

Karsten Borgwardt March 1 to March 12, 2010 Machine Learning & Computational Biology Research Group MPIs Tübingen

slide-2
SLIDE 2

Gene Selection via the BAHSIC Family

  • f Algorithms

Le Song

NICTA Statistical Machine Learning Program, Australia University of Sydney Joint work with Justin Bedo, Karsten Borgwardt, Arthur Gretton and Alex Smola 25th July 2007

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-3
SLIDE 3

Gene Selection

Reasons Biological: identify disease related genes. Statistical: avoid model overfitting.

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-4
SLIDE 4

Gene Selection

Current State Small sample size (100+), large number of genes (10,000+) Lack of robustness: gene lists are not reproducible. Plethora of feature selectors: which to choose?

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-5
SLIDE 5

Gene Selection

Two components Selection criterion (eg. Pearson’s correlation, t-statistic, mutual information) Selection algorithm (eg. forward greedy method, backward elimination, feature weighting)

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-6
SLIDE 6

Gene Selection

BAHSIC: BAckward elimination via Hilbert-Schmidt Independence Criterion (BAHSIC) Key Idea: Select genes whose expression levels are most relevant/dependent on the phenotype as measured by HSIC.

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-7
SLIDE 7

HSIC

Hilbert-Schmidt Independence Criterion (HSIC)

tr(KHLH)

K: kernel or similarity matrix on gene expression data L: kernel or similarity matrix on phenotype information H: centering matrix

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-8
SLIDE 8

BAHSIC

BAckward elimination via HSIC(BAHSIC) Start with full set of genes. Find the gene that is the least relevant to phenotype information. Remove this gene. Repeat until a few genes are left.

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-9
SLIDE 9

BAHSIC Family: tr(KHLH)

Examples Pearson’s correlation Mean difference Kernel mean difference t-statistic Signal-to-noise ratio (SNR) Moderated-t Shrunken centroid Ridge regression Quadratic mutual information ...

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-10
SLIDE 10

BAHSIC Family: tr(KHLH)

Pearson’s correlation rxy =

m

i=1(xi−¯

x)(yi−¯ y) sxsy

Normalize data and labels by std. sx and sy. Linear kernels on both domain

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-11
SLIDE 11

BAHSIC Family: tr(KHLH)

Mean difference and variants (¯ x+ − ¯ x−)2 Use

1 m+ and −1 m− as labels.

Linear kernels on both domain

  • Eg. signal-to-noise ratio: normalize by (s+ + s−)

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-12
SLIDE 12

BAHSIC Family: tr(KHLH)

L Linear Polynomial Gaussian · · · K Linear

  • ?

? ? Polynomial

  • ?

? ? Gaussian

  • ?

? ? . . .

  • ?

? ?

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-13
SLIDE 13

Experiments

Linear vs nonlinear features Linear Nonlinear Insert 10 artificial genes into dataset 9 and 10

BAHSIC family Others Ref.# pc snr pam t m-t lods lin RBF dis rfe l1 mi Linear 9

  • 10
  • Nonlinear

9

10

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-14
SLIDE 14

Experiments

Subtype discovery linear (first row) vs nonlinear kernel (second row) Dataset 18 Dataset 27 Dataset 28

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-15
SLIDE 15

Experiments

Select top 10 genes for classification:

BAHSIC family Others Ref.# pc snr pam t m-t lods lin RBF dis rfe l1 mi ℓ2 16.9 20.9 17.3 43.5 50.5 50.3 13.2 22.9 35.4 26.3 19.7 23.5 1

  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • (pc=Pearson’s correlation, snr=signal-to-noise ratio,

pam=shrunken centroid, t=t-statistics, m-t=moderated

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-16
SLIDE 16

Experiments

Robustness of the top 10 genes.

BAHSIC family Others Ref.# pc snr pam t m-t lods lin RBF dis rfe l1 mi best 2 1 1 6 10 9 2 0 0 0 1

  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • (pc=Pearson’s correlation, snr=signal-to-noise ratio,

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-17
SLIDE 17

Experiments

Rule of thumb: Apply a linear kernel in general. Apply a RBF kernel if nonlinear effects are sought.

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-18
SLIDE 18

Summary

BAHSIC: BAckward elimination using HSIC BAHSIC provides a unifying framework for various feature selectors. BAHSIC provides guidelines for practical gene selection.

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-19
SLIDE 19

The End

Acknowledgement US National Science Foundation For more information http://www.cs.usyd.edu.au/∼lesong/

Le Song Gene Selection via the BAHSIC Family of Algorithms

slide-20
SLIDE 20

Genome-Wide Association

Karsten Borgwardt: Data Mining in Bioinformatics, Page 2

Single Nucleotide Polymorphisms Sites in the genome where the DNA sequences of many individuals differ by a single base are called single nu- cleotide polymorphisms (SNPs) These genetic variants might be related to various phe- notypes such as disease susceptibility Projects International HapMap Project: detect Human SNPs 1001 Genomes: detect Arabidopsis SNPs

slide-21
SLIDE 21

Genome-Wide Association

Karsten Borgwardt: Data Mining in Bioinformatics, Page 3

Instance of Feature Selection Given individuals, their Single Nucleotide Polymor- phisms (SNPs) and their phenotype: Find the SNPs (X) that correlate most with a particular phenotype (Y ) among hundreds of thousands of SNPs for hundreds of individuals Genome Wide Association Studies are a large-scale feature selection problem!

slide-22
SLIDE 22

Genome-Wide Association

Karsten Borgwardt: Data Mining in Bioinformatics, Page 4

Challenges in GWA Large number of SNPs Number of SNPs >> number of individuals Multiple hypothesis testing problem Strategies Analysis of variance (ANOVA) for feature ranking Logistic Regression as a wrapper approach to SNP se- lection Bonferroni correction for multiple hypothesis testing Challenges Correlations between groups of SNPs and phenotypes Complex interactions between SNPs Relevant to distinguish classes of SNPs?

slide-23
SLIDE 23

References and further reading

Karsten Borgwardt: Data Mining in Bioinformatics, Page 5

References

[1] Nordborg M and Weigel D. Next-generation genetics in

  • plants. Nature 456(7223):720-3, 2008 Dec 11

[2] International HapMap Consortium. A haplotype map of the human genome. Nature 437(7063):1299-320, 2005 Oct 27