Data Mining in Bioinformatics Day 8: Feature Selection in - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt February 25 to March 10 Bioinformatics Group MPIs Tübingen Karsten Borgwardt: Data Mining in Bioinformatics, Page 1

Gene Selection via the BAHSIC Family of Algorithms Le Song NICTA Statistical Machine Learning Program, Australia University of Sydney Joint work with Justin Bedo, Karsten Borgwardt, Arthur Gretton and Alex Smola 25th July 2007 Le Song Gene Selection via the BAHSIC Family of Algorithms

Gene Selection Reasons Biological: identify disease related genes. Statistical: avoid model overfitting. Le Song Gene Selection via the BAHSIC Family of Algorithms

Gene Selection Current State Small sample size (100+), large number of genes (10,000+) Lack of robustness: gene lists are not reproducible. Plethora of feature selectors: which to choose? Le Song Gene Selection via the BAHSIC Family of Algorithms

Gene Selection Two components Selection criterion (eg. Pearson’s correlation, t-statistic, mutual information) Selection algorithm (eg. forward greedy method, backward elimination, feature weighting) Le Song Gene Selection via the BAHSIC Family of Algorithms

Gene Selection BAHSIC: BAckward elimination via Hilbert-Schmidt Independence Criterion (BAHSIC) Key Idea: Select genes whose expression levels are most relevant/dependent on the phenotype as measured by HSIC. Le Song Gene Selection via the BAHSIC Family of Algorithms

HSIC Hilbert-Schmidt Independence Criterion (HSIC) tr ( KHLH ) K : kernel or similarity matrix on gene expression data L : kernel or similarity matrix on phenotype information H : centering matrix Le Song Gene Selection via the BAHSIC Family of Algorithms

BAHSIC BAckward elimination via HSIC(BAHSIC) Start with full set of genes. Find the gene that is the least relevant to phenotype information. Remove this gene. Repeat until a few genes are left. Le Song Gene Selection via the BAHSIC Family of Algorithms

BAHSIC Family: tr ( KHLH ) Examples Pearson’s correlation Mean difference Kernel mean difference t-statistic Signal-to-noise ratio (SNR) Moderated-t Shrunken centroid Ridge regression Quadratic mutual information ... Le Song Gene Selection via the BAHSIC Family of Algorithms

BAHSIC Family: tr ( KHLH ) Pearson’s correlation � m i = 1 ( x i − ¯ x )( y i − ¯ y ) r xy = s x s y Normalize data and labels by std. s x and s y . Linear kernels on both domain Le Song Gene Selection via the BAHSIC Family of Algorithms

BAHSIC Family: tr ( KHLH ) Mean difference and variants (¯ x + − ¯ x − ) 2 m + and − 1 1 Use m − as labels. Linear kernels on both domain Eg. signal-to-noise ratio: normalize by ( s + + s − ) Le Song Gene Selection via the BAHSIC Family of Algorithms

BAHSIC Family: tr ( KHLH ) L Linear Polynomial Gaussian · · · Linear � ? ? ? Polynomial ? ? ? � K Gaussian � ? ? ? ? ? ? � . . . Le Song Gene Selection via the BAHSIC Family of Algorithms

Experiments Linear vs nonlinear features Linear Nonlinear Insert 10 artificial genes into dataset 9 and 10 BAHSIC family Others Ref.# pc snr pam t m-t lods lin RBF dis rfe l1 mi 9 � � � � � � � � � � � � Linear 10 � � � � � � � � � � � � 9 - - - - - - - � � - - � Nonlinear 10 - - - - - - - � � - - � Le Song Gene Selection via the BAHSIC Family of Algorithms

Experiments Subtype discovery linear (first row) vs nonlinear kernel (second row) Dataset 18 Dataset 27 Dataset 28 Le Song Gene Selection via the BAHSIC Family of Algorithms

Experiments Select top 10 genes for classification: BAHSIC family Others Ref.# pc snr pam t m-t lods lin RBF dis rfe l1 mi ℓ 2 16 . 9 20 . 9 17 . 3 43 . 5 50 . 5 50 . 3 13.2 22 . 9 35 . 4 26 . 3 19 . 7 23 . 5 1 � 2 � � 3 � � � 4 � � � � 5 � � � � � � � � � � � 6 � � � � 7 � � � � � � � � 8 � � � � 9 � � � � � 10 � � � � � � � � 11 � � � 12 � � 13 � � 14 � � 15 � � � � � � � (pc=Pearson’s correlation, snr=signal-to-noise ratio, Le Song Gene Selection via the BAHSIC Family of Algorithms pam=shrunken centroid, t=t-statistics, m-t=moderated

Experiments Robustness of the top 10 genes. BAHSIC family Others Ref.# pc snr pam t m-t lods lin RBF dis rfe l1 mi best 2 1 1 6 10 9 0 2 0 0 0 0 1 � � � 2 � 3 � � � 4 � 5 � � � 6 � � � 7 � 8 � � 9 � � � � 10 � � 11 � 12 � 13 � 14 � � � 15 � � Le Song Gene Selection via the BAHSIC Family of Algorithms (pc=Pearson’s correlation, snr=signal-to-noise ratio,

Experiments Rule of thumb: Apply a linear kernel in general. Apply a RBF kernel if nonlinear effects are sought. Le Song Gene Selection via the BAHSIC Family of Algorithms

Summary BAHSIC: BAckward elimination using HSIC BAHSIC provides a unifying framework for various feature selectors. BAHSIC provides guidelines for practical gene selection. Le Song Gene Selection via the BAHSIC Family of Algorithms

The End Acknowledgement US National Science Foundation For more information http://www.cs.usyd.edu.au/ ∼ lesong/ Le Song Gene Selection via the BAHSIC Family of Algorithms

Genome-Wide Association Single Nucleotide Polymorphisms Sites in the genome where the DNA sequences of many individuals differ by a single base are called single nucleotide polymorphisms (SNPs) These genetic variants might be related to various phenotypes such as disease susceptibility Projects International HapMap Project: detect Human SNPs 1001 Genomes: detect Arabidopsis SNPs Karsten Borgwardt: Data Mining in Bioinformatics, Page 2

Genome-Wide Association Instance of Feature Selection Given individuals, their Single Nucleotide Polymor- phisms (SNPs) and their phenotype: Find the SNPs ( X ) that correlate most with a particular phenotype ( Y ) among hundreds of thousands of SNPs for hundreds of individuals Genome Wide Association Studies are a large-scale feature selection problem! Karsten Borgwardt: Data Mining in Bioinformatics, Page 3

Genome-Wide Association Challenges in GWA Large number of SNPs Number of SNPs >> number of individuals Multiple hypothesis testing problem Strategies Analysis of variance (ANOVA) for feature ranking Logistic Regression as a wrapper approach to SNP selection Bonferroni correction for multiple hypothesis testing Challenges Correlations between groups of SNPs and phenotypes Complex interactions between SNPs Relevant to distinguish classes of SNPs? Karsten Borgwardt: Data Mining in Bioinformatics, Page 4

References and further reading References [1] Nordborg M and Weigel D. Next-generation genetics in plants. Nature 456(7223):720-3, 2008 Dec 11 [2] International HapMap Consortium. A haplotype map of the human genome. Nature 437(7063):1299-320, 2005 Oct 27 Karsten Borgwardt: Data Mining in Bioinformatics, Page 5

Data Mining in Bioinformatics Day 8: Feature Selection in - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt February 25 to March 10 Bioinformatics Group MPIs Tbingen Karsten Borgwardt: Data Mining in Bioinformatics, Page 1 Gene Selection via the BAHSIC

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt March

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 6: Feature Selection in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Data Mining in Bioinformatics Day 3: Feature Selection Karsten Borgwardt February 21 to March 4,

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Epistasis detection

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Feature Selection: ROC and Subset Selection Theodoridis 5.5-5.7 Using ROC for Feature Selection

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 21

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt March 1

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 21 to March 4, 2011

Computing for engineering simulation Data analysis I, II and Experimental Thinking Jin Yoon

Factor Vocab Word 2 Its meaning Introduction to (As it is used A whole number A whole number

Introduction to Probability Click to go to Table of Contents Slide 5 / 188 Probability One way

On the Diversity of Graphs with High Variable Node Degrees Lun Li David Alderson John C. Doyle

Learning and Inference in Markov Logic Networks CS 486/686 University of Waterloo Lecture 23:

CS570 Data Mining Classification: Ensemble Methods Cengiz Gnay Dept. Math & CS, Emory

Recent Theoretical Advances in Sparse Approximation Joel A. Tropp

Probabilistic Graphical Models Lecture 7 Variable Elimination CS/CNS/EE 155 Andreas Krause

Data Mining in Bioinformatics Day 8: Feature Selection in - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt February 25 to March 10 Bioinformatics Group MPIs Tbingen Karsten Borgwardt: Data Mining in Bioinformatics, Page 1 Gene Selection via the BAHSIC

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt March

Data Mining in Bioinformatics Day 9: String &amp; Text Mining in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 6: Feature Selection in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Data Mining in Bioinformatics Day 3: Feature Selection Karsten Borgwardt February 21 to March 4,

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Epistasis detection

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Feature Selection: ROC and Subset Selection Theodoridis 5.5-5.7 Using ROC for Feature Selection

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 21

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt March 1

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 21 to March 4, 2011

Computing for engineering simulation Data analysis I, II and Experimental Thinking Jin Yoon

Factor Vocab Word 2 Its meaning Introduction to (As it is used A whole number A whole number

Introduction to Probability Click to go to Table of Contents Slide 5 / 188 Probability One way

On the Diversity of Graphs with High Variable Node Degrees Lun Li David Alderson John C. Doyle

Learning and Inference in Markov Logic Networks CS 486/686 University of Waterloo Lecture 23:

CS570 Data Mining Classification: Ensemble Methods Cengiz Gnay Dept. Math &amp; CS, Emory

Recent Theoretical Advances in Sparse Approximation Joel A. Tropp

Probabilistic Graphical Models Lecture 7 Variable Elimination CS/CNS/EE 155 Andreas Krause

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

CS570 Data Mining Classification: Ensemble Methods Cengiz Gnay Dept. Math & CS, Emory