Data Mining in Bioinformatics Day 6: Feature Selection in - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 6: Feature Selection in Bioinformatics Karsten Borgwardt February 6 to February 17, 2012 Machine Learning & Computational Biology Research Group MPIs Tübingen Karsten Borgwardt: Data Mining in Bioinformatics, Page 1

Gene Selection via the BAHSIC Family of Algorithms Le Song NICTA Statistical Machine Learning Program, Australia University of Sydney Joint work with Justin Bedo, Karsten Borgwardt, Arthur Gretton and Alex Smola 25th July 2007 Le Song Gene Selection via the BAHSIC Family of Algorithms

Gene Selection Reasons Biological: identify disease related genes. Statistical: avoid model overfitting. Le Song Gene Selection via the BAHSIC Family of Algorithms

Gene Selection Current State Small sample size (100+), large number of genes (10,000+) Lack of robustness: gene lists are not reproducible. Plethora of feature selectors: which to choose? Le Song Gene Selection via the BAHSIC Family of Algorithms

Gene Selection Two components Selection criterion (eg. Pearson’s correlation, t-statistic, mutual information) Selection algorithm (eg. forward greedy method, backward elimination, feature weighting) Le Song Gene Selection via the BAHSIC Family of Algorithms

Gene Selection BAHSIC: BAckward elimination via Hilbert-Schmidt Independence Criterion (BAHSIC) Key Idea: Select genes whose expression levels are most relevant/dependent on the phenotype as measured by HSIC. Le Song Gene Selection via the BAHSIC Family of Algorithms

HSIC Hilbert-Schmidt Independence Criterion (HSIC) tr ( KHLH ) K : kernel or similarity matrix on gene expression data L : kernel or similarity matrix on phenotype information H : centering matrix Le Song Gene Selection via the BAHSIC Family of Algorithms

BAHSIC BAckward elimination via HSIC(BAHSIC) Start with full set of genes. Find the gene that is the least relevant to phenotype information. Remove this gene. Repeat until a few genes are left. Le Song Gene Selection via the BAHSIC Family of Algorithms

BAHSIC Family: tr ( KHLH ) Examples Pearson’s correlation Mean difference Kernel mean difference t-statistic Signal-to-noise ratio (SNR) Moderated-t Shrunken centroid Ridge regression Quadratic mutual information ... Le Song Gene Selection via the BAHSIC Family of Algorithms

BAHSIC Family: tr ( KHLH ) Pearson’s correlation � m i = 1 ( x i − ¯ x )( y i − ¯ y ) r xy = s x s y Normalize data and labels by std. s x and s y . Linear kernels on both domain Le Song Gene Selection via the BAHSIC Family of Algorithms

BAHSIC Family: tr ( KHLH ) Mean difference and variants (¯ x + − ¯ x − ) 2 m + and − 1 1 Use m − as labels. Linear kernels on both domain Eg. signal-to-noise ratio: normalize by ( s + + s − ) Le Song Gene Selection via the BAHSIC Family of Algorithms

BAHSIC Family: tr ( KHLH ) L Linear Polynomial Gaussian · · · Linear � ? ? ? Polynomial ? ? ? � K Gaussian � ? ? ? ? ? ? � . . . Le Song Gene Selection via the BAHSIC Family of Algorithms

Experiments Linear vs nonlinear features Linear Nonlinear Insert 10 artificial genes into dataset 9 and 10 BAHSIC family Others Ref.# pc snr pam t m-t lods lin RBF dis rfe l1 mi 9 � � � � � � � � � � � � Linear 10 � � � � � � � � � � � � 9 - - - - - - - � � - - � Nonlinear 10 - - - - - - - � � - - � Le Song Gene Selection via the BAHSIC Family of Algorithms

Experiments Subtype discovery linear (first row) vs nonlinear kernel (second row) Dataset 18 Dataset 27 Dataset 28 Le Song Gene Selection via the BAHSIC Family of Algorithms

Experiments Select top 10 genes for classification: BAHSIC family Others Ref.# pc snr pam t m-t lods lin RBF dis rfe l1 mi ℓ 2 16 . 9 20 . 9 17 . 3 43 . 5 50 . 5 50 . 3 13.2 22 . 9 35 . 4 26 . 3 19 . 7 23 . 5 1 � 2 � � 3 � � � 4 � � � � 5 � � � � � � � � � � � 6 � � � � 7 � � � � � � � � 8 � � � � 9 � � � � � 10 � � � � � � � � 11 � � � 12 � � 13 � � 14 � � 15 � � � � � � � (pc=Pearson’s correlation, snr=signal-to-noise ratio, Le Song Gene Selection via the BAHSIC Family of Algorithms pam=shrunken centroid, t=t-statistics, m-t=moderated

Experiments Robustness of the top 10 genes. BAHSIC family Others Ref.# pc snr pam t m-t lods lin RBF dis rfe l1 mi best 2 1 1 6 10 9 0 2 0 0 0 0 1 � � � 2 � 3 � � � 4 � 5 � � � 6 � � � 7 � 8 � � 9 � � � � 10 � � 11 � 12 � 13 � 14 � � � 15 � � Le Song Gene Selection via the BAHSIC Family of Algorithms (pc=Pearson’s correlation, snr=signal-to-noise ratio,

Experiments Rule of thumb: Apply a linear kernel in general. Apply a RBF kernel if nonlinear effects are sought. Le Song Gene Selection via the BAHSIC Family of Algorithms

Summary BAHSIC: BAckward elimination using HSIC BAHSIC provides a unifying framework for various feature selectors. BAHSIC provides guidelines for practical gene selection. Le Song Gene Selection via the BAHSIC Family of Algorithms

The End Acknowledgement US National Science Foundation For more information http://www.cs.usyd.edu.au/ ∼ lesong/ Le Song Gene Selection via the BAHSIC Family of Algorithms

Two-locus association mapping in subquadratic time Karsten Borgwardt Machine Learning and Computational Biology Research Group Max Planck Institute for Intelligent Systems & Max Planck Institute for Developmental Biology, T¨ ubingen Eberhard Karls Universit¨ at T¨ ubingen EMBL Heidelberg October 18, 2011 Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 1

Classic and new questions ◮ Genetics ◮ How does genotypic variation lead to phenotypic variation? ◮ Can we predict phenotypes based on the genotype of an individual? ◮ Recent progress ◮ Genotypes can be determined at an unprecedented level of detail ◮ Phenotypes can be recorded in an automated manner Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 2

Phenotype prediction Arabidopsis phenotypes ( 99-199 plants, 250k SNPs, Atwell et al., 2010 ) Phenotype AUC SVM Chlorosis at 22 ◦ C 0.629 ± 0.003 Anthocyanin at 16 ◦ C 0.569 ± 0.003 Anthocyanin at 22 ◦ C 0.609 ± 0.003 Leaf Roll at 10 ◦ C 0.696 ± 0.002 Leaf Roll at 22 ◦ C 0.587 ± 0.004 Why is there room for improvement? ◮ We assume additive effects of SNPs, ignore gene-gene interactions and gene-environment interactions. ◮ We ignore population structure, that is systematic ancestry differences of cases and controls. Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 3

Overview Machine Learning in Statistical Genetics ◮ Accurate genotyping and characterization of individuals: ◮ Population-wide genome sequencing in A. thaliana (Cao et al., Nature Genetics 2011) ◮ Methylome of A. thaliana (Becker et al., Nature 2011) ◮ Detecting structural variants using NGS ◮ Accurate mapping : ◮ Taking confounders into account (Stegle et al., NIPS 2011, Li et al., ISMB 2011) ◮ Taking gene-gene interactions into account (Kam-Thong et al., ISMB 2011; Achlioptas et al., KDD 2011) ◮ Automated phenotyping : 1. Guppy colour pattern and shape extraction (Karaletsos et al., under review) 2. Silique number estimation in Arabidopsis Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 4

Genome-wide association mapping by courtesy of D. Weigel Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 5

Challenges in two-locus mapping Scale of the problem ◮ Typical datasets include order 10 5 − 10 7 SNPs. ◮ Hence we have to consider order 10 10 − 10 14 SNP pairs. ◮ Enormous multiple hypothesis testing problem. ◮ Enormous computational runtime problem. Our contribution ◮ We assume binary phenotypes (cases and controls). ◮ Genotypes may be homozygous or heterozygous. ◮ We assume m individuals with n SNPs each. ◮ We define an algorithm that rapidly detects epistatic interactions in a runtime subquadratic in n (Achlioptas et al., KDD 2011). Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 6

Common approaches in the literature Exhaustive enumeration ◮ Only with special hardware such as GPU implementations: EPIBLASTER (Kam-Thong et al., EJHG 2010) Filtering approaches ◮ Statistical criterion, e.g. SNPs with large main effect (Zhang et al., 2007) ◮ Biological criterion, e.g. underlying PPI (Emily et al., 2009) Index structure approaches ◮ fastANOVA, branch-and-bound on SNPs (Zhang et al., 2008) ◮ TEAM, efficient updates of contingency tables (Zhang et al., 2010) Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 7

Difference in correlation for epistasis detection ◮ We phrase epistasis detection as a difference in correlation problem: argmax | ρ cases ( x i , x j ) − ρ controls ( x i , x j ) | . (1) i,j ◮ Different degree of linkage disequilibrium of two loci in cases and controls Karsten Borgwardt Two-locus association mapping in subquadratic time October 18, 2011 8

Data Mining in Bioinformatics Day 6: Feature Selection in - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 6: Feature Selection in Bioinformatics Karsten Borgwardt February 6 to February 17, 2012 Machine Learning & Computational Biology Research Group MPIs Tbingen Karsten Borgwardt: Data Mining in

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt March

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Data Mining in Bioinformatics Day 3: Feature Selection Karsten Borgwardt February 21 to March 4,

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Epistasis detection

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Feature Selection: ROC and Subset Selection Theodoridis 5.5-5.7 Using ROC for Feature Selection

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 21

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt March 1

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 21 to March 4, 2011

Profiles and Multiple Alignments COMP 571 Luay Nakhleh, Rice University Outline Profiles and

Vectorization Past Dependent Branches Through Speculation Majedul Haque Sujon R. Clint Whaley

Network Flow Based Datapath Bit Slicing Hua Xiang Minsik Cho Haoxing Ren Matthew Ziegler

BGP update profiles and the implications for secure BGP update validation processing Geoff

Addressing educational equity for Latino youth in Oregon: The OSU Open Campus Juntos Program

HYPOTHESIS TESTING PART II LEARNING GOALS get more intimate with p -values distribution

HYPOTHESIS TESTING PART III LEARNING GOALS become able to interpret & apply some

Univariate Categorical Data MATH 185 Introduction to Computational Statistics University of

Data Mining in Bioinformatics Day 6: Feature Selection in - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 6: Feature Selection in Bioinformatics Karsten Borgwardt February 6 to February 17, 2012 Machine Learning & Computational Biology Research Group MPIs Tbingen Karsten Borgwardt: Data Mining in

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt March

Data Mining in Bioinformatics Day 9: String &amp; Text Mining in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Data Mining in Bioinformatics Day 3: Feature Selection Karsten Borgwardt February 21 to March 4,

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Epistasis detection

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Feature Selection: ROC and Subset Selection Theodoridis 5.5-5.7 Using ROC for Feature Selection

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 21

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt March 1

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 21 to March 4, 2011

Profiles and Multiple Alignments COMP 571 Luay Nakhleh, Rice University Outline Profiles and

Vectorization Past Dependent Branches Through Speculation Majedul Haque Sujon R. Clint Whaley

Network Flow Based Datapath Bit Slicing Hua Xiang Minsik Cho Haoxing Ren Matthew Ziegler

BGP update profiles and the implications for secure BGP update validation processing Geoff

Addressing educational equity for Latino youth in Oregon: The OSU Open Campus Juntos Program

HYPOTHESIS TESTING PART II LEARNING GOALS get more intimate with p -values distribution

HYPOTHESIS TESTING PART III LEARNING GOALS become able to interpret &amp; apply some

Univariate Categorical Data MATH 185 Introduction to Computational Statistics University of

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

HYPOTHESIS TESTING PART III LEARNING GOALS become able to interpret & apply some