Data Mining in Bioinformatics Day 5: Classification in - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 5: Classification in Bioinformatics Karsten Borgwardt February 6 to February 17, 2012 Machine Learning & Computational Biology Research Group MPIs Tübingen Karsten Borgwardt: Data Mining in Bioinformatics, Page 1

Karsten M. Borgwardt Protein function prediction via graph kernels ISMB 2005 Joint work with Cheng Soon Ong and S.V.N. Vishwanathan, Stefan Schönauer, Hans-Peter Kriegel and Alex Smola Ludwig-Maximilians-Universität Munich, Germany and National ICT Australia, Canberra

Content Introduction • The problem: protein function prediction • The method: Support Vector Machines (SVM) Our approach to function prediction • Protein graph model • Protein graph kernel • Experimental evaluation Technique to analyze our graph model • Hyperkernels Discussion Karsten Borgwardt et al. - Protein function prediction via graph kernels 2

Current approaches to protein function prediction similar phylogenetic profiles similar structures similar sequences similar motifs similar function similar chemical properties similar interaction partners similar surface clefts Karsten Borgwardt et al. - Protein function prediction via graph kernels 3

Current approaches to protein function prediction similar phylogenetic profiles similar structures similar sequences similar motifs similar function similar interaction similar chemical properties partners similar surface clefts Karsten Borgwardt et al. - Protein function prediction via graph kernels 4

Support Vector Machines Are new data points ( x ) red or black? The blue decision boundary allows to predict class membership of new data points. Karsten Borgwardt et al. - Protein function prediction via graph kernels 5

Kernel trick input space feature space mapping Ф kernel function The kernel trick allows to introduce a separating hyperplane in feature space. Karsten Borgwardt et al. - Protein function prediction via graph kernels 6

Feature vectors for function prediction protein structure e.g. Cai et al. (2004) , Dobson and Doig (2003) and/or protein sequence • hydrophobicity • polarity • polarizability • van der Waals volume •fraction of amino acid types •fraction of surface area •disulphide bonds •size of largest surface pocket 7

Our approach Sequence + Structure + Chemical properties Graph model SVMs + Graph models Protein function Karsten Borgwardt et al. - Protein function prediction via graph kernels 8

Protein graph model secondary protein sequence structure structure Karsten Borgwardt et al. - Protein function prediction via graph kernels 9

Protein graph model Node attributes Edge attributes • hydrophobicity • type (sequence, structure) • polarity • length • polarizability • van der Waals volume • length • helix, sheet, loop Karsten Borgwardt et al. - Protein function prediction via graph kernels 10

Protein graph kernel (Kashima et al. (2003) and Gärtner et al. (2003)) compares walks of identical length l l − 1 k walk  v 1, ... ,v l  ,  w 1, ... ,w l = ∑ k step  v i ,v i  1  ,  w i ,w i  1  i = 1 Walks are similar, if along both walks • types of secondary structure elements (SSEs) are the same • distances between SSEs are similar • chemical properties of SSEs are similar 11

Example: Protein kernel S S S S S Protein A Protein B S Similar (H,10,S,1,S,3,H) (H,9,S,1,S,3,H) 12

Example: Protein kernel S S S S S Protein A Protein B S Dissimilar (H,10,S,1,S) (S,3,H,5,S) 13

Evaluation: enzymes vs. non-enzymes 10-fold cross-validation on 1128 proteins from dataset by Dobson and Doig (2003); 59 % are enzymes. Kernel type accuracy SD Vector kernel 76.86 1.23 Optimized vector kernel 80.17 1.24 Graph kernel 77.30 1.20 Graph kernel without structure 72.33 5.32 Graph kernel with global info 84.04 3.33 DALI classifier 75.07 4.58 Karsten Borgwardt et al. - Protein function prediction via graph kernels 14

Attribute selection Which structural or chemical attribute is most important for correct classification? For this purpose, we employ hyperkernels (Ong et. al, 2003). Hyperkernels find an optimal linear combination of input kernel matrices : minimizing training error and m ∑ β i K i fulfilling regularization constraints i = 1 Karsten Borgwardt et al. - Protein function prediction via graph kernels 15

Attribute selection Our approach: •Calculate kernel matrix for 600 proteins on graph model with only ONE single attribute! •Repeat this for all attributes •Normalize these kernel matrices •Determine hyperkernel combination •Weights then reflect contribution of individual attributes to correct classification 16

Attribute selection Attribute EC 1 EC 2 EC 3 EC 4 EC 5 EC 6 Amino acid length 1.00 0.31 1.00 1.00 0.73 0.00 3-bin van der Waals 0.00 0.00 0.00 0.00 0.00 0.00 3-bin Hydrophobicity 0.00 0.00 0.00 0.00 0.00 0.00 3-bin Polarity 0.00 0.01 0.00 0.00 0.00 1.00 3-bin Polarizability 0.00 0.00 0.00 0.00 0.12 0.00 3d length 0.00 0.40 0.00 0.00 0.00 0.00 Total van der Waals 0.00 0.00 0.00 0.00 0.00 0.00 Total Hydrophobicity 0.00 0.13 0.00 0.00 0.01 0.00 Total Polarity 0.00 0.14 0.00 0.00 0.01 0.00 Total Polarizability 0.00 0.01 0.00 0.00 0.13 0.00 Karsten Borgwardt et al. - Protein function prediction via graph kernels 17

Discussion • Novel combined approach to protein function prediction integrating sequence, structure and chemical information • Reaches state-of-the-art classification accuracy on less information; higher accuracy levels on same amount of information • Hyperkernels for finding most interesting protein characteristics Karsten Borgwardt et al. - Protein function prediction via graph kernels 18

Discussion • More detailed graph models (amino acids, atoms) might be more interesting, yet raise computational difficulties (graphs too large!) Two directions of future research: • Efficient, yet expressive graph kernels for structure • Integrating more proteomic information, e.g. surface pockets, into our graph model Karsten Borgwardt et al. - Protein function prediction via graph kernels 19

The End Thank you! Questions? Karsten Borgwardt et al. - Protein function prediction via graph kernels 20

ARTS: Accurate Recognition of Transcription Starts in human ∗ ,♮ atsch ♮ † S¨ oren Sonnenburg, Alexander Zien, Gunnar R¨ † Fraunhofer FIRST.IDA, Kekul´ estr. 7, 12489 Berlin, Germany ♮ Friedrich Miescher Laboratory of the Max Planck Society, ∗ Max Planck Institute for Biological Cybernetics, Spemannstr. 37-39, 72076 T¨ ubingen, Germany Soeren.Sonnenburg@first.fraunhofer.de, { Alexander.Zien,Gunnar.Raetsch } @tuebingen.mpg.de

Promoter Detection Overview: • Transcription Start Site (TSS) • Features to describe the TSS • Our approach • Evaluation with current methods • Example - Protocadherin- α • Summary Sonnenburg, Zien, R¨ atsch 1

Promoter Detection Transcription Start Site - Properties • POL II binds to a rather vague region of ≈ [ − 20 , +20] bp • Upstream of TSS: promoter containing transcription factor binding sites • Downstream of TSS: 5’ UTR, and further downstream coding regions and introns (different statistics) • 3D structure of the promoter must allow the transcription factors to bind ⇒ Promoter Prediction is non-trivial Sonnenburg, Zien, R¨ atsch 2

Promoter Detection Features to describe the TSS • TFBS in Promoter region • condition: DNA should not be too twisted • CpG islands (often over TSS/first exon; in most, but not all promoters) • TSS with TATA box ( ≈ − 30 bp upstream) • Exon content in UTR 5” region • Distance to first donor splice site Idea: Combine weak features to build strong promoter predictor Sonnenburg, Zien, R¨ atsch 3

Promoter Detection The ARTS Approach use SVM classifier � N s � • � f ( x ) = sign y i α i k( x , x i ) + b i =1 • key ingredient is kernel k ( x , x ′ ) — similarity of two sequences • use 5 sub-kernels suited to model the aforementioned features k( x , x ′ ) = k T SS ( x , x ′ )+k CpG ( x , x ′ )+k coding ( x , x ′ )+k energy ( x , x ′ )+k twist ( x , x ′ ) Sonnenburg, Zien, R¨ atsch 4

Promoter Detection The 5 sub-kernels 1. TSS signal (including parts of core promoter with TATA box) – use Weighted Degree Shift kernel 2. CpG Islands, distant enhancers and TFBS upstream of TSS – use Spectrum kernel (large window upstream of TSS) 3. Model coding sequence TFBS downstream of TSS – use another Spectrum kernel (small window downstream of TSS) 4. Stacking energy of DNA – use btwist energy of dinucleotides with Linear kernel 5. Twistedness of DNA – use btwist angle of dinucleotides with Linear kernel Sonnenburg, Zien, R¨ atsch 5

Promoter Detection Weighted Degree Shift Kernel k(x1,x2) = w6,3 + w6,-3 + w3,4 x 1 x 2 • Count matching substrings of length 1 . . . d • Weight according to length of the match β 1 . . . β d • Position dependent but tolerates “shifts” of up to S L − k +1 d S � � � k( x , x ′ ) = δ s (I( x [ k : l + s ]= x ′ [ k : l ])+I( x [ k : l ]= x ′ [ k : l + s ])) β k s =0 k =1 l =1 s + l ≤ L x [ k : l ] := subsequence of x of length k starting at position l Sonnenburg, Zien, R¨ atsch 6

Data Mining in Bioinformatics Day 5: Classification in - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 5: Classification in Bioinformatics Karsten Borgwardt February 6 to February 17, 2012 Machine Learning & Computational Biology Research Group MPIs Tbingen Karsten Borgwardt: Data Mining in

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt March 1

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 21

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt March

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 21 to March 4, 2011

Data Mining in Bioinformatics Day 10: Graph Mining in Bioinformatics Karsten Borgwardt February

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

At Creation Common Holy Day 1 Day 2 Day 8 Day 9 Day 3 Day 4 Day 5 Day 6 Day 7 7 Days The

Text Mining and Information Extraction Applications for Bioinformatics and Systems Biology Plant

Data Mining in Bioinformatics Day 6: Feature Selection in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Days 6 and 7: The Need for Data Mining in Bioinformatics Karsten

Supervised Convolutional GSN for Protein Secondary Structure Prediction Jian Zhou Olga

Specificity of Protein-DNA recognition of a long DNA binding motif Francisco Melo Ledermann EMBO

FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES Robert Hirt Department of Zoology, The Natural

NMR Spectroscopy CH.EMBnet course 28.9.2004 Biozentrum, Basel D. Hussinger Overview 1. Basic

TEXT ENCODING FOR PROTEIN STRUCTURE REPRESENTATION Jun Tan Donald Adjeroh Biological background

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

IGSAnnota*onEngineandManatee MichelleGwinnGiglio

10: Biological Applications for HMMs Machine Learning and Real-world Data (MLRD) Ann Copestake

Sambuz

Useful Links

Newsletter

Mail Us

Data Mining in Bioinformatics Day 5: Classification in - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 5: Classification in Bioinformatics Karsten Borgwardt February 6 to February 17, 2012 Machine Learning & Computational Biology Research Group MPIs Tbingen Karsten Borgwardt: Data Mining in

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 9: String &amp; Text Mining in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt March 1

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 21

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt March

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 21 to March 4, 2011

Data Mining in Bioinformatics Day 10: Graph Mining in Bioinformatics Karsten Borgwardt February

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

At Creation Common Holy Day 1 Day 2 Day 8 Day 9 Day 3 Day 4 Day 5 Day 6 Day 7 7 Days The

Text Mining and Information Extraction Applications for Bioinformatics and Systems Biology Plant

Data Mining in Bioinformatics Day 6: Feature Selection in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Days 6 and 7: The Need for Data Mining in Bioinformatics Karsten

Supervised Convolutional GSN for Protein Secondary Structure Prediction Jian Zhou Olga

Specificity of Protein-DNA recognition of a long DNA binding motif Francisco Melo Ledermann EMBO

FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES Robert Hirt Department of Zoology, The Natural

NMR Spectroscopy CH.EMBnet course 28.9.2004 Biozentrum, Basel D. Hussinger Overview 1. Basic

TEXT ENCODING FOR PROTEIN STRUCTURE REPRESENTATION Jun Tan Donald Adjeroh Biological background

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

IGSAnnota*onEngineandManatee MichelleGwinnGiglio

10: Biological Applications for HMMs Machine Learning and Real-world Data (MLRD) Ann Copestake

Sambuz

Useful Links

Newsletter

Mail Us

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt