Data Mining in Bioinformatics Day 9: String & Text Mining in - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt February 25 to March 10 Bioinformatics Group MPIs Tübingen Karsten Borgwardt: Data Mining in Bioinformatics, Page 1

Why compare sequences? Protein sequences Proteins are chains of amino acids. 20 different types of amino acids can be found in protein sequences. Protein sequence changes over time by mutations, dele- tion, insertions. Different protein sequences may diverge from one common ancestor. Their sequences may differ slightly, yet their function is often conserved. Karsten Borgwardt: Data Mining in Bioinformatics, Page 2

Why compare sequences? Biological Question: Biologists are interested in the reverse direction: Given two protein sequences, is it likely that they origi- nate from the same common ancestor? Computational Challenge: How to measure similarity between two protein sequence, or equivalently: How to measure similarity between two strings Kernel Challenge: How to measure similarity between two strings via a kernel function In short: How to define a string kernel Karsten Borgwardt: Data Mining in Bioinformatics, Page 3

History of sequence comparison First phase Smith-Waterman BLAST Second phase Profiles Hidden Markov Models Third phase PSI-Blast SAM-T98 Fourth phase Kernels Karsten Borgwardt: Data Mining in Bioinformatics, Page 4

Sequence comparison: Phase 1 Idea Measure pairwise similarities between sequences with gaps Methods Smith-Waterman dynamic programming high accuracy slow (O( n 2 )) BLAST faster heuristic alternative with sufficient accuracy searches common substrings of fixed length extends these in both directions performs gapped alignment Karsten Borgwardt: Data Mining in Bioinformatics, Page 5

Sequence comparison: Phase 2 Idea Collect aggregate statistics from a family of sequences Compare this statistics to a single unlabeled protein Methods Hidden Markov Models (HMMs) Markov process with hidden and observable parame- ters Forward algorithm determines probability if given sequence is output of particular HMM Profiles Profiles of sequence families are derived by multiple sequence alignment Given sequence is compared to this profile Karsten Borgwardt: Data Mining in Bioinformatics, Page 6

Sequence comparison: Phase 3 Idea Create single models from database collections of homologous sequences Methods PSI-BLAST Position specific iterative BLAST Profile from highest scoring hits in initial BLAST runs Position weighting according to degree of conserva- tion Iteration of these steps SAM-T98, now SAM-T02 database search with HMM from multiple sequence alignment Karsten Borgwardt: Data Mining in Bioinformatics, Page 7

Phase 4: Kernels and SVMs General idea Model differences between classes of sequences Use SVM classifier to distinguish classes Use kernel to measure similarity between strings Kernels for Protein Sequences SVM-Fisher kernel Composite kernel Motif kernel String kernel Karsten Borgwardt: Data Mining in Bioinformatics, Page 8

SVM-Fisher method General idea Combine HMMs and SVMs for sequence classification Won best-paper award at ISMB 1999 Sequence representation fixed-length vector components are transition and emission probabilities transformation into Fisher score Karsten Borgwardt: Data Mining in Bioinformatics, Page 9

SVM-Fisher method Algorithm Model protein family F as HMM Transform query protein X into fixed-length vector via HMM Compute kernel between X and positive and negative examples of the protein family Advantages allows to incorporate prior knowledge allows to deal with missing data is interpretable outperforms competing methods Karsten Borgwardt: Data Mining in Bioinformatics, Page 10

Composition kernels General idea Model sequence by amino acid content Bin amino acids w.r.t physico-chemical properties Sequence representation feature vector of amino acid frequencies physico-chemical properties include predicted secondary structure, hydrophobicity, normalized van der Waals volume, polarity, polarizability useful database: AAindex Karsten Borgwardt: Data Mining in Bioinformatics, Page 11

Motif kernels General idea Conserved motif in amino acid sequences indicate structural and functional relationship Model sequence s as a feature vector f representing motifs i-th component of f is 1 ⇔ s contains i-th motif Motif databases PROSITE eMOTIFs BLOCKS+ combines several databases Generated by manual construction multiple sequence alignment Karsten Borgwardt: Data Mining in Bioinformatics, Page 12

Pairwise comparison kernels General idea Employ empirical kernel map on Smith-Waterman/Blast scores Advantage Utilizes decades of practical experience with Blast Disadvantage High computational cost ( O ( m 3 ) ) Alleviation Employ Blast instead of Smith-Waterman Use vectorization set for empirical map only Karsten Borgwardt: Data Mining in Bioinformatics, Page 13

Phase 4: String Kernels General idea Count common substrings in two strings A substring of length k is a k-mer Variations Assign weights to k-mers Allow for mismatches Allow for gaps Include substitutions Include wildcards Karsten Borgwardt: Data Mining in Bioinformatics, Page 14

Spectrum Kernel General idea For each l-mer α ∈ Σ l , the coordinate indexed by α will be the number of times α occurs in sequence x . Then the l-spectrum feature map is Φ Spectrum ( x ) = ( φ α ( x )) α ∈ Σ l l Here φ α ( x ) is the # occurrences of α in x . The spectrum kernel is now the inner product in the feature space defined by this map: k Spectrum ( x, x ′ ) = < Φ Spectrum ( x ) , Φ Spectrum ( x ′ ) > l l Sequences are deemed the more similar, the more common substrings they contain Karsten Borgwardt: Data Mining in Bioinformatics, Page 15

Spectrum Kernel Principle Spectrum kernel: Count exactly common k-mers Karsten Borgwardt: Data Mining in Bioinformatics, Page 16

Mismatch Kernel General idea Do not enforce strictly exact matches Define mismatch neighborhood of an l-mer α with up to m mismatches: φ Mismatch ( α ) = ( φ β ( α )) β ∈ Σ l ( l,m ) For a sequence x of any length, the map is then ex- tended as � φ Mismatch ( φ Mismatch ( x ) = ( α )) ( l,m ) ( l,m ) l − mers α in x The mismatch kernel is now the inner product in feature space defined by: ( x, x ′ ) = < Φ Mismatch ( x ′ ) > k Mismatch ( x ) , Φ Mismatch ( l,m ) ( l,m ) ( l,m ) Karsten Borgwardt: Data Mining in Bioinformatics, Page 17

Mismatch Kernel Principle Mismatch kernel: Count common k-mers with max. m mismatches Karsten Borgwardt: Data Mining in Bioinformatics, Page 18

Gappy Kernel General idea Allow for gaps in common substrings → “subsequences” A g-mer then contributes to all its l-mer subsequences φ Gap ( g,l ) ( α ) = ( φ β ( α )) β ∈ Σ l For a sequence x of any length, the map is then ex- tended as φ Gap ( φ Gap � ( g,l ) ( x ) = ( g,l ) ( α )) g − mers α in x The gappy kernel is now the inner product in feature space defined by: k Gap ( g,l ) ( x, x ′ ) = < Φ Gap ( g,l ) ( x ) , Φ Gap ( g,l ) ( x ′ ) > Karsten Borgwardt: Data Mining in Bioinformatics, Page 19

Gappy Kernel Principle Gappy kernel: Count common l-subsequences of g- mers Karsten Borgwardt: Data Mining in Bioinformatics, Page 20

Substitution Kernel General idea mismatch neighborhood → substitution neighborhood An l-mer then contributes to all l-mers in its substitution neighborhood l M ( l,σ ) ( α ) = { β = b 1 b 2 . . . b l ∈ Σ l : − � log P ( a i | b i ) < σ } i For a sequence x of any length, the map is then ex- tended as � φ Sub ( φ Sub ( l,σ ) ( x ) = ( l,σ ) ( α )) l − mers α in x The substitution kernel is now: k Sub ( l,σ ) ( x, x ′ ) = < Φ Sub ( l,σ ) ( x ) , Φ Sub ( l,σ ) ( x ′ ) > Karsten Borgwardt: Data Mining in Bioinformatics, Page 21

Substitution Kernel Principle Substitution kernel: Count common l-subsequences in substitution neighborhood Karsten Borgwardt: Data Mining in Bioinformatics, Page 22

Wildcard Kernels General idea augment alphabet Σ by a wildcard character ∗ → Σ ∪{∗} given α from Σ l and β from { Σ ∪ {∗}} l with maximum m occurrences of ∗ l-mer α contributes to l-mer β if their non-wildcard char- acters match For a sequence x of any length, the map is then given by � φ Wildcard ( l,m,λ ) ( x ) = ( φ β ( α )) β ∈ W l − mers α in x where φ β ( α ) = λ j if α matches pattern β containing j wildcards, φ β ( α ) = 0 if α does not match β , and 0 ≤ λ ≤ 1 . Karsten Borgwardt: Data Mining in Bioinformatics, Page 23

Wildcard Kernel Principle Wildcard kernel: Count l-mers that match except for wildcards l=3, m=1 ILV: IL*, *LV, I*V LVF: LV*, *VF, L*F Protein A: ILVFMC VFM: VF*,*FM, V*M FMC: FM*, *MC, F*C Protein B: WLVFQC WLV: WL*, *LV, W*V LVF: LV*, *VF, L*F VFQ: VF*, *FQ, V*Q FQC: FQ*, *QC, F*C Karsten Borgwardt: Data Mining in Bioinformatics, Page 24

Data Mining in Bioinformatics Day 9: String & Text Mining in - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt February 25 to March 10 Bioinformatics Group MPIs Tbingen Karsten Borgwardt: Data Mining in Bioinformatics, Page 1 Why compare sequences?

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

The String Class Trace Code Constructing a String String s = "Java"; String

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Text Mining and Information Extraction Applications for Bioinformatics and Systems Biology Plant

String Matching String matching problem: string T (text) and string P (pattern) over an

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 21 to March 4, 2011

1 2 3+4 2 type Parser = String Tree type Parser = String ( Tree, String) type Parser =

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

String Matching Inge Li Grtz CLRS 32 String Matching String matching problem: string

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 21

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt March

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt March 1

String Objectives Discuss string handling System.String class

Data Mining 2020 Text Classification Naive Bayes Ad Feelders Universiteit Utrecht Ad Feelders

Approximating Reachability Probabilities by (Super-)Martingales Ichiro Hasuo National Institute

Novel Approaches for Mitigating Plasma Disruptions and Runaway Electrons in Tokamak ADITYA by R.

The Prospectivity of Mid-Latitude Buried Ice Deposits on Mars for Future Human Missions

A practical tour of optimization algorithms for the Lasso Alexandre Gramfort

Spatial Frequencies and Hemispheric Processing Dave Peterzell Spatial Frequencies 1 2 3 4

A Survey of Complex Dimensions, Measurability, and Lattice/Nonlattice Dichotomies John A. Rock

Outage Reporting in an NG9-1-1 Environment Ex Parte Presentation by NENA: The 9-1-1 Association

Status and progress of the HFLAV-Tau group activities Alberto Lusiani Scuola Normale Superiore

Data Mining in Bioinformatics Day 9: String & Text Mining in - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt February 25 to March 10 Bioinformatics Group MPIs Tbingen Karsten Borgwardt: Data Mining in Bioinformatics, Page 1 Why compare sequences?

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

The String Class Trace Code Constructing a String String s = &quot;Java&quot;; String

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Text Mining and Information Extraction Applications for Bioinformatics and Systems Biology Plant

String Matching String matching problem: string T (text) and string P (pattern) over an

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 21 to March 4, 2011

1 2 3+4 2 type Parser = String Tree type Parser = String ( Tree, String) type Parser =

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

String Matching Inge Li Grtz CLRS 32 String Matching String matching problem: string

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 21

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt March

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt March 1

String Objectives Discuss string handling System.String class

Data Mining 2020 Text Classification Naive Bayes Ad Feelders Universiteit Utrecht Ad Feelders

Approximating Reachability Probabilities by (Super-)Martingales Ichiro Hasuo National Institute

Novel Approaches for Mitigating Plasma Disruptions and Runaway Electrons in Tokamak ADITYA by R.

The Prospectivity of Mid-Latitude Buried Ice Deposits on Mars for Future Human Missions

A practical tour of optimization algorithms for the Lasso Alexandre Gramfort

Spatial Frequencies and Hemispheric Processing Dave Peterzell Spatial Frequencies 1 2 3 4

A Survey of Complex Dimensions, Measurability, and Lattice/Nonlattice Dichotomies John A. Rock

Outage Reporting in an NG9-1-1 Environment Ex Parte Presentation by NENA: The 9-1-1 Association

Status and progress of the HFLAV-Tau group activities Alberto Lusiani Scuola Normale Superiore

The String Class Trace Code Constructing a String String s = "Java"; String