Karsten Borgwardt: Data Mining in Bioinformatics, Page 1
Data Mining in Bioinformatics Day 9: String & Text Mining in - - PowerPoint PPT Presentation
Data Mining in Bioinformatics Day 9: String & Text Mining in - - PowerPoint PPT Presentation
Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt February 25 to March 10 Bioinformatics Group MPIs Tbingen Karsten Borgwardt: Data Mining in Bioinformatics, Page 1 Why compare sequences?
Why compare sequences?
Karsten Borgwardt: Data Mining in Bioinformatics, Page 2
Protein sequences Proteins are chains of amino acids. 20 different types of amino acids can be found in protein sequences. Protein sequence changes over time by mutations, dele- tion, insertions. Different protein sequences may diverge from one com- mon ancestor. Their sequences may differ slightly, yet their function is
- ften conserved.
Why compare sequences?
Karsten Borgwardt: Data Mining in Bioinformatics, Page 3
Biological Question: Biologists are interested in the reverse direction: Given two protein sequences, is it likely that they origi- nate from the same common ancestor? Computational Challenge: How to measure similarity between two protein se- quence, or equivalently: How to measure similarity between two strings Kernel Challenge: How to measure similarity between two strings via a ker-
nel function
In short: How to define a string kernel
History of sequence comparison
Karsten Borgwardt: Data Mining in Bioinformatics, Page 4
First phase Smith-Waterman BLAST Second phase Profiles Hidden Markov Models Third phase PSI-Blast SAM-T98 Fourth phase Kernels
Sequence comparison: Phase 1
Karsten Borgwardt: Data Mining in Bioinformatics, Page 5
Idea Measure pairwise similarities between sequences with gaps Methods Smith-Waterman dynamic programming high accuracy slow (O(n2)) BLAST faster heuristic alternative with sufficient accuracy searches common substrings of fixed length extends these in both directions performs gapped alignment
Sequence comparison: Phase 2
Karsten Borgwardt: Data Mining in Bioinformatics, Page 6
Idea Collect aggregate statistics from a family of sequences Compare this statistics to a single unlabeled protein Methods Hidden Markov Models (HMMs) Markov process with hidden and observable parame- ters Forward algorithm determines probability if given se- quence is output of particular HMM Profiles Profiles of sequence families are derived by multiple sequence alignment Given sequence is compared to this profile
Sequence comparison: Phase 3
Karsten Borgwardt: Data Mining in Bioinformatics, Page 7
Idea Create single models from database collections
- f homologous sequences
Methods PSI-BLAST Position specific iterative BLAST Profile from highest scoring hits in initial BLAST runs Position weighting according to degree of conserva- tion Iteration of these steps SAM-T98, now SAM-T02 database search with HMM from multiple sequence alignment
Phase 4: Kernels and SVMs
Karsten Borgwardt: Data Mining in Bioinformatics, Page 8
General idea Model differences between classes of sequences Use SVM classifier to distinguish classes Use kernel to measure similarity between strings Kernels for Protein Sequences SVM-Fisher kernel Composite kernel Motif kernel String kernel
SVM-Fisher method
Karsten Borgwardt: Data Mining in Bioinformatics, Page 9
General idea Combine HMMs and SVMs for sequence classification Won best-paper award at ISMB 1999 Sequence representation fixed-length vector components are transition and emission probabilities transformation into Fisher score
SVM-Fisher method
Karsten Borgwardt: Data Mining in Bioinformatics, Page 10
Algorithm Model protein family F as HMM Transform query protein X into fixed-length vector via HMM Compute kernel between X and positive and negative examples of the protein family Advantages allows to incorporate prior knowledge allows to deal with missing data is interpretable
- utperforms competing methods
Composition kernels
Karsten Borgwardt: Data Mining in Bioinformatics, Page 11
General idea Model sequence by amino acid content Bin amino acids w.r.t physico-chemical properties Sequence representation feature vector of amino acid frequencies physico-chemical properties include predicted secondary structure, hydrophobicity, normalized van der Waals volume, polarity, polarizability useful database: AAindex
Motif kernels
Karsten Borgwardt: Data Mining in Bioinformatics, Page 12
General idea Conserved motif in amino acid sequences indicate structural and functional relationship Model sequence s as a feature vector f representing motifs i-th component of f is 1 ⇔ s contains i-th motif Motif databases PROSITE eMOTIFs BLOCKS+ combines several databases Generated by manual construction multiple sequence alignment
Pairwise comparison kernels
Karsten Borgwardt: Data Mining in Bioinformatics, Page 13
General idea Employ empirical kernel map on Smith-Waterman/Blast scores Advantage Utilizes decades of practical experience with Blast Disadvantage High computational cost (O(m3)) Alleviation Employ Blast instead of Smith-Waterman Use vectorization set for empirical map only
Phase 4: String Kernels
Karsten Borgwardt: Data Mining in Bioinformatics, Page 14
General idea Count common substrings in two strings A substring of length k is a k-mer Variations Assign weights to k-mers Allow for mismatches Allow for gaps Include substitutions Include wildcards
Spectrum Kernel
Karsten Borgwardt: Data Mining in Bioinformatics, Page 15
General idea For each l-mer α ∈ Σl, the coordinate indexed by α will be the number of times α occurs in sequence x. Then the l-spectrum feature map is
ΦSpectrum
l
(x) = (φα(x))α∈Σl
Here φα(x) is the # occurrences of α in x. The spectrum kernel is now the inner product in the fea- ture space defined by this map:
kSpectrum(x, x′) =< ΦSpectrum
l
(x), ΦSpectrum
l
(x′) >
Sequences are deemed the more similar, the more com- mon substrings they contain
Spectrum Kernel
Karsten Borgwardt: Data Mining in Bioinformatics, Page 16
Principle Spectrum kernel: Count exactly common k-mers
Mismatch Kernel
Karsten Borgwardt: Data Mining in Bioinformatics, Page 17
General idea Do not enforce strictly exact matches Define mismatch neighborhood of an l-mer α with up to
m mismatches: φMismatch
(l,m)
(α) = (φβ(α))β∈Σl
For a sequence x of any length, the map is then ex- tended as
φMismatch
(l,m)
(x) =
- l−mers α in x
(φMismatch
(l,m)
(α))
The mismatch kernel is now the inner product in feature space defined by:
kMismatch
(l,m)
(x, x′) =< ΦMismatch
(l,m)
(x), ΦMismatch
(l,m)
(x′) >
Mismatch Kernel
Karsten Borgwardt: Data Mining in Bioinformatics, Page 18
Principle Mismatch kernel: Count common k-mers with max. m mismatches
Gappy Kernel
Karsten Borgwardt: Data Mining in Bioinformatics, Page 19
General idea Allow for gaps in common substrings → “subsequences” A g-mer then contributes to all its l-mer subsequences
φGap
(g,l)(α) = (φβ(α))β∈Σl
For a sequence x of any length, the map is then ex- tended as
φGap
(g,l)(x) =
- g−mers α in x
(φGap
(g,l)(α))
The gappy kernel is now the inner product in feature space defined by:
kGap
(g,l)(x, x′) =< ΦGap (g,l)(x), ΦGap (g,l)(x′) >
Gappy Kernel
Karsten Borgwardt: Data Mining in Bioinformatics, Page 20
Principle Gappy kernel: Count common l-subsequences of g- mers
Substitution Kernel
Karsten Borgwardt: Data Mining in Bioinformatics, Page 21
General idea mismatch neighborhood → substitution neighborhood An l-mer then contributes to all l-mers in its substitution neighborhood
M(l,σ)(α) = {β = b1b2 . . . bl ∈ Σl : −
l
- i
log P(ai|bi) < σ}
For a sequence x of any length, the map is then ex- tended as
φSub
(l,σ)(x) =
- l−mers α in x
(φSub
(l,σ)(α))
The substitution kernel is now:
kSub
(l,σ)(x, x′) =< ΦSub (l,σ)(x), ΦSub (l,σ)(x′) >
Substitution Kernel
Karsten Borgwardt: Data Mining in Bioinformatics, Page 22
Principle Substitution kernel: Count common l-subsequences in substitution neighborhood
Wildcard Kernels
Karsten Borgwardt: Data Mining in Bioinformatics, Page 23
General idea augment alphabet Σ by a wildcard character ∗ → Σ∪{∗} given α from Σl and β from {Σ ∪ {∗}}l with maximum m
- ccurrences of ∗
l-mer α contributes to l-mer β if their non-wildcard char- acters match For a sequence x of any length, the map is then given by
φWildcard
(l,m,λ) (x) =
- l−mers α in x
(φβ(α))β∈W
where φβ(α) = λj if α matches pattern β containing j wildcards, φβ(α) = 0 if α does not match β, and
0 ≤ λ ≤ 1.
Wildcard Kernel
Karsten Borgwardt: Data Mining in Bioinformatics, Page 24
Principle Wildcard kernel: Count l-mers that match except for wildcards
Protein A: ILVFMC Protein B: WLVFQC l=3, m=1 ILV: IL*, *LV, I*V LVF: LV*, *VF, L*F VFM: VF*,*FM, V*M FMC: FM*, *MC, F*C WLV: WL*, *LV, W*V LVF: LV*, *VF, L*F VFQ: VF*, *FQ, V*Q FQC: FQ*, *QC, F*C
References and further reading
Karsten Borgwardt: Data Mining in Bioinformatics, Page 25
References
[1] C. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for SVM protein classification. In PSB, pages 564–575, 2002. [2] C. Leslie, E. Eskin, J. Weston, and W. S. Noble. Mis- match string kernels for SVM protein classification. In NIPS 2002. MIT Press. [3] C. Leslie and R. Kuang. Fast kernels for inexact string
- matching. In COLT, 2003.