Data Mining in Bioinformatics Day 9: String & Text Mining in - - PowerPoint PPT Presentation

data mining in bioinformatics day 9 string text mining in
SMART_READER_LITE
LIVE PREVIEW

Data Mining in Bioinformatics Day 9: String & Text Mining in - - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt February 25 to March 10 Bioinformatics Group MPIs Tbingen Karsten Borgwardt: Data Mining in Bioinformatics, Page 1 Why compare sequences?


slide-1
SLIDE 1

Karsten Borgwardt: Data Mining in Bioinformatics, Page 1

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics

Karsten Borgwardt February 25 to March 10 Bioinformatics Group MPIs Tübingen

slide-2
SLIDE 2

Why compare sequences?

Karsten Borgwardt: Data Mining in Bioinformatics, Page 2

Protein sequences Proteins are chains of amino acids. 20 different types of amino acids can be found in protein sequences. Protein sequence changes over time by mutations, dele- tion, insertions. Different protein sequences may diverge from one com- mon ancestor. Their sequences may differ slightly, yet their function is

  • ften conserved.
slide-3
SLIDE 3

Why compare sequences?

Karsten Borgwardt: Data Mining in Bioinformatics, Page 3

Biological Question: Biologists are interested in the reverse direction: Given two protein sequences, is it likely that they origi- nate from the same common ancestor? Computational Challenge: How to measure similarity between two protein se- quence, or equivalently: How to measure similarity between two strings Kernel Challenge: How to measure similarity between two strings via a ker-

nel function

In short: How to define a string kernel

slide-4
SLIDE 4

History of sequence comparison

Karsten Borgwardt: Data Mining in Bioinformatics, Page 4

First phase Smith-Waterman BLAST Second phase Profiles Hidden Markov Models Third phase PSI-Blast SAM-T98 Fourth phase Kernels

slide-5
SLIDE 5

Sequence comparison: Phase 1

Karsten Borgwardt: Data Mining in Bioinformatics, Page 5

Idea Measure pairwise similarities between sequences with gaps Methods Smith-Waterman dynamic programming high accuracy slow (O(n2)) BLAST faster heuristic alternative with sufficient accuracy searches common substrings of fixed length extends these in both directions performs gapped alignment

slide-6
SLIDE 6

Sequence comparison: Phase 2

Karsten Borgwardt: Data Mining in Bioinformatics, Page 6

Idea Collect aggregate statistics from a family of sequences Compare this statistics to a single unlabeled protein Methods Hidden Markov Models (HMMs) Markov process with hidden and observable parame- ters Forward algorithm determines probability if given se- quence is output of particular HMM Profiles Profiles of sequence families are derived by multiple sequence alignment Given sequence is compared to this profile

slide-7
SLIDE 7

Sequence comparison: Phase 3

Karsten Borgwardt: Data Mining in Bioinformatics, Page 7

Idea Create single models from database collections

  • f homologous sequences

Methods PSI-BLAST Position specific iterative BLAST Profile from highest scoring hits in initial BLAST runs Position weighting according to degree of conserva- tion Iteration of these steps SAM-T98, now SAM-T02 database search with HMM from multiple sequence alignment

slide-8
SLIDE 8

Phase 4: Kernels and SVMs

Karsten Borgwardt: Data Mining in Bioinformatics, Page 8

General idea Model differences between classes of sequences Use SVM classifier to distinguish classes Use kernel to measure similarity between strings Kernels for Protein Sequences SVM-Fisher kernel Composite kernel Motif kernel String kernel

slide-9
SLIDE 9

SVM-Fisher method

Karsten Borgwardt: Data Mining in Bioinformatics, Page 9

General idea Combine HMMs and SVMs for sequence classification Won best-paper award at ISMB 1999 Sequence representation fixed-length vector components are transition and emission probabilities transformation into Fisher score

slide-10
SLIDE 10

SVM-Fisher method

Karsten Borgwardt: Data Mining in Bioinformatics, Page 10

Algorithm Model protein family F as HMM Transform query protein X into fixed-length vector via HMM Compute kernel between X and positive and negative examples of the protein family Advantages allows to incorporate prior knowledge allows to deal with missing data is interpretable

  • utperforms competing methods
slide-11
SLIDE 11

Composition kernels

Karsten Borgwardt: Data Mining in Bioinformatics, Page 11

General idea Model sequence by amino acid content Bin amino acids w.r.t physico-chemical properties Sequence representation feature vector of amino acid frequencies physico-chemical properties include predicted secondary structure, hydrophobicity, normalized van der Waals volume, polarity, polarizability useful database: AAindex

slide-12
SLIDE 12

Motif kernels

Karsten Borgwardt: Data Mining in Bioinformatics, Page 12

General idea Conserved motif in amino acid sequences indicate structural and functional relationship Model sequence s as a feature vector f representing motifs i-th component of f is 1 ⇔ s contains i-th motif Motif databases PROSITE eMOTIFs BLOCKS+ combines several databases Generated by manual construction multiple sequence alignment

slide-13
SLIDE 13

Pairwise comparison kernels

Karsten Borgwardt: Data Mining in Bioinformatics, Page 13

General idea Employ empirical kernel map on Smith-Waterman/Blast scores Advantage Utilizes decades of practical experience with Blast Disadvantage High computational cost (O(m3)) Alleviation Employ Blast instead of Smith-Waterman Use vectorization set for empirical map only

slide-14
SLIDE 14

Phase 4: String Kernels

Karsten Borgwardt: Data Mining in Bioinformatics, Page 14

General idea Count common substrings in two strings A substring of length k is a k-mer Variations Assign weights to k-mers Allow for mismatches Allow for gaps Include substitutions Include wildcards

slide-15
SLIDE 15

Spectrum Kernel

Karsten Borgwardt: Data Mining in Bioinformatics, Page 15

General idea For each l-mer α ∈ Σl, the coordinate indexed by α will be the number of times α occurs in sequence x. Then the l-spectrum feature map is

ΦSpectrum

l

(x) = (φα(x))α∈Σl

Here φα(x) is the # occurrences of α in x. The spectrum kernel is now the inner product in the fea- ture space defined by this map:

kSpectrum(x, x′) =< ΦSpectrum

l

(x), ΦSpectrum

l

(x′) >

Sequences are deemed the more similar, the more com- mon substrings they contain

slide-16
SLIDE 16

Spectrum Kernel

Karsten Borgwardt: Data Mining in Bioinformatics, Page 16

Principle Spectrum kernel: Count exactly common k-mers

slide-17
SLIDE 17

Mismatch Kernel

Karsten Borgwardt: Data Mining in Bioinformatics, Page 17

General idea Do not enforce strictly exact matches Define mismatch neighborhood of an l-mer α with up to

m mismatches: φMismatch

(l,m)

(α) = (φβ(α))β∈Σl

For a sequence x of any length, the map is then ex- tended as

φMismatch

(l,m)

(x) =

  • l−mers α in x

(φMismatch

(l,m)

(α))

The mismatch kernel is now the inner product in feature space defined by:

kMismatch

(l,m)

(x, x′) =< ΦMismatch

(l,m)

(x), ΦMismatch

(l,m)

(x′) >

slide-18
SLIDE 18

Mismatch Kernel

Karsten Borgwardt: Data Mining in Bioinformatics, Page 18

Principle Mismatch kernel: Count common k-mers with max. m mismatches

slide-19
SLIDE 19

Gappy Kernel

Karsten Borgwardt: Data Mining in Bioinformatics, Page 19

General idea Allow for gaps in common substrings → “subsequences” A g-mer then contributes to all its l-mer subsequences

φGap

(g,l)(α) = (φβ(α))β∈Σl

For a sequence x of any length, the map is then ex- tended as

φGap

(g,l)(x) =

  • g−mers α in x

(φGap

(g,l)(α))

The gappy kernel is now the inner product in feature space defined by:

kGap

(g,l)(x, x′) =< ΦGap (g,l)(x), ΦGap (g,l)(x′) >

slide-20
SLIDE 20

Gappy Kernel

Karsten Borgwardt: Data Mining in Bioinformatics, Page 20

Principle Gappy kernel: Count common l-subsequences of g- mers

slide-21
SLIDE 21

Substitution Kernel

Karsten Borgwardt: Data Mining in Bioinformatics, Page 21

General idea mismatch neighborhood → substitution neighborhood An l-mer then contributes to all l-mers in its substitution neighborhood

M(l,σ)(α) = {β = b1b2 . . . bl ∈ Σl : −

l

  • i

log P(ai|bi) < σ}

For a sequence x of any length, the map is then ex- tended as

φSub

(l,σ)(x) =

  • l−mers α in x

(φSub

(l,σ)(α))

The substitution kernel is now:

kSub

(l,σ)(x, x′) =< ΦSub (l,σ)(x), ΦSub (l,σ)(x′) >

slide-22
SLIDE 22

Substitution Kernel

Karsten Borgwardt: Data Mining in Bioinformatics, Page 22

Principle Substitution kernel: Count common l-subsequences in substitution neighborhood

slide-23
SLIDE 23

Wildcard Kernels

Karsten Borgwardt: Data Mining in Bioinformatics, Page 23

General idea augment alphabet Σ by a wildcard character ∗ → Σ∪{∗} given α from Σl and β from {Σ ∪ {∗}}l with maximum m

  • ccurrences of ∗

l-mer α contributes to l-mer β if their non-wildcard char- acters match For a sequence x of any length, the map is then given by

φWildcard

(l,m,λ) (x) =

  • l−mers α in x

(φβ(α))β∈W

where φβ(α) = λj if α matches pattern β containing j wildcards, φβ(α) = 0 if α does not match β, and

0 ≤ λ ≤ 1.

slide-24
SLIDE 24

Wildcard Kernel

Karsten Borgwardt: Data Mining in Bioinformatics, Page 24

Principle Wildcard kernel: Count l-mers that match except for wildcards

Protein A: ILVFMC Protein B: WLVFQC l=3, m=1 ILV: IL*, *LV, I*V LVF: LV*, *VF, L*F VFM: VF*,*FM, V*M FMC: FM*, *MC, F*C WLV: WL*, *LV, W*V LVF: LV*, *VF, L*F VFQ: VF*, *FQ, V*Q FQC: FQ*, *QC, F*C

slide-25
SLIDE 25

References and further reading

Karsten Borgwardt: Data Mining in Bioinformatics, Page 25

References

[1] C. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for SVM protein classification. In PSB, pages 564–575, 2002. [2] C. Leslie, E. Eskin, J. Weston, and W. S. Noble. Mis- match string kernels for SVM protein classification. In NIPS 2002. MIT Press. [3] C. Leslie and R. Kuang. Fast kernels for inexact string

  • matching. In COLT, 2003.

[4] B. Schölkopf, K. Tsuda, and J.-P . Vert. Kernel Methods in Computational Biology, Chapter 3 and 4. MIT Press, Cambridge, MA, 2004.