Whats Behind BLAST Gene Myers, Director MPI for Cell Biology and - - PowerPoint PPT Presentation

what s behind blast
SMART_READER_LITE
LIVE PREVIEW

Whats Behind BLAST Gene Myers, Director MPI for Cell Biology and - - PowerPoint PPT Presentation

Whats Behind BLAST Gene Myers, Director MPI for Cell Biology and Genetics Dresden, DE Approximate String Search Given a string A of length n, a query Q of length p n, an alignment scoring function , and a threshold d: Find all


slide-1
SLIDE 1

What’s Behind BLAST

Gene Myers, Director MPI for Cell Biology and Genetics Dresden, DE

slide-2
SLIDE 2

δ here = Simple Levenstein (unit cost mismatch, insert, & delete) ...xxxxxxxxaacgt-gcattacxxxxxxxx... aatgtggc-ttac

Approximate String Search

A 3-match (absolute) Given a string A of length n, a query Q of length p ⪻ n, an alignment scoring function δ, and a threshold d: Find all substrings of A, say M, s.t. δ(Q,M) ≤ d? A 25%-match (relative)

slide-3
SLIDE 3
  • March ’88: The Lister Hill Meeting & Galil’s 2 questions

The Story

slide-4
SLIDE 4
  • S. Altschul W. Fitch Z. Galil
  • W. Goad T. Hunkapillar S. Karlin
  • G. Landau E. Lander D. Lipman
  • J. Maisel H. Martinez C. Sanders
  • T. Smith R. Staden J. Turner
  • M. Zuker A. Mukherjee M. Waterman
  • D. Sankoff P. Sellers E. Ukkonen
  • W. Miller G. Myers

The Beginning

“Workshop for Algorithms in Molecular Genetics” March 26-28, 1988

slide-5
SLIDE 5

Zvi gave a talk about suffix trees: Q1: Can one get rid of the annoying dependence on alphabet size Σ? ⇒ Manber & Myers, “Suffix Arrays” 1990 Q2: Can one use an index to get faster approximate search?

Galil’s 2 Questions

“Workshop for Algorithms in Molecular Genetics” March 26-28, 1988

slide-6
SLIDE 6
  • March ’88: The Lister Hill Meeting & Galil’s 2 questions

The Story

slide-7
SLIDE 7
  • March ’88: The Lister Hill Meeting & Galil’s 2 questions
  • June ’88: Seed & Extend

The Story

slide-8
SLIDE 8

APM Filters

100% Sn < 100% Sn 100% Sp Exact < 100% Sp Filter Heuristic

A filter is an algorithm that eliminates a lot of that which isn’t desired.

Filter Exact

If fast & specific then can improve speed of an exact algorithm. Approximate match filter ideas:

|Nd(k)| ≲ ( )(2Σ)d

k d

  • Look for exact matches to k-mers of the query (in an index)

( Pearson & Lipman FASTA, Chang & Lawler, O(dn/lg p) )

  • Instead look for k-mers that are a small distance away, e.g. 1 or 2 diff’s,

from a k-mer of the query, i.e. the neighborhood Nd(w) = { v : v and w are ≤ d differences apart }

slide-9
SLIDE 9

The Power of Neighborhoods

Consider looking for a 9%-match of 40 symbols (⇒ ≤ 3 differences or 3-match): If divide “query” into 4 10-mers then at least one must match exactly: ⇒ Get a hit every Σ10 / 4 symbols (e.g. 2.5 ⋅ 105 for DNA) If divide into 2 20-mers then at least one of the N1 strings must match exactly: ⇒ Get a hit every Σ20 / 2N1(20) symbols (e.g. 1012 / 2 ⋅ 160 = 3.12 ⋅ 109 for DNA) 10,000 times more specific ! (but 80x more lookups)

slide-10
SLIDE 10

“Seed & Extend”

The “seed” matches (either exact or from a neighborhood) are in effect defining areas within the edit graph of Q vs A where the alignment of an ε-match could be:

Q A

s1 s2 s3 s4

1 3 2 2 4

Q = s1s2s3s4

slide-11
SLIDE 11

“Seed & Extend”

Q A

s1 s2 s3 s4

1 3 2 2 4

±d/4

The “seed” matches (either exact or from a neighborhood) are in effect defining areas within the edit graph of Q vs A where the alignment of an ε-match could be:

Q = s1s2s3s4

slide-12
SLIDE 12

Q A

s1 s2 s3 s4

1 3 2 2 4

“Seed & Extend”

±d

Spend O(pdh + pz) time where h(k) = the number of seed “k-hits” vs. z(k) = neighborhood size “k-words” Both z and h are functions of k and the optimal k is “slightly bigger” than logΣ n The “seed” matches (either exact or from a neighborhood) are in effect defining areas within the edit graph of Q vs A where the alignment of an ε-match could be:

slide-13
SLIDE 13
  • March ’88: The Lister Hill Meeting & Galil’s 2 questions
  • June ’88: Seed & Extend

The Story

slide-14
SLIDE 14
  • March ’88: The Lister Hill Meeting & Galil’s 2 questions
  • June ’88: Seed & Extend

The Story

  • May ’89: The TRW Chip & The Cigarette Break
  • Fall ’89: Blast is Born
slide-15
SLIDE 15

Blast = Seed & Extend Seeds are neighborhoods of all k-mers of query under weighted Levenstein (e.g. PAM120) Find seeds with a deterministic finite automaton accepting all neighborhood words (⇒O(n)) Extend is just weighted Hamming but stop when score drops too much A heuristic “blast” was inspired by “slam” = sublinear approximate match

slide-16
SLIDE 16

The Story

  • March ’88: The Lister Hill Meeting & Galil’s 2 questions
  • June ’88: Seed & Extend
  • May ’89: The TRW Chip & The Cigarette Break
  • Fall ’89: Blast is Born
slide-17
SLIDE 17

The Story

  • March ’88: The Lister Hill Meeting & Galil’s 2 questions
  • June ’88: Seed & Extend
  • May ’89: The TRW Chip & The Cigarette Break
  • Fall ’89: Blast is Born
  • Fall ’89: The Splitting Lemma
slide-18
SLIDE 18

The Splitting Lemma

Lemma: If w ε-matches v then either (a) w0 has an ε-match to a prefix (call it v0) of v, or (b) w1 has an ε-match to a suffix (call it v1) of v.

k errors

w v

≤⎣k/2⎦ errors? ≤⎣k/2⎦ errors?

w0 v w1

≤⎣k/2⎦ errors

w0 v1 w1

Proof: By Pigeon Hole Principle

k =⎣ε|w|⎦

slide-19
SLIDE 19

⎣k/4⎦?

w0 v10 w1

⎣k/4⎦?

w10 w11 w

≤⎣k/2⎦ errors

w0 v1 w1

⎣k/8⎦

w0 v101 w1 w10 w11 w w100 w101

slide-20
SLIDE 20

The Splitting Lemma

Lemma: If w ε-matches v then ∃ α s.t. ∀ prefixes β of α, (1) wβ has an ε-match to a substring (call it vβ) of v, and (2) vβ0 is a prefix of vβ (if β0 is a prefix of α), and (3) vβ1 is a suffix of vβ (if β1 is a prefix of α). Let wε = w

wβa = wβ[1..|wβ|/2] if a = 0

wβ[|wβ|/2+1.. |wβ|] if a = 1

⎣k/8⎦

w0 v101 w1 w10 w11 w w100 w101 e.g. α=101...

slide-21
SLIDE 21

The Story

  • March ’88: The Lister Hill Meeting & Galil’s 2 questions
  • June ’88: Seed & Extend
  • May ’89: The TRW Chip & The Cigarette Break
  • Fall ’89: Blast is Born
  • Fall ’89: The Splitting Lemma
slide-22
SLIDE 22

The Story

  • March ’88: The Lister Hill Meeting & Galil’s 2 questions
  • June ‘88: Seed & Extend
  • May ’89: The TRW Chip & The Cigarette Break
  • Fall ’89: Blast is Born
  • Fall ’89: The Splitting Lemma
  • Fall ’89: Seed & Extend by Doubling
slide-23
SLIDE 23

Doubling Extension

✘ ✘ ✘ ✘

Time for each extension telescopes hyper-geometrically and so is dominated by the first term: O(P/logΣn · h · logΣn · εlogΣn) = O(dhlogΣn)

✓ ✘

Use logΣ n as the seed size ! Lemma: Any ε-match of Q has an ε-match to at least one seed segment of size logΣ n Use the splitting lemma to split Q to seeds of size logΣ n, and instead of extending all at once, extend by doubling using the splitting lemma.

slide-24
SLIDE 24

The Story

  • March ’88: The Lister Hill Meeting & Galil’s 2 questions
  • June ‘88: Seed & Extend
  • May ’89: The TRW Chip & The Cigarette Break
  • Fall ’89: Blast is Born
  • Fall ’89: The Splitting Lemma
  • Fall ’89: Seed & Extend by Doubling
slide-25
SLIDE 25

The Story

  • March ’88: The Lister Hill Meeting & Galil’s 2 questions
  • June ‘88: Seed & Extend
  • May ’89: The TRW Chip & The Cigarette Break
  • Fall ’89: Blast is Born
  • Fall ’89: The Splitting Lemma
  • Fall ’89: Seed & Extend by Doubling
  • Spr ’90: Generating Condensed Neighborhoods
slide-26
SLIDE 26

N1(abbaa) = { aabaa, aabbaa, abaa, abaaa, ababaa,

abba, abbaa, abbab, abbaaa, abbaba, abbba, abbbaa, babbaa, bbaa, bbbaa }

Generating (Condensed) Neighborhoods

Nd(w) = { v : v and w are ≤ d differences apart and

v is not a proper prefix of another word in Nd(w) } It suffices to find the words in the condensed neighborhood. But how do you do that efficiently, including finding them in the index? ... ... Compute rows of dynamic programming matrix as one traverses the trie of all strings over Σ

N1(abbaa) = { aabaa, aabbaa, abaa, abaaa, ababaa,

abba, abbaa, abbab, abbaaa, abbaba, abbba, abbbaa, babbaa, bbaa, bbbaa }

slide-27
SLIDE 27

3 2 1 1 2 3

b

0 1 2 3 4 5 0 1 2 3 4 5

a b b a a

1 0 1 2 3 4

a

2 1 1 2 2 3

a

3 2 1 1 2 3

b

4 3 2 2 1 2

a

5 4 3 3 2 1

a

w v?

5 4 3 3 2 1

a Done: a 1 in the right corner

4 3 2 2 1 2

a

2 1 1 2 2 3

a

1 0 1 2 3 4

a

Condensed Neighborhoods

slide-28
SLIDE 28

0 1 2 3 4 5 5 4 3 3 2 1 4 3 2 2 1 2

a a

3 2 1 1 2 3

b

2 1 1 2 2 3

a

1 0 1 2 3 4

a

0 1 2 3 4 5

a b b a a

2 1 1 2 2 3 1 0 1 2 3 4

a a

3 2 1 1 2 3

b

4 3 2 2 1 2

a

5 4 3 3 2 1

a

w v

Condensed Neighborhoods

Only need ±1 band !

slide-29
SLIDE 29

0 1 _ _ _ _ _ _ _ _ 2 1 _ _ _ 2 1 2

a a

_ _ 1 1 2 _

b

_ 1 1 2 _ _

a

1 0 1 _ _ _

a

0 1

a b b a a

1 1 2 1 0 1

a a

1 1 2

b

2 1 2

a

2 1

a

w v

Condensed Neighborhoods

slide-30
SLIDE 30

0 1 _ _ _ _ 1 0 1 _ _ _ _ 1 1 2 _ _ _ _ 1 1 2 _ _ _ _ _ 2 1 _ _ _ 2 1 2

a a a a b

_ _ _ _ 1 2

a

_ _ _ _ _ 1

a

1 1 1 _ _ _

b

_ _ _ _ 1 2

a

_ _ _ _ _ 1

a

_ _ _ _ 1 1

a

_ _ 1 2 3 _

b

_ _ _ 1 2 3

b

_ _ _ _ 1 2

a

_ _ _ _ _1

a

_ _ _ 3 2 1

a

_ _ _ 2 1 2

a

_ _ _ _ 2 1

a

Condensed Neighborhoods

_ 1 0 1 _ _

b

_ _ _ 1 2 3

b

_ _ _ 2 1 1

a

_ _ _ 1 2 2

b

_ _ _ 1 0 1

a

_ _ _ 1 1 2

b

_ _ 2 2 1 _

a

_ _ 2 1 2 _

b

_ 1 2 2 _ _

a

_ 2 1 1 _ _

b

_ _ 1 1 1 _

a

_ _ 1 0 1 _

b If all entries are ≥ d then wasting time on D.P.

slide-31
SLIDE 31

0 1 _ _ _ _ 1 0 1 _ _ _ _ 1 1 2 _ _ _ _ 1 1 2 _ _ _ _ _ 2 1 _ _ _ 2 1 2

a a a a b

_ _ _ _ 1 2

a

_ _ _ _ _ 1

a

1 1 1 _ _ _

b

_ _ _ _ 1 2

a

_ _ _ _ _ 1

a

_ _ _ _ 1 1

a

_ _ 1 2 3 _

b

_ _ _ 1 2 3

b

_ _ _ _ 1 2

a

_ _ _ _ _1

a

_ _ _ 3 2 1

a

_ _ _ 2 1 2

a

_ _ _ _ 2 1

a

Condensed Neighborhoods

_ 1 0 1 _ _

b

_ _ _ 1 2 3

b

_ _ _ 2 1 1

a

_ _ _ 1 2 2

b

_ _ _ 1 0 1

a

_ _ _ 1 1 2

b

_ _ 2 2 1 _

a

_ _ 2 1 2 _

b

_ 1 2 2 _ _

a

_ 2 1 1 _ _

b

_ _ 1 1 1 _

a

_ _ 1 0 1 _

b If all entries are ≥ d then wasting time on D.P.

slide-32
SLIDE 32

0 1 _ _ _ _ 1 0 1 _ _ _ 1 1 1 _ _ _ _ 1 0 1 _ _ _ 1 1 2 _ _ _ _ _ 1 0 1 _ _ _ 1 1 2 _ _ 1 0 1 _ _ _ 1 1 1 _

a a b b b b a a

a b b a a b b a a b a a a a a a a a b a a b b a a b a a

a a

0 1 1 0 1 1 1 2 b b b a a a

a b b a a

a

Condensed Neighborhoods

slide-33
SLIDE 33

0 1 _ _ _ _ 1 0 1 _ _ _ 1 1 1 _ _ _ _ 1 0 1 _ _ _ 1 1 2 _ _ _ _ _ 1 0 1 _ _ _ 1 1 2 _ _ 1 0 1 _ _ _ 1 1 1 _

a a b b b b a a

a b b a a b b a a b a a a a a a a a b a a b b a a b a a

Condensed Neighborhoods

Use “KMP” on reverse of w to efficiently discover these. A shorter suffix of w that is a prefix of the extension is also possible

slide-34
SLIDE 34

0 1 _ _ _ _ 1 0 1 _ _ _ 1 1 1 _ _ _ _ 1 0 1 _ _ _ 1 1 2 _ _ _ _ _ 1 0 1 _ _ _ 1 1 2 _ _ 1 0 1 _ _ _ 1 1 1 _

a a b b b b a a

a b b a a b b a a b a a a a b a a b b a a b a a

Condensed Neighborhoods

Lemma: Neighborhoods and their hits in A can be generated in O(zd+h) time where z = |Nd(w)|

slide-35
SLIDE 35

The Story

  • March ’88: The Lister Hill Meeting & Galil’s 2 questions
  • June ‘88: Seed & Extend
  • May ’89: The TRW Chip & The Cigarette Break
  • Fall ’89: Blast is Born
  • Fall ’89: The Splitting Lemma
  • Fall ’89: Seed & Extend by Doubling
  • Spr ’90: Generating Condensed Neighborhoods
slide-36
SLIDE 36

The Story

  • March ’88: The Lister Hill Meeting & Galil’s 2 questions
  • June ‘88: Seed & Extend
  • May ’89: The TRW Chip & The Cigarette Break
  • Fall ’89: Blast is Born
  • Fall ’89: The Splitting Lemma
  • Fall ’89: Seed & Extend by Doubling
  • Spr ’90: Generating Condensed Neighborhoods
  • Fall ’90: Finale: Complexity
slide-37
SLIDE 37

Complexity

How big is Nd(k)?

Developed recurrence for non-redundant edit scripts: (a) DI = S (b) DS = SD (c) IS = SI (d) ID = Φ S(k,d) = S(k-1,d) + (Σ-1)S(k-1,d-1) + (Σ-1) Σj S(k-1,d-1)

d-1

Σ

j=0

+ (Σ-1)2 Σj S(k-2,d-2-j)

d-2

Σ

j=0

+ S(k-2-j,d-1-j)

d-1

Σ

j=0

≤ S(k,d) + Σj S(k-1,d-j)

d

Σ

j=1

Nd(k)

Lemma:

slide-38
SLIDE 38

Complexity

So how big is it?

where α(ε) = Σpow(ε) and pow(ε) = logΣ ( ) + ε logΣ c(ε) + ε c(ε)+1 c(ε)−1 and c(ε) = ε-1 + (1 + ε-2) .5 where β(ε) = Σ1-pow(ε) Also Pr(w in Nε(k)) = O( 1 / β(ε)k ) ≤ 1.708 α(ε)k Nε(k)

Lemma:

slide-39
SLIDE 39

pow(ε)

slide-40
SLIDE 40

Complexity

So how big is it? And when k = logΣ n?

where α = Σpow(ε) = O( αk ) Nε(k) where β = Σ1-pow(ε) Pr(w in Nε(k)) = O( 1 / βk ) = O(npow(ε)) and Pr(w in Nε(k)) = O(npow(ε)-1) Nε(k)

Lemma:

Starts at 1 (ε=0) and grows “Flex factor” Starts at Σ (ε=0) and shrinks “Effective alphabet size”

slide-41
SLIDE 41

The Result

Theorem: Given (a) A is effectively Bernouilli, (b) a simple O(n) space, precomputed index of A, and (c) there are h d-matches of a query Q to A then they can be found in O(d·npow(ε)·log n + pd·h) expected-time.

slide-42
SLIDE 42

To my knowledge no one has improved

  • n this in the last 20 years ! ?

Algorithmica 12, 4-5 (1994) (submitted 1991 ! )

slide-43
SLIDE 43

Les Treilles, May 1990