Whats Behind BLAST Gene Myers, Director MPI for Cell Biology and - PowerPoint PPT Presentation

What’s Behind BLAST Gene Myers, Director MPI for Cell Biology and Genetics Dresden, DE

Approximate String Search Given a string A of length n, a query Q of length p ⪻ n, an alignment scoring function δ , and a threshold d: Find all substrings of A, say M, s.t. δ (Q,M) ≤ d? δ here = Simple Levenstein (unit cost mismatch, insert, & delete) ...xxxxxxxxaacgt-gcattacxxxxxxxx... aatgtggc-ttac A 3-match (absolute) A 25%-match (relative)

The Story • March ’88: The Lister Hill Meeting & Galil’s 2 questions

The Beginning “Workshop for Algorithms in Molecular Genetics” March 26-28, 1988 S. Altschul W. Fitch Z. Galil W. Goad T. Hunkapillar S. Karlin G. Landau E. Lander D. Lipman J. Maisel H. Martinez C. Sanders T. Smith R. Staden J. Turner M. Zuker A. Mukherjee M. Waterman D. Sankoff P. Sellers E. Ukkonen W. Miller G. Myers

Galil’s 2 Questions “Workshop for Algorithms in Molecular Genetics” March 26-28, 1988 Zvi gave a talk about suffix trees: Q1: Can one get rid of the annoying dependence on alphabet size Σ ? ⇒ Manber & Myers, “Suffix Arrays” 1990 Q2: Can one use an index to get faster approximate search?

The Story • March ’88: The Lister Hill Meeting & Galil’s 2 questions

The Story • March ’88: The Lister Hill Meeting & Galil’s 2 questions • June ’88: Seed & Extend

APM Filters A filter is an algorithm that eliminates a lot of that which isn’t desired. 100% Sn < 100% Sn Filter 100% Sp Exact Exact < 100% Sp Filter Heuristic If fast & specific then can improve speed of an exact algorithm. Approximate match filter ideas: • Look for exact matches to k-mers of the query (in an index) ( Pearson & Lipman FASTA, Chang & Lawler, O(dn/lg p) ) • Instead look for k-mers that are a small distance away, e.g. 1 or 2 diff’s, from a k-mer of the query, i.e. the neighborhood N d (w) = { v : v and w are ≤ d differences apart } | N d (k)| ≲ ( )(2 Σ ) d k d

The Power of Neighborhoods Consider looking for a 9%-match of 40 symbols ( ⇒ ≤ 3 differences or 3-match): If divide “query” into 4 10-mers then at least one must match exactly: ⇒ Get a hit every Σ 10 / 4 symbols (e.g. 2.5 ⋅ 10 5 for DNA) If divide into 2 20-mers then at least one of the N 1 strings must match exactly: ⇒ Get a hit every Σ 20 / 2 N 1 (20) symbols (e.g. 10 12 / 2 ⋅ 160 = 3.12 ⋅ 10 9 for DNA) 10,000 times more specific ! (but 80x more lookups)

“Seed & Extend” The “seed” matches (either exact or from a neighborhood) are in effect defining areas within the edit graph of Q vs A where the alignment of an ε -match could be: A 3 2 1 2 4 s 1 s 2 Q s 3 s 4 Q = s 1 s 2 s 3 s 4

“Seed & Extend” The “seed” matches (either exact or from a neighborhood) are in effect defining areas within the edit graph of Q vs A where the alignment of an ε -match could be: A 3 2 1 2 4 s 1 s 2 Q s 3 s 4 ±d/4 Q = s 1 s 2 s 3 s 4

“Seed & Extend” The “seed” matches (either exact or from a neighborhood) are in effect defining areas within the edit graph of Q vs A where the alignment of an ε -match could be: A 3 2 1 2 4 s 1 s 2 Q s 3 s 4 ±d Spend O(pdh + pz) time where h(k) = the number of seed “k-hits” vs. z(k) = neighborhood size “k-words” Both z and h are functions of k and the optimal k is “slightly bigger” than log Σ n

The Story • March ’88: The Lister Hill Meeting & Galil’s 2 questions • June ’88: Seed & Extend

The Story • March ’88: The Lister Hill Meeting & Galil’s 2 questions • June ’88: Seed & Extend • May ’89: The TRW Chip & The Cigarette Break • Fall ’89: Blast is Born

Blast = Seed & Extend Seeds are neighborhoods of all k-mers of query under weighted Levenstein (e.g. PAM120) Find seeds with a deterministic finite automaton accepting all neighborhood words ( ⇒ O(n)) Extend is just weighted Hamming but stop when score drops too much A heuristic “blast” was inspired by “slam” = sublinear approximate match

The Story • March ’88: The Lister Hill Meeting & Galil’s 2 questions • June ’88: Seed & Extend • May ’89: The TRW Chip & The Cigarette Break • Fall ’89: Blast is Born

The Story • March ’88: The Lister Hill Meeting & Galil’s 2 questions • June ’88: Seed & Extend • May ’89: The TRW Chip & The Cigarette Break • Fall ’89: Blast is Born • Fall ’89: The Splitting Lemma

The Splitting Lemma Lemma: If w ε -matches v then either (a) w 0 has an ε -match to a prefix (call it v 0 ) of v, or (b) w 1 has an ε -match to a suffix (call it v 1 ) of v. w Proof: k = ⎣ ε |w| ⎦ k errors v w 0 w 1 ≤ ⎣ k/2 ⎦ errors? ≤ ⎣ k/2 ⎦ errors? By Pigeon v Hole w 0 w 1 Principle ≤ ⎣ k/2 ⎦ errors v 1

w 0 w 1 ≤ ⎣ k/2 ⎦ errors v 1 w w 0 w 1 w 10 w 11 ⎣ k/4 ⎦ ? ⎣ k/4 ⎦ ? w v 10 w 0 w 1 w 10 w 11 w 100 w 101 ⎣ k/8 ⎦ v 101

The Splitting Lemma w w 0 w 1 Let w ε = w w 10 w 11 w β a = w β [1..|w β |/2] if a = 0 w β [|w β |/2+1.. |w β |] if a = 1 w 100 w 101 e.g. α =101... ⎣ k/8 ⎦ v 101 Lemma: If w ε -matches v then ∃ α s.t. ∀ prefixes β of α , (1) w β has an ε -match to a substring (call it v β ) of v, and (2) v β 0 is a prefix of v β (if β 0 is a prefix of α ), and (3) v β 1 is a suffix of v β (if β 1 is a prefix of α ).

The Story • March ’88: The Lister Hill Meeting & Galil’s 2 questions • June ’88: Seed & Extend • May ’89: The TRW Chip & The Cigarette Break • Fall ’89: Blast is Born • Fall ’89: The Splitting Lemma

The Story • March ’88: The Lister Hill Meeting & Galil’s 2 questions • June ‘88: Seed & Extend • May ’89: The TRW Chip & The Cigarette Break • Fall ’89: Blast is Born • Fall ’89: The Splitting Lemma • Fall ’89: Seed & Extend by Doubling

Doubling Extension Use log Σ n as the seed size ! Lemma: Any ε -match of Q has an ε -match to at least one seed segment of size log Σ n Use the splitting lemma to split Q to seeds of size log Σ n, and instead of extending all at once, extend by doubling using the splitting lemma. ✘ ✘ ✘ ✘ ✓ ✘ Time for each extension telescopes hyper-geometrically and so is dominated by the first term: O(P/log Σ n · h · log Σ n · ε log Σ n) = O(dhlog Σ n)

The Story • March ’88: The Lister Hill Meeting & Galil’s 2 questions • June ‘88: Seed & Extend • May ’89: The TRW Chip & The Cigarette Break • Fall ’89: Blast is Born • Fall ’89: The Splitting Lemma • Fall ’89: Seed & Extend by Doubling

The Story • March ’88: The Lister Hill Meeting & Galil’s 2 questions • June ‘88: Seed & Extend • May ’89: The TRW Chip & The Cigarette Break • Fall ’89: Blast is Born • Fall ’89: The Splitting Lemma • Fall ’89: Seed & Extend by Doubling • Spr ’90: Generating Condensed Neighborhoods

Generating (Condensed) Neighborhoods N d (w) = { v : v and w are ≤ d differences apart and v is not a proper prefix of another word in N d (w) } N 1 (abbaa) = { aabaa, aabbaa, abaa, abaaa, ababaa, N 1 (abbaa) = { aabaa, aabbaa, abaa, abaaa, ababaa, abba, abbaa, abbab, abbaaa, abbaba, abba, abbaa, abbab, abbaaa, abbaba, abbba, abbbaa, babbaa, bbaa, bbbaa } abbba, abbbaa, babbaa, bbaa, bbbaa } It suffices to find the words in the condensed neighborhood. But how do you do that efficiently, including finding them in the index? ... ... Compute rows of dynamic programming matrix as one traverses the trie of all strings over Σ

Condensed Neighborhoods 0 1 2 3 4 5 a 1 0 1 2 3 4 a w 2 1 1 2 2 3 b a b b a a 3 2 1 1 2 3 0 1 2 3 4 5 a a 1 0 1 2 3 4 a 2 1 1 2 2 3 v? 4 3 2 2 1 2 b 3 2 1 1 2 3 a a 4 3 2 2 1 2 a 5 4 3 3 2 1 5 4 3 3 2 1 Done: a 1 in the right corner

Condensed Neighborhoods 0 1 2 3 4 5 a 1 0 1 2 3 4 a w 2 1 1 2 2 3 b a b b a a 3 2 1 1 2 3 0 1 2 3 4 5 a a 1 0 1 2 3 4 a 2 1 1 2 2 3 v 4 3 2 2 1 2 b 3 2 1 1 2 3 a a 4 3 2 2 1 2 a 5 4 3 3 2 1 5 4 3 3 2 1 Only need ±1 band !

Condensed Neighborhoods 0 1 _ _ _ _ a 1 0 1 _ _ _ a w _ 1 1 2 _ _ b a b b a a _ _ 1 1 2 _ 0 1 a a 1 0 1 a 1 1 2 v _ _ _ 2 1 2 b 1 1 2 a a 2 1 2 a _ _ _ _ 2 1 2 1

Condensed Neighborhoods If all entries are ≥ d then wasting time on D.P. 0 1 _ _ _ _ a b 1 1 1 _ _ _ 1 0 1 _ _ _ a a b b _ 1 2 2 _ _ _ 1 1 2 _ _ _ 2 1 1 _ _ _ 1 0 1 _ _ b b b b a a _ _ 2 1 2 _ _ _ 1 2 3 _ _ _ 1 0 1 _ _ _ 1 1 1 _ _ _ 1 1 2 _ _ _ 2 2 1 _ a a a b a b a b b _ _ _ 2 1 2 _ _ _ 3 2 1 _ _ _ 1 1 2 _ _ _ 1 0 1 _ _ _ 1 2 2 _ _ _ 2 1 1 _ _ _ 1 2 3 _ _ _ 2 1 2 _ _ _ 1 2 3 a a a a a a _ _ _ _ 1 2 _ _ _ _ 1 1 _ _ _ _ 1 2 _ _ _ _ 1 2 _ _ _ _ 2 1 _ _ _ _ 2 1 a a a _ _ _ _ _1 _ _ _ _ _ 1 _ _ _ _ _ 1

Whats Behind BLAST Gene Myers, Director MPI for Cell Biology and - PowerPoint PPT Presentation

Whats Behind BLAST Gene Myers, Director MPI for Cell Biology and Genetics Dresden, DE Approximate String Search Given a string A of length n, a query Q of length p n, an alignment scoring function , and a threshold d: Find all

BLAST Business License/ Web Update Business License/ Web Update BLAST BLAST BLAST BLAST (

A few BLAST details Julin Maloof April 16, 2019 Slides courtesy of Venkatsean Sundaresan BLAST

L4: Blast: Alignment Scores etc. L4: Blast: Alignment Scores etc. Why is Blast Fast? Why is

Blast Injuries and Landmines Travelling positive pressure wave C. Giannou Hat Yai July 2012

Chapter 7: Rapid alignment methods: FASTA and BLAST The biological problem l Search strategies

Alignments in Practice BLAST and CLUSTAL Introduction to Bioinformatics Dortmund, 16.-20.07.2007

Rapid alignment methods: FASTA and BLAST p The biological problem p Search strategies p FASTA p

Software Verification with BLAST Model Checking Blast Motivation Rigorous Sofware Development

Outline BLAST CSE 527 Scoring Computational Biology Weekly Bio Interlude: PCR & Sequencing

Community Cricket Forum Item Cricket Blast Junior Blaster Cricket Blast Master Blaster

1 BLAST and BLAST-like programs Nucleotide Words NCBI FieldGuide NCBI FieldGuide Query

Blast summary Blast summary Basic ideas: Basic ideas: Alignment (global/local/affine

Searching Sequence databases 1: Searching Sequence databases 1: Blast Blast The Central dogma

Searching Sequence databases 1: Searching Sequence databases 1: Blast Blast Query: Query:

Counterexample Guided Abstraction Refinement in Blast Reading: Checking Memory Safety with Blast

CSE P 527 Computational Biology 3: BLAST, Alignment score significance; PCR and DNA sequencing

A vision of hope for a world in transition Jonas Salks wish was Jonas Salks wish was that

Australian Drosophila Ecology and Evolution Resource curating life science research data

Evaluation of text data mining for Evaluation of text data mining for database curation: lessons

SciForum MOL2NET Phytochemical prospection of ethanolic extract of Azadirachta indica stem bark

Provenance and Linked Data in Biological Data Webs Jun Zhao Image Bioinformatics Research Group

Assessing Human Error Against a Benchmark of Perfection Ashton Anderson University of Toronto

Foundations of Artificial Intelligence 40. Board Games: Introduction and State of the Art Malte

Annotation and down-stream analysis Martin Morgan 1 June 20-23, 2011 1 mtmorgan@fhcrc.org

Whats Behind BLAST Gene Myers, Director MPI for Cell Biology and - PowerPoint PPT Presentation

Whats Behind BLAST Gene Myers, Director MPI for Cell Biology and Genetics Dresden, DE Approximate String Search Given a string A of length n, a query Q of length p n, an alignment scoring function , and a threshold d: Find all

BLAST Business License/ Web Update Business License/ Web Update BLAST BLAST BLAST BLAST (

A few BLAST details Julin Maloof April 16, 2019 Slides courtesy of Venkatsean Sundaresan BLAST

L4: Blast: Alignment Scores etc. L4: Blast: Alignment Scores etc. Why is Blast Fast? Why is

Blast Injuries and Landmines Travelling positive pressure wave C. Giannou Hat Yai July 2012

Chapter 7: Rapid alignment methods: FASTA and BLAST The biological problem l Search strategies

Alignments in Practice BLAST and CLUSTAL Introduction to Bioinformatics Dortmund, 16.-20.07.2007

Rapid alignment methods: FASTA and BLAST p The biological problem p Search strategies p FASTA p

Software Verification with BLAST Model Checking Blast Motivation Rigorous Sofware Development

Outline BLAST CSE 527 Scoring Computational Biology Weekly Bio Interlude: PCR &amp; Sequencing

Community Cricket Forum Item Cricket Blast Junior Blaster Cricket Blast Master Blaster

1 BLAST and BLAST-like programs Nucleotide Words NCBI FieldGuide NCBI FieldGuide Query

Blast summary Blast summary Basic ideas: Basic ideas: Alignment (global/local/affine

Searching Sequence databases 1: Searching Sequence databases 1: Blast Blast The Central dogma

Searching Sequence databases 1: Searching Sequence databases 1: Blast Blast Query: Query:

Counterexample Guided Abstraction Refinement in Blast Reading: Checking Memory Safety with Blast

CSE P 527 Computational Biology 3: BLAST, Alignment score significance; PCR and DNA sequencing

A vision of hope for a world in transition Jonas Salks wish was Jonas Salks wish was that

Australian Drosophila Ecology and Evolution Resource curating life science research data

Evaluation of text data mining for Evaluation of text data mining for database curation: lessons

SciForum MOL2NET Phytochemical prospection of ethanolic extract of Azadirachta indica stem bark

Provenance and Linked Data in Biological Data Webs Jun Zhao Image Bioinformatics Research Group

Assessing Human Error Against a Benchmark of Perfection Ashton Anderson University of Toronto

Foundations of Artificial Intelligence 40. Board Games: Introduction and State of the Art Malte

Annotation and down-stream analysis Martin Morgan 1 June 20-23, 2011 1 mtmorgan@fhcrc.org

Outline BLAST CSE 527 Scoring Computational Biology Weekly Bio Interlude: PCR & Sequencing