L4: Blast: Alignment Scores etc. L4: Blast: Alignment Scores etc.

Why is Blast Fast? Why is Blast Fast?

Silly Question Silly Question ß Prove or Disprove: Prove or Disprove: ß ß There are two people in New York City with There are two people in New York City with ß exactly the same number of hairs. exactly the same number of hairs.

Large database search Large database search Database (n) Query (m) Database size n=10M, Querysize m=300. O(nm) = 3. 10 9 computations

Observations Observations ß Much of the database is random from the Much of the database is random from the ß query’ ’s perspective s perspective query ß Consider a random DNA string of length n. Consider a random DNA string of length n. ß ß Pr[A]=Pr[C] = Pr[G]=Pr[T]=0.25 Pr[A]=Pr[C] = Pr[G]=Pr[T]=0.25 ß ß What is the probability that an exact match What is the probability that an exact match ß to the query can be found? to the query can be found?

Basic probability Basic probability ß Probability that there is a match starting at a Probability that there is a match starting at a ß fixed position i = 0.25 m fixed position i = 0.25 m ß What is the probability that some position i What is the probability that some position i ß has a match. has a match. ß Dependencies confound probability estimates. Dependencies confound probability estimates. ß

Basic Probability:Expectation Basic Probability:Expectation ß Q: Toss a coin: each time it comes up heads, Q: Toss a coin: each time it comes up heads, ß you get a dollar you get a dollar ß What is the money you expect to get after n What is the money you expect to get after n ß tosses? tosses? ß Let X Let X i be the amount earned in the i-th th toss toss i be the amount earned in the i- ß E ( X i ) = 1. p + 0.(1 - p ) = p ß Total money you expect to earn Â Â E ( X i ) = E ( X i ) = np i i

Expected number of matches Expected number of matches Expected number of matches can still be computed. ß Expected number of matches can still be computed. ß i ß Let X i =1 if there is a match starting at position i, X i =0 otherwise Pr(Match at Position i) = p i = 0.25 m E ( X i ) = p i = 0.25 m ß Expected number of matches = m ( ) = n 14 Â Â E ( X i ) = E ( X i ) i i

Expected number of exact Expected number of exact Matches is small! Matches is small! ß Expected number of matches = n*0.25 Expected number of matches = n*0.25 m m ß ß If n=10 If n=10 7 , m=10, 7 , m=10, ß ß Then, expected number of matches = 9.537 Then, expected number of matches = 9.537 ß ß If n=10 If n=10 7 , m=11 7 , m=11 ß ß expected number of hits = 2.38 expected number of hits = 2.38 ß ß n=10 n=10 7 7 ,m=12, ,m=12, ß ß Expected number of hits = 0.5 < 1 Expected number of hits = 0.5 < 1 ß ß Bottom Line: An exact match to a substring of the Bottom Line: An exact match to a substring of the ß query is unlikely just by chance. query is unlikely just by chance.

Observation 2 Observation 2 ß What is the pigeonhole principle? What is the pigeonhole principle? ß ß Suppose we are looking for a database string with greater than 90% identity to the query (length 100) ß Partition the query into size 10 substrings. At least one much match the database string exactly

Why is this important? Why is this important? Suppose we are looking for sequences that are 80% identical to the query Suppose we are looking for sequences that are 80% identical to the query ß ß sequence of length 100. sequence of length 100. Assume that the mismatches are randomly distributed. Assume that the mismatches are randomly distributed. ß ß What is the probability that there is no stretch of 10 bp bp, where the query and , where the query and What is the probability that there is no stretch of 10 ß ß the subject match exactly? the subject match exactly? 90 Ê ˆ 10 ( ) @ 1 - 810 Á ˜ = 0.000036 Ë ¯ Rough calculations show that it is very low. Exact match of a short query Rough calculations show that it is very low. Exact match of a short query ß ß substring to a truly similar subject is very high. substring to a truly similar subject is very high. The above equation does not take dependencies into account The above equation does not take dependencies into account ß ß Reality is better because the matches are not randomly distributed Reality is better because the matches are not randomly distributed ß ß

Just the Facts Just the Facts ß Consider the set of all substrings of the Consider the set of all substrings of the ß query string of fixed length W. query string of fixed length W. ß Prob Prob. of exact match to a random database string . of exact match to a random database string ß is very low. is very low. ß Prob Prob. of exact match to a true homolog is very . of exact match to a true homolog is very ß high. high. ß Keyword Search (exact matches) is MUCH faster Keyword Search (exact matches) is MUCH faster ß than sequence alignment than sequence alignment

BLAST BLAST Database (n) Consider all (m-W) query words of size W (Default = 11) Consider all (m-W) query words of size W (Default = 11) • • Scan the database for exact match to all such words Scan the database for exact match to all such words • • For all regions that hit, extend using a dynamic programming alignment. For all regions that hit, extend using a dynamic programming alignment. • • Can be many orders of magnitude faster than SW over the entire string Can be many orders of magnitude faster than SW over the entire string • •

Why is BLAST fast? Why is BLAST fast? • Assume that keyword searching does not consume any Assume that keyword searching does not consume any • time and that alignment computation the expensive time and that alignment computation the expensive step. step. • Query m=1000, random Db n=10 Query m=1000, random Db n=10 7 , no TP 7 , no TP • • SW = O(nm) = 1000*10 SW = O(nm) = 1000*10 7 = 10 10 computations 7 = 10 10 computations • • BLAST, W=11 BLAST, W=11 • E(#11-mer mer hits)= 1000* (1/4) hits)= 1000* (1/4) 11 * 10 7 =2384 E(#11- 11 * 10 7 =2384 • • Number of computations = 2384*100*10=2.384*10 6 Number of computations = 2384*100*10=2.384*10 6 • • Ratio=10 10 /(2.384*10 6 )=4200 Ratio=10 10 /(2.384*10 6 )=4200 • • • Further speed improvements are possible Further speed improvements are possible •

Keyword Matching Keyword Matching ß How fast can we match How fast can we match ß keywords? keywords? ß Hash table/Db index? Hash table/Db index? ß AATCA 567 What is the size of the What is the size of the hash table, for m=11 hash table, for m=11 ß Suffix trees? What is Suffix trees? What is ß the size of the suffix the size of the suffix trees? trees? ß Trie Trie based search. We based search. We ß will do this in class. will do this in class.

Related notes Related notes ß How to choose the alignment region? How to choose the alignment region? ß ß Extend greedily until the score falls below a certain Extend greedily until the score falls below a certain ß threshold threshold ß What about protein sequences? What about protein sequences? ß ß Default word size = 3, and mismatches are allowed. Default word size = 3, and mismatches are allowed. ß ß Like sequences, BLAST has been evolving continuously Like sequences, BLAST has been evolving continuously ß ß Banded alignment Banded alignment ß ß Seed selection Seed selection ß ß Scanning for exact matches, keyword search versus database Scanning for exact matches, keyword search versus database ß indexing indexing

P-value computation P-value computation • How significant is a score? What happens to How significant is a score? What happens to • significance when you change the score function significance when you change the score function • A simple empirical method: A simple empirical method: • Compute a distribution of scores against a random database. Compute a distribution of scores against a random database. • • Use an estimate of the area under the curve to get the Use an estimate of the area under the curve to get the • • probability. probability. OR, fit the distribution to one of the standard distributions. OR, fit the distribution to one of the standard distributions. • •

Z-scores for alignment Z-scores for alignment ß Initial assumption was that the scores Initial assumption was that the scores ß followed a normal distribution. followed a normal distribution. ß Z-score computation: Z-score computation: ß ß For any alignment, score S, shuffle one of the For any alignment, score S, shuffle one of the ß sequences many times, and recompute recompute alignment. alignment. sequences many times, and Get mean and standard deviation Get mean and standard deviation Z S = S - m s ß Look up a table to get a P-value Look up a table to get a P-value ß

L4: Blast: Alignment Scores etc. L4: Blast: Alignment Scores etc. - PowerPoint PPT Presentation

L4: Blast: Alignment Scores etc. L4: Blast: Alignment Scores etc. Why is Blast Fast? Why is Blast Fast? Silly Question Silly Question Prove or Disprove: Prove or Disprove: There are two people in New York City with There are two

BLAST Business License/ Web Update Business License/ Web Update BLAST BLAST BLAST BLAST (

A few BLAST details Julin Maloof April 16, 2019 Slides courtesy of Venkatsean Sundaresan BLAST

Chapter 5: z-Scores : Location of Scores Chapter 5: z-Scores : Location of Scores and Standardized

Chapter 7: Rapid alignment methods: FASTA and BLAST The biological problem l Search strategies

Rapid alignment methods: FASTA and BLAST p The biological problem p Search strategies p FASTA p

Outline BLAST CSE 527 Scoring Computational Biology Weekly Bio Interlude: PCR & Sequencing

Blast summary Blast summary Basic ideas: Basic ideas: Alignment (global/local/affine

CSE P 527 Computational Biology 3: BLAST, Alignment score significance; PCR and DNA sequencing

CSEP 527 Computational Biology Spring 2016 3: BLAST, Alignment score significance; PCR and DNA

Blast Injuries and Landmines Travelling positive pressure wave C. Giannou Hat Yai July 2012

Alignments in Practice BLAST and CLUSTAL Introduction to Bioinformatics Dortmund, 16.-20.07.2007

Software Verification with BLAST Model Checking Blast Motivation Rigorous Sofware Development

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p

Chapter 7: Rapid alignment methods: FASTA and BLAST The biological problem l Search strategies

CSE 427 Computational Biology Autumn 2015 3: BLAST, Alignment score significance 1 Significance

A2/A2SB GWAC Industry Meeting November 19, 2014 GSA Alliant 2 Speakers Primary Point of Contact

RECAP But So far we have seen: How do we make sure the decompositions are lossless

Images of Isaac Newton 1 Portrait of Isaac Newton, Godfrey Kneller, 1689 This image is in the

HIGH-ENERGY HADRON-NUCLEUS COLLISIONS MEASURED WITH ALICE MPI@LHC TRIESTE, 23-27 NOVEMBER 2015

Future silicon trackers: 4D tracking, very high fluences, very small pixels Nicol Cartiglia

Observation of Gravitational Waves from a Binary Black Hole Merger In LIGO Hanford and Livingston

Scheduling Optim al & Real Tim e using CORA CORA CORA April 2002 June 2005

Detecting strategic moves in HearthStone matches BORIS DOUX, CLMENT GAUTRAIS AND BENJAMIN

L4: Blast: Alignment Scores etc. L4: Blast: Alignment Scores etc. - PowerPoint PPT Presentation

L4: Blast: Alignment Scores etc. L4: Blast: Alignment Scores etc. Why is Blast Fast? Why is Blast Fast? Silly Question Silly Question Prove or Disprove: Prove or Disprove: There are two people in New York City with There are two

BLAST Business License/ Web Update Business License/ Web Update BLAST BLAST BLAST BLAST (

A few BLAST details Julin Maloof April 16, 2019 Slides courtesy of Venkatsean Sundaresan BLAST

Chapter 5: z-Scores : Location of Scores Chapter 5: z-Scores : Location of Scores and Standardized

Chapter 7: Rapid alignment methods: FASTA and BLAST The biological problem l Search strategies

Rapid alignment methods: FASTA and BLAST p The biological problem p Search strategies p FASTA p

Outline BLAST CSE 527 Scoring Computational Biology Weekly Bio Interlude: PCR &amp; Sequencing

Blast summary Blast summary Basic ideas: Basic ideas: Alignment (global/local/affine

CSE P 527 Computational Biology 3: BLAST, Alignment score significance; PCR and DNA sequencing

CSEP 527 Computational Biology Spring 2016 3: BLAST, Alignment score significance; PCR and DNA

Blast Injuries and Landmines Travelling positive pressure wave C. Giannou Hat Yai July 2012

Alignments in Practice BLAST and CLUSTAL Introduction to Bioinformatics Dortmund, 16.-20.07.2007

Software Verification with BLAST Model Checking Blast Motivation Rigorous Sofware Development

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p

Chapter 7: Rapid alignment methods: FASTA and BLAST The biological problem l Search strategies

CSE 427 Computational Biology Autumn 2015 3: BLAST, Alignment score significance 1 Significance

A2/A2SB GWAC Industry Meeting November 19, 2014 GSA Alliant 2 Speakers Primary Point of Contact

RECAP But So far we have seen: How do we make sure the decompositions are lossless

Images of Isaac Newton 1 Portrait of Isaac Newton, Godfrey Kneller, 1689 This image is in the

HIGH-ENERGY HADRON-NUCLEUS COLLISIONS MEASURED WITH ALICE MPI@LHC TRIESTE, 23-27 NOVEMBER 2015

Future silicon trackers: 4D tracking, very high fluences, very small pixels Nicol Cartiglia

Observation of Gravitational Waves from a Binary Black Hole Merger In LIGO Hanford and Livingston

Scheduling Optim al &amp; Real Tim e using CORA CORA CORA April 2002 June 2005

Detecting strategic moves in HearthStone matches BORIS DOUX, CLMENT GAUTRAIS AND BENJAMIN

Outline BLAST CSE 527 Scoring Computational Biology Weekly Bio Interlude: PCR & Sequencing

Scheduling Optim al & Real Tim e using CORA CORA CORA April 2002 June 2005