 
              L4: Blast: Alignment Scores etc. L4: Blast: Alignment Scores etc.
Why is Blast Fast? Why is Blast Fast?
Silly Question Silly Question ß Prove or Disprove: Prove or Disprove: ß ß There are two people in New York City with There are two people in New York City with ß exactly the same number of hairs. exactly the same number of hairs.
Large database search Large database search Database (n) Query (m) Database size n=10M, Querysize m=300. O(nm) = 3. 10 9 computations
Observations Observations ß Much of the database is random from the Much of the database is random from the ß query’ ’s perspective s perspective query ß Consider a random DNA string of length n. Consider a random DNA string of length n. ß ß Pr[A]=Pr[C] = Pr[G]=Pr[T]=0.25 Pr[A]=Pr[C] = Pr[G]=Pr[T]=0.25 ß ß What is the probability that an exact match What is the probability that an exact match ß to the query can be found? to the query can be found?
Basic probability Basic probability ß Probability that there is a match starting at a Probability that there is a match starting at a ß fixed position i = 0.25 m fixed position i = 0.25 m ß What is the probability that some position i What is the probability that some position i ß has a match. has a match. ß Dependencies confound probability estimates. Dependencies confound probability estimates. ß
Basic Probability:Expectation Basic Probability:Expectation ß Q: Toss a coin: each time it comes up heads, Q: Toss a coin: each time it comes up heads, ß you get a dollar you get a dollar ß What is the money you expect to get after n What is the money you expect to get after n ß tosses? tosses? ß Let X Let X i be the amount earned in the i-th th toss toss i be the amount earned in the i- ß E ( X i ) = 1. p + 0.(1 - p ) = p ß Total money you expect to earn   E ( X i ) = E ( X i ) = np i i
Expected number of matches Expected number of matches Expected number of matches can still be computed. ß Expected number of matches can still be computed. ß i ß Let X i =1 if there is a match starting at position i, X i =0 otherwise Pr(Match at Position i) = p i = 0.25 m E ( X i ) = p i = 0.25 m ß Expected number of matches = m ( ) = n 14 Â Â E ( X i ) = E ( X i ) i i
Expected number of exact Expected number of exact Matches is small! Matches is small! ß Expected number of matches = n*0.25 Expected number of matches = n*0.25 m m ß ß If n=10 If n=10 7 , m=10, 7 , m=10, ß ß Then, expected number of matches = 9.537 Then, expected number of matches = 9.537 ß ß If n=10 If n=10 7 , m=11 7 , m=11 ß ß expected number of hits = 2.38 expected number of hits = 2.38 ß ß n=10 n=10 7 7 ,m=12, ,m=12, ß ß Expected number of hits = 0.5 < 1 Expected number of hits = 0.5 < 1 ß ß Bottom Line: An exact match to a substring of the Bottom Line: An exact match to a substring of the ß query is unlikely just by chance. query is unlikely just by chance.
Observation 2 Observation 2 ß What is the pigeonhole principle? What is the pigeonhole principle? ß ß Suppose we are looking for a database string with greater than 90% identity to the query (length 100) ß Partition the query into size 10 substrings. At least one much match the database string exactly
Why is this important? Why is this important? Suppose we are looking for sequences that are 80% identical to the query Suppose we are looking for sequences that are 80% identical to the query ß ß sequence of length 100. sequence of length 100. Assume that the mismatches are randomly distributed. Assume that the mismatches are randomly distributed. ß ß What is the probability that there is no stretch of 10 bp bp, where the query and , where the query and What is the probability that there is no stretch of 10 ß ß the subject match exactly? the subject match exactly? 90 Ê ˆ 10 ( ) @ 1 - 810 Á ˜ = 0.000036 Ë ¯ Rough calculations show that it is very low. Exact match of a short query Rough calculations show that it is very low. Exact match of a short query ß ß substring to a truly similar subject is very high. substring to a truly similar subject is very high. The above equation does not take dependencies into account The above equation does not take dependencies into account ß ß Reality is better because the matches are not randomly distributed Reality is better because the matches are not randomly distributed ß ß
Just the Facts Just the Facts ß Consider the set of all substrings of the Consider the set of all substrings of the ß query string of fixed length W. query string of fixed length W. ß Prob Prob. of exact match to a random database string . of exact match to a random database string ß is very low. is very low. ß Prob Prob. of exact match to a true homolog is very . of exact match to a true homolog is very ß high. high. ß Keyword Search (exact matches) is MUCH faster Keyword Search (exact matches) is MUCH faster ß than sequence alignment than sequence alignment
BLAST BLAST Database (n) Consider all (m-W) query words of size W (Default = 11) Consider all (m-W) query words of size W (Default = 11) • • Scan the database for exact match to all such words Scan the database for exact match to all such words • • For all regions that hit, extend using a dynamic programming alignment. For all regions that hit, extend using a dynamic programming alignment. • • Can be many orders of magnitude faster than SW over the entire string Can be many orders of magnitude faster than SW over the entire string • •
Why is BLAST fast? Why is BLAST fast? • Assume that keyword searching does not consume any Assume that keyword searching does not consume any • time and that alignment computation the expensive time and that alignment computation the expensive step. step. • Query m=1000, random Db n=10 Query m=1000, random Db n=10 7 , no TP 7 , no TP • • SW = O(nm) = 1000*10 SW = O(nm) = 1000*10 7 = 10 10 computations 7 = 10 10 computations • • BLAST, W=11 BLAST, W=11 • E(#11-mer mer hits)= 1000* (1/4) hits)= 1000* (1/4) 11 * 10 7 =2384 E(#11- 11 * 10 7 =2384 • • Number of computations = 2384*100*10=2.384*10 6 Number of computations = 2384*100*10=2.384*10 6 • • Ratio=10 10 /(2.384*10 6 )=4200 Ratio=10 10 /(2.384*10 6 )=4200 • • • Further speed improvements are possible Further speed improvements are possible •
Keyword Matching Keyword Matching ß How fast can we match How fast can we match ß keywords? keywords? ß Hash table/Db index? Hash table/Db index? ß AATCA 567 What is the size of the What is the size of the hash table, for m=11 hash table, for m=11 ß Suffix trees? What is Suffix trees? What is ß the size of the suffix the size of the suffix trees? trees? ß Trie Trie based search. We based search. We ß will do this in class. will do this in class.
Related notes Related notes ß How to choose the alignment region? How to choose the alignment region? ß ß Extend greedily until the score falls below a certain Extend greedily until the score falls below a certain ß threshold threshold ß What about protein sequences? What about protein sequences? ß ß Default word size = 3, and mismatches are allowed. Default word size = 3, and mismatches are allowed. ß ß Like sequences, BLAST has been evolving continuously Like sequences, BLAST has been evolving continuously ß ß Banded alignment Banded alignment ß ß Seed selection Seed selection ß ß Scanning for exact matches, keyword search versus database Scanning for exact matches, keyword search versus database ß indexing indexing
P-value computation P-value computation • How significant is a score? What happens to How significant is a score? What happens to • significance when you change the score function significance when you change the score function • A simple empirical method: A simple empirical method: • Compute a distribution of scores against a random database. Compute a distribution of scores against a random database. • • Use an estimate of the area under the curve to get the Use an estimate of the area under the curve to get the • • probability. probability. OR, fit the distribution to one of the standard distributions. OR, fit the distribution to one of the standard distributions. • •
Z-scores for alignment Z-scores for alignment ß Initial assumption was that the scores Initial assumption was that the scores ß followed a normal distribution. followed a normal distribution. ß Z-score computation: Z-score computation: ß ß For any alignment, score S, shuffle one of the For any alignment, score S, shuffle one of the ß sequences many times, and recompute recompute alignment. alignment. sequences many times, and Get mean and standard deviation Get mean and standard deviation Z S = S - m s ß Look up a table to get a P-value Look up a table to get a P-value ß
Recommend
More recommend