cse 182 l2 blast variants i
play

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. - PowerPoint PPT Presentation

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu edu/classes/fa05/cse182 /classes/fa05/cse182 www. www.cse cse. .ucsd ucsd. .edu edu/~ /~vbafna vbafna www. FA05 CSE182 Searching Sequence


  1. CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu edu/classes/fa05/cse182 /classes/fa05/cse182 www. www.cse cse. .ucsd ucsd. .edu edu/~ /~vbafna vbafna www. FA05 CSE182

  2. Searching Sequence databases http://www.ncbi.nlm.nih.gov/BLAST/ FA05 CSE182

  3. Query: >gi|26339572|dbj|BAC33457.1| unnamed protein product [Mus musculus] MSSTKLEDSLSRRNWSSASELNETQEPFLNPTDYDDEEFLRYLWREYLHPKEYEWVLIAGYIIVFVVA LIGNVLVCVAVWKNHHMRTVTNYFIVNLSLADVLVTITCLPATLVVDITETWFFGQSLCKVIPYLQTV SVSVSVLTLSCIALDRWYAICHPLMFKSTAKRARNSIVVIWIVSCIIMIPQAIVMECSSMLPGLANKT TLFTVCDEHWGGEVYPKMYHICFFLVTYMAPLCLMILAYLQIFRKLWCRQIPGTSSVVQRKWKQQQPV SQPRGSGQQSKARISAVAAEIKQIRARRKTARMLMVVLLVFAICYLPISILNVLKRVFGMFTHTEDRE TVYAWFTFSHWLVYANSAANPIIYNFLSGKFREEFKAAFSCCLGVHHRQGDRLARGRTSTESRKSLTT QISNFDNVSKLSEHVVLTSISTLPAANGAGPLQNWYLQQGVPSSLLSTWLEV • What is the function of this sequence? • Is there a human homolog? • Which organelle does it work in? (Secreted/membrane bound) • Idea: Search a database of known proteins to see if you can find similar sequences which have a known function FA05 CSE182

  4. Querying with Blast FA05 CSE182

  5. Blast Output • The output (Blastp query) is a series of protein sequences, ranked according to similarity with the query • Each database hit is aligned to a subsequence of the query FA05 CSE182

  6. Blast Output Schematic 26 422 query db 19 405 FA05 CSE182

  7. Blast Output S Id Q beg S beg Q end S end FA05 CSE182

  8. Interpreting Blast results • How do we interpret these results? – Similar sequence in the 3 species implies that the common ancestor of the 3 had that sequence. – The sequence accumulates mutations over time. These mutations may be indels, or substitutions. – Hum and mus diverged more recently and so the sequences are more likely to be similar. hum hummus? mus dros FA05 CSE182

  9. How do we measure similarity between sequences? • Percent identity? – Hard to compute without indels. A T C A A T C G A T C A A T C G T C A A T G G T T C A A T G G T Number of sequence edit operations?  Number of sequence edit operations?  Implies a notion of alignment.  Implies a notion of alignment.  FA05 CSE182

  10. Computing alignments • What is an alignment? 2Xm table. • Each sequence is a row, with interspersed gaps • Columns describe the edit operations • A A - T C G G A A C T C G - A What is the score of an alignment? • What is the score of an alignment? • Score every column, and sum up the scores. Let C be the score • Score every column, and sum up the scores. Let C be the score • function for the column function for the column How do we compute the alignment with the best score? • How do we compute the alignment with the best score? • FA05 CSE182

  11. Optimum scoring alignments, and score of optimum alignment • Instead of computing an optimum scoring alignment, we attempt to compute the score of an optimal alignment. • Later, we will show that the two are equivalent FA05 CSE182

  12. Computing Optimal Alignment score 1 1 2 k t s • Observations: The optimum alignment has nice recursive properties: – The alignment score is the sum of the scores of columns. – If we break off at cell k, the left part and right part must be optimal sub-alignments. – The left part contains prefixes s[1..i], and t[1..j] FA05 CSE182

  13. Optimum prefix alignments 1 k s t • Consider an optimum alignment of the prefix s[1..i], and t[1..j] • Look at the last cell. It can only have 3 possibilities. FA05 CSE182

  14. 3 possibilities for rightmost cell Optimum alignment of s[1..i-1], and t[1..j-1] s[i] 1. s[i] is aligned to t[j] t[j] Optimum alignment of s[1..i-1], and t[1..j] 2. s[i] is aligned to ‘-’ s[i] 3. t[j] is aligned to ‘-’ t[j] Optimum alignment of s[1..i], and t[1..j-1] FA05 CSE182

  15. Optimal score of an alignment Optimum alignment of s[1..i-1], and t[1..j-1] s[i] S[i,j] = C(s i ,t j )+S(i-1,j-1) t[j] Optimum alignment of s[1..i-1], and t[1..j] s[i] S[i,j] = C(s i ,-)+S(i-1,j) - Optimum alignment of s[1..i], and t[1..j-1] - S[i,j] = C(-,t j )+S(i,j-1) t[j] • Let S[i,j] be the score of an optimal alignment of the prefix s[1..i], and t[1..j]. It must be one of 3 possibilities. FA05 CSE182

  16. Optimal alignment score  S [ i − 1, j − 1] + C ( s i , t j )  S [ i , j ] = max S [ i − 1, j ] + C ( s i , − )   S [ i , j − 1] + C ( − , t j )  • Which prefix pairs (i,j) should we use? For now, simply use all. • If the strings are of length m, and n, respectively, what is the score of the optimal alignment? FA05 CSE182

  17. An O(nm) algorithm for score computation For i = 1 to n For j = 1 to m  S [ i − 1, j − 1] + C ( s i , t j )  S [ i , j ] = max S [ i − 1, j ] + C ( s i , − )   S [ i , j − 1] + C ( − , t j )  • The iteration ensures that all values on the right are computed in earlier steps. FA05 CSE182

  18. Initialization S [0,0] = 0 S [ i ,0] = C ( s i , − ) + S [ i − 1,0] ∀ i S [0, j ] = C ( − , s j ) + S [0, j − 1] ∀ j FA05 CSE182

  19. A tableaux approach t 1 j n 1 s M S[i-1,j] S[i-1,j-1]  S [ i − 1, j − 1] + C ( s i , t j )  S [ i , j ] = max S [ i − 1, j ] + C ( s i , − )   S [ i , j − 1] + C ( − , t j )  i S[i,j-1] S[i,j] n Cell (i,j) contains the score S[i,j]. Each cell only looks at 3 neighboring cells FA05 CSE182

  20. An Example T C A T - T C A T T G C A A T G C A A A1 A2 • Align s=TCAT with t=TGCAA • Match Score = 1 • Mismatch score = -1, Indel Score = -1 • Score A1?, Score A2? FA05 CSE182

  21. Alignment Table T G C A A 0 -1 -2 -3 -4 -5 -1 1 0 -1 -2 -3 T -2 0 0 1 0 -1 C -3 -1 -1 0 2 1 A -4 -2 -2 -1 1 1 T FA05 CSE182

  22. Alignment Table • S[4,5] = 1 is the score of an optimum T G C A A alignment 0 -1 -2 -3 -4 -5 • Therefore, A2 is an -1 1 0 -1 -2 -3 optimum alignment T • We know how to -2 0 0 1 0 -1 C obtain the optimum -3 -1 -1 0 2 1 A Score. How do we get -4 -2 -2 -1 1 1 the best alignment? T FA05 CSE182

  23. Computing Optimum Alignment • At each cell, we have 3 choices • We maintain additional information to record the choice at each step. For i = 1 to n For j = 1 to m  S [ i − 1, j − 1] + C ( s i , t j )  S [ i , j ] = max S [ i − 1, j ] + C ( s i , − )   S [ i , j − 1] + C ( − , t j )  j-1 j If (S[i,j]= S[i-1,j-1] + C(s i ,t j )) M[i,j] = i-1 If (S[i,j]= S[i-1,j] + C(s i ,-)) M[i,j] = If (S[i,j]= S[i,j-1] + C(-,t j ) ) M[i,j] = i FA05 CSE182

  24. Computing Optimal Alignments T G C A A 0 -1 -2 -3 -4 -5 -1 1 0 -1 -2 -3 T -2 0 0 1 0 -1 C -3 -1 -1 0 2 1 A -4 -2 -2 -1 1 1 T FA05 CSE182

  25. Retrieving Opt.Alignment • M[4,5]= 1 2 3 4 5 Implies that T G C A A S[4,5]=S[3,4]+C( A,T ) 0 -1 -2 -3 -4 -5 or A -1 1 0 -1 -2 -3 1 T T -2 0 0 1 0 -1 2 C M[3,4]= -3 -1 -1 0 2 1 3 A Implies that S[3,4]=S[2,3] +C( A,A ) -4 -2 -2 -1 1 1 4 T or A A A T FA05 CSE182

  26. Retrieving Opt.Alignment • M[2,3]= Implies that T G C A A S[2,3]=S[1,2]+C( C,C ) 0 -1 -2 -3 -4 -5 or C A A -1 1 0 -1 -2 -3 T C A T -2 0 0 1 0 -1 C M[1,2]= -3 -1 -1 0 2 1 A Implies that S[1,2]=S[1,1] +C (-,G ) -4 -2 -2 -1 1 1 T or T - C A A T G C A T FA05 CSE182

  27. Algorithm to retrieve optimal alignment RetrieveAl(i,j) if (M[i,j] == `\’) s i t j return (RetrieveAl (i-1,j-1) . ) else if (M[i,j] == `|’) s i - return (RetrieveAl (i-1,j) . ) else if (M[i,j] == `--’ ’) ) else if (M[i,j] == `-- - return (RetrieveAl RetrieveAl (i,j-1) . (i,j-1) . ) ) return ( t j FA05 CSE182

  28. Summary • An optimal alignment of strings of length n and m can be computed in O(nm) time • There is a tight connection between computation of optimal score, and computation of opt. Alignment – True for all DP based solutions FA05 CSE182

  29. Global versus Local Alignment Consider s = ACCACCCCTT t = ATCCCCACAT ACCACCCCTT ACCACCCCT T ACCACCCCTT ACCACCCCT T ATCCCCACAT A TCCCCACAT ATCCCCACAT A TCCCCACAT FA05 CSE182

  30. Blast Outputs Local Alignment Schematic 26 422 query db 19 405 FA05 CSE182

  31. Local Alignment • Compute maximum scoring interval over all sub-intervals (a,b), and (a’,b’) a • How can we compute this efficiently? b a’ b’ FA05 CSE182

  32. Local Alignment Trick  0 S [ i − 1, j − 1] + C ( s i , t j )  S [ i , j ] = max  S [ i − 1, j ] + C ( s i , − )  S [ i , j − 1] + C ( − , t j )  How can we compute the local alignment itself? FA05 CSE182

  33. Generalizing Gap Cost • It is more likely for gaps to be contiguous • The penalty for a gap of length l should be go + ge ∗ l FA05 CSE182

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend