csci 490
play

CSCI 490 Bioinformatics Part II: Pair-wise Sequence Alignment - PowerPoint PPT Presentation

CSCI 490 Bioinformatics Part II: Pair-wise Sequence Alignment Outline Whats the biological problem Algorithm issues Intro to dynamic programming Global sequence alignment Local sequence alignment More efficient


  1. Example x = AGTA F(i-1, j-1) +  (Xi,Yj) m = 1 F(i,j) = max F(i-1, j) – d y = ATA s = 1 F(i, j-1) – d d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 0 -1 -2 1 2 T -2 A -3 3

  2. Example x = AGTA F(i-1, j-1) +  (Xi,Yj) m = 1 F(i,j) = max F(i-1, j) – d y = ATA s = 1 F(i, j-1) – d d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 0 -1 -2 1 2 T -2 0 0 1 0 3 A -3

  3. Example x = AGTA F(i-1, j-1) +  (Xi,Yj) m = 1 F(i,j) = max F(i-1, j) – d y = ATA s = 1 F(i, j-1) – d d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 Optimal Alignment: j = 0 F(4,3) = 2 A -1 1 0 -1 -2 1 2 T -2 0 0 1 0 3 A -3 -1 -1 0 2

  4. Example x = AGTA F(i-1, j-1) +  (Xi,Yj) m = 1 F(i,j) = max F(i-1, j) – d y = ATA s = 1 F(i, j-1) – d d = 1 F(i,j) i = 0 1 2 3 4 A G T A j = 0 0 -1 -2 -3 -4 Optimal Alignment: F(4,3) = 2 1 A -1 1 0 -1 -2 This only tells us the 2 T -2 0 0 1 0 best score 3 A -3 -1 -1 0 2

  5. Trace-back x = AGTA F(i-1, j-1) +  (Xi,Yj) m = 1 F(i,j) = max F(i-1, j) – d y = ATA s = 1 F(i, j-1) – d d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 0 -1 -2 1 A 2 T -2 0 0 1 0 A 3 A -3 -1 -1 0 2

  6. Trace-back x = AGTA F(i-1, j-1) +  (Xi,Yj) m = 1 F(i,j) = max F(i-1, j) – d y = ATA s = 1 F(i, j-1) – d d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 0 -1 -2 1 T A 2 T -2 0 0 1 0 T A 3 A -3 -1 -1 0 2

  7. Trace-back x = AGTA F(i-1, j-1) +  (Xi,Yj) m = 1 F(i,j) = max F(i-1, j) – d y = ATA s = 1 F(i, j-1) – d d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 0 -1 -2 1 G T A 2 T -2 0 0 1 0 - T A 3 A -3 -1 -1 0 2

  8. Trace-back x = AGTA F(i-1, j-1) +  (Xi,Yj) m = 1 F(i,j) = max F(i-1, j) – d y = ATA s = 1 F(i, j-1) – d d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 0 -1 -2 1 A G T A 2 T -2 0 0 1 0 A - T A 3 A -3 -1 -1 0 2

  9. Trace-back x = AGTA F(i-1, j-1) +  (Xi,Yj) m = 1 F(i,j) = max F(i-1, j) – d y = ATA s = 1 F(i, j-1) – d d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 Optimal Alignment: j = 0 F(4,3) = 2 A -1 1 0 -1 -2 1 AGTA 2 T -2 0 0 1 0 A − TA 3 A -3 -1 -1 0 2

  10. Using trace-back pointers x = AGTA m = 1 y = ATA s = 1 d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 2 T -2 3 A -3

  11. Using trace-back pointers x = AGTA m = 1 y = ATA s = 1 d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 0 -1 -2 1 2 T -2 3 A -3

  12. Using trace-back pointers x = AGTA m = 1 y = ATA s = 1 d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 0 -1 -2 1 2 T -2 0 0 1 0 3 A -3

  13. Using trace-back pointers x = AGTA m = 1 y = ATA s = 1 d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 0 -1 -2 1 2 T -2 0 0 1 0 3 A -3 -1 -1 0 2

  14. Using trace-back pointers x = AGTA m = 1 y = ATA s = 1 d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 0 -1 -2 1 2 T -2 0 0 1 0 3 A -3 -1 -1 0 2

  15. Using trace-back pointers x = AGTA m = 1 y = ATA s = 1 d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 0 -1 -2 1 2 T -2 0 0 1 0 3 A -3 -1 -1 0 2

  16. Using trace-back pointers x = AGTA m = 1 y = ATA s = 1 d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 j = 0 A -1 1 0 -1 -2 1 2 T -2 0 0 1 0 3 A -3 -1 -1 0 2

  17. Using trace-back pointers x = AGTA m = 1 y = ATA s = 1 d = 1 F(i,j) i = 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 Optimal Alignment: j = 0 F(4,3) = 2 A -1 1 0 -1 -2 1 AGTA 2 T -2 0 0 1 0 A − TA 3 A -3 -1 -1 0 2

  18. The Needleman-Wunsch Algorithm 1. Initialization. a. F(0, 0) = 0 = - j  d b. F(0, j) = - i  d c. F(i, 0) 2. Main Iteration. Filling in scores For each i = 1……M a. For each j = 1……N F(i-1,j) – d [case 1] F(i, j-1) – d F(i, j) = max [case 2] F(i-1, j-1) + σ (x i , y j ) [case 3] UP, if [case 1] Ptr(i,j) = LEFT if [case 2] DIAG if [case 3] 3. Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment

  19. Complexity • Time: O(NM) • Space: O(NM) • Linear-space algorithms do exist (with the same time complexity)

  20. Equivalent graph problem S1 = G A T A (0,0) → : a gap in the 2 nd sequence 1 1 S2 = A  : a gap in the 1 st sequence 1 : match / mismatch T Value on vertical/horizontal line: -d 1 1 Value on diagonal: m or -s A (3,4) • Number of steps: length of the alignment • Path length: alignment score • Optimal alignment: find the longest path from (0, 0) to (3, 4) • General longest path problem cannot be found with DP. Longest path on this graph can be found by DP since no cycle is possible.

  21. Question • If we change the scoring scheme, will the optimal alignment be changed? – Old: Match = 1, mismatch = gap = -1 – New: match = 2, mismatch = gap = 0 – New: Match = 2, mismatch = gap = -2?

  22. Question • What kind of alignment is represented by these paths? A A A A A B B B B B C C C C C A- A-- --A -A- -A BC -BC BC- B-C BC Alternating gaps are impossible if – s > -2d

  23. A variant of the basic algorithm Scoring scheme: m = s = d: 1 Seq1: CAGCA-CTTGGATTCTCGG || |:||| Score = -7 Seq2: ---CAGCGTGG-------- Seq1: CAGCACTTGGATTCTCGG |||| | | || Seq2: CAGC-----G-T----GG Score = -2 The first alignment may be biologically more realistic in some cases (e.g. if we know s2 is a subsequence of s1)

  24. A variant of the basic algorithm • Maybe it is OK to have an unlimited # of gaps in the beginning and end: ----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC GCGAGTTCATCTATCAC--GACCGC--GGTCG-------------- • Then, we don ’ t want to penalize gaps in the ends

  25. The Overlap Detection variant Changes: x 1 ……………………………… x M y N ……………………………… y 1 1. Initialization For all i, j, F(i, 0) = 0 F(0, j) = 0 2. Termination max i F(i, N) F OPT = max max j F(M, j)

  26. Different types of overlaps x x y y

  27. The local alignment problem X = x 1 ……x M , Given two strings Y = y 1 ……y N Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum e.g. X = abcxdex X’ = cxde Y = xxxcde Y’ = c -de x y

  28. Why local alignment • Conserved regions may be a small part of the whole – Global alignment might miss them if flanking “junk” outweighs similar regions • Genes are shuffled between genomes C D B A D B A C

  29. Naïve algorithm for all substrings X’ of X and Y’ of Y Align X’ & Y’ via dynamic programming Retain pair with max value end ; Output the retained pair • Time: O(n 2 ) choices for A, O(m 2 ) for B, O(nm) for DP, so O(n 3 m 3 ) total.

  30. Reminder • The overlap detection algorithm – We do not give penalty to gaps at either end Free gap Free gap

  31. The local alignment idea • Do not penalize the unaligned regions (gaps or mismatches) • The alignment can start anywhere and ends anywhere • Strategy: whenever we get to some low similarity region (negative score), we restart a new alignment – By resetting alignment score to zero

  32. The Smith-Waterman algorithm Initialization : F(0, j) = F(i, 0) = 0 0 F(i – 1, j) – d Iteration : F(i, j) = max F(i, j – 1) – d F(i – 1, j – 1) +  (x i , y j )

  33. The Smith-Waterman algorithm Termination : 1. If we want the best local alignment… F OPT = max i,j F(i, j) 2. If we want all local alignments scoring > t For all i, j find F(i, j) > t, and trace back

  34. x x x c d e Match: 2 0 0 0 0 0 0 0 Mismatch: -1 a 0 Gap: -1 b 0 c 0 x 0 d 0 e 0 x 0

  35. x x x c d e Match: 2 0 0 0 0 0 0 0 Mismatch: -1 a 0 0 0 0 0 0 0 Gap: -1 b 0 0 0 0 0 0 0 c 0 x 0 d 0 e 0 x 0

  36. x x x c d e Match: 2 0 0 0 0 0 0 0 Mismatch: -1 a 0 0 0 0 0 0 0 Gap: -1 b 0 0 0 0 0 0 0 c 0 0 0 0 2 1 0 x 0 d 0 e 0 x 0

  37. x x x c d e Match: 2 0 0 0 0 0 0 0 Mismatch: -1 a 0 0 0 0 0 0 0 Gap: -1 b 0 0 0 0 0 0 0 c 0 0 0 0 2 1 0 x 0 2 2 2 1 1 0 d 0 e 0 x 0

  38. x x x c d e Match: 2 0 0 0 0 0 0 0 Mismatch: -1 a 0 0 0 0 0 0 0 Gap: -1 b 0 0 0 0 0 0 0 c 0 0 0 0 2 1 0 x 0 2 2 2 1 1 0 d 0 1 1 1 1 3 2 e 0 x 0

  39. x x x c d e Match: 2 0 0 0 0 0 0 0 Mismatch: -1 a 0 0 0 0 0 0 0 Gap: -1 b 0 0 0 0 0 0 0 c 0 0 0 0 2 1 0 x 0 2 2 2 1 1 0 d 0 1 1 1 1 3 2 e 0 0 0 0 0 2 5 x 0

  40. x x x c d e Match: 2 0 0 0 0 0 0 0 Mismatch: -1 a 0 0 0 0 0 0 0 Gap: -1 b 0 0 0 0 0 0 0 c 0 0 0 0 2 1 0 x 0 2 2 2 1 1 0 d 0 1 1 1 1 3 2 e 0 0 0 0 0 2 5 x 0 2 2 2 1 1 4

  41. Trace back x x x c d e Match: 2 0 0 0 0 0 0 0 Mismatch: -1 a 0 0 0 0 0 0 0 Gap: -1 b 0 0 0 0 0 0 0 c 0 0 0 0 2 1 0 x 0 2 2 2 1 1 0 d 0 1 1 1 1 3 2 e 0 0 0 0 0 2 5 x 0 2 2 2 1 1 4

  42. Trace back x x x c d e Match: 2 0 0 0 0 0 0 0 Mismatch: -1 a 0 0 0 0 0 0 0 Gap: -1 b 0 0 0 0 0 0 0 cxde | || c 0 0 0 0 2 1 0 c-de x 0 2 2 2 1 1 0 d 0 1 1 1 1 3 2 x-de | || e 0 0 0 0 0 2 5 xcde x 0 2 2 2 1 1 4

  43. • No negative values in local alignment DP array • Optimal local alignment will never have a gap on either end • Local alignment: “ Smith-Waterman ” • Global alignment: “ Needleman-Wunsch ”

  44. Analysis • Time: – O(MN) for finding the best alignment – Time to report all alignments depends on the number of sub-opt alignments • Memory: – O(MN) – O(M+N) possible

  45. More efficient alignment algorithms

  46. • Given two sequences of length M, N • Time: O(MN) – Ok, but still slow for long sequences • Space: O(MN) – bad – 1Mb seq x 1Mb seq = 1TB memory • Can we do better?

  47. Bounded alignment Good alignment should appear near the diagonal

  48. Bounded Dynamic Programming If we know that x and y are very similar Assumption: # gaps(x, y) < k x i | i – j | < k Then, | implies y j

  49. Bounded Dynamic Programming x 1 ………………………… x M Initialization: y N ………………………… y 1 F(i,0), F(0,j) undefined for i, j > k Iteration: For i = 1…M For j = max(1, i – k)…min(N, i+k) F(i – 1, j – 1)+  (x i , y j ) F(i, j – 1) – d, if j > i – k F(i, j) = max F(i – 1, j) – d, if j < i + k Termination: same k

  50. Analysis • Time: O(kM) << O(MN) • Space: O(kM) with some tricks => M M 2k 2k

  51. • Given two sequences of length M, N • Time: O(MN) – ok • Space: O(MN) – bad – 1mb seq x 1mb seq = 1TB memory • Can we do better?

  52. Linear space algorithm • If all we need is the alignment score but not the alignment, easy! We only need to keep two rows (You only need one row, with a little trick) But how do we get the alignment?

  53. Linear space algorithm • When we finish, we know how we have aligned the ends of the sequences X M Y N Naïve idea: Repeat on the smaller subproblem F(M-1, N-1) Time complexity: O((M+N)(MN))

  54. (0, 0) M/2 (M, N) Key observation: optimal alignment (longest path) must use an intermediate point on the M/2-th row. Call it (M/2, k), where k is unknown.

  55. (0,0) (3,2) (3,0) (3,4) (3,6) (6,6) • Longest path from (0, 0) to (6, 6) is max_k (LP(0,0,3,k) + LP(6,6,3,k))

  56. Hirschberg’s idea • Divide and conquer! Y X Forward algorithm Align x 1 x 2 …x M/2 with Y M/2 F(M/2, k) represents the best alignment between x 1 x 2 …x M/2 and y 1 y 2 …y k

  57. Backward Algorithm Y X Backward algorithm Align reverse(x M/2+1 …x M ) with reverse(Y) M/2 B(M/2, k) represents the best alignment between reverse(x M/2+1 …x M ) and reverse(y k y k+1 …y N )

  58. Linear-space alignment Using 2 (4) rows of space, we can compute for k = 1…N, F(M/2, k), B(M/2, k) M/2

  59. Linear-space alignment Now, we can find k * maximizing F(M/2, k) + B(M/2, k) Also, we can trace the path exiting column M/2 from k * Conclusion: In O(NM) time, O(N) space, we found optimal alignment path at row M/2

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend