sequence alignment
play

Sequence alignment Correspondence between bases of two DNA - PowerPoint PPT Presentation

Sequence alignment Correspondence between bases of two DNA sequences, or between amino acids of two protein sequences Alignment":""2"x"k"matrix"("k" m,"n") n"="10


  1. Sequence alignment Correspondence between bases of two DNA sequences, or between amino acids of two protein sequences Alignment":""2"x"k"matrix"("k" ≥ m,"n") n"="10 V""="ACCTGGTAAA matches 8 mismatches 1 m"="10 W"="ACTGCGTATA deletions 1 insertions 1 V A C C T G G T A A A W" A C T G C G T A T A

  2. “Goodness” of alignments Given two sequences, there are many possible alignments ATTTTCCC distance=2 ATTTACGC ATTT-TCCC distance=3 ATTTA-CGC ATTTTCCC———————— distance=16 ————————ATTTACGC Edit distance : the total number of substitutions, insertions and deletions needed to transform one sequence to another

  3. Manhattan tourist problem Imagine seeking a Source * * path (from source to sink) to travel * (only eastward and * * southward) with the * most number of * * attractions (*) in * the Manhattan grid * * * Sink

  4. Recursive algorithm -> Dynamic programming Function MT ( n,m ) 1. x = MT(n-1,m)+ weight of the edge from (n-1,m) to (n,m) 2. y = MT(n,m-1)+ weight of the edge from (n,m-1) to (n,m) 3. return max{x,y} MT (x, y) returns the “most weighted” path from point (x, y) to the “sink”.

  5. How to find the optimal path 0 1 2 3 source 1 2 5 0 1 3 8 • Start from Sink. i 5 3 10 $5 • Find which of the two $5 1 $5 1 5 4 13 8 edges gave the “max”. Take it. 3 5 $3 2 3 3 $5 2 • Repeat. 8 9 12 15 0 0 $5 1 0 0 0 3 8 9 9 16 S 3,3$ =/16

  6. Recipe 1. Identify subproblems 2. Write down recursions 3. Make it dynamic-programming!

  7. The edit distance problem A F G C D E Match A G Insertion_X C Insertion_Y D E F A-GCDEF AFGCDE-

  8. Minimum Edit Distance For sequence X and Y

  9. Optimal alignment match match

  10. Complexity

  11. Is the edit distance the best way? For sequence X and Y

  12. Amino acids can share similar properties

  13. Weighted edit distance • To generalize scoring for DNA/RNA, consider a 4x4 scoring matrix S . • In the case of an amino acid sequence alignment, the scoring matrix would be a 20x20 size. • The addition of d is to include the score for comparison of a gap character “-”. • Two questions: • (a) What should S be? • (b) How do we find optimal scoring alignment?

  14. Weighted edit distance • To generalize scoring for DNA/RNA, consider a (4+1) x(4+1) scoring matrix S . • In the case of an amino acid sequence alignment, the scoring matrix would be a (20+1)x(20+1) size. • The addition of d is to include the score for comparison of a gap character “-”. • Two questions: • (a) What should S be? • (b) How do we find optimal scoring alignment? Traditionally, people tend to maximize the alignment score with a negative gap penalty score

  15. BLOcks SUbstitution Matrix (BLOSUM) amino acids

  16. BLOcks SUbstitution Matrix (BLOSUM)

  17. Recursion for generalized edit distance Complexity?

  18. Gap score/penalty

  19. Affine gap penalty Question: How to develop an efficient dynamic programming algorithm for affine gap penalties?

  20. Categories of pairwise alignments

  21. Semi-global alignment

  22. Semi-global alignment

  23. Local alignment: naive algorithm • Long run time O(n 4 ): - In the grid of size n x n there are n 2 vertices (i,j) that may serve as a source. - For each such vertex computing alignments from (i,j) to (i’,j’) takes O(n 2 ) time. • This can be remedied by allowing every point to be the starting point

  24. Local alignment: Smith-Waterman algorithm Idea: start over from any entry!

  25. Local alignment

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend