sequence comparison
play

Sequence Comparison: Dynamic Programming Genome 373 Genomic - PowerPoint PPT Presentation

Sequence Comparison: Dynamic Programming Genome 373 Genomic Informatics Elhanan Borenstein GAATC CATAC Mission: Find the best alignment between two sequences. A search algorithm for A method for finding the alignment scoring with


  1. Sequence Comparison: Dynamic Programming Genome 373 Genomic Informatics Elhanan Borenstein

  2. GAATC CATAC Mission: Find the best alignment between two sequences. A “search” algorithm for A method for finding the alignment scoring with the best score alignments  Dynamic programming  Substitution matrix  Gap penalties

  3. Scoring Aligned Bases • • Substitution matrix: Gap penalty: A C G T • Linear gap penalty • Affine gap penalty A 10 -5 0 -5 C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 GAAT-C d=-4 CA-TAC -5 + 10 + -4 + 10 + -4 + 10 = 17

  4. Exhaustive search GAATC GAAT-C -GAAT-C CATAC C-ATAC C-A-TAC GAATC- GAAT-C GA-ATC CA-TAC CA-TAC CATA-C

  5. How many possibilities? GAATC GAAT-C -GAAT-C CATAC C-ATAC C-A-TAC GAATC- GAAT-C GA-ATC CA-TAC CA-TAC CATA-C • How many different possible alignments of two sequences of length n exist?

  6. How many possibilities? GAATC GAAT-C -GAAT-C CATAC C-ATAC C-A-TAC GAATC- GAAT-C GA-ATC CA-TAC CA-TAC CATA-C • How many different possible alignments of two sequences of length n exist? 5 2.5x10 2 10 1.8x10 5 20 1.4x10 11 30 1.2x10 17 40 1.1x10 23

  7. The Needleman – Wunsch Algorithm • An algorithm for global alignment on two sequences • A Dynamic Programming (DP) approach – Yes , it’s a weird name. – DP is closely related to recursion and to mathematical induction • We can prove that the resulting score is optimal.

  8. A C G T GA DP matrix A 10 -5 0 -5 CA C -5 10 -5 0 G 0 -5 10 -5 j 0 1 2 3 etc. T -5 0 -5 10 G A A T C i 0 C 1 A 5 2 T 3 A 4 The value at ( i,j ) is the score of the 5 C best alignment of the first i characters of one sequence versus the first j characters of the other sequence. initial row and column

  9. A C G T A 10 -5 0 -5 GAA DP matrix C -5 10 -5 0 CA- G 0 -5 10 -5 T -5 0 -5 10 G A A T C C A 5 1 T Moving horizontally in the A matrix introduces a gap in the sequence along the C left edge.

  10. A C G T GA- A 10 -5 0 -5 CAT DP matrix C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C Moving vertically in the matrix introduces a gap in C the sequence along the top edge. A 5 T 1 A C

  11. A C G T GAA A 10 -5 0 -5 CAT DP matrix C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C C Moving diagonally in the matrix aligns two residues A 5 T 0 A C

  12. A C G T A 10 -5 0 -5 Initialization Start at top left and C -5 10 -5 0 move progressively G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 C A T A C

  13. A C G T G A 10 -5 0 -5 - Introducing a gap C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 C A T A C

  14. A C G T - A 10 -5 0 -5 C Introducing a gap C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 C -4 A T A C

  15. A C G T Complete first row ----- A 10 -5 0 -5 C -5 10 -5 0 CATAC and column G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 A -8 T -12 A -16 C -20

  16. A C G T Three ways to get A 10 -5 0 -5 G- C -5 10 -5 0 to i=1 , j=1 -C G 0 -5 10 -5 j 0 1 2 3 etc. T -5 0 -5 10 G A A T C i 0 -4 0 C -8 1 A 2 T 3 A 4 5 C

  17. A C G T Three ways to get -G A 10 -5 0 -5 C -5 10 -5 0 C- to i=1 , j=1 G 0 -5 10 -5 j 0 1 2 3 etc. T -5 0 -5 10 G A A T C i 0 0 C -4 -8 1 A 2 T 3 A 4 5 C

  18. A C G T Three ways to get G A 10 -5 0 -5 C to i=1 , j=1 C -5 10 -5 0 G 0 -5 10 -5 j 0 1 2 3 etc. T -5 0 -5 10 G A A T C i 0 0 C -5 1 A 2 T 3 A 4 5 C

  19. A C G T Accept the highest scoring A 10 -5 0 -5 C -5 10 -5 0 of the three G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 A -8 Then simply repeat the T -12 same rule progressively across the matrix A -16 C -20

  20. A C G T A 10 -5 0 -5 DP matrix C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 A -8 ? T -12 A -16 C -20

  21. A C G T -G G- --G A 10 -5 0 -5 CA CA CA- DP matrix C -5 10 -5 0 G 0 -5 10 -5 -4+0=-4 -5+-4=-9 -8+-4=-12 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 0 -4 -4 A -8 ? T -12 A -16 C -20

  22. A C G T -G G- --G A 10 -5 0 -5 CA CA CA- DP matrix C -5 10 -5 0 G 0 -5 10 -5 -4+0=-4 -5+-4=-9 -8+-4=-12 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 0 -4 -4 A -8 -4 T -12 A -16 C -20

  23. A C G T A 10 -5 0 -5 DP matrix C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 A -8 -4 T -12 ? A -16 ? C -20 ?

  24. A C G T A 10 -5 0 -5 DP matrix C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 A -8 -4 T -12 -8 A -16 -12 C -20 -16

  25. A C G T A 10 -5 0 -5 DP matrix C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 ? A -8 -4 ? T -12 -8 ? A -16 -12 ? C -20 -16 ?

  26. A C G T A 10 -5 0 -5 Traceback C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 -9 What is the alignment associated with this entry? A -8 -4 5 Just follow the arrows back - this is called the traceback T -12 -8 1 -G-A A -16 -12 2 CATA C -20 -16 -2

  27. A C G T A 10 -5 0 -5 Full Alignment C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 -9 A -8 -4 5 Continue and find the optimal global T -12 -8 1 alignment, and its score. A -16 -12 2 C -20 -16 -2 ?

  28. A C G T A 10 -5 0 -5 Full Alignment C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 -9 -13 -12 -6 A -8 -4 5 1 -3 -7 T -12 -8 1 0 11 7 A -16 -12 2 11 7 6 C -20 -16 -2 7 11 17

  29. A C G T A 10 -5 0 -5 Full Alignment C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 -9 -13 -12 -6 Best alignment starts at bottom right and follows A -8 -4 5 1 -3 -7 traceback arrows to top left T -12 -8 1 0 11 7 A -16 -12 2 11 7 6 C -20 -16 -2 7 11 17

  30. A C G T GA-ATC A 10 -5 0 -5 One best traceback C -5 10 -5 0 CATA-C G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 -9 -13 -12 -6 A -8 -4 5 1 -3 -7 T -12 -8 1 0 11 7 A -16 -12 2 11 7 6 C -20 -16 -2 7 11 17

  31. A C G T A 10 -5 0 -5 GAAT-C -CATAC Another best traceback C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 -9 -13 -12 -6 A -8 -4 5 1 -3 -7 T -12 -8 1 0 11 7 A -16 -12 2 11 7 6 C -20 -16 -2 7 11 17

  32. A C G T GAAT-C GA-ATC A 10 -5 0 -5 C -5 10 -5 0 -CATAC CATA-C G 0 -5 10 -5 T -5 0 -5 10 G A A T C 0 -4 -8 -12 -16 -20 C -4 -5 -9 -13 -12 -6 A -8 -4 5 1 -3 -7 T -12 -8 1 0 11 7 A -16 -12 2 11 7 6 C -20 -16 -2 7 11 17

  33. Multiple solutions GA-ATC • When a program returns a single CATA-C sequence alignment, it may not be the only best alignment but it is GAAT-C guaranteed to be one of them. CA-TAC • In our example, all of the alignments GAAT-C at the left have equal scores. C-ATAC GAAT-C -CATAC

  34. Practice problem: Find a best pairwise alignment of GAATC and AATTC A C G T G A A T C A 10 -5 0 -5 C -5 10 -5 0 0 G 0 -5 10 -5 T -5 0 -5 10 A d = -4 A T T C

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend