csci 490
play

CSCI 490 Bioinformatics Multiple Sequence Alignment Multiple - PowerPoint PPT Presentation

CSCI 490 Bioinformatics Multiple Sequence Alignment Multiple Sequence Alignment Motivation: A faint similarity between two sequences becomes very significant if present in many sequences Definition Given N sequences x 1 , x 2


  1. CSCI 490 Bioinformatics Multiple Sequence Alignment

  2. Multiple Sequence Alignment • Motivation: – A faint similarity between two sequences becomes very significant if present in many sequences • Definition – Given N sequences x 1 , x 2 ,…, x N : Insert gaps (-) in each sequence x i , such that • All sequences have the same length L • Score of the alignment is maximum • Two issues – How to score an alignment? – How to find a (nearly) optimal alignment?

  3. Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x: ACGCGGC y: ACGCGAG z: GCCGCGAG Induces : x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

  4. Sum Of Pairs (cont’d) • The sum-of-pairs (SP) score of an alignment is the sum of the scores of all induced pairwise alignments S(m) =  k<l s(m k , m l ) s(m k , m l ): score of induced alignment (k,l)

  5. Example: x: AC-GCGG-C - A C G T y: AC-GC-GAG z: GCCGC-GAG A 1 -1 -1 -1 -1 C -1 1 -1 -1 -1 G -1 -1 1 -1 -1 (A,A) + (G,G) x 3 (-,A) x 2 + (A,G) x 2 = 3 (A,A) = -1 T -1 -1 -1 1 -1 = -1 - -1 -1 -1 -1 0 Total score = (-1) + 3 + (-2) + 3 + 3 + (-2) + 3 + (-1) + (-1) = 5

  6. Multiple Sequence Alignments Algorithms • Can also be global or local – We only talk about global for now • A simple method – Do pairwise alignment between all pairs – Combine the pairwise alignments into a single multiple alignment – Is this going to work?

  7. Compatible pairwise alignments AAAATTTT AAAATTTT---- AAAATTTT---- ----TTTTGGGG AAAA----GGGG AAAATTTT---- ----TTTTGGGG AAAA----GGGG TTTTGGGG AAAAGGGG ----TTTTGGGG AAAA----GGGG

  8. Incompatible pairwise alignments AAAATTTT AAAATTTT---- ----AAAATTTT ----TTTTGGGG GGGGAAAA---- ? TTTTGGGG GGGGAAAA TTTTGGGG---- ----GGGGAAAA

  9. Multidimensional Dynamic Programming (MDP) Generalization of Needleman-Wunsh: • Find the longest path in a high-dimensional cube – As opposed to a two-dimensional grid • Uses a N-dimensional matrix – As apposed to a two-dimensional array • Entry F(i 1 , …, i k ) represents score of optimal alignment for s 1 [1..i 1 ], … s k [1..i k ] F(i 1 ,i 2 ,…,i N ) = max (all neighbors of a cell) (F(nbr)+S(current))

  10. Multidimensional Dynamic Programming (MDP) • Example: in 3D (three sequences): (i-1,j-1,k-1) (i-1,j,k-1) • 2 3 – 1 = 7 neighbors/cell (i-1,j-1,k) (i-1,j,k) F(i-1,j-1,k-1) + S(x i , y j , z k ), F(i-1,j-1,k ) + S(x i , y j , -), F(i-1,j ,k-1) + S(x i , -, z k ), (i,j,k-1) (i,j-1,k-1) F(i,j,k) = max F(i ,j-1,k-1) + S(-, y j , z k ), F(i-1,j ,k ) + S(x i , -, -), F(i ,j-1,k ) + S(-, y j , -), (i,j-1,k) (i,j,k) F(i ,j ,k-1) + S(-, -, z k )

  11. Multidimensional Dynamic Programming (MDP) Running Time: L N ; 1. Size of matrix: Where L = length of each sequence N = number of sequences Neighbors/cell: 2 N – 1 2. Therefore………………………… O(2 N L N )

  12. Faster MDP • Carrillo & Lipman, 1988 – Branch and bound – Other heuristics • Implemented in a tool called MSA • Practical for about 6 sequences of length about 200-300.

  13. Faster MDP • Basic idea: bounds of the optimal score of a multiple alignment can be pre-computed – Upper-bound: sum of optimal pair-wise alignment scores, i.e. S(m) =  k<l s(m k , m l )   k<l s(k, l) Optimal Score of optimal alignment Score of the alignment between k and l between k and l induced by m msa – lower-bounded: score computed by any approximate algorithm – For any partial path, if S current + S perspective < lower- bound, can give up that path – Guarantees optimality

  14. Progressive Alignment • Multiple Alignment is NP-hard • Most used heuristic: Progressive Alignment Algorithm: 1. Align two of the sequences x i , x j 2. Fix that alignment 3. Align a third sequence x k to the alignment x i ,x j 4. Repeat until all sequences are aligned Running Time: O(NL 2 ) Each alignment takes O(L 2 ) Repeat N times

  15. Progressive Alignment x y z w • When evolutionary tree is known: – Align closest first, in the order of the tree Example: Order of alignments: 1. (x,y) 2. (z,w) 3. (xy, zw)

  16. Progressive Alignment: CLUSTALW CLUSTALW: most popular multiple protein alignment Algorithm: Find all d ij : alignment dist (x i , x j ) 1. • High alignment score => short distance 2. Construct a tree (similar to hierarchical clustering.) 3. Align nodes in order of decreasing similarity + a large number of heuristics

  17. CLUSTALW example • S 1 ALSK • S 2 TNSD • S 3 NASK • S 4 NTSD

  18. CLUSTALW example • S 1 ALSK • S 2 TNSD • S 3 NASK • S 4 NTSD s 1 s 2 s 3 s 4 s 1 0 9 4 7 s 2 0 8 3 s 3 0 7 s 4 0 Distance matrix

  19. CLUSTALW example • S 1 ALSK • S 2 TNSD • S 3 NASK • S 4 NTSD s 1 s 1 s 2 s 3 s 4 s 3 s 1 0 9 4 7 s 2 s 2 0 8 3 s 4 s 3 0 7 s 4 0

  20. CLUSTALW example • S 1 ALSK -ALSK • S 2 TNSD NA-SK • S 3 NASK • S 4 NTSD s 1 s 1 s 2 s 3 s 4 s 3 s 1 0 9 4 7 s 2 s 2 0 8 3 s 4 s 3 0 7 s 4 0

  21. CLUSTALW example • S 1 ALSK -ALSK • S 2 TNSD NA-SK • S 3 NASK -TNSD NT-SD • S 4 NTSD s 1 s 1 s 2 s 3 s 4 s 3 s 1 0 9 4 7 s 2 s 2 0 8 3 s 4 s 3 0 7 s 4 0

  22. CLUSTALW example • S 1 ALSK -ALSK -ALSK • S 2 TNSD NA-SK -TNSD • S 3 NASK NA-SK -TNSD NT-SD NT-SD • S 4 NTSD s 1 s 1 s 2 s 3 s 4 s 3 s 1 0 9 4 7 s 2 s 2 0 8 3 s 4 s 3 0 7 s 4 0

  23. Iterative Refinement Problems with progressive alignment: • Depend on pair-wise alignments • If sequences are very distantly related, much higher likelihood of errors • Initial alignments are “frozen” even when new evidence comes Example: x: GAAGTT Frozen! y: GAC-TT z: GAACTG Now clear: correct y should be GA-CTT w: GTACTG

  24. Iterative Refinement Algorithm (Barton-Stenberg): 1. Align most similar x i , x j 2. Align x k most similar to (x i x j ) Progressive alignment 3. Repeat 2 until (x 1 … x N ) are aligned 4. For j = 1 to N, Remove x j , and realign to x 1 …x j-1 x j+1 … x N 5. Repeat 4 until convergence

  25. Iterative Refinement (cont’d) For each sequence y 1. Remove y z 2. Realign y x y (while rest fixed) allow y to vary x,z fixed projection

  26. Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x: GAAGTTA y: GAC-TTA z: GAACTGA w: GTACTGA After realigning y: x: GAAGTTA y: G-ACTTA + 3 matches z: GAACTGA w: GTACTGA

  27. Iterative Refinement • Example not handled well: x: GAAGTTA y 1 : GAC-TTA y 2 : GAC-TTA Realigning any single y i changes nothing y 3 : GAC-TTA z: GAACTGA w: GTACTGA

  28. Other approaches • Statistical learning methods – Profile Hidden Markov Models • Consistency-based methods – Still rely on pairwise alignment • But consider a third seq when aligning two seqs • If block A in seq x aligns to block B in seq y, and both aligns to block C in seq z, we have higher confidence to say that the alignment between A-B is reliable • Essentially: change scoring system according to consistency • Then apply DP as in other approaches – Pioneered by a tool called T-Coffee

  29. Multiple alignment tools • Clustal W (Thompson, 1994) – Most popular • T-Coffee (Notredame, 2000) – Another popular tool – Consistency-based – Slower than clustalW, but generally more accurate for more distantly related sequences • MUSCLE (Edgar, 2004) – Iterative refinement – More efficient than most others • DIALIGN (Morgenstern, 1998, 1999, 2005) – “local” • Align-m (Walle, 2004) – “local” • PROBCONS (Do, 2004) – Probabilistic consistency-based – Best accuracy on benchmarks • ProDA (Phuong, 2006) – Allow repeated and shuffled regions

  30. In summary • Multiple alignment scoring functions – Sum of pairs – Other funcs exist, but less used • Multiple alignment algorithms: – MDP • Optimal • too slow • Branch & Bound doesn’t solve the problem entirely – Progressive alignment: clustalW – Iterative refinement Heuristic – Consistency-based

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend