multiple sequence multiple sequence alignments alignments
play

Multiple Sequence Multiple Sequence Alignments Alignments - PowerPoint PPT Presentation

Multiple Sequence Multiple Sequence Alignments Alignments Multiple alignment Pairwise alignment Infer biological relationships from string similarity Multiple alignment Infer string similarity from biological relationships


  1. Multiple Sequence Multiple Sequence Alignments Alignments

  2. Multiple alignment • Pairwise alignment – Infer biological relationships from string similarity • Multiple alignment – Infer string similarity from biological relationships

  3. Biological Motivations • One of the most essential tools in molecular biology – Finding highly conserved sub-regions or embedded patterns of a set of biological sequences – Production of consensus sequence – Estimation of evolutionary distance between sequences – Prediction of protein secondary/tertiary structure – To find conserved regions • Local multiple alignment reveals conserved regions • Conserved regions usually are key functional regions, prime targets for drug developments • Practically useful methods only since D. Sankoff (1987) based on phylogenetics – Before 1987 they were constructed by hand – The basic problem: no dynamic programming approach can be used

  4. Alignment between globins (human beta globin, horse beta globin, human alpha globin, horse alpha globin, cyanohaemoglobin, whale myoglobin, leghaemoglobin) produced by Clustal. Boxes mark the seven alpha helices composing each globin. .

  5. Definition • Given strings x 1 , x 2 … x k a multiple (global) alignment is a matrix of k rows and A columns where each row represents a sequence and a column contains a symbol from each sequence or gaps symbols (at least one non gap)

  6. Multiple Sequence Alignment Matrix 3 rows 8 colums

  7. Family representations • Outcome of multiple alignment • Three kinds – Profile representation • Frequencies of symbols in each column • Weight vector • Alignment to a profile – Consensus sequence representation • Steiner string – Signature representation • PROSITE, BLOCKS databases • Regular expression

  8. Scoring Function • Ideally: – Find alignment that maximizes probability that sequences evolved from common ancestor x y z ? w v

  9. Multiple Sequence Alignment • Mult-Seq-Align allows to detect similarities which cannot be detected with Pairwise-Seq-Align methods. • Detection of family characteristics. Three questions: 1. Scoring 2. Computation of Mult-Seq-Align. 3. Family representation.

  10. A fragment of multiple alignment of 7 kinases. ClustalW program from SRS server.

  11. Scoring: SP (sum of pairs) SP – the sum of pairwise scores of all pairs of symbols in the column. (-,-) = 0 SP 3 (-,A,A) = (-,A)+(-,A)+(A,A) SP Total Score = sum over all columns

  12. Induced pairwise alignment Induced pairwise alignment or projection of a multiple alignment. a(S 1 , S 2 ) SP Total Score = Σ i<j score[ a(S i , S j ) ] a(S 2 , S 3 ) (-,-) = 0 a(S 1 , S 3 )

  13. Consensus -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC CAG-CTATCAC--GACCGC----TCGATTTGCTCGAC CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC • Find optimal consensus string m * to maximize S(m) = Σ i s(m * , m i ) s(m k , m l ): score of pairwise alignment (k,l)

  14. Optimal solution • Multidimensional Dynamic Programming • Generalization of pair-wise alignment • For simplicity, assume k sequences of length n • The dynamic programming array is k-dimensional hyperlattice of length n+1 (including initial gaps) • The entry F(i 1 , …, i k ) represents score of optimal alignment for s 1 [1..i 1 ], … s k [1..i k ] • Initialize values on the faces of the hyperlattice

  15. Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces : x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

  16. Sum Of Pairs • The sum-of-pairs (SP) score of a multiple alignment A is the sum of the scores of all induced pairwise alignments S(A) = Σ i<j S(A ij ) A ij is the induced alignment of x i , x j

  17. Dyn.Prog. Solution

  18. { k=3 2 k –1=7 s ( N ) + δ ( A ) N V N S s ( NA ) NV s ( NA ) N- s ( N- ) + δ ( A ) NS NV − NS N- S s ( N- ) s ( N- ) + δ ( A ) N- N- V A − NS NS s ( NA ) = max s ( NA ) + δ ( − ) NV N- V NS N- S S s ( NA ) s ( NA ) + δ ( − ) NV N- V − NS N- s ( N ) s ( NA ) + δ ( − ) N NV − V s ( N- ) NV N- S N + δ ( A ) − N- S s ( N- ) + δ ( A ) − NV − NS

  19. Multidimensional Dynamic Programming • Example: in 3D (three sequences): • 7 neighbors/cell F(i,j,k) = max{ F(i-1,j-1,k-1)+ S(x i ,x j ,x k ), F(i-1,j-1,k) + S(x i ,x j , -), F(i-1,j,k-1) + S(x i ,-, x k ), F(i-1,j, k) + S(x i ,-, -), F(i,j-1,k-1) + S( -,x j ,x k ), F(i,j-1,k) + S( -,x j ,x k ), F(i,j,k-1) + S( -,-, x k ) }

  20. Complexity • Space complexity: O(n k ) for k sequences each n long. • Computing at a cell: O(2 k ). cost of computing δ. • Time complexity: O(2 k n k ). cost of computing δ. • Finding the optimal solution is exponential in k • Proven to be NP-complete for a number of cost functions

  21. Algorithms • Faster Dynamic Programming (SP) – Carrillo and Lipman 88 (MSA) – Pruning of hyperlattice in DP – Practical for about 6 sequences of length about 200. • Star alignment (SP) • Progressive methods – CLUSTALW – PILEUP • Iterative algorithms • Sampling (Gibbs) based methods • Hidden Markov Model (HMM) based methods • Expectation Maximization Algorithm

  22. Idea behind MSA algorithm • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which optimal alignments are found • Specifics – Sequences x 1 ,..,x r . – Alignment A, cost = c(A) – Optimal alignment A* – A ij = induced alignment on x i ,..,x j on account of A – D(x i ,x j ) = cost of optimal pairwise alignment of x i ,x j <= c(A ij )

  23. Progressive Alignment • Multiple Alignment is NP-complete • Most used heuristic: Progressive Alignment Algorithm: – Align two of the sequences x i , x j – Fix that alignment – Align a third sequence x k to the alignment x i ,x j – Repeat until all sequences are aligned Running Time: O( N L 2 )

  24. Star Alignments • Heuristic method for multiple sequence alignments • Select a sequence c as the center of the star • For each sequence x 1 , …, x k such that index i ≠ c, perform a Needleman-Wunsch global alignment • Aggregate alignments with the principle “once a gap, always a gap.”

  25. Star Alignments Example MPE MSKE x 1 : MPE | | s 3 s 1 x 2 : MKE -|| MKE x 3 : MSKE MKE x 4 : SKE s 2 SKE || -MPE -MPE MPE -MKE MKE -MKE s 4 MKE MSKE MSKE -SKE

  26. Choosing a center • Try them all and pick the one with the best score • Calculate all O(k 2 ) alignments, and pick the sequence x c that minimizes Σ D(x c ,x i ) i > c • D(x c ,x i ) = c(A ci ), A is the multiple alignment

  27. Analysis • Assuming all sequences have length n • O(k 2 n 2 ) to calculate center • Step i takes O((i.n).n) time – two strings of length n and i.n • O(k 2 n 2 ) overall cost • Produces multiple sequence alignments whose SP values are at most twice that of the optimal solutions, provided triangle inequality holds .

  28. Aligning to family representations • Profile – Apply dynamic programming – Score depends on the profile • Consensus string – Apply dynamic programming • Signature representations – Align to regular expressions / CFG/ …

  29. Progressive alignment (CLUSTALW) • CLUSTALW is the most popular multiple protein alignment Algorithm: 1. Find all d ij : alignment dist (x i , x j ) 2. Construct a tree (Neighbor-joining hierarchical clustering) 3. Align nodes in order of decreasing similarity • sequence to sequence • sequence to profile • profile to profile + a large number of heuristics

  30. S 1 Multiple Alignment Step: S 2 1. Aligning S 1 and S 3 S 3 2. Aligning S 2 and S 4 3. S 4 Aligning (S 1 ,S 3 ) with (S 2 ,S 4 ). All Pairwise Alignments Dendrogram Similarity Matrix S 1 S 1 S 2 S 3 S 4 S 3 S 1 4 9 4 Cluster Analysis S 2 4 7 S 2 S 3 4 S 4 S 4 Distance From Higgins(1991) and Thompson(1994).

  31. Problems with Progressive Alignments • Depends on pairwise alignments • If sequences are very distantly related, much higher likelihood of errors • Care must be made in choosing scoring matrices and penalties

  32. Progressive Alignment: CLUSTALW CLUSTALW: most popular multiple protein alignment Algorithm: • Find all d ij : alignment dist (x i , x j ) • Construct a tree (Neighbor-joining hierarchical clustering) • Align nodes in order of decreasing similarity + a large number of heuristics

  33. Iterative Refinement One problem of progressive alignment: • Initial alignments are “frozen” even when new evidence comes Example: x: GAAGTT Frozen! y: GAC-TT z: GAACTG w: GTACTG Now clear correct y = GA-CTT

  34. Iterative Refinement Algorithm (Barton-Stenberg): • Align most similar x i , x j • Align x k most similar to (x i x j ) • Repeat 2 until (x 1 …x N ) are aligned • For j = 1 to N, Remove x j , and realign to x 1 …x j-1 x j+1 …x N • Repeat 4 until convergence Note: Guaranteed to converge

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend