CSCI 490 Bioinformatics Multiple Sequence Alignment Multiple - - PowerPoint PPT Presentation
CSCI 490 Bioinformatics Multiple Sequence Alignment Multiple - - PowerPoint PPT Presentation
CSCI 490 Bioinformatics Multiple Sequence Alignment Multiple Sequence Alignment Motivation: A faint similarity between two sequences becomes very significant if present in many sequences Definition Given N sequences x 1 , x 2
Multiple Sequence Alignment
- Motivation:
– A faint similarity between two sequences becomes very significant if present in many sequences
- Definition
– Given N sequences x1, x2,…, xN: Insert gaps (-) in each sequence xi, such that
- All sequences have the same length L
- Score of the alignment is maximum
- Two issues
– How to score an alignment? – How to find a (nearly) optimal alignment?
Scoring Function: Sum Of Pairs
Definition: Induced pairwise alignment A pairwise alignment induced by the multiple
alignment
Example:
x: ACGCGGC y: ACGCGAG z: GCCGCGAG
Induces:
x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG
Sum Of Pairs (cont’d)
- The sum-of-pairs (SP) score of an
alignment is the sum of the scores of all induced pairwise alignments S(m) = k<l s(mk, ml) s(mk, ml): score of induced alignment (k,l)
Example:
x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG A C G T
- A
1 -1 -1 -1 -1 C
- 1
1 -1 -1 -1 G
- 1 -1
1 -1 -1 T
- 1 -1 -1
1 -1
- 1 -1 -1 -1
(A,A) + (A,G) x 2 = -1 (G,G) x 3 = 3 (-,A) x 2 + (A,A) = -1 Total score = (-1) + 3 + (-2) + 3 + 3 + (-2) + 3 + (-1) + (-1) = 5
Multiple Sequence Alignments Algorithms
- Can also be global or local
– We only talk about global for now
- A simple method
– Do pairwise alignment between all pairs – Combine the pairwise alignments into a single multiple alignment – Is this going to work?
Compatible pairwise alignments
AAAATTTT TTTTGGGG AAAAGGGG AAAATTTT----
- ---TTTTGGGG
AAAATTTT---- AAAA----GGGG
- ---TTTTGGGG
AAAA----GGGG
AAAATTTT----
- ---TTTTGGGG
AAAA----GGGG
Incompatible pairwise alignments
AAAATTTT TTTTGGGG GGGGAAAA AAAATTTT----
- ---TTTTGGGG
- ---AAAATTTT
GGGGAAAA---- TTTTGGGG----
- ---GGGGAAAA
?
Multidimensional Dynamic Programming (MDP)
Generalization of Needleman-Wunsh:
- Find the longest path in a high-dimensional cube
– As opposed to a two-dimensional grid
- Uses a N-dimensional matrix
– As apposed to a two-dimensional array
- Entry F(i1, …, ik) represents score of optimal
alignment for s1[1..i1], … sk[1..ik]
F(i1,i2,…,iN) = max(all neighbors of a cell) (F(nbr)+S(current))
- Example: in 3D (three sequences):
- 23 – 1 = 7 neighbors/cell
F(i-1,j-1,k-1) + S(xi, yj, zk), F(i-1,j-1,k ) + S(xi, yj, -), F(i-1,j ,k-1) + S(xi, -, zk), F(i,j,k) = max F(i ,j-1,k-1) + S(-, yj, zk), F(i-1,j ,k ) + S(xi, -, -), F(i ,j-1,k ) + S(-, yj, -), F(i ,j ,k-1) + S(-, -, zk)
Multidimensional Dynamic Programming (MDP)
(i,j,k) (i,j,k-1) (i-1,j,k-1) (i-1,j-1,k-1) (i-1,j-1,k) (i,j-1,k) (i-1,j,k) (i,j-1,k-1)
Multidimensional Dynamic Programming (MDP)
Running Time: 1. Size of matrix: LN; Where L = length of each sequence N = number of sequences 2. Neighbors/cell: 2N – 1 Therefore………………………… O(2N LN)
Faster MDP
- Carrillo & Lipman, 1988
– Branch and bound – Other heuristics
- Implemented in a tool called MSA
- Practical for about 6 sequences of length
about 200-300.
Faster MDP
- Basic idea: bounds of the optimal score of a
multiple alignment can be pre-computed
– Upper-bound: sum of optimal pair-wise alignment scores, i.e. S(m) = k<l s(mk, ml) k<l s(k, l) – lower-bounded: score computed by any approximate algorithm – For any partial path, if Scurrent + Sperspective < lower- bound, can give up that path – Guarantees optimality
Score of the alignment between k and l induced by m
Optimal msa
Score of optimal alignment between k and l
Progressive Alignment
- Multiple Alignment is NP-hard
- Most used heuristic: Progressive Alignment
Algorithm:
1. Align two of the sequences xi, xj 2. Fix that alignment 3. Align a third sequence xk to the alignment xi,xj 4. Repeat until all sequences are aligned
Running Time: O(NL2)
Each alignment takes O(L2) Repeat N times
Progressive Alignment
- When evolutionary tree is known:
– Align closest first, in the order of the tree
Example: Order of alignments:
- 1. (x,y)
- 2. (z,w)
- 3. (xy, zw)
x w y z
Progressive Alignment: CLUSTALW
CLUSTALW: most popular multiple protein alignment Algorithm:
1. Find all dij: alignment dist (xi, xj)
- High alignment score => short distance
2. Construct a tree (similar to hierarchical clustering.) 3. Align nodes in order of decreasing similarity
+ a large number of heuristics
CLUSTALW example
- S1 ALSK
- S2 TNSD
- S3 NASK
- S4 NTSD
CLUSTALW example
- S1 ALSK
- S2 TNSD
- S3 NASK
- S4 NTSD
s1 s2 s3 s4 s1 9 4 7 s2 8 3 s3 7 s4 Distance matrix
CLUSTALW example
- S1 ALSK
- S2 TNSD
- S3 NASK
- S4 NTSD
s1 s2 s3 s4 s1 9 4 7 s2 8 3 s3 7 s4 s1 s3 s2 s4
CLUSTALW example
- S1 ALSK
- S2 TNSD
- S3 NASK
- S4 NTSD
s1 s2 s3 s4 s1 9 4 7 s2 8 3 s3 7 s4 s1 s3 s2 s4
- ALSK
NA-SK
CLUSTALW example
- S1 ALSK
- S2 TNSD
- S3 NASK
- S4 NTSD
s1 s2 s3 s4 s1 9 4 7 s2 8 3 s3 7 s4 s1 s3 s2 s4
- ALSK
NA-SK
- TNSD
NT-SD
CLUSTALW example
- S1 ALSK
- S2 TNSD
- S3 NASK
- S4 NTSD
s1 s2 s3 s4 s1 9 4 7 s2 8 3 s3 7 s4 s1 s3 s2 s4
- ALSK
NA-SK
- TNSD
NT-SD
- ALSK
- TNSD
NA-SK NT-SD
Problems with progressive alignment:
- Depend on pair-wise alignments
- If sequences are very distantly related, much higher likelihood of
errors
- Initial alignments are “frozen” even when new evidence comes
Example: x: GAAGTT y: GAC-TT z: GAACTG w: GTACTG
Iterative Refinement
Frozen! Now clear: correct y should be GA-CTT
Iterative Refinement
Algorithm (Barton-Stenberg):
- 1. Align most similar xi, xj
- 2. Align xk most similar to (xixj)
- 3. Repeat 2 until (x1…xN) are aligned
- 4. For j = 1 to N,
Remove xj, and realign to x1…xj-1xj+1…xN
- 5. Repeat 4 until convergence
Progressive alignment
Iterative Refinement (cont’d)
For each sequence y
- 1. Remove y
- 2. Realign y
(while rest fixed)
x y z x,z fixed projection allow y to vary
Iterative Refinement
Example: align (x,y), (z,w), (xy, zw):
x: GAAGTTA y: GAC-TTA z: GAACTGA w: GTACTGA
After realigning y:
x: GAAGTTA y: G-ACTTA + 3 matches z: GAACTGA w: GTACTGA
Iterative Refinement
- Example not handled well:
x: GAAGTTA y1: GAC-TTA y2: GAC-TTA y3: GAC-TTA z: GAACTGA w: GTACTGA
Realigning any single yi changes nothing
Other approaches
- Statistical learning methods
– Profile Hidden Markov Models
- Consistency-based methods
– Still rely on pairwise alignment
- But consider a third seq when aligning two seqs
- If block A in seq x aligns to block B in seq y, and both aligns
to block C in seq z, we have higher confidence to say that the alignment between A-B is reliable
- Essentially: change scoring system according to consistency
- Then apply DP as in other approaches
– Pioneered by a tool called T-Coffee
Multiple alignment tools
- Clustal W (Thompson, 1994)
– Most popular
- T-Coffee (Notredame, 2000)
– Another popular tool – Consistency-based – Slower than clustalW, but generally more accurate for more distantly related sequences
- MUSCLE (Edgar, 2004)
– Iterative refinement – More efficient than most others
- DIALIGN (Morgenstern, 1998, 1999, 2005)
– “local”
- Align-m (Walle, 2004)
– “local”
- PROBCONS (Do, 2004)
– Probabilistic consistency-based – Best accuracy on benchmarks
- ProDA (Phuong, 2006)
– Allow repeated and shuffled regions
In summary
- Multiple alignment scoring functions
– Sum of pairs – Other funcs exist, but less used
- Multiple alignment algorithms:
– MDP
- Optimal
- too slow
- Branch & Bound doesn’t solve the problem entirely
– Progressive alignment: clustalW – Iterative refinement – Consistency-based
Heuristic