Multiple sequence alignment
BCB410 presentation by Nirvana Nursimulu Friday 25th November 2011
alignment BCB410 presentation by Nirvana Nursimulu Friday 25 th - - PowerPoint PPT Presentation
Multiple sequence alignment BCB410 presentation by Nirvana Nursimulu Friday 25 th November 2011 MSA: definition In MSA, k (greater than 2) sequences are aligned at the same time. Sequences can be of DNA, RNA, or protein. Want to
BCB410 presentation by Nirvana Nursimulu Friday 25th November 2011
In MSA, k (greater than 2) sequences
Sequences can be of DNA, RNA, or
Want to write each sequence along the
2 ~Multiple Sequence Alignment
Reveal biologically important sequence
Phylogenetic reconstruction.
~Multiple Sequence Alignment 3
Secondary structure prediction by
~Multiple Sequence Alignment 4
Can’t we just do a number of pairwise
Needleman-Wunsch algorithm: uses
~Multiple Sequence Alignment 5
~Multiple Sequence Alignment 6
Formulation of recursion for sequences
j i
Time complexity is O(L2) for a pair
If we perform multiple pairwise
~Multiple Sequence Alignment 7
Does this actually work!?!? NO!
Source: BCH441H fall 2011 notes.
“Better” has fewer gaps + more matches
~Multiple Sequence Alignment 8
~Multiple Sequence Alignment 9
Could use dynamic programming to get
Takes O(Lk)
This takes exponential time…
~Multiple Sequence Alignment 10
ClustalW T-coffee MAFFT MUSCLE
~Multiple Sequence Alignment 11
Different strategies. One objective usually:
~Multiple Sequence Alignment 12
Progressive
Consistency-based
~Multiple Sequence Alignment 13
Probabilistic
Iterated
~Multiple Sequence Alignment 14
Clustal has been in use for the longest
~Multiple Sequence Alignment 15
3 stages:
Finally, align according to guide tree
~Multiple Sequence Alignment 16
(Higgins D.G., Sharp P.M.: figure 1)
~Multiple Sequence Alignment 17
UPGMA cluster analysis
~Multiple Sequence Alignment 18
Errors early-on persist Performance deteriorates for
~Multiple Sequence Alignment 19
~Multiple Sequence Alignment 20
Notredame C., Higgins D.G., Heringa J.: figure 2(a)
Tree-based Consistency Objective
Two attractive features:
Data from these sources provided via a library of pairwise alignments
~Multiple Sequence Alignment 21
Technique is similar to Clustal’s
But different and better
…not just those being aligned at that stage
~Multiple Sequence Alignment 22
~Multiple Sequence Alignment 23
Notredame C., Higgins D.G., Heringa J.: figure 2(a)
~Multiple Sequence Alignment 24
Creation of a primary library
Weight of each pair of residue = average identity amongst matched residues
~Multiple Sequence Alignment 25
If find duplicated pair between the 2 libraries: merge into a single entry
Weight = sum of the 2 weights
Otherwise, new entry created.
Notredame C., Higgins D.G., Heringa J.: figure 2(b)
Extended library: triplet approach
Check alignment of (a,b) with residues from remaining sequences More intermediate seq. supporting alignment higher weight
N = num. sequences; L = average seq. length
~Multiple Sequence Alignment 26
~Multiple Sequence Alignment 27
Notredame C., Higgins D.G., Heringa J.: figure 2(c)
Progressive alignment
…but use the weights in the extended library to align the residues
~Multiple Sequence Alignment 28
~Multiple Sequence Alignment 29 Notredame C., Higgins D.G., Heringa J.: figure 1
Takes info from local alignments in
More accurate
~Multiple Sequence Alignment 30
Multiple Alignment using Fast Fourier
Amino acid residues are converted to
Intuition:
~Multiple Sequence Alignment 31
Note:
~Multiple Sequence Alignment 32
Find correlation (of volume and
FFT trick reduces the complexity of
~Multiple Sequence Alignment 33
M k n N n p M k n N n v p v
1 , 1 2 1 1 , 1 2 1
~Multiple Sequence Alignment 34
Katoh K., Misawa K., Kuma K., Miyata T.: fig 1(A)
Having performed FFT analysis, we
Therefore, perform sliding window
~Multiple Sequence Alignment 35
Katoh K., Misawa K., Kuma K., Miyata T.: fig 1(B)
Construct homology matrix, S:
Therefore, matrix is divided into sub-
Area for DP is reduced!
~Multiple Sequence Alignment 36
~Multiple Sequence Alignment 37
Katoh K., Misawa K., Kuma K., Miyata T.: fig 2(A),(B)
But we have only been talking of 2
Eventually, the MAFFT is only a
But it uses a two-cycle progressive
~Multiple Sequence Alignment 38
But Clustal had a problem:
Two ways of dealing with this:
Correct mistakes in initial alignment
Try to avoid mistakes in advance
Both work equally well.
~Multiple Sequence Alignment 39
O(N2L) + O(NL2)
But when input sequences are highly
~Multiple Sequence Alignment 40
MUltiple Sequence Comparison by
Even without refinement:
~Multiple Sequence Alignment 41
Uses:
~Multiple Sequence Alignment 42
~Multiple Sequence Alignment 43
Edgar R.C.: fig 2
3 main stages:
Progressive alignment
Progressive alignment
Iterative refinement
First two stages = MUSCLE-p Profile calculated uses log-expectation
~Multiple Sequence Alignment 44
Stage 1: Draft progressive
Derived from fraction of kmers in common in compressed alphabet
~Multiple Sequence Alignment 45
Stage 2: Improved Progressive
Apply Kimura correction for multiple substitutions at a single site.
Optimize by computing alignments only for subtrees whose branching orders changed relative to TREE1.
~Multiple Sequence Alignment 46
Stage 3: Refinement
Choose an edge e (visit in order of decreasing distance from root) Delete e to get two subtrees: T1, T2. Compute profiles of T1 and T2. Realign profiles to get a new MSA. If score is better, keep new alignment.
~Multiple Sequence Alignment 47
MUSCLE-p (ie, first two stages)
Refinement
MUSCLE is comparable in speed with
~Multiple Sequence Alignment 48
BCH441 Fall 2011 lecture notes on MSA (Lecture 13).
Durbin R., Eddy S., Krogh A., Mitchison G.: Biological Sequence Analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, 2002 .
Edgar R.C.: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research. (2004) 32(5), 1792-1797.
Higgins D.G., Sharp P.M.: Clustal—a package for performing multiple sequence alignment
Katoh K., Misawa K., Kuma K., Miyata T.: MAFFT: A Novel Method for Rapid Multiple Sequence Alignment based on Fast Fourier Transform. Nucleic Acids Research. (2002) 30(14), 3059-3066.
Katoh K., Toh H.: Recent developments in the MAFFT multiple sequence alignment
Notredame C., Higgins D.G., Heringa J.: T-Coffee: A Novel Method for Fast and Accurate Multiple Sequence Alignment. J. Mol. Biol. (2000) 302, 205-217.
~Multiple Sequence Alignment 49
~Multiple Sequence Alignment 50
Contact info:
~Multiple Sequence Alignment 51