[PPT] - Multiple Sequence Alignment Alignment can be easy or difficult PowerPoint Presentation

SLIDE 1

Multiple Sequence Alignment

SLIDE 2

Alignment can be easy or difficult

GCGGCCCA TCAGGTAGTT GGTGG GCGGCCCA TCAGGTAGTT GGTGG GCGTTCCA TCAGCTGGTT GGTGG GCGTCCCA TCAGCTAGTT GGTGG GCGGCGCA TTAGCTAGTT GGTGA ******** ********** ***** TTGACATG CCGGGG---A AACCG TTGACATG CCGGTG--GT AAGCC TTGACATG -CTAGG---A ACGCG TTGACATG -CTAGGGAAC ACGCG TTGACATC -CTCTG---A ACGCG ******** ?????????? *****

Easy Difficult due to insertions

r deletions

(indels)

SLIDE 3

Homology: Definition

Homology: similarity that is the result of inheritance from a

common ancestor - identification and analysis of homologies is central to phylogenetic systematics.

An Alignment is an hypothesis of positional homology between

bases/Amino Acids.

SLIDE 4

Multiple Sequence Alignment- Goals

To generate a concise, information-rich summary of

sequence data.

Sometimes used to illustrate the dissimilarity

between a group of sequences.

Alignments can be treated as models that can be

used to test hypotheses.

Does this model of events accurately reflect known

biological evidence.

SLIDE 5

SLIDE 6

SLIDE 7

Alignment of 16S rRNA can be guided by secondary structure

<---------------(--------------------HELIX 19---------------------) <---------------(22222222-000000-111111-00000-111111-0000-22222222 Thermus ruber UCCGAUGC-UAAAGA-CCGAAG=CUCAA=CUUCGG=GGGU=GCGUUGGA

Th. thermophilus UCCCAUGU-GAAAGA-CCACGG=CUCAA=CCGUGG=GGGA=GCGUGGGA

E.coli UCAGAUGU-GAAAUC-CCCGGG=CUCAA=CCUGGG=AACU=GCAUCUGA Ancyst.nidulans UCUGUUGU-CAAAGC-GUGGGG=CUCAA=CCUCAU=ACAG=GCAAUGGA B.subtilis UCUGAUGU-GAAAGC-CCCCGG=CUCAA=CCGGGG=AGGG=UCAUUGGA Chl.aurantiacus UCGGCGCU-GAAAGC-GCCCCG=CUUAA=CGGGGC=GAGG=CGCGCCGA match ** *** * ** ** * **

Alignment of 16S rRNA sequences from different bacteria

SLIDE 8

Protein Alignment may be guided by Tertiary Structure Interactions

Homo sapiens DjlA protein Escherichia coli DjlA protein

SLIDE 9

Multiple Sequence Alignment- Methods

–3 main methods of alignment:

Manual
Automatic
Combined

SLIDE 10

Manual Alignment - reasons

Might be carried out because:

– Alignment is easy. – There is some extraneous information (structural). – Automated alignment methods have encountered the local minimum problem. – An automated alignment method can be “improved”.

SLIDE 11

Dynamic programming

2 methods:

Dynamic programming

– Consider 2 protein sequences of 100 amino acids in length. – If it takes 1002 seconds to exhaustively align these sequences, then it will take 1003 seconds to align 3 sequences, 1004 to align 4 sequences...etc. – More time than the universe has existed to align 20 sequences exhaustively.

Progressive alignment

SLIDE 12

Progressive Alignment

Devised by Feng and Doolittle in 1987.
Essentially a heuristic method and as such

is not guaranteed to find the ‘optimal’ alignment.

Requires n-1+n-2+n-3...n-n+1 pairwise

alignments as a starting point

Most successful implementation is Clustal

(Des Higgins)

SLIDE 13

Overview of ClustalW Procedure

1 PEEKSAVTALWGKVN--VDEVGG 2 GEEKAAVLALWDKVN--EEEVGG 3 PADKTNVKAAWGKVGAHAGEYGA 4 AADKTNVKAAWSKVGGHAGEYGA 5 EHEWQLVLHVWAKVEADVAGHGQ Hbb_Human 1 - Hbb_Horse 2 .17 - Hba_Human 3 .59 .60 - Hba_Horse 4 .59 .59 .13 - Myg_Whale 5 .77 .77 .75 .75 -

Hbb_Human Hbb_Horse Hba_Horse Hba_Human Myg_Whale

1 2 3 4 1 2 3 4

alpha-helices

Quick pairwise alignment: calculate distance matrix Neighbor-joining tree (guide tree) Progressive alignment following guide tree CLUSTAL W

SLIDE 14

ClustalW- Pairwise Alignments

First perform all possible pairwise

alignments between each pair of

sequences. There are (n-1)+(n-2)...(n-

n+1) possibilities.

Calculate the ‘distance’ between each pair
f sequences based on these isolated

pairwise alignments.

Generate a distance matrix.

SLIDE 15

Path Graph for aligning two sequences.

SLIDE 16

Possible alignment

1 1 1

1

Scoring Scheme:

Match:

+1

Mismatch:
Indel:
1

Score for this path= 2

SLIDE 17

Alignment using this path

GATTC- GAATTC

1 1 1

1

SLIDE 18

Optimal Alignment 1

1 1

1

1 1 1

Alignment score: 4

Alignment using this path GA-TTC GAATTC

SLIDE 19

Optimal Alignment 2

1

1

1 1 1 1

Alignment score: 4

Alignment using this path G-ATTC GAATTC

SLIDE 20

ClustalW- Guide Tree

Generate a Neighbor-Joining

‘guide tree’ from these pairwise distances.

This guide tree gives the order

in which the progressive alignment will be carried out.

SLIDE 21

Neighbor joining method

The neighbor joining method is a greedy heuristic which

joins at each step, the two closest sub-trees that are not already joined.

It is based on the minimum evolution principle.
One of the important concepts in the NJ method is

neighbors, which are defined as two taxa that are connected by a single node in an unrooted tree A B Node 1

SLIDE 22

PAM Spinach Rice Mosquito Monkey Human Spinach 0.0 84.9 105.6 90.8 86.3 Rice 84.9 0.0 117.8 122.4 122.6 Mosquito 105.6 117.8 0.0 84.7 80.8 Monkey 90.8 122.4 84.7 0.0 3.3 Human 86.3 122.6 80.8 3.3 0.0

What is required for the Neighbour joining method?

Distance matrix

Distance Matrix

SLIDE 23

PAM distance 3.3 (Human - Monkey) is the minimum. So we'll join Human and Monkey to MonHum and we'll calculate the new distances.

Mon-Hum Monkey Human Spinach Mosquito Rice

First Step

SLIDE 24

After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree. We do this with a simple average of distances:

Dist[Spinach, MonHum] = (Dist[Spinach, Monkey] + Dist[Spinach, Human])/2 = (90.8 + 86.3)/2 = 88.55

Mon-Hum Monkey Human Spinach

Calculation of New Distances

SLIDE 25

PAM Spinach Rice Mosquito MonHum Spinach 0.0 84.9 105.6 88.6 Rice 84.9 0.0 117.8 122.5 Mosquito 105.6 117.8 0.0 82.8 MonHum 88.6 122.5 82.8 0.0

Human Mosquito Mon-Hum Monkey Spinach Rice Mos-(Mon-Hum)

Next Cycle

SLIDE 26

PAM Spinach Rice MosMonHum Spinach 0.0 84.9 97.1 Rice 84.9 0.0 120.2 MosMonHum 97.1 120.2 0.0

Human Mosquito Mon-Hum Monkey Spinach Rice Mos-(Mon-Hum) Spin-Rice

Penultimate Cycle

SLIDE 27

PAM SpinRice MosMonHum Spinach 0.0 108.7 MosMonHum 108.7 0.0

Human Mosquito Mon-Hum Monkey Spinach Rice Mos-(Mon-Hum) Spin-Rice (Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

SLIDE 28

Human Monkey Mosquito Rice Spinach

Unrooted Neighbor-Joining Tree

SLIDE 29

Multiple Alignment- First pair

Align the two most closely-related

sequences first.

This alignment is then ‘fixed’ and

will never change. If a gap is to be introduced subsequently, then it will be introduced in the same place in both sequences, but their relative alignment remains unchanged.

SLIDE 30

ClustalW- Decision time

Next, consult the guide tree to see what alignment is

performed next.

– Align a third sequence to the first two Or – Align two entirely different sequences to each other.

Option 1 Option 2

SLIDE 31

ClustalW- Alternative 1

If the situation arises where a third sequence is aligned to the first two, then when a gap has to be introduced to improve the alignment, each of these two entities are treated as two single sequences.

+

SLIDE 32

ClustalW- Alternative 2

If, on the other hand,

two separate sequences have to be aligned together, then the first pairwise alignment is placed to one side and the pairwise alignment of the

ther two is carried out.

+

SLIDE 33

ClustalW- Progression

The alignment is progressively

built up in this way, with each step being treated as a pairwise alignment, sometimes with each member of a ‘pair’ having more than one sequence.

SLIDE 34

ClustalW-Good points/Bad points

Advantages:

– Speed.

Disadvantages:

– No objective function. – No way of quantifying whether or not the alignment is good – No way of knowing if the alignment is ‘correct’.

SLIDE 35

ClustalW-Local Minimum

Potential problems:

– Local minimum problem. If an error is introduced early in the alignment process, it is impossible to correct this later in the procedure. – Arbitrary alignment.

SLIDE 36

Increasing the sophistiaction of the alignment process.

Should we treat all the sequences in the

same way? - even though some sequences are closely-related and some sequences are distant relatives.

Should we treat all positions in the

sequences as though they were the same? - even though they might have different functions and different locations in the 3-dimensional structure.

SLIDE 37

SLIDE 38

ClustalW- Caveats

Sequence weighting
Varying substitution matrices
Residue-specific gap penalties and reduced

penalties in hydrophilic regions (external regions of protein sequences), encourage gaps in loops rather than in core regions.

Positions in early alignments where gaps have been
pened receive locally reduced gap penalties to

encourage openings in subsequent alignments

SLIDE 39

Sequence weighting

First we must be able to categorise sequences

according to whether they have close relatives or if they are distantly-related to the other sequences (calculated directly from the guide tree).

Weights are normalised, so that the largest

weight is 1.

Closely-related sequences have a large amount of

the same information, so they are downweighted.

These weights are multiplication factors.

SLIDE 40

ClustalW- User-supplied values

Two penalties are set by the user

(there are default values, but you should know that it is possible to change these).

GOP- Gap Opening Penalty is the cost
f opening a gap in an alignment.
GEP- Gap Extension Penalty is the cost
f extending this gap.

SLIDE 41

ClustalW- Manipulation of penalties

Although GOP and GEP are set by the

user, the program attempts to manipulate these according to the following criteria:

– Dependence on the weight matrix: – Dependence on the similarity of the sequences: – The percent identity of the sequences is used as a scaling factor to increase the GOP for closely-related sequences and decrease it for more distantly-related sequences.

SLIDE 42

ClustalW

Dependence on the length of the sequences:

– The program uses the formula

– GOP->(GOP+log(MIN(N,M))*(Average residue mismatch score)*(percent identity scaling factor)

– The logarithm of the length of the shortest sequence is used as a scaling factor to increase the GOP with increasing length

Dependence on the difference in lengths of the

two sequences:

GEP-> GEP*(1.0+|log(N/M)|)

SLIDE 43

Position-Specific gap penalties

Before any pair of (groups of) sequences are aligned, a

table of GOPs are generated for each position in the two (sets of) sequences.

The GOP is manipulated in a position-specific manner, so

that it can vary over the sequences.

If there is a gap at a position, the GOP and GEP penalties

are lowered, the other rules do not apply.

This makes gaps more likely at positions where gaps

already exist.

SLIDE 44

Discouraging too many gaps

If there is no gap opened, then the GOP is increased if the

position is within 8 residues of an existing gap.

This discourages gaps that are too close together.
At any position within a run of hydrophilic residues, the GOP

is decreased.

These runs usually indicate loop regions in protein structures.
A run of 5 hydrophilic residues is considered to be a

hydrophilic stretch.

The default hydrophilic residues are:

– D, E, G, K, N, Q, P, R, S – But this can be changed by the user.

SLIDE 45

Divergent Sequences

The most divergent sequences (most different, on

average from all of the other sequences) are usually the most difficult to align.

It is sometimes better to delay their aligment until later

(when the easier sequences have already been aligned).

The user has the choice of setting a cutoff (default is

40% identity).

This will delay the alignment until the others have been

aligned.

SLIDE 46

Advice on progressive alignment

Progressive alignment is a mathematical process that is

completely independent of biological reality.

Can be a very good estimate
Can be an impossibly poor estimate.
Requires user input and skill.
Treat cautiously
Can be improved by eye (usually)
Often helps to have colour-coding.
Depending on the use, the user should be able to make a

judgement on those regions that are reliable or not.

For phylogeny reconstruction, only use those positions whose

hypothesis of positional homology is unimpeachable

SLIDE 47

Alignment of protein-coding DNA sequences

It is not very sensible to align the DNA

sequences of protein-coding genes.

ATGCTGTTAGGG ATGCTCGTAGGG ATGCT-GTTAGGG ATGCTCGTA-GGG

The result might be highly-implausible and might not reflect what is known about biological processes. It is much more sensible to translate the sequences to their corresponding amino acid sequences, align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment.

SLIDE 48

Manual Alignment- software

GDE- The Genetic Data Environment (UNIX) CINEMA- Java applet available from:

– http://www.biochem.ucl.ac.uk

Seqapp/Seqpup- Mac/PC/UNIX available from:

– http://iubio.bio.indiana.edu

SeAl for Macintosh, available from:

– http://evolve.zoo.ox.ac.uk/Se-Al/Se-Al.html

BioEdit for PC, available from:

– http://www.mbio.ncsu.edu/RNaseP/info/programs/BIOEDIT/bi

edit.html