VI - 2004 Page 1
Pairwise sequence alignments
Volker Flegel Vassilios Ioannidis
VI - 2004 Page 2
Outline
- Introduction
- Definitions
- Biological context of pairwise alignments
- Computing of pairwise alignments
- Some programs
Pairwise sequence alignments Volker Flegel Vassilios Ioannidis VI - - PDF document
Pairwise sequence alignments Volker Flegel Vassilios Ioannidis VI - 2004 Page 1 Outline Introduction Definitions Biological context of pairwise alignments Computing of pairwise alignments Some programs VI - 2004 Page
VI - 2004 Page 1
VI - 2004 Page 2
VI - 2004 Page 3
VI - 2004 Page 4
THIO_EMENI GFVVVDCFATWCGPCKAIAPTVEKFAQTY G ++VD +A WCGPCK IAP +++ A Y ??? GAILVDFWAEWCGPCKMIAPILDEIADEY THIO_EMENI GFVVVDCFATWCGPCKAIAPTVEKFAQTY G ++VD +A WCGPCK IAP +++ A Y ??? GAILVDFWAEWCGPCKMIAPILDEIADEY
THIO_EMENI
SwissProt Extrapolate Extrapolate
???
VI - 2004 Page 5
Same Sequence Same 3D Fold Same Origin Same Function VI - 2004 Page 6
970 980 990 1000 1010 1020 SLIT_DROME FSCQCAPGYTGARCETNIDDCLGEIKCQNNATCIDGVESYKCECQPGFSGEFCDTKIQFC ..:.: :. :.: ...:.: .. : :.. : ::.. . :.: ::..:. :. :. : NOTC_DROME YKCECPRGFYDAHCLSDVDECASN-PCVNEGRCEDGINEFICHCPPGYTGKRCELDIDEC 740 750 760 770 780 790 970 980 990 1000 1010 1020 SLIT_DROME FSCQCAPGYTGARCETNIDDCLGEIKCQNNATCIDGVESYKCECQPGFSGEFCDTKIQFC ..:.: :. :.: ...:.: .. : :.. : ::.. . :.: ::..:. :. :. : NOTC_DROME YKCECPRGFYDAHCLSDVDECASN-PCVNEGRCEDGINEFICHCPPGYTGKRCELDIDEC 740 750 760 770 780 790
VI - 2004 Page 7
Tissue-Type plasminogen Activator Urokinase-Type plasminogen Activator URL: www.isrec.isb-sib.ch/java/dotlet/Dotlet.html
VI - 2004 Page 8
VI - 2004 Page 9
VI - 2004 Page 10 Matches
True positives True negatives False positives False negatives
(for example, through a pairwise comparison with another globin). Globins
VI - 2004 Page 11
True positives True negatives False positives False negatives Greater sensitivity Less selectivity Less sensitivity Greater selectivity VI - 2004 Page 12
– Tolerant to errors (mismatches, insertion / deletions or indels) – Evaluation of the alignment in a biological concept (significance)
errors / mismatches insertion deletion
VI - 2004 Page 13
CGATGCAGACGTCA |||||||| CGATGCAAGACGTCA CGATGCAGACGTCA |||||||| CGATGCAAGACGTCA CGATGCAGACGTCA |||||||| CGATGCAAGACGTCA CGATGCAGACGTCA |||||||| CGATGCAAGACGTCA
more than 10600 gapped alignments
(Avogadro 1024, estimated number of atoms in the universe 1080) VI - 2004 Page 14
CGAGGCACAACGTCA ||| ||| |||||| CGATGCAAGACGTCA CGAGGCACAACGTCA ||| ||| |||||| CGATGCAAGACGTCA ATTGGACAGCAATCAGG | || | | ACGATGCAAGACGTCAG ATTGGACAGCAATCAGG | || | | ACGATGCAAGACGTCAG
VI - 2004 Page 15
CGAGGCACAACGTCA ||| ||| |||||| CGATGCAAGACGTCA CGAGGCACAACGTCA ||| ||| |||||| CGATGCAAGACGTCA ATTGGACAGCAATCAGG | || | | ACGATGCAAGACGTCAG ATTGGACAGCAATCAGG | || | | ACGATGCAAGACGTCAG
VI - 2004 Page 16
VI - 2004 Page 17
From:
(Leu, Ile): 2 (Leu, Cys): -6 ...
more often then expected by chance during evolution
evolution
log-odd ratio VI - 2004 Page 18
Percent Accepted Mutation. A unit introduced by Dayhoff et al. to quantify the amount of evolutionary change in a protein sequence. 1.0 PAM unit, is the amount of evolution which will change, on average, 1% of amino acids in a protein sequence. A PAM(x) substitution matrix is a look-up table in which scores for each amino acid substitution have been calculated based on the frequency of that substitution in closely related proteins that have experienced a certain amount (x) of evolutionary divergence.
VI - 2004 Page 19
Blocks Substitution Matrix. A substitution matrix in which scores for each position are derived from
related proteins. Each matrix is tailored to a particular evolutionary distance. In the BLOSUM62 matrix, for example, the alignment from which scores were derived was created using sequences sharing no more than 62% identity. Sequences more identical than 62% are represented by a single sequence in the alignment so as to avoid over-weighting closely related family members.
VI - 2004 Page 20
VI - 2004 Page 21
Score = 1 = 9 + 6 + + 2
VI - 2004 Page 22
GCATGCATGCAACTGCAT ||||||||| GCATGCATGGGCAACTGCAT GCATGCATGCAACTGCAT ||||||||| GCATGCATGGGCAACTGCAT
GCATGCATG--CAACTGCAT ||||||||| ||||||||| GCATGCATGGGCAACTGCAT GCATGCATG--CAACTGCAT ||||||||| ||||||||| GCATGCATGGGCAACTGCAT
VI - 2004 Page 23
(e.g. poorly conserved loops between well-conserved helices)
CGATGCAGCAGCAGCATCG |||||| ||||||| CGATGC------AGCATCG CGATGCAGCAGCAGCATCG |||||| ||||||| CGATGC------AGCATCG CGATGCAGCAGCAGCATCG || || |||| || || | CG-TG-AGCA-CA--AT-G CGATGCAGCAGCAGCATCG || || |||| || || | CG-TG-AGCA-CA--AT-G
gap opening
(some programs include the first extension into this penalty) gap extension
VI - 2004 Page 24
CGATGCAGCAGCAGCATCG |||||| ||||||| CGATGC------AGCATCG CGATGCAGCAGCAGCATCG |||||| ||||||| CGATGC------AGCATCG CGATGCAGCAGCAGCATCG || || |||| || || | CG-TG-AGCA-CA--AT-G CGATGCAGCAGCAGCATCG || || |||| || || | CG-TG-AGCA-CA--AT-G
gap opening
gap extension
VI - 2004 Page 25
short alignment.
We need a normalised score to compare alignments ! We need to evaluate the biological meaning of the score (p-value, e-value).
VI - 2004 Page 26
Ala Val ... Tr p
score
score x score y ... ...
Ala Val ... Tr p Random sequences Pairwise alignments Score distribution
VI - 2004 Page 27
score y: our alignment is very improbable to
sequences
score
score x: our alignment has a great probability
random sequence similarity
Threshold significant alignment
VI - 2004 Page 28
100% 0% N
VI - 2004 Page 29
Tissue-Type plasminogen Activator Urokinase-Type plasminogen Activator
VI - 2004 Page 30
THEFA-TCAT ||||| |||| THEFASTCAT THEFA-TCAT ||||| |||| THEFASTCAT
VI - 2004 Page 31
THE ||| THE THE ||| THE
Score: 23
THE HEF THE HEF
Score: -5
CAT THE CAT THE
Score: -4
HEF THE HEF THE
Score: -5 VI - 2004 Page 32
Tissue-Type plasminogen Activator Urokinase-Type plasminogen Activator
VI - 2004 Page 33
Seq B A-CA-CA | || | Seq A ACCAAC- Seq B A-CA-CA | || | Seq A ACCAAC- Seq B ACA--CA | Seq A A-CCAAC Seq B ACA--CA | Seq A A-CCAAC VI - 2004 Page 34
most visible) diagonals.
VI - 2004 Page 35
Tissue-Type plasminogen Activator Urokinase-Type plasminogen Activator
Global alignment: VI - 2004 Page 36
Tissue-Type plasminogen Activator Urokinase-Type plasminogen Activator
Local alignments:
VI - 2004 Page 37
How to extend optimaly an optimal alignment
Seq A a1 a2 a3 ... ai-1 ai Seq B b1 b2 b3 ... bj-1 bj Seq A a1 a2 a3 ... ai-1 ai Seq B b1 b2 b3 ... bj-1 bj
Seq A a1 a2 a3 ... ai-1 ai Seq B b1 b2 b3 ... bj-1 bj Seq A a1 a2 a3 ... ai-1 ai Seq B b1 b2 b3 ... bj-1 bj ai+1 bj+1 ai+1 bj+1
Score = Scoreij + Substi+1j+1
Seq A a1 a2 a3 ... ai-1 ai Seq B b1 b2 b3 ... bj-1 bj Seq A a1 a2 a3 ... ai-1 ai Seq B b1 b2 b3 ... bj-1 bj ai+1
Seq A a1 a2 a3 ... ai-1 ai Seq B b1 b2 b3 ... bj-1 bj Seq A a1 a2 a3 ... ai-1 ai Seq B b1 b2 b3 ... bj-1 bj
Score = Scoreij - gap VI - 2004 Page 38
Match score:
2
Mismatch score:
Gap penalty:
been completely filled out.
20
0 - 2 0 - 2 2 + 2
20
F(i-
1,j)
F(i,j)
s(xi,yj)
F(i-1,j-
1)
F(i,j-
1)
F(i,j): score at position i, j s(xi,yj): match or mismatch score (or substitution matrix value) for residues xi and yj d: gap penalty (positive value)
GA-TTA || || GAATTC GA-TTA || || GAATTC
VI - 2004 Page 39
www.ch.embnet.org/software/LALIGN_form.html
www.ch.embnet.org/software/PRSS_form.html
Do not blindly trust your alignment to be the only truth. Especially gapped regions may be quite variable.
statistically significant.
profiles)