[PPT] - Bioinformatics Sequence comparison 1 global pairwise alignment PowerPoint Presentation

SLIDE 1

Bioinformatics

David Gilbert Bioinformatics Research Centre

www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow

Sequence comparison 1

global pairwise alignment

SLIDE 2

(c) David Gilbert 2008 Sequence Comparison (1) 2

Lecture contents

Evolutionary relationships, sequence comparison and alignment
How to compare and align sequences using scoring schemes
Naïve approach to finding optimal score and alignment
Dynamic programming as an efficient method for finding optimal

scores & alignments

Variations on dynamic programming

– Gap penalties – Substitution matrices

SLIDE 3

(c) David Gilbert 2008 Sequence Comparison (1) 3

Why compare sequences?

Assume a genome has been sequenced
We can find out where “putative” genes are by gene-

finding (see http://www.ensembl.org/ )

→What do such a gene do?
We can make the protein that it encodes

– but we can’t easily find out its biological function

So, we can try to find other sequences which are

similar to it, for which we know the function...

SLIDE 4

(c) David Gilbert 2008 Sequence Comparison (1) 4

So, what is this sequence similar to?

acatttgctt ctgacacaac tgtgttcact agcaacctca aacagacacc atggtgcacc tgactcctga ggagaagtct gcggttactg ccctgtgggg caaggtgaac gtggatgaag ttggtggtga ggccctgggc aggctgctgg tggtctaccc ttggacccag aggttctttg agtcctttgg ggatctgtcc actcctgatg cagttatggg caaccctaag gtgaaggctc atggcaagaa agtgctcggt gcctttagtg atggcctggc tcacctggac aacctcaagg gcacctttgc cacactgagt gagctgcact gtgacaagct gcacgtggat cctgagaact tcaggctcct gggcaacgtg ctggtctgtg tgctggccca tcactttggc aaagaattca ccccaccagt gcaggctgcc tatcagaaag tggtggctgg tgtggctaat gccctggccc acaagtatca ctaagctcgc tttcttgctg tccaatttct attaaaggtt cctttgttcc ctaagtccaa ctactaaact gggggatatt atgaagggcc ttgagcatct ggattctgcc taataaaaaa catttatttt cattgc Search using BLAST http://www.ncbi.nlm.nih.gov/BLAST/

r

http://www.ebi.ac.uk/blastall/ MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAH GKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQ AAYQKVVAGVANALAHKYH

Amino-acid (protein sequence) cDNA (nucleotide sequence)

Where does the coding start on

this sequence?

SLIDE 5

(c) David Gilbert 2008 Sequence Comparison (1) 5

Evolution - basic concepts

Mutation in DNA a natural evolutionary process
DNA replication errors: (nucleotide)

– substitutions – insertions – deletions

Similarity between sequences

– clue to common evolutionary origin, or – clue to common function

This is a simplistic story: in fact the altered function of the expressed protein will determine

if the organism will survive to reproduce, and hence pass on [transmit] the altered gene

}indels

SLIDE 6

(c) David Gilbert 2008 Sequence Comparison (1) 6

Human genetic variations (Single Nucleotide Polymorphisms)

SNP’s - “genetic indivuality”
~ 1/1000 bases variable (2 humans)
Make us more/less susceptible to diseases
May influence the effect of drug treatments

TTT TAC GGC ATC Phe Tyr Asn Met TTT TAC GTC ATC Phe Tyr Ser Met

Associated with high cholersterol

SLIDE 7

(c) David Gilbert 2008 Sequence Comparison (1) 7

ACCESSION J02799

/translation="MESKVVVPAQGKKITLQNGKLNVPENPIIPYIEGDGIGVDVTPA MLKVVDAAVEKAYKGERKISWMEIYTGEKSTQVYGQDVWLPAETLDLIREYRVAIKGP LTTPVGGGIRSLNVALRQELDLYICLRPVRYYQGTPSPVKHPELTDMVIFRENSEDIY AGIEWKADSADAEKVIKFLREEMGVKKIRFPEHCGIGIKPCSEEGTKRLVRAAIEYAI ANDRDSVTLVHKGNIMKFTEGAFKDWGYQLAREEFGGELIDGGPWLKVKNPNTGKEIV IKDVIADAFLQQILLRPAEYDVIACMNLNGDYISDALAAQVGGIGIAPGANIGDECAL FEATHGTAPKYAGQDKVNPGSIILSAEMMLRHMGWTEAADLIVKGMEGAINAKTVTYD FERLMDGAKLLKCSEFGDAIIENM" ORIGIN MluI site; 25.3 min on K12 map. 1 cgcgtggcgt ggttttcagg tttacgcctg gtagaacgtt gcgagctgaa tcgcttaacc 61 tggtgatttc taaaagaagt tttttgcatg gtattttcag agattatgaa ttgccgcatt 121 atagcctaat aacgcgcatc tttcatgacg gcaaacaata gggtagtatt gacaagccaa 181 ttacaaatca ttaacaaaaa attgctctaa agcatccgta tcgcaggacg caaacgcata 241 tgcaacgtgg tggcagacga gcaaaccagt agcgctcgaa ggagaggtga atggaaagta 301 aagtagttgt tccggcacaa ggcaagaaga tcaccctgca aaacggcaaa ctcaacgttc 361 ctgaaaatcc gattatccct tacattgaag gtgatggaat cggtgtagat gtaaccccag 421 ccatgctgaa agtggtcgac gctgcagtcg agaaagccta taaaggcgag cgtaaaatct 481 cctggatgga aatttacacc ggtgaaaaat ccacacaggt ttatggtcag gacgtctggc 541 tgcctgctga aactcttgat ctgattcgtg aatatcgcgt tgccattaaa ggtccgctga 601 ccactccggt tggtggcggt attcgctctc tgaacgttgc cctgcgccag gaactggatc 661 tctacatctg cctgcgtccg gtacgttact atcagggcac tccaagcccg gttaaacacc 721 ctgaactgac cgatatggtt atcttccgtg aaaactcgga agacatttat gcgggtatcg 781 aatggaaagc agactctgcc gacgccgaga aagtgattaa attcctgcgt gaagagatgg 841 gggtgaagaa aattcgcttc ccggaacatt gtggtatcgg tattaagccg tgttcggaag 901 aaggcaccaa acgtctggtt cgtgcagcga tcgaatacgc aattgctaac gatcgtgact 961 ctgtgactct ggtgcacaaa ggcaacatca tgaagttcac cgaaggagcg tttaaagact 1021 ggggctacca gctggcgcgt gaagagtttg gcggtgaact gatcgacggt ggcccgtggc 1081 tgaaagttaa aaacccgaac actggcaaag agatcgtcat taaagacgtg attgctgatg 1141 cattcctgca acagatcctg ctgcgtccgg ctgaatatga tgttatcgcc tgtatgaacc 1201 tgaacggtga ctacatttct gacgccctgg cagcgcaggt tggcggtatc ggtatcgccc 1261 ctggtgcaaa catcggtgac gaatgcgccc tgtttgaagc cacccacggt actgcgccga 1321 aatatgccgg tcaggacaaa gtaaatcctg gctctattat tctctccgct gagatgatgc 1381 tgcgccacat gggttggacc gaagcggctg acttaattgt taaaggtatg gaaggcgcaa 1441 tcaacgcgaa aaccgtaacc tatgacttcg agcgtctgat ggatggcgct aaactgctga 1501 aatgttcaga gtttggtgac gcgatcatcg aaaacatgta atgccgtagt ttgttaaatt 1561 tattaacg //

SLIDE 8

(c) David Gilbert 2008 Sequence Comparison (1) 8

Mutations, frameshifts

at[t,c,a]at[t,c,a]ga[a,g]aa[t,c]atg taa (regex) I I E N M Ter atc atc gaa aac atg taa Compute the translation of 1. atcatcgaaaacatgtaatgccgtagtttgttaaatttattaacg 2. tcatcgaaaacatgtaatgccgtagtttgttaaatttattaacg 3. catcgaaaacatgtaatgccgtagtttgttaaatttattaacg

SLIDE 9

(c) David Gilbert 2008 Sequence Comparison (1) 9

Frameshift mutations

1. atc atc gaa aac atg taa tgc cgt agt ttg tta aat tta tta acg 2. tca tcg aaa aca tgt aat gcc gta gtt tgt taa att tat taa cg 3. cat cga aaa cat gta atg ccg tag ttt gtt aaa ttt att aac g

SLIDE 10

(c) David Gilbert 2008 Sequence Comparison (1) 10

Evolution, DNA->Amino-acids

Triplet code, hence difference between DNA base

Substitution: (hence 1 amino-acid changes)
Insertion / Deletion: “frame shift” (all subsequent amino-

acids change)

– NB, Indels can be in multiples of 3, and hence...

Also

“Silent mutation” - DNA changes but amino-acid doesn’t

change - why?

“Nonsense mutation” - a single DNA base substitution

resulting in a stop codon.

SLIDE 11

(c) David Gilbert 2008 Sequence Comparison (1) 11

Some evolutionary relationships revealed by comparing α- haemoglobins

axolotl giant panda lesser panda moose goshawk vulture duck alligator

SLIDE 12

(c) David Gilbert 2008 Sequence Comparison (1) 12

ggcatt agcatt agcata agcatg agccta aggatt gacatt

Evolution - example

SLIDE 13

(c) David Gilbert 2008 Sequence Comparison (1) 13

Evolution - related sequences

ggcatt agcatt agcata agcatg agccta aggatt gacatt

c→g g→a

Highlight the other mutations!

What are the mutations in the following:- ggcatt agccta

Q: How many changes between 2 sequences?

“living examples”

{

“ancestral sequences”

SLIDE 14

(c) David Gilbert 2008 Sequence Comparison (1) 14

Other evolutionary issues

Convergent evolution: same sequence evolved from different ancestors
back evolution - mutate to a previous sequence

ggcatt agcatt agcata agcatg agccta aggatt gacatt aggatc aggata aggatc ggcatt

SLIDE 15

(c) David Gilbert 2008 Sequence Comparison (1) 15

Evolutionary Relationships

Evolutionary relationships between sequences

– Two sequences evolved from same ancestor – sequences are homologous h: GLVST V→I S→ GLIST GLVT →V L →I q: GLISVT d: GIVT

SLIDE 16

(c) David Gilbert 2008 Sequence Comparison (1) 16

Evolutionary Relationships

‘True’ Alignment

‘True’ evolutionary history (& h) unknown

Alignment can be interpreted as

– Two substitutions & 2 insertions OR 2 deletions OR 1 deletion & 1 insertion

Unable to obtain evolutionary history even with ‘true’ alignment

T

V

I G d: T V S I L G q: T S V L G h:

2 sub 1 ins 1 del

SLIDE 17

(c) David Gilbert 2008 Sequence Comparison (1) 17

T V I G d: T V S I L G q: T V I G h: T V S I G d: T V S I L G q: T V S I G h:

1 del, 1 ins 2 del

SLIDE 18

(c) David Gilbert 2008 Sequence Comparison (1) 18

Evolutionary Relationships

Require model to reconstruct evolutionary history
Minimise number of mutations

q: GLISVT; I↔L; V ↔I; ←S→; ←V→; d: GIVT

A number of histories can be obtained e.g.

q: GLISVT; L →I; I →V; S→; V→; d: GIVT

With ‘true’ alignment obtaining evolutionary history ill-

posed – many possible histories.

SLIDE 19

(c) David Gilbert 2008 Sequence Comparison (1) 19

Evolutionary Relationships

What if ‘true’ alignment not known?
Again require a model to construct alignment
Minimise the number of mutations
One alignment would be

GLISVT G_I_VT

Showing two insertions/deletions (indels)

SLIDE 20

(c) David Gilbert 2008 Sequence Comparison (1) 20

Evolutionary Relationships

This alignment would produce a number of possible

histories e.g. GLIVT →S L → GLISVT GIVT

GLIVT not equal to ‘true’ h GLVST

SLIDE 21

(c) David Gilbert 2008 Sequence Comparison (1) 21

Evolutionary Relationships

Unable to recover evolutionary history from constructed

alignment alone

Alignments only meaningful for sequences with common

ancestor – i.e. homologous sequences

Therefore being able to construct a ‘good’ alignment

between two sequences can indicate homology

Sequence alignment employed to identify homologous

sequences – DNA, RNA, Protein

Homologous sequences may have similar functional roles

SLIDE 22

(c) David Gilbert 2008 Sequence Comparison (1) 22

Scoring Alignments

Objective measure of ‘good’ alignment required
Require score for alignment which provides likelihood that

sequences are related

Alignment of sequences q & d, denote i’th symbol by qi &

di

Under assumption of homology sequence symbols qi & di

derived independently from common (unknown) ancestor

Joint probability P(qi ,di |H)
f co-occurrence

in alignment

SLIDE 23

(c) David Gilbert 2008 Sequence Comparison (1) 23

Scoring Alignments

Under assumption of NO homology, sequence symbols qi & di have no dependency
n each other
Joint probability P(qi ,di |N) = P(qi ) P(di)
Probability of whole alignment

– P(q ,d |H) = ∏i P(qi ,di |H) assuming homology – P(q ,d |N) = ∏i P(qi) P(di) assuming no homology

Odds Ratio Score ∏i P(qi ,di |H) / P(qi) P(di)
Log-Odds Score for alignment; S = ∑is(qi ,di ) where

s(qi ,di ) = log[P(qi ,di |H)] – log[P(qi) P(di) ]

Overall score is sum of scores for each pair in alignment

SLIDE 24

(c) David Gilbert 2008 Sequence Comparison (1) 24

Scoring Alignments

Log-Odds Score S = ∑is(qi ,di ) can be assigned to any

alignment

Require score, s(qi ,di ), for every possible symbol in

dictionary (e.g. DNA – [A,C,T,G] including gaps) - subsequent lecture

Now require to find alignments which maximise the

likelihood of sequences being homologous

How can high scoring alignments be identified?
Naïve method – score all possible alignments – select

highest scoring alignment

SLIDE 25

(c) David Gilbert 2008 Sequence Comparison (1) 25

Optimal Alignments

Typical protein sequence 300 symbols long ~ 10179 different alignments to consider. (3 times as large as the number of electrons in the Earth)

SLIDE 26

(c) David Gilbert 2008 Sequence Comparison (1) 26

Edit distance

Levenshtein 1966
Minimum number of edit operations to transform 1

string into another

– insert, delete, (substitute) (1 symbol)

Distance is zero (identical) or positive
E.g “AIMS” & “AMOS”

(distance=2 for each solution)

AIM-S A-MOS AMOS AMOS AIMS AIMS AIMS AMOS

⇒

SLIDE 27

(c) David Gilbert 2008 Sequence Comparison (1) 27

A Better Way

Note alignment score is additive i.e. S = ∑is(qi ,di )
Overall maximum score is summation of all maximum

scores for sub-alignments

q & d sequences of length M & N
qi & dj are i’th & j’th symbols of q & d
q1…i & d1…j is sequence of first i & j symbols
s(a, b) pre-defined score between symbols a & b
Si,j highest score in aligning q1…i & d1…j

SLIDE 28

(c) David Gilbert 2008 Sequence Comparison (1) 28

A Better Way

Alignment for q1…i & d1…j can only end with three possible
ptions
First case gap required for d1…j and q1…i-1 padded with qi so

Si,j = Si-1,j + s(qi, -)

Second case gap required for q1…i and d1…j-1 padded with di

so Si,j = Si,j-1 + s(-, dj)

Third case both q1…i-1 and d1…j-1 padded with qi & di so Si,j

= Si-1,j-1 + s(qi, dj) dj dj

qi
qi

SLIDE 29

(c) David Gilbert 2008 Sequence Comparison (1) 29

A Better Way

Maximum score for Si,j is then maximum of three

possibilities listed Si-1,j-1 + s(qi, dj) Si,j-1 + s(-, dj) Si,j = max Si-1,j + s(qi, -)

{

SLIDE 30

(c) David Gilbert 2008 Sequence Comparison (1) 30

A Better Way

Each Si,j defined by previous Si-1,j-1 Si-1,j Si,j-1
General recursion will repeatedly solve same sub-problems

multiple times

Memorise which sub-problems have been solved
Can be organised in tabular form
Allows both optimal score and associated alignments to be

recovered

An example of Bellman’s Dynamic Programming
How does this compare to Naïve approach?

SLIDE 31

(c) David Gilbert 2008 Sequence Comparison (1) 31

A Better Way

Naïve approach requires three loops

– For all 0 ≤ j ≤ N

For all subsequences of length j of q

– For all subsequences of length j of d » Form & Score Alignment » Save Maximum Score – End

End

– End

Of the order of 22N calls to scoring routine
DP approach requires of the order of N2

N

N N

2

2 2

SLIDE 32

(c) David Gilbert 2008 Sequence Comparison (1) 32

Example

WATER to WINE
Symbol mis-match -5; gap insertion -1; match 5

E N Si,j Si,j-1 I Si-1,j Si-1,j-1 W R E T A W i / j

SLIDE 33

(c) David Gilbert 2008 Sequence Comparison (1) 33

Example

Gap insertions before WATER & WINE
4

E

3

N

2

I

1

W

5
4
3
2
1

R E T A W i / j

SLIDE 34

(c) David Gilbert 2008 Sequence Comparison (1) 34

Example

Build up entries employing
4

E

3

N

2

I 5

1

W

5
4
3
2
1

R E T A W i / j

SLIDE 35

(c) David Gilbert 2008 Sequence Comparison (1) 35

Example

Build up entries employing
4

E

3

N

2

I 4 5

1

W

5
4
3
2
1

R E T A W i / j

SLIDE 36

(c) David Gilbert 2008 Sequence Comparison (1) 36

Example

Traceback through the matrix

horizontal edge = insert gap in vertical (column) sequence
vertical edge = insert gap in horzontal (row) sequence
diagonal edge = match or mis-match (substitution)

SLIDE 37

(c) David Gilbert 2008 Sequence Comparison (1) 37

Exercise

Find all (other) maximally scoring alignments!

SLIDE 38

(c) David Gilbert 2008 Sequence Comparison (1) 38

Exercise

Try aligning AIMS and AMOS using

Symbol mis-match -3
gap insertion -1
match 5

SLIDE 39

(c) David Gilbert 2008 Sequence Comparison (1) 39

Exercise

For the sequences q = CDAA and d = AEECA, find highest scoring alignments using the (mis)match & gap scores below. Scoring Matrix Gap score -2 2 E

2

2 D 1 C

1
2
2

2 A E D C A

SLIDE 40

(c) David Gilbert 2008 Sequence Comparison (1) 40

Percentage sequence identity

number of identical residues x 100 = ________________________________ number of residues in smallest sequence Can differ if have gaps/no_gaps: compute for these sequences:

TGCAT-A-

SLIDE 41

(c) David Gilbert 2008 Sequence Comparison (1) 41

β and α globin, without gaps

β MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK α VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK β VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG α KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA β KEFTPPVQAAYQKVVAGVANALAHKYH α VHASLDKFLASVSTVLTSKYR Compute the identity%

SLIDE 42

(c) David Gilbert 2008 Sequence Comparison (1) 42

β and α globin, with gaps

CLUSTAL W (1.81) multiple sequence alignment

β MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK α --VLSPADKTNVKAAWGKVGAHAG----EYGAEALERMFLSFPTTKTYFPHFDLSHGSAQ β VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG α VKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLP β KEFTPPVQAAYQKVVAGVANALAHKYH α AEFTPAVHASLDKFLASVSTVLTSKYR Compute the identity%

SLIDE 43

(c) David Gilbert 2008 Sequence Comparison (1) 43

Human beta globin hits coyote!

>SW:HBB_CANFA P02056 HEMOGLOBIN BETA CHAIN. Length = 146 Score = 276 bits (698), Expect = 2e-74 Identities = 131/146 (89%), Positives = 137/146 (93%) Query:2 VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV 61 VHLT EEKS V+ LWGKVNVDEVGGEALGRLL+VYPWTQRFF+SFGDLSTPDAVM N KV Sbjct: 1 VHLTAEEKSLVSGLWGKVNVDEVGGEALGRLLIVYPWTQRFFDSFGDLSTPDAVMSNAKV 60 Query: 62 KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK 121 KAHGKKVL +FSDGL +LDNLKGTFA LSELHCDKLHVDPENF+LLGNVLVCVLAHHFGK Sbjct: 61 KAHGKKVLNSFSDGLKNLDNLKGTFAKLSELHCDKLHVDPENFKLLGNVLVCVLAHHFGK 120 Query: 122 EFTPPVQAAYQKVVAGVANALAHKYH 147 EFTP VQAAYQKVVAGVANALAHKYH Sbjct: 121 EFTPQVQAAYQKVVAGVANALAHKYH 146

Blast output

Compute the identity% What happened?