Bioinformatics Sequence comparison 1 global pairwise alignment - - PowerPoint PPT Presentation

bioinformatics sequence comparison 1
SMART_READER_LITE
LIVE PREVIEW

Bioinformatics Sequence comparison 1 global pairwise alignment - - PowerPoint PPT Presentation

Bioinformatics Sequence comparison 1 global pairwise alignment David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Lecture contents Evolutionary relationships,


slide-1
SLIDE 1

Bioinformatics

David Gilbert Bioinformatics Research Centre

www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow

Sequence comparison 1

global pairwise alignment

slide-2
SLIDE 2

(c) David Gilbert 2008 Sequence Comparison (1) 2

Lecture contents

  • Evolutionary relationships, sequence comparison and alignment
  • How to compare and align sequences using scoring schemes
  • Naïve approach to finding optimal score and alignment
  • Dynamic programming as an efficient method for finding optimal

scores & alignments

  • Variations on dynamic programming

– Gap penalties – Substitution matrices

slide-3
SLIDE 3

(c) David Gilbert 2008 Sequence Comparison (1) 3

Why compare sequences?

  • Assume a genome has been sequenced
  • We can find out where “putative” genes are by gene-

finding (see http://www.ensembl.org/ )

  • →What do such a gene do?
  • We can make the protein that it encodes

– but we can’t easily find out its biological function

  • So, we can try to find other sequences which are

similar to it, for which we know the function...

slide-4
SLIDE 4

(c) David Gilbert 2008 Sequence Comparison (1) 4

So, what is this sequence similar to?

acatttgctt ctgacacaac tgtgttcact agcaacctca aacagacacc atggtgcacc tgactcctga ggagaagtct gcggttactg ccctgtgggg caaggtgaac gtggatgaag ttggtggtga ggccctgggc aggctgctgg tggtctaccc ttggacccag aggttctttg agtcctttgg ggatctgtcc actcctgatg cagttatggg caaccctaag gtgaaggctc atggcaagaa agtgctcggt gcctttagtg atggcctggc tcacctggac aacctcaagg gcacctttgc cacactgagt gagctgcact gtgacaagct gcacgtggat cctgagaact tcaggctcct gggcaacgtg ctggtctgtg tgctggccca tcactttggc aaagaattca ccccaccagt gcaggctgcc tatcagaaag tggtggctgg tgtggctaat gccctggccc acaagtatca ctaagctcgc tttcttgctg tccaatttct attaaaggtt cctttgttcc ctaagtccaa ctactaaact gggggatatt atgaagggcc ttgagcatct ggattctgcc taataaaaaa catttatttt cattgc Search using BLAST http://www.ncbi.nlm.nih.gov/BLAST/

  • r

http://www.ebi.ac.uk/blastall/ MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAH GKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQ AAYQKVVAGVANALAHKYH

Amino-acid (protein sequence) cDNA (nucleotide sequence)

Where does the coding start on

this sequence?

slide-5
SLIDE 5

(c) David Gilbert 2008 Sequence Comparison (1) 5

Evolution - basic concepts

  • Mutation in DNA a natural evolutionary process
  • DNA replication errors: (nucleotide)

– substitutions – insertions – deletions

  • Similarity between sequences

– clue to common evolutionary origin, or – clue to common function

  • This is a simplistic story: in fact the altered function of the expressed protein will determine

if the organism will survive to reproduce, and hence pass on [transmit] the altered gene

}indels

slide-6
SLIDE 6

(c) David Gilbert 2008 Sequence Comparison (1) 6

Human genetic variations (Single Nucleotide Polymorphisms)

  • SNP’s - “genetic indivuality”
  • ~ 1/1000 bases variable (2 humans)
  • Make us more/less susceptible to diseases
  • May influence the effect of drug treatments

TTT TAC GGC ATC Phe Tyr Asn Met TTT TAC GTC ATC Phe Tyr Ser Met

Associated with high cholersterol

slide-7
SLIDE 7

(c) David Gilbert 2008 Sequence Comparison (1) 7

ACCESSION J02799

/translation="MESKVVVPAQGKKITLQNGKLNVPENPIIPYIEGDGIGVDVTPA MLKVVDAAVEKAYKGERKISWMEIYTGEKSTQVYGQDVWLPAETLDLIREYRVAIKGP LTTPVGGGIRSLNVALRQELDLYICLRPVRYYQGTPSPVKHPELTDMVIFRENSEDIY AGIEWKADSADAEKVIKFLREEMGVKKIRFPEHCGIGIKPCSEEGTKRLVRAAIEYAI ANDRDSVTLVHKGNIMKFTEGAFKDWGYQLAREEFGGELIDGGPWLKVKNPNTGKEIV IKDVIADAFLQQILLRPAEYDVIACMNLNGDYISDALAAQVGGIGIAPGANIGDECAL FEATHGTAPKYAGQDKVNPGSIILSAEMMLRHMGWTEAADLIVKGMEGAINAKTVTYD FERLMDGAKLLKCSEFGDAIIENM" ORIGIN MluI site; 25.3 min on K12 map. 1 cgcgtggcgt ggttttcagg tttacgcctg gtagaacgtt gcgagctgaa tcgcttaacc 61 tggtgatttc taaaagaagt tttttgcatg gtattttcag agattatgaa ttgccgcatt 121 atagcctaat aacgcgcatc tttcatgacg gcaaacaata gggtagtatt gacaagccaa 181 ttacaaatca ttaacaaaaa attgctctaa agcatccgta tcgcaggacg caaacgcata 241 tgcaacgtgg tggcagacga gcaaaccagt agcgctcgaa ggagaggtga atggaaagta 301 aagtagttgt tccggcacaa ggcaagaaga tcaccctgca aaacggcaaa ctcaacgttc 361 ctgaaaatcc gattatccct tacattgaag gtgatggaat cggtgtagat gtaaccccag 421 ccatgctgaa agtggtcgac gctgcagtcg agaaagccta taaaggcgag cgtaaaatct 481 cctggatgga aatttacacc ggtgaaaaat ccacacaggt ttatggtcag gacgtctggc 541 tgcctgctga aactcttgat ctgattcgtg aatatcgcgt tgccattaaa ggtccgctga 601 ccactccggt tggtggcggt attcgctctc tgaacgttgc cctgcgccag gaactggatc 661 tctacatctg cctgcgtccg gtacgttact atcagggcac tccaagcccg gttaaacacc 721 ctgaactgac cgatatggtt atcttccgtg aaaactcgga agacatttat gcgggtatcg 781 aatggaaagc agactctgcc gacgccgaga aagtgattaa attcctgcgt gaagagatgg 841 gggtgaagaa aattcgcttc ccggaacatt gtggtatcgg tattaagccg tgttcggaag 901 aaggcaccaa acgtctggtt cgtgcagcga tcgaatacgc aattgctaac gatcgtgact 961 ctgtgactct ggtgcacaaa ggcaacatca tgaagttcac cgaaggagcg tttaaagact 1021 ggggctacca gctggcgcgt gaagagtttg gcggtgaact gatcgacggt ggcccgtggc 1081 tgaaagttaa aaacccgaac actggcaaag agatcgtcat taaagacgtg attgctgatg 1141 cattcctgca acagatcctg ctgcgtccgg ctgaatatga tgttatcgcc tgtatgaacc 1201 tgaacggtga ctacatttct gacgccctgg cagcgcaggt tggcggtatc ggtatcgccc 1261 ctggtgcaaa catcggtgac gaatgcgccc tgtttgaagc cacccacggt actgcgccga 1321 aatatgccgg tcaggacaaa gtaaatcctg gctctattat tctctccgct gagatgatgc 1381 tgcgccacat gggttggacc gaagcggctg acttaattgt taaaggtatg gaaggcgcaa 1441 tcaacgcgaa aaccgtaacc tatgacttcg agcgtctgat ggatggcgct aaactgctga 1501 aatgttcaga gtttggtgac gcgatcatcg aaaacatgta atgccgtagt ttgttaaatt 1561 tattaacg //

slide-8
SLIDE 8

(c) David Gilbert 2008 Sequence Comparison (1) 8

Mutations, frameshifts

at[t,c,a]at[t,c,a]ga[a,g]aa[t,c]atg taa (regex) I I E N M Ter atc atc gaa aac atg taa Compute the translation of 1. atcatcgaaaacatgtaatgccgtagtttgttaaatttattaacg 2. tcatcgaaaacatgtaatgccgtagtttgttaaatttattaacg 3. catcgaaaacatgtaatgccgtagtttgttaaatttattaacg

slide-9
SLIDE 9

(c) David Gilbert 2008 Sequence Comparison (1) 9

Frameshift mutations

1. atc atc gaa aac atg taa tgc cgt agt ttg tta aat tta tta acg 2. tca tcg aaa aca tgt aat gcc gta gtt tgt taa att tat taa cg 3. cat cga aaa cat gta atg ccg tag ttt gtt aaa ttt att aac g

slide-10
SLIDE 10

(c) David Gilbert 2008 Sequence Comparison (1) 10

Evolution, DNA->Amino-acids

Triplet code, hence difference between DNA base

  • Substitution: (hence 1 amino-acid changes)
  • Insertion / Deletion: “frame shift” (all subsequent amino-

acids change)

– NB, Indels can be in multiples of 3, and hence...

Also

  • “Silent mutation” - DNA changes but amino-acid doesn’t

change - why?

  • “Nonsense mutation” - a single DNA base substitution

resulting in a stop codon.

slide-11
SLIDE 11

(c) David Gilbert 2008 Sequence Comparison (1) 11

Some evolutionary relationships revealed by comparing α- haemoglobins

axolotl giant panda lesser panda moose goshawk vulture duck alligator

slide-12
SLIDE 12

(c) David Gilbert 2008 Sequence Comparison (1) 12

ggcatt agcatt agcata agcatg agccta aggatt gacatt

Evolution - example

slide-13
SLIDE 13

(c) David Gilbert 2008 Sequence Comparison (1) 13

Evolution - related sequences

ggcatt agcatt agcata agcatg agccta aggatt gacatt

c→g g→a

Highlight the other mutations!

What are the mutations in the following:- ggcatt agccta

Q: How many changes between 2 sequences?

“living examples”

{

“ancestral sequences”

slide-14
SLIDE 14

(c) David Gilbert 2008 Sequence Comparison (1) 14

Other evolutionary issues

  • Convergent evolution: same sequence evolved from different ancestors
  • back evolution - mutate to a previous sequence

ggcatt agcatt agcata agcatg agccta aggatt gacatt aggatc aggata aggatc ggcatt

slide-15
SLIDE 15

(c) David Gilbert 2008 Sequence Comparison (1) 15

Evolutionary Relationships

  • Evolutionary relationships between sequences

– Two sequences evolved from same ancestor – sequences are homologous h: GLVST V→I S→ GLIST GLVT →V L →I q: GLISVT d: GIVT

slide-16
SLIDE 16

(c) David Gilbert 2008 Sequence Comparison (1) 16

Evolutionary Relationships

  • ‘True’ Alignment

‘True’ evolutionary history (& h) unknown

  • Alignment can be interpreted as

– Two substitutions & 2 insertions OR 2 deletions OR 1 deletion & 1 insertion

  • Unable to obtain evolutionary history even with ‘true’ alignment

T

  • V

I G d: T V S I L G q: T S V L G h:

2 sub 1 ins 1 del

slide-17
SLIDE 17

(c) David Gilbert 2008 Sequence Comparison (1) 17

T V I G d: T V S I L G q: T V I G h: T V S I G d: T V S I L G q: T V S I G h:

1 del, 1 ins 2 del

slide-18
SLIDE 18

(c) David Gilbert 2008 Sequence Comparison (1) 18

Evolutionary Relationships

  • Require model to reconstruct evolutionary history
  • Minimise number of mutations

q: GLISVT; I↔L; V ↔I; ←S→; ←V→; d: GIVT

  • A number of histories can be obtained e.g.

q: GLISVT; L →I; I →V; S→; V→; d: GIVT

  • With ‘true’ alignment obtaining evolutionary history ill-

posed – many possible histories.

slide-19
SLIDE 19

(c) David Gilbert 2008 Sequence Comparison (1) 19

Evolutionary Relationships

  • What if ‘true’ alignment not known?
  • Again require a model to construct alignment
  • Minimise the number of mutations
  • One alignment would be

GLISVT G_I_VT

  • Showing two insertions/deletions (indels)
slide-20
SLIDE 20

(c) David Gilbert 2008 Sequence Comparison (1) 20

Evolutionary Relationships

  • This alignment would produce a number of possible

histories e.g. GLIVT →S L → GLISVT GIVT

  • GLIVT not equal to ‘true’ h GLVST
slide-21
SLIDE 21

(c) David Gilbert 2008 Sequence Comparison (1) 21

Evolutionary Relationships

  • Unable to recover evolutionary history from constructed

alignment alone

  • Alignments only meaningful for sequences with common

ancestor – i.e. homologous sequences

  • Therefore being able to construct a ‘good’ alignment

between two sequences can indicate homology

  • Sequence alignment employed to identify homologous

sequences – DNA, RNA, Protein

  • Homologous sequences may have similar functional roles
slide-22
SLIDE 22

(c) David Gilbert 2008 Sequence Comparison (1) 22

Scoring Alignments

  • Objective measure of ‘good’ alignment required
  • Require score for alignment which provides likelihood that

sequences are related

  • Alignment of sequences q & d, denote i’th symbol by qi &

di

  • Under assumption of homology sequence symbols qi & di

derived independently from common (unknown) ancestor

  • Joint probability P(qi ,di |H)
  • f co-occurrence

in alignment

slide-23
SLIDE 23

(c) David Gilbert 2008 Sequence Comparison (1) 23

Scoring Alignments

  • Under assumption of NO homology, sequence symbols qi & di have no dependency
  • n each other
  • Joint probability P(qi ,di |N) = P(qi ) P(di)
  • Probability of whole alignment

– P(q ,d |H) = ∏i P(qi ,di |H) assuming homology – P(q ,d |N) = ∏i P(qi) P(di) assuming no homology

  • Odds Ratio Score ∏i P(qi ,di |H) / P(qi) P(di)
  • Log-Odds Score for alignment; S = ∑is(qi ,di ) where

s(qi ,di ) = log[P(qi ,di |H)] – log[P(qi) P(di) ]

  • Overall score is sum of scores for each pair in alignment
slide-24
SLIDE 24

(c) David Gilbert 2008 Sequence Comparison (1) 24

Scoring Alignments

  • Log-Odds Score S = ∑is(qi ,di ) can be assigned to any

alignment

  • Require score, s(qi ,di ), for every possible symbol in

dictionary (e.g. DNA – [A,C,T,G] including gaps) - subsequent lecture

  • Now require to find alignments which maximise the

likelihood of sequences being homologous

  • How can high scoring alignments be identified?
  • Naïve method – score all possible alignments – select

highest scoring alignment

slide-25
SLIDE 25

(c) David Gilbert 2008 Sequence Comparison (1) 25

Optimal Alignments

Typical protein sequence 300 symbols long ~ 10179 different alignments to consider. (3 times as large as the number of electrons in the Earth)

slide-26
SLIDE 26

(c) David Gilbert 2008 Sequence Comparison (1) 26

Edit distance

  • Levenshtein 1966
  • Minimum number of edit operations to transform 1

string into another

– insert, delete, (substitute) (1 symbol)

  • Distance is zero (identical) or positive
  • E.g “AIMS” & “AMOS”

(distance=2 for each solution)

AIM-S A-MOS AMOS AMOS AIMS AIMS AIMS AMOS

slide-27
SLIDE 27

(c) David Gilbert 2008 Sequence Comparison (1) 27

A Better Way

  • Note alignment score is additive i.e. S = ∑is(qi ,di )
  • Overall maximum score is summation of all maximum

scores for sub-alignments

  • q & d sequences of length M & N
  • qi & dj are i’th & j’th symbols of q & d
  • q1…i & d1…j is sequence of first i & j symbols
  • s(a, b) pre-defined score between symbols a & b
  • Si,j highest score in aligning q1…i & d1…j
slide-28
SLIDE 28

(c) David Gilbert 2008 Sequence Comparison (1) 28

A Better Way

  • Alignment for q1…i & d1…j can only end with three possible
  • ptions
  • First case gap required for d1…j and q1…i-1 padded with qi so

Si,j = Si-1,j + s(qi, -)

  • Second case gap required for q1…i and d1…j-1 padded with di

so Si,j = Si,j-1 + s(-, dj)

  • Third case both q1…i-1 and d1…j-1 padded with qi & di so Si,j

= Si-1,j-1 + s(qi, dj) dj dj

  • qi
  • qi
slide-29
SLIDE 29

(c) David Gilbert 2008 Sequence Comparison (1) 29

A Better Way

  • Maximum score for Si,j is then maximum of three

possibilities listed Si-1,j-1 + s(qi, dj) Si,j-1 + s(-, dj) Si,j = max Si-1,j + s(qi, -)

{

slide-30
SLIDE 30

(c) David Gilbert 2008 Sequence Comparison (1) 30

A Better Way

  • Each Si,j defined by previous Si-1,j-1 Si-1,j Si,j-1
  • General recursion will repeatedly solve same sub-problems

multiple times

  • Memorise which sub-problems have been solved
  • Can be organised in tabular form
  • Allows both optimal score and associated alignments to be

recovered

  • An example of Bellman’s Dynamic Programming
  • How does this compare to Naïve approach?
slide-31
SLIDE 31

(c) David Gilbert 2008 Sequence Comparison (1) 31

A Better Way

  • Naïve approach requires three loops

– For all 0 ≤ j ≤ N

  • For all subsequences of length j of q

– For all subsequences of length j of d » Form & Score Alignment » Save Maximum Score – End

  • End

– End

  • Of the order of 22N calls to scoring routine
  • DP approach requires of the order of N2

N

N N

2

2 2

slide-32
SLIDE 32

(c) David Gilbert 2008 Sequence Comparison (1) 32

Example

  • WATER to WINE
  • Symbol mis-match -5; gap insertion -1; match 5

E N Si,j Si,j-1 I Si-1,j Si-1,j-1 W R E T A W i / j

slide-33
SLIDE 33

(c) David Gilbert 2008 Sequence Comparison (1) 33

Example

  • Gap insertions before WATER & WINE
  • 4

E

  • 3

N

  • 2

I

  • 1

W

  • 5
  • 4
  • 3
  • 2
  • 1

R E T A W i / j

slide-34
SLIDE 34

(c) David Gilbert 2008 Sequence Comparison (1) 34

Example

  • Build up entries employing
  • 4

E

  • 3

N

  • 2

I 5

  • 1

W

  • 5
  • 4
  • 3
  • 2
  • 1

R E T A W i / j

slide-35
SLIDE 35

(c) David Gilbert 2008 Sequence Comparison (1) 35

Example

  • Build up entries employing
  • 4

E

  • 3

N

  • 2

I 4 5

  • 1

W

  • 5
  • 4
  • 3
  • 2
  • 1

R E T A W i / j

slide-36
SLIDE 36

(c) David Gilbert 2008 Sequence Comparison (1) 36

Example

Traceback through the matrix

  • horizontal edge = insert gap in vertical (column) sequence
  • vertical edge = insert gap in horzontal (row) sequence
  • diagonal edge = match or mis-match (substitution)
slide-37
SLIDE 37

(c) David Gilbert 2008 Sequence Comparison (1) 37

Exercise

Find all (other) maximally scoring alignments!

slide-38
SLIDE 38

(c) David Gilbert 2008 Sequence Comparison (1) 38

Exercise

Try aligning AIMS and AMOS using

  • Symbol mis-match -3
  • gap insertion -1
  • match 5
slide-39
SLIDE 39

(c) David Gilbert 2008 Sequence Comparison (1) 39

Exercise

For the sequences q = CDAA and d = AEECA, find highest scoring alignments using the (mis)match & gap scores below. Scoring Matrix Gap score -2 2 E

  • 2

2 D 1 C

  • 1
  • 2
  • 2

2 A E D C A

slide-40
SLIDE 40

(c) David Gilbert 2008 Sequence Comparison (1) 40

Percentage sequence identity

number of identical residues x 100 = ________________________________ number of residues in smallest sequence Can differ if have gaps/no_gaps: compute for these sequences:

  • TGCAT-A-

| | | | AT-C-TGAT TGCATA | | ATCTGAT

slide-41
SLIDE 41

(c) David Gilbert 2008 Sequence Comparison (1) 41

β and α globin, without gaps

β MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK α VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK β VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG α KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA β KEFTPPVQAAYQKVVAGVANALAHKYH α VHASLDKFLASVSTVLTSKYR Compute the identity%

slide-42
SLIDE 42

(c) David Gilbert 2008 Sequence Comparison (1) 42

β and α globin, with gaps

CLUSTAL W (1.81) multiple sequence alignment

β MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK α --VLSPADKTNVKAAWGKVGAHAG----EYGAEALERMFLSFPTTKTYFPHFDLSHGSAQ β VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG α VKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLP β KEFTPPVQAAYQKVVAGVANALAHKYH α AEFTPAVHASLDKFLASVSTVLTSKYR Compute the identity%

slide-43
SLIDE 43

(c) David Gilbert 2008 Sequence Comparison (1) 43

Human beta globin hits coyote!

>SW:HBB_CANFA P02056 HEMOGLOBIN BETA CHAIN. Length = 146 Score = 276 bits (698), Expect = 2e-74 Identities = 131/146 (89%), Positives = 137/146 (93%) Query:2 VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV 61 VHLT EEKS V+ LWGKVNVDEVGGEALGRLL+VYPWTQRFF+SFGDLSTPDAVM N KV Sbjct: 1 VHLTAEEKSLVSGLWGKVNVDEVGGEALGRLLIVYPWTQRFFDSFGDLSTPDAVMSNAKV 60 Query: 62 KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK 121 KAHGKKVL +FSDGL +LDNLKGTFA LSELHCDKLHVDPENF+LLGNVLVCVLAHHFGK Sbjct: 61 KAHGKKVLNSFSDGLKNLDNLKGTFAKLSELHCDKLHVDPENFKLLGNVLVCVLAHHFGK 120 Query: 122 EFTPPVQAAYQKVVAGVANALAHKYH 147 EFTP VQAAYQKVVAGVANALAHKYH Sbjct: 121 EFTPQVQAAYQKVVAGVANALAHKYH 146

Blast output

Compute the identity% What happened?