Bioinformatics
David Gilbert Bioinformatics Research Centre
www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow
Bioinformatics Sequence comparison 1 global pairwise alignment - - PowerPoint PPT Presentation
Bioinformatics Sequence comparison 1 global pairwise alignment David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Lecture contents Evolutionary relationships,
www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow
(c) David Gilbert 2008 Sequence Comparison (1) 2
scores & alignments
– Gap penalties – Substitution matrices
(c) David Gilbert 2008 Sequence Comparison (1) 3
– but we can’t easily find out its biological function
(c) David Gilbert 2008 Sequence Comparison (1) 4
acatttgctt ctgacacaac tgtgttcact agcaacctca aacagacacc atggtgcacc tgactcctga ggagaagtct gcggttactg ccctgtgggg caaggtgaac gtggatgaag ttggtggtga ggccctgggc aggctgctgg tggtctaccc ttggacccag aggttctttg agtcctttgg ggatctgtcc actcctgatg cagttatggg caaccctaag gtgaaggctc atggcaagaa agtgctcggt gcctttagtg atggcctggc tcacctggac aacctcaagg gcacctttgc cacactgagt gagctgcact gtgacaagct gcacgtggat cctgagaact tcaggctcct gggcaacgtg ctggtctgtg tgctggccca tcactttggc aaagaattca ccccaccagt gcaggctgcc tatcagaaag tggtggctgg tgtggctaat gccctggccc acaagtatca ctaagctcgc tttcttgctg tccaatttct attaaaggtt cctttgttcc ctaagtccaa ctactaaact gggggatatt atgaagggcc ttgagcatct ggattctgcc taataaaaaa catttatttt cattgc Search using BLAST http://www.ncbi.nlm.nih.gov/BLAST/
http://www.ebi.ac.uk/blastall/ MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAH GKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQ AAYQKVVAGVANALAHKYH
Amino-acid (protein sequence) cDNA (nucleotide sequence)
Where does the coding start on
this sequence?
(c) David Gilbert 2008 Sequence Comparison (1) 5
– substitutions – insertions – deletions
– clue to common evolutionary origin, or – clue to common function
if the organism will survive to reproduce, and hence pass on [transmit] the altered gene
(c) David Gilbert 2008 Sequence Comparison (1) 6
Associated with high cholersterol
(c) David Gilbert 2008 Sequence Comparison (1) 7
ACCESSION J02799
/translation="MESKVVVPAQGKKITLQNGKLNVPENPIIPYIEGDGIGVDVTPA MLKVVDAAVEKAYKGERKISWMEIYTGEKSTQVYGQDVWLPAETLDLIREYRVAIKGP LTTPVGGGIRSLNVALRQELDLYICLRPVRYYQGTPSPVKHPELTDMVIFRENSEDIY AGIEWKADSADAEKVIKFLREEMGVKKIRFPEHCGIGIKPCSEEGTKRLVRAAIEYAI ANDRDSVTLVHKGNIMKFTEGAFKDWGYQLAREEFGGELIDGGPWLKVKNPNTGKEIV IKDVIADAFLQQILLRPAEYDVIACMNLNGDYISDALAAQVGGIGIAPGANIGDECAL FEATHGTAPKYAGQDKVNPGSIILSAEMMLRHMGWTEAADLIVKGMEGAINAKTVTYD FERLMDGAKLLKCSEFGDAIIENM" ORIGIN MluI site; 25.3 min on K12 map. 1 cgcgtggcgt ggttttcagg tttacgcctg gtagaacgtt gcgagctgaa tcgcttaacc 61 tggtgatttc taaaagaagt tttttgcatg gtattttcag agattatgaa ttgccgcatt 121 atagcctaat aacgcgcatc tttcatgacg gcaaacaata gggtagtatt gacaagccaa 181 ttacaaatca ttaacaaaaa attgctctaa agcatccgta tcgcaggacg caaacgcata 241 tgcaacgtgg tggcagacga gcaaaccagt agcgctcgaa ggagaggtga atggaaagta 301 aagtagttgt tccggcacaa ggcaagaaga tcaccctgca aaacggcaaa ctcaacgttc 361 ctgaaaatcc gattatccct tacattgaag gtgatggaat cggtgtagat gtaaccccag 421 ccatgctgaa agtggtcgac gctgcagtcg agaaagccta taaaggcgag cgtaaaatct 481 cctggatgga aatttacacc ggtgaaaaat ccacacaggt ttatggtcag gacgtctggc 541 tgcctgctga aactcttgat ctgattcgtg aatatcgcgt tgccattaaa ggtccgctga 601 ccactccggt tggtggcggt attcgctctc tgaacgttgc cctgcgccag gaactggatc 661 tctacatctg cctgcgtccg gtacgttact atcagggcac tccaagcccg gttaaacacc 721 ctgaactgac cgatatggtt atcttccgtg aaaactcgga agacatttat gcgggtatcg 781 aatggaaagc agactctgcc gacgccgaga aagtgattaa attcctgcgt gaagagatgg 841 gggtgaagaa aattcgcttc ccggaacatt gtggtatcgg tattaagccg tgttcggaag 901 aaggcaccaa acgtctggtt cgtgcagcga tcgaatacgc aattgctaac gatcgtgact 961 ctgtgactct ggtgcacaaa ggcaacatca tgaagttcac cgaaggagcg tttaaagact 1021 ggggctacca gctggcgcgt gaagagtttg gcggtgaact gatcgacggt ggcccgtggc 1081 tgaaagttaa aaacccgaac actggcaaag agatcgtcat taaagacgtg attgctgatg 1141 cattcctgca acagatcctg ctgcgtccgg ctgaatatga tgttatcgcc tgtatgaacc 1201 tgaacggtga ctacatttct gacgccctgg cagcgcaggt tggcggtatc ggtatcgccc 1261 ctggtgcaaa catcggtgac gaatgcgccc tgtttgaagc cacccacggt actgcgccga 1321 aatatgccgg tcaggacaaa gtaaatcctg gctctattat tctctccgct gagatgatgc 1381 tgcgccacat gggttggacc gaagcggctg acttaattgt taaaggtatg gaaggcgcaa 1441 tcaacgcgaa aaccgtaacc tatgacttcg agcgtctgat ggatggcgct aaactgctga 1501 aatgttcaga gtttggtgac gcgatcatcg aaaacatgta atgccgtagt ttgttaaatt 1561 tattaacg //
(c) David Gilbert 2008 Sequence Comparison (1) 8
at[t,c,a]at[t,c,a]ga[a,g]aa[t,c]atg taa (regex) I I E N M Ter atc atc gaa aac atg taa Compute the translation of 1. atcatcgaaaacatgtaatgccgtagtttgttaaatttattaacg 2. tcatcgaaaacatgtaatgccgtagtttgttaaatttattaacg 3. catcgaaaacatgtaatgccgtagtttgttaaatttattaacg
(c) David Gilbert 2008 Sequence Comparison (1) 9
1. atc atc gaa aac atg taa tgc cgt agt ttg tta aat tta tta acg 2. tca tcg aaa aca tgt aat gcc gta gtt tgt taa att tat taa cg 3. cat cga aaa cat gta atg ccg tag ttt gtt aaa ttt att aac g
(c) David Gilbert 2008 Sequence Comparison (1) 10
Triplet code, hence difference between DNA base
acids change)
– NB, Indels can be in multiples of 3, and hence...
Also
change - why?
resulting in a stop codon.
(c) David Gilbert 2008 Sequence Comparison (1) 11
Some evolutionary relationships revealed by comparing α- haemoglobins
axolotl giant panda lesser panda moose goshawk vulture duck alligator
(c) David Gilbert 2008 Sequence Comparison (1) 12
ggcatt agcatt agcata agcatg agccta aggatt gacatt
(c) David Gilbert 2008 Sequence Comparison (1) 13
ggcatt agcatt agcata agcatg agccta aggatt gacatt
c→g g→a
What are the mutations in the following:- ggcatt agccta
Q: How many changes between 2 sequences?
“living examples”
“ancestral sequences”
(c) David Gilbert 2008 Sequence Comparison (1) 14
ggcatt agcatt agcata agcatg agccta aggatt gacatt aggatc aggata aggatc ggcatt
(c) David Gilbert 2008 Sequence Comparison (1) 15
– Two sequences evolved from same ancestor – sequences are homologous h: GLVST V→I S→ GLIST GLVT →V L →I q: GLISVT d: GIVT
(c) David Gilbert 2008 Sequence Comparison (1) 16
‘True’ evolutionary history (& h) unknown
– Two substitutions & 2 insertions OR 2 deletions OR 1 deletion & 1 insertion
T
I G d: T V S I L G q: T S V L G h:
(c) David Gilbert 2008 Sequence Comparison (1) 17
T V I G d: T V S I L G q: T V I G h: T V S I G d: T V S I L G q: T V S I G h:
(c) David Gilbert 2008 Sequence Comparison (1) 18
q: GLISVT; I↔L; V ↔I; ←S→; ←V→; d: GIVT
q: GLISVT; L →I; I →V; S→; V→; d: GIVT
posed – many possible histories.
(c) David Gilbert 2008 Sequence Comparison (1) 19
GLISVT G_I_VT
(c) David Gilbert 2008 Sequence Comparison (1) 20
histories e.g. GLIVT →S L → GLISVT GIVT
(c) David Gilbert 2008 Sequence Comparison (1) 21
alignment alone
ancestor – i.e. homologous sequences
between two sequences can indicate homology
sequences – DNA, RNA, Protein
(c) David Gilbert 2008 Sequence Comparison (1) 22
sequences are related
di
derived independently from common (unknown) ancestor
in alignment
(c) David Gilbert 2008 Sequence Comparison (1) 23
– P(q ,d |H) = ∏i P(qi ,di |H) assuming homology – P(q ,d |N) = ∏i P(qi) P(di) assuming no homology
s(qi ,di ) = log[P(qi ,di |H)] – log[P(qi) P(di) ]
(c) David Gilbert 2008 Sequence Comparison (1) 24
alignment
dictionary (e.g. DNA – [A,C,T,G] including gaps) - subsequent lecture
likelihood of sequences being homologous
highest scoring alignment
(c) David Gilbert 2008 Sequence Comparison (1) 25
Typical protein sequence 300 symbols long ~ 10179 different alignments to consider. (3 times as large as the number of electrons in the Earth)
(c) David Gilbert 2008 Sequence Comparison (1) 26
– insert, delete, (substitute) (1 symbol)
(distance=2 for each solution)
AIM-S A-MOS AMOS AMOS AIMS AIMS AIMS AMOS
(c) David Gilbert 2008 Sequence Comparison (1) 27
scores for sub-alignments
(c) David Gilbert 2008 Sequence Comparison (1) 28
Si,j = Si-1,j + s(qi, -)
so Si,j = Si,j-1 + s(-, dj)
= Si-1,j-1 + s(qi, dj) dj dj
(c) David Gilbert 2008 Sequence Comparison (1) 29
possibilities listed Si-1,j-1 + s(qi, dj) Si,j-1 + s(-, dj) Si,j = max Si-1,j + s(qi, -)
(c) David Gilbert 2008 Sequence Comparison (1) 30
multiple times
recovered
(c) David Gilbert 2008 Sequence Comparison (1) 31
– For all 0 ≤ j ≤ N
– For all subsequences of length j of d » Form & Score Alignment » Save Maximum Score – End
– End
N
N N
2
2 2
(c) David Gilbert 2008 Sequence Comparison (1) 32
E N Si,j Si,j-1 I Si-1,j Si-1,j-1 W R E T A W i / j
(c) David Gilbert 2008 Sequence Comparison (1) 33
E
N
I
W
R E T A W i / j
(c) David Gilbert 2008 Sequence Comparison (1) 34
E
N
I 5
W
R E T A W i / j
(c) David Gilbert 2008 Sequence Comparison (1) 35
E
N
I 4 5
W
R E T A W i / j
(c) David Gilbert 2008 Sequence Comparison (1) 36
Traceback through the matrix
(c) David Gilbert 2008 Sequence Comparison (1) 37
Find all (other) maximally scoring alignments!
(c) David Gilbert 2008 Sequence Comparison (1) 38
(c) David Gilbert 2008 Sequence Comparison (1) 39
For the sequences q = CDAA and d = AEECA, find highest scoring alignments using the (mis)match & gap scores below. Scoring Matrix Gap score -2 2 E
2 D 1 C
2 A E D C A
(c) David Gilbert 2008 Sequence Comparison (1) 40
| | | | AT-C-TGAT TGCATA | | ATCTGAT
(c) David Gilbert 2008 Sequence Comparison (1) 41
β MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK α VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK β VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG α KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA β KEFTPPVQAAYQKVVAGVANALAHKYH α VHASLDKFLASVSTVLTSKYR Compute the identity%
(c) David Gilbert 2008 Sequence Comparison (1) 42
CLUSTAL W (1.81) multiple sequence alignment
β MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK α --VLSPADKTNVKAAWGKVGAHAG----EYGAEALERMFLSFPTTKTYFPHFDLSHGSAQ β VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG α VKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLP β KEFTPPVQAAYQKVVAGVANALAHKYH α AEFTPAVHASLDKFLASVSTVLTSKYR Compute the identity%
(c) David Gilbert 2008 Sequence Comparison (1) 43
>SW:HBB_CANFA P02056 HEMOGLOBIN BETA CHAIN. Length = 146 Score = 276 bits (698), Expect = 2e-74 Identities = 131/146 (89%), Positives = 137/146 (93%) Query:2 VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV 61 VHLT EEKS V+ LWGKVNVDEVGGEALGRLL+VYPWTQRFF+SFGDLSTPDAVM N KV Sbjct: 1 VHLTAEEKSLVSGLWGKVNVDEVGGEALGRLLIVYPWTQRFFDSFGDLSTPDAVMSNAKV 60 Query: 62 KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK 121 KAHGKKVL +FSDGL +LDNLKGTFA LSELHCDKLHVDPENF+LLGNVLVCVLAHHFGK Sbjct: 61 KAHGKKVLNSFSDGLKNLDNLKGTFAKLSELHCDKLHVDPENFKLLGNVLVCVLAHHFGK 120 Query: 122 EFTPPVQAAYQKVVAGVANALAHKYH 147 EFTP VQAAYQKVVAGVANALAHKYH Sbjct: 121 EFTPQVQAAYQKVVAGVANALAHKYH 146
Compute the identity% What happened?