Bioinformatics Sequence comparison 2 local pairwise alignment - - PowerPoint PPT Presentation

bioinformatics sequence comparison 2
SMART_READER_LITE
LIVE PREVIEW

Bioinformatics Sequence comparison 2 local pairwise alignment - - PowerPoint PPT Presentation

Bioinformatics Sequence comparison 2 local pairwise alignment David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Lecture contents Variations on dynamic programming


slide-1
SLIDE 1

Bioinformatics

David Gilbert Bioinformatics Research Centre

www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow

Sequence comparison 2

local pairwise alignment

slide-2
SLIDE 2

(c) David Gilbert 2008 Sequence Comparison (2) 2

Lecture contents

  • Variations on dynamic programming

– Gap penalties – Substitution matrices

  • To explain the reason that local alignments may be more

appropriate than global ones.

  • To describe the use of Dot-Plots in visualising an

alignment

  • To describe the Smith-Waterman method of finding and

scoring an optimal local pairwise alignment

  • To describe in outline the BLAST algorithm for database

search

slide-3
SLIDE 3

(c) David Gilbert 2008 Sequence Comparison (2) 3

Solution to Week 1 Exercise

A A D C A C E E A

slide-4
SLIDE 4

(c) David Gilbert 2008 Sequence Comparison (2) 4

Percentage sequence identity

number of identical residues x 100 = ________________________________ number of residues in smallest sequence Can differ if have gaps/no_gaps: compute for these sequences:

  • TGCAT-A-

| | | | AT-C-TGAT TGCATA | | ATCTGAT

Sequence similarity - change at amino-acid residue or nucleotide that preserves the physico- chemical properties of the residue

slide-5
SLIDE 5

(c) David Gilbert 2008 Sequence Comparison (2) 5

β and α globin, without gaps

β MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK α VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK β VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG α KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA β KEFTPPVQAAYQKVVAGVANALAHKYH α VHASLDKFLASVSTVLTSKYR Compute the identity%

slide-6
SLIDE 6

(c) David Gilbert 2008 Sequence Comparison (2) 6

β and α globin, with gaps

CLUSTAL W (1.81) multiple sequence alignment

β MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK α --VLSPADKTNVKAAWGKVGAHAG----EYGAEALERMFLSFPTTKTYFPHFDLSHGSAQ β VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG α VKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLP β KEFTPPVQAAYQKVVAGVANALAHKYH α AEFTPAVHASLDKFLASVSTVLTSKYR Compute the identity%

slide-7
SLIDE 7

(c) David Gilbert 2008 Sequence Comparison (2) 7

Human beta globin hits coyote!

>SW:HBB_CANFA P02056 HEMOGLOBIN BETA CHAIN. Length = 146 Score = 276 bits (698), Expect = 2e-74 Identities = 131/146 (89%), Positives = 137/146 (93%) Query:2 VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV 61 VHLT EEKS V+ LWGKVNVDEVGGEALGRLL+VYPWTQRFF+SFGDLSTPDAVM N KV Sbjct: 1 VHLTAEEKSLVSGLWGKVNVDEVGGEALGRLLIVYPWTQRFFDSFGDLSTPDAVMSNAKV 60 Query: 62 KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK 121 KAHGKKVL +FSDGL +LDNLKGTFA LSELHCDKLHVDPENF+LLGNVLVCVLAHHFGK Sbjct: 61 KAHGKKVLNSFSDGLKNLDNLKGTFAKLSELHCDKLHVDPENFKLLGNVLVCVLAHHFGK 120 Query: 122 EFTPPVQAAYQKVVAGVANALAHKYH 147 EFTP VQAAYQKVVAGVANALAHKYH Sbjct: 121 EFTPQVQAAYQKVVAGVANALAHKYH 146

Blast output

Compute the identity% What happened?

slide-8
SLIDE 8

(c) David Gilbert 2008 Sequence Comparison (2) 8

Penalising gaps

  • Gap = maximal consecutive run of spaces in an alignment (1 or more

spaces)

  • Simple penalty - each gap contributes a constant weight
  • More complex - gap penalty proportional to gap length
  • Large gap penalty → few gaps (less substrings in alignment). Small

penalty → fragmented alignments.

  • FASTA:

– GAPOPEN: Penalty for the first residue in a gap (-12 for proteins, -16 for DNA). – GAPEXT: Penalty for additional residues in a gap (-2 for proteins, -4 for DNA).

slide-9
SLIDE 9

(c) David Gilbert 2008 Sequence Comparison (2) 9

Substitution matrices

  • Unitary matrix: match=1, mismatch=0

– sparse matrix (most elements are 0)

  • Poor diagnostic power

– all identical matches carry identical weighting

  • We can enhance scoring potential of weak but

biologically significant signals

  • Scoring matrices - weight matches for non-identical

residues according to observed substitution rates.

  • More on this later!

A C G T A C G T

slide-10
SLIDE 10

(c) David Gilbert 2008 Sequence Comparison (2) 10

Global and local alignment

  • Global alignment - as per dynamic programming

solution as explained

– Needleman & Wunsch algorithm (1970)

  • Local alignment - find local regions from each

string which are similar:

– Corresponds to shorter, localised paths in the matrix. – Justification - biological functional sites localised to short conserved regions (no indels/mutations). – Smith-Waterman algorithm (1981)

slide-11
SLIDE 11

(c) David Gilbert 2008 Sequence Comparison (2) 11

Local alignment

  • Start & end dynamic programming computation at any cells instead of

(0,0) and (i,j)

  • The matrix contains a maximum value that may not be at (i,j)

[the end of the input sequences]

– represents the endpoint of an alignment s.t. no other pair of segments with greater similarity exists between the 2 sequences

slide-12
SLIDE 12

(c) David Gilbert 2008 Sequence Comparison (2) 12

Global vs local alignment

Global, Needleman & Wunsch Local, Smith & Waterman

slide-13
SLIDE 13

(c) David Gilbert 2008 Sequence Comparison (2) 13

Local Pairwise Alignment

  • Distantly related sequences i.e. proteins

– Uneven accumulation of mutations along sequence

  • Similar segments in overall dissimilar sequences

– Rearrangement of gene segments in genome

  • Related sub-sequences in unrelated genes
  • Local similarity corresponds to

– Shared structural or functional motif

  • Robust to mutations
  • Evolutionarily important
  • Global alignment may fail in such cases

– Island of similarity lost in random symbol matches

slide-14
SLIDE 14

(c) David Gilbert 2008 Sequence Comparison (2) 14

Local Pairwise Alignment

  • Require to find similar segments in sequences
  • Database search task : Find homologous sequences {d} to query q in

database D

– In a reasonable time – Present only homologous sequences (True Positives) – Do not present non-homologous sequences (False Positives)

  • First – how to find local alignments?
slide-15
SLIDE 15

(c) David Gilbert 2008 Sequence Comparison (2) 15

Dot Matrices

  • First technique to discover local similarities

– M by N matrix created – symbols of q (length M) on one axis, symbols of d (length N) on the other – Matrix populated with dots and spaces – Dot in cell (i,j) indicates that q(i) = d(j)

  • Easy to understand visualisation
  • Common substrings found easily – contiguous diagonal dots
slide-16
SLIDE 16

(c) David Gilbert 2008 Sequence Comparison (2) 16

Dot plots

  • A convenient way of comparing 2 sequences visually
  • Use matrix, put 1 sequence on X-axis, 1 on Y-axis
  • Cells with identical characters filled with a ‘1’, non-identical with ‘0’ (simplest scheme -

could have weights)

  • Identical sequences will look like WHAT?
  • Similar sequences will have a broken diagonal, plus some other lines
  • Distantly related sequences - much noisier.
  • Can generate an alignment
  • Best path through dotplot given by dynamic programming algorithms:

– global alignment = Needleman & Wunsch – local alignment = Smith & Waterman

slide-17
SLIDE 17

(c) David Gilbert 2008 Sequence Comparison (2) 17

H L T P E E K S V H T H A K P E E K S A V T

Dot plots

slide-18
SLIDE 18

(c) David Gilbert 2008 Sequence Comparison (2) 18

H L T P E E K S V H T H x x A K x P x E x x E x x K x S x A V x T x x

Dot plots

Alignment HLTPEEKSVHT | ||||| | HAKPEEKSAVT

slide-19
SLIDE 19

(c) David Gilbert 2008 Sequence Comparison (2) 19

A dotplot

slide-20
SLIDE 20

(c) David Gilbert 2008 Sequence Comparison (2) 20

Try a dotplot and alignment

M T F R D L L S V S F E G P R P M T F R D L L S V S F E G P R P D S S A G G

slide-21
SLIDE 21

(c) David Gilbert 2008 Sequence Comparison (2) 21

Try a dotplot and alignment

  • Two sequences

q = ANTGDSCTAWCDEFGHIKPQWERTY d = TREDFGAACDEFGHIKLHYTYTRTRERAECDEFGHIKHYGT

slide-22
SLIDE 22

(c) David Gilbert 2008 Sequence Comparison (2) 22

Dot Matrices

  • Easy to identify common recurring substring

CDEFGHIK

  • Anti-diagonal identifies reversed substring

TRE

  • Matrix image can be ‘noisy’

– Most of dots not associated with a common substring

  • Matrices can be very large & unwieldy for typical protein

sequences ~ 500 ~ 1000 aa’s

slide-23
SLIDE 23

(c) David Gilbert 2008 Sequence Comparison (2) 23

Smith-Waterman Method

  • Require objective score of alignment
  • Can employ dynamic programming method (Lecture 1) though

requires some changes

  • Consider following two sequences

– q = ACEDECADE – d = REDCEDKL

  • Unsure at what symbols (residues) highest scoring local alignments

end – all pairs should be considered

  • Consider q8 and d6
slide-24
SLIDE 24

(c) David Gilbert 2008 Sequence Comparison (2) 24

Smith-Waterman Method

  • Consider q8 and d6 i.e q = ACEDECAD & d = REDCED
  • Scoring 0.5 equality, -0.3 inequality, -0.5 gap
  • Removing first two pairs in alignment will improve alignment score – negative scores

0.4

  • 0.1

0.2

  • 0.3

0.2

  • 0.3
  • 0.8
  • 0.3

a.s 0.5

  • 0.3

0.5

  • 0.5

0.5 0.5

  • 0.5
  • 0.3

c.s D E C

  • D

E

  • R

d6 D A C E D E C A q8

slide-25
SLIDE 25

(c) David Gilbert 2008 Sequence Comparison (2) 25

Smith-Waterman Method

1.2 0.7 1.0 0.5 1.0 0.5 a.s 0.5

  • 0.3

0.5

  • 0.5

0.5 0.5 c.s D E C

  • D

E d2…6 D A C E D E q3…8

slide-26
SLIDE 26

(c) David Gilbert 2008 Sequence Comparison (2) 26

Smith-Waterman Method

  • Removing prefixes with negative scores improves overall score
  • To find best local alignment ending in qi & dj must find best

starting point

  • Fortunately a DP method can be employed
  • In this case all negative values are replaced with zero
  • Simple change to the global alignment DP
slide-27
SLIDE 27

(c) David Gilbert 2008 Sequence Comparison (2) 27

Smith-Waterman Method

Si-1,j-1 + s(qi, dj) Si,j-1 + s(-, dj) Si,j = max Si-1,j + s(qi, -)

{

slide-28
SLIDE 28

(c) David Gilbert 2008 Sequence Comparison (2) 28

Smith-Waterman Method

  • Now initialise first row & columns with 0
  • In this example: Score as 0.5 equality, -0.3 inequality, -0.5 gap, Si,j ≥ 0

0.4 0.9 0.7 0.5 0.2 0.5 E 0.2 0.7 1.2 0.2 0.5 D 0.4 0.4 0.2 0.7 0.5 A 0.9 0.7 0.7 0.5 1 0.2 C 0.7 1.2 1.0 1.0 0.7 0.5 0.5 E 0.5 1 1.5 0.5 0.5 1 D 0.5 1 0.5 E 0.5 C A L K D E C D E R I/J

slide-29
SLIDE 29

(c) David Gilbert 2008 Sequence Comparison (2) 29

Smith-Waterman Method

  • Initialise first row & columns with 0
  • Si,j ≥ 0
  • Find maximum partial alignment score Sk,l
  • Trace backwards, constructing alignment, until reach a

cell with value 0

slide-30
SLIDE 30

(c) David Gilbert 2008 Sequence Comparison (2) 30

Smith-Waterman Method

0.4 0.9 0.7 0.5 0.2 0.5 E 0.2 0.7 1.2 0.2 0.5 D 0.4 0.4 0.2 0.7 0.5 A 0.9 0.7 0.7 0.5 1 0.2 C 0.7 1.2 1.0 1.0 0.7 0.5 0.5 E 0.5 1 1.5 0.5 0.5 1 D 0.5 1 0.5 E 0.5 C A L K D E C D E R I/J

  • Find maximum partial alignment score Sk,l
  • Trace backwards, constructing alignment, until reach a cell with value 0
slide-31
SLIDE 31

(c) David Gilbert 2008 Sequence Comparison (2) 31

Smith-Waterman Method

0.4 0.9 0.7 0.5 0.2 0.5 E 0.2 0.7 1.2 0.2 0.5 D 0.4 0.4 0.2 0.7 0.5 A 0.9 0.7 0.7 0.5 1 0.2 C 0.7 1.2 1.0 1.0 0.7 0.5 0.5 E 0.5 1 1.5 0.5 0.5 1 D 0.5 1 0.5 E 0.5 C A L K D E C D E R I/J

  • Find maximum partial alignment score Sk,l
  • Trace backwards, constructing alignment, until reach a cell with value 0
slide-32
SLIDE 32

(c) David Gilbert 2008 Sequence Comparison (2) 32

FASTA (Lipman & Pearson 1985)

  • Fast approximation to the Smith-Waterman algorithm
  • Local alignments - tries to find paths of regional similarity, rather than trying to find the

best alignment between 2 sequences.

  • Alignments can contain gaps.
  • Rapid
  • Heuristic - not guaranteed to find the best alignment between 2 sequences; it may miss

matches.

– uses a strategy which is expected to find most matches, but sacrifices complete sensitivity in

  • rder to gain speed.
  • A substitution matrix is used during all phases of protein searches
slide-33
SLIDE 33

(c) David Gilbert 2008 Sequence Comparison (2) 33

FASTA

  • Identifies short words (k-tuples) common to both sequences

(nucleotides k=1 or 2, amino-acids, k up to 6)

  • Similar to focussing on diagonal matches in dynamic

programming algorithm

  • Uses heuristic to join k-tuples close on same diagonal
  • If significant number of matches found, then uses dynamic

programming to compute gapped alignment

slide-34
SLIDE 34

(c) David Gilbert 2008 Sequence Comparison (2) 34

FASTA algorithm

www.compbio.dundee.ac.uk/ftp/preprints/review93/Figure9.pdf

slide-35
SLIDE 35

(c) David Gilbert 2008 Sequence Comparison (2) 35

FASTA output

HBB_CANFA P02056 HEMOGLOBIN BETA CHAIN. (146 aa) initn: 886 init1: 886 opt: 886 Z-score: 1095.3 bits: 208.6 E(): 2.9e-54 Smith-Waterman score: 886; 89.726% identity (89.726% ungapped) in 146 aa overlap (2-147:1-146) 10 20 30 40 50 60 EMBOSS MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK :::: :::: :..:::::::::::::::::::.:::::::::.::::::::::::.: : SW:HBB VHLTAEEKSLVSGLWGKVNVDEVGGEALGRLLIVYPWTQRFFDSFGDLSTPDAVMSNAK 10 20 30 40 50 70 80 90 100 110 120 EMBOSS VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG :::::::::..::::: .::::::::: ::::::::::::::::.::::::::::::::: SW:HBB VKAHGKKVLNSFSDGLKNLDNLKGTFAKLSELHCDKLHVDPENFKLLGNVLVCVLAHHFG 60 70 80 90 100 110 130 140 EMBOSS KEFTPPVQAAYQKVVAGVANALAHKYH ::::: ::::::::::::::::::::: SW:HBB KEFTPQVQAAYQKVVAGVANALAHKYH 120 130 140

slide-36
SLIDE 36

(c) David Gilbert 2008 Sequence Comparison (2) 36

Database Search - BLAST

  • SW complexity similar to DP for global alignment
  • Not realistic for database search in terms of time
  • Trade-off guarantee of finding best alignment with time expense
  • Basic Local Alignment Search Tool – BLAST (Alschul et al 1990)
  • Employs fast search to find small segments with similar score in both sequences
  • Extend small segments (local alignments)
  • Returns maximal scoring pairs (MSP) & MSP score & statistical significance of

scores

slide-37
SLIDE 37

(c) David Gilbert 2008 Sequence Comparison (2) 37

Database Search - BLAST

  • Query sequence split into words of defined length

– Query string q = AFGTULL with word length of L = 3 – AFG, FGT, GTU, TUL, ULL

  • Define a threshold alignment score T
  • Find all word-pairs of length L with score ≥ T

– For amino acids there are 203 = 8000 distinct words w of length L=3 – e.g Find all w such that S(w, AFG) ≥ T

slide-38
SLIDE 38

(c) David Gilbert 2008 Sequence Comparison (2) 38

Database Search - BLAST

  • Search database for all ‘hits’ - sequences with exact matches to each w

– Indexing of sequences to create ‘inverted file’ by employing hash table to index sequences – fast

  • Extend alignment of ‘hits’ while score increases – producing High Scoring Pair’s
  • Return sequences with HSP’s which have significantly (statistically) higher

scores than a threshold Smax

  • Smax obtained empirically from random sequences
slide-39
SLIDE 39

(c) David Gilbert 2008 Sequence Comparison (2) 39

Database Search - BLAST

  • Varying the threshold alignment score T

– Search time decreases as T is increased, fewer word pairs are found – Sensitivity of search decreases as T is increased, word pairs overlooked (homologous sequences may be discarded)

  • The score of the alignment Smax AND the associated statistical significance are

required to assess whether homology is suggested

slide-40
SLIDE 40

(c) David Gilbert 2008 Sequence Comparison (2) 40

BLAST - Basic Local Alignment Tool

Altschul et al 1990

  • Given 2 sequences:

– Segment pair - pair of subsequences of the same length forming an ungapped alignment – Computes all segment pairs – If there is a MSP maximal segment pair (highest score of all pairs for 1 comparison) above some cutoff score C and C is “significant” then report hit – Also reports those sequences where the score of MSP < C, but several segment pairs in combination which are significant. – Reports score from highest scoring pairs & probability scores [E values] (expected by chance).

  • Only produces ungapped alignments
slide-41
SLIDE 41

(c) David Gilbert 2008 Sequence Comparison (2) 41

BLAST Algorithm

www.compbio.dundee.ac.uk/ftp/preprints/review93/Figure10.pdf

slide-42
SLIDE 42

(c) David Gilbert 2008 Sequence Comparison (2) 42

Gapped BLAST (Altschul et al 1997)

  • Seeks only 1 (not all) ungapped alignments with

significant match

  • Uses dynamic programming to extend residues

in both directions to give gapped alignment

  • 3x faster than ungapped BLAST
slide-43
SLIDE 43

(c) David Gilbert 2008 Sequence Comparison (2) 43

Edited results (EMBL)

Database: embl: 958,670 sequences; 2,466,994,978 total letters Score E Sequences producing significant alignments: (bits) Value EM_HUM:HSBGL1 V00497 Human messenger RNA for beta-globin. 1241 0.0 EM_HUM:AF181989 AF181989 Homo sapiens hemoglobin beta subuni... 1116 0.0 EM_HUM:HSHEMOB M25113 Human sickle beta-hemoglobin mRNA. 1100 0.0 EM_PAT:I32884 I32884 Sequence 9 from patent US 5589367. 910 0.0 EM_HUM:HS202231 U20223 Human thalassemia beta globin gene, c... 416 e-114 EM_OM:AGHBD M19061 Spider monkey (A.geoffroyi) delta-globin ... 369 1e-99 EM_OM:CPHBB5CP J00330 monkey (c.polykomos) beta-globin gene;... 367 4e-99 EM_OM:PPHBD M21825 Orangutan delta globin gene, complete cds. 347 4e-93 EM_OM:CPHBDPSC J00335 Monkey (colobus) delta-globin pseudoge... 297 3e-78 EM_OM:LMHBB M15734 Lemur (brown) beta-globin gene, complete ... 270 7e-70 EM_OM:TSHBD J04428 T.syrichta delta-globin gene, complete cds. 266 1e-68 EM_OM:OCU60902 U60902 Otolemur crassicaudatus epsilon-, gamm... 266 1e-68 EM_OM:LEBGLOB Y00347 Lepus europaeus adult beta-globin gene 266 1e-68 EM_OM:GCDELGLB M61740 G.crassicaudatus beta globin gene, com... 266 1e-68 EM_OM:MOHBDPS J00332 monkey (anubis) silent delta-globin gene. 262 2e-67 EM_OM:TSHBB J04429 T.syrichta beta globin gene, complete cds. 260 7e-67 EM_PAT:A34698 A34698 Synthetic pSXBeta+ sequence 258 3e-66 EM_OM:OCBGLO V00882 Rabbit (O. cuniculus) gene for beta-globin. 250 7e-64 EM_OM:BTBG M63453 Bovine Beta globin gene and globin (PSI-3)... 220 6e-55

slide-44
SLIDE 44

(c) David Gilbert 2008 Sequence Comparison (2) 44

Edited results (Swiss-prot)

Database: swissprot: 86,593 sequences; 31,411,157 total letters Score E Sequences producing significant alignments: (bits) Value SW:HBB_HUMAN P02023 HEMOGLOBIN BETA CHAIN. (human) 306 2e-83 SW:HBB_GORGO P02024 HEMOGLOBIN BETA CHAIN. (gorilla) 305 4e-83 SW:HBB2_PANLE P18988 HEMOGLOBIN BETA-2 CHAIN. (lion) 302 3e-82 SW:HBB_HYLLA P02025 HEMOGLOBIN BETA CHAIN. (gibbon) 300 8e-82 SW:HBB_PREEN P02032 HEMOGLOBIN BETA CHAIN. (Hanumam langur) 298 5e-81 SW:HBB_COLPO P19885 HEMOGLOBIN BETA CHAIN. (Colobus) 295 3e-80 SW:HBB_CERAE P02028 HEMOGLOBIN BETA CHAIN. (Green monkey) 295 3e-80 SW:HBB_MACFU P02027 HEMOGLOBIN BETA CHAIN. (Japanese macaque) 293 2e-79 SW:HBB_CALAR P18985 HEMOGLOBIN BETA CHAIN. (Marmoset) 292 2e-79 SW:HBB_ATEGE P02034 HEMOGLOBIN BETA CHAIN. (Spider monkey) 292 2e-79 SW:HBB_MANSP P08259 HEMOGLOBIN BETA CHAIN. (Mandrill) 291 4e-79 … SW:HBB1_RAT P02091 HEMOGLOBIN BETA CHAIN, (Rat) 255 4e-68 SW:HBB_ERIEU P02059 HEMOGLOBIN BETA CHAIN. (Hedgehog) 252 2e-67 SW:HBB_PANPO P04244 HEMOGLOBIN BETA CHAIN. (Bison) 251 5e-67 SW:HBB_BISBO P09422 HEMOGLOBIN BETA CHAIN. (Leopard) 251 5e-67 Database: swissprot: 86,593 sequences; 31,411,157 total letters Score E Sequences producing significant alignments: (bits) Value SW:HBB_HUMAN P02023 HEMOGLOBIN BETA CHAIN. (human) 306 2e-83 SW:HBB_GORGO P02024 HEMOGLOBIN BETA CHAIN. (gorilla) 305 4e-83 SW:HBB2_PANLE P18988 HEMOGLOBIN BETA-2 CHAIN. (lion) 302 3e-82 SW:HBB_HYLLA P02025 HEMOGLOBIN BETA CHAIN. (gibbon) 300 8e-82 SW:HBB_PREEN P02032 HEMOGLOBIN BETA CHAIN. (Hanumam langur) 298 5e-81 SW:HBB_COLPO P19885 HEMOGLOBIN BETA CHAIN. (Colobus) 295 3e-80 SW:HBB_CERAE P02028 HEMOGLOBIN BETA CHAIN. (Green monkey) 295 3e-80 SW:HBB_MACFU P02027 HEMOGLOBIN BETA CHAIN. (Japanese macaque) 293 2e-79 SW:HBB_CALAR P18985 HEMOGLOBIN BETA CHAIN. (Marmoset) 292 2e-79 SW:HBB_ATEGE P02034 HEMOGLOBIN BETA CHAIN. (Spider monkey) 292 2e-79 SW:HBB_MANSP P08259 HEMOGLOBIN BETA CHAIN. (Mandrill) 291 4e-79 … SW:HBB1_RAT P02091 HEMOGLOBIN BETA CHAIN, (Rat) 255 4e-68 SW:HBB_ERIEU P02059 HEMOGLOBIN BETA CHAIN. (Hedgehog) 252 2e-67 SW:HBB_PANPO P04244 HEMOGLOBIN BETA CHAIN. (Bison) 251 5e-67 SW:HBB_BISBO P09422 HEMOGLOBIN BETA CHAIN. (Leopard) 251 5e-67

slide-45
SLIDE 45

(c) David Gilbert 2008 Sequence Comparison (2) 45

Blast alignment

>SW:HBB_CANFA P02056 HEMOGLOBIN BETA CHAIN. Length = 146 Score = 276 bits (698), Expect = 2e-74 Identities = 131/146 (89%), Positives = 137/146 (93%) Query: 2 VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV 61 VHLT EEKS V+ LWGKVNVDEVGGEALGRLL+VYPWTQRFF+SFGDLSTPDAVM N KV Sbjct: 1 VHLTAEEKSLVSGLWGKVNVDEVGGEALGRLLIVYPWTQRFFDSFGDLSTPDAVMSNAKV 60 Query: 62 KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK 121 KAHGKKVL +FSDGL +LDNLKGTFA LSELHCDKLHVDPENF+LLGNVLVCVLAHHFGK Sbjct: 61 KAHGKKVLNSFSDGLKNLDNLKGTFAKLSELHCDKLHVDPENFKLLGNVLVCVLAHHFGK 120 Query: 122 EFTPPVQAAYQKVVAGVANALAHKYH 147 EFTP VQAAYQKVVAGVANALAHKYH Sbjct: 121 EFTPQVQAAYQKVVAGVANALAHKYH 146

Blast output

slide-46
SLIDE 46

(c) David Gilbert 2008 Sequence Comparison (2) 46

FASTA vs BLAST

  • BLAST faster than FASTA without significant loss of ability to

find the similar database sequences.

  • BLAST & FAST equivalent for highly similar sequences
  • FASTA may be better for less similar sequences
  • But can always make a full local alignment

(Smith-Waterman algorithm) - highest potential of finding less similar sequences in a database search since the entire sequence lengths are compared.

slide-47
SLIDE 47

(c) David Gilbert 2008 Sequence Comparison (2) 47

EMBL - some stats

31/01/06

Total nucleotides (current 118,263,140,052) Number of entries (current 65,933,089)

slide-48
SLIDE 48

(c) David Gilbert 2008 Sequence Comparison (2) 48

Smith-Waterman programs

  • MPsrch Edinburgh University

– http://www.ebi.ac.uk/MPsrch/

  • Scanps2.3 Geoff Barton (EBI; University of Dundee)

– http://www.ebi.ac.uk/scanps/

  • Blitz (bic_sw) Compugen -- uses MPsrch & Scanps

– http://www.ebi.ac.uk/bic_sw/ (email only)

slide-49
SLIDE 49

(c) David Gilbert 2008 Sequence Comparison (2) 49

Multiple alignments

  • Analyse gene families

– reveal (subtle) conserved family characteristics

characters 1 2 3 4 5 6 7 8 9 10

S1 Y D G G A V - E A L S2 Y D G G - - - E A L S3 F E G G I L V E A L S4 F D - G I L V Q A V S5 Y E G G A V V Q A L

consensus y d G G AI VL V e A l

sequences

slide-50
SLIDE 50

(c) David Gilbert 2008 Sequence Comparison (2) 50

  • Simultaneous: N-wise alignment (adapted from pairwise approach)

– uses N-dimension matrix. – Complexity is

  • O(m1m2) [2 sequences length m1 & m2 ]
  • O(mn) [n sequences of length m]

– Thus only good for short sequences.

  • Manua1 (!)
  • Progessive (heuristic) e.g. ClustalW:

– compute pairwise sequence identities – construct binary tree (can output phylogenetic tree) – align similar sequences in pairs, add distantly related ones later.

Multiple aligment - methods

s1 s2 s3 s4 s5 a1 a2 a3 a4

slide-51
SLIDE 51

(c) David Gilbert 2008 Sequence Comparison (2) 51

Multiple sequence alignment (globins)

CLUSTAL W (1.81) multiple sequence alignment Human VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV 60 Gorilla VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV 60 Rabbit VHLSSEEKSAVTALWGKVNVEEVGGEALGRLLVVYPWTQRFFESFGDLSSANAVMNNPKV 60 Pig VHLSAEEKEAVLGLWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSNADAVMGNPKV 60 ***:.***.** .*******:****************************..:***.**** Human KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK 120 Gorilla KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFKLLGNVLVCVLAHHFGK 120 Rabbit KAHGKKVLAAFSEGLSHLDNLKGTFAKLSELHCDKLHVDPENFRLLGNVLVIVLSHHFGK 120 Pig KAHGKKVLQSFSDGLKHLDNLKGTFAKLSELHCDQLHVDPENFRLLGNVIVVVLARRLGH 120 ******** :**:** **********.*******:********:*****:* **::::*: Human EFTPPVQAAYQKVVAGVANALAHKYH 146 Gorilla EFTPPVQAAYQKVVAGVANALAHKYH 146 Rabbit EFTPQVQAAYQKVVAGVANALAHKYH 146 Pig DFNPNVQAAFQKVVAGVANALAHKYH 146 :*.* ****:****************

slide-52
SLIDE 52

(c) David Gilbert 2008 Sequence Comparison (2) 52

Multiple sequence alignments & phylogenetic trees

((Human:0.00000, Gorilla:0.00685) :0.04110, Rabbit:0.05479, Pig:0.10959); Pair Score Human-Gorilla 99 Human-Rabbit 90 Gorilla-Rabbit 89 Human-Pig 84 Gorilla-Pig 84 Rabbit-Pig 83

slide-53
SLIDE 53

(c) David Gilbert 2008 Sequence Comparison (2) 53

What can we do with multiple alignments?

  • Create (databases of) profiles derived from multiple alignments for protein

families – profile = multiple alignment + observed character frequencies at each position

  • Search with a sequence against a database of profiles

(e.g. PROSITE database) – faster than sequence against sequence – gives a more general result (“the input sequence matches globin profile”)

  • Search with a profile against a database of sequences

– PSI-BLAST : can identify more distant relationships than by normal BLAST search

slide-54
SLIDE 54

(c) David Gilbert 2008 Sequence Comparison (2) 54

PSI-BLAST (position specific iterated BLAST)

Single protein sequence Search database(BLAST) Multiple alignment Profile Estimate statistical significance of local alignments ?iterate until convergence

slide-55
SLIDE 55

(c) David Gilbert 2008 Sequence Comparison (2) 55

PSI-BLAST (Altschul et al 1997)

(1) Start with 1 sequence (or profile) = ‘probe’ (2) Search with BLAST and select top hits manually or automatically (3) Make multiple alignment & profile (4) Estimate statistical significance of local alignments.

If significance ok & you want to continue, then go to (1) using the profile, else exit

slide-56
SLIDE 56

(c) David Gilbert 2008 Sequence Comparison (2) 56

Dates & programs

FASTA BLAST Gapped BLAST & PSI BLAST