 
              Bioinformatics Sequence comparison 2 local pairwise alignment David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow
Lecture contents • Variations on dynamic programming – Gap penalties – Substitution matrices • To explain the reason that local alignments may be more appropriate than global ones. • To describe the use of Dot-Plots in visualising an alignment • To describe the Smith-Waterman method of finding and scoring an optimal local pairwise alignment • To describe in outline the BLAST algorithm for database search (c) David Gilbert 2008 Sequence Comparison (2) 2
Solution to Week 1 Exercise 0 A E E C A 0 C D A A (c) David Gilbert 2008 Sequence Comparison (2) 3
Percentage sequence identity number of identical residues x 100 = ________________________________ number of residues in smallest sequence Can differ if have gaps/no_gaps: compute for these sequences: TGCATA -TGCAT-A- | | | | | | ATCTGAT AT-C-TGAT Sequence similarity - change at amino-acid residue or nucleotide that preserves the physico- chemical properties of the residue (c) David Gilbert 2008 Sequence Comparison (2) 4
β and α globin, without gaps β MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK α VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK β VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG α KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA β KEFTPPVQAAYQKVVAGVANALAHKYH α VHASLDKFLASVSTVLTSKYR Compute the identity% (c) David Gilbert 2008 Sequence Comparison (2) 5
β and α globin, with gaps CLUSTAL W (1.81) multiple sequence alignment β MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK α --VLSPADKTNVKAAWGKVGAHAG----EYGAEALERMFLSFPTTKTYFPHFDLSHGSAQ β VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG α VKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLP β KEFTPPVQAAYQKVVAGVANALAHKYH α AEFTPAVHASLDKFLASVSTVLTSKYR Compute the identity% (c) David Gilbert 2008 Sequence Comparison (2) 6
Human beta globin hits coyote! Blast output >SW:HBB_CANFA P02056 HEMOGLOBIN BETA CHAIN. Length = 146 What happened? Score = 276 bits (698), Expect = 2e-74 Identities = 131/146 (89%), Positives = 137/146 (93%) Query:2 VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV 61 VHLT EEKS V+ LWGKVNVDEVGGEALGRLL+VYPWTQRFF+SFGDLSTPDAVM N KV Sbjct: 1 VHLTAEEKSLVSGLWGKVNVDEVGGEALGRLLIVYPWTQRFFDSFGDLSTPDAVMSNAKV 60 Query: 62 KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK 121 KAHGKKVL +FSDGL +LDNLKGTFA LSELHCDKLHVDPENF+LLGNVLVCVLAHHFGK Sbjct: 61 KAHGKKVLNSFSDGLKNLDNLKGTFAKLSELHCDKLHVDPENFKLLGNVLVCVLAHHFGK 120 Query: 122 EFTPPVQAAYQKVVAGVANALAHKYH 147 EFTP VQAAYQKVVAGVANALAHKYH Sbjct: 121 EFTPQVQAAYQKVVAGVANALAHKYH 146 Compute the identity% (c) David Gilbert 2008 Sequence Comparison (2) 7
Penalising gaps • Gap = maximal consecutive run of spaces in an alignment (1 or more spaces) • Simple penalty - each gap contributes a constant weight • More complex - gap penalty proportional to gap length • Large gap penalty → few gaps (less substrings in alignment). Small penalty → fragmented alignments. • FASTA: – GAPOPEN: Penalty for the first residue in a gap (-12 for proteins, -16 for DNA). – GAPEXT: Penalty for additional residues in a gap (-2 for proteins, -4 for DNA). (c) David Gilbert 2008 Sequence Comparison (2) 8
Substitution matrices A C G T A • Unitary matrix: match=1, mismatch=0 C G – sparse matrix (most elements are 0) T • Poor diagnostic power – all identical matches carry identical weighting • We can enhance scoring potential of weak but biologically significant signals • Scoring matrices - weight matches for non-identical residues according to observed substitution rates. • More on this later! (c) David Gilbert 2008 Sequence Comparison (2) 9
Global and local alignment • Global alignment - as per dynamic programming solution as explained – Needleman & Wunsch algorithm (1970) • Local alignment - find local regions from each string which are similar: – Corresponds to shorter, localised paths in the matrix. – Justification - biological functional sites localised to short conserved regions (no indels/mutations). – Smith-Waterman algorithm (1981) (c) David Gilbert 2008 Sequence Comparison (2) 10
Local alignment • Start & end dynamic programming computation at any cells instead of (0,0) and (i,j) • The matrix contains a maximum value that may not be at (i,j) [the end of the input sequences] – represents the endpoint of an alignment s.t. no other pair of segments with greater similarity exists between the 2 sequences (c) David Gilbert 2008 Sequence Comparison (2) 11
Global vs local alignment Global, Needleman & Wunsch Local, Smith & Waterman (c) David Gilbert 2008 Sequence Comparison (2) 12
Local Pairwise Alignment • Distantly related sequences i.e. proteins – Uneven accumulation of mutations along sequence • Similar segments in overall dissimilar sequences – Rearrangement of gene segments in genome • Related sub-sequences in unrelated genes • Local similarity corresponds to – Shared structural or functional motif • Robust to mutations • Evolutionarily important • Global alignment may fail in such cases – Island of similarity lost in random symbol matches (c) David Gilbert 2008 Sequence Comparison (2) 13
Local Pairwise Alignment • Require to find similar segments in sequences • Database search task : Find homologous sequences { d } to query q in database D – In a reasonable time – Present only homologous sequences (True Positives) – Do not present non-homologous sequences (False Positives) • First – how to find local alignments? (c) David Gilbert 2008 Sequence Comparison (2) 14
Dot Matrices • First technique to discover local similarities – M by N matrix created – symbols of q (length M ) on one axis, symbols of d (length N ) on the other – Matrix populated with dots and spaces – Dot in cell ( i,j ) indicates that q ( i ) = d ( j ) • Easy to understand visualisation • Common substrings found easily – contiguous diagonal dots (c) David Gilbert 2008 Sequence Comparison (2) 15
Dot plots • A convenient way of comparing 2 sequences visually • Use matrix, put 1 sequence on X-axis, 1 on Y-axis • Cells with identical characters filled with a ‘1’, non-identical with ‘0’ (simplest scheme - could have weights) • Identical sequences will look like WHAT? • Similar sequences will have a broken diagonal, plus some other lines • Distantly related sequences - much noisier. • Can generate an alignment • Best path through dotplot given by dynamic programming algorithms: – global alignment = Needleman & Wunsch – local alignment = Smith & Waterman (c) David Gilbert 2008 Sequence Comparison (2) 16
Dot plots H L T P E E K S V H T H A K P E E K S A V T (c) David Gilbert 2008 Sequence Comparison (2) 17
Dot plots H L T P E E K S V H T H x x A K x Alignment P x HLTPEEKSVHT E x x | ||||| | E x x HAKPEEKSAVT K x S x A V x T x x (c) David Gilbert 2008 Sequence Comparison (2) 18
A dotplot (c) David Gilbert 2008 Sequence Comparison (2) 19
Try a dotplot and alignment M T F R D L L S V S F E G P R P M T F R D L L S V S F E G P R P D S S A G G (c) David Gilbert 2008 Sequence Comparison (2) 20
Try a dotplot and alignment • Two sequences q = ANTGDSCTAWCDEFGHIKPQWERTY d = TREDFGAACDEFGHIKLHYTYTRTRERAECDEFGHIKHYGT (c) David Gilbert 2008 Sequence Comparison (2) 21
Dot Matrices • Easy to identify common recurring substring CDEFGHIK • Anti-diagonal identifies reversed substring TRE • Matrix image can be ‘noisy’ – Most of dots not associated with a common substring • Matrices can be very large & unwieldy for typical protein sequences ~ 500 ~ 1000 aa’s (c) David Gilbert 2008 Sequence Comparison (2) 22
Smith-Waterman Method • Require objective score of alignment • Can employ dynamic programming method (Lecture 1) though requires some changes • Consider following two sequences – q = ACEDECADE – d = REDCEDKL • Unsure at what symbols (residues) highest scoring local alignments end – all pairs should be considered • Consider q 8 and d 6 (c) David Gilbert 2008 Sequence Comparison (2) 23
Smith-Waterman Method • Consider q 8 and d 6 i.e q = ACEDECAD & d = REDCED • Scoring 0.5 equality, -0.3 inequality, -0.5 gap q 8 A C E D E C A D d 6 R - E D - C E D c.s -0.3 -0.5 0.5 0.5 -0.5 0.5 -0.3 0.5 a.s -0.3 -0.8 -0.3 0.2 -0.3 0.2 -0.1 0.4 • Removing first two pairs in alignment will improve alignment score – negative scores (c) David Gilbert 2008 Sequence Comparison (2) 24
Smith-Waterman Method q 3…8 E D E C A D d 2…6 E D - C E D c.s 0.5 0.5 -0.5 0.5 -0.3 0.5 a.s 0.5 1.0 0.5 1.0 0.7 1.2 (c) David Gilbert 2008 Sequence Comparison (2) 25
Recommend
More recommend