bioinformatics sequence comparison 2
play

Bioinformatics Sequence comparison 2 local pairwise alignment - PowerPoint PPT Presentation

Bioinformatics Sequence comparison 2 local pairwise alignment David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Lecture contents Variations on dynamic programming


  1. Bioinformatics Sequence comparison 2 local pairwise alignment David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow

  2. Lecture contents • Variations on dynamic programming – Gap penalties – Substitution matrices • To explain the reason that local alignments may be more appropriate than global ones. • To describe the use of Dot-Plots in visualising an alignment • To describe the Smith-Waterman method of finding and scoring an optimal local pairwise alignment • To describe in outline the BLAST algorithm for database search (c) David Gilbert 2008 Sequence Comparison (2) 2

  3. Solution to Week 1 Exercise 0 A E E C A 0 C D A A (c) David Gilbert 2008 Sequence Comparison (2) 3

  4. Percentage sequence identity number of identical residues x 100 = ________________________________ number of residues in smallest sequence Can differ if have gaps/no_gaps: compute for these sequences: TGCATA -TGCAT-A- | | | | | | ATCTGAT AT-C-TGAT Sequence similarity - change at amino-acid residue or nucleotide that preserves the physico- chemical properties of the residue (c) David Gilbert 2008 Sequence Comparison (2) 4

  5. β and α globin, without gaps β MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK α VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK β VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG α KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA β KEFTPPVQAAYQKVVAGVANALAHKYH α VHASLDKFLASVSTVLTSKYR Compute the identity% (c) David Gilbert 2008 Sequence Comparison (2) 5

  6. β and α globin, with gaps CLUSTAL W (1.81) multiple sequence alignment β MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK α --VLSPADKTNVKAAWGKVGAHAG----EYGAEALERMFLSFPTTKTYFPHFDLSHGSAQ β VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG α VKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLP β KEFTPPVQAAYQKVVAGVANALAHKYH α AEFTPAVHASLDKFLASVSTVLTSKYR Compute the identity% (c) David Gilbert 2008 Sequence Comparison (2) 6

  7. Human beta globin hits coyote! Blast output >SW:HBB_CANFA P02056 HEMOGLOBIN BETA CHAIN. Length = 146 What happened? Score = 276 bits (698), Expect = 2e-74 Identities = 131/146 (89%), Positives = 137/146 (93%) Query:2 VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV 61 VHLT EEKS V+ LWGKVNVDEVGGEALGRLL+VYPWTQRFF+SFGDLSTPDAVM N KV Sbjct: 1 VHLTAEEKSLVSGLWGKVNVDEVGGEALGRLLIVYPWTQRFFDSFGDLSTPDAVMSNAKV 60 Query: 62 KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK 121 KAHGKKVL +FSDGL +LDNLKGTFA LSELHCDKLHVDPENF+LLGNVLVCVLAHHFGK Sbjct: 61 KAHGKKVLNSFSDGLKNLDNLKGTFAKLSELHCDKLHVDPENFKLLGNVLVCVLAHHFGK 120 Query: 122 EFTPPVQAAYQKVVAGVANALAHKYH 147 EFTP VQAAYQKVVAGVANALAHKYH Sbjct: 121 EFTPQVQAAYQKVVAGVANALAHKYH 146 Compute the identity% (c) David Gilbert 2008 Sequence Comparison (2) 7

  8. Penalising gaps • Gap = maximal consecutive run of spaces in an alignment (1 or more spaces) • Simple penalty - each gap contributes a constant weight • More complex - gap penalty proportional to gap length • Large gap penalty → few gaps (less substrings in alignment). Small penalty → fragmented alignments. • FASTA: – GAPOPEN: Penalty for the first residue in a gap (-12 for proteins, -16 for DNA). – GAPEXT: Penalty for additional residues in a gap (-2 for proteins, -4 for DNA). (c) David Gilbert 2008 Sequence Comparison (2) 8

  9. Substitution matrices A C G T A • Unitary matrix: match=1, mismatch=0 C G – sparse matrix (most elements are 0) T • Poor diagnostic power – all identical matches carry identical weighting • We can enhance scoring potential of weak but biologically significant signals • Scoring matrices - weight matches for non-identical residues according to observed substitution rates. • More on this later! (c) David Gilbert 2008 Sequence Comparison (2) 9

  10. Global and local alignment • Global alignment - as per dynamic programming solution as explained – Needleman & Wunsch algorithm (1970) • Local alignment - find local regions from each string which are similar: – Corresponds to shorter, localised paths in the matrix. – Justification - biological functional sites localised to short conserved regions (no indels/mutations). – Smith-Waterman algorithm (1981) (c) David Gilbert 2008 Sequence Comparison (2) 10

  11. Local alignment • Start & end dynamic programming computation at any cells instead of (0,0) and (i,j) • The matrix contains a maximum value that may not be at (i,j) [the end of the input sequences] – represents the endpoint of an alignment s.t. no other pair of segments with greater similarity exists between the 2 sequences (c) David Gilbert 2008 Sequence Comparison (2) 11

  12. Global vs local alignment Global, Needleman & Wunsch Local, Smith & Waterman (c) David Gilbert 2008 Sequence Comparison (2) 12

  13. Local Pairwise Alignment • Distantly related sequences i.e. proteins – Uneven accumulation of mutations along sequence • Similar segments in overall dissimilar sequences – Rearrangement of gene segments in genome • Related sub-sequences in unrelated genes • Local similarity corresponds to – Shared structural or functional motif • Robust to mutations • Evolutionarily important • Global alignment may fail in such cases – Island of similarity lost in random symbol matches (c) David Gilbert 2008 Sequence Comparison (2) 13

  14. Local Pairwise Alignment • Require to find similar segments in sequences • Database search task : Find homologous sequences { d } to query q in database D – In a reasonable time – Present only homologous sequences (True Positives) – Do not present non-homologous sequences (False Positives) • First – how to find local alignments? (c) David Gilbert 2008 Sequence Comparison (2) 14

  15. Dot Matrices • First technique to discover local similarities – M by N matrix created – symbols of q (length M ) on one axis, symbols of d (length N ) on the other – Matrix populated with dots and spaces – Dot in cell ( i,j ) indicates that q ( i ) = d ( j ) • Easy to understand visualisation • Common substrings found easily – contiguous diagonal dots (c) David Gilbert 2008 Sequence Comparison (2) 15

  16. Dot plots • A convenient way of comparing 2 sequences visually • Use matrix, put 1 sequence on X-axis, 1 on Y-axis • Cells with identical characters filled with a ‘1’, non-identical with ‘0’ (simplest scheme - could have weights) • Identical sequences will look like WHAT? • Similar sequences will have a broken diagonal, plus some other lines • Distantly related sequences - much noisier. • Can generate an alignment • Best path through dotplot given by dynamic programming algorithms: – global alignment = Needleman & Wunsch – local alignment = Smith & Waterman (c) David Gilbert 2008 Sequence Comparison (2) 16

  17. Dot plots H L T P E E K S V H T H A K P E E K S A V T (c) David Gilbert 2008 Sequence Comparison (2) 17

  18. Dot plots H L T P E E K S V H T H x x A K x Alignment P x HLTPEEKSVHT E x x | ||||| | E x x HAKPEEKSAVT K x S x A V x T x x (c) David Gilbert 2008 Sequence Comparison (2) 18

  19. A dotplot (c) David Gilbert 2008 Sequence Comparison (2) 19

  20. Try a dotplot and alignment M T F R D L L S V S F E G P R P M T F R D L L S V S F E G P R P D S S A G G (c) David Gilbert 2008 Sequence Comparison (2) 20

  21. Try a dotplot and alignment • Two sequences q = ANTGDSCTAWCDEFGHIKPQWERTY d = TREDFGAACDEFGHIKLHYTYTRTRERAECDEFGHIKHYGT (c) David Gilbert 2008 Sequence Comparison (2) 21

  22. Dot Matrices • Easy to identify common recurring substring CDEFGHIK • Anti-diagonal identifies reversed substring TRE • Matrix image can be ‘noisy’ – Most of dots not associated with a common substring • Matrices can be very large & unwieldy for typical protein sequences ~ 500 ~ 1000 aa’s (c) David Gilbert 2008 Sequence Comparison (2) 22

  23. Smith-Waterman Method • Require objective score of alignment • Can employ dynamic programming method (Lecture 1) though requires some changes • Consider following two sequences – q = ACEDECADE – d = REDCEDKL • Unsure at what symbols (residues) highest scoring local alignments end – all pairs should be considered • Consider q 8 and d 6 (c) David Gilbert 2008 Sequence Comparison (2) 23

  24. Smith-Waterman Method • Consider q 8 and d 6 i.e q = ACEDECAD & d = REDCED • Scoring 0.5 equality, -0.3 inequality, -0.5 gap q 8 A C E D E C A D d 6 R - E D - C E D c.s -0.3 -0.5 0.5 0.5 -0.5 0.5 -0.3 0.5 a.s -0.3 -0.8 -0.3 0.2 -0.3 0.2 -0.1 0.4 • Removing first two pairs in alignment will improve alignment score – negative scores (c) David Gilbert 2008 Sequence Comparison (2) 24

  25. Smith-Waterman Method q 3…8 E D E C A D d 2…6 E D - C E D c.s 0.5 0.5 -0.5 0.5 -0.3 0.5 a.s 0.5 1.0 0.5 1.0 0.7 1.2 (c) David Gilbert 2008 Sequence Comparison (2) 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend