Bioinformatics Sequence comparison 2 local pairwise alignment - PowerPoint PPT Presentation

Bioinformatics Sequence comparison 2 local pairwise alignment David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow

Lecture contents • Variations on dynamic programming – Gap penalties – Substitution matrices • To explain the reason that local alignments may be more appropriate than global ones. • To describe the use of Dot-Plots in visualising an alignment • To describe the Smith-Waterman method of finding and scoring an optimal local pairwise alignment • To describe in outline the BLAST algorithm for database search (c) David Gilbert 2008 Sequence Comparison (2) 2

Solution to Week 1 Exercise 0 A E E C A 0 C D A A (c) David Gilbert 2008 Sequence Comparison (2) 3

Percentage sequence identity number of identical residues x 100 = ________________________________ number of residues in smallest sequence Can differ if have gaps/no_gaps: compute for these sequences: TGCATA -TGCAT-A- | | | | | | ATCTGAT AT-C-TGAT Sequence similarity - change at amino-acid residue or nucleotide that preserves the physico- chemical properties of the residue (c) David Gilbert 2008 Sequence Comparison (2) 4

β and α globin, without gaps β MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK α VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK β VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG α KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA β KEFTPPVQAAYQKVVAGVANALAHKYH α VHASLDKFLASVSTVLTSKYR Compute the identity% (c) David Gilbert 2008 Sequence Comparison (2) 5

β and α globin, with gaps CLUSTAL W (1.81) multiple sequence alignment β MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK α --VLSPADKTNVKAAWGKVGAHAG----EYGAEALERMFLSFPTTKTYFPHFDLSHGSAQ β VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG α VKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLP β KEFTPPVQAAYQKVVAGVANALAHKYH α AEFTPAVHASLDKFLASVSTVLTSKYR Compute the identity% (c) David Gilbert 2008 Sequence Comparison (2) 6

Human beta globin hits coyote! Blast output >SW:HBB_CANFA P02056 HEMOGLOBIN BETA CHAIN. Length = 146 What happened? Score = 276 bits (698), Expect = 2e-74 Identities = 131/146 (89%), Positives = 137/146 (93%) Query:2 VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV 61 VHLT EEKS V+ LWGKVNVDEVGGEALGRLL+VYPWTQRFF+SFGDLSTPDAVM N KV Sbjct: 1 VHLTAEEKSLVSGLWGKVNVDEVGGEALGRLLIVYPWTQRFFDSFGDLSTPDAVMSNAKV 60 Query: 62 KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK 121 KAHGKKVL +FSDGL +LDNLKGTFA LSELHCDKLHVDPENF+LLGNVLVCVLAHHFGK Sbjct: 61 KAHGKKVLNSFSDGLKNLDNLKGTFAKLSELHCDKLHVDPENFKLLGNVLVCVLAHHFGK 120 Query: 122 EFTPPVQAAYQKVVAGVANALAHKYH 147 EFTP VQAAYQKVVAGVANALAHKYH Sbjct: 121 EFTPQVQAAYQKVVAGVANALAHKYH 146 Compute the identity% (c) David Gilbert 2008 Sequence Comparison (2) 7

Penalising gaps • Gap = maximal consecutive run of spaces in an alignment (1 or more spaces) • Simple penalty - each gap contributes a constant weight • More complex - gap penalty proportional to gap length • Large gap penalty → few gaps (less substrings in alignment). Small penalty → fragmented alignments. • FASTA: – GAPOPEN: Penalty for the first residue in a gap (-12 for proteins, -16 for DNA). – GAPEXT: Penalty for additional residues in a gap (-2 for proteins, -4 for DNA). (c) David Gilbert 2008 Sequence Comparison (2) 8

Substitution matrices A C G T A • Unitary matrix: match=1, mismatch=0 C G – sparse matrix (most elements are 0) T • Poor diagnostic power – all identical matches carry identical weighting • We can enhance scoring potential of weak but biologically significant signals • Scoring matrices - weight matches for non-identical residues according to observed substitution rates. • More on this later! (c) David Gilbert 2008 Sequence Comparison (2) 9

Global and local alignment • Global alignment - as per dynamic programming solution as explained – Needleman & Wunsch algorithm (1970) • Local alignment - find local regions from each string which are similar: – Corresponds to shorter, localised paths in the matrix. – Justification - biological functional sites localised to short conserved regions (no indels/mutations). – Smith-Waterman algorithm (1981) (c) David Gilbert 2008 Sequence Comparison (2) 10

Local alignment • Start & end dynamic programming computation at any cells instead of (0,0) and (i,j) • The matrix contains a maximum value that may not be at (i,j) [the end of the input sequences] – represents the endpoint of an alignment s.t. no other pair of segments with greater similarity exists between the 2 sequences (c) David Gilbert 2008 Sequence Comparison (2) 11

Global vs local alignment Global, Needleman & Wunsch Local, Smith & Waterman (c) David Gilbert 2008 Sequence Comparison (2) 12

Local Pairwise Alignment • Distantly related sequences i.e. proteins – Uneven accumulation of mutations along sequence • Similar segments in overall dissimilar sequences – Rearrangement of gene segments in genome • Related sub-sequences in unrelated genes • Local similarity corresponds to – Shared structural or functional motif • Robust to mutations • Evolutionarily important • Global alignment may fail in such cases – Island of similarity lost in random symbol matches (c) David Gilbert 2008 Sequence Comparison (2) 13

Local Pairwise Alignment • Require to find similar segments in sequences • Database search task : Find homologous sequences { d } to query q in database D – In a reasonable time – Present only homologous sequences (True Positives) – Do not present non-homologous sequences (False Positives) • First – how to find local alignments? (c) David Gilbert 2008 Sequence Comparison (2) 14

Dot Matrices • First technique to discover local similarities – M by N matrix created – symbols of q (length M ) on one axis, symbols of d (length N ) on the other – Matrix populated with dots and spaces – Dot in cell ( i,j ) indicates that q ( i ) = d ( j ) • Easy to understand visualisation • Common substrings found easily – contiguous diagonal dots (c) David Gilbert 2008 Sequence Comparison (2) 15

Dot plots • A convenient way of comparing 2 sequences visually • Use matrix, put 1 sequence on X-axis, 1 on Y-axis • Cells with identical characters filled with a ‘1’, non-identical with ‘0’ (simplest scheme - could have weights) • Identical sequences will look like WHAT? • Similar sequences will have a broken diagonal, plus some other lines • Distantly related sequences - much noisier. • Can generate an alignment • Best path through dotplot given by dynamic programming algorithms: – global alignment = Needleman & Wunsch – local alignment = Smith & Waterman (c) David Gilbert 2008 Sequence Comparison (2) 16

Dot plots H L T P E E K S V H T H A K P E E K S A V T (c) David Gilbert 2008 Sequence Comparison (2) 17

Dot plots H L T P E E K S V H T H x x A K x Alignment P x HLTPEEKSVHT E x x | ||||| | E x x HAKPEEKSAVT K x S x A V x T x x (c) David Gilbert 2008 Sequence Comparison (2) 18

A dotplot (c) David Gilbert 2008 Sequence Comparison (2) 19

Try a dotplot and alignment M T F R D L L S V S F E G P R P M T F R D L L S V S F E G P R P D S S A G G (c) David Gilbert 2008 Sequence Comparison (2) 20

Try a dotplot and alignment • Two sequences q = ANTGDSCTAWCDEFGHIKPQWERTY d = TREDFGAACDEFGHIKLHYTYTRTRERAECDEFGHIKHYGT (c) David Gilbert 2008 Sequence Comparison (2) 21

Dot Matrices • Easy to identify common recurring substring CDEFGHIK • Anti-diagonal identifies reversed substring TRE • Matrix image can be ‘noisy’ – Most of dots not associated with a common substring • Matrices can be very large & unwieldy for typical protein sequences ~ 500 ~ 1000 aa’s (c) David Gilbert 2008 Sequence Comparison (2) 22

Smith-Waterman Method • Require objective score of alignment • Can employ dynamic programming method (Lecture 1) though requires some changes • Consider following two sequences – q = ACEDECADE – d = REDCEDKL • Unsure at what symbols (residues) highest scoring local alignments end – all pairs should be considered • Consider q 8 and d 6 (c) David Gilbert 2008 Sequence Comparison (2) 23

Smith-Waterman Method • Consider q 8 and d 6 i.e q = ACEDECAD & d = REDCED • Scoring 0.5 equality, -0.3 inequality, -0.5 gap q 8 A C E D E C A D d 6 R - E D - C E D c.s -0.3 -0.5 0.5 0.5 -0.5 0.5 -0.3 0.5 a.s -0.3 -0.8 -0.3 0.2 -0.3 0.2 -0.1 0.4 • Removing first two pairs in alignment will improve alignment score – negative scores (c) David Gilbert 2008 Sequence Comparison (2) 24

Smith-Waterman Method q 3…8 E D E C A D d 2…6 E D - C E D c.s 0.5 0.5 -0.5 0.5 -0.3 0.5 a.s 0.5 1.0 0.5 1.0 0.7 1.2 (c) David Gilbert 2008 Sequence Comparison (2) 25

Bioinformatics Sequence comparison 2 local pairwise alignment - PowerPoint PPT Presentation

Bioinformatics Sequence comparison 2 local pairwise alignment David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Lecture contents Variations on dynamic programming

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Bioinformatics Sequence comparison 1 global pairwise alignment David Gilbert Bioinformatics

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

Assignment 3: Sequence Comparison Part 1: Running BLAST Step 1: Obtain Gene Sequence Obtain

Sequence comparison: Sequence comparison: Significance of alignment scores

Outline Administravia What is bioinformatics CS 5263 Bioinformatics Why

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Activity of Cabozantinib (XL184) in Hepatocellular Carcinoma: Results From a Phase 2 Randomized

Elucidation of oxygen sensing mechanisms in human and animal cells Peter J. Ratcliffe Nobel

Purification of Human Hemoglobin and Drug Conjugation for Liver Targeting Gord Adamson, Ph.D.

Course contents (18.9.) Biological background (book chapter 1) Probability calculus

Biology FEST: Faculty Explora.ons in Scien.fic Teaching Biology FEST

Dynamic Measurement Scheduling for Event Forecasting Using Deep RL Chun-Hao Chang Mingjie Mai

Inorganic Chemistry in Biology Or Biological Inorganic Chemistry Or Bioinorganic Chemistry

Generating a Document- Oriented View of a Protg Knowledge Base Samson Tu, Shantha Condamoor,

Bioinformatics Sequence comparison 2 local pairwise alignment - PowerPoint PPT Presentation

Bioinformatics Sequence comparison 2 local pairwise alignment David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Lecture contents Variations on dynamic programming

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Bioinformatics Sequence comparison 1 global pairwise alignment David Gilbert Bioinformatics

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

Assignment 3: Sequence Comparison Part 1: Running BLAST Step 1: Obtain Gene Sequence Obtain

Sequence comparison: Sequence comparison: Significance of alignment scores

Outline Administravia What is bioinformatics CS 5263 Bioinformatics Why

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Activity of Cabozantinib (XL184) in Hepatocellular Carcinoma: Results From a Phase 2 Randomized

Elucidation of oxygen sensing mechanisms in human and animal cells Peter J. Ratcliffe Nobel

Purification of Human Hemoglobin and Drug Conjugation for Liver Targeting Gord Adamson, Ph.D.

Course contents (18.9.) Biological background (book chapter 1) Probability calculus

Biology FEST: Faculty Explora.ons in Scien.fic Teaching Biology FEST

Dynamic Measurement Scheduling for Event Forecasting Using Deep RL Chun-Hao Chang Mingjie Mai

Inorganic Chemistry in Biology Or Biological Inorganic Chemistry Or Bioinorganic Chemistry

Generating a Document- Oriented View of a Protg Knowledge Base Samson Tu, Shantha Condamoor,

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or