Similarity searches in biological sequence databases Volker Flegel - PowerPoint PPT Presentation

Similarity searches in biological sequence databases Volker Flegel september 2004 Page 1

Outline Keyword search in databases • General concept Examples • SRS http://srs.ebi.ac.uk • Entrez http://www.ncbi.nih.gov/Entrez/index.html • Expasy http://www.expasy.uniprot.org/search/textSearch.shtml Similarity searches in databases • Goal • Definitions • Alignment visualisation • Alignment algorithms Examples • F AST A • BLAST and its gory details september 2004 Page 2

Keyword search Accessing database entries • Each database uses its own specific access methods � Several kinds of search possibilities according to the data stored – Identification number (unique) – Authors – Keywords, ... Biological sequence databases • Use a unique identification number to retrieve a specific sequence • This identification number must remain constant accross the database releases • Genbank / EMBL / DDBJ accession.version • Swiss-Prot accession and id (Note: id may change) september 2004 Page 3

Genbank entry example LOCUS AF455746_1 80 aa PRI 08-JAN-2002 DEFINITION ubiquitin-conjugating enzyme [Homo sapiens]. ACCESSION AAL58874 PID g18087414 VERSION AAL58874.1 GI:18087414 DBSOURCE locus AF455746 accession AF455746.1 KEYWORDS . SOURCE human. ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (residues 1 to 80) AUTHORS Poloumienko,A. TITLE Exon-intron structure of the mammalian ubiquitin-conjugating enzyme (HR6A) genes JOURNAL Unpublished COMMENT Method: conceptual translation supplied by author. FEATURES Location/Qualifiers source 1..80 /organism="Homo sapiens" /db_xref="taxon:9606" /chromosome="X" /cell_line="MCF-7" Protein 1..80 /product="ubiquitin-conjugating enzyme" CDS 1..80 /gene="HR6A" /coded_by="join(AF455746.1:<1..64,AF455746.1:1057..1145, AF455746.1:1594..>1680)" ORIGIN 1 teeypnkppt vrfvskmfhp nvyadgsicl dilqnrwspt ydvssiltsi qslldepnpn 61 spansqaaql yqenkreyek // september 2004 Page 4

SwissProt entry example ID UBCA_HUMAN STANDARD; PRT; 152 AA. AC P49459; DT 01-FEB-1996 (Rel. 33, Created) DT 01-FEB-1996 (Rel. 33, Last sequence update) DT 16-OCT-2001 (Rel. 40, Last annotation update) DE Ubiquitin-conjugating enzyme E2-17 kDa (EC 6.3.2.19) DE (Ubiquitin-protein ligase) (Ubiquitin carrier protein) (HR6A). GN UBE2A. OS Homo sapiens (Human). OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. OX NCBI_TaxID=9606; RN [1] RP SEQUENCE FROM N.A. RX MEDLINE=92020951; PubMed=1717990; RA Koken M.H.M., Reynolds P., Jaspers-Dekker I., Prakash L., Prakash S., RA Bootsma D., Hoeijmakers J.H.J.; RT "Structural and functional conservation of two human homologs of the RT yeast DNA repair gene RAD6."; RL Proc. Natl. Acad. Sci. U.S.A. 88:8865-8869(1991). (...) DR EMBL; M74524; AAA35981.1; -. DR HSSP; P25865; 2AAK. DR MIM; 312180; -. DR InterPro; IPR000608; UBQ_conjugat. DR Pfam; PF00179; UQ_con; 1. DR SMART; SM00212; UBCc; 1. DR PROSITE; PS00183; UBIQUITIN_CONJUGAT_1; 1. DR PROSITE; PS50127; UBIQUITIN_CONJUGAT_2; 1. KW Ubiquitin conjugation; Ligase; Multigene family. FT BINDING 88 88 UBIQUITIN (BY SIMILARITY). SQ SEQUENCE 152 AA; 17243 MW; 7A86173D5FAE6DE1 CRC64; MSTPARRRLM RDFKRLQEDP PAGVSGAPSE NNIMVWNAVI FGPEGTPFGD GTFKLTIEFT EEYPNKPPTV RFVSKMFHPN VYADGSICLD ILQNRWSPTY DVSSILTSIQ SLLDEPNPNS PANSQAAQLY QENKREYEKR VSAIVEQSWR DC // september 2004 Page 5

Similarity searches Concept • Generalisation (asymmetric) of a pairwise comparison Query Subject Pairwise alignment sequence sequence Similarity searches sequence database Database vs. database database database september 2004 Page 6

Theoretical considerations Similar to those of pairwise comparison • Sequence divergence is due to evolutionary mechanisms • Sequence similarity allows information extrapolation: � Sequence history and origin � Biological function � 3D structure Alignement types • Global Alignment between the complete sequence A and the complete sequence B • Local Alignment between a sub-sequence of A and a sub- sequence of B Computer implementation (Algorithms) • Dynamic programing • Global Needleman-Wunsch • Local Smith-Waterman september 2004 Page 7

Problems to solve Similarity search mechanism • A pairwise comparison is done successively between the query and every sequence of the database Obstacles • The complexity of the task is proportional to the size of the database � Extremely long running time of the search � Difficult biological interpretation of the results Solutions • Reduce search time by using more powerful computers • Reduce search time by using newer and faster algorithms (heuristics) • Sort and analyse the resulting alignments using statistical methods september 2004 Page 8

Definitions Query Sequence that is being compared against the database. Subject Sequence of the database that matches the query. Exact algorithm An exact algorithm is guaranteed to find the best alignment, or at least one of the best in case of a tie. Heuristic algorithm A heuristic algorithm is not guaranteed to find the best alignment. But good ones often do, and much quicker than exact ones. september 2004 Page 9

Some more definitions Identity Proportion of pairs of identical residues between two aligned sequences. Generally expressed as a percentage. This value strongly depends on how the two sequences are aligned. Similarity Proportion of pairs of similar residues between two aligned sequences. If two residues are similar is determined by a substitution matrix. This value also depends strongly on how the two sequences are aligned, as well as on the substitution matrix used. Homology Two sequences are homologous if and only if they have a common ancestor. There is no such thing as a level of homology ! (It's either yes or no) • Homologous sequences do not necessarily serve the same function... • ... Nor are they always highly similar: structure may be conserved while sequence is not. september 2004 Page 10

Alignment score Amino acid substitution matrices • Example: PAM250 • Most used: Blosum62 Raw score of an alignment TPEA TPEA ¦| | ¦| | APGA APGA Score = 1 + 6 + 0 + 2 = 9 september 2004 Page 11

Insertions and deletions Gap penalties gap opening gap extension gap Seq A GARFIELDTHE----CAT Seq A GARFIELDTHE----CAT ||||||||||| ||| ||||||||||| ||| Seq B GARFIELDTHELASTCAT Seq B GARFIELDTHELASTCAT • Opening a gap penalizes an alignment score • Each extension of a gap penalizes the alignment's score • The gap opening penalty is in general higher than the gap extension penalties (simulating evolutionary behavior) • The raw score of a gapped alignment is the sum of all amino acid substitutions from which we subtract the gap opening and extension penalties. september 2004 Page 12

Alignment visualisation Matrix - Text - Dotplot � An alignment is a path through a graph � DotPlot: Graphical view in 2 dimensions Visual aid to identify regions of similarity Tissue-Type plasminogen Activator Urokinase-Type plasminogen Activator Seq B Seq B A-CA-CA ACA--CA Seq B Seq B A-CA-CA ACA--CA | | || | Address: www.isrec.isb-sib.ch/java/dotlet/Dotlet.html | | || | Seq A Seq A ACCAAC- A-CCAAC Seq A Seq A ACCAAC- A-CCAAC september 2004 Page 13

Optimal alignment extension How to extend optimaly an optimal alignment • An optimal alignment up to positions i and j can be extended in 3 ways. • Keeping the best of the 3 guarantees an extended optimal alignment. Seq A a 1 a 2 a 3 ... a i-1 a i Seq A a 1 a 2 a 3 ... a i-1 a i Seq B Seq B b 1 b 2 b 3 ... b j-1 b j b 1 b 2 b 3 ... b j-1 b j Seq A Seq A a 1 a 2 a 3 ... a i-1 a i a 1 a 2 a 3 ... a i-1 a i a i+1 a i+1 Score = Score ij + Subst i+1 j+1 Seq B b 1 b 2 b 3 ... b j-1 b j Seq B b 1 b 2 b 3 ... b j-1 b j b j+1 b j+1 Seq A a 1 a 2 a 3 ... a i-1 a i Seq A a 1 a 2 a 3 ... a i-1 a i a i+1 a i+1 Score = Score ij - gap Seq B Seq B b 1 b 2 b 3 ... b j-1 b j b 1 b 2 b 3 ... b j-1 b j - - Seq A Seq A a 1 a 2 a 3 ... a i-1 a i a 1 a 2 a 3 ... a i-1 a i - - Score = Score ij - gap Seq B Seq B b 1 b 2 b 3 ... b j-1 b j b 1 b 2 b 3 ... b j-1 b j b j+1 b j+1 • We have the optimal alignment extended from i and j by one residue. september 2004 Page 14

Similarity searches in biological sequence databases Volker Flegel - PowerPoint PPT Presentation

Similarity searches in biological sequence databases Volker Flegel september 2004 Page 1 Outline Keyword search in databases General concept Examples SRS http://srs.ebi.ac.uk Entrez http://www.ncbi.nih.gov/Entrez/index.html

Similarity Searches on Sequence Databases Lorenza Bordoli Swiss Institute of Bioinformatics

Similarity Searches on Sequence Databases: BLAST, FASTA Lorenza Bordoli Swiss Institute of

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Similarity Searches on Sequence Databases Lorenza Bordoli Swiss Institute of Bioinformatics

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

Searches with a Searches with a Disappearing-Track Signature Disappearing-Track Signature Andy

Creating Databases and Tables Introduction to Databases in Python Creating Databases

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Multiple Sequence Multiple Sequence Alignments Alignments Multiple alignment Pairwise

CSE 427 Comp Bio Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

CSE421 Algorithms Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

Chromatin structure and 3C-like data Davide Ba & Franois Serra Genome Biology Group

Bayesian dynamic borrowing of external information: What can be gained in terms of frequentist

Why Publish in JBC F. Peter Guengerich Editorial Board Member Orientation April 22, 2017 Hyatt

Deciphering Signatures of Mutational Processes Operative in Human Cancer Tumor Cells Carry

Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic

A sentinel protein assay for simultaneously quantifying cellular processes Martin Soste

Modeling Com binatorial I ntervention Effects in Transcription Netw orks ( The Sound of One-Hand

INTRODUCTION TO GENETIC EPIDEMIOLOGY (EPID0754) Prof. Dr. Dr. K. Van Steen K Van Steen 1

Similarity searches in biological sequence databases Volker Flegel - PowerPoint PPT Presentation

Similarity searches in biological sequence databases Volker Flegel september 2004 Page 1 Outline Keyword search in databases General concept Examples SRS http://srs.ebi.ac.uk Entrez http://www.ncbi.nih.gov/Entrez/index.html

Similarity Searches on Sequence Databases Lorenza Bordoli Swiss Institute of Bioinformatics

Similarity Searches on Sequence Databases: BLAST, FASTA Lorenza Bordoli Swiss Institute of

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

Similarity Searches on Sequence Databases Lorenza Bordoli Swiss Institute of Bioinformatics

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

Searches with a Searches with a Disappearing-Track Signature Disappearing-Track Signature Andy

Creating Databases and Tables Introduction to Databases in Python Creating Databases

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Multiple Sequence Multiple Sequence Alignments Alignments Multiple alignment Pairwise

CSE 427 Comp Bio Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

CSE421 Algorithms Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

Chromatin structure and 3C-like data Davide Ba &amp; Franois Serra Genome Biology Group

Bayesian dynamic borrowing of external information: What can be gained in terms of frequentist

Why Publish in JBC F. Peter Guengerich Editorial Board Member Orientation April 22, 2017 Hyatt

Deciphering Signatures of Mutational Processes Operative in Human Cancer Tumor Cells Carry

Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic

A sentinel protein assay for simultaneously quantifying cellular processes Martin Soste

Modeling Com binatorial I ntervention Effects in Transcription Netw orks ( The Sound of One-Hand

INTRODUCTION TO GENETIC EPIDEMIOLOGY (EPID0754) Prof. Dr. Dr. K. Van Steen K Van Steen 1

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Chromatin structure and 3C-like data Davide Ba & Franois Serra Genome Biology Group