Similarity searches in biological sequence databases Volker Flegel - - PowerPoint PPT Presentation

similarity searches in biological sequence databases
SMART_READER_LITE
LIVE PREVIEW

Similarity searches in biological sequence databases Volker Flegel - - PowerPoint PPT Presentation

Similarity searches in biological sequence databases Volker Flegel september 2004 Page 1 Outline Keyword search in databases General concept Examples SRS http://srs.ebi.ac.uk Entrez http://www.ncbi.nih.gov/Entrez/index.html


slide-1
SLIDE 1

september 2004 Page 1

Similarity searches in biological sequence databases

Volker Flegel

slide-2
SLIDE 2

september 2004 Page 2

Outline

Keyword search in databases

  • General concept

Examples

  • SRS

http://srs.ebi.ac.uk

  • Entrez

http://www.ncbi.nih.gov/Entrez/index.html

  • Expasy

http://www.expasy.uniprot.org/search/textSearch.shtml

Similarity searches in databases

  • Goal
  • Definitions
  • Alignment visualisation
  • Alignment algorithms

Examples

  • FASTA
  • BLAST and its gory details
slide-3
SLIDE 3

september 2004 Page 3

Keyword search

Accessing database entries

  • Each database uses its own specific access methods

Several kinds of search possibilities according to the data stored

– Identification number (unique) – Authors – Keywords, ...

Biological sequence databases

  • Use a unique identification number to retrieve a specific sequence
  • This identification number must remain constant accross the database

releases

  • Genbank / EMBL / DDBJ

accession.version

  • Swiss-Prot

accession and id (Note: id may change)

slide-4
SLIDE 4

september 2004 Page 4

Genbank entry example

LOCUS AF455746_1 80 aa PRI 08-JAN-2002 DEFINITION ubiquitin-conjugating enzyme [Homo sapiens]. ACCESSION AAL58874 PID g18087414 VERSION AAL58874.1 GI:18087414 DBSOURCE locus AF455746 accession AF455746.1 KEYWORDS . SOURCE human. ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (residues 1 to 80) AUTHORS Poloumienko,A. TITLE Exon-intron structure of the mammalian ubiquitin-conjugating enzyme (HR6A) genes JOURNAL Unpublished COMMENT Method: conceptual translation supplied by author. FEATURES Location/Qualifiers source 1..80 /organism="Homo sapiens" /db_xref="taxon:9606" /chromosome="X" /cell_line="MCF-7" Protein 1..80 /product="ubiquitin-conjugating enzyme" CDS 1..80 /gene="HR6A" /coded_by="join(AF455746.1:<1..64,AF455746.1:1057..1145, AF455746.1:1594..>1680)" ORIGIN 1 teeypnkppt vrfvskmfhp nvyadgsicl dilqnrwspt ydvssiltsi qslldepnpn 61 spansqaaql yqenkreyek //

slide-5
SLIDE 5

september 2004 Page 5

SwissProt entry example

ID UBCA_HUMAN STANDARD; PRT; 152 AA. AC P49459; DT 01-FEB-1996 (Rel. 33, Created) DT 01-FEB-1996 (Rel. 33, Last sequence update) DT 16-OCT-2001 (Rel. 40, Last annotation update) DE Ubiquitin-conjugating enzyme E2-17 kDa (EC 6.3.2.19) DE (Ubiquitin-protein ligase) (Ubiquitin carrier protein) (HR6A). GN UBE2A. OS Homo sapiens (Human). OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. OX NCBI_TaxID=9606; RN [1] RP SEQUENCE FROM N.A. RX MEDLINE=92020951; PubMed=1717990; RA Koken M.H.M., Reynolds P., Jaspers-Dekker I., Prakash L., Prakash S., RA Bootsma D., Hoeijmakers J.H.J.; RT "Structural and functional conservation of two human homologs of the RT yeast DNA repair gene RAD6."; RL Proc. Natl. Acad. Sci. U.S.A. 88:8865-8869(1991). (...) DR EMBL; M74524; AAA35981.1; -. DR HSSP; P25865; 2AAK. DR MIM; 312180; -. DR InterPro; IPR000608; UBQ_conjugat. DR Pfam; PF00179; UQ_con; 1. DR SMART; SM00212; UBCc; 1. DR PROSITE; PS00183; UBIQUITIN_CONJUGAT_1; 1. DR PROSITE; PS50127; UBIQUITIN_CONJUGAT_2; 1. KW Ubiquitin conjugation; Ligase; Multigene family. FT BINDING 88 88 UBIQUITIN (BY SIMILARITY). SQ SEQUENCE 152 AA; 17243 MW; 7A86173D5FAE6DE1 CRC64; MSTPARRRLM RDFKRLQEDP PAGVSGAPSE NNIMVWNAVI FGPEGTPFGD GTFKLTIEFT EEYPNKPPTV RFVSKMFHPN VYADGSICLD ILQNRWSPTY DVSSILTSIQ SLLDEPNPNS PANSQAAQLY QENKREYEKR VSAIVEQSWR DC //

slide-6
SLIDE 6

september 2004 Page 6

Similarity searches

Concept

  • Generalisation (asymmetric) of a pairwise comparison

sequence sequence sequence database

Query Subject Pairwise alignment Similarity searches

database database

Database vs. database

slide-7
SLIDE 7

september 2004 Page 7

Theoretical considerations

Similar to those of pairwise comparison

  • Sequence divergence is due to evolutionary mechanisms
  • Sequence similarity allows information extrapolation:

Sequence history and origin Biological function 3D structure Alignement types

  • Global

Alignment between the complete sequence A and the complete sequence B

  • Local

Alignment between a sub-sequence of A and a sub- sequence of B

Computer implementation (Algorithms)

  • Dynamic programing
  • Global

Needleman-Wunsch

  • Local

Smith-Waterman

slide-8
SLIDE 8

september 2004 Page 8

Problems to solve

Similarity search mechanism

  • A pairwise comparison is done successively between the query and every

sequence of the database

Obstacles

  • The complexity of the task is proportional to the size of the database

Extremely long running time of the search Difficult biological interpretation of the results Solutions

  • Reduce search time by using more powerful computers
  • Reduce search time by using newer and faster algorithms (heuristics)
  • Sort and analyse the resulting alignments using statistical methods
slide-9
SLIDE 9

september 2004 Page 9

Definitions

Query

Sequence that is being compared against the database.

Subject

Sequence of the database that matches the query.

Exact algorithm

An exact algorithm is guaranteed to find the best alignment, or at least

  • ne of the best in case of a tie.

Heuristic algorithm

A heuristic algorithm is not guaranteed to find the best alignment. But good ones often do, and much quicker than exact ones.

slide-10
SLIDE 10

september 2004 Page 10

Some more definitions

Identity

Proportion of pairs of identical residues between two aligned sequences. Generally expressed as a percentage. This value strongly depends on how the two sequences are aligned.

Similarity

Proportion of pairs of similar residues between two aligned sequences. If two residues are similar is determined by a substitution matrix. This value also depends strongly on how the two sequences are aligned, as well as on the substitution matrix used.

Homology

Two sequences are homologous if and only if they have a common ancestor. There is no such thing as a level of homology ! (It's either yes or no)

  • Homologous sequences do not necessarily serve the same function...
  • ... Nor are they always highly similar: structure may be conserved while sequence is not.
slide-11
SLIDE 11

september 2004 Page 11

Alignment score

Amino acid substitution matrices

  • Example:

PAM250

  • Most used:

Blosum62

Raw score of an alignment

TPEA ¦| | APGA TPEA ¦| | APGA

Score = 1 = 9 + 6 + + 2

slide-12
SLIDE 12

september 2004 Page 12

Insertions and deletions

Gap penalties

  • Opening a gap penalizes an alignment score
  • Each extension of a gap penalizes the alignment's score
  • The gap opening penalty is in general higher than the gap extension

penalties (simulating evolutionary behavior)

  • The raw score of a gapped alignment is the sum of all amino acid

substitutions from which we subtract the gap opening and extension penalties.

Seq A GARFIELDTHE----CAT ||||||||||| ||| Seq B GARFIELDTHELASTCAT Seq A GARFIELDTHE----CAT ||||||||||| ||| Seq B GARFIELDTHELASTCAT

gap

gap opening gap extension

slide-13
SLIDE 13

september 2004 Page 13

Alignment visualisation

Matrix - Text - Dotplot An alignment is a path through a graph DotPlot: Graphical view in 2 dimensions

Visual aid to identify regions of similarity

Address: www.isrec.isb-sib.ch/java/dotlet/Dotlet.html Tissue-Type plasminogen Activator Urokinase-Type plasminogen Activator

Seq B A-CA-CA | || | Seq A ACCAAC- Seq B A-CA-CA | || | Seq A ACCAAC- Seq B ACA--CA | Seq A A-CCAAC Seq B ACA--CA | Seq A A-CCAAC

slide-14
SLIDE 14

september 2004 Page 14

Optimal alignment extension

How to extend optimaly an optimal alignment

  • An optimal alignment up to positions i and j can be extended in 3 ways.
  • Keeping the best of the 3 guarantees an extended optimal alignment.

Seq A a1 a2 a3 ... ai-1 ai Seq B b1 b2 b3 ... bj-1 bj Seq A a1 a2 a3 ... ai-1 ai Seq B b1 b2 b3 ... bj-1 bj

  • We have the optimal alignment extended from i and j by one residue.

Seq A a1 a2 a3 ... ai-1 ai Seq B b1 b2 b3 ... bj-1 bj Seq A a1 a2 a3 ... ai-1 ai Seq B b1 b2 b3 ... bj-1 bj ai+1 bj+1 ai+1 bj+1

Score = Scoreij + Substi+1 j+1

Seq A a1 a2 a3 ... ai-1 ai Seq B b1 b2 b3 ... bj-1 bj Seq A a1 a2 a3 ... ai-1 ai Seq B b1 b2 b3 ... bj-1 bj ai+1

  • ai+1
  • Score = Scoreij - gap

Seq A a1 a2 a3 ... ai-1 ai Seq B b1 b2 b3 ... bj-1 bj Seq A a1 a2 a3 ... ai-1 ai Seq B b1 b2 b3 ... bj-1 bj

  • bj+1
  • bj+1

Score = Scoreij - gap

slide-15
SLIDE 15

september 2004 Page 15

Exact algorithms (Needleman-Wunsch / Smith - Waterman)

Simple example (Needleman-Wunsch)

  • Scoring system:

Match score:

2

Mismatch score:

  • 1

Gap penalty:

  • 2

Note

  • We have to keep track of the origin of the score for each element in the

matrix. This allows to build the alignment by traceback when the matrix has been

completely filled out.

  • Computation time is proportional to the size of sequences (n x m).

G A T T A

  • 2
  • 4
  • 6
  • 8
  • 10

G

  • 2

A

  • 4

A

  • 6

T

  • 8

T

  • 10

C

  • 12

G A T T A

  • 2
  • 4
  • 6
  • 8
  • 10

G

  • 2

2

  • 2
  • 4
  • 6

A

  • 4

4

A

  • 6

T

  • 8

T

  • 10

C

  • 12

0 - 2 0 - 2 2 + 2

G A T T A

  • 2
  • 4
  • 6
  • 8
  • 10

G

  • 2

2

  • 2
  • 4
  • 6

A

  • 4

4 2

  • 2

A

  • 6
  • 2

2 3 1 2

T

  • 8
  • 4

4 5 3

T

  • 10
  • 6
  • 2

2 6 4

C

  • 12
  • 8
  • 4

4 5

F(i-1,j) F(i,j)

s(xi,yj)

F(i-1,j-1)

  • d

F(i,j-1)

  • d

F(i,j): score at position i, j s(xi,yj): match or mismatch score (or substitution matrix value) for residues xi and yj d: gap penalty (positive value)

GA-TTA || || GAATTC GA-TTA || || GAATTC

slide-16
SLIDE 16

september 2004 Page 16

Heuristic algorithms

Faster but less sensitive

  • They use the dynamic programming approach like exact algorithms
  • They try to limit its use to sequences which seem interesting

The heuristic part of the algorithm tries to make a clever guess at

which sequences would produce an interesting alignment.

FASTA

  • Developped by Lipman and Pearson in 1985
  • Tries to find sequences having identical words (or k-tuples = k

consecutive residues) in common on a same diagonal.

  • Compares the query sequentially to all those sequences in the database.

Blast

  • Developped by Altschul et al. in 1990
  • The most used and cited bioinformatics tool in biology
  • Online tutorial: www.ncbi.nlm.nih.gov/Education/BLASTinfo/tut1.html
slide-17
SLIDE 17

september 2004 Page 17

A Blast for each query

Different programs are available according to the type of query

Program Query Database

blastp protein protein blastn nucleotide nucleotide blastx protein nucleotide protein tblastn protein protein nucleotide tblastx protein nucleotide protein nucleotide

VS VS VS VS VS

slide-18
SLIDE 18

september 2004 Page 18

Access to Blast

Web access

  • Numerous web sites offer access to Blast servers

NCBI (USA) where the Blast program was created

– Provide access to all Blast options and numerous databases – User interface not very intuitive

  • URL: www.ncbi.nlm.nih.gov/BLAST

EMBnet (i.e. Swiss node located in Lausanne at the SIB)

– Several servers across the world – Provide access to all Blast options – Provide a simplified and an advanced user interface – Wide choice of databases

  • URL: www.ch.embnet.org/software/bBLAST.html

(Simple user interface) www.ch.embnet.org/software/aBLAST.html (Advanced user interface)

slide-19
SLIDE 19

september 2004 Page 19

Blast: the gory details

Blast algorithm: creating a list of similar words

REL Query RSL RSL AAA AAC AAD YYY AAA AAC AAD YYY List of all possible words with 3 amino acid residues ... ACT RSL TVF ACT RSL TVF List of words matching the query with a score > T score > T ... ... LKP LKP L K P L K P score < T

A substitution matrix is used to compute the word scores A substitution matrix is used to compute the word scores

slide-20
SLIDE 20

september 2004 Page 20

Blast: the gory details

ACT RSL TVF ACT RSL TVF List of words matching the query with a score > T ... ...

Blast algorithm: eliminating sequences without word hits

ACT ACT ACT RSL RSL TVF RSL RSL RSL RSL TVF TVF Database sequences

List of sequences containing words

similar to the query (hits)

List of sequences containing words

similar to the query (hits) Search for exact matches

slide-21
SLIDE 21

september 2004 Page 21

Blast: the gory details (The End)

Blast algorithm: extension of hits

Database sequence Query A Ungapped extension if:

  • 2 "Hits" are on the same diagonal

but at a distance less than A Database sequence Query A Extension using dynamic programming

  • limited to a restricted region
slide-22
SLIDE 22

september 2004 Page 22

Statistical evaluation of results

Alignments are evaluated according to their score

  • Raw score

It's the sum of the amino acid substitution scores and gap

penalties (gap opening and gap extension)

Depends on the scoring system (substitution matrix, etc.) Different alignments should not be compared based only on the

raw score

  • Normalised score

Is independent of the scoring system Allows the comparison of different alignments Units: expressed in bits

slide-23
SLIDE 23

september 2004 Page 23

Statistical evaluation of results

Statistics derived from the scores

  • p-value

Probability that an alignment with this score occurs by chance in

a database of this size

The closer the p-value is towards 0, the better the alignment

  • e-value

Number of matches with this score one can expect to find by

chance in a database of this size

The closer the e-value is towards 0, the better the alignment

  • Relationship between e-value and p-value:

In a database containing N sequences

e = p x N

100% 0% N

slide-24
SLIDE 24

september 2004 Page 24

Low complexity regions

  • Regions with a high frequency of only a few type of residues (= low

complexity regions) may produce high scoring but biological uninteresting alignments, e.g. polyserine

  • Such regions are, by default, filtered out by Blast. They appear masked

with 'X' in the alignment. They are not taken into account for score computation

slide-25
SLIDE 25

september 2004 Page 25

Basic Blast on EMBnet

www.ch.embnet.org/software/bBLAST.html Select the type of query Select the substitution matrix to use Select your input type: Either a raw sequence or an accession or id number, as well as the database from which blast should retrieve your query Select the protein database to search with either blastp, blastx Select the nucleotide database to search with either blastn, tblastn, tblastx

slide-26
SLIDE 26

september 2004 Page 26

Advanced Blast on EMBnet

www.ch.embnet.org/software/aBLAST.html

  • Greater choice of databases to search
  • Advanced Blast parameter modification
slide-27
SLIDE 27

september 2004 Page 27

Search results

Graphical visualisation and description of alignment scores

slide-28
SLIDE 28

september 2004 Page 28

Search results

Alignment example

  • Normalised score, raw score and e-value
  • Percentage of identical aligned residues, percentage of aligned residues

having a positive score in the substitution matrix

  • Alignment (local) between the query and the database sequence. The

middle line shows if a residue is conserved or not

  • Low complexity region is masked with a series of 'X'
slide-29
SLIDE 29

september 2004 Page 29

Search results

Search details (at the bottom of the results)

  • Size of the database searched
  • Scoring system parameters
  • Details about the number of hits

found

slide-30
SLIDE 30

september 2004 Page 30

Conclusions

Blast: the most used database search tool

  • Fast and very reliable even for a heuristic algorithm
  • Does not necessarily find the best alignment, but most of the

time it finds the best matching sequences in the database

  • Easy to use with default parameters
  • Solid statistical framework for the evaluation of scores

but...

  • The biologist's expertise is still essential to the analysis of the

results !

Tips and tricks

  • For coding sequences always search at the protein level
  • Mask low complexity regions
  • Use a substitution matrix adapted to the expected divergence of

the searched sequences (nevertheless most of the time BLOSUM62 works well)

  • If there are only matches to a limited region of your query, cut
  • ut that region and rerun the search with the remaining part of

your query