EMBL Database Growth (growth statistics as of 05/30/02) EMBL - - PDF document
EMBL Database Growth (growth statistics as of 05/30/02) EMBL - - PDF document
EMBL Database Growth (growth statistics as of 05/30/02) EMBL Divisions (05/30/02) Computational Genomics Sequence determination Signal processing Assembly of shotgun sequence fragments and determination of a consensus sequence
(growth statistics as of 05/30/02)
EMBL Database Growth
EMBL Divisions (05/30/02)
Computational Genomics
Sequence determination
Signal processing Assembly of shotgun sequence fragments and determination of a consensus sequence
Sequence analysis
Search for genes, regulatory patterns, repeats, etc, in the inferred genomic sequence Conceptual translation of the predicted genes into protein sequences Inferences about the structural, functional, evolutionary, etc, properties
- f the putative proteins
( )
Looking for similarities in biological Databases
Biological Databases :
Primary Data: curated or not complete or not redundant or not general or specialized (organisms...) structures Derived data: patterns, motifs, profiles etc Structural data Motif, profile... Need to be better defined and formalized.
Query : Similarity :
sequences : nucleic acids or proteins Sequence : nucleic acid or protein, "finished" or not, etc Global vs local, and other variations How to measure it? What is its biological interpretation? significance?
HOMOLOGY
x x x xx x x x
ANALOGY
x x x Convergent evolution Similarity Similarity Divergent evolutionary process (mutations and selection)
Interpretation of biological similarity
From W. M. Fitch, 2000: "Homology; a personal view on some of the problems", Trends in genetics, 16, p227−231
x x x x x x x x x x x x x x x x x x x x
HOMOLOGIES
Speciation ORTHOLOGY Lateral transfert PARALOGY Gene duplication XENOLOGY
1.1 Algorithms for sequence alignments 2.1 2.2 The Fasta and Blast programs 2.3 Scores and statistics 3.1 Algorithms: how to compute global or local multiple alignments? 3.3 Examples of resources and applications of these concepts in the context of similarity searches 3.4 Back to the algorithms: motif inference
- 2. Similarity searching in sequence databases
1.2 Scoring schemes
Computational sequence analysis
- 1. Pairwise sequence comparison
Sequence databases
- 3. Multiple alignments and motifs
3.2 Representation of the information contained in a multiple alignment
Introduction to
Protein structure prediction and homology modeling Phylogenetic inference Gene prediction
Other important aspects of computational sequence analysis:
C A S R K E H Y D D E K E H A R C T K E H Y E K E H
. . . . . . . . . . . . . . . . . . . . .
DOT−PLOT
C A S R K E H Y D D E K E H A R C T K E H Y E K E H
. . . . . . .
Word matches
A S K E Y H
. . . . . .
H
. . . . . . . .
C E K R K E H Y D D E E E H A R C T K
Parameters: words, or using a scoring matrice for amino acids. computed as the number of identities between the two between two windows (or words). The score itself may be w = size of the sliding words or windows t = threshold which is applied to the score of the comparison
Approximate word matches
Global alignment
seq A LTTSLLYSQGSDIYEIDTTLPLKTFYDDDDDDDNDDDDEEGNGKTKSAATPNPEYGDAFQ ||.|..||||||||||| ..||. ..|. | .|||||. seq B LTSSVIYSQGSDIYEIDFAVPLQ−−−−−−−−−−−−−−−−−−−−−−EAASEPVKDYGDAFE seq A DVEGKPLRPKWIYQGETVAKMQYLESSDDSTAIAMSKNGSLAWFRDEIKVPVHIVQEMMG .|. | ||..||||||.|| ||... ..| ..||||||||||.. ||||.|||||.|| seq B GIENTSLSPKFVYQGETVSKMAYLDKTGETTLLSMSKNGSLAWFKEGIKVPIHIVQELMG seq A PATRYSSIHSLTRPG−−−−−−−SLAVSDFDVSTNMDTVVKSQSNGYEEDSILKIIDNSDR ||| |.||||||||| |||.||| .|.. .|.||||||| |||||||||||. . seq B PATSYASIHSLTRPGDLPEKDFSLAISDFGISNDTETIVKSQSNGDEEDSILKIIDNAGK seq A PGDILRTVHVPGTNVAHSVRFFNNHLFASCSDDNILRFWDTRTADKPLWTLSEPKNGRLT ||.||||||||||.|.|.||||.||.|||||||||||||||||.|||.|.|.|||||.|| seq B PGEILRTVHVPGTTVTHTVRFFDNHIFASCSDDNILRFWDTRTSDKPIWVLGEPKNGKLT seq A SFDSSQVTENLFVTGFSTGVIKLWDARAVQLATTDLTHRQNGEEPIQNEIAKLFHSGGDS ||| |||..||||||||||.||||||||.. ||||||.|||||.|||||||...|.|||| seq B SFDCSQVSNNLFVTGFSTGIIKLWDARAAEAATTDLTYRQNGEDPIQNEIANFYHAGGDS seq A VVDILFSQTSATEFVTVGGTGNVYHWDMEYSFSRNDDDNEDEVRVAAPEELQGQCLKFFH |||. || ||..|| |||||||.|||. .||.|. . |. | || | . |.|.| seq B VVDVQFSATSSSEFFTVGGTGNIYHWNTDYSLSKYNPDDTIAPPQDATEESQTKSLRFLH seq A TGGTRRSSNQFGKRNTVALHPVINDFVGTVDSDSLVTAYKPFLASDFIGRGYDD ||.||| .|.|.|||.| ||||...|||||.||||. |||. . . seq B KGGSRRSPKQIGRRNTAAWHPVIENLVGTVDDDSLVSIYKPYTEES−−−−−−−E seq A MAKSKSSQGASGARRKPAPSLYQHISSFKPQFSTRVDDVLHFSKTLTWRSEIIPDKSKGT | |. . .. .|| |.||||..|.| .|||.||| |.||. . ..|||...| seq B MPKK−−−−−−−−VWKSSTPSTYEHISSLRPKFVSRVDNVLHQRKSLTFSNVVVPDKKNNT
A T C A C C T A A A T T C _ _ _
.
Alignment corresponding to the colored path:
. . . . . . . . . . . . . . . . . . . . . . . .
C
. . . . . . . . . . . . . . . . .
A T C A T C C _ C A C T C C A C T C C C C _ C _ C _ C _ C _ C _ _ T T A T T C T A T T T C T _ T _ T _ T _ T _ T _ T
.
_ A A A T C A A A T A C A _ A _ A _ A _ A _ A _ A T _ T A T T A T T T C T T _ T _ T _ T _ T _ T _ A _ A A A T C A A A T A C A A _ _ A _ A _ A _ A _ C _ _ A C T C C A C T C C C _ C T C _ C _ C _ C _
. . . . . . . . . . . . . . . . . . . . .
A _ T _
.
_ A A A A T C A A A T A C A _ A _ A _ A _ A A _ A A C C _ T _ A _ A _ A _ T A C A _ C A T _ _ A C A C _ C _ A _ A _ T _ C _ C _ T _ T _ C _ C _ T _ A _ A _ C _ T _ A _ T _ C _ C T _ _ A _ T _ T _ C _ C _ C _ T _ A _ C _ T T T _ A _ A _ A _ _ C _ A _ T _ C _
Alignment as a path in a graph
. the scores may be either positive or negative . the greater the similarity between two compared symbols, the greater the score of their comparison, . when using a similarity scoring scheme, we want to maximize the score of the alignment. decreases . scores are always positive and increase when the similarity . a simple distance scoring scheme: if x = y, then d(x,y) = 0, else d(x,y) = 1 d(x,−) = d(−,y) = 1 . using a distance measure, we want to minimize the score of the alignment
Similarity measures: Distance measures:
Scoring Schemes
Alignments
letters from the alphabet of the sequences, or the symbol for a gap. An alignment is defined as a series of paired symbols, that are either Given two sequences and
- ne alignment between the two sequences is represented as:
paired symbols: The score of an alignment is defined as the sum of the scores of all the which can be written as: with: scheme that is used) ("best" meaning "lowest" or "highest", depending the kind of scoring The optimal alignment is defined as the one having the best score
Dynamic Programming
_ _
A A T C T A C 1 3 5 4 6 7 6 5 4 3 2 1 1 2 4 5 2 1 1 1 2 3 4 5 3 2 2 2 1 2 3 4 3 2 3 2 2 2 3 4 3 2 3 2 3 3 6 5 4 3 2 3 3 3 Scoring parameters (distance scheme): Matches = 0 Mismatches and insertions/deletions = 1
Dynamic Programming − Global alignment
A C
A
T C T
A A T C T A C A T C
A
T C
Dynamic Programming − Local Alignment
2 3 1 1 1 1 1 1 2 1 1 1 1 2 2 3 2 2 1 3 2 2 1 3 2 2 3 Matches: −1 Scoring parameters (similarity scheme): Mismatches and insertions/deletions = −1
Improvments to the algorithms:
- O. Gotoh, 1982
- E. W. Myers and W. Miller, 1988
- S. Needleman and C. Wunsch, 1970.
- P. Sellers, 1974.
Local alignment :
- T. Smith and M. Waterman , 1981.
Global alignment :
Dynamic Programming alignment
- f two sequences
−1 −1 2 1 1 1 1 1 −2 −1 −3 −1 −3 −3 −2 S −1 −1 3 1 −1 −1 −2 −3 −3 −5 T −1 −1 −1 −1 6 1 −1 −2 −3 −2 −1 −5 −5 −6 P −1 2 1 −2 −1 −1 −2 −1 −4 −3 −6 A 2 1 1 2 2 1 −2 −2 −3 −2 −4 −2 −4 N 4 2 3 1 −1 −3 −3 −2 −2 −6 −4 −7 D 4 2 1 −1 −2 −2 −4 −2 −5 −4 −7 E 4 3 1 1 −1 −2 −2 −2 −5 −4 −5 Q −2 −3 −2 −3 −4 −5 −5 −5 −3 12 −4 −5 −5 −2 −6 −2 −4 −8 C −1 −2 5 1 −2 −3 −3 −3 −4 −1 −5 −5 −7 G 6 2 −2 −2 −2 −2 −2 −3 H 6 3 −2 −3 −2 −4 −4 2 R 5 K −2 −3 −2 −5 −4 −3 6 2 4 2 −2 −4 M 5 2 4 1 −1 −5 I 6 2 2 −1 −2 L 9 7 F 4 −1 −2 −6 V 10 Y 17 W E D N G A P T S C Q H R K M I L V F Y W
The PAM250 similarity matrice
(M. O. Dayhoff et al, 1978)
(From M−F. Sagot, PhD dissertation, 1996)
Different viewpoints
Ch
Venn Diagram for amino acids
Proposed by W. R. Taylor, 1986 s−s
C P S T V I L N M W K Q G G A Y F E D R H
positive charged polar tiny small aliphatic aromatic hydrophobic
- 2. Mutation Data Matrix (MDM) :
From the observed frequencies of change occurence and the individual frequencies of the amino acids, for each pair of amino acids i and j, compute the probability of i mutating into j during a given amount of evolution. Amounts of evolution are expressed in PAM units : one PAM is the amount
- f evolution during which one expects on average 1% of change.
Derive the probabilities for one PAM and extrapolate to other PAM distances.
- 3. Relatedness Odds Matrix :
are homologous, by the probability of i and j being aligned by chance.
- f i and j being aligned because corresponding positions in the sequence
For each pair i, j of amino acids, R
- 4. Scoring Matrix :
- 1. Raw PAM Matrix :
= log Observed frequencies of occurence of the substitutions. = Ri,j qi,j p p i j Ri,j i,j represents the ratio of the probability i,j S
(M. Dayhoff et al., 1968, 1972, 1978)
PAM Matrices
PAM = Accepted Point Mutation
Old PAM series = Dayhoff et al. PAMx : x stands for the evolutionary distance New PAM series = Jones et al, Gonnet et al. represented by the matrix. BLOSUM series = Henikoff & Henikoff BLOSUMy : y is the minimum percent of identity the matrix is derived. in the set of sequences from which
Log−Odds Matrices
: target frequencies pi , j p : background frequencies qi,j
a log−odds matrix, best suited for distinguishing local Any matrix used for scoring local alignments is implicitly alignments in which i and j are aligned with frequency (Stephen Altschul) qi,j .
Si,j = log qi,j pi pj
alignments in WU−Blast2 aligned regions: implicitly done by combining several local the score of a gap will be proportional to its length.
Affine gap costs:
The score for a gap of length l is of the form a + b l : "a" is the cost for the presence of a gap per se The same penalty for each inserted or deleted element.
Concave functions:
"b" is the cost (per residue) for extending the gap
Benner, Cohen and Gonnet, 1993 (and previous studies): From the observed distribution of gap lengths, they infer that
Theoretical work :
the score of a gap should be of the form: a + b log l Waterman, 1984; Miller and Myers, 1988 Development of algorithms for sequence alignment with concave gap costs.
Different treatments for different kind of gaps:
Considering regions with gaps as "unaligned" regions: Empirical studies : Distinguishing gaps inside alignments from gaps between generalized affine gap costs (Altschul, 1998). Simplest model:
Scoring gaps