Searching Sequence databases 1: Searching Sequence databases 1: - - PowerPoint PPT Presentation
Searching Sequence databases 1: Searching Sequence databases 1: - - PowerPoint PPT Presentation
Searching Sequence databases 1: Searching Sequence databases 1: Blast Blast The Central dogma of Biology The Central dogma of Biology Protein translation Protein translation gene DNA Transcription mRNA Translation ATG CGT AAT TCA
The Central dogma of Biology The Central dogma of Biology
Protein translation Protein translation
DNA mRNA gene
CGT ATG TCA AAT ATG TGA
M R N S * Translation Transcription
Blast Variants Blast Variants
ß ß blastn
blastn: nucleotide versus nucleotide : nucleotide versus nucleotide
ß ß blastp
blastp: protein versus protein : protein versus protein
ß ß blastx
blastx: nucleotide query, protein database : nucleotide query, protein database
ß ß tblastn
tblastn: protein query, : protein query, nt nt database database
ß ß tblastx
tblastx: nucleotide query, nucleotide : nucleotide query, nucleotide database, alignments at a protein level. database, alignments at a protein level.
What is a 6 frame translation of a nucleotide sequence?
Homology versus similarity Homology versus similarity
ß ß Sequence alignment tools compute similarity
Sequence alignment tools compute similarity
ß ß Two sequences are homologous if they shared a
Two sequences are homologous if they shared a common evolutionary ancestor. common evolutionary ancestor.
ß ß Orthologous Orthologous if the pair was created in a speciation if the pair was created in a speciation event event ß ß Paralogous Paralogous if the pair was created by a gene if the pair was created by a gene duplication duplication
ß ß Homology does not always imply sequence
Homology does not always imply sequence
- similarity. Ancient protein families might be
- similarity. Ancient protein families might be
homologous, but greatly diverged. The score homologous, but greatly diverged. The score function must reflect this. function must reflect this.
Scoring Function Scoring Function
ß ß For DNA, we worked with a simple
For DNA, we worked with a simple match/mismatch criteria. match/mismatch criteria.
ß ß Purines Purines (AG) & (AG) & Pyrimidines Pyrimidines (CT) (CT) ß ß Transitions (nucleotide substitution within a Transitions (nucleotide substitution within a group) are more likely than group) are more likely than transversions transversions. . ß ß When aligning When aligning cDNA cDNA, the third base , the third base substitution is more likely, then other substitution is more likely, then other positions. positions. ß ß Q: Can you devise an algorithm to score
Q: Can you devise an algorithm to score appropriately? appropriately?
Scoring proteins Scoring proteins
ß ß Scoring protein sequence alignments is a much more Scoring protein sequence alignments is a much more complex task than scoring DNA complex task than scoring DNA
ß ß Not all substitutions are equal
Not all substitutions are equal
ß ß Problem was first worked on by Problem was first worked on by Pauling Pauling and and collaborators collaborators ß ß In the 1970s, In the 1970s, Dayhoff Dayhoff created the first similarity created the first similarity matrices. matrices.
ß ß “
“One size does not fit all One size does not fit all” ”
ß ß Homologous proteins which are evolutionarily close should
Homologous proteins which are evolutionarily close should be scored differently than proteins that are evolutionarily be scored differently than proteins that are evolutionarily distant distant
ß ß Different proteins might evolve at different rates and we need
Different proteins might evolve at different rates and we need to normalize for that to normalize for that
PAM1 matrix PAM1 matrix
ß ß Align many proteins that are very similar
Align many proteins that are very similar
ß ß Is this a problem? Is this a problem?
ß ß PAM1 distance is the probability of a
PAM1 distance is the probability of a substitution when 1% of the residues have substitution when 1% of the residues have changed changed
ß ß Estimate the frequency
Estimate the frequency P Pab
ab of every pair of
- f every pair of
substitions substitions a,b a,b
ß ß S(a,b) = log
S(a,b) = log10
10(
(P Pab
ab/
/P Pa
aP
Pb
b) = log10(
) = log10(P Pb
b|a |a/
/P Pb
b)