Searching Sequence databases 1: Searching Sequence databases 1: - - PowerPoint PPT Presentation

searching sequence databases 1 searching sequence
SMART_READER_LITE
LIVE PREVIEW

Searching Sequence databases 1: Searching Sequence databases 1: - - PowerPoint PPT Presentation

Searching Sequence databases 1: Searching Sequence databases 1: Blast Blast The Central dogma of Biology The Central dogma of Biology Protein translation Protein translation gene DNA Transcription mRNA Translation ATG CGT AAT TCA


slide-1
SLIDE 1

Searching Sequence databases 1: Searching Sequence databases 1: Blast Blast

slide-2
SLIDE 2

The Central dogma of Biology The Central dogma of Biology

slide-3
SLIDE 3

Protein translation Protein translation

DNA mRNA gene

CGT ATG TCA AAT ATG TGA

M R N S * Translation Transcription

slide-4
SLIDE 4

Blast Variants Blast Variants

ß ß blastn

blastn: nucleotide versus nucleotide : nucleotide versus nucleotide

ß ß blastp

blastp: protein versus protein : protein versus protein

ß ß blastx

blastx: nucleotide query, protein database : nucleotide query, protein database

ß ß tblastn

tblastn: protein query, : protein query, nt nt database database

ß ß tblastx

tblastx: nucleotide query, nucleotide : nucleotide query, nucleotide database, alignments at a protein level. database, alignments at a protein level.

What is a 6 frame translation of a nucleotide sequence?

slide-5
SLIDE 5

Homology versus similarity Homology versus similarity

ß ß Sequence alignment tools compute similarity

Sequence alignment tools compute similarity

ß ß Two sequences are homologous if they shared a

Two sequences are homologous if they shared a common evolutionary ancestor. common evolutionary ancestor.

ß ß Orthologous Orthologous if the pair was created in a speciation if the pair was created in a speciation event event ß ß Paralogous Paralogous if the pair was created by a gene if the pair was created by a gene duplication duplication

ß ß Homology does not always imply sequence

Homology does not always imply sequence

  • similarity. Ancient protein families might be
  • similarity. Ancient protein families might be

homologous, but greatly diverged. The score homologous, but greatly diverged. The score function must reflect this. function must reflect this.

slide-6
SLIDE 6

Scoring Function Scoring Function

ß ß For DNA, we worked with a simple

For DNA, we worked with a simple match/mismatch criteria. match/mismatch criteria.

ß ß Purines Purines (AG) & (AG) & Pyrimidines Pyrimidines (CT) (CT) ß ß Transitions (nucleotide substitution within a Transitions (nucleotide substitution within a group) are more likely than group) are more likely than transversions transversions. . ß ß When aligning When aligning cDNA cDNA, the third base , the third base substitution is more likely, then other substitution is more likely, then other positions. positions. ß ß Q: Can you devise an algorithm to score

Q: Can you devise an algorithm to score appropriately? appropriately?

slide-7
SLIDE 7

Scoring proteins Scoring proteins

ß ß Scoring protein sequence alignments is a much more Scoring protein sequence alignments is a much more complex task than scoring DNA complex task than scoring DNA

ß ß Not all substitutions are equal

Not all substitutions are equal

ß ß Problem was first worked on by Problem was first worked on by Pauling Pauling and and collaborators collaborators ß ß In the 1970s, In the 1970s, Dayhoff Dayhoff created the first similarity created the first similarity matrices. matrices.

ß ß “

“One size does not fit all One size does not fit all” ”

ß ß Homologous proteins which are evolutionarily close should

Homologous proteins which are evolutionarily close should be scored differently than proteins that are evolutionarily be scored differently than proteins that are evolutionarily distant distant

ß ß Different proteins might evolve at different rates and we need

Different proteins might evolve at different rates and we need to normalize for that to normalize for that

slide-8
SLIDE 8

PAM1 matrix PAM1 matrix

ß ß Align many proteins that are very similar

Align many proteins that are very similar

ß ß Is this a problem? Is this a problem?

ß ß PAM1 distance is the probability of a

PAM1 distance is the probability of a substitution when 1% of the residues have substitution when 1% of the residues have changed changed

ß ß Estimate the frequency

Estimate the frequency P Pab

ab of every pair of

  • f every pair of

substitions substitions a,b a,b

ß ß S(a,b) = log

S(a,b) = log10

10(

(P Pab

ab/

/P Pa

aP

Pb

b) = log10(

) = log10(P Pb

b|a |a/

/P Pb

b)

)

slide-9
SLIDE 9

PAM 1 PAM 1