ORF Calling ORF Calling Why? Need to know protein sequence - - PowerPoint PPT Presentation
ORF Calling ORF Calling Why? Need to know protein sequence - - PowerPoint PPT Presentation
ORF Calling ORF Calling Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity studies Proteins are better for remote
Why?
Need to know protein sequence
Protein sequence is usually what does the work
Functional studies
Crystallography
Proteomics
Similarity studies
Proteins are better for remote similarities than DNA sequences
Protein sequences change slower than DNA sequences
ORF Calling
Intrinsic gene calling Extrinsic gene calling
Compare your DNA sequences to known
- sequences. Needs other sequences that
are known! Only use information in your DNA
- sequences. Does not use other
information.
ORF Calling
- Start with DNA sequence
- Translate in all 6 reading frames
Extrinsic gene calling
AGT AAA ACT TTA ATT GTT GGT TAA AGT AAA ACT TTA ATT GTT GGT TAA 1 AG TAA AAC TTT AAT TGT TGG TTA A 3 A GTA AAA CTT TAA TTG TTG GTT AA 2 TCA TTT TGA AAT TAA CAA CCA ATT
| | | | | | | | | | | | | | | | | | | | | | | |
T CAT TTT GAA ATT AAC AAC CAA TT
- 3
TCA TTT TGA AAT TAA CAA CCA ATT
- 1
TC ATT TTG AAA TTA ACA ACC AAT T
- 2
Why are there 6 reading frames?
- Start with DNA sequence
- Translate in all 6 reading frames
- Compare your sequence to known protein
sequences
- Find the ends of each, and call those genes!
Extrinsic gene calling
DNA sequence
}
Similar protein sequences e.g. from BLAST
Protein encoding gene
For example
- This is how (most) metagenome ORF calling is
done
- Eukaryotic ORF calling – especially using EST
sequences
Uses of extrinsic calling
- Very slow (depending on search algorithm)
- Dependent on your database
- Only fjnds known genes
Problems with extrinsic calling
- Intrinsic gene calling
- Ab initio gene calling
- What are the start codons?
- What are the stop codons?
ATG TAA TAG TGA
Alternatives to extrinsic gene calling
Approximately once every 20 amino acids at random! A stretch of 100 amino acids is likely to have a stop codon!
How frequently do stop codons appear?
DNA
3 2 1
- 1
- 2
- 3
How to call ORFs (the easy way)
DNA
3 2 1
- 1
- 2
- 3
Find all the stop codons
DNA
3 2 1
- 1
- 2
- 3
X is often 100 amino acids
Find all the ORFs > x amino acids
DNA
3 2 1
- 1
- 2
- 3
Trim to those ORFs that have a start
DNA
3 2 1
- 1
- 2
- 3
Short ORFs that overlap others
Remove “shadow” ORFs
DNA
3 2 1
- 1
- 2
- 3
Trim the start sites to fjrst ATG
DNA
3 2 1
- 1
- 2
- 3
These are the ORFs
Intrinsic ORF calling using Markov Models
- Based on language processing
- Common for gene and protein fjnding,
alignments, and so on
Markov Models
English: the Spanish: el (la) Portuguese: que
What is the most common word?
Scrabble
In scrabble, how do they score the letters? The most abundant letters (easiest to place
- n the board) are given the lowest score
Scrabble
1 point: E, A, I, O, N, R, T, L, S, U 2 points: D, G 3 points: B, C, M, P 4 points: F, H, V, W, Y 5 points: K 8 points: J, X 10 points: Q, Z
Scrabble
Frequency of letters
If I want to make up a sentence, I could choose some letters at random, based on their occurrence in the alphabet (i.e their scrabble score) rla bsht es stsfa ohhofsd
Making up sentences
What follows a period (“.”)? What follows a t? Usually a space “ ” Usually an “i” (-tion, -tize, ...)
Lets get clever!
When the fjrst letter is “t” (from 3,269 words):
ti 51% te 20% ta 15% th 8%
Frequency of two letters
Choose a letter based on the probability that it follows the letter before:
s h a n d t uc h t i n e y m e l e o l l d
Level 1 analysis
1 letter (a, e, o …) 2 letters (th, ti, sh …) 3 letters (the, and, …) 4 letters (that, …) Zero order model First order model Second order model Third order model
Levels of analysis
With about 10th order Markov models of English you get complete words and sentences!
Markov models
With about 10th order Markov models of English you get complete words and sentences!
Markov models
Codons have three letters (ATG, CAC, GGG, ...) Use a 2nd order Markov model for ORF calling The frequency of a letter is predicted based
- n the frequency of the two letters before
Markov Models and ORF calling
Scrabble
Do English and Spanish use the same letters?
Scrabble (México)
Scrabble (México)
1 point: E, A, I, O, N, R, T, L, S, U 2 points: D, G 3 points: B, C, M, P 4 points: F, H, V, W, Y 5 points: K 8 points: J, X 10 points: Q, Z
Scrabble (US)
Based on the front page of the NY Times!
1 point: A, E, O, I, S, N, L, R, U, T 2 points: D, G 3 points: C, B, M, P 4 points: H, F, V, Y 5 points: CH, Q 8 points: J, LL, Ñ, RR, X 10 points: Z
Scrabble (Spanish)
Will vary with the composition of the
- rganism!
Remember, some organisms have high G+C compared to A+T
What about scrabble scores for DNA?
Use a 2nd order Markov model for ORF calling The frequency of a letter is predicted based
- n the frequency of the two letters before
Markov Models and ORF calling
Need to train the Markov model – not all
- rganisms are the same
Can use phylogentically close organisms Can use “long orfs” – likely to be correct because unlikely to be random stretches without a stop codon!
Problems!
Markov Models order 1-8 (word size 2-9) Discard (or ↓ weight) for rare words Promote (or ↑ weight) for common words Probability is the sum of all probabilities from 1- 8 2-9
Interpolated Markov Model
(The imm in GLIMMER)
As with proteins, two main methods: Ab initio
- Intrinsic
Homology based
- extrinsic
RNA genes
Ribosomes are made of proteins and RNA
Ribosomes
30S subunit from Thermus aquaticus Blue: protein Orange: rRNA
- E. coli
16S rRNA secondary structure
Variable region Conserved region
Variable regions in the 16S rRNA. Vn – 9 regions (n) – variable loop(s) forward/rev primers
V1 (6) V2 (8- 11) V3 (18) V4 (P23- 1, 24) V5 (28, 29) V6 (3 7) V7 (43) V8 (45, 46) V9 (49)
Van de Peer Y, Chapelle S, De Wachter R. (1996) A quantitative map of nucleotide substitution rates in bacterial rRNA.
- Nucl. Acids Res. 24:3381-3391
Ribosomes are made of proteins and RNA Prokaryotic ribosome: Large subunit: 50S 5S and 23S rRNA genes Small subunit: 30S 16S rRNA gene
Ribosomes
Easiest way is iterative:
- BLAST
- ALIGN
- TRIM
Problem: secondary structure makes identification of the ends difficult
Finding 16S genes
Not as easy as rRNA Much shorter Varied sequence Only conservation is 2° structure
Finding tRNA genes
tRNAScan-SE
Sean Eddy
Use it!
How does this relate to tRNA?
tRNA-Phe by Yikrazuul - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons https://commons.wikimedia.org/wiki/File:TRNA-Phe_yeast_en.svg
tRNA structure
- Start of acceptor stem (7-9 bp)
- D-loop (4-6-bp) stem plus loop
- anticodon arm (6-bp) stem plus loop with anticodon
- T-loop (4-5-bp) stem plus loop
- End of acceptor stem (7-9 bp)
- CCA to attach amino acid (may not be in sequence ...