ORF Calling ORF Calling Why? Need to know protein sequence - - PowerPoint PPT Presentation

orf calling
SMART_READER_LITE
LIVE PREVIEW

ORF Calling ORF Calling Why? Need to know protein sequence - - PowerPoint PPT Presentation

ORF Calling ORF Calling Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity studies Proteins are better for remote


slide-1
SLIDE 1

ORF Calling

slide-2
SLIDE 2

 Why?

Need to know protein sequence

Protein sequence is usually what does the work

Functional studies

Crystallography

Proteomics

Similarity studies

Proteins are better for remote similarities than DNA sequences

Protein sequences change slower than DNA sequences

ORF Calling

slide-3
SLIDE 3

Intrinsic gene calling Extrinsic gene calling

Compare your DNA sequences to known

  • sequences. Needs other sequences that

are known! Only use information in your DNA

  • sequences. Does not use other

information.

ORF Calling

slide-4
SLIDE 4
  • Start with DNA sequence
  • Translate in all 6 reading frames

Extrinsic gene calling

slide-5
SLIDE 5

AGT AAA ACT TTA ATT GTT GGT TAA AGT AAA ACT TTA ATT GTT GGT TAA 1 AG TAA AAC TTT AAT TGT TGG TTA A 3 A GTA AAA CTT TAA TTG TTG GTT AA 2 TCA TTT TGA AAT TAA CAA CCA ATT

| | | | | | | | | | | | | | | | | | | | | | | |

T CAT TTT GAA ATT AAC AAC CAA TT

  • 3

TCA TTT TGA AAT TAA CAA CCA ATT

  • 1

TC ATT TTG AAA TTA ACA ACC AAT T

  • 2

Why are there 6 reading frames?

slide-6
SLIDE 6
  • Start with DNA sequence
  • Translate in all 6 reading frames
  • Compare your sequence to known protein

sequences

  • Find the ends of each, and call those genes!

Extrinsic gene calling

slide-7
SLIDE 7

DNA sequence

}

Similar protein sequences e.g. from BLAST

Protein encoding gene

For example

slide-8
SLIDE 8
  • This is how (most) metagenome ORF calling is

done

  • Eukaryotic ORF calling – especially using EST

sequences

Uses of extrinsic calling

slide-9
SLIDE 9
  • Very slow (depending on search algorithm)
  • Dependent on your database
  • Only fjnds known genes

Problems with extrinsic calling

slide-10
SLIDE 10
  • Intrinsic gene calling
  • Ab initio gene calling
  • What are the start codons?
  • What are the stop codons?

ATG TAA TAG TGA

Alternatives to extrinsic gene calling

slide-11
SLIDE 11

Approximately once every 20 amino acids at random! A stretch of 100 amino acids is likely to have a stop codon!

How frequently do stop codons appear?

slide-12
SLIDE 12

DNA

3 2 1

  • 1
  • 2
  • 3

How to call ORFs (the easy way)

slide-13
SLIDE 13

DNA

3 2 1

  • 1
  • 2
  • 3

Find all the stop codons

slide-14
SLIDE 14

DNA

3 2 1

  • 1
  • 2
  • 3

X is often 100 amino acids

Find all the ORFs > x amino acids

slide-15
SLIDE 15

DNA

3 2 1

  • 1
  • 2
  • 3

Trim to those ORFs that have a start

slide-16
SLIDE 16

DNA

3 2 1

  • 1
  • 2
  • 3

Short ORFs that overlap others

Remove “shadow” ORFs

slide-17
SLIDE 17

DNA

3 2 1

  • 1
  • 2
  • 3

Trim the start sites to fjrst ATG

slide-18
SLIDE 18

DNA

3 2 1

  • 1
  • 2
  • 3

These are the ORFs

slide-19
SLIDE 19

Intrinsic ORF calling using Markov Models

slide-20
SLIDE 20
  • Based on language processing
  • Common for gene and protein fjnding,

alignments, and so on

Markov Models

slide-21
SLIDE 21

English: the Spanish: el (la) Portuguese: que

What is the most common word?

slide-22
SLIDE 22

Scrabble

slide-23
SLIDE 23

In scrabble, how do they score the letters? The most abundant letters (easiest to place

  • n the board) are given the lowest score

Scrabble

slide-24
SLIDE 24

1 point: E, A, I, O, N, R, T, L, S, U 2 points: D, G 3 points: B, C, M, P 4 points: F, H, V, W, Y 5 points: K 8 points: J, X 10 points: Q, Z

Scrabble

slide-25
SLIDE 25

Frequency of letters

slide-26
SLIDE 26

If I want to make up a sentence, I could choose some letters at random, based on their occurrence in the alphabet (i.e their scrabble score) rla bsht es stsfa ohhofsd

Making up sentences

slide-27
SLIDE 27

What follows a period (“.”)? What follows a t? Usually a space “ ” Usually an “i” (-tion, -tize, ...)

Lets get clever!

slide-28
SLIDE 28

When the fjrst letter is “t” (from 3,269 words):

ti 51% te 20% ta 15% th 8%

Frequency of two letters

slide-29
SLIDE 29

Choose a letter based on the probability that it follows the letter before:

s h a n d t uc h t i n e y m e l e o l l d

Level 1 analysis

slide-30
SLIDE 30

1 letter (a, e, o …) 2 letters (th, ti, sh …) 3 letters (the, and, …) 4 letters (that, …) Zero order model First order model Second order model Third order model

Levels of analysis

slide-31
SLIDE 31

With about 10th order Markov models of English you get complete words and sentences!

Markov models

slide-32
SLIDE 32

With about 10th order Markov models of English you get complete words and sentences!

Markov models

slide-33
SLIDE 33

Codons have three letters (ATG, CAC, GGG, ...) Use a 2nd order Markov model for ORF calling The frequency of a letter is predicted based

  • n the frequency of the two letters before

Markov Models and ORF calling

slide-34
SLIDE 34

Scrabble

slide-35
SLIDE 35

Do English and Spanish use the same letters?

Scrabble (México)

slide-36
SLIDE 36

Scrabble (México)

slide-37
SLIDE 37

1 point: E, A, I, O, N, R, T, L, S, U 2 points: D, G 3 points: B, C, M, P 4 points: F, H, V, W, Y 5 points: K 8 points: J, X 10 points: Q, Z

Scrabble (US)

Based on the front page of the NY Times!

slide-38
SLIDE 38

1 point: A, E, O, I, S, N, L, R, U, T 2 points: D, G 3 points: C, B, M, P 4 points: H, F, V, Y 5 points: CH, Q 8 points: J, LL, Ñ, RR, X 10 points: Z

Scrabble (Spanish)

slide-39
SLIDE 39

Will vary with the composition of the

  • rganism!

Remember, some organisms have high G+C compared to A+T

What about scrabble scores for DNA?

slide-40
SLIDE 40

Use a 2nd order Markov model for ORF calling The frequency of a letter is predicted based

  • n the frequency of the two letters before

Markov Models and ORF calling

slide-41
SLIDE 41

Need to train the Markov model – not all

  • rganisms are the same

Can use phylogentically close organisms Can use “long orfs” – likely to be correct because unlikely to be random stretches without a stop codon!

Problems!

slide-42
SLIDE 42

Markov Models order 1-8 (word size 2-9) Discard (or ↓ weight) for rare words Promote (or ↑ weight) for common words Probability is the sum of all probabilities from 1- 8 2-9

Interpolated Markov Model

(The imm in GLIMMER)

slide-43
SLIDE 43

As with proteins, two main methods: Ab initio

  • Intrinsic

Homology based

  • extrinsic

RNA genes

slide-44
SLIDE 44

Ribosomes are made of proteins and RNA

Ribosomes

slide-45
SLIDE 45

30S subunit from Thermus aquaticus Blue: protein Orange: rRNA

slide-46
SLIDE 46
  • E. coli

16S rRNA secondary structure

slide-47
SLIDE 47

Variable region Conserved region

slide-48
SLIDE 48

Variable regions in the 16S rRNA. Vn – 9 regions (n) – variable loop(s) forward/rev primers

V1 (6) V2 (8- 11) V3 (18) V4 (P23- 1, 24) V5 (28, 29) V6 (3 7) V7 (43) V8 (45, 46) V9 (49)

Van de Peer Y, Chapelle S, De Wachter R. (1996) A quantitative map of nucleotide substitution rates in bacterial rRNA.

  • Nucl. Acids Res. 24:3381-3391
slide-49
SLIDE 49

Ribosomes are made of proteins and RNA Prokaryotic ribosome: Large subunit: 50S 5S and 23S rRNA genes Small subunit: 30S 16S rRNA gene

Ribosomes

slide-50
SLIDE 50

Easiest way is iterative:

  • BLAST
  • ALIGN
  • TRIM

Problem: secondary structure makes identification of the ends difficult

Finding 16S genes

slide-51
SLIDE 51

Not as easy as rRNA Much shorter Varied sequence Only conservation is 2° structure

Finding tRNA genes

slide-52
SLIDE 52

tRNAScan-SE

Sean Eddy

Use it!

slide-53
SLIDE 53

How does this relate to tRNA?

tRNA-Phe by Yikrazuul - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons https://commons.wikimedia.org/wiki/File:TRNA-Phe_yeast_en.svg

slide-54
SLIDE 54

tRNA structure

  • Start of acceptor stem (7-9 bp)
  • D-loop (4-6-bp) stem plus loop
  • anticodon arm (6-bp) stem plus loop with anticodon
  • T-loop (4-5-bp) stem plus loop
  • End of acceptor stem (7-9 bp)
  • CCA to attach amino acid (may not be in sequence ...

added during processing)