Eukaryotic Gene Eukaryotic Gene Prediction Prediction Eukaryotic - - PowerPoint PPT Presentation

eukaryotic gene eukaryotic gene prediction prediction
SMART_READER_LITE
LIVE PREVIEW

Eukaryotic Gene Eukaryotic Gene Prediction Prediction Eukaryotic - - PowerPoint PPT Presentation

Eukaryotic Gene Eukaryotic Gene Prediction Prediction Eukaryotic gene structure Eukaryotic gene structure Translation Translation Gene Finding: The 1st generation Gene Finding: The 1st generation Given genomic DNA, does it contain a


slide-1
SLIDE 1

Eukaryotic Gene Eukaryotic Gene Prediction Prediction

slide-2
SLIDE 2

Eukaryotic gene structure Eukaryotic gene structure

slide-3
SLIDE 3

Translation Translation

slide-4
SLIDE 4

Gene Finding: The 1st generation Gene Finding: The 1st generation

  • Given genomic DNA, does it contain a gene (or

Given genomic DNA, does it contain a gene (or not)? not)?

  • Key idea: The distributions of nucleotides is

Key idea: The distributions of nucleotides is different in coding (translated different in coding (translated exons exons) and non- ) and non- coding regions. coding regions.

  • Therefore, a statistical test can be used to

Therefore, a statistical test can be used to discriminate between coding and non-coding discriminate between coding and non-coding regions. regions.

slide-5
SLIDE 5

Coding versus non-coding Coding versus non-coding

  • Fickett

Fickett and and Tung Tung (1992) compared various (1992) compared various measures measures

  • Measures that preserve the triplet frame are

Measures that preserve the triplet frame are the most successful. the most successful.

  • Genscan

Genscan: 5th order Markov Model : 5th order Markov Model

  • Assignment 2 (Conservation implies a

Assignment 2 (Conservation implies a protein coding measure) protein coding measure)

slide-6
SLIDE 6

Coding Coding vs

  • vs. non-coding

. non-coding regions regions

Given : Three 5th order transition matrices C(1),C(2),C(3) trained on coding exons P h(Xa,b) = C((h+i)mod 3+1)[Xa+i]

i= 0 b-a

Coding ratio, r = Ph(Xa,b) PD(Xa,b) Coding Score s = log2(r)

Compute average coding score (per base) of exons and introns, and take the difference. If the measure is good, the difference must be biased away from 0.

slide-7
SLIDE 7

Coding differential for 380 genes Coding differential for 380 genes

slide-8
SLIDE 8

Other Signals Other Signals

GT ATG AG

Coding

slide-9
SLIDE 9

Coding region can be detected Coding region can be detected

Coding

ß ß Plot the coding score using a sliding window of fixed

Plot the coding score using a sliding window of fixed length. length.

ß ß The (large)

The (large) exons exons will show up reliably. will show up reliably.

ß ß Not enough to predict gene boundaries reliably

Not enough to predict gene boundaries reliably

slide-10
SLIDE 10

Other Signals Other Signals

GT ATG AG

Coding

ß ß Signals at

Signals at exon exon boundaries are precise but not specific. boundaries are precise but not specific. Coding signals are specific but not precise. Coding signals are specific but not precise.

ß ß When combined they can be effective

When combined they can be effective

slide-11
SLIDE 11

The second generation of Gene finding The second generation of Gene finding

  • Ex: Grail II. Used statistical techniques to

Ex: Grail II. Used statistical techniques to combine various signals into a coherent combine various signals into a coherent gene structure. gene structure.

  • It was not easy to train on many

It was not easy to train on many parameters.

  • parameters. Guigo

Guigo & & Bursett Bursett test revealed test revealed that accuracy was still very low. that accuracy was still very low.

  • Problem with multiple genes in a genomic

Problem with multiple genes in a genomic region region

slide-12
SLIDE 12

HMMs HMMs and gene finding and gene finding

  • HMMs

HMMs allow for a systematic approach to allow for a systematic approach to merging many signals. merging many signals.

  • They can model multiple genes, partial

They can model multiple genes, partial genes in a genomic region, as also genes genes in a genomic region, as also genes

  • n both strands.
  • n both strands.
slide-13
SLIDE 13

The The Viterbi Viterbi Algorithm Algorithm

Let vk (i) be the probability of the most likely path that ends in state p k, and emits symbols x1Lxk Then,

vk(i +1) = ek(xi+1)max

l (vl(i)alk)

slide-14
SLIDE 14

HMMs HMMs and gene finding and gene finding

  • The

The Viterbi Viterbi algorithm (and backtracking) algorithm (and backtracking) allows us to parse a string through the allows us to parse a string through the states of an HMM states of an HMM

  • Can we describe Eukaryotic gene

Can we describe Eukaryotic gene structure by the states of an HMM? structure by the states of an HMM?

  • This could be a solution to the GF problem.

This could be a solution to the GF problem.

slide-15
SLIDE 15

An HMM for Gene structure An HMM for Gene structure

slide-16
SLIDE 16

Generalized Generalized HMMs HMMs, and other , and other refinements refinements

  • A probabilistic model for each of the states (ex:

A probabilistic model for each of the states (ex: Exon Exon, Splice site) needs to be described , Splice site) needs to be described

  • In standard

In standard HMMs HMMs, there is an exponential , there is an exponential distribution on the duration of time spent in a distribution on the duration of time spent in a state. state.

  • This is violated by many states of the gene

This is violated by many states of the gene structure HMM. Solution is to model these using structure HMM. Solution is to model these using generalized generalized HMMs HMMs. .

slide-17
SLIDE 17

Length distributions of Length distributions of Introns Introns & & Exons Exons

slide-18
SLIDE 18

Generalized HMM for gene finding Generalized HMM for gene finding

  • Each state also emits a

Each state also emits a ‘ ‘duration duration’ ’ for which for which it will cycle in the same state. The time is it will cycle in the same state. The time is generated according to a random process generated according to a random process that depends on the state. that depends on the state.

slide-19
SLIDE 19

Forward algorithm for gene finding Forward algorithm for gene finding

j i qk Fk(i) = P qk

j<i

Â

(X j,i) fqk ( j - i +1) alk

l ŒQ

Â

Fl( j)

slide-20
SLIDE 20

HMMs HMMs and Gene finding and Gene finding

ß ß Generalized

Generalized HMMs HMMs are an attractive are an attractive model for computational gene finding model for computational gene finding

ß ß Allow incorporation of various signals Allow incorporation of various signals ß ß Quality of gene finding depends upon quality Quality of gene finding depends upon quality

  • f signals.
  • f signals.
slide-21
SLIDE 21

Signals Signals

  • Coding versus non-coding

Coding versus non-coding

  • Splice Signals

Splice Signals

  • Translation start

Translation start

slide-22
SLIDE 22

Splice signals Splice signals

  • GT is a Donor signal, and AG is the

GT is a Donor signal, and AG is the acceptor signal acceptor signal

GT AG

slide-23
SLIDE 23

PWMs PWMs

  • Fixed length for the splice signal.

Fixed length for the splice signal.

  • Each position is generated

Each position is generated independently according to a independently according to a distribution distribution

  • Figure shows data from > 1200 donor

Figure shows data from > 1200 donor sites sites 321123456 321123456 AAGGTGAGT AAGGTGAGT CCGGTAAGT CCGGTAAGT GAGGTGAGG GAGGTGAGG TAGGTAAGG TAGGTAAGG

slide-24
SLIDE 24

MDD MDD

  • PWMs

PWMs do not capture correlations between positions do not capture correlations between positions

  • Many position pairs in the Donor signal are correlated

Many position pairs in the Donor signal are correlated

slide-25
SLIDE 25
  • Choose the position which has the highest

Choose the position which has the highest correlation score. correlation score.

  • Split sequences into two: those which

Split sequences into two: those which have the consensus at position I, and the have the consensus at position I, and the remaining. remaining.

  • Recurse

Recurse until <Terminating conditions> until <Terminating conditions>

slide-26
SLIDE 26

MDD for Donor sites MDD for Donor sites

slide-27
SLIDE 27

De novo De novo Gene prediction: Gene prediction: Sumary Sumary

  • Various signals distinguish coding regions

Various signals distinguish coding regions from non-coding from non-coding

  • HMMs

HMMs are a reasonable model for Gene are a reasonable model for Gene structures, and provide a uniform method structures, and provide a uniform method for combining various signals. for combining various signals.

  • Further improvement may come from

Further improvement may come from improved signal detection improved signal detection

slide-28
SLIDE 28

How many genes do we have? How many genes do we have?

Nature Science

slide-29
SLIDE 29

Alternative splicing Alternative splicing

slide-30
SLIDE 30

Comparative methods Comparative methods

  • Gene prediction is harder with alternative splicing.

Gene prediction is harder with alternative splicing.

  • One approach might be to use comparative methods to

One approach might be to use comparative methods to detect genes detect genes

  • Given a similar mRNA/protein (from another species,

Given a similar mRNA/protein (from another species, perhaps?), can you find the best parse of a genomic perhaps?), can you find the best parse of a genomic sequence that matches that target sequence sequence that matches that target sequence

  • Yes, with a variant on alignment algorithms that penalize

Yes, with a variant on alignment algorithms that penalize separately for separately for introns introns, versus other gaps. , versus other gaps.

slide-31
SLIDE 31

Comparative gene finding Comparative gene finding tools tools

ß ß Procrustes

Procrustes/Sim4: mRNA /Sim4: mRNA vs

  • vs. genomic

. genomic

ß ß Genewise

Genewise: proteins versus genomic : proteins versus genomic

ß ß CEM: genomic versus genomic

CEM: genomic versus genomic

ß ß Twinscan

Twinscan: Combines comparative and de : Combines comparative and de novo approach. novo approach.