Eukaryotic Gene Eukaryotic Gene Prediction Prediction Eukaryotic - - PowerPoint PPT Presentation
Eukaryotic Gene Eukaryotic Gene Prediction Prediction Eukaryotic - - PowerPoint PPT Presentation
Eukaryotic Gene Eukaryotic Gene Prediction Prediction Eukaryotic gene structure Eukaryotic gene structure Translation Translation Gene Finding: The 1st generation Gene Finding: The 1st generation Given genomic DNA, does it contain a
Eukaryotic gene structure Eukaryotic gene structure
Translation Translation
Gene Finding: The 1st generation Gene Finding: The 1st generation
- Given genomic DNA, does it contain a gene (or
Given genomic DNA, does it contain a gene (or not)? not)?
- Key idea: The distributions of nucleotides is
Key idea: The distributions of nucleotides is different in coding (translated different in coding (translated exons exons) and non- ) and non- coding regions. coding regions.
- Therefore, a statistical test can be used to
Therefore, a statistical test can be used to discriminate between coding and non-coding discriminate between coding and non-coding regions. regions.
Coding versus non-coding Coding versus non-coding
- Fickett
Fickett and and Tung Tung (1992) compared various (1992) compared various measures measures
- Measures that preserve the triplet frame are
Measures that preserve the triplet frame are the most successful. the most successful.
- Genscan
Genscan: 5th order Markov Model : 5th order Markov Model
- Assignment 2 (Conservation implies a
Assignment 2 (Conservation implies a protein coding measure) protein coding measure)
Coding Coding vs
- vs. non-coding
. non-coding regions regions
Given : Three 5th order transition matrices C(1),C(2),C(3) trained on coding exons P h(Xa,b) = C((h+i)mod 3+1)[Xa+i]
i= 0 b-a
’
Coding ratio, r = Ph(Xa,b) PD(Xa,b) Coding Score s = log2(r)
Compute average coding score (per base) of exons and introns, and take the difference. If the measure is good, the difference must be biased away from 0.
Coding differential for 380 genes Coding differential for 380 genes
Other Signals Other Signals
GT ATG AG
Coding
Coding region can be detected Coding region can be detected
Coding
ß ß Plot the coding score using a sliding window of fixed
Plot the coding score using a sliding window of fixed length. length.
ß ß The (large)
The (large) exons exons will show up reliably. will show up reliably.
ß ß Not enough to predict gene boundaries reliably
Not enough to predict gene boundaries reliably
Other Signals Other Signals
GT ATG AG
Coding
ß ß Signals at
Signals at exon exon boundaries are precise but not specific. boundaries are precise but not specific. Coding signals are specific but not precise. Coding signals are specific but not precise.
ß ß When combined they can be effective
When combined they can be effective
The second generation of Gene finding The second generation of Gene finding
- Ex: Grail II. Used statistical techniques to
Ex: Grail II. Used statistical techniques to combine various signals into a coherent combine various signals into a coherent gene structure. gene structure.
- It was not easy to train on many
It was not easy to train on many parameters.
- parameters. Guigo
Guigo & & Bursett Bursett test revealed test revealed that accuracy was still very low. that accuracy was still very low.
- Problem with multiple genes in a genomic
Problem with multiple genes in a genomic region region
HMMs HMMs and gene finding and gene finding
- HMMs
HMMs allow for a systematic approach to allow for a systematic approach to merging many signals. merging many signals.
- They can model multiple genes, partial
They can model multiple genes, partial genes in a genomic region, as also genes genes in a genomic region, as also genes
- n both strands.
- n both strands.
The The Viterbi Viterbi Algorithm Algorithm
Let vk (i) be the probability of the most likely path that ends in state p k, and emits symbols x1Lxk Then,
vk(i +1) = ek(xi+1)max
l (vl(i)alk)
HMMs HMMs and gene finding and gene finding
- The
The Viterbi Viterbi algorithm (and backtracking) algorithm (and backtracking) allows us to parse a string through the allows us to parse a string through the states of an HMM states of an HMM
- Can we describe Eukaryotic gene
Can we describe Eukaryotic gene structure by the states of an HMM? structure by the states of an HMM?
- This could be a solution to the GF problem.
This could be a solution to the GF problem.
An HMM for Gene structure An HMM for Gene structure
Generalized Generalized HMMs HMMs, and other , and other refinements refinements
- A probabilistic model for each of the states (ex:
A probabilistic model for each of the states (ex: Exon Exon, Splice site) needs to be described , Splice site) needs to be described
- In standard
In standard HMMs HMMs, there is an exponential , there is an exponential distribution on the duration of time spent in a distribution on the duration of time spent in a state. state.
- This is violated by many states of the gene
This is violated by many states of the gene structure HMM. Solution is to model these using structure HMM. Solution is to model these using generalized generalized HMMs HMMs. .
Length distributions of Length distributions of Introns Introns & & Exons Exons
Generalized HMM for gene finding Generalized HMM for gene finding
- Each state also emits a
Each state also emits a ‘ ‘duration duration’ ’ for which for which it will cycle in the same state. The time is it will cycle in the same state. The time is generated according to a random process generated according to a random process that depends on the state. that depends on the state.
Forward algorithm for gene finding Forward algorithm for gene finding
j i qk Fk(i) = P qk
j<i
Â
(X j,i) fqk ( j - i +1) alk
l ŒQ
Â
Fl( j)
HMMs HMMs and Gene finding and Gene finding
ß ß Generalized
Generalized HMMs HMMs are an attractive are an attractive model for computational gene finding model for computational gene finding
ß ß Allow incorporation of various signals Allow incorporation of various signals ß ß Quality of gene finding depends upon quality Quality of gene finding depends upon quality
- f signals.
- f signals.
Signals Signals
- Coding versus non-coding
Coding versus non-coding
- Splice Signals
Splice Signals
- Translation start
Translation start
Splice signals Splice signals
- GT is a Donor signal, and AG is the
GT is a Donor signal, and AG is the acceptor signal acceptor signal
GT AG
PWMs PWMs
- Fixed length for the splice signal.
Fixed length for the splice signal.
- Each position is generated
Each position is generated independently according to a independently according to a distribution distribution
- Figure shows data from > 1200 donor
Figure shows data from > 1200 donor sites sites 321123456 321123456 AAGGTGAGT AAGGTGAGT CCGGTAAGT CCGGTAAGT GAGGTGAGG GAGGTGAGG TAGGTAAGG TAGGTAAGG
MDD MDD
- PWMs
PWMs do not capture correlations between positions do not capture correlations between positions
- Many position pairs in the Donor signal are correlated
Many position pairs in the Donor signal are correlated
- Choose the position which has the highest
Choose the position which has the highest correlation score. correlation score.
- Split sequences into two: those which
Split sequences into two: those which have the consensus at position I, and the have the consensus at position I, and the remaining. remaining.
- Recurse
Recurse until <Terminating conditions> until <Terminating conditions>
MDD for Donor sites MDD for Donor sites
De novo De novo Gene prediction: Gene prediction: Sumary Sumary
- Various signals distinguish coding regions
Various signals distinguish coding regions from non-coding from non-coding
- HMMs
HMMs are a reasonable model for Gene are a reasonable model for Gene structures, and provide a uniform method structures, and provide a uniform method for combining various signals. for combining various signals.
- Further improvement may come from
Further improvement may come from improved signal detection improved signal detection
How many genes do we have? How many genes do we have?
Nature Science
Alternative splicing Alternative splicing
Comparative methods Comparative methods
- Gene prediction is harder with alternative splicing.
Gene prediction is harder with alternative splicing.
- One approach might be to use comparative methods to
One approach might be to use comparative methods to detect genes detect genes
- Given a similar mRNA/protein (from another species,
Given a similar mRNA/protein (from another species, perhaps?), can you find the best parse of a genomic perhaps?), can you find the best parse of a genomic sequence that matches that target sequence sequence that matches that target sequence
- Yes, with a variant on alignment algorithms that penalize
Yes, with a variant on alignment algorithms that penalize separately for separately for introns introns, versus other gaps. , versus other gaps.
Comparative gene finding Comparative gene finding tools tools
ß ß Procrustes
Procrustes/Sim4: mRNA /Sim4: mRNA vs
- vs. genomic