CSE182-L10 Gene Finding November 09 HMM fair-coin example 0.6 0.6 1 - - PowerPoint PPT Presentation

cse182 l10
SMART_READER_LITE
LIVE PREVIEW

CSE182-L10 Gene Finding November 09 HMM fair-coin example 0.6 0.6 1 - - PowerPoint PPT Presentation

CSE182-L10 Gene Finding November 09 HMM fair-coin example 0.6 0.6 1 0.4 0.4 E F (H)=0.5 E L (H)=0.1 November 09 0.6 0.6 1 0.4 0.4 E F (H)=0.5 E L (H)=0.1 H H T T T is the observed sequence 0.6 1.5e-1 4.5e-2 1.3e-2 5.8e-3 0.5 1 0 0.4


slide-1
SLIDE 1

CSE182-L10

Gene Finding

November 09

slide-2
SLIDE 2

HMM ‘fair-coin’ example

November 09

EF(H)=0.5 EL(H)=0.1 0.6 0.6 0.4 0.4 1

slide-3
SLIDE 3
  • H H T T T is the observed sequence

November 09

EF(H)=0.5 EL(H)=0.1 0.6 0.6 0.4 0.4

1 0.6 0.4 0.5 1.5e-1 4.5e-2 1.3e-2 5.8e-3 2e-2 5.4e-2 2.9e-2 1.6e-2

1

slide-4
SLIDE 4

Syllabus for midterm

  • Sequence alignment using Blast

– Global, local, space saving, affine gap costs

  • P-value, e-value computation
  • Pigeonhole principle, keyword matching
  • Column specific scoring (profiles)
  • Pattern matching (regular expressions)
  • HMMs

November 09

slide-5
SLIDE 5

Translation

  • The ribosomal machinery

reads mRNA.

  • Each triplet is

translated into a unique amino-acid until the STOP codon is encountered.

  • There is also a special

signal where translation starts, usually at the ATG (M) codon.

November 09

slide-6
SLIDE 6

Translation

  • The ribosomal machinery

reads mRNA.

  • Each triplet is translated

into a unique amino-acid until the STOP codon is encountered.

  • There is also a special signal

where translation starts, usually at the ATG (M) codon.

  • Given a DNA sequence, how

many ways can you translate it?

November 09

slide-7
SLIDE 7

Eukaryotic gene structure

  • The coding regions of a gene are

discontiguous regions (exons), separated by non-coding regions (introns).

  • Transcription initially copies the

entire region into RNA

  • The introns are ‘spliced out’ to

form the mature mRNA (message)

  • Translation starts from an

intitiating ATG somewhere in the message.

November 09

slide-8
SLIDE 8

Gene Features

ATG

5’ UTR intron exon 3’ UTR Acceptor Donor splice site Transcription start Translation start

November 09

slide-9
SLIDE 9

Gene Features

  • The gene can lie on any strand

(relative to the reference genome)

  • The code can be in one of 3

frames. AGTAGAGTATAGTGGACG S R V * W R V Q Y S G * S I V D

Frame 1 Frame 2 Frame 3

  • ve strand

TCATCTCATATCACCTGC

November 09

slide-10
SLIDE 10

Gene identification

  • Eukaryotic gene definitions:

– Location that codes for a protein – The transcript sequence(s) that encodes the protein – The protein sequence(s)

  • Suppose you want to know all of the genes in an organism.
  • This was a major problem in the 70s. PhDs, and careers were

spent isolating a single gene sequence.

  • All of that changed with better reagents and the development
  • f high throughput methods like EST sequencing
  • With genome sequencing, the initial problem became

computational.

November 09

slide-11
SLIDE 11

Computational Gene Finding

  • Given Genomic DNA, identify all the coordinates
  • f the gene
  • TRIVIA QUIZ! What is the name of the FIRST

gene finding program? (google testcode)

ATG

5’ UTR intron exon 3’ UTR Acceptor Donor splice site Transcription start Translation start

November 09

slide-12
SLIDE 12

Gene Finding: The 1st generation

  • Given genomic DNA, does it contain a gene (or

not)?

  • Key idea: The distributions of nucleotides is

different in coding (translated exons) and non- coding regions.

  • Therefore, a statistical test can be used to

discriminate between coding and non-coding regions.

November 09

slide-13
SLIDE 13

Coding versus non-coding

  • You are given a collection of exons, and a

collection of intergenic sequence.

  • Count the number of occurrences of ATGATG in

Introns and Exons. – Suppose 1% of the hexamers in Exons are ATGATG – Only 0.01% of the hexamers in Intergenic are ATGATG

  • How can you use this idea to find genes?

November 09

slide-14
SLIDE 14

Generalizing

AAAAAA AAAAAC AAAAAG AAAAAT

I E

  • Compute a frequency count for all hexamers.
  • Exons, Intergenic and the sequence X are all vectors

in a multi-dimensional space

  • Use this to decide whether a sequence X is exonic/

intergenic.

10 5 20 10

X

10 5

Frequencies (X10-5)

November 09

slide-15
SLIDE 15

A geometric approach (2 hexamers)

  • Plot the following vectors

– E= [10, 20] – I = [10, 5] – V3 = [6, 10] – V4 = [9, 15]

  • Is V3 more like E or more

like I?

5 20 15 10 15 10 5

E I V3

November 09

slide-16
SLIDE 16

Choosing between Intergenic and Exonic

  • Normalize V’ = V/||V||
  • All vectors have the

same length (lie on the unit circle)

  • Next, compute the

angle to E, and I.

  • Choose the feature

that is ‘closer’ (smaller angle. E I V3

β

α

E - score(V3) = α α + β

November 09

slide-17
SLIDE 17

Coding versus non-coding signals

  • Fickett and Tung (1992) compared various

measures

  • Measures that preserve the triplet frame are

the most successful.

  • Genscan uses a 5th order Markov Model

November 09

slide-18
SLIDE 18

5th order markov chain

  • PrEXON[AAAAAACGAGAC..]

=T[AAAAA,A] T[AAAAA,C] T[AAAAC,G] T[AAACG,A]……

= (20/78) (50/78)………. AAAAAA 20 1 AAAAAC 50 10 AAAAAG 5 30 AAAAAT 3 .. Tot AAAAA

A G C

AAAAG AAAAC

Exon Intron

November 09

slide-19
SLIDE 19

Scoring for coding regions

CodingDifferential[x] = log PrExon[x] PrIntron[x]      

  • The coding differential can be computed as the

log odds of the probability that a sequence is an exon vs. and intron.

  • In Genscan, separate transition matrices are

trained for each frame, as different frames have different hexamer distributions

November 09

slide-20
SLIDE 20

Coding differential for 380 genes

November 09

slide-21
SLIDE 21

Coding region can be detected

Coding

  • Plot the coding score using a sliding window of fixed length.
  • The (large) exons will show up reliably.
  • Not enough to predict gene boundaries reliably

November 09

slide-22
SLIDE 22

Other Signals

GT ATG AG

Coding

  • Signals at exon boundaries are precise but not specific. Coding

signals are specific but not precise.

  • When combined they can be effective

November 09

slide-23
SLIDE 23

Combining Signals

  • We can compute the following:

– E-score[i,j] – I-score[i,j] – D-score[i] – A-score[i] – Goal is to find coordinates that maximize the total score

i j

November 09

slide-24
SLIDE 24

The second generation of Gene finding

  • Ex: Grail II. Used statistical techniques to

combine various signals into a coherent gene structure.

  • It was not easy to train on many
  • parameters. Guigo & Bursett test revealed

that accuracy was still very low.

  • Problem with multiple genes in a genomic

region

November 09

slide-25
SLIDE 25

Combining signals using D.P.

  • An HMM is the best way to model and
  • ptimize the combination of signals
  • Here, we will use a simpler approach which

is essentially the same as the Viterbi algorithm for HMMs, but without the formalism.

November 09

slide-26
SLIDE 26

Hidden states & gene structure

  • Identifying a gene is equivalent to labeling each

nucleotide as E/I/intergenic etc.

  • These ‘labels’ are the hidden states
  • For simplicity, consider only two states E and I

IIIIIEEEEEEIIIIIIEEEEEEIIIIEEEEEE IIIII

i1 i2 i3 i4

November 09

slide-27
SLIDE 27

Gene finding reformulated

  • Given a labeling L, we can score it as
  • I-score[0..i1-1] + E-score[i1..i2] + D-score[i2+1] + I-score[i2+1..i3-1] +

A-score[i3-1] + E-score[i3..i4] + …….

  • Goal is to compute a labeling with maximum score.

IIIIIEEEEEEIIIIIIEEEEEEIIIIEEEEEE IIIII

i1 i2 i3 i4

November 09

slide-28
SLIDE 28

Optimum labeling using D.P. (Viterbi)

  • Define VE(i) = Best score of a labeling of the

prefix 1..i such that the i-th position is labeled E

  • Define VI(i) = Best score of a labeling of the

prefix 1..i such that the i-th position is labeled I

  • Why is it enough to compute VE(i) & VI(i) ?

November 09

slide-29
SLIDE 29

Optimum parse of the gene

VE(i) = max j<i E_score[ j…i]+ VI ( j −1) +A_score[ j −1]}   

VI (i) = max j<i I_score[ j..i]+ VE( j −1) +D_score[ j]}   

j i j i

November 09

slide-30
SLIDE 30

Generalizing

  • Note that we deal with two states, and consider all

paths that move between the two states. E I i

November 09

slide-31
SLIDE 31

Generalizing

  • We did not deal with the boundary cases in the

recurrence.

  • Instead of labeling with two states, we can label

with multiple states,

– Einit, Efin, Emid, – I, IG (intergenic)

Einit I Efin Emid IG Note: all links are not shown here

November 09

slide-32
SLIDE 32

An HMM for Gene structure

November 09

slide-33
SLIDE 33

Gene Finding via HMMs

  • Gene finding can be interpreted as a d.p. approach

that threads genomic sequence through the states

  • f a ‘gene’ HMM.

– Einit, Efin, Emid, – I, IG (intergenic)

Einit I Efin Emid IG Note: all links are not shown here i

November 09

slide-34
SLIDE 34

Generalized HMMs, and other refinements

  • A probabilistic model for each of the states (ex:

Exon, Splice site) needs to be described

  • In standard HMMs, there is an exponential

distribution on the duration of time spent in a state.

  • This is violated by many states of the gene structure
  • HMM. Solution is to model these using generalized

HMMs.

November 09

slide-35
SLIDE 35

Length distributions of Introns & Exons

November 09

slide-36
SLIDE 36

Generalized HMM for gene finding

  • Each state also emits a ‘duration’ for which

it will cycle in the same state. The time is generated according to a random process that depends on the state.

November 09

slide-37
SLIDE 37

Forward algorithm for gene finding

j i qk Fk(i) = P qk

j<i

(X j,i) fqk ( j − i +1) alk

l ∈Q

Fl( j)

Emission Prob.: Probability that you emitted Xi..Xj in state qk (given by the 5th order markov model) Forward Prob: Probability that you emitted i symbols and ended up in state qk Duration Prob.: Probability that you stayed in state qk for j-i+1 steps

November 09

slide-38
SLIDE 38

De novo Gene prediction: Summary

  • Various signals distinguish coding regions

from non-coding

  • HMMs are a reasonable model for Gene

structures, and provide a uniform method for combining various signals.

  • Further improvement may come from

improved signal detection

November 09

slide-39
SLIDE 39

DNA Signals

  • Coding versus non-coding
  • Splice Signals
  • Translation start

ATG

5’ UTR intron exon 3’ UTR Acceptor Donor splice site Transcription start Translation start

November 09

slide-40
SLIDE 40

DNA signal example:

  • The donor site marks the junction where an

exon ends, and an intron begins.

  • For gene finding, we are interested in

computing a probability

– D[i] = Prob[Donor site at position i]

  • Approach: Collect a large number of donor

sites, align, and look for a signal.

November 09

slide-41
SLIDE 41

PWMs

  • Fixed length for the splice signal.
  • Each position is generated independently

according to a distribution

  • Figure shows data from > 1200 donor sites

November 09

slide-42
SLIDE 42

Improvements to signal detection

November 09

slide-43
SLIDE 43

MDD

  • PWMs do not capture correlations between positions
  • Many position pairs in the Donor signal are correlated

November 09

slide-44
SLIDE 44

Maximal Dependence Decomposition

  • Choose the position i which has the highest

correlation score.

  • Split sequences into two: those which have

the consensus at position i, and the remaining.

  • Recurse until <Terminating conditions>

– Stop if #sequences is ‘small enough’

November 09

slide-45
SLIDE 45

MDD for Donor sites

November 09

slide-46
SLIDE 46

Gene prediction: Summary

  • Various signals distinguish coding regions

from non-coding

  • HMMs are a reasonable model for Gene

structures, and provide a uniform method for combining various signals.

  • Further improvement may come from

improved signal detection

November 09

slide-47
SLIDE 47

How many genes do we have?

Nature Science

November 09

slide-48
SLIDE 48

Alternative splicing

November 09

slide-49
SLIDE 49

Comparative methods

  • Gene prediction is harder with alternative splicing.
  • One approach might be to use comparative methods to detect

genes

  • Given a similar mRNA/protein (from another species, perhaps?),

can you find the best parse of a genomic sequence that matches that target sequence

  • Yes, with a variant on alignment algorithms that penalize separately

for introns, versus other gaps.

November 09

slide-50
SLIDE 50

Comparative gene finding tools

  • Procrustes/Sim4: mRNA vs. genomic
  • Genewise: proteins versus genomic
  • CEM: genomic versus genomic
  • Twinscan: Combines comparative and de novo

approach.

  • Mass Spec related?

– Later in the class we will consider mass spectrometry data. – Can we use this data to identify genes in eukaryotic genomes? (Research project)

November 09

slide-51
SLIDE 51

Databases

  • RefSeq and other

databases maintain sequences of full- length transcripts/ genes.

  • We can query using

sequence.

November 09

slide-52
SLIDE 52

Course

  • Sequence Comparison

(BLAST & other tools)

  • Protein Motifs:

– Profiles/Regular Expression/HMMs

  • Discovering protein

coding genes

– Gene finding HMMs – DNA signals (splice signals)

  • How is the genomic

sequence itself obtained?

Protein sequence analysis ESTs Gene finding

November 09