Eukaryotic Gene Eukaryotic Gene Prediction Prediction Eukaryotic - PowerPoint PPT Presentation

Eukaryotic Gene Eukaryotic Gene Prediction Prediction

Eukaryotic gene structure Eukaryotic gene structure

Translation Translation

Gene Finding: The 1st generation Gene Finding: The 1st generation • Given genomic DNA, does it contain a gene (or Given genomic DNA, does it contain a gene (or • not)? not)? • Key idea: The distributions of nucleotides is Key idea: The distributions of nucleotides is • different in coding (translated exons exons) and non- ) and non- different in coding (translated coding regions. coding regions. • Therefore, a statistical test can be used to Therefore, a statistical test can be used to • discriminate between coding and non-coding discriminate between coding and non-coding regions. regions.

Coding versus non-coding Coding versus non-coding • Fickett Fickett and and Tung Tung (1992) compared various (1992) compared various • measures measures • Measures that preserve the triplet frame are Measures that preserve the triplet frame are • the most successful. the most successful. • Genscan Genscan: 5th order Markov Model : 5th order Markov Model • • Assignment 2 (Conservation implies a Assignment 2 (Conservation implies a • protein coding measure) protein coding measure)

Coding vs vs. non-coding . non-coding Coding regions regions Given : Three 5th order transition matrices C (1) , C (2) , C (3) trained on coding exons b - a P h ( X a , b ) = ’ C (( h + i )mod 3 + 1) [ X a + i ] i = 0 Coding ratio, r = P h ( X a , b ) P D ( X a , b ) Coding Score s = log 2 (r) Compute average coding score (per base) of exons and introns, and take the difference. If the measure is good, the difference must be biased away from 0.

Coding differential for 380 genes Coding differential for 380 genes

Other Signals Other Signals ATG AG GT Coding

Coding region can be detected Coding region can be detected ß Plot the coding score using a sliding window of fixed ß Plot the coding score using a sliding window of fixed length. length. ß The (large) ß The (large) exons exons will show up reliably. will show up reliably. ß Not enough to predict gene boundaries reliably ß Not enough to predict gene boundaries reliably Coding

Other Signals Other Signals ß Signals at ß Signals at exon exon boundaries are precise but not specific. boundaries are precise but not specific. Coding signals are specific but not precise. Coding signals are specific but not precise. ß When combined they can be effective ß When combined they can be effective ATG AG GT Coding

The second generation of Gene finding The second generation of Gene finding • Ex: Grail II. Used statistical techniques to Ex: Grail II. Used statistical techniques to • combine various signals into a coherent combine various signals into a coherent gene structure. gene structure. • It was not easy to train on many It was not easy to train on many • parameters. Guigo Guigo & & Bursett Bursett test revealed test revealed parameters. that accuracy was still very low. that accuracy was still very low. • Problem with multiple genes in a genomic Problem with multiple genes in a genomic • region region

HMMs and gene finding and gene finding HMMs • HMMs HMMs allow for a systematic approach to allow for a systematic approach to • merging many signals. merging many signals. • They can model multiple genes, partial They can model multiple genes, partial • genes in a genomic region, as also genes genes in a genomic region, as also genes on both strands. on both strands.

The Viterbi Viterbi Algorithm Algorithm The Let v k ( i ) be the probability of the most likely path that ends in state p k , and emits symbols x 1 L x k Then, v k ( i + 1) = e k ( x i + 1 )max l ( v l ( i ) a lk )

HMMs and gene finding and gene finding HMMs • The The Viterbi Viterbi algorithm (and backtracking) algorithm (and backtracking) • allows us to parse a string through the allows us to parse a string through the states of an HMM states of an HMM • Can we describe Eukaryotic gene Can we describe Eukaryotic gene • structure by the states of an HMM? structure by the states of an HMM? • This could be a solution to the GF problem. This could be a solution to the GF problem. •

An HMM for Gene structure An HMM for Gene structure

Generalized HMMs HMMs, and other , and other Generalized refinements refinements • A probabilistic model for each of the states (ex: A probabilistic model for each of the states (ex: • Exon, Splice site) needs to be described , Splice site) needs to be described Exon • In standard In standard HMMs HMMs, there is an exponential , there is an exponential • distribution on the duration of time spent in a distribution on the duration of time spent in a state. state. • This is violated by many states of the gene This is violated by many states of the gene • structure HMM. Solution is to model these using structure HMM. Solution is to model these using generalized HMMs HMMs. . generalized

Length distributions of Introns Introns & & Exons Exons Length distributions of

Generalized HMM for gene finding Generalized HMM for gene finding • Each state also emits a Each state also emits a ‘ ‘duration duration’ ’ for which for which • it will cycle in the same state. The time is it will cycle in the same state. The time is generated according to a random process generated according to a random process that depends on the state. that depends on the state.

Forward algorithm for gene finding Forward algorithm for gene finding q k j i Â P q k Â F k ( i ) = ( X j , i ) f q k ( j - i + 1) a lk F l ( j ) j < i l Œ Q

HMMs and Gene finding and Gene finding HMMs ß Generalized ß Generalized HMMs HMMs are an attractive are an attractive model for computational gene finding model for computational gene finding ß Allow incorporation of various signals Allow incorporation of various signals ß ß Quality of gene finding depends upon quality Quality of gene finding depends upon quality ß of signals. of signals.

Signals Signals • Coding versus non-coding Coding versus non-coding • • Splice Signals Splice Signals • • Translation start Translation start •

Splice signals Splice signals • GT is a Donor signal, and AG is the GT is a Donor signal, and AG is the • acceptor signal acceptor signal GT AG

PWMs PWMs • Fixed length for the splice signal. Fixed length for the splice signal. • 321123456 321123456 AAGGTGAGT AAGGTGAGT • Each position is generated Each position is generated • independently according to a independently according to a CCGGTAAGT CCGGTAAGT distribution distribution GAGGTGAGG GAGGTGAGG • Figure shows data from > 1200 donor Figure shows data from > 1200 donor TAGGTAAGG • TAGGTAAGG sites sites

MDD MDD • PWMs PWMs do not capture correlations between positions do not capture correlations between positions • • Many position pairs in the Donor signal are correlated Many position pairs in the Donor signal are correlated •

• Choose the position which has the highest Choose the position which has the highest • correlation score. correlation score. • Split sequences into two: those which Split sequences into two: those which • have the consensus at position I, and the have the consensus at position I, and the remaining. remaining. • Recurse Recurse until <Terminating conditions> until <Terminating conditions> •

MDD for Donor sites MDD for Donor sites

De novo Gene prediction: Gene prediction: Sumary Sumary De novo • Various signals distinguish coding regions Various signals distinguish coding regions • from non-coding from non-coding • HMMs HMMs are a reasonable model for Gene are a reasonable model for Gene • structures, and provide a uniform method structures, and provide a uniform method for combining various signals. for combining various signals. • Further improvement may come from Further improvement may come from • improved signal detection improved signal detection

How many genes do we have? How many genes do we have? Nature Science

Alternative splicing Alternative splicing

Comparative methods Comparative methods • Gene prediction is harder with alternative splicing. Gene prediction is harder with alternative splicing. • • One approach might be to use comparative methods to One approach might be to use comparative methods to • detect genes detect genes • Given a similar mRNA/protein (from another species, Given a similar mRNA/protein (from another species, • perhaps?), can you find the best parse of a genomic perhaps?), can you find the best parse of a genomic sequence that matches that target sequence sequence that matches that target sequence Yes, with a variant on alignment algorithms that penalize Yes, with a variant on alignment algorithms that penalize • • separately for introns introns, versus other gaps. , versus other gaps. separately for

Eukaryotic Gene Eukaryotic Gene Prediction Prediction Eukaryotic - PowerPoint PPT Presentation

Eukaryotic Gene Eukaryotic Gene Prediction Prediction Eukaryotic gene structure Eukaryotic gene structure Translation Translation Gene Finding: The 1st generation Gene Finding: The 1st generation Given genomic DNA, does it contain a

Genome Characteristics and Annotation COMP 571 - Spring 2015 Luay Nakhleh, Rice University

GDR ADN, 2-4 mai 2012 Replication in eukaryotic genomes Specific features of eukaryotic

Gene Expression in Eukaryotic cells Slide 2 / 54 Central Dogma DNA is the the genetic material

Gene Finding Strategies to find gene structures on the web Swiss Institute of Bioinformatics

Staphylococcus aureus Pathogenesis - Gene exchanges - Gene regulation - Gene products - Gene

Ap C ntr l or The easy way to success Selection Eukaryotic Bacteria cells selection

Antibiotic Antibiotic accumulation and efflux accumulation and efflux in eukaryotic cells: in

Gene Prediction with AUGUSTUS Genome annotation: challenges in eukaryotes and consequences for

Ab initio gene prediction Genome 559, Winter 2014 Ab initio gene prediction method Define

Gene Expression Watson & Crick worked out the structure of DNA as a double helix. in

Gene Expression Data Introduction to gene expression data Expression data storage concept An

Gene finding and gene structure prediction Lorenzo Cerutti Swiss Institute of Bioinformatics

genomes Ekaterina Shelest 09.03.2018 Gttingen Part 1. Gene clusters and their discovery From

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

Gene-gene and gene-environment interactions in genetic case- control association studies Jurg Ott

An introduction to SYSTEMS BIOLOGY Paolo Tieri CNR Consiglio Nazionale delle Ricerche, Rome,

Proteomics Informatics Databases, data repositories and standardization (Week 7) Protein

Proteomics Informatics (BMSC-GA 4437) Course Director David Feny Contact information

Hanyang University Il Hong Suh 2012.4.17 1 4.5 Ga : 1.8 Ma = 1 day : 34.6 Sec Evolution of

SMT Solving for Vesicle Traffic Systems in Cells A. Shukla 2 M. Srivas 2 A. Gupta 1 M. Thattai 3 1

Deciphering regulatory networks by promoter sequence analysis Elodie Portales-Casamar University

Bio_KB_101: A Challenge for TPTP First-Order Reasoners (?) Is it really a challenge? We dont

Protein folds, fold classi fj cations & structure stability Magnus Andersson

Eukaryotic Gene Eukaryotic Gene Prediction Prediction Eukaryotic - PowerPoint PPT Presentation

Eukaryotic Gene Eukaryotic Gene Prediction Prediction Eukaryotic gene structure Eukaryotic gene structure Translation Translation Gene Finding: The 1st generation Gene Finding: The 1st generation Given genomic DNA, does it contain a

Genome Characteristics and Annotation COMP 571 - Spring 2015 Luay Nakhleh, Rice University

GDR ADN, 2-4 mai 2012 Replication in eukaryotic genomes Specific features of eukaryotic

Gene Expression in Eukaryotic cells Slide 2 / 54 Central Dogma DNA is the the genetic material

Gene Finding Strategies to find gene structures on the web Swiss Institute of Bioinformatics

Staphylococcus aureus Pathogenesis - Gene exchanges - Gene regulation - Gene products - Gene

Ap C ntr l or The easy way to success Selection Eukaryotic Bacteria cells selection

Antibiotic Antibiotic accumulation and efflux accumulation and efflux in eukaryotic cells: in

Gene Prediction with AUGUSTUS Genome annotation: challenges in eukaryotes and consequences for

Ab initio gene prediction Genome 559, Winter 2014 Ab initio gene prediction method Define

Gene Expression Watson &amp; Crick worked out the structure of DNA as a double helix. in

Gene Expression Data Introduction to gene expression data Expression data storage concept An

Gene finding and gene structure prediction Lorenzo Cerutti Swiss Institute of Bioinformatics

genomes Ekaterina Shelest 09.03.2018 Gttingen Part 1. Gene clusters and their discovery From

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

Gene-gene and gene-environment interactions in genetic case- control association studies Jurg Ott

An introduction to SYSTEMS BIOLOGY Paolo Tieri CNR Consiglio Nazionale delle Ricerche, Rome,

Proteomics Informatics Databases, data repositories and standardization (Week 7) Protein

Proteomics Informatics (BMSC-GA 4437) Course Director David Feny Contact information

Hanyang University Il Hong Suh 2012.4.17 1 4.5 Ga : 1.8 Ma = 1 day : 34.6 Sec Evolution of

SMT Solving for Vesicle Traffic Systems in Cells A. Shukla 2 M. Srivas 2 A. Gupta 1 M. Thattai 3 1

Deciphering regulatory networks by promoter sequence analysis Elodie Portales-Casamar University

Bio_KB_101: A Challenge for TPTP First-Order Reasoners (?) Is it really a challenge? We dont

Protein folds, fold classi fj cations &amp; structure stability Magnus Andersson

Gene Expression Watson & Crick worked out the structure of DNA as a double helix. in

Protein folds, fold classi fj cations & structure stability Magnus Andersson