CSE 427 Computational Biology Genes and Gene Prediction 1 Some - PowerPoint PPT Presentation

CSE 427   Computational Biology Genes and Gene Prediction 1

Some notes on HW #2 How do we evaluate and compare classifiers?

Quantifying “Accuracy” https://en.wikipedia.org/wiki/Sensitivity_and_specificity 8

“A diagnostic test with sensitivity 67% and specificity 91% is applied to 2030 people to look for a disorder with a population prevalence of 1.48%” The patient’s “true” status blood test outcome https://en.wikipedia.org/wiki/Sensitivity_and_specificity 9

ROC Curves A way to think about 2-parameter trade- offs (true positives and false positives) 1.0 A bit better   True Positive ° than chance Rate No better 0.5 ° than chance ° 0.0 0.0 0.5 1.0 False Positive Rate

Markov Model Score − 50 0 50 0 ORF Length 2000

Blue = ORF length threshold; Green = Markov Model threshold 51 171 1.0 ● ● ● 291 ● 411 0.8 1.0 537 ● ORF length-based threshold 0.9 0.6 669 ● TPR ● 411 d 807 o l ● s h e h r t d s e 0.8 a b M - M 0.4 TPR ● 957 ● 537 0.7 ● 1098 0.2 1269 ● 1506 ● 0.6 ● 669 ● 1971 0.0 8703 ● 0.5 0.0 0.2 0.4 0.6 0.8 1.0 FPR 0.0000 0.0005 0.0010 0.0015 0.0020

Gene Finding: Motivation Sequence data flooding in What does it mean? protein genes, RNA genes, mitochondria, chloroplast, regulation, replication, structure, repeats, transposons, unknown stuff, … More generally, how do you: learn from complex data in an unknown language, leverage what’s known to help discover what’s not 2

Protein Coding Nuclear DNA Focus of these slides Goal: Automated annotation of new seq data State of the Art: In Eukaryotes: predictions ~ 60% similar to real proteins ~80% if database similarity used Prokaryotes better, but still imperfect Lab verification still needed, still expensive Largely done for Human; unlikely for most others 3

Biological Basics Central Dogma: DNA transcription RNA translation Protein Codons: 3 bases code one amino acid Start codon Stop codons 3 ’ , 5 ’ Untranslated Regions (UTR ’ s) 4

RNA   Transcription (This gene is heavily transcribed, but many are not.) 5

Translation: mRNA → Protein Watson, Gilman, Witkowski, & Zoller, 1992 6

DNA (thin lines), RNA Pol (Arrow), mRNA with attached Ribosomes (dark circles) Darnell, p120 7

Ribosomes Watson, Gilman, Witkowski, & Zoller, 1992 8

Codons & The Genetic Code Ala : Alanine Arg : Arginine Second Base U C A G Asn : Asparagine Phe Ser Tyr Cys U Asp : Aspartic acid Phe Ser Tyr Cys C Cys : Cysteine U Leu Ser Stop Stop A Gln : Glutamine Leu Ser Stop Trp G Glu : Glutamic acid Leu Pro His Arg U Gly : Glycine Leu Pro His Arg C His : Histidine C Leu Pro Gln Arg A Ile : Isoleucine Third Base First Base Leu Pro Gln Arg G Leu : Leucine Ile Thr Asn Ser U Lys : Lysine Ile Thr Asn Ser C Met : Methionine A Ile Thr Lys Arg A Phe : Phenylalanine Met/Start Thr Lys Arg G Pro : Proline Val Ala Asp Gly U Ser : Serine Val Ala Asp Gly C Thr : Threonine G Val Ala Glu Gly A Trp : Tryptophane Val Ala Glu Gly G Tyr : Tyrosine Val : Valine 9

Idea #1: Find Long ORF ’ s Reading frame: which of the 3 possible sequences of triples does the ribosome read? Open Reading Frame: No internal stop codons In random DNA average ORF ~ 64/3 = 21 triplets 300bp ORF once per 36kbp per strand But average protein ~ 1000bp 10

A Simple ORF finder start at left end scan triplet-by-non-overlapping triplet for AUG then continue scan for STOP repeat until right end repeat all starting at offset 1 repeat all starting at offset 2 then do it again on the other strand 11

Scanning for ORFs * 1 2 3 U U A A U G U G U C A U U G A U U A A G A A U U A C A C A G U A A C U A A U A C 4 5 6 * In bacteria, GUG is sometimes a start codon… 12

Idea #2: Codon Frequency In random DNA   Leucine : Alanine : Tryptophan = 6 : 4 : 1 But in real protein, ratios ~ 6.9 : 6.5 : 1 So, coding DNA is not random Even more: synonym usage is biased (in a species dependant way)   examples known with 90% AT 3 rd base Why? E.g. efficiency, histone, enhancer, splice interactions 13

Idea #3: Non-Independence Not only is codon usage biased, but residues (aa or nt) in one position are not independent of neighbors How to model this? Markov models 14

CpG Islands CpG Islands More CpG than elsewhere (say, CpG/GpC>50%) More C & G than elsewhere, too (say, C+G>50%) Typical length: few 100 to few 1000 bp Questions Is a short sequence (say, 200 bp) a CpG island or not? Given long sequence (say, 10-100kb), find CpG islands? 11

    Markov Chains A sequence of random variables is a k-th order Markov chain if, for all i , i th value is independent of all but the previous k values:   i-1 k typically ≪ i-1 0 th   } Example 1: Uniform random ACGT order Example 2: Weight matrix model } 1 st   Example 3: ACGT, but ↓ Pr(G following C) order 14

A Markov Model (1st order) States: A,C,G,T Emissions: corresponding letter Transitions: a st = P(x i = t | x i-1 = s) 1st order 15

A Markov Model (1st order) States: A,C,G,T Emissions: corresponding letter Transitions: a st = P(x i = t | x i-1 = s) B egin/ E nd states 16

Pr of emitting sequence x 17

Training Max likelihood estimates for transition probabilities are just the frequencies of transitions when emitting the training sequences E.g., from 48 CpG islands in 60k bp: From DEKM 18

Discrimination/Classification Log likelihood ratio of CpG model vs background model From DEKM 19

CpG Island Scores CpG islands Non-CpG Figure 3.2 Histogram of length-normalized scores. From DEKM 20

GENES, PART II 15

Promoters, etc. In prokaryotes, most DNA coding E.g. ~ 70% in H. influenzae Long ORFs + codon stats do well But obviously won ’ t be perfect short genes 5 ’ & 3 ’ UTR ’ s Can improve by modeling promoters, etc. e.g. via WMM or higher-order Markov models 16

Eukaryotes As in prokaryotes (but maybe more variable) promoters start/stop transcription start/stop translation 17

And then… Nobel Prize of the week: P. Sharp, 1993, Splicing 18

Mechanical Devices of the Spliceosome: Motors, Clocks, Springs, and Things Jonathan P . Staley and Christine Guthrie CELL Volume 92, Issue 3 , 6 February 1998, Pages 315-326 19

Figure 2. Spliceosome Assembly, Rearrangement, and Disassembly Requires ATP, Numerous DExD/H box Proteins, and Prp24. The snRNPs are depicted as circles. The pathway for S. cerevisiae is shown. 20

Hints to Origins? Tetrahymena thermophila 22

Genes in Eukaryotes As in prokaryotes (but maybe more variable) promoters start/stop transcription start/stop translation 3 ’ 5 ’ New Features: exon intron exon intron introns, exons, splicing AG/GT yyy..AG/G AG/GT branch point signal donor acceptor donor alternative splicing polyA site/tail 23

Characteristics of human genes (Nature, 2/2001, Table 21) Median Mean Sample (size) Internal exon 122 bp 145 bp RefSeq alignments to draft genome sequence, with confirmed intron boundaries (43,317 exons) Exon number 7 8.8 RefSeq alignments to finished seq (3,501 genes) Introns 1,023 bp 3,365 bp RefSeq alignments to finished seq (27,238 introns) 3' UTR 400 bp 770 bp Confirmed by mRNA or EST on chromo 22 (689) 300 bp Confirmed by mRNA or EST on chromo 22 (463) 5' UTR 240 bp 1340 bp Selected RefSeq entries (1,804)* Coding seq 1,100 bp (CDS) 367 aa 447 aa Genomic span 14 kb 27 kb Selected RefSeq entries (1,804)* * 1,804 selected RefSeq entries were those with full- length unambiguous alignment to finished sequence 24

Big Genes Many genes are over 100 kb long, Max known: dystrophin gene (DMD), 2.4 Mb. The variation in the size distribution of coding sequences and exons is less extreme, although there are remarkable outliers. The titin gene has the longest currently known coding sequence at 80,780 bp; it also has the largest number of exons (178) and longest single exon (17,106 bp). RNApol rate: 1.2-2.5 kb/min = >16 hours to transcribe DMD 25

Nature 2/2001 Exons Introns Introns 26

Figure 36 GC content Nature 2/2001 Genes vs Gene   Genome Density a: Distribution of GC content b: Gene density as a in genes and in the genome . function of GC content   For 9,315 known genes mapped (= ratios of data in a. Less to the draft genome sequence, the accurate at high GC because local GC content was calculated in the denominator is small) a window covering either the whole alignment or 20,000 bp c: Dependence of mean centered on midpoint of the exon and intron lengths alignment, whichever was larger. on GC content.   Ns in the sequence were not The local GC content, based Intron Exon counted. GC content for the on alignments to finished genome was calculated for sequence only, calculated adjacent nonoverlapping 20,000- from windows covering the bp windows across the sequence. larger of feature size or Both distributions normalized to 10,000 bp centered on it sum to one. 27

CSE 427 Computational Biology Genes and Gene Prediction 1 Some - PowerPoint PPT Presentation

CSE 427 Computational Biology Genes and Gene Prediction 1 Some notes on HW #2 How do we evaluate and compare classifiers? Quantifying Accuracy https://en.wikipedia.org/wiki/Sensitivity_and_specificity 8 A diagnostic test with

CSE 427 Computational Biology http://courses.cs.washington.edu/courses/cse427 Larry Ruzzo Winter

CSE 427 Computational Biology http://courses.cs.washington.edu/courses/cse427 Larry Ruzzo Autumn

CSE427 Computational Biology http://www.cs.washington.edu/427 Larry Ruzzo Winter 2008 UW CSE

CSE 427 Computational Biology Genes and Gene Prediction 1 Gene Finding: Motivation

CSE 427 Computational Biology Gene Prediction A statistical interlude: Fair or biased? H H H H

CSE 427 Computational Biology Course Wrap Up 71 Please complete online course

CSE 427 Computational Biology Autumn 2015 3: BLAST, Alignment score significance 1 Significance

The Plan BLAST CSE 427 Scoring Computational Biology Another Bio Interlude: PCR

CSE 427 Computational Biology Winter 2008 Sequence Alignment; DNA Replication 1 Sequence

He who asks is a fool for five CSE427 minutes, but he who does not Computational Biology ask

Deep Computing in Biology Challenges and Progress Ajay K. Royyuru Computational Biology Center

Basics of Molecular biology Molecular biology is the study of biology at molecular level.

2019-20 DNA Biology New Products RNA Biology PROTEIN Biology MOLECULAR Biology Plant DNA

CSE 3401 Functional and Logic Programming York University CSE 3401 Vida Movahedi 1 York University

Curation of computational biology models Curation of computational biology models Anand

Computational and Mathematical Biology Computational and Mathematical Biology in the Genomics

U i Unix, Perl and Python P l d P h Perl for Bioinformatics George W. Bell, Ph.D. WIBR

An introduction to electronic voting Application to single transferable vote Orange Labs Jacques

t sts t trst

6th Grade The Universe and Its Stars 2015-08-27 www.njctl.org Slide 3 / 120 Slide 4 / 120

Develop Your Data Mindset Module 8 - Progress Monitoring Part 11 - Absorb, Ask, Accumulate &

Orthogonal Random Forests for Causal Inference Steven Wu University of Minnesota Joint work

of predicate abstraction A. Cimatti, J. Dubrovin, T. Junttila, M. Roveri Fondazione Bruno

ECO 317 Economics of Uncertainty Lectures: Tu-Th 3.00-4.20, Avinash Dixit Precept: Fri

CSE 427 Computational Biology Genes and Gene Prediction 1 Some - PowerPoint PPT Presentation

CSE 427 Computational Biology Genes and Gene Prediction 1 Some notes on HW #2 How do we evaluate and compare classifiers? Quantifying Accuracy https://en.wikipedia.org/wiki/Sensitivity_and_specificity 8 A diagnostic test with

CSE 427 Computational Biology http://courses.cs.washington.edu/courses/cse427 Larry Ruzzo Winter

CSE 427 Computational Biology http://courses.cs.washington.edu/courses/cse427 Larry Ruzzo Autumn

CSE427 Computational Biology http://www.cs.washington.edu/427 Larry Ruzzo Winter 2008 UW CSE

CSE 427 Computational Biology Genes and Gene Prediction 1 Gene Finding: Motivation

CSE 427 Computational Biology Gene Prediction A statistical interlude: Fair or biased? H H H H

CSE 427 Computational Biology Course Wrap Up 71 Please complete online course

CSE 427 Computational Biology Autumn 2015 3: BLAST, Alignment score significance 1 Significance

The Plan BLAST CSE 427 Scoring Computational Biology Another Bio Interlude: PCR

CSE 427 Computational Biology Winter 2008 Sequence Alignment; DNA Replication 1 Sequence

He who asks is a fool for five CSE427 minutes, but he who does not Computational Biology ask

Deep Computing in Biology Challenges and Progress Ajay K. Royyuru Computational Biology Center

Basics of Molecular biology Molecular biology is the study of biology at molecular level.

2019-20 DNA Biology New Products RNA Biology PROTEIN Biology MOLECULAR Biology Plant DNA

CSE 3401 Functional and Logic Programming York University CSE 3401 Vida Movahedi 1 York University

Curation of computational biology models Curation of computational biology models Anand

Computational and Mathematical Biology Computational and Mathematical Biology in the Genomics

U i Unix, Perl and Python P l d P h Perl for Bioinformatics George W. Bell, Ph.D. WIBR

An introduction to electronic voting Application to single transferable vote Orange Labs Jacques

t sts t trst

6th Grade The Universe and Its Stars 2015-08-27 www.njctl.org Slide 3 / 120 Slide 4 / 120

Develop Your Data Mindset Module 8 - Progress Monitoring Part 11 - Absorb, Ask, Accumulate &amp;

Orthogonal Random Forests for Causal Inference Steven Wu University of Minnesota Joint work

of predicate abstraction A. Cimatti, J. Dubrovin, T. Junttila, M. Roveri Fondazione Bruno

ECO 317 Economics of Uncertainty Lectures: Tu-Th 3.00-4.20, Avinash Dixit Precept: Fri

Develop Your Data Mindset Module 8 - Progress Monitoring Part 11 - Absorb, Ask, Accumulate &