CSEP 590 A Computational Biology " " Genes and Gene - PowerPoint PPT Presentation

CSEP 590 A   Computational Biology " " Genes and Gene Prediction " "

A Note on HW #3 " Log " µ = 14.8 µ = 23.1 µ = 47.7 µ = 50.6 Likelihood " 0.4 " σ = 1 σ = 1 σ = 1 σ = 1 0.2 -178.5 " 0.0 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 10 20 30 40 50 µ = 14.8 µ = 21.1 µ = 25.2 µ = 49.4 0.4 σ = 1 σ = 1 σ = 1 σ = 1 -139.6 " 0.2 0.0 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 10 20 30 40 50 µ = 14.8 µ = 21.1 µ = 25.2 0.4 σ = 1 σ = 1 σ = 1 µ = 49.4 0.2 σ = 2 -135.3 " 0.0 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 10 20 30 40 50 3% change in LL may look small, but exp(4.3) = 73.7 time more likely " 2 "

Gene Finding: Motivation " Sequence data flooding in " What does it mean? " " protein genes, RNA genes, mitochondria, chloroplast, regulation, replication, structure, repeats, transposons, unknown stuff, … " More generally, how do you: learn from complex data in an unknown language, leverage what’s known to help discover what’s not " 9 " "

Protein Coding Nuclear DNA " Focus of this lecture " Goal: Automated annotation of new seq data " State of the Art: " In Eukaryotes: " predictions ~ 60% similar to real proteins " ~80% if database similarity used " Prokaryotes " better, but still imperfect " Lab verification still needed, still expensive " Largely done for Human; unlikely for most others " 10 "

Biological Basics " Central Dogma: " " DNA transcription RNA translation Protein " Codons: 3 bases code one amino acid " Start codon " Stop codons " 3 ’ , 5 ’ Untranslated Regions (UTR ’ s) " 11 "

RNA   Transcription " (This gene is heavily transcribed, but many are not.) ! 12 "

Darnell, p120 " 13 "

Translation: mRNA → Protein " Watson, Gilman, Witkowski, & Zoller, 1992 14 "

Ribosomes " Watson, Gilman, Witkowski, & Zoller, 1992 15 "

Codons & The Genetic Code Ala : Alanine Arg : Arginine Second Base U C A G Asn : Asparagine Phe Ser Tyr Cys U Asp : Aspartic acid Phe Ser Tyr Cys C Cys : Cysteine U Leu Ser Stop Stop A Gln : Glutamine Leu Ser Stop Trp G Glu : Glutamic acid Leu Pro His Arg U Gly : Glycine Leu Pro His Arg C His : Histidine C Leu Pro Gln Arg A Ile : Isoleucine Third Base First Base Leu Pro Gln Arg G Leu : Leucine Ile Thr Asn Ser U Lys : Lysine Ile Thr Asn Ser C Met : Methionine A Ile Thr Lys Arg A Phe : Phenylalanine Met/Start Thr Lys Arg G Pro : Proline Val Ala Asp Gly U Ser : Serine Val Ala Asp Gly C Thr : Threonine G Val Ala Glu Gly A Trp : Tryptophane Val Ala Glu Gly G Tyr : Tyrosine Val : Valine 16

Idea #1: Find Long ORF ’ s " Reading frame: which of the 3 possible sequences of triples does the ribosome read? " Open Reading Frame: No stop codons " In random DNA " average ORF ~ 64/3 = 21 triplets " 300bp ORF once per 36kbp per strand " But average protein ~ 1000bp " 17 "

A Simple ORF finder " start at left end " scan triplet-by-non-overlapping triplet for AUG " then continue scan for STOP " repeat until right end " repeat all starting at offset 1 " repeat all starting at offset 2 " then do it again on the other strand " 18 "

Scanning for ORFs " * " 1 " 2 " 3 " U U A A U G U G U C A U U G A U U A A G " " " A A U U A C A C A G U A A C U A A U A C " " 4 " 5 " 6 " " * In bacteria, GUG is sometimes a start codon… " 19 "

Idea #2: Codon Frequency " In random DNA   Leucine : Alanine : Tryptophan = 6 : 4 : 1 " But in real protein, ratios ~ 6.9 : 6.5 : 1 " So, coding DNA is not random " Even more: synonym usage is biased (in a species dependant way)   examples known with 90% AT 3 rd base " " Why? E.g. efficiency, histone, enhancer, splice interactions " 20 "

Recognizing Codon Bias " Assume " Codon usage i.i.d.; abc with freq. f(abc) ! a 1 a 2 a 3 a 4 …a 3n+2 is coding, unknown frame " Calculate " p 1 = f(a 1 a 2 a 3 )f(a 4 a 5 a 6 )…f(a 3n-2 a 3n-1 a 3n ) ! p 2 = f(a 2 a 3 a 4 )f(a 5 a 6 a 7 )…f(a 3n-1 a 3n a 3n+1 ) ! p 3 = f(a 3 a 4 a 5 )f(a 6 a 7 a 8 )…f(a 3n a 3n+1 a 3n+2 ) ! P i = p i / (p 1 +p 2 +p 3 ) ! More generally: k -th order Markov model " k = 5 or 6 is typical " 21 "

Codon Usage in Φ x174 " Staden & McLachlan, NAR 10, 1 1982, 141-156 23 "

Promoters, etc. " In prokaryotes, most DNA coding " E.g. ~ 70% in H. influenzae " Long ORFs + codon stats do well " But obviously won ’ t be perfect " short genes " 5 ’ & 3 ’ UTR ’ s " Can improve by modeling promoters, etc. " e.g. via WMM or higher-order Markov models " 24 "

Eukaryotes " As in prokaryotes (but maybe more variable) " promoters " start/stop transcription " start/stop translation " " 25 "

And then… " Nobel Prize of the week: P. Sharp, 1993, Splicing " 26 "

Mechanical Devices of the Spliceosome: Motors, Clocks, Springs, and Things ! ! ! ! ! Jonathan P . Staley and Christine Guthrie ! ! ! Volume 92, Issue 3 , 6 February 1998, Pages 315-326 ! CELL ! 27 "

Figure 2. Spliceosome Assembly, Rearrangement, and Disassembly Requires ATP, Numerous DExD/H box Proteins, and Prp24. The snRNPs are depicted as circles. The pathway for S. cerevisiae is shown. " 29 "

Figure 3. Splicing Requires Numerous Rearrangements   " E.g.: exchange of U1 for U6 " 31 "

Figure 6. A Paradigm for Unwindase Specificity and Timing? The DExD/H box protein UAP56 (orange) binds U2AF65 (pink) through its linker region (L). U2 binds the branch point. Y's indicate the polypyrimidine stretch; RS, RRM as in Figure 5A. Sequences are from mammals. ! " 34 "

Hints to Origins? " Tetrahymena thermophila " 36 "

Genes in Eukaryotes " As in prokaryotes (but maybe more variable) " promoters " start/stop transcription " start/stop translation " 3’ 5’ New Features: " exon intron exon intron introns, exons, splicing " AG/GT yyy..AG/G AG/GT branch point signal " donor acceptor donor alternative splicing " polyA site/tail " 43 "

Characteristics of human genes (Nature, 2/2001, Table 21) " Median Mean Sample (size) Internal exon 122 bp 145 bp RefSeq alignments to draft genome sequence, with confirmed intron boundaries (43,317 exons) Exon number 7 8.8 RefSeq alignments to finished seq (3,501 genes) Introns 1,023 bp 3,365 bp RefSeq alignments to finished seq (27,238 introns) 3' UTR 400 bp 770 bp Confirmed by mRNA or EST on chromo 22 (689) 300 bp Confirmed by mRNA or EST on chromo 22 (463) 5' UTR 240 bp 1340 bp Selected RefSeq entries (1,804)* Coding seq 1,100 bp (CDS) 367 aa 447 aa Genomic span 14 kb 27 kb Selected RefSeq entries (1,804)* * 1,804 selected RefSeq entries were those with full- length unambiguous alignment to finished sequence " 44 "

Big Genes " Many genes are over 100 kb long, " Max known: dystrophin gene (DMD), 2.4 Mb. " The variation in the size distribution of coding sequences and exons is less extreme, although there are remarkable outliers. " The titin gene has the longest currently known coding sequence at 80,780 bp; it also has the largest number of exons (178) and longest single exon (17,106 bp). " RNApol rate: 1.2-2.5 kb/min = >16 hours to transcribe DMD " 45 "

Nature 2/2001 " Exons " Introns " Introns " 46 "

Figure 36 GC content " " " " Nature 2/2001 " Genes vs Gene   Genome " Density " a: Distribution of GC content b: Gene density as a in genes and in the genome . function of GC content   For 9,315 known genes mapped (= ratios of data in a. Less to the draft genome sequence, the accurate at high GC because local GC content was calculated in the denominator is small) " a window covering either the " whole alignment or 20,000 bp c: Dependence of mean centered on midpoint of the exon and intron lengths alignment, whichever was larger. on GC content.   Ns in the sequence were not The local GC content, based Intron Exon " counted. GC content for the on alignments to finished genome was calculated for sequence only, calculated adjacent nonoverlapping 20,000- from windows covering the bp windows across the sequence. larger of feature size or 10,000 bp centered on it " Both distributions normalized to sum to one. " 47 "

CSEP 590 A Computational Biology " " Genes and Gene - PowerPoint PPT Presentation

CSEP 590 A Computational Biology " " Genes and Gene Prediction " " A Note on HW #3 " Log " = 14.8 = 23.1 = 47.7 = 50.6 Likelihood " 0.4 " = 1 = 1 = 1 = 1 0.2 -178.5 "

CSEP 590 B Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

CSEP 527 Computational Biology http://courses.cs.washington.edu/courses/csep527/16sp Larry Ruzzo

CSEP 527 Computational Biology http://courses.cs.washington.edu/courses/csep527/18wi Larry Ruzzo

DNA Methylation CpG - 2 adjacent nts, same strand (not CH 3 CSEP 590 A Watson-Crick pair;

Markov Chains and MCMC CompSci 590.02 Instructor: AshwinMachanavajjhala Lecture 4 : 590.02

De-anonymizing Data CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 2 : 590.03 Fall 12

Sampling from Databases CompSci 590.02 Instructor: AshwinMachanavajjhala Lecture 2 : 590.02

Post-processing outputs for better utility CompSci 590.03 Instructor: Ashwin Machanavajjhala

Wavelet and Matrix Mechanism CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 11 :

Deep Computing in Biology Challenges and Progress Ajay K. Royyuru Computational Biology Center

CSEP 527 Computational Biology RNA: Function, Secondary Structure Prediction, Search, Discovery

CSEP 527 Computational Biology Genes and Gene Prediction 1 Gene Finding: Motivation We

CSEP 527 Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

CSEP 527 Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

CSEP 527 Computational Biology Course Wrap Up Please complete online course evaluation

RNA Search and Motif Discovery CSEP 527 Computational Biology Previous Lecture

Challenges of ancient genomics and pan-genomics Kay Nieselt Center for Bioinformatics Tbingen

ENCODE Element Browser Goal: to navigate the candidate DNA elements predicted by the ENCODE

Seriation, Spectral Clustering and de novo genome assembly Antoine Recanati , CNRS & ENS with

CS681: Advanced Topics in Computational Biology Week 7 Lectures 2-3 Can Alkan EA224

from their Substrings Spectrum Sagi Marcovich, Eitan Yaakobi Technion Israel Institute of

Genomics Sequencing tech Sequencing tech: next generation What do we get from sequencing? How

SPAdes: a New Genome Assembler for Single-Cell Sequencing Algorithmic Biology Lab St. Petersburg

Outline Part 1 Introduction to Genomics Part 2 Visual Design for Genomics Part 3 Hands-On

CSEP 590 A Computational Biology " " Genes and Gene - PowerPoint PPT Presentation

CSEP 590 A Computational Biology " " Genes and Gene Prediction " " A Note on HW #3 " Log " = 14.8 = 23.1 = 47.7 = 50.6 Likelihood " 0.4 " = 1 = 1 = 1 = 1 0.2 -178.5 "

CSEP 590 B Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

CSEP 527 Computational Biology http://courses.cs.washington.edu/courses/csep527/16sp Larry Ruzzo

CSEP 527 Computational Biology http://courses.cs.washington.edu/courses/csep527/18wi Larry Ruzzo

DNA Methylation CpG - 2 adjacent nts, same strand (not CH 3 CSEP 590 A Watson-Crick pair;

Markov Chains and MCMC CompSci 590.02 Instructor: AshwinMachanavajjhala Lecture 4 : 590.02

De-anonymizing Data CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 2 : 590.03 Fall 12

Sampling from Databases CompSci 590.02 Instructor: AshwinMachanavajjhala Lecture 2 : 590.02

Post-processing outputs for better utility CompSci 590.03 Instructor: Ashwin Machanavajjhala

Wavelet and Matrix Mechanism CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 11 :

Deep Computing in Biology Challenges and Progress Ajay K. Royyuru Computational Biology Center

CSEP 527 Computational Biology RNA: Function, Secondary Structure Prediction, Search, Discovery

CSEP 527 Computational Biology Genes and Gene Prediction 1 Gene Finding: Motivation We

CSEP 527 Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

CSEP 527 Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

CSEP 527 Computational Biology Course Wrap Up Please complete online course evaluation

RNA Search and Motif Discovery CSEP 527 Computational Biology Previous Lecture

Challenges of ancient genomics and pan-genomics Kay Nieselt Center for Bioinformatics Tbingen

ENCODE Element Browser Goal: to navigate the candidate DNA elements predicted by the ENCODE

Seriation, Spectral Clustering and de novo genome assembly Antoine Recanati , CNRS &amp; ENS with

CS681: Advanced Topics in Computational Biology Week 7 Lectures 2-3 Can Alkan EA224

from their Substrings Spectrum Sagi Marcovich, Eitan Yaakobi Technion Israel Institute of

Genomics Sequencing tech Sequencing tech: next generation What do we get from sequencing? How

SPAdes: a New Genome Assembler for Single-Cell Sequencing Algorithmic Biology Lab St. Petersburg

Outline Part 1 Introduction to Genomics Part 2 Visual Design for Genomics Part 3 Hands-On

Seriation, Spectral Clustering and de novo genome assembly Antoine Recanati , CNRS & ENS with