CSEP 590A Summer 2006 Lecture 8 RNA Secondary Structure Prediction

Outline Biological roles for RNA What is “secondary structure? How is it represented? Why is it important? Examples Approaches

RNA Structure Primary Structure: Sequence Secondary Structure: Pairing Tertiary Structure: 3D shape

RNA Pairing Watson-Crick Pairing C - G ~ 3 kcal/mole A - U ~ 2 kcal/mole “Wobble Pair” G - U ~1 kcal/mole Non-canonical Pairs (esp. if modified)

Ribosomes Watson, Gilman, Witkowski, & Zoller, 1992

tRNA 3d Structure

tRNA - Alt. Representations 3’ Anticodon 5’ loop Anticodon loop

tRNA - Alt. Representations 3’ 5’ Anticodon Anticodon loop loop

“Classical” RNAs tRNA - transfer RNA (~61 kinds, ~ 75 nt) rRNA - ribosomal RNA (~4 kinds, 120-5k nt) snRNA - small nuclear RNA (splicing: U1, etc, 60-300nt) RNaseP - tRNA processing (~300 nt) RNase MRP - rRNA processing; mito. rep. (~225 nt) SRP - signal recognition particle; membrane targeting (~100-300 nt) SECIS - selenocysteine insertion element (~65nt) 6S - ? (~175 nt)

Semi-classical RNAs (discovery in mid 90’s) tmRNA - resetting stalled ribosomes Telomerase - (200-400nt) snoRNA - small nucleolar RNA (many varieties; 80-200nt)

Recent discoveries microRNAs riboswitches many ribozymes regulatory elements … Hundreds of families Rfam release 1, 1/2003: 25 families, 55k instances Rfam release 7, 3/2005: 503 families, 300k instances

Why? RNA’s fold, and function Nature uses what works

Noncoding RNAs Breakthrough of the Year

Example: Glycine Regulation How is glycine level regulated? Plausible answer: g gce protein g g g TF g DNA TF glycine cleavage enzyme gene transcription factors (proteins) bind to DNA to turn nearby genes on or off

The Glycine Riboswitch Actual answer (in many bacteria): gce protein g g g 5 ′ 3 ′ g gce mRNA DNA glycine cleavage enzyme gene Mandal et al. Science 2004

Gene Regulation: The Met Repressor SAM DNA Protein Alberts, et al, 3e.

Two SAM Ribo- switches Corbino et al., Genome Biol. 2005 Alberts, et al, 3e.

6S mimics an open promoter Bacillus/ Clostridium Actino- bacteria E.coli Barrick et al. RNA 2005 Trotochaud et al. NSMB 2005 Willkomm et al. NAR 2005

The Hammerhead Ribozyme Involved in “rolling circle replication” of viruses.

Wanted Good structure prediction tools Good motif descriptions/models Good, fast search tools (“RNA BLAST”, etc.) Good, fast motif discovery tools (“RNA MEME”, etc.) Importance of structure makes last 3 hard

Why is RNA hard to deal with? A G A A A A A A G A U C G U U C U C G A C U C G C U A G C G G U G C A A G G G A G C G A U C G C C G G A C G C A A G A G G G A G A G G A G A C C A C A C U U G U A C C C C G A A A A A G G C U G C C A A A U A A A A G A G U G A G A C A C U C U U U U G G U C G U G C U C U G C G A G C G U C G G A C G C A U U G C U G A A A C G A U G C U U G U U G A U G G G C A: Structure often more important than sequence

Task 1: Structure Prediction

RNA Pairing Watson-Crick Pairing C - G ~ 3 kcal/mole A - U ~ 2 kcal/mole “Wobble Pair” G - U ~ 1 kcal/mole Non-canonical Pairs (esp. if modified)

Definitions Sequence 5’ r 1 r 2 r 3 ... r n 3’ in {A, C, G, T} A Secondary Structure is a set of pairs i•j s.t. i < j-4, and no sharp turns if i•j & i’•j’ are two different pairs with i ≤ i’, then 2nd pair follows 1st, or j < i’, or is nested within it; i < i’ < j’ < j no “pseudoknots.”

Nested Precedes Pseudoknot

A Pseudoknot A-C / \ 3’ - A-G-G-C-U U U-C-C-G-A-G-G-G | C-C-C - 5’ \ / U-C-U-C

Approaches to Structure Prediction Maximum Pairing + works on single sequences + simple - too inaccurate Minimum Energy + works on single sequences - ignores pseudoknots - only finds “optimal” fold Partition Function + finds all folds - ignores pseudoknots

Approaches, II Comparative sequence analysis + handles all pairings (incl. pseudoknots) - requires several (many?) aligned, appropriately diverged sequences Stochastic Context-free Grammars Roughly combines min energy & comparative, but no pseudoknots Physical experiments (x-ray crystalography, NMR)

Nussinov: Max Pairing B(i,j) = # pairs in optimal pairing of r i ... r j B(i,j) = 0 for all i, j with i ≥ j-4; otherwise B(i,j) = max of: B(i,j-1) max { B(i,k-1)+1+B(k+1,j-1) | i ≤ k < j-4 and r k -r j may pair} Time: O(n 3 )

“Optimal pairing of r i ... r j ” Two possibilities i J Unpaired: Find best pairing of r i ... r j-1 j j-1 J Paired: Find best r i ... r k-1 + i k-1 best r k+1 ... r j-1 plus 1 k Why is it slow? j k+1 Why do pseudoknots matter? j-1

Pair-based Energy Minimization E(i,j) = energy of pairs in optimal pairing of r i ... r j E(i,j) = ∞ for all i, j with i ≥ j-4; otherwise E(i,j) = min of: energy of j-k pair E(i,j-1) min { E(i,k-1) + e(r k , r j ) + E(k+1,j-1) | i ≤ k < j-4 } Time: O(n 3 )

Loop-based Energy Minimization 1 Detailed experiments show it’s more accurate to model based 2 on loops, rather than just pairs Loop types 3 Hairpin loop Stack 4 Bulge Interior loop Multiloop 5

Base Pairs and Stacking cytosine uracil thymine guanine adenine

The Double Helix

Loop Examples

Zuker: Loop-based Energy, I W(i,j) = energy of optimal pairing of r i ... r j V(i,j) = as above, but forcing pair i•j W(i,j) = V(i,j) = ∞ for all i, j with i ≥ j-4 W(i,j) = min(W(i,j-1), min { W(i,k-1)+V(k,j) | i ≤ k < j-4 } )

Zuker: Loop-based Energy, II bulge/ multi- hairpin stack interior loop V(i,j) = min(eh(i,j), es(i,j)+V(i+1,j-1), VBI(i,j), VM(i,j)) VM(i,j) = min { W(i,k)+W(k+1,j) | i < k < j } VBI(i,j) = min { ebi(i,j,i ’ ,j ’ ) + V(i ’ , j ’ ) | i < i ’ < j ’ < j & i ’ -i+j-j ’ > 2 } Time: O(n 4 ) bulge/ interior O(n 3 ) possible if ebi(.) is “nice”

Suboptimal Energy There are always alternate folds with near-optimal energies. Thermodynamics: populations of identical molecules will exist in different folds; individual molecules even flicker among different folds Mod to Zuker’s algorithm finds subopt folds McCaskill: more elaborate dyn. prog. algorithm calculates the “partition function,” which defines the probability distribution over all these states.

Two competing secondary structures for the Leptomonas collosoma spliced leader mRNA.

Example of suboptimal folding Black dots: pairs in opt fold Colored dots: pairs in folds 2-5% worse than optimal fold

Accuracy Latest estimates suggest ~50-75% of base pairs predicted correctly in sequences of up to ~300nt Definitely useful, but obviously imperfect

Task 2: Motif Description

How to model an RNA “Motif”? Conceptually, start with a profile HMM: from a multiple alignment, estimate nucleotide/ insert/delete preferences for each position given a new seq, estimate likelihood that it could be generated by the model, & align it to the model mostly G ins all G del

How to model an RNA “Motif”? Add “column pairs” and pair emission probabilities for base-paired regions <<<<<<< >>>>>>> paired columns … …

RNA Motif Models “Covariance Models” (Eddy & Durbin 1994) aka profile stochastic context-free grammars aka hidden Markov models on steroids Model position-specific nucleotide preferences and base-pair preferences Pro: accurate Con: model building hard, search sloooow

Summary RNA has important roles beyond mRNA Many unexpected recent discoveries Structure is critical to function True of proteins, too, but they’re easier to find, due, e.g., to codon structure, which RNAs lack RNA secondary structure can be predicted (to useful accuracy) by dynamic programming RNA “motifs” (seq + 2-ary struct) well-captured by “covariance models”

CSEP 590A Summer 2006 Lecture 8 RNA Secondary Structure Prediction - PowerPoint PPT Presentation

CSEP 590A Summer 2006 Lecture 8 RNA Secondary Structure Prediction Outline Biological roles for RNA What is secondary structure? How is it represented? Why is it important? Examples Approaches RNA Structure Primary Structure:

Outline CSEP 590A Summer 2006 Biological roles for RNA What is secondary structure? Lecture

RNA Search and Motif Discovery Lecture 9 CSEP 590A Summer 2006 Outline Whirlwind tour of

Comprehensive State Energy Plan (CSEP): An Update Martin R. Hyman Senior Energy Policy Analyst,

Advanced topics in software systems Reid Holmes Winter 2010 CSEP504 Lecture 6 CSEP 504:

AS Of 5/21/20 Steven L. Moyer, CPA CGMA PFS CSEP Brent C. Thompson, CPA CMA CGMA Canon Capital

Resource Efficiency Through Systems Engineering Joe Moravec, CSEP Booz Allen Hamilton

AS Of 6/11/20 Steven L. Moyer, CPA CGMA PFS CSEP Brent C. Thompson, CPA CMA CGMA Canon Capital

CSEP 517 Natural Language Processing Autumn 2015 Parsing (Trees) Yejin Choi - University of

CSEP 527 Computational Biology RNA: Function, Secondary Structure Prediction, Search, Discovery

CSEP 517 Natural Language Processing Frame Semantics Luke Zettlemoyer Slides adapted from Yejin

CSEP 527 Computational Biology http://courses.cs.washington.edu/courses/csep527/16sp Larry Ruzzo

CSEP 527 Spring 2016 Phylogenies: Parsimony Plus a Tantalizing Taste of Likelihood 1

DNA Methylation CpG - 2 adjacent nts, same strand (not CH 3 CSEP 590 A Watson-Crick pair;

CSEP 517: Natural Language Processing Recurrent Neural Networks Autumn 2018 Luke Zettlemoyer

CSEP 590 A Computational Biology " " Genes and Gene Prediction " " A

CSEP 590B Summary Below, as a somewhat unusual course summary, I have decided to give

Fastest Origin of Life? Human Life needs Gene? information carrier: DNA molecular machines,

Molecular biology recap Autumn 2007 Esa Pitknen Master's Degree Programme in Bioinformatics

Health and Movement Learning Objective: To explore human and animal skeletons. NEXT

HOW CATHOLICS INTERPRET THE BIBLE Meg & Brian Vail 2015 Why the Bible? The Road to

Databases # sequenced genomes Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Moore's

A Set Cover Approach to Taxonomic Annotation Francesc Rossell o Gabriel Valiente Department

Expressive pattern matching with LOGOL Application to the modelling of -1 Ribosomal Frameshift

Towards Process-Level Charm++ Programming in NAMD James Phillips Beckman Institute, University

Sambuz

Useful Links

Newsletter

Mail Us