Fall 2008 RNA Function, Secondary Structure Prediction, Search, - PowerPoint PPT Presentation

Ribosomes � 1974 Nobel prize to Romanian biologist George Palade for discovery in mid 50’s � 50-80 proteins � 3-4 RNAs (half the mass) � Catalytic core is RNA � Atomic structure of the 50S Subunit from Haloarcula marismortui . Proteins are shown Of course, mRNAs and tRNAs in blue and the two RNA strands in orange and yellow. The small patch of green in the (messenger & transfer RNAs) are � center of the subunit is the active site. � - Wikipedia � critical too � 52 �

tRNA 3d Structure �

tRNA - Alt. Representations � 3’ � 5’ � Anticodon � loop � Anticodon loop �

tRNA - Alt. Representations � 3’ � 3’ � 5’ � 5’ � Anticodon � Anticodon � loop � loop �

RNA Pairing � Watson-Crick Pairing � C - G � ~ 3 kcal/mole � A - U � ~ 2 kcal/mole � “Wobble Pair” G - U � ~ 1 kcal/mole � Non-canonical Pairs (esp. if modified) �

Definitions � Sequence 5’ r 1 r 2 r 3 ... r n 3’ in {A, C, G, T} � A Secondary Structure is a set of pairs i•j s.t. � i < j-4, and � � � no sharp turns � if i•j & i’•j’ are two different pairs with i ! i’, then � 2nd pair follows 1st, or is j < i’, or � nested within it; � i < i’ < j’ < j � no “pseudoknots.” �

RNA Secondary Structure: Examples Examples. G G G G G C G G C U C U G G A G C G C U C C U A U A U A G C G U A U A U A A U U A base pair U A C C G G U G U A U A C G G G G U A U A C C G G U U G A � 4 ok sharp turn U A C C G G U G U A A C crossing 58

Nested � Precedes � Pseudoknot �

Approaches to Structure Prediction � Maximum Pairing � + works on single sequences � + simple � - too inaccurate � Minimum Energy � + works on single sequences � - ignores pseudoknots � - only finds “optimal” fold � Partition Function � + finds all folds � - ignores pseudoknots �

Nussinov: Max Pairing � B(i,j) = # pairs in optimal pairing of r i ... r j B(i,j) = 0 for all i, j with i � j-4; otherwise B(i,j) = max of: B(i,j-1) max { B(i,k-1)+1+B(k+1,j-1) | i � k < j-4 and r k -r j may pair} Time: O(n 3 ) �

“Optimal pairing of r i ... r j ” � Two possibilities � j Unpaired: � i � Find best pairing of r i ... r j-1 � j � j-1 � j Paired (with some k): � Find best r i ... r k-1 + � i � k-1 � best r k+1 ... r j-1 plus 1 � k � Why is it slow? � j � k+1 � Why do pseudoknots matter? � j-1 �

Pair-based Energy Minimization � E(i,j) = energy of pairs in optimal pairing of r i ... r j � E(i,j) = � for all i, j with i " j-4; otherwise � E(i,j) = min of: � energy of j-k pair � E(i,j-1) � min { E(i,k-1) + e(r k , r j ) + E(k+1,j-1) | i � k < j-4 } � Time: O(n 3 ) �

Loop-based Energy Minimization � 1 Detailed experiments show it’s � more accurate to model based � 2 on loops, rather than just pairs � 3 Loop types � 1. � Hairpin loop � 2. � Stack � 4 3. � Bulge � 4. � Interior loop � 5. � Multiloop � 5

Zuker: Loop-based Energy, I � W(i,j) = energy of optimal pairing of r i ... r j � V(i,j) = as above, but forcing pair i•j � W(i,j) = V(i,j) = � for all i, j with i " j-4 � W(i,j) = min(W(i,j-1), � min { W(i,k-1)+V(k,j) | i � k < j-4 } � � ) �

Zuker: Loop-based Energy, II � bulge/ � multi- � interior � loop � hairpin � stack � V(i,j) � = min(eh(i,j), es(i,j)+V(i+1,j-1), VBI(i,j), VM(i,j)) � VM(i,j) = min { W(i,k)+W(k+1,j) | i < k < j } � VBI(i,j) = min { ebi(i,j,i � ,j � ) + V(i � , j � ) | � i < i � < j � < j & i � -i+j-j � > 2 } � Time: O(n 4 ) � bulge/ � interior � O(n 3 ) possible if ebi(.) is “nice” �

Energy Parameters � Q. Where do they come from? � A1. Experiments with carefully selected synthetic RNAs � A2. Learned algorithmically from trusted alignments/structures �

Accuracy � Latest estimates suggest ~50-75% of base pairs predicted correctly in sequences of up to ~300nt � Definitely useful, but obviously imperfect �

Approaches to Structure Prediction � Maximum Pairing � � + works on single sequences � � + simple � � - too inaccurate � Minimum Energy � � + works on single sequences � � - ignores pseudoknots � � - only finds “optimal” fold � Partition Function � � + finds all folds � � - ignores pseudoknots �

Approaches, II � Comparative sequence analysis � � + handles all pairings (incl. pseudoknots) � � - requires several (many?) aligned, � � appropriately diverged sequences � Stochastic Context-free Grammars � Roughly combines min energy & comparative, but no pseudoknots � Physical experiments (x-ray crystalography, NMR) �

Summary � RNA has important roles beyond mRNA � � Many unexpected recent discoveries � Structure is critical to function � � True of proteins, too, but they’re easier to find, due, e.g., to codon structure, which RNAs lack � RNA secondary structure can be predicted (to useful accuracy) by dynamic programming � Next: RNA “motifs” (seq + 2-ary struct) well- captured by “covariance models” � 98

“RNA sequence analysis using covariance models” � Eddy & Durbin � Nucleic Acids Research, 1994 � vol 22 #11, 2079-2088 � (see also, Ch 10 of Durbin et al .) �

What � A probabilistic model for RNA families � The “Covariance Model” � � A Stochastic Context-Free Grammar � A generalization of a profile HMM � Algorithms for Training � From aligned or unaligned sequences � Automates “comparative analysis” � Complements Nusinov/Zucker RNA folding � Algorithms for searching �

Main Results � Very accurate search for tRNA � (Precursor to tRNAscanSE - current favorite) � Given sufficient data, model construction comparable to, but not quite as good as, � human experts � Some quantitative info on importance of pseudoknots and other tertiary features �

Probabilistic Model Search � As with HMMs, given a sequence, you calculate likelihood ratio that the model could generate the sequence, vs a background model � You set a score threshold � Anything above threshold � a “hit” � Scoring: � “Forward” / “Inside” algorithm - sum over all paths � Viterbi approximation - find single best path � (Bonus: alignment & structure prediction) �

Example: searching for tRNAs

Profile Hmm Structure � M j : � Match states (20 emission probabilities) � I j : � Insert states (Background emission probabilities) � D j : � Delete states (silent - no emission) �

CM Structure � A: Sequence + structure � B: the CM “guide tree” � C: probabilities of letters/ pairs & of indels � Think of each branch being an HMM emitting both sides of a helix (but 3’ side emitted in reverse order) �

Overall CM Architecture � One box (“node”) per node of guide tree � BEG/MATL/INS/DEL just like an HMM � MATP & BIF are the key additions: MATP emits pairs of symbols, modeling base- pairs; BIF allows multiple helices �

CM Viterbi Alignment � = i th letter of input x i x ij = substring i ,..., j of input T yz = P (transition y � z ) y E x i , x j = P (emission of x i , x j from state y ) y S ij = max � log P ( x ij gen'd starting in state y via path � )

y = max � log P ( x ij generated starting in state y via path � ) S ij � z y max z [ S i + 1, j � 1 + log T yz + log E x i , x j ] match pair � y ] z max z [ S i + 1, j + log T yz + log E x i match/insert left � � y = y ] z S ij � max z [ S i , j � 1 + log T yz + log E x j match/insert right � z max z [ S i , j + log T yz ] delete � y right ] y left + S k + 1, j � max i < k � j [ S i , k bifurcation � Time O(qn 3 ), q states, seq len n

Model Training �

mRNA leader mRNA leader switch? 18

Mutual Information � f xi , xj � M ij = f xi , xj log 2 ; 0 � M ij � 2 f xi f xj xi , xj Max when no seq conservation but perfect pairing � MI = expected score gain from using a pair state � Finding optimal MI, (i.e. opt pairing of cols) is hard(?) � Finding optimal MI without pseudoknots can be done by dynamic programming �

M.I. Example (Artificial) � * 1 2 3 4 5 6 7 8 9 * MI: 1 2 3 4 5 6 7 8 9 A G A U A A U C U 9 0 0 0 0 0 0 0 0 A G A U C A U C U 8 0 0 0 0 0 0 0 A G A C G U U C U 7 0 0 2 0.30 0 1 A G A U U U U C U 6 0 0 1 0.55 1 A G C C A G G C U 5 0 0 0 0.42 A G C G C G G C U 4 0 0 0.30 A G C U G C G C U 3 0 0 A G C A U C G C U 2 0 A G G U A G C C U 1 A G G G C G C C U A G G U G U C C U Cols 1 & 9, 2 & 8: perfect conservation & might be A G G C U U C C U A G U A A A A C U base-paired, but unclear whether they are. M.I. = 0 A G U C C A A C U A G U U G C A C U Cols 3 & 7: No conservation, but always W-C pairs, A G U U U C A C U so seems likely they do base-pair. M.I. = 2 bits. Cols 7->6: unconserved, but each letter in 7 has A 16 0 4 2 4 4 4 0 0 only 2 possible mates in 6. M.I. = 1 bit. � C 0 0 4 4 4 4 4 16 0 G 0 16 4 2 4 4 4 0 0 U 0 0 4 8 4 4 4 0 16

MI-Based Structure-Learning � Find best (max total MI) subset of column pairs among i…j, subject to absence of pseudo-knots � � S i , j = max S i , j � 1 � max i � k < j � 4 S i , k � 1 + M k , j + S k + 1, j � 1 � “Just like Nussinov/Zucker folding” � BUT, need enough data---enough sequences at right phylogenetic distance �

Pseudoknots � � n � disallowed allowed � � /2 max j M i , j � � i = 1

Rfam – an RNA family DB � Griffiths-Jones, et al., NAR ‘03,’05 � Biggest scientific computing user in Europe - 1000 cpu cluster for a month per release � Rapidly growing: � Rel 1.0, 1/03: 25 families, 55k instances � Rel 7.0, 3/05: 503 families, >300k instances �

Rfam � IRE (partial seed alignment): � Input (hand-curated): � Hom.sap. GUUCCUGCUUCAACAGUGUUUGGAUGGAAC MSA “seed alignment” � Hom.sap. UUUCUUC.UUCAACAGUGUUUGGAUGGAAC Hom.sap. UUUCCUGUUUCAACAGUGCUUGGA.GGAAC SS_cons � Hom.sap. UUUAUC..AGUGACAGAGUUCACU.AUAAA Score Thresh T � Hom.sap. UCUCUUGCUUCAACAGUGUUUGGAUGGAAC Hom.sap. AUUAUC..GGGAACAGUGUUUCCC.AUAAU Window Len W � Hom.sap. UCUUGC..UUCAACAGUGUUUGGACGGAAG Hom.sap. UGUAUC..GGAGACAGUGAUCUCC.AUAUG Output: � Hom.sap. AUUAUC..GGAAGCAGUGCCUUCC.AUAAU Cav.por. UCUCCUGCUUCAACAGUGCUUGGACGGAGC CM � Mus.mus. UAUAUC..GGAGACAGUGAUCUCC.AUAUG Mus.mus. UUUCCUGCUUCAACAGUGCUUGAACGGAAC scan results & “full Mus.mus. GUACUUGCUUCAACAGUGUUUGAACGGAAC alignment” � Rat.nor. UAUAUC..GGAGACAGUGACCUCC.AUAUG Rat.nor. UAUCUUGCUUCAACAGUGUUUGGACGGAAC SS_cons <<<<<...<<<<<......>>>>>.>>>>>

Faster Genome Annotation � of Non-coding RNAs � Without Loss of Accuracy � Zasha Weinberg � & W.L. Ruzzo � Recomb ‘04, ISMB ‘04, Bioinfo ‘06 �

Covariance � Model � Key difference of CM vs HMM: Pair states emit paired symbols, corresponding to base-paired nucleotides; 16 emission probabilities here.

CM’s are good, but slow � Rfam Reality Our Work Rfam Goal EMBL EMBL EMBL BLAST Ravenna CM CM CM junk junk hits hits hits 1 month, ~2 months, 10 years, 1000 computers 1000 computers 1000 computers

Results: New ncRNA’s? � # found � # found # new � rigorous filter Name � BLAST � + CM � + CM � Pyrococcus snoRNA � 57 � 180 � 123 � Iron response element � 201 � 322 � 121 � Histone 3’ element � 1004 � 1106 � 102 � Purine riboswitch � 69 � 123 � 54 � Retron msr � 11 � 59 � 48 � Hammerhead I � 167 � 193 � 26 � Hammerhead III � 251 � 264 � 13 � U4 snRNA � 283 � 290 � 7 � S-box � 128 � 131 � 3 � U6 snRNA � 1462 � 1464 � 2 � U5 snRNA � 199 � 200 � 1 � U7 snRNA � 312 � 313 � 1 �

Cmfinder--A Covariance � Model Based RNA Motif � Finding Algorithm Bioinformatics , 2006, 22(4): 445-452 Zizhen Yao � Zasha Weinberg � Walter L. Ruzzo � University of Washington, Seattle �

CMfinder Accuracy � (on Rfam families with flanking sequence) � /CW /CW

Chloroflexi Chloroflexus aurantiacus � -Proteobacteria CMfinder: 9 instances Geobacter metallireducens Geobacter sulphurreducens Found by Scan: 447 hits Symbiobacterium thermophilum

boxed = confirmed riboswitch (+2 more) 71 Weinberg, et al. Nucl. Acids Res., July 2007 35: 4809-4819. �

Search in Vertebrates � Extract ENCODE Multiz alignments � Trust 17-way Remove exons, most conserved elements. � alignment for orthology, not for 56017 blocks, 8.7M bps. � detailed Apply CMfinder to both strands. � alignment 10,106 predictions, 6,587 clusters. � High false positive rate, but still suggests 1000’s of RNAs. � (We’ve applied CMfinder to whole human genome: � O(1000) CPU years. Analysis in progress.) �

10 of 11 top expressed, usually differentially �

Summary � ncRNA - apparently widespread, much interest � Covariance Models - powerful but expensive tool for ncRNA motif representation, search, discovery � Rigorous/Heuristic filtering - typically 100x speedup in search with no/little loss in accuracy � CMfinder - CM-based motif discovery in unaligned sequences �

Course Wrap Up �

“High-Throughput � BioTech” � Sensors � DNA sequencing � Microarrays/Gene expression � Mass Spectrometry/Proteomics � Protein/protein & DNA/protein interaction � Controls � Cloning � Gene knock out/knock in � RNAi � Floods of data � “Grand Challenge” problems �

CS Points of Contact � Scientific visualization � Gene expression patterns � Databases � Integration of disparate, overlapping data sources � Distributed genome annotation in face of shifting underlying coordinates � AI/NLP/Text Mining � Information extraction from journal texts with inconsistent nomenclature, indirect interactions, incomplete/inaccurate models,… � Machine learning � System level synthesis of cell behavior from low-level heterogeneous data (DNA sequence, gene expression, protein interaction, mass spec, �� Algorithms � … �

Frontiers & Opportunities � New data: � Proteomics, SNP, arrays CGH, comparative sequence information, methylation, chromatin structure, ncRNA, interactome � New methods: � graphical models? rigorous filtering? � Data integration � many, complex, noisy sources �

Fall 2008 RNA Function, Secondary Structure Prediction, Search, - PowerPoint PPT Presentation

CSE P 590A Fall 2008 RNA Function, Secondary Structure Prediction, Search, Discovery The Message noncoding RNA Cells make lots of RNA Functionally important, functionally diverse Structurally complex New

Fall to Fall Enrollment Comparison Fall to Fall Enrollment Comparison Student FTE, Fall 2000

Seasonal Outreach Fall Fall Outreach Campaign Fall Outreach Campaign Fall Outreach Fall

CPB Approach 0,5 0 2000 2002 2004 2006 2008 2010 2012 2014 -0,5 5 November 2015 Fall 06 Fall

Sampling CS 6965 Fall 2011 Creative Program 3 CS 6965 Fall 2011 2 CS 6965 Fall 2011 3 CS

2008 Half Year Results Presentation 6 months to 30 June 2008 20 August 2008 1 2008 Half Year

Sterile Neutrinos in Cosmology Mikhail Shaposhnikov NEUTRINO 2008 Neutrino 2008, 30 May 2008

2008 2008 2008 2008 Investor Community Conference Call Risk Review Risk Review Risk

Investor Community Conference Call Q1 2008 2008 2008 2008 Risk Review Risk Review Risk

2008 2008 2008 2008 Investor Community Conference Call Risk Review Tom Flynn Executive

2008 2008 2008 2008 Investor Community Conference Call Financial Results RUSS ROBERTSON

2008 2008 2008 2008 Investor Community Conference Call Financial Results RUSS ROBERTSON Ch

Investor Community Conference Call Q4 2008 2008 2008 2008 Risk Review Tom Flynn Executive

Investor Community Conference Call Q2 2008 2008 2008 2008 Risk Review Tom Flynn Executive

2008 2008 2008 2008 Investor Community Conference Call Risk Review Tom Flynn Executive

2008 2008 2008 2008 Investor Community Conference Call Risk Review Tom Flynn Executive

EQUATION OF FREE FALL Chapter 2 = Free Fall v = u - gt Chapter 2 = Free Fall v = u - gt

Key roles of metallo-organic complexes: from photovoltaics materials to enzymatic structures P.

Disclosures CHOOSING THE RIGHT I am President of the Epilepsy Study Consortium. All

Exploring Variation in Biochemical Pathways with the Continuous -Calculus Ian Stark and Marek

Elements of Floating-point Arithmetic Sanzheng Qiao Department of Computing and Software

306 306 membe bers compani panies www. www.epic- -as

Niklas Smedberg Senior Engine Programmer, Epic Games Who Am I A.k.a. Smedis

Keeping Track Of All The Things A use-case and content management story Matt Parks | Manager,

Upcoming: Distinguished Lecturer! Upcoming: Distinguished Lecturer! Lecture: Self-Reference and