day 1
play

Day 1 " Last lecture: ! RNA Search and ! many biologically - PowerPoint PPT Presentation

Day 1 " Last lecture: ! RNA Search and ! many biologically interesting roles for RNA " Motif Discovery " Today: " Covariance Models (CMs) represent CSE 527 ! conserved RNA sequence/structure motifs " Computational Biology


  1. Day 1 " Last lecture: ! RNA Search and ! many biologically interesting roles for RNA " Motif Discovery " Today: " Covariance Models (CMs) represent CSE 527 ! conserved RNA sequence/structure motifs " Computational Biology " 2 Computational Problems " How to predict secondary structure " How to model an RNA “motif” ! (I.e., sequence/structure pattern) " Given a motif, how to search for instances " Given (unaligned) sequences, find motifs " How to score discovered motifs " How to leverage prior knowledge " 3 4

  2. RNA Motif Models " “Covariance Models” (Eddy & Durbin 1994) " aka profile stochastic context-free grammars " Motif Description " aka hidden Markov models on steroids " Model position-specific nucleotide preferences and base-pair preferences " Pro: accurate " Con: model building hard, search sloooow " 10 What " Main Results " A probabilistic model for RNA families " Very accurate search for tRNA " The “Covariance Model” " (Precursor to tRNAscanSE - current favorite) " ! A Stochastic Context-Free Grammar " Given sufficient data, model construction A generalization of a profile HMM " comparable to, but not quite as good as, ! Algorithms for Training " human experts " From aligned or unaligned sequences " Some quantitative info on importance of Automates “comparative analysis” " pseudoknots and other tertiary features " Complements Nusinov/Zucker RNA folding " Algorithms for searching " 11 12

  3. Example: searching for Probabilistic Model Search " tRNAs ! As with HMMs, given a sequence, you calculate likelihood ratio that the model could generate the sequence, vs a background model " You set a score threshold " Anything above threshold " a “hit” " Scoring: " “Forward” / “Inside” algorithm - sum over all paths " Viterbi approximation - find single best path ! (Bonus: alignment & structure prediction) " 13 14 How to model an RNA “Motif”? " How to model an RNA “Motif”? " Add “column pairs” and pair emission probabilities Conceptually, start with a profile HMM: " for base-paired regions " from a multiple alignment, estimate nucleotide/ insert/delete preferences for each position " given a new seq, estimate likelihood that it could be generated by the model, & align it to the model " <<<<<<< >>>>>>> paired columns … … mostly G ins all G del 16 17

  4. Profile Hmm Structure " CM Structure " A: Sequence + structure " B: the CM “guide tree” " C: probabilities of letters/ pairs & of indels " Think of each branch being an HMM emitting both sides of a helix (but M j : " Match states (20 emission probabilities) " 3’ side emitted in I j : " Insert states (Background emission probabilities) " reverse order) " D j : " Delete states (silent - no emission) " 18 24 Overall CM CM Viterbi Alignment ! Architecture " (the “inside” algorithm) " One box (“node”) per node = i th letter of input of guide tree " x i BEG/MATL/INS/DEL just x ij = substring i ,..., j of input like an HMM " T yz = P (transition y " z ) MATP & BIF are the key y E x i , x j = P (emission of x i , x j from state y ) additions: MATP emits pairs of symbols, modeling base- y S ij = max # log P ( x ij gen'd starting in state y via path # ) pairs; BIF allows multiple helices " 25 26

  5. CM Viterbi Alignment ! (the “inside” algorithm) " y = max " log P ( x ij generated starting in state y via path " ) S ij % z y max z [ S i + 1, j # 1 + log T yz + log E x i , x j ] match pair ' y ] z max z [ S i + 1, j + log T yz + log E x i match/insert left ' ' y = y ] z S ij & max z [ S i , j # 1 + log T yz + log E x j match/insert right ' z max z [ S i , j + log T yz ] delete ' y left + S k + 1, j y right ] ' max i < k $ j [ S i , k bifurcation ( Time O(qn 3 ), q states, seq len n Time O(qn 3 ), q states, seq len n compare: O(qn) for profile HMM 27 29 Covariation is strong evidence for base pairing mRNA leader Mutual Information " f xi , xj " M ij = f xi , xj log 2 ; 0 # M ij # 2 f xi f xj xi , xj Max when no seq conservation but perfect pairing " MI = expected score gain from using a pair state " mRNA leader switch? Finding optimal MI, (i.e. opt pairing of cols) is hard(?) " Finding optimal MI without pseudoknots can be done by dynamic programming " 30 31

  6. M.I. Example (Artificial) " * 1 2 3 4 5 6 7 8 9 * MI: 1 2 3 4 5 6 7 8 9 A G A U A A U C U 9 0 0 0 0 0 0 0 0 A G A U C A U C U 8 0 0 0 0 0 0 0 A G A C G U U C U 7 0 0 2 0.30 0 1 A G A U U U U C U 6 0 0 1 0.55 1 A G C C A G G C U 5 0 0 0 0.42 A G C G C G G C U 4 0 0 0.30 A G C U G C G C U 3 0 0 A G C A U C G C U 2 0 A G G U A G C C U 1 A G G G C G C C U A G G U G U C C U A G G C U U C C U Cols 1 & 9, 2 & 8: perfect conservation & might be A G U A A A A C U base-paired, but unclear whether they are. M.I. = 0 A G U C C A A C U A G U U G C A C U Cols 3 & 7: No conservation, but always W-C pairs, A G U U U C A C U so seems likely they do base-pair. M.I. = 2 bits. Cols 7->6: unconserved, but each letter in 7 has A 16 0 4 2 4 4 4 0 0 only 2 possible mates in 6. M.I. = 1 bit. " C 0 0 4 4 4 4 4 16 0 G 0 16 4 2 4 4 4 0 0 32 35 U 0 0 4 8 4 4 4 0 16 MI-Based Structure-Learning ! Primary vs Secondary Info " Find best (max total MI) subset of column pairs among i…j, subject to absence of pseudo-knots ! $ S i , j = max S i , j " 1 j unpaired % max i # k < j " 4 S i , k " 1 + M k , j + S k + 1, j " 1 j paired & “Just like Nussinov/Zucker folding” " disallowing allowing BUT, need enough data---enough sequences at right pseudoknots phylogenetic distance " # n & % " ( /2 max j M i , j $ ' i = 1 36 37

  7. Comparison to TRNASCAN " tRNAScanSE " Fichant & Burks - best heuristic then " Uses 3 older heuristic tRNA finders as evaluation criteria Slightly different 97.5% true positive " prefilter " 0.37 false positives per MB " Uses CM built as described for final scoring " CM A1415 (trained on trusted alignment) " Actually 3(?) different CMs " > 99.98% true positives " " eukaryotic nuclear " < 0.2 false positives per MB " Current method-of-choice is “tRNAscanSE”, a CM- " prokaryotic " based scan with heuristic pre-filtering (including " organellar " TRNASCAN?) for performance reasons. " Used in all genome annotation projects " 40 41 Rfam – an RNA family DB ! Griffiths-Jones, et al., NAR ’03, ’05, ’08 " Biggest scientific computing user in Europe - An Important Application: ! 1000 cpu cluster for a month per release " Rfam " Rapidly growing: " Rel 1.0, 1/03: 25 families, 55k instances " Rel 7.0, 3/05: 503 families, >300k instances " Rel 9.0, 7/08: 603 families, 896k instances ! Rel 9.1, 1/09: 1372 families, ??? instances " 43

  8. Rfam database ! Example Rfam Family " http://www.sanger.ac.uk/Software/Rfam/ ! (Release 7.0, 3/2005) " IRE (partial seed alignment): ! Input (hand-curated): " 503 ncRNA families � Hom.sap. GUUCCUGCUUCAACAGUGUUUGGAUGGAAC MSA “seed alignment” " Hom.sap. UUUCUUC.UUCAACAGUGUUUGGAUGGAAC 280,000 annotated ncRNAs � Hom.sap. UUUCCUGUUUCAACAGUGCUUGGA.GGAAC SS_cons " Hom.sap. UUUAUC..AGUGACAGAGUUCACU.AUAAA Score Thresh T " Hom.sap. UCUCUUGCUUCAACAGUGUUUGGAUGGAAC Hom.sap. AUUAUC..GGGAACAGUGUUUCCC.AUAAU 8 riboswitches, 235 small nucleolar RNAs, Window Len W " Hom.sap. UCUUGC..UUCAACAGUGUUUGGACGGAAG Hom.sap. UGUAUC..GGAGACAGUGAUCUCC.AUAUG 8 spliceosomal RNAs, 10 bacterial Output: " Hom.sap. AUUAUC..GGAAGCAGUGCCUUCC.AUAAU antisense RNAs, 46 microRNAs, 9 Cav.por. UCUCCUGCUUCAACAGUGCUUGGACGGAGC CM " Mus.mus. UAUAUC..GGAGACAGUGAUCUCC.AUAUG ribozymes, 122 cis RNA regulatory Mus.mus. UUUCCUGCUUCAACAGUGCUUGAACGGAAC scan results & “full Mus.mus. GUACUUGCUUCAACAGUGUUUGAACGGAAC elements, … � alignment” " Rat.nor. UAUAUC..GGAGACAGUGACCUCC.AUAUG Rat.nor. UAUCUUGCUUCAACAGUGUUUGGACGGAAC SS_cons <<<<<...<<<<<......>>>>>.>>>>> 44 46 Day 2 ! Rfam – key issues "" " 5 slide synopsis of last lecture Overly narrow families " Covariance Models (CMs) represent conserved RNA sequence/structure motifs " Variant structures/unstructured RNAs " Spliced RNAs " They allow accurate search " RNA pseudogenes " But " Human ALU is SRP related w/ 1.1m copies " " a) search is slow " Mouse B2 repeat (350k copies) tRNA related " " b) model construction is laborious " Speed & sensitivity " Motif discovery " 48 49

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend