day 1 rna search and motif discovery
play

Day 1 " RNA Search and ! Motif Discovery " Many - PDF document

Day 1 " RNA Search and ! Motif Discovery " Many biologically interesting roles for RNA " Genome 541 ! RNA secondary structure prediction " Intro to Computational ! Molecular Biology " 3 Approaches to Structure Prediction


  1. Day 1 " RNA Search and ! Motif Discovery " Many biologically interesting roles for RNA " Genome 541 ! RNA secondary structure prediction " Intro to Computational ! Molecular Biology " 3 Approaches to Structure Prediction " Maximum Pairing ! " + works on single sequences ! " + simple ! " - too inaccurate " Minimum Energy ! " + works on single sequences ! " - ignores pseudoknots ! " - only finds “optimal” fold " Partition Function ! " + finds all folds ! " - ignores pseudoknots " 4 Nussinov: ! Approaches, II " A Computation Order " Comparative sequence analysis ! Or energy " + handles all pairings (potentially incl. pseudoknots) ! B(i,j) = # pairs in optimal pairing of r i ... r j " " - requires several (many?) aligned, ! B(i,j) = 0 for all i, j with i � j-4; otherwise " " appropriately diverged sequences " Stochastic Context-free Grammars ! B(i,j) = max of: " Roughly combines min energy & comparative, but K=2 B(i,j-1) " 3 no pseudoknots " 4 5 Physical experiments (x-ray crystalography, NMR) " max { B(i,k-1)+1+B(k+1,j-1) | ! i ! k < j-4 and r k -r j may pair} " Time: O(n 3 )

  2. Day 2 " Computational Problems " How to predict secondary structure " Day 1: ! Many biologically interesting roles for RNA " How to model an RNA “motif” ! (I.e., sequence/structure pattern) " RNA secondary structure prediction " Given a motif, how to search for instances " Today: " Given (unaligned) sequences, find motifs " Covariance Models (CMs) represent ! How to score discovered motifs " RNA sequence/structure motifs " How to leverage prior knowledge " Fast CM search " 8 9 RNA Motif Models " “Covariance Models” (Eddy & Durbin 1994) " Motif Description " aka profile stochastic context-free grammars " aka hidden Markov models on steroids " Model position-specific nucleotide preferences and base-pair preferences " Pro: accurate " Con: model building hard, search slow " 15 What " Main Results " A probabilistic model for RNA families " Very accurate search for tRNA " The “Covariance Model” " (Precursor to tRNAscanSE - current favorite) " " A Stochastic Context-Free Grammar " Given sufficient data, model construction A generalization of a profile HMM " comparable to, but not quite as good as, ! Algorithms for Training " human experts " From aligned or unaligned sequences " Some quantitative info on importance of Automates “comparative analysis” " pseudoknots and other tertiary features " Complements Nusinov/Zucker RNA folding " Algorithms for searching " 16 17

  3. Example: searching for Probabilistic Model Search " tRNAs ! As with HMMs, given a sequence, you calculate likelihood ratio that the model could generate the sequence, vs a background model " You set a score threshold " Anything above threshold # a “hit” " Scoring: " “Forward” / “Inside” algorithm - sum over all paths " Viterbi approximation - find single best path ! (Bonus: alignment & structure prediction) " 18 19 How to model an RNA “Motif”? " How to model an RNA “Motif”? " Add “column pairs” and pair emission probabilities Conceptually, start with a profile HMM: " for base-paired regions " from a multiple alignment, estimate nucleotide/ insert/delete preferences for each position " given a new seq, estimate likelihood that it could be generated by the model, & align it to the model " <<<<<<< >>>>>>> paired columns … … mostly G del ins all G 21 22 Profile Hmm Structure " CM Structure " A: Sequence + structure " B: the CM “guide tree” " C: probabilities of letters/ pairs & of indels " Think of each branch being an HMM emitting both sides of a helix (but M j : " Match states (20 emission probabilities) " 3’ side emitted in I j : " Insert states (Background emission probabilities) " reverse order) " D j : " Delete states (silent - no emission) " 23 29

  4. Overall CM CM Viterbi Alignment ! Architecture " (the “inside” algorithm) " One box (“node”) per node = i th letter of input of guide tree " x i BEG/MATL/INS/DEL just x ij = substring i ,..., j of input like an HMM " T yz = P (transition y " z ) MATP & BIF are the key y E x i , x j = P (emission of x i , x j from state y ) additions: MATP emits pairs of symbols, modeling base- y S ij = max # log P ( x ij gen'd starting in state y via path # ) pairs; BIF allows multiple helices " 30 31 CM Viterbi Alignment ! (the “inside” algorithm) " y = max " log P ( x ij generated starting in state y via path " ) S ij % z y max z [ S i + 1, j # 1 + log T yz + log E x i , x j ] match pair ' y ] z max z [ S i + 1, j + log T yz + log E x i match/insert left ' ' y = y ] S ij z & max z [ S i , j # 1 + log T yz + log E x j match/insert right ' z max z [ S i , j + log T yz ] delete ' y right ] y left + S k + 1, j ' max i < k $ j [ S i , k bifurcation ( Time O(qn 3 ), q states, seq len n Time O(qn 3 ), q states, seq len n compare: O(qn) for profile HMM 32 34 Covariation is strong evidence for base pairing mRNA leader Mutual Information " f xi , xj " M ij = f xi , xj log 2 ; 0 # M ij # 2 f xi f xj xi , xj Max when no seq conservation but perfect pairing " MI = expected score gain from using a pair state " mRNA leader switch? Finding optimal MI, (i.e. opt pairing of cols) is hard(?) " Finding optimal MI without pseudoknots can be done by dynamic programming " 35 36

  5. M.I. Example (Artificial) " * 1 2 3 4 5 6 7 8 9 * MI: 1 2 3 4 5 6 7 8 9 A G A U A A U C U 9 0 0 0 0 0 0 0 0 A G A U C A U C U 8 0 0 0 0 0 0 0 A G A C G U U C U 7 0 0 2 0.30 0 1 A G A U U U U C U 6 0 0 1 0.55 1 A G C C A G G C U 5 0 0 0 0.42 A G C G C G G C U 4 0 0 0.30 A G C U G C G C U 3 0 0 A G C A U C G C U 2 0 A G G U A G C C U 1 A G G G C G C C U A G G U G U C C U A G G C U U C C U Cols 1 & 9, 2 & 8: perfect conservation & might be A G U A A A A C U base-paired, but unclear whether they are. M.I. = 0 A G U C C A A C U A G U U G C A C U Cols 3 & 7: No conservation, but always W-C pairs, A G U U U C A C U so seems likely they do base-pair. M.I. = 2 bits. Cols 7->6: unconserved, but each letter in 7 has A 16 0 4 2 4 4 4 0 0 only 2 possible mates in 6. M.I. = 1 bit. " C 0 0 4 4 4 4 4 16 0 G 0 16 4 2 4 4 4 0 0 37 40 U 0 0 4 8 4 4 4 0 16 Primary vs Secondary Info " Comparison to TRNASCAN " Fichant & Burks - best heuristic then " evaluation criteria Slightly different 97.5% true positive " 0.37 false positives per MB " CM A1415 (trained on trusted alignment) " > 99.98% true positives " < 0.2 false positives per MB " Current method-of-choice is “tRNAscanSE”, a CM- based scan with heuristic pre-filtering (including disallowing / allowing TRNASCAN?) for performance reasons. " pseudoknots # n & % " ( /2 max j M i , j $ ' i = 1 42 45 tRNAScanSE " Uses 3 older heuristic tRNA finders as An Important Application: ! prefilter " Rfam " Uses CM built as described for final scoring " Actually 3(?) different CMs " " eukaryotic nuclear " " prokaryotic " " organellar " Used in all genome annotation projects " 46

  6. Rfam – an RNA family DB ! Example Rfam Family " Griffiths-Jones, et al., NAR ’03, ’05, ’08 " IRE (partial seed alignment): ! Biggest scientific computing user in Europe - Input (hand-curated): " Hom.sap. GUUCCUGCUUCAACAGUGUUUGGAUGGAAC 1000 cpu cluster for a month per release " MSA “seed alignment” " Hom.sap. UUUCUUC.UUCAACAGUGUUUGGAUGGAAC Hom.sap. UUUCCUGUUUCAACAGUGCUUGGA.GGAAC SS_cons " Rapidly growing: " Hom.sap. UUUAUC..AGUGACAGAGUUCACU.AUAAA Score Thresh T " Hom.sap. UCUCUUGCUUCAACAGUGUUUGGAUGGAAC DB size: Hom.sap. AUUAUC..GGGAACAGUGUUUCCC.AUAAU Rel 1.0, 1/03: 25 families, 55k instances " Window Len W " Hom.sap. UCUUGC..UUCAACAGUGUUUGGACGGAAG ~8GB Hom.sap. UGUAUC..GGAGACAGUGAUCUCC.AUAUG Rel 7.0, 3/05: 503 families, 363k instances " Output: " Hom.sap. AUUAUC..GGAAGCAGUGCCUUCC.AUAAU Cav.por. UCUCCUGCUUCAACAGUGCUUGGACGGAGC Rel 9.0, 7/08: 603 families, 636k instances ! CM " Mus.mus. UAUAUC..GGAGACAGUGAUCUCC.AUAUG Rel 9.1, 1/09: 1372 families, 1148k instances " Mus.mus. UUUCCUGCUUCAACAGUGCUUGAACGGAAC scan results & “full Mus.mus. GUACUUGCUUCAACAGUGUUUGAACGGAAC alignment” " Rat.nor. UAUAUC..GGAGACAGUGACCUCC.AUAUG Rel 10.0, 1/10: 1446 families, " 3193k instances " ~160GB Rat.nor. UAUCUUGCUUCAACAGUGUUUGGACGGAAC phylogeny, etc. " SS_cons <<<<<...<<<<<......>>>>>.>>>>> 48 53

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend