rfam faster genome annotation of
play

Rfam Faster Genome Annotation of Input (hand-tuned): IRE (partial - PowerPoint PPT Presentation

Rfam Faster Genome Annotation of Input (hand-tuned): IRE (partial seed alignment): Non-coding RNAs Without MSA Hom.sap. GUUCCUGCUUCAACAGUGUUUGGAUGGAAC Hom.sap. UUUCUUC.UUCAACAGUGUUUGGAUGGAAC SS_cons Loss of Accuracy Hom.sap.


  1. Rfam Faster Genome Annotation of • Input (hand-tuned): IRE (partial seed alignment): Non-coding RNAs Without – MSA Hom.sap. GUUCCUGCUUCAACAGUGUUUGGAUGGAAC Hom.sap. UUUCUUC.UUCAACAGUGUUUGGAUGGAAC – SS_cons Loss of Accuracy Hom.sap. UUUCCUGUUUCAACAGUGCUUGGA.GGAAC – Score Thresh T Hom.sap. UUUAUC..AGUGACAGAGUUCACU.AUAAA Hom.sap. UCUCUUGCUUCAACAGUGUUUGGAUGGAAC – Window Len W Zasha Weinberg Hom.sap. AUUAUC..GGGAACAGUGUUUCCC.AUAAU Hom.sap. UCUUGC..UUCAACAGUGUUUGGACGGAAG • Output: & W.L. Ruzzo Hom.sap. UGUAUC..GGAGACAGUGAUCUCC.AUAUG Hom.sap. AUUAUC..GGAAGCAGUGCCUUCC.AUAAU – CM Recomb ‘04 Cav.por. UCUCCUGCUUCAACAGUGCUUGGACGGAGC – scan results Mus.mus. UAUAUC..GGAGACAGUGAUCUCC.AUAUG Mus.mus. UUUCCUGCUUCAACAGUGCUUGAACGGAAC Mus.mus. GUACUUGCUUCAACAGUGUUUGAACGGAAC Rat.nor. UAUAUC..GGAGACAGUGACCUCC.AUAUG Rat.nor. UAUCUUGCUUCAACAGUGUUUGGACGGAAC SS_cons <<<<<...<<<<<......>>>>>.>>>>> Covariance CM’s are good, but slow Model Rfam Reality Our Work Rfam Goal EMBL EMBL EMBL Key difference of CM vs HMM: Pair states emit paired symbols, Z Blast corresponding to base-paired nucleotides; 16 emission probabilities here. CM CM CM junk junk hits hits hits 1 month, ~2 months, 10 years, 1000 computers 1000 computers 1000 computers

  2. Oversimplified CM CM to HMM (for pedagogical purposes only) A A A A C C C C A A G G G G C C U U U U G G – – – – U U – – CM HMM A A A A C C C C A A G G G G C C U U U U G G – – – – U U – – 25 emisions per state 5 emissions per state, 2x states Key Issue: 25 scores � 10 Viterbi/Forward Scoring A A A A • Path � defines transitions/emissions C C C C CM HMM G G G G • Score( � ) = product of “probabilities” on � U U U U – – – – • NB: ok if “probs” aren’t, e.g. !"# • E.g. in CM, emissions are odds ratios vs 0th- order background • For any nucleotide sequence x: • Need: log Viterbi scores CM � HMM – Viterbi-score(x) = max{ score( � ) | � emits x} – Forward-score(x) = !$ score( � ) | � emits x}

  3. P AA � L A + R A P AC � L A + R C Key Issue: 25 scores � 10 Rigorous Filtering P AG � L A + R G P AU � L A + R U P A– � L A + R – … • Any scores satisfying the linear A A A A C C C C CM HMM inequalities give rigorous filtering G G G G U U U U – – – – NB:HMM not a prob. model Proof: • Need: log Viterbi scores CM � HMM CM Viterbi path score � “corresponding” HMM path score P AA � L A + R A P CA � L C + R A … P AC � L A + R C P CC � L C + R C … � Viterbi HMM path score P AG � L A + R G P CG � L C + R G … (even if it does not correspond to any CM path) P AU � L A + R U P CU � L C + R U … P A– � L A + R – P C– � L C + R – … Some scores filter better Optimizing filtering P UA = 1 � L U + R A • For any nucleotide sequence x: P UG = 4 � L U + R G Viterbi-score(x) = max{ score( � ) | � emits x } Forward-score(x) = !$ score( � ) | � emits x } Assuming ACGU � 25% • Expected Forward Score Option 1: Opt 1: E(L i , R i ) = ! x Forward-score(x)*Pr(x) L U = R A = R G = 2 L U + (R A + R G )/2 = 4 – NB: E is a function of L i , R i only Under 0th-order background model • Optimization: Option 2: Opt 2: Minimize E(L i , R i ) subject to score L.I.s L U = 0, R A = 1, R G = 4 L U + (R A + R G )/2 = 2.5 – This is heuristic (“forward � � Viterbi � � filter � ”) – But still rigorous because “subject to score L.I.s”

  4. Calculating E(L i , R i ) Minimizing E(L i , R i ) E(L i , R i ) = ! x Forward-score(x)*Pr(x) • Calculate E(L i , R i ) symbolically , in terms of emission scores, so we can do • Forward-like: for every state, calculate partial derivatives for numerical convex expected score for all paths ending optimization algorithm there, easily calculated from expected � E ( L 1 , L 2 ,...) scores of predecessors & transition/ emission probabilities/scores � L i Results: buried treasures Estimated Filtering Efficiency # found # found # new (139 Rfam 4.0 families) rigorous Name BLAST filter + CM + CM Filtering # families # families Pyrococcus snoRNA 57 180 123 fraction (compact) (expanded) Iron response element 201 322 121 < 10 -4 105 110 Histone 3’ element 1004 1106 102 Purine riboswitch 69 123 54 10 -4 - 10 -2 8 17 Retron msr 11 59 48 .01 - .10 11 3 Hammerhead I 167 193 26 Hammerhead III 251 264 13 .10 - .25 2 2 U4 snRNA 283 290 7 S-box 128 131 3 .25 - .99 6 4 U6 snRNA 1462 1464 2 U5 snRNA 199 200 1 .99 - 1.0 7 3 U7 snRNA 312 313 1

  5. Results: With additional work # with # with rigorous # new BLAST+CM filter series + CM Rfam tRNA 58609 63767 5158 Group II 5708 6039 331 intron tRNAscan- 608 729 121 SE (human) tmRNA 226 247 21 Lysine 60 71 11 riboswitch And more…

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend