rna search and
play

RNA Search and Whirlwind tour of ncRNA search & discovery Motif - PowerPoint PPT Presentation

Outline RNA Search and Whirlwind tour of ncRNA search & discovery Motif Discovery RNA motif description (Covariance Model Review) Algorithms for searching Rigorous & heuristic filtering Motif discovery Lecture


  1. Outline � RNA Search and � Whirlwind tour of ncRNA search & discovery � Motif Discovery � RNA motif description (Covariance Model Review) � Algorithms for searching � Rigorous & heuristic filtering � Motif discovery � Lecture 9 � Applications � CSEP 590A � Autumn 2008 � 2 RNA Motif Models � “Covariance Models” (Eddy & Durbin 1994) � aka profile stochastic context-free grammars � Motif Description � aka hidden Markov models on steroids � Model position-specific nucleotide preferences and base-pair preferences � Pro: accurate � Con: model building hard, search sloooow � 7 8

  2. Profile Hmm Structure � Example: searching for tRNAs M j : � Match states (20 emission probabilities) � I j : � Insert states (Background emission probabilities) � D j : � Delete states (silent - no emission) � 13 17 CM Structure � CM Viterbi Alignment � A: Sequence + structure � = i th letter of input x i B: the CM “guide tree” � x ij = substring i ,..., j of input C: probabilities of T yz = P (transition y � z ) letters/ pairs & of indels � y E x i , x j = P (emission of x i , x j from state y ) Think of each branch being an HMM emitting y S ij = max � log P ( x ij gen'd starting in state y via path � ) both sides of a helix (but 3’ side emitted in reverse order) � 18 20

  3. mRNA leader y = max � log P ( x ij generated starting in state y via path � ) S ij � z y max z [ S i + 1, j � 1 + log T yz + log E x i , x j ] match pair � y ] z max z [ S i + 1, j + log T yz + log E x i match/insert left � � y = y ] z S ij � max z [ S i , j � 1 + log T yz + log E x j match/insert right mRNA leader switch? � z max z [ S i , j + log T yz ] delete � y left + S k + 1, j y right ] � max i < k � j [ S i , k bifurcation � Time O(qn 3 ), q states, seq len n 21 23 Mutual Information � f xi , xj � M ij = f xi , xj log 2 ; 0 � M ij � 2 f xi f xj xi , xj Max when no seq conservation but perfect pairing � MI = expected score gain from using a pair state � Finding optimal MI, (i.e. opt pairing of cols) is hard(?) � Finding optimal MI without pseudoknots can be done by dynamic programming � 24 26

  4. Pseudoknots � � n disallowed allowed � � � /2 max j M i , j � � i = 1 29 31 Rfam – an RNA family DB � Rfam � Griffiths-Jones, et al., NAR ‘03,’05 � IRE (partial seed alignment): � Biggest scientific computing user in Europe - Input (hand-curated): � Hom.sap. GUUCCUGCUUCAACAGUGUUUGGAUGGAAC 1000 cpu cluster for a month per release � MSA “seed alignment” � Hom.sap. UUUCUUC.UUCAACAGUGUUUGGAUGGAAC Hom.sap. UUUCCUGUUUCAACAGUGCUUGGA.GGAAC SS_cons � Rapidly growing: � Hom.sap. UUUAUC..AGUGACAGAGUUCACU.AUAAA Score Thresh T � Hom.sap. UCUCUUGCUUCAACAGUGUUUGGAUGGAAC Hom.sap. AUUAUC..GGGAACAGUGUUUCCC.AUAAU Rel 1.0, 1/03: 25 families, 55k instances � Window Len W � Hom.sap. UCUUGC..UUCAACAGUGUUUGGACGGAAG Hom.sap. UGUAUC..GGAGACAGUGAUCUCC.AUAUG Output: � Rel 7.0, 3/05: 503 families, >300k instances � Hom.sap. AUUAUC..GGAAGCAGUGCCUUCC.AUAAU Cav.por. UCUCCUGCUUCAACAGUGCUUGGACGGAGC CM � Mus.mus. UAUAUC..GGAGACAGUGAUCUCC.AUAUG scan results & “full Mus.mus. UUUCCUGCUUCAACAGUGCUUGAACGGAAC Mus.mus. GUACUUGCUUCAACAGUGUUUGAACGGAAC alignment” � Rat.nor. UAUAUC..GGAGACAGUGACCUCC.AUAUG Rat.nor. UAUCUUGCUUCAACAGUGUUUGGACGGAAC SS_cons <<<<<...<<<<<......>>>>>.>>>>> 33 34

  5. Faster Genome Annotation � Faster Search � of Non-coding RNAs � Without Loss of Accuracy � Zasha Weinberg � & W.L. Ruzzo � Recomb ‘04, ISMB ‘04, Bioinfo ‘06 � 37 38 RaveNnA: Genome Scale � CM’s are good, but slow � RNA Search � Rfam Reality Our Work Rfam Goal Typically 100x speedup over raw CM, w/ no loss in accuracy: � EMBL EMBL EMBL � � drop structure from CM to create a (faster) HMM � � use that to pre-filter sequence; � BLAST Ravenna � discard parts where, provably, CM score < threshold; � � actually run CM on the rest (the promising parts) � � assignment of HMM transition/emission scores is key � CM CM CM �� (large convex optimization problem) � junk junk hits hits hits 1 month, ~2 months, 10 years, Weinberg & Ruzzo, Bioinformatics , 2004, 2006 1000 computers 39 1000 computers 1000 computers 41

  6. CM to HMM � Key Issue: 25 scores � 10 � CM HMM CM HMM A A A A A A A A C C C C C C C C P L R G G G G G G G G U U U U U U U U – – – – – – – – A A A A C C C C G G G G Need: log Viterbi scores CM � HMM � U U U U – – – – 25 emisions per state 5 emissions per state, 2x states 43 44 Viterbi/Forward Scoring � Key Issue: 25 scores � 10 � CM HMM Path � defines transitions/emissions � A A A A C C C C Score( � ) = product of “probabilities” on � � P L R G G G G U U U U NB: ok if “probs” aren’t, e.g. !" 1 � – – – – NB: HMM not a prob. model (e.g. in CM, emissions are odds ratios vs � 0th-order background) � Need: log Viterbi scores CM � HMM � For any nucleotide sequence x: � P AA � L A + R A P CA � L C + R A … Viterbi-score(x) = max{ score( � ) | � emits x} � P AC � L A + R C P CC � L C + R C … P CG � L C + R G … Forward-score(x) = ! { score( � ) | � emits x} � P AG � L A + R G P AU � L A + R U P CU � L C + R U … P A– � L A + R – P C– � L C + R – … 45 46

  7. P AA � L A + R A P AC � L A + R C Rigorous Filtering � Some scores filter better � P AG � L A + R G P AU � L A + R U P A– � L A + R – … P UA = 1 � L U + R A � Any scores satisfying the linear inequalities P UG = 4 � L U + R G � give rigorous filtering � � � � � � � Assuming ACGU � 25% � Proof: � Option 1: � � � � Opt 1: � CM Viterbi path score � L U = R A = R G = 2 � � L U + (R A + R G )/2 = 4 � � “corresponding” HMM path score � � Viterbi HMM path score � Option 2: � � � � Opt 2: � (even if it does not correspond to any CM path) � L U = 0, R A = 1, R G = 4 � L U + (R A + R G )/2 = 2.5 � 47 48 Optimizing filtering � Calculating E(L i , R i ) � For any nucleotide sequence x: � E(L i , R i ) = ! x Forward-score(x)*Pr(x) � Viterbi-score(x) = max{ score( � ) | � emits x } � Forward-score(x) = ! { score( � ) | � emits x } � Forward-like: for every state, calculate Expected Forward Score � expected score for all paths ending there; E(L i , R i ) = ! all sequences x Forward-score(x)*Pr(x) � NB: E is a function of L i , R i only � easily calculated from expected scores of Under 0th-order background model Optimization: � predecessors & transition/emission Minimize E(L i , R i ) subject to score Lin.Ineq.s � probabilities/scores � This is heuristic (“forward � � Viterbi � � filter � ”) � But still rigorous because “subject to score Lin.Ineq.s” � 49 50

  8. Minimizing E(L i , R i ) � “Convex” Optimization � Calculate E(L i , R i ) Convex: � Nonconvex: � Forward: local max = global max; � can be many local maxima, symbolically , in terms of << global max; � simple “hill climbing” works � emission scores, so we “hill-climbing” fails � can do partial derivatives Viterbi: for numerical convex optimization algorithm � � E ( L 1 , L 2 ,...) � L i 51 52 Results: New ncRNA’s? � Estimated Filtering Efficiency � # found � # found # new � (139 Rfam 4.0 families) � rigorous filter Name � BLAST � + CM � + CM � Filtering # families # families fraction � (compact) � (expanded) � Pyrococcus snoRNA � 57 � 180 � 123 � Iron response element � 201 � 322 � 121 � < 10 -4 � 105 � 110 � ~100x Histone 3’ element � 1004 � 1106 � 102 � speedup 10 -4 - 10 -2 � 8 � 17 � Purine riboswitch � 69 � 123 � 54 � Retron msr � 11 � 59 � 48 � .01 - .10 � 11 � 3 � Hammerhead I � 167 � 193 � 26 � .10 - .25 � 2 � 2 � Hammerhead III � 251 � 264 � 13 � U4 snRNA � 283 � 290 � 7 � .25 - .99 � 6 � 4 � S-box � 128 � 131 � 3 � U6 snRNA � 1462 � 1464 � 2 � .99 - 1.0 � 7 � 3 � U5 snRNA � 199 � 200 � 1 � U7 snRNA � 312 � 313 � 1 � 53 54

  9. RNA Motif Discovery � Typical problem: given a ~10-20 unaligned sequences of ~1kb, most of which contain Motif Discovery � instances of one RNA motif of, say, 150bp -- find it. � Example: 5’ UTRs of orthologous glycine cleavage genes from � -proteobacteria � 60 61 Searching for noncoding RNAs � Cmfinder--A Covariance � Model Based RNA Motif � CM’s are great, but where do they come from? � Finding Algorithm An approach: comparative genomics � Bioinformatics , 2006, 22(4): 445-452 Search for motifs with common secondary structure in a set of functionally related sequences. � Challenges � Zizhen Yao � Three related tasks � Zasha Weinberg � Locate the motif regions. � Walter L. Ruzzo � Align the motif instances. � University of Washington, Seattle � Predict the consensus secondary structure. � Motif search space is huge! � Motif location space, alignment space, structure space. � 62 63

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend