rna sequence analysis using covariance models
play

RNA sequence analysis using covariance models Eddy & Durbin - PowerPoint PPT Presentation

RNA sequence analysis using covariance models Eddy & Durbin Nucleic Acids Research, 1994 vol 22 #11, 2079-2088 What A probabilistic model for RNA families The Covariance Model A Stochastic Context-Free Grammar


  1. “RNA sequence analysis using covariance models” Eddy & Durbin Nucleic Acids Research, 1994 vol 22 #11, 2079-2088

  2. What • A probabilistic model for RNA families – The “Covariance Model” – ≈ A Stochastic Context-Free Grammar – A generalization of a profile HMM • Algorithms for Training – From aligned or unaligned sequences – Automates “comparative analysis” – Complements Nusinov/Zucker RNA folding • Algorithms for searching

  3. Main Results • Very accurate search for tRNA – (Precursor to tRNAscanSE - current favorite) • Given sufficient data, model construction comparable to, but not quite as good as, human experts • Some quantitative info on importance of pseudoknots and other tertiary features

  4. Probabilistic Model Search • As with HMMs, given a sequence, you calculate llikelihood ratio that the model could generate the sequence, vs a background model • You set a score threshold • Anything above threshold --> a “hit” • Scoring: – “Forward” / “Inside” algorithm - sum over all paths – Viterbi approximation - find single best path (Bonus: alignment & structure prediction)

  5. Example: searching for tRNAs

  6. Alignment Quality

  7. Comparison to TRNASCAN • Fichant & Burks - best heuristic then evaluation criteria – 97.5% true positive Slightly different – 0.37 false positives per MB • CM A1415 (trained on trusted alignment) – > 99.98% true positives – <0.2 false positives per MB • Current method-of-choice is “tRNAscanSE”, a CM-based scan with heuristic pre-filtering (including TRNASCAN?) for performance reasons.

  8. Profile Hmm Structure M j : Match states (20 emission probabilities) I j : Insert states (Background emission probabilities) D j : Delete states (silent - no emission)

  9. CM Structure • A: Sequence + structure • B: the CM “guide tree” • C: probabilities of letters/ pairs & of indels • Think of each branch being an HMM emitting both sides of a helix (but 3’ side emitted in reverse order)

  10. Overall CM Architecture • One box (“node”) per node of guide tree • BEG/MATL/INS/DEL just like an HMM • MATP & BIF are the key additions: MATP emits pairs of symbols, modeling base-pairs; BIF allows multiple helices

  11. CM Viterbi Alignment x i = i th letter of input x ij = substring i ,..., j of input T yz = P (transition y � z ) y E x i , x j = P (emission of x i , x j from state y ) y = max � log P ( x ij generated starting in state y via path � ) S ij

  12. y = max � log P ( x ij generated starting in state y via path � ) S ij � z y max z [ S i + 1, j � 1 + log T yz + log E x i , x j ] match pair � y ] z max z [ S i + 1, j + log T yz + log E x i match/insert left � � y = y ] z S ij � max z [ S i , j � 1 + log T yz + log E x j match/insert right � z max z [ S i , j + log T yz ] delete � y right ] y left + S k + 1, j � max i < k � j [ S i , k bifurcation �

  13. Model Training

  14. Mutual Information f xi , xj � M ij = f xi , xj log 2 ; 0 � M ij � 2 f xi f xj xi , xj • Max when no sequence conservation but perfect pairing • MI = expected score gain from using a pair state • Finding optimal MI, (i.e. optimal pairing of columns) is NP-hard(?) • Finding optimal MI without pseudoknots can be done by dynamic programming

  15. M.I. Example * 1 2 3 4 5 6 7 8 9 * MI: 1 2 3 4 5 6 7 8 9 A G A U A A U C U 9 0 0 0 0 0 0 0 0 A G A U C A U C U 8 0 0 0 0 0 0 0 A G A C G U U C U 7 0 0 2 0.30 0 1 A G A U U U U C U 6 0 0 1 0.55 1 A G C C A G G C U 5 0 0 0 0.42 A G C G C G G C U 4 0 0 0.30 A G C U G C G C U 3 0 0 A G C A U C G C U 2 0 A G G U A G C C U 1 A G G G C G C C U A G G U G U C C U Cols 1 & 9, 2 & 8: perfect conservation and A G G C U U C C U A G U A A A A C U might be base-paired, but unclear whether A G U C C A A C U they are. M.I. = 0 A G U U G C A C U A G U U U C A C U Cols 3 & 7: completely unconserved, but always W-C pairs, so seems likely that they A 16 0 4 2 4 4 4 0 0 do base-pair. M.I. = 2 bits. C 0 0 4 4 4 4 4 16 0 G 0 16 4 2 4 4 4 0 0 Cols 7->6: unconserved, but each letter in 7 U 0 0 4 8 4 4 4 0 16 has only 2 possible mates in 6. M.I. = 1 bit.

  16. MI-Based Structure-Learning • find best (max total MI) subset of column pairs among i…j, subject to absence of pseudo-knots � S i + 1, j � � S i , j � 1 S i , j = max � S i + 1, j � 1 + M i , j � � max i < j < k S i , k + S k + 1, j � • “just like Nussinov/Zucker folding” • BUT, need enough data---enough sequences at right phylogenetic distance

  17. Pseudoknots � � n � disallowed allowed � � /2 max j M i , j � � i = 1

  18. Rfam – an RNA family DB Griffiths-Jones, et al., NAR ‘03,’05 • Biggest scientific computing user in Europe - 1000 cpu cluster for a month per release • Rapidly growing: – Rel 1.0, 1/03: 25 families, 55k instances – Rel 7.0, 3/05: 503 families, >300k instances

  19. Rfam • Input (hand-curated): IRE (partial seed alignment): – MSA “seed alignment” Hom.sap. GUUCCUGCUUCAACAGUGUUUGGAUGGAAC – SS_cons Hom.sap. UUUCUUC.UUCAACAGUGUUUGGAUGGAAC Hom.sap. UUUCCUGUUUCAACAGUGCUUGGA.GGAAC – Score Thresh T Hom.sap. UUUAUC..AGUGACAGAGUUCACU.AUAAA Hom.sap. UCUCUUGCUUCAACAGUGUUUGGAUGGAAC – Window Len W Hom.sap. AUUAUC..GGGAACAGUGUUUCCC.AUAAU Hom.sap. UCUUGC..UUCAACAGUGUUUGGACGGAAG • Output: Hom.sap. UGUAUC..GGAGACAGUGAUCUCC.AUAUG Hom.sap. AUUAUC..GGAAGCAGUGCCUUCC.AUAAU – CM Cav.por. UCUCCUGCUUCAACAGUGCUUGGACGGAGC – scan results & “full Mus.mus. UAUAUC..GGAGACAGUGAUCUCC.AUAUG Mus.mus. UUUCCUGCUUCAACAGUGCUUGAACGGAAC alignment” Mus.mus. GUACUUGCUUCAACAGUGUUUGAACGGAAC Rat.nor. UAUAUC..GGAGACAGUGACCUCC.AUAUG Rat.nor. UAUCUUGCUUCAACAGUGUUUGGACGGAAC SS_cons <<<<<...<<<<<......>>>>>.>>>>>

  20. Rfam – key issues • Overly narrow families • Variant structures/unstructured RNAs • Spliced RNAs • RNA pseudogenes – Human ALU is SRP related w/ 1.1m copies – Mouse B2 repeat (350k copies) tRNA related • Speed & sensitivity • Motif discovery

  21. Faster Genome Annotation of Non-coding RNAs Without Loss of Accuracy Zasha Weinberg & W.L. Ruzzo Recomb ‘04, ISMB ‘04

  22. Covariance Model Key difference of CM vs HMM: Pair states emit paired symbols, corresponding to base-paired nucleotides; 16 emission probabilities here.

  23. CM’s are good, but slow Rfam Reality Our Work Rfam Goal EMBL EMBL EMBL Z Blast CM CM CM junk junk hits hits hits 1 month, ~2 months, 10 years, 1000 computers 1000 computers 1000 computers

  24. Oversimplified CM (for pedagogical purposes only) A A C C G G U U – – A A C C G G U U – –

  25. CM to HMM A A A A C C C C G G G G U U U U – – – – CM HMM A A A A C C C C G G G G U U U U – – – – 25 emisions per state 5 emissions per state, 2x states

  26. Key Issue: 25 scores  10 A A A A C C C C CM HMM G G G G U U U U – – – – • Need: log Viterbi scores CM ≤ HMM

  27. Viterbi/Forward Scoring • Path π defines transitions/emissions • Score( π ) = product of “probabilities” on π • NB: ok if “probs” aren’t, e.g. ∑≠1 (e.g. in CM, emissions are odds ratios vs 0th-order background) • For any nucleotide sequence x: – Viterbi-score(x) = max{ score( π ) | π emits x} – Forward-score(x) = ∑{ score( π ) | π emits x}

  28. Key Issue: 25 scores  10 A A A A C C C C CM HMM L R G G G G U U U U – – – – NB:HMM not a prob. model • Need: log Viterbi scores CM ≤ HMM P AA ≤ L A + R A P CA ≤ L C + R A … P AC ≤ L A + R C P CC ≤ L C + R C … P AG ≤ L A + R G P CG ≤ L C + R G … P AU ≤ L A + R U P CU ≤ L C + R U … P C– ≤ L C + R – P A– ≤ L A + R – …

  29. P AA ≤ L A + R A P AC ≤ L A + R C Rigorous Filtering P AG ≤ L A + R G P AU ≤ L A + R U P A– ≤ L A + R – … • Any scores satisfying the linear inequalities give rigorous filtering Proof: CM Viterbi path score ≤ “corresponding” HMM path score ≤ Viterbi HMM path score (even if it does not correspond to any CM path)

  30. Some scores filter better P UA = 1 ≤ L U + R A P UG = 4 ≤ L U + R G Assuming ACGU ≈ 25% Option 1: Opt 1: L U = R A = R G = 2 L U + (R A + R G )/2 = 4 Option 2: Opt 2: L U = 0, R A = 1, R G = 4 L U + (R A + R G )/2 = 2.5

  31. Optimizing filtering • For any nucleotide sequence x: Viterbi-score(x) = max{ score( π ) | π emits x } Forward-score(x) = ∑{ score( π ) | π emits x } • Expected Forward Score E(L i , R i ) = ∑ all sequences x Forward-score(x)*Pr(x) – NB: E is a function of L i , R i only Under 0th-order background model • Optimization: Minimize E(L i , R i ) subject to score L.I.s – This is heuristic (“forward ↓ ⇒ Viterbi ↓ ⇒ filter ↓ ”) – But still rigorous because “subject to score L.I.s”

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend