RNA sequence analysis using covariance models Eddy & Durbin - PowerPoint PPT Presentation

“RNA sequence analysis using covariance models” Eddy & Durbin Nucleic Acids Research, 1994 vol 22 #11, 2079-2088

What • A probabilistic model for RNA families – The “Covariance Model” – ≈ A Stochastic Context-Free Grammar – A generalization of a profile HMM • Algorithms for Training – From aligned or unaligned sequences – Automates “comparative analysis” – Complements Nusinov/Zucker RNA folding • Algorithms for searching

Main Results • Very accurate search for tRNA – (Precursor to tRNAscanSE - current favorite) • Given sufficient data, model construction comparable to, but not quite as good as, human experts • Some quantitative info on importance of pseudoknots and other tertiary features

Probabilistic Model Search • As with HMMs, given a sequence, you calculate llikelihood ratio that the model could generate the sequence, vs a background model • You set a score threshold • Anything above threshold --> a “hit” • Scoring: – “Forward” / “Inside” algorithm - sum over all paths – Viterbi approximation - find single best path (Bonus: alignment & structure prediction)

Example: searching for tRNAs

Alignment Quality

Comparison to TRNASCAN • Fichant & Burks - best heuristic then evaluation criteria – 97.5% true positive Slightly different – 0.37 false positives per MB • CM A1415 (trained on trusted alignment) – > 99.98% true positives – <0.2 false positives per MB • Current method-of-choice is “tRNAscanSE”, a CM-based scan with heuristic pre-filtering (including TRNASCAN?) for performance reasons.

Profile Hmm Structure M j : Match states (20 emission probabilities) I j : Insert states (Background emission probabilities) D j : Delete states (silent - no emission)

CM Structure • A: Sequence + structure • B: the CM “guide tree” • C: probabilities of letters/ pairs & of indels • Think of each branch being an HMM emitting both sides of a helix (but 3’ side emitted in reverse order)

Overall CM Architecture • One box (“node”) per node of guide tree • BEG/MATL/INS/DEL just like an HMM • MATP & BIF are the key additions: MATP emits pairs of symbols, modeling base-pairs; BIF allows multiple helices

CM Viterbi Alignment x i = i th letter of input x ij = substring i ,..., j of input T yz = P (transition y � z ) y E x i , x j = P (emission of x i , x j from state y ) y = max � log P ( x ij generated starting in state y via path � ) S ij

y = max � log P ( x ij generated starting in state y via path � ) S ij � z y max z [ S i + 1, j � 1 + log T yz + log E x i , x j ] match pair � y ] z max z [ S i + 1, j + log T yz + log E x i match/insert left � � y = y ] z S ij � max z [ S i , j � 1 + log T yz + log E x j match/insert right � z max z [ S i , j + log T yz ] delete � y right ] y left + S k + 1, j � max i < k � j [ S i , k bifurcation �

Model Training

Mutual Information f xi , xj � M ij = f xi , xj log 2 ; 0 � M ij � 2 f xi f xj xi , xj • Max when no sequence conservation but perfect pairing • MI = expected score gain from using a pair state • Finding optimal MI, (i.e. optimal pairing of columns) is NP-hard(?) • Finding optimal MI without pseudoknots can be done by dynamic programming

M.I. Example * 1 2 3 4 5 6 7 8 9 * MI: 1 2 3 4 5 6 7 8 9 A G A U A A U C U 9 0 0 0 0 0 0 0 0 A G A U C A U C U 8 0 0 0 0 0 0 0 A G A C G U U C U 7 0 0 2 0.30 0 1 A G A U U U U C U 6 0 0 1 0.55 1 A G C C A G G C U 5 0 0 0 0.42 A G C G C G G C U 4 0 0 0.30 A G C U G C G C U 3 0 0 A G C A U C G C U 2 0 A G G U A G C C U 1 A G G G C G C C U A G G U G U C C U Cols 1 & 9, 2 & 8: perfect conservation and A G G C U U C C U A G U A A A A C U might be base-paired, but unclear whether A G U C C A A C U they are. M.I. = 0 A G U U G C A C U A G U U U C A C U Cols 3 & 7: completely unconserved, but always W-C pairs, so seems likely that they A 16 0 4 2 4 4 4 0 0 do base-pair. M.I. = 2 bits. C 0 0 4 4 4 4 4 16 0 G 0 16 4 2 4 4 4 0 0 Cols 7->6: unconserved, but each letter in 7 U 0 0 4 8 4 4 4 0 16 has only 2 possible mates in 6. M.I. = 1 bit.

MI-Based Structure-Learning • find best (max total MI) subset of column pairs among i…j, subject to absence of pseudo-knots � S i + 1, j � � S i , j � 1 S i , j = max � S i + 1, j � 1 + M i , j � � max i < j < k S i , k + S k + 1, j � • “just like Nussinov/Zucker folding” • BUT, need enough data---enough sequences at right phylogenetic distance

Pseudoknots � � n � disallowed allowed � � /2 max j M i , j � � i = 1

Rfam – an RNA family DB Griffiths-Jones, et al., NAR ‘03,’05 • Biggest scientific computing user in Europe - 1000 cpu cluster for a month per release • Rapidly growing: – Rel 1.0, 1/03: 25 families, 55k instances – Rel 7.0, 3/05: 503 families, >300k instances

Rfam • Input (hand-curated): IRE (partial seed alignment): – MSA “seed alignment” Hom.sap. GUUCCUGCUUCAACAGUGUUUGGAUGGAAC – SS_cons Hom.sap. UUUCUUC.UUCAACAGUGUUUGGAUGGAAC Hom.sap. UUUCCUGUUUCAACAGUGCUUGGA.GGAAC – Score Thresh T Hom.sap. UUUAUC..AGUGACAGAGUUCACU.AUAAA Hom.sap. UCUCUUGCUUCAACAGUGUUUGGAUGGAAC – Window Len W Hom.sap. AUUAUC..GGGAACAGUGUUUCCC.AUAAU Hom.sap. UCUUGC..UUCAACAGUGUUUGGACGGAAG • Output: Hom.sap. UGUAUC..GGAGACAGUGAUCUCC.AUAUG Hom.sap. AUUAUC..GGAAGCAGUGCCUUCC.AUAAU – CM Cav.por. UCUCCUGCUUCAACAGUGCUUGGACGGAGC – scan results & “full Mus.mus. UAUAUC..GGAGACAGUGAUCUCC.AUAUG Mus.mus. UUUCCUGCUUCAACAGUGCUUGAACGGAAC alignment” Mus.mus. GUACUUGCUUCAACAGUGUUUGAACGGAAC Rat.nor. UAUAUC..GGAGACAGUGACCUCC.AUAUG Rat.nor. UAUCUUGCUUCAACAGUGUUUGGACGGAAC SS_cons <<<<<...<<<<<......>>>>>.>>>>>

Rfam – key issues • Overly narrow families • Variant structures/unstructured RNAs • Spliced RNAs • RNA pseudogenes – Human ALU is SRP related w/ 1.1m copies – Mouse B2 repeat (350k copies) tRNA related • Speed & sensitivity • Motif discovery

Faster Genome Annotation of Non-coding RNAs Without Loss of Accuracy Zasha Weinberg & W.L. Ruzzo Recomb ‘04, ISMB ‘04

Covariance Model Key difference of CM vs HMM: Pair states emit paired symbols, corresponding to base-paired nucleotides; 16 emission probabilities here.

CM’s are good, but slow Rfam Reality Our Work Rfam Goal EMBL EMBL EMBL Z Blast CM CM CM junk junk hits hits hits 1 month, ~2 months, 10 years, 1000 computers 1000 computers 1000 computers

Oversimplified CM (for pedagogical purposes only) A A C C G G U U – – A A C C G G U U – –

CM to HMM A A A A C C C C G G G G U U U U – – – – CM HMM A A A A C C C C G G G G U U U U – – – – 25 emisions per state 5 emissions per state, 2x states

Key Issue: 25 scores  10 A A A A C C C C CM HMM G G G G U U U U – – – – • Need: log Viterbi scores CM ≤ HMM

Viterbi/Forward Scoring • Path π defines transitions/emissions • Score( π ) = product of “probabilities” on π • NB: ok if “probs” aren’t, e.g. ∑≠1 (e.g. in CM, emissions are odds ratios vs 0th-order background) • For any nucleotide sequence x: – Viterbi-score(x) = max{ score( π ) | π emits x} – Forward-score(x) = ∑{ score( π ) | π emits x}

Key Issue: 25 scores  10 A A A A C C C C CM HMM L R G G G G U U U U – – – – NB:HMM not a prob. model • Need: log Viterbi scores CM ≤ HMM P AA ≤ L A + R A P CA ≤ L C + R A … P AC ≤ L A + R C P CC ≤ L C + R C … P AG ≤ L A + R G P CG ≤ L C + R G … P AU ≤ L A + R U P CU ≤ L C + R U … P C– ≤ L C + R – P A– ≤ L A + R – …

P AA ≤ L A + R A P AC ≤ L A + R C Rigorous Filtering P AG ≤ L A + R G P AU ≤ L A + R U P A– ≤ L A + R – … • Any scores satisfying the linear inequalities give rigorous filtering Proof: CM Viterbi path score ≤ “corresponding” HMM path score ≤ Viterbi HMM path score (even if it does not correspond to any CM path)

Some scores filter better P UA = 1 ≤ L U + R A P UG = 4 ≤ L U + R G Assuming ACGU ≈ 25% Option 1: Opt 1: L U = R A = R G = 2 L U + (R A + R G )/2 = 4 Option 2: Opt 2: L U = 0, R A = 1, R G = 4 L U + (R A + R G )/2 = 2.5

Optimizing filtering • For any nucleotide sequence x: Viterbi-score(x) = max{ score( π ) | π emits x } Forward-score(x) = ∑{ score( π ) | π emits x } • Expected Forward Score E(L i , R i ) = ∑ all sequences x Forward-score(x)*Pr(x) – NB: E is a function of L i , R i only Under 0th-order background model • Optimization: Minimize E(L i , R i ) subject to score L.I.s – This is heuristic (“forward ↓ ⇒ Viterbi ↓ ⇒ filter ↓ ”) – But still rigorous because “subject to score L.I.s”

RNA sequence analysis using covariance models Eddy & Durbin - PowerPoint PPT Presentation

RNA sequence analysis using covariance models Eddy & Durbin Nucleic Acids Research, 1994 vol 22 #11, 2079-2088 What A probabilistic model for RNA families The Covariance Model A Stochastic Context-Free Grammar

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

RNA World Hypothesis and RNA folding By Lixin Dai October 16, 2002 Outline: RNA World

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Prediction of RNA-RNA Interaction slides by Mathias M ohl and Rolf Backofen ohl M.M c

Introduction to RNA-Seq Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

CSE182-L16 Non-coding RNA Biol. Data analysis: Review Assembly Protein Sequence Sequence

Lecture 14 Covariance Functions 3/08/2018 1 More on Covariance Functions 2 Nugget Covariance

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Day 1 " Last lecture: ! RNA Search and ! many biologically interesting roles for RNA "

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

DNA AND RNA ATI TEAS SCIENCE DNA & RNA Questions related to DNA and RNA cover topics

Prediction of RNA-RNA-Interaction 20 1 15 1 5 10 20 5 10 20 15 10 1 15 5 1 20 10

PROTEIN SYNTHESIS RNA (ribonucleic acid) 3 types RNA DIFFERENCES 1. messenger RNA (mRNA) DNA

PROTEIN SYNTHESIS RNA (ribonucleic acid) 3 types RNA DIFFERENCES 1. messenger RNA (mRNA)

RNA-seq basics: From reads to differential expression COMBINE RNA-seq Workshop RNA sequencing

Archaeologists 1 WALT identify how and why people in the Shang Dynasty wrote. What might

Results Presentation For the Half Year Ended 31 December 2016 COMMONWEALTH BANK OF AUSTRALIA |

F. Malcolm Cunningham, Sr. Bar Association COVID-19 Reopening Checklist for Employers presented

FY19 PRELIMINARY FINAL REPORT INVESTOR PRESENTATION Rod Bishop, Managing Director Peter

GROUP EVENT PRESENTATION vegassuperkarts.com/group-events 001 (702) 802-5600

15 Premier Projects In the Worlds Highest -Grade Lithium District TSX-V: LITH Disclaimer The

Coonamessett Farm Foundation: Electronic catch reporting for the US scallop fleet - RSA:

15 Premier Projects Disclaimer Inve vestment Highlights High-Grade Lithium, Premier Locations,

RNA sequence analysis using covariance models Eddy & Durbin - PowerPoint PPT Presentation

RNA sequence analysis using covariance models Eddy & Durbin Nucleic Acids Research, 1994 vol 22 #11, 2079-2088 What A probabilistic model for RNA families The Covariance Model A Stochastic Context-Free Grammar

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

RNA World Hypothesis and RNA folding By Lixin Dai October 16, 2002 Outline: RNA World

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Prediction of RNA-RNA Interaction slides by Mathias M ohl and Rolf Backofen ohl M.M c

Introduction to RNA-Seq Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

CSE182-L16 Non-coding RNA Biol. Data analysis: Review Assembly Protein Sequence Sequence

Lecture 14 Covariance Functions 3/08/2018 1 More on Covariance Functions 2 Nugget Covariance

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Day 1 &quot; Last lecture: ! RNA Search and ! many biologically interesting roles for RNA &quot;

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

DNA AND RNA ATI TEAS SCIENCE DNA &amp; RNA Questions related to DNA and RNA cover topics

Prediction of RNA-RNA-Interaction 20 1 15 1 5 10 20 5 10 20 15 10 1 15 5 1 20 10

PROTEIN SYNTHESIS RNA (ribonucleic acid) 3 types RNA DIFFERENCES 1. messenger RNA (mRNA) DNA

PROTEIN SYNTHESIS RNA (ribonucleic acid) 3 types RNA DIFFERENCES 1. messenger RNA (mRNA)

RNA-seq basics: From reads to differential expression COMBINE RNA-seq Workshop RNA sequencing

Archaeologists 1 WALT identify how and why people in the Shang Dynasty wrote. What might

Results Presentation For the Half Year Ended 31 December 2016 COMMONWEALTH BANK OF AUSTRALIA |

F. Malcolm Cunningham, Sr. Bar Association COVID-19 Reopening Checklist for Employers presented

FY19 PRELIMINARY FINAL REPORT INVESTOR PRESENTATION Rod Bishop, Managing Director Peter

GROUP EVENT PRESENTATION vegassuperkarts.com/group-events 001 (702) 802-5600

15 Premier Projects In the Worlds Highest -Grade Lithium District TSX-V: LITH Disclaimer The

Coonamessett Farm Foundation: Electronic catch reporting for the US scallop fleet - RSA:

15 Premier Projects Disclaimer Inve vestment Highlights High-Grade Lithium, Premier Locations,

Day 1 " Last lecture: ! RNA Search and ! many biologically interesting roles for RNA "

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

DNA AND RNA ATI TEAS SCIENCE DNA & RNA Questions related to DNA and RNA cover topics