RNA Search and 61 gggcgcagcg gcggccgcag accgagcccc gggcgcggca - PowerPoint PPT Presentation

The Human Parts List, circa 2001 1 gagcccggcc cgggggacgg gcggcgggat agcgggaccc cggcgcggcg gtgcgcttca RNA Search and 61 gggcgcagcg gcggccgcag accgagcccc gggcgcggca agaggcggcg ggagccggtg 3 billion nucleotides, containing: 121 gcggctcggc atcatgcgtc gagggcgtct gctggagatc gccctgggat ttaccgtgct 181 tttagcgtcc tacacgagcc atggggcgga cgccaatttg gaggctggga acgtgaagga Motif Discovery 241 aaccagagcc agtcgggcca agagaagagg cggtggagga cacgacgcgc ttaaaggacc •25,000 protein-coding genes 301 caatgtctgt ggatcacgtt ataatgctta ctgttgccct ggatggaaaa ccttacctgg 361 cggaaatcag tgtattgtcc ccatttgccg gcattcctgt ggggatggat tttgttcgag (only ~1% of the DNA) 421 gccaaatatg tgcacttgcc catctggtca gatagctcct tcctgtggct ccagatccat 481 acaacactgc aatattcgct gtatgaatgg aggtagctgc agtgacgatc actgtctatg •Messenger RNAs made from each 541 ccagaaagga tacataggga ctcactgtgg acaacctgtt tgtgaaagtg gctgtctcaa Lectures 18-19 601 tggaggaagg tgtgtggccc caaatcgatg tgcatgcact tacggattta ctggacccca 661 gtgtgaaaga gattacagga caggcccatg ttttactgtg atcagcaacc agatgtgcca •Plus a double-handful of other RNA genes 721 gggacaactc agcgggattg tctgcacaaa acagctctgc tgtgccacag tcggccgagc CSE 527 781 ctggggccac ccctgtgaga tgtgtcctgc ccagcctcac ccctgccgcc gtggcttcat Autumn 2007 841 tccaaatatc cgcacgggag cttgtcaaga tgtggatgaa tgccaggcca tccccgggct 901 ctgtcaggga ggaaattgca ttaatactgt tgggtctttt gagtgcaaat gccctgctgg 961 acacaaactt aatgaagtgt cacaaaaatg tgaagatatt gatgaatgca gcaccattcc 1021 ... Noncoding Outline RNAs Dramatic discoveries in Task 1: RNA 2 ary Structure Prediction (last time) last 5 years Task 2: RNA Motif Models 100s of new families Many roles: Regulation, Covariance Models transport, stability, catalysis, … Training & “Mutual Information” 1% of DNA codes for Task 3: Search protein, but 30% of it is Breakthrough of the Year copied into RNA, i.e. Rigorous & heuristic filtering ncRNA >> mRNA Task 4: Motif discovery 1

How to model an RNA “Motif”? Conceptually, start with a profile HMM: from a multiple alignment, estimate nucleotide/ insert/delete Task 2: Motif Description preferences for each position given a new seq, estimate likelihood that it could be generated by the model, & align it to the model mostly G del ins all G How to model an RNA “Motif”? RNA Motif Models Add “column pairs” and pair emission probabilities “Covariance Models” (Eddy & Durbin 1994) for base-paired regions aka profile stochastic context-free grammars aka hidden Markov models on steroids Model position-specific nucleotide preferences and base-pair preferences Pro: accurate Con: model building hard, search sloooow <<<<<<< >>>>>>> paired columns … … 2

What A probabilistic model for RNA families “RNA sequence analysis using The “Covariance Model” ≈ A Stochastic Context-Free Grammar covariance models” A generalization of a profile HMM Algorithms for Training Eddy & Durbin From aligned or unaligned sequences Nucleic Acids Research, 1994 Automates “comparative analysis” Complements Nusinov/Zucker RNA folding vol 22 #11, 2079-2088 Algorithms for searching (see also, Ch 10 of Durbin et al .) Main Results Probabilistic Model Search As with HMMs, given a sequence, you calculate Very accurate search for tRNA likelihood ratio that the model could generate the (Precursor to tRNAscanSE - current favorite) sequence, vs a background model Given sufficient data, model construction You set a score threshold comparable to, but not quite as good as, Anything above threshold → a “hit” human experts Scoring: Some quantitative info on importance of “Forward” / “Inside” algorithm - sum over all paths pseudoknots and other tertiary features Viterbi approximation - find single best path (Bonus: alignment & structure prediction) 3

Example: Alignment Quality searching for tRNAs Profile Hmm Structure Comparison to TRNASCAN Fichant & Burks - best heuristic then evaluation criteria Slightly different 97.5% true positive 0.37 false positives per MB CM A1415 (trained on trusted alignment) > 99.98% true positives <0.2 false positives per MB Current method-of-choice is “tRNAscanSE”, a CM- based scan with heuristic pre-filtering (including TRNASCAN?) for performance reasons. M j : Match states (20 emission probabilities) I j : Insert states (Background emission probabilities) D j : Delete states (silent - no emission) 4

Overall CM CM Structure Architecture A: Sequence + structure One box (“node”) per node of guide tree B: the CM “guide tree” BEG/MATL/INS/DEL just C: probabilities of like an HMM letters/ pairs & of indels MATP & BIF are the key Think of each branch additions: MATP emits pairs being an HMM emitting of symbols, modeling base- both sides of a helix (but pairs; BIF allows multiple 3’ side emitted in helices reverse order) CM Viterbi Alignment CM Viterbi Alignment (the “inside” algorithm) (the “inside” algorithm) y = max � log P ( x ij generated starting in state y via path � ) = i th letter of input S ij x i � z y max z [ S i + 1, j � 1 + log T yz + log E x i , x j ] match pair x ij = substring i ,..., j of input � y ] z max z [ S i + 1, j + log T yz + log E x i match/insert left T yz = P (transition y � z ) � � y = y ] z S ij � max z [ S i , j � 1 + log T yz + log E x j match/insert right y E x i , x j = P (emission of x i , x j from state y ) � z max z [ S i , j + log T yz ] delete � y S ij = max � log P ( x ij gen'd starting in state y via path � ) y left + S k + 1, j y right ] � max i < k � j [ S i , k bifurcation � Time O(qn 3 ), q states, seq len n compare: O(qn) for profile HMM 5

Model Training Mutual Information f xi , xj � M ij = f xi , xj log 2 ; 0 � M ij � 2 f xi f xj xi , xj Max when no seq conservation but perfect pairing MI = expected score gain from using a pair state Finding optimal MI, (i.e. opt pairing of cols) is hard(?) Finding optimal MI without pseudoknots can be done by dynamic programming M.I. Example (Artificial) 3’ * 1 2 3 4 5 6 7 8 9 * MI: 1 2 3 4 5 6 7 8 9 A G A U A A U C U 9 0 0 0 0 0 0 0 0 A G A U C A U C U 8 0 0 0 0 0 0 0 5’ A G A C G U U C U 7 0 0 2 0.30 0 1 A G A U U U U C U 6 0 0 1 0.55 1 A G C C A G G C U 5 0 0 0 0.42 A G C G C G G C U 4 0 0 0.30 A G C U G C G C U 3 0 0 A G C A U C G C U 2 0 A G G U A G C C U 1 A G G G C G C C U A G G U G U C C U Cols 1 & 9, 2 & 8: perfect conservation & might be A G G C U U C C U A G U A A A A C U base-paired, but unclear whether they are. M.I. = 0 A G U C C A A C U A G U U G C A C U Cols 3 & 7: No conservation, but always W-C pairs, A G U U U C A C U so seems likely they do base-pair. M.I. = 2 bits. Cols 7->6: unconserved, but each letter in 7 has A 16 0 4 2 4 4 4 0 0 C 0 0 4 4 4 4 4 16 0 only 2 possible mates in 6. M.I. = 1 bit. G 0 16 4 2 4 4 4 0 0 U 0 0 4 8 4 4 4 0 16 6

MI-Based Structure-Learning Find best (max total MI) subset of column pairs among i…j, subject to absence of pseudo-knots � S i , j = max S i , j � 1 � max i � k < j � 4 S i , k � 1 + M k , j + S k + 1, j � 1 � “Just like Nussinov/Zucker folding” BUT, need enough data---enough sequences at right phylogenetic distance Pseudoknots � n � disallowed allowed � � max j M i , j � /2 � � i = 1 7

Rfam – an RNA family DB tRNAScanSE Griffiths-Jones, et al., NAR ‘03, ’05 uses 3 older heuristic tRNA finders as Biggest scientific computing user in Europe - prefilter 1000 cpu cluster for a month per release uses CM built as described for final scoring Rapidly growing: Actually 3(?) different CMs Rel 1.0, 1/03: 25 families, 55k instances eukaryotic nuclear Rel 7.0, 3/05: 503 families, >300k instances prokaryotic organellar used in all genome annotation projects Rfam IRE (partial seed alignment): Input (hand-curated): Hom.sap. GUUCCUGCUUCAACAGUGUUUGGAUGGAAC MSA “seed alignment” Hom.sap. UUUCUUC.UUCAACAGUGUUUGGAUGGAAC Hom.sap. UUUCCUGUUUCAACAGUGCUUGGA.GGAAC SS_cons Hom.sap. UUUAUC..AGUGACAGAGUUCACU.AUAAA Score Thresh T Hom.sap. UCUCUUGCUUCAACAGUGUUUGGAUGGAAC Hom.sap. AUUAUC..GGGAACAGUGUUUCCC.AUAAU Window Len W Hom.sap. UCUUGC..UUCAACAGUGUUUGGACGGAAG Hom.sap. UGUAUC..GGAGACAGUGAUCUCC.AUAUG Output: Hom.sap. AUUAUC..GGAAGCAGUGCCUUCC.AUAAU Cav.por. UCUCCUGCUUCAACAGUGCUUGGACGGAGC CM Mus.mus. UAUAUC..GGAGACAGUGAUCUCC.AUAUG Mus.mus. UUUCCUGCUUCAACAGUGCUUGAACGGAAC scan results & “full Mus.mus. GUACUUGCUUCAACAGUGUUUGAACGGAAC alignment” Rat.nor. UAUAUC..GGAGACAGUGACCUCC.AUAUG Rat.nor. UAUCUUGCUUCAACAGUGUUUGGACGGAAC SS_cons <<<<<...<<<<<......>>>>>.>>>>> 8

RNA Search and 61 gggcgcagcg gcggccgcag accgagcccc gggcgcggca - PowerPoint PPT Presentation

The Human Parts List, circa 2001 1 gagcccggcc cgggggacgg gcggcgggat agcgggaccc cggcgcggcg gtgcgcttca RNA Search and 61 gggcgcagcg gcggccgcag accgagcccc gggcgcggca agaggcggcg ggagccggtg 3 billion nucleotides, containing: 121 gcggctcggc

RNA World Hypothesis and RNA folding By Lixin Dai October 16, 2002 Outline: RNA World

Prediction of RNA-RNA Interaction slides by Mathias M ohl and Rolf Backofen ohl M.M c

Prediction of RNA-RNA-Interaction 20 1 15 1 5 10 20 5 10 20 15 10 1 15 5 1 20 10

DNA AND RNA ATI TEAS SCIENCE DNA & RNA Questions related to DNA and RNA cover topics

Introduction to RNA-Seq Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

PROTEIN SYNTHESIS RNA (ribonucleic acid) 3 types RNA DIFFERENCES 1. messenger RNA (mRNA) DNA

PROTEIN SYNTHESIS RNA (ribonucleic acid) 3 types RNA DIFFERENCES 1. messenger RNA (mRNA)

RNA-seq basics: From reads to differential expression COMBINE RNA-seq Workshop RNA sequencing

Long Noncoding RNA The Dark Matter of the Genome Megan McSweeney BMS 265 Long Noncoding RNA

RNA Secondary Structure CSE 417 W.L. Ruzzo The Double Helix Los Alamos Science The Central

RNA-RNA Interaction Prediction with Stochastic Grammars Sebastian Wild Markus Nebel Anika

A partition function algorithm for RNA-RNA interaction Hamidreza Chitsaz Raheleh Salari, Cenk

The Double Helix RNA Secondary Structure CSE 417 W.L. Ruzzo Los Alamos Science The Central

Day 1 " Last lecture: ! RNA Search and ! many biologically interesting roles for RNA "

Day 1 " RNA Search and ! Motif Discovery " Many biologically interesting roles for RNA

RNA Search and Whirlwind tour of ncRNA search & discovery Motif Discovery RNA motif

Graph-theore*c algorithms to improve phylogenomic analyses Tandy Warnow and Pranjal Vachaspa3

GENE TREE CORRECTION GUIDED BY ORTHOLOGY Manuel Lafond 1 , Magali Semeria 2 , Krister M. Swenson

Assessing annotation Assessing annotation consistency in the Gene consistency in the Gene

KnowOS Goals of a Knowledge Operating System Provide persistent object store (interconnected

The Birth of HPC Cuba How supercomputing is being made available to all Cuban researchers using

CSE182-L5: Scoring matrices Dictionary Matching October 09 CSE 182 Expectation? Some

Comparing cancer models using gene expression of genetic pathways and other gene lists Tauno

Optimally Solving Hard Combinatorial Problems in Computational Biology Falk Hffner Institut

RNA Search and 61 gggcgcagcg gcggccgcag accgagcccc gggcgcggca - PowerPoint PPT Presentation

The Human Parts List, circa 2001 1 gagcccggcc cgggggacgg gcggcgggat agcgggaccc cggcgcggcg gtgcgcttca RNA Search and 61 gggcgcagcg gcggccgcag accgagcccc gggcgcggca agaggcggcg ggagccggtg 3 billion nucleotides, containing: 121 gcggctcggc

RNA World Hypothesis and RNA folding By Lixin Dai October 16, 2002 Outline: RNA World

Prediction of RNA-RNA Interaction slides by Mathias M ohl and Rolf Backofen ohl M.M c

Prediction of RNA-RNA-Interaction 20 1 15 1 5 10 20 5 10 20 15 10 1 15 5 1 20 10

DNA AND RNA ATI TEAS SCIENCE DNA &amp; RNA Questions related to DNA and RNA cover topics

Introduction to RNA-Seq Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

PROTEIN SYNTHESIS RNA (ribonucleic acid) 3 types RNA DIFFERENCES 1. messenger RNA (mRNA) DNA

PROTEIN SYNTHESIS RNA (ribonucleic acid) 3 types RNA DIFFERENCES 1. messenger RNA (mRNA)

RNA-seq basics: From reads to differential expression COMBINE RNA-seq Workshop RNA sequencing

Long Noncoding RNA The Dark Matter of the Genome Megan McSweeney BMS 265 Long Noncoding RNA

RNA Secondary Structure CSE 417 W.L. Ruzzo The Double Helix Los Alamos Science The Central

RNA-RNA Interaction Prediction with Stochastic Grammars Sebastian Wild Markus Nebel Anika

A partition function algorithm for RNA-RNA interaction Hamidreza Chitsaz Raheleh Salari, Cenk

The Double Helix RNA Secondary Structure CSE 417 W.L. Ruzzo Los Alamos Science The Central

Day 1 &quot; Last lecture: ! RNA Search and ! many biologically interesting roles for RNA &quot;

Day 1 &quot; RNA Search and ! Motif Discovery &quot; Many biologically interesting roles for RNA

RNA Search and Whirlwind tour of ncRNA search &amp; discovery Motif Discovery RNA motif

Graph-theore*c algorithms to improve phylogenomic analyses Tandy Warnow and Pranjal Vachaspa3

GENE TREE CORRECTION GUIDED BY ORTHOLOGY Manuel Lafond 1 , Magali Semeria 2 , Krister M. Swenson

Assessing annotation Assessing annotation consistency in the Gene consistency in the Gene

KnowOS Goals of a Knowledge Operating System Provide persistent object store (interconnected

The Birth of HPC Cuba How supercomputing is being made available to all Cuban researchers using

CSE182-L5: Scoring matrices Dictionary Matching October 09 CSE 182 Expectation? Some

Comparing cancer models using gene expression of genetic pathways and other gene lists Tauno

Optimally Solving Hard Combinatorial Problems in Computational Biology Falk Hffner Institut

DNA AND RNA ATI TEAS SCIENCE DNA & RNA Questions related to DNA and RNA cover topics

Day 1 " Last lecture: ! RNA Search and ! many biologically interesting roles for RNA "

Day 1 " RNA Search and ! Motif Discovery " Many biologically interesting roles for RNA

RNA Search and Whirlwind tour of ncRNA search & discovery Motif Discovery RNA motif