 
              The Human Parts List, circa 2001 1 gagcccggcc cgggggacgg gcggcgggat agcgggaccc cggcgcggcg gtgcgcttca RNA Search and 61 gggcgcagcg gcggccgcag accgagcccc gggcgcggca agaggcggcg ggagccggtg 3 billion nucleotides, containing: 121 gcggctcggc atcatgcgtc gagggcgtct gctggagatc gccctgggat ttaccgtgct 181 tttagcgtcc tacacgagcc atggggcgga cgccaatttg gaggctggga acgtgaagga Motif Discovery 241 aaccagagcc agtcgggcca agagaagagg cggtggagga cacgacgcgc ttaaaggacc •25,000 protein-coding genes 301 caatgtctgt ggatcacgtt ataatgctta ctgttgccct ggatggaaaa ccttacctgg 361 cggaaatcag tgtattgtcc ccatttgccg gcattcctgt ggggatggat tttgttcgag (only ~1% of the DNA) 421 gccaaatatg tgcacttgcc catctggtca gatagctcct tcctgtggct ccagatccat 481 acaacactgc aatattcgct gtatgaatgg aggtagctgc agtgacgatc actgtctatg •Messenger RNAs made from each 541 ccagaaagga tacataggga ctcactgtgg acaacctgtt tgtgaaagtg gctgtctcaa Lectures 18-19 601 tggaggaagg tgtgtggccc caaatcgatg tgcatgcact tacggattta ctggacccca 661 gtgtgaaaga gattacagga caggcccatg ttttactgtg atcagcaacc agatgtgcca •Plus a double-handful of other RNA genes 721 gggacaactc agcgggattg tctgcacaaa acagctctgc tgtgccacag tcggccgagc CSE 527 781 ctggggccac ccctgtgaga tgtgtcctgc ccagcctcac ccctgccgcc gtggcttcat Autumn 2007 841 tccaaatatc cgcacgggag cttgtcaaga tgtggatgaa tgccaggcca tccccgggct 901 ctgtcaggga ggaaattgca ttaatactgt tgggtctttt gagtgcaaat gccctgctgg 961 acacaaactt aatgaagtgt cacaaaaatg tgaagatatt gatgaatgca gcaccattcc 1021 ... Noncoding Outline RNAs Dramatic discoveries in Task 1: RNA 2 ary Structure Prediction (last time) last 5 years Task 2: RNA Motif Models 100s of new families Many roles: Regulation, Covariance Models transport, stability, catalysis, … Training & “Mutual Information” 1% of DNA codes for Task 3: Search protein, but 30% of it is Breakthrough of the Year copied into RNA, i.e. Rigorous & heuristic filtering ncRNA >> mRNA Task 4: Motif discovery 1
How to model an RNA “Motif”? Conceptually, start with a profile HMM: from a multiple alignment, estimate nucleotide/ insert/delete Task 2: Motif Description preferences for each position given a new seq, estimate likelihood that it could be generated by the model, & align it to the model mostly G del ins all G How to model an RNA “Motif”? RNA Motif Models Add “column pairs” and pair emission probabilities “Covariance Models” (Eddy & Durbin 1994) for base-paired regions aka profile stochastic context-free grammars aka hidden Markov models on steroids Model position-specific nucleotide preferences and base-pair preferences Pro: accurate Con: model building hard, search sloooow <<<<<<< >>>>>>> paired columns … … 2
What A probabilistic model for RNA families “RNA sequence analysis using The “Covariance Model” ≈ A Stochastic Context-Free Grammar covariance models” A generalization of a profile HMM Algorithms for Training Eddy & Durbin From aligned or unaligned sequences Nucleic Acids Research, 1994 Automates “comparative analysis” Complements Nusinov/Zucker RNA folding vol 22 #11, 2079-2088 Algorithms for searching (see also, Ch 10 of Durbin et al .) Main Results Probabilistic Model Search As with HMMs, given a sequence, you calculate Very accurate search for tRNA likelihood ratio that the model could generate the (Precursor to tRNAscanSE - current favorite) sequence, vs a background model Given sufficient data, model construction You set a score threshold comparable to, but not quite as good as, Anything above threshold → a “hit” human experts Scoring: Some quantitative info on importance of “Forward” / “Inside” algorithm - sum over all paths pseudoknots and other tertiary features Viterbi approximation - find single best path (Bonus: alignment & structure prediction) 3
Example: Alignment Quality searching for tRNAs Profile Hmm Structure Comparison to TRNASCAN Fichant & Burks - best heuristic then evaluation criteria Slightly different 97.5% true positive 0.37 false positives per MB CM A1415 (trained on trusted alignment) > 99.98% true positives <0.2 false positives per MB Current method-of-choice is “tRNAscanSE”, a CM- based scan with heuristic pre-filtering (including TRNASCAN?) for performance reasons. M j : Match states (20 emission probabilities) I j : Insert states (Background emission probabilities) D j : Delete states (silent - no emission) 4
Overall CM CM Structure Architecture A: Sequence + structure One box (“node”) per node of guide tree B: the CM “guide tree” BEG/MATL/INS/DEL just C: probabilities of like an HMM letters/ pairs & of indels MATP & BIF are the key Think of each branch additions: MATP emits pairs being an HMM emitting of symbols, modeling base- both sides of a helix (but pairs; BIF allows multiple 3’ side emitted in helices reverse order) CM Viterbi Alignment CM Viterbi Alignment (the “inside” algorithm) (the “inside” algorithm) y = max � log P ( x ij generated starting in state y via path � ) = i th letter of input S ij x i � z y max z [ S i + 1, j � 1 + log T yz + log E x i , x j ] match pair x ij = substring i ,..., j of input � y ] z max z [ S i + 1, j + log T yz + log E x i match/insert left T yz = P (transition y � z ) � � y = y ] z S ij � max z [ S i , j � 1 + log T yz + log E x j match/insert right y E x i , x j = P (emission of x i , x j from state y ) � z max z [ S i , j + log T yz ] delete � y S ij = max � log P ( x ij gen'd starting in state y via path � ) y left + S k + 1, j y right ] � max i < k � j [ S i , k bifurcation � Time O(qn 3 ), q states, seq len n compare: O(qn) for profile HMM 5
Model Training Mutual Information f xi , xj � M ij = f xi , xj log 2 ; 0 � M ij � 2 f xi f xj xi , xj Max when no seq conservation but perfect pairing MI = expected score gain from using a pair state Finding optimal MI, (i.e. opt pairing of cols) is hard(?) Finding optimal MI without pseudoknots can be done by dynamic programming M.I. Example (Artificial) 3’ * 1 2 3 4 5 6 7 8 9 * MI: 1 2 3 4 5 6 7 8 9 A G A U A A U C U 9 0 0 0 0 0 0 0 0 A G A U C A U C U 8 0 0 0 0 0 0 0 5’ A G A C G U U C U 7 0 0 2 0.30 0 1 A G A U U U U C U 6 0 0 1 0.55 1 A G C C A G G C U 5 0 0 0 0.42 A G C G C G G C U 4 0 0 0.30 A G C U G C G C U 3 0 0 A G C A U C G C U 2 0 A G G U A G C C U 1 A G G G C G C C U A G G U G U C C U Cols 1 & 9, 2 & 8: perfect conservation & might be A G G C U U C C U A G U A A A A C U base-paired, but unclear whether they are. M.I. = 0 A G U C C A A C U A G U U G C A C U Cols 3 & 7: No conservation, but always W-C pairs, A G U U U C A C U so seems likely they do base-pair. M.I. = 2 bits. Cols 7->6: unconserved, but each letter in 7 has A 16 0 4 2 4 4 4 0 0 C 0 0 4 4 4 4 4 16 0 only 2 possible mates in 6. M.I. = 1 bit. G 0 16 4 2 4 4 4 0 0 U 0 0 4 8 4 4 4 0 16 6
MI-Based Structure-Learning Find best (max total MI) subset of column pairs among i…j, subject to absence of pseudo-knots � S i , j = max S i , j � 1 � max i � k < j � 4 S i , k � 1 + M k , j + S k + 1, j � 1 � “Just like Nussinov/Zucker folding” BUT, need enough data---enough sequences at right phylogenetic distance Pseudoknots � n � disallowed allowed � � max j M i , j � /2 � � i = 1 7
Rfam – an RNA family DB tRNAScanSE Griffiths-Jones, et al., NAR ‘03, ’05 uses 3 older heuristic tRNA finders as Biggest scientific computing user in Europe - prefilter 1000 cpu cluster for a month per release uses CM built as described for final scoring Rapidly growing: Actually 3(?) different CMs Rel 1.0, 1/03: 25 families, 55k instances eukaryotic nuclear Rel 7.0, 3/05: 503 families, >300k instances prokaryotic organellar used in all genome annotation projects Rfam IRE (partial seed alignment): Input (hand-curated): Hom.sap. GUUCCUGCUUCAACAGUGUUUGGAUGGAAC MSA “seed alignment” Hom.sap. UUUCUUC.UUCAACAGUGUUUGGAUGGAAC Hom.sap. UUUCCUGUUUCAACAGUGCUUGGA.GGAAC SS_cons Hom.sap. UUUAUC..AGUGACAGAGUUCACU.AUAAA Score Thresh T Hom.sap. UCUCUUGCUUCAACAGUGUUUGGAUGGAAC Hom.sap. AUUAUC..GGGAACAGUGUUUCCC.AUAAU Window Len W Hom.sap. UCUUGC..UUCAACAGUGUUUGGACGGAAG Hom.sap. UGUAUC..GGAGACAGUGAUCUCC.AUAUG Output: Hom.sap. AUUAUC..GGAAGCAGUGCCUUCC.AUAAU Cav.por. UCUCCUGCUUCAACAGUGCUUGGACGGAGC CM Mus.mus. UAUAUC..GGAGACAGUGAUCUCC.AUAUG Mus.mus. UUUCCUGCUUCAACAGUGCUUGAACGGAAC scan results & “full Mus.mus. GUACUUGCUUCAACAGUGUUUGAACGGAAC alignment” Rat.nor. UAUAUC..GGAGACAGUGACCUCC.AUAUG Rat.nor. UAUCUUGCUUCAACAGUGUUUGGACGGAAC SS_cons <<<<<...<<<<<......>>>>>.>>>>> 8
Recommend
More recommend