RNA Search and Whirlwind tour of ncRNA search & discovery Motif - PowerPoint PPT Presentation

Outline � RNA Search and � Whirlwind tour of ncRNA search & discovery � Motif Discovery � RNA motif description (Covariance Model Review) � Algorithms for searching � Rigorous & heuristic filtering � Motif discovery � Lecture 9 � Applications � CSEP 590A � Autumn 2008 � 2 RNA Motif Models � “Covariance Models” (Eddy & Durbin 1994) � aka profile stochastic context-free grammars � Motif Description � aka hidden Markov models on steroids � Model position-specific nucleotide preferences and base-pair preferences � Pro: accurate � Con: model building hard, search sloooow � 7 8

Profile Hmm Structure � Example: searching for tRNAs M j : � Match states (20 emission probabilities) � I j : � Insert states (Background emission probabilities) � D j : � Delete states (silent - no emission) � 13 17 CM Structure � CM Viterbi Alignment � A: Sequence + structure � = i th letter of input x i B: the CM “guide tree” � x ij = substring i ,..., j of input C: probabilities of T yz = P (transition y � z ) letters/ pairs & of indels � y E x i , x j = P (emission of x i , x j from state y ) Think of each branch being an HMM emitting y S ij = max � log P ( x ij gen'd starting in state y via path � ) both sides of a helix (but 3’ side emitted in reverse order) � 18 20

mRNA leader y = max � log P ( x ij generated starting in state y via path � ) S ij � z y max z [ S i + 1, j � 1 + log T yz + log E x i , x j ] match pair � y ] z max z [ S i + 1, j + log T yz + log E x i match/insert left � � y = y ] z S ij � max z [ S i , j � 1 + log T yz + log E x j match/insert right mRNA leader switch? � z max z [ S i , j + log T yz ] delete � y left + S k + 1, j y right ] � max i < k � j [ S i , k bifurcation � Time O(qn 3 ), q states, seq len n 21 23 Mutual Information � f xi , xj � M ij = f xi , xj log 2 ; 0 � M ij � 2 f xi f xj xi , xj Max when no seq conservation but perfect pairing � MI = expected score gain from using a pair state � Finding optimal MI, (i.e. opt pairing of cols) is hard(?) � Finding optimal MI without pseudoknots can be done by dynamic programming � 24 26

Pseudoknots � � n disallowed allowed � � � /2 max j M i , j � � i = 1 29 31 Rfam – an RNA family DB � Rfam � Griffiths-Jones, et al., NAR ‘03,’05 � IRE (partial seed alignment): � Biggest scientific computing user in Europe - Input (hand-curated): � Hom.sap. GUUCCUGCUUCAACAGUGUUUGGAUGGAAC 1000 cpu cluster for a month per release � MSA “seed alignment” � Hom.sap. UUUCUUC.UUCAACAGUGUUUGGAUGGAAC Hom.sap. UUUCCUGUUUCAACAGUGCUUGGA.GGAAC SS_cons � Rapidly growing: � Hom.sap. UUUAUC..AGUGACAGAGUUCACU.AUAAA Score Thresh T � Hom.sap. UCUCUUGCUUCAACAGUGUUUGGAUGGAAC Hom.sap. AUUAUC..GGGAACAGUGUUUCCC.AUAAU Rel 1.0, 1/03: 25 families, 55k instances � Window Len W � Hom.sap. UCUUGC..UUCAACAGUGUUUGGACGGAAG Hom.sap. UGUAUC..GGAGACAGUGAUCUCC.AUAUG Output: � Rel 7.0, 3/05: 503 families, >300k instances � Hom.sap. AUUAUC..GGAAGCAGUGCCUUCC.AUAAU Cav.por. UCUCCUGCUUCAACAGUGCUUGGACGGAGC CM � Mus.mus. UAUAUC..GGAGACAGUGAUCUCC.AUAUG scan results & “full Mus.mus. UUUCCUGCUUCAACAGUGCUUGAACGGAAC Mus.mus. GUACUUGCUUCAACAGUGUUUGAACGGAAC alignment” � Rat.nor. UAUAUC..GGAGACAGUGACCUCC.AUAUG Rat.nor. UAUCUUGCUUCAACAGUGUUUGGACGGAAC SS_cons <<<<<...<<<<<......>>>>>.>>>>> 33 34

Faster Genome Annotation � Faster Search � of Non-coding RNAs � Without Loss of Accuracy � Zasha Weinberg � & W.L. Ruzzo � Recomb ‘04, ISMB ‘04, Bioinfo ‘06 � 37 38 RaveNnA: Genome Scale � CM’s are good, but slow � RNA Search � Rfam Reality Our Work Rfam Goal Typically 100x speedup over raw CM, w/ no loss in accuracy: � EMBL EMBL EMBL � � drop structure from CM to create a (faster) HMM � � use that to pre-filter sequence; � BLAST Ravenna � discard parts where, provably, CM score < threshold; � � actually run CM on the rest (the promising parts) � � assignment of HMM transition/emission scores is key � CM CM CM �� (large convex optimization problem) � junk junk hits hits hits 1 month, ~2 months, 10 years, Weinberg & Ruzzo, Bioinformatics , 2004, 2006 1000 computers 39 1000 computers 1000 computers 41

CM to HMM � Key Issue: 25 scores � 10 � CM HMM CM HMM A A A A A A A A C C C C C C C C P L R G G G G G G G G U U U U U U U U – – – – – – – – A A A A C C C C G G G G Need: log Viterbi scores CM � HMM � U U U U – – – – 25 emisions per state 5 emissions per state, 2x states 43 44 Viterbi/Forward Scoring � Key Issue: 25 scores � 10 � CM HMM Path � defines transitions/emissions � A A A A C C C C Score( � ) = product of “probabilities” on � � P L R G G G G U U U U NB: ok if “probs” aren’t, e.g. !" 1 � – – – – NB: HMM not a prob. model (e.g. in CM, emissions are odds ratios vs � 0th-order background) � Need: log Viterbi scores CM � HMM � For any nucleotide sequence x: � P AA � L A + R A P CA � L C + R A … Viterbi-score(x) = max{ score( � ) | � emits x} � P AC � L A + R C P CC � L C + R C … P CG � L C + R G … Forward-score(x) = ! { score( � ) | � emits x} � P AG � L A + R G P AU � L A + R U P CU � L C + R U … P A– � L A + R – P C– � L C + R – … 45 46

P AA � L A + R A P AC � L A + R C Rigorous Filtering � Some scores filter better � P AG � L A + R G P AU � L A + R U P A– � L A + R – … P UA = 1 � L U + R A � Any scores satisfying the linear inequalities P UG = 4 � L U + R G � give rigorous filtering � � � � � � � Assuming ACGU � 25% � Proof: � Option 1: � � � � Opt 1: � CM Viterbi path score � L U = R A = R G = 2 � � L U + (R A + R G )/2 = 4 � � “corresponding” HMM path score � � Viterbi HMM path score � Option 2: � � � � Opt 2: � (even if it does not correspond to any CM path) � L U = 0, R A = 1, R G = 4 � L U + (R A + R G )/2 = 2.5 � 47 48 Optimizing filtering � Calculating E(L i , R i ) � For any nucleotide sequence x: � E(L i , R i ) = ! x Forward-score(x)*Pr(x) � Viterbi-score(x) = max{ score( � ) | � emits x } � Forward-score(x) = ! { score( � ) | � emits x } � Forward-like: for every state, calculate Expected Forward Score � expected score for all paths ending there; E(L i , R i ) = ! all sequences x Forward-score(x)*Pr(x) � NB: E is a function of L i , R i only � easily calculated from expected scores of Under 0th-order background model Optimization: � predecessors & transition/emission Minimize E(L i , R i ) subject to score Lin.Ineq.s � probabilities/scores � This is heuristic (“forward � � Viterbi � � filter � ”) � But still rigorous because “subject to score Lin.Ineq.s” � 49 50

Minimizing E(L i , R i ) � “Convex” Optimization � Calculate E(L i , R i ) Convex: � Nonconvex: � Forward: local max = global max; � can be many local maxima, symbolically , in terms of << global max; � simple “hill climbing” works � emission scores, so we “hill-climbing” fails � can do partial derivatives Viterbi: for numerical convex optimization algorithm � � E ( L 1 , L 2 ,...) � L i 51 52 Results: New ncRNA’s? � Estimated Filtering Efficiency � # found � # found # new � (139 Rfam 4.0 families) � rigorous filter Name � BLAST � + CM � + CM � Filtering # families # families fraction � (compact) � (expanded) � Pyrococcus snoRNA � 57 � 180 � 123 � Iron response element � 201 � 322 � 121 � < 10 -4 � 105 � 110 � ~100x Histone 3’ element � 1004 � 1106 � 102 � speedup 10 -4 - 10 -2 � 8 � 17 � Purine riboswitch � 69 � 123 � 54 � Retron msr � 11 � 59 � 48 � .01 - .10 � 11 � 3 � Hammerhead I � 167 � 193 � 26 � .10 - .25 � 2 � 2 � Hammerhead III � 251 � 264 � 13 � U4 snRNA � 283 � 290 � 7 � .25 - .99 � 6 � 4 � S-box � 128 � 131 � 3 � U6 snRNA � 1462 � 1464 � 2 � .99 - 1.0 � 7 � 3 � U5 snRNA � 199 � 200 � 1 � U7 snRNA � 312 � 313 � 1 � 53 54

RNA Motif Discovery � Typical problem: given a ~10-20 unaligned sequences of ~1kb, most of which contain Motif Discovery � instances of one RNA motif of, say, 150bp -- find it. � Example: 5’ UTRs of orthologous glycine cleavage genes from � -proteobacteria � 60 61 Searching for noncoding RNAs � Cmfinder--A Covariance � Model Based RNA Motif � CM’s are great, but where do they come from? � Finding Algorithm An approach: comparative genomics � Bioinformatics , 2006, 22(4): 445-452 Search for motifs with common secondary structure in a set of functionally related sequences. � Challenges � Zizhen Yao � Three related tasks � Zasha Weinberg � Locate the motif regions. � Walter L. Ruzzo � Align the motif instances. � University of Washington, Seattle � Predict the consensus secondary structure. � Motif search space is huge! � Motif location space, alignment space, structure space. � 62 63

RNA Search and Whirlwind tour of ncRNA search & discovery Motif - PowerPoint PPT Presentation

Outline RNA Search and Whirlwind tour of ncRNA search & discovery Motif Discovery RNA motif description (Covariance Model Review) Algorithms for searching Rigorous & heuristic filtering Motif discovery Lecture

RNA World Hypothesis and RNA folding By Lixin Dai October 16, 2002 Outline: RNA World

Prediction of RNA-RNA Interaction slides by Mathias M ohl and Rolf Backofen ohl M.M c

Prediction of RNA-RNA-Interaction 20 1 15 1 5 10 20 5 10 20 15 10 1 15 5 1 20 10

DNA AND RNA ATI TEAS SCIENCE DNA & RNA Questions related to DNA and RNA cover topics

Introduction to RNA-Seq Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

PROTEIN SYNTHESIS RNA (ribonucleic acid) 3 types RNA DIFFERENCES 1. messenger RNA (mRNA) DNA

PROTEIN SYNTHESIS RNA (ribonucleic acid) 3 types RNA DIFFERENCES 1. messenger RNA (mRNA)

RNA-seq basics: From reads to differential expression COMBINE RNA-seq Workshop RNA sequencing

Long Noncoding RNA The Dark Matter of the Genome Megan McSweeney BMS 265 Long Noncoding RNA

RNA Secondary Structure CSE 417 W.L. Ruzzo The Double Helix Los Alamos Science The Central

RNA-RNA Interaction Prediction with Stochastic Grammars Sebastian Wild Markus Nebel Anika

A partition function algorithm for RNA-RNA interaction Hamidreza Chitsaz Raheleh Salari, Cenk

The Double Helix RNA Secondary Structure CSE 417 W.L. Ruzzo Los Alamos Science The Central

Day 1 " Last lecture: ! RNA Search and ! many biologically interesting roles for RNA "

Day 1 " RNA Search and ! Motif Discovery " Many biologically interesting roles for RNA

The Message " CSE 527 ! noncoding RNA " Cells make lots of RNA " Computational

Modern Information Retrieval Boolean information retrieval and document preprocessing 1 Hamid

Paul: Surprises along David B. Capes Houston Graduate the Way School of Theology Rediscovering

R e c e n t P r o g r e s s i n S i l i c o n M o d u l e s a n d

Art and Design Making Plants in Paper Success Criteria Aim I can make a paper plant. I

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale

0. Model problem: diffusion in a 1. Dynamics on the moduli space periodic billiard 2.

Designing Pop-Up Cards Carleton Algorithms Seminar Zachary Abel, Erik D. Demaine, Martin L.

Entire functions arising from trees Weiwei Cui Mathematisches Seminar, CAU Kiel Topics in

Sambuz

Useful Links

Newsletter

Mail Us

RNA Search and Whirlwind tour of ncRNA search & discovery Motif - PowerPoint PPT Presentation

Outline RNA Search and Whirlwind tour of ncRNA search & discovery Motif Discovery RNA motif description (Covariance Model Review) Algorithms for searching Rigorous & heuristic filtering Motif discovery Lecture

RNA World Hypothesis and RNA folding By Lixin Dai October 16, 2002 Outline: RNA World

Prediction of RNA-RNA Interaction slides by Mathias M ohl and Rolf Backofen ohl M.M c

Prediction of RNA-RNA-Interaction 20 1 15 1 5 10 20 5 10 20 15 10 1 15 5 1 20 10

DNA AND RNA ATI TEAS SCIENCE DNA &amp; RNA Questions related to DNA and RNA cover topics

Introduction to RNA-Seq Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

PROTEIN SYNTHESIS RNA (ribonucleic acid) 3 types RNA DIFFERENCES 1. messenger RNA (mRNA) DNA

PROTEIN SYNTHESIS RNA (ribonucleic acid) 3 types RNA DIFFERENCES 1. messenger RNA (mRNA)

RNA-seq basics: From reads to differential expression COMBINE RNA-seq Workshop RNA sequencing

Long Noncoding RNA The Dark Matter of the Genome Megan McSweeney BMS 265 Long Noncoding RNA

RNA Secondary Structure CSE 417 W.L. Ruzzo The Double Helix Los Alamos Science The Central

RNA-RNA Interaction Prediction with Stochastic Grammars Sebastian Wild Markus Nebel Anika

A partition function algorithm for RNA-RNA interaction Hamidreza Chitsaz Raheleh Salari, Cenk

The Double Helix RNA Secondary Structure CSE 417 W.L. Ruzzo Los Alamos Science The Central

Day 1 &quot; Last lecture: ! RNA Search and ! many biologically interesting roles for RNA &quot;

Day 1 &quot; RNA Search and ! Motif Discovery &quot; Many biologically interesting roles for RNA

The Message &quot; CSE 527 ! noncoding RNA &quot; Cells make lots of RNA &quot; Computational

Modern Information Retrieval Boolean information retrieval and document preprocessing 1 Hamid

Paul: Surprises along David B. Capes Houston Graduate the Way School of Theology Rediscovering

R e c e n t P r o g r e s s i n S i l i c o n M o d u l e s a n d

Art and Design Making Plants in Paper Success Criteria Aim I can make a paper plant. I

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale

0. Model problem: diffusion in a 1. Dynamics on the moduli space periodic billiard 2.

Designing Pop-Up Cards Carleton Algorithms Seminar Zachary Abel, Erik D. Demaine, Martin L.

Entire functions arising from trees Weiwei Cui Mathematisches Seminar, CAU Kiel Topics in

Sambuz

Useful Links

Newsletter

Mail Us

DNA AND RNA ATI TEAS SCIENCE DNA & RNA Questions related to DNA and RNA cover topics

Day 1 " Last lecture: ! RNA Search and ! many biologically interesting roles for RNA "

Day 1 " RNA Search and ! Motif Discovery " Many biologically interesting roles for RNA

The Message " CSE 527 ! noncoding RNA " Cells make lots of RNA " Computational