De novo prediction of structural noncoding RNAs
Stefan Washietl
18.417 - Fall 2011
1/ 38
De novo prediction of structural noncoding RNAs Stefan Washietl - - PowerPoint PPT Presentation
De novo prediction of structural noncoding RNAs Stefan Washietl 18.417 - Fall 2011 1/ 38 Outline Motivation: Biological importance of (noncoding) RNAs Algorithms to predict structural noncoding RNAs RNAz: thermodynamical folding +
1/ 38
◮ RNAz: thermodynamical folding + phylogenetic information ◮ EvoFold: phylogenetic stochastic context-free grammars
2/ 38
3/ 38
4/ 38
5/ 38
Serganov A, Patel DJ, Nat Rev Genet. 2007 8:(10)776-90 6/ 38
7/ 38
8/ 38
Intergenic 5’−UTR CDS exon Intron 3’−UTR Intron Intergenic
9/ 38
◮ No common strong statistical features in primary sequence
◮ ncRNAs are highly diverse (short, long, spliced, unspliced,
10/ 38
RNAfold < trna.fa >AF041468 GGGGGUAUAGCUCAGUUGGUAGAGCGCUGCCUUUGCACGGCAGAUGUCAGGGGUUCGAGUCCCCUUACCUCCA (((((((..((((........)))).(((((.......))))).....(((((.......)))))))))))). (-31.10) 11/ 38
12/ 38
1 2 3 4 z-score 0.2 0.4 Frequency
2%
Washietl & Hofacker, J. Mol. Biol. (2004) 342:19 13/ 38
14/ 38
Hofacker, Fekete & Stadler, J. Mol. Biol. (2002) 319:1059 15/ 38
16/ 38
1 2 3 Sampled z-scores
1 2 3 Sampled z-scores
1 2 3 Calculated z-scores
17/ 38
Washietl, Hofacker & Stadler, Proc. Natl. Acad. Sci. USA (2005) 33:2433 18/ 38
Washietl, Hofacker & Stadler, Proc. Natl. Acad. Sci. USA (2005) 33:2433 18/ 38
19/ 38
◮ V is a finite set of nonterminal symbols (“states”), ◮ T is a finite set of terminal symbols, ◮ P is a finite set of production rules and ◮ S is the initial (start) nonterminal (S ∈ V ).
◮ Efficiently describes the set of all palindromes over the
◮ Example production:
20/ 38
◮ S → aSu|uSa|gSc|cSg|uSg|gSu ◮ S → aS|uS|gS|cS ◮ S → Sa|Su|Sa|Sc ◮ S → SS ◮ S → ǫ
21/ 38
22/ 38
23/ 38
◮ CYK ↔ Minimum Free energy (Nussinov/Zuker) ◮ Inside/outside algorithm ↔ Partition functions (McCaskill)
◮ CYK Minimum ↔ Viterbi’s algorithm ◮ Inside/outside algorithm ↔ Forward/backwards algorithm 24/ 38
S S S S S S S S ε S S S S S S S S ε S A C A G G A G A C U G U A C G G U G C A A C C G Structure Parse Tree Phylogenetic tree A C A G G A G A C U G U A C G G U G C A A C C G A C A G G A G A C U G U A C G G U G C A A C C G A C A G G A G A C U G U A C G G U G C A A C C G A C A G G A G A C U G U A C G G U G C A A C C G A C A G G A G A C U G U A C G G U G C A A C C G ( ( ( ( . . . . ) ) ) ) ( ( ( ( . . . . ) ) ) ) Single sequence: Terminal symbols are bases or base-pairs Emission probabilities are base frequencies in loops and paired regions Phylo-SCFG: Terminal symbols are single or paired alignment columns Emission probabilities calculated from phylogenetic model and tree using Felsenstein's algorithm 4x4 Matrix for single columns 16x16 Matrix for paired columns
25/ 38
◮ Uses simple RNA grammar ◮ Two competing models: ◮ Non-structural model with all columns treated as evolving
◮ Structural model with dependent and independent columns ◮ Sophisticated parametrization 26/ 38
92.0M 94.0M 96.0M 98.0M Most conserved noncoding regions (present in at least human/mouse/rat/dog) RNAz structural RNAs (P>0.5) RNAz structural RNAs (P>0.9) RefSeq Genes
90801000 90801500
RNAz structural RNAs (P>0.9) miRNAs
mir-17 mir-19a mir-19b-1 mir-18 mir-20 mir-92-1
(((((..((((((..((((((((.((.(((((...(((........)))...))))).)).))))))))...))))))....))))) GTCAGAATAATGTCAAAGTGCTTACAGTGCAGGTAGTGATATGT-GCATCTACTGCAGTGAAGGCACTTGTAGCATTA-TG-GTGAC GTCAGAATAATGTCAAAGTGCTTACAGTGCAGGTAGTGATGTGT-GCATCTACTGCAGTGAGGGCACTTGTAGCATTA-TG-CTGAC GTCAGGATAATGTCAAAGTGCTTACAGTGCAGGTAGTGGTGTGT-GCATCTACTGCAGTGAAGGCACTTGTGGCATTG-TG-CTGAC GTCAGAGTAATGTCAAAGTGCTTACAGTGCAGGTAGTGATATATAGAACCTACTGCAGTGAAGGCACTTGTAGCATTA-TG-TTGAC GTCAATGTATTGTCAAAGTGCTTACAGTGCAGGTAGTATTATGGAATATCTACTGCAGTGGAGGCACTTCTAGCAATA-CACTTGAC GTCTGTGTATTGCCAAAGTGCTTACAGTGCAGGTAGTTCTATGTGACACCTACTGCAATGGAGGCACTTACAGCAGTACTC-TTGAC Human Mouse Rat Chicken Zebrafish Fugu
G U C A G A A U A A U G U C A A A G U G C U UA C A G U G C A G G U AG U G A U A U G U _ G C A U C U A C U G C A G U G A A G G C A C U U G U A G C A U U A _ U G _ U U G A C
93104k 93106k 93108k RNAz structural RNAs (P>0.5) RNAz structural RNAs (P>0.9) H/ACA snoRNAs C/D-box snoRNAs ACA25 ACA32 ACA1 ACA8 ACA18 ACA40 mgh28S-2412 mgh28S-2410
Washietl, Hofacker & Stadler, Nat. Biotech. (2005) 23:1383 27/ 38
Washietl, Hofacker & Stadler, Nat. Biotech. (2005) 23:1383 28/ 38
Washietl, Hofacker & Stadler, Nat. Biotech. (2005) 23:1383 29/ 38
Washietl, Hofacker & Stadler, Nat. Biotech. (2005) 23:1383 30/ 38
Sandman & Cohen, PLoS One (2007) 2:e1265 31/ 38
chr14: RNAz EvoFold RACE primer TARs/Transfrags Constrained elements Conservation 53427000 53427500 RNAz EvoFold RACE primer RACEfrags TARs/Transfrags Constrained elements Vertebrate Multiz Alignment & Conservation Gencode Reference Genes
Testis
G G U U C A U U C A G U G A C A G U G G A U C A G ACA A U A C U C C _A G _ C U G G C U G G C G A A G A U U G C U U G U GU U G G A AC A A G C A U U C C A G G G C A G G C A U U C C A C A C A A G C G G U C A C A G C A U U C U G C A U A U C U U G U G G C C U U C C _ A A A G U C A U U C U U A U G C U A A G G A A G G U C G A A U G U G A U A A C A U A U U U G U G U U G GC CA A G A G U G A _ G G A C A U A G C A A A
Washietl, Pedersen, Korbel et al., Genome Res. (2007) 17:852 32/ 38
chr5: RNAz EvoFold RACE primer TARs/Transfrags Constrained elements Conservation 56176500 56177000 56177500 56178000 RNAz EvoFold RACE primer RACEfrags TARs/Transfrags Constrained elements Human ESTs Including Unspliced Vertebrate Multiz Alignment & Conservation Gencode Reference Genes
Testis DR006352 BM148300 AI476562 BE782001 AW505258 MAP3K1 MAP3K1
C A U C C U U U U C C U U G C U U A C U G A U C U G U G U U C A A C A A U U A A G G A _ _ _ _ A G A G G A U G
Washietl, Pedersen, Korbel et al., Genome Res. (2007) 17:852 33/ 38
34/ 38
Mourier et al. Genome Res. 2008 35/ 38
Parker et al. Genome Res. 2011 36/ 38
Parker et al. Genome Res. 2011 37/ 38
Parker et al. Genome Res. 2011 38/ 38