De novo prediction of structural noncoding RNAs Stefan Washietl - - PowerPoint PPT Presentation

de novo prediction of structural noncoding rnas
SMART_READER_LITE
LIVE PREVIEW

De novo prediction of structural noncoding RNAs Stefan Washietl - - PowerPoint PPT Presentation

De novo prediction of structural noncoding RNAs Stefan Washietl 18.417 - Fall 2011 1/ 38 Outline Motivation: Biological importance of (noncoding) RNAs Algorithms to predict structural noncoding RNAs RNAz: thermodynamical folding +


slide-1
SLIDE 1

De novo prediction of structural noncoding RNAs

Stefan Washietl

18.417 - Fall 2011

1/ 38

slide-2
SLIDE 2

Outline

◮ Motivation: Biological importance of (noncoding) RNAs ◮ Algorithms to predict structural noncoding RNAs

◮ RNAz: thermodynamical folding + phylogenetic information ◮ EvoFold: phylogenetic stochastic context-free grammars

◮ A few applications of RNAz and Evofold

2/ 38

slide-3
SLIDE 3

Essential biochemical functions of life

◮ Information storage and replication ◮ Enzymatic activity: catalyze biochemical reactions ◮ Regulator: sense and react to environment

3/ 38

slide-4
SLIDE 4

Enzymatic activity: Ribozymes

◮ Self splicing introns and RNAseP were the first examples of

RNAs with catalytic activity. First discoverd by Sidney Altman and Thomas Cech.

4/ 38

slide-5
SLIDE 5

Self duplication

◮ Ribozyme acting as RNA dependent RNA polymerase ◮ A chimeric construct of a natural ligase ribozyme with an in

vitro selected template binding domain can replicate at least

  • ne turn of an RNA helix.

5/ 38

slide-6
SLIDE 6

Regulation: Riboswitches

◮ Environmental stimuli change directly (without protein) the

conformation of an RNA which affects gene activity.

Serganov A, Patel DJ, Nat Rev Genet. 2007 8:(10)776-90 6/ 38

slide-7
SLIDE 7

Putting things together: RNA world hypothesis

◮ RNA or RNA-like molecules could have formed a pre-protein

world.

7/ 38

slide-8
SLIDE 8

Overview of RNA functions

8/ 38

slide-9
SLIDE 9

Examples of structured RNAs and their genomic context

Intergenic 5’−UTR CDS exon Intron 3’−UTR Intron Intergenic

IRES miRNA

snoRNA

IRE snRNA tRNA SECIS

9/ 38

slide-10
SLIDE 10

Prediction of noncoding RNAs

◮ Compared to prediction of protein coding RNAs an extremely

difficult problem:

◮ No common strong statistical features in primary sequence

such as start/stop codons, codon bias, open reading frame

◮ ncRNAs are highly diverse (short, long, spliced, unspliced,

processed, intron encoded, intergenic, antisense,...)

◮ Good progress in prediction for a subset of ncRNAs:

structured ncRNAs

10/ 38

slide-11
SLIDE 11

Prediction of RNA secondary structure

◮ The standard energy model expresses the free energy of a secondary

structure S as the sum of the energies of its components L: E(S) =

  • L∈S

E(L)

◮ The minimum free energy structure can be calculated by dynamic

programming, e.g. by using RNAfold:

RNAfold < trna.fa >AF041468 GGGGGUAUAGCUCAGUUGGUAGAGCGCUGCCUUUGCACGGCAGAUGUCAGGGGUUCGAGUCCCCUUACCUCCA (((((((..((((........)))).(((((.......))))).....(((((.......)))))))))))). (-31.10) 11/ 38

slide-12
SLIDE 12

Significance of predicted RNA secondary structures: z-score statistics

◮ Has a natural occuring RNA sequence a lower minimum free

energy (MFE) than random sequences of the same size and base composition?

  • 1. Calculate native MFE m.
  • 2. Calculate mean µ and standard deviation σ of MFEs of a large

number of shuffled random sequences.

  • 3. Express significance in standard deviations from the mean as

z-score z = m − µ σ

◮ Negative z-scores indicate that the native RNA is more stable

than the random RNAs.

12/ 38

slide-13
SLIDE 13

z-scores of structured RNAs

  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 4 z-score 0.2 0.4 Frequency

2%

ncRNA Type

  • No. of Seqs.

Mean z-score tRNA 579 −1.84 5S rRNA 606 −1.62 Hammerhead ribozyme III 251 −3.08 Group II catalytic intron 116 −3.88 SRP RNA 73 −3.37 U5 spliceosomal RNA 199 −2.73

Washietl & Hofacker, J. Mol. Biol. (2004) 342:19 13/ 38

slide-14
SLIDE 14

Comparative genomics at our hands

◮ 30+ vertebrate genomes ◮ 12+ drosophila genomes ◮ 20+ yeast genomes ◮ and many more. . .

14/ 38

slide-15
SLIDE 15

Consensus folding using RNAalifold

◮ RNAalifold uses the same algorithms and energy parameters

as RNAfold

◮ Energy contributions of the single sequences are averaged ◮ Covariance information (e.g. compensatory mutations) is

incorporated in the energy model.

◮ It calculates a consensus MFE consisting of an energy term

and a covariance term:

Hofacker, Fekete & Stadler, J. Mol. Biol. (2002) 319:1059 15/ 38

slide-16
SLIDE 16

The structure conservation index

◮ The SCI is an efficient and convenient measure for secondary

structure conservation.

16/ 38

slide-17
SLIDE 17

Efficient calculation of stability z-scores

◮ The significance of a predicted

MFE structure can be expressed as z-score which is normalized w.r.t. sequence length and base composition.

◮ Traditionally, z-scores are sampled

by time-consuming random shuffling.

◮ The shuffling can be replaced by a

regression calculation which is of the same accuracy.

  • 8
  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 Sampled z-scores

  • 8
  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 Sampled z-scores

  • 8
  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 Calculated z-scores

17/ 38

slide-18
SLIDE 18

SVM classification based on both scores

◮ Both scores separate native ncRNAs from controls in two

dimensions.

Washietl, Hofacker & Stadler, Proc. Natl. Acad. Sci. USA (2005) 33:2433 18/ 38

slide-19
SLIDE 19

SVM classification based on both scores

◮ Both scores separate native ncRNAs from controls in two

dimensions.

◮ A support vector machine is used for classification: RNAz.

Washietl, Hofacker & Stadler, Proc. Natl. Acad. Sci. USA (2005) 33:2433 18/ 38

slide-20
SLIDE 20

Probabilistic approaches to fold RNA

◮ Hidden Markov Models are commonly used in computational

biology to assign “states” to a sequence: e.g. exons in DNA sequence, conserved regions in alignments,

◮ Can we use a similar approach to parse a RNA sequence into

structural states?

AGCUCUGAGGUGAUUUUCAUAUUGAAUUGCAAAUUCGAAGAAGCAGCUUCAAACCUGCCGGGGCUU (((((((..((((...)))).(((((((...)))))))....((((........))))))))))).

◮ The HMM framework needs to be extended to allow for

nested correlations

19/ 38

slide-21
SLIDE 21

Context free grammars

◮ A context-free grammar can be defined by G(V , T, P, S)

where:

◮ V is a finite set of nonterminal symbols (“states”), ◮ T is a finite set of terminal symbols, ◮ P is a finite set of production rules and ◮ S is the initial (start) nonterminal (S ∈ V ).

◮ A simple palindrome grammar: V = {S}, T = {a, b},

P = {S → aSa, S → bSb, S → ǫ}

◮ Efficiently describes the set of all palindromes over the

alphabet {a, b}.

◮ Example production:

S → aSa → abSba → abbSbba → abbbba

◮ Given the CFG G(V , T, P, S), we get a stochastic CFG

(SCGF) by assigning each production rule α ∈ P a probability Prob(α) such that:

α Prob(α) = 1

20/ 38

slide-22
SLIDE 22

A simple RNA grammar

◮ V = {S}, T = {a, c, g, u}, P =

◮ S → aSu|uSa|gSc|cSg|uSg|gSu ◮ S → aS|uS|gS|cS ◮ S → Sa|Su|Sa|Sc ◮ S → SS ◮ S → ǫ

◮ Shorthand S → aSˆ

a|aS|Sa|SS|ǫ

21/ 38

slide-23
SLIDE 23

Parse tree

◮ One possible parse tree Π of the string x =

ACAGGAAACUGUACGGUGCAACCG and its correspondence to a RNA secondary structure (nonterminals: red, terminals: black)

22/ 38

slide-24
SLIDE 24

RNA folding using SCFG

◮ Find the parse tree of maximum probability using a Nussinov

style recursion.

◮ γ(i, j) is the maximum log(Prob) for subsequence (i, j) ◮ Initialization: γ(i, i − 1) = log p(S → ǫ)

γ(i, j) = max            γ(i + 1, j − 1) + log(Prob(S → xiSxj) γ(i + 1, j) + log(Prob(S → xiS) γ(i, j − 1) + log(Prob(S → Sxj) maxi<k<j{γ(i, k) + γ(k + 1, j) + log(Prob(S → SS)}

23/ 38

slide-25
SLIDE 25

Standard algorithms for SCFG

◮ Given a parameterized SCFG(G, Ω) and a sequence x, the

Cocke-Younger-Kasami (CYK) dynamic programming algorithm finds an optimal (maximum probability) parse tree ˆ π: ˆ π = arg max

π

Prob(π, x|G, Ω)

◮ The Inside algorithm, is used to obtain the total probability of

the sequence given the model summed over all parse trees, Prob(x|G, Ω) =

  • π

Prob(x, π|G, Ω)

◮ Analogies to thermodynamic folding:

◮ CYK ↔ Minimum Free energy (Nussinov/Zuker) ◮ Inside/outside algorithm ↔ Partition functions (McCaskill)

◮ Analogies to Hidden Markov models:

◮ CYK Minimum ↔ Viterbi’s algorithm ◮ Inside/outside algorithm ↔ Forward/backwards algorithm 24/ 38

slide-26
SLIDE 26

Evofold: Phylo SCFGs

S S S S S S S S ε S S S S S S S S ε S A C A G G A G A C U G U A C G G U G C A A C C G Structure Parse Tree Phylogenetic tree A C A G G A G A C U G U A C G G U G C A A C C G A C A G G A G A C U G U A C G G U G C A A C C G A C A G G A G A C U G U A C G G U G C A A C C G A C A G G A G A C U G U A C G G U G C A A C C G A C A G G A G A C U G U A C G G U G C A A C C G ( ( ( ( . . . . ) ) ) ) ( ( ( ( . . . . ) ) ) ) Single sequence: Terminal symbols are bases or base-pairs Emission probabilities are base frequencies in loops and paired regions Phylo-SCFG: Terminal symbols are single or paired alignment columns Emission probabilities calculated from phylogenetic model and tree using Felsenstein's algorithm 4x4 Matrix for single columns 16x16 Matrix for paired columns

25/ 38

slide-27
SLIDE 27

EvoFold

◮ Structural RNA gene finding: EvoFold

◮ Uses simple RNA grammar ◮ Two competing models: ◮ Non-structural model with all columns treated as evolving

independently

◮ Structural model with dependent and independent columns ◮ Sophisticated parametrization 26/ 38

slide-28
SLIDE 28

Screening the human genome with RNAz

92.0M 94.0M 96.0M 98.0M Most conserved noncoding regions (present in at least human/mouse/rat/dog) RNAz structural RNAs (P>0.5) RNAz structural RNAs (P>0.9) RefSeq Genes

90801000 90801500

RNAz structural RNAs (P>0.9) miRNAs

mir-17 mir-19a mir-19b-1 mir-18 mir-20 mir-92-1

(((((..((((((..((((((((.((.(((((...(((........)))...))))).)).))))))))...))))))....))))) GTCAGAATAATGTCAAAGTGCTTACAGTGCAGGTAGTGATATGT-GCATCTACTGCAGTGAAGGCACTTGTAGCATTA-TG-GTGAC GTCAGAATAATGTCAAAGTGCTTACAGTGCAGGTAGTGATGTGT-GCATCTACTGCAGTGAGGGCACTTGTAGCATTA-TG-CTGAC GTCAGGATAATGTCAAAGTGCTTACAGTGCAGGTAGTGGTGTGT-GCATCTACTGCAGTGAAGGCACTTGTGGCATTG-TG-CTGAC GTCAGAGTAATGTCAAAGTGCTTACAGTGCAGGTAGTGATATATAGAACCTACTGCAGTGAAGGCACTTGTAGCATTA-TG-TTGAC GTCAATGTATTGTCAAAGTGCTTACAGTGCAGGTAGTATTATGGAATATCTACTGCAGTGGAGGCACTTCTAGCAATA-CACTTGAC GTCTGTGTATTGCCAAAGTGCTTACAGTGCAGGTAGTTCTATGTGACACCTACTGCAATGGAGGCACTTACAGCAGTACTC-TTGAC Human Mouse Rat Chicken Zebrafish Fugu

G U C A G A A U A A U G U C A A A G U G C U UA C A G U G C A G G U AG U G A U A U G U _ G C A U C U A C U G C A G U G A A G G C A C U U G U A G C A U U A _ U G _ U U G A C

93104k 93106k 93108k RNAz structural RNAs (P>0.5) RNAz structural RNAs (P>0.9) H/ACA snoRNAs C/D-box snoRNAs ACA25 ACA32 ACA1 ACA8 ACA18 ACA40 mgh28S-2412 mgh28S-2410

  • Chr. 13
  • Chr. 13
  • Chr. 11

a b d c

◮ Large scale

comparative screen of mammals/vertebrates

◮ ≈ 5% of the best

conserved non-coding regions

◮ → 438,788

alignments covering 82.64 MB (2.88% of the genome)

Washietl, Hofacker & Stadler, Nat. Biotech. (2005) 23:1383 27/ 38

slide-29
SLIDE 29

Detection performance of well-known small ncRNAs

Washietl, Hofacker & Stadler, Nat. Biotech. (2005) 23:1383 28/ 38

slide-30
SLIDE 30

Searching for H/ACA snoRNAs

◮ Two stems of at least

15 pairs

◮ Unpaired hinge ◮ ACA in last 20

nucleotides

◮ → 137 candidates (28

known), 30-40 show typical structure upon visual inspection, 15 have canonical H-box motif ANANNA

◮ Five candidates were

tested, 3 found on Northerns in HeLa cells

Washietl, Hofacker & Stadler, Nat. Biotech. (2005) 23:1383 29/ 38

slide-31
SLIDE 31

Searching for miRNA precursors

◮ Stem with at least 20 pairs ◮ Mean z-score < −3.5 ◮ 22nt window with more than 95% identity ◮ → 312 candidates (109 known miRNAs) ◮ Automatized in RNAmicro (Hertel und Stadler, Bioinformatics

22:e197, 2006)

Washietl, Hofacker & Stadler, Nat. Biotech. (2005) 23:1383 30/ 38

slide-32
SLIDE 32

miRNA precursors in Drosophila (Sandman & Cohen)

◮ 56 miRNAs predicted using RNAz and evolutionary patterns. ◮ 22 (39%) verified (16 Northern, 19 small RNA libraries, 13

both)

Sandman & Cohen, PLoS One (2007) 2:e1265 31/ 38

slide-33
SLIDE 33

Intergenic RNAs

chr14: RNAz EvoFold RACE primer TARs/Transfrags Constrained elements Conservation 53427000 53427500 RNAz EvoFold RACE primer RACEfrags TARs/Transfrags Constrained elements Vertebrate Multiz Alignment & Conservation Gencode Reference Genes

  • 3.5 *

Testis

G G U U C A U U C A G U G A C A G U G G A U C A G ACA A U A C U C C _A G _ C U G G C U G G C G A A G A U U G C U U G U GU U G G A AC A A G C A U U C C A G G G C A G G C A U U C C A C A C A A G C G G U C A C A G C A U U C U G C A U A U C U U G U G G C C U U C C _ A A A G U C A U U C U U A U G C U A A G G A A G G U C G A A U G U G A U A A C A U A U U U G U G U U G GC CA A G A G U G A _ G G A C A U A G C A A A

Washietl, Pedersen, Korbel et al., Genome Res. (2007) 17:852 32/ 38

slide-34
SLIDE 34

Intronic RNAs

chr5: RNAz EvoFold RACE primer TARs/Transfrags Constrained elements Conservation 56176500 56177000 56177500 56178000 RNAz EvoFold RACE primer RACEfrags TARs/Transfrags Constrained elements Human ESTs Including Unspliced Vertebrate Multiz Alignment & Conservation Gencode Reference Genes

  • 5.2 *

Testis DR006352 BM148300 AI476562 BE782001 AW505258 MAP3K1 MAP3K1

C A U C C U U U U C C U U G C U U A C U G A U C U G U G U U C A A C A A U U A A G G A _ _ _ _ A G A G G A U G

Washietl, Pedersen, Korbel et al., Genome Res. (2007) 17:852 33/ 38

slide-35
SLIDE 35

RNAz screen in other genomes

◮ Drosophila melanogaster: Rose et al.: BMC Genomics

2007, 8:406.

◮ Ciana intestinalis: Missal, Rose & Stadler: Bioinformatics

2005, 21 Suppl 2:77-78

◮ Caenorhabditis elegans: Missal et al.: J Exp Zoolog B Mol

Dev Evol 2006, 306(4):379-392.

◮ Saccharomyces cerevisiae: Steigele et al.: BMC Biol 2007,

5:25-25.

◮ Plasmodium falciparum: Mourier et al.: Genome Res., 2008

34/ 38

slide-36
SLIDE 36

A RNAz screen in Plasmodium (Mourier et. al)

◮ 22 of 78 tested high scoring RNAz candidates (28%) were

verified by Northern blot analysis.

Mourier et al. Genome Res. 2008 35/ 38

slide-37
SLIDE 37

Structure family identification using EvoFold+EvoFam

Parker et al. Genome Res. 2011 36/ 38

slide-38
SLIDE 38

Family of hairpins in 39-UTR of MAT2A

Parker et al. Genome Res. 2011 37/ 38

slide-39
SLIDE 39

tRNA like structures in intron of POP1

Parker et al. Genome Res. 2011 38/ 38