[PPT] - ncRNA: Interest extensive noncoding sequence conservation Modeling PowerPoint Presentation

SLIDE 1

1

Modeling and Searching for Non-Coding RNA

W.L. Ruzzo

http://www.cs.washington.edu/homes/ruzzo

ncRNA: Interest

extensive noncoding sequence conservation even more extensive transcription “invisible” structural conservation? many RNA binding proteins examples: microRNAs, riboswitches Bottom line: important regulatory roles

Outline

Why RNA? Examples of RNA biology Computational Challenges

Modeling Search Inference

Fig. 2. The arrows show the situation as it

seemed in 1958. Solid arrows represent probable transfers, dotted arrows possible

transfers. The absent arrows (compare Fig. 1)

represent the impossible transfers postulated by the central dogma. They are the three possible arrows starting from protein.

SLIDE 2

2

The “Central Dogma”

DNA  RNA  Protein DNA

(chromosome)

RNA

(messenger)

Protein

gene

cell

RNA Secondary Structure:

RNA makes helices too

A G A C U G AC G A U CA C G C A G U CA Base pairs A U C G A C AU

“Classical” RNAs

mRNA tRNA rRNA snRNA (small nuclear - spl

icing)

snoRNA (small nucleolar - guides for t/rRNA

modifications)

RNAseP (tRNA maturation; ribozyme in bacteria) SRP (signal recognition particle; co-translational

targeting of proteins to membranes)

telomerases

Non-coding RNA

Messenger RNA - codes for proteins Non-coding RNA - all the rest

Before, say, mid 1990’s, 1-2 dozen known (critically important, but narrow roles: e.g. tRNA)

Since mid 90’s dramatic discoveries

Regulation, transport, stability/degradation E.g. “microRNA”: ≈ 100’s in humans

By some estimates, ncRNA >> mRNA

SLIDE 3

3

Bacteria

Triumph of proteins 80% of genome is coding DNA Functionally diverse receptors motors catalysts regulators (Monod & Jakob, Nobel prize 1965) …

Alberts, et al, 3e.

Gene Regulation: The MET Repressor

SAM DNA Protein

Alberts, et al, 3e.

The protein way Riboswitch alternative

SAM Grundy & Henkin, Mol. Microbiol 1998 Epshtein, et al., PNAS 2003 Winkler et al., Nat. Struct. Biol. 2003

SLIDE 4

4

Alberts, et al, 3e.

The protein way Riboswitch alternatives

SAM-II

SAM-I Grundy, Epshtein, Winkler et al., 1998, 2003

Corbino et al., Genome Biol. 2005

Alberts, et al, 3e. Corbino et al., Genome Biol. 2005

The protein way Riboswitch alternatives

SAM-III

SAM-II SAM-I

Fuchs et al., NSMB 2006

Grundy, Epshtein, Winkler et al., 1998, 2003 Alberts, et al, 3e. Corbino et al., Genome Biol. 2005

The protein way Riboswitch alternatives

Weinberg et al., RNA 2008 SAM-III SAM-II SAM-I Fuchs et al., NSMB 2006 Grundy, Epshtein, Winkler et al., 1998, 2003 SAM-IV Weinberg, et al. Nucl. Acids Res., July 2007 35: 4809-4819.

boxed = confirmed riboswitch (+2 more)

Widespread, deeply conserved, structurally sophisticated, functionally diverse, biologically important uses for ncRNA throughout prokaryotic world.

SLIDE 5

5

RNA on the Rise

In humans

more RNA- than DNA-binding proteins? much more conserved DNA than coding MUCH more transcribed DNA than coding

In bacteria

regulation of MANY genes involves RNA dozens of classes & thousands of new examples in just last 5 years

Human Predictions

Evofold

S Pedersen, G Bejerano, A Siepel, K Rosenbloom, K Lindblad-Toh, ES Lander, J Kent, W Miller, D Haussler, "Identification and classification of conserved RNA secondary structures in the human genome." PLoS Comput. Biol., 2, #4 (2006) e33. 48,479 candidates (~70% FDR?)

RNAz

S Washietl, IL Hofacker, M Lukasser, A Hutenhofer, PF Stadler, "Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome." Nat. Biotechnol., 23, #11 (2005) 1383-90. 30,000 structured RNA elements 1,000 conserved across all vertebrates. ~1/3 in introns of known genes, ~1/6 in UTRs ~1/2 located far from any known gene

FOLDALIGN

E Torarinsson, M Sawera, JH Havgaard, M Fredholm, J Gorodkin, "Thousands of corresponding human and mouse genomic regions unalignable in primary sequence contain common RNA structure." Genome Res., 16, #7 (2006) 885-9. 1800 candidates from 36970 (of 100,000) pairs

SLIDE 6

6 CMfinder

Torarinsson, Yao, Wiklund, Bramsen, Hansen, Kjems, Tommerup, Ruzzo and

Gorodkin. Comparative genomics beyond

sequence based alignments: RNA structures in the ENCODE regions. Genome Research, Feb 2008, 18(2):242-251 PMID: 18096747 6500 candidates in ENCODE alone (better FDR, but still high)

Fastest Human Gene? Origin of Life?

Life needs information carrier: DNA molecular machines, like enzymes: Protein making proteins needs DNA + RNA + proteins making (duplicating) DNA needs proteins Horrible circularities! How could it have arisen in an abiotic environment?

Origin of Life?

RNA can carry information too (RNA double helix) RNA can form complex structures RNA enzymes exist (ribozymes) The “RNA world” hypothesis: 1st life was RNA-based Some extant RNAs are relicts of that origin; some are “modern” inventionsrel

SLIDE 7

7

ncRNA Example: Xist

large (12kb?) largely unstructured RNA required for X-inactivation in mammals

ncRNA Example: 6S

medium size (175nt) structured highly expressed in E. coli in certain growth conditions sequenced in 1971; function unknown for 30 years

6S mimics an

pen promoter

Barrick et al. RNA 2005 Trotochaud et al. NSMB 2005 Willkomm et al. NAR 2005

E.coli

ncRNA Example: IRE

Iron Response Element: a short conserved stem- loop, bound by iron response proteins (IRPs). Found in UTRs of various mRNAs whose products are involved in iron metabolism. E.g., the mRNA of ferritin (an iron storage protein) contains one IRE in its 5' UTR. When iron concentration is low, IRPs bind the ferritin mRNA IRE, repressing translation. Binding of multiple IREs in the 3' and 5' UTRs of the transferrin receptor (involved in iron acquisition) leads to increased mRNA

stability. These two activities form the basis of iron

homeostasis in the vertebrate cell.

SLIDE 8

8

IRE (partial seed alignment):

Hom.sap. GUUCCUGCUUCAACAGUGUUUGGAUGGAAC Hom.sap. UUUCUUC.UUCAACAGUGUUUGGAUGGAAC Hom.sap. UUUCCUGUUUCAACAGUGCUUGGA.GGAAC Hom.sap. UUUAUC..AGUGACAGAGUUCACU.AUAAA Hom.sap. UCUCUUGCUUCAACAGUGUUUGGAUGGAAC Hom.sap. AUUAUC..GGGAACAGUGUUUCCC.AUAAU Hom.sap. UCUUGC..UUCAACAGUGUUUGGACGGAAG Hom.sap. UGUAUC..GGAGACAGUGAUCUCC.AUAUG Hom.sap. AUUAUC..GGAAGCAGUGCCUUCC.AUAAU Cav.por. UCUCCUGCUUCAACAGUGCUUGGACGGAGC Mus.mus. UAUAUC..GGAGACAGUGAUCUCC.AUAUG Mus.mus. UUUCCUGCUUCAACAGUGCUUGAACGGAAC Mus.mus. GUACUUGCUUCAACAGUGUUUGAACGGAAC Rat.nor. UAUAUC..GGAGACAGUGACCUCC.AUAUG Rat.nor. UAUCUUGCUUCAACAGUGUUUGGACGGAAC SS_cons <<<<<...<<<<<......>>>>>.>>>>>

Iron Response Element ncRNA Example: MicroRNAs

short (~22 nt) unstructured RNAs excised from ~75nt precursor hairpin approx antisense to mRNA targets, often in 3’ UTR regulate gene activity, e.g. by destabilizing (plants)

r otherwise suppressing (animals) message

hundreds and growing, each w/ perhaps 10x targets Some conserved human to worm;

thers evolving rapidly

ncRNA Example: T-boxes ncRNA Example: Riboswitches

UTR structure that directly senses/binds

small molecules & regulates mRNA

widespread in prokaryotes
some in eukaryotes

SLIDE 9

9

Example: Glycine Regulation

How is glycine level regulated?
Plausible answer:

glycine cleavage enzyme gene g g TF g TF gce protein g g

DNA

transcription factors (proteins) bind to DNA to turn nearby genes on or off

The Glycine Riboswitch

Actual answer (in many bacteria):

glycine cleavage enzyme gene g g g g gce mRNA gce protein

5′ 3′

DNA

Mandal et al. Science 2004

More examples means better alignment
Understand phylogenetic distribution
Find riboswitch in front of new gene

(Mandal, Lee, Barrick, Weinberg, Emilsson, Ruzzo, Breaker, Science 2004)

And…

gcvT ORF

5’ 3’

Fig. 3. Cooperative binding of two glycine molecules by the VC I-II RNA. Plot depicts the

fraction of VC II (open) and VC I-II (solid) bound to ligand versus the concentration of glycine. The constant, n, is the Hill coefficient for the lines as indicated that best fit the aggregate data from four different regions (fig. S3). Shaded boxes demark the dynamic range (DR) of glycine concentrations needed by the RNAs to progress from 10%- to 90%-bound states.

SLIDE 10

10

Riboswitches

~ 20 ligands known; multiple nonhomologous

solutions for some

dozens to hundreds of instances of each
TPP known in archaea & eukaryotes
on/off; transcription/translation; splicing;

combinatorial control

In some bacteria, more riboregulators

identified than protein TFs

all found since ~2003

Why?

RNA’s fold,

and function

Nature uses

what works

Outline

ncRNA: what/why?
What does computation bring?
How to model and search for ncRNA?
Faster search
Better model inference

Homology search

Sequence-based

– Smith-Waterman – FASTA – BLAST

Sharp decline in sensitivity at ~60-70% identity So, use structure, too

SLIDE 11

11

Structure Prediction

Xray crystalography, NMR, etc
Comparative modeling

– Alignment & compensatory substitutions

Single-sequence Folding

– mfold, mccaskill, vienna… – est 50-70% accurate up to 200-300 nt

Multiple sequence alignment/folding

Impact of RNA homology search

B. subtilis
L. innocua
A. tumefaciens
V. cholera
M. tuberculosis

(Barrick, et al., 2004)

(and 19 more species)

peron

glycine riboswitch

Impact of RNA homology search

B. subtilis
L. innocua
A. tumefaciens
V. cholera
M. tuberculosis

(Barrick, et al., 2004)

(and 19 more species)

peron

glycine riboswitch (and 42 more species)

Using our techniques, we found…

RNA Informatics

RNA: Not just a messenger anymore

– Dramatic discoveries – Hundreds of families (besides classics like tRNA, rRNA, snRNA…) – Widespread, important roles

Computational tools important

– Discovery, characterization, annotation – BUT: slow, inaccurate, demanding

SLIDE 12

12

Q: What’s so hard?

A C U G C A G G G A G C A A G C G A G G C C U C U G C A A U G A C G G U G C A U G A G A G C G U C U U U U C A A C A C U G U U A U G G A A G U U U G G C U A G C G U U C U A G A G C U G U G A C A C U G C C G C G A C G G G A A A G U A A C G G G C G G C G A G U A A A C C C G A U C C C G G U G A A U A G C C U G A A A A A C A A A G U A C A C G G G A U A C G

A: Structure often more important than sequence

Computational Challenges

Search - given

related RNA’s, find more

Modeling - describe

a related family

Meta-modeling -

what’s a good modeling framework?

Covariance Models
Hand-curated

alignments -> CMs

CM-based search

Predict Structure from Multiple Sequences

… GA … UC … … GA … UC … … GA … UC … … CA … UG … … CC … GG … … UA … UA … Compensatory mutations reveal structure, but in usual alignment algorithms they are doubly penalized.

“RNA sequence analysis using covariance models”

Eddy & Durbin Nucleic Acids Research, 1994 vol 22 #11, 2079-2088

SLIDE 13

13

What

A probabilistic model for RNA families

– The “Covariance Model” – ≈ A Stochastic Context-Free Grammar – A generalization of a profile HMM

Algorithms for Training

– From aligned or unaligned sequences – Automates “comparative analysis” – Complements Nusinov/Zucker RNA folding

Algorithms for searching

Main Results

Very accurate search for tRNA

– (Precursor to tRNAscanSE - current favorite)

Given sufficient data, model

construction comparable to, but not quite as good as, human experts

Some quantitative info on importance of

pseudoknots and other tertiary features

Probabilistic Model Search

As with HMMs, given a sequence, you

calculate llikelihood ratio that the model could generate the sequence, vs a background model

You set a score threshold
Anything above threshold => a “hit”
Scoring:

– “Forward” / “Inside” algorithm - sum over all paths – Viterbi approximation - find single best path (Bonus: alignment & structure prediction) Example: searching for tRNAs

SLIDE 14

14

How to model an RNA “Motif”?

Conceptually, start with a profile HMM:

– from a multiple alignment, estimate nucleotide/ insert/delete preferences for each position – given a new seq, estimate likelihood that it could be generated by the model, & align it to the model all G mostly G del ins

Mj: Match states (20 emission probabilities) Ij: Insert states (Background emission probabilities) Dj: Delete states (silent - no emission)

Profile Hmm Structure How to model an RNA “Motif”?

Covariance Models (aka “profile SCFG”)

– Probabilistic models, like profile HMMs, but adding “column pairs” and pair emission probabilities for base-paired regions

paired columns

<<<<<<< >>>>>>> … …

mRNA leader mRNA leader switch?

SLIDE 15

15

mRNA leader mRNA leader switch?

CM Structure

A: Sequence + structure B: the CM “guide tree” C: probabilities of letters/ pairs & of indels Think of each branch being an HMM emitting both sides of a helix (but 3’ side emitted in reverse

rder)

Overall CM Architecture

One box (“node”) per node of

guide tree

BEG/MATL/INS/DEL just like

an HMM

MATP & BIF are the key

additions: MATP emits pairs of symbols, modeling base-pairs; BIF allows multiple helices

SLIDE 16

16

CM Viterbi Alignment

xi = ith letter of input xij = substring i,..., j of input Tyz = P(transition y z) Exi ,x j

y

= P(emission of xi,x j from state y) Sij

y = max logP(xij generated starting in state y via path )

Sij

y = max logP(xij generated starting in state y via path )

Sij

y =

maxz[Si+1, j1

z

+ logTyz + log Exi ,x j

y

] match pair maxz[Si+1, j

z

+ logTyz + log Exi

y ]

match/insert left maxz[Si, j1

z

+ logTyz + log Ex j

y ]

match/insert right maxz[Si, j

z

+ logTyz] delete maxi<k j[Si,k

yleft + Sk+1, j yright ]

bifurcation

Viterbi, cont.

Time O(qn3), q states, seq len n

compare: O(qn) for profile HMM

Mutual Information

Max when no seq conservation but perfect pairing MI = expected score gain from using a pair state Finding optimal MI, (i.e. opt pairing of cols) is hard(?) Finding optimal MI without pseudoknots can be done by dynamic programming

Mij = fxi,xj

xi,xj

log2

fxi,xj fxi fxj ; 0 Mij 2

* 1 2 3 4 5 6 7 8 9 * MI: 1 2 3 4 5 6 7 8 9 A G A U A A U C U 9 A G A U C A U C U 8 A G A C G U U C U 7 2 0.30 1 A G A U U U U C U 6 1 0.55 1 A G C C A G G C U 5 0.42 A G C G C G G C U 4 0.30 A G C U G C G C U 3 A G C A U C G C U 2 A G G U A G C C U 1 A G G G C G C C U A G G U G U C C U A G G C U U C C U A G U A A A A C U A G U C C A A C U A G U U G C A C U A G U U U C A C U A 16 4 2 4 4 4 C 4 4 4 4 4 16 G 0 16 4 2 4 4 4 U 4 8 4 4 4 0 16

M.I. Example (Artificial)

Cols 1 & 9, 2 & 8: perfect conservation & might be base-

paired, but unclear whether they are. M.I. = 0

Cols 3 & 7: No conservation, but always W-C pairs, so

seems likely they do base-pair. M.I. = 2 bits.

Cols 7->6: unconserved, but each letter in 7 has only 2

possible mates in 6. M.I. = 1 bit.

SLIDE 17

17

“just like Nussinov/Zucker folding”
BUT, need enough data---enough sequences at right

phylogenetic distance

MI-Based Structure-Learning

Si, j = max Si+1, j Si, j1 Si+1, j1 + Mi, j maxi< j<k Si,k + Sk+1, j

Pseudoknots

disallowed allowed

max j Mi, j

i=1 n

/2

Accelerating CM search

Zasha Weinberg

& W.L. Ruzzo

Recomb ‘04, Bioinformatics ‘04, ‘06

SLIDE 18

18

Rfam database

(Release 7.0, 3/2005)

503 ncRNA families 8 riboswitches, 235 small nucleolar RNAs, 8 spliceosomal RNAs, 10 bacterial antisense RNAs, 46 microRNAs, 9 ribozymes, 122 cis RNA regulatory elements, … 280,000 annotated ncRNAs

Covariance Model

Key difference of CM vs HMM: Pair states emit paired symbols, corresponding to base-paired nucleotides; 16 emission probabilities here. EMBL CM hits Z Our Work ~2 months, 1000 computers

CM’s are good, but slow

EMBL CM hits junk Rfam Goal 10 years, 1000 computers Rfam Reality EMBL CM hits junk Blast 1 month, 1000 computers

Oversimplified CM

(for pedagogical purposes only)

A C G U – A C G U – A C G U – A C G U –

SLIDE 19

19

CM to HMM

25 emisions per state 5 emissions per state, 2x states

A C G U – A C G U – A C G U – A C G U – A C G U – A C G U – A C G U – A C G U –

CM HMM

A C G U – A C G U – A C G U – A C G U –

Key Issue: 25 scores  10

Need: log Viterbi scores CM ≤ HMM

CM HMM

Viterbi/Forward Scoring

Path π defines transitions/emissions
Score(π) = product of “probabilities” on π
NB: ok if “probabilities” aren’t, e.g. ∑≠1
E.g. in CM, emissions are odds ratios vs 0th-
rder background
For any nucleotide sequence x:

– Viterbi-score(x) = max{ score(π) | π emits x} – Forward-score(x) = ∑{ score(π) | π emits x}

Key Issue: 25 scores  10

Need: log Viterbi scores CM ≤ HMM

A C G U – A C G U – A C G U – A C G U –

CM HMM PCA ≤ LC + RA PCC ≤ LC + RC PCG ≤ LC + RG PCU ≤ LC + RU PC– ≤ LC + R– … … … … … PAA ≤ LA + RA PAC ≤ LA + RC PAG ≤ LA + RG PAU ≤ LA + RU PA– ≤ LA + R–

NB:HMM not a prob. model

L R

SLIDE 20

20

Rigorous Filtering

Any scores satisfying the linear

inequalities give rigorous filtering Proof: CM Viterbi path score ≤ “corresponding” HMM path score ≤ Viterbi HMM path score

(even if it does not correspond to any CM path) PAA ≤ LA + RA PAC ≤ LA + RC PAG ≤ LA + RG PAU ≤ LA + RU PA– ≤ LA + R– …

Some scores filter better

PUA = 1 ≤ LU + RA PUG = 4 ≤ LU + RG

Assuming ACGU ≈ 25%

Option 1: Opt 1: LU = RA = RG = 2 LU + (RA + RG)/2 = 4 Option 2: Opt 2: LU = 0, RA = 1, RG = 4 LU + (RA + RG)/2 = 2.5

Optimizing filtering

For any nucleotide sequence x:

Viterbi-score(x) = max{ score(π) | π emits x } Forward-score(x) = ∑{ score(π) | π emits x }

Expected Forward Score

E(Li, Ri) = ∑x Forward-score(x)*Pr(x) – NB: E is a function of Li, Ri only

Optimization:

Minimize E(Li, Ri) subject to score L.I.s

– This is heuristic (“forward↓ ⇒ Viterbi↓ ⇒ filter↓”) – But still rigorous because “subject to score L.I.s”

Under 0th-order background model

Calculating E(Li, Ri)

E(Li, Ri) = ∑x Forward-score(x)*Pr(x)

Forward-like: for every state, calculate

expected score for all paths ending there, easily calculated from expected scores of predecessors & transition/ emission probabilities/scores

SLIDE 21

21

Minimizing E(Li, Ri)

Calculate E(Li, Ri)

symbolically, in terms

f emission scores,

so we can do partial derivatives for numerical convex

ptimization

algorithm

E(L1, L2,...) Li

Forward: Viterbi:

“Convex” Optimization

Convex: local max = global max; simple “hill climbing” works Nonconvex: can be many local maxima, << global max; “hill-climbing” fails

What should the probabilities be?

Convex optimization problem

– Constraints: enforce rigorous property – Objective function: filter as aggressively as possible

Problem sizes:

– 1000-10000 variables – 10000-100000 inequality constraints

Estimated Filtering Efficiency

(139 Rfam 4.0 families)

3 7 .99 - 1.0 4 6 .25 - .99 2 2 .10 - .25 3 11 .01 - .10 17 8 10-4 - 10-2 110 105 < 10-4 # families (expanded) # families (compact) Filtering fraction

Averages 283 times faster than CM

≈ break even

SLIDE 22

22

Results: buried treasures

7 290 283 U4 snRNA 1 200 199 U5 snRNA 3 131 128 S-box

54 123 69 Purine riboswitch

313 1464 264 193 59

1106 322 180

# found rigorous filter + CM

1 312 U7 snRNA 2 1462 U6 snRNA 13 251 Hammerhead III 26 167 Hammerhead I 48 11 Retron msr

102 1004 Histone 3’ element 121 201 Iron response element 123 57 Pyrococcus snoRNA

# new # found BLAST + CM Name

Building CM’s

Hand-curated alignments + structure as

in Rfam are great, but it doesn’t scale

Example Application:

Given 5-20 upstream regions (~500 nt)

f orthologous bacterial genes, some

(but not all) plausibly regulated by a common riboswitch, could we find it?

Importance of Alignment

Blue boxes, e.g., should be lined up.
Structure is invisible otherwise.

CMFinder

Harder: Finding CMs without alignment

Yao, Weinberg & Ruzzo, Bioinformatics, 2006 Folding predictions Smart heuristics Candidate alignment CM Realign Search

SLIDE 23

23

CMfinder Accuracy

(on Rfam families with flanking sequence)

/CW /CW

Summary of Rfam test families and results

Li = column i; σ = (α, β) the 2ary struct, α = unpaired, β = paired cols With MLE params, Iij is the mutual information between cols i and j Can find it via a simple dynamic programming alg.

SLIDE 24

24

A Computational Pipeline for High Throughput Discovery of cis–Regulatory Noncoding RNA in Prokaryotes

PLoS Comp Biol, 2007

Zizhen Yao, Jeffrey Barrick, Zasha Weinberg, Shane Neph, Ronald Breaker, Martin Tompa and Walter L. Ruzzo

An approach for cis-regulatory RNA discovery in bacteria

1. Get all sequenced bacterial genomes
2. Group upstream sequences per CDD
3. Find most promising genes, based on

sequence motifs conserved in group

4. From those, find most promising candidates,

incorporating structure in the motifs

5. From those, genome-wide searches for more

instances

6. Expert analyses (Breaker Lab, Yale)

2946 CDD groups 35975 motifs 1740 motifs 1466 motifs

Retrieve upstream sequences Motif postprocessing Identify CDD group members

< 10 CPU days

Motif postprocessing Footprinter ranking

< 10 CPU days

CMfinder

1 ~ 2 CPU months

RaveNnA

10 CPU months

CMfinder refinement

< 1 CPU month

Genome Scale Search: Why

Most riboswitches, e.g., are present in ~5 copies per genome Throughout (most of) clade More examples give better model, hence even more examples, fewer errors More examples give more clues to function

SLIDE 25

25

Genome Scale Search: How

CMfinder is directly usable for/with search

Folding predictions Smart heuristics Candidate alignment CM Realign Search

Results

Process largely complete in

– bacillus/clostridia – gamma proteobacteria – cyanobacteria – actinobacteria

Analysis ongoing

Actino Results: finding known RNAs

Rfam Family Type (metabolite) Rank THI riboswitch (thiamine) 4 ydaO-yuaA riboswitch (unknown) 19 Cobalamin riboswitch (cobalamin) 21 SRP_bact gene 28 RFN riboswitch (FMN) 39 yybP-ykoY riboswitch (unknown) 48 gcvT riboswitch (glycine) 53 S_box riboswitch (SAM) 401 tmRNA gene Not found RNaseP gene Not found

not cis- regulatory

Rank # CDD Gene: Description Annotation 6 69 28178 DHOase IIa: Dihydroorotase PyrR attenuator [22] 15 33 10097 RplL: Ribosomal protein L7/L1 L10 r-protein leader; see Supp 19 36 10234 RpsF: Ribosomal protein S6 S6 r-protein leader 22 32 10897 COG1179: Dinucleotide-utilizing enzymes 6S RNA [25] 27 27 9926 RpsJ: Ribosomal protein S10 S10 r-protein leader; see Supp 29 11 15150 Resolvase: N terminal domain 31 31 10164 InfC: Translation initiation factor 3 IF-3 r-protein leader; see Supp 41 26 10393 RpsD: Ribosomal protein S4 and related proteins S4 r-protein leader; see Supp [30] 44 30 10332 GroL: Chaperonin GroEL HrcA DNA binding site [46] 46 33 25629 Ribosomal L21p: Ribosomal prokaryotic L21 protein L21 r-protein leader; see Supp 50 11 5638 Cad: Cadmium resistance transporter [47] 51 19 9965 RplB: Ribosomal protein L2 S10 r-protein leader 55 7 26270 RNA pol Rpb2 1: RNA polymerase beta subunit 69 9 13148 COG3830: ACT domain-containing protein 72 28 4174 Ribosomal S2: Ribosomal protein S2 S2 r-protein leader 74 9 9924 RpsG: Ribosomal protein S7 S12 r-protein leader 86 6 12328 COG2984: ABC-type uncharacterized transport system 88 19 24072 CtsR: Firmicutes transcriptional repressor of class III CtsR DNA binding site [48] 100 21 23019 Formyl trans N: Formyl transferase 103 8 9916 PurE: Phosphoribosylcarboxyaminoimidazole 117 5 13411 COG4129: Predicted membrane protein 120 10 10075 RplO: Ribosomal protein L15 L15 r-protein leader 121 9 10132 RpmJ: Ribosomal protein L36 IF-1 r-protein leader 129 4 23962 Cna B: Cna protein B-type domain 130 9 25424 Ribosomal S12: Ribosomal protein S12 S12 r-protein leader 131 9 16769 Ribosomal L4: Ribosomal protein L4/L1 family L3 r-protein leader 136 7 10610 COG0742: N6-adenine-specific methylase ylbH putative RNA motif [4] 140 12 8892 Pencillinase R: Penicillinase repressor BlaI, MecI DNA binding site [49] 157 25 24415 Ribosomal S9: Ribosomal protein S9/S16 L13 r-protein leader; Fig 3 160 27 1790 Ribosomal L19: Ribosomal protein L19 L19 r-protein leader; Fig 2 164 6 9932 GapA: Glyceraldehyde-3-phosphate dehydrogenase/erythrose 174 8 13849 COG4708: Predicted membrane protein 176 7 10199 COG0325: Predicted enzyme with a TIM-barrel fold 182 9 10207 RpmF: Ribosomal protein L32 L32 r-protein leader 187 11 27850 LDH: L-lactate dehydrogenases 190 11 10094 CspR: Predicted rRNA methylase 194 9 10353 FusA: Translation elongation factors EF-G r-protein leader

Table 3: High ranking motifs not found in Rfam

SLIDE 26

26

mRNA leader mRNA leader switch?

Rfam Membership Overlap Structure # Sn Sp nt Sn Sp bp Sn Sp RF00174 Cobalamin 183 0.741 0.97 152 0.75 0.85 20 0.60 0.77 RF00504 Glycine 92 0.561 0.96 94 0.94 0.68 17 0.84 0.82 RF00234 glmS 34 0.92 1.00 100 0.54 1.00 27 0.96 0.97 RF00168 Lysine 80 0.82 0.98 111 0.61 0.68 26 0.76 0.87 RF00167 Purine 86 0.86 0.93 83 0.83 0.55 17 0.90 0.95 RF00050 RFN 133 0.98 0.99 139 0.96 1.00 12 0.66 0.65 RF00011 RNaseP_bact_b 144 0.99 0.99 194 0.53 1.00 38 0.72 0.78 RF00162 S_box 208 0.95 0.97 110 1.00 0.69 23 0.91 0.78 RF00169 SRP_bact 177 0.92 0.95 99 1.00 0.65 25 0.89 0.81 RF00230 T-box 453 0.96 0.61 187 0.77 1.00 5 0.32 0.38 RF00059 THI 326 0.89 1.00 99 0.91 0.69 13 0.56 0.74 RF00442 ykkC-yxkD 19 0.90 0.53 99 0.94 0.81 18 0.94 0.68 RF00380 ykoK 49 0.92 1.00 125 0.75 1.00 27 0.80 0.95 RF00080 yybP-ykoY 41 0.32 0.89 100 0.78 0.90 18 0.63 0.66 mean 145 0.84 0.91 121 0.81 0.82 21 0.75 0.77 median 113 0.91 0.97 105 0.81 0.83 19 0.78 0.78

Tbl 2: Prediction accuracy compared to prokaryotic subset of Rfam full alignments.

Membership: # of seqs in overlap between our predictions and Rfam’s, the sensitivity (Sn) and specificity (Sp) of our membership predictions. Overlap: the avg len of overlap between our predictions and Rfam’s (nt), the fractional lengths of the overlapped region in Rfam’s predictions (Sn) and in ours (Sp). Structure: the avg # of correctly predicted canonical base pairs (in overlapped regions) in the secondary structure (bp), and sensitivity and specificity of

ur predictions. 1After 2nd RaveNnA scan, membership Sn of Glycine and Cobalamin

increased to 76% and 98% resp., Glycine Sp unchanged, but Cobalamin Sp dropped to 84%.

Identification of 22 candidate structured RNAs in bacteria using the CMfinder comparative genomics pipeline

NAR 2007

Zasha Weinberg, Jeffrey E. Barrick, Zizhen Yao, Adam Roth, Jane N. Kim, Jeremy Gore, Joy Xin Wang, Elaine R. Lee, Kirsten F. Block, Narasimhan Sudarsan, Shane Neph, Martin Tompa, Walter L. Ruzzo and Ronald R. Breaker

SLIDE 27

27

Motif RNA? Cis? Switch? Phylum/class M,V Cov. # Non cis GEMM Y Y y Widespread V 21 322 12/309 Moco Y Y Y Widespread M,V 15 105 3/81 SAH Y Y Y Proteobacteria M,V 22 42 0/41 SAM-IV Y Y Y Actinobacteria V 28 54 2/54 COG4708 Y Y y Firmicutes M,V 8 23 0/23 sucA Y Y y

proteobacteria

9 40 0/40 23S-methyl Y Y n Firmicutes 12 38 1/37 hemB Y ? ?

proteobacteria

V 12 50 2/50 (anti-hemB) (n) (n) (37) (31/37) MAEB ? Y n

proteobacteria

3 662 15/646 mini-ykkC Y Y ? Widespread V 17 208 1/205 purD y Y ?

proteobacteria

M 16 21 0/20 6C y ? n Actinobacteria 21 27 1/27 alpha- transposases ? N N

proteobacteria

16 102 39/99 excisionase ? ? n Actinobacteria 7 27 0/27 ATPC y ? ? Cyanobacteria 11 29 0/23 cyano-30S Y Y n Cyanobacteria 7 26 0/23 lacto-1 ? ? n Firmicutes 10 97 18/95 lacto-2 y N n Firmicutes 14 357 67/355 TD-1 y ? n Spirochaetes M,V 25 29 2/29 TD-2 y N n Spirochaetes V 11 36 17/36 coccus-1 ? N N Firmicutes 6 246 112/189 gamma-150 ? N N

proteobacteria

9 27 6/27

Comparative genomics beyond sequence based alignments: RNA structures in the ENCODE regions Torarinsson, Yao, Wiklund, Bramsen, Hansen, Kjems, Tommerup, Ruzzo and Gorodkin.

Genome Research, Feb 2008, 18(2):242-251 PMID: 18096747

Finding vertebrates ncRNAs

Natural approach : Align, Fold, Score UCSC Browser tracks for Evofold, RNAz Thousands of candidates

Alignment Matters

SLIDE 28

28

Comparison with Evofold, RNAz

4799 3134 1781 548 44 169 230

CMfinder Evofold RNAz

Small overlap (w/ highly significant p-values) emphasizes complementarity Strong association with known genes Strong association with “Indel purified segments” - I.e., apparently under selection

10 of 11 top expressed, usually differentially

Assoc w/ coding genes

Many known human ncRNAs lie in introns Several of our candidates do, too, including some of the tested ones

#6: SYN3 (Synapsin 3)
#10: TIMP3, antisense within SYN3 intron
#9: GRM8 (glutamate receptor metabotropic 8)

SLIDE 29

29

Estimated FDR

Software

Infernal - (Eddy et al.) most of Eddy & Durbin RaveNnA - (Weinberg) fast filtering CMfinder - (Yao) Motif discovery (local alignment)

Open Problems - Better CM’s

Optional- and variable-length stems Riboswitches & other regulatory RNAs often switch between conformations; better search & alignment exploiting both alternatives? “Augmented” CM handling pseudoknots probably too slow for scan, but plausibly could be used for alignment Better use of prior knowledge? (GNRA tetraloops, single-stranded A’s…)

Open Problems - Better algorithms & scoring

incorporating phylogeny in model construction & scoring

e.g. “mutual information” ignores it

improve scoring by “shuffling” other ideas for scan filtering comparing & clustering RNA structures search/alignment/inference with splicing

SLIDE 30

30

Open Problems - Applications & Biology

clustering intergenic sequences, esp prokaryotic systematic look at eukaryotic UTRs

how to cluster? how to score?

“swiss-cheese phylogenies” evidence for selection (no dN/dS)

Summary

ncRNA is a “hot” topic For family homology modeling: CMs Training & search like HMM (but slower) Dramatic acceleration possible Automated model construction possible New computational methods lead to new discoveries