Modeling and Searching for Non-Coding RNA W.L. Ruzzo ! - - PowerPoint PPT Presentation

modeling and searching for non coding rna
SMART_READER_LITE
LIVE PREVIEW

Modeling and Searching for Non-Coding RNA W.L. Ruzzo ! - - PowerPoint PPT Presentation

Modeling and Searching for Non-Coding RNA W.L. Ruzzo ! http://www.cs.washington.edu/homes/ruzzo http://www.cs.washington.edu/homes/ruzzo/ courses/gs541/10sp GENOME 541 Syllabus ! protein and DNA sequence analysis to


slide-1
SLIDE 1

Modeling and Searching 
 for Non-Coding RNA

W.L. Ruzzo !

http://www.cs.washington.edu/homes/ruzzo http://www.cs.washington.edu/homes/ruzzo/ courses/gs541/10sp

slide-2
SLIDE 2

GENOME 541 Syllabus !

“… protein and DNA sequence analysis … to determine the "periodic table of biology," i.e., the list of proteins …, which can be regarded as the first stage in…”!

No mention of RNA… !

slide-3
SLIDE 3

The Message!

Cells make lots of RNA! Functionally important, functionally diverse! Structurally complex! New tools required! !alignment, discovery, search, scoring, etc.!

10

noncoding RNA!

slide-4
SLIDE 4

Rough Outline!

Today!

Noncoding RNA Examples! RNA structure prediction!

Lecture 2!

RNA “motif” models! Search!

Lecture 3!

Motif discovery! Applications!

17

slide-5
SLIDE 5

RNA !

DNA: DeoxyriboNucleic Acid! RNA: RiboNucleic Acid!

Like DNA, except:! Lacks OH on ribose (backbone sugar)! Uracil (U) in place of thymine (T)! A, G, C as before!

18

uracil! thymine!

CH3!

pairs ! with A!

slide-6
SLIDE 6
  • Fig. 2. The arrows show the situation as it

seemed in 1958. Solid arrows represent probable transfers, dotted arrows possible

  • transfers. The absent arrows (compare Fig. 1)

represent the impossible transfers postulated by the central dogma. They are the three possible arrows starting from protein.!

slide-7
SLIDE 7

“Classical” RNAs!

rRNA - ribosomal RNA (~4 kinds, 120-5k nt)! tRNA - transfer RNA (~61 kinds, ~ 75 nt)! RNaseP - tRNA processing (~300 nt)! snRNA - small nuclear RNA (splicing: U1, etc, 60-300nt)! a handful of others!

slide-8
SLIDE 8

A G A C U G A C G A UC A C G C A G U C A Base pairs A U C G A C A U G U

RNA Secondary Structure: "

RNA makes helices too!

26

5´ 3´

Usually single stranded!

slide-9
SLIDE 9

Bacteria !

Triumph of proteins! ~ 80% of genome is coding DNA! Functionally diverse! !receptors! !motors! !catalysts! !regulators (Monod & Jakob, Nobel prize 1965)! !… !

28

slide-10
SLIDE 10

Proteins catalyze & regulate biochemistry!

29

slide-11
SLIDE 11

34

Alberts, et al, 3e.

Protein way Riboswitch alternative

SAM! Grundy & Henkin, Mol. Microbiol 1998 Epshtein, et al., PNAS 2003 Winkler et al., Nat. Struct. Biol. 2003

Not the only way!!

slide-12
SLIDE 12

35

Alberts, et al, 3e.

Protein way Riboswitch alternatives

SAM-II!

SAM-I! Grundy, Epshtein, Winkler et al., 1998, 2003

Corbino et al., Genome Biol. 2005

Not the only way!!

slide-13
SLIDE 13

36

Alberts, et al, 3e. Corbino et al., Genome Biol. 2005

Protein way Riboswitch alternatives

SAM-III!

SAM-II! SAM-I!

Fuchs et al., NSMB 2006

Grundy, Epshtein, Winkler et al., 1998, 2003

Not the only way!!

slide-14
SLIDE 14

37

Alberts, et al, 3e. Corbino et al., Genome Biol. 2005

Protein way Riboswitch alternatives

Weinberg et al., RNA 2008 SAM-III! SAM-II! SAM-I! Fuchs et al., NSMB 2006 Grundy, Epshtein, Winkler et al., 1998, 2003 SAM-IV!

Not the only way!!

slide-15
SLIDE 15

38

Alberts, et al, 3e.

Protein way Riboswitch alternatives

Corbino et al., Genome

  • Biol. 2005

Weinberg et al., RNA 2008 SAM-III! SAM-II! SAM-I! Fuchs et al., NSMB 2006 Grundy, Epshtein, Winkler et al., 1998, 2003 SAM-IV!

Not the only way!!

Meyer, etal., BMC Genomics 2009

slide-16
SLIDE 16

39

slide-17
SLIDE 17

40

slide-18
SLIDE 18

Riboswitches !

~ 20 ligands known; multiple nonhomologous solutions for some! dozens to hundreds of instances of each! TPP known in archaea & eukaryotes!

  • ne known in bacteriophage!
  • n/off; transcription/translation; splicing; combinatorial

control! In some bacteria, more riboregulators identified than protein TFs! all found since ~2003!

slide-19
SLIDE 19

58

slide-20
SLIDE 20

ncRNA Example: T-boxes !

slide-21
SLIDE 21

ncRNA Example: 6S !

medium size (175nt)! structured! highly expressed in E. coli in certain growth conditions! sequenced in 1971; function unknown for 30 years!

slide-22
SLIDE 22

6S mimics an "

  • pen promoter!

Barrick et al. RNA 2005 Trotochaud et al. NSMB 2005 Willkomm et al. NAR 2005

E.coli

Bacillus/" Clostridium! Actino- bacteria! 64

slide-23
SLIDE 23

65

1 10 100 1,000 23S rRNA 16S rRNA Group II intron tmRNA OLE Group I intron RNase P AdoCbl riboswitch glmS ribozyme Lysine riboswitch IMES-1 IMES-2 GOLLD HEARO Average size (nucleotides) Multistem junctions plus pseudoknots Not ribozyme Unknown function Ribozyme

LETTERS

Exceptional structured noncoding RNAs revealed by bacterial metagenome analysis

Zasha Weinberg1,2, Jonathan Perreault2, Michelle M. Meyer2 & Ronald R. Breaker1,2,3

Vol 462 |3 December 2009 |doi:10.1038/nature08586

slide-24
SLIDE 24

RNAs of unusual size and complexity!

b

ACAAAATATATTACTCAACTGTCAG ATGAGCCAAAAACGCGAACTAGAA ACAAAATATATCACTCAACTATGAGCCAAAAACGCGAACTAGAA

  • A. variabilis

Nostoc sp. 149530 151150 75790 HEARO HEARO

1–58 nt 1–2 nt 1–9 nt 0–39 nt 0–7 nt 0–18 nt 0–10 nt 1–6 nt 0–11 nt G A G Y R C U ACG U U R C A C C Y G R A UG Y Y Y Y A G U Y Y G C Y C U G R Y R Y Y R Y R R Y A A CAU U CG A R G R R R R A A Y Y Y Y R G R R R Stem usually has A bulge

  • r A-C

mismatch Pseudoknot 0–17 nt U Y C C UC Y C UR AR G R GYY U C C A U G A 3′ integration site 0–14 nt U U A A A C A R Y R RG G R R A G U G 73% U C A CG C U G G C GA A AG G Y A A A G C G C C G A A G G 7% 5′ 0–70 nt G U C A R Y A C C C C U R AA G G G GC U U R G Y U G A C Y A

a

5′ integration site 3′ 0–1490 nt

ORF

|

66

slide-25
SLIDE 25

a

GOLLD

U Y A A A Y C U R Y G CA R R Y R R G G C A U Y R A A G R G R A G U A R Pseudoknot E-loop R R R G G Y R G Y A U Y U Y U C A A A A G R R R R C R Y R R R C R C C Y Y Y A A G A A A A G U Y Y Y R G Y R Y G A A G C UA U R Y Y R G Y Y R RR Y C C A A G Y Y R G A G U A R Y Y R Y A R A R UG R U R Y U A A R A Y C G 0–129 nt (can contain tRNA) R Y R R R Y Y Y R G C C G U R E-loop 0–2 nt 0–22 nt 0–7 nt G R R U A C G U G G A A R R R R G AA A U A A U Y Y Y A A A G Y Y Y R UG U A U C U C AR U 3′ 0–3 nt 0–2 nt AR Y G R U A Y R Y Y A G Y Y R A G G G Y R A C CU R R GG R R R R R R R U A Y Y G R Y G YR GR Y Y R RUUG A G R U G R RA A Y CAAU A R G A A A R Y Y R 5′ 0–2 nt 3 nt 7 or 8 nt G G C G Y Y U A G U C Y A R A U AARC Y G A A R G R R U AAA G G U G C G Y Y R R A R R C R U A R R CA G R R G G R Y Y CA G G C G U C Pseudoknot G A U C 1–2 nt AGRR Y UGY RA RA A RU R GRY Y A U C C R R Y Y Y A Y A U U G C G U Y C A A U R Y AR A G R C U U A A A A C C G AA G G U A G Y G UA C R G G UG GU G C U G U U R Y U C CUU R Y Y Y C U AC C A R G G U U G A A G R C U U G A A R U AU G Pseudoknot Pseudoknot Pseudoknot

Variable-length hairpin Variable-length loop Zero-length connector Variable-length region 90% 97% 75% 50% Nucleotide identity Nucleotide present 75%

N N

97%

N

90% Covarying mutations Base pair annotations R: A or G, Y: C or U. nt: nucleotides Compatible mutations No mutations observed Modular sub-structure

67

b

GOLLD phage genomic DNA GOLLD phage genomic DNA 1 0.5 Bacterial cell density GOLLD RNA Mitomycin C No treatment 2 4 6 8 10 12 14 22 2 4 6 8 10 12 14 22 Hours Fraction of maximum

|

slide-26
SLIDE 26

RNAs of unusual abundance!

More abundant than 5S rRNA! From unknown marine organisms!

68

! ! !

slide-27
SLIDE 27

Summary: RNA in Bacteria !

Widespread, deeply conserved, structurally sophisticated, functionally diverse, biologically important uses for ncRNA throughout prokaryotic world.! Regulation of MANY genes involves RNA!

In some species, we know identities of more ribo- regulators than protein regulators!

Dozens of classes & thousands of new examples in just last 5 years!

slide-28
SLIDE 28

Vertebrates!

Bigger, more complex genomes! <2% coding! But >5% conserved in sequence?! And 50-90% transcribed?! And structural conservation, if any, invisible

(without proper alignments, etc.)!

What’s going on?!

slide-29
SLIDE 29

Vertebrate ncRNAs!

mRNA, tRNA, rRNA, … of course! PLUS:! snRNA, spliceosome, snoRNA, teleomerase, microRNA, RNAi, SECIS, IRE, piwi-RNA, XIST (X-inactivation), ribozymes, …!

77

slide-30
SLIDE 30

MicroRNA !

1st discovered 1992 in C. elegans! 2nd discovered 2000, also C. elegans!

and human, fly, everything between!

21-23 nucleotides!

literally fell off ends of gels!

Hundreds now known in human!

may regulate 1/3-1/2 of all genes! development, stem cells, cancer, infectious diseases,…!

79

slide-31
SLIDE 31

siRNA !

“Short Interfering RNA”! Also discovered in C. elegans! Possibly an antiviral defense, shares machinery with miRNA pathways! Allows artificial repression of most genes in most higher organisms! Huge tool for biology & biotech!

80

2006 Nobel Prize! Fire & Mello!

slide-32
SLIDE 32

Human Predictions !

Evofold! S Pedersen, G Bejerano, A Siepel, K Rosenbloom, K Lindblad-Toh, ES Lander, J Kent, W Miller, D Haussler, "Identification and classification of conserved RNA secondary structures in the human genome." PLoS Comput. Biol., 2, #4 (2006) e33. ! 48,479 candidates (~70% FDR?)! RNAz! S Washietl, IL Hofacker, M Lukasser, A Hutenhofer, PF Stadler, "Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome." Nat. Biotechnol., 23, #11 (2005) 1383-90.! 30,000 structured RNA elements ! 1,000 conserved across all vertebrates. ! ~1/3 in introns of known genes, ~1/6 in UTRs ! ~1/2 located far from any known gene! FOLDALIGN! E Torarinsson, M Sawera, JH Havgaard, M Fredholm, J Gorodkin, "Thousands of corresponding human and mouse genomic regions unalignable in primary sequence contain common RNA structure." Genome Res., 16, #7 (2006) 885-9.! 1800 candidates from 36970 (of 100,000) pairs! CMfinder! Torarinsson, Yao, Wiklund, Bramsen, Hansen, Kjems, Tommerup, Ruzzo and Gorodkin. Comparative genomics beyond sequence based alignments: RNA structures in the ENCODE regions. Genome Research, Feb 2008, 18(2):242-251 PMID: 18096747! 6500 candidates in ENCODE alone (better FDR, but still high)!

slide-33
SLIDE 33

Bottom line? !

A significant number of “one-off” examples ! Extremely wise-spread ncRNA expression ! At a minimum, a vast evolutionary substrate ! New technology (e.g. RNAseq) exposing more! How do you recognize an interesting one?! Conserved secondary structure !

slide-34
SLIDE 34

A G A C U G A C G A UC A C G C A G U C A Base pairs A U C G A C A U G U

RNA Secondary Structure: "

RNA makes helices too!

103

5´ 3´

Usually single stranded!

slide-35
SLIDE 35

A G A C U G A C G A UC A C G C A G U C A A C A U

RNA Secondary Structure: can be fixed while sequence evolves!

104

A G C C A A A C C A UC A G G U U G G C A A C A U

G-U!

slide-36
SLIDE 36

Why is RNA hard to deal with?!

A C U G C A G G G A G C A A G C G A G G C C U C U G C A A U G A C G G U G C A U G A G A G C G U C U U U U C A A C A C U G U U A U G G A A G U U U G G C U A G C G U U C U A G AG C U G U G A C A C U G C C G C G A C G G G A A A G U A A C G G G C G G C G A G U A A A C C C G A U C C C G G U G A A U A G C C U G A A A A A C A A A G U A C A C G G G A U A C G

A: Structure often more important than sequence

105

slide-37
SLIDE 37

Structure Prediction"

slide-38
SLIDE 38

RNA Structure !

Primary Structure: !Sequence! Secondary Structure: !Pairing! Tertiary Structure: !3D shape!

113

slide-39
SLIDE 39

RNA Pairing!

Watson-Crick Pairing!

C - G ! !~ 3 kcal/mole! A - U ! !~ 2 kcal/mole!

“Wobble Pair” G - U ! !~1 kcal/mole! Non-canonical Pairs (esp. if modified)!

slide-40
SLIDE 40

tRNA 3d Structure!

slide-41
SLIDE 41

tRNA - Alt. Representations!

Anticodon loop! Anticodon" loop!

3’! 5’!

116

a.a.

slide-42
SLIDE 42

tRNA - Alt. Representations!

Anticodon! loop! Anticodon" loop!

3’! 5’!

5’! 3’!

117

slide-43
SLIDE 43

Definitions!

Sequence 5’ r1 r2 r3 ... rn 3’ in {A, C, G, T}! A Secondary Structure is a set of pairs i•j s.t.!

i < j-4, and ! ! ! no sharp turns! if i•j & i’•j’ are two different pairs with i i’, then!

j < i’, or ! i < i’ < j’ < j !

2nd pair follows 1st, or is nested within it; " no “pseudoknots.”!

slide-44
SLIDE 44

RNA Secondary Structure: Examples !

119

!

C! G! G! C! A! G! U! U! U! A! U! A! C! C! G! G! U! G! U! A! G! G! C! A! G! U! U! A! C! G! G! C! A! U! G! U! U! A!

sharp turn! crossing!

  • k!

G! !4! U! A! C! C! G! G! U! U! G! A! base pair! C! G! G! C! A! G! U! U! U! A! C! A! U! A! C! G! G! G! G! U! A! U! A! C! C! G! G! U! G! U! A! A! C!

slide-45
SLIDE 45

Nested! Pseudoknot! Precedes!

slide-46
SLIDE 46

Approaches to Structure Prediction!

Maximum Pairing" + works on single sequences" + simple"

  • too inaccurate!

Minimum Energy" + works on single sequences"

  • ignores pseudoknots "
  • only finds “optimal” fold!

Partition Function" + finds all folds"

  • ignores pseudoknots!
slide-47
SLIDE 47

Nussinov: Max Pairing!

B(i,j) = # pairs in optimal pairing of ri ... rj! B(i,j) = 0 for all i, j with i j-4; otherwise! B(i,j) = max of:!

B(i,j-1)! max { B(i,k-1)+1+B(k+1,j-1) | " i ! k < j-4 and rk-rj may pair}!

R Nussinov, AB Jacobson, "Fast algorithm for predicting the secondary structure of single-stranded RNA." PNAS 1980.

slide-48
SLIDE 48

“Optimal pairing of ri ... rj”"

Two possibilities!

j Unpaired: " Find best pairing of ri ... rj-1! j Paired (with some k):" Find best ri ... rk-1 + " best rk+1 ... rj-1 plus 1! Why is it slow? " Why do pseudoknots matter?!

j i j-1 j k-1 k i j-1 k+1

slide-49
SLIDE 49

Nussinov: " A Computation Order!

B(i,j) = # pairs in optimal pairing of ri ... rj! B(i,j) = 0 for all i, j with i j-4; otherwise! B(i,j) = max of:!

B(i,j-1)! max { B(i,k-1)+1+B(k+1,j-1) | " i ! k < j-4 and rk-rj may pair}!

Time: O(n3)!

K=2! 3! 4! 5!

slide-50
SLIDE 50

Which Pairs? !

Usual dynamic programming “trace-back” tells you which base pairs are in the optimal solution, not just how many!

slide-51
SLIDE 51

Approaches to Structure Prediction!

Maximum Pairing" + works on single sequences" + simple"

  • too inaccurate!

Minimum Energy" + works on single sequences"

  • ignores pseudoknots "
  • only finds “optimal” fold!

Partition Function" + finds all folds"

  • ignores pseudoknots!
slide-52
SLIDE 52

Pair-based Energy Minimization !

E(i,j) = energy of pairs in optimal pairing of ri ... rj! E(i,j) = ! for all i, j with i j-4; otherwise! E(i,j) = min of:!

E(i,j-1)! min { E(i,k-1) + e(rk, rj) + E(k+1,j-1) | i ! k < j-4 }!

Time: O(n3)! energy of k-j pair!

slide-53
SLIDE 53

Loop-based Energy Minimization!

Detailed experiments show it’s " more accurate to model based "

  • n loops, rather than just pairs!

Loop types!

  • 1. Hairpin loop!
  • 2. Stack!
  • 3. Bulge!
  • 4. Interior loop!
  • 5. Multiloop!

1 2 3 4 5

slide-54
SLIDE 54

Zuker: Loop-based Energy, I!

W(i,j) = energy of optimal pairing of ri ... rj! V(i,j) = as above, but forcing pair i•j! W(i,j) = V(i,j) = ! for all i, j with i j-4! W(i,j) = min( W(i,j-1)," min { W(i,k-1)+V(k,j) | i ! k < j-4 } !! !)!

slide-55
SLIDE 55

Zuker: Loop-based Energy, II !

V(i,j) = min(eh(i,j), es(i,j)+V(i+1,j-1), VBI(i,j), VM(i,j))! VM(i,j) = min { W(i,k)+W(k+1,j) | i < k < j } ! VBI(i,j) = min { ebi(i,j,i’,j’) + V(i’, j’) |" i < i’ < j’ < j & i’-i+j-j’ > 2 }!

!Time: O(n4) ! O(n3) possible if ebi(.) is “nice”!

hairpin! stack! bulge/! interior! multi-! loop! bulge/! interior!

slide-56
SLIDE 56

Energy Parameters!

  • Q. !Where do they come from?!
  • A1. Experiments with carefully selected

synthetic RNAs!

  • A2. Learned algorithmically from trusted

alignments/structures [Andronescu et al., 2007]!

slide-57
SLIDE 57

Single Seq Prediction Accuracy!

Mfold, Vienna,... [Nussinov, Zuker, Hofacker, McCaskill]! Estimates suggest ~50-75% of base pairs predicted correctly in sequences of up to ~300nt! Definitely useful, but obviously imperfect!

slide-58
SLIDE 58

Approaches to Structure Prediction!

Maximum Pairing" !+ works on single sequences" !+ simple" !- too inaccurate! Minimum Energy" !+ works on single sequences" !- ignores pseudoknots " !- only finds “optimal” fold! Partition Function" !+ finds all folds" !- ignores pseudoknots!

slide-59
SLIDE 59

Approaches, II!

Comparative sequence analysis" !+ handles all pairings (potentially incl. pseudoknots)" !- requires several (many?) aligned," ! appropriately diverged sequences! Stochastic Context-free Grammars" Roughly combines min energy & comparative, but no pseudoknots! Physical experiments (x-ray crystalography, NMR)!

slide-60
SLIDE 60

Summary!

RNA has important roles beyond mRNA! !Many unexpected recent discoveries! Structure is critical to function! !True of proteins, too, but they’re easier to find from sequence alone due, e.g., to codon structure, which RNAs lack! RNA secondary structure can be predicted (to useful accuracy) by dynamic programming! Next: RNA “motifs” (seq + 2-ary struct) well- captured by “covariance models”!

139