GENOME 541 Syllabus ! protein and DNA sequence analysis to - - PDF document

genome 541 syllabus
SMART_READER_LITE
LIVE PREVIEW

GENOME 541 Syllabus ! protein and DNA sequence analysis to - - PDF document

GENOME 541 Syllabus ! protein and DNA sequence analysis to Modeling and Searching determine the "periodic table of biology," i.e., for Non-Coding RNA the list of proteins , which can be regarded as the first stage


slide-1
SLIDE 1

Modeling and Searching 
 for Non-Coding RNA

W.L. Ruzzo !

http://www.cs.washington.edu/homes/ruzzo http://www.cs.washington.edu/homes/ruzzo/ courses/gs541/10sp

GENOME 541 Syllabus !

“… protein and DNA sequence analysis … to determine the "periodic table of biology," i.e., the list of proteins …, which can be regarded as the first stage in…”!

No mention of RNA… !

The Message!

Cells make lots of RNA! Functionally important, functionally diverse! Structurally complex! New tools required! !alignment, discovery, search, scoring, etc.!

10

noncoding RNA!

Rough Outline!

Today!

Noncoding RNA Examples! RNA structure prediction!

Lecture 2!

RNA “motif” models! Search!

Lecture 3!

Motif discovery! Applications!

17

RNA !

DNA: DeoxyriboNucleic Acid! RNA: RiboNucleic Acid!

Like DNA, except:! Lacks OH on ribose (backbone sugar)! Uracil (U) in place of thymine (T)! A, G, C as before!

18

uracil! thymine!

CH3!

pairs ! with A!

  • Fig. 2. The arrows show the situation as it

seemed in 1958. Solid arrows represent probable transfers, dotted arrows possible

  • transfers. The absent arrows (compare Fig. 1)

represent the impossible transfers postulated by the central dogma. They are the three possible arrows starting from protein.!

slide-2
SLIDE 2

“Classical” RNAs!

rRNA - ribosomal RNA (~4 kinds, 120-5k nt)! tRNA - transfer RNA (~61 kinds, ~ 75 nt)! RNaseP - tRNA processing (~300 nt)! snRNA - small nuclear RNA (splicing: U1, etc, 60-300nt)! a handful of others!

A G A C U G A C G A UC A C G C A G U C A Base pairs A U C G A C A U G U

RNA Secondary Structure: "

RNA makes helices too!

26

5´ 3´

Usually single stranded!

Bacteria !

Triumph of proteins! ~ 80% of genome is coding DNA! Functionally diverse! !receptors! !motors! !catalysts! !regulators (Monod & Jakob, Nobel prize 1965)! !… !

28

Proteins catalyze & regulate biochemistry!

29 34

Alberts, et al, 3e.

Protein way Riboswitch alternative

SAM! Grundy & Henkin, Mol. Microbiol 1998 Epshtein, et al., PNAS 2003 Winkler et al., Nat. Struct. Biol. 2003

Not the only way!!

35

Alberts, et al, 3e.

Protein way Riboswitch alternatives

SAM-II!

SAM-I! Grundy, Epshtein, Winkler et al., 1998, 2003

Corbino et al., Genome Biol. 2005

Not the only way!!

slide-3
SLIDE 3

36

Alberts, et al, 3e. Corbino et al., Genome Biol. 2005

Protein way Riboswitch alternatives

SAM-III!

SAM-II! SAM-I!

Fuchs et al., NSMB 2006

Grundy, Epshtein, Winkler et al., 1998, 2003

Not the only way!!

37

Alberts, et al, 3e. Corbino et al., Genome Biol. 2005

Protein way Riboswitch alternatives

Weinberg et al., RNA 2008 SAM-III! SAM-II! SAM-I! Fuchs et al., NSMB 2006 Grundy, Epshtein, Winkler et al., 1998, 2003 SAM-IV!

Not the only way!!

38

Alberts, et al, 3e.

Protein way Riboswitch alternatives

Corbino et al., Genome

  • Biol. 2005

Weinberg et al., RNA 2008 SAM-III! SAM-II! SAM-I! Fuchs et al., NSMB 2006 Grundy, Epshtein, Winkler et al., 1998, 2003 SAM-IV!

Not the only way!!

Meyer, etal., BMC Genomics 2009

39 40

Riboswitches !

~ 20 ligands known; multiple nonhomologous solutions for some! dozens to hundreds of instances of each! TPP known in archaea & eukaryotes!

  • ne known in bacteriophage!
  • n/off; transcription/translation; splicing; combinatorial

control! In some bacteria, more riboregulators identified than protein TFs! all found since ~2003!

slide-4
SLIDE 4

58

ncRNA Example: T-boxes ! ncRNA Example: 6S !

medium size (175nt)! structured! highly expressed in E. coli in certain growth conditions! sequenced in 1971; function unknown for 30 years!

6S mimics an "

  • pen promoter!

Barrick et al. RNA 2005 Trotochaud et al. NSMB 2005 Willkomm et al. NAR 2005

E.coli

Bacillus/" Clostridium! Actino- bacteria! 64 65 1 10 100 1,000 23S rRNA 16S rRNA Group II intron tmRNA OLE Group I intron RNase P AdoCbl riboswitch glmS ribozyme Lysine riboswitch IMES-1 IMES-2 GOLLD HEARO Average size (nucleotides) Multistem junctions plus pseudoknots Not ribozyme Unknown function Ribozyme

LETTERS

Exceptional structured noncoding RNAs revealed by bacterial metagenome analysis

Zasha Weinberg1,2, Jonathan Perreault2, Michelle M. Meyer2 & Ronald R. Breaker1,2,3

Vol 462 |3 December 2009 |doi:10.1038/nature08586

RNAs of unusual size and complexity!

b ACAAAATATATTACTCAACTGTCAG ATGAGCCAAAAACGCGAACTAGAA ACAAAATATATCACTCAACTATGAGCCAAAAACGCGAACTAGAA

  • A. variabilis

Nostoc sp. 149530 151150 75790 HEARO HEARO 1–58 nt 1–2 nt 1–9 nt 0–39 nt 0–7 nt 0–18 nt 0–10 nt 1–6 nt 0–11 nt G A G Y R C U ACG U U R C A C C Y G R A UG Y Y Y Y A G U Y Y G C Y C U G R Y R Y Y R Y R R Y A A CAU U CG A R G R R R R A A Y Y Y Y R G R R R Stem usually has A bulge

  • r A-C

mismatch Pseudoknot 0–17 nt U Y C C UC Y C UR AR G R GYY U C C A U G A 3′ integration site 0–14 nt U U A A A C A R Y R RG G R R A G U G 73% U C A CG C U G G C GA A AG G Y A A A G C G C C G A A G G 7% 5′ 0–70 nt G U C A R Y A C C C C U R AA G G G GC U U R G Y U G A C Y A

a

5′ integration site 3′ 0–1490 nt ORF |

66

slide-5
SLIDE 5

a GOLLD

U Y A A A Y C U R Y G CA R R Y R R G G C A U Y R A A G R G R A G U A R Pseudoknot E-loop R R R G G Y R G Y A U Y U Y U C A A A A G R R R R C R Y R R R C R C C Y Y Y A A G A A A A G U Y Y Y R G Y R Y G A A G C UA U R Y Y R G Y Y R RR Y C C A A G Y Y R G A G U A R Y Y R Y A R A R UG R U R Y U A A R A Y C G 0–129 nt (can contain tRNA) R Y R R R Y Y Y R G C C G U R E-loop 0–2 nt 0–22 nt 0–7 nt G R R U A C G U G G A A R R R R G AA A U A A U Y Y Y A A A G Y Y Y R UG U A U C U C AR U 3′ 0–3 nt 0–2 nt AR Y G R U A Y R Y Y A G Y Y R A G G G Y R A C CU R R GG R R R R R R R U A Y Y G R Y G YR GR Y Y R RUUG A G R U G R RA A Y CAAU A R G A A A R Y Y R 5′ 0–2 nt 3 nt 7 or 8 nt G G C G Y Y U A G U C Y A R A U AARC Y G A A R G R R U AAA G G U G C G Y Y R R A R R C R U A R R CA G R R G G R Y Y CA G G C G U C Pseudoknot G A U C 1–2 nt AGRR Y UGY RA RA A RU R GRY Y A U C C R R Y Y Y A Y A U U G C G U Y C A A U R Y AR A G R C U U A A A A C C G AA G G U A G Y G UA C R G G UG GU G C U G U U R Y U C CUU R Y Y Y C U AC C A R G G U U G A A G R C U U G A A R U AU G Pseudoknot Pseudoknot Pseudoknot Variable-length hairpin Variable-length loop Zero-length connector Variable-length region 90% 97% 75% 50% Nucleotide identity Nucleotide present 75%

N N

97%

N

90% Covarying mutations Base pair annotations R: A or G, Y: C or U. nt: nucleotides Compatible mutations No mutations observed Modular sub-structure 67 b GOLLD phage genomic DNA GOLLD phage genomic DNA 1 0.5 Bacterial cell density GOLLD RNA Mitomycin C No treatment 2 4 6 8 10 12 14 22 2 4 6 8 10 12 14 22 Hours Fraction of maximum

|

RNAs of unusual abundance!

More abundant than 5S rRNA! From unknown marine organisms!

68

! ! !

  • Summary: RNA in Bacteria

!

Widespread, deeply conserved, structurally sophisticated, functionally diverse, biologically important uses for ncRNA throughout prokaryotic world.! Regulation of MANY genes involves RNA!

In some species, we know identities of more ribo- regulators than protein regulators!

Dozens of classes & thousands of new examples in just last 5 years!

Vertebrates!

Bigger, more complex genomes! <2% coding! But >5% conserved in sequence?! And 50-90% transcribed?! And structural conservation, if any, invisible

(without proper alignments, etc.)!

What’s going on?!

Vertebrate ncRNAs!

mRNA, tRNA, rRNA, … of course! PLUS:! snRNA, spliceosome, snoRNA, teleomerase, microRNA, RNAi, SECIS, IRE, piwi-RNA, XIST (X-inactivation), ribozymes, …!

77

MicroRNA !

1st discovered 1992 in C. elegans! 2nd discovered 2000, also C. elegans!

and human, fly, everything between!

21-23 nucleotides!

literally fell off ends of gels!

Hundreds now known in human!

may regulate 1/3-1/2 of all genes! development, stem cells, cancer, infectious diseases,…!

79

slide-6
SLIDE 6

siRNA !

“Short Interfering RNA”! Also discovered in C. elegans! Possibly an antiviral defense, shares machinery with miRNA pathways! Allows artificial repression of most genes in most higher organisms! Huge tool for biology & biotech!

80

2006 Nobel Prize! Fire & Mello!

Human Predictions !

Evofold! S Pedersen, G Bejerano, A Siepel, K Rosenbloom, K Lindblad-Toh, ES Lander, J Kent, W Miller, D Haussler, "Identification and classification of conserved RNA secondary structures in the human genome." PLoS Comput. Biol., 2, #4 (2006) e33. ! 48,479 candidates (~70% FDR?)! RNAz! S Washietl, IL Hofacker, M Lukasser, A Hutenhofer, PF Stadler, "Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome." Nat. Biotechnol., 23, #11 (2005) 1383-90.! 30,000 structured RNA elements ! 1,000 conserved across all vertebrates. ! ~1/3 in introns of known genes, ~1/6 in UTRs ! ~1/2 located far from any known gene! FOLDALIGN! E Torarinsson, M Sawera, JH Havgaard, M Fredholm, J Gorodkin, "Thousands of corresponding human and mouse genomic regions unalignable in primary sequence contain common RNA structure." Genome Res., 16, #7 (2006) 885-9.! 1800 candidates from 36970 (of 100,000) pairs! CMfinder! Torarinsson, Yao, Wiklund, Bramsen, Hansen, Kjems, Tommerup, Ruzzo and Gorodkin. Comparative genomics beyond sequence based alignments: RNA structures in the ENCODE regions. Genome Research, Feb 2008, 18(2):242-251 PMID: 18096747! 6500 candidates in ENCODE alone (better FDR, but still high)!

Bottom line? !

A significant number of “one-off” examples ! Extremely wise-spread ncRNA expression ! At a minimum, a vast evolutionary substrate ! New technology (e.g. RNAseq) exposing more! How do you recognize an interesting one?! Conserved secondary structure !

A G A C U G A C G A UC A C G C A G U C A Base pairs A U C G A C A U G U

RNA Secondary Structure: "

RNA makes helices too!

103

5´ 3´

Usually single stranded!

A G A C U G A C G A UC A C G C A G U C A A C A U

RNA Secondary Structure: can be fixed while sequence evolves!

104

A G C C A A A C C A UC A G G U U G G C A A C A U

G-U!

Why is RNA hard to deal with?!

A C U G C A G G G A G C A A G C G A G G C C U C U G C A A U G A C G G U G C A U G A G A G C G U C U U U U C A A C A C U G U U A U G G A A G U U U G G C U A G C G U U C U A G AG C U G U G A C A C U G C C G C G A C G G G A A A G U A A C G G G C G G C G A G U A A A C C C G A U C C C G G U G A A U A G C C U G A A A A A C A A A G U A C A C G G G A U A C G

A: Structure often more important than sequence

105

slide-7
SLIDE 7

Structure Prediction" RNA Structure !

Primary Structure: !Sequence! Secondary Structure: !Pairing! Tertiary Structure: !3D shape!

113

RNA Pairing!

Watson-Crick Pairing!

C - G ! !~ 3 kcal/mole! A - U ! !~ 2 kcal/mole!

“Wobble Pair” G - U ! !~1 kcal/mole! Non-canonical Pairs (esp. if modified)!

tRNA 3d Structure!

tRNA - Alt. Representations!

Anticodon loop! Anticodon" loop!

3’! 5’!

116

a.a.

tRNA - Alt. Representations!

Anticodon! loop! Anticodon" loop!

3’! 5’!

5’! 3’!

117

slide-8
SLIDE 8

Definitions!

Sequence 5’ r1 r2 r3 ... rn 3’ in {A, C, G, T}! A Secondary Structure is a set of pairs i•j s.t.!

i < j-4, and ! ! ! no sharp turns! if i•j & i’•j’ are two different pairs with i i’, then!

j < i’, or ! i < i’ < j’ < j ! 2nd pair follows 1st, or is nested within it; " no “pseudoknots.”!

RNA Secondary Structure: Examples !

119

!

C! G! G! C! A! G! U! U! U! A! U! A! C! C! G! G! U! G! U! A! G! G! C! A! G! U! U! A! C! G! G! C! A! U! G! U! U! A!

sharp turn! crossing!

  • k!

G! !4! U! A! C! C! G! G! U! U! G! A! base pair! C! G! G! C! A! G! U! U! U! A! C! A! U! A! C! G! G! G! G! U! A! U! A! C! C! G! G! U! G! U! A! A! C!

Nested! Pseudoknot! Precedes!

Approaches to Structure Prediction!

Maximum Pairing" + works on single sequences" + simple"

  • too inaccurate!

Minimum Energy" + works on single sequences"

  • ignores pseudoknots "
  • only finds “optimal” fold!

Partition Function" + finds all folds"

  • ignores pseudoknots!

Nussinov: Max Pairing!

B(i,j) = # pairs in optimal pairing of ri ... rj! B(i,j) = 0 for all i, j with i j-4; otherwise! B(i,j) = max of:!

B(i,j-1)! max { B(i,k-1)+1+B(k+1,j-1) | " i ! k < j-4 and rk-rj may pair}!

R Nussinov, AB Jacobson, "Fast algorithm for predicting the secondary structure of single-stranded RNA." PNAS 1980.

“Optimal pairing of ri ... rj”"

Two possibilities!

j Unpaired: " Find best pairing of ri ... rj-1! j Paired (with some k):" Find best ri ... rk-1 + " best rk+1 ... rj-1 plus 1! Why is it slow? " Why do pseudoknots matter?!

j i j-1 j k-1 k i j-1 k+1

slide-9
SLIDE 9

Nussinov: " A Computation Order!

B(i,j) = # pairs in optimal pairing of ri ... rj! B(i,j) = 0 for all i, j with i j-4; otherwise! B(i,j) = max of:!

B(i,j-1)! max { B(i,k-1)+1+B(k+1,j-1) | " i ! k < j-4 and rk-rj may pair}!

Time: O(n3)!

K=2! 3! 4! 5!

Which Pairs? !

Usual dynamic programming “trace-back” tells you which base pairs are in the optimal solution, not just how many!

Approaches to Structure Prediction!

Maximum Pairing" + works on single sequences" + simple"

  • too inaccurate!

Minimum Energy" + works on single sequences"

  • ignores pseudoknots "
  • only finds “optimal” fold!

Partition Function" + finds all folds"

  • ignores pseudoknots!

Pair-based Energy Minimization !

E(i,j) = energy of pairs in optimal pairing of ri ... rj! E(i,j) = ! for all i, j with i j-4; otherwise! E(i,j) = min of:!

E(i,j-1)! min { E(i,k-1) + e(rk, rj) + E(k+1,j-1) | i ! k < j-4 }!

Time: O(n3)! energy of k-j pair!

Loop-based Energy Minimization!

Detailed experiments show it’s " more accurate to model based "

  • n loops, rather than just pairs!

Loop types!

  • 1. Hairpin loop!
  • 2. Stack!
  • 3. Bulge!
  • 4. Interior loop!
  • 5. Multiloop!

1 2 3 4 5

Zuker: Loop-based Energy, I!

W(i,j) = energy of optimal pairing of ri ... rj! V(i,j) = as above, but forcing pair i•j! W(i,j) = V(i,j) = ! for all i, j with i j-4! W(i,j) = min( W(i,j-1)," min { W(i,k-1)+V(k,j) | i ! k < j-4 } !! !)!

slide-10
SLIDE 10

Zuker: Loop-based Energy, II !

V(i,j) = min(eh(i,j), es(i,j)+V(i+1,j-1), VBI(i,j), VM(i,j))! VM(i,j) = min { W(i,k)+W(k+1,j) | i < k < j } ! VBI(i,j) = min { ebi(i,j,i’,j’) + V(i’, j’) |" i < i’ < j’ < j & i’-i+j-j’ > 2 }!

!Time: O(n4) ! O(n3) possible if ebi(.) is “nice”!

hairpin! stack! bulge/! interior! multi-! loop! bulge/! interior!

Energy Parameters!

  • Q. !Where do they come from?!
  • A1. Experiments with carefully selected

synthetic RNAs!

  • A2. Learned algorithmically from trusted

alignments/structures [Andronescu et al., 2007]!

Single Seq Prediction Accuracy!

Mfold, Vienna,... [Nussinov, Zuker, Hofacker, McCaskill]! Estimates suggest ~50-75% of base pairs predicted correctly in sequences of up to ~300nt! Definitely useful, but obviously imperfect!

Approaches to Structure Prediction!

Maximum Pairing" !+ works on single sequences" !+ simple" !- too inaccurate! Minimum Energy" !+ works on single sequences" !- ignores pseudoknots " !- only finds “optimal” fold! Partition Function" !+ finds all folds" !- ignores pseudoknots!

Approaches, II!

Comparative sequence analysis" !+ handles all pairings (potentially incl. pseudoknots)" !- requires several (many?) aligned," ! appropriately diverged sequences! Stochastic Context-free Grammars" Roughly combines min energy & comparative, but no pseudoknots! Physical experiments (x-ray crystalography, NMR)!

Summary!

RNA has important roles beyond mRNA! !Many unexpected recent discoveries! Structure is critical to function! !True of proteins, too, but they’re easier to find from sequence alone due, e.g., to codon structure, which RNAs lack! RNA secondary structure can be predicted (to useful accuracy) by dynamic programming! Next: RNA “motifs” (seq + 2-ary struct) well- captured by “covariance models”!

139