linearfold
play

LinearFold Linear-Time RNA Folding x - PowerPoint PPT Presentation

LinearFold Linear-Time RNA Folding x GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA y (((((((..((((........)))).(((((.......))))).....(((((.......)))))))))))).... 1 G C U C C A C G G C 70 76 G C 60


  1. LinearFold Linear-Time RNA Folding x GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA y (((((((..((((........)))).(((((.......))))).....(((((.......)))))))))))).... 1 G C U C C A C G G C 70 76 G C 60 G C A U G U A U A C U G C U Liang Huang U 10 G A G G C G A G A U C U C U C U C G U 50 U Baidu Research USA & Oregon State University G A G C G G G A U A G G C G 20 G C Joint work with Dezhong Deng (Oregon State / Baidu) and Kai Zhao (Oregon State / Google) 
 A U 30 C G 40 and David Hendrix (Oregon State) and David Mathews (Rochester) C G U A U A G C C Stanford University School of Medicine, July 2018

  2. A Bit About Myself… … Ph.D., 2008 Research Scientist, 2009 Assistant Professor, 2015- Principal Scientist, 2018- • my main area is computational linguistics (aka natural language processing) • where I develop faster (linear-time) algorithms to understand/translate languages • but I also apply these algorithms to computational structural biology… 2

  3. RNA Structure Prediction and Design RNA sequence CRISPR/Cas9: gene editing GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA design structure prediction RNA secondary structure RNA 3D structure M. tuberculosis 3

  4. RNA Structure Prediction (Folding) allowed pairs: G-C A-U G-U example: transfer RNA (tRNA) assume no crossing pairs x GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA y (((((((..((((........)))).(((((.......))))).....(((((.......)))))))))))).... 3’ 5’ 75 A G C C C G 5 U G 1 G C U C C A C G 70 G A C G C A challenge: existing structure prediction C G C 70 U 76 C A 10 G C 60 U G G C U algorithms are way too slow: O ( n 3 ) C U 65 A U U G U G A C U A C C U U G C U A 10 G 15 U G A G A G G C G C A U U C U U 60 C U U G C U C U G G 50 U A G G C G A G G U G G 20 A A U C A G U 55 G solution: borrow linear-time algorithms G U A C G 20 G G G C A C from natural language parsing G A U 25 A GUCGC CGAC 30 C 40 G 50 C G U A C U U G C U C G G G 30 U parse tree A 45 C A A G G G C 35 C 40 4 4

  5. Our Linear-Time Prediction is Much Faster… 10,000 nt (~HIV) 244,296 nt (longest in RNAcentral) 4min 7s ~200hrs 120s 9 2 hrs 8 running time per sequence (sec) s 1000 7 6 s n 2.6 100 CONTRAfold MFE, ~ n 2.6 5 ~ , d l o f A 4 N R s 10 a n 3 n e i V 2 LinearFold b=100 , ~ n 1.0 s 1 Vienna RNAfold: n 2.6 CONTRAfold MFE: n 2.6 1 LinearFold b=100 : n 1.0 0 . 1 LinearFold b=50 , ~ n LinearFold b=050 : n 1.0 0 10 3 nt 10 4 nt 10 5 nt 0 1000 nt 2000 nt 3000 nt with even slightly better prediction accuracy!! 5 5

  6. Computational Linguistics => Computational Biology linguistics computer science biology 1955 Chomsky: 
 1953 Watson & Crick: 
 1958 Backus & Naur: context-free grammars DNA double-helix CFGs in programming lang. 1964 Cocke \ 1965 Kasami - CKY Parsing: O ( n 3 ) 1967 Younger / 1965 Knuth: LR Parsing: O ( n ) 1980s: O ( n 3 ) CKY for RNA structures 1970 Joshi: tree-adjoining grammars 1985 CKY-style TAG parsing in O ( n 6 ) 1985 Shieber: non-CF languages 1986 Tomita: Generalized LR Parsing 1999: TAGs for RNA pseudoknots ~1990: linear-time greedy parsing 2010: linear-time DP parsing 
 2018: LinearFold: O ( n ) RNA 
 (Huang & Sagae) structure prediction 6

  7. Current Structure Prediction Method: O ( n 3 ) • Dynamic Programming — O ( n 3 ) ( ) • bottom-up CKY parsing i i+1 j-1 j • example: maximize # of pairs (A-U G-C G-U) ((.)) i k j . .(.) (.). . . ... (.) (.) .. .. .. () . . . . . A C A G U 7

  8. How to Fold RNAs in Linear-Time? 5’ 3’ GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA (((((((..((((........)))).(((((.......))))).....(((((.......)))))))))))).... • idea 0: tag each nucleotide from left to right • maintain a stack: push “(”, pop “)”, skip “.” • exhaustive: O (3 n ) 8 (Huang and Sagae, 2010)

  9. How to Fold RNAs in Linear-Time? 5’ 3’ GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA (((((((..((((........)))).(((((.......))))).....(((((.......)))))))))))).... • idea 1: DP by merging “equivalent states” • maintain graph-structured stacks • DP: O ( n 3 ) 9 (Huang and Sagae, 2010)

  10. How to Fold RNAs in Linear-Time? 5’ 3’ GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA (((((((..((((........)))).(((((.......))))).....(((((.......)))))))))))).... • idea 1: DP by merging “equivalent states” • maintain graph-structured stacks • DP: O ( n 3 ) 10 (Huang and Sagae, 2010)

  11. How to Fold RNAs in Linear-Time? 5’ 3’ GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA (((((((..((((........)))).(((((.......))))).....(((((.......)))))))))))).... • idea 2: approximate search: beam pruning • keep only top b states per step • DP+beam: O ( n ) each DP state corresponds to 
 exponentially many non-DP states graph-structured stack (GSS) 
 (Tomita, 1986) 11 (Huang and Sagae, 2010)

  12. Another View: Left-to-Right CKY • many variants of CKY ~ various topological ordering (S, 0, n) (S, 0, n) (S, 0, n) bottom-up left-to-right right-to-left all O(n 3 ), but the incremental ones can apply beam search to run in O(n) 12

  13. Our Linear-Time Prediction is Much Faster… 10,000 nt (~HIV) 244,296 nt (longest in RNAcentral) 4min 7s ~200hrs 120s 9 2 hrs 8 running time per sequence (sec) s 1000 7 6 s n 2.6 100 CONTRAfold MFE, ~ n 2.6 5 ~ , d l o f A 4 N R s 10 a n 3 n e i V 2 LinearFold b=100 , ~ n 1.0 s 1 Vienna RNAfold: n 2.6 CONTRAfold MFE: n 2.6 1 LinearFold b=100 : n 1.0 0 . 1 LinearFold b=50 , ~ n LinearFold b=050 : n 1.0 0 10 3 nt 10 4 nt 10 5 nt 0 1000 nt 2000 nt 3000 nt with even slightly better prediction accuracy!! 13 13

  14. On to details...

  15. An Example Path push push skip pop pop 15

  16. Version 1: Exhaustive Search O (3 n ) 16

  17. Version 1: Exhaustive Search O (3 n ) 17

  18. Version 1: Exhaustive Search O (3 n ) 18

  19. Version 1: Exhaustive Search O (3 n ) 19

  20. Version 1: Exhaustive Search O (3 n ) 20

  21. Version 1: Exhaustive Search O (3 n ) 21

  22. Idea 1a: Merge Identical Stacks Merge states with the same full stack (unpaired openings): “Equivalent States” 22

  23. Version 2: Merge by Full Stack O (2 n ) exhaustive full-stack merge 23

  24. Version 2: Merge by Full Stack O (2 n ) merge states with identical stacks exhaustive full-stack merge 24

  25. Version 2: Merge by Full Stack O (2 n ) exhaustive O (2 n ) full-stack merge 25

  26. Idea 1b: Merge “Temporary Equivalents” Merge states with the same top of the stack 
 (last unpaired opening): O (2 n ) “Temporarily Equivalent States” 26

  27. Version 3: Merge by Stack Top O ( n 3 ) packing temporarily equivalent states 27

  28. Version 3: Merge by Stack Top O ( n 3 ) 28

  29. Version 3: Merge by Stack Top O ( n 3 ) 29

  30. Version 3: Merge by Stack Top O ( n 3 ) unpacking packing 30

  31. Version 3: Merge by Stack Top O ( n 3 ) O (2 n ) packing 31

  32. Close Up Look at Two Paths 32

  33. Close Up Look at Two Paths 33

  34. Idea 3: Beam Pruning O (2 n ) full-stack merge stack-top merge 34

  35. Version 4: DP with Beam Search O ( n ) stack-top merge +beam pruning 35

  36. Recap: O (3 n ) to O ( n 3 ) to O ( n ) 0 1 2 3 4 5 • 5 search algorithms no DP C CC CCA CCAG CCAGG O (3 n ) ..( ...( × × ( 3 0 ( 4 0 . .. ... .... ..... . . . . . ✏ • DP: bottom-up CKY: O ( n 3 ) 0 0 0 0 0 0 0 0 0 0 0 0 ( +full stack merge . . .( .(. .(.. .(..) ) 2 0 2 0 2 0 0 0 ( ) ( . .(( .(.) .(.). × • left-to-right (exhaustive): O (3 n ) 2 3 0 0 0 0 . . . ( (. (.. (... (...) ) 1 0 1 0 1 0 1 0 0 0 ( ) . (.( (..) (..). ( × • DP: merge by full stack: O (2 n ) 1 3 0 0 0 0 . (( ((. ((.) ((.)) ) ) 1 2 1 2 1 0 0 0 • DP: merge by stack top: O ( n 3 ) DP C CC CCA CCAG CCAGG O (2 n ) 2 ) n O (2 . .. ... .(.. . . . ✏ . 0 0 0 0 0 0 0 0 2 0 ) ( . . . .( .(. .(.) ((.)) ( ) • approx. DP via beam search: O ( n ) +GSS 2 0 2 0 0 0 0 0 ) ) . . . ( (. (.. ((.) 1 0 1 0 1 0 1 0 ( ) . (( ((. • this is a simple illustration that we just 1 2 1 2 DP+GSS C CC CCA CCAG CCAGG O ( n 3 ) . .. ... ?(.. . . . maximize the number of pairs ✏ . 0 0 0 0 0 0 0 0 .. 2 ) ( ( . +beam . . ( ?( ?(. .(.) ((.)) ( ) . .. 1 .. 2 .. 2 0 0 0 0 ) • our real systems work with complicated ) ) . . (. (.. ((.) .. 1 .. 1 .. 1 CCAGG LinearFold feature templates C CC CCA CCAG . . O ( n ) . ?( ?(. .(.) ((.)) . ) ( ✏ 0 0 0 0 .. 2 .. 2 0 0 0 0 ) ) ( ( . . ( approx. DP) ( (. (.. ((.) ) 36 .. 1 .. 1 .. 1 .. 1

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend