LinearFold Linear-Time RNA Folding x - PowerPoint PPT Presentation

LinearFold Linear-Time RNA Folding x GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA y (((((((..((((........)))).(((((.......))))).....(((((.......)))))))))))).... 1 G C U C C A C G G C 70 76 G C 60 G C A U G U A U A C U G C U Liang Huang U 10 G A G G C G A G A U C U C U C U C G U 50 U Baidu Research USA & Oregon State University G A G C G G G A U A G G C G 20 G C Joint work with Dezhong Deng (Oregon State / Baidu) and Kai Zhao (Oregon State / Google)   A U 30 C G 40 and David Hendrix (Oregon State) and David Mathews (Rochester) C G U A U A G C C Stanford University School of Medicine, July 2018

A Bit About Myself… … Ph.D., 2008 Research Scientist, 2009 Assistant Professor, 2015- Principal Scientist, 2018- • my main area is computational linguistics (aka natural language processing) • where I develop faster (linear-time) algorithms to understand/translate languages • but I also apply these algorithms to computational structural biology… 2

RNA Structure Prediction and Design RNA sequence CRISPR/Cas9: gene editing GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA design structure prediction RNA secondary structure RNA 3D structure M. tuberculosis 3

RNA Structure Prediction (Folding) allowed pairs: G-C A-U G-U example: transfer RNA (tRNA) assume no crossing pairs x GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA y (((((((..((((........)))).(((((.......))))).....(((((.......)))))))))))).... 3’ 5’ 75 A G C C C G 5 U G 1 G C U C C A C G 70 G A C G C A challenge: existing structure prediction C G C 70 U 76 C A 10 G C 60 U G G C U algorithms are way too slow: O ( n 3 ) C U 65 A U U G U G A C U A C C U U G C U A 10 G 15 U G A G A G G C G C A U U C U U 60 C U U G C U C U G G 50 U A G G C G A G G U G G 20 A A U C A G U 55 G solution: borrow linear-time algorithms G U A C G 20 G G G C A C from natural language parsing G A U 25 A GUCGC CGAC 30 C 40 G 50 C G U A C U U G C U C G G G 30 U parse tree A 45 C A A G G G C 35 C 40 4 4

Our Linear-Time Prediction is Much Faster… 10,000 nt (~HIV) 244,296 nt (longest in RNAcentral) 4min 7s ~200hrs 120s 9 2 hrs 8 running time per sequence (sec) s 1000 7 6 s n 2.6 100 CONTRAfold MFE, ~ n 2.6 5 ~ , d l o f A 4 N R s 10 a n 3 n e i V 2 LinearFold b=100 , ~ n 1.0 s 1 Vienna RNAfold: n 2.6 CONTRAfold MFE: n 2.6 1 LinearFold b=100 : n 1.0 0 . 1 LinearFold b=50 , ~ n LinearFold b=050 : n 1.0 0 10 3 nt 10 4 nt 10 5 nt 0 1000 nt 2000 nt 3000 nt with even slightly better prediction accuracy!! 5 5

Computational Linguistics => Computational Biology linguistics computer science biology 1955 Chomsky:   1953 Watson & Crick:   1958 Backus & Naur: context-free grammars DNA double-helix CFGs in programming lang. 1964 Cocke \ 1965 Kasami - CKY Parsing: O ( n 3 ) 1967 Younger / 1965 Knuth: LR Parsing: O ( n ) 1980s: O ( n 3 ) CKY for RNA structures 1970 Joshi: tree-adjoining grammars 1985 CKY-style TAG parsing in O ( n 6 ) 1985 Shieber: non-CF languages 1986 Tomita: Generalized LR Parsing 1999: TAGs for RNA pseudoknots ~1990: linear-time greedy parsing 2010: linear-time DP parsing   2018: LinearFold: O ( n ) RNA   (Huang & Sagae) structure prediction 6

Current Structure Prediction Method: O ( n 3 ) • Dynamic Programming — O ( n 3 ) ( ) • bottom-up CKY parsing i i+1 j-1 j • example: maximize # of pairs (A-U G-C G-U) ((.)) i k j . .(.) (.). . . ... (.) (.) .. .. .. () . . . . . A C A G U 7

How to Fold RNAs in Linear-Time? 5’ 3’ GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA (((((((..((((........)))).(((((.......))))).....(((((.......)))))))))))).... • idea 0: tag each nucleotide from left to right • maintain a stack: push “(”, pop “)”, skip “.” • exhaustive: O (3 n ) 8 (Huang and Sagae, 2010)

How to Fold RNAs in Linear-Time? 5’ 3’ GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA (((((((..((((........)))).(((((.......))))).....(((((.......)))))))))))).... • idea 1: DP by merging “equivalent states” • maintain graph-structured stacks • DP: O ( n 3 ) 9 (Huang and Sagae, 2010)

How to Fold RNAs in Linear-Time? 5’ 3’ GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA (((((((..((((........)))).(((((.......))))).....(((((.......)))))))))))).... • idea 1: DP by merging “equivalent states” • maintain graph-structured stacks • DP: O ( n 3 ) 10 (Huang and Sagae, 2010)

How to Fold RNAs in Linear-Time? 5’ 3’ GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA (((((((..((((........)))).(((((.......))))).....(((((.......)))))))))))).... • idea 2: approximate search: beam pruning • keep only top b states per step • DP+beam: O ( n ) each DP state corresponds to   exponentially many non-DP states graph-structured stack (GSS)   (Tomita, 1986) 11 (Huang and Sagae, 2010)

Another View: Left-to-Right CKY • many variants of CKY ~ various topological ordering (S, 0, n) (S, 0, n) (S, 0, n) bottom-up left-to-right right-to-left all O(n 3 ), but the incremental ones can apply beam search to run in O(n) 12

Our Linear-Time Prediction is Much Faster… 10,000 nt (~HIV) 244,296 nt (longest in RNAcentral) 4min 7s ~200hrs 120s 9 2 hrs 8 running time per sequence (sec) s 1000 7 6 s n 2.6 100 CONTRAfold MFE, ~ n 2.6 5 ~ , d l o f A 4 N R s 10 a n 3 n e i V 2 LinearFold b=100 , ~ n 1.0 s 1 Vienna RNAfold: n 2.6 CONTRAfold MFE: n 2.6 1 LinearFold b=100 : n 1.0 0 . 1 LinearFold b=50 , ~ n LinearFold b=050 : n 1.0 0 10 3 nt 10 4 nt 10 5 nt 0 1000 nt 2000 nt 3000 nt with even slightly better prediction accuracy!! 13 13

On to details...

An Example Path push push skip pop pop 15

Version 1: Exhaustive Search O (3 n ) 16

Idea 1a: Merge Identical Stacks Merge states with the same full stack (unpaired openings): “Equivalent States” 22

Version 2: Merge by Full Stack O (2 n ) exhaustive full-stack merge 23

Version 2: Merge by Full Stack O (2 n ) merge states with identical stacks exhaustive full-stack merge 24

Version 2: Merge by Full Stack O (2 n ) exhaustive O (2 n ) full-stack merge 25

Idea 1b: Merge “Temporary Equivalents” Merge states with the same top of the stack   (last unpaired opening): O (2 n ) “Temporarily Equivalent States” 26

Version 3: Merge by Stack Top O ( n 3 ) packing temporarily equivalent states 27

Version 3: Merge by Stack Top O ( n 3 ) 28

Version 3: Merge by Stack Top O ( n 3 ) 29

Version 3: Merge by Stack Top O ( n 3 ) unpacking packing 30

Version 3: Merge by Stack Top O ( n 3 ) O (2 n ) packing 31

Close Up Look at Two Paths 32

Close Up Look at Two Paths 33

Idea 3: Beam Pruning O (2 n ) full-stack merge stack-top merge 34

Version 4: DP with Beam Search O ( n ) stack-top merge +beam pruning 35

Recap: O (3 n ) to O ( n 3 ) to O ( n ) 0 1 2 3 4 5 • 5 search algorithms no DP C CC CCA CCAG CCAGG O (3 n ) ..( ...( × × ( 3 0 ( 4 0 . .. ... .... ..... . . . . . ✏ • DP: bottom-up CKY: O ( n 3 ) 0 0 0 0 0 0 0 0 0 0 0 0 ( +full stack merge . . .( .(. .(.. .(..) ) 2 0 2 0 2 0 0 0 ( ) ( . .(( .(.) .(.). × • left-to-right (exhaustive): O (3 n ) 2 3 0 0 0 0 . . . ( (. (.. (... (...) ) 1 0 1 0 1 0 1 0 0 0 ( ) . (.( (..) (..). ( × • DP: merge by full stack: O (2 n ) 1 3 0 0 0 0 . (( ((. ((.) ((.)) ) ) 1 2 1 2 1 0 0 0 • DP: merge by stack top: O ( n 3 ) DP C CC CCA CCAG CCAGG O (2 n ) 2 ) n O (2 . .. ... .(.. . . . ✏ . 0 0 0 0 0 0 0 0 2 0 ) ( . . . .( .(. .(.) ((.)) ( ) • approx. DP via beam search: O ( n ) +GSS 2 0 2 0 0 0 0 0 ) ) . . . ( (. (.. ((.) 1 0 1 0 1 0 1 0 ( ) . (( ((. • this is a simple illustration that we just 1 2 1 2 DP+GSS C CC CCA CCAG CCAGG O ( n 3 ) . .. ... ?(.. . . . maximize the number of pairs ✏ . 0 0 0 0 0 0 0 0 .. 2 ) ( ( . +beam . . ( ?( ?(. .(.) ((.)) ( ) . .. 1 .. 2 .. 2 0 0 0 0 ) • our real systems work with complicated ) ) . . (. (.. ((.) .. 1 .. 1 .. 1 CCAGG LinearFold feature templates C CC CCA CCAG . . O ( n ) . ?( ?(. .(.) ((.)) . ) ( ✏ 0 0 0 0 .. 2 .. 2 0 0 0 0 ) ) ( ( . . ( approx. DP) ( (. (.. ((.) ) 36 .. 1 .. 1 .. 1 .. 1

LinearFold Linear-Time RNA Folding x - PowerPoint PPT Presentation

LinearFold Linear-Time RNA Folding x GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA y (((((((..((((........)))).(((((.......))))).....(((((.......)))))))))))).... 1 G C U C C A C G G C 70 76 G C 60

RNA Secondary Structure Prediction allowed pairs: G-C A-U G-U example: transfer RNA (tRNA)

Appetizer: Simultaneous Translation ACL 2019 Invited Talk Simultaneous Translation: Recent

CTSA Program PI Webinar Wednesday, March 28, 2018 2:00 3:00 ET Agenda Time Topic

Successful Grant Writing Strategies Purdue grant writing strategies and assistance Sally Bond

Overview 1. Yellow perch : great species for Great Lakes aquaculture 2. Key biological problems

Oligo Pools: Design, Synthesis, and Research Applications Presenter Marcelo Caraballo, Senior

Patently Prepared: Are FOSS Companies Ready to Deal with Patents in the US & Europe? Robinson

Practical Bioinformatics Mark Voorhies 5/2/2017 Mark Voorhies Practical Bioinformatics

Molecular dynamics: looking ahead to exascale Steve Plimpton Sandia National Laboratories 17th

The Security and Privacy Challenges Raised by Precision Medicine Jean-Pierre Hubaux With

Welcome to the 116 th meeting of the Lyncean Group 6 December 2017 Agenda for today

PostgreSQL and Omics Data How omics data can be stored in postgres database Postgr tgreSQ eSQL

BIBLIOGRAPHY PRESENTATIONS KLAUS AMMANN UNTIL 20190423 klaus.ammann@ips.unibe.ch Ammann Klaus

Integration Project (FCHIP) Introduction and Overview of the Frontier Community Health

IHIs Hospital Flow Professional Development Program Pat Rutherford VP, Institute for

EVICTION EVICTION MORATORIUM MORATORIUM This crisis is not over No New Yorker should lose the

From the National Coalition for Alarm Management Safety A case study from Rush Medical Center

Disclosure Discussion of Background and Purpose Review of Materials & Methods

Welcome Neonatology and Intensive Care Medicine Course Objectives Introduce you to

Location of Critical Incident 60 50 40 # of CIs 30 (n=109) 20 10 0 Inpt Outpt NICU ED

OVERVIEW FOR 2018-19 PLAN YEAR May - June 2018 April K e lly OE BB Rule s Co o rdina to r EE

CRITICAL INCIDENT RESPONSE TRAINING FOR COMMANDERS: THE PROVISION

2015 Annual PHEP HPP CFDA 93.889. Its contents are solely the responsibility of the authors and

Leadership Challenges In Todays Volunteer Emergency Services Organizations Presented by:

LinearFold Linear-Time RNA Folding x - PowerPoint PPT Presentation

LinearFold Linear-Time RNA Folding x GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA y (((((((..((((........)))).(((((.......))))).....(((((.......)))))))))))).... 1 G C U C C A C G G C 70 76 G C 60

RNA Secondary Structure Prediction allowed pairs: G-C A-U G-U example: transfer RNA (tRNA)

Appetizer: Simultaneous Translation ACL 2019 Invited Talk Simultaneous Translation: Recent

CTSA Program PI Webinar Wednesday, March 28, 2018 2:00 3:00 ET Agenda Time Topic

Successful Grant Writing Strategies Purdue grant writing strategies and assistance Sally Bond

Overview 1. Yellow perch : great species for Great Lakes aquaculture 2. Key biological problems

Oligo Pools: Design, Synthesis, and Research Applications Presenter Marcelo Caraballo, Senior

Patently Prepared: Are FOSS Companies Ready to Deal with Patents in the US &amp; Europe? Robinson

Practical Bioinformatics Mark Voorhies 5/2/2017 Mark Voorhies Practical Bioinformatics

Molecular dynamics: looking ahead to exascale Steve Plimpton Sandia National Laboratories 17th

The Security and Privacy Challenges Raised by Precision Medicine Jean-Pierre Hubaux With

Welcome to the 116 th meeting of the Lyncean Group 6 December 2017 Agenda for today

PostgreSQL and Omics Data How omics data can be stored in postgres database Postgr tgreSQ eSQL

BIBLIOGRAPHY PRESENTATIONS KLAUS AMMANN UNTIL 20190423 klaus.ammann@ips.unibe.ch Ammann Klaus

Integration Project (FCHIP) Introduction and Overview of the Frontier Community Health

IHIs Hospital Flow Professional Development Program Pat Rutherford VP, Institute for

EVICTION EVICTION MORATORIUM MORATORIUM This crisis is not over No New Yorker should lose the

From the National Coalition for Alarm Management Safety A case study from Rush Medical Center

Disclosure Discussion of Background and Purpose Review of Materials &amp; Methods

Welcome Neonatology and Intensive Care Medicine Course Objectives Introduce you to

Location of Critical Incident 60 50 40 # of CIs 30 (n=109) 20 10 0 Inpt Outpt NICU ED

OVERVIEW FOR 2018-19 PLAN YEAR May - June 2018 April K e lly OE BB Rule s Co o rdina to r EE

CRITICAL INCIDENT RESPONSE TRAINING FOR COMMANDERS: THE PROVISION

2015 Annual PHEP HPP CFDA 93.889. Its contents are solely the responsibility of the authors and

Leadership Challenges In Todays Volunteer Emergency Services Organizations Presented by:

Patently Prepared: Are FOSS Companies Ready to Deal with Patents in the US & Europe? Robinson

Disclosure Discussion of Background and Purpose Review of Materials & Methods