RNA Secondary Structure Prediction allowed pairs: G-C A-U G-U - - PowerPoint PPT Presentation

rna secondary structure prediction
SMART_READER_LITE
LIVE PREVIEW

RNA Secondary Structure Prediction allowed pairs: G-C A-U G-U - - PowerPoint PPT Presentation

LinearFold: Linear-Time Approximate RNA Folding by 5-to-3 dynamic programming and beam search x GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA y


slide-1
SLIDE 1

LinearFold: Linear-Time Approximate RNA Folding


by 5’-to-3’ dynamic programming and beam search

Liang Huang *

Oregon State University & Baidu Research USA

Joint work with He Zhang **, Dezhong Deng **, Kai Zhao, 
 Kaibo Liu, David Hendrix and David Mathews

G C G G G A A U A G C U C A G U U G G U A G A G C A C G A C C U U G C C A A G G U C G G G G U C G C G A G U U C G A G U C U C G U U U C C C G C U C C A

1 10 20 30 40 50 60 70 76

x y

* corresponding author ** equal contribution

GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA (((((((..((((........)))).(((((.......))))).....(((((.......))))))))))))....

ISMB 2019 Proceedings Talk

slide-2
SLIDE 2

LinearFold: Linear-Time Approximate RNA Folding


by 5’-to-3’ dynamic programming and beam search

Liang Huang *

Oregon State University & Baidu Research USA

Joint work with He Zhang **, Dezhong Deng **, Kai Zhao, 
 Kaibo Liu, David Hendrix and David Mathews

G C G G G A A U A G C U C A G U U G G U A G A G C A C G A C C U U G C C A A G G U C G G G G U C G C G A G U U C G A G U C U C G U U U C C C G C U C C A

1 10 20 30 40 50 60 70 76

GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA (((((((..((((........)))).(((((.......))))).....(((((.......))))))))))))....

x y

* corresponding author ** equal contribution

ISMB 2019 Proceedings Talk

slide-3
SLIDE 3

LinearFold: Linear-Time Approximate RNA Folding


by 5’-to-3’ dynamic programming and beam search

Liang Huang *

Oregon State University & Baidu Research USA

Joint work with He Zhang **, Dezhong Deng **, Kai Zhao, 
 Kaibo Liu, David Hendrix and David Mathews

G C G G G A A U A G C U C A G U U G G U A G A G C A C G A C C U U G C C A A G G U C G G G G U C G C G A G U U C G A G U C U C G U U U C C C G C U C C A

1 10 20 30 40 50 60 70 76

GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA (((((((..((((........)))).(((((.......))))).....(((((.......))))))))))))....

x y

* corresponding author ** equal contribution

first O(n) (approx.) RNA folding algorithm 
 & server (linearfold.org) with even higher accuracy than O(n3) algorithms

ISMB 2019 Proceedings Talk

slide-4
SLIDE 4

RNA Secondary Structure Prediction

2 GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA

allowed pairs: G-C A-U G-U assume no crossing pairs (no pseudoknots)

x

2

input example: transfer RNA (tRNA)

slide-5
SLIDE 5

RNA Secondary Structure Prediction

2 GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA (((((((..((((........)))).(((((.......))))).....(((((.......))))))))))))....

allowed pairs: G-C A-U G-U assume no crossing pairs (no pseudoknots)

x y

2

input

  • utput

example: transfer RNA (tRNA)

slide-6
SLIDE 6

RNA Secondary Structure Prediction

2 GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA (((((((..((((........)))).(((((.......))))).....(((((.......))))))))))))....

allowed pairs: G-C A-U G-U assume no crossing pairs (no pseudoknots)

x y

2

G C G G G A A U A G C U C A G U U G G U A G A G C A C G A C C U U G C C A A G G U C G G G G U C G C G A G U U C G A G U C U C G U U U C C C G C U C C A

1 10 20 30 40 50 60 70 76

input

  • utput

example: transfer RNA (tRNA)

slide-7
SLIDE 7

parse tree

RNA Secondary Structure Prediction

2 GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA (((((((..((((........)))).(((((.......))))).....(((((.......))))))))))))....

allowed pairs: G-C A-U G-U assume no crossing pairs (no pseudoknots)

x y

2

G C G G G A A U A G C U C A G U U G G U A G A G C A C G A C C U U G C C A A G G U C G G G G U C G C G A G U U C G A G U C U C G U U U C C C G C U C C A

1 10 20 30 40 50 60 70 76

input

  • utput

example: transfer RNA (tRNA)

slide-8
SLIDE 8

parse tree

RNA Secondary Structure Prediction

2 GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA (((((((..((((........)))).(((((.......))))).....(((((.......))))))))))))....

allowed pairs: G-C A-U G-U assume no crossing pairs (no pseudoknots)

x y

2

problem: standard structure prediction algorithms are way too slow: O(n3)

G C G G G A A U A G C U C A G U U G G U A G A G C A C G A C C U U G C C A A G G U C G G G G U C G C G A G U U C G A G U C U C G U U U C C C G C U C C A

1 10 20 30 40 50 60 70 76

input

  • utput

example: transfer RNA (tRNA)

. . .

O(n3)

S NP DT the NN man VP VB bit NP DT the NN dog

slide-9
SLIDE 9

parse tree

RNA Secondary Structure Prediction

2 GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA (((((((..((((........)))).(((((.......))))).....(((((.......))))))))))))....

allowed pairs: G-C A-U G-U assume no crossing pairs (no pseudoknots)

x y

2

problem: standard structure prediction algorithms are way too slow: O(n3) solution: adapt my linear-time dynamic programming algorithms from parsing

G C G G G A A U A G C U C A G U U G G U A G A G C A C G A C C U U G C C A A G G U C G G G G U C G C G A G U U C G A G U C U C G U U U C C C G C U C C A

1 10 20 30 40 50 60 70 76

input

  • utput

example: transfer RNA (tRNA)

. . .

O(n3)

S NP DT the NN man VP VB bit NP DT the NN dog

slide-10
SLIDE 10

Results: LinearFold is Much Faster and More Accurate

3 3

A

2 4 6 8 10 1000nt 2000nt 3000nt running time (seconds)

Vienna RNAfold: ~n2.4 CONTRAfold MFE: ~n2.2 LinearFold-V: ~n1.2 LinearFold-C: ~n1.1

C

  • ur work

existing ones

40 50 60 70 80

t R N A 5 S r R N A S R P R N a s e P t m R N A G r

  • u

p I I n t r

  • n

t e l

  • m

e r a s e R N A 1 6 S r R N A 2 3 S r R N A *

Precision

Standard O(n3) search LinearFold: O(n) search * *

40 50 60 70 80

t R N A 5 S r R N A S R P R N a s e P t m R N A G r

  • u

p I I n t r

  • n

t e l

  • m

e r a s e R N A 1 6 S r R N A 2 3 S r R N A **

Recall

Standard O(n3) search LinearFold: O(n) search * * * *

slide-11
SLIDE 11

From Linguistics to Biology

x y

GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA (((((((..((((........)))).(((((.......))))).....(((((.......))))))))))))....

slide-12
SLIDE 12

From Linguistics to Biology

GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA (((((((..((((........)))).(((((.......))))).....(((((.......))))))))))))....

x y

slide-13
SLIDE 13

Computational Linguistics => Computational Biology

linguistics computational biology compiler theory

1958 Backus & Naur: CFGs for program, lang. 1978: Nussinov O(n3) RNA folding 1981: Zuker & Siegler

5

1964 Cocke \ bottom-up 1965 Kasami - CKY O(n3) 1967 Younger / for all CFGs

  • comp. linguistics

1955 Chomsky: context-free 
 grammars (CFGs)

S NP DT the NN man VP VB bit NP DT the NN dog

O(n3)

dynamic programming

slide-14
SLIDE 14

Computational Linguistics => Computational Biology

linguistics computational biology compiler theory

1958 Backus & Naur: CFGs for program, lang. 1978: Nussinov O(n3) RNA folding 1981: Zuker & Siegler

5

1964 Cocke \ bottom-up 1965 Kasami - CKY O(n3) 1967 Younger / for all CFGs 1965 Knuth: LR parsing for 
 restricted CFGs: O(n)

  • comp. linguistics

1955 Chomsky: context-free 
 grammars (CFGs)

S NP DT the NN man VP VB bit NP DT the NN dog

O(n3)

:= id x + id y const 3

x = y + 3;

O(n)

dynamic programming

slide-15
SLIDE 15

Computational Linguistics => Computational Biology

linguistics computational biology compiler theory

1958 Backus & Naur: CFGs for program, lang. 1986 Tomita: Generalized LR
 for all CFGs: O(n3) 1978: Nussinov O(n3) RNA folding 1981: Zuker & Siegler

5

1964 Cocke \ bottom-up 1965 Kasami - CKY O(n3) 1967 Younger / for all CFGs 1965 Knuth: LR parsing for 
 restricted CFGs: O(n)

  • comp. linguistics

1955 Chomsky: context-free 
 grammars (CFGs)

S NP DT the NN man VP VB bit NP DT the NN dog

O(n3)

:= id x + id y const 3

x = y + 3;

O(n)

O(n3) dynamic programming

slide-16
SLIDE 16

Computational Linguistics => Computational Biology

linguistics computational biology compiler theory

1958 Backus & Naur: CFGs for program, lang. 1986 Tomita: Generalized LR
 for all CFGs: O(n3) 1978: Nussinov O(n3) RNA folding 1981: Zuker & Siegler

5

1964 Cocke \ bottom-up 1965 Kasami - CKY O(n3) 1967 Younger / for all CFGs 1965 Knuth: LR parsing for 
 restricted CFGs: O(n)

  • comp. linguistics

1955 Chomsky: context-free 
 grammars (CFGs) 2010: Huang & Sagae: O(n)
 (approx.) DP for all CFGs

S NP DT the NN man VP VB bit NP DT the NN dog

O(n3)

:= id x + id y const 3

x = y + 3;

O(n)

O(n3) O(n) dynamic programming

slide-17
SLIDE 17

Computational Linguistics => Computational Biology

linguistics computational biology compiler theory

1958 Backus & Naur: CFGs for program, lang. 1986 Tomita: Generalized LR
 for all CFGs: O(n3) 2019: LinearFold O(n) 
 (approx.) RNA folding 1978: Nussinov O(n3) RNA folding 1981: Zuker & Siegler

5

1964 Cocke \ bottom-up 1965 Kasami - CKY O(n3) 1967 Younger / for all CFGs 1965 Knuth: LR parsing for 
 restricted CFGs: O(n)

  • comp. linguistics

1955 Chomsky: context-free 
 grammars (CFGs) 2010: Huang & Sagae: O(n)
 (approx.) DP for all CFGs

S NP DT the NN man VP VB bit NP DT the NN dog

O(n3)

:= id x + id y const 3

x = y + 3;

O(n)

O(n3) O(n) dynamic programming

slide-18
SLIDE 18

5’ 3’ GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA

How to Fold RNAs in Linear-Time?

  • idea 0: tag each nucleotide from left to right
  • maintain a stack: push “(”, pop “)”, skip “.”
  • naive: O(3n)

6

( . )

slide-19
SLIDE 19

5’ 3’ GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA (((((((..((((........)))).(((((.......))))).....(((((.......))))))))))))....

How to Fold RNAs in Linear-Time?

  • idea 0: tag each nucleotide from left to right
  • maintain a stack: push “(”, pop “)”, skip “.”
  • naive: O(3n)

6

( . )

slide-20
SLIDE 20

5’ 3’ GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA

How to Fold RNAs in Linear-Time?

  • idea 1: DP by packing “equivalent states”
  • maintain graph-structured stacks
  • DP: O(n3)

(((((((..((((........)))).(((((.......))))).....(((((.......))))))))))))....

7

( . )

slide-21
SLIDE 21

5’ 3’ GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA (((((((..((((........)))).(((((.......))))).....(((((.......))))))))))))....

How to Fold RNAs in Linear-Time?

  • idea 1: DP by packing “equivalent states”
  • maintain graph-structured stacks
  • DP: O(n3)

8

( . )

slide-22
SLIDE 22

5’ 3’ GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA

  • idea 2: approximate search: beam pruning
  • keep only top b states per step
  • DP+beam: O(n)

How to Fold RNAs in Linear-Time?

(((((((..((((........)))).(((((.......))))).....(((((.......))))))))))))....

9

slide-23
SLIDE 23

5’ 3’ GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA

  • idea 2: approximate search: beam pruning
  • keep only top b states per step
  • DP+beam: O(n)

How to Fold RNAs in Linear-Time?

(((((((..((((........)))).(((((.......))))).....(((((.......))))))))))))....

9

each DP state corresponds to
 exponentially many non-DP states

slide-24
SLIDE 24

5’ 3’ GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA

  • idea 2: approximate search: beam pruning
  • keep only top b states per step
  • DP+beam: O(n)

How to Fold RNAs in Linear-Time?

(((((((..((((........)))).(((((.......))))).....(((((.......))))))))))))....

9

each DP state corresponds to
 exponentially many non-DP states

slide-25
SLIDE 25

5’ 3’ GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA

  • idea 2: approximate search: beam pruning
  • keep only top b states per step
  • DP+beam: O(n)

How to Fold RNAs in Linear-Time?

(((((((..((((........)))).(((((.......))))).....(((((.......))))))))))))....

9

b e a m s e a r c h

each DP state corresponds to
 exponentially many non-DP states

slide-26
SLIDE 26

Details: Packing “Temporarily Equivalent” States

  • two states are “temporarily equivalent” if they share the same stack top index
  • because they are looking for the same nucleotide(s) to match this stack-top nucleotide
  • we pack them as a single state until a matching is found, when they unpack
  • this is how we can explore exponentially many structures in linear time

10

‘ ( (( ((. ((.) ((.)) . .( .(. .(.) .(.).

. ( . ) . ( ( . ) )

b e a m s e a r c h

slide-27
SLIDE 27

Details: Packing “Temporarily Equivalent” States

  • two states are “temporarily equivalent” if they share the same stack top index
  • because they are looking for the same nucleotide(s) to match this stack-top nucleotide
  • we pack them as a single state until a matching is found, when they unpack
  • this is how we can explore exponentially many structures in linear time

10

‘ ( (( ((. ((.) ((.)) . .( .(. .(.) .(.).

. ( . ) . ( ( . ) )

‘ ( ?( ?(. ((.) ((.)) . .(.)

. ( . . ) . ( ( ) )

b e a m s e a r c h

slide-28
SLIDE 28

Details: Packing “Temporarily Equivalent” States

  • two states are “temporarily equivalent” if they share the same stack top index
  • because they are looking for the same nucleotide(s) to match this stack-top nucleotide
  • we pack them as a single state until a matching is found, when they unpack
  • this is how we can explore exponentially many structures in linear time

10

‘ ( (( ((. ((.) ((.)) . .( .(. .(.) .(.).

. ( . ) . ( ( . ) )

‘ ( ?( ?(. ((.) ((.)) . .(.)

. ( . . ) . ( ( ) )

b e a m s e a r c h

packing unpacking

slide-29
SLIDE 29

Results

slide-30
SLIDE 30

LinearFold with SOTA Prediction Models

  • models from two widely-used folding engines
  • CONTRAfold MFE (machine-learned)
  • Vienna RNAfold (thermodynamic)
  • we linearized both systems from O(n3) to O(n)

efficiency systems time space machine-learned thermo-dynamic

baselines O(n3) O(n2) CONTRAfold Vienna RNAfold

  • ur work

O(n) O(n) LinearFold-C LinearFold-V

12

slide-31
SLIDE 31

LinearFold with SOTA Prediction Models

  • models from two widely-used folding engines
  • CONTRAfold MFE (machine-learned)
  • Vienna RNAfold (thermodynamic)
  • we linearized both systems from O(n3) to O(n)

efficiency systems time space machine-learned thermo-dynamic

baselines O(n3) O(n2) CONTRAfold Vienna RNAfold

  • ur work

O(n) O(n) LinearFold-C LinearFold-V

12

slide-32
SLIDE 32

Efficiency & Scalability: O(n) time, O(n) memory

13

Archive II data set


(~3,000 seqs, max len: ~3,000 nt)

A

2 4 6 8 10 1000nt 2000nt 3000nt running time (seconds)

Vienna RNAfold: ~n2.4 CONTRAfold MFE: ~n2.2 LinearFold-V: ~n1.2 LinearFold-C: ~n1.1

B

10 10

C

V i e n n a R N A f

  • l

d CONTRAfold L i n e a r F

  • l

d

  • C

L i n e a r F

  • l

d

  • V

RNAcentral data set


(sampled, max len: ~250,000 nt)

LinearFold-V LinearFold-C V i e n n a R N A f

  • l

d C O N T R A f

  • l

d

10,000nt (~HIV) 4min 7s 244,296nt 
 (longest in RNAcentral) ~200hrs 120s

slide-33
SLIDE 33

Accuracy

  • Tested on Archive II dataset 


(on a family-by-family basis)

  • significantly better on 3 long families
  • biggest boost on the longest families: 


16S/23S rRNAs

  • LinearFold-V vs.

Vienna is similar

14

40 50 60 70 80

tRNA 5S rRNA SRP RNaseP tmRNA Group I Intron telomerase RNA 16S rRNA 23S rRNA *

Precision

Standard O(n3) search LinearFold: O(n) search * *

40 50 60 70 80

tRNA 5S rRNA SRP RNaseP tmRNA Group I Intron telomerase RNA 16S rRNA 23S rRNA **

Recall

Standard O(n3) search LinearFold: O(n) search * * * *

(precision) (recall)

slide-34
SLIDE 34

Accuracy

  • Tested on Archive II dataset 


(on a family-by-family basis)

  • significantly better on 3 long families
  • biggest boost on the longest families: 


16S/23S rRNAs

  • LinearFold-V vs.

Vienna is similar

14

40 50 60 70 80

tRNA 5S rRNA SRP RNaseP tmRNA Group I Intron telomerase RNA 16S rRNA 23S rRNA *

Precision

Standard O(n3) search LinearFold: O(n) search * *

40 50 60 70 80

tRNA 5S rRNA SRP RNaseP tmRNA Group I Intron telomerase RNA 16S rRNA 23S rRNA **

Recall

Standard O(n3) search LinearFold: O(n) search * * * *

(precision) (recall)

slide-35
SLIDE 35

Improvements on Long-Range Base Pairs

  • long-distance pairs are well-known to be hard to predict
  • LinearFold outputs less long-range base pairs, but more correct ones

15

10 20 30 40 50 60 70 1-200 201-500 >500 PPV base pair distance

CONTRAfold MFE LinearFold-C b=100

10 20 30 40 50 60 70 1-200 201-500 >500 Sensitivity base pair distance

CONTRAfold MFE LinearFold-C b=100

slide-36
SLIDE 36

Improvements on Long-Range Base Pairs

  • long-distance pairs are well-known to be hard to predict
  • LinearFold outputs less long-range base pairs, but more correct ones

15

10 20 30 40 50 60 70 1-200 201-500 >500 PPV base pair distance

CONTRAfold MFE LinearFold-C b=100

10 20 30 40 50 60 70 1-200 201-500 >500 Sensitivity base pair distance

CONTRAfold MFE LinearFold-C b=100

C

200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800

5’ 3’

I

200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800

5’ 3’

CONTRAfold Vienna RNAfold

  • E. coli 23S rRNA

(2,904 nt)

slide-37
SLIDE 37

Improvements on Long-Range Base Pairs

  • long-distance pairs are well-known to be hard to predict
  • LinearFold outputs less long-range base pairs, but more correct ones

15

10 20 30 40 50 60 70 1-200 201-500 >500 PPV base pair distance

CONTRAfold MFE LinearFold-C b=100

10 20 30 40 50 60 70 1-200 201-500 >500 Sensitivity base pair distance

CONTRAfold MFE LinearFold-C b=100

L

200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800

5’ 3’

2901 582

F

200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800

5’ 3’

582

LinearFold-C LinearFold-V

C

200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800

5’ 3’

I

200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800

5’ 3’

CONTRAfold Vienna RNAfold

  • E. coli 23S rRNA

(2,904 nt)

slide-38
SLIDE 38

Search (Approximation) Quality

  • search error = energy (or model score) gap between exact search and our search
  • search error quickly shrinks to 0 as beam size increases
  • search error grows linearly with sequence length (constant search error per nucleotide)
  • both PPV & Sensitivity peaks around beam=120, but we choose beam=100 for simplicity

16

A B C

1 2 3 4 20 50 100 200 300 0.5 1 1.5 2 2.5 free energy gap model score gap beam size (b) Vienna RNAfold vs. LinearFold-V CONTRAfold MFE vs. LinearFold-C 2 4 6 8 10 12 14 16 500 1000 1500 2000 2500 3000

CONTRAfold MFE vs. LinearFold-C, ~n1.0

model score gap sequence length (n)

tRNA 5S rRNA SRP RNaseP tmRNA Group I Intron telomerase RNA 16S rRNA 23S rRNA

free energy gap

A B

48 49 50 51 52 53 54 55 56 57 48 49 50 51 52 53 54 55 56 57 Sensitivity PPV

CONTRAfold MFE LinearFold-C

b=20 b=50 b=75 b=200

56 56.2 56.4 55.2 55.4 55.6 55.8 56 b=150 b=120 b=100

slide-39
SLIDE 39

Connections to Cotranscriptional Folding

17

  • RNAs & proteins fold while being assembled
  • RNA & protein sequences evolve to be incrementally foldable
  • these might explain why LinearFold performs better than exact search
slide-40
SLIDE 40

LinearFold Server & GitHub

http://linearfold.org

18

  • fastest RNA folding server


(by a very large margin)

  • thanks to O(n) time
  • longest sequence limit
  • 100,000 nt (10x

Vienna)

  • thanks to O(n) space
  • source code on GitHub
  • unified implementation of

LinearFold-C & LinearFold-V

https://github.com/LinearFold/LinearFold

slide-41
SLIDE 41

Extensions of LinearFold

19

LinearFold single strand
 single structure
 no pseudoknots LinearCoFold two strand folding
 => multistrand also Learning to Fold learning folding parameters in linear time LinearPseudoFold (restricted subset of)
 pseudoknots LinearPartition partition function and base pair probabilities

slide-42
SLIDE 42

first linear-time (approximate) RNA folding algorithm better in accuracy (esp. long seqs and long-range pairs) http://linearfold.org

slide-43
SLIDE 43

Thank you very much !

first linear-time (approximate) RNA folding algorithm better in accuracy (esp. long seqs and long-range pairs) http://linearfold.org

slide-44
SLIDE 44

Backup Slide: Local Folding

21

23S rRNA E. coli (2904 nt, 830 pairs) Vienna RNAfold Vienna RNAfold Local (L=150) LinearFold-V b = 100

200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800

3’ 5’

200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800

3’ 5’

200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800

3’ 5’

2901 582

PPV: 52.02, Sens.: 57.47, pairs: 917, time: 7.6s. PPV: 51.63, Sens.: 55.18, pairs: 887, time: 6.3s. PPV: 55.60, Sens.: 62.17, pairs: 928, time: 2.2s.