LinearFold Linear-Time RNA Folding x - - PowerPoint PPT Presentation

linearfold
SMART_READER_LITE
LIVE PREVIEW

LinearFold Linear-Time RNA Folding x - - PowerPoint PPT Presentation

LinearFold Linear-Time RNA Folding x GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA y (((((((..((((........)))).(((((.......))))).....(((((.......)))))))))))).... 1 G C U C C A C G G C 70 76 G C 60


slide-1
SLIDE 1

LinearFold

Linear-Time RNA Folding

Liang Huang

Baidu Research USA & Oregon State University

Joint work with Dezhong Deng (Oregon State / Baidu) and Kai Zhao (Oregon State / Google) 
 and David Hendrix (Oregon State) and David Mathews (Rochester)

G C G G G A A U A G C U C A G U U G G U A G A G C A C G A C C U U G C C A A G G U C G G G G U C G C G A G U U C G A G U C U C G U U U C C C G C U C C A

1 10 20 30 40 50 60 70 76

GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA (((((((..((((........)))).(((((.......))))).....(((((.......))))))))))))....

x y

Stanford University School of Medicine, July 2018

slide-2
SLIDE 2

A Bit About Myself…

2

Ph.D., 2008 Research Scientist, 2009 Assistant Professor, 2015- Principal Scientist, 2018-

  • my main area is computational linguistics (aka natural language processing)
  • where I develop faster (linear-time) algorithms to understand/translate languages
  • but I also apply these algorithms to computational structural biology…
slide-3
SLIDE 3

RNA Structure Prediction and Design

3

CRISPR/Cas9: gene editing

GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA

RNA sequence RNA secondary structure RNA 3D structure design structure prediction

  • M. tuberculosis
slide-4
SLIDE 4

RNA Structure Prediction (Folding)

4 GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA (((((((..((((........)))).(((((.......))))).....(((((.......))))))))))))....

allowed pairs: G-C A-U G-U assume no crossing pairs

x y

G 5’ C G G G 5 A A U A G 10 C U C A G 15 U U G G U 20 A G A G C 25 A CGAC 30 C U U G C 35 C A A G G 40 U C G G G 45 GUCGC 50 G A G U U 55 C G A G U 60 C U C G U 65 U U C C C 70 G C U C C 75 A 3’

parse tree

G C G G G A A U A G C U C A G U U G G U A G A G C A C G A C C U U G C C A A G G U C G G G G U C G C G A G U U C G A G U C U C G U U U C C C G C U C C A

1 10 20 30 40 50 60 70 76

4

example: transfer RNA (tRNA)

challenge: existing structure prediction algorithms are way too slow: O(n3) solution: borrow linear-time algorithms from natural language parsing

slide-5
SLIDE 5

Our Linear-Time Prediction is Much Faster…

5

10,000nt (~HIV) 4min 7s 244,296nt (longest in RNAcentral) ~200hrs 120s

1 2 3 4 5 6 7 8 9 1000nt 2000nt 3000nt

CONTRAfold MFE, ~n2.6 V i e n n a R N A f

  • l

d , ~ n2.6 LinearFold b=100, ~n1.0 LinearFold b=50, ~n

1 .

running time per sequence (sec)

1 10 100 1000 103nt 104nt 105nt

Vienna RNAfold: n2.6 CONTRAfold MFE: n2.6 LinearFold b=100: n1.0 LinearFold b=050: n1.0

2 hrs s s s s

with even slightly better prediction accuracy!!

5

slide-6
SLIDE 6

Computational Linguistics => Computational Biology

1955 Chomsky: 
 context-free grammars 1953 Watson & Crick:
 DNA double-helix

linguistics biology computer science

1964 Cocke \ 1965 Kasami - CKY Parsing: O(n3) 1967 Younger / 1965 Knuth: LR Parsing: O(n) 1958 Backus & Naur: CFGs in programming lang. 1986 Tomita: Generalized LR Parsing 2010: linear-time DP parsing
 (Huang & Sagae) 1980s: O(n3) CKY for RNA structures 2018: LinearFold: O(n) RNA
 structure prediction ~1990: linear-time greedy parsing

6

1985 Shieber: non-CF languages 1970 Joshi: tree-adjoining grammars 1985 CKY-style TAG parsing in O(n6) 1999: TAGs for RNA pseudoknots

slide-7
SLIDE 7

Current Structure Prediction Method: O(n3)

C A A G U . . . . . .. .. .. () ... (.) (.) .(.) (.). ((.))

( )

i j i+1 j-1 i j k

. . .

  • Dynamic Programming — O(n3)
  • bottom-up CKY parsing
  • example: maximize # of pairs (A-U G-C G-U)

7

slide-8
SLIDE 8

5’ 3’ GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA (((((((..((((........)))).(((((.......))))).....(((((.......))))))))))))....

How to Fold RNAs in Linear-Time?

  • idea 0: tag each nucleotide from left to right
  • maintain a stack: push “(”, pop “)”, skip “.”
  • exhaustive: O(3n)

(Huang and Sagae, 2010) 8

slide-9
SLIDE 9

5’ 3’ GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA

How to Fold RNAs in Linear-Time?

  • idea 1: DP by merging “equivalent states”
  • maintain graph-structured stacks
  • DP: O(n3)

(Huang and Sagae, 2010)

(((((((..((((........)))).(((((.......))))).....(((((.......))))))))))))....

9

slide-10
SLIDE 10

5’ 3’ GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA (((((((..((((........)))).(((((.......))))).....(((((.......))))))))))))....

How to Fold RNAs in Linear-Time?

(Huang and Sagae, 2010)

  • idea 1: DP by merging “equivalent states”
  • maintain graph-structured stacks
  • DP: O(n3)

10

slide-11
SLIDE 11

5’ 3’ GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA

How to Fold RNAs in Linear-Time?

  • idea 2: approximate search: beam pruning
  • keep only top b states per step
  • DP+beam: O(n)

(Huang and Sagae, 2010)

each DP state corresponds to
 exponentially many non-DP states

graph-structured stack (GSS)


(Tomita, 1986)

(((((((..((((........)))).(((((.......))))).....(((((.......))))))))))))....

11

slide-12
SLIDE 12

Another View: Left-to-Right CKY

  • many variants of CKY ~ various topological ordering

all O(n3), but the incremental ones can apply beam search to run in O(n)

bottom-up left-to-right right-to-left

(S, 0, n) (S, 0, n) (S, 0, n)

12

slide-13
SLIDE 13

Our Linear-Time Prediction is Much Faster…

13

10,000nt (~HIV) 4min 7s 244,296nt (longest in RNAcentral) ~200hrs 120s

1 2 3 4 5 6 7 8 9 1000nt 2000nt 3000nt

CONTRAfold MFE, ~n2.6 V i e n n a R N A f

  • l

d , ~ n2.6 LinearFold b=100, ~n1.0 LinearFold b=50, ~n

1 .

running time per sequence (sec)

1 10 100 1000 103nt 104nt 105nt

Vienna RNAfold: n2.6 CONTRAfold MFE: n2.6 LinearFold b=100: n1.0 LinearFold b=050: n1.0

2 hrs s s s s

with even slightly better prediction accuracy!!

13

slide-14
SLIDE 14

On to details...

slide-15
SLIDE 15

An Example Path

push push skip pop pop

15

slide-16
SLIDE 16

Version 1: Exhaustive Search O(3n)

16

slide-17
SLIDE 17

Version 1: Exhaustive Search O(3n)

17

slide-18
SLIDE 18

Version 1: Exhaustive Search O(3n)

18

slide-19
SLIDE 19

Version 1: Exhaustive Search O(3n)

19

slide-20
SLIDE 20

Version 1: Exhaustive Search O(3n)

20

slide-21
SLIDE 21

Version 1: Exhaustive Search O(3n)

21

slide-22
SLIDE 22

Idea 1a: Merge Identical Stacks

Merge states with the same full stack (unpaired openings): “Equivalent States” 22

slide-23
SLIDE 23

Version 2: Merge by Full Stack O(2n)

exhaustive full-stack merge

23

slide-24
SLIDE 24

Version 2: Merge by Full Stack O(2n)

merge states with identical stacks exhaustive full-stack merge

24

slide-25
SLIDE 25

Version 2: Merge by Full Stack O(2n)

O(2n)

exhaustive full-stack merge

25

slide-26
SLIDE 26

Idea 1b: Merge “Temporary Equivalents”

Merge states with the same top of the stack
 (last unpaired opening): “Temporarily Equivalent States”

O(2n)

26

slide-27
SLIDE 27

Version 3: Merge by Stack Top O(n3)

packing temporarily equivalent states

27

slide-28
SLIDE 28

Version 3: Merge by Stack Top O(n3)

28

slide-29
SLIDE 29

Version 3: Merge by Stack Top O(n3)

29

slide-30
SLIDE 30

Version 3: Merge by Stack Top O(n3)

unpacking packing

30

slide-31
SLIDE 31

Version 3: Merge by Stack Top O(n3)

O(2n)

packing

31

slide-32
SLIDE 32

Close Up Look at Two Paths

32

slide-33
SLIDE 33

Close Up Look at Two Paths

33

slide-34
SLIDE 34

Idea 3: Beam Pruning

O(2n)

34

stack-top merge full-stack merge

slide-35
SLIDE 35

Version 4: DP with Beam Search O(n)

+beam pruning stack-top merge

35

slide-36
SLIDE 36

Recap: O(3n) to O(n3) to O(n)

  • 5 search algorithms
  • DP: bottom-up CKY: O(n3)
  • left-to-right (exhaustive): O(3n)
  • DP: merge by full stack: O(2n)
  • DP: merge by stack top: O(n3)
  • approx. DP via beam search: O(n)
  • this is a simple illustration that we just

maximize the number of pairs

  • our real systems work with complicated

feature templates

1 2 3 4 5 C CC CCA CCAG CCAGG no DP ..( ...( O(3n)

3 0 4 0

✏ . .. ... .... .....

0 0 0 0 0 0 0 0 0 0 0 0

.( .(. .(.. .(..)

2 0 2 0 2 0 0 0

.(( .(.) .(.).

2 3 0 0 0 0

( (. (.. (... (...)

1 0 1 0 1 0 1 0 0 0

(.( (..) (..).

1 3 0 0 0 0

(( ((. ((.) ((.))

1 2 1 2 1 0 0 0

C CC CCA CCAG CCAGG DP ✏ . .. ... .(.. O(2

n 2 ) 0 0 0 0 0 0 0 0 2 0

.( .(. .(.) ((.))

2 0 2 0 0 0 0 0

( (. (.. ((.)

1 0 1 0 1 0 1 0

(( ((.

1 2 1 2

C CC CCA CCAG CCAGG DP+GSS ✏ . .. ... ?(.. O(n3)

0 0 0 0 0 0 0 0 .. 2

( ?( ?(. .(.) ((.))

.. 1 .. 2 .. 2 0 0 0 0

(. (.. ((.)

.. 1 .. 1 .. 1

C CC CCA CCAG CCAGG LinearFold ✏ . ?( ?(. .(.) ((.)) O(n)

0 0 0 0 .. 2 .. 2 0 0 0 0

( (. (.. ((.) (approx. DP)

.. 1 .. 1 .. 1 .. 1

× × × ×

. ( ( . . ( . . . . ( ( ( . . ) . ) ) ( . ) . ) . ) ( . . ( . ( . . . . . . ) . ) ) ) . ) . ( . ( ( . . . . . . ) ) . ) ) . ) . ( ( ( . . . ) ) ) . )

+full stack merge +GSS +beam

O(2n) 36

slide-37
SLIDE 37

Connections to Incremental Parsing

  • shared key observation: local ambiguity packing
  • pack non-crucial local ambiguities along the way
  • unpack (in a reduce action) only when needed

(Huang and Sagae, 2010)

psycholinguistic evidence 
 (eye-tracking experiments): delayed disambiguation

John and Mary had 2 papers John and Mary had 2 papers

Frazier and Rayner (1990), Frazier (1999)

each together

37

slide-38
SLIDE 38

Experiments

slide-39
SLIDE 39

Applying Prediction Models on LinearFold

  • models from two most widely-used systems
  • CONTRAfold MFE (machine-learned)
  • Vienna RNAfold (thermodynamic)
  • we linearized both systems from O(n3) to O(n)

efficiency systems time space machine-learned thermo-dynamic

baselines O(n3) O(n2) CONTRAfold Vienna RNAfold

  • ur work

O(n) O(n) LinearFold-C LinearFold-V

39

slide-40
SLIDE 40

Efficiency & Scalability

  • tested on two datasets
  • Archive II dataset (labeled, like Penn Treebank)
  • #: 2,889, avglen: 222nt; up to 2,927nt
  • used for both efficiency and accuracy evaluations
  • RNAcentral dataset (unlabeled, like Gigaword)
  • over 1.3M of unlabeled sequences; up to 244,296nt
  • only used for efficiency/scalability evaluations

10MB 100MB 1GB 10GB 103nt 104nt 105nt

CONTRAfold MFE, ~n2.0 Vienna RNAfold: ~n2.0 LinearFold b=100, ~n1.0 LinearFold b=50, ~n1.0

memory used

1 2 3 4 5 6 7 8 9 1000nt 2000nt 3000nt

CONTRAfold MFE, ~n 2.6 V i e n n a R N A f

  • l

d , ~ n

2 . 6

LinearFold b=100, ~n 1.0 L i n e a r F

  • l

d b = 5 , ~ n

1.0

running time per sequence (sec)

1 10 100 1000 10

3nt

10

4nt

10

5nt

Vienna RNAfold: n

2.6

CONTRAfold MFE: n

2.6

LinearFold b=100: n

1.0

LinearFold b=050: n

1.0

(2 hrs)

40

slide-41
SLIDE 41

40 50 60 70 80

tRNA 5S rRNA SRP RNaseP tmRNA Group I Intron telomerase RNA 16S rRNA 23S rRNA * *

PPV

CONTRAfold MFE LinearFold-C b=100 ** * ** Vienna RNAfold LinearFold-V b=100 * * *

40 50 60 70 80

tRNA 5S rRNA SRP RNaseP tmRNA Group I Intron telomerase RNA 16S rRNA 23S rRNA * *

Sensitivity

CONTRAfold MFE LinearFold-C b=100 * * ** Vienna RNAfold LinearFold-V b=100 * *

Accuracy

  • use beam=100 for LinearFold
  • Tested on Archive II dataset 


(on a family-by-family basis)

  • significant improvements on 3 longer families
  • biggest improvements on the longest families:

16S/23S rRNAs

% CONTRAfold LinearFold-C Vienna LinearFold-V PPV (precision) 54.8 56.9 (+2.1) 50.7 50.9 (+0.2) Sensitivity (recall) 55.7 57.1 (+1.4) 59.3 59.5 (+0.2)

41

slide-42
SLIDE 42

Beam Size and Search Quality

  • beam search achieves good search quality starting b=100
  • slightly under-predicts compared to exact search, but close

57 60 63 66 69 72 20 100 200 300 number of pairs predicted beam size Vienna RNAfold LinearFold-V Ground Truth CONTRAfold MFE LinearFold-C

  • 83
  • 81
  • 79
  • 77

20 100 200 300

  • 24
  • 23
  • 22
  • 21

average free energy average model cost beam size Vienna RNAfold LinearFold-V CONTRAfold MFE LinearFold-C

42

slide-43
SLIDE 43

Beam Size and PPV/Sensitivity

  • PPV/Sensitivity trade-off as a function of beam size
  • stable PPV/Sensitivity around b=100-150

49 52 55 58 20 100 200 300 49 52 55 58 PPV Sensitivity beam size

CONTRAfold MFE LinearFold-C CONTRAfold MFE LinearFold-C

50 51 52 53 54 55 56 57 58 51 52 53 54 55 56 57 58 Sensitivity PPV

CONTRAfold MFE LinearFold-C

b=20 b=50 b=75 b=300

56.8 57 57.2 57.4 56.2 56.4 56.6 56.8 57 b=150 b=120 b=100

43

slide-44
SLIDE 44

Improvements on Long-Range Pairs

  • long-distance pairs are well-known to be hard to predict
  • but our beam search systems are even better on those

44

10 20 30 40 50 60 70 1

  • 2

2 1

  • 5

> 5 PPV base pair distance

CONTRAfold MFE LinearFold-C b=100

10 20 30 40 50 60 70 1

  • 2

2 1

  • 5

> 5 Sensitivity base pair distance

CONTRAfold MFE LinearFold-C b=100

slide-45
SLIDE 45

Incremental Folding

  • left-to-right vs. right-to-left folding (cotranscriptional folding?)

49 50 51 52 53 54 55 56 57 58 21 21.5 22 22.5 23 23.5

linear PPV linear Sensitivity right-to-left PPV right-to-left Sensitivity contrafold PPV contrafold Sensitivity

negative free energy PPV / Sensitivity

45

slide-46
SLIDE 46

Example Predictions: CONTRAfold__

46

slide-47
SLIDE 47

Example Predictions: Vienna RNAfold

47

slide-48
SLIDE 48

Web Demo 1 (Beam Sizes)

http://web.engr.oregonstate.edu/~liukaib/demo_json+canvas.html

48

slide-49
SLIDE 49

Web Demo 2 (Live Demo)

http://linearfold.eecs.oregonstate.edu

49

slide-50
SLIDE 50

Bonus Track: Deep Learning for RNA Folding

slide-51
SLIDE 51

Existing Models

  • Physics-based energy model
  • hundreds of carefully designed and experimentally-measured

thermodynamic parameters

  • Machine Learning based model
  • claimed “without physics”, but still using physics-based

feature templates

  • just learn the feature weights (parameters) from data
  • Can we really do it “without physics”?

51

slide-52
SLIDE 52

Deep Learning

  • Deep Learning for RNA structure prediction
  • RNNs automatically extract features and learn their weights
  • Completely without physics
  • First deep learning RNA secondary structure prediction algorithm
  • We borrow techniques from natural language parsing
  • where deep learning has been very successful since ~2016
  • our group’s “span-based” parsing framework is now dominant

52

Training

Input x Output y

Model w

“idealized” ML

Training

Input x Output y

Model w

“actual” ML

feature map ϕ

Training

Input x Output y

Model w

deep learning ≈ representation learning

feature map ϕ

slide-53
SLIDE 53

Our Approach: from NLP Span Parsing

  • Span differences are taken from an encoder

(in this case: a bi-LSTM)

  • A span is scored and labeled by a feed-

forward network

  • The score of a tree is the sum of all the

labeled span scores

stree(t) = P

(i,j,X)∈t

s(i, j, X)

s

(fj − fi, bi − bj)

s(i, j, X)

⟨/s⟩ ⟨s⟩

f0 f1 f2 f3 f4 f5 b1 b2 b3 b4 b5 b0

1 3 5 2 4

You should eat ice cream

Cross + Huang 2016 Stern et al. 2017 Wang + Chang 2016

53

slide-54
SLIDE 54

Our Approach: RNA structure predict.

  • Replace words with nucleotides (A,C,G,U) as inputs
  • 3 labels for a span:
  • Left-unpair

.

  • Left-pair

( )

  • Pair

( )

54

s

(fj − fi, bi − bj)

s(i, j, X)

⟨/s⟩ ⟨s⟩

f0 f1 f2 f3 f4 f5 b1 b2 b3 b4 b5 b0

1 3 5 2 4

C A G A C

slide-55
SLIDE 55

Experiments

  • database: S-Full (Andronescu et al. 2008; 2010)
  • preprocessing (cap at 700nt; removing pseudoknots and non-canonical

base-pairs); 3245 distinct structures; 80% / 20% split for training / set

  • randomly split into training set (2586) and testing set (659)
  • can be found https://www.cs.bgu.ac.il/~negevcb/contextfold/

system precision recall F-score

Physics-based Vienna RNAfold 53.35 61.68 57.21 Machine learning CONTRAfold (off the shelf) 55.75 53.63 55.75 CONTRAfold (retrained) 61.13 65.80 63.38 Deep learning Our model 77.06 59.55 67.18

55

slide-56
SLIDE 56

Thank you very much

⾮靟常 感谢 !

fēi cháng gǎn xiè

!

G C G G G A A U A G C U C A G U U G G U A G A G C A C G A C C U U G C C A A G G U C G G G G U C G C G A G U U C G A G U C U C G U U U C C C G C U C C A

1 10 20 30 40 50 60 70 76

Just google “LinearFold”