COMPUTING OPTIMAL FLOW DECOMPOSITIONS FOR ASSEMBLY Kyle Kloster, - - PowerPoint PPT Presentation

computing optimal flow decompositions for assembly
SMART_READER_LITE
LIVE PREVIEW

COMPUTING OPTIMAL FLOW DECOMPOSITIONS FOR ASSEMBLY Kyle Kloster, - - PowerPoint PPT Presentation

COMPUTING OPTIMAL FLOW DECOMPOSITIONS FOR ASSEMBLY Kyle Kloster, Philipp Kuinke , Michael P. OBrien, Felix Reidl, Fernando Snchez Villaamil, Blair D. Sullivan, Andrew van der Poel 2018/03/27 North Carolina State University RWTH Aachen


slide-1
SLIDE 1

COMPUTING OPTIMAL FLOW DECOMPOSITIONS FOR ASSEMBLY

Kyle Kloster, Philipp Kuinke, Michael P. O’Brien, Felix Reidl, Fernando Sánchez Villaamil, Blair D. Sullivan, Andrew van der Poel 2018/03/27 North Carolina State University RWTH Aachen University

slide-2
SLIDE 2

MOTIVATION

slide-3
SLIDE 3

The Problem

Shared segments between DNA/RNA strands create ambiguity in the assembly problem

2

slide-4
SLIDE 4

The Problem

Connecting overlapping segments and counting their frequencies yields a splice-graph.

3

slide-5
SLIDE 5

The Problem

4

slide-6
SLIDE 6

The Problem

The problem is to split the flow into s-t-paths, to recover the

  • riginal DNA/RNA strands.

5

slide-7
SLIDE 7

The Problem

6

slide-8
SLIDE 8

The Problem

Input: (G, f , k) with an s-t–DAG G, a flow f on G, and a positive integer k. Problem: Find an integral flow decomposition of (G, f ) using at most k paths. k-Flow Decomposition (k-FD)

7

slide-9
SLIDE 9

Related Work

Input: (G, f , k) with an s-t–DAG G, a flow f on G, and a positive integer k. Problem: Find an integral flow decomposition of (G, f ) using at most k paths. k-Flow Decomposition (k-FD)

8

slide-10
SLIDE 10

Related Work

Input: (G, f , k) with an s-t–DAG G, a flow f on G, and a positive integer k. Problem: Find an integral flow decomposition of (G, f ) using at most k paths. k-Flow Decomposition (k-FD) How do we choose k?

8

slide-11
SLIDE 11

Related Work

Input: (G, f , k) with an s-t–DAG G, a flow f on G, and a positive integer k. Problem: Find an integral flow decomposition of (G, f ) using at most k paths. k-Flow Decomposition (k-FD) How do we choose k? → minimization

A novel min-cost flow method for estimating transcript expression with RNA-Seq. A.I. Tomescu et. al. Efficient Heuristic for Decomposing a Flow with Minimum Number of Paths.

  • M. Shao & C. Kingsford

8

slide-12
SLIDE 12

Related Work

Input: (G, f , k) with an s-t–DAG G, a flow f on G, and a positive integer k. Problem: Find an integral flow decomposition of (G, f ) using at most k paths. k-Flow Decomposition (k-FD) How do we choose k? → minimization

A novel min-cost flow method for estimating transcript expression with RNA-Seq. A.I. Tomescu et. al. Efficient Heuristic for Decomposing a Flow with Minimum Number of Paths.

  • M. Shao & C. Kingsford

Problem is NP-hard even for weights {1, 2, 4}

How to split a flow?

  • T. Hartman et. al.

8

slide-13
SLIDE 13

Computer Scientists...

About ten years ago, some computer scientists came by and said they heard we have some really cool problems. They showed that the problems are NP-complete and went away!

  • Joseph Felsenstein (Biologist)

9

slide-14
SLIDE 14

Computer Scientists...

About ten years ago, some computer scientists came by and said they heard we have some really cool problems. They showed that the problems are NP-complete and went away!

  • Joseph Felsenstein (Biologist)

9

slide-15
SLIDE 15

Linear FPT

cn

vs.

dk · n

linear fpt: exponential only in the parameter and linear in n!

10

slide-16
SLIDE 16

Observations

Data used by Shao and Kingsford:

  • 1. 99% of instances decompose into ≤ 8 paths.

→ exploit small natural parameter.

  • 2. ∼4 million mostly small instances.

→ handle large throughput.

  • 3. Output decompositions.

→ reliably recover domain-specific solution.

11

slide-17
SLIDE 17

Toboggan

Theorem

Toboggan solves k-FD in 2O(k2)(n + λ), where λ is the logarithm of the largest flow value.

12

slide-18
SLIDE 18

Toboggan

Theorem

Toboggan solves k-FD in 2O(k2)(n + λ), where λ is the logarithm of the largest flow value. Worst-case run-time is linear in n

12

slide-19
SLIDE 19

Toboggan

Theorem

Toboggan solves k-FD in 2O(k2)(n + λ), where λ is the logarithm of the largest flow value. Worst-case run-time is linear in n Guarantees optimal solution

12

slide-20
SLIDE 20

Toboggan

Theorem

Toboggan solves k-FD in 2O(k2)(n + λ), where λ is the logarithm of the largest flow value. Worst-case run-time is linear in n Guarantees optimal solution

  • Gives opportunity to validate the model

12

slide-21
SLIDE 21

Toboggan

Theorem

Toboggan solves k-FD in 2O(k2)(n + λ), where λ is the logarithm of the largest flow value. Worst-case run-time is linear in n Guarantees optimal solution

  • Gives opportunity to validate the model

Run-time competitive with current state of the art heuristic

12

slide-22
SLIDE 22

Toboggan

Theorem

Toboggan solves k-FD in 2O(k2)(n + λ), where λ is the logarithm of the largest flow value. Worst-case run-time is linear in n Guarantees optimal solution

  • Gives opportunity to validate the model

Run-time competitive with current state of the art heuristic Usable in practice

12

slide-23
SLIDE 23

IMPLEMENTATION & EXPERIMENTS

slide-24
SLIDE 24

Repository

https://github.com/theoryinpractice/toboggan

14

slide-25
SLIDE 25

Setup

Dataset: Available from Shao and Kingsford. Simulated sequencing data for human, mouse and zebrafish, containing ground-truth.

15

slide-26
SLIDE 26

Setup

Dataset: Available from Shao and Kingsford. Simulated sequencing data for human, mouse and zebrafish, containing ground-truth. Deviation from original setup: Trivial instances omitted. Removes around 64% of the 4M graphs.

15

slide-27
SLIDE 27

Setup

Dataset: Available from Shao and Kingsford. Simulated sequencing data for human, mouse and zebrafish, containing ground-truth. Deviation from original setup: Trivial instances omitted. Removes around 64% of the 4M graphs. Dedicated system with Intel i7-3770: 3.40 GHz, 8 MB cache and 32 GB RAM.

15

slide-28
SLIDE 28

Execution Time

Median: Toboggan: 1.24ms Catfish: 3.47ms

16

slide-29
SLIDE 29

Ground Truth Validation

dataset instances minimal non-minimal zebrafish 445,880 99.907% 0.053% mouse 473,185 99.401% 0.074% human 529,523 99.490% 0.043% all 1,448,588 99.589% 0.056%

17

slide-30
SLIDE 30

Exact Recovery

k instances Catfish Toboggan 2 63.2791% 0.992 0.995 3 22.0775% 0.967 0.969 4 8.5237% 0.931 0.930 5 3.4920% 0.886 0.886 6 1.5375% 0.830 0.828 7 0.6698% 0.788 0.780 8 0.2889% 0.767 0.766 9 0.1241% 0.740 0.743 10 0.0070% 0.752 0.802 11 0.0004% 0.500 0.500 all 100% 0.973 0.975

18

slide-31
SLIDE 31

Solutions vs. Ground Truth

19

slide-32
SLIDE 32

ALGORITHM IDEA

slide-33
SLIDE 33

The Idea

s t 3 3 2 1 5 1 3 3 4

21

slide-34
SLIDE 34

The Idea

s t 3 3 2 1 5 1 3 3 4

22

slide-35
SLIDE 35

The Idea

s t 3 3 2 1 5 1 3 3 4

w + w 3 w 3 w 2

22

slide-36
SLIDE 36

The Idea

s t 3 3 2 1 5 1 3 3 4

w + w 3 w 3 w 2

22

slide-37
SLIDE 37

The Idea

s t 3 3 2 1 5 1 3 3 4

w + w 3 w 3 w 2 w 1 w + w 5 w 1

22

slide-38
SLIDE 38

The Idea

s t 3 3 2 1 5 1 3 3 4

w + w 3 w 3 w 2 w 1 w + w 5 w 1

Aw f

22

slide-39
SLIDE 39

Dynamic Programming

. . . . . . Si−1

. . . . . . Si

. . . . . . Si+1

23

slide-40
SLIDE 40

Dynamic Programming

. . . . . . Si−1

. . . . . . Si

. . . . . . Si+1

g1, L1 g2, L2 g3, L3 g4, L4 g5, L5 g6, L6 g7, L7 g8, L8 g9, L9 g10, L10 g11, L11

  • 23
slide-41
SLIDE 41

CONCLUSION

slide-42
SLIDE 42

Conclusion

Theoretical worst-case runtime linear in n.

25

slide-43
SLIDE 43

Conclusion

Theoretical worst-case runtime linear in n. Competitive runtime with heuristics.

25

slide-44
SLIDE 44

Conclusion

Theoretical worst-case runtime linear in n. Competitive runtime with heuristics. Guarantees optimal k.

25

slide-45
SLIDE 45

Conclusion

Theoretical worst-case runtime linear in n. Competitive runtime with heuristics. Guarantees optimal k. Python already fast, C++ even faster?

25

slide-46
SLIDE 46

Conclusion

Theoretical worst-case runtime linear in n. Competitive runtime with heuristics. Guarantees optimal k. Python already fast, C++ even faster? paper: https://arxiv.org/abs/1706.07851 github: https://github.com/theoryinpractice/toboggan

25

slide-47
SLIDE 47

Kyle Kloster, Philipp Kuinke, Michael P. O’Brien, Felix Reidl, Fernando Sánchez Villaamil, Blair D. Sullivan, Andrew van der Poel

Thank you!

Supported in part by the Gordon & Betty Moore Foundation’s Data-Driven Discovery Initiative through Grant GBMF4560 to Blair D. Sullivan. 26