computing optimal flow decompositions for assembly
play

COMPUTING OPTIMAL FLOW DECOMPOSITIONS FOR ASSEMBLY Kyle Kloster, - PowerPoint PPT Presentation

COMPUTING OPTIMAL FLOW DECOMPOSITIONS FOR ASSEMBLY Kyle Kloster, Philipp Kuinke , Michael P. OBrien, Felix Reidl, Fernando Snchez Villaamil, Blair D. Sullivan, Andrew van der Poel 2018/03/27 North Carolina State University RWTH Aachen


  1. COMPUTING OPTIMAL FLOW DECOMPOSITIONS FOR ASSEMBLY Kyle Kloster, Philipp Kuinke , Michael P. O’Brien, Felix Reidl, Fernando Sánchez Villaamil, Blair D. Sullivan, Andrew van der Poel 2018/03/27 North Carolina State University RWTH Aachen University

  2. MOTIVATION

  3. T he Problem Shared segments between DNA/RNA strands create ambiguity in the assembly problem 2

  4. The Problem Connecting overlapping segments and counting their frequencies yields a splice-graph. 3

  5. The Problem 4

  6. The Problem The problem is to split the flow into s - t -paths, to recover the original DNA/RNA strands. 5

  7. The Problem 6

  8. The Problem k -Flow Decomposition ( k -FD) Input: ( G , f , k ) with an s - t –DAG G , a flow f on G , and a positive integer k . Problem: Find an integral flow decomposition of ( G , f ) using at most k paths. 7

  9. Related Work k -Flow Decomposition ( k -FD) Input: ( G , f , k ) with an s - t –DAG G , a flow f on G , and a positive integer k . Problem: Find an integral flow decomposition of ( G , f ) using at most k paths. 8

  10. Related Work k -Flow Decomposition ( k -FD) Input: ( G , f , k ) with an s - t –DAG G , a flow f on G , and a positive integer k . Problem: Find an integral flow decomposition of ( G , f ) using at most k paths. How do we choose k ? 8

  11. Related Work k -Flow Decomposition ( k -FD) Input: ( G , f , k ) with an s - t –DAG G , a flow f on G , and a positive integer k . Problem: Find an integral flow decomposition of ( G , f ) using at most k paths. How do we choose k ? → minimization A novel min-cost flow method for estimating transcript expression with RNA-Seq. A.I. Tomescu et. al. Efficient Heuristic for Decomposing a Flow with Minimum Number of Paths. M. Shao & C. Kingsford 8

  12. Related Work k -Flow Decomposition ( k -FD) Input: ( G , f , k ) with an s - t –DAG G , a flow f on G , and a positive integer k . Problem: Find an integral flow decomposition of ( G , f ) using at most k paths. How do we choose k ? → minimization A novel min-cost flow method for estimating transcript expression with RNA-Seq. A.I. Tomescu et. al. Efficient Heuristic for Decomposing a Flow with Minimum Number of Paths. M. Shao & C. Kingsford Problem is NP-hard even for weights { 1 , 2 , 4 } How to split a flow? T. Hartman et. al. 8

  13. Computer Scientists... About ten years ago, some computer scientists came by and said they heard we have some really cool problems. They showed that the problems are NP-complete and went away! -Joseph Felsenstein (Biologist) 9

  14. Computer Scientists... About ten years ago, some computer scientists came by and said they heard we have some really cool problems. They showed that the problems are NP-complete and went away! -Joseph Felsenstein (Biologist) 9

  15. Linear FPT d k · n c n vs. linear fpt: exponential only in the parameter and linear in n ! 10

  16. Observations Data used by Shao and Kingsford: 1. 99% of instances decompose into ≤ 8 paths. → exploit small natural parameter . 2. ∼ 4 million mostly small instances. → handle large throughput . 3. Output decompositions. → reliably recover domain-specific solution . 11

  17. Toboggan Theorem Toboggan solves k -FD in 2 O ( k 2 ) ( n + λ ) , where λ is the logarithm of the largest flow value. 12

  18. Toboggan Theorem Toboggan solves k -FD in 2 O ( k 2 ) ( n + λ ) , where λ is the logarithm of the largest flow value. � Worst-case run-time is linear in n 12

  19. Toboggan Theorem Toboggan solves k -FD in 2 O ( k 2 ) ( n + λ ) , where λ is the logarithm of the largest flow value. � Worst-case run-time is linear in n � Guarantees optimal solution 12

  20. Toboggan Theorem Toboggan solves k -FD in 2 O ( k 2 ) ( n + λ ) , where λ is the logarithm of the largest flow value. � Worst-case run-time is linear in n � Guarantees optimal solution ◦ Gives opportunity to validate the model 12

  21. Toboggan Theorem Toboggan solves k -FD in 2 O ( k 2 ) ( n + λ ) , where λ is the logarithm of the largest flow value. � Worst-case run-time is linear in n � Guarantees optimal solution ◦ Gives opportunity to validate the model � Run-time competitive with current state of the art heuristic 12

  22. Toboggan Theorem Toboggan solves k -FD in 2 O ( k 2 ) ( n + λ ) , where λ is the logarithm of the largest flow value. � Worst-case run-time is linear in n � Guarantees optimal solution ◦ Gives opportunity to validate the model � Run-time competitive with current state of the art heuristic � Usable in practice 12

  23. I MPLEMENTATION & EXPERIMENTS

  24. Repository https://github.com/theoryinpractice/toboggan 14

  25. Setup Dataset: Available from Shao and Kingsford. Simulated sequencing data for human , mouse and zebrafish , containing ground-truth. 15

  26. Setup Dataset: Available from Shao and Kingsford. Simulated sequencing data for human , mouse and zebrafish , containing ground-truth. Deviation from original setup: Trivial instances omitted. Removes around 64% of the 4M graphs. 15

  27. Setup Dataset: Available from Shao and Kingsford. Simulated sequencing data for human , mouse and zebrafish , containing ground-truth. Deviation from original setup: Trivial instances omitted. Removes around 64% of the 4M graphs. Dedicated system with Intel i7-3770: 3.40 GHz, 8 MB cache and 32 GB RAM. 15

  28. Execution Time Median: Toboggan : 1 . 24 ms Catfish : 3 . 47 ms 16

  29. Ground Truth Validation dataset instances minimal non-minimal 445,880 99.907% 0.053% zebrafish 473,185 99.401% 0.074% mouse 529,523 99.490% 0.043% human all 1,448,588 99.589% 0.056% 17

  30. Exact Recovery Catfish Toboggan k instances 2 63.2791% 0.992 0.995 3 22.0775% 0.967 0.969 4 8.5237% 0.931 0.930 5 3.4920% 0.886 0.886 6 1.5375% 0.830 0.828 7 0.6698% 0.788 0.780 8 0.2889% 0.767 0.766 9 0.1241% 0.740 0.743 10 0.0070% 0.752 0.802 11 0.0004% 0.500 0.500 all 100% 0.973 0.975 18

  31. Solutions vs. Ground Truth 19

  32. ALGORITHM IDEA

  33. The Idea 1 3 3 s t 3 2 5 3 4 1 21

  34. The Idea 1 3 3 s t 5 3 4 3 2 1 22

  35. The Idea 1 3 3 s t 5 3 4 3 2 1 w + w � 3 w � 3 w � 2 22

  36. The Idea 1 3 3 s t 3 4 3 2 5 1 w + w � 3 w � 3 w � 2 22

  37. The Idea 1 3 3 s t 3 4 3 2 5 1 w + w � 3 w � 1 w � 3 w + w � 5 w � 2 w � 1 22

  38. The Idea 1 3 3 s t 3 4 3 2 5 1 w + w � 3 w � 1 Aw � f w � 3 w + w � 5 w � 2 w � 1 22

  39. Dynamic Programming . . . . . . S i − 1 ↓ . . . . . . S i ↓ . . . . . . S i +1 23

  40. Dynamic Programming . . . . . . S i − 1 g 1 , L 1 ↓ g 2 , L 2 g 3 , L 3 g 4 , L 4 g 5 , L 5 . . . . . . S i g 6 , L 6 g 7 , L 7 ↓ g 8 , L 8 g 10 , L 10 � g 9 , L 9 g 11 , L 11 . . . . . . S i +1 23

  41. C ONCLUSION

  42. Conclusion � Theoretical worst-case runtime linear in n . 25

  43. Conclusion � Theoretical worst-case runtime linear in n . � Competitive runtime with heuristics. 25

  44. Conclusion � Theoretical worst-case runtime linear in n . � Competitive runtime with heuristics. � Guarantees optimal k . 25

  45. Conclusion � Theoretical worst-case runtime linear in n . � Competitive runtime with heuristics. � Guarantees optimal k . � Python already fast, C++ even faster? 25

  46. Conclusion � Theoretical worst-case runtime linear in n . � Competitive runtime with heuristics. � Guarantees optimal k . � Python already fast, C++ even faster? paper: https://arxiv.org/abs/1706.07851 github: https://github.com/theoryinpractice/toboggan 25

  47. Kyle Kloster, Philipp Kuinke , Michael P. O’Brien, Felix Reidl, Fernando Sánchez Villaamil, Blair D. Sullivan , Andrew van der Poel Thank you! Supported in part by the Gordon & Betty Moore Foundation’s Data-Driven Discovery Initiative through Grant GBMF4560 to Blair D. Sullivan. 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend