COMPUTING OPTIMAL FLOW DECOMPOSITIONS FOR ASSEMBLY Kyle Kloster, - PowerPoint PPT Presentation

COMPUTING OPTIMAL FLOW DECOMPOSITIONS FOR ASSEMBLY Kyle Kloster, Philipp Kuinke , Michael P. O’Brien, Felix Reidl, Fernando Sánchez Villaamil, Blair D. Sullivan, Andrew van der Poel 2018/03/27 North Carolina State University RWTH Aachen University

MOTIVATION

T he Problem Shared segments between DNA/RNA strands create ambiguity in the assembly problem 2

The Problem Connecting overlapping segments and counting their frequencies yields a splice-graph. 3

The Problem 4

The Problem The problem is to split the flow into s - t -paths, to recover the original DNA/RNA strands. 5

The Problem 6

The Problem k -Flow Decomposition ( k -FD) Input: ( G , f , k ) with an s - t –DAG G , a flow f on G , and a positive integer k . Problem: Find an integral flow decomposition of ( G , f ) using at most k paths. 7

Related Work k -Flow Decomposition ( k -FD) Input: ( G , f , k ) with an s - t –DAG G , a flow f on G , and a positive integer k . Problem: Find an integral flow decomposition of ( G , f ) using at most k paths. 8

Related Work k -Flow Decomposition ( k -FD) Input: ( G , f , k ) with an s - t –DAG G , a flow f on G , and a positive integer k . Problem: Find an integral flow decomposition of ( G , f ) using at most k paths. How do we choose k ? 8

Related Work k -Flow Decomposition ( k -FD) Input: ( G , f , k ) with an s - t –DAG G , a flow f on G , and a positive integer k . Problem: Find an integral flow decomposition of ( G , f ) using at most k paths. How do we choose k ? → minimization A novel min-cost flow method for estimating transcript expression with RNA-Seq. A.I. Tomescu et. al. Efficient Heuristic for Decomposing a Flow with Minimum Number of Paths. M. Shao & C. Kingsford 8

Related Work k -Flow Decomposition ( k -FD) Input: ( G , f , k ) with an s - t –DAG G , a flow f on G , and a positive integer k . Problem: Find an integral flow decomposition of ( G , f ) using at most k paths. How do we choose k ? → minimization A novel min-cost flow method for estimating transcript expression with RNA-Seq. A.I. Tomescu et. al. Efficient Heuristic for Decomposing a Flow with Minimum Number of Paths. M. Shao & C. Kingsford Problem is NP-hard even for weights { 1 , 2 , 4 } How to split a flow? T. Hartman et. al. 8

Computer Scientists... About ten years ago, some computer scientists came by and said they heard we have some really cool problems. They showed that the problems are NP-complete and went away! -Joseph Felsenstein (Biologist) 9

Linear FPT d k · n c n vs. linear fpt: exponential only in the parameter and linear in n ! 10

Observations Data used by Shao and Kingsford: 1. 99% of instances decompose into ≤ 8 paths. → exploit small natural parameter . 2. ∼ 4 million mostly small instances. → handle large throughput . 3. Output decompositions. → reliably recover domain-specific solution . 11

Toboggan Theorem Toboggan solves k -FD in 2 O ( k 2 ) ( n + λ ) , where λ is the logarithm of the largest flow value. 12

Toboggan Theorem Toboggan solves k -FD in 2 O ( k 2 ) ( n + λ ) , where λ is the logarithm of the largest flow value. � Worst-case run-time is linear in n 12

Toboggan Theorem Toboggan solves k -FD in 2 O ( k 2 ) ( n + λ ) , where λ is the logarithm of the largest flow value. � Worst-case run-time is linear in n � Guarantees optimal solution 12

Toboggan Theorem Toboggan solves k -FD in 2 O ( k 2 ) ( n + λ ) , where λ is the logarithm of the largest flow value. � Worst-case run-time is linear in n � Guarantees optimal solution ◦ Gives opportunity to validate the model 12

Toboggan Theorem Toboggan solves k -FD in 2 O ( k 2 ) ( n + λ ) , where λ is the logarithm of the largest flow value. � Worst-case run-time is linear in n � Guarantees optimal solution ◦ Gives opportunity to validate the model � Run-time competitive with current state of the art heuristic 12

Toboggan Theorem Toboggan solves k -FD in 2 O ( k 2 ) ( n + λ ) , where λ is the logarithm of the largest flow value. � Worst-case run-time is linear in n � Guarantees optimal solution ◦ Gives opportunity to validate the model � Run-time competitive with current state of the art heuristic � Usable in practice 12

I MPLEMENTATION & EXPERIMENTS

Repository https://github.com/theoryinpractice/toboggan 14

Setup Dataset: Available from Shao and Kingsford. Simulated sequencing data for human , mouse and zebrafish , containing ground-truth. 15

Setup Dataset: Available from Shao and Kingsford. Simulated sequencing data for human , mouse and zebrafish , containing ground-truth. Deviation from original setup: Trivial instances omitted. Removes around 64% of the 4M graphs. 15

Setup Dataset: Available from Shao and Kingsford. Simulated sequencing data for human , mouse and zebrafish , containing ground-truth. Deviation from original setup: Trivial instances omitted. Removes around 64% of the 4M graphs. Dedicated system with Intel i7-3770: 3.40 GHz, 8 MB cache and 32 GB RAM. 15

Execution Time Median: Toboggan : 1 . 24 ms Catfish : 3 . 47 ms 16

Ground Truth Validation dataset instances minimal non-minimal 445,880 99.907% 0.053% zebrafish 473,185 99.401% 0.074% mouse 529,523 99.490% 0.043% human all 1,448,588 99.589% 0.056% 17

Exact Recovery Catfish Toboggan k instances 2 63.2791% 0.992 0.995 3 22.0775% 0.967 0.969 4 8.5237% 0.931 0.930 5 3.4920% 0.886 0.886 6 1.5375% 0.830 0.828 7 0.6698% 0.788 0.780 8 0.2889% 0.767 0.766 9 0.1241% 0.740 0.743 10 0.0070% 0.752 0.802 11 0.0004% 0.500 0.500 all 100% 0.973 0.975 18

Solutions vs. Ground Truth 19

ALGORITHM IDEA

The Idea 1 3 3 s t 3 2 5 3 4 1 21

The Idea 1 3 3 s t 5 3 4 3 2 1 22

The Idea 1 3 3 s t 5 3 4 3 2 1 w + w � 3 w � 3 w � 2 22

The Idea 1 3 3 s t 3 4 3 2 5 1 w + w � 3 w � 3 w � 2 22

The Idea 1 3 3 s t 3 4 3 2 5 1 w + w � 3 w � 1 w � 3 w + w � 5 w � 2 w � 1 22

The Idea 1 3 3 s t 3 4 3 2 5 1 w + w � 3 w � 1 Aw � f w � 3 w + w � 5 w � 2 w � 1 22

Dynamic Programming . . . . . . S i − 1 ↓ . . . . . . S i ↓ . . . . . . S i +1 23

Dynamic Programming . . . . . . S i − 1 g 1 , L 1 ↓ g 2 , L 2 g 3 , L 3 g 4 , L 4 g 5 , L 5 . . . . . . S i g 6 , L 6 g 7 , L 7 ↓ g 8 , L 8 g 10 , L 10 � g 9 , L 9 g 11 , L 11 . . . . . . S i +1 23

C ONCLUSION

Conclusion � Theoretical worst-case runtime linear in n . 25

Conclusion � Theoretical worst-case runtime linear in n . � Competitive runtime with heuristics. 25

Conclusion � Theoretical worst-case runtime linear in n . � Competitive runtime with heuristics. � Guarantees optimal k . 25

Conclusion � Theoretical worst-case runtime linear in n . � Competitive runtime with heuristics. � Guarantees optimal k . � Python already fast, C++ even faster? 25

Conclusion � Theoretical worst-case runtime linear in n . � Competitive runtime with heuristics. � Guarantees optimal k . � Python already fast, C++ even faster? paper: https://arxiv.org/abs/1706.07851 github: https://github.com/theoryinpractice/toboggan 25

Kyle Kloster, Philipp Kuinke , Michael P. O’Brien, Felix Reidl, Fernando Sánchez Villaamil, Blair D. Sullivan , Andrew van der Poel Thank you! Supported in part by the Gordon & Betty Moore Foundation’s Data-Driven Discovery Initiative through Grant GBMF4560 to Blair D. Sullivan. 26

COMPUTING OPTIMAL FLOW DECOMPOSITIONS FOR ASSEMBLY Kyle Kloster, - PowerPoint PPT Presentation

COMPUTING OPTIMAL FLOW DECOMPOSITIONS FOR ASSEMBLY Kyle Kloster, Philipp Kuinke , Michael P. OBrien, Felix Reidl, Fernando Snchez Villaamil, Blair D. Sullivan, Andrew van der Poel 2018/03/27 North Carolina State University RWTH Aachen

Query Decompositions survey Nicola Onose January 19, 2007 Nicola Onose Query Decompositions

BOOLEAN MATRIX AND TENSOR DECOMPOSITIONS Pauli Miettinen TML 2013 27 September 2013 BOOLEAN

Tensor Decompositions for ensor Decompositions for Big Multi-aspect Data Big Multi-aspect Data

Some particular direct-sum decompositions and direct-product decompositions Alberto Facchini

Toward Computing Towards an Optimal . . . An (Almost) Optimal . . . Minor Problem an Optimal

DNA computing and self- -assembly assembly DNA computing and self Jie Gao References: Erik

#join Y assembly to Box JellyBox Build: 15_Y-Assembly Join (link directly to the y assembly part

Flow Visualization Overview: Flow Visualization (1) Introduction, overview Flow data Simulation

Optimal Agents Nick Hay 27th September 2005 1 / 36 Nick Hay Optimal Agents The Optimal Agent

Third-Order Tensor Decompositions and Their Application in Quantum Chemistry Tyler Ueltschi April

Type decompositions in NIP theories Pierre Simon Ecole Normale Sup erieure, Paris Logic

On the Algorithmic Effectiveness of Digraph Decompositions and Complexity Measures Michael

Irreducible decompositions of binomial ideals Christopher ONeill Duke University

Symmetric Tensor Decompositions Kristian Ranestad University of Oslo Linz, 26.11.13 Kristian

Triangular Decompositions of Polynomial Systems: From Theory to Practice Marc Moreno Maza Univ.

Baire measurable paradoxical decompositions Andrew Marks and Spencer Unger UCLA Paradoxical

Self Assembly (talk for the AERES evaluation) Eric R emila based on Florent Becker s Ph. D.

Limitations of Self-Assembly at Temperature 1 David Doty, Matt Patitz, Scott Summers Department

Introduction Customer Expectations Company Culture (DNA) Culture of Safety &

National Reference Laboratory- the Irish Experience Rosemarie Slowey 25th April 2019 State

EE595 Capstone Design Presentations Friday, Dec 14 EMS-E250 Presentations Team 5: 8:30

Masters Thesis Genome Assembly: Scaffolding Guided by Related Genomes Runar Furenes

Viruses What is a virus? What is a virus? A small infectious agent that reproduces only

Sequencing the hexaploid wheat genome in 42 simple steps David Edwards University of Queensland

COMPUTING OPTIMAL FLOW DECOMPOSITIONS FOR ASSEMBLY Kyle Kloster, - PowerPoint PPT Presentation

COMPUTING OPTIMAL FLOW DECOMPOSITIONS FOR ASSEMBLY Kyle Kloster, Philipp Kuinke , Michael P. OBrien, Felix Reidl, Fernando Snchez Villaamil, Blair D. Sullivan, Andrew van der Poel 2018/03/27 North Carolina State University RWTH Aachen

Query Decompositions survey Nicola Onose January 19, 2007 Nicola Onose Query Decompositions

BOOLEAN MATRIX AND TENSOR DECOMPOSITIONS Pauli Miettinen TML 2013 27 September 2013 BOOLEAN

Tensor Decompositions for ensor Decompositions for Big Multi-aspect Data Big Multi-aspect Data

Some particular direct-sum decompositions and direct-product decompositions Alberto Facchini

Toward Computing Towards an Optimal . . . An (Almost) Optimal . . . Minor Problem an Optimal

DNA computing and self- -assembly assembly DNA computing and self Jie Gao References: Erik

#join Y assembly to Box JellyBox Build: 15_Y-Assembly Join (link directly to the y assembly part

Flow Visualization Overview: Flow Visualization (1) Introduction, overview Flow data Simulation

Optimal Agents Nick Hay 27th September 2005 1 / 36 Nick Hay Optimal Agents The Optimal Agent

Third-Order Tensor Decompositions and Their Application in Quantum Chemistry Tyler Ueltschi April

Type decompositions in NIP theories Pierre Simon Ecole Normale Sup erieure, Paris Logic

On the Algorithmic Effectiveness of Digraph Decompositions and Complexity Measures Michael

Irreducible decompositions of binomial ideals Christopher ONeill Duke University

Symmetric Tensor Decompositions Kristian Ranestad University of Oslo Linz, 26.11.13 Kristian

Triangular Decompositions of Polynomial Systems: From Theory to Practice Marc Moreno Maza Univ.

Baire measurable paradoxical decompositions Andrew Marks and Spencer Unger UCLA Paradoxical

Self Assembly (talk for the AERES evaluation) Eric R emila based on Florent Becker s Ph. D.

Limitations of Self-Assembly at Temperature 1 David Doty, Matt Patitz, Scott Summers Department

Introduction Customer Expectations Company Culture (DNA) Culture of Safety &amp;

National Reference Laboratory- the Irish Experience Rosemarie Slowey 25th April 2019 State

EE595 Capstone Design Presentations Friday, Dec 14 EMS-E250 Presentations Team 5: 8:30

Masters Thesis Genome Assembly: Scaffolding Guided by Related Genomes Runar Furenes

Viruses What is a virus? What is a virus? A small infectious agent that reproduces only

Sequencing the hexaploid wheat genome in 42 simple steps David Edwards University of Queensland

Introduction Customer Expectations Company Culture (DNA) Culture of Safety &