phylogenetic trees and networks
play

Phylogenetic Trees and Networks Konstantinos Mampentzidis PhD - PowerPoint PPT Presentation

Comparison and Construction of Phylogenetic Trees and Networks Konstantinos Mampentzidis PhD Defense Aarhus University, Aarhus, Denmark 24 October 2019 1 Publications Gerth Stlting Brodal and Konstantinos Mampentzidis. Cache Oblivious


  1. Comparison and Construction of Phylogenetic Trees and Networks Konstantinos Mampentzidis PhD Defense Aarhus University, Aarhus, Denmark 24 October 2019 1

  2. Publications ▪ Gerth Stølting Brodal and Konstantinos Mampentzidis. Cache Oblivious Algorithms for Computing the Triplet Distance between Trees. In ESA 2017 , Vienna, Austria. ▪ Jesper Jansson, Konstantinos Mampentzidis, Ramesh Rajaby, and Wing-Kin Sung. Computing the Rooted Triplet Distance Between Phylogenetic Networks. In IWOCA 2019 , Pisa, Italy. ▪ Jesper Jansson, Konstantinos Mampentzidis, and Sandhya Thekkumpadan Puthiyaveedu. Building a Small and Informative Phylogenetic Supertree. In WABI 2019 , Niagara Falls, USA. 2

  3. Algorithmic Theory and Practice ▪ Algorithm : sequence of steps for solving a computational problem ▪ Theory : algorithms are first designed & analyzed in a model of computation ▪ Practice : then implemented in a programming language (C, C++, python, …) RAM model I/O model Cache Oblivious model Frigo, Leiserson, Prokop, Ramachandran 1999 John von Neumann 1945 Aggarwal and Vitter 1988 I/O I/O Memory Memory Memory cache cache B B CPU CPU CPU ∞ ∞ ∞ M M Gap between Computer architecture continues Theory and Practice becoming more complicated Design Algorithm Engineering ▪ Term first used by G. F. Italiano who organized the “Workshop on Algorithm Engineering” Analysis Experiments in Venice, Italy, 1997 ▪ bridges the gap between theory and practice Implementation 3

  4. Problems in Phylogenetics Rooted Tree Phylogenetic Rooted Phylogenetic Network (DAG) Reticulation vertices ▪ Different available data/construction algorithms can lead to trees/networks that look different ▪ Quantifying this difference can improve evolutionary inferences ESA 2017 Given two rooted phylogenetic trees T 1 and T 2 over n species, how different are they? IWOCA 2019 Given two rooted phylogenetic networks N 1 and N 2 over n species, how different are they? ▪ How are the trees and networks created to begin with? WABI 2019 Given an input set of biological data, build a rooted phylogenetic tree that best represents it 4

  5. Publications ▪ Gerth Stølting Brodal and Konstantinos Mampentzidis. Cache Oblivious Algorithms for Computing the Triplet Distance between Trees. In ESA 2017 , Vienna, Austria. ▪ Jesper Jansson, Konstantinos Mampentzidis, Ramesh Rajaby, and Wing-Kin Sung. Computing the Rooted Triplet Distance Between Phylogenetic Networks. In IWOCA 2019 , Pisa, Italy. ▪ Jesper Jansson, Konstantinos Mampentzidis, and Sandhya Thekkumpadan Puthiyaveedu. Building a Small and Informative Phylogenetic Supertree. In WABI 2019 , Niagara Falls, USA. 5

  6. Comparing Phylogenetic Trees Rooted Tree Phylogenetic Rooted Phylogenetic Tree T 1 T 2 QUESTION Given two rooted phylogenetic trees T 1 and T 2 over n species, how different are they? ▪ Tree types: rooted /unrooted, binary / arbitrary degree d ▪ Distance measures: rooted triplet distance , unrooted quartet distance, Robinson-Foulds , … 6

  7. Rooted Triplet Distance (Trees) ▪ A rooted triplet is defined by 3 leaf labels and their induced tree topology ▪ A triplet is induced by a tree T’ if it appears as an embedded subtree in T’ Resolved triplet Fan triplet u T’ u u v v x z w z x y x | z | w xy | z x y z w Rooted Triplet Distance (Trees), Dobson [Combinatorial Mathematics III 1975] Let T 1 and T 2 be two rooted trees built on the same leaf label set Λ of size n Shared triplets = triplets that are induced by both T 1 and T 2 S ( T 1 , T 2 ) = # shared triplets ≤ n 3 Rooted triplet distance D ( T 1 , T 2 ) = n 3 − S ( T 1 , T 2 ) = # non-shared triplets 7

  8. Rooted Triplet Distance (Trees) Rooted Triplet Distance (Trees), Dobson [Combinatorial Mathematics III 1975] Let T 1 and T 2 be two rooted trees built on the same leaf label set Λ of size n Shared triplets = triplets that are induced by both T 1 and T 2 S ( T 1 , T 2 ) = # shared triplets ≤ n 3 Rooted triplet distance D ( T 1 , T 2 ) = n 3 − S ( T 1 , T 2 ) = # non-shared triplets Example shared triplets non-shared triplets T 1 T 2 a 3 a 4 | a 5 a 1 , a 2 , a 3 a 2 , a 3 , a 5 a 3 a 4 | a 1 a 1 , a 3 , a 5 a 2 , a 4 , a 3 a 1 a 5 a 1 | a 2 | a 5 a 1 , a 2 , a 4 a 2 , a 4 , a 5 a 1 a 2 a 5 a 3 a 4 a 3 a 1 , a 4 , a 5 a 2 a 4 D ( T 1 , T 2 ) = 7 8

  9. Previous and New Results Reference Time I/Os Space Non-Binary Trees O( n 2 ) O( n 2 ) O( n 2 ) Critchlow et al. [Sys. Biology 1996] no O( n 2 ) O( n 2 ) O( n 2 ) Bansal et al. [TCS 2011] yes Sand et al. [BMC Bioinform. 2013] O( n ∙ log 2 n ) O( n ∙ log 2 n ) O( n ) no Brodal et al. [SODA 2013] O( n ∙ log n ) O( n ∙ log n ) O( n ∙ log n ) yes O( n ∙ log 3 n ) O( n ∙ log 3 n ) Jansson & Rajaby [JCB 2017] O( n ∙ log n ) yes new [ESA 2017] O( n ∙l og n ) O( n / B ∙ log 2 ( n / M )) O( n ) yes Implementation available ▪ All previous solutions rely heavily on random memory access o Penalized by cache performance o Do not scale to external memory ▪ The new algorithms rely on scanning continuous chunks of memory o Scanning s elements requires O( s / B ) I/Os in the cache oblivious model B B B B B B s o Scale to external memory 9

  10. Previous Approaches – Quadratic Algorithm ▪ Basis for all O( n ∙ polylog n ) results: O( n 2 ) algorithm for binary trees in [BMC Bioinform. 2013] T 1 T 2 arbitrary arbitrary height height (anchor) v u (anchor) s ( u ) = { xy | z , …} 1 2 3 … x y z n-1 n z y x 9 n - 4 2 … 3 7 ▪ Every triplet with leaves x , y , and z is anchored in LCA ( x , y , z ) (anchor node) ▪ s ( u ): set containing all triplets anchored in u ▪ S ( T 1 , T 2 ) = σ u ∈ T 1 σ v ∈ T 2 | s ( u ) ∩ s ( v )| T 1 T 2 arbitrary arbitrary u v height height r l 1 2 3 … n-1 n 9 n - 4 2 … 3 7 | s ( u ) ∩ s ( v )| = l red r blue + l blue r red + r red l blue + r blue l red 2 2 2 2 10

  11. Previous Approaches – Subquadratic Algorithms Hierarchical arbitrary arbitrary v T 1 T 2 decomposition height height u height v HDT ( T 2 ) O(log n ) 1 2 3 … n-1 n x y z z x y 9 n- 4 2 … 3 7 9 n- 4 2 … z x y 3 7 ▪ For u ∈ T 1 the HDT ( T 2 ) maintains σ v ∈ T 2 | s ( u ) ∩ s ( v )| ▪ Each leaf color change in T 1 yields an update to HDT ( T 2 ) Θ( n log n ) updates, with each update corresponding to a leaf to root path Bad I/O performance traversal of HDT ( T 2 ) Reference Time HDT ( T 2 ) O( n ∙ log 2 n ) Sand et al. [BMC Bioinform. 2013] Static Brodal et al. [SODA 2013] O( n ∙ log n ) Dynamic/Contraction Static O( n ∙ log 3 n ) Jansson & Rajaby [JCB 2017] (heavy-light decomposition) 11

  12. The New Algorithm for Binary Trees (ESA 2017) ▪ New order of visiting nodes of T 1 based on DFS traversal of an HDT ( T 1 ) ▪ HDT ( T 1 ) = modified centroid decomposition LCA(x,c’) T 1 T 1 x c c ≤ s s c’ 2 ≤ s ≤ s 2 2 ▪ Lemma 2 height( HDT ( T 1 )) ≤ 2 + 2∙log s = O(log n ) T 1 u 3 HDT ( T 1 ) height u u O(log n ) u 1 u 2 u 1 u 3 u 2 ▪ Order to visit the nodes in T 1 : DFS traversal of HDT ( T 1 ), where the children of a node u are visited from left to right 12

  13. The New Algorithm for Binary Trees (ESA 2017) T 1 HDT ( T 1 ) u height u O(log n ) C u Contract T 2 T 2 T 2 ( u ) For every node u in HDT ( T 1 ) we scan T 2 ( u ) to count σ v ∈ T 2 | s ( u ) ∩ s ( v )| Size O(| C u |) ▪ RAM model: O( n ) time per level of HDT ( T 1 ) → O( n ∙log n ) ▪ To scale to external memory: store every component/contracted tree in memory following a proper layout such that scanning a component/contracted tree of size s takes O( s / B ) I/Os 13

  14. The New Algorithm for General Trees (ESA 2017) 1. Anchor triplets in edges instead of nodes 2. Capture triplets with 4 colors T 1 u O( n 2 ) k k c c z x y w z x y w z 3. Transform T 1 into a binary tree b ( T 1 ) w b ( T 1 ) T 1 O( n ∙ log n ) k c c z x y w z z x y w z 14

  15. RAM Experiments – Time Performance [JCB 2017] [SODA 2013] [JCB 2017] [SODA 2013] new new Binary trees General trees seconds/ n seconds/ n log 2 n log 2 n Source code: https://github.com/kmampent/CacheTD 15

  16. I/O Experiments – Time Performance Binary Trees General Trees n [JCB 2017] [SODA 2013] New n [JCB 2017] [SODA 2013] New Previous best Previous best 2 15 2 15 1s 1s 1s 1s 1s 1s 2 16 2 16 1s 2s 1s 1s 1s 1s 2 17 2 17 1s 4s 1s 1s 3s 1s 2 18 2 18 2s 1m:03s 1s 3s 7s 1s 2 19 2 19 4s 1h:21m 1s 7s 5m:20s 1s 2 20 2 20 9s ≥ 10h 1s 3m:43s ≥ 10h 2s 2 21 2 21 13m:12s 3s ≥ 10h 20s 2 22 2 22 ≥ 10h 9s 2m:02s 2 23 2 23 3m:37s 10m:42s 2 24 2 24 10m:35s 42m:06s Source code: https://github.com/kmampent/CacheTD 16

  17. Publications ▪ Gerth Stølting Brodal and Konstantinos Mampentzidis. Cache Oblivious Algorithms for Computing the Triplet Distance between Trees. In ESA 2017 , Vienna, Austria. ▪ Jesper Jansson, Konstantinos Mampentzidis, Ramesh Rajaby, and Wing-Kin Sung. Computing the Rooted Triplet Distance Between Phylogenetic Networks. In IWOCA 2019 , Pisa, Italy. ▪ Jesper Jansson, Konstantinos Mampentzidis, and Sandhya Thekkumpadan Puthiyaveedu. Building a Small and Informative Phylogenetic Supertree. In WABI 2019 , Niagara Falls, USA. 17

  18. Rooted Phylogenetic Networks Rooted Tree Phylogenetic Rooted Phylogenetic Network (DAG) Reticulation vertices An “example” of a hybrid animal 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend