Designing parallel algorithms for constructing large phylogenetic - PowerPoint PPT Presentation

Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters Erin Molloy University of Illinois at Urbana Champaign General Allocation (PI: Tandy Warnow) Exploratory Allocation (PI: Bill Gropp) NCSA Blue Waters Symposium June 5, 2018 Molloy (PI: Warnow; PI: Gropp) TERADACTAL 1/56

Phylogeny Molloy (PI: Warnow; PI: Gropp) TERADACTAL 2/56

Tree of Life and Downstream Applications “Nothing in biology makes sense except in the light of evolution” –Dobhzansky (1973) Protein structure and function prediction Population genetics Human migrations Metagenomics Infectious Disease Biodiversity Molloy (PI: Warnow; PI: Gropp) TERADACTAL 3/56

Outline Phylogeny estimation pipeline Some standard approaches and their limitations Our new approach to ultra-large phylogeny estimation on Blue Waters Comparison of our approach to existing methods Future work Molloy (PI: Warnow; PI: Gropp) TERADACTAL 4/56

DNA Sequence Evolution Molloy (PI: Warnow; PI: Gropp) TERADACTAL 5/56

Phylogeny Estimation Pipelines Molloy (PI: Warnow; PI: Gropp) TERADACTAL 6/56

Distance Methods Input: A matrix D such that D [ i , j ] indicates the distance between sequence i and sequence j Use multiple sequence alignment to compute pairwise distances Use alignment-free method to compute pairwise distances in an embarrassingly parallel fashion Output: A tree with branch lengths Distance methods use polynomial time . Molloy (PI: Warnow; PI: Gropp) TERADACTAL 12/56

Maximum Likelihood (ML) Tree Estimation Input: A multiple sequence alignment Output: A model tree (topology and other numerical parameters) that maximizes likelihood, that is, the probability of observing the multiple sequence alignment given a model tree The ML tree estimation problem is NP-hard . [Felsenstein,1981; Roch, 2006] ML Heuristic are typically more accurate than distance methods , especially under some challenging model conditions. Molloy (PI: Warnow; PI: Gropp) TERADACTAL 13/56

Maximum Likelihood (ML) Tree Estimation Input: A multiple sequence alignment Output: A model tree (topology and other numerical parameters) that maximizes likelihood or probability of observing the multiple sequence alignment given a model tree The ML tree estimation problem is NP-hard . [Felsenstein,1981; Roch, 2006] ML Heuristic are typically more accurate than distance methods , especially under some challenging model conditions. Molloy (PI: Warnow; PI: Gropp) TERADACTAL 14/56

ML Tree Estimation: N versus L A multiple sequence alignment is an N × L matrix. Number of sites or alignment length, L Thousands of sites for single gene analysis Millions of sites for multi-gene analysis Many analyses can be parallelized across sites Because likelihood is computed on each site independently Number of species or sequences, N Many datasets have thousands of species The Tree of Life will have millions Number of tree topologies on N leaves is (2 N − 5)!! Parallelism is more complicated Molloy (PI: Warnow; PI: Gropp) TERADACTAL 15/56

ML Tree Estimation: N versus L [Stamatakis, 2006; Price et al., 2010; Kozlov et al., 2015] Molloy (PI: Warnow; PI: Gropp) TERADACTAL 16/56

ML Tree Estimation: N versus L ExaML [Kozlov et al., 2015] can “be used for analyzing datasets with 10-20 genes and up to 55,000 taxa, but scalability will be limited to at most 100 cores”. Molloy (PI: Warnow; PI: Gropp) TERADACTAL 19/56

ML Tree Estimation: N versus L Molloy (PI: Warnow; PI: Gropp) TERADACTAL 20/56

ML Tree Estimation: N versus L Molloy (PI: Warnow; PI: Gropp) TERADACTAL 21/56

Divide-and-Conquer with DACTAL [Nelesen et al., 2012] [Steel, 1992; Jiang et al., 2001; Bansal et al., 2010] Molloy (PI: Warnow; PI: Gropp) TERADACTAL 22/56

Our Design Goals If we want to build the Tree of Life (millions of sequences!) using Blue Waters, then we need to design an algorithm that can utilize a very large numbers of (not high memory) processors. Based on our observations, it would be good to avoid estimating 1 Multiple sequence alignment on the full set of sequences 2 ML tree on the full multiple sequence alignment 3 Supertree Molloy (PI: Warnow; PI: Gropp) TERADACTAL 28/56

TERA DACTAL Algorithm Molloy (PI: Warnow; PI: Gropp) TERADACTAL 30/56

TreeMerge: Step 1 Create a minimum spanning tree on the disjoint subsets. [Kruskal, 1956] Molloy (PI: Warnow; PI: Gropp) TERADACTAL 31/56

TreeMerge: Step 1 Create a minimum spanning tree on the disjoint subsets. [Kruskal, 1956] Molloy (PI: Warnow; PI: Gropp) TERADACTAL 32/56

TreeMerge: Step 2 Merge trees T i and T j into tree T ij such that T ij | L i = T i and T ij | L j = T j (multiple solutions). 1 Merge two alignments A i and A j into A ij using an existing technique (e.g., OPAL [Wheeler and Kececioglu, 2007]). 2 Compute distance matrix D ij from merged alignment A ij . 3 Run NJMerge – our variant of Neighbor-Joining [Saitou and Nei, 1987] that takes both a distance matrix and a set of constraint trees as input. Molloy (PI: Warnow; PI: Gropp) TERADACTAL 33/56

TreeMerge: Step 2 Merge trees T i and T j into tree T ij such that T ij | L i = T i and T ij | L j = T j (multiple solutions). 1 Merge two alignments A i and A j into A ij using an existing technique (e.g., OPAL [Wheeler and Kececioglu, 2007]). 2 Compute distance matrix D ij from merged alignment A ij . 3 Run NJMerge – our variant of Neighbor-Joining [Saitou and Nei, 1987] that takes both a distance matrix and a set of constraint trees as input. Molloy (PI: Warnow; PI: Gropp) TERADACTAL 34/56

NJMerge: Constrained Neighbor-Joining Molloy (PI: Warnow; PI: Gropp) TERADACTAL 35/56

NJMerge: Constrained Neighbor-Joining Molloy (PI: Warnow; PI: Gropp) TERADACTAL 36/56

TreeMerge: Step 3 Combine pairs of merged trees using branch lengths, e.g., trees T ij and T jk are combined through T j (blue). NOTE: Unlabeled branches have length one for simplicity. Molloy (PI: Warnow; PI: Gropp) TERADACTAL 37/56

TreeMerge: Step 3 Combine pairs of merged trees using branch lengths, e.g., trees T ij and T jk are combined through T j (blue). NOTE: Unlabeled branches have length one for simplicity. Molloy (PI: Warnow; PI: Gropp) TERADACTAL 38/56

Advantages TERADACTAL No multiple sequence alignment estimation on the full dataset No Maximum Likelihood tree estimation on the full dataset No supertree estimation TreeMerge Polynomial Time Parallel (Step 2) Pairs of alignments and trees can be merged in an embarrassingly parallel fashion. (Step 3) Merged tree pairs can be combined in parallel, as long as they do not share edges in the minimum spanning tree. Molloy (PI: Warnow; PI: Gropp) TERADACTAL 39/56

Simulation Studies Molloy (PI: Warnow; PI: Gropp) TERADACTAL 40/56

Quantifying Tree Error Molloy (PI: Warnow; PI: Gropp) TERADACTAL 41/56

Simulation Study Compare TERADACTAL to 2 alignment-free methods 3 multiple sequence alignment methods (2 shown) 2 distance methods (1 shown) 2 Maximum Likelihood methods (1 Shown) on simulated datasets from Mirarab et al., 2015. 10,000 sequences 4 model conditions each with 10 replicate datasets (1 shown) Molloy (PI: Warnow; PI: Gropp) TERADACTAL 42/56

TERADACTAL Iterations INDELible M2 0.8 .71 0.6 Error Rate 0.4 .14 0.2 .13 .12 .12 .11 0.0 Iteration 0 1 2 3 4 5 Molloy (PI: Warnow; PI: Gropp) TERADACTAL 43/56

Designing parallel algorithms for constructing large phylogenetic - PowerPoint PPT Presentation

Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters Erin Molloy University of Illinois at Urbana Champaign General Allocation (PI: Tandy Warnow) Exploratory Allocation (PI: Bill Gropp) NCSA Blue Waters

Designing for Designing for Greenspace Greenspace Greenspace Designing for Designing for

Class 14 Slides SLIDE what is the designing principle how does designing principle

+ Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms

Parallel Algorithms Parallel Algorithms Examples Examples Concepts & Definitions

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

+ Design of Parallel Algorithms Parallel Sorting Algorithms + Topic Overview n Issues in

+ Design of Parallel Algorithms Parallel Dense Matrix Algorithms + Topic Overview n

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Designing Your Fashion Portfolio From Concept To Presentation Designing Your Fashion Portfolio

How to Think Algorithmically in Parallel? Or, Parallel Programming through Parallel Algorithms

+ Design of Parallel Algorithms Bulk Synchronous Parallel A Bridging Model of Parallel

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.1 Parallel Algorithm

Introduction to Parallel Computing George Karypis Analytical Modeling of Parallel Algorithms

Computing parsimony Parsimony treats each site (position in a sequence) l independently Total

Outline Review of trees. Coun4ng features. Characterbased phylogeny Maximum

What is a phylogenetic tree? Bioinformatics Algorithms (Fundamental Algorithms, module 2)

Marine Molluscs Simon Hills (biologist) Ecology Group Institute of Natural Resources Massey

Amorerealisticapproachto simulatingheterotachyanditseffect

Multiple sequence alignments and phylogenetic trees Multiple sequence alignment (MSA) Software

Introduction to Bio++ Julien Dutheil jdutheil@birc.au.dk Bioinformatics Research Center Aarhus

Phylogenetic trees I Foundations, Distance-based inference Gerhard Jger Words, Bones, Genes,

Designing parallel algorithms for constructing large phylogenetic - PowerPoint PPT Presentation

Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters Erin Molloy University of Illinois at Urbana Champaign General Allocation (PI: Tandy Warnow) Exploratory Allocation (PI: Bill Gropp) NCSA Blue Waters

Designing for Designing for Greenspace Greenspace Greenspace Designing for Designing for

Class 14 Slides SLIDE what is the designing principle how does designing principle

+ Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms

Parallel Algorithms Parallel Algorithms Examples Examples Concepts &amp; Definitions

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

+ Design of Parallel Algorithms Parallel Sorting Algorithms + Topic Overview n Issues in

+ Design of Parallel Algorithms Parallel Dense Matrix Algorithms + Topic Overview n

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Designing Your Fashion Portfolio From Concept To Presentation Designing Your Fashion Portfolio

How to Think Algorithmically in Parallel? Or, Parallel Programming through Parallel Algorithms

+ Design of Parallel Algorithms Bulk Synchronous Parallel A Bridging Model of Parallel

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.1 Parallel Algorithm

Introduction to Parallel Computing George Karypis Analytical Modeling of Parallel Algorithms

Computing parsimony Parsimony treats each site (position in a sequence) l independently Total

Outline Review of trees. Coun4ng features. Characterbased phylogeny Maximum

What is a phylogenetic tree? Bioinformatics Algorithms (Fundamental Algorithms, module 2)

Marine Molluscs Simon Hills (biologist) Ecology Group Institute of Natural Resources Massey

Amorerealisticapproachto simulatingheterotachyanditseffect

Multiple sequence alignments and phylogenetic trees Multiple sequence alignment (MSA) Software

Introduction to Bio++ Julien Dutheil jdutheil@birc.au.dk Bioinformatics Research Center Aarhus

Phylogenetic trees I Foundations, Distance-based inference Gerhard Jger Words, Bones, Genes,

Parallel Algorithms Parallel Algorithms Examples Examples Concepts & Definitions