Scaling methods for phylogeny estimation to large datasets using divide-and-conquer
Tandy Warnow University of Illinois at Urbana-Champaign Joint work with Erin Molloy
.
Scaling methods for phylogeny estimation to large datasets using - - PowerPoint PPT Presentation
Scaling methods for phylogeny estimation to large datasets using divide-and-conquer Tandy Warnow University of Illinois at Urbana-Champaign Joint work with Erin Molloy . Phylogeny (evolutionary tree) Orangutan Human Gorilla Chimpanzee
.
Orangutan Gorilla Chimpanzee Human
From the Tree of the Life Website, University of Arizona
.
– The pipeline: Statistical estimation and NP-hard
– Incomplete lineage sorting and species tree estimation under the Multi-Species Coalescent model (MSC) – Statistically consistent methods (ASTRAL and ASTRID) – NJMerge and TreeMerge: scaling species tree methods to large datasets
AAGACTT TGGACTT AAGGCCT
today AGGGCAT TAGCCCT AGCACTT AAGGCCT TGGACTT TAGCCCA TAGACTT AGCGCTT AGCACAA AGGGCAT AGGGCAT TAGCCCT AGCACTT AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AAGGCCT TGGACTT AGCGCTT AGCACAA TAGACTT TAGCCCA AGGGCAT
TAGCCCA TAGACTT TGCACAA TGCGCTT AGGGCAT
U V W X Y U V W X Y
The different sites are assumed to evolve i.i.d. down the model tree (with rates that are drawn from a gamma distribution).
The different sites are assumed to evolve i.i.d. down the model tree (with rates that are drawn from a gamma distribution).
Simplest site evolution model (Jukes-Cantor, 1969):
with 0<p(e)<3/4.
More complex models (such as the General Markov model) are also considered,
TAGCCCA TAGACTT TGCACAA TGCGCTT AGGGCAT
U V W X Y U V W X Y
FN: false negative (missing edge) FP: false positive (incorrect edge)
FN FP 50% error rate
identifiable, and which methods are statistically consistent.
for standard methods.
estimation) are very computationally intensive.
.
3
Orang. Gorilla Chimp Human Orang. Gorilla Chimp Human
gene1000 gene 1
Incomplete Lineage Sorting (ILS) is a dominant cause of gene tree heterogeneity
Present Past
Courtesy James Degnan
Gorilla and Orangutan are not siblings in the species tree, but they are in the gene tree.
Present Past
Courtesy James Degnan
Gorilla and Orangutan are not siblings in the species tree, but they are in the gene tree. Deep coalescence = INCOMPLETE LINEAGE SORTING (ILS): gene tree can be different from the species tree
l
103 plant transcriptomes, 400-800 single copy “genes”
l
Next phase will be much bigger
l
Wickett, Mirarab et al., PNAS 2014
U Alberta
Northwestern
U Georgia
iPlant
UT-Austin UT-Austin UT-Austin
Major Challenge:
Erich Jarvis, HHMI Guojie Zhang, BGI
Biggest computational challenges:
and 1Tb of distributed memory, at supercomputers around world)
different gene trees
MTP Gilbert, Copenhagen Siavash Mirarab, Tandy Warnow, Texas Texas and UIUC
Major challenge:
Orangutan Gorilla Chimp Human
Gene evolution model
Orang. Gorilla Chimp Human Orang. Gorilla Chimp Human Orang. Gorilla Chimp Human Orang. Chimp Human
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
Sequence evolution model
1
Species tree Gene tree
Sequence data (Alignments)
Gene tree Gene tree Gene tree
Sequence data (Alignments)
. . .
Analyze separately Summary Method
gene 1 gene 2 . . . gene k . . .
Concatenation Species
Orangutan Gorilla Chimp Human
Gene evolution model
Orang. Gorilla Chimp Human Orang. Gorilla Chimp Human Orang. Gorilla Chimp Human Orang. Chimp Human
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
Sequence evolution model
1
Species tree Gene tree
Sequence data (Alignments)
Gene tree Gene tree Gene tree
Sequence data (Alignments)
Orangutan Gorilla Chimp Human
Gene evolution model
Orang. Gorilla Chimp Human Orang. Gorilla Chimp Human Orang. Gorilla Chimp Human Orang. Chimp Human
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
Sequence evolution model
2
Species tree Gene tree
Sequence data (Alignments)
Gene tree Gene tree Gene tree
Sequence data (Alignments)
Orangutan Gorilla Chimp Human
Gene evolution model
Orang. Gorilla Chimp Human Orang. Gorilla Chimp Human Orang. Gorilla Chimp Human Orang. Chimp Human
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
Sequence evolution model
3
Gene tree Gene tree Gene tree Gene tree
[Mirarab, et al., ECCB/Bioinformatics, 2014]
species coalescent model when solved exactly
15
Find the species tree with the maximum number of induced quartet trees shared with the collection of input gene trees
Set of quartet trees induced by T a gene tree
Score(T) = X
t∈T
Q(T)∩Q(t)
all input gene trees
But ASTRAL can fail to return a tree within 24 hrs
datasets with high ILS
Decompose species set into pairwise disjoint subsets. Full species set Build a tree on each subset
Compute tree on entire set of species using “Disjoint Tree Merger” method
Tree
species set Auxiliary Info
(e.g., distance matrix)
Decompose species set into pairwise disjoint subsets. Full species set Build a tree on each subset
Compute tree on entire set of species using “Disjoint Tree Merger” method
Tree
species set Auxiliary Info
(e.g., distance matrix)
Algorithm design: Necessary to explore the design space to determine best strategies
100 taxa, 25 introns 100 taxa, 100 introns 100 taxa, 1000 introns 1000 taxa, 1000 introns Moderate Very High Moderate Very High Moderate Very High Moderate Very High 500 1000 1500 2000 2500 5 10 15 20 0.0 0.5 1.0 0.0 0.5
Level of ILS Running Time (m)
ASTRAL−III NJMerge+ASTRAL−III (in serial) NJMerge+ASTRAL−III (in parallel)
100 taxa, 25 introns 100 taxa, 100 introns 100 taxa, 1000 introns 1000 taxa, 1000 introns Moderate Very High Moderate Very High Moderate Very High Moderate Very High 0.0 0.1 0.2 0.0 0.1 0.2 0.0 0.1 0.2 0.0 0.1 0.2 0.3
Level of ILS Species Tree Error
ASTRAL−III NJMerge+ASTRAL−III
NJMerge + ASTRAL vs. ASTRAL: Comparable accuracy and can analyze larger datasets
NJMerge + RAxML vs. RAxML: Better accuracy and faster!
100 taxa, 25 introns 100 taxa, 100 introns 100 taxa, 1000 introns 1000 taxa, 1000 introns Moderate Very High Moderate Very High Moderate Very High Moderate Very High 0.0 0.1 0.2 0.0 0.1 0.2 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.4 0.5
Level of ILS Species Tree Error
RAxML NJMerge+RAxML
100 taxa, 25 introns 100 taxa, 100 introns 100 taxa, 1000 introns 1000 taxa, 1000 introns Moderate Very High Moderate Very High Moderate Very High Moderate Very High 500 1000 1500 2000 2500 50 100 150 5 10 1 2 3 4
Level of ILS Running Time (m)
RAxML NJMerge+RAxML (in SERIAL) NJMerge+RAxML (in PARALLEL)
accurate and faster on large datasets than ASTRAL, and also statistically consistent under the Multi-Species Coalescent model
maximum likelihood (CA-ML): more accurate and much faster, greater scalability than CA-ML
Approaches:
Genomic data are:
Mirarab and Warnow, Bioinformatics 2015 (ASTRAL-II) Molloy and Warnow, Systematic Biology 2017 Molloy and Warnow, RECOMB-CG 2018 (and Algorithms for Molecular Biology) Molloy and Warnow, ISMB 2019 (and Bioinformatics, to appear) Papers available at http://tandy.cs.illinois.edu/papers.html Presentations available at http://tandy.cs.illinois.edu/talks.html Funding: NSF (CCF 1535977 and also NSF Graduate Fellowship to Erin Molloy) Supercomputers: TACC (for ASTRAL) and BlueWaters (for NJMerge and TreeMerge)
200 Estimated Gene Trees
Data: Fixed, moderate ILS rate, 50 replicates per HGT rates (1)-(6), 1 model species tree per replicate on 51 taxa, 1000 true gene trees, simulated 1000 bp gene sequences using INDELible 8, 1000 gene trees estimated from GTR simulated sequences using FastTree-27 7Price, Dehal, Arkin 2015 8Fletcher, Yang 2009 12
Davidson et al., RECOMB-CG, BMC Genomics 2015