Comparison of commonly used methods for combining multiple phylogenetic data sets
Comparison of commonly used methods for combining multiple - - PowerPoint PPT Presentation
Comparison of commonly used methods for combining multiple - - PowerPoint PPT Presentation
Comparison of commonly used methods for combining multiple phylogenetic data sets Comparison of commonly used methods for combining multiple phylogenetic data sets Anne Kupczok, Heiko A. Schmidt and Arndt von Haeseler Center for Integrative
Comparison of commonly used methods for combining multiple phylogenetic data sets Motivation
Multi-Locus Datasets
phylogeny collection data reconstruction
t s r q p
- n
m l k j i h g f e d c b a A B C D E F G H I J K L M N O P Q R S T
Taxa Genes
Comparison of commonly used methods for combining multiple phylogenetic data sets Motivation
Multi-Locus Datasets
phylogeny collection data reconstruction
t s r q p
- n
m l k j i h g f e d c b a A B C D E F G H I J K L M N O P Q R S T
Taxa Genes
Comparison of commonly used methods for combining multiple phylogenetic data sets Motivation
Multi-Locus Datasets
phylogeny collection data reconstruction
t s r q p
- n
m l k j i h g f e d c b a A B C D E F G H I J K L M N O P Q R S T
Taxa Genes
Approaches:
late level medium level early level
Comparison of commonly used methods for combining multiple phylogenetic data sets Methods Early-level combination
Early-level combination: Superalignment
= Supermatrix or ’Total Evidence’ Combination by concatenating data sets: Any tree reconstruction method can be applied to the data matrix
Comparison of commonly used methods for combining multiple phylogenetic data sets Methods Late-level combination
Late-level combination: Supertree
Construct separate trees for each gene and combine them to a supertree:
Comparison of commonly used methods for combining multiple phylogenetic data sets Methods Late-level combination
Late-level combination: Supertree
Construct separate trees for each gene and combine them to a supertree: Supertree methods combine special kinds of information: Split information → Matrix Representation: MR with Parsimony (MRP, Baum, 1992; Ragan, 1992) MR with Flipping (MRF, e.g. Chen et al., 2003)
Comparison of commonly used methods for combining multiple phylogenetic data sets Methods Late-level combination
Late-level combination: Supertree
Construct separate trees for each gene and combine them to a supertree: Supertree methods combine special kinds of information: Triplet information → Rooted triplets: MinCut (Semple and Steel, 2000) Modified MinCut (Page, 2002) MaxCut (Snir and Rao, 2006)
Comparison of commonly used methods for combining multiple phylogenetic data sets Methods Medium-level combination
Medium-level combination
Intermediate data (not final trees) is computed from every source alignment and subsequently combined to a tree. SuperQP: Combination of quartet likelihoods (Schmidt, 2003)
Comparison of commonly used methods for combining multiple phylogenetic data sets Methods Medium-level combination
Medium-level combination
Intermediate data (not final trees) is computed from every source alignment and subsequently combined to a tree. Average Consensus: Average over distance matrix for each gene
(Lapointe and Cucumel, 1997)
SDM: Additional weights estimated (Criscuolo et al., 2006)
Comparison of commonly used methods for combining multiple phylogenetic data sets Simulation
Simulation setting
1 Estimate an ML tree with branch lengths and model
parameters from a data superalignment → species tree
2 Generate gene trees 3 Simulate alignments along the gene trees 4 Apply the reconstruction methods to each data set and
compare the result with the model tree 2. 3. 4.
(true) species tree gene trees (simulated) reconstructed tree
1.
alignments
Comparison of commonly used methods for combining multiple phylogenetic data sets Simulation
Species tree
10 genes of 25 Crocodylia species (Gatesy et al., 2004)
5 10 15 20 25
taxa length
data sets 2000 1500 1000 500
− →
C_moreletii_14 C_acutus_12 C_intermediu_13 C_rhombifer_11 C_niloticus_21 C_novaeguineae_18 C_mindorensis_17 C_johnstoni_16 C_palustris_20 C_siamensis_15 C_porosus_19 T_schlegelii_24 G_gangeticus_25 C_latirostris_5 C_crocodilus_4 M_niger_6 P_palpebrosus_7 P_trigonatus_8 A_mississippiensis_9 A_sinensis_10 Paleognathae_1 Neognathae_2 Testudines_3 C_cataphractus_22 O_tetraspis_23 .10
Comparison of commonly used methods for combining multiple phylogenetic data sets Results Complete and missing data
Complete and missing data
Step 2: Gene trees are the complete model tree (complete data) or the pruned model tree (missing data) Step 3: Simulation with the parameters estimated with the superalignment
- 1.
2. 3. Parameters
Comparison of commonly used methods for combining multiple phylogenetic data sets Results Complete and missing data 5 10 15 20 25 Robinson−Foulds distance
- 5
10 15 20 25
- 5
10 15 20 25
- 5
10 15 20 25
- 5
10 15 20 25
- 5
10 15 20 25 5 10 15 20 25
- 5
10 15 20 25
- 5
10 15 20 25
Complete data
Gene Trees MRP MRF ModMinCut MaxCut SuperQP SDM SA
Comparison of commonly used methods for combining multiple phylogenetic data sets Results Complete and missing data 5 10 15 20 25 Robinson−Foulds distance
- 5
10 15 20 25
- 5
10 15 20 25
- 5
10 15 20 25
- 5
10 15 20 25
- 5
10 15 20 25 5 10 15 20 25
- 5
10 15 20 25
- 5
10 15 20 25
Complete data
Gene Trees MRP MRF ModMinCut MaxCut SuperQP SDM SA
5 10 15 20 25 Robinson−Foulds distance
- 5
10 15 20 25
- 5
10 15 20 25
- 5
10 15 20 25
- 5
10 15 20 25
- 5
10 15 20 25
- 5
10 15 20 25
- 5
10 15 20 25
Missing data
MRP MRF ModMinCut MaxCut SuperQP SDM SA
Comparison of commonly used methods for combining multiple phylogenetic data sets Results Incomplete lineage sorting
Incomplete lineage sorting
Step 2: For every simulation, a gene tree is generated from the species tree with a coalescent process (θ = 0.005) Step 3: Simulation with the parameters estimated with the superalignment
- 1.
2. Parameters 2. 3.
Comparison of commonly used methods for combining multiple phylogenetic data sets Results Incomplete lineage sorting 5 10 15 20 25 Robinson−Foulds distance 5 10 15 20 25
- 5
10 15 20 25
- 5
10 15 20 25
- 5
10 15 20 25
- 5
10 15 20 25
- 5
10 15 20 25 5 10 15 20 25
- 5
10 15 20 25
Complete data
Gene Trees MRP MRF ModMinCut MaxCut SuperQP SDM SA
Comparison of commonly used methods for combining multiple phylogenetic data sets Results Incomplete lineage sorting 5 10 15 20 25 Robinson−Foulds distance 5 10 15 20 25
- 5
10 15 20 25
- 5
10 15 20 25
- 5
10 15 20 25
- 5
10 15 20 25
- 5
10 15 20 25 5 10 15 20 25
- 5
10 15 20 25
Complete data
Gene Trees MRP MRF ModMinCut MaxCut SuperQP SDM SA
5 10 15 20 25 Robinson−Foulds distance
- 5
10 15 20 25
- 5
10 15 20 25
- 5
10 15 20 25
- 5
10 15 20 25
- 5
10 15 20 25 5 10 15 20 25
- 5
10 15 20 25
Missing data
MRP MRF ModMinCut MaxCut SuperQP SDM SA
Comparison of commonly used methods for combining multiple phylogenetic data sets Summary
Summary
Simulation of sequence-based phylogenetic analysis for multiple data sets With the assumption of tree-like evolution for most genes, superalignment yields the highest accuracy In case of high incongruency among gene trees other methods may outperform superalignment Matrix Representation methods are the best choice for supertree reconstruction
Comparison of commonly used methods for combining multiple phylogenetic data sets Summary