Comparison of commonly used methods for combining multiple - - PowerPoint PPT Presentation

comparison of commonly used methods for combining
SMART_READER_LITE
LIVE PREVIEW

Comparison of commonly used methods for combining multiple - - PowerPoint PPT Presentation

Comparison of commonly used methods for combining multiple phylogenetic data sets Comparison of commonly used methods for combining multiple phylogenetic data sets Anne Kupczok, Heiko A. Schmidt and Arndt von Haeseler Center for Integrative


slide-1
SLIDE 1

Comparison of commonly used methods for combining multiple phylogenetic data sets

Comparison of commonly used methods for combining multiple phylogenetic data sets

Anne Kupczok, Heiko A. Schmidt and Arndt von Haeseler

Center for Integrative Bioinformatics Vienna Max F. Perutz Laboratories

June 12th, 2008

slide-2
SLIDE 2

Comparison of commonly used methods for combining multiple phylogenetic data sets Motivation

Multi-Locus Datasets

phylogeny collection data reconstruction

t s r q p

  • n

m l k j i h g f e d c b a A B C D E F G H I J K L M N O P Q R S T

Taxa Genes

slide-3
SLIDE 3

Comparison of commonly used methods for combining multiple phylogenetic data sets Motivation

Multi-Locus Datasets

phylogeny collection data reconstruction

t s r q p

  • n

m l k j i h g f e d c b a A B C D E F G H I J K L M N O P Q R S T

Taxa Genes

slide-4
SLIDE 4

Comparison of commonly used methods for combining multiple phylogenetic data sets Motivation

Multi-Locus Datasets

phylogeny collection data reconstruction

t s r q p

  • n

m l k j i h g f e d c b a A B C D E F G H I J K L M N O P Q R S T

Taxa Genes

Approaches:

late level medium level early level

slide-5
SLIDE 5

Comparison of commonly used methods for combining multiple phylogenetic data sets Methods Early-level combination

Early-level combination: Superalignment

= Supermatrix or ’Total Evidence’ Combination by concatenating data sets: Any tree reconstruction method can be applied to the data matrix

slide-6
SLIDE 6

Comparison of commonly used methods for combining multiple phylogenetic data sets Methods Late-level combination

Late-level combination: Supertree

Construct separate trees for each gene and combine them to a supertree:

slide-7
SLIDE 7

Comparison of commonly used methods for combining multiple phylogenetic data sets Methods Late-level combination

Late-level combination: Supertree

Construct separate trees for each gene and combine them to a supertree: Supertree methods combine special kinds of information: Split information → Matrix Representation: MR with Parsimony (MRP, Baum, 1992; Ragan, 1992) MR with Flipping (MRF, e.g. Chen et al., 2003)

slide-8
SLIDE 8

Comparison of commonly used methods for combining multiple phylogenetic data sets Methods Late-level combination

Late-level combination: Supertree

Construct separate trees for each gene and combine them to a supertree: Supertree methods combine special kinds of information: Triplet information → Rooted triplets: MinCut (Semple and Steel, 2000) Modified MinCut (Page, 2002) MaxCut (Snir and Rao, 2006)

slide-9
SLIDE 9

Comparison of commonly used methods for combining multiple phylogenetic data sets Methods Medium-level combination

Medium-level combination

Intermediate data (not final trees) is computed from every source alignment and subsequently combined to a tree. SuperQP: Combination of quartet likelihoods (Schmidt, 2003)

slide-10
SLIDE 10

Comparison of commonly used methods for combining multiple phylogenetic data sets Methods Medium-level combination

Medium-level combination

Intermediate data (not final trees) is computed from every source alignment and subsequently combined to a tree. Average Consensus: Average over distance matrix for each gene

(Lapointe and Cucumel, 1997)

SDM: Additional weights estimated (Criscuolo et al., 2006)

slide-11
SLIDE 11

Comparison of commonly used methods for combining multiple phylogenetic data sets Simulation

Simulation setting

1 Estimate an ML tree with branch lengths and model

parameters from a data superalignment → species tree

2 Generate gene trees 3 Simulate alignments along the gene trees 4 Apply the reconstruction methods to each data set and

compare the result with the model tree 2. 3. 4.

(true) species tree gene trees (simulated) reconstructed tree

1.

alignments

slide-12
SLIDE 12

Comparison of commonly used methods for combining multiple phylogenetic data sets Simulation

Species tree

10 genes of 25 Crocodylia species (Gatesy et al., 2004)

5 10 15 20 25

taxa length

data sets 2000 1500 1000 500

− →

C_moreletii_14 C_acutus_12 C_intermediu_13 C_rhombifer_11 C_niloticus_21 C_novaeguineae_18 C_mindorensis_17 C_johnstoni_16 C_palustris_20 C_siamensis_15 C_porosus_19 T_schlegelii_24 G_gangeticus_25 C_latirostris_5 C_crocodilus_4 M_niger_6 P_palpebrosus_7 P_trigonatus_8 A_mississippiensis_9 A_sinensis_10 Paleognathae_1 Neognathae_2 Testudines_3 C_cataphractus_22 O_tetraspis_23 .10

slide-13
SLIDE 13

Comparison of commonly used methods for combining multiple phylogenetic data sets Results Complete and missing data

Complete and missing data

Step 2: Gene trees are the complete model tree (complete data) or the pruned model tree (missing data) Step 3: Simulation with the parameters estimated with the superalignment

  • 1.

2. 3. Parameters

slide-14
SLIDE 14

Comparison of commonly used methods for combining multiple phylogenetic data sets Results Complete and missing data 5 10 15 20 25 Robinson−Foulds distance

  • 5

10 15 20 25

  • 5

10 15 20 25

  • 5

10 15 20 25

  • 5

10 15 20 25

  • 5

10 15 20 25 5 10 15 20 25

  • 5

10 15 20 25

  • 5

10 15 20 25

Complete data

Gene Trees MRP MRF ModMinCut MaxCut SuperQP SDM SA

slide-15
SLIDE 15

Comparison of commonly used methods for combining multiple phylogenetic data sets Results Complete and missing data 5 10 15 20 25 Robinson−Foulds distance

  • 5

10 15 20 25

  • 5

10 15 20 25

  • 5

10 15 20 25

  • 5

10 15 20 25

  • 5

10 15 20 25 5 10 15 20 25

  • 5

10 15 20 25

  • 5

10 15 20 25

Complete data

Gene Trees MRP MRF ModMinCut MaxCut SuperQP SDM SA

5 10 15 20 25 Robinson−Foulds distance

  • 5

10 15 20 25

  • 5

10 15 20 25

  • 5

10 15 20 25

  • 5

10 15 20 25

  • 5

10 15 20 25

  • 5

10 15 20 25

  • 5

10 15 20 25

Missing data

MRP MRF ModMinCut MaxCut SuperQP SDM SA

slide-16
SLIDE 16

Comparison of commonly used methods for combining multiple phylogenetic data sets Results Incomplete lineage sorting

Incomplete lineage sorting

Step 2: For every simulation, a gene tree is generated from the species tree with a coalescent process (θ = 0.005) Step 3: Simulation with the parameters estimated with the superalignment

  • 1.

2. Parameters 2. 3.

slide-17
SLIDE 17

Comparison of commonly used methods for combining multiple phylogenetic data sets Results Incomplete lineage sorting 5 10 15 20 25 Robinson−Foulds distance 5 10 15 20 25

  • 5

10 15 20 25

  • 5

10 15 20 25

  • 5

10 15 20 25

  • 5

10 15 20 25

  • 5

10 15 20 25 5 10 15 20 25

  • 5

10 15 20 25

Complete data

Gene Trees MRP MRF ModMinCut MaxCut SuperQP SDM SA

slide-18
SLIDE 18

Comparison of commonly used methods for combining multiple phylogenetic data sets Results Incomplete lineage sorting 5 10 15 20 25 Robinson−Foulds distance 5 10 15 20 25

  • 5

10 15 20 25

  • 5

10 15 20 25

  • 5

10 15 20 25

  • 5

10 15 20 25

  • 5

10 15 20 25 5 10 15 20 25

  • 5

10 15 20 25

Complete data

Gene Trees MRP MRF ModMinCut MaxCut SuperQP SDM SA

5 10 15 20 25 Robinson−Foulds distance

  • 5

10 15 20 25

  • 5

10 15 20 25

  • 5

10 15 20 25

  • 5

10 15 20 25

  • 5

10 15 20 25 5 10 15 20 25

  • 5

10 15 20 25

Missing data

MRP MRF ModMinCut MaxCut SuperQP SDM SA

slide-19
SLIDE 19

Comparison of commonly used methods for combining multiple phylogenetic data sets Summary

Summary

Simulation of sequence-based phylogenetic analysis for multiple data sets With the assumption of tree-like evolution for most genes, superalignment yields the highest accuracy In case of high incongruency among gene trees other methods may outperform superalignment Matrix Representation methods are the best choice for supertree reconstruction

slide-20
SLIDE 20

Comparison of commonly used methods for combining multiple phylogenetic data sets Summary

Summary

Simulation of sequence-based phylogenetic analysis for multiple data sets With the assumption of tree-like evolution for most genes, superalignment yields the highest accuracy In case of high incongruency among gene trees other methods may outperform superalignment Matrix Representation methods are the best choice for supertree reconstruction

Acknowledgements:

Gregory Ewing (CIBIV) WWTF for funding