Methodological Challenges in the Pursuit of the Tree of Life ! - - PowerPoint PPT Presentation

methodological challenges in the pursuit of the tree of
SMART_READER_LITE
LIVE PREVIEW

Methodological Challenges in the Pursuit of the Tree of Life ! - - PowerPoint PPT Presentation

Reviews in Computational Biology Methodological Challenges in the Pursuit of the Tree of Life ! Christophe Dessimoz February 13th, 2013 Outline Introduction Mature methods: supermatrix, supertree Emerging methods: species-tree


slide-1
SLIDE 1

Methodological Challenges in the Pursuit

  • f the Tree of Life

Christophe Dessimoz

!

Reviews in Computational Biology

February 13th, 2013

slide-2
SLIDE 2

Outline

  • Introduction
  • Mature methods: supermatrix, supertree
  • Emerging methods: species-tree
  • Outlook
slide-3
SLIDE 3

Augustin Augier, Arbre Botanique (1801)

slide-4
SLIDE 4

Lamarck, Philosophie Zoologique , 1809

slide-5
SLIDE 5

Darwin, Notebook B, 1837

Wikipedia

slide-6
SLIDE 6
slide-7
SLIDE 7

16S rRNA was used by Woese (1987) to group early life forms into three kingdoms

slide-8
SLIDE 8

Snel et al. Genome trees and the nature of genome evolution. Annu Rev Microbiol (2005) vol. 59 pp. 191-209

Genomic Era

slide-9
SLIDE 9
slide-10
SLIDE 10

PART I

Established Methods:

Supermatrix and Supertree

slide-11
SLIDE 11

Gene trees, Homology Orthology & Paralogy

Duplication Gene loss Speciation

  • rthologs( , )

paralogs( , )

Altenhoff and Dessimoz, Methods in Molecular Biology 2012

slide-12
SLIDE 12
slide-13
SLIDE 13
slide-14
SLIDE 14

Fraction of marker genes used # of genomes

100 30 genes 1000

slide-15
SLIDE 15

Fraction of marker gene used # of genomes

1,000 100 13 genes

Full genome

slide-16
SLIDE 16

Fraction of marker genes used # of genomes

1000 578 31 genes

Full genome

slide-17
SLIDE 17

Fraction of marker genes used # of genomes

1000 2684 8 genes

Full genome

slide-18
SLIDE 18

i.e. 50% bootstrap support! i.e. 95% bootstrap support!

slide-19
SLIDE 19

But!

slide-20
SLIDE 20

Actually, use only small fraction of data.

slide-21
SLIDE 21

# of species

1000

# of marker genes

Goloboff et al. 2009

73,060 191 578 31

Edwards et al. 2010

2684 150 8 77

Ciccarelli 2006 Wu & Eisen 2008 Dunn et al. 2008 Pisani 2007

>1000

Hejnol et al. 2009 Smith et al. 2011

? Since then...

slide-22
SLIDE 22

Gene tree ≠ Species tree

slide-23
SLIDE 23

Gene tree ≠ Species tree

  • Gene duplication (paralogs)
  • Lateral gene transfer (xenologs)
  • Endosymbiosis (e.g. Delusc et al. 2005)
  • Hybridization (Hallström & Janke 2008)
  • Incomplete lineage sorting

(aka deep coalescence)

Jeffroy et al. 2006 McInerney et al. 2008 Edwards 2009 Philippe et al. 2011

slide-24
SLIDE 24

Systematic Errors

  • Branch-length heterogeneity (Matsen &

Steel 2007, Edwards 2009)

  • Nucleotide composition heterogeneity

across species (Hasegawa & Hashimoto 1993, Jeffroy et al. 2006)

  • Missing data (Hartmann & Vision 2008)
  • In general: model violations
slide-25
SLIDE 25
slide-26
SLIDE 26

Systematic error can result in

  • verconfidence

All photos from Wikipedia

Dunn et al. Nature 2008

Sponges (Porifera) Bilateria Cnidaria (Corals, jellyfish) Comb Jellies (Ctenophora)

80-90% 0-70%

Schierwater et al. PLoS Biol 2009

Cnidaria (Corals, jellyfish) Comb Jellies (Ctenophora) Sponges (Porifera) Bilateria

53% 27%

Philippe et al. Current Biol 2009

Sponges (Porifera) Bilateria Cnidaria (Corals, jellyfish) Comb Jellies (Ctenophora)

62-96%

78-99%

e.g.

Same argument in Philippe et al. 2011

slide-27
SLIDE 27

PART II

Emerging Methods:

Species-Tree Inference Methods

Most relevant review: Anderson et al. Methods in Molecular Biology 2012

slide-28
SLIDE 28

Two main classes

  • Methods modelling specific processes

(“mechanistic”) a) Rate variation within/among markers b) Gene duplication c) Deep coalescence d) Lateral gene transfer

  • Process agnostic (“empirical”)
slide-29
SLIDE 29

a) Rate heterogeneity within/between genes

  • Within genes:
  • Among-site rate heterogeneity (Gamma-rate):

Yang 1993, Yang 1994

  • Among-site model heterogeneity (CAT model):

Lartillot & Philippe 2004

  • Heterotachy (change over time, i.e. branches):

Galtier 2001, Penny 2001

  • Among genes:
  • Proportional model: Pupko 2002, Dessimoz et al.

2008

slide-30
SLIDE 30

b) Duplication events

Intro: gene/species tree reconciliation

Dufayard et al., Bioinformatics, 2005

G1 Homo sapiens G4 Pan troglodytes G3 Rattus norvegicus G2 Mus musculus

G

Homo sapiens Pan troglodytes Rattus norvegicus Mus musculus

S

Loss Homo sapiens G4 Pan troglodytes G3 Rattus norvegicus Loss Mus musculus

R

G1 Homo sapiens Loss Pan troglodytes G2 Mus musculus Loss Rattus norvegicus

Duplication node

Reviewed in Altenhoff & Dessimoz, Methods in Molecular Biology 2012

slide-31
SLIDE 31

Reconciliation: Parsimony & Likelihood

Loss Homo sapiens G4 Pan troglodytes G3 Rattus norvegicus Loss Mus musculus

R

G1 Homo sapiens Loss Pan troglodytes G2 Mus musculus Loss Rattus norvegicus

Duplication node

Parsimony: Minimise # duplication & losses

Reviewed in Altenhoff & Dessimoz, Methods in Molecular Biology 2012

Likelihood: Pick the reconciliation(s) that maximise the probability of

  • bserving the data

(i.e. gene/species trees) under a particular model

slide-32
SLIDE 32

IDEA: treat species tree as unknown (or at least somewhat uncertain) quantity

slide-33
SLIDE 33

c) Modelling Coalescent

Rannala & Yang, Annu Rev Genomics Hum Genet 2008

Sequence alignments Model Parameters Gene Trees

time of speciation

time to most recent common ancestor

IDEA: instead of fixing species tree, treat as parameter!

locus

slide-34
SLIDE 34

Methods

also see review of Liu et al 2009

(summary statistics) (parsimony) (parsimony)

slide-35
SLIDE 35

d) Lateral gene transfer

slide-36
SLIDE 36

Process agnostic

  • Independent tree

inference for each gene (relatively efficient!)

  • Number of different

trees modeled as Dirichlet process

Tree of gene i All Sequence alignments Gene-to-tree map assumption

  • f independence

among genes

slide-37
SLIDE 37

Dirichlet Process a.k.a. Chinese Restaurant Process

http://www.cs.princeton.edu/courses/archive/fall07/cos597C/scribe/20070921.pdf

e.g.

slide-38
SLIDE 38

Evaluation with simulated data

slide-39
SLIDE 39

Leaché & Rannala, Syst Biol 2010

population size * mutation rate tree length

Difference between gene and species tree (baseline)

slide-40
SLIDE 40

Chung & Ané 2011

Better

Incomplete Lineage Sorting only Horizontal Gene Transfers+ILS mechanistic (ILS) empirical

slide-41
SLIDE 41

Better

slide-42
SLIDE 42

Evaluation with empirical data

slide-43
SLIDE 43
  • “Note that the concordance factors in the [BUCKy] tree are

much more conservative than the posterior probabilities in the topology estimated from the concatenated alignment”

  • “Taking into account the incongruence between gene trees

does not drastically change our overall view of rice phylogeny, but it does give a more varied picture of the support across the tree.”

  • “[BUCKy] is robust to the prior probability on gene tree

incongruence (the α parameter)”

  • “[The 6-species, 162 genes Bayesian analysis] had not yet

reached stationarity after 1.6 billion iterations.” (2 months on 96 CPU cores)

slide-44
SLIDE 44

Outlook

  • Bottleneck is methods, not data
  • Need methods able to deal with different gene

histories

  • Very difficult to say which approach yields better

results solely from first principle

  • > need for sound simulation/empirical tests
  • Efficiency needs to be improved

(“The largest data set yet tested with these species tree methods is yeast, with 106 loci in 8 species” Cranston 2009)