Statistical binning enables an accurate coalescent-based estimation of the avian tree
Siavash Mirarab, Md. Shamsuzzoha Bayzid, Bastien Boussau, and Tandy Warnow. Science (2014)
Statistical binning enables an accurate coalescent-based estimation - - PowerPoint PPT Presentation
Statistical binning enables an accurate coalescent-based estimation of the avian tree Siavash Mirarab, Md. Shamsuzzoha Bayzid, Bastien Boussau, and Tandy Warnow. Science (2014) Avian whole genomes phylogenies [Jarvis, Mirarab, et al., Science,
Siavash Mirarab, Md. Shamsuzzoha Bayzid, Bastien Boussau, and Tandy Warnow. Science (2014)
[Jarvis, Mirarab, et al., Science, 2014]
2
48 representative birds
Data (i.e., # of genes) Species tree error
Hope!
3
Eagle ¡Owl Falcon ¡Finch Eagle ¡Owl Falcon ¡Finch
gene 1000 gene 1 gene 999 gene 2
gene: recombination-free orthologous regions in genomes
3
¡ ¡ ¡ ¡Eagle ¡Owl Falcon ¡Finch
A gene tree The species tree
Eagle ¡Owl Falcon ¡Finch Eagle ¡Owl Falcon ¡Finch
gene 1000 gene 1 gene 999 gene 2
3
¡ ¡ ¡ ¡Eagle ¡Owl Falcon ¡Finch
A gene tree The species tree
Eagle ¡Owl Falcon ¡Finch Eagle ¡Owl Falcon ¡Finch
Causes of gene tree discordance:
branches) such as the bird radiation; 60 mya
tree distribution [Degnan and Salter, 2005]
gene 1000 gene 1 gene 999 gene 2
4
gene 999 gene 2
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000 gene 1
4
gene 999 gene 2
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G AGCAGCATGCGATG AGCAGC-TGCGATG AGCAGC-TGC-ATG C-TA-CAC-GGATG CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT
Concatenation
gene 1000 gene 1
4
gene 999 gene 2
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G AGCAGCATGCGATG AGCAGC-TGCGATG AGCAGC-TGC-ATG C-TA-CAC-GGATG CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT
Concatenation
¡ ¡ ¡ ¡Eagle ¡Owl ¡ ¡ ¡Falcon ¡ ¡ ¡ ¡Finch 81%
gene 1000 gene 1
ML
4
gene 999 gene 2
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G AGCAGCATGCGATG AGCAGC-TGCGATG AGCAGC-TGC-ATG C-TA-CAC-GGATG CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT
Concatenation
¡ ¡ ¡ ¡Eagle ¡Owl ¡ ¡ ¡Falcon ¡ ¡ ¡ ¡Finch 81%
gene 1000 gene 1
[Roch and Steel, Theo. Pop. Gen., 2014]
[Kubatko and Degnan, Systematic Biology, 2007]
[Mirarab, et al., Systematic Biology, 2014]
Data Error
ML
5
gene 999 gene 2
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000 gene 1
5
Eagle ¡Owl Falcon ¡Finch Eagle ¡Owl Falcon ¡Finch Eagle ¡Owl Falcon ¡Finch Eagle ¡Owl Falcon ¡Finch
gene 999 gene 2
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000 gene 1
5
¡ ¡ ¡ ¡Eagle ¡Owl ¡ ¡ ¡Falcon ¡ ¡ ¡ ¡Finch 78% Summary method
Eagle ¡Owl Falcon ¡Finch Eagle ¡Owl Falcon ¡Finch Eagle ¡Owl Falcon ¡Finch Eagle ¡Owl Falcon ¡Finch
gene 999 gene 2
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000 gene 1
5
¡ ¡ ¡ ¡Eagle ¡Owl ¡ ¡ ¡Falcon ¡ ¡ ¡ ¡Finch 78% Summary method
Eagle ¡Owl Falcon ¡Finch Eagle ¡Owl Falcon ¡Finch Eagle ¡Owl Falcon ¡Finch Eagle ¡Owl Falcon ¡Finch
Data Error
Can be statistically consistent
[Liu, Yu, Edwards, BMC Evol. Bio., 2010]
gene 999 gene 2
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000 gene 1
5
¡ ¡ ¡ ¡Eagle ¡Owl ¡ ¡ ¡Falcon ¡ ¡ ¡ ¡Finch 78% Summary method
Eagle ¡Owl Falcon ¡Finch Eagle ¡Owl Falcon ¡Finch Eagle ¡Owl Falcon ¡Finch Eagle ¡Owl Falcon ¡Finch
Data Error
Can be statistically consistent
[Liu, Yu, Edwards, BMC Evol. Bio., 2010]
True gene trees
gene 999 gene 2
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000 gene 1
median mean
5% 10% 15% 20% 0% 25% 50% 75% 100%
branch bootstrap support branches (percentage)
6
A measure of confidence in estimated gene tree branches
14,000 “genes”: 8,000 exons and 2,500 introns 3,500 Ultra-Conserved Elements
median mean
5% 10% 15% 20% 0% 25% 50% 75% 100%
branch bootstrap support branches (percentage)
6
A measure of confidence in estimated gene tree branches
14,000 “genes”: 8,000 exons and 2,500 introns 3,500 Ultra-Conserved Elements
median mean
5% 10% 15% 20% 0% 25% 50% 75% 100%
branch bootstrap support branches (percentage)
6
A measure of confidence in estimated gene tree branches
88 54 96 95 85 74 32 Chordates 45 84 72 70 52 60 71
Arthropoda Hym enoptera
Protostomia Craniates
B) Metazoan
9 7
Cursores
Columbea Otidim
Australaves 80
Binned MP-EST Unbinned MP-EST
7 3 67 92 7 9 9 4 9 9 68 88
A) Avian
87 98 88 50 88 68 86 95
Binned MP-EST Unbinned MP-EST
Confl ict with
strong evidence
Urochordates Cephalochordates
H.robusta C.intestinalis C.elegans S.purpuratus B.m
T.adhaerens G.gallus L.gigantea S.mansoni T.castaneum D.pulex D.m elanogaster X.tropicalis B.fl
N.vectensis A.m ellifera D.rerio I.scapularis M.m usculus H.sapiens M.brevicollis
46
Urochordates Cephalochordates
77
C.elegans T.castaneum G.gallus M.brevicollis I.scapularis L.gigantea D.rerio S.mansoni A.m ellifera N.vectensis H.sapiens C.intestinalis S.purpuratus D.m elanogaster H.robusta B.m
X.tropicalis T.adhaerens M.m usculus B.fl
D.pulex
Bilateria
Podiceps cristatus
97
Passeriformes Psittaciformes Falco peregrinus Cariama cristata Coraciimorphae Accipitriformes T yto alba Cariama cristata Coraciimorphae Pelecanus crispus Egrett agarzetta Nipponia nippon Phalacrocorax carbo Procellariimorphae Gavia stellata Phaethon lepturus Eurypyga helias Balearica regulorum Charadrius vociferus Opisthocomus hoazin Calypte anna Chaetura pelagica Antrostomus carolinensis T auraco erythrolophus Chlamydotis macqueenii Cuculus canorus Columbal ivia Pterocles gutturalis Mesitornis unicolor Phoenicopterus ruber Meleagris gallopavo Gallus gallus Anas platyrhynchos Struthio camelus Tinamus guttatus 91 58 59 99 Podiceps cristatus Phoenicopterus ruber Cuculus canorus Passeriformes Psittaciformes Falco peregrinus Accipitriformes T yto alba Pelecanus crispus Egrett agarzetta Nipponia nippon Phalacrocorax carbo Procellariimorphae Gavia stellata Phaethon lepturus Eurypyga helias Balearica regulorum Charadrius vociferus Opisthocomus hoazin Calypte anna Chaetura pelagica Antrostomus carolinensis Columbal ivia Pterocles gutturalis Mesitornis unicolor Meleagris gallopavo Gallus gallus Anas platyrhynchos Struthio camelus Tinamus guttatus T auraco erythrolophus Chlamydotis macqueenii
14,000 “genes”: 8,000 exons and 2,500 introns 3,500 Ultra-Conserved Elements
Error metric: percentage of branches in true tree that are missing from the estimated tree
7
True gene trees Sequence data Esmated species tree
Finch Falcon Owl Eagle Pigeon
Esmated gene trees
Finch Owl Falcon Eagle Pigeon
True (model) species tree
5% 10% 15% 20% 1,500 1,000 500 250
Gene sequence length Species tree topological error (FN)
MP−EST
8
Avian-like simulations (1000 genes) [Mirarab, et al., Science, 2014] A statistically consistent summary method more gene tree error
5% 10% 15% 20% 1,500 1,000 500 250
Gene sequence length Species tree topological error (FN)
MP−EST
8
Avian-like simulations (1000 genes) [Mirarab, et al., Science, 2014] A statistically consistent summary method more gene tree error
[Ané, et al, MBE, 2007] [Patel, et al, MBE, 2013] [Gatesy, Springer, MPE, 2014] [Mirarab, et al., Systematic Biology, 2014]
5% 10% 15% 20% 1,500 1,000 500 250
Gene sequence length Species tree topological error (FN)
MP−EST
8
Avian-like simulations (1000 genes) [Mirarab, et al., Science, 2014]
5% 10% 15% 20% 1,500 1,000 500 250
Gene sequence length Species tree topological error (FN)
MP−EST Concatenation (ML)
more gene tree error
9
Summary methods: All “genes” independent
Concatenation: All “genes” put together Binning
increase the phylogenetic signal
9
Summary methods: All “genes” independent
Concatenation: All “genes” put together Binning
increase the phylogenetic signal
9
Summary methods: All “genes” independent
Concatenation: All “genes” put together Binning
10
A B C D E F G A B C D E F G
g1 g2
10
40% 70% 85% 20% A B C D E F G 65% 25% 90% 70% A B C D E F G
g1 g2
10
40% 70% 85% 20% A B C D E F G 70% 85% A B C D E F G 65% 25% 90% 70% 65% 90% 70% A B C D E F G A B C D E F G
<50% <50%
g1 g2
10
40% 70% 85% 20% A B C D E F G 70% 85% A B C D E F G 65% 25% 90% 70% 65% 90% 70% A B C D E F G A B C D E F G A B C D E F G
Compatible
<50% <50%
g1 g2
11
Incompatibility graph
A gene tree Incompatibility between two gene trees
11
Incompatibility graph
edges between any pairs of nodes
pairwise compatible
balanced bins where possible
A gene tree Incompatibility between two gene trees
12
Original version: unweighted [Mirarab, et al., Science, 2014]
Gene sequence data Estimated initial g ene trees Incompatibility Graph Binned supergene alignments Supergene trees (weighted) Species tree
g1 g2 g3 gk
(partitioned)
support threshold
MP-EST
12
Original version: unweighted [Mirarab, et al., Science, 2014] New version: weighted [Bayzid, Mirarab, Warnow, arXiv, 2015]
Gene sequence data Estimated initial g ene trees Incompatibility Graph Binned supergene alignments Supergene trees (weighted) Species tree
g1 g2 g3 gk
(partitioned)
support threshold
MP-EST
13
48 avian-like species, 1000 genes
5% 10% 15% 20% 1,500 1,000 500 250
Gene sequence length Species tree topological error (FN)
MP−EST MP−EST − Binned
more gene tree error
14 5% 10% 15% 20% 1,500 1,000 500 250
Gene sequence length Species tree topological error (FN)
MP−EST MP−EST − Binned CA−ML
more gene tree error
48 avian-like species, 1000 genes
15
15
highly supported false positives)
15
highly supported false positives)
The binned tree was highly supported and was largely congruent with concatenation
16
RESEARCH ARTICLEWhole-genome analyses resolve early branches in the tree of life
[Jarvis, Mirarab, et al., Science, 2014]
coalescent-based analyses of the avian dataset
statistical measures of combinability
both unbinned summary methods and concatenation
analyses of the avian dataset; results were largely congruent with concatenation
17
challenging
and a need for method development
18
Acknowledgments
Jim ¡Leebens-‑mack ¡ (UGA) Norman ¡Wickett ¡ (U ¡Chicago) Gane ¡Wong ¡ (U ¡of ¡Alberta) Keshav ¡Pingali ¡S.M. ¡Bayzid ¡ Nam ¡Nguyen ¡ ¡ (now ¡at ¡UIUC) Tandy ¡Warnow Théo ¡ ¡ Zimmermann Bastien ¡Boussau ¡ (Université ¡Lyon) Erich ¡Jarvis ¡ (Duke, ¡HMMI) Tom ¡Gilbert ¡ (U ¡Copenhagen) Guojie ¡Zhang ¡ (BGI, ¡China) Ed ¡Braun ¡ (U ¡Florida)
… …
HMMI ¡international ¡student ¡fellowship ¡
0% 25% 50% 75% 100% 100 1,000 10,000 100,000
Sequence length (log) Average branch bootstrap support
20
21
[Mirarab, et al., Science, 2014]
inheritance and maintenance of alleles
times between speciation events and/or large population size
22
Tracing alleles through generations
inheritance and maintenance of alleles
times between speciation events and/or large population size
(multi-species coalescent)
probability distribution on the gene trees, and is identifiable from the distribution on gene trees
[Degnan and Salter, Int. J. Org. Evolution, 2005]
22
Tracing alleles through generations
23
More information per gene
[Mirarab, et al., Science, 2014]
MP-EST
MP-EST
More information per gene
Branch length accuracy
triplet frequency:
gene trees and from the estimated gene trees (for all triplets of taxa)
24
A B C B A C
70% 15%
C A B
15%
A B C B A C C A B
65% 25% 15%
true distribution estimated distribution
Compare
Empirical commutative distribution
Empirical commutative distribution
More information per gene
Supergene trees represent the true gene tree distribution much better than the estimated gene trees without binning.
Empirical commutative distribution
More information per gene