Is the best model good enough? Assessing the absolute fit of phylogenetic models via posterior predictive sampling
Gerhard Jäger
Tübingen University
Workshop Computational and phylogenetic historical linguistics
ICHL24, Canberra, July 4, 2019
Is the best model good enough? Assessing the absolute fit of - - PowerPoint PPT Presentation
Is the best model good enough? Assessing the absolute fit of phylogenetic models via posterior predictive sampling Gerhard Jger Tbingen University Workshop Computational and phylogenetic historical linguistics ICHL24, Canberra, July 4, 2019
Tübingen University
ICHL24, Canberra, July 4, 2019
1 / 35
2 / 35
3 / 35
model data posterior posterior predictive simulation
4 / 35
5 / 35
6 / 35
7 / 35
0.2 0.4 0.6 0.8
posterior probability of P(R)
8 / 35
9 / 35
10 / 35
1 draw a sample of parameter values from the posterior 2 simulate mock data according to the generative model, using
the parameters from the previous step
11 / 35
0.00 0.05 0.10 0.15 5 10 15
12 / 35
initial state π1 π0 α β β α
π1
0.00 0.25 0.50 0.75 1.00
13 / 35
0.00 0.05 0.10 0.15 0.20 5 10 15
14 / 35
15 / 35
16 / 35
10 30 100 1e+02 1e+05 1e+08
population number of phonemes areal
Africa Eurasia NorthAmerica Oceania SouthAmerica
17 / 35
18 / 35
uncertainty
too short
language families
rake-shaped “proto-world” root with unknown depth
estimated from data
directly to the root
NC.BANTOID.F21_SUKUMA An.OCEANIC.SIVISA_TITAN IE.IRANIAN.PERSIAN LP .LAKES_PLAIN.TAUSE_WEIRATE ST.KUKI_CHIN.CHIN_HAKA NC.BANTOID.BABUNGO An.CELEBIC.TAJE_PETAPA T
Man.EASTERN_MANDE.SHANGA NC.UBANGI.KPATILI CSu.BONGO_BAGIRMI.BIRRI_C_AFRICAN_REP An.SOUTH_HALMAHERA_WEST_NEW_GUINEA.WAREMBORI An.SOUTH_HALMAHERA_WEST_NEW_GUINEA.WAUYAI Iro.NORTHERN_IROQUOIAN.TUSCARORA NC.BANTOID.NGULU An.OCEANIC.RIRIO NC.ADAMAWA.PAM AA.WEST_CHADIC.KWAAMI An.OCEANIC.BIEREBO_YEVALI AuA.VIET_MUONG.MALIENG Alt.TURKIC.SHOR AuA.PALAUNG_KHMUIC.KSINMUL_2 An.OCEANIC.OROHA An.CENTRAL_MALAYO_POLYNESIAN.ALOR Alt.MONGOLIC.KALAQIN NC.CROSS_RIVER.UKWA Alt.TURKIC.TEREKEME_AZERI ESu.NILOTIC.ALUR AA.BIU_MANDARA.GIDAR Pan.PANOAN.YAMINAWA Pen.SAHAPTIAN.NEZ_PERCE ST.BODIC.JIREL NC.PLATOID.HASHA NC.KWA.TUTRUGBU An.BARITO.KADORIH OM.CHINANTECAN.LEALAO_CHINANTEC AuA.KATUIC.NGE Mas.MASCOIAN.SANAPANA_ENLHET MGe.GE_KAINGANG.PANARA An.CENTRAL_MALAYO_POLYNESIAN.APUTAI TNG.MADANG.SINSAURU Pan.PANOAN.KASHIBO_SAN_ALEJANDRO NDe.ATHAPASKAN.HUPA Kpx.KAPIXANA.KANOE An.OCEANIC.SOUTH_EFATE_ERAKOR ST.MAHAKIRANTI.YAKHA An.CENTRAL_MALAYO_POLYNESIAN.SAPOLEWA_SOOW_KWELE_ULUI_SERAM Mat.MATACOAN.CHOROTE T
An.YAPESE.YAPESE
τ
19 / 35
mean sound inventory size (in standard deviations)
20 / 35
0.0 0.2
correlation
model
naive phylogenetic
log-population size
21 / 35
22 / 35
Latvian Russian Ukrainian Sorbian_Lower Macedonian Faroese Swedish Danish Flemish Afrikaans Luxembourgish Dolomite_Ladino Romansh French Spanish Portuguese Romanian Welsh Armenian_Eastern Bihari Marathi Gujarati Marwari Nepali Hindi Urdu Persian T ajik Shughni Pashto
23 / 35
mutations, given a phylogeny
C C A A B B
24 / 35
mutations, given a phylogeny
C C A A B B A B B 2 mutations C B
24 / 35
mutations, given a phylogeny
C C A A B B A B B 3 mutations C C
24 / 35
mutations, given a phylogeny
A C C A A B B 4 mutations
24 / 35
RI(T ) = number of mutations avoided in T number of avoidable mutations
25 / 35
C C A A B B A B B 2 mutations C B 26 / 35
C C A A B B A B B 3 mutations C C 26 / 35
A C C A A B B 4 mutations 26 / 35
27 / 35
simulated empirical 0.45 0.50 0.55 0.60 0.65 0.70 0.75
Retention Index
28 / 35
simulated empirical 0.75 0.80 0.85
Retention Index
29 / 35
α β
γ δ
30 / 35
simulated empirical 0.68 0.70 0.72 0.74
Retention Index
31 / 35
Tajik Persian Pashto Shughni Nepali Marathi Gujarati Marwari Bihari Hindi Urdu Macedonian Sorbian_Lower Ukrainian Russian Latvian Spanish Portuguese Dolomite_Ladi Romansh French Romanian Welsh Faroese Swedish Danish Afrikaans Flemish Luxembourgis Armenian_Eas Latvian Russian Ukrainian Sorbian_Lowe Macedonian Welsh Spanish Portuguese French Dolomite_Lad Romansh Romanian Faroese Swedish Danish Luxembourgi Flemish Afrikaans Armenian_Ea Tajik Persian Shughni Pashto Nepali Marwari Gujarati Marathi Urdu Hindi Bihari
CTMC CTMC+ABC
Latvian Russian Ukrainian Sorbian_Lower Macedonian Faroese Swedish Danish Flemish Afrikaans Luxembourgis Dolomite_Ladi Romansh French Spanish Portuguese Romanian Welsh Armenian_Eas Bihari Marathi Gujarati Marwari Nepali Hindi Urdu Persian Tajik Shughni Pashto
covarion 32 / 35
33 / 35
34 / 35
Jennifer Hay and Laurie Bauer. Phoneme inventory size and population size. Language, 83(2):388–400, 2007. Sebastian Höhna, Michael J. Landis, Tracy A. Heath, Bastien Boussau, Nicolas Lartillot, Brian R. Moore, John P. Huelsenbeck, and Frederik Ronquist. RevBayes: Bayesian phylogenetic inference using graphical models and an interactive model-specification language. Systematic biology, 65(4):726–736, 2016. Paul O. Lewis. A likelihood approach to estimating phylogeny from discrete morphological character data. Systematic Biology, 50(6):913–925, 2001. Steven Moran, Daniel McCloy, and Richard Wright. Revisiting the population vs phoneme-inventory correlation. Language, 88(4):877–893, 2012. David Penny, Bennet J. McComish, Michael A. Charleston, and Michael D. Hendy. Mathematical elegance with biochemical realism: the covarion model of molecular evolution. Journal of Molecular Evolution, 53(6): 711–723, 2001.