Is the best model good enough? Assessing the absolute fit of - - PowerPoint PPT Presentation

is the best model good enough assessing the absolute fit
SMART_READER_LITE
LIVE PREVIEW

Is the best model good enough? Assessing the absolute fit of - - PowerPoint PPT Presentation

Is the best model good enough? Assessing the absolute fit of phylogenetic models via posterior predictive sampling Gerhard Jger Tbingen University Workshop Computational and phylogenetic historical linguistics ICHL24, Canberra, July 4, 2019


slide-1
SLIDE 1

Is the best model good enough? Assessing the absolute fit of phylogenetic models via posterior predictive sampling

Gerhard Jäger

Tübingen University

Workshop Computational and phylogenetic historical linguistics

ICHL24, Canberra, July 4, 2019

slide-2
SLIDE 2

Introduction

1 / 35

slide-3
SLIDE 3

“What I cannot create, I do not understand” (Feynman)

2 / 35

slide-4
SLIDE 4

Motivation

  • Bayesian model comparison (BF, WAIC, LOOIC, DIC, ...)

compare models

  • tell us which model, out of a pre-defined collection, is best in

explaining the data

  • does not tell us how plausible it is that the data were

generated from a generative process akin to the one specified by the model

  • Posterior predictive sampling simulates possible data from

posterior distribution, after fitting model to observed data

  • tells us what the data might have been, provided the model

accurately represents our state of knowledge

3 / 35

slide-5
SLIDE 5

Workflow

... ...

model data posterior posterior predictive simulation

4 / 35

slide-6
SLIDE 6

Example: regression

5 / 35

slide-7
SLIDE 7

Toy example

6 / 35

slide-8
SLIDE 8

An example

Suppose you observe a roulette wheel 20 times, and it comes up with the following sequence of colors: BBBBBBBBBRRRRRRRRBBB How can you model the wheel’s behavior?

7 / 35

slide-9
SLIDE 9

An example

  • 8 times red and 12 times black
  • maximum likelihood estimation:

P(R) = 0.4

  • straightforward Bayesian analyis:
  • prior distribution over P(R): uniform
  • posterior distribution: Beta(9, 13)

0.2 0.4 0.6 0.8

posterior probability of P(R)

Model assumes that the wheel has no memory! Should we believe this? BBBBBBBBBRRRRRRRRBBB

8 / 35

slide-10
SLIDE 10

An example

  • observed sequence contains only 2 changes of colors between

subsequent draws

  • how many such changes should we expect within 20 draws if
  • the wheel is memory-less,
  • our prior belief over P(R) is uniform, and
  • we have observed the sequence above?

9 / 35

slide-11
SLIDE 11

An example

posterior predictive distribution

10 / 35

slide-12
SLIDE 12

An example

  • posterior distribution of model parameters sampled via MCMC
  • posterior predictive distribution sampled by repeatedly (often)

1 draw a sample of parameter values from the posterior 2 simulate mock data according to the generative model, using

the parameters from the previous step

  • mock data can be used to sample from posterior distribution
  • f some summary statistics (such as the number of color

changes)

11 / 35

slide-13
SLIDE 13

An example

0.00 0.05 0.10 0.15 5 10 15

number of changes posterior probability

12 / 35

slide-14
SLIDE 14

Better model

red black

initial state π1 π0 α β β α

π1

0.00 0.25 0.50 0.75 1.00

13 / 35

slide-15
SLIDE 15

Better model

0.00 0.05 0.10 0.15 0.20 5 10 15

number of changes posterior probability

14 / 35

slide-16
SLIDE 16

Sound inventories and population size

15 / 35

slide-17
SLIDE 17

Sound inventories and population size

number of phonemes (Phoible) number of speakers

16 / 35

slide-18
SLIDE 18

Sound inventories and population size

10 30 100 1e+02 1e+05 1e+08

population number of phonemes areal

Africa Eurasia NorthAmerica Oceania SouthAmerica

17 / 35

slide-19
SLIDE 19

Sound inventories and population size

  • Hay and Bauer (2007) (a.o.): positive correlation of

population size with sound inventory size

  • Phoible data: correlation = 0.37
  • debunked by Moran et al. (2012)

18 / 35

slide-20
SLIDE 20

phylogenetically controlled regression

  • tree estimation above

established language families unreliable

  • very high phylogenetic

uncertainty

  • branches tend to be much

too short

  • compromise used here:
  • infer trees for individual

language families

  • connect them with a

rake-shaped “proto-world” root with unknown depth

  • total tree depth τ is

estimated from data

  • isolates are connected

directly to the root

NC.BANTOID.F21_SUKUMA An.OCEANIC.SIVISA_TITAN IE.IRANIAN.PERSIAN LP .LAKES_PLAIN.TAUSE_WEIRATE ST.KUKI_CHIN.CHIN_HAKA NC.BANTOID.BABUNGO An.CELEBIC.TAJE_PETAPA T

  • r.WAPEI_PALEI.BRAGAT

Man.EASTERN_MANDE.SHANGA NC.UBANGI.KPATILI CSu.BONGO_BAGIRMI.BIRRI_C_AFRICAN_REP An.SOUTH_HALMAHERA_WEST_NEW_GUINEA.WAREMBORI An.SOUTH_HALMAHERA_WEST_NEW_GUINEA.WAUYAI Iro.NORTHERN_IROQUOIAN.TUSCARORA NC.BANTOID.NGULU An.OCEANIC.RIRIO NC.ADAMAWA.PAM AA.WEST_CHADIC.KWAAMI An.OCEANIC.BIEREBO_YEVALI AuA.VIET_MUONG.MALIENG Alt.TURKIC.SHOR AuA.PALAUNG_KHMUIC.KSINMUL_2 An.OCEANIC.OROHA An.CENTRAL_MALAYO_POLYNESIAN.ALOR Alt.MONGOLIC.KALAQIN NC.CROSS_RIVER.UKWA Alt.TURKIC.TEREKEME_AZERI ESu.NILOTIC.ALUR AA.BIU_MANDARA.GIDAR Pan.PANOAN.YAMINAWA Pen.SAHAPTIAN.NEZ_PERCE ST.BODIC.JIREL NC.PLATOID.HASHA NC.KWA.TUTRUGBU An.BARITO.KADORIH OM.CHINANTECAN.LEALAO_CHINANTEC AuA.KATUIC.NGE Mas.MASCOIAN.SANAPANA_ENLHET MGe.GE_KAINGANG.PANARA An.CENTRAL_MALAYO_POLYNESIAN.APUTAI TNG.MADANG.SINSAURU Pan.PANOAN.KASHIBO_SAN_ALEJANDRO NDe.ATHAPASKAN.HUPA Kpx.KAPIXANA.KANOE An.OCEANIC.SOUTH_EFATE_ERAKOR ST.MAHAKIRANTI.YAKHA An.CENTRAL_MALAYO_POLYNESIAN.SAPOLEWA_SOOW_KWELE_ULUI_SERAM Mat.MATACOAN.CHOROTE T

  • r.WAPEI_PALEI.AMOL

An.YAPESE.YAPESE

τ

19 / 35

slide-21
SLIDE 21

without phylogenetic control with phylogenetic control

mean sound inventory size (in standard deviations)

DIC: 4132 DIC: 3475

20 / 35

slide-22
SLIDE 22

Posterior predictive check: correlations under null models

  • 0.2

0.0 0.2

correlation

model

naive phylogenetic

log-population size

21 / 35

slide-23
SLIDE 23

Applying PPS to phylogenetic inference

22 / 35

slide-24
SLIDE 24

Case study

  • data: IELex
  • 30 randomly

sampled languages

  • binarized

cognate classes

Latvian Russian Ukrainian Sorbian_Lower Macedonian Faroese Swedish Danish Flemish Afrikaans Luxembourgish Dolomite_Ladino Romansh French Spanish Portuguese Romanian Welsh Armenian_Eastern Bihari Marathi Gujarati Marwari Nepali Hindi Urdu Persian T ajik Shughni Pashto

23 / 35

slide-25
SLIDE 25

Summary statistics of interest: Retention Index

  • instead of comparing y with ˆ

y, I will compare distribution of (y, θ) with (ˆ y, θ)

  • if model is credible, we expect strong overlap between

distributions

  • summary statistics to be used: Retention Index
  • most parsimonious reconstruction gives the minimal number of

mutations, given a phylogeny

C C A A B B

24 / 35

slide-26
SLIDE 26

Summary statistics of interest: Retention Index

  • instead of comparing y with ˆ

y, I will compare distribution of (y, θ) with (ˆ y, θ)

  • if model is credible, we expect strong overlap between

distributions

  • summary statistics to be used: Retention Index
  • most parsimonious reconstruction gives the minimal number of

mutations, given a phylogeny

C C A A B B A B B 2 mutations C B

24 / 35

slide-27
SLIDE 27

Summary statistics of interest: Retention Index

  • instead of comparing y with ˆ

y, I will compare distribution of (y, θ) with (ˆ y, θ)

  • if model is credible, we expect strong overlap between

distributions

  • summary statistics to be used: Retention Index
  • most parsimonious reconstruction gives the minimal number of

mutations, given a phylogeny

C C A A B B A B B 3 mutations C C

24 / 35

slide-28
SLIDE 28

Summary statistics of interest: Retention Index

  • instead of comparing y with ˆ

y, I will compare distribution of (y, θ) with (ˆ y, θ)

  • if model is credible, we expect strong overlap between

distributions

  • summary statistics to be used: Retention Index
  • most parsimonious reconstruction gives the minimal number of

mutations, given a phylogeny

A C C A A B B 4 mutations

24 / 35

slide-29
SLIDE 29

Retention Index

  • minimal number of mutations: number of states − 1
  • maximal number of mutations: number of taxa - number
  • f occurrences of most frequent state
  • number of avoidable mutations: maximal number of

mutations - minimal number of mutations

  • number of mutations avoided in T :

maximal number of mutations−(minimal) number of mutations in T

  • Retention Index (RI) of a tree T :

RI(T ) = number of mutations avoided in T number of avoidable mutations

25 / 35

slide-30
SLIDE 30

Retention Index

RI = 1

C C A A B B A B B 2 mutations C B 26 / 35

slide-31
SLIDE 31

Retention Index

RI = 1/2

C C A A B B A B B 3 mutations C C 26 / 35

slide-32
SLIDE 32

Retention Index

RI = 0

A C C A A B B 4 mutations 26 / 35

slide-33
SLIDE 33

Model 1: CTMC

1

α β

  • Γ-distributed rates
  • relaxed clock
  • uniform tree prior

27 / 35

slide-34
SLIDE 34

Model 1: CTMC

simulated empirical 0.45 0.50 0.55 0.60 0.65 0.70 0.75

Retention Index

marginal log-density: −14391

28 / 35

slide-35
SLIDE 35

Model 2: CTMC + ascertainment bias correction

simulated empirical 0.75 0.80 0.85

Retention Index

marginal log-density: −12862

29 / 35

slide-36
SLIDE 36

Model 3: Covarion + ascertainment bias correction

1

α β

1

γ δ

30 / 35

slide-37
SLIDE 37

Model 3: Covarion + ascertainment bias correction

simulated empirical 0.68 0.70 0.72 0.74

Retention Index

marginal log-density: −12983

31 / 35

slide-38
SLIDE 38

Tajik Persian Pashto Shughni Nepali Marathi Gujarati Marwari Bihari Hindi Urdu Macedonian Sorbian_Lower Ukrainian Russian Latvian Spanish Portuguese Dolomite_Ladi Romansh French Romanian Welsh Faroese Swedish Danish Afrikaans Flemish Luxembourgis Armenian_Eas Latvian Russian Ukrainian Sorbian_Lowe Macedonian Welsh Spanish Portuguese French Dolomite_Lad Romansh Romanian Faroese Swedish Danish Luxembourgi Flemish Afrikaans Armenian_Ea Tajik Persian Shughni Pashto Nepali Marwari Gujarati Marathi Urdu Hindi Bihari

CTMC CTMC+ABC

Latvian Russian Ukrainian Sorbian_Lower Macedonian Faroese Swedish Danish Flemish Afrikaans Luxembourgis Dolomite_Ladi Romansh French Spanish Portuguese Romanian Welsh Armenian_Eas Bihari Marathi Gujarati Marwari Nepali Hindi Urdu Persian Tajik Shughni Pashto

covarion 32 / 35

slide-39
SLIDE 39

Summary

33 / 35

slide-40
SLIDE 40
  • PPS is useful addition to Bayes Factor etc.
  • helps to understand qualitatively what a model is doing
  • future work: explicitly modeling unattested character

34 / 35

slide-41
SLIDE 41

References

Jennifer Hay and Laurie Bauer. Phoneme inventory size and population size. Language, 83(2):388–400, 2007. Sebastian Höhna, Michael J. Landis, Tracy A. Heath, Bastien Boussau, Nicolas Lartillot, Brian R. Moore, John P. Huelsenbeck, and Frederik Ronquist. RevBayes: Bayesian phylogenetic inference using graphical models and an interactive model-specification language. Systematic biology, 65(4):726–736, 2016. Paul O. Lewis. A likelihood approach to estimating phylogeny from discrete morphological character data. Systematic Biology, 50(6):913–925, 2001. Steven Moran, Daniel McCloy, and Richard Wright. Revisiting the population vs phoneme-inventory correlation. Language, 88(4):877–893, 2012. David Penny, Bennet J. McComish, Michael A. Charleston, and Michael D. Hendy. Mathematical elegance with biochemical realism: the covarion model of molecular evolution. Journal of Molecular Evolution, 53(6): 711–723, 2001.