Phylogenetic trees IV Maximum Likelihood Gerhard Jger ESSLLI 2016 - - PowerPoint PPT Presentation

phylogenetic trees iv maximum likelihood
SMART_READER_LITE
LIVE PREVIEW

Phylogenetic trees IV Maximum Likelihood Gerhard Jger ESSLLI 2016 - - PowerPoint PPT Presentation

Phylogenetic trees IV Maximum Likelihood Gerhard Jger ESSLLI 2016 Gerhard Jger Maximum Likelihood ESSLLI 2016 1 / 50 Theory Theory Gerhard Jger Maximum Likelihood ESSLLI 2016 2 / 50 Theory Recap: Continuous time Markov model


slide-1
SLIDE 1

Phylogenetic trees IV Maximum Likelihood

Gerhard Jäger ESSLLI 2016

Gerhard Jäger Maximum Likelihood ESSLLI 2016 1 / 50

slide-2
SLIDE 2

Theory

Theory

Gerhard Jäger Maximum Likelihood ESSLLI 2016 2 / 50

slide-3
SLIDE 3

Theory

Recap: Continuous time Markov model

P(t) = s + re−t r − re−t s − se−t r + se−t

  • π

= (s, r)

l1 l2 l3 l4 l5 l6 l7 l8 Gerhard Jäger Maximum Likelihood ESSLLI 2016 3 / 50

slide-4
SLIDE 4

Theory

Likelihood of a tree

background reading: Ewens and Grant (2005), 15.7 simplifying assumption: evolution at different branches is independent suppose we know probability distributions vt and vb over states at top and bottom of branch lk L(lk) = vT

t P(lk)vb

l1 l2 l3 l4 l5 l6 l7 l8 Gerhard Jäger Maximum Likelihood ESSLLI 2016 4 / 50

slide-5
SLIDE 5

Theory

Likelihood of a tree

likelihoods of states (0, 1) at root are vT

1 P(l1)vT 2 P(l2)

log-likelihoods log(vT

1 P(l1)) + log(vT 2 P(l2))

log-likelihood of larger tree: recursively apply this method from tips to root

l1 l2

v1 v2

Gerhard Jäger Maximum Likelihood ESSLLI 2016 5 / 50

slide-6
SLIDE 6

Theory

(Log-)Likelihood of a tree

log L(tips below|mother = s) =

  • d∈daughters
  • s′∈states log P(s → s′|branchlength)+

log(L(tips below d|d = s′))

Gerhard Jäger Maximum Likelihood ESSLLI 2016 6 / 50

slide-7
SLIDE 7

Theory

(Log-)Likelihood of a tree

this is essentially identical to Sankoff algorithm for parsimony:

weight(i, j) = log P(lk)ij weight matrix depends on branch length → needs to be recomputed for each branch

  • verall likelihood for entire tree depends on probability distribution on

root if we assume that root node is in equilibrium: L(tree) = (s, r)T L(root) does not depend on location of the root (→ time reversibility) this is for one character — likelhood for all data is product of likelihoods for each character

Gerhard Jäger Maximum Likelihood ESSLLI 2016 7 / 50

slide-8
SLIDE 8

Theory

(Log-)Likelihood of a tree

likelihood of tree depends on

branch lengths rates for each character

likelihood for tree topology: L(topology) = max

lk: k is a branch

L(tree| lk)

Gerhard Jäger Maximum Likelihood ESSLLI 2016 8 / 50

slide-9
SLIDE 9

Theory

(Log-)Likelihood of a tree

Where do we get the rates from? different options, increasing order of complexity

1

s = r = 0.5 for all characters

2

r = empirical relative frequency of state 1 in the data (identical for all characters)

3

a certain proportion pinv (value to be estimated) of characters are invariant

4

rates are gamma distributed

Gerhard Jäger Maximum Likelihood ESSLLI 2016 9 / 50

slide-10
SLIDE 10

Theory

Gamma-distributed rates

we want allow rates to vary, but not too much common method (no real justification except for mathematical convenience) equilibrium distribution is identical for all characters rate matrix is multiplied with coefficient λi for character i λi is random variable drawn from a Gamma distribution L(ri = x) = ββx(β−1)e−βx Γ(β)

Gerhard Jäger Maximum Likelihood ESSLLI 2016 10 / 50

slide-11
SLIDE 11

Theory

Gamma-distributed rates

  • verall likelihood of tree topology: integrate
  • ver all λi, weighted by Gamma likelihood

computationally impractical in practice: split Gamma distribution into n discrete bins (usually n = 4) and approximate integration via Hidden Markov Model

Gerhard Jäger Maximum Likelihood ESSLLI 2016 11 / 50

slide-12
SLIDE 12

Theory

Modeling decisions to make

aspect of model possible choices number of parameters to estimate branch lengths unconstrained 2n − 3 (n is number of taxa) ultrametric n − 1 equilibrium probabilities uniform empirical 1 ML estimate 1 rate variation none Gamma distributed 1 invariant characters none pinv 1 This could be continued — you can build in rate variation across branches, you can fit the number of Gamma categories . . .

Gerhard Jäger Maximum Likelihood ESSLLI 2016 12 / 50

slide-13
SLIDE 13

Theory

Model selection

tradeoff

rich models are better at detecting patterns in the data, but are prone to over-fitting parsimoneous models less vulnerable to overfitting but may miss important information

standard issue in statistical inference

  • ne possible heuristics: Akaike Information Criterion (AIC)

AIC = −2 × log likelihood + 2 × number of free parameters the model minimizing AIC is to be preferred

Gerhard Jäger Maximum Likelihood ESSLLI 2016 13 / 50

slide-14
SLIDE 14

Theory

Example: Model selection for cognacy data/ UPGMA tree

model no. branch lengths

  • eq. probs.

rate variation

  • inv. char.

AIC 1 ultrametric uniform none none 17515.95 2 ultrametric uniform none pinv 17518.39 3 ultrametric uniform Gamma none 17517.89 4 ultrametric uniform Gamma pinv 17519.75 5 ultrametric empirical none none 16114.66 6 ultrametric empirical none pinv 16056.85 7 ultrametric empirical Gamma none 15997.16 8 ultrametric empirical Gamma pinv 16022.21 9 ultrametric ML none none 16034.96 10 ultrametric ML none pinv 16058.83 11 ultrametric ML Gamma none 15981.94 12 ultrametric ML Gamma pinv 16009.90 13 unconstrained uniform none none 17492.73 14 unconstrained uniform none pinv 17494.73 15 unconstrained uniform Gamma none 17494.73 16 unconstrained uniform Gamma pinv 17496.73 17 unconstrained empirical none none 16106.52 18 unconstrained empirical none pinv 16049.28 19 unconstrained empirical Gamma none 16033.21 20 unconstrained empirical Gamma pinv 16011.38 21 unconstrained ML none none 16102.04 22 unconstrained ML none pinv 16051.27 23 unconstrained ML Gamma none 16025.99 24 unconstrained ML Gamma pinv 16001.00

Gerhard Jäger Maximum Likelihood ESSLLI 2016 14 / 50

slide-15
SLIDE 15

Theory

Tree search

ML computation gives us likelihood of a tree topology, given data and a model ML tree:

heuristic search to find the topology maximizing likelihood

  • ptimize branch lengths to maximize likelihood for that topology

computationally very demanding! for the 25 taxa in our running example, ML tree search for the full model requires several hours on a single processor; parallelization helps ideally, one would want to do 24 heuristic tree searches, one for each model specification, and pick the tree+model with lowest AIC in practice one has to make compromises

Gerhard Jäger Maximum Likelihood ESSLLI 2016 15 / 50

slide-16
SLIDE 16

Running example

Running example

Gerhard Jäger Maximum Likelihood ESSLLI 2016 16 / 50

slide-17
SLIDE 17

Running example

Running example: cognacy data

unconstrained branch lengths: AIC = 7929

Italian Catalan French Spanish Portuguese Hindi Bulgarian Welsh Breton Dutch Russian Bengali Romanian Danish English Lithuanian Icelandic Polish Ukrainian Greek Irish Swedish German Czech Nepali

ultrametric: AIC = 7972

Catalan Portuguese Czech Lithuanian French Greek Spanish Dutch Ukrainian Polish Icelandic Swedish English Welsh Bengali Romanian Irish Russian Italian German Danish Breton Nepali Bulgarian Hindi

Gerhard Jäger Maximum Likelihood ESSLLI 2016 17 / 50

slide-18
SLIDE 18

Running example

Running example: WALS data

unconstrained branch lengths: AIC = 2752

Bengali Nepali French Greek English Czech Romanian Italian Portuguese Russian Icelandic Dutch Hindi Bulgarian Welsh Lithuanian Irish German Polish Danish Swedish Ukrainian Catalan Spanish Breton

ultrametric: AIC = 2828

Catalan Italian Greek Spanish Welsh English Bulgarian Bengali Portuguese Dutch German Danish Icelandic Polish Ukrainian Breton Czech Russian French Irish Romanian Lithuanian Hindi Nepali Swedish

Gerhard Jäger Maximum Likelihood ESSLLI 2016 18 / 50

slide-19
SLIDE 19

Running example

Running example: phonetic data

unconstrained branch lengths: AIC = 89871

Lithuanian Ukrainian Welsh Bengali Catalan Polish English Russian French Bulgarian Danish Hindi Spanish Portuguese Irish German Greek Icelandic Czech Breton Italian Nepali Swedish Dutch Romanian

ultrametric: AIC = 90575

Polish Ukrainian Greek Spanish Italian Bulgarian French Romanian German English Bengali Hindi Icelandic Catalan Danish Nepali Dutch Breton Russian Portuguese Irish Lithuanian Swedish Welsh Czech

Gerhard Jäger Maximum Likelihood ESSLLI 2016 19 / 50

slide-20
SLIDE 20

Running example

Wrapping up

ML is conceptually superior to MP (let alone distance methods)

different mutation rates for different characters are inferred from the data possibility of multiple mutations are taken into account — depending

  • n branch lengths

side effect of likelihood computation: probability distribution over character states at each internal node can be read off

disadvantages:

computationally demanding many parameter settings makes model selection difficult (note that the ultrametric trees in our example are sometimes better even though they have higher AIC) ultrametric constraint makes branch lengths optimization computationally more expensive ⇒ not feasible for larger data sets (more than 100–200 taxa)

Gerhard Jäger Maximum Likelihood ESSLLI 2016 20 / 50

slide-21
SLIDE 21

Cleaning up from yesterday

Cleaning up from yesterday

Gerhard Jäger Maximum Likelihood ESSLLI 2016 21 / 50

slide-22
SLIDE 22

Cleaning up from yesterday

Using all data and the most sophisticated model...

using both cognacy characters and phonetic characters Bayesian phylogenetic inference (related to Maximum Likelihood, but quite a bit more complex) 10 Gamma categories relaxed molecular clock ⇒ rates are allowed to vary between branches, but only to a limited degree

Gerhard Jäger Maximum Likelihood ESSLLI 2016 22 / 50

slide-23
SLIDE 23

Cleaning up from yesterday

Using all data and the most sophisticated model...

Catalan Danish Swedish Bulgarian Dutch Czech Russian Lithuanian Romanian Portuguese Spanish Nepali Greek Breton Polish French Italian English Icelandic Bengali Welsh German Ukrainian Hindi Irish 0,98 0,98 0,28 0,98 0,83 0,99 0,97 0,36 0,59 0,98 0,87 0,8 0,93 1 0,95 0,97 0,97 0,69 0,98 0,98 0,98 0,7 0,99 0,98 Gerhard Jäger Maximum Likelihood ESSLLI 2016 23 / 50

slide-24
SLIDE 24

Cleaning up from yesterday

Using all data and the most sophisticated model...

Welsh Breton Irish Spanish Portuguese Catalan French Italian Romanian Greek Bengali Nepali Hindi Czech Polish Russian Ukrainian Bulgarian Lithuanian Dutch German English Icelandic Swedish Danish

Gerhard Jäger Maximum Likelihood ESSLLI 2016 24 / 50

slide-25
SLIDE 25

Application: Ancestral State Reconstruction

Application: Ancestral State Reconstruction

Gerhard Jäger Maximum Likelihood ESSLLI 2016 25 / 50

slide-26
SLIDE 26

Application: Ancestral State Reconstruction

joint work with Johann-Mattis List

Gerhard Jäger Maximum Likelihood ESSLLI 2016 26 / 50

slide-27
SLIDE 27

Application: Ancestral State Reconstruction

What is Ancestral State Reconstruction?

While tree-building methods seek to find branching diagrams which explain how a language family has evolved, ASR methods use the branching diagrams in order to explain what has evolved concretely. Ancestral state reconstruction is very common in evolutionary biology but only spuriously practiced in computational historical linguistics (Bouchard-Côté et al., 2013) In classical historical linguistics, on the other hand, linguistic reconstruction of proto-forms and proto-meanings is very common and

  • ne of the main goals of the classical comparative method (Fox 1995).

Gerhard Jäger Maximum Likelihood ESSLLI 2016 27 / 50

slide-28
SLIDE 28

Application: Ancestral State Reconstruction

ASR of Lexical Replacement Patterns

If we look for words corresponding to one meaning in a wordlist and know which of the words are cognate or not, we may ask which of the word forms was the most likely candidate to be used in the proto-language of all descendant languages. This question resembles the task of “semantic reconstruction”, but in contrast to classical semantic reconstruction, we are only operating within one concept slot here, disregarding all words with a different meaning which may also be cognate with the words in our sample. As a result of this restriction, it is quite likely that we cannot recover the original form from our data. It is, however, very interesting to see to which degree we can propose a good candidate word form (cognate set) for the proto-language.

Gerhard Jäger Maximum Likelihood ESSLLI 2016 28 / 50

slide-29
SLIDE 29

Application: Ancestral State Reconstruction

Data

Gerhard Jäger Maximum Likelihood ESSLLI 2016 29 / 50

slide-30
SLIDE 30

Application: Ancestral State Reconstruction

Data

IELex 153 Indo-European doculects 207 concepts entries for Proto-Indo-European for 135 concepts → used as gold standard arbitrarily split into training set and test set: training set: 67 concepts, 1127 cognate classes (83

  • ccur in PIE)

test set: 68 concepts, 957 cognate classes (79 from PIE) ABVD 743 Austronesian doculects → 100 were selected at random 210 concepts; for 154 of them entries for Proto-Austronesian split into training set and test set: training set: 81 concepts, 1695 cognate classes (88

  • ccur in PAn)

test set: 74 concepts, 1584 cognate classes (79

  • ccur in PAn)

Gerhard Jäger Maximum Likelihood ESSLLI 2016 30 / 50

slide-31
SLIDE 31

Application: Ancestral State Reconstruction

Prerequisites: Trees

Trees trees were inferred with full data set (training + test data) via Bayesian inference

IELex outgroup: Anatolian ABVD outgroup: Malayo-Polynesian

random samples of 1000 trees from posterior distributions maximum clade credibility trees

600.0 Kashmiri Upper_Sorbian Lahnda Old_High_German Sariqoli Stavangersk Pennsylvania_Dutch Urdu Old_Norse Polish Bulgarian Old_Swedish Portuguese_St Greek_Mod Hittite Oriya Panjabi_St Ashkun Romansh Prasun Luvian Irish_A Tocharian_A Classical_Armenian Gaulish Old_Irish Old_Gutnish Gujarati Swedish_Vl Standard_German_Munich Serbian Norwegian Latvian Wakhi Frisian Greek_Md Bulgarian_P Khaskura Czech_E Polish_P Kati Sardinian_N Digor_Ossetic French Danish Standard_Albanian Brazilian Ladin Ossetic Manx Albanian_K Magahi Marathi Sardinian_L Old_Prussian Rumanian_List Slovak_P Albanian_Top Albanian_T Waziri German Greek_D Byelorussian Oscan Hindi Vlach Vedic_Sanskrit Shughni Schwyzerduetsch Breton_List Old_Welsh Macedonian Slovenian Albanian_C Provencal Serbocroatian Breton_Se Persian Lithuanian_O Baluchi Ancient_Greek Slovak Catalan Gaelic_Scots Serbocroatian_P Czech Icelandic_St Albanian_G Gothic Lithuanian_St Dolomite_Ladino Latin Ukrainian Marwari Gypsy_Gk Avestan Swedish Welsh_N Macedonian_P Greek_K Tocharian_B Oevdalian Armenian_List Old_Breton Flemish Old_English Swedish_Up Bihari Welsh_C Sindhi Italian Bhojpuri Old_Persian Byelorussian_P Afrikaans Friulian Faroese Gutnish_Lau Tadzik Sardinian_C Old_Cornish Palaic Czech_P Ukrainian_P Irish_B Dutch_List Singhalese Russian Cornish Lower_Sorbian Assamese Russian_P Greek_Ml Nepali English Kurdish Breton_St Sogdian Letzebuergesch Spanish Danish_Fjolde Pashto Umbrian Zazaki Iron_Ossetic Old_Church_Slavonic Lycian Walloon Armenian_Mod Slovenian_P Albanian Tsakonian Bengali 0.06 FijianBau Isamorong KwaraaeSolomonIslands Cebuano LampungApiKalianda Lampung KomeringIlirPalauGemantungVillage Tagalog Ivasay EastSumbaneseUmbuRatuNggaidialect Carolinian LampungApiKrui Anakalang LampungApiBelalau LampungNyoMenggalaTulangBawang Melayu KakidugenIlongot Komering KomeringUluPerjayaVillage Kerinci TetunTerikFehandialect Surigaonon Woleai LampungApiDaya Mamboru Tabar Marquesan EastSumbaneseLewadialect Maori Tongan Tolo CiuliAtayalBandai Rarotongan BlablangaGhove LampungApiSungkai GhariTandai TahitianModern LampungNyoAbungKotabumi Tuamotu Babuyan Rurutuan MalayBahasaIndonesia Saa Imorod PaiwanKulalao Niue KomeringKayuAgungAsli Blablanga FutunaEast TaliseMalagheti Ogan Indonesian MaringeKmagha Toambaita Itbayat LampungApiTalangPadang KilokakaYsabel Yami ManoboAtaupriver DayakNgaju Masiwang Luangiua LampungApiJabung Lau KomeringUluAdumanisVillage Tikopia NakanaiBilekiDialect Neveei Sengga Iraralay ManoboAtadownriver Itbayaten LampungApiPubian Pukapuka Talise SquliqAtayal TannaSouthwest LampungNyoAbungSukadana KomeringUluDamarpuraVillage Hawaiian Katingan LampungApiSukau WesternBukidnonManobo Chuukese TagalogAnthonydelaPaz LampungApiWayKanan Samoan EastSumbaneseKamberaSoutherndialect Kokota Lakalai LampungApiKotaAgung Penrhyn BabatanaKatazi Sikaiana GhariNggeri Kambera Luqa LampungApiRanau Rennellese Kubokota

Gerhard Jäger Maximum Likelihood ESSLLI 2016 31 / 50

slide-32
SLIDE 32

Application: Ancestral State Reconstruction

Phylogenetic uncertainty

proper way to deal with it: work with posterior sample rather than with a single tree poor man’s method:

remove all short branches (shorter than some threshold) do ASR with resulting multifurcating tree

Prasun Ashkun Kati Sogdian Ossetic Digor_Ossetic Iron_Ossetic Pashto Waziri Baluchi Kurdish Zazaki T adzik Persian Wakhi Shughni Sariqoli Old_Persian Avestan Vedic_Sanskrit Kashmiri Nepali Khaskura Bengali Assamese Oriya Bihari Gujarati Marathi Sindhi Marwari Hindi Urdu Lahnda Panjabi_St Bhojpuri Magahi Gypsy_Gk Singhalese Old_Prussian Latvian Lithuanian_O Lithuanian_St Old_Church_Slavonic Serbocroatian Serbian Serbocroatian_P Bulgarian_P Bulgarian Macedonian Macedonian_P Slovenian Slovenian_P Russian Russian_P Ukrainian_P Byelorussian_P Byelorussian Polish Ukrainian Polish_P Upper_Sorbian Lower_Sorbian Czech Slovak Czech_E Slovak_P Czech_P Gothic German Standard_German_Munich Pennsylvania_Dutch Schwyzerduetsch Letzebuergesch Frisian Afrikaans Flemish Dutch_List Old_High_German Old_English English Old_Gutnish Stavangersk Norwegian Danish Danish_Fjolde Gutnish_Lau Oevdalian Swedish Swedish_Up Swedish_Vl Old_Swedish Faroese Old_Norse Icelandic_St Old_Breton Old_Cornish Old_Welsh Welsh_C Welsh_N Cornish Breton_St Breton_Se Breton_List Gaulish Old_Irish Irish_A Irish_B Gaelic_Scots Manx Oscan Umbrian Vlach Rumanian_List Dolomite_Ladino Romansh Ladin Friulian Italian Walloon French Provencal Catalan Brazilian Portuguese_St Spanish Sardinian_L Sardinian_C Sardinian_N Latin T
  • charian_A
T
  • charian_B
Albanian_T Standard_Albanian Albanian Albanian_G Albanian_T
  • p
Albanian_K Albanian_C Ancient_Greek Greek_Mod Greek_Md Greek_Ml Greek_D T sakonian Greek_K Classical_Armenian Armenian_Mod Armenian_List Lycian Luvian Palaic Hittite 100.0

Gerhard Jäger Maximum Likelihood ESSLLI 2016 32 / 50

slide-33
SLIDE 33

Application: Ancestral State Reconstruction

Summary on Indo-European ASR

Error Type GS ASR Number Missing forms A ∅ 7 Different forms A B 9 Additional forms in ASR A A, B 5 Missing root in ASR A, B A 4 Summary 25

Gerhard Jäger Maximum Likelihood ESSLLI 2016 33 / 50

slide-34
SLIDE 34

Application: Ancestral State Reconstruction

Evaluating the Differences

We evaluate the differences qualitatively by checking the reflection of the proposed root in the branches, especially with semantically shifted word forms, the likelihood of semantic shift of the given root with help of the Database of Cross-Linguistic Colexifications (CLICS, List et al. 2013 and 2014), thoroughly whether the cognate sets in the data are really reflexes of the proposed PIE root. Based on this check, we distinguish four grades of root quality: erroneous problematic possible good

Gerhard Jäger Maximum Likelihood ESSLLI 2016 34 / 50

slide-35
SLIDE 35

Application: Ancestral State Reconstruction

Indo-European ASR: Missing forms

Concept Form Meaning in Reflexes Comment SEE *derḱ-

to see

Only reflected in Indo-Iranian, cognates also problematic.

SEE *weid-

to see or to know

Safe root for Indo-European.

SING *kan-

to sing

  • r

the rooster

Root is proposed for PIE on the basis of Germanic reflexes meaning “rooster” which is a highly unlikely semantic change

SMELL *h₃ed-

to smell

Potential root for PIE, but only reflected in Greek and Romance

SMALL *mei-

small

Wrong cognate judgments in the database, since neither Russian malenkij nor English small go back to this root

THINK *teng-

to think or to feel

Root only reflected in Germanic languages with spurious reflexes in semantically shifted form in other branches. A better candidate for PIE would be *men- “the mind or to think”.

WASH *leh₂w-

to wash

  • r

to pour

Wrong cognate assignment in the source since Romance and Albanian reflexes are not annotated.

WASH *neigʷ-

to wash or water monster

Very unlikely cognate assignment, due to the extreme shift from “to wash” to “water monster” (cf. English nix) in the Germanic languages.

WET *wed-

water or wet

Semantic change from “water” to “wet” is likely according to CLICS, but it is not clear why this should have already happened in PIE times.

erroneous problematic possible good

Gerhard Jäger Maximum Likelihood ESSLLI 2016 35 / 50

slide-36
SLIDE 36

Application: Ancestral State Reconstruction

Indo-European ASR: Missing forms

Concept Form Meaning in Reflexes Comment SEE *derḱ-

to see

Only reflected in Indo-Iranian, cognates also problematic.

SEE *weid-

to see or to know

Safe root for Indo-European.

SING *kan-

to sing

  • r

the rooster

Root is proposed for PIE on the basis of Germanic reflexes meaning “rooster” which is a highly unlikely semantic change

SMELL *h₃ed-

to smell

Potential root for PIE, but only reflected in Greek and Romance

SMALL *mei-

small

Wrong cognate judgments in the database, since neither Russian malenkij nor English small go back to this root

THINK *teng-

to think or to feel

Root only reflected in Germanic languages with spurious reflexes in semantically shifted form in other branches. A better candidate for PIE would be *men- “the mind or to think”.

WASH *leh₂w-

to wash

  • r

to pour

Wrong cognate assignment in the source since Romance and Albanian reflexes are not annotated.

WASH *neigʷ-

to wash or water monster

Very unlikely PIE root, due to the extreme shift from “to wash” to “water monster” (cf. English nix) in the Germanic languages.

WET *wed-

water or wet

Semantic change from “water” to “wet” is likely according to CLICS, but it is not clear why this should have already happened in PIE times.

erroneous problematic possible good

Gerhard Jäger Maximum Likelihood ESSLLI 2016 35 / 50

slide-37
SLIDE 37

Application: Ancestral State Reconstruction

Indo-European ASR: Different Forms

Concept GS ASR Comment RIVER *h₂ekʷeh₂ *h₂ep-

Form in GS meant “water” in PIE. Although a shift from “water” to “river” is likely according to CLICS, this meaning is an innovation in Germanic. The ASR form is reflected across multiple branches and a much better candidate.

RUB *melh₁- *terh₁-

Form in GS is not reflected in the standard literature (LIV and LIN), form in ASR is reflected in the meaning “to rub, to bore”.

SCRATCH *gerbʰ- *kes-

Form in GS is only reflected in few Germanic languages, probably with a wrong cognate

  • assignment. Following Derksen (2008), assuming the GSR form is a much better candidate

for the PIE word for “scratch”.

SKIN *pel *(s)kewH-

Form in GS is a good PIE root, but not necessarily with the meaning “skin”, as the meaning

  • f the reflexes differs greatly. The GSR form derives from a PIE verb meaning “to cover”,

but the cognate should not contain Slavic words (Derksen 2008).

WALK *ǵʰeh₁ *h₁ei-

The GS form is only reflected in Germanic. The ASR form is a clear PIE root, but the meaning may also have been “to go”.

WATER *h₂ekʷeh₂ *wódr̥

The ASR form is a much better candidate for “water” in PIE, due to its high number of reflexes in all branches.

WHITE *h₂elbʰós *h₂erǵó-

The GS form is only reflected in Romance in this meaning and as meaning “cloud” in Hittite. The ASR form is a much better candidate, with a much more plausible connection between reflexes meaning “shine” and “white”, as also confirmed by CLICS.

WORM *wr̥mi- *kʷr̥mis

The ASR form is reflected in more different branches of PIE, while the GS form is only reflected in Germanic and Romance.

erroneous problematic possible good

Gerhard Jäger Maximum Likelihood ESSLLI 2016 36 / 50

slide-38
SLIDE 38

Application: Ancestral State Reconstruction

Indo-European ASR: Different Forms

Concept GS ASR Comment RIVER *h₂ekʷeh₂ *h₂ep-

Form in GS meant “water” in PIE. Although a shift from “water” to “river” is likely according to CLICS, this meaning is an innovation in Germanic. The ASR form is reflected across multiple branches and a much better candidate.

RUB *melh₁- *terh₁-

Form in GS is not reflected in the standard literature (LIV and LIN), form in ASR is reflected in the meaning “to rub, to bore”.

SCRATCH *gerbʰ- *kes-

Form in GS is only reflected in few Germanic languages, probably with a wrong cognate

  • assignment. Following Derksen (2008), assuming the GSR form is a much better candidate

for the PIE word for “scratch”.

SKIN *pel *(s)kewH-

Form in GS is a good PIE root, but not necessarily with the meaning “skin”, as the meaning

  • f the reflexes differs greatly. The GSR form derives from a PIE verb meaning “to cover”,

but the cognate should not contain Slavic words (Derksen 2008).

WALK *ǵʰeh₁ *h₁ei-

The GS form is only reflected in Germanic. The ASR form is a clear PIE root, but the meaning may also have been “to go”.

WATER *h₂ekʷeh₂ *wódr̥

The ASR form is a much better candidate for “water” in PIE, due to its high number of reflexes in all branches.

WHITE *h₂elbʰós *h₂erǵó-

The GS form is only reflected in Romance in this meaning and as meaning “cloud” in Hittite. The ASR form is a much better candidate, with a much more plausible connection between reflexes meaning “shine” and “white”, as also confirmed by CLICS.

WORM *wr̥mi- *kʷr̥mis

The ASR form is reflected in more different branches of PIE, while the GS form is only reflected in Germanic and Romance.

erroneous problematic possible good

Gerhard Jäger Maximum Likelihood ESSLLI 2016 36 / 50

slide-39
SLIDE 39

Application: Ancestral State Reconstruction

Indo-European ASR: Additional Forms

Concept Form in ASR Comment MOON *lewk-s-nh₂

This form would go back to a PIE root meaning “to shine” and is often said to have independently turned to mean “moon” in Romance and Slavic and other

  • branches. The shift from “shine” to “moon” is however not very likely (no evidence

in CLICS), so it is also possible that the word meant already “moon” in PIE as an epithet (Vaan 2008).

SNOW *ǵʰéi-mn̥-

The form has probably independently shifted from the original meaning “frost, cold”, which is a very likely shift according to CLICS.

SUCK *suḱ-

The root is present in this meaning in many subbranches and a good candidate for PIE in this meaning.

THIS *so / *to

The root is a clear PIE demonstrative (Meier-Brg̈ger 2010), but the reflexes in the daughter languages vary greatly, due to analogical levelling.

WITH *sm ̥

A very good candidate for the meaning with reflexes in Greek, Indo-Iranian and Slavic.

erroneous problematic possible good

Gerhard Jäger Maximum Likelihood ESSLLI 2016 37 / 50

slide-40
SLIDE 40

Application: Ancestral State Reconstruction

Indo-European ASR: Additional Forms

Concept Form in ASR Comment MOON *lewk-s-nh₂

This form would go back to a PIE root meaning “to shine” and is often said to have independently turned to mean “moon” in Romance and Slavic and other

  • branches. The shift from “shine” to “moon” is however not very likely (no evidence

in CLICS), so it is also possible that the word meant already “moon” in PIE as an epithet (Vaan 2008).

SNOW *ǵʰéi-mn̥-

The form has probably independently shifted from the original meaning “frost, cold”, which is a very likely shift according to CLICS.

SUCK *suḱ-

The root is present in this meaning in many subbranches and a good candidate for PIE in this meaning.

THIS *so / *to

The root is a clear PIE demonstrative (Meier-Brügger 2010), but the reflexes in the daughter languages vary greatly, due to analogical levelling.

WITH *sm ̥

A very good candidate for the meaning with reflexes in Greek, Indo-Iranian and Slavic.

erroneous problematic possible good

Gerhard Jäger Maximum Likelihood ESSLLI 2016 37 / 50

slide-41
SLIDE 41

Application: Ancestral State Reconstruction

Indo-European ASR: Missing Forms in ASR

Concept Form in GS Comment NOT *meh₁

This form is reflected in Old Greek as a prohibitive negation and also recon- structed as such. Whether it was the normal negation in PIE is less clear.

SLEEP *drem

This form is mainly reflected in Latin and spuriously in Indian and Greek. It is much more likely that it meant something else in PIE and then shifted into this meaning.

VOMIT *h₁rewg-

No need to reconstruct this form back to PIE, since it is only reflected in two languages of Romance.

YEAR *ieHr-

This form has only reflexes in Germanic languages. Generally, the meaning “year” is difficult to reconstruct, due to the high potential for shift from “summer”, “winter”, “time”, etc. as shown in CLICS.

erroneous problematic possible good

Gerhard Jäger Maximum Likelihood ESSLLI 2016 38 / 50

slide-42
SLIDE 42

Application: Ancestral State Reconstruction

Indo-European ASR: Missing Forms in ASR

Concept Form in GS Comment NOT *meh₁

This form is reflected in Old Greek as a prohibitive negation and also recon- structed as such. Whether it was the normal negation in PIE is less clear.

SLEEP *drem

This form is mainly reflected in Latin and spuriously in Indian and Greek. It is much more likely that it meant something else in PIE and then shifted into this meaning.

VOMIT *h₁rewg-

No need to reconstruct this form back to PIE, since it is only reflected in two languages of Romance.

YEAR *ieHr-

This form has only reflexes in Germanic languages. Generally, the meaning “year” is difficult to reconstruct, due to the high potential for shift from “summer”, “winter”, “time”, etc. as shown in CLICS.

erroneous problematic possible good

Gerhard Jäger Maximum Likelihood ESSLLI 2016 38 / 50

slide-43
SLIDE 43

Application: Ancestral State Reconstruction

Evaluation against our manually created gold standard

precision: 0.986 (1 false positive) recall: 0.895 (8 false negatives) F-score: 0.9381

1The IELex PIE entries have an F-score of 0.854. Gerhard Jäger Maximum Likelihood ESSLLI 2016 39 / 50

slide-44
SLIDE 44

Application: Ancestral State Reconstruction

False positive

Sogdian Ossetic Digor Ossetic Iron Ossetic Wakhi Shughni Sariqoli Baluchi Zazaki Tadzik Persian Pashto Waziri Avestan Vedic Sanskrit Kashmiri Marathi Nepali Khaskura Gypsy Gk Singhalese Old Prussian Latvian Lithuanian O Lithuanian St Bulgarian P Bulgarian Macedonian Macedonian P Serbocroatian Serbian Serbocroatian P Slovenian Slovenian P Russian Russian P Ukrainian P Polish Ukrainian Byelorussian Byelorussian P Slovak Czech E Czech Slovak P Czech P Polish P Upper Sorbian Lower Sorbian Old Church Slavonic Cornish Breton Se Breton List Breton St Welsh C Welsh N Old Irish Irish A Irish B Gaelic Scots Vlach Dolomite Ladino Romansh Ladin Friulian Italian Walloon French Provencal Catalan Brazilian Portuguese St Spanish Sardinian L Sardinian C Latin Gothic Afrikaans Flemish Dutch List Frisian German Standard German Munich Schwyzerduetsch Letzebuergesch Pennsylvania Dutch Old High German Old English English Old Norse Icelandic St Faroese Old Swedish Stavangersk Norwegian Danish Danish Fjolde Gutnish Lau Oevdalian Swedish Swedish Up Swedish Vl Albanian T Albanian Albanian G Standard Albanian Albanian Top Albanian K Albanian C Ancient Greek Greek Ml Greek D Greek Md Greek Mod Greek K Classical Armenian Armenian Mod Armenian List

  • snow:D

Gerhard Jäger Maximum Likelihood ESSLLI 2016 40 / 50

slide-45
SLIDE 45

Application: Ancestral State Reconstruction

False negatives

Kati Sogdian Ossetic Digor Ossetic Iron Ossetic Zazaki Tadzik Persian Pashto Old Persian Avestan Vedic Sanskrit Hindi Panjabi St Sindhi Marwari Gujarati Marathi Assamese Oriya Bengali Nepali Khaskura Singhalese Old Prussian Latvian Lithuanian O Lithuanian St Bulgarian P Bulgarian Macedonian Macedonian P Serbocroatian Serbian Serbocroatian P Slovenian P Russian Russian P Ukrainian P Polish Ukrainian Byelorussian Byelorussian P Slovak Czech Slovak P Czech P Polish P Upper Sorbian Lower Sorbian Old Church Slavonic Cornish Breton Se Breton List Breton St Welsh C Welsh N Gaulish Old Irish Irish A Irish B Gaelic Scots Vlach Rumanian List Dolomite Ladino Romansh Ladin Friulian Italian Walloon French Provencal Catalan Brazilian Portuguese St Spanish Sardinian L Sardinian C Sardinian N Latin Gothic Flemish Frisian German Standard German Munich Schwyzerduetsch Letzebuergesch Old High German Old English Old Norse Icelandic St Faroese Old Swedish Stavangersk Norwegian Danish Danish Fjolde Gutnish Lau Oevdalian Swedish Swedish Up Swedish Vl Albanian T Albanian Albanian G Standard Albanian Albanian Top Albanian K Albanian C Ancient Greek Greek Ml Greek D Greek Md Greek Mod Greek K Classical Armenian Armenian Mod Armenian List Luvian Hittite

  • river:O

Gerhard Jäger Maximum Likelihood ESSLLI 2016 41 / 50

slide-46
SLIDE 46

Application: Ancestral State Reconstruction

False negatives

Digor Ossetic Iron Ossetic Shughni Baluchi Zazaki Tadzik Persian Pashto Vedic Sanskrit Hindi Lahnda Panjabi St Urdu Sindhi Gujarati Marathi Assamese Oriya Bengali Bihari Nepali Khaskura Gypsy Gk Old Prussian Latvian Lithuanian St Bulgarian Macedonian Macedonian P Serbocroatian Serbian Serbocroatian P Slovenian Slovenian P Russian P Ukrainian P Polish Ukrainian Byelorussian Byelorussian P Slovak Czech E Czech Slovak P Czech P Polish P Upper Sorbian Lower Sorbian Old Church Slavonic Cornish Breton Se Breton List Breton St Welsh C Welsh N Old Irish Irish A Gaelic Scots Rumanian List Dolomite Ladino Romansh Italian Walloon French Provencal Catalan Brazilian Portuguese St Spanish Sardinian C Latin Afrikaans Flemish Dutch List Frisian German Standard German Munich Letzebuergesch Old High German Old English Old Norse Icelandic St Faroese Old Swedish Stavangersk Norwegian Danish Danish Fjolde Gutnish Lau Oevdalian Swedish Swedish Up Swedish Vl Tocharian A Tocharian B Albanian T Albanian Albanian Top Albanian K Ancient Greek Greek Ml Greek D Greek Md Greek Mod Greek K Classical Armenian Armenian Mod Armenian List

  • smell:W

Gerhard Jäger Maximum Likelihood ESSLLI 2016 42 / 50

slide-47
SLIDE 47

Application: Ancestral State Reconstruction

False negatives

Kati Sogdian Ossetic Digor Ossetic Iron Ossetic Wakhi Shughni Baluchi Tadzik Persian Pashto Waziri Avestan Vedic Sanskrit Kashmiri Hindi Sindhi Marwari Gujarati Marathi Assamese Oriya Bengali Bihari Gypsy Gk Singhalese Latvian Lithuanian O Lithuanian St Bulgarian P Bulgarian Macedonian P Serbocroatian Serbian Serbocroatian P Slovenian Slovenian P Russian Russian P Ukrainian P Polish Ukrainian Byelorussian Byelorussian P Slovak Czech E Czech Slovak P Czech P Polish P Upper Sorbian Lower Sorbian Old Church Slavonic Cornish Breton Se Breton List Breton St Welsh C Welsh N Old Irish Irish A Irish B Gaelic Scots Vlach Rumanian List Dolomite Ladino Romansh Ladin Friulian Italian Walloon French Provencal Catalan Brazilian Portuguese St Spanish Sardinian L Sardinian C Sardinian N Latin Gothic Afrikaans Flemish Dutch List Frisian German Standard German Munich Schwyzerduetsch Letzebuergesch Pennsylvania Dutch Old High German Old English English Old Norse Icelandic St Faroese Old Swedish Stavangersk Norwegian Danish Danish Fjolde Gutnish Lau Oevdalian Swedish Swedish Up Swedish Vl Albanian T Albanian Albanian G Standard Albanian Albanian Top Albanian K Albanian C Ancient Greek Greek Ml Greek D Greek Md Greek Mod Greek K Classical Armenian Armenian Mod Armenian List

  • wet:I

Gerhard Jäger Maximum Likelihood ESSLLI 2016 43 / 50

slide-48
SLIDE 48

Application: Ancestral State Reconstruction

False negatives

Prasun Ashkun Kati Sogdian Ossetic Digor Ossetic Iron Ossetic Wakhi Baluchi Kurdish Tadzik Persian Pashto Waziri Avestan Vedic Sanskrit Kashmiri Hindi Lahnda Urdu Marwari Gujarati Marathi Assamese Oriya Bengali Bihari Nepali Khaskura Latvian Lithuanian O Lithuanian St Bulgarian P Bulgarian Macedonian Macedonian P Serbocroatian Serbian Serbocroatian P Slovenian Slovenian P Russian Russian P Ukrainian P Polish Ukrainian Byelorussian Byelorussian P Slovak Czech E Czech Slovak P Czech P Polish P Upper Sorbian Lower Sorbian Old Church Slavonic Old Breton Old Cornish Old Welsh Cornish Breton Se Breton List Breton St Welsh C Welsh N Old Irish Irish A Irish B Gaelic Scots Manx Rumanian List Dolomite Ladino Romansh Ladin Friulian Italian Walloon French Provencal Catalan Brazilian Portuguese St Spanish Sardinian L Sardinian C Sardinian N Latin Gothic Afrikaans Flemish Dutch List Frisian German Standard German Munich Schwyzerduetsch Letzebuergesch Pennsylvania Dutch Old High German Old English Old Norse Icelandic St Faroese Old Swedish Stavangersk Norwegian Danish Danish Fjolde Gutnish Lau Oevdalian Swedish Swedish Up Swedish Vl Tocharian A Tocharian B Albanian T Albanian Albanian G Standard Albanian Albanian Top Albanian K Albanian C Ancient Greek Greek Ml Greek D Greek Md Tsakonian Greek Mod Greek K Classical Armenian Armenian List

  • skin:B

Gerhard Jäger Maximum Likelihood ESSLLI 2016 44 / 50

slide-49
SLIDE 49

Application: Ancestral State Reconstruction

False negatives

Kati Sogdian Ossetic Digor Ossetic Iron Ossetic Wakhi Shughni Sariqoli Baluchi Kurdish Zazaki Tadzik Persian Pashto Waziri Avestan Vedic Sanskrit Kashmiri Hindi Lahnda Panjabi St Urdu Bhojpuri Sindhi Marwari Gujarati Marathi Assamese Oriya Bengali Bihari Nepali Khaskura Singhalese Old Prussian Latvian Lithuanian O Lithuanian St Bulgarian P Bulgarian Macedonian Macedonian P Serbocroatian Serbocroatian P Slovenian Slovenian P Russian Russian P Ukrainian P Polish Ukrainian Byelorussian Byelorussian P Slovak Czech E Czech Slovak P Czech P Polish P Upper Sorbian Lower Sorbian Old Church Slavonic Cornish Breton Se Breton List Breton St Welsh C Welsh N Old Irish Irish A Irish B Gaelic Scots Manx Vlach Rumanian List Dolomite Ladino Romansh Ladin Friulian Italian Walloon French Provencal Catalan Brazilian Portuguese St Spanish Sardinian L Sardinian C Sardinian N Latin Gothic Afrikaans Flemish Dutch List Frisian German Standard German Munich Schwyzerduetsch Letzebuergesch Pennsylvania Dutch Old High German Old English English Old Gutnish Old Norse Icelandic St Faroese Old Swedish Stavangersk Norwegian Danish Danish Fjolde Gutnish Lau Oevdalian Swedish Swedish Up Swedish Vl Tocharian A Tocharian B Albanian T Albanian Albanian G Standard Albanian Albanian Top Albanian K Albanian C Ancient Greek Greek Ml Greek D Greek Md Tsakonian Greek Mod Greek K Armenian Mod Armenian List Hittite

  • sleep:E

Gerhard Jäger Maximum Likelihood ESSLLI 2016 45 / 50

slide-50
SLIDE 50

Application: Ancestral State Reconstruction

False negatives

Prasun Ashkun Kati Sogdian Ossetic Digor Ossetic Iron Ossetic Sariqoli Baluchi Kurdish Zazaki Tadzik Persian Pashto Waziri Avestan Vedic Sanskrit Kashmiri Hindi Lahnda Panjabi St Marwari Gujarati Marathi Oriya Bihari Nepali Khaskura Gypsy Gk Singhalese Latvian Lithuanian O Lithuanian St Bulgarian P Bulgarian Macedonian Macedonian P Serbocroatian Serbian Serbocroatian P Slovenian P Russian Russian P Ukrainian P Polish Ukrainian Byelorussian Byelorussian P Slovak Czech E Czech Slovak P Czech P Polish P Upper Sorbian Lower Sorbian Old Church Slavonic Old Breton Old Cornish Old Welsh Cornish Breton Se Breton List Breton St Welsh C Welsh N Gaulish Old Irish Irish A Irish B Gaelic Scots Manx Vlach Rumanian List Dolomite Ladino Romansh Ladin Friulian Italian Walloon French Provencal Catalan Brazilian Portuguese St Spanish Sardinian L Sardinian C Sardinian N Latin Gothic Afrikaans Flemish Dutch List Frisian German Standard German Munich Schwyzerduetsch Letzebuergesch Pennsylvania Dutch Old High German Old English English Old Gutnish Old Norse Icelandic St Faroese Old Swedish Stavangersk Norwegian Danish Danish Fjolde Gutnish Lau Oevdalian Swedish Swedish Up Swedish Vl Tocharian A Tocharian B Albanian T Albanian Albanian G Standard Albanian Albanian Top Albanian K Albanian C Ancient Greek Greek Ml Greek D Greek Md Tsakonian Greek Mod Greek K Armenian List Hittite

  • white:E

Gerhard Jäger Maximum Likelihood ESSLLI 2016 46 / 50

slide-51
SLIDE 51

Application: Ancestral State Reconstruction

False negatives

Sogdian Digor Ossetic Iron Ossetic Wakhi Sariqoli Baluchi Zazaki Tadzik Persian Pashto Waziri Vedic Sanskrit Kashmiri Hindi Lahnda Panjabi St Urdu Magahi Sindhi Gujarati Marathi Assamese Oriya Bengali Nepali Singhalese Old Prussian Latvian Lithuanian O Lithuanian St Bulgarian P Bulgarian Macedonian Macedonian P Serbocroatian Serbian Serbocroatian P Slovenian P Russian Russian P Ukrainian P Polish Ukrainian Byelorussian Byelorussian P Slovak Czech E Czech Slovak P Czech P Polish P Upper Sorbian Lower Sorbian Old Church Slavonic Cornish Breton Se Breton List Breton St Welsh N Old Irish Irish B Gaelic Scots Vlach Rumanian List Dolomite Ladino Ladin Friulian Italian Walloon French Provencal Brazilian Portuguese St Spanish Sardinian L Sardinian C Sardinian N Latin Gothic Afrikaans Flemish Dutch List Frisian German Standard German Munich Schwyzerduetsch Letzebuergesch Pennsylvania Dutch Old High German Old English English Old Norse Icelandic St Faroese Old Swedish Stavangersk Norwegian Danish Danish Fjolde Gutnish Lau Oevdalian Swedish Swedish Up Swedish Vl Tocharian A Tocharian B Albanian T Albanian Albanian G Standard Albanian Albanian Top Albanian K Albanian C Greek Ml Greek D Greek Md Greek Mod Greek K Classical Armenian Armenian Mod Armenian List

  • worm:B

Gerhard Jäger Maximum Likelihood ESSLLI 2016 47 / 50

slide-52
SLIDE 52

Application: Ancestral State Reconstruction

Summary on Indo-European

As the qualitative evaluation shows, the proto-forms proposed to be reconstructed back to PIE by our best ASR method are mostly equally good if not even better candidates than those which we found in the gold

  • standard. Given the general and well-known uncertainties in semantic

reconstruction in classical historical linguistics, it seems that ASR methods could provide actual help in semantic reconstruction by providing objective evolutionary scenarios for word evolution along a given tree which follow a specific evolutionary model.

Gerhard Jäger Maximum Likelihood ESSLLI 2016 48 / 50

slide-53
SLIDE 53

Application: Ancestral State Reconstruction

Hands-on

How to run Maximum-Likelihood tree estimation in Paup* Load your nexus file in to Paup* >paup4 soundConcept.bin.nex set optimality criterion to likelihood paup> set criterion=likelihood choose model:

  • ptimized rate parameter

paup> lset basefreq=estimate ultrametric tree: paup> lset clock=yes gamma-distributed rates paup> lset rates=gamma shape = estimate assume invariant sites paup> lset pinvar=estimate

Gerhard Jäger Maximum Likelihood ESSLLI 2016 49 / 50

slide-54
SLIDE 54

Application: Ancestral State Reconstruction

Hands-on

How to run Maximum-Likelihood tree estimation in Paup* (cont.) perform heuristic search paup> hsearch display tree paup> describetree /plot=phylo show log-likelihood and AIC lscores /aic=yes

Gerhard Jäger Maximum Likelihood ESSLLI 2016 50 / 50

slide-55
SLIDE 55

Application: Ancestral State Reconstruction

Bouchard-Côté, A., D. Hall, T. L. Griffiths, and D. Klein (2013). Automated reconstruction of ancient languages using probabilistic models of sound change. Proceedings of the National Academy of Sciences, 36(2):141–150. Ewens, W. and G. Grant (2005). Statistical Methods in Bioinformatics: An

  • Introduction. Springer, New York.

Gerhard Jäger Maximum Likelihood ESSLLI 2016 50 / 50