[PPT] - LexStat: Automatic Detection of Cognates in Multilingual Wordlists PowerPoint Presentation

SLIDE 1

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

. . . . . . .

LexStat: Automatic Detection of Cognates in Multilingual Wordlists

Johann-Mattis List∗

∗Institute for Romance Languages and Literature

Heinrich Heine University Düsseldorf

April 24, 2012

1 / 28

SLIDE 2

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Structure of the Talk

. . .

1

Keys to the Past . . .

2

Identification of Cognates . . .

3

LexStat . . .

4

Evaluation

2 / 28

SLIDE 3

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Keys to the Past

3 / 28

SLIDE 4

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Charles Lyell on Languages

4 / 28

SLIDE 5

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Charles Lyell on Languages

The Geological Evidences

f

The Antiquity of Man

with Remarks on Theories of

The Origin of Species by Variation

By Sir Charles Lyell London John Murray, Albemarle Street 1863 1

4 / 28

SLIDE 6

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Charles Lyell on Languages

If we new not- hing of the existence

f Latin, - if all

historical documents previous to the fin- teenth century had been lost, - if tra- dition even was si- lent as to the former existance of a Ro- man empire, a me- re comparison of the Italian, Spanish, Portuguese, French, Wallachian, and Rhaetian dialects would enable us to say that at some time there must ha- ve been a language, from which these six modern dialects derive their

rigin

in common. 1

4 / 28

SLIDE 7

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Uniformitarianism and Abduction

. Uniformitarianism . . . . . . . . “Universality of Change” – Change is independent of time and space “Graduality of Change” – Change is neither abrupt nor chaotic “Uniformity of Change” – Change is not heterogeneous . Abduction . . . . . . . .

Present Events or Patterns + Known Laws => Abduction of Historical Facts Similarities Between Languages + Language Change => Inference of Proto-Languages

5 / 28

SLIDE 8

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Uniformitarianism and Abduction

. Uniformitarianism . . . . . . . . “Universality of Change” – Change is independent of time and space “Graduality of Change” – Change is neither abrupt nor chaotic “Uniformity of Change” – Change is not heterogeneous . Abduction . . . . . . . .

Present Events or Patterns + Known Laws => Abduction of Historical Facts Similarities Between Languages + Language Change => Inference of Proto-Languages

5 / 28

SLIDE 9

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Uniformitarianism and Abduction

. Uniformitarianism . . . . . . . . “Universality of Change” – Change is independent of time and space “Graduality of Change” – Change is neither abrupt nor chaotic “Uniformity of Change” – Change is not heterogeneous . Abduction . . . . . . . .

Present Events or Patterns + Known Laws => Abduction of Historical Facts Similarities Between Languages + Language Change => Inference of Proto-Languages

5 / 28

SLIDE 10

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Uniformitarianism and Abduction

. Uniformitarianism . . . . . . . . “Universality of Change” – Change is independent of time and space “Graduality of Change” – Change is neither abrupt nor chaotic “Uniformity of Change” – Change is not heterogeneous . Abduction . . . . . . . .

Present Events or Patterns + Known Laws => Abduction of Historical Facts Similarities Between Languages + Language Change => Inference of Proto-Languages

5 / 28

SLIDE 11

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Uniformitarianism and Abduction

. Uniformitarianism . . . . . . . . “Universality of Change” – Change is independent of time and space “Graduality of Change” – Change is neither abrupt nor chaotic “Uniformity of Change” – Change is not heterogeneous . Abduction . . . . . . . .

Present Events or Patterns + Known Laws => Abduction of Historical Facts Similarities Between Languages + Language Change => Inference of Proto-Languages

5 / 28

SLIDE 12

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Uniformitarianism and Abduction

. Uniformitarianism . . . . . . . . “Universality of Change” – Change is independent of time and space “Graduality of Change” – Change is neither abrupt nor chaotic “Uniformity of Change” – Change is not heterogeneous . Abduction . . . . . . . .

Present Events or Patterns + Known Laws => Abduction of Historical Facts Similarities Between Languages + Language Change => Inference of Proto-Languages

5 / 28

SLIDE 13

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Uniformitarianism and Abduction

. Uniformitarianism . . . . . . . . “Universality of Change” – Change is independent of time and space “Graduality of Change” – Change is neither abrupt nor chaotic “Uniformity of Change” – Change is not heterogeneous . Abduction . . . . . . . .

Present Events or Patterns + Known Laws => Abduction of Historical Facts Similarities Between Languages + Language Change => Inference of Proto-Languages

5 / 28

SLIDE 14

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Uniformitarianism and Abduction

. Uniformitarianism . . . . . . . . “Universality of Change” – Change is independent of time and space “Graduality of Change” – Change is neither abrupt nor chaotic “Uniformity of Change” – Change is not heterogeneous . Abduction . . . . . . . .

Present Events or Patterns + Known Laws => Abduction of Historical Facts Similarities Between Languages + Language Change => Inference of Proto-Languages

5 / 28

SLIDE 15

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

h j

ä

r t a

h
e
r

z

h
e

a r t

c
r

d i s hjärta herz heart cordis

1

Identification of Cognates

6 / 28

SLIDE 16

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

The Comparative Method

. Basic Procedure . . . . . . . . Compile an initial list of putative cognate sets. Extract an initial list of putative sets of sound correspondences from the initial cognate list. Refine the cognate list and the correspondence list by

adding and deleting cognate sets from the cognate list, depending on whether they are consistent with the correspondence list or not, and adding and deleting correspondence sets from the correspondence list, depending on whether they are consistent with the cognate list or not.

Finish when the results are satisfying enough.

7 / 28

SLIDE 17

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

The Comparative Method

. Basic Procedure . . . . . . . . Compile an initial list of putative cognate sets. Extract an initial list of putative sets of sound correspondences from the initial cognate list. Refine the cognate list and the correspondence list by

adding and deleting cognate sets from the cognate list, depending on whether they are consistent with the correspondence list or not, and adding and deleting correspondence sets from the correspondence list, depending on whether they are consistent with the cognate list or not.

Finish when the results are satisfying enough.

7 / 28

SLIDE 18

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

The Comparative Method

. Basic Procedure . . . . . . . . Compile an initial list of putative cognate sets. Extract an initial list of putative sets of sound correspondences from the initial cognate list. Refine the cognate list and the correspondence list by

adding and deleting cognate sets from the cognate list, depending on whether they are consistent with the correspondence list or not, and adding and deleting correspondence sets from the correspondence list, depending on whether they are consistent with the cognate list or not.

Finish when the results are satisfying enough.

7 / 28

SLIDE 19

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

The Comparative Method

. Basic Procedure . . . . . . . . Compile an initial list of putative cognate sets. Extract an initial list of putative sets of sound correspondences from the initial cognate list. Refine the cognate list and the correspondence list by

adding and deleting cognate sets from the cognate list, depending on whether they are consistent with the correspondence list or not, and adding and deleting correspondence sets from the correspondence list, depending on whether they are consistent with the cognate list or not.

Finish when the results are satisfying enough.

7 / 28

SLIDE 20

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

The Comparative Method

. Basic Procedure . . . . . . . . Compile an initial list of putative cognate sets. Extract an initial list of putative sets of sound correspondences from the initial cognate list. Refine the cognate list and the correspondence list by

adding and deleting cognate sets from the cognate list, depending on whether they are consistent with the correspondence list or not, and adding and deleting correspondence sets from the correspondence list, depending on whether they are consistent with the cognate list or not.

Finish when the results are satisfying enough.

7 / 28

SLIDE 21

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

The Comparative Method

. Basic Procedure . . . . . . . . Compile an initial list of putative cognate sets. Extract an initial list of putative sets of sound correspondences from the initial cognate list. Refine the cognate list and the correspondence list by

adding and deleting cognate sets from the cognate list, depending on whether they are consistent with the correspondence list or not, and adding and deleting correspondence sets from the correspondence list, depending on whether they are consistent with the cognate list or not.

Finish when the results are satisfying enough.

7 / 28

SLIDE 22

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

The Comparative Method

. Basic Procedure . . . . . . . . Compile an initial list of putative cognate sets. Extract an initial list of putative sets of sound correspondences from the initial cognate list. Refine the cognate list and the correspondence list by

adding and deleting cognate sets from the cognate list, depending on whether they are consistent with the correspondence list or not, and adding and deleting correspondence sets from the correspondence list, depending on whether they are consistent with the cognate list or not.

Finish when the results are satisfying enough.

7 / 28

SLIDE 23

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

The Comparative Method

. Language-Specific Similarity Measure . . . . . . . . Sequence similarity is determined on the basis of systematic sound correspondences as opposed to similarity based on surface resemblances of phonetic segments. Lass (1997) calls this notion of similarity phenotypic as

pposed to a genotypic notion of similarity.

The most crucial aspect of correspondence-based similarity is that it is language-specific: Genotypic similarity is never defined in general terms but always with respect to the language systems which are being compared.

bla German [ʦaːn] “tooth” Dutch tand [tɑnt] English [tʊːθ] “tooth” German [ʦeːn] “ten” Dutch tien [tiːn] English [tɛn] “ten” German [ʦʊŋə] “tongue” Dutch tong [tɔŋ] English [tʌŋ] “tongue”

8 / 28

SLIDE 24

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

The Comparative Method

. Language-Specific Similarity Measure . . . . . . . . Sequence similarity is determined on the basis of systematic sound correspondences as opposed to similarity based on surface resemblances of phonetic segments. Lass (1997) calls this notion of similarity phenotypic as

pposed to a genotypic notion of similarity.

The most crucial aspect of correspondence-based similarity is that it is language-specific: Genotypic similarity is never defined in general terms but always with respect to the language systems which are being compared.

bla German [ʦaːn] “tooth” Dutch tand [tɑnt] English [tʊːθ] “tooth” German [ʦeːn] “ten” Dutch tien [tiːn] English [tɛn] “ten” German [ʦʊŋə] “tongue” Dutch tong [tɔŋ] English [tʌŋ] “tongue”

8 / 28

SLIDE 25

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

The Comparative Method

. Language-Specific Similarity Measure . . . . . . . . Sequence similarity is determined on the basis of systematic sound correspondences as opposed to similarity based on surface resemblances of phonetic segments. Lass (1997) calls this notion of similarity phenotypic as

pposed to a genotypic notion of similarity.

The most crucial aspect of correspondence-based similarity is that it is language-specific: Genotypic similarity is never defined in general terms but always with respect to the language systems which are being compared.

bla German [ʦaːn] “tooth” Dutch tand [tɑnt] English [tʊːθ] “tooth” German [ʦeːn] “ten” Dutch tien [tiːn] English [tɛn] “ten” German [ʦʊŋə] “tongue” Dutch tong [tɔŋ] English [tʌŋ] “tongue”

8 / 28

SLIDE 26

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

The Comparative Method

. Language-Specific Similarity Measure . . . . . . . . Sequence similarity is determined on the basis of systematic sound correspondences as opposed to similarity based on surface resemblances of phonetic segments. Lass (1997) calls this notion of similarity phenotypic as

pposed to a genotypic notion of similarity.

The most crucial aspect of correspondence-based similarity is that it is language-specific: Genotypic similarity is never defined in general terms but always with respect to the language systems which are being compared.

bla German [ʦaːn] “tooth” Dutch tand [tɑnt] English [tʊːθ] “tooth” German [ʦeːn] “ten” Dutch tien [tiːn] English [tɛn] “ten” German [ʦʊŋə] “tongue” Dutch tong [tɔŋ] English [tʌŋ] “tongue”

8 / 28

SLIDE 27

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

The Comparative Method

. Language-Specific Similarity Measure . . . . . . . . Sequence similarity is determined on the basis of systematic sound correspondences as opposed to similarity based on surface resemblances of phonetic segments. Lass (1997) calls this notion of similarity phenotypic as

pposed to a genotypic notion of similarity.

The most crucial aspect of correspondence-based similarity is that it is language-specific: Genotypic similarity is never defined in general terms but always with respect to the language systems which are being compared.

Meaning German Dutch English “tooth” Zahn [ ʦ aːn] tand [ t ɑnt] tooth [ t ʊːθ] “ten” zehn [ ʦ eːn] tien [ t iːn] ten [ t ɛn] “tongue” Zunge [ ʦ ʊŋə] tong [ t ɔŋ] tongue [ t ʌŋ]

8 / 28

SLIDE 28

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

The Comparative Method

. Language-Specific Similarity Measure . . . . . . . . Sequence similarity is determined on the basis of systematic sound correspondences as opposed to similarity based on surface resemblances of phonetic segments. Lass (1997) calls this notion of similarity phenotypic as

pposed to a genotypic notion of similarity.

The most crucial aspect of correspondence-based similarity is that it is language-specific: Genotypic similarity is never defined in general terms but always with respect to the language systems which are being compared.

Meaning Shanghai Beijing Guangzhou “nine” [ ʨ iɤ³⁵] Beijing [ ʨ iou²¹⁴] [ k ɐu³⁵] “today” [ ʨ iŋ⁵⁵ʦɔ²¹] Beijing [ ʨ iɚ⁵⁵] [ k ɐm⁵³jɐt²] “rooster” [koŋ⁵⁵ ʨ i²¹] Beijing[kuŋ⁵⁵ ʨ i⁵⁵] [ k ɐi⁵⁵koŋ⁵⁵]

8 / 28

SLIDE 29

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Automatic Approaches

. Alignment Analyses . . . . . . . . In alignment analyses, sequences are arranged in a matrix in such a way that corresponding elements occur in the same column, while empty cells resulting from non-corresponding elements are filled with gap symbols.

9 / 28

SLIDE 30

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Automatic Approaches

. Alignment Analyses . . . . . . . . In alignment analyses, sequences are arranged in a matrix in such a way that corresponding elements occur in the same column, while empty cells resulting from non-corresponding elements are filled with gap symbols.

9 / 28

SLIDE 31

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Automatic Approaches

. Alignment Analyses . . . . . . . . In alignment analyses, sequences are arranged in a matrix in such a way that corresponding elements occur in the same column, while empty cells resulting from non-corresponding elements are filled with gap symbols.

9 / 28

SLIDE 32

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Automatic Approaches

. Alignment Analyses . . . . . . . . In alignment analyses, sequences are arranged in a matrix in such a way that corresponding elements occur in the same column, while empty cells resulting from non-corresponding elements are filled with gap symbols.

t ɔ x t ə r d ɔː t ə r 1

9 / 28

SLIDE 33

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Automatic Approaches

. Alignment Analyses . . . . . . . . In alignment analyses, sequences are arranged in a matrix in such a way that corresponding elements occur in the same column, while empty cells resulting from non-corresponding elements are filled with gap symbols.

t ɔ x t ə r d ɔː t ə r 1

9 / 28

SLIDE 34

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Automatic Approaches

. Alignment Analyses . . . . . . . . In alignment analyses, sequences are arranged in a matrix in such a way that corresponding elements occur in the same column, while empty cells resulting from non-corresponding elements are filled with gap symbols.

t ɔ x t ə r d ɔː

t

ə r 1

9 / 28

SLIDE 35

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Automatic Approaches

. Alignment Analyses . . . . . . . . In alignment analyses, sequences are arranged in a matrix in such a way that corresponding elements occur in the same column, while empty cells resulting from non-corresponding elements are filled with gap symbols.

t ɔ x t ə r d ɔː

t

ə r 1

C

g

n a t e i d e n t i f i c a t i

n

i s u s u a l l y b a s e d

n

a s i m

i

l a r i t y

r

d i s t a n c e s c

r

e ( e . g . , e d i t

d

i s t a n c e ) c a l

c

u l a t e d f r

m

t h e n u m b e r

f

m a t c h e s a n d m i s

m

a t c h e s i n t h e a l i g n m e n t .

9 / 28

SLIDE 36

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Automatic Approaches

. Sound Classes . . . . . . . . Sounds which often occur in correspondence relations in genetically related languages can be clustered into classes (types). It is assumed “that phonetic correspondences inside a ‘type’ are more regular than those between different ‘types’” (Dolgopolsky 1986: 35).

10 / 28

SLIDE 37

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Automatic Approaches

. Sound Classes . . . . . . . . Sounds which often occur in correspondence relations in genetically related languages can be clustered into classes (types). It is assumed “that phonetic correspondences inside a ‘type’ are more regular than those between different ‘types’” (Dolgopolsky 1986: 35).

10 / 28

SLIDE 38

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Automatic Approaches

. Sound Classes . . . . . . . . Sounds which often occur in correspondence relations in genetically related languages can be clustered into classes (types). It is assumed “that phonetic correspondences inside a ‘type’ are more regular than those between different ‘types’” (Dolgopolsky 1986: 35).

k g p b ʧ ʤ f v t d ʃ ʒ θ ð s z 110 / 28

SLIDE 39

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Automatic Approaches

. Sound Classes . . . . . . . . Sounds which often occur in correspondence relations in genetically related languages can be clustered into classes (types). It is assumed “that phonetic correspondences inside a ‘type’ are more regular than those between different ‘types’” (Dolgopolsky 1986: 35).

k g p b ʧ ʤ f v t d ʃ ʒ θ ð s z 1

10 / 28

SLIDE 40

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Automatic Approaches

. Sound Classes . . . . . . . . Sounds which often occur in correspondence relations in genetically related languages can be clustered into classes (types). It is assumed “that phonetic correspondences inside a ‘type’ are more regular than those between different ‘types’” (Dolgopolsky 1986: 35).

k g p b ʧ ʤ f v t d ʃ ʒ θ ð s z 1

10 / 28

SLIDE 41

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Automatic Approaches

. Sound Classes . . . . . . . . Sounds which often occur in correspondence relations in genetically related languages can be clustered into classes (types). It is assumed “that phonetic correspondences inside a ‘type’ are more regular than those between different ‘types’” (Dolgopolsky 1986: 35).

K T P S

1

10 / 28

SLIDE 42

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Automatic Approaches

. Sound Classes . . . . . . . . Sounds which often occur in correspondence relations in genetically related languages can be clustered into classes (types). It is assumed “that phonetic correspondences inside a ‘type’ are more regular than those between different ‘types’” (Dolgopolsky 1986: 35).

K T P S

1

C

g

n a t e i d e n t i f i c a t i

n

i s u s u a l l y b a s e d

n

c

m
p

a r i n g t h e f i r s t t w

c
n

s

n

a n t s

f

t w

w
r

d s : I f t h e y m a t c h r e g a r d i n g t h e i r s

u

n d c l a s s e s , t h e w

r

d s a r e j u d g e d t

b

e c

g

n a t e ,

t

h e r w i s e n

t

.

10 / 28

SLIDE 43

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Automatic Approaches

. Sound-Class-Based Alignment (SCA) . . . . . . . . Sound classes and alignment analyses can be easily combined by representing phonetic sequences internally as sound classes and comparing the sound classes with traditional alignment algorithms.

11 / 28

SLIDE 44

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Automatic Approaches

. Sound-Class-Based Alignment (SCA) . . . . . . . . Sound classes and alignment analyses can be easily combined by representing phonetic sequences internally as sound classes and comparing the sound classes with traditional alignment algorithms.

11 / 28

SLIDE 45

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Automatic Approaches

. Sound-Class-Based Alignment (SCA) . . . . . . . . Sound classes and alignment analyses can be easily combined by representing phonetic sequences internally as sound classes and comparing the sound classes with traditional alignment algorithms.

INPUT

tɔxtər dɔːtər

TOKENIZATION

t, ɔ, x, t, ə, r d, ɔː, t, ə, r

CONVERSION

t ɔ x … → T O G … d ɔː t … → T O T …

ALIGNMENT

T O G T E R T O - T E R

CONVERSION

T O G … → t ɔ x … T O - … → d oː - …

OUTPUT

t ɔ x t ə r d ɔː x t ə r 1

11 / 28

SLIDE 46

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Automatic Approaches

. Sound-Class-Based Alignment (SCA) . . . . . . . . Sound classes and alignment analyses can be easily combined by representing phonetic sequences internally as sound classes and comparing the sound classes with traditional alignment algorithms.

INPUT

tɔxtər dɔːtər

TOKENIZATION

t, ɔ, x, t, ə, r d, ɔː, t, ə, r

CONVERSION

t ɔ x … → T O G … d ɔː t … → T O T …

ALIGNMENT

T O G T E R T O - T E R

CONVERSION

T O G … → t ɔ x … T O - … → d oː - …

OUTPUT

t ɔ x t ə r d ɔː x t ə r 1

C

g

n a t e i d e n t i f i c a t i

n

m a y b e b a s e d

n

a c e r

t

a i n t h r e s h

l

d a n d d i s t a n c e s c

r

e s d e r i v e d f r

m

t h e s i m i l a r i t y s c

r

e s y i e l d e d b y t h e a l i g n m e n t a l

g
r

i t h m .

11 / 28

SLIDE 47

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Traditional vs. Automatic Approaches

. Similarity . . . . . . . . Almost all current automatic approaches are based on a language-independent similarity measure, while the comparative method applies a language-specific one. All automatic approaches will therefore yield the same scores for phenotypically identical sequences, regardless of the language systems they belong to.

12 / 28

SLIDE 48

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Traditional vs. Automatic Approaches

. Similarity . . . . . . . . Almost all current automatic approaches are based on a language-independent similarity measure, while the comparative method applies a language-specific one. All automatic approaches will therefore yield the same scores for phenotypically identical sequences, regardless of the language systems they belong to.

12 / 28

SLIDE 49

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Traditional vs. Automatic Approaches

. Similarity . . . . . . . . Almost all current automatic approaches are based on a language-independent similarity measure, while the comparative method applies a language-specific one. All automatic approaches will therefore yield the same scores for phenotypically identical sequences, regardless of the language systems they belong to.

12 / 28

SLIDE 50

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

LexStat

13 / 28

SLIDE 51

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Working Procedure

Sequence Input

sequences are read from specifically for- matted input files

1 Sequence Conversion

sequences are converted to sound classes and prosodic profiles

2 Scoring-Scheme Creation

using a permutation method, language- specific scoring schemes are determined

3 Distance Calculation

based on the language-specific scoring- scheme, pairwise distances between sequences are calculated

4 Sequence Clustering

sequences are clustered into cognate sets whose average distance is beyond a certain threshold

Sequence Output

information regarding sequence clustering is written to file using a specific format

14 / 28

SLIDE 52

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Working Procedure

Sequence Input

sequences are read from specifically for- matted input files

1 Sequence Conversion

sequences are converted to sound classes and prosodic profiles

2 Scoring-Scheme Creation

using a permutation method, language- specific scoring schemes are determined

3 Distance Calculation

based on the language-specific scoring- scheme, pairwise distances between sequences are calculated

4 Sequence Clustering

sequences are clustered into cognate sets whose average distance is beyond a certain threshold

Sequence Output

information regarding sequence clustering is written to file using a specific format

14 / 28

SLIDE 53

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Working Procedure

Sequence Input

sequences are read from specifically for- matted input files

1 Sequence Conversion

sequences are converted to sound classes and prosodic profiles

2 Scoring-Scheme Creation

using a permutation method, language- specific scoring schemes are determined

3 Distance Calculation

based on the language-specific scoring- scheme, pairwise distances between sequences are calculated

4 Sequence Clustering

sequences are clustered into cognate sets whose average distance is beyond a certain threshold

Sequence Output

information regarding sequence clustering is written to file using a specific format

14 / 28

SLIDE 54

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Working Procedure

Sequence Input

sequences are read from specifically for- matted input files

1 Sequence Conversion

sequences are converted to sound classes and prosodic profiles

2 Scoring-Scheme Creation

using a permutation method, language- specific scoring schemes are determined

3 Distance Calculation

based on the language-specific scoring- scheme, pairwise distances between sequences are calculated

4 Sequence Clustering

sequences are clustered into cognate sets whose average distance is beyond a certain threshold

Sequence Output

information regarding sequence clustering is written to file using a specific format

14 / 28

SLIDE 55

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Working Procedure

Sequence Input

sequences are read from specifically for- matted input files

1 Sequence Conversion

sequences are converted to sound classes and prosodic profiles

2 Scoring-Scheme Creation

using a permutation method, language- specific scoring schemes are determined

3 Distance Calculation

based on the language-specific scoring- scheme, pairwise distances between sequences are calculated

4 Sequence Clustering

sequences are clustered into cognate sets whose average distance is beyond a certain threshold

Sequence Output

information regarding sequence clustering is written to file using a specific format

14 / 28

SLIDE 56

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Working Procedure

Sequence Input

sequences are read from specifically for- matted input files

1 Sequence Conversion

sequences are converted to sound classes and prosodic profiles

2 Scoring-Scheme Creation

using a permutation method, language- specific scoring schemes are determined

3 Distance Calculation

based on the language-specific scoring- scheme, pairwise distances between sequences are calculated

4 Sequence Clustering

sequences are clustered into cognate sets whose average distance is beyond a certain threshold

Sequence Output

information regarding sequence clustering is written to file using a specific format

14 / 28

SLIDE 57

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Working Procedure

Sequence Input

sequences are read from specifically for- matted input files

1 Sequence Conversion

sequences are converted to sound classes and prosodic profiles

2 Scoring-Scheme Creation

using a permutation method, language- specific scoring schemes are determined

3 Distance Calculation

based on the language-specific scoring- scheme, pairwise distances between sequences are calculated

4 Sequence Clustering

sequences are clustered into cognate sets whose average distance is beyond a certain threshold

Sequence Output

information regarding sequence clustering is written to file using a specific format

14 / 28

SLIDE 58

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Implementation

LexStat ist implemented as part of the LingPy Python library (see http://lingulist.de/lingpy) for automatic tasks in historical linguistics. The current release of LingPy (lingpy-1.0) provides methods for pairwise and multiple sequence alignment (SCA), automatic cognate detection (LexStat), and plotting routines (see the online documentation for details). LexStat can be invoked from the Python shell or inside Python scripts (examples are given in the online documentation).

15 / 28

SLIDE 59

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Implementation

LexStat ist implemented as part of the LingPy Python library (see http://lingulist.de/lingpy) for automatic tasks in historical linguistics. The current release of LingPy (lingpy-1.0) provides methods for pairwise and multiple sequence alignment (SCA), automatic cognate detection (LexStat), and plotting routines (see the online documentation for details). LexStat can be invoked from the Python shell or inside Python scripts (examples are given in the online documentation).

15 / 28

SLIDE 60

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Implementation

LexStat ist implemented as part of the LingPy Python library (see http://lingulist.de/lingpy) for automatic tasks in historical linguistics. The current release of LingPy (lingpy-1.0) provides methods for pairwise and multiple sequence alignment (SCA), automatic cognate detection (LexStat), and plotting routines (see the online documentation for details). LexStat can be invoked from the Python shell or inside Python scripts (examples are given in the online documentation).

15 / 28

SLIDE 61

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Implementation

LexStat ist implemented as part of the LingPy Python library (see http://lingulist.de/lingpy) for automatic tasks in historical linguistics. The current release of LingPy (lingpy-1.0) provides methods for pairwise and multiple sequence alignment (SCA), automatic cognate detection (LexStat), and plotting routines (see the online documentation for details). LexStat can be invoked from the Python shell or inside Python scripts (examples are given in the online documentation).

15 / 28

SLIDE 62

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Input and Output

ID Items German English Swedish 1 hand hant hænd hand 2 woman fraʊ wʊmən kvina 3 know kɛnən nəʊ çɛna 3 know vɪsən

veːta

… … … … …

16 / 28

SLIDE 63

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Input and Output

ID Items German COG English COG Swedish COG 1 hand hant 1 hænd 1 hand 1 2 woman fraʊ 2 wʊmən 3 kvina 4 3 know kɛnən 5 nəʊ 5 çɛna 5 3 know vɪsən 6

veːta

6 … … … … … … … …

16 / 28

SLIDE 64

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Input and Output

16 / 28

SLIDE 65

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Internal Representation of Sequences

. Sound Classes and Prosodic Context . . . . . . . . All sequences are internally represented as sound classes, the default model being the one proposed in List (forthcoming). All sequences are also represented by prosodic strings which indicate the prosodic environment (initial, ascending, maximum, descending, final) of each phonetic segment (List 2012). The information regarding sound classes and prosodic context is combined, and each input sequence is further represented as a sequence of tuples, consisting of the sound class and the prosodic environment of the respective phonetic segment.

17 / 28

SLIDE 66

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Internal Representation of Sequences

. Sound Classes and Prosodic Context . . . . . . . . All sequences are internally represented as sound classes, the default model being the one proposed in List (forthcoming). All sequences are also represented by prosodic strings which indicate the prosodic environment (initial, ascending, maximum, descending, final) of each phonetic segment (List 2012). The information regarding sound classes and prosodic context is combined, and each input sequence is further represented as a sequence of tuples, consisting of the sound class and the prosodic environment of the respective phonetic segment.

17 / 28

SLIDE 67

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Internal Representation of Sequences

. Sound Classes and Prosodic Context . . . . . . . . All sequences are internally represented as sound classes, the default model being the one proposed in List (forthcoming). All sequences are also represented by prosodic strings which indicate the prosodic environment (initial, ascending, maximum, descending, final) of each phonetic segment (List 2012). The information regarding sound classes and prosodic context is combined, and each input sequence is further represented as a sequence of tuples, consisting of the sound class and the prosodic environment of the respective phonetic segment.

17 / 28

SLIDE 68

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Internal Representation of Sequences

. Sound Classes and Prosodic Context . . . . . . . . All sequences are internally represented as sound classes, the default model being the one proposed in List (forthcoming). All sequences are also represented by prosodic strings which indicate the prosodic environment (initial, ascending, maximum, descending, final) of each phonetic segment (List 2012). The information regarding sound classes and prosodic context is combined, and each input sequence is further represented as a sequence of tuples, consisting of the sound class and the prosodic environment of the respective phonetic segment.

17 / 28

SLIDE 69

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Internal Representation of Sequences

. Sound Classes and Prosodic Context . . . . . . . . All sequences are internally represented as sound classes, the default model being the one proposed in List (forthcoming). All sequences are also represented by prosodic strings which indicate the prosodic environment (initial, ascending, maximum, descending, final) of each phonetic segment (List 2012). The information regarding sound classes and prosodic context is combined, and each input sequence is further represented as a sequence of tuples, consisting of the sound class and the prosodic environment of the respective phonetic segment.

17 / 28

SLIDE 70

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Scoring-Scheme Creation

. Attested Distribution . . . . . . . .

carry out global and pairwise alignment analyses of all sequence pairs occuring in the same semantic slot store all corresponding segments that occur in sequences whose distance is beyond a certain threshold

. Creation of the Expected Distribution . . . . . . . .

shuffle the wordlists repeatedly and carry out global and pairwise alignment analyses of all sequence pairs in the randomly shuffled wordlists store all corresponding segments average the results

. Calculation of Similarity Scores . . . . . . . .

Calculation of log-odds scores from the distributions.

18 / 28

SLIDE 71

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Scoring-Scheme Creation

. Attested Distribution . . . . . . . .

carry out global and pairwise alignment analyses of all sequence pairs occuring in the same semantic slot store all corresponding segments that occur in sequences whose distance is beyond a certain threshold

. Creation of the Expected Distribution . . . . . . . .

shuffle the wordlists repeatedly and carry out global and pairwise alignment analyses of all sequence pairs in the randomly shuffled wordlists store all corresponding segments average the results

. Calculation of Similarity Scores . . . . . . . .

Calculation of log-odds scores from the distributions.

18 / 28

SLIDE 72

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Scoring-Scheme Creation

. Attested Distribution . . . . . . . .

carry out global and pairwise alignment analyses of all sequence pairs occuring in the same semantic slot store all corresponding segments that occur in sequences whose distance is beyond a certain threshold

. Creation of the Expected Distribution . . . . . . . .

shuffle the wordlists repeatedly and carry out global and pairwise alignment analyses of all sequence pairs in the randomly shuffled wordlists store all corresponding segments average the results

. Calculation of Similarity Scores . . . . . . . .

Calculation of log-odds scores from the distributions.

18 / 28

SLIDE 73

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Scoring-Scheme Creation

. Attested Distribution . . . . . . . .

carry out global and pairwise alignment analyses of all sequence pairs occuring in the same semantic slot store all corresponding segments that occur in sequences whose distance is beyond a certain threshold

. Creation of the Expected Distribution . . . . . . . .

shuffle the wordlists repeatedly and carry out global and pairwise alignment analyses of all sequence pairs in the randomly shuffled wordlists store all corresponding segments average the results

. Calculation of Similarity Scores . . . . . . . .

Calculation of log-odds scores from the distributions.

18 / 28

SLIDE 74

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Scoring-Scheme Creation

. Attested Distribution . . . . . . . .

carry out global and pairwise alignment analyses of all sequence pairs occuring in the same semantic slot store all corresponding segments that occur in sequences whose distance is beyond a certain threshold

. Creation of the Expected Distribution . . . . . . . .

shuffle the wordlists repeatedly and carry out global and pairwise alignment analyses of all sequence pairs in the randomly shuffled wordlists store all corresponding segments average the results

. Calculation of Similarity Scores . . . . . . . .

Calculation of log-odds scores from the distributions.

18 / 28

SLIDE 75

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Scoring-Scheme Creation

. Attested Distribution . . . . . . . .

carry out global and pairwise alignment analyses of all sequence pairs occuring in the same semantic slot store all corresponding segments that occur in sequences whose distance is beyond a certain threshold

. Creation of the Expected Distribution . . . . . . . .

shuffle the wordlists repeatedly and carry out global and pairwise alignment analyses of all sequence pairs in the randomly shuffled wordlists store all corresponding segments average the results

. Calculation of Similarity Scores . . . . . . . .

Calculation of log-odds scores from the distributions.

18 / 28

SLIDE 76

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Scoring-Scheme Creation

. Attested Distribution . . . . . . . .

carry out global and pairwise alignment analyses of all sequence pairs occuring in the same semantic slot store all corresponding segments that occur in sequences whose distance is beyond a certain threshold

. Creation of the Expected Distribution . . . . . . . .

shuffle the wordlists repeatedly and carry out global and pairwise alignment analyses of all sequence pairs in the randomly shuffled wordlists store all corresponding segments average the results

. Calculation of Similarity Scores . . . . . . . .

Calculation of log-odds scores from the distributions.

18 / 28

SLIDE 77

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Scoring-Scheme Creation

English German Att. Exp. Score #[t,d] #[t,d] 3.0 1.24 6.3 #[t,d] #[ʦ] 3.0 0.38 6.0 #[t,d] #[ʃ,s,z] 1.0 1.99

1.5

#[θ,ð] #[t,d] 7.0 0.72 6.3 #[θ,ð] #[ʦ] 0.0 0.25

1.5

#[θ,ð] #[s,z] 0.0 1.33 0.5 [t,d]$ [t,d]$ 21.0 8.86 6.3 [t,d]$ [ʦ]$ 3.0 1.62 3.9 [t,d]$ [ʃ,s]$ 6.0 5.30 1.5 [θ,ð]$ [t,d]$ 4.0 1.14 4.8 [θ,ð]$ [ʦ]$ 0.0 0.20

1.5

[θ,ð]$ [ʃ,s]$ 0.0 0.80 0.5

19 / 28

SLIDE 78

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Scoring-Scheme Creation

English German Att. Exp. Score #[t,d] #[t,d] 3.0 1.24 6.3 #[t,d] #[ʦ] 3.0 0.38 6.0 #[t,d] #[ʃ,s,z] 1.0 1.99

1.5

#[θ,ð] #[t,d] 7.0 0.72 6.3 #[θ,ð] #[ʦ] 0.0 0.25

1.5

#[θ,ð] #[s,z] 0.0 1.33 0.5 [t,d]$ [t,d]$ 21.0 8.86 6.3 [t,d]$ [ʦ]$ 3.0 1.62 3.9 [t,d]$ [ʃ,s]$ 6.0 5.30 1.5 [θ,ð]$ [t,d]$ 4.0 1.14 4.8 [θ,ð]$ [ʦ]$ 0.0 0.20

1.5

[θ,ð]$ [ʃ,s]$ 0.0 0.80 0.5

19 / 28

SLIDE 79

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Scoring-Scheme Creation

Initial Final English town [taʊn] hot [hɔt] German Zaun [ʦaun] heiß [haɪs] English thorn [θɔːn] mouth [maʊθ] German Dorn [dɔrn] Mund [mʊnt] English dale [deɪl] head [hɛd] German Tal [taːl] Hut [huːt]

19 / 28

SLIDE 80

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Sequence Clustering

Ger. Eng. Dan. Swe. Dut. Nor.

Ger. [frau]

0.00 0.95 0.81 0.70 0.34 1.00

Eng. [wʊmən]

0.95 0.00 0.78 0.90 0.80 0.80

Dan. [kvenə]

0.81 0.78 0.00 0.17 0.96 0.13

Swe. [kvinːa]

0.70 0.90 0.17 0.00 0.86 0.10

Dut. [vrɑuʋ]

0.34 0.80 0.96 0.86 0.00 0.89

Nor. [kʋinə]

1.00 0.80 0.13 0.10 0.89 0.00 Clusters 1 2 3 3 1 3

20 / 28

SLIDE 81

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Sequence Clustering

Ger. Eng. Dan. Swe. Dut. Nor.

Ger. [frau]

0.00 0.95 0.81 0.70 0.34 1.00

Eng. [wʊmən]

0.95 0.00 0.78 0.90 0.80 0.80

Dan. [kvenə]

0.81 0.78 0.00 0.17 0.96 0.13

Swe. [kvinːa]

0.70 0.90 0.17 0.00 0.86 0.10

Dut. [vrɑuʋ]

0.34 0.80 0.96 0.86 0.00 0.89

Nor. [kʋinə]

1.00 0.80 0.13 0.10 0.89 0.00 Clusters 1 2 3 3 1 3

20 / 28

SLIDE 82

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

* * * * * * * * * * * * *

v

l
d

e m

r

t v

l

a d i m i r

v

a l

d

e m a r

1

Evaluation

21 / 28

SLIDE 83

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Gold Standard

22 / 28

SLIDE 84

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Gold Standard

File Family Lng. Itm. Entr. Source GER Germanic 7 110 814 Starostin (2008) ROM Romance 5 110 589 Starostin (2008) SLV Slavic 4 110 454 Starostin (2008) PIE Indo-Eur. 18 110 2057 Starostin (2008) OUG Uralic 21 110 2055 Starostin (2008) BAI Bai 9 110 1028 Wang (2006) SIN Sinitic 9 180 1614 Hóu (2004) KSL varia 8 200 1600 Kessler (2001) JAP Japonic 10 200 1986 Shirō (1973)

22 / 28

SLIDE 85

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Evaluation Measures

. Set Comparison . . . . . . . . Precision, Recall, and F-Score are calculated by comparing the cognate sets proposed by the method with the cognate sets in the gold standard (see Bergsma & Kondrak 2007). . Pair Comparison . . . . . . . . Pair comparison is based on a pairwise comparison of all decisions present in testset and goldstandard.

23 / 28

SLIDE 86

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Evaluation Measures

. Set Comparison . . . . . . . . Precision, Recall, and F-Score are calculated by comparing the cognate sets proposed by the method with the cognate sets in the gold standard (see Bergsma & Kondrak 2007). . Pair Comparison . . . . . . . . Pair comparison is based on a pairwise comparison of all decisions present in testset and goldstandard.

23 / 28

SLIDE 87

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Evaluation Measures

. Set Comparison . . . . . . . . Precision, Recall, and F-Score are calculated by comparing the cognate sets proposed by the method with the cognate sets in the gold standard (see Bergsma & Kondrak 2007). . Pair Comparison . . . . . . . . Pair comparison is based on a pairwise comparison of all decisions present in testset and goldstandard.

23 / 28

SLIDE 88

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Evaluation Measures

. Set Comparison . . . . . . . . Precision, Recall, and F-Score are calculated by comparing the cognate sets proposed by the method with the cognate sets in the gold standard (see Bergsma & Kondrak 2007). . Pair Comparison . . . . . . . . Pair comparison is based on a pairwise comparison of all decisions present in testset and goldstandard.

23 / 28

SLIDE 89

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Evaluation Measures

. Set Comparison . . . . . . . . Precision, Recall, and F-Score are calculated by comparing the cognate sets proposed by the method with the cognate sets in the gold standard (see Bergsma & Kondrak 2007). . Pair Comparison . . . . . . . . Pair comparison is based on a pairwise comparison of all decisions present in testset and goldstandard.

23 / 28

SLIDE 90

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Tests

Sound Classes – matching sound classes without alignment (based on Turchin et al. 2010) Simple Alignment – normalized edit-distance (Levenshtein 1966) SCA – language-independent distance scores derived from sound-class-based alignment analyses (List 2012) LexStat – language-specific distance scores

24 / 28

SLIDE 91

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Tests

Sound Classes – matching sound classes without alignment (based on Turchin et al. 2010) Simple Alignment – normalized edit-distance (Levenshtein 1966) SCA – language-independent distance scores derived from sound-class-based alignment analyses (List 2012) LexStat – language-specific distance scores

24 / 28

SLIDE 92

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Tests

Sound Classes – matching sound classes without alignment (based on Turchin et al. 2010) Simple Alignment – normalized edit-distance (Levenshtein 1966) SCA – language-independent distance scores derived from sound-class-based alignment analyses (List 2012) LexStat – language-specific distance scores

24 / 28

SLIDE 93

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Tests

Sound Classes – matching sound classes without alignment (based on Turchin et al. 2010) Simple Alignment – normalized edit-distance (Levenshtein 1966) SCA – language-independent distance scores derived from sound-class-based alignment analyses (List 2012) LexStat – language-specific distance scores

24 / 28

SLIDE 94

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Tests

Sound Classes – matching sound classes without alignment (based on Turchin et al. 2010) Simple Alignment – normalized edit-distance (Levenshtein 1966) SCA – language-independent distance scores derived from sound-class-based alignment analyses (List 2012) LexStat – language-specific distance scores

24 / 28

SLIDE 95

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

General Results

25 / 28

SLIDE 96

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

General Results

Score LexStat SCA Simple Alm. Sound Cl. Identical Pairs 0.85 0.82 0.76 0.74 Precision 0.59 0.51 0.39 0.39 Recall 0.68 0.57 0.47 0.55 F-Score 0.63 0.55 0.42 0.46

25 / 28

SLIDE 97

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

General Results

SLV KSL GER BAI SIN PIE ROM JAP OUG 0.6 0.7 0.8 0.9 1.0

LexStat SCA NED Turchin 1

25 / 28

SLIDE 98

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Specific Results

Pairwise decisions were extracted from the KSL dataset and compared with the Gold Standard. 72 borrowings were explicitly marked along with their source by Kessler (2001). 83 chance resemblances were determined automatically by taking non-cognate word pairs with an NED score less than 0.6.

LexStat SCA Simple Alm. Sound Cl. Borrowings 50% 61% 49% 53% Chance Resemblances 17% 42% 89% 31%

26 / 28

SLIDE 99

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Specific Results

Pairwise decisions were extracted from the KSL dataset and compared with the Gold Standard. 72 borrowings were explicitly marked along with their source by Kessler (2001). 83 chance resemblances were determined automatically by taking non-cognate word pairs with an NED score less than 0.6.

LexStat SCA Simple Alm. Sound Cl. Borrowings 50% 61% 49% 53% Chance Resemblances 17% 42% 89% 31%

26 / 28

SLIDE 100

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Specific Results

Pairwise decisions were extracted from the KSL dataset and compared with the Gold Standard. 72 borrowings were explicitly marked along with their source by Kessler (2001). 83 chance resemblances were determined automatically by taking non-cognate word pairs with an NED score less than 0.6.

LexStat SCA Simple Alm. Sound Cl. Borrowings 50% 61% 49% 53% Chance Resemblances 17% 42% 89% 31%

26 / 28

SLIDE 101

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Specific Results

Pairwise decisions were extracted from the KSL dataset and compared with the Gold Standard. 72 borrowings were explicitly marked along with their source by Kessler (2001). 83 chance resemblances were determined automatically by taking non-cognate word pairs with an NED score less than 0.6.

LexStat SCA Simple Alm. Sound Cl. Borrowings 50% 61% 49% 53% Chance Resemblances 17% 42% 89% 31%

26 / 28

SLIDE 102

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Specific Results

Pairwise decisions were extracted from the KSL dataset and compared with the Gold Standard. 72 borrowings were explicitly marked along with their source by Kessler (2001). 83 chance resemblances were determined automatically by taking non-cognate word pairs with an NED score less than 0.6.

LexStat SCA Simple Alm. Sound Cl. Borrowings 50% 61% 49% 53% Chance Resemblances 17% 42% 89% 31%

26 / 28

SLIDE 103

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

*deh3-

?

1 What’s next?

27 / 28

SLIDE 104

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

Special thanks to:

The German Federal Mi-

nistry of Education and Research (BMBF) for funding

ur

research project.

Hans Geisler for his hel-

pful, critical, and inspi- ring support.

James Kilbury for all the

time he spent on helping me to refine the manu- script.

1

28 / 28

SLIDE 105

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation

THANK YOU

1

FOR LISTENING!

1

28 / 28