Automatic Detection of Borrowings in Lexicostatistic Datasets . . - - PowerPoint PPT Presentation

automatic detection of borrowings in lexicostatistic
SMART_READER_LITE
LIVE PREVIEW

Automatic Detection of Borrowings in Lexicostatistic Datasets . . - - PowerPoint PPT Presentation

. . Automatic Detection of Borrowings in Lexicostatistic Datasets . . . . . Johann-Mattis List 1 , Steven Moran 1 , 2 & Jelena Proki 1 1 Research Unit Quantitative Language Comparison Philipps-University Marburg 2 Linguistics


slide-1
SLIDE 1

. . . . . . .

Automatic Detection of Borrowings in Lexicostatistic Datasets

Johann-Mattis List1, Steven Moran1,2 & Jelena Prokić1

1Research Unit Quantitative Language Comparison

Philipps-University Marburg

2Linguistics Department, University of Zürich

December 13, 2012

1 / 45

slide-2
SLIDE 2

Structure of the Talk

. . .

1

Modelling Language History Trees Waves Networks . . .

2

Borrowing Complexity of Borrowing Processes Phyletic Patterns Borrowing Detection . . .

3

Application Material Methods Results . . .

4

Discussion Natural Findings or Artifacts? Limits Examples

2 / 45

slide-3
SLIDE 3

Modelling Language History

3 / 45

slide-4
SLIDE 4

Modelling Language History Trees

Dendrophilia

August Schleicher (1821-1868)

4 / 45

slide-5
SLIDE 5

Modelling Language History Trees

Dendrophilia

August Schleicher (1821-1868) These assumptions that logically follow from the results of our re- search can be best illustrated with help of a branching tree. (Schle- icher 1853: 787, translation JML)

4 / 45

slide-6
SLIDE 6

Modelling Language History Trees

Dendrophilia

Schleicher (1853)

5 / 45

slide-7
SLIDE 7

Modelling Language History Waves

Dendrophobia

Johannes Schmidt (1843-1901) No matter how we look at it, as long as we stick to the assumption that today’s languages originated from their common proto-language via multiple furcation, we will never be able to explain all facts in a scientifi- cally adequate way. (Schmidt 1872: 17, translation JML)

6 / 45

slide-8
SLIDE 8

Modelling Language History Waves

Dendrophobia

Johannes Schmidt (1843-1901) No matter how we look at it, as long as we stick to the assumption that today’s languages originated from their common proto-language via multiple furcation, we will never be able to explain all facts in a scientifi- cally adequate way. (Schmidt 1872: 17, translation JML)

6 / 45

slide-9
SLIDE 9

Modelling Language History Waves

Dendrophobia

Johannes Schmidt (1843-1901) I want to replace [the tree] by the im- age of a wave that spreads out from the center in concentric circles be- coming weaker and weaker the far- ther they get away from the center. (Schmidt 1872: 27, translation JML)

7 / 45

slide-10
SLIDE 10

Modelling Language History Waves

Dendrophobia

Schmidt (1875)

8 / 45

slide-11
SLIDE 11

Modelling Language History Waves

Dendrophobia

Meillet (1908) Hirt (1905) Bloomfield (1933) Bonfante (1931)

9 / 45

slide-12
SLIDE 12

Modelling Language History Networks

Phylogenetic Networks

Trees are bad because they are difficult to reconstruct............ languages do not separate in split processes they are boring, since they

  • nly capture certain aspects of

language history, namely the vertical relations Waves are bad because nobody knows how to reconstruct them languages still separate, even if not in split processes they are boring, since they

  • nly capture certain aspects of

language history, namely, the horizontal relations

10 / 45

slide-13
SLIDE 13

Modelling Language History Networks

Phylogenetic Networks

Trees are bad because they are difficult to reconstruct............ languages do not separate in split processes they are boring, since they

  • nly capture certain aspects of

language history, namely the vertical relations Waves are bad because nobody knows how to reconstruct them languages still separate, even if not in split processes they are boring, since they

  • nly capture certain aspects of

language history, namely, the horizontal relations

10 / 45

slide-14
SLIDE 14

Modelling Language History Networks

Phylogenetic Networks

Trees are bad because they are difficult to reconstruct............ languages do not separate in split processes they are boring, since they

  • nly capture certain aspects of

language history, namely the vertical relations Waves are bad because nobody knows how to reconstruct them languages still separate, even if not in split processes they are boring, since they

  • nly capture certain aspects of

language history, namely, the horizontal relations

10 / 45

slide-15
SLIDE 15

Modelling Language History Networks

Phylogenetic Networks

Trees are bad because they are difficult to reconstruct............ languages do not separate in split processes they are boring, since they

  • nly capture certain aspects of

language history, namely the vertical relations Waves are bad because nobody knows how to reconstruct them languages still separate, even if not in split processes they are boring, since they

  • nly capture certain aspects of

language history, namely, the horizontal relations

10 / 45

slide-16
SLIDE 16

Modelling Language History Networks

Phylogenetic Networks

Trees are bad because they are difficult to reconstruct............ languages do not separate in split processes they are boring, since they

  • nly capture certain aspects of

language history, namely the vertical relations Waves are bad because nobody knows how to reconstruct them languages still separate, even if not in split processes they are boring, since they

  • nly capture certain aspects of

language history, namely, the horizontal relations

10 / 45

slide-17
SLIDE 17

Modelling Language History Networks

Phylogenetic Networks

Trees are bad because they are difficult to reconstruct............ languages do not separate in split processes they are boring, since they

  • nly capture certain aspects of

language history, namely the vertical relations Waves are bad because nobody knows how to reconstruct them languages still separate, even if not in split processes they are boring, since they

  • nly capture certain aspects of

language history, namely, the horizontal relations

10 / 45

slide-18
SLIDE 18

Modelling Language History Networks

Phylogenetic Networks

Trees are bad because they are difficult to reconstruct............ languages do not separate in split processes they are boring, since they

  • nly capture certain aspects of

language history, namely the vertical relations Waves are bad because nobody knows how to reconstruct them languages still separate, even if not in split processes they are boring, since they

  • nly capture certain aspects of

language history, namely, the horizontal relations

10 / 45

slide-19
SLIDE 19

Modelling Language History Networks

Phylogenetic Networks

Hugo Schuchardt (1842-1927) We connect the branches and twigs

  • f the tree with countless horizon-

tal lines and it ceases to be a tree (Schuchardt 1870 [1900]: 11, trans- lation JML)

11 / 45

slide-20
SLIDE 20

Modelling Language History Networks

Phylogenetic Networks

Hugo Schuchardt (1842-1927) We connect the branches and twigs

  • f the tree with countless horizon-

tal lines and it ceases to be a tree (Schuchardt 1870 [1900]: 11, trans- lation JML)

11 / 45

slide-21
SLIDE 21

Modelling Language History Networks

Phylogenetic Networks

Illustration by Ovidiu Popa (Heinrich Heine University Düsseldorf).

12 / 45

slide-22
SLIDE 22

Borrowing

13 / 45

slide-23
SLIDE 23

Borrowing Complexity of Borrowing Processes

Complexity of Borrowing Processes

expected Mandarin [ma₅₅po₂₁lou] attested Mandarin [wan₅₁paw₂₁lu₅₁] explanation Cantonese [maːn₂₂pow₃₅low₃₂]

14 / 45

slide-24
SLIDE 24

Borrowing Complexity of Borrowing Processes

Complexity of Borrowing Processes

expected Mandarin [ma₅₅po₂₁lou] attested Mandarin [wan₅₁paw₂₁lu₅₁] explanation Cantonese [maːn₂₂pow₃₅low₃₂]

14 / 45

slide-25
SLIDE 25

Borrowing Complexity of Borrowing Processes

Complexity of Borrowing Processes

expected Mandarin [ma₅₅po₂₁lou] attested Mandarin [wan₅₁paw₂₁lu₅₁] explanation Cantonese [maːn₂₂pow₃₅low₃₂]

14 / 45

slide-26
SLIDE 26

Borrowing Complexity of Borrowing Processes

Complexity of Borrowing Processes

expected Mandarin [ma₅₅po₂₁lou] attested Mandarin [wan₅₁paw₂₁lu₅₁] explanation Cantonese [maːn₂₂pow₃₅low₃₂]

14 / 45

slide-27
SLIDE 27

Borrowing Complexity of Borrowing Processes

Complexity of Borrowing Processes

English Cantonese Mandarin maːlboʁo maːn22pow35low32 wan51paw21lu51 Proper Name “Road of 1000 Tre- asures” “Road of 1000 Tre- asures” 万宝路 1

15 / 45

slide-28
SLIDE 28

Borrowing Phyletic Patterns

Patchy Distributions in Phyletic Patterns

Borrowing processes can be incredibly complex. Neverthe- less, they always leave direct traces, in so far as the bor- rowed word is usually phonetically quite similar to the donor

  • word. Furthermore, since the borrowing process is not tree-

like, borrowings may – if they are mistaken for cognates – show up as “patchy distributions” in phyletic patterns of ge- netically related languages.

16 / 45

slide-29
SLIDE 29

Borrowing Phyletic Patterns

Patchy Distributions in Phyletic Patterns

mountain Berg montagne monte 17 / 45

slide-30
SLIDE 30

Borrowing Phyletic Patterns

Patchy Distributions in Phyletic Patterns

mountain Berg montagne monte 17 / 45

slide-31
SLIDE 31

Borrowing Phyletic Patterns

Patchy Distributions in Phyletic Patterns

mountain Berg montagne monte 17 / 45

slide-32
SLIDE 32

Borrowing Phyletic Patterns

Patchy Distributions in Phyletic Patterns

mountain Berg montagne monte ? ? ? 17 / 45

slide-33
SLIDE 33

Borrowing Phyletic Patterns

Patchy Distributions in Phyletic Patterns

1 1 ? 1 ? ? 17 / 45

slide-34
SLIDE 34

Borrowing Phyletic Patterns

Patchy Distributions in Phyletic Patterns

1 1 1 1 1 1 17 / 45

slide-35
SLIDE 35

Borrowing Phyletic Patterns

Patchy Distributions in Phyletic Patterns

1 1 1 1 17 / 45

slide-36
SLIDE 36

Borrowing Phyletic Patterns

Patchy Distributions in Phyletic Patterns

1 1 1 1 17 / 45

slide-37
SLIDE 37

Borrowing Phyletic Patterns

Patchy Distributions in Phyletic Patterns

1 1 1 1 17 / 45

slide-38
SLIDE 38

Borrowing Phyletic Patterns

Gain Loss Mapping

Patchy distributions in phyletic patterns can serve as a heuristic for borrowing detection. Patchily distributed cog- nates can be identified with help of gain loss mapping ap- proaches (Mirkin et al. 2003, Dagan & Martin 2007, Cohen et al. 2008) by which phyletic patterns are plotted to a refer- ence tree.

18 / 45

slide-39
SLIDE 39

Borrowing Phyletic Patterns

Gain Loss Mapping

19 / 45

slide-40
SLIDE 40

Borrowing Phyletic Patterns

Gain Loss Mapping

5 gains 0 losses

19 / 45

slide-41
SLIDE 41

Borrowing Phyletic Patterns

Gain Loss Mapping

19 / 45

slide-42
SLIDE 42

Borrowing Phyletic Patterns

Gain Loss Mapping

1 gain 6 losses

19 / 45

slide-43
SLIDE 43

Borrowing Phyletic Patterns

Gain Loss Mapping

19 / 45

slide-44
SLIDE 44

Borrowing Phyletic Patterns

Gain Loss Mapping

3 gains 0 losses

19 / 45

slide-45
SLIDE 45

Borrowing Phyletic Patterns

Ancestral Vocabulary Distributions

Gain loss mapping is useful to test possible scenarios of character evolution. However, as long as there is no di- rect criterion that helps to choose the “best” of many differ- ent solutions, the method hardly gives us any new insights. Nelson-Sathi et al. (2011) use ancestral vocabulary sizes as a criterion to determine the right model. Here, we introduce ancestral vocabulary distributions, i.e. the form-meaning ra- tio of ancestral taxa, as a new criterion.

20 / 45

slide-46
SLIDE 46

Borrowing Phyletic Patterns

Ancestral Vocabulary Distributions

Dagan & Martin (2007) Nelson-Sathi et al. (2011) Vocabulary Size

21 / 45

slide-47
SLIDE 47

Borrowing Phyletic Patterns

Ancestral Vocabulary Distributions

50 50 50 50 50 50

Dagan & Martin (2007) Nelson-Sathi et al. (2011) Vocabulary Size

21 / 45

slide-48
SLIDE 48

Borrowing Phyletic Patterns

Ancestral Vocabulary Distributions

50 50 75 75 50 50 50 125 75 50 100

Dagan & Martin (2007) Nelson-Sathi et al. (2011) Vocabulary Size

21 / 45

slide-49
SLIDE 49

Borrowing Phyletic Patterns

Ancestral Vocabulary Distributions

Vocabulary Distribution

22 / 45

slide-50
SLIDE 50

Borrowing Phyletic Patterns

Ancestral Vocabulary Distributions

50/50 50/50 50/50 50/50 50/50 50/50

Vocabulary Distribution

22 / 45

slide-51
SLIDE 51

Borrowing Phyletic Patterns

Ancestral Vocabulary Distributions

50/50 50/50 75/50 75/50 50/50 50/50 50/50 75/50 125/50 50/50 100/50

Vocabulary Distribution

22 / 45

slide-52
SLIDE 52

Borrowing Phyletic Patterns

Ancestral Vocabulary Distributions

Favoring ancestral vocabulary distributions over ancestral vocabulary sizes comes quite closer to linguistic needs, since we know that languages cannot be measured in terms

  • f their “size”, while it is reasonable to assume that lan-

guages do not allow for an unlimited amount of synonyms. Furthermore, ancestral vocabulary distributions help to avoid problems resulting from semantic shift.

23 / 45

slide-53
SLIDE 53

Borrowing Phyletic Patterns

Differential Loss and Semantic Shift

monte Berg montagne mountain

24 / 45

slide-54
SLIDE 54

Borrowing Phyletic Patterns

Differential Loss and Semantic Shift

Differential Loss

24 / 45

slide-55
SLIDE 55

Borrowing Phyletic Patterns

Differential Loss and Semantic Shift

Semantic Shift

24 / 45

slide-56
SLIDE 56

Borrowing Phyletic Patterns

Differential Loss and Semantic Shift

Semantic Shift

24 / 45

slide-57
SLIDE 57

Borrowing Phyletic Patterns

Differential Loss and Semantic Shift

Borrowing

24 / 45

slide-58
SLIDE 58

Borrowing Phyletic Patterns

Differential Loss and Semantic Shift

Parallel semantic shift is not improbable per se. However, parallel semantic shift involving the same source forms in independent branches of a language family is rather unlikely.

25 / 45

slide-59
SLIDE 59

Borrowing Borrowing Detection

Gain Loss Mapping Approach to Borrowing Detection

Input

(a) lexicostatistic dataset (cognate sets) (b) presence-absence matrix (phyletic patterns) (c) reference tree

1 Gain Loss Mapping

Apply parsimony-based gain loss mapping anal- ysis using different models with varying ratio of weights for gains and losses.

2 Model Selection

Choose the most probable model by compar- ing the ancestral vocabulary distributions with the contemporary ones using the Mann-Whitney-U test.

3 Patchy Cognate Detection

Split all cognate sets for which more than one ori- gin was inferred by the best model into subsets of common origin.

4 Network Reconstruction

Connect the separate origins of all patchy cog- nate sets by calculating a weighted minimum spanning tree and add all links as edges to the reference tree, whereby the edge weight reflects the number of inferred links.

26 / 45

slide-60
SLIDE 60

Borrowing Borrowing Detection

Gain Loss Mapping Approach to Borrowing Detection

Input

(a) lexicostatistic dataset (cognate sets) (b) presence-absence matrix (phyletic patterns) (c) reference tree

1 Gain Loss Mapping

Apply parsimony-based gain loss mapping anal- ysis using different models with varying ratio of weights for gains and losses.

2 Model Selection

Choose the most probable model by compar- ing the ancestral vocabulary distributions with the contemporary ones using the Mann-Whitney-U test.

3 Patchy Cognate Detection

Split all cognate sets for which more than one ori- gin was inferred by the best model into subsets of common origin.

4 Network Reconstruction

Connect the separate origins of all patchy cog- nate sets by calculating a weighted minimum spanning tree and add all links as edges to the reference tree, whereby the edge weight reflects the number of inferred links.

26 / 45

slide-61
SLIDE 61

Borrowing Borrowing Detection

Gain Loss Mapping Approach to Borrowing Detection

Input

(a) lexicostatistic dataset (cognate sets) (b) presence-absence matrix (phyletic patterns) (c) reference tree

1 Gain Loss Mapping

Apply parsimony-based gain loss mapping anal- ysis using different models with varying ratio of weights for gains and losses.

2 Model Selection

Choose the most probable model by compar- ing the ancestral vocabulary distributions with the contemporary ones using the Mann-Whitney-U test.

3 Patchy Cognate Detection

Split all cognate sets for which more than one ori- gin was inferred by the best model into subsets of common origin.

4 Network Reconstruction

Connect the separate origins of all patchy cog- nate sets by calculating a weighted minimum spanning tree and add all links as edges to the reference tree, whereby the edge weight reflects the number of inferred links.

26 / 45

slide-62
SLIDE 62

Borrowing Borrowing Detection

Gain Loss Mapping Approach to Borrowing Detection

Input

(a) lexicostatistic dataset (cognate sets) (b) presence-absence matrix (phyletic patterns) (c) reference tree

1 Gain Loss Mapping

Apply parsimony-based gain loss mapping anal- ysis using different models with varying ratio of weights for gains and losses.

2 Model Selection

Choose the most probable model by compar- ing the ancestral vocabulary distributions with the contemporary ones using the Mann-Whitney-U test.

3 Patchy Cognate Detection

Split all cognate sets for which more than one ori- gin was inferred by the best model into subsets of common origin.

4 Network Reconstruction

Connect the separate origins of all patchy cog- nate sets by calculating a weighted minimum spanning tree and add all links as edges to the reference tree, whereby the edge weight reflects the number of inferred links.

26 / 45

slide-63
SLIDE 63

Borrowing Borrowing Detection

Gain Loss Mapping Approach to Borrowing Detection

Input

(a) lexicostatistic dataset (cognate sets) (b) presence-absence matrix (phyletic patterns) (c) reference tree

1 Gain Loss Mapping

Apply parsimony-based gain loss mapping anal- ysis using different models with varying ratio of weights for gains and losses.

2 Model Selection

Choose the most probable model by compar- ing the ancestral vocabulary distributions with the contemporary ones using the Mann-Whitney-U test.

3 Patchy Cognate Detection

Split all cognate sets for which more than one ori- gin was inferred by the best model into subsets of common origin.

4 Network Reconstruction

Connect the separate origins of all patchy cog- nate sets by calculating a weighted minimum spanning tree and add all links as edges to the reference tree, whereby the edge weight reflects the number of inferred links.

26 / 45

slide-64
SLIDE 64

Borrowing Borrowing Detection

Gain Loss Mapping Approach to Borrowing Detection

Input

(a) lexicostatistic dataset (cognate sets) (b) presence-absence matrix (phyletic patterns) (c) reference tree

1 Gain Loss Mapping

Apply parsimony-based gain loss mapping anal- ysis using different models with varying ratio of weights for gains and losses.

2 Model Selection

Choose the most probable model by compar- ing the ancestral vocabulary distributions with the contemporary ones using the Mann-Whitney-U test.

3 Patchy Cognate Detection

Split all cognate sets for which more than one ori- gin was inferred by the best model into subsets of common origin.

4 Network Reconstruction

Connect the separate origins of all patchy cog- nate sets by calculating a weighted minimum spanning tree and add all links as edges to the reference tree, whereby the edge weight reflects the number of inferred links.

26 / 45

slide-65
SLIDE 65

Application

27 / 45

slide-66
SLIDE 66

Application Material

Dogon Languages

28 / 45

slide-67
SLIDE 67

Application Material

Dogon Languages

The Dogon language family consists

  • f about 20 distinct (mutually

unintelligible) languages. The internal structure of the language family is largely unknown. Some scholars propose a split in an Eastern and a Western branch. The Dogon Languages Project (DLP, http://dogonlanguages.org) provides a lexical spreadsheet consisting of 23 language varieties submitted by 5 authors. The spreadsheet consists of 9000 semantic items translated into the respective varieties, but only a small amount of the items (less than 200) is translated into all languages.

28 / 45

slide-68
SLIDE 68

Application Material

Dogon Data

From the Dogon spreadsheet, we extracted: 325 semantic items (“concepts”), translated into 18 varieties (“doculects”), yielding a total amount of 4883 words (“counterparts”) The main criterion for the data selection was to maximize the number of semantically aligned words in the given varieties in order to avoid large amounts of gaps in the data.

29 / 45

slide-69
SLIDE 69

Application Methods

QLC-LingPy

All analyses were conducted using the development version of QLC-LingPy. QLC-LingPy is a Python library currently being developed in Michael Cysouw’s reasearch unit “Quantitative Language Comparison” (Philipps-University Marburg). QLC-LingPy supersedes the independently developed QLC and LingPy libraries by merging their specific features into a common framework, while extending its functionality. Our goal is to provide a Python toolkit that is easy to use for non-experts in programming, while at the same time offering up-to-date proposals for common tasks in quantitative historical linguistics.

30 / 45

slide-70
SLIDE 70

Application Methods

Workflow

Input

(a) Dogon spreadsheet (b) Reference trees (DLP, MrBayes, Neighbor-Joining)

1 Preprocessing

Orthographic parsing (IPA conversion) and tok- enization using the Orthography Profile Approach (Moran & Cysouw in prep.).

2 Cognate Detection

Identification of etymologically related words (cognates and borrowings, i.e. “homologs”) using the LexStat method (List 2012) with a low thresh-

  • ld (0.4) in order to minimize the number of false

positives.

3 Borrowing Detection

Identification of patchy phyletic patterns using the improved gain loss mapping approach (ten differ- ent gain-loss models, favoring varying amounts

  • f origins).

Output

(a) Cognate sets (b) Patchy cognate sets (c) Phylogenetic network

31 / 45

slide-71
SLIDE 71

Application Methods

Workflow

Input

(a) Dogon spreadsheet (b) Reference trees (DLP, MrBayes, Neighbor-Joining)

1 Preprocessing

Orthographic parsing (IPA conversion) and tok- enization using the Orthography Profile Approach (Moran & Cysouw in prep.).

2 Cognate Detection

Identification of etymologically related words (cognates and borrowings, i.e. “homologs”) using the LexStat method (List 2012) with a low thresh-

  • ld (0.4) in order to minimize the number of false

positives.

3 Borrowing Detection

Identification of patchy phyletic patterns using the improved gain loss mapping approach (ten differ- ent gain-loss models, favoring varying amounts

  • f origins).

Output

(a) Cognate sets (b) Patchy cognate sets (c) Phylogenetic network

31 / 45

slide-72
SLIDE 72

Application Methods

Workflow

Input

(a) Dogon spreadsheet (b) Reference trees (DLP, MrBayes, Neighbor-Joining)

1 Preprocessing

Orthographic parsing (IPA conversion) and tok- enization using the Orthography Profile Approach (Moran & Cysouw in prep.).

2 Cognate Detection

Identification of etymologically related words (cognates and borrowings, i.e. “homologs”) using the LexStat method (List 2012) with a low thresh-

  • ld (0.4) in order to minimize the number of false

positives.

3 Borrowing Detection

Identification of patchy phyletic patterns using the improved gain loss mapping approach (ten differ- ent gain-loss models, favoring varying amounts

  • f origins).

Output

(a) Cognate sets (b) Patchy cognate sets (c) Phylogenetic network

31 / 45

slide-73
SLIDE 73

Application Methods

Workflow

Input

(a) Dogon spreadsheet (b) Reference trees (DLP, MrBayes, Neighbor-Joining)

1 Preprocessing

Orthographic parsing (IPA conversion) and tok- enization using the Orthography Profile Approach (Moran & Cysouw in prep.).

2 Cognate Detection

Identification of etymologically related words (cognates and borrowings, i.e. “homologs”) using the LexStat method (List 2012) with a low thresh-

  • ld (0.4) in order to minimize the number of false

positives.

3 Borrowing Detection

Identification of patchy phyletic patterns using the improved gain loss mapping approach (ten differ- ent gain-loss models, favoring varying amounts

  • f origins).

Output

(a) Cognate sets (b) Patchy cognate sets (c) Phylogenetic network

31 / 45

slide-74
SLIDE 74

Application Methods

Workflow

Input

(a) Dogon spreadsheet (b) Reference trees (DLP, MrBayes, Neighbor-Joining)

1 Preprocessing

Orthographic parsing (IPA conversion) and tok- enization using the Orthography Profile Approach (Moran & Cysouw in prep.).

2 Cognate Detection

Identification of etymologically related words (cognates and borrowings, i.e. “homologs”) using the LexStat method (List 2012) with a low thresh-

  • ld (0.4) in order to minimize the number of false

positives.

3 Borrowing Detection

Identification of patchy phyletic patterns using the improved gain loss mapping approach (ten differ- ent gain-loss models, favoring varying amounts

  • f origins).

Output

(a) Cognate sets (b) Patchy cognate sets (c) Phylogenetic network

31 / 45

slide-75
SLIDE 75

Application Methods

Workflow

Input

(a) Dogon spreadsheet (b) Reference trees (DLP, MrBayes, Neighbor-Joining)

1 Preprocessing

Orthographic parsing (IPA conversion) and tok- enization using the Orthography Profile Approach (Moran & Cysouw in prep.).

2 Cognate Detection

Identification of etymologically related words (cognates and borrowings, i.e. “homologs”) using the LexStat method (List 2012) with a low thresh-

  • ld (0.4) in order to minimize the number of false

positives.

3 Borrowing Detection

Identification of patchy phyletic patterns using the improved gain loss mapping approach (ten differ- ent gain-loss models, favoring varying amounts

  • f origins).

Output

(a) Cognate sets (b) Patchy cognate sets (c) Phylogenetic network

31 / 45

slide-76
SLIDE 76

Application Results

Models

32 / 45

slide-77
SLIDE 77

Application Results

Models

M_5_1 p<0.00 M_4_1 p<0.00 M_3_1 p=0.06 M_2_1 p=0.15 M_1_1 p<0.00 M_1_2 p<0.00 M_1_3 p<0.00 M_1_4 p<0.00 M_1_5 p<0.00

0.0 0.2 0.4 0.6 0.8 1.0 1.2

32 / 45

slide-78
SLIDE 78

Application Results

Models

M_5_1 p<0.00 M_4_1 p<0.00 M_3_1 p=0.06 M_2_1 p=0.15 M_1_1 p<0.00 M_1_2 p<0.00 M_1_3 p<0.00 M_1_4 p<0.00 M_1_5 p<0.00

0.0 0.2 0.4 0.6 0.8 1.0 1.2

Best Model: 2_1

32 / 45

slide-79
SLIDE 79

Application Results

Numbers

Tree Model Origins (Ø) MaxO p-value DLP 2_1 1.68 5 0.15 MrBayes 2_1 1.67 5 0.50 NeighborJoining 2_1 1.69 5 0.16

33 / 45

slide-80
SLIDE 80

Application Results

Phylogenetic Network

Gourou TommoSo TogoKan JamsayMondoro Jamsay Nanga TomoKanDiangassagou YornoSo ToroTegu PergeTegu Bunoge Tiranige Mombo TebulUre YandaDom DogulDom BenTey BankanTey

34 / 45

slide-81
SLIDE 81

Application Results

Phylogenetic Network

Gourou TommoSo TogoKan JamsayMondoro Jamsay Nanga TomoKanDiangassagou YornoSo ToroTegu PergeTegu Bunoge Tiranige Mombo TebulUre YandaDom DogulDom BenTey BankanTey

34 / 45

slide-82
SLIDE 82

Application Results

Areal Perspective

Eastern Western 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1 BankanTey 2 BenTey 3 Bunoge 4 DogulDom 5 Gourou 6 Jamsay 7 JamsayMondoro 8 Mombo 9 Nanga 10 PergeTegu 11 TebulUre 12 Tiranige 13 TogoKan 14 TommoSo 15 TomoKanDiangassagou 16 ToroTegu 17 YandaDom 18 YornoSo

35 / 45

slide-83
SLIDE 83

Application Results

Areal Perspective

Eastern Western 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1 BankanTey 2 BenTey 3 Bunoge 4 DogulDom 5 Gourou 6 Jamsay 7 JamsayMondoro 8 Mombo 9 Nanga 10 PergeTegu 11 TebulUre 12 Tiranige 13 TogoKan 14 TommoSo 15 TomoKanDiangassagou 16 ToroTegu 17 YandaDom 18 YornoSo

1 8 15 25 39 Inferred Links

35 / 45

slide-84
SLIDE 84

Application Results

Areal Perspective: Tebul Ure

Eastern Western 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1 BankanTey 2 BenTey 3 Bunoge 4 DogulDom 5 Gourou 6 Jamsay 7 JamsayMondoro 8 Mombo 9 Nanga 10 PergeTegu 11 TebulUre 12 Tiranige 13 TogoKan 14 TommoSo 15 TomoKanDiangassagou 16 ToroTegu 17 YandaDom 18 YornoSo

1 5 18 23 32 Inferred Links

36 / 45

slide-85
SLIDE 85

Application Results

Areal Perspective: Tebul Ure

Eastern Western 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1 BankanTey 2 BenTey 3 Bunoge 4 DogulDom 5 Gourou 6 Jamsay 7 JamsayMondoro 8 Mombo 9 Nanga 10 PergeTegu 11 TebulUre 12 Tiranige 13 TogoKan 14 TommoSo 15 TomoKanDiangassagou 16 ToroTegu 17 YandaDom 18 YornoSo

1 5 18 23 32 Inferred Links

Heath (2011a: 3) notes that Tommo So is the main contact language of Tebul Ure.

36 / 45

slide-86
SLIDE 86

Application Results

Areal Perspective: Yanda Dom

Eastern Western 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1 BankanTey 2 BenTey 3 Bunoge 4 DogulDom 5 Gourou 6 Jamsay 7 JamsayMondoro 8 Mombo 9 Nanga 10 PergeTegu 11 TebulUre 12 Tiranige 13 TogoKan 14 TommoSo 15 TomoKanDiangassagou 16 ToroTegu 17 YandaDom 18 YornoSo

1 5 18 22 39 Inferred Links

37 / 45

slide-87
SLIDE 87

Application Results

Areal Perspective: Yanda Dom

Eastern Western 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1 BankanTey 2 BenTey 3 Bunoge 4 DogulDom 5 Gourou 6 Jamsay 7 JamsayMondoro 8 Mombo 9 Nanga 10 PergeTegu 11 TebulUre 12 Tiranige 13 TogoKan 14 TommoSo 15 TomoKanDiangassagou 16 ToroTegu 17 YandaDom 18 YornoSo

1 5 18 22 39 Inferred Links

Heath (2011b: 3) notes that the use of Tommo So as a second language is common among Yanda Dom speakers.

37 / 45

slide-88
SLIDE 88

Application Results

Areal Perspective: Dogul Dom

Eastern Western 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1 BankanTey 2 BenTey 3 Bunoge 4 DogulDom 5 Gourou 6 Jamsay 7 JamsayMondoro 8 Mombo 9 Nanga 10 PergeTegu 11 TebulUre 12 Tiranige 13 TogoKan 14 TommoSo 15 TomoKanDiangassagou 16 ToroTegu 17 YandaDom 18 YornoSo

1 4 8 12 32 Inferred Links

38 / 45

slide-89
SLIDE 89

Application Results

Areal Perspective: Dogul Dom

Eastern Western 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1 BankanTey 2 BenTey 3 Bunoge 4 DogulDom 5 Gourou 6 Jamsay 7 JamsayMondoro 8 Mombo 9 Nanga 10 PergeTegu 11 TebulUre 12 Tiranige 13 TogoKan 14 TommoSo 15 TomoKanDiangassagou 16 ToroTegu 17 YandaDom 18 YornoSo

1 4 8 12 32 Inferred Links

Cansler (2012: 2) notes that most speakers of Dogul Dom use Tommo So as a second language.

38 / 45

slide-90
SLIDE 90

?!? !?! !!! ???

Discussion

39 / 45

slide-91
SLIDE 91

Discussion Natural Findings or Artifacts?

Natural Findings or Artifacts?

On the large scale, the results seem to confirm the method. However, given the multitude of possible errors that may have influenced our results, how can we be sure that these findings are “natural” and not artifacts of our methods?

40 / 45

slide-92
SLIDE 92

Discussion Natural Findings or Artifacts?

Natural Findings or Artifacts?

Well, we can’t! At least not for sure. But we can say, that our results are consistent throughout a couple of varying param- eters, which makes us rather confident that it is worth pur- suing our work with the methods...

40 / 45

slide-93
SLIDE 93

Discussion Natural Findings or Artifacts?

Natural Findings or Artifacts?

TreeA TreeB B-Cubed F-Score DLP MrBayes 0.9539 DLP Neighbor-Joining 0.9401 MrBayes Neighbor-Joining 0.9464 Comparing the Impact of Varying Reference Trees

41 / 45

slide-94
SLIDE 94

Discussion Natural Findings or Artifacts?

Natural Findings or Artifacts?

Varying the reference trees does only marginally change the concrete predictions of the method. Although the trees cre- ated from the data (MrBayes & Neighbor) do not reflect the East-West distinction of the DLP tree, the dominating role of Tommo So can still be inferred.

41 / 45

slide-95
SLIDE 95

Discussion Natural Findings or Artifacts?

Natural Findings or Artifacts?

Threshold Best Model Origins (Ø) MaxO p-Value 0.2 3_1 1.43 4 0.35 0.3 2_1 1.64 5 0.31 0.4 2_1 1.68 5 0.15 0.5 2_1 1.65 5 0.42 0.6 1_1 2.35 7 0.45 Varying the Threshold for Cognate Detection

42 / 45

slide-96
SLIDE 96

Discussion Natural Findings or Artifacts?

Natural Findings or Artifacts?

Varying the thresholds for cognate (homolog) detection clearly changes the results. The higher the threshold, the higher the amount of false positives proposed by the LexS- tat method. False positives, however, also often show up as patchy distributions.

42 / 45

slide-97
SLIDE 97

Discussion Limits

Limits of the Method

Patchily distributed cognate sets do not necessarily result from borrowings but may likewise result from

(a) missing data, (b) false positives, or (c) coincidence.

Borrowing processes do not necessarily result in patchily distributed cognate sets, especially if they occur

(a) outside the group of languages being compared, (b) so frequently that they are “masked” as non-patchy distributions, or (c) between languages that are genetically close on the reference tree.

43 / 45

slide-98
SLIDE 98

Discussion Limits

Limits of the Method

Patchily distributed cognate sets do not necessarily result from borrowings but may likewise result from

(a) missing data, (b) false positives, or (c) coincidence.

Borrowing processes do not necessarily result in patchily distributed cognate sets, especially if they occur

(a) outside the group of languages being compared, (b) so frequently that they are “masked” as non-patchy distributions, or (c) between languages that are genetically close on the reference tree.

43 / 45

slide-99
SLIDE 99

Discussion Limits

Limits of the Method

Patchily distributed cognate sets do not necessarily result from borrowings but may likewise result from

(a) missing data, (b) false positives, or (c) coincidence.

Borrowing processes do not necessarily result in patchily distributed cognate sets, especially if they occur

(a) outside the group of languages being compared, (b) so frequently that they are “masked” as non-patchy distributions, or (c) between languages that are genetically close on the reference tree.

43 / 45

slide-100
SLIDE 100

Discussion Examples

Examples

44 / 45

slide-101
SLIDE 101

Discussion Examples

Examples

44 / 45

slide-102
SLIDE 102

1

Thank You for Listening!

45 / 45