Recovering dialect geography from an unaligned comparable corpus - - PowerPoint PPT Presentation

recovering dialect geography from an unaligned comparable
SMART_READER_LITE
LIVE PREVIEW

Recovering dialect geography from an unaligned comparable corpus - - PowerPoint PPT Presentation

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion Recovering dialect geography from an unaligned comparable corpus Yves Scherrer LATL, Department of Linguistics University of Geneva,


slide-1
SLIDE 1

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Recovering dialect geography from an unaligned comparable corpus

Yves Scherrer

LATL, Department of Linguistics University of Geneva, Switzerland

LINGVIS & UNCLH Workshop EACL 2012, Avignon

1 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-2
SLIDE 2

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Overview

1

Introduction

2

The Archimob corpus

3

Cognate and identical word pairs

4

Recovering dialect geography

5

Conclusion

2 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-3
SLIDE 3

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Introduction

1

Cognate identifjcation

Find cognate word pairs in texts from multiple dialects Use these word pairs to determine dialect distance

2

Dialectometric analysis

Use statistical and mathematical methods to discover the geographical distribution of dialect similarities Typical data source: dialectological surveys Our data source: transcribed texts from multiple Swiss German dialects

3 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-4
SLIDE 4

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Introduction

1

Cognate identifjcation

Find cognate word pairs in texts from multiple dialects Use these word pairs to determine dialect distance

2

Dialectometric analysis

Use statistical and mathematical methods to discover the geographical distribution of dialect similarities Typical data source: dialectological surveys Our data source: transcribed texts from multiple Swiss German dialects

3 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-5
SLIDE 5

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Introduction

1

Cognate identifjcation

Find cognate word pairs in texts from multiple dialects Use these word pairs to determine dialect distance

2

Dialectometric analysis

Use statistical and mathematical methods to discover the geographical distribution of dialect similarities Typical data source: dialectological surveys Our data source: transcribed texts from multiple Swiss German dialects

3 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-6
SLIDE 6

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Introduction

Typical dialectological data:

hier Leute (sie) machen Köniz (BE) hie Lüt mache Niederwald (VS) hie Lit machend Horgen (ZH) daa Lüüt mached Flawil (SG) doo Lüüt mached

A data matrix lists the realizations of different linguistic phenomena (columns) at different inquiry points (rows). All realizations of a given phenomenon can be retrieved and compared easily: they are in the same column.

Our data set:

A comparable multidialectal corpus: 16 Swiss German texts Unaligned: we don’t know which are the phenomena and their respective realizations

4 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-7
SLIDE 7

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Introduction

Typical dialectological data:

hier Leute (sie) machen Köniz (BE) hie Lüt mache Niederwald (VS) hie Lit machend Horgen (ZH) daa Lüüt mached Flawil (SG) doo Lüüt mached

A data matrix lists the realizations of different linguistic phenomena (columns) at different inquiry points (rows). All realizations of a given phenomenon can be retrieved and compared easily: they are in the same column.

Our data set:

A comparable multidialectal corpus: 16 Swiss German texts Unaligned: we don’t know which are the phenomena and their respective realizations

4 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-8
SLIDE 8

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

The idea

Text in Dialect A

(Bag of words)

vatter het vom dienscht autersheim gsìì Text in Dialect B

(Bag of words)

hät schlosser vatter gsii vom altershaim

1

Determine cognate word pairs (and discard non-cognate words)

2

Partition cognate word pairs into identical and non-identical ones

The proportion between identical and non-identical cognate pairs is used as a measure of dialect similarity.

5 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-9
SLIDE 9

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

The idea

Text in Dialect A

(Bag of words)

vatter het vom dienscht autersheim gsìì Text in Dialect B

(Bag of words)

hät schlosser vatter gsii vom altershaim

1

Determine cognate word pairs (and discard non-cognate words)

2

Partition cognate word pairs into identical and non-identical ones

The proportion between identical and non-identical cognate pairs is used as a measure of dialect similarity.

5 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-10
SLIDE 10

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

The idea

Text in Dialect A

(Bag of words)

vatter het vom dienscht autersheim gsìì Text in Dialect B

(Bag of words)

hät schlosser vatter gsii vom altershaim

1

Determine cognate word pairs (and discard non-cognate words)

2

Partition cognate word pairs into identical and non-identical ones

The proportion between identical and non-identical cognate pairs is used as a measure of dialect similarity.

5 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-11
SLIDE 11

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

The idea

Text in Dialect A

(Bag of words)

vatter het vom dienscht autersheim gsìì Text in Dialect B

(Bag of words)

hät schlosser vatter gsii vom altershaim

1

Determine cognate word pairs (and discard non-cognate words)

2

Partition cognate word pairs into identical and non-identical ones

The proportion between identical and non-identical cognate pairs is used as a measure of dialect similarity.

5 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-12
SLIDE 12

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Questions

The proportion between identical and non-identical cognate pairs is used as a measure of dialect similarity.

What is a cognate word pair? What is an identical word pair?

Dialect similarity can then be computed between every text pair.

How do these dialect similarity values compare with the geographical proximity of the dialects?

Numerically: correlation with geographical distance Visually: cluster analysis, multidimensional scaling

6 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-13
SLIDE 13

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Questions

The proportion between identical and non-identical cognate pairs is used as a measure of dialect similarity.

What is a cognate word pair? What is an identical word pair?

Dialect similarity can then be computed between every text pair.

How do these dialect similarity values compare with the geographical proximity of the dialects?

Numerically: correlation with geographical distance Visually: cluster analysis, multidimensional scaling

6 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-14
SLIDE 14

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Overview

1

Introduction

2

The Archimob corpus

3

Cognate and identical word pairs

4

Recovering dialect geography

5

Conclusion

7 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-15
SLIDE 15

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

The Archimob corpus

Archimob is an oral history project about the Second World War period in Switzerland.

555 interviews in all Swiss language regions 16 Swiss German interviews transcribed (University of Zurich)

Transcription:

A single transcriber for all texts Dieth spelling guidelines 6 500 to 16 700 words per interview

8 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-16
SLIDE 16

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

The Archimob corpus

The geographic location of the 16 transcribed texts:

AG1063 AG1147 BE1142 BE1170 BL1073 BS1057 GL1048 GL1207 LU1195 LU1261 NW1007 SG1198 SZ1209 VS1212 ZH1143 ZH1270

9 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-17
SLIDE 17

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

The Archimob corpus

Example: BE1142

de vatter ìsch lokomitiiffüerer gsìì / de ìsch dispensiert gsìì vom dienscht nattürlech / und / zwo schwöschtere / hani ghaa / wobii ei gsch / eini gschtoorben ìsch u di ander ìsch ìsch ime autersheim / u soo bini ufgwachse ir lenggass / mit em / pruefsleer / mit wiiterbiudig nächheer / ( ? ) ‘the father has been a train driver / he has been dispensed from military service of course / and / two sisters / I have had / where one / one has died and the other is is in a home for the elderly / this is how I have grown up in the Lenggass / with a / apprenticeship / with further education afterwards / ( ? )’

10 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-18
SLIDE 18

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Overview

1

Introduction

2

The Archimob corpus

3

Cognate and identical word pairs

4

Recovering dialect geography

5

Conclusion

11 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-19
SLIDE 19

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Cognate identifjcation

Most recently proposed cognate identifjcation algorithms are based on variants of Levenshtein distance.

We normalize Levenshtein distance by the length of the alignment (Heeringa et al. 2006)

Two words w1, w2 form a cognate word pair iff the normalized Levenshtein distance between w1 and w2 is lower or equal than the threshold tC.

We experiment with different threshold values:

tC = {0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40} Manual evaluation with a random sample of 100 word pairs per value

12 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-20
SLIDE 20

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Cognate identifjcation

Most recently proposed cognate identifjcation algorithms are based on variants of Levenshtein distance.

We normalize Levenshtein distance by the length of the alignment (Heeringa et al. 2006)

Two words w1, w2 form a cognate word pair iff the normalized Levenshtein distance between w1 and w2 is lower or equal than the threshold tC.

We experiment with different threshold values:

tC = {0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40} Manual evaluation with a random sample of 100 word pairs per value

12 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-21
SLIDE 21

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Manual evaluation of cognate thresholds

10000 20000 30000 40000 50000 60000 70000 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Number of inferred word pairs Distance threshold Non-words Non-cognates Lemma cognates Form cognates

13 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-22
SLIDE 22

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Manual evaluation of cognate thresholds

10000 20000 30000 40000 50000 60000 70000 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Number of inferred word pairs Distance threshold Non-words Non-cognates Lemma cognates Form cognates

Total number of inferred word pairs increases with threshold.

13 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-23
SLIDE 23

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Manual evaluation of cognate thresholds

10000 20000 30000 40000 50000 60000 70000 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Number of inferred word pairs Distance threshold Non-words Non-cognates Lemma cognates Form cognates

Form cognates: same lemma and same infmected form. Lemma cognates: same lemma, but different infmected form.

13 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-24
SLIDE 24

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Manual evaluation of cognate thresholds

10000 20000 30000 40000 50000 60000 70000 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Number of inferred word pairs Distance threshold Non-words Non-cognates Lemma cognates Form cognates

Number of cognates levels off from 0.3 onwards.

13 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-25
SLIDE 25

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Manual evaluation of cognate thresholds

10000 20000 30000 40000 50000 60000 70000 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Number of inferred word pairs Distance threshold Non-words Non-cognates Lemma cognates Form cognates

Non-cognates outnumber cognates from 0.35 onwards.

13 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-26
SLIDE 26

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Manual evaluation of cognate thresholds

10000 20000 30000 40000 50000 60000 70000 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Number of inferred word pairs Distance threshold Non-words Non-cognates Lemma cognates Form cognates

These fjgures are about precision, not about recall.

13 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-27
SLIDE 27

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Identical word pairs

Two words w1, w2 form an identical word pair iff the normalized Levenshtein distance between w1 and w2 is lower or equal than the threshold tI.

Intuitive value: tI = 0.0

String identity

Relaxed value: tI = 0.1

Neglect minor transcription inconsistencies Neglect smallest dialect differences

14 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-28
SLIDE 28

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Overview

1

Introduction

2

The Archimob corpus

3

Cognate and identical word pairs

4

Recovering dialect geography

5

Conclusion

15 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-29
SLIDE 29

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Correlation measures

Linguistic distances correlate with geographical distances. Goal: fjnd the linguistic distance measure (the threshold combination) that correlates best with geographical distance. We compute two correlation values:

1

Local incoherence (Nerbonne & Kleiweg 2005)

Correlation is a local phenomenon that does not need to hold over larger geographical distances.

2

Mantel test (Sokal & Rohlf 1995, 813-819)

A general statistical test that applies to data expressed as dissimilarities, often used in evolutionary biology and ecology. Statistical signifjcance of the correlation is obtained by randomization. All our tests: 999 permutations, corresponding to a simulated p-value

  • f 0.001.

16 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-30
SLIDE 30

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Correlation measures

Linguistic distances correlate with geographical distances. Goal: fjnd the linguistic distance measure (the threshold combination) that correlates best with geographical distance. We compute two correlation values:

1

Local incoherence (Nerbonne & Kleiweg 2005)

Correlation is a local phenomenon that does not need to hold over larger geographical distances.

2

Mantel test (Sokal & Rohlf 1995, 813-819)

A general statistical test that applies to data expressed as dissimilarities, often used in evolutionary biology and ecology. Statistical signifjcance of the correlation is obtained by randomization. All our tests: 999 permutations, corresponding to a simulated p-value

  • f 0.001.

16 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-31
SLIDE 31

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Correlation measures

Linguistic distances correlate with geographical distances. Goal: fjnd the linguistic distance measure (the threshold combination) that correlates best with geographical distance. We compute two correlation values:

1

Local incoherence (Nerbonne & Kleiweg 2005)

Correlation is a local phenomenon that does not need to hold over larger geographical distances.

2

Mantel test (Sokal & Rohlf 1995, 813-819)

A general statistical test that applies to data expressed as dissimilarities, often used in evolutionary biology and ecology. Statistical signifjcance of the correlation is obtained by randomization. All our tests: 999 permutations, corresponding to a simulated p-value

  • f 0.001.

16 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-32
SLIDE 32

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Correlation measures

0.2 0.25 0.3 0.35 0.4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Identical 0.1 Mantel test Identical 0.0 Mantel test Identical 0.1 Inv local inc Identical 0.0 Inv local inc

Cognate threshold 17 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-33
SLIDE 33

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Correlation measures

0.2 0.25 0.3 0.35 0.4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Identical 0.1 Mantel test Identical 0.0 Mantel test Identical 0.1 Inv local inc Identical 0.0 Inv local inc

Cognate threshold

Mantel test values peak at 0.25. Local incoherence values peak at 0.35.

17 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-34
SLIDE 34

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Correlation measures

0.2 0.25 0.3 0.35 0.4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Identical 0.1 Mantel test Identical 0.0 Mantel test Identical 0.1 Inv local inc Identical 0.0 Inv local inc

Cognate threshold

0.1 threshold for identical pairs (red) performs slightly better than 0.0 threshold (blue).

17 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-35
SLIDE 35

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Correlation measures

0.2 0.25 0.3 0.35 0.4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Identical 0.1 Mantel test Identical 0.0 Mantel test Identical 0.1 Inv local inc Identical 0.0 Inv local inc

Cognate threshold

The following visualizations are drawn with the following setting: Cognate threshold 0.35, Identical threshold 0.1.

17 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-36
SLIDE 36

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Visualizations

1

Hierarchical cluster analysis Geographically close dialect texts should be reliably clustered together Noisy clustering (Nerbonne et al. 2008):

Repeat clustering 100 times Each time, add random amounts of noise to the distance values Alternate between two clustering algorithms: Weighted Average and Group Average

2

Multidimensional scaling The Swiss German dialect landscape features major East-West and North-South divisions. Can this organization be recovered from the dialect texts? Reduce the distance matrix to two dimensions (Kruskal’s multidimensional scaling), and plot the results

18 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-37
SLIDE 37

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Visualizations

1

Hierarchical cluster analysis Geographically close dialect texts should be reliably clustered together Noisy clustering (Nerbonne et al. 2008):

Repeat clustering 100 times Each time, add random amounts of noise to the distance values Alternate between two clustering algorithms: Weighted Average and Group Average

2

Multidimensional scaling The Swiss German dialect landscape features major East-West and North-South divisions. Can this organization be recovered from the dialect texts? Reduce the distance matrix to two dimensions (Kruskal’s multidimensional scaling), and plot the results

18 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-38
SLIDE 38

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Visualizations

1

Hierarchical cluster analysis Geographically close dialect texts should be reliably clustered together Noisy clustering (Nerbonne et al. 2008):

Repeat clustering 100 times Each time, add random amounts of noise to the distance values Alternate between two clustering algorithms: Weighted Average and Group Average

2

Multidimensional scaling The Swiss German dialect landscape features major East-West and North-South divisions. Can this organization be recovered from the dialect texts? Reduce the distance matrix to two dimensions (Kruskal’s multidimensional scaling), and plot the results

18 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-39
SLIDE 39

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Clustering

AG1147 LU1195 100 AG1063 100 LU1261 100 BE1142 BE1170 100 BL1073 92 92 GL1048 GL1207 100 ZH1143 ZH1270 100 SZ1209 100 92 NW1007 100 BS1057 SG1198 100 VS1212 100 0.0 0.2 0.4 0.6

Texts from the same canton (same two-letter prefjx) are clustered together with high reliability. Outliers at the bottom in line with geographical and dialectological knowledge.

19 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-40
SLIDE 40

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Clustering

BS1057 BL1073 BE1142 BE1170 VS1212 SG1198 GL1207 GL1048 SZ1209 NW1007 LU1195 LU1261 ZH1143 ZH1270 AG1147 AG1063

Three-fold East-West stratifjcation Outliers in red and grey

20 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-41
SLIDE 41

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Multidimensional scaling

BS1057 BL1073 BE1142 BE1170 VS1212 NW1007 GL1207 GL1048 ZH1143 SZ1209 SG1198 ZH1270 LU1261 LU1195 AG1147 AG1063 21 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-42
SLIDE 42

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Multidimensional scaling

BS1057 BL1073 BE1142 BE1170 VS1212 NW1007 GL1207 GL1048 ZH1143 SZ1209 SG1198 ZH1270 LU1261 LU1195 AG1147 AG1063

AG1063 AG1147 BE1142 BE1170 BL1073 BS1057 GL1048 GL1207 LU1195 LU1261 NW1007 SG1198 SZ1209 VS1212 ZH1143 ZH1270

21 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-43
SLIDE 43

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Multidimensional scaling

BS1057 BL1073 BE1142 BE1170 VS1212 NW1007 GL1207 GL1048 ZH1143 SZ1209 SG1198 ZH1270 LU1261 LU1195 AG1147 AG1063

AG1063 AG1147 BE1142 BE1170 BL1073 BS1057 GL1048 GL1207 LU1195 LU1261 NW1007 SG1198 SZ1209 VS1212 ZH1143 ZH1270

Three major areas: Central-Northwest, East, Southwest

21 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-44
SLIDE 44

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Overview

1

Introduction

2

The Archimob corpus

3

Cognate and identical word pairs

4

Recovering dialect geography

5

Conclusion

22 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-45
SLIDE 45

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Conclusion

Summary:

A simple method for approximating the linguistic distance of two unaligned texts:

Ratio of identical words among the cognate word pairs Operationalized with fjxed thresholds of normalized Levenshtein distance

Application to a corpus of 16 Swiss German dialect texts:

Visualization with clustering and multidimensional scaling yields dialect landscapes that are compatible with geographic and dialectological knowledge

23 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus

slide-46
SLIDE 46

Introduction The Archimob corpus Cognate and identical word pairs Recovering dialect geography Conclusion

Conclusion

Limitations:

Potential scalability issues

Each word of each text is compared with each word of each text

More sophisticated variants of Levenshtein distance could be used

Vowels vs. consonants Diacritics

Alignment technique relies on graphemic similarity alone

Distributional word alignment techniques could add a semantic similarity criterion

24 / 24 Yves Scherrer: Recovering dialect geography from an unaligned comparable corpus