Introduction to Dialectometry
Wilbert Heeringa
Spr˚ akbanken, University of Gothenburg 30 january 2019
1
Introduction to Dialectometry Wilbert Heeringa Spr akbanken, - - PowerPoint PPT Presentation
Introduction to Dialectometry Wilbert Heeringa Spr akbanken, University of Gothenburg 30 january 2019 1 Introduction 2 What is dialectometry? The measure of dialect (Jean S eguy). Measures the degree of difference or
Spr˚ akbanken, University of Gothenburg 30 january 2019
1
2
What is dialectometry?
eguy).
3
Why dialectometry?
peoples, and internal cultural divisions.
planners.
4
Isogloss method
5
Nucleus in ripe: [rip(@)] (west) [rE;p] (central) [rip(@)] (east)
6
Coda in cold: [kO;ut] (west) [kO;lt] (east)
7
Nucleus in ripe & coda in cold
8
Isogloss method
Overlay the isogloss maps of 14 phenomena:
1 [VErk] vs. [VEr@k] 2 [splInt@r] vs. [splInt@ö] 3 [kni] vs. [kne:] vs. [knE:i] vs. [knIb@l] 4 [zi;n] vs. [@zi;n] vs. [G@zi;n] vs. [j@zi;n] 5 [ste;n "] vs. [ste;n@] vs. [stI;@s] 6 [me:st@r] vs. [mi;@st@r] vs. [mE;st@r] 7 [rip] vs. [rE;ip] 8 [zEs] vs. [sEs] vs. [sEz] 9 [kO;ut] vs. [kO;lt] 10 [ro:zn "] vs. [ro:z@n] vs. [ro:z@] 11 [lAd@r vs. [li;@r(@)] 12 [bru:r] vs. [brœ:ij@r] vs. [bru;r@] 13 [brYx] vs. [brYG(@)] vs. [brYg] 14 [blO;w] vs. [blA:t]
9
Isoglosses of 14 phenomena. Isogloss bundles represent dialect boundaries.
10
Isogloss method
11
Dialectometry
We need methodology that:
Use dialectometry?
12
Relative difference value
eguy.
number of items on which two dialects differ, expressed in a percentage.
13
Relative difference value
Middelstum Ommen friend kAm@rU;t ˇ kAm@rO:t ship sxIp sxIp far v ˚ E:r Vit ˇ 1 are bIn " bInt still nOx nOx push stø;tn " drYk ˇN " 1 2
14
Relative difference value
Value (RIV).
15
Weighted difference value
between two local dialects than similarity in common lexemes.
value’.
16
Weighted difference value
(353), boot (2), lager (1), schuit (4). In terms of distances: schip vs. schip : 353/360 = 0.981 schuit vs. schuit : 4/360 = 0.011 boot vs. boot : 2/360 = 0.006
17
Weighted difference value
calculate the lexical weighted difference value between Middelstum and Ommen on the basis of 6 words: Middelstum Ommen friend kAm@rU;t ˇ kAm@rO:t 140/354 0.40 ship sxIp sxIp 353/360 0.98 far v ˚ E:r Vit ˇ 1 are bIn " bInt 176/360 0.49 still nOx nOx 354/355 1.00 push stø;tn " drYk ˇN " 1 4.87
18
Levenshtein distance
Grouw Groningen Haarlem Almelo Alveringem Renesse Polsbroek Mechelen Venray Kerkrade mlk mlk mlk mlk mæk mælk mlk mlk mlk mlx
How to quantify differences between the dialect pronunciations?
19
Levenshtein distance
Bulgarian and Bantu dialect/language varieties by others.
20
Levenshtein distance
in the dialect of Grouw.
mEl@k
1 mOl@k delete @ 1 mOlk insert @ 1 mOlk@ 3
cheapest mapping.
21
Levenshtein distance
1 2 3 4 5 6 m E l @ k m O l k @ 1 1 1
alignment.
plausible.
22
Alignment
which:
23
Levenshtein distance
a set of words.
distance?
24
Calculating the aggregate
Ommen on the basis of 6 words: Middelstum Ommen sum of length of weights alignment ship sxIp sxIp 4 cap pEt pEt@ 1 4 called rOupm @rupm 2 6 jump sprIN sprINkt 2 7 cellar kEl@r kEld@r 1 6 house hus hys 1 3 7 30
25
Operation weights
26
PMI-based Levenshtein distance
c and John Nerbonne in 2009.
repeat
with binary weights, later times with newly calculated weights).
in an alignment, the smaller the distance weight. until weights do not change any more.
c and Nerbonne (2009).
27
Application
ee.
28
Distribution of the 361 dialects in the Dutch dialect area.
29
Beam maps
30
Beam maps: lexical relative difference values (left), lexical weighted difference values (middle) and pronunciation Levenshtein distances (right).
31
Honeycomb maps
by darker lines.
32
Honeycomb maps: lexical relative distance values (left), lexical weighted difference values (middle) and pronunciation Levenshtein distances (right).
33
Cluster analysis
group (called a cluster) are more similar to each other than to those in other groups (clusters) (definition Wikipedia).
is a “bottom-up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy (S.C. Johnson 1967)
and a linkage criterion which specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets (Wikipedia).
then the distance between the set that contains A and B and K is (2 + 3)/2 is 2.5.
34
Linkage criteria
Dendrograms obtained by this method reflect the original distances in the distance matrix most closely (cophenetic correlation coefficient). Best choice in general.
Is recommended in case of an irregular sampling distribution.
Minimizes the variance in the clusters; results in a well-belanced dendrogram, all clusters have about the same size.
35
5 10 15 20 40 60 80 10 20 30
Dendrograms obtained by clustering: lexical relative difference values (left), lexical weighted difference values (middle) and pronunciation Levenshtein distances (right). The most significant groups are distinguished by different colors. The lables are omitted.
36
Dialect areas obtained by clustering: lexical relative difference values (left), lexical weighted difference values (middle) and pronunciation Levenshtein distances (right).
37
Multidimensional scaling
So n dimensions are reduced to two or three.
38
Using MDS the 361 dimensions are reduced to 2. They explain 84.3% (left: lexical relative difference values), 45.0% (middle: lexical weighted difference values) and 51.9% (right: pronunciation Levenshtein distances) of the variance in the original distances. Labels are
39
Using MDS the 361 dimensions are reduced to 3. They explain 89.5% (left: lexical relative difference values), 50.7% (middle: lexical weighted difference values) and 88.4% (right: pronunciation Levenshtein distances) of the variance in the original distances. Labels are
40
Multidimensional scaling
value for x, y and z.
blue.
41
Three color dimensions for the pronunciation Levenshtein distances. Red represents the x-axis, green represents inversely the z-axis, and blue represents inversely the y-axis.
42
Now we overlay the three maps.
43
Map 3 major MDS dimensions to red, green and blue. Dialect islands (mainly town Frisian varieties) are marked with a diamond, and only these diamonds are colored.
44
Dialect continua obtained by using multidimensional scaling: lex. relative difference values (left), lex. weighted difference values (middle) and pron. Levenshtein distances (right).
45
Fuzzy clustering
large changes in the clustering results.
(2004): add noise (i.e. small random values, one or two standard deviations) to the measurements and perform cluster analysis.
stable ones.
46
Fuzzy clustering
each cluster was encountered in the repeated clustering with noise (see also Nerbonne et al. 2008).
47
Putten Spankeren 94 Aalten 97 Ommen Wijhe 100 Steenwijk 98 Almelo 100 Roodeschool Winschoten 71 Groningen 96 Roswinkel 50 Assen 100 100 97 Alveringem Damme 100 Renesse 51 Haarlem Schagen 89 Delft SD 89 Polsbroek Soest 93 Dussen 81 89 Gemert Oosterhout 92 66 Geel Geraardsbergen Lamswaarde Lebbeke Mechelen 50 Bergum Ferwerd Grouw Holwerd 100 Makkum 100 Kollum Leeuwarden 61 Nes 100 100 50 Born Venray 69 Kerkrade Tienen 50 100 10 20 30 40
48
What is Gabmap?
Doing dialect analysis on the web
freely distributed since 2004 (the maps shown in this presentation were also made with RuG/L04).
2011.
49
What is Gabmap?
http://www.let.rug.nl/~kleiweg/L04/webapp
¸agri C ¸¨
http://www.gabmap.nl/ and maintained by Martijn Wieling.
https://github.com/pebbe/Gabmap-docker
50
Input
1) a map, 2) dialect data, 3) a feature definition file
51
Input: map
names should be spelled exactly as in your data file!
52
Input: map
http://www.let.rug.nl/~kleiweg/L04/kml/manual.html and with Google Maps: http://coltekin.net/cagri/courses/leuven/
53
Input: dialect data
https://westonruter.github.io/ipa-chart/keyboard/ for finding the Unicode characters.
54
55
56
Input: dialect data
encoded as Unicode (UTF-8 or UTF-16).
UTF-16) and the tab as separator.
categorical data.
Help in Gabmap.
57
Input: feature definition file
http://www.wjheeringa.nl/courses/dialectometry/datasets/IPA.def.
and allows that:
58
Input: feature definition file
the weight is 0.3.
primary stress, secondary stress, minor (foot) group, major (intonation) group, syllable break, linking (absence of a break).
However, be careful when changing IPA.def.
59
Running Gabmap
60
61
62
63
64
65