Introduction to Dialectometry II
Wilbert Heeringa
German Academic Exchange Service – DAAD University of Bielefeld, Faculty of Linguistics and Literary Studies Frisian Academy
Abidjan, December, 19–23, 2016
1
Introduction to Dialectometry II Wilbert Heeringa German Academic - - PowerPoint PPT Presentation
Introduction to Dialectometry II Wilbert Heeringa German Academic Exchange Service DAAD University of Bielefeld, Faculty of Linguistics and Literary Studies Frisian Academy Abidjan, December, 1923, 2016 1 Topics Validation of distance
German Academic Exchange Service – DAAD University of Bielefeld, Faculty of Linguistics and Literary Studies Frisian Academy
Abidjan, December, 19–23, 2016
1
Topics
Validation of distance measures Consistency of distance measures Quality of classifications Cluster algorithms Fuzzy clustering Cophenetic multidimensional scaling maps Reference point maps
2
3
Experiment
dialects.
http://www.ling.hf.ntnu.no/nos/
4
Experiment
texts.
5
Bergen Bjugn Bodø Bø Borre Fræna Halden Herøy Larvik Lesja Lillehammer Stjørdal Time Trondheim Verdal
The geographic distribution of the 15 Norwegian dialects.
6
Experiment
to his own dialect.
7
Experiment
Be Bj Bo Bø Bo Fr Ha He La Le Li St Ti Tr Ve Bergen 1.7 " 9.0 8.2 8.0 7.7 7.7 8.2 6.9 8.0 8.9 8.5 8.4 4.8 8.5 8.0 Bjugn 9.1 3.4 " 6.4 8.2 9.2 5.8 8.3 8.0 8.4 7.3 9.1 2.2 8.0 3.3 2.8 Bodø 8.7 7.9 1.5 " 8.3 8.3 6.6 7.9 7.8 7.3 8.0 8.7 6.6 8.1 6.2 6.3 Bø 8.1 7.8 7.5 1.0 " 7.7 8.1 4.9 7.8 5.3 6.0 5.1 7.1 6.3 8.2 8.6 Borre 6.1 8.8 7.8 6.5 1.7 " 8.5 1.8 7.5 1.6 7.5 2.0 7.2 7.5 8.5 9.1 Fræna 9.0 7.5 7.1 8.4 8.8 3.1 " 8.1 7.8 8.5 7.2 9.0 6.6 7.4 6.1 7.6 Halden 7.0 8.2 8.0 6.8 4.0 8.1 2.8 " 7.9 2.8 6.6 3.0 7.4 7.0 8.0 8.3 Herøy 8.6 9.3 8.4 8.5 9.1 7.0 8.6 1.2 " 9.3 9.3 9.4 8.5 7.5 7.5 8.2 Larvik 7.4 8.7 7.6 4.0 4.0 7.7 3.2 5.6 3.4 " 7.1 4.6 8.2 6.8 8.3 7.5 Lesja 8.5 7.6 7.8 7.4 8.2 7.3 7.6 7.7 7.6 1.0 " 7.1 6.9 7.2 7.7 8.2 Lillehammer 6.7 8.3 8.1 6.2 4.4 8.0 3.1 7.5 4.1 7.3 2.7 " 7.6 6.8 8.7 8.1 Stjørdal 8.7 3.7 6.8 7.7 8.1 6.0 7.5 7.7 8.3 7.1 8.3 2.0 " 7.7 3.8 3.4 Time 7.0 9.3 8.4 8.1 8.4 8.3 8.0 7.2 8.2 9.1 8.8 8.8 1.8 " 8.8 9.0 Trondheim 7.8 5.8 6.7 7.5 6.4 7.3 6.0 7.1 5.9 7.9 6.3 4.4 7.6 3.3 " 6.8 Verdal 8.8 3.4 6.4 8.2 8.4 5.7 7.2 7.9 7.9 7.4 8.4 1.8 7.9 3.1 2.6 "
Perceptual distances among 15 Norwegian dialeact varieties. Row names represent listener groups, column names represent dialect speakers.
8
Average perceptual distances between 15 Norwegian dialects. Darker lines connect closer points, lighter lines more remote
Distance pairs A–B / B–A are averaged.
9
Experiment
among the 15 local dialect variaties.
How well do the dialectometric distances correlate with the perceptual distances?
10
Correlations (1)
lexical r expl. var. relative difference value 0.27 7% weighted difference value 0.37 14% pronunciation aggregate r expl. var. Levenshtein (1) 0.71 50% Levenshtein (2) 0.70 49% Levenshtein (3) 0.67 45% Levenshtein PMI (1) 0.71 50% Levenshtein PMI (3) 0.67 45%
11
Correlations (2)
Suprasegmentals and diacritics are ignored.
by PMI Levenshtein are better, see Wieling, Proki´ c and Nerbonne (2009).
12
Left: perceptual distances. Right: lexical weighted difference value distances. Darker lines connect closer points, lighter lines more remote ones. r = 0.37
13
Left: perceptual distances. Right: non-normalized Levenshtein distances. Darker lines connect closer points, lighter lines more remote ones. r = 0.71.
14
15
Consistency
words in the data set give the same signal of linguistic relationships between the dialects: measure Cronbach’s Alpha.
this example we normalize Levenshtein distances per word pair.
16
Grouw Haarlem Almelo Levenshtein distances between pronunciations of the word seen.
17
Grouw Haarlem Almelo Levenshtein distances between pronunciations of the word hart.
18
Grouw Haarlem Almelo Levenshtein distances between pronunciations of the word son.
19
Grouw Haarlem Almelo Levenshtein distances between pronunciations of the word house.
20
Consistency
and relatively distant to Grouw.
seen hart son house Grouw vs. Haarlem 71 25 100 75 Grouw vs. Almelo 83 25 75 33 Haarlem vs. Almelo 60 20 50 50
21
Consistency
r n seen vs. hart 0.85 3 seen vs. son 0.48 3 seen vs. house
3 hart vs. son 0.87 3 hart vs. house 0.11 3 son vs. house 0.59 3
22
Consistency
inter-correlation among the words: α = nw × ¯ r 1 + (nw − 1) × ¯ r where nw is the number of words which is in our example 4.
α = 4 × 0.41 1 + (4 − 1) × 0.41 = 0.74
alpha is 1, if there is no consistency between the words in the data set the value is 0.
23
Consistency
24
Consistency
20 40 60 80 100 number of words
0.1 0.3 0.5 0.7 Cronbachs’s alpha 20 40 60 80 100 120 number of words 0.0 0.2 0.4 0.6 0.8 1.0 Cronbach’s alpha
Left: Cronbach’s α values for random subsets of 2 through 107 words (lexical weighted difference values) and 360 local dialects. From 86 words on α is always higher than 0.70. For 107 words α is equal to 0.75. Right: Cronbach’s α values for random subsets of 2 through 125 words (Levenshtein distance) and 360 local
25
26
Quality of classifications
to original distances.
27
Cophenetic distances
branches.
Grouw Delft Haarlem Hattem Lochem 10 20 30 40
direction within the shortest path.
28
Cophenetic distances
Grouw Haarlem Delft Hattem Lochem Grouw 44 44 44 44 Haarlem 44 16 36.25 36.25 Delft 44 16 36.25 36.25 Hattem 44 36.25 36.25 20 Lochem 44 36.25 36.25 20
29
Cophenetic distances
between local dialects as suggested by the dendrogram preserve the original pairwise distances.
0.99
is r2 × 100 = 97.6%.
30
Interpoint multidimensional scaling distances
space so that the distances are preserved as well as possible:
20 40
10 20 30 first dimension second dimension Grouw Haarlem Delft Hattem Lochem
31
Interpoint multidimensional scaling distances
20 40
10 20 30 first dimension second dimension Grouw Haarlem Delft Hattem Lochem
32
Interpoint multidimensional scaling distances
20 40
10 20 30 first dimension second dimension Grouw Haarlem Delft Hattem Lochem
33
Interpoint multidimensional scaling distances
Grouw Haarlem Delft Hattem Lochem Grouw 41.9 42.9 44.6 45.4 Haarlem 41.9 2.5 33.1 35.6 Delft 42.9 2.5 35.6 38.1 Hattem 44.6 33.1 35.6 2.5 Lochem 45.4 35.6 38.1 2.5
0.99
is r2 × 100 = 98.0%.
34
35
Data source
ee.
36
Analysis
Levenshtein distances divided by the sum of the alignment lengths times 100.
37
Cluster algorithms
This is done by a matrix updating algoritme.
38
Cluster algorithms
between cluster ij and a cluster k we need (partially) the following data: ni: number of varieties in cluster i; dki: the distance between k and i; nj: number of varieties in cluster j; dkj: the distance between k and j; nk: number of varieties in cluster k; dij: the distance between i and j;
39
Single link
Nearest neighbor
dk[ij] = minimum(dki, dkj)
40
Complete link
Furthest neighbor
dk[ij] = maximum(dki, dkj)
41
single link complete link
Kerkrade Alveringem Damme Renesse Lamswaarde Geraardsbergen Mechelen Lebbeke Geel Almelo Wijhe Ommen Steenwijk Roswinkel Roodeschool Winschoten Groningen Assen Aalten Putten Spankeren Oosterhout Gemert Dussen Soest Polsbroek Schagen Haarlem SD Delft Nes Kollum Leeuwarden Holwerd Grouw Ferwerd Bergum Makkum Venray Born Tienen 5 10 15 20 25 Kerkrade Venray Born Tienen Mechelen Lebbeke Geraardsbergen Lamswaarde Geel Gemert Oosterhout Dussen Soest Polsbroek Schagen Haarlem SD Delft Renesse Damme Alveringem Roodeschool Winschoten Groningen Assen Roswinkel Steenwijk Ommen Wijhe Almelo Aalten Spankeren Putten Leeuwarden Kollum Nes Bergum Ferwerd Grouw Holwerd Makkum 10 20 30 40
42
Group average
Unweighted Pair Group Method using Arithmetic averages (UPGMA)
dk[ij] = (ni / (ni + nj)) × dki + (nj / (ni + nj)) × dkj
larger cluster will influence the distances stronger than the smaller cluster.
matrix most closely (cophenetic correlation coefficient). Best choice in general.
43
Weighted average
Weighted Pair Group Method using Arithmetic averages (WPGMA)
dk[ij] = (1
2 × dki) + (1 2 × dkj)
clusters influence the distances to the same extent, regardless of their sizes.
44
group average weighted average
Kerkrade Venray Born Tienen Roodeschool Winschoten Groningen Roswinkel Assen Steenwijk Ommen Wijhe Almelo Aalten Spankeren Putten Geraardsbergen Geel Lebbeke Mechelen Gemert Oosterhout Dussen Soest Polsbroek SD Haarlem Schagen Delft Lamswaarde Renesse Damme Alveringem Leeuwarden Kollum Nes Holwerd Grouw Ferwerd Bergum Makkum 10 20 30 40 Kerkrade Venray Born Tienen Alveringem Damme Renesse Lamswaarde Geraardsbergen Geel Lebbeke Mechelen SD Haarlem Schagen Delft Dussen Soest Polsbroek Oosterhout Gemert Aalten Spankeren Putten Roodeschool Winschoten Groningen Roswinkel Assen Steenwijk Ommen Wijhe Almelo Leeuwarden Kollum Nes Bergum Ferwerd Grouw Holwerd Makkum 10 20 30 40
45
Ward’s method
dk[ij] = ((nk + ni) / (nk + ni + nj)) × dki + ((nk + nj) / (nk + ni + nj)) × dkj − (nk / (nk + ni + nj)) × dij
46
Ward’s method
Bergum Ferwerd Grouw Holwerd Makkum Nes Kollum Leeuwarden Alveringem Damme Renesse Geel Lamswaarde Geraardsbergen Lebbeke Mechelen Oosterhout Gemert Dussen Soest Polsbroek Schagen Haarlem SD Delft Kerkrade Venray Born Tienen Putten Spankeren Aalten Almelo Wijhe Ommen Steenwijk Roswinkel Assen Groningen Winschoten Roodeschool 50 100 150
47
Validation
coph. cor. Single link 0.74 Complete link 0.81 Group average 0.83 Weighted average 0.82 Ward’s method 0.67
48
49
Fuzzy clustering
large changes in the clustering results.
(one or two standard deviations).
stable ones.
50
Fuzzy clustering
each cluster was encountered in the repeated clustering with noise.
51
Putten Spankeren 94 Aalten 97 Ommen Wijhe 100 Steenwijk 98 Almelo 100 Roodeschool Winschoten 71 Groningen 96 Roswinkel 50 Assen 100 100 97 Alveringem Damme 100 Renesse 51 Haarlem Schagen 89 Delft SD 89 Polsbroek Soest 93 Dussen 81 89 Gemert Oosterhout 92 66 Geel Geraardsbergen Lamswaarde Lebbeke Mechelen 50 Bergum Ferwerd Grouw Holwerd 100 Makkum 100 Kollum Leeuwarden 61 Nes 100 100 50 Born Venray 69 Kerkrade Tienen 50 100 10 20 30 40
52
53
Cophenetic multidimensional scaling maps
the branch lengths of the dendrogram.
54
Left: ‘original’ pronunciation distances measured with Levenshtein distance between 361 local Dutch dialects. Right: cophenetic distances. Correlation: r = 0.84. Percentage of variance explained by the cophenetic distances: 0.842 × 100 = 71.2%
55
Cophenetic multidimensional scaling maps
dialects.
color continuum map.
a mix of a cluster map and a multidimensional scaling map.
56
Den Burg Schiermonnikoog Oosterend Leeuwarden Grouw Groningen Heerhugowaard Haarlem Delft Staveren Steenwijk Urk Hattem Amersfoort Assen Emmen Itterbeck Lochem Brugge Veurne Middelburg Gent Vianen Zevenbergen Kalmthout Mechelen Groesbeek Helmond Venlo Overpelt Roeselare Steenbeek Geraardsbergen Tienen Kerkrade Aubel
Left: six most signicant groups found by cluster analysis. Middle: cophenetic multidimensional scaling map. Right: multidimenional scaling map obtained on the basis of original distances.
57
58
Reference point maps
etc.
to the reference point are red, and most distant dialects are blue. The following maps show pronunciation distances measured with Levenshtein distance.
59
Comparison to Standard Dutch
the dialect transcriptions.
60
Dutch dialects compared to Standard Dutch. Red polygons represent strongly related dialects, blue polygons more remote ones.
61
Comparison to Afrikaans
the way to the Indies.
South-Holland.
Afrikaans language.
62
Dutch dialects compared to Afrikaans. Red polygons represent strongly related dialects, blue polygons more remote ones.
63
Comparison to Proto-Germanic
W¨
dictionary.
64
Dutch dialects compared to Proto- Germanic. Red polygons represent strongly related dialects, blue polygons more remote ones.
65
Final remarks
http://www.let.rug.nl/kleiweg/L04/.
http://www.gabmap.nl/.
66
67