Introduction to Dialectometry II Wilbert Heeringa German Academic - - PowerPoint PPT Presentation

introduction to dialectometry ii
SMART_READER_LITE
LIVE PREVIEW

Introduction to Dialectometry II Wilbert Heeringa German Academic - - PowerPoint PPT Presentation

Introduction to Dialectometry II Wilbert Heeringa German Academic Exchange Service DAAD University of Bielefeld, Faculty of Linguistics and Literary Studies Frisian Academy Abidjan, December, 1923, 2016 1 Topics Validation of distance


slide-1
SLIDE 1

Introduction to Dialectometry II

Wilbert Heeringa

German Academic Exchange Service – DAAD University of Bielefeld, Faculty of Linguistics and Literary Studies Frisian Academy

Abidjan, December, 19–23, 2016

1

slide-2
SLIDE 2

Topics

Validation of distance measures Consistency of distance measures Quality of classifications Cluster algorithms Fuzzy clustering Cophenetic multidimensional scaling maps Reference point maps

2

slide-3
SLIDE 3

Validation of distance measures

3

slide-4
SLIDE 4

Experiment

  • In Norway: everybody speaks dialect, there is not a standard language.
  • In the period 1999–2002 Jørn Almberg and Kristian Skarbø recorded about 50 Norwegian

dialects.

  • The fable ‘The North Wind and the Sun’ was taken as a basis.
  • This text was also used in IPA handbooks published in 1949 and 1999.
  • Speakers were asked to translate the text and to read it aloud.
  • Audio files and transcriptions available at:

http://www.ling.hf.ntnu.no/nos/

4

slide-5
SLIDE 5

Experiment

  • Perception experiment carried out in the Spring of 2000 by Charlotte Gooskens.
  • 15 recordings of 15 dialects were used.
  • In each of the 15 locations, a group of 16 to 27 high school pupils listened to all 15

texts.

  • The texts were presented in a randomized order.

5

slide-6
SLIDE 6

Bergen Bjugn Bodø Bø Borre Fræna Halden Herøy Larvik Lesja Lillehammer Stjørdal Time Trondheim Verdal

The geographic distribution of the 15 Norwegian dialects.

6

slide-7
SLIDE 7

Experiment

  • Task: each pupil notes for each text the distance of the corresponding dialect compared

to his own dialect.

  • Scale from 1 (similar to own dialect) to 10 (not similar to own dialect).
  • Final result: a 15 × 15 perceptual distance matrix.

7

slide-8
SLIDE 8

Experiment

Be Bj Bo Bø Bo Fr Ha He La Le Li St Ti Tr Ve Bergen 1.7 " 9.0 8.2 8.0 7.7 7.7 8.2 6.9 8.0 8.9 8.5 8.4 4.8 8.5 8.0 Bjugn 9.1 3.4 " 6.4 8.2 9.2 5.8 8.3 8.0 8.4 7.3 9.1 2.2 8.0 3.3 2.8 Bodø 8.7 7.9 1.5 " 8.3 8.3 6.6 7.9 7.8 7.3 8.0 8.7 6.6 8.1 6.2 6.3 Bø 8.1 7.8 7.5 1.0 " 7.7 8.1 4.9 7.8 5.3 6.0 5.1 7.1 6.3 8.2 8.6 Borre 6.1 8.8 7.8 6.5 1.7 " 8.5 1.8 7.5 1.6 7.5 2.0 7.2 7.5 8.5 9.1 Fræna 9.0 7.5 7.1 8.4 8.8 3.1 " 8.1 7.8 8.5 7.2 9.0 6.6 7.4 6.1 7.6 Halden 7.0 8.2 8.0 6.8 4.0 8.1 2.8 " 7.9 2.8 6.6 3.0 7.4 7.0 8.0 8.3 Herøy 8.6 9.3 8.4 8.5 9.1 7.0 8.6 1.2 " 9.3 9.3 9.4 8.5 7.5 7.5 8.2 Larvik 7.4 8.7 7.6 4.0 4.0 7.7 3.2 5.6 3.4 " 7.1 4.6 8.2 6.8 8.3 7.5 Lesja 8.5 7.6 7.8 7.4 8.2 7.3 7.6 7.7 7.6 1.0 " 7.1 6.9 7.2 7.7 8.2 Lillehammer 6.7 8.3 8.1 6.2 4.4 8.0 3.1 7.5 4.1 7.3 2.7 " 7.6 6.8 8.7 8.1 Stjørdal 8.7 3.7 6.8 7.7 8.1 6.0 7.5 7.7 8.3 7.1 8.3 2.0 " 7.7 3.8 3.4 Time 7.0 9.3 8.4 8.1 8.4 8.3 8.0 7.2 8.2 9.1 8.8 8.8 1.8 " 8.8 9.0 Trondheim 7.8 5.8 6.7 7.5 6.4 7.3 6.0 7.1 5.9 7.9 6.3 4.4 7.6 3.3 " 6.8 Verdal 8.8 3.4 6.4 8.2 8.4 5.7 7.2 7.9 7.9 7.4 8.4 1.8 7.9 3.1 2.6 "

Perceptual distances among 15 Norwegian dialeact varieties. Row names represent listener groups, column names represent dialect speakers.

8

slide-9
SLIDE 9

Average perceptual distances between 15 Norwegian dialects. Darker lines connect closer points, lighter lines more remote

  • nes.

Distance pairs A–B / B–A are averaged.

9

slide-10
SLIDE 10

Experiment

  • Using the transcriptions we measure lexical distances and pronunciation distances

among the 15 local dialect variaties.

  • Each dialect text usually consists of 58 different words.
  • Validation:

How well do the dialectometric distances correlate with the perceptual distances?

10

slide-11
SLIDE 11

Correlations (1)

lexical r expl. var. relative difference value 0.27 7% weighted difference value 0.37 14% pronunciation aggregate r expl. var. Levenshtein (1) 0.71 50% Levenshtein (2) 0.70 49% Levenshtein (3) 0.67 45% Levenshtein PMI (1) 0.71 50% Levenshtein PMI (3) 0.67 45%

11

slide-12
SLIDE 12

Correlations (2)

  • In the measurements binary weighting is used.

Suprasegmentals and diacritics are ignored.

  • No difference between ‘classic’ Levenshtein and PMI Levenshtein, but alignments made

by PMI Levenshtein are better, see Wieling, Proki´ c and Nerbonne (2009).

12

slide-13
SLIDE 13

Left: perceptual distances. Right: lexical weighted difference value distances. Darker lines connect closer points, lighter lines more remote ones. r = 0.37

13

slide-14
SLIDE 14

Left: perceptual distances. Right: non-normalized Levenshtein distances. Darker lines connect closer points, lighter lines more remote ones. r = 0.71.

14

slide-15
SLIDE 15

Consistency of distance measures

15

slide-16
SLIDE 16

Consistency

  • How many items do we need for dialect comparison? Rule of thumb: 100 items (Goebl).
  • In order to answer this question more precisely, measure the degree to which different

words in the data set give the same signal of linguistic relationships between the dialects: measure Cronbach’s Alpha.

  • Example: measure Levenshtein distance between three dialects using four words. In

this example we normalize Levenshtein distances per word pair.

16

slide-17
SLIDE 17

Grouw Haarlem Almelo Levenshtein distances between pronunciations of the word seen.

17

slide-18
SLIDE 18

Grouw Haarlem Almelo Levenshtein distances between pronunciations of the word hart.

18

slide-19
SLIDE 19

Grouw Haarlem Almelo Levenshtein distances between pronunciations of the word son.

19

slide-20
SLIDE 20

Grouw Haarlem Almelo Levenshtein distances between pronunciations of the word house.

20

slide-21
SLIDE 21

Consistency

  • General pattern: Haarlem and Almelo are linguistically relatively close to each other

and relatively distant to Grouw.

  • Levenshtein distances between the three local dialects:

seen hart son house Grouw vs. Haarlem 71 25 100 75 Grouw vs. Almelo 83 25 75 33 Haarlem vs. Almelo 60 20 50 50

  • Using the values in the columns the words are correlated to each other.

21

slide-22
SLIDE 22

Consistency

  • Correlations between words:

r n seen vs. hart 0.85 3 seen vs. son 0.48 3 seen vs. house

  • 0.43

3 hart vs. son 0.87 3 hart vs. house 0.11 3 son vs. house 0.59 3

  • The average inter-correlation r is 0.41.

22

slide-23
SLIDE 23

Consistency

  • Cronbach’s α can be written as a function of the number of words and the average

inter-correlation among the words: α = nw × ¯ r 1 + (nw − 1) × ¯ r where nw is the number of words which is in our example 4.

  • Calculation:

α = 4 × 0.41 1 + (4 − 1) × 0.41 = 0.74

  • If all words have the same geographic distribution of variants the value of Cronbach’s

alpha is 1, if there is no consistency between the words in the data set the value is 0.

  • A generally accepted threshold for consistency of the data is 0.70.

23

slide-24
SLIDE 24

Consistency

  • In general: the more items are included, the higher Cronbach’s Alpha.
  • If the Cronbach’s Alpha value is very low, add more items!

24

slide-25
SLIDE 25

Consistency

20 40 60 80 100 number of words

  • 0.1

0.1 0.3 0.5 0.7 Cronbachs’s alpha 20 40 60 80 100 120 number of words 0.0 0.2 0.4 0.6 0.8 1.0 Cronbach’s alpha

Left: Cronbach’s α values for random subsets of 2 through 107 words (lexical weighted difference values) and 360 local dialects. From 86 words on α is always higher than 0.70. For 107 words α is equal to 0.75. Right: Cronbach’s α values for random subsets of 2 through 125 words (Levenshtein distance) and 360 local

  • dialects. From 13 words on α is always higher than 0.70. For 125 words α is equal to 0.97.

25

slide-26
SLIDE 26

Quality of classifications

26

slide-27
SLIDE 27

Quality of classifications

  • For clustering compare cophenetic distances to original distances.
  • For multidimensional scaling compare interpoint multidimensional scaling distances

to original distances.

27

slide-28
SLIDE 28

Cophenetic distances

  • In a dendrogram the distances between clusters are represented by the length of the

branches.

Grouw Delft Haarlem Hattem Lochem 10 20 30 40

  • Cophenetic distance: distance between two local dialects as found in the dendrogram.
  • Find the shortest path between two local dialects and the longest distance in one

direction within the shortest path.

28

slide-29
SLIDE 29

Cophenetic distances

Grouw Haarlem Delft Hattem Lochem Grouw 44 44 44 44 Haarlem 44 16 36.25 36.25 Delft 44 16 36.25 36.25 Hattem 44 36.25 36.25 20 Lochem 44 36.25 36.25 20

29

slide-30
SLIDE 30

Cophenetic distances

  • Cophenetic correlation coefficient: measure of how faithfully the pairwise distances

between local dialects as suggested by the dendrogram preserve the original pairwise distances.

  • Correlate the pairwise cophenetic distances with the original pairwise distances: r =

0.99

  • The amount of variance in the original distances explained by the cophenetic distances

is r2 × 100 = 97.6%.

30

slide-31
SLIDE 31

Interpoint multidimensional scaling distances

  • With multidimensional scaling the five local dialects are plotted in two-dimensional

space so that the distances are preserved as well as possible:

  • 40
  • 20

20 40

  • 30
  • 20
  • 10

10 20 30 first dimension second dimension Grouw Haarlem Delft Hattem Lochem

31

slide-32
SLIDE 32

Interpoint multidimensional scaling distances

  • We can calculate interpoint distances between the local dialects:
  • 40
  • 20

20 40

  • 30
  • 20
  • 10

10 20 30 first dimension second dimension Grouw Haarlem Delft Hattem Lochem

  • Distance between Grouw (-24,21) and Hattem (18, 6):
  • (−24 − 18)2 + (21 − 6)2 = 44.6

32

slide-33
SLIDE 33

Interpoint multidimensional scaling distances

  • We calculate interpoint distances for all dialect pairs:
  • 40
  • 20

20 40

  • 30
  • 20
  • 10

10 20 30 first dimension second dimension Grouw Haarlem Delft Hattem Lochem

33

slide-34
SLIDE 34

Interpoint multidimensional scaling distances

  • Interpoint distances:

Grouw Haarlem Delft Hattem Lochem Grouw 41.9 42.9 44.6 45.4 Haarlem 41.9 2.5 33.1 35.6 Delft 42.9 2.5 35.6 38.1 Hattem 44.6 33.1 35.6 2.5 Lochem 45.4 35.6 38.1 2.5

  • Correlate the pairwise interpoint distances with the original pairwise distances: r =

0.99

  • The amount of variance in the original distances explained by the interpoint distances

is r2 × 100 = 98.0%.

34

slide-35
SLIDE 35

Cluster algorithms

35

slide-36
SLIDE 36

Data source

  • Reeks Nederlandse Dialectatlassen, compiled by E. Blancquaert and W. P´

ee.

  • Texts from 1922–1975, 1956 dialects, 139 sentences each.
  • We selected 40 dialects, 125 words

36

slide-37
SLIDE 37

Analysis

  • We use ‘classic’ Levenshtein distance.
  • The the aggregated distance between two local dialects is calculated as the sum of the

Levenshtein distances divided by the sum of the alignment lengths times 100.

37

slide-38
SLIDE 38

Cluster algorithms

  • When two clusters are fused, one larger cluster arises.
  • The distances between this larger cluster and the other clusters need to be calculated.

This is done by a matrix updating algoritme.

  • We discuss five algorithms.

38

slide-39
SLIDE 39

Cluster algorithms

  • Assume clusters i and j are fused to one cluster ij. In order to calculate the distance

between cluster ij and a cluster k we need (partially) the following data: ni: number of varieties in cluster i; dki: the distance between k and i; nj: number of varieties in cluster j; dkj: the distance between k and j; nk: number of varieties in cluster k; dij: the distance between i and j;

39

slide-40
SLIDE 40

Single link

Nearest neighbor

  • Choose the smallest distance:

dk[ij] = minimum(dki, dkj)

  • Outliers are clearly recognizable shown, “chaining effect”.

40

slide-41
SLIDE 41

Complete link

Furthest neighbor

  • Choose the largest distance:

dk[ij] = maximum(dki, dkj)

  • Gives a well-belanced dendrogram, clusters have about the same size, sensitive to
  • utliers.

41

slide-42
SLIDE 42

single link complete link

Kerkrade Alveringem Damme Renesse Lamswaarde Geraardsbergen Mechelen Lebbeke Geel Almelo Wijhe Ommen Steenwijk Roswinkel Roodeschool Winschoten Groningen Assen Aalten Putten Spankeren Oosterhout Gemert Dussen Soest Polsbroek Schagen Haarlem SD Delft Nes Kollum Leeuwarden Holwerd Grouw Ferwerd Bergum Makkum Venray Born Tienen 5 10 15 20 25 Kerkrade Venray Born Tienen Mechelen Lebbeke Geraardsbergen Lamswaarde Geel Gemert Oosterhout Dussen Soest Polsbroek Schagen Haarlem SD Delft Renesse Damme Alveringem Roodeschool Winschoten Groningen Assen Roswinkel Steenwijk Ommen Wijhe Almelo Aalten Spankeren Putten Leeuwarden Kollum Nes Bergum Ferwerd Grouw Holwerd Makkum 10 20 30 40

42

slide-43
SLIDE 43

Group average

Unweighted Pair Group Method using Arithmetic averages (UPGMA)

  • Choose the average distance between all varieties in the two clusters:

dk[ij] = (ni / (ni + nj)) × dki + (nj / (ni + nj)) × dkj

  • When two clusters are merged and distances to the other clusters are calculated, the

larger cluster will influence the distances stronger than the smaller cluster.

  • Dendrograms obtained by this method reflect the original distances in the distance

matrix most closely (cophenetic correlation coefficient). Best choice in general.

43

slide-44
SLIDE 44

Weighted average

Weighted Pair Group Method using Arithmetic averages (WPGMA)

  • Choose the average distance between the two clusters:

dk[ij] = (1

2 × dki) + (1 2 × dkj)

  • When two clusters are merged and distances to the other clusters are calculated, the

clusters influence the distances to the same extent, regardless of their sizes.

  • Is recommended in case of an irregular sampling distribution.

44

slide-45
SLIDE 45

group average weighted average

Kerkrade Venray Born Tienen Roodeschool Winschoten Groningen Roswinkel Assen Steenwijk Ommen Wijhe Almelo Aalten Spankeren Putten Geraardsbergen Geel Lebbeke Mechelen Gemert Oosterhout Dussen Soest Polsbroek SD Haarlem Schagen Delft Lamswaarde Renesse Damme Alveringem Leeuwarden Kollum Nes Holwerd Grouw Ferwerd Bergum Makkum 10 20 30 40 Kerkrade Venray Born Tienen Alveringem Damme Renesse Lamswaarde Geraardsbergen Geel Lebbeke Mechelen SD Haarlem Schagen Delft Dussen Soest Polsbroek Oosterhout Gemert Aalten Spankeren Putten Roodeschool Winschoten Groningen Roswinkel Assen Steenwijk Ommen Wijhe Almelo Leeuwarden Kollum Nes Bergum Ferwerd Grouw Holwerd Makkum 10 20 30 40

45

slide-46
SLIDE 46

Ward’s method

  • Minimize the variance in the clusters:

dk[ij] = ((nk + ni) / (nk + ni + nj)) × dki + ((nk + nj) / (nk + ni + nj)) × dkj − (nk / (nk + ni + nj)) × dij

  • Results in a well-belanced dendrogram, all clusters have about the same size.

46

slide-47
SLIDE 47

Ward’s method

Bergum Ferwerd Grouw Holwerd Makkum Nes Kollum Leeuwarden Alveringem Damme Renesse Geel Lamswaarde Geraardsbergen Lebbeke Mechelen Oosterhout Gemert Dussen Soest Polsbroek Schagen Haarlem SD Delft Kerkrade Venray Born Tienen Putten Spankeren Aalten Almelo Wijhe Ommen Steenwijk Roswinkel Assen Groningen Winschoten Roodeschool 50 100 150

47

slide-48
SLIDE 48

Validation

coph. cor. Single link 0.74 Complete link 0.81 Group average 0.83 Weighted average 0.82 Ward’s method 0.67

48

slide-49
SLIDE 49

Fuzzy clustering

49

slide-50
SLIDE 50

Fuzzy clustering

  • Custer analysis is relatively unstable: small changes in the distance matrix can lead to

large changes in the clustering results.

  • Solution: fuzzy clustering.
  • Contaminate the original distance matrix with (varying) small amounts of random noise

(one or two standard deviations).

  • Repeat this several times.
  • After that count how many times each cluster has appeared.
  • Clusters that appear in many runs of the analysis with added noise are particularly

stable ones.

50

slide-51
SLIDE 51

Fuzzy clustering

  • Display the results in a probabilistic dendrogram.
  • For each cluster in the dendrogram show a percentage that indicates how many times

each cluster was encountered in the repeated clustering with noise.

  • Example on next slide obtained on the basis of group average.

51

slide-52
SLIDE 52

Putten Spankeren 94 Aalten 97 Ommen Wijhe 100 Steenwijk 98 Almelo 100 Roodeschool Winschoten 71 Groningen 96 Roswinkel 50 Assen 100 100 97 Alveringem Damme 100 Renesse 51 Haarlem Schagen 89 Delft SD 89 Polsbroek Soest 93 Dussen 81 89 Gemert Oosterhout 92 66 Geel Geraardsbergen Lamswaarde Lebbeke Mechelen 50 Bergum Ferwerd Grouw Holwerd 100 Makkum 100 Kollum Leeuwarden 61 Nes 100 100 50 Born Venray 69 Kerkrade Tienen 50 100 10 20 30 40

52

slide-53
SLIDE 53

Cophenetic multidimensional scaling maps

53

slide-54
SLIDE 54

Cophenetic multidimensional scaling maps

  • With clustering we can obtain cophenetic distances, i.e.

the branch lengths of the dendrogram.

54

slide-55
SLIDE 55

Left: ‘original’ pronunciation distances measured with Levenshtein distance between 361 local Dutch dialects. Right: cophenetic distances. Correlation: r = 0.84. Percentage of variance explained by the cophenetic distances: 0.842 × 100 = 71.2%

55

slide-56
SLIDE 56

Cophenetic multidimensional scaling maps

  • Multidimensional scaling is usually applied to the ‘orginal’ distances between the local

dialects.

  • We can also apply multidimensional scaling to the cophenetic distances, and create a

color continuum map.

  • Thus we obtain a cophenetic multidimensional scaling map, the result of which is

a mix of a cluster map and a multidimensional scaling map.

56

slide-57
SLIDE 57

Den Burg Schiermonnikoog Oosterend Leeuwarden Grouw Groningen Heerhugowaard Haarlem Delft Staveren Steenwijk Urk Hattem Amersfoort Assen Emmen Itterbeck Lochem Brugge Veurne Middelburg Gent Vianen Zevenbergen Kalmthout Mechelen Groesbeek Helmond Venlo Overpelt Roeselare Steenbeek Geraardsbergen Tienen Kerkrade Aubel

Left: six most signicant groups found by cluster analysis. Middle: cophenetic multidimensional scaling map. Right: multidimenional scaling map obtained on the basis of original distances.

57

slide-58
SLIDE 58

Reference point maps

58

slide-59
SLIDE 59

Reference point maps

  • Introduced by Goebl (± 1982).
  • Compare local dialects to a reference point.
  • Reference point can be one of the local dialects, a standard language, a proto-language,

etc.

  • Goebl used a rainbow scheme: red-orange-yellow-green-blue, where dialects most similar

to the reference point are red, and most distant dialects are blue. The following maps show pronunciation distances measured with Levenshtein distance.

59

slide-60
SLIDE 60

Comparison to Standard Dutch

  • Standard Dutch is the overarching standard language in the Netherlands and Flanders.
  • Transcriptions based on Tekstboekje of Blancquaert (1939) to ensure consistency with

the dialect transcriptions.

60

slide-61
SLIDE 61

Dutch dialects compared to Standard Dutch. Red polygons represent strongly related dialects, blue polygons more remote ones.

61

slide-62
SLIDE 62

Comparison to Afrikaans

  • In 1652 Jan van Riebeeck founded a refreshment station at the Cape of Good Hope on

the way to the Indies.

  • He and the group around him came from the southern part of the Dutch province of

South-Holland.

  • Kloeke (1950): Jan van Riebeeck’s groups is the most important source of today’s

Afrikaans language.

62

slide-63
SLIDE 63

Dutch dialects compared to Afrikaans. Red polygons represent strongly related dialects, blue polygons more remote ones.

63

slide-64
SLIDE 64

Comparison to Proto-Germanic

  • Gerhard K¨
  • bler compiled the Germanisches W¨
  • rterbuch, 3rd edition, 2003.
  • On the basis of this dictionary he also compiled the neuhochdeutsch-germanisches

  • rterbuch.
  • For 86 words in our list of 125 words Proto-Germanic translations were found in this

dictionary.

  • The dictionaries are available at: http://www.koeblergerhard.de/.

64

slide-65
SLIDE 65

Dutch dialects compared to Proto- Germanic. Red polygons represent strongly related dialects, blue polygons more remote ones.

65

slide-66
SLIDE 66

Final remarks

  • The maps were produced with RuG/L04, developed by Peter Kleiweg, and available at:

http://www.let.rug.nl/kleiweg/L04/.

  • Gabmap is a web application made for dialectologists and students, and available at:

http://www.gabmap.nl/.

66

slide-67
SLIDE 67

67