Introduction to Dialectometry Wilbert Heeringa Spr akbanken, - - PowerPoint PPT Presentation

introduction to dialectometry
SMART_READER_LITE
LIVE PREVIEW

Introduction to Dialectometry Wilbert Heeringa Spr akbanken, - - PowerPoint PPT Presentation

Introduction to Dialectometry Wilbert Heeringa Spr akbanken, University of Gothenburg 30 january 2019 1 Introduction 2 What is dialectometry? The measure of dialect (Jean S eguy). Measures the degree of difference or


slide-1
SLIDE 1

Introduction to Dialectometry

Wilbert Heeringa

Spr˚ akbanken, University of Gothenburg 30 january 2019

1

slide-2
SLIDE 2

Introduction

2

slide-3
SLIDE 3

What is dialectometry?

  • ’The measure of dialect’ (Jean S´

eguy).

  • Measures the degree of difference or similarity between dialects.
  • Thus patterns in the dialect landscape can be revealed.

3

slide-4
SLIDE 4

Why dialectometry?

  • For the record of cultural history. In order to reveal migrations, contacts with other

peoples, and internal cultural divisions.

  • May be of use to language learners, publishers, broadcasters, educators and language

planners.

4

slide-5
SLIDE 5

Isogloss method

  • Primary tool of traditional dialectology has been the isogloss.
  • Greek isos means equal, Greek gl¯
  • ssa means language.

5

slide-6
SLIDE 6

Nucleus in ripe: [rip(@)] (west) [rE;p] (central) [rip(@)] (east)

6

slide-7
SLIDE 7

Coda in cold: [kO;ut] (west) [kO;lt] (east)

7

slide-8
SLIDE 8

Nucleus in ripe & coda in cold

8

slide-9
SLIDE 9

Isogloss method

Overlay the isogloss maps of 14 phenomena:

1 [VErk] vs. [VEr@k] 2 [splInt@r] vs. [splInt@ö] 3 [kni] vs. [kne:] vs. [knE:i] vs. [knIb@l] 4 [zi;n] vs. [@zi;n] vs. [G@zi;n] vs. [j@zi;n] 5 [ste;n "] vs. [ste;n@] vs. [stI;@s] 6 [me:st@r] vs. [mi;@st@r] vs. [mE;st@r] 7 [rip] vs. [rE;ip] 8 [zEs] vs. [sEs] vs. [sEz] 9 [kO;ut] vs. [kO;lt] 10 [ro:zn "] vs. [ro:z@n] vs. [ro:z@] 11 [lAd@r vs. [li;@r(@)] 12 [bru:r] vs. [brœ:ij@r] vs. [bru;r@] 13 [brYx] vs. [brYG(@)] vs. [brYg] 14 [blO;w] vs. [blA:t]

9

slide-10
SLIDE 10

Isoglosses of 14 phenomena. Isogloss bundles represent dialect boundaries.

10

slide-11
SLIDE 11

Isogloss method

  • Not easy to decide about dialect borders, unless by selecting coinciding isoglosses.

11

slide-12
SLIDE 12

Dialectometry

We need methodology that:

  • is purely linguistic;
  • includes all linguistic levels;
  • uses a representative data set of contemporary spoken dialect;
  • includes all data without making subjective selections;
  • utilizes the data maximally;
  • allows comparisons regardless whether varieties are geographically close or not;
  • produces results that are unambiguous.

Use dialectometry?

12

slide-13
SLIDE 13

Relative difference value

  • The term ‘dialectometry’ was coined by Jean S´

eguy.

  • He was director of the Atlas linguistique de la Gascogne.
  • Assisted and inspired by Henri Guiter.
  • Dialect distance:

number of items on which two dialects differ, expressed in a percentage.

13

slide-14
SLIDE 14

Relative difference value

  • Example: calculate lexical relative difference value between Middelstum and Ommen
  • n the basis of six items:

Middelstum Ommen friend kAm@rU;t ˇ kAm@rO:t ship sxIp sxIp far v ˚ E:r Vit ˇ 1 are bIn " bInt still nOx nOx push stø;tn " drYk ˇN " 1 2

  • Distance: 2/6 = 0.33. Percentage: 33%.

14

slide-15
SLIDE 15

Relative difference value

  • We call this the ‘relative difference value’.
  • Can be used for all linguistic levels.
  • No gradual distances between items.
  • Goebl (1982 and later) measured dialect similarity and called this Relative Identity

Value (RIV).

15

slide-16
SLIDE 16

Weighted difference value

  • Goebl (1984) introduced the Weighted Identity Value (WIV).
  • Basic idea: similarity in rare lexemes contributes more strongly to the overall similarity

between two local dialects than similarity in common lexemes.

  • Since we focus on distances rather than on similarity, we present ‘weighted difference

value’.

16

slide-17
SLIDE 17

Weighted difference value

  • Example: in a set of 360 dialects we find the following lexemes for schip ‘ship’: schip

(353), boot (2), lager (1), schuit (4). In terms of distances: schip vs. schip : 353/360 = 0.981 schuit vs. schuit : 4/360 = 0.011 boot vs. boot : 2/360 = 0.006

  • The distance between different lexemes (for example schip versus boot) always is 1.

17

slide-18
SLIDE 18

Weighted difference value

  • Example:

calculate the lexical weighted difference value between Middelstum and Ommen on the basis of 6 words: Middelstum Ommen friend kAm@rU;t ˇ kAm@rO:t 140/354 0.40 ship sxIp sxIp 353/360 0.98 far v ˚ E:r Vit ˇ 1 are bIn " bInt 176/360 0.49 still nOx nOx 354/355 1.00 push stø;tn " drYk ˇN " 1 4.87

  • Distance: 4.87/6 = 0.81. Percentage: 81%.

18

slide-19
SLIDE 19

Levenshtein distance

Grouw Groningen Haarlem Almelo Alveringem Renesse Polsbroek Mechelen Venray Kerkrade mlk mlk mlk mlk mæk mælk mlk mlk mlk mlx

How to quantify differences between the dialect pronunciations?

19

slide-20
SLIDE 20

Levenshtein distance

  • Levenshtein distance was introduced in dialectology by Brett Kessler.
  • In 1995 he measured linguistic distances between Irish Gaelic dialects.
  • Later it was applied to Dutch, Sardinian, Norwegian, American English, German,

Bulgarian and Bantu dialect/language varieties by others.

  • Calculate the cost of changing one string into another.

20

slide-21
SLIDE 21

Levenshtein distance

  • Example: milk may be pronounced as [mEl@k] in the dialect of Haarlem and as [mOlk@]

in the dialect of Grouw.

  • Change the first pronounciation into the other.

mEl@k

  • subst. E/O

1 mOl@k delete @ 1 mOlk insert @ 1 mOlk@ 3

  • Many sequence operations map [mEl@k] → [mOlk@]. Levenshtein distance = cost of

cheapest mapping.

21

slide-22
SLIDE 22

Levenshtein distance

  • Alignment:

1 2 3 4 5 6 m E l @ k m O l k @ 1 1 1

  • We keep track of the alignment length.
  • If multiple alignments all have the minimum cost, we calculate the length of the longest

alignment.

  • The longest alignment has the greatest number of matches and is linguistically most

plausible.

22

slide-23
SLIDE 23

Alignment

  • In a linguistic alignment we assure that the minimum cost is based on an alignment in

which:

  • a vowel matches with a vowel
  • a consonant matches with a consonant
  • the [j] or [w] matches with a vowel
  • the [i] or [u] matches with a consonant
  • the schwa matches with a sonorant
  • A pair of pronunciations to be compared with Levenshtein distance consists preferably
  • f cognates as we have done in all of the examples.

23

slide-24
SLIDE 24

Levenshtein distance

  • Variation among dialects is usually not measured on the basis of a single word, but on

a set of words.

  • Assume for two dialects we calculate the Levenshtein distance for n word pairs.
  • How do we combine them to one distance, i.e. how do we calculate the aggregated

distance?

24

slide-25
SLIDE 25

Calculating the aggregate

  • Example: calculate the distance in the sound components between Middelstum and

Ommen on the basis of 6 words: Middelstum Ommen sum of length of weights alignment ship sxIp sxIp 4 cap pEt pEt@ 1 4 called rOupm @rupm 2 6 jump sprIN sprINkt 2 7 cellar kEl@r kEld@r 1 6 house hus hys 1 3 7 30

  • ‘Raw distance’ is 7/6 = 1.67, normalized distance is 7/30 = 0.233 = 23.3%.

25

slide-26
SLIDE 26

Operation weights

  • In the examples above we used binary weights:
  • weight is 0 (match of two sounds) or 1 (substitution of one sound by another);
  • when a sound is inserted or deleted, the weight also is 1.
  • Refinement by using gradual PMI distances as operation weights.

26

slide-27
SLIDE 27

PMI-based Levenshtein distance

  • Introduced in dialectology by Martijn Wieling, Jelena Proki´

c and John Nerbonne in 2009.

  • Pointwise Mutual Information (PMI) assesses the degree of dependence between aligned
  • segments. Procedure:

repeat

  • compare each dialect to each dialect by using Levenstein distance (the first time

with binary weights, later times with newly calculated weights).

  • find new weights by analyzing the alignments: the more frequently segments co-occur

in an alignment, the smaller the distance weight. until weights do not change any more.

  • Alignments made by PMI Levenshtein are better, see Wieling, Proki´

c and Nerbonne (2009).

27

slide-28
SLIDE 28

Application

  • Reeks Nederlandse Dialectatlassen, compiled by E. Blancquaert and W. P´

ee.

  • Texts from 1922–1975, 1956 local dialects, 139 sentences each.
  • We selected 361 dialects, 125 words.

28

slide-29
SLIDE 29

Distribution of the 361 dialects in the Dutch dialect area.

29

slide-30
SLIDE 30

Beam maps

  • Introduced by Goebl (± 1983).
  • Distances between dialects represented by lines among local dialects in a map.
  • Each local dialect is connected by a straight line with each dialect.
  • Darker lines represent smaller distances, lighter lines represent larger distances.

30

slide-31
SLIDE 31

Beam maps: lexical relative difference values (left), lexical weighted difference values (middle) and pronunciation Levenshtein distances (right).

31

slide-32
SLIDE 32

Honeycomb maps

  • Exist since Haag (1898), and ‘reintroduced’ by Goebl (± 1983).
  • Shows distances between geographically neighboring dialects.
  • Related dialects are separated by lighter lines, and more remote dialects are separated

by darker lines.

  • Cartographic inversion of beam maps.

32

slide-33
SLIDE 33

Honeycomb maps: lexical relative distance values (left), lexical weighted difference values (middle) and pronunciation Levenshtein distances (right).

33

slide-34
SLIDE 34

Cluster analysis

  • Given the distances between the objects, group the objects so that objects in the same

group (called a cluster) are more similar to each other than to those in other groups (clusters) (definition Wikipedia).

  • Hierarchical Agglomerative clustering:

is a “bottom-up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy (S.C. Johnson 1967)

  • In order to decide which clusters should be combined, distances between observations

and a linkage criterion which specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets (Wikipedia).

  • Example: is distance between A and K is 2, and the distance between B and K is 3,

then the distance between the set that contains A and B and K is (2 + 3)/2 is 2.5.

  • Introduced by Goebl (± 1982) in dialectometry.

34

slide-35
SLIDE 35

Linkage criteria

  • UPGMA: Unweighted Pair Group Method using Arithmetic averages:

Dendrograms obtained by this method reflect the original distances in the distance matrix most closely (cophenetic correlation coefficient). Best choice in general.

  • WPGMA: Weighted Pair Group Method using Arithmetic averages:

Is recommended in case of an irregular sampling distribution.

  • Ward’s method:

Minimizes the variance in the clusters; results in a well-belanced dendrogram, all clusters have about the same size.

35

slide-36
SLIDE 36

5 10 15 20 40 60 80 10 20 30

Dendrograms obtained by clustering: lexical relative difference values (left), lexical weighted difference values (middle) and pronunciation Levenshtein distances (right). The most significant groups are distinguished by different colors. The lables are omitted.

36

slide-37
SLIDE 37

Dialect areas obtained by clustering: lexical relative difference values (left), lexical weighted difference values (middle) and pronunciation Levenshtein distances (right).

37

slide-38
SLIDE 38

Multidimensional scaling

  • Visualize dialect continuum.
  • Introduced by Embleton (1993) in dialectometry.
  • Given a geographic map, distances between locations can be measured.
  • Multidimensional scaling: given distances, locations on a map can be inferred.
  • In our case: from n × n distances we infer coordinates in 2- or 3-dimensional space.

So n dimensions are reduced to two or three.

38

slide-39
SLIDE 39

Using MDS the 361 dimensions are reduced to 2. They explain 84.3% (left: lexical relative difference values), 45.0% (middle: lexical weighted difference values) and 51.9% (right: pronunciation Levenshtein distances) of the variance in the original distances. Labels are

  • mitted.

39

slide-40
SLIDE 40

Using MDS the 361 dimensions are reduced to 3. They explain 89.5% (left: lexical relative difference values), 50.7% (middle: lexical weighted difference values) and 88.4% (right: pronunciation Levenshtein distances) of the variance in the original distances. Labels are

  • mitted.

40

slide-41
SLIDE 41

Multidimensional scaling

  • When scaling to three dimensions, each dialect is represented by three values, i.e. a

value for x, y and z.

  • Now let x be the intensity of red, y be the intensity of green and z be the intensity of

blue.

  • This way each dialect gets its unique color!
  • Thus the dialect landscape is visualized as a dialect continuum.
  • Introduced by Nerbonne, Heeringa & Kleiweg (1999).

41

slide-42
SLIDE 42

Three color dimensions for the pronunciation Levenshtein distances. Red represents the x-axis, green represents inversely the z-axis, and blue represents inversely the y-axis.

42

slide-43
SLIDE 43

Now we overlay the three maps.

43

slide-44
SLIDE 44

Map 3 major MDS dimensions to red, green and blue. Dialect islands (mainly town Frisian varieties) are marked with a diamond, and only these diamonds are colored.

44

slide-45
SLIDE 45

Dialect continua obtained by using multidimensional scaling: lex. relative difference values (left), lex. weighted difference values (middle) and pron. Levenshtein distances (right).

45

slide-46
SLIDE 46

Fuzzy clustering

  • Custer analysis is relatively unstable: small changes in the distance matrix can lead to

large changes in the clustering results.

  • Kleiweg et al.

(2004): add noise (i.e. small random values, one or two standard deviations) to the measurements and perform cluster analysis.

  • Repeat this e.g. 50 times.
  • After that count how many times each cluster has appeared.
  • Clusters that appear in many runs of the analysis with added noise are particularly

stable ones.

46

slide-47
SLIDE 47

Fuzzy clustering

  • Display the results in a probabilistic dendrogram.
  • For each cluster in the dendrogram show a percentage that indicates how many times

each cluster was encountered in the repeated clustering with noise (see also Nerbonne et al. 2008).

  • Example on next slide obtained on the basis of group average.

47

slide-48
SLIDE 48

Putten Spankeren 94 Aalten 97 Ommen Wijhe 100 Steenwijk 98 Almelo 100 Roodeschool Winschoten 71 Groningen 96 Roswinkel 50 Assen 100 100 97 Alveringem Damme 100 Renesse 51 Haarlem Schagen 89 Delft SD 89 Polsbroek Soest 93 Dussen 81 89 Gemert Oosterhout 92 66 Geel Geraardsbergen Lamswaarde Lebbeke Mechelen 50 Bergum Ferwerd Grouw Holwerd 100 Makkum 100 Kollum Leeuwarden 61 Nes 100 100 50 Born Venray 69 Kerkrade Tienen 50 100 10 20 30 40

48

slide-49
SLIDE 49

What is Gabmap?

  • A web application that visualizes dialect variation:

Doing dialect analysis on the web

  • Developed by Peter Kleiweg under supervision of John Nerbonne.
  • Based on functions in the RuG/L04 package which exists since 2001, and has been

freely distributed since 2004 (the maps shown in this presentation were also made with RuG/L04).

  • Gabmap was developed since the end of 2010 and first published on Github on June 4,

2011.

49

slide-50
SLIDE 50

What is Gabmap?

  • Original version available at:

http://www.let.rug.nl/~kleiweg/L04/webapp

  • Version forked and maintained by C

¸agri C ¸¨

  • ltekin:

http://www.gabmap.nl/ and maintained by Martijn Wieling.

  • Peter Kleiweg developed a Docker image of Gabmap installed in Lubuntu 16.04, see:

https://github.com/pebbe/Gabmap-docker

50

slide-51
SLIDE 51

Input

  • Gabmap needs three input files:

1) a map, 2) dialect data, 3) a feature definition file

51

slide-52
SLIDE 52

Input: map

  • A map consists of at least:
  • an outline of the area;
  • placemarks are added for the locations where the data was collected. NB: place

names should be spelled exactly as in your data file!

  • Optionally, more details can be added to the map, for example internal borders, rivers.

52

slide-53
SLIDE 53

Input: map

  • The maps can be created with Google Earth or Google Maps
  • For a manual about creating maps with Google Earth see:

http://www.let.rug.nl/~kleiweg/L04/kml/manual.html and with Google Maps: http://coltekin.net/cagri/courses/leuven/

  • Save the map as .kml or .kmz file

53

slide-54
SLIDE 54

Input: dialect data

  • The dialect data should be in a table where:
  • the rows represent the locations where the data was collected;
  • the columns represent the data items.
  • Prepare the data file using LibreOffice Calc or Microsoft Excel.
  • Use the IPA chart Unicode keyboard at:

https://westonruter.github.io/ipa-chart/keyboard/ for finding the Unicode characters.

  • The chart covers the The International Phonetic Alphabet revised to 2005.

54

slide-55
SLIDE 55

55

slide-56
SLIDE 56

56

slide-57
SLIDE 57

Input: dialect data

  • For uploading the data file in Gabmap it has to be a tab-separated plain text file

encoded as Unicode (UTF-8 or UTF-16).

  • When loading an existing file in LibreOffice Calc load the file as Unicode (UTF8 or

UTF-16) and the tab as separator.

  • Other types of data than transcriptions can be analyzed in Gabmap, too, especially

categorical data.

  • See also the manual about preparing dialect data for Gabmap which is found under

Help in Gabmap.

57

slide-58
SLIDE 58

Input: feature definition file

  • The file IPA.def is found at:

http://www.wjheeringa.nl/courses/dialectometry/datasets/IPA.def.

  • Covers the Unicode characters of the IPA revised until 2005.
  • Using this file assures that in an alignment of two pronunciations:
  • a vowel matches with a vowel
  • a consonant matches with a consonant

and allows that:

  • the [j] or [w] matches with a vowel
  • the [i] or [u] matches with a consonant
  • the schwa matches with a sonorant
  • Substitutions, insertions and indels have weight of 1.

58

slide-59
SLIDE 59

Input: feature definition file

  • If two segments are the same, but they have different suprasegmentals and diacritics,

the weight is 0.3.

  • Not processed are:

primary stress, secondary stress, minor (foot) group, major (intonation) group, syllable break, linking (absence of a break).

  • NB: language-specific adjustments may be necessary!

However, be careful when changing IPA.def.

59

slide-60
SLIDE 60

Running Gabmap

  • Now we have a map, a table and a feature definition file, we can run Gabmap.

60

slide-61
SLIDE 61

61

slide-62
SLIDE 62

62

slide-63
SLIDE 63

63

slide-64
SLIDE 64

Demo

64

slide-65
SLIDE 65

Tack s˚ a mycket!

65