Introduction to Dialectometry II Wilbert Heeringa German Academic - PowerPoint PPT Presentation

Introduction to Dialectometry II Wilbert Heeringa German Academic Exchange Service – DAAD University of Bielefeld, Faculty of Linguistics and Literary Studies Frisian Academy Abidjan, December, 19–23, 2016 1

Topics Validation of distance measures Consistency of distance measures Quality of classifications Cluster algorithms Fuzzy clustering Cophenetic multidimensional scaling maps Reference point maps 2

Validation of distance measures 3

Experiment • In Norway: everybody speaks dialect, there is not a standard language. • In the period 1999–2002 Jørn Almberg and Kristian Skarbø recorded about 50 Norwegian dialects. • The fable ‘The North Wind and the Sun’ was taken as a basis. • This text was also used in IPA handbooks published in 1949 and 1999. • Speakers were asked to translate the text and to read it aloud. • Audio files and transcriptions available at: http://www.ling.hf.ntnu.no/nos/ 4

Experiment • Perception experiment carried out in the Spring of 2000 by Charlotte Gooskens. • 15 recordings of 15 dialects were used. • In each of the 15 locations, a group of 16 to 27 high school pupils listened to all 15 texts. • The texts were presented in a randomized order. 5

Bodø Verdal Bjugn Stjørdal Fræna Trondheim Herøy Lesja The geographic distribution of Lillehammer the 15 Norwegian dialects. Bergen Bø Borre Halden Larvik Time 6

Experiment • Task: each pupil notes for each text the distance of the corresponding dialect compared to his own dialect. • Scale from 1 (similar to own dialect) to 10 (not similar to own dialect). • Final result: a 15 × 15 perceptual distance matrix. 7

Experiment Be Bj Bo Bø Bo Fr Ha He La Le Li St Ti Tr Ve Bergen 1.7 9.0 8.2 8.0 7.7 7.7 8.2 6.9 8.0 8.9 8.5 8.4 4.8 8.5 8.0 " Bjugn 9.1 3.4 6.4 8.2 9.2 5.8 8.3 8.0 8.4 7.3 9.1 2.2 8.0 3.3 2.8 " Bodø 8.7 7.9 1.5 8.3 8.3 6.6 7.9 7.8 7.3 8.0 8.7 6.6 8.1 6.2 6.3 " Bø 8.1 7.8 7.5 1.0 7.7 8.1 4.9 7.8 5.3 6.0 5.1 7.1 6.3 8.2 8.6 " Borre 6.1 8.8 7.8 6.5 1.7 8.5 1.8 7.5 1.6 7.5 2.0 7.2 7.5 8.5 9.1 " Fræna 9.0 7.5 7.1 8.4 8.8 3.1 8.1 7.8 8.5 7.2 9.0 6.6 7.4 6.1 7.6 " Halden 7.0 8.2 8.0 6.8 4.0 8.1 2.8 7.9 2.8 6.6 3.0 7.4 7.0 8.0 8.3 " Herøy 8.6 9.3 8.4 8.5 9.1 7.0 8.6 1.2 9.3 9.3 9.4 8.5 7.5 7.5 8.2 " Larvik 7.4 8.7 7.6 4.0 4.0 7.7 3.2 5.6 3.4 7.1 4.6 8.2 6.8 8.3 7.5 " Lesja 8.5 7.6 7.8 7.4 8.2 7.3 7.6 7.7 7.6 1.0 7.1 6.9 7.2 7.7 8.2 " Lillehammer 6.7 8.3 8.1 6.2 4.4 8.0 3.1 7.5 4.1 7.3 2.7 7.6 6.8 8.7 8.1 " Stjørdal 8.7 3.7 6.8 7.7 8.1 6.0 7.5 7.7 8.3 7.1 8.3 2.0 7.7 3.8 3.4 " Time 7.0 9.3 8.4 8.1 8.4 8.3 8.0 7.2 8.2 9.1 8.8 8.8 1.8 8.8 9.0 " Trondheim 7.8 5.8 6.7 7.5 6.4 7.3 6.0 7.1 5.9 7.9 6.3 4.4 7.6 3.3 6.8 " Verdal 8.8 3.4 6.4 8.2 8.4 5.7 7.2 7.9 7.9 7.4 8.4 1.8 7.9 3.1 2.6 " Perceptual distances among 15 Norwegian dialeact varieties. Row names represent listener groups, column names represent dialect speakers. 8

Average perceptual distances between 15 Norwegian dialects. Darker lines connect closer points, lighter lines more remote ones. Distance pairs A – B / B – A are averaged. 9

Experiment • Using the transcriptions we measure lexical distances and pronunciation distances among the 15 local dialect variaties. • Each dialect text usually consists of 58 different words. • Validation: How well do the dialectometric distances correlate with the perceptual distances? 10

Correlations (1) lexical r expl. var. relative difference value 0.27 7% weighted difference value 0.37 14% pronunciation aggregate r expl. var. Levenshtein (1) 0.71 50% Levenshtein (2) 0.70 49% Levenshtein (3) 0.67 45% Levenshtein PMI (1) 0.71 50% Levenshtein PMI (3) 0.67 45% 11

Correlations (2) • In the measurements binary weighting is used. Suprasegmentals and diacritics are ignored. • No difference between ‘classic’ Levenshtein and PMI Levenshtein, but alignments made by PMI Levenshtein are better, see Wieling, Proki´ c and Nerbonne (2009). 12

Left: perceptual distances. Right: lexical weighted difference value distances. Darker lines connect closer points, lighter lines more remote ones. r = 0 . 37 13

Left: perceptual distances. Right: non-normalized Levenshtein distances. Darker lines connect closer points, lighter lines more remote ones. r = 0 . 71 . 14

Consistency of distance measures 15

Consistency • How many items do we need for dialect comparison? Rule of thumb: 100 items (Goebl). • In order to answer this question more precisely, measure the degree to which different words in the data set give the same signal of linguistic relationships between the dialects: measure Cronbach’s Alpha . • Example: measure Levenshtein distance between three dialects using four words. In this example we normalize Levenshtein distances per word pair. 16

Grouw Haarlem Almelo Levenshtein distances between pronunciations of the word seen . 17

Grouw Haarlem Almelo Levenshtein distances between pronunciations of the word hart . 18

Grouw Haarlem Almelo Levenshtein distances between pronunciations of the word son . 19

Grouw Haarlem Almelo Levenshtein distances between pronunciations of the word house . 20

Consistency • General pattern: Haarlem and Almelo are linguistically relatively close to each other and relatively distant to Grouw. • Levenshtein distances between the three local dialects: seen hart son house Grouw vs. Haarlem 71 25 100 75 Grouw vs. Almelo 83 25 75 33 Haarlem vs. Almelo 60 20 50 50 • Using the values in the columns the words are correlated to each other. 21

Consistency • Correlations between words: r n seen vs. hart 0.85 3 seen vs. son 0.48 3 seen vs. house -0.43 3 hart vs. son 0.87 3 hart vs. house 0.11 3 son vs. house 0.59 3 • The average inter-correlation r is 0.41. 22

Consistency • Cronbach’s α can be written as a function of the number of words and the average inter-correlation among the words: n w × ¯ r α = 1 + ( n w − 1) × ¯ r where n w is the number of words which is in our example 4. • Calculation: 4 × 0 . 41 α = 1 + (4 − 1) × 0 . 41 = 0 . 74 • If all words have the same geographic distribution of variants the value of Cronbach’s alpha is 1, if there is no consistency between the words in the data set the value is 0. • A generally accepted threshold for consistency of the data is 0.70. 23

Consistency • In general: the more items are included, the higher Cronbach’s Alpha. • If the Cronbach’s Alpha value is very low, add more items! 24

Consistency 1.0 0.7 0.8 Cronbachs’s alpha Cronbach’s alpha 0.5 0.6 0.3 0.4 0.2 0.1 0.0 -0.1 0 20 40 60 80 100 0 20 40 60 80 100 120 number of words number of words Left: Cronbach’s α values for random subsets of 2 through 107 words (lexical weighted difference values) and 360 local dialects. From 86 words on α is always higher than 0.70. For 107 words α is equal to 0.75. Right: Cronbach’s α values for random subsets of 2 through 125 words (Levenshtein distance) and 360 local dialects. From 13 words on α is always higher than 0.70. For 125 words α is equal to 0.97. 25

Quality of classifications 26

Quality of classifications • For clustering compare cophenetic distances to original distances. • For multidimensional scaling compare interpoint multidimensional scaling distances to original distances. 27

Cophenetic distances • In a dendrogram the distances between clusters are represented by the length of the branches. Grouw Delft Haarlem Hattem Lochem 0 10 20 30 40 • Cophenetic distance: distance between two local dialects as found in the dendrogram. • Find the shortest path between two local dialects and the longest distance in one direction within the shortest path. 28

Cophenetic distances Grouw Haarlem Delft Hattem Lochem Grouw 0 44 44 44 44 Haarlem 44 0 16 36.25 36.25 Delft 44 16 0 36.25 36.25 Hattem 44 36.25 36.25 0 20 Lochem 44 36.25 36.25 20 0 29

Cophenetic distances • Cophenetic correlation coefficient: measure of how faithfully the pairwise distances between local dialects as suggested by the dendrogram preserve the original pairwise distances. • Correlate the pairwise cophenetic distances with the original pairwise distances: r = 0.99 • The amount of variance in the original distances explained by the cophenetic distances is r 2 × 100 = 97.6%. 30

Interpoint multidimensional scaling distances • With multidimensional scaling the five local dialects are plotted in two-dimensional space so that the distances are preserved as well as possible: 30 Grouw 20 second dimension 10 Lochem Hattem 0 -10 Haarlem Delft -20 -30 -40 -20 0 20 40 fi rst dimension 31

Interpoint multidimensional scaling distances • We can calculate interpoint distances between the local dialects: 30 Grouw 20 second dimension 10 Lochem Hattem 0 -10 Haarlem Delft -20 -30 -40 -20 0 20 40 fi rst dimension • Distance between Grouw (-24,21) and Hattem (18, 6): � ( − 24 − 18) 2 + (21 − 6) 2 = 44 . 6 32

Introduction to Dialectometry II Wilbert Heeringa German Academic - PowerPoint PPT Presentation

Introduction to Dialectometry II Wilbert Heeringa German Academic Exchange Service DAAD University of Bielefeld, Faculty of Linguistics and Literary Studies Frisian Academy Abidjan, December, 1923, 2016 1 Topics Validation of distance

Reverse dialectometry Geography as a probe into linguistic theory Jeroen van Craenenbroeck KU

Introduction to Dialectometry Wilbert Heeringa Spr akbanken, University of Gothenburg 30

Introduction to Dialectometry III Wilbert Heeringa German Academic Exchange Service DAAD

INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION

Introduction ATV Introduction A T V Introduction A lphabet T V Introduction A lphabet

Brief Brief Introduction Introduction Brief Brief Introduction Introduction Zhengzhou

Brief Brief Introduction Introduction Brief Brief Introduction Introduction Zhengzhou

Shenzhen Cuilu jewelry Co., Ltd was founded in 1996 and its a large private enterprise

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Spectrum Painting Richard Shipman MW0RCZ ADARS 6th Jan 2020 Introduction Introduction

Introduction Introduction Introduction Introduction Outline Motivation Failures

Introduction Introduction Introduction Nationwide Cause for Concern 1

Team Introduction Experiments Outreach Problem Project Brainstorm Introduction Introduction

Lecture 1 Andreas Habegger Introduction Zynq Introduction Zynq Introduction Zynq PS vs. PL

Introduction to Web Design & Computer Principles Class 1 CSCI-UA 4 Introduction and Overview

Introduction to CICS Course introduction Course introduction What is CICS? What is an

Multidimensional scaling and flat split systems Monika Balvoi ut e joint work with

Machine Learning in Conceptual Spaces Two Learning Processes Lucas Bechberger

Dim imensionality ty Redu eduction: Th Theoretic ical Ana nalysis of Pr Practi tical Mea

GH: definition Z,f,g d Z d GH ( X, Y ) = inf H ( f ( X ) , g ( Y )) 1 The Elad-Kimmel approach

Off- -The The- -Shelf Classifiers Shelf Classifiers Off A method that can be applied directly

Dr. Damien Fay. SRG group, Computer Lab, University of Cambridge. A graph metric: motivation.

MACHINE LEARNING Spectral Clustering 1 ADVANCED MACHINE LEARNING Outline of Todays Lecture

- AP & LLE Xiangliang Zhang King Abdullah University of Science and Technology

Sambuz

Useful Links

Newsletter

Mail Us

Introduction to Dialectometry II Wilbert Heeringa German Academic - PowerPoint PPT Presentation

Introduction to Dialectometry II Wilbert Heeringa German Academic Exchange Service DAAD University of Bielefeld, Faculty of Linguistics and Literary Studies Frisian Academy Abidjan, December, 1923, 2016 1 Topics Validation of distance

Reverse dialectometry Geography as a probe into linguistic theory Jeroen van Craenenbroeck KU

Introduction to Dialectometry Wilbert Heeringa Spr akbanken, University of Gothenburg 30

Introduction to Dialectometry III Wilbert Heeringa German Academic Exchange Service DAAD

INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION

Introduction ATV Introduction A T V Introduction A lphabet T V Introduction A lphabet

Brief Brief Introduction Introduction Brief Brief Introduction Introduction Zhengzhou

Brief Brief Introduction Introduction Brief Brief Introduction Introduction Zhengzhou

Shenzhen Cuilu jewelry Co., Ltd was founded in 1996 and its a large private enterprise

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Spectrum Painting Richard Shipman MW0RCZ ADARS 6th Jan 2020 Introduction Introduction

Introduction Introduction Introduction Introduction Outline Motivation Failures

Introduction Introduction Introduction Nationwide Cause for Concern 1

Team Introduction Experiments Outreach Problem Project Brainstorm Introduction Introduction

Lecture 1 Andreas Habegger Introduction Zynq Introduction Zynq Introduction Zynq PS vs. PL

Introduction to Web Design &amp; Computer Principles Class 1 CSCI-UA 4 Introduction and Overview

Introduction to CICS Course introduction Course introduction What is CICS? What is an

Multidimensional scaling and flat split systems Monika Balvoi ut e joint work with

Machine Learning in Conceptual Spaces Two Learning Processes Lucas Bechberger

Dim imensionality ty Redu eduction: Th Theoretic ical Ana nalysis of Pr Practi tical Mea

GH: definition Z,f,g d Z d GH ( X, Y ) = inf H ( f ( X ) , g ( Y )) 1 The Elad-Kimmel approach

Off- -The The- -Shelf Classifiers Shelf Classifiers Off A method that can be applied directly

Dr. Damien Fay. SRG group, Computer Lab, University of Cambridge. A graph metric: motivation.

MACHINE LEARNING Spectral Clustering 1 ADVANCED MACHINE LEARNING Outline of Todays Lecture

- AP &amp; LLE Xiangliang Zhang King Abdullah University of Science and Technology

Sambuz

Useful Links

Newsletter

Mail Us

Introduction to Web Design & Computer Principles Class 1 CSCI-UA 4 Introduction and Overview

- AP & LLE Xiangliang Zhang King Abdullah University of Science and Technology