Data in computational historical linguistics Gerhard Jger ESSLLI - PowerPoint PPT Presentation

Data in computational historical linguistics Gerhard Jäger ESSLLI 2016 Gerhard Jäger Data sources ESSLLI 2016 1 / 25

Background identifying regular sound correspondences automatically is a surprisingly ESSLLI 2016 Data sources Gerhard Jäger et al., 2015; Bouchard-Côté et al., 2013) currently one of the hot topics, far from resolved (List, 2014; Hruschka hard problem, due to data sparseness also, isolating languages have no morphology Background especially if we look individual language families morphological categories are not easily comparable across languages, both are not very suitable for computational approaches, because regular sound correspondences morphological paradigms comparative method strongly focuses on two types of data: 2 / 25

Background Background what we need (especially if we apply statistical methods): data types which are applicable to all natural languages ideally lots of data current practice: word lists + expert annotations about cognacy (currently the dominant paradigm) unannotated word lists in phonetic transcriptions discrete grammatical categorizations (compiled by human experts) Gerhard Jäger Data sources ESSLLI 2016 3 / 25

Cognate-coded Swadesh lists Cognate-coded Swadesh lists Gerhard Jäger Data sources ESSLLI 2016 4 / 25

Cognate-coded Swadesh lists Swadesh lists collections of 100 – 200 concepts (there are different versions) core vocabulary: not culture dependent diachronically stable, i.e. resistant both against semantic change and aginst borrowing proposed by Morris Swadesh (Swadesh, 1955, 1971) to facilitate an early attempt to automatize certain tasks in historical linguistics popular among computational historical linguistics because it is a standard see (List, 2016) for a thoughtful discussion of the notion of cognacy Gerhard Jäger Data sources ESSLLI 2016 5 / 25

Cognate-coded Swadesh lists Cognates Cognates are words that have the same origin traditionally, cognacy excludes loanwords, but terminology among computationalists is sometimes less strict: would also qualify as cognate pair on average, the closer two languages are related, the more cognate pairs they share Gerhard Jäger Data sources ESSLLI 2016 6 / 25 Latin filius ⇒ French fils , Italian figlio Latin persona ⇒ English person

Cognate-coded Swadesh lists Cognates during language change, the word for a given concept is sometimes replaced by a non-cognate one causes: semantic change, borrowing, morphological word formation German Knochen Bein is still part of the German lexicon, but it now means leg cognate replacement is comparable to a mutation in biological evolution Gerhard Jäger Data sources ESSLLI 2016 7 / 25 ’bone’: Old High German Bein (cognate to Engl. bone ⇒ New High

Cognate-coded Swadesh lists Cognates Caveats cognacy is not binary, but a matter of degree first component is cognate to wife , German Weib etc., and second component to man , German Mann etc. Are woman and Weib cognate or not? for distantly related languages, experts often disagree about cognacy Ancient Greek ὕλη /Latin silva ‘woods’ Gerhard Jäger Data sources ESSLLI 2016 8 / 25 English woman ⇐ Old English wiff-man

Cognate-coded Swadesh lists IELex Indo-European Lexical Cognacy Database freely available online at http://ielex.mpi.nl/ based on Dyen et al. (1992) current version curated by group at MPI Nijmegen recently migrated to new MPI Jena; new version not public yet Gerhard Jäger Data sources ESSLLI 2016 9 / 25

Cognate-coded Swadesh lists 962 woman 962 woman lat LATIN woman:E woman woman ENGLISH lat LATIN woman:D woman 962 woman GUTNISH _ LAU IELex woman:G eng 962 DANISH ESSLLI 2016 Data sources Gerhard Jäger woman:K woman 962 woman dan woman:H woman woman 962 woman deu GERMAN woman:H woman 962 woman woman:D woman cognate _ class nld DUTCH woman:Ag woman 962 woman qov ELFDALIAN transcription 962 local _ id global _ id gloss iso _ code language sample entries: entries are assigned to cognate classes words in orthographic and partially in phonetic transcription (IPA) 207-item Swadesh lists for 135 Indo-European languages DANISH _ FJOLDE woman 10 / 25 962 woman:D 962 woman dan DANISH woman:B woman woman woman deu GERMAN woman:B woman kɛ̀lɪŋg vrɑu fraŭ g̥ʰvenə kvinʲ kvɪnːˌfolk mulier feːmina wʊmən vaĭp d̥ɛːmə

Cognate-coded Swadesh lists 1 List, J.-M. (2014): Data from: Sequence comparison in historical linguistics. GitHub ESSLLI 2016 Data sources Gerhard Jäger 3 Supplementary material to Mennecier et al. (2016) 2 Supplementary material to Wichmann and Holman (2013) Release: 1.0. 88 cognate-coded Swadesh lists from Central-Asian languages 3 Other publicly available cognacy data sources Holman 2 from various language families collected by Søren Wichman and Eric ten collections of short (40-100 items) cognate-coded Swadesh lists families collected by Johann-Mattis List 1 ten collections of cognate-coded Swadesh lists from various language http://language.psy.auckland.ac.nz/austronesian/ Austronesian Basic Vocabulary Database 11 / 25 Repository. http://github.com/SequenceComparison/SupplementaryMaterial .

Phonetically transcribed Swadesh lists Phonetically transcribed Swadesh lists Gerhard Jäger Data sources ESSLLI 2016 12 / 25

Phonetically transcribed Swadesh lists used concepts: I, you, we, one, two, person, fish, dog, louse, tree, leaf, skin, ESSLLI 2016 Data sources Gerhard Jäger name see, hear, die, come, sun, star, water, stone, fire, path, mountain, night, full, new, blood, bone, horn, ear, eye, nose, tooth, tongue, knee, hand, breast, liver, drink, freely available at http://asjp.clld.org/ The Automatic Similarity Judgment Program transcription basic vocabulary of 40 words for each language, in uniform phonetic iso code) covers more than 7,000 languages and dialects (4.574 languages with since 2009; currently version 17 (2016) Wichmann Project originally hosted at MPI EVA in Leipzig around Søren 13 / 25

Phonetically transcribed Swadesh lists language family, language genus, classifcation according to Ethnologue ESSLLI 2016 Data sources Gerhard Jäger population size geographic location and Glottolog Metadata The Automatic Similarity Judgment Program hkw$ : pre-aspirated labalized k a* : nasalized a ph~ : aspirated p various diacritics to capture finer phonetic distinctions, e.g. 41 sound classes, all coded as ASCII characters Phonetic transcription 14 / 25

Phonetically transcribed Swadesh lists ASJP code h voiced pharyngeal fricative χ, ʁ, ħ, ʕ voiceless and voiced uvular fricative, voiceless and X ɢ voiced uvular stop G q voiceless uvular stop q symbol IPA symbols Description ŋ h, ɦ k ɲ y palatal approximant j k voiceless velar stop g velar nasal voiced velar stop g x voiceless and voiced velar fricative x, N voiceless and voiced glottal fricative 7 5 3 ESSLLI 2016 Data sources Gerhard Jäger ɣ, ʌ, ɑ, o, ɔ, ɒ mid and low back vowel, rounded and unrounded o ɯ, u high back vowel, rounded and unrounded u a, ɐ low central vowel, unrounded a ɨ, ɘ, ə,ɜ, ʉ, ɵ, ɞ high and mid central vowel, rounded and unrounded æ, ɛ, œ, ɶ voiceless glottal stop !, ǀ, ǁ, ǂ ʔ L all other laterals ʟ, ɭ, λ ! all varieties of “click-sounds” i low front vowel, rounded and unrounded high front vowel, rounded and unrounded i, ɪ, y, ʏ e mid front vowel, rounded and unrounded e, ø E palatal nasal c, ɟ The Automatic Similarity Judgment Program v voiceless alveolar stop t n̪ dental nasal 4 θ, ð voiceless and voiced dental fricative 8 w voiced bilabial-velar approximant w m bilabial nasal m voiced labiodental fricative d voiceless bilabial stop and fricative ASJP sound classes (from Brown et al. 2013) ASJP code Description IPA symbols symbol p p,ɸ v b voiced bilabial stop and fricative b, β f voiceless labiodental fricative f t voiced alveolar stop voiceless and voiced palatal stop ʒ l S voiceless post-alveolar fricative ʃ Z voiced post-alveolar fricative C l voiceless palato-alveolar affricate ʧ j voiced palato-alveolar affricate ʤ T voiced alveolar lateral approximant “r-sounds” d c s voiceless alveolar fricative s z voiced alveolar fricative z voiceless and voiced alveolar affricate ɾ, r, ʀ, ɽ ts, ʤ n alveolar nasal n r voiced apico-alveolar flap and all other varieties of 15 / 25

Phonetically transcribed Swadesh lists see sol sun k3m wenire come dEi mori die hir audire hear si widere drink star bibere drink liv3r yekur liver brest pektus, mama breast hEnd manus hand ni genu knee s3n stela Automated Similarity Judgment Project noks ESSLLI 2016 Data sources Gerhard Jäger nem nomen name nu nowus new ful plenus full nEit night star %maunt3n mons mountain pE8 viya path fEir iNnis fire ston lapis stone wat3r water English Latin concept two pedikulus louse dag kanis dog fiS piskis fish %pers3n persona, homo person tu duo w3n t3N unus one wi nos we yu tu you Ei ego I English Latin concept laus tree arbor horn tongue tu8 dens tooth nos nasus nose Ei okulus eye ir auris ear kornu tri horn leaf lif skin kutis %skin blood bl3d bone os bon 16 / 25 foly ∼ u* akw ∼ a saNgw ∼ is liNgw ∼ E

Data in computational historical linguistics Gerhard Jger ESSLLI - PowerPoint PPT Presentation

Data in computational historical linguistics Gerhard Jger ESSLLI 2016 Gerhard Jger Data sources ESSLLI 2016 1 / 25 Background identifying regular sound correspondences automatically is a surprisingly ESSLLI 2016 Data sources Gerhard

Introduction to Linguistics Darrell Larsen Linguistics 101 Darrell Larsen Introduction to

Outline zipfR zipfR (Computational) linguistics Evert & Baroni Evert & Baroni

Text Mining for Historical Documents Introduction to Computational Linguistics Caroline Sporleder

Foundations of Computational Linguistics man-machine communication in natural language R OLAND H

Foundations of Computational Linguistics man-machine communication in natural language R OLAND H

Foundations of Computational Linguistics man-machine communication in natural language R OLAND H

Foundations of Computational Linguistics man-machine communication in natural language R OLAND H

4CSLL5 Advanced Computational Linguistics Introduction Phrase Based Machine Trans Martin

Topics in Computational Linguistics Topics in Computational Linguistics March 28, 2014 GIL,

Linguistics 201 Personnel Introduction to Linguistics General Course Description Syllabus

Computational Linguistics I CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu

Computational Linguistics I CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu

Why does NLP need linguistics? Julia Hockenmaier juliahmr@illinois.edu NLP and Linguistics:

Computational linguistics and NLP: How far from generic linguistics? Andrey Kutuzov University

(Pre-)Algebras for Linguistics 2. Introducing Preordered Algebras Carl Pollard Linguistics 680:

The scope of linguistics John Goldsmith Origins of linguistics In several cases, the roots

Embryology the anatomic basis of fetal medicine Prenatal consult You meet with expectant

Disclosures I have no financial conflicts of interest Deep: Scuba diving associated I

Welcome to PedsCases video on an approach to interpreting pediatric chest x-rays. My name is

Short introduction For personal use only 1 SBMHS-BVOOG Apnea Conference - 16 Jun 2018 BUWALDA

Coronavirus, an update Grand Rounds, DMU September 3, 2020 @ 0700 Via Zoom Source Overview

Objectives Glomerular Diseases in Primary Care Discuss diagnosis of glomerular diseases Sadiq

Phonetics. The Sound of Language 1 The Description of Sounds Fromkin & Rodman: An

Canadian Expert Perspectives April 23, 2020 Planning faculty Alan Bell, MD, CCFP, FCFP Eddy Lang

Sambuz

Useful Links

Newsletter

Mail Us

Data in computational historical linguistics Gerhard Jger ESSLLI - PowerPoint PPT Presentation

Data in computational historical linguistics Gerhard Jger ESSLLI 2016 Gerhard Jger Data sources ESSLLI 2016 1 / 25 Background identifying regular sound correspondences automatically is a surprisingly ESSLLI 2016 Data sources Gerhard

Introduction to Linguistics Darrell Larsen Linguistics 101 Darrell Larsen Introduction to

Outline zipfR zipfR (Computational) linguistics Evert &amp; Baroni Evert &amp; Baroni

Text Mining for Historical Documents Introduction to Computational Linguistics Caroline Sporleder

Foundations of Computational Linguistics man-machine communication in natural language R OLAND H

Foundations of Computational Linguistics man-machine communication in natural language R OLAND H

Foundations of Computational Linguistics man-machine communication in natural language R OLAND H

Foundations of Computational Linguistics man-machine communication in natural language R OLAND H

4CSLL5 Advanced Computational Linguistics Introduction Phrase Based Machine Trans Martin

Topics in Computational Linguistics Topics in Computational Linguistics March 28, 2014 GIL,

Linguistics 201 Personnel Introduction to Linguistics General Course Description Syllabus

Computational Linguistics I CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu

Computational Linguistics I CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu

Why does NLP need linguistics? Julia Hockenmaier juliahmr@illinois.edu NLP and Linguistics:

Computational linguistics and NLP: How far from generic linguistics? Andrey Kutuzov University

(Pre-)Algebras for Linguistics 2. Introducing Preordered Algebras Carl Pollard Linguistics 680:

The scope of linguistics John Goldsmith Origins of linguistics In several cases, the roots

Embryology the anatomic basis of fetal medicine Prenatal consult You meet with expectant

Disclosures I have no financial conflicts of interest Deep: Scuba diving associated I

Welcome to PedsCases video on an approach to interpreting pediatric chest x-rays. My name is

Short introduction For personal use only 1 SBMHS-BVOOG Apnea Conference - 16 Jun 2018 BUWALDA

Coronavirus, an update Grand Rounds, DMU September 3, 2020 @ 0700 Via Zoom Source Overview

Objectives Glomerular Diseases in Primary Care Discuss diagnosis of glomerular diseases Sadiq

Phonetics. The Sound of Language 1 The Description of Sounds Fromkin &amp; Rodman: An

Canadian Expert Perspectives April 23, 2020 Planning faculty Alan Bell, MD, CCFP, FCFP Eddy Lang

Sambuz

Useful Links

Newsletter

Mail Us

Outline zipfR zipfR (Computational) linguistics Evert & Baroni Evert & Baroni

Phonetics. The Sound of Language 1 The Description of Sounds Fromkin & Rodman: An