Language change as a random walk in vector space Gerhard Jger - PowerPoint PPT Presentation

Language change as a random walk in vector space Gerhard Jäger Tübingen University, Department of Linguistics Cluster Colloquium Machine Learning in Science Cluster of Excellence Machine Learning , Tübingen, July 23, 2019

Introduction 1 / 42

Language change and evolution Vater Unser im Himmel, geheiligt werde Dein Name Onze Vader in de Hemel, laat Uw Naam geheiligd worden Our Father in heaven, hallowed be your name Fader Vor, du som er i himlene! Helliget vorde dit navn 2 / 42

Language change and evolution 3 / 42

Language change and evolution Mittelhochdeutsch: Got vater unser, dâ du bist in dem himelrîche gewaltic alles des dir ist, geheiliget sô werde dîn nam Althochdeutsch: Fater unser thû thâr bist in himile, si giheilagôt thîn namo Gotisch: Atta unsar þu in himinam, weihnai namo þein 4 / 42

Convergent evolution • Old English docga > English dog • Proto-Paman *gudaga > Mbabaram dog (‘dog’) 5 / 42

Language phylogeny Comparative method 1 identifying cognates , i.e. obviously related morphemes in different languages, such as new/nowy , two/dwa , or water/voda 2 reconstruction of common ancestor and sound laws that explain the change from reconstructed to observed forms 3 applying this iteratively leads to phylogenetic language trees 6 / 42

Language phylogeny Scope of the method • reconstructed vocabulary shrinks with growing time depth • maximal time horizon seems to be about 8,000 years • grammatical morphemes and categories arguably more stable and less apt to borrowing • problem here: limited number of features, cross-linguistic variation constrained by language universals, frequently convergent evolution • comparative method is hard to apply in regions with high linguistic diversity and without written documents (Paleo-America, Papua) • tree structure might be inappropriate if there is a significant effect of language contact (cf. Australia) 7 / 42

Computational Methods • both cognate detection and tree construction lend themselves to algorithmic implementation • Advantages: • easy to scale up • comparability of results • affords statistical evaluation • Disadvantages: • cognacy judgments require lots of linguistic insight and experience • tree construction should be subject to historical (including archeological) and geographical plausibility 8 / 42

From words to trees Swadesh lists training pair-Hidden Markov Model sound similarities applying pair-Hidden Markov Model word alignments classification/ clustering cognate classes feature extraction character matrix Bayesian phylogenetic inference phylogenetic tree 9 / 42

From words to trees n a Khoisan a r h n a a S i - d o v i l i a Altaic N r D c n l i a Niger-Congo a e Swadesh lists r p U o r u E o - d n I training pair-Hidden Markov Model Afro-Asiatic sound n NW Eurasia similarities a r a h applying a a s c b i u r pair-Hidden Markov Model A f n S a l a i Australia/Papua s t r A u orricelli T Sepik word alignments Trans-NewGuinea l l i P T o r r i c e a Trans-NewGuinea p u Trans-NewGuinea a a classification/ s i Trans-NewGuinea A clustering E n C h i b c h a S Otomanguean cognate classes n a k a a w A r o a n A P a n m A i n u e G e o - r a c Cariban r i M c feature extraction ucanoan a n T a p i u T Penutian Austronesian Algic e character matrix n D e n a a N e g u n a m o t Uto-Aztecan O Bayesian n a k o Mayan H phylogenetic n a i Salish inference n a n phylogenetic s t e a h u Hmong-Mien h Sino-Tibetan T g c ai-Kadai a e Timor-Alor-Pantar Austro-Asiatic tree D - u h Q k a N 9 / 42

From word lists to distances 10 / 42

The Automated Similarity Judgment Program • Project at MPI EVA in Leipzig around Søren Wichmann • covers more than 6,000 languages and dialects • basic vocabulary of 40 words for each language, in uniform phonetic transcription • freely available used concepts: I, you, we, one, two, person, fish, dog, louse, tree, leaf, skin, blood, bone, horn, ear, eye, nose, tooth, tongue, knee, hand, breast, liver, drink, see, hear, die, come, sun, star, water, stone, fire, path, mountain, night, full, new, name 11 / 42

Automated Similarity Judgment Project concept Latin English concept Latin English ego Ei nasus nos I nose tu yu dens tu8 you tooth nos wi liNgw ∼ E t3N we tongue unus w3n genu ni one knee duo tu manus hEnd two hand persona, homo pers3n pektus, mama brest person breast piskis fiS yekur liv3r fish liver dog kanis dag drink bibere drink louse pedikulus laus see widere si tree arbor tri hear audire hir leaf foly ∼ u* lif die mori dEi skin kutis skin come wenire k3m blood saNgw ∼ is bl3d sun sol s3n bone os bon star stela star horn kornu horn water akw ∼ a wat3r ear auris ir stone lapis ston eye okulus Ei fire iNnis fEir 12 / 42

Word distances • based on string alignment • baseline: Levenshtein alignment ⇒ count matches and mis-matches • too crude as it totally ignores sound correspondences 13 / 42

How well does normalized Levenshtein distance predict cognacy? 1.00 1.0 0.8 0.75 empirical probability of cognacy 0.6 cognate LDN no 0.50 yes 0.4 0.25 0.2 0.0 0.00 0.2 0.4 0.6 0.8 no yes cognate LDN 14 / 42

Problems • binary distinction: match vs. non-match • frequently genuin sound correspondences in cognates are missed: c v a i n a z 3 - - - f i S - - t u n - o s p i s k i s • corresponding sounds count as mismatches even if they are aligend correctly h a n t h a n t h E n d m a n o • substantial amount of chance similarities 15 / 42

Capturing sound correspondences • weighted alignment using P ointwise M utual I nformation (PMI, a.k.a. log-odds ): s ( a , b ) = log p ( a , b ) q ( a ) q ( b ) • p ( a , b ) : probability of sound a being etymologically related to sound b in a pair of cognates • q ( a ) : relative frequency of sound a • Needleman-Wunsch algorithm: given a matrix of pairwise PMI scores between individual symbols and two strings, it returns the alignment that maximizes the aggregate PMI score • but first we need to estimate p ( a , b ) and q ( a ) , q ( b ) for all soundclasses a and b • q ( a ) : relative frequency of occurence of segment a in all words in ASJP • p ( a , b ) : that’s a bit more complicated... 16 / 42

Substitution matrix for the ASJP data 1. identify large sample of pairs of closely related languages (using expert information or heuristics based on aggregated Levenshtein distance) An.NORTHERN_PHILIPPINES.CENTRAL_BONTOC An.SOUTHERN_PHILIPPINES.KAGAYANEN An.MESO-PHILIPPINE.NORTHERN_SORSOGON An.NORTHERN_PHILIPPINES.LIMOS_KALINGA WF.WESTERN_FLY.IAMEGA An.MESO-PHILIPPINE.CANIPAAN_PALAWAN WF.WESTERN_FLY.GAMAEWE An.NORTHWEST_MALAYO-POLYNESIAN.LAHANAN Pan.PANOAN.KASHIBO_BAJO_AGUAYTIA NC.BANTOID.LIFONGA Pan.PANOAN.KASHIBO_SAN_ALEJANDRO NC.BANTOID.BOMBOMA_2 AA.EASTERN_CUSHITIC.KAMBAATA_2 IE.INDIC.WAD_PAGGA AA.EASTERN_CUSHITIC.HADIYYA_2 IE.INDIC.TALAGANG_HINDKO ST.BAI.QILIQIAO_BAI_2 NC.BANTOID.LINGALA ST.BAI.YUNLONG_BAI NC.BANTOID.LIFONGA An.SULAWESI.MANDAR An.CENTRAL_MALAYO-POLYNESIAN.BALILEDO An.OCEANIC.RAGA An.CENTRAL_MALAYO-POLYNESIAN.PALUE An.SULAWESI.TANETE AuA.MUNDA.HO An.SAMA-BAJAW.BOEPINANG_BAJAU AuA.MUNDA.KORKU 17 / 42

Substitution matrix for the ASJP data 2. pick a concept and a pair of related languages at random • languages: Pen.MAIDUAN.MAIDU_KONKAU , Pen.MAIDUAN.NE_MAIDU • concept: one 3. find corresponding words from the two languages: • nisam , niSem 4. do Levenshtein alignment n i s a m n i S e m 5. for each sound pair, count number of correspondences • nn: 1; ii: 1; sS; 1; ae: 1; mm: 1 18 / 42

Language change as a random walk in vector space Gerhard Jger - PowerPoint PPT Presentation

Language change as a random walk in vector space Gerhard Jger Tbingen University, Department of Linguistics Cluster Colloquium Machine Learning in Science Cluster of Excellence Machine Learning , Tbingen, July 23, 2019 Introduction 1 / 42

The Winter Walk at Wisley The Winter Walk at Wisley The Winter Walk at Wisley The Winter Walk at

Mixing time for a random walk on a ring Stephen Connor Joint work with Michael Bate Aspects of

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Back to Random Walks on Graphs Random walk on a graph: Stationary distribution: Back to Random

Short Walks in Higher Dimensions Ghislain McKay Febuary 3, 2015 What is a Random Walk? A path

Advanced Algorithms (XII) Shanghai Jiao Tong University Chihao Zhang May 25, 2020 Random Walk

Critical density for Activated Random Walk Lorenzo Taggi Max Planck Institute for Mathematics in

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

Random walk on the torus Jean-Baptiste Boyer (IMB / ModalX) May 16, 2016 Jean-Baptiste Boyer

Random Walks Will Perkins February 5, 2013 Simple Random Walk S 0 = 0, S n = X 1 + X 2 + . . . X

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break

Southeast Cooler Corporation Southeast Cooler Corporation Walk Walk- -In Cooler In Cooler

Turn Right Walk forward 100 pixels Start Here Walk Forward Turn Left and 100 pixels walk

Onelight.com Training Series Connecting the Pyramids and the Crystal Cities the ISIS Walk 2 The

vector class homogeneous aggregate with random access templated class: Vector<int>

Association rules and compositional data analy lysis: : im impli licatio ions to big ig data

Monitoring Maryland Performance Medicare TCOC Data Data through July 2016 - Paid Claims through

Disclosures Use of PH Directed Therapies in Patients with Single Ventricles Any insight that

r.r.fi use indicator others 0 IT In PrEIj Xn I In e EE In E EE Ide c In Xn Iie E ECTS

On a Conjecture of Donagi-Morrison Margherita Lelli-Chiesa MPIM Bonn VBAC 2013 M. Lelli-Chiesa

MCQIC Re-engagement Webinar Maternity & Neonates 3pm 4.20pm Wednesday 30 September 2020

The Development of Decision Analysis Jason R. W. Merrick Based on Smith and von Winterfeldt

REAL: A Retention Error Aware LDPC Decoding Scheme to Improve NAND Flash Read Performance Meng

Language change as a random walk in vector space Gerhard Jger - PowerPoint PPT Presentation

Language change as a random walk in vector space Gerhard Jger Tbingen University, Department of Linguistics Cluster Colloquium Machine Learning in Science Cluster of Excellence Machine Learning , Tbingen, July 23, 2019 Introduction 1 / 42

The Winter Walk at Wisley The Winter Walk at Wisley The Winter Walk at Wisley The Winter Walk at

Mixing time for a random walk on a ring Stephen Connor Joint work with Michael Bate Aspects of

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Back to Random Walks on Graphs Random walk on a graph: Stationary distribution: Back to Random

Short Walks in Higher Dimensions Ghislain McKay Febuary 3, 2015 What is a Random Walk? A path

Advanced Algorithms (XII) Shanghai Jiao Tong University Chihao Zhang May 25, 2020 Random Walk

Critical density for Activated Random Walk Lorenzo Taggi Max Planck Institute for Mathematics in

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

Random walk on the torus Jean-Baptiste Boyer (IMB / ModalX) May 16, 2016 Jean-Baptiste Boyer

Random Walks Will Perkins February 5, 2013 Simple Random Walk S 0 = 0, S n = X 1 + X 2 + . . . X

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break

Southeast Cooler Corporation Southeast Cooler Corporation Walk Walk- -In Cooler In Cooler

Turn Right Walk forward 100 pixels Start Here Walk Forward Turn Left and 100 pixels walk

Onelight.com Training Series Connecting the Pyramids and the Crystal Cities the ISIS Walk 2 The

vector class homogeneous aggregate with random access templated class: Vector&lt;int&gt;

Association rules and compositional data analy lysis: : im impli licatio ions to big ig data

Monitoring Maryland Performance Medicare TCOC Data Data through July 2016 - Paid Claims through

Disclosures Use of PH Directed Therapies in Patients with Single Ventricles Any insight that

r.r.fi use indicator others 0 IT In PrEIj Xn I In e EE In E EE Ide c In Xn Iie E ECTS

On a Conjecture of Donagi-Morrison Margherita Lelli-Chiesa MPIM Bonn VBAC 2013 M. Lelli-Chiesa

MCQIC Re-engagement Webinar Maternity &amp; Neonates 3pm 4.20pm Wednesday 30 September 2020

The Development of Decision Analysis Jason R. W. Merrick Based on Smith and von Winterfeldt

REAL: A Retention Error Aware LDPC Decoding Scheme to Improve NAND Flash Read Performance Meng

vector class homogeneous aggregate with random access templated class: Vector<int>

MCQIC Re-engagement Webinar Maternity & Neonates 3pm 4.20pm Wednesday 30 September 2020