Factoring lexical and phonetic phylogenetic characters from word - - PowerPoint PPT Presentation

factoring lexical and phonetic phylogenetic characters
SMART_READER_LITE
LIVE PREVIEW

Factoring lexical and phonetic phylogenetic characters from word - - PowerPoint PPT Presentation

Factoring lexical and phonetic phylogenetic characters from word lists Gerhard J ager & Johann-Mattis List T ubingen University & CRLAO / Team AIRE, Paris QITL-6 November 5, 2015 J ager & List (T ubingen/Paris)


slide-1
SLIDE 1

Factoring lexical and phonetic phylogenetic characters from word lists

Gerhard J¨ ager & Johann-Mattis List

T¨ ubingen University & CRLAO / Team AIRE, Paris

QITL-6

November 5, 2015

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 1 / 43

slide-2
SLIDE 2

Introduction

Introduction

computational historical linguistics is a young and thrilling field

  • ne of its major challenges is the collection of suitable data resources

there are many methods from computational biology which can be used to make very fine-grained inferences about language history BUT: these methods require data which is organized in character matrices and high quality data of this type is currently only available for a very small number of languages

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 2 / 43

slide-3
SLIDE 3

Introduction

Introduction

an alternative approach uses pairwise sequence alignment and distance-based phylogenetic inference methods the disadvantage of these techniques is that they have a strong black box character BUT: they can be used with raw and unprocessed data, such as the collection of small wordlists for over 6000 languages collected by the ASJP project

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 3 / 43

slide-4
SLIDE 4

Introduction

Introduction

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud] HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud] HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud] HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

P(A|B)=(P(B|A)P(A))/(P(B)

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

ASJP WORD LIST DATA

HUGE DATA SMALL DATA

? ? ?

SMART ALGORITHM BLACK BOX

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 3 / 43

slide-5
SLIDE 5

Introduction

Introduction

Can we come up with a workflow that brings the ASJP data into the character-matrix format, so that we can test the best algorithms for phylogenetic inference along with the largest collection of language data?

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 4 / 43

slide-6
SLIDE 6

Introduction

Introduction

COGNATE DETECTION COGNATE ALIGNMENT PARSIMONY FIL TERING

ML PHYLOGENETIC RECONSTRUCTION HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

A S J P W O R D L I S T D A T A PHYLOGENETIC TREE

PMI ANAL YSIS

PMI GUIDE TREE A B D C E J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 4 / 43

slide-7
SLIDE 7

Materials Data

ASJP Data

COGNATE DETECTION COGNATE ALIGNMENT PARSIMONY FIL TERING

ML PHYLOGENETIC RECONSTRUCTION HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

A S J P W O R D L I S T D A T A PHYLOGENETIC TREE

PMI ANAL YSIS

PMI GUIDE TREE A B D C E J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 5 / 43

slide-8
SLIDE 8

Materials Data

The Automated Similarity Judgment Program

largest collection of cross-linguistic word-lists covers more than 6,000 languages and dialects basic vocabulary of 40 words for each language, in uniform phonetic transcription freely available used concepts: I, you, we, one, two, person, fish, dog, louse, tree, leaf, skin,

blood, bone, horn, ear, eye, nose, tooth, tongue, knee, hand, breast, liver, drink, see, hear, die, come, sun, star, water, stone, fire, path, mountain, night, full, new, name

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 6 / 43

slide-9
SLIDE 9

Materials Data

Automated Similarity Judgment Project

concept Dutch English I ik Ei you yEi, y3 yu we vEi, v3 wi

  • ne

en 8is two tve 8Et person %pErson, mEns %pers3n fish vis fiS dog hont dag louse l3is laus tree bom tri leaf blat lif skin h3id, vEl skin blood blut bl3d bone ben bon horn horn horn ear

  • r

ir eye

  • X

Ei concept Dutch English nose nes nos tooth tant tu8 tongue toN t3N knee kni ni hand hant hEnd breast borst brest liver lev3r liv3r drink driNk3n drink see zin si hear hor3n hir die stErv3n dEi come kom3n k3m sun zon s3n star ster star water vat3r wat3r stone sten ston fire vir fEir

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 7 / 43

slide-10
SLIDE 10

Materials Pre-Processing

PMI-Preprocessing

COGNATE DETECTION COGNATE ALIGNMENT PARSIMONY FIL TERING

ML PHYLOGENETIC RECONSTRUCTION HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

A S J P W O R D L I S T D A T A PHYLOGENETIC TREE

PMI ANAL YSIS

PMI GUIDE TREE A B D C E J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 8 / 43

slide-11
SLIDE 11

Materials Pre-Processing

Determining distances between word lists

two steps:

compute similarity/distance between individual word forms aggregate word distances to doculect distances

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 9 / 43

slide-12
SLIDE 12

Materials Pre-Processing

Word distances

based on string alignment baseline: Levenshtein alignment ⇒ count matches and mis-matches

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 10 / 43

slide-13
SLIDE 13

Materials Pre-Processing

Word distances

weighted alignment using Pointwise Mutual Information (a.k.a. log-odds): s(a, b) = log p(a, b) q(a)q(b)

p(a, b): probability of sound a being etymologically related to sound b in a pair of cognates q(a): relative frequency of sound a

h a n t h E n d h a n t m a n o

2 . 8 9

  • .

6 2 . 3 7

  • .

4

Σ = 4.80

  • 5

. 8 3 2 . 6

  • 1

. 4 4 2 . 3 7

Σ = -11.85

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 11 / 43

slide-14
SLIDE 14

Materials Pre-Processing

Word distances

parameters are trained by extracting a large amount of likely cognate pairs from ASJP (cf. J¨ ager 2013, 2015)

−12 −10 −8 −6 −4 −2 2 4 6 8 10

  • u

i E e 3 a y 8 z d r Z j L l S h s t T C c w f b v m p G q X x 7 g k N n ! 4 5

  • u

i E e 3 a y 8 z d r Z j L l S h s t T C c w f b v m p G q X x 7 g k N n ! 4 5

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 12 / 43

slide-15
SLIDE 15

Materials Pre-Processing

Aggregating word distances

English / Swedish Ei yu wi w3n tu fiS . . . yog −7.77 0.75 −7.68 −7.90 −8.57 −10.50 du −7.62 0.33 −5.71 −7.41 2.66 −8.57 vi −2.72 −2.83 4.04 −1.34 −6.45 0.70 et −5.47 −7.87 −5.47 −6.43 −1.83 −4.70 tvo −7.91 −4.27 −3.64 −4.57 0.39 −6.98 fisk −7.45 −11.2 −3.07 −9.97 −8.66 7.58 . . .

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 13 / 43

slide-16
SLIDE 16

Materials Pre-Processing

Aggregating word distances

  • −20

−10 10 diagonal

  • ff−diagonal

position PMI

English/Swedish

  • −20

−10 10 diagonal

  • ff−diagonal

position PMI

English/Swahili

distance between two word lists is a measure for how much the distribution along the diagonal differs from the distribution off the diagonal

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 14 / 43

slide-17
SLIDE 17

Materials Pre-Processing

Aggregating word distances

some examples

A B d(A, B) English English 0.0078 English Scots 0.2139 Danish Swedish 0.2773 English Swedish 0.3981 English Frisian 0.4215 English Dutch 0.4040 Hindi Farsi 0.6231 English French 0.7720 English Hindi 0.7735 Amharic Vietnamese 0.8566 Swahili Warlpiri 0.8573 Navajo Dyirbal 0.8436 Japanese Haida 0.8504 English Swahili 0.8901

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 15 / 43

slide-18
SLIDE 18

Materials Pre-Processing

Phylogenetic inference

dissimilarities are plugged into distance-based phylogenetic inference algorithm (a variant of Neighbor Joining, software fastme)

Uralic Hmong-Mien Chukotko-Kamchatkan Japonic Dravidian Austronesian Nakh-Daghestanian Tai-Kadai Tungusic Sino-Tibetan Mongolic Yeniseian Ainu Austroasiatic Nivkh Indo-European Turkic

99.4% 100% 96.8% 99.9% 100% 96.9% 100%

Yukaghir

Austronesian Niger-Congo T a i

  • K

a d a i Austro-Asiatic Sino-Tibetan Uto-Aztecan Mayan Quechuan Altaic

Africa Eurasia Papunesia Australia America

Subsaharan Africa NW Eurasia A u s t r a l i a / P a p u a SE Asia A m e r i c a Papua

Khoisan Nilo-Saharan Kadugli Nilo-Saharan Niger-Congo Dravidian Timor-Alor-Pantar Indo-European Uralic Afro-Asiatic Afro-Asiatic Australian

this talk:

adopt Glottolog classification into language families compute PMI tree for each family of size between 10 and 70

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 16 / 43

slide-19
SLIDE 19

Methods Cognate Detection

Cognate Detection

COGNATE DETECTION COGNATE ALIGNMENT PARSIMONY FIL TERING

ML PHYLOGENETIC RECONSTRUCTION HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

A S J P W O R D L I S T D A T A PHYLOGENETIC TREE

PMI ANAL YSIS

PMI GUIDE TREE A B D C E J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 17 / 43

slide-20
SLIDE 20

Methods Cognate Detection

Cognate Detection with LexStat

LexStat is an algorithm for automatic cognate detection and freely available as part of the LingPy software package (http://lingpy.org). The algorithm proceeds in two stages: A distance calculation: calculate pairwise distances between all words in a given meaning slot, using the Sound-Class-Based Phonetic Alignment algorithm (SCA). B flat clustering: partition the words using the distances scores with help of an agglomerative clustering algorithm that terminates when a certain user-defined threshold is reached.

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 18 / 43

slide-21
SLIDE 21

Methods Cognate Detection

Cognate Detection with LexStat

2 3 1 GREEK 0.00 0.73 0.90 0.87 0.99 GERMAN 0.73 0.00 0.25 0.93 1.00 ENGLISH 0.90 0.25 0.00 1.00 1.00 RUSSIAN 0.87 0.93 1.00 0.00 0.08 POLISH 0.99 1.06 1.00 0.08 0.00 G R E E K G E R M A N E N G L I S H R U S S I A N P O L I S H xeri hant hEnd ruka rE*ka xeri hant hEnd ruka rE*ka

A B

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 18 / 43

slide-22
SLIDE 22

Methods Cognate Detection

Evaluation against IELex

Comparison of automatic and manual cognate classification

Rand index (corrected) for individual concepts Frequency 0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 5 6 7

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 19 / 43

slide-23
SLIDE 23

Methods Cognate Detection

Sometimes it works well ...

noc B nat B nax B noc B not B nat B nos B noapte B naktis B noC B nat B niC B not B noc B nat3 B nEit B nat B noC B nu3t B noC B nat B nat3 B nEt B noc B not B noc B noz B noSt B noapte B note B nita B noCe B noC B nit B nox B noC B nout B nEt B not B nakc B noyt3 B nui B notte B noC B nat B ratri D rat D rat D rat D rat D rat D rat D rat D radur D rati D rat D ratri7 D Sab A Sap A Sev A Sab A SEw A Spa A Spa A EXsEvE A EXSEv A ei E

  • xE

E 5ot B giSer C

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 20 / 43

slide-24
SLIDE 24

Methods Cognate Detection

... and sometimes not

yog A yo A ya A Zo A yas A ya A yes A yEy A yo A yo A yaz A j3 A ya A ya A ya A yEi A yeg A ya A ya A yEu A ya A ya A ya A man A man A moe A ami A mi C min A mai A man A mEn A me A mi A mi A mE A mama A ik A Ek A Eg A ig A ik A ik A ik A eu A Ei A eu A io A az A Ez A ez A woz A iX A yEx A Ex A es A aS A un3 A un3 A z3 A exo A EZ A ew A

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 21 / 43

slide-25
SLIDE 25

Methods Cognate Detection

Cognate Detection with LexStat

Once cognate sets are identified, the data can be converted into a binary character matrix.

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 22 / 43

slide-26
SLIDE 26

Methods Cognate Detection

Cognate Detection with LexStat

DOCULECT WORD ADHS GREEK xeri GERMAN hant ENGLISH hEnd RUSSIAN ruka POLISH rE*ka DOCULECT GREEK GERMAN ENGLISH RUSSIAN POLISH 1 1 1 1 1

3 3 2 2 1 1 2 3 J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 22 / 43

slide-27
SLIDE 27

Methods Filtering characters via the Retention Index

Retention Index

COGNATE DETECTION COGNATE ALIGNMENT PARSIMONY FIL TERING

ML PHYLOGENETIC RECONSTRUCTION HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

A S J P W O R D L I S T D A T A PHYLOGENETIC TREE

PMI ANAL YSIS

PMI GUIDE TREE A B D C E J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 23 / 43

slide-28
SLIDE 28

Methods Filtering characters via the Retention Index

Retention Index

most parsimonious reconstruction gives the minimal number of mutations, given a phylogeny

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 24 / 43

slide-29
SLIDE 29

Methods Filtering characters via the Retention Index

Retention Index

most parsimonious reconstruction gives the minimal number of mutations, given a phylogeny

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 24 / 43

slide-30
SLIDE 30

Methods Filtering characters via the Retention Index

Retention Index

most parsimonious reconstruction gives the minimal number of mutations, given a phylogeny

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 24 / 43

slide-31
SLIDE 31

Methods Filtering characters via the Retention Index

Retention Index

most parsimonious reconstruction gives the minimal number of mutations, given a phylogeny

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 24 / 43

slide-32
SLIDE 32

Methods Filtering characters via the Retention Index

Retention Index

minimal number of mutations: number of states − 1 maximal number of mutations: number of taxa - number of

  • ccurrences of most frequent state

number of avoidable mutations: maximal number of mutations - minimal number of mutations number of mutations avoided in T : maximal number of mutations − (minimal) number of mutations in T Retention Index (RI) of a tree T :

RI(T ) = number of mutations avoided in T number of avoidable mutations

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 25 / 43

slide-33
SLIDE 33

Methods Filtering characters via the Retention Index

Retention Index

RI = 1

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 26 / 43

slide-34
SLIDE 34

Methods Filtering characters via the Retention Index

Retention Index

RI = 1/2

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 26 / 43

slide-35
SLIDE 35

Methods Filtering characters via the Retention Index

Retention Index

RI = 0

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 26 / 43

slide-36
SLIDE 36

Methods Filtering characters via the Retention Index

Retention Index

we considered all (binary) characters, i.e. automatically obtained cognate classes, with an RI < 0.4 relative to the PMI tree as “bad” characters

RI of automatically vs. manually obtained cognate classes

RI Density 0.0 0.2 0.4 0.6 0.8 1.0 2 4 6 8 automatic IELex

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 27 / 43

slide-37
SLIDE 37

Methods Multiple Alignment

Multiple Alignment

COGNATE DETECTION COGNATE ALIGNMENT PARSIMONY FIL TERING

ML PHYLOGENETIC RECONSTRUCTION HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

A S J P W O R D L I S T D A T A PHYLOGENETIC TREE

PMI ANAL YSIS

PMI GUIDE TREE A B D C E J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 28 / 43

slide-38
SLIDE 38

Methods Multiple Alignment

Sound-Class-Based Phonetic Alignment

The SCA algorithm for multiple phonetic alignment is a “traditional” algorithm for phonetic alignment, using basic features, like a guide tree, but employs two very specific linguistic models which account for syntagmatic and paradigmatic aspects of phonetic sequences: A it assigns sounds to classes in which transitions are judged to be more probable than outside of a class (following an idea of Dolgopolsky 1964) B it uses prosodic profiles (List 2014) to account for the fact that different positions of a phonetic sequence are differently prone to change

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 29 / 43

slide-39
SLIDE 39

Methods Multiple Alignment

SCA: Sound Classes

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 30 / 43

slide-40
SLIDE 40

Methods Multiple Alignment

SCA: Sound Classes

Sound Classes Sounds which frequently occur in correspondence relation in genetically related languages can be clustered into classes (types), assuming that “phonetic correspondences inside a ‘type’ are more regular than those between different ‘types’” (Dolgopolsky 1986[1966]: 35).

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 30 / 43

slide-41
SLIDE 41

Methods Multiple Alignment

SCA: Sound Classes

Sound Classes Sounds which frequently occur in correspondence relation in genetically related languages can be clustered into classes (types), assuming that “phonetic correspondences inside a ‘type’ are more regular than those between different ‘types’” (Dolgopolsky 1986[1966]: 35).

k g p b ʧ ʤ f v t d ʃ ʒ θ ð s z 1

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 30 / 43

slide-42
SLIDE 42

Methods Multiple Alignment

SCA: Sound Classes

Sound Classes Sounds which frequently occur in correspondence relation in genetically related languages can be clustered into classes (types), assuming that “phonetic correspondences inside a ‘type’ are more regular than those between different ‘types’” (Dolgopolsky 1986[1966]: 35).

k g p b ʧ ʤ f v t d ʃ ʒ θ ð s z 1

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 30 / 43

slide-43
SLIDE 43

Methods Multiple Alignment

SCA: Sound Classes

Sound Classes Sounds which frequently occur in correspondence relation in genetically related languages can be clustered into classes (types), assuming that “phonetic correspondences inside a ‘type’ are more regular than those between different ‘types’” (Dolgopolsky 1986[1966]: 35).

k g p b ʧ ʤ f v t d ʃ ʒ θ ð s z 1

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 30 / 43

slide-44
SLIDE 44

Methods Multiple Alignment

SCA: Sound Classes

Sound Classes Sounds which frequently occur in correspondence relation in genetically related languages can be clustered into classes (types), assuming that “phonetic correspondences inside a ‘type’ are more regular than those between different ‘types’” (Dolgopolsky 1986[1966]: 35).

K T P S

1

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 30 / 43

slide-45
SLIDE 45

Methods Multiple Alignment

SCA: Prosodic Profiles

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 31 / 43

slide-46
SLIDE 46

Methods Multiple Alignment

SCA: Prosodic Profiles

Prosodic Strings Sound change occurs more frequently in prosodically weak positions of sound sequences (Geisler 1992). Based on the sonority profile of a sound sequence, we can distinguish different positions inside a string with respect to their prosodic

  • context. Prosodic context can be

modeled as prosodic string in which contexts are encoded by using specific symbols.

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 31 / 43

slide-47
SLIDE 47

Methods Multiple Alignment

SCA: Prosodic Profiles

Prosodic Strings Sound change occurs more frequently in prosodically weak positions of sound sequences (Geisler 1992). Based on the sonority profile of a sound sequence, we can distinguish different positions inside a string with respect to their prosodic

  • context. Prosodic context can be

modeled as prosodic string in which contexts are encoded by using specific symbols.

j a b ə l k a 1

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 31 / 43

slide-48
SLIDE 48

Methods Multiple Alignment

SCA: Prosodic Profiles

Prosodic Strings Sound change occurs more frequently in prosodically weak positions of sound sequences (Geisler 1992). Based on the sonority profile of a sound sequence, we can distinguish different positions inside a string with respect to their prosodic

  • context. Prosodic context can be

modeled as prosodic string in which contexts are encoded by using specific symbols.

sonority increases

j a b ə l k a 1

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 31 / 43

slide-49
SLIDE 49

Methods Multiple Alignment

SCA: Prosodic Profiles

Prosodic Strings Sound change occurs more frequently in prosodically weak positions of sound sequences (Geisler 1992). Based on the sonority profile of a sound sequence, we can distinguish different positions inside a string with respect to their prosodic

  • context. Prosodic context can be

modeled as prosodic string in which contexts are encoded by using specific symbols.

j a b ə l k a ↑ △ ↑ △ ↓ ↑ △ ↑ ascending △ maximum ↓ descending 1

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 31 / 43

slide-50
SLIDE 50

Methods Multiple Alignment

SCA: Prosodic Profiles

Prosodic Strings Sound change occurs more frequently in prosodically weak positions of sound sequences (Geisler 1992). Based on the sonority profile of a sound sequence, we can distinguish different positions inside a string with respect to their prosodic

  • context. Prosodic context can be

modeled as prosodic string in which contexts are encoded by using specific symbols.

j a b ə l k a ↑ △ ↑ △ ↓ ↑ △

  • strong

weak 1

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 31 / 43

slide-51
SLIDE 51

Methods Multiple Alignment

SCA: Prosodic Profiles

Prosodic Strings Sound change occurs more frequently in prosodically weak positions of sound sequences (Geisler 1992). Based on the sonority profile of a sound sequence, we can distinguish different positions inside a string with respect to their prosodic

  • context. Prosodic context can be

modeled as prosodic string in which contexts are encoded by using specific symbols.

j a b ə l k a # v C v c C > 1

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 31 / 43

slide-52
SLIDE 52

Methods Multiple Alignment

SCA: Multi-tiered Sequence Representation

External Representation IPA j a b l k a Internal Representation Dolgopolsky Sound Classes J V P V L K V SCA Sound-Classes J A P E L K A ASJP Sound-Classes y a b I l k a Prosodic String # V C V c C > Relative Gap-Weight 2.0 1.5 1.5 1.3 1.1 1.5 0.7 ... ... ... ... ... ... ... ...

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 32 / 43

slide-53
SLIDE 53

Methods Multiple Alignment

SCA: Workflow

_ INPUT SEQUEN- CES _

jabl̩ko jabəlka jabləkə japkɔ

stage 1 SOUND-CLASS CONVERSION

jabl̩ko

JAPLKU jabəlka

JAPELKA jabləkə

JAPLEKE japkɔ

JAPKU

stage 2 LIBRARY CREATI- ON

JAP-LKU JAPELKA JAPL-KU JAPLEKE JAPLKU JAP-KU JAPEL-KA JAP-LEKE ... ...

stage 3 DISTANCE CAL- CULATION

JAPLKU 0.00 0.14 0.34 0.12 JAPELKA 0.14 0.00 0.46 0.28 JAPLEKE 0.34 0.46 0.00 0.44 JAPKO 0.12 0.28 0.44 0.00

stage 4 CLUSTER ANALY- SIS

. . . JAPLKU JAPELKA . JAPLEKE . . JAPKU

stage 5 PROGRESSIVE ALIGNMENT

J A P

  • L

K U J A P E L K A JAPLEKE JAPKU MORE SEQUENCES?

stage 6 ITERATIVE REFI- NEMENT

J A P

  • L
  • K

U J A P E L

  • K

A J A P

  • L

E K E JAPKU

stage 7 SWAP CHECK

J A P

  • L
  • K

U J A P E L

  • K

A J A P

  • L

E K E J A P

  • K

U

stage 8 IPA CONVERSION

J A P

… →

j a b

J A P

… →

j a b

J A P

… →

j a b

J A P

… →

j a p

… OUTPUT MSA

j a b

  • k
  • j

a b ə l

  • k

a j a b

  • l

ə k ə j a p

  • k

ɔ

yes no

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 33 / 43

slide-54
SLIDE 54

Methods Multiple Alignment

SCA: Workflow

_ INPUT SEQUEN- CES _

jabl̩ko jabəlka jabləkə japkɔ

stage 1 SOUND-CLASS CONVERSION

jabl̩ko

JAPLKU jabəlka

JAPELKA jabləkə

JAPLEKE japkɔ

JAPKU

stage 2 LIBRARY CREATI- ON

JAP-LKU JAPELKA JAPL-KU JAPLEKE JAPLKU JAP-KU JAPEL-KA JAP-LEKE ... ...

stage 3 DISTANCE CAL- CULATION

JAPLKU 0.00 0.14 0.34 0.12 JAPELKA 0.14 0.00 0.46 0.28 JAPLEKE 0.34 0.46 0.00 0.44 JAPKO 0.12 0.28 0.44 0.00

stage 4 CLUSTER ANALY- SIS

. . . JAPLKU JAPELKA . JAPLEKE . . JAPKU

stage 5 PROGRESSIVE ALIGNMENT

J A P

  • L

K U J A P E L K A JAPLEKE JAPKU MORE SEQUENCES?

stage 6 ITERATIVE REFI- NEMENT

J A P

  • L
  • K

U J A P E L

  • K

A J A P

  • L

E K E JAPKU

stage 7 SWAP CHECK

J A P

  • L
  • K

U J A P E L

  • K

A J A P

  • L

E K E J A P

  • K

U

stage 8 IPA CONVERSION

J A P

… →

j a b

J A P

… →

j a b

J A P

… →

j a b

J A P

… →

j a p

… OUTPUT MSA

j a b

  • k
  • j

a b ə l

  • k

a j a b

  • l

ə k ə j a p

  • k

ɔ

yes no

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 33 / 43

slide-55
SLIDE 55

Methods Multiple Alignment

SCA: Workflow

_ INPUT SEQUEN- CES _

jabl̩ko jabəlka jabləkə japkɔ

stage 1 SOUND-CLASS CONVERSION

jabl̩ko

JAPLKU jabəlka

JAPELKA jabləkə

JAPLEKE japkɔ

JAPKU

stage 2 LIBRARY CREATI- ON

JAP-LKU JAPELKA JAPL-KU JAPLEKE JAPLKU JAP-KU JAPEL-KA JAP-LEKE ... ...

DISTANCE CAL

JAPLKU 0.00 0.14 0.34 0.12

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 33 / 43

slide-56
SLIDE 56

Methods Multiple Alignment

SCA: Workflow

_ INPUT SEQUEN- CES _

jabl̩ko jabəlka jabləkə japkɔ

stage 1 SOUND-CLASS CONVERSION

jabl̩ko

JAPLKU jabəlka

JAPELKA jabləkə

JAPLEKE japkɔ

JAPKU

stage 2 LIBRARY CREATI- ON

JAP-LKU JAPELKA JAPL-KU JAPLEKE JAPLKU JAP-KU JAPEL-KA JAP-LEKE ... ...

stage 3 DISTANCE CAL- CULATION

JAPLKU 0.00 0.14 0.34 0.12 JAPELKA 0.14 0.00 0.46 0.28 JAPLEKE 0.34 0.46 0.00 0.44 JAPKO 0.12 0.28 0.44 0.00

stage 4 CLUSTER ANALY- SIS

. . . JAPLKU JAPELKA . JAPLEKE . . JAPKU

stage 5 PROGRESSIVE ALIGNMENT

J A P

  • L

K U J A P E L K A JAPLEKE JAPKU MORE SEQUENCES?

stage 6 ITERATIVE REFI- NEMENT

J A P

  • L
  • K

U J A P E L

  • K

A J A P

  • L

E K E JAPKU

stage 7 SWAP CHECK

J A P

  • L
  • K

U J A P E L

  • K

A J A P

  • L

E K E J A P

  • K

U

stage 8 IPA CONVERSION

J A P

… →

j a b

J A P

… →

j a b

J A P

… →

j a b

J A P

… →

j a p

… OUTPUT MSA

j a b

  • k
  • j

a b ə l

  • k

a j a b

  • l

ə k ə j a p

  • k

ɔ

yes no

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 33 / 43

slide-57
SLIDE 57

Methods Multiple Alignment

SCA: Workflow

_ INPUT SEQUEN- CES _

jabl̩ko jabəlka jabləkə japkɔ

stage 1 SOUND-CLASS CONVERSION

jabl̩ko

JAPLKU jabəlka

JAPELKA jabləkə

JAPLEKE japkɔ

JAPKU

stage 2 LIBRARY CREATI- ON

JAP-LKU JAPELKA JAPL-KU JAPLEKE JAPLKU JAP-KU JAPEL-KA JAP-LEKE ... ...

stage 3 DISTANCE CAL- CULATION

JAPLKU 0.00 0.14 0.34 0.12 JAPELKA 0.14 0.00 0.46 0.28 JAPLEKE 0.34 0.46 0.00 0.44 JAPKO 0.12 0.28 0.44 0.00

stage 4 CLUSTER ANALY- SIS

. . . JAPLKU JAPELKA . JAPLEKE . . JAPKU

stage 5 PROGRESSIVE ALIGNMENT

J A P

  • L

K U J A P E L K A JAPLEKE JAPKU MORE SEQUENCES?

stage 6 ITERATIVE REFI- NEMENT

J A P

  • L
  • K

U J A P E L

  • K

A J A P

  • L

E K E JAPKU

stage 7 SWAP CHECK

J A P

  • L
  • K

U J A P E L

  • K

A J A P

  • L

E K E J A P

  • K

U

stage 8 IPA CONVERSION

J A P

… →

j a b

J A P

… →

j a b

J A P

… →

j a b

J A P

… →

j a p

… OUTPUT MSA

j a b

  • k
  • j

a b ə l

  • k

a j a b

  • l

ə k ə j a p

  • k

ɔ

yes no

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 33 / 43

slide-58
SLIDE 58

Methods Multiple Alignment

SCA: Workflow

CONVERSION

j japkɔ

JAPKU

stage 2 LIBRARY CREATI- ON

JAP-LKU JAPELKA JAPL-KU JAPLEKE JAPLKU JAP-KU JAPEL-KA JAP-LEKE ... ...

stage 3 DISTANCE CAL- CULATION

JAPLKU 0.00 0.14 0.34 0.12 JAPELKA 0.14 0.00 0.46 0.28 JAPLEKE 0.34 0.46 0.00 0.44 JAPKO 0.12 0.28 0.44 0.00

stage 4 CLUSTER ANALY- SIS

. . . JAPLKU JAPELKA . JAPLEKE . . JAPKU J A P

  • L

K U J A P E L K A

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 33 / 43

slide-59
SLIDE 59

Methods Multiple Alignment

SCA: Workflow

_ INPUT SEQUEN- CES _

jabl̩ko jabəlka jabləkə japkɔ

stage 1 SOUND-CLASS CONVERSION

jabl̩ko

JAPLKU jabəlka

JAPELKA jabləkə

JAPLEKE japkɔ

JAPKU

stage 2 LIBRARY CREATI- ON

JAP-LKU JAPELKA JAPL-KU JAPLEKE JAPLKU JAP-KU JAPEL-KA JAP-LEKE ... ...

stage 3 DISTANCE CAL- CULATION

JAPLKU 0.00 0.14 0.34 0.12 JAPELKA 0.14 0.00 0.46 0.28 JAPLEKE 0.34 0.46 0.00 0.44 JAPKO 0.12 0.28 0.44 0.00

stage 4 CLUSTER ANALY- SIS

. . . JAPLKU JAPELKA . JAPLEKE . . JAPKU

stage 5 PROGRESSIVE ALIGNMENT

J A P

  • L

K U J A P E L K A JAPLEKE JAPKU MORE SEQUENCES?

stage 6 ITERATIVE REFI- NEMENT

J A P

  • L
  • K

U J A P E L

  • K

A J A P

  • L

E K E JAPKU

stage 7 SWAP CHECK

J A P

  • L
  • K

U J A P E L

  • K

A J A P

  • L

E K E J A P

  • K

U

stage 8 IPA CONVERSION

J A P

… →

j a b

J A P

… →

j a b

J A P

… →

j a b

J A P

… →

j a p

… OUTPUT MSA

j a b

  • k
  • j

a b ə l

  • k

a j a b

  • l

ə k ə j a p

  • k

ɔ

yes no

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 33 / 43

slide-60
SLIDE 60

Methods Multiple Alignment

SCA: Workflow

_ INPUT SEQUEN- CES _

jabl̩ko jabəlka jabləkə japkɔ

stage 1 SOUND-CLASS CONVERSION

jabl̩ko

JAPLKU jabəlka

JAPELKA jabləkə

JAPLEKE japkɔ

JAPKU

stage 2 LIBRARY CREATI- ON

JAP-LKU JAPELKA JAPL-KU JAPLEKE JAPLKU JAP-KU JAPEL-KA JAP-LEKE ... ...

stage 3 DISTANCE CAL- CULATION

JAPLKU 0.00 0.14 0.34 0.12 JAPELKA 0.14 0.00 0.46 0.28 JAPLEKE 0.34 0.46 0.00 0.44 JAPKO 0.12 0.28 0.44 0.00

stage 4 CLUSTER ANALY- SIS

. . . JAPLKU JAPELKA . JAPLEKE . . JAPKU

stage 5 PROGRESSIVE ALIGNMENT

J A P

  • L

K U J A P E L K A JAPLEKE JAPKU MORE SEQUENCES?

stage 6 ITERATIVE REFI- NEMENT

J A P

  • L
  • K

U J A P E L

  • K

A J A P

  • L

E K E JAPKU

stage 7 SWAP CHECK

J A P

  • L
  • K

U J A P E L

  • K

A J A P

  • L

E K E J A P

  • K

U

stage 8 IPA CONVERSION

J A P

… →

j a b

J A P

… →

j a b

J A P

… →

j a b

J A P

… →

j a p

… OUTPUT MSA

j a b

  • k
  • j

a b ə l

  • k

a j a b

  • l

ə k ə j a p

  • k

ɔ

yes no

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 33 / 43

slide-61
SLIDE 61

Methods Multiple Alignment

SCA: Workflow

stage 5 PROGRESSIVE ALIGNMENT

J A P

  • L

K U J A P E L K A JAPLEKE JAPKU

MORE SEQUENCES?

stage 6 ITERATIVE REFI- NEMENT

J A P

  • L
  • K

U J A P E L

  • K

A J A P

  • L

E K E JAPKU

yes no J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 33 / 43

slide-62
SLIDE 62

Methods Multiple Alignment

SCA: Workflow

_ INPUT SEQUEN- CES _

jabl̩ko jabəlka jabləkə japkɔ

stage 1 SOUND-CLASS CONVERSION

jabl̩ko

JAPLKU jabəlka

JAPELKA jabləkə

JAPLEKE japkɔ

JAPKU

stage 2 LIBRARY CREATI- ON

JAP-LKU JAPELKA JAPL-KU JAPLEKE JAPLKU JAP-KU JAPEL-KA JAP-LEKE ... ...

stage 3 DISTANCE CAL- CULATION

JAPLKU 0.00 0.14 0.34 0.12 JAPELKA 0.14 0.00 0.46 0.28 JAPLEKE 0.34 0.46 0.00 0.44 JAPKO 0.12 0.28 0.44 0.00

stage 4 CLUSTER ANALY- SIS

. . . JAPLKU JAPELKA . JAPLEKE . . JAPKU

stage 5 PROGRESSIVE ALIGNMENT

J A P

  • L

K U J A P E L K A JAPLEKE JAPKU MORE SEQUENCES?

stage 6 ITERATIVE REFI- NEMENT

J A P

  • L
  • K

U J A P E L

  • K

A J A P

  • L

E K E JAPKU

stage 7 SWAP CHECK

J A P

  • L
  • K

U J A P E L

  • K

A J A P

  • L

E K E J A P

  • K

U

stage 8 IPA CONVERSION

J A P

… →

j a b

J A P

… →

j a b

J A P

… →

j a b

J A P

… →

j a p

… OUTPUT MSA

j a b

  • k
  • j

a b ə l

  • k

a j a b

  • l

ə k ə j a p

  • k

ɔ

yes no

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 33 / 43

slide-63
SLIDE 63

Methods Multiple Alignment

SCA: Workflow

_ INPUT SEQUEN- CES _

jabl̩ko jabəlka jabləkə japkɔ

stage 1 SOUND-CLASS CONVERSION

jabl̩ko

JAPLKU jabəlka

JAPELKA jabləkə

JAPLEKE japkɔ

JAPKU

stage 2 LIBRARY CREATI- ON

JAP-LKU JAPELKA JAPL-KU JAPLEKE JAPLKU JAP-KU JAPEL-KA JAP-LEKE ... ...

stage 3 DISTANCE CAL- CULATION

JAPLKU 0.00 0.14 0.34 0.12 JAPELKA 0.14 0.00 0.46 0.28 JAPLEKE 0.34 0.46 0.00 0.44 JAPKO 0.12 0.28 0.44 0.00

stage 4 CLUSTER ANALY- SIS

. . . JAPLKU JAPELKA . JAPLEKE . . JAPKU

stage 5 PROGRESSIVE ALIGNMENT

J A P

  • L

K U J A P E L K A JAPLEKE JAPKU MORE SEQUENCES?

stage 6 ITERATIVE REFI- NEMENT

J A P

  • L
  • K

U J A P E L

  • K

A J A P

  • L

E K E JAPKU

stage 7 SWAP CHECK

J A P

  • L
  • K

U J A P E L

  • K

A J A P

  • L

E K E J A P

  • K

U

stage 8 IPA CONVERSION

J A P

… →

j a b

J A P

… →

j a b

J A P

… →

j a b

J A P

… →

j a p

… OUTPUT MSA

j a b

  • k
  • j

a b ə l

  • k

a j a b

  • l

ə k ə j a p

  • k

ɔ

yes no

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 33 / 43

slide-64
SLIDE 64

Methods Multiple Alignment

SCA: Workflow

JAPKU

stage 7 SWAP CHECK

J A P

  • L
  • K

U J A P E L

  • K

A J A P

  • L

E K E J A P

  • K

U

stage 8 IPA CONVERSION

J A P

… →

j a b

J A P

… →

j a b

J A P

… →

j a b

J A P

… →

j a p

… OUTPUT MSA

j a b

  • k
  • j

a b ə l

  • k

a j a b

  • l

ə k ə j a p

  • k

ɔ

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 33 / 43

slide-65
SLIDE 65

Methods Multiple Alignment

T-Coffee

Progressive alignment start with a guide tree (here: PMI tree) working bottom-up, at each internal node, do pairwise alignment of the block alignments at the daugher node complexity is O(n2k3) ⇒ computationally feasible

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 34 / 43

slide-66
SLIDE 66

Methods Multiple Alignment

T-Coffee

progressive alignment only uses (phylogenetically) local information erroneous decisions cannot be corrected later

dendron 8enro dendron 8en-ro- tri dru dendron 8en-ro- d---ru- dendron 8en-ro- d---ru-

  • --tri-

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 35 / 43

slide-67
SLIDE 67

Methods Multiple Alignment

T-Coffee

Tree-based Consistency Objective Function for alignment Evaluation (Notredame et al., 2000) (slightly adapted for linguistic application)

1

pairwise alignment for all word pairs, using PMI scores

dendron 8en-ro- dendron 8---ru- dendron

  • --dru-

dendron t---ri- dendron

  • --tri-

8enro d--ru 8enro t--ri dru tri J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 36 / 43

slide-68
SLIDE 68

Methods Multiple Alignment

T-Coffee

Tree-based Consistency Objective Function for alignment Evaluation (Notredame et al., 2000) (slightly adapted for linguistic application)

1

pairwise alignment for all word pairs, using PMI scores

2

ternary alignments via relation composition

dendron 8en-ro- dendron 8---ru- dendron

  • --dru-

dendron t---ri- dendron

  • --tri-

8enro d--ru 8enro t--ri dru tri

t---ri- dendron

  • --dru-

t--ri 8enro d--ru t---ri- dendron d---ru-

  • --tri-

dendron

  • --dru-
  • --tri-

dendron t---ru- ... ... ...

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 36 / 43

slide-69
SLIDE 69

Methods Multiple Alignment

T-Coffee

Tree-based Consistency Objective Function for alignment Evaluation (Notredame et al., 2000) (slightly adapted for linguistic application)

1

pairwise alignment for all word pairs, using PMI scores

2

ternary alignments via relation composition

3

indirect alignment scores between sound occurrences

dendron 8en-ro- dendron 8---ru- dendron

  • --dru-

dendron t---ri- dendron

  • --tri-

8enro d--ru 8enro t--ri dru tri

t---ri- dendron

  • --dru-

t--ri 8enro d--ru t---ri- dendron d---ru-

  • --tri-

dendron

  • --dru-
  • --tri-

dendron t---ru- ... ... ...

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 36 / 43

slide-70
SLIDE 70

Methods Multiple Alignment

T-Coffee

Tree-based Consistency Objective Function for alignment Evaluation (Notredame et al., 2000) (slightly adapted for linguistic application)

1

pairwise alignment for all word pairs, using PMI scores

2

ternary alignments via relation composition

3

indirect alignment scores between sound occurrences

4

progressive alignment using those scores

dendron 8en-ro- dendron 8---ru- dendron

  • --dru-

dendron t---ri- dendron

  • --tri-

8enro d--ru 8enro t--ri dru tri

t---ri- dendron

  • --dru-

t--ri 8enro d--ru t---ri- dendron d---ru-

  • --tri-

dendron

  • --dru-
  • --tri-

dendron t---ru- ... ... ...

dendron 8enro dendron 8en-ro- tri dru dendron 8en-ro- d---ru- dendron 8en-ro- d---ru- t---ri-

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 36 / 43

slide-71
SLIDE 71

Methods Multiple Alignment

Phonetic characters

language ASJP entries multiple sequence alignment Italian kane k----a-ne---.......... Spanish

  • ......................

French Sia S-----ia----.......... Portuguese kau k----a--u---.......... Romanian k3ne k----3-ne---.......... Greek

  • ......................

English

  • ......................

German hunt h----u-n-t--.......... Danish hun7 h----u-n-7--.......... Icelandic hintir h----i-n-tir.......... Norwegian hund h----u-n-d--.......... Bulgarian

  • ......................

Serbocroatian pas ..................p-as Polish piEs ..................piEs Russian sobak3,pos sobak3------......p-os Irish ku k----u------.......... Marathi k3tra7 ............k3tra7.... Hindi kutta ............kutta-.... language h k s S Italian 1 Spanish

  • French

1 Portuguese 1 Romanian 1 Greek

  • English
  • German

1 Danish 1 Icelandic 1 Norwegian 1 Bulgarian

  • Serbocroatian
  • Polish
  • Russian

1 Irish 1 Marathi

  • Hindi
  • alignment columns are binarized

binary characters with RI< 0.4 are considered “bad” characters

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 37 / 43

slide-72
SLIDE 72

Evaluation

Phylogenetic Inference

COGNATE DETECTION COGNATE ALIGNMENT PARSIMONY FIL TERING

ML PHYLOGENETIC RECONSTRUCTION HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WOOD [wud]

A S J P W O R D L I S T D A T A PHYLOGENETIC TREE

PMI ANAL YSIS

PMI GUIDE TREE A B D C E J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 38 / 43

slide-73
SLIDE 73

Evaluation

Phylogenetic Inference

the following selection of binary characters were considered for phylogenetic inference

1

all cognate classes

2

cognate classes with RI≥ 0.4

3

all phonetic characters obtained via SCA alignment

4

all phonetic characters obtained via SCA alignment with RI≥ 0.4

5

all phonetic characters obtained via T-Coffee alignment

6

all phonetic characters obtained via T-Coffee alignment with RI≥ 0.4

7

all cognate classes + all SCA-characters

8

all cognate classes with RI≥ 0.4 + SCA-characters with RI≥ 0.4

9

all cognate classes + all T-Coffee-characters

10 all cognate classes with RI≥ 0.4 + T-Coffee-characters with RI≥ 0.4 J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 39 / 43

slide-74
SLIDE 74

Evaluation

Phylogenetic Inference and evaluation

Inference Maximum Likelihood inference (using PAUP*), separately for each Glottolog family settings:

molecular clock constant rates within lexical and within phonetic characters (but possibly different rates for different character types) SPR heuristic search, starting from PMI tree

Evaluation Generalized Quartet Distance (GQD) between inferred tree and Glottolog tree ranges from 0 (perfectly compatible) to 1 (totally incompatible) expected GQD between two randomly chosen trees is 2/3

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 40 / 43

slide-75
SLIDE 75

Evaluation

Comparison to Glottolog

Workflow SCA T-Coffee PMI tree 15.28%(= x) cognacy characters x + 3.82% filtered cognacy characters x + 3.42% phonetic characters x + 20.26% x + 14.87% filtered phonetic characters x + 19.66% x + 19.93% cognacy + phonetic characters x + 3.00% x + 2.66% filtered (cognacy + phonetic) characters x + 1.79% x + 1.65%

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 41 / 43

slide-76
SLIDE 76

Conclusions

Conclusions

Did we get any further? yes, since we implemented an initial workflow that factors phonetic and lexical characters from unprocessed wordlist data no, since the evaluation shows that the phylogenetic trees we inferred with this new approach does perform worse than the black box methods

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 42 / 43

slide-77
SLIDE 77

Conclusions

Conclusions

Nevertheless...

  • ur workflow is transparent and gets rid of the black boxes

we can track and trace any single automatic decision that led to the classificatory outcome we can also exploit the inferences built in the different steps of the workflow and use them as a starting point for further interesting investigations, like the calculation of tendencies and rates of language change

J¨ ager & List (T¨ ubingen/Paris) Factoring Phylogenetic Characters T¨ ubingen 43 / 43