Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence - - PowerPoint PPT Presentation

sequence alignment
SMART_READER_LITE
LIVE PREVIEW

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence - - PowerPoint PPT Presentation

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1 / 62 Sequence alignment: Motivation Sequence alignment: Motivation Gerhard Jger Sequence Alignment ESSLLI 2016 2 / 62 Sequence alignment:


slide-1
SLIDE 1

Sequence Alignment

Gerhard Jäger ESSLLI 2016

Gerhard Jäger Sequence Alignment ESSLLI 2016 1 / 62

slide-2
SLIDE 2

Sequence alignment: Motivation

Sequence alignment: Motivation

Gerhard Jäger Sequence Alignment ESSLLI 2016 2 / 62

slide-3
SLIDE 3

Sequence alignment: Motivation

Motivation

suppose we have no information except word lists goals:

estimate distances between languages estimate cognate classes track individual sound changes

Example

Meaning Italian English cognate few ’pɔko fju: 1 rub fre’gare rʌb dull

  • t’tuzo

dʌl hunt kat’tʃare hʊnt year ’anno jɪə this ’kwesto ðɪs fish ’peʃʃe fɪʃ 1 rotten ’martʃo ’rɒtən right ’dʒusto raɪt when ’kwando wɛn 1 drink ’bere drɪŋk heavy pe’sante ’hɛvɪ heavy ’grɛve ’hɛvɪ egg ’wɔvo ɛg 1 earth ’tɛrra ɜ:θ dust ’polvere dʌst laugh ’ridere lɑ:f grass ’ɛrba grɑ:s sharp taʎ’ʎɛnte ʃɑ:p wash la’vare wɒʃ

Gerhard Jäger Sequence Alignment ESSLLI 2016 3 / 62

slide-4
SLIDE 4

Sequence alignment: Motivation

Motivation

suppose we have no information except word lists goals:

estimate distances between languages estimate cognate classes track individual sound changes

Example

Meaning Italian English cognate few ’pɔko fju: rub fre’gare rʌb dull

  • t’tuzo

dʌl hunt kat’tʃare hʊnt year ’anno jɪə this ’kwesto ðɪs fish ’peʃʃe fɪʃ rotten ’martʃo ’rɒtən right ’dʒusto raɪt when ’kwando wɛn drink ’bere drɪŋk heavy pe’sante ’hɛvɪ heavy ’grɛve ’hɛvɪ egg ’wɔvo ɛg earth ’tɛrra ɜ:θ dust ’polvere dʌst laugh ’ridere lɑ:f grass ’ɛrba grɑ:s sharp taʎ’ʎɛnte ʃɑ:p wash la’vare wɒʃ

Gerhard Jäger Sequence Alignment ESSLLI 2016 3 / 62

slide-5
SLIDE 5

Sequence alignment: Motivation

Preprocessing

IPA is open-ended — 107 letters, 52 diacritics, 4 prosodic marks → 200,000 combinations good practice: map IPA strings to a uniform representation with fewer symbols common choices:

10 Dolgopolsky sound classes (Dolgopolsky 1986; used i.a. in List 2014) 41 ASJP sound classes

this course: ASJP

Gerhard Jäger Sequence Alignment ESSLLI 2016 4 / 62

slide-6
SLIDE 6

Sequence alignment: Motivation

Preprocessing

IPA is open-ended — 107 letters, 52 diacritics, 4 prosodic marks → 200,000 combinations good practice: map IPA strings to a uniform representation with fewer symbols common choices:

10 Dolgopolsky sound classes (Dolgopolsky 1986; used i.a. in List 2014) 41 ASJP sound classes

this course: ASJP

Meaning Italian English few poko fyu rub fregare rob dull

  • ttuzo

dol hunt kattSare hunt year anno yi3 this kwesto 8is fish peSSe fiS rotten martSo rot3n right dZusto rait when kwando wEn drink bere driNk heavy pesante hEvi heavy grEve hEvi egg wovo Eg earth tErra 38 dust polvere dost laugh ridere lof grass Erba gros sharp tallEnte Sop wash lavare woS

Gerhard Jäger Sequence Alignment ESSLLI 2016 4 / 62

slide-7
SLIDE 7

Pairwise alignment

Pairwise alignment

Gerhard Jäger Sequence Alignment ESSLLI 2016 5 / 62

slide-8
SLIDE 8

Pairwise alignment

Levenshtein alignment

related to as edit distance defines the distance between two strings as the minimal number of edit operations to transform one string into the other edit operations:

deletion insertion replacemant

example: grm. mEnS vs. Cimbrian menEs

1

mEnS → menS (replace)

2

menS → menES (insert)

3

menES → menEs (insert)

dL(mEnS, menEs) = 3

Gerhard Jäger Sequence Alignment ESSLLI 2016 6 / 62

slide-9
SLIDE 9

Pairwise alignment

Levenshtein alignment

alternative presentation: alignment m E n − S | | | | | m e n E s distance for a particular alignment is the number of non-identities Levenshtein distance is the number of mismatches for the optimal alignment

Gerhard Jäger Sequence Alignment ESSLLI 2016 7 / 62

slide-10
SLIDE 10

Pairwise alignment

Computing the Levenshtein Distance

recursive definition:

1

dL(ǫ, α) = dL(α, ǫ) = l(α)

2

dL(αx, βy) = min    dL(α, β) + δ(x, y) dL(αx, β) + 1 dL(α, βy) + 1

apparently requires exponentially growing number of comparisons ⇒ computationally not feasible but:

if l(α) = n and l(β) = m, there are n + 1 substrings of α and m + 1 substrings of β hence there are only (n + 1)(m + 1) many different comparisons to be performed computational complexity is polynomial (quadratic in l(α) + l(β))

Gerhard Jäger Sequence Alignment ESSLLI 2016 8 / 62

slide-11
SLIDE 11

Pairwise alignment

Computing the Levenshtein distance

Dynamic Programming − m E n S − 1 2 3 4 m 1 e 2 n 3 E 4 s 5

Gerhard Jäger Sequence Alignment ESSLLI 2016 9 / 62

slide-12
SLIDE 12

Pairwise alignment

Computing the Levenshtein distance

Dynamic Programming − m E n S − 1 2 3 4 m 1 e 2 n 3 E 4 s 5

Gerhard Jäger Sequence Alignment ESSLLI 2016 9 / 62

slide-13
SLIDE 13

Pairwise alignment

Computing the Levenshtein distance

Dynamic Programming − m E n S − 1 2 3 4 m 1 e 2 n 3 E 4 s 5

Gerhard Jäger Sequence Alignment ESSLLI 2016 9 / 62

slide-14
SLIDE 14

Pairwise alignment

Computing the Levenshtein distance

Dynamic Programming − m E n S − 1 2 3 4 m 1 e 2 n 3 E 4 s 5

Gerhard Jäger Sequence Alignment ESSLLI 2016 9 / 62

slide-15
SLIDE 15

Pairwise alignment

Computing the Levenshtein distance

Dynamic Programming − m E n S − 1 2 3 4 m 1 e 2 n 3 E 4 s 5

Gerhard Jäger Sequence Alignment ESSLLI 2016 9 / 62

slide-16
SLIDE 16

Pairwise alignment

Computing the Levenshtein distance

Dynamic Programming − m E n S − 1 2 3 4 m 1 1 e 2 n 3 E 4 s 5

Gerhard Jäger Sequence Alignment ESSLLI 2016 9 / 62

slide-17
SLIDE 17

Pairwise alignment

Computing the Levenshtein distance

Dynamic Programming − m E n S − 1 2 3 4 m 1 1 2 e 2 n 3 E 4 s 5

Gerhard Jäger Sequence Alignment ESSLLI 2016 9 / 62

slide-18
SLIDE 18

Pairwise alignment

Computing the Levenshtein distance

Dynamic Programming − m E n S − 1 2 3 4 m 1 1 2 3 e 2 n 3 E 4 s 5

Gerhard Jäger Sequence Alignment ESSLLI 2016 9 / 62

slide-19
SLIDE 19

Pairwise alignment

Computing the Levenshtein distance

Dynamic Programming − m E n S − 1 2 3 4 m 1 1 2 3 e 2 1 n 3 E 4 s 5

Gerhard Jäger Sequence Alignment ESSLLI 2016 9 / 62

slide-20
SLIDE 20

Pairwise alignment

Computing the Levenshtein distance

Dynamic Programming − m E n S − 1 2 3 4 m 1 1 2 3 e 2 1 1 n 3 E 4 s 5

Gerhard Jäger Sequence Alignment ESSLLI 2016 9 / 62

slide-21
SLIDE 21

Pairwise alignment

Computing the Levenshtein distance

Dynamic Programming − m E n S − 1 2 3 4 m 1 1 2 3 e 2 1 1 2 n 3 E 4 s 5

Gerhard Jäger Sequence Alignment ESSLLI 2016 9 / 62

slide-22
SLIDE 22

Pairwise alignment

Computing the Levenshtein distance

Dynamic Programming − m E n S − 1 2 3 4 m 1 1 2 3 e 2 1 1 2 3 n 3 E 4 s 5

Gerhard Jäger Sequence Alignment ESSLLI 2016 9 / 62

slide-23
SLIDE 23

Pairwise alignment

Computing the Levenshtein distance

Dynamic Programming − m E n S − 1 2 3 4 m 1 1 2 3 e 2 1 1 2 3 n 3 2 E 4 s 5

Gerhard Jäger Sequence Alignment ESSLLI 2016 9 / 62

slide-24
SLIDE 24

Pairwise alignment

Computing the Levenshtein distance

Dynamic Programming − m E n S − 1 2 3 4 m 1 1 2 3 e 2 1 1 2 3 n 3 2 2 E 4 s 5

Gerhard Jäger Sequence Alignment ESSLLI 2016 9 / 62

slide-25
SLIDE 25

Pairwise alignment

Computing the Levenshtein distance

Dynamic Programming − m E n S − 1 2 3 4 m 1 1 2 3 e 2 1 1 2 3 n 3 2 2 1 E 4 s 5

Gerhard Jäger Sequence Alignment ESSLLI 2016 9 / 62

slide-26
SLIDE 26

Pairwise alignment

Computing the Levenshtein distance

Dynamic Programming − m E n S − 1 2 3 4 m 1 1 2 3 e 2 1 1 2 3 n 3 2 2 1 2 E 4 s 5

Gerhard Jäger Sequence Alignment ESSLLI 2016 9 / 62

slide-27
SLIDE 27

Pairwise alignment

Computing the Levenshtein distance

Dynamic Programming − m E n S − 1 2 3 4 m 1 1 2 3 e 2 1 1 2 3 n 3 2 2 1 2 E 4 3 s 5

Gerhard Jäger Sequence Alignment ESSLLI 2016 9 / 62

slide-28
SLIDE 28

Pairwise alignment

Computing the Levenshtein distance

Dynamic Programming − m E n S − 1 2 3 4 m 1 1 2 3 e 2 1 1 2 3 n 3 2 2 1 2 E 4 3 2 s 5

Gerhard Jäger Sequence Alignment ESSLLI 2016 9 / 62

slide-29
SLIDE 29

Pairwise alignment

Computing the Levenshtein distance

Dynamic Programming − m E n S − 1 2 3 4 m 1 1 2 3 e 2 1 1 2 3 n 3 2 2 1 2 E 4 3 2 2 s 5

Gerhard Jäger Sequence Alignment ESSLLI 2016 9 / 62

slide-30
SLIDE 30

Pairwise alignment

Computing the Levenshtein distance

Dynamic Programming − m E n S − 1 2 3 4 m 1 1 2 3 e 2 1 1 2 3 n 3 2 2 1 2 E 4 3 2 2 2 s 5

Gerhard Jäger Sequence Alignment ESSLLI 2016 9 / 62

slide-31
SLIDE 31

Pairwise alignment

Computing the Levenshtein distance

Dynamic Programming − m E n S − 1 2 3 4 m 1 1 2 3 e 2 1 1 2 3 n 3 2 2 1 2 E 4 3 2 2 2 s 5 4

Gerhard Jäger Sequence Alignment ESSLLI 2016 9 / 62

slide-32
SLIDE 32

Pairwise alignment

Computing the Levenshtein distance

Dynamic Programming − m E n S − 1 2 3 4 m 1 1 2 3 e 2 1 1 2 3 n 3 2 2 1 2 E 4 3 2 2 2 s 5 4 3

Gerhard Jäger Sequence Alignment ESSLLI 2016 9 / 62

slide-33
SLIDE 33

Pairwise alignment

Computing the Levenshtein distance

Dynamic Programming − m E n S − 1 2 3 4 m 1 1 2 3 e 2 1 1 2 3 n 3 2 2 1 2 E 4 3 2 2 2 s 5 4 3 3

Gerhard Jäger Sequence Alignment ESSLLI 2016 9 / 62

slide-34
SLIDE 34

Pairwise alignment

Computing the Levenshtein distance

Dynamic Programming − m E n S − 1 2 3 4 m 1 1 2 3 e 2 1 1 2 3 n 3 2 2 1 2 E 4 3 2 2 2 s 5 4 3 3 3

Gerhard Jäger Sequence Alignment ESSLLI 2016 9 / 62

slide-35
SLIDE 35

Pairwise alignment

Computing the Levenshtein distance

Dynamic Programming − m E n S − 1 2 3 4 m 1 1 2 3 e 2 1 1 2 3 n 3 2 2 1 2 E 4 3 2 2 2 s 5 4 3 3 3 memorizing in each step which of the three cells to the left and above gave rise to the current entry lets us recover the corresponing optimal alignment

Gerhard Jäger Sequence Alignment ESSLLI 2016 9 / 62

slide-36
SLIDE 36

Pairwise alignment

Computing the Levenshtein distance

Dynamic Programming − m E n S − 1 2 3 4 m 1 1 2 3 e 2 1 1 2 3 n 3 2 2 1 2 E 4 3 2 2 2 s 5 4 3 3 3 memorizing in each step which of the three cells to the left and above gave rise to the current entry lets us recover the corresponing optimal alignment

Gerhard Jäger Sequence Alignment ESSLLI 2016 9 / 62

slide-37
SLIDE 37

Pairwise alignment

Computing the Levenshtein distance

Dynamic Programming − m E n S − 1 2 3 4 m 1 1 2 3 e 2 1 1 2 3 n 3 2 2 1 2 E 4 3 2 2 2 s 5 4 3 3 3 memorizing in each step which of the three cells to the left and above gave rise to the current entry lets us recover the corresponing optimal alignment

Gerhard Jäger Sequence Alignment ESSLLI 2016 9 / 62

slide-38
SLIDE 38

Pairwise alignment

Computing the Levenshtein distance

Dynamic Programming − m E n S − 1 2 3 4 m 1 1 2 3 e 2 1 1 2 3 n 3 2 2 1 2 E 4 3 2 2 2 s 5 4 3 3 3 memorizing in each step which of the three cells to the left and above gave rise to the current entry lets us recover the corresponing optimal alignment

Gerhard Jäger Sequence Alignment ESSLLI 2016 9 / 62

slide-39
SLIDE 39

Pairwise alignment

Computing the Levenshtein distance

Dynamic Programming − m E n S − 1 2 3 4 m 1 1 2 3 e 2 1 1 2 3 n 3 2 2 1 2 E 4 3 2 2 2 s 5 4 3 3 3 memorizing in each step which of the three cells to the left and above gave rise to the current entry lets us recover the corresponing optimal alignment

Gerhard Jäger Sequence Alignment ESSLLI 2016 9 / 62

slide-40
SLIDE 40

Pairwise alignment

Computing the Levenshtein distance

Dynamic Programming − m E n S − 1 2 3 4 m 1 1 2 3 e 2 1 1 2 3 n 3 2 2 1 2 E 4 3 2 2 2 s 5 4 3 3 3 m E n − S m e n E s

Gerhard Jäger Sequence Alignment ESSLLI 2016 9 / 62

slide-41
SLIDE 41

Pairwise alignment

Computing the Levenshtein distance

− m E n S − 1 2 3 4 m 1 1 2 3 e 2 1 1 2 3 n 3 2 2 1 2 E 4 3 2 2 2 s 5 4 3 3 3 m E n − S m e n E s − m E n S − 1 2 3 4 m 1 1 2 3 e 2 1 1 2 3 n 3 2 2 1 2 E 4 3 2 2 2 s 5 4 3 3 3 m E n S − m e n E s

Gerhard Jäger Sequence Alignment ESSLLI 2016 9 / 62

slide-42
SLIDE 42

Pairwise alignment

Normalization for length

  • grm. mEnS (Mensch, ’person’) and Hindi manuSya are (partially)

cognate

  • grm. ze3n (sehen, ’see’) and Hindi deg are not cognate

still dL(mEnS, manuSya) = 4 dL(ze3n, deg) = 3 normalization: dividing Levenshtein distance by length of longer string: dLD(mEnS, manuSya) = 4/7 ≈ 0.57 dLD(ze3n, deg) = 3/4 = 0.75

Gerhard Jäger Sequence Alignment ESSLLI 2016 10 / 62

slide-43
SLIDE 43

Pairwise alignment

How well does normalized Levenshtein distance predict cognacy?

LDN empirical probability of cognacy 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 1.0

0.00 0.25 0.50 0.75 1.00 no yes

cognate LDN cognate

no yes

Gerhard Jäger Sequence Alignment ESSLLI 2016 11 / 62

slide-44
SLIDE 44

Pairwise alignment

Problems

binary distinction: match vs. non-match frequently genuin sound correspondences in cognates are missed: c v a i n a z 3

  • f

i S

  • t

u n

  • s

p i s k i s corresponding sounds count as mismatches even if they are aligend correctly h a n t h a n t h E n d m a n

  • substantial amount of chance similarities

Gerhard Jäger Sequence Alignment ESSLLI 2016 12 / 62

slide-45
SLIDE 45

Pairwise alignment

Background: probability theory

Given two sequences: How likely is it that they are aligned? More general question: Given some data, and two competing hypotheses, how likely is it that the first hypothesis is correct?

Bayesian Inference!!!

given:

data: d hypotheses: h1, h0 model: P(d|h1), P(d|h0)

wanted: P(h1|d) : P(h0|d)

Gerhard Jäger Sequence Alignment ESSLLI 2016 13 / 62

slide-46
SLIDE 46

Pairwise alignment

Bayesian inference

Bayes Theorem: P(h|d) = P(d|h)P(h)

  • h′ P(d|h′)P(h′)

ergo: P(h1|d) : P(h0|d) = P(d|h1)P(h1) : P(d|h0)P(h0) P(h1|d) : P(h0|d) = P(d|h1) P(d|h0) P(h1) P(h0) log(P(h1|d) : P(h0|d)) = log P(d|h1) P(d|h0) + log P(h1) P(h0)

Gerhard Jäger Sequence Alignment ESSLLI 2016 14 / 62

slide-47
SLIDE 47

Pairwise alignment

Bayesian inference

suppose we have many independent data: d = d1, . . . , dn P( d|h) =

n

  • i=1

P(di|h) log P( d|h) =

n

  • i=1

log P(di|h) log P( d|h1) P( d|h0) =

n

  • i=1

log P(di|h1) P(di|h0) log(P(h1| d) : P(h0| d)) =

n

  • i=1

log P(di|h1) P(di|h0) + log P(h1) P(h0)

Gerhard Jäger Sequence Alignment ESSLLI 2016 15 / 62

slide-48
SLIDE 48

Pairwise alignment

Bayesian inference

mein argument against using Bayes’ rule: the prior probabilities P(h1), P(h0) are not known there are various heuristics, but no generally accepted way to obtain them if n is large though, log P(h1)/P(h0) doesn’t matter very much:1 log(P(h1| d) : P(h0| d)) ≈

n

  • i=1

log P(di|h1) P(di|h0) = log(P( d|h1) : P( d|h0)) the quantity log(P( d|h1) : P( d|h0)) is called log-odds

1Also, if we choose an uninformative prior with P(h1) = P(h0), we have

log P (h1)/P (h0) = 0 anyway.

Gerhard Jäger Sequence Alignment ESSLLI 2016 16 / 62

slide-49
SLIDE 49

Pairwise alignment

Log-odds

log-odds can take any real value a positive value indicates evidence for h1 and a negative value evidence for h0 the higher the absolute value, the stronger is the evidence

Gerhard Jäger Sequence Alignment ESSLLI 2016 17 / 62

slide-50
SLIDE 50

Pairwise alignment

Weighted alignment

suppose our data are two aligned sequences x, y for the time being, we assume there are no gaps in the alignment

h1: they developed from a common ancestor via substitions h0: they are unrelated

additional assumptions (rough approximation in biology, pretty much

  • ff the mark in linguistics): substitions in different positions occur

independently

Gerhard Jäger Sequence Alignment ESSLLI 2016 18 / 62

slide-51
SLIDE 51

Pairwise alignment

The null model

if x and y are unrelated, their joint probability equals the product of their individual probabilities as a start (quite wrong both in biology and in linguistics): let us assume the strings have no “grammar”; each position is independent from all other positions then P( x, y|h0) = P( x|h0)P( y|h0) =

  • i

P(xi|h0)P(yi|h0) log P( x, y|h0) =

  • i

log(P(xi|h0) + log P(yi|h0))

Gerhard Jäger Sequence Alignment ESSLLI 2016 19 / 62

slide-52
SLIDE 52

Pairwise alignment

The null model

suppose x and y are generated by the same process (reasonable for DNA and protein comparison, false for cross-linguistic word comparison) then P(xi|h), P(yi|h) are simply the probabilities of occurrence qa: probability that symbol a occurs in a sequence log P( x, y|h0) =

  • i

log qxi +

  • j

log qyj q can be estimated from relative frequencies

Gerhard Jäger Sequence Alignment ESSLLI 2016 20 / 62

slide-53
SLIDE 53

Pairwise alignment

The alignment model

suppose x and y evolved from a common ancestor via independent substitution mutations independence between positions: P( x, y|h1) =

  • i

P(xi, yi|h2) pa,b: probability that a position in the latest common ancestor of x and y evolved into an a in sequence x and into a b in sequence y P( x, y|h1) =

  • i

pxi,yi log P( x, y|h1) =

  • i

log pxi,yi

Gerhard Jäger Sequence Alignment ESSLLI 2016 21 / 62

slide-54
SLIDE 54

Pairwise alignment

The log-odds score

taking things together, we have log(P( x, y|h1) : P( x, y|h0)) =

  • i

log pxi,yi qxiqyi log pab

qaqb : score of the alignment of a with b

also goes by the name of Pointwise Mutual Information (PMI) assembled in a PMI substitution matrix

Gerhard Jäger Sequence Alignment ESSLLI 2016 22 / 62

slide-55
SLIDE 55

Pairwise alignment

Substitution matrices

in bioinformatics, several commonly used substitution matrices for nucleotids and proteins based on explicit models of evolution and careful empirical testing for nucleotids: A G T C A 2 −5 −7 −7 G −5 2 −7 −7 T −7 −7 2 −5 C −7 −7 −5 2

Gerhard Jäger Sequence Alignment ESSLLI 2016 23 / 62

slide-56
SLIDE 56

Pairwise alignment

Substitution matrices

for proteins: different matrices for different evolutionary distances for instance: BLOSUM50

Gerhard Jäger Sequence Alignment ESSLLI 2016 24 / 62

slide-57
SLIDE 57

Pairwise alignment

Substitution matrix for the ASJP data

  • 1. identify large sample of pairs of closely related languages (using expert

information or heuristics based on aggregated Levenshtein distance)

An.NORTHERN_PHILIPPINES.CENTRAL_BONTOC An.MESO-PHILIPPINE.NORTHERN_SORSOGON WF.WESTERN_FLY.IAMEGA WF.WESTERN_FLY.GAMAEWE Pan.PANOAN.KASHIBO_BAJO_AGUAYTIA Pan.PANOAN.KASHIBO_SAN_ALEJANDRO AA.EASTERN_CUSHITIC.KAMBAATA_2 AA.EASTERN_CUSHITIC.HADIYYA_2 ST.BAI.QILIQIAO_BAI_2 ST.BAI.YUNLONG_BAI An.SULAWESI.MANDAR An.OCEANIC.RAGA An.SULAWESI.TANETE An.SAMA-BAJAW.BOEPINANG_BAJAU UA.AZTECAN.NAHUATL_HUEYAPAN_TETELA_DEL_VOLCAN UA.AZTECAN.NAHUATL_CUENTEPEC_TEMIXCO An.SOUTHERN_PHILIPPINES.KAGAYANEN An.NORTHERN_PHILIPPINES.LIMOS_KALINGA An.MESO-PHILIPPINE.CANIPAAN_PALAWAN An.NORTHWEST_MALAYO-POLYNESIAN.LAHANAN NC.BANTOID.LIFONGA NC.BANTOID.BOMBOMA_2 IE.INDIC.WAD_PAGGA IE.INDIC.TALAGANG_HINDKO NC.BANTOID.LINGALA NC.BANTOID.LIFONGA An.CENTRAL_MALAYO-POLYNESIAN.BALILEDO An.CENTRAL_MALAYO-POLYNESIAN.PALUE AuA.MUNDA.HO AuA.MUNDA.KORKU MGe.GE-KAINGANG.KAYAPO MGe.GE-KAINGANG.APINAYE Gerhard Jäger Sequence Alignment ESSLLI 2016 25 / 62

slide-58
SLIDE 58

Pairwise alignment

Substitution matrix for the ASJP data

  • 2. pick a concept and a pair of related languages at random

languages: Pen.MAIDUAN.MAIDU_KONKAU, Pen.MAIDUAN.NE_MAIDU concept: one

  • 3. find corresponding words from the two languages:

nisam, niSem

  • 4. do Levenshtein alignment

n i s a m n i S e m

  • 5. for each sound pair, count number of correspondences

nn: 1; ii: 1; sS; 1; ae: 1; mm: 1

Gerhard Jäger Sequence Alignment ESSLLI 2016 26 / 62

slide-59
SLIDE 59

Pairwise alignment

Substitution matrix for the ASJP data

steps 2-5 are repeated 100,000 times klem S3--v ligini kulox Naltir---i … klom S37on ji---p Gulox Naltirtiri …

a a 56,047 . . . . . . . . . i i 33,955 4 8 2 u u 23,731 4 a 2 n n 21,363 G t 2

  • 19,619

i ! 2 m m 18,263 G y 2 t t 16,975 d ! 2 k k 16,773 s G 2 e e 12,745 Z 5 2 r r 11,601 G s 2 l l 11,377 X z 2 b b 8,965 ! k 2 s s 8,245 q 8 2 d d 6,829 a ! 2 p p 6,681 a ! 2 w w 6,613 ! y 2 N N 6,275 ! E 2 h h 5,331 j G 2 y y 5,321 G i 2 3 3 5,255 E ! 2 . . . . . . . . . v S 2

Gerhard Jäger Sequence Alignment ESSLLI 2016 27 / 62

slide-60
SLIDE 60

Pairwise alignment

Substitution matrix for the ASJP data

  • 6. determine relative frequency of occurrence of each sound within the

entire database

a 0.1479 i 0.0969 u 0.0696

  • 0.0626

n 0.0614 e 0.0478 k 0.0478 m 0.0465 t 0.0449 r 0.0346 l 0.0331 b 0.0248 s 0.0243 w 0.0232 3 0.0228 y 0.0222 d 0.0214 h 0.0213 p 0.0202 N 0.0201 g 0.0178 E 0.0134 7 0.0124 C 0.0073 S 0.0064 x 0.0062 c 0.0056 f 0.0052 5 0.0049 v 0.0045 q 0.0041 z 0.0035 j 0.0035 T 0.0029 L 0.0027 X 0.0022 8 0.0014 Z 0.0011 ! 0.0009 4 0.0002 G 0.0001 Gerhard Jäger Sequence Alignment ESSLLI 2016 28 / 62

slide-61
SLIDE 61

Pairwise alignment

Substitution matrix for the ASJP data

  • 7. estimate pab as relative frequency of co-occurrence of a with b, qa, qb

as individual relative frequencies, and determine PMI scores log2

pab qaqb

G G 11.2348 ! ! 10.0202 4 4 9.1480 8 8 8.0650 Z Z 7.9575 X X 7.9375 L L 7.6276 z z 7.2624 q q 7.2542 f f 6.9117 v v 6.8418 5 5 6.7731 j j 6.7587 T T 6.6580 S S 6.6054 c c 6.5989 C C 6.2439 4 G 6.1943 x x 6.1210 G X 5.3342 G q 5.3017 7 7 5.2111 p p 5.0693 N N 4.9821 Z j 4.9386 d d 4.9263 g g 4.8958 b b 4.8906 s s 4.8277 4 5 4.7508 E E 4.7143 w w 4.6512 h h 4.5819 G x 4.5573 Z z 4.4943 y y 4.4637 l l 4.4037 ! G 4.3760 3 3 4.3692 r r 4.3061 X q 4.1200 m m 4.1087 t t 4.1021 G Z 4.0429 k k 3.9046 X x 3.8116 T Z 3.7380 8 G 3.6993 · · ·

  • q
  • 3.2842

C a

  • 3.2893

j

  • 3.2914

a m

  • 3.2915

E v

  • 3.3035

! w

  • 3.3079

! u

  • 3.3087

5 q

  • 3.3116

T

  • 3.3158

! k

  • 3.3526

e z

  • 3.3763

! s

  • 3.3788

f q

  • 3.3942

N S

  • 3.3954

! b

  • 3.4077

L b

  • 3.4558

T u

  • 3.4690

4 i

  • 3.5529

5 a

  • 3.8294

C N

  • 3.8451

! t

  • 4.2625

! e

  • 4.3534

! i

  • 4.3712

! a

  • 4.9817

Gerhard Jäger Sequence Alignment ESSLLI 2016 29 / 62

slide-62
SLIDE 62

Pairwise alignment

Evaluation

j Z z L 8 y l d r C T c S s t ! 4 5 x X g h 7 q k G f v w p b n N m i e E 3

  • u

a a u

  • 3

E e i m N n b p w v f G k q 7 h g X x 5 4 ! t s S c T C r d l y 8 L z Z j

−10 −5 5 10 PMI

j Z z L 8 y l d r C T c S s t ! 4 5 x X g h 7 q k G f v w p b n N m i e E 3

  • u

a a u

  • 3

E e i m N n b p w v f G k q 7 h g X x 5 4 ! t s S c T C r d l y 8 L z Z j

Gerhard Jäger Sequence Alignment ESSLLI 2016 30 / 62

slide-63
SLIDE 63

Pairwise alignment

Evaluation

① ❳
  • q
❣ ❦ ✹ ◆ ♦ ✉ ❛ ❊ ❡ ✸ ✐ ✈ ✇ ♠ ❢ ❜ ♣ ❚ ✦ ❈ ❝ ✼ ❤ s ✽ ③ ♥ ✺ ② ❧ r ▲ ❙ ❩ t ❞ ❥ ✲ ✺ ✵ ✺ ✶ ✵ ✁ ✂ ✄ ☎✆ ✝ ✝ ✂ ✄ ☎ ✞ ✟ ✁ ✂ ✠ ✂ ✄ ✡ ☛ ✄ ✂ ✠ ☞ ✁ ✌ ✍ ✄ ✂

Gerhard Jäger Sequence Alignment ESSLLI 2016 31 / 62

slide-64
SLIDE 64

Pairwise alignment

Gap penalties

gaps in an alignment correspond either to an insertion or a deletion simplified assumption: insertions and deletions are equally likely at all positions; symbols are inserted according to their general frequency of

  • ccurrence

Suppose an item xi is aligned to a gap. Let α be the probability that an insertion occured since the latest common ancestor, and β the probability of a deletion P(xi, −|h1) = αqxi + βqxi P(xi, −|h0) = qxi log(P(xi, −|h1) : P(xi, −|h0)) = log(α + β) = −d i.e., there is a constant term for each gap as α + β < 1, this term is negative, i.e. there a constant penalty for each gap

Gerhard Jäger Sequence Alignment ESSLLI 2016 32 / 62

slide-65
SLIDE 65

Pairwise alignment

Affine gap penalties

deletions/insertions frequently apply to entire blocks of symbols (both in biology and linguistics) probability of a gap of length n are higher than the product of probabilities of n individual gaps penalty e for extending a gap is lower than penalty d for opening a gap g: length of a gap γ(g) = −d − (g − 1)e no principled way to derive the values of d and e; have to be fixed via trial and error d = 2.5 and e = 1.6 work quite well for the ASJP data

Gerhard Jäger Sequence Alignment ESSLLI 2016 33 / 62

slide-66
SLIDE 66

Pairwise alignment

Weighted alignment

so far, we assumed that the alignment between x and y is known to assess strength of evidence for h1 given x, y, we need to consider all alignments between x and y enumeration is infeasible, because the number of alignments between two sequences of length n is 2n n

  • = (2n)!

(n!)2 ≈ 22n √πn computation is nonetheless possible using Pair Hidden Markov Models simpler task: find the most likely alignment and determine its log-odds!

Gerhard Jäger Sequence Alignment ESSLLI 2016 34 / 62

slide-67
SLIDE 67

Pairwise alignment

The Needleman-Wunsch algorithm

almost identical to Levenshtein algorithm, except:

matches/mismatches are counted not as 1 and 0, but as log-odds scores

  • f the corresponding symbol pair

insertions/deletions are counted as gap penalties by convention, the similarity rather than the distance is counted, i.e. we try to find the alignment that maximizes the score

let x have length n, y lenth m, sab be the log-odds score of a and b, and d/e the gap penalties

Gerhard Jäger Sequence Alignment ESSLLI 2016 35 / 62

slide-68
SLIDE 68

Pairwise alignment

The Needleman-Wunsch algorithm

F (0, 0) = G(0, 0) = ∀i : 0 < i ≤ n F (i, 0) = F (i − 1, 0) + G(i − 1, 0)e + (1 − G(i − 1, 0))d G(i, 0) = 1 ∀j : 0 < j ≤ m : F (0, j) = F (0, j − 1) + G(0, j − 1)e + (1 − G(0, j − 1))d G(0, j) = 1 ∀i, j : 0 < i ≤ n, 0 < j ≤ m F (i, j) = max    F (i − 1, j) + G(i − 1, j)e + (1 − G(i − 1, j))d F (i, j − 1) + G(i, j − 1) + (1 − G(i, j − 1))d F (i − 1, j − 1) + sxiyj G(i, j) = 0 if arg max    F (i − 1, j) + G(i − 1, j)e + (1 − G(i − 1, j))d F (i, j − 1) + G(i, j − 1)e + (1 − G(i, j − 1))d F (i − 1, j − 1) + sxiyj    = 3 1 else Gerhard Jäger Sequence Alignment ESSLLI 2016 36 / 62

slide-69
SLIDE 69

Pairwise alignment

Finding the best alignment

Dynamic Programming − m E n S − −2.5 −4.1 −5.7 −7.3 m −2.5 e −4.1 n −5.7 E −7.3 s −8.9

Gerhard Jäger Sequence Alignment ESSLLI 2016 37 / 62

slide-70
SLIDE 70

Pairwise alignment

Finding the best alignment

Dynamic Programming − m E n S − −2.5 −4.1 −5.7 −7.3 m −2.5 e −4.1 n −5.7 E −7.3 s −8.9

Gerhard Jäger Sequence Alignment ESSLLI 2016 37 / 62

slide-71
SLIDE 71

Pairwise alignment

Finding the best alignment

Dynamic Programming − m E n S − −2.5 −4.1 −5.7 −7.3 m −2.5 e −4.1 n −5.7 E −7.3 s −8.9

Gerhard Jäger Sequence Alignment ESSLLI 2016 37 / 62

slide-72
SLIDE 72

Pairwise alignment

Finding the best alignment

Dynamic Programming − m E n S − −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 e −4.1 n −5.7 E −7.3 s −8.9

Gerhard Jäger Sequence Alignment ESSLLI 2016 37 / 62

slide-73
SLIDE 73

Pairwise alignment

Finding the best alignment

Dynamic Programming − m E n S − −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 e −4.1 n −5.7 E −7.3 s −8.9

Gerhard Jäger Sequence Alignment ESSLLI 2016 37 / 62

slide-74
SLIDE 74

Pairwise alignment

Finding the best alignment

Dynamic Programming − m E n S − −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 e −4.1 n −5.7 E −7.3 s −8.9

Gerhard Jäger Sequence Alignment ESSLLI 2016 37 / 62

slide-75
SLIDE 75

Pairwise alignment

Finding the best alignment

Dynamic Programming − m E n S − −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 e −4.1 n −5.7 E −7.3 s −8.9

Gerhard Jäger Sequence Alignment ESSLLI 2016 37 / 62

slide-76
SLIDE 76

Pairwise alignment

Finding the best alignment

Dynamic Programming − m E n S − −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 e −4.1 n −5.7 E −7.3 s −8.9

Gerhard Jäger Sequence Alignment ESSLLI 2016 37 / 62

slide-77
SLIDE 77

Pairwise alignment

Finding the best alignment

Dynamic Programming − m E n S − −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 n −5.7 E −7.3 s −8.9

Gerhard Jäger Sequence Alignment ESSLLI 2016 37 / 62

slide-78
SLIDE 78

Pairwise alignment

Finding the best alignment

Dynamic Programming − m E n S − −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 n −5.7 E −7.3 s −8.9

Gerhard Jäger Sequence Alignment ESSLLI 2016 37 / 62

slide-79
SLIDE 79

Pairwise alignment

Finding the best alignment

Dynamic Programming − m E n S − −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 n −5.7 E −7.3 s −8.9

Gerhard Jäger Sequence Alignment ESSLLI 2016 37 / 62

slide-80
SLIDE 80

Pairwise alignment

Finding the best alignment

Dynamic Programming − m E n S − −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 n −5.7 E −7.3 s −8.9

Gerhard Jäger Sequence Alignment ESSLLI 2016 37 / 62

slide-81
SLIDE 81

Pairwise alignment

Finding the best alignment

Dynamic Programming − m E n S − −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 E −7.3 s −8.9

Gerhard Jäger Sequence Alignment ESSLLI 2016 37 / 62

slide-82
SLIDE 82

Pairwise alignment

Finding the best alignment

Dynamic Programming − m E n S − −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 0.03 E −7.3 s −8.9

Gerhard Jäger Sequence Alignment ESSLLI 2016 37 / 62

slide-83
SLIDE 83

Pairwise alignment

Finding the best alignment

Dynamic Programming − m E n S − −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 0.03 3.05 E −7.3 s −8.9

Gerhard Jäger Sequence Alignment ESSLLI 2016 37 / 62

slide-84
SLIDE 84

Pairwise alignment

Finding the best alignment

Dynamic Programming − m E n S − −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 0.03 3.05 9.2 E −7.3 s −8.9

Gerhard Jäger Sequence Alignment ESSLLI 2016 37 / 62

slide-85
SLIDE 85

Pairwise alignment

Finding the best alignment

Dynamic Programming − m E n S − −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 0.03 3.05 9.2 6.6 E −7.3 s −8.9

Gerhard Jäger Sequence Alignment ESSLLI 2016 37 / 62

slide-86
SLIDE 86

Pairwise alignment

Finding the best alignment

Dynamic Programming − m E n S − −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 0.03 3.05 9.2 6.6 E −7.3 −1.47 s −8.9

Gerhard Jäger Sequence Alignment ESSLLI 2016 37 / 62

slide-87
SLIDE 87

Pairwise alignment

Finding the best alignment

Dynamic Programming − m E n S − −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 0.03 3.05 9.2 6.6 E −7.3 −1.47 4.75 s −8.9

Gerhard Jäger Sequence Alignment ESSLLI 2016 37 / 62

slide-88
SLIDE 88

Pairwise alignment

Finding the best alignment

Dynamic Programming − m E n S − −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 0.03 3.05 9.2 6.6 E −7.3 −1.47 4.75 6.6 s −8.9

Gerhard Jäger Sequence Alignment ESSLLI 2016 37 / 62

slide-89
SLIDE 89

Pairwise alignment

Finding the best alignment

Dynamic Programming − m E n S − −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 0.03 3.05 9.2 6.6 E −7.3 −1.47 4.75 6.6 7.62 s −8.9

Gerhard Jäger Sequence Alignment ESSLLI 2016 37 / 62

slide-90
SLIDE 90

Pairwise alignment

Finding the best alignment

Dynamic Programming − m E n S − −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 0.03 3.05 9.2 6.6 E −7.3 −1.47 4.75 6.6 7.62 s −8.9 −2.97

Gerhard Jäger Sequence Alignment ESSLLI 2016 37 / 62

slide-91
SLIDE 91

Pairwise alignment

Finding the best alignment

Dynamic Programming − m E n S − −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 0.03 3.05 9.2 6.6 E −7.3 −1.47 4.75 6.6 7.62 s −8.9 −2.97 2.15

Gerhard Jäger Sequence Alignment ESSLLI 2016 37 / 62

slide-92
SLIDE 92

Pairwise alignment

Finding the best alignment

Dynamic Programming − m E n S − −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 0.03 3.05 9.2 6.6 E −7.3 −1.47 4.75 6.6 7.62 s −8.9 −2.97 2.15 5.1

Gerhard Jäger Sequence Alignment ESSLLI 2016 37 / 62

slide-93
SLIDE 93

Pairwise alignment

Finding the best alignment

Dynamic Programming − m E n S − −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 0.03 3.05 9.2 6.6 E −7.3 −1.47 4.75 6.6 7.62 s −8.9 −2.97 2.15 5.1 8.84

Gerhard Jäger Sequence Alignment ESSLLI 2016 37 / 62

slide-94
SLIDE 94

Pairwise alignment

Finding the best alignment

Dynamic Programming − m E n S − −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 0.03 3.05 9.2 6.6 E −7.3 −1.47 4.75 6.6 7.62 s −8.9 −2.97 2.15 5.1 8.84 memorizing in each step which of the three cells to the left and above gave rise to the current entry lets us recover the corresponing optimal alignment

Gerhard Jäger Sequence Alignment ESSLLI 2016 37 / 62

slide-95
SLIDE 95

Pairwise alignment

Finding the best alignment

Dynamic Programming − m E n S − −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 0.03 3.05 9.2 6.6 E −7.3 −1.47 4.75 6.6 7.62 s −8.9 −2.97 2.15 5.1 8.84 memorizing in each step which of the three cells to the left and above gave rise to the current entry lets us recover the corresponing optimal alignment

Gerhard Jäger Sequence Alignment ESSLLI 2016 37 / 62

slide-96
SLIDE 96

Pairwise alignment

Finding the best alignment

Dynamic Programming − m E n S − −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 0.03 3.05 9.2 6.6 E −7.3 −1.47 4.75 6.6 7.62 s −8.9 −2.97 2.15 5.1 8.84 memorizing in each step which of the three cells to the left and above gave rise to the current entry lets us recover the corresponing optimal alignment

Gerhard Jäger Sequence Alignment ESSLLI 2016 37 / 62

slide-97
SLIDE 97

Pairwise alignment

Finding the best alignment

Dynamic Programming − m E n S − −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 0.03 3.05 9.2 6.6 E −7.3 −1.47 4.75 6.6 7.62 s −8.9 −2.97 2.15 5.1 8.84 memorizing in each step which of the three cells to the left and above gave rise to the current entry lets us recover the corresponing optimal alignment

Gerhard Jäger Sequence Alignment ESSLLI 2016 37 / 62

slide-98
SLIDE 98

Pairwise alignment

Finding the best alignment

Dynamic Programming − m E n S − −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 0.03 3.05 9.2 6.6 E −7.3 −1.47 4.75 6.6 7.62 s −8.9 −2.97 2.15 5.1 8.84 memorizing in each step which of the three cells to the left and above gave rise to the current entry lets us recover the corresponing optimal alignment

Gerhard Jäger Sequence Alignment ESSLLI 2016 37 / 62

slide-99
SLIDE 99

Pairwise alignment

Evaluation

left: Levenshtein alignment; right: Needleman-Wunsch alignment

  • iX

iX- ego ego du du tu tu vir vir nos nos ains ain-s unus

  • unus

cvai cvai

  • duo

duo-

  • --mEnS

mEnS--- persona persona

  • --fiS

fiS--- piskis piskis hun-t hun-t kanis kanis

  • ----laus
  • -----laus

pedikulus pedikul-us

  • blat

b-lat folyu folyu haut-- haut--

  • kutis

k-utis

  • --blut
  • --blut

saNgwis saNgwis knoX3n knoX3n

  • -o--s
  • -os--

horn- horn- kornu kornu

  • au-g3

a-ug3-

  • kulus
  • kulus

na-z3 naz3- nasus nasus chan chan- dens d-ens

  • chuN3

chuN--3 liNgwE

  • liNgwE

han-t han-t manus manus

  • -brust

b--rust pektus- pektus- leb3r leb3r yekur yekur triNk3n triNk3n-

  • bibere
  • bi-bere
  • -ze3n
  • -ze3n

widere widere

  • her3n
  • -her3n

audire audire- Sterb3n Sterb3n

  • mor--i
  • mor-i-

khom3n khom3n--- wenire w---enire zon3 zon3 so-l sol- Gerhard Jäger Sequence Alignment ESSLLI 2016 38 / 62

slide-100
SLIDE 100

Pairwise alignment

Evaluation

vas3r

  • -vas3r
  • akwa

akwa--- Stain Sta-in lapis

  • lapis
  • foia

fo-ia iNnis iNnis pfat p-fat viya viya- bErk bErk mons mons n-at na-t noks noks

  • --fol

fol---- plenus p-lenus no--i no-i- nowus nowus nam-3 nam3- nomen nomen Gerhard Jäger Sequence Alignment ESSLLI 2016 39 / 62

slide-101
SLIDE 101

Pairwise alignment

German — Swabian

'I': 0.3 iX i 'you': 8.26 du du 'we':

  • 1.09

vir mia 'one': 4.63 ains

  • is

'two': 16.0 cvai cvoi 'person': 12.61 mEnS mEnZE 'fish': 16.35 fiS fiS 'louse': 15.01 laus laus 'tree': 6.57 baum bom 'leaf': 11.92 blat blad 'skin': 14.42 haut haut 'blood': 12.88 blut blud 'bone': 16.88 knoX3n knoXE 'horn': 8.75 horn hoan 'tongue': 9.8 chuN3 cuN 'knee': 7.77 kni knui 'hand': 8.6 hant hEnd 'breast': 14.81 brust bXuSt 'liver': 10.01 leb3r leba 'drink': 4.99 triNk3n dXiNg 'see': 0.63 ze3n se 'die': 10.16 Sterb3n StEab 'come': 11.84 khom3n khom 'sun': 8.79 zon3 sonE 'star': 16.16 StErn StEan 'water': 7.8 vas3r vaza 'stone': 10.36 Stain Stoi 'fire': 12.43 foia fuia Gerhard Jäger Sequence Alignment ESSLLI 2016 40 / 62

slide-102
SLIDE 102

Pairwise alignment

German — English

'I':

  • 2.3

iX Ei 'you': 2.34 du yu 'we': 2.21 vir wi 'one':

  • 2.3

ains w3n 'two':

  • 5.25

cvai tu 'fish': 16.35 fiS fiS 'dog':

  • 7.46

hunt dag 'tree':

  • 7.83

baum tri 'leaf':

  • 0.47

blat lif 'blood': 9.46 blut bl3d 'bone':

  • 1.36

knoX3n bon 'horn': 15.73 horn horn 'eye':

  • 4.1

aug3 Ei 'nose': 1.63 naz3 nos 'tongue':-0.63 chuN3 t3N 'knee': 3.86 kni ni 'hand': 8.6 hant hEnd 'breast': 16.93 brust brest 'liver': 14.65 leb3r liv3r 'drink': 7.48 triNk3n drink 'see':

  • 3.04

ze3n si 'die':

  • 7.7

Sterb3n dEi 'come': 1.22 khom3n k3m 'sun': 1.95 zon3 s3n 'star': 8.2 StErn star 'water': 12.06 vas3r wat3r 'stone': 6.75 Stain ston 'fire': 6.79 foia fEir Gerhard Jäger Sequence Alignment ESSLLI 2016 41 / 62

slide-103
SLIDE 103

Pairwise alignment

German — Latin

'I':

  • 3.87

iX ego 'you': 3.62 du tu 'we':

  • 5.06

vir nos 'one': 2.39 ains unus 'two':

  • 5.51

cvai duo 'person':-4.66 mEnS persona 'fish': 0.29 fiS piskis 'louse': -0.08 laus pedikulus 'tree':

  • 3.85

baum arbor 'leaf':

  • 3.57

blat folyu 'skin':

  • 0.25

haut kutis 'blood': -9.18 blut saNgwis 'bone':

  • 5.72

knoX3n

  • s

'horn': 7.55 horn kornu 'nose': 4.49 naz3 nasus 'tooth': -2.78 chan dens 'tongue':-3.4 chuN3 liNgwE 'knee': 0.8 kni genu 'hand': 0.73 hant manus 'breast': 1.39 brust pektus 'liver': 5.37 leb3r yekur 'see':

  • 4.15

ze3n widere 'hear':

  • 4.24

her3n audire 'die':

  • 6.12

Sterb3n mori 'come':

  • 9.25

khom3n wenire 'sun': 0.97 zon3 sol 'star': 5.72 StErn stela 'water': -5.4 vas3r akwa Gerhard Jäger Sequence Alignment ESSLLI 2016 42 / 62

slide-104
SLIDE 104

Pairwise alignment

How well does PMI similarity predict cognacy?

expert cognacy judgments used as gold standard

LDN empirical probability of cognacy 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 1.0 PMI empirical probability of cognacy −20 −10 10 20 0.0 0.2 0.4 0.6 0.8 1.0

0.00 0.25 0.50 0.75 1.00 no yes cognate LDN cognate no yes −20 −10 10 20 no yes cognate PMI cognate no yes

Gerhard Jäger Sequence Alignment ESSLLI 2016 43 / 62

slide-105
SLIDE 105

Pairwise alignment

How well does PMI similarity predict cognacy?

Average Precision LDN: 0.847 PMI: 0.864

0.0 0.2 0.4 0.6 0.8 1.0 0.4 0.5 0.6 0.7 0.8 0.9 1.0

precision−recall curve

recall precision LDN PMI

Gerhard Jäger Sequence Alignment ESSLLI 2016 44 / 62

slide-106
SLIDE 106

Estimating distances from pairwise alignments

Estimating distances from pairwise alignments

Gerhard Jäger Sequence Alignment ESSLLI 2016 45 / 62

slide-107
SLIDE 107

Estimating distances from pairwise alignments

Probability of cognacy

logistic regression to predict probability of cognacy from PMI similarity

concept Italian English predicted prob. expert judgment sharp tallEnte Sop 0.004 float galleddZare fl3ut 0.004 Kill ammattsare kil 0.007 bark skordza bok 0.009 husband marito hozb3nd 0.010 walk kamminare wok 0.011 eat mandZare it 0.011 bark kortettSa bok 0.013 know sapere n3u 0.015 come venire kom 0.016 1 swim nwotare swim 0.016 back dosso bEk 0.018 burn ardere b3n 0.018 think pensare 8iNk 0.019 dust polvere dost 0.019 wife molle waif 0.020 swell gonfyare swEl 0.021 sing kantare siN 0.022 knee rotElla ni 0.022 dry aSSutto drai 0.022 five tSinkwe faiv 0.023 1 skin pElle skin 0.024 hand mano hEnd 0.025 blood sangwe blod 0.025 flow skorrere fl3u 0.026 wipe aSSugare waip 0.026 turn dZirare t3n 0.026 concept Italian English predicted prob. expert judgment father padre fo83 0.480 1 when kwando wEn 0.483 1 night notte nait 0.508 1 and eed End 0.518 name nome neim 0.519 1 worm vErme w3m 0.521 1 round tondo raund 0.526 1 many molti mEni 0.569 wind vEnto wind 0.573 1 two due tu 0.600 1 mother madre mo83 0.624 1 thou tu 8au 0.629 1 child fantSullo tSaild 0.638 long lungo loN 0.651 1 fish peSSe fiS 0.659 1 count kontare kaunt 0.660 1 star stella sto 0.664 1 belly vEntre bEli 0.679 sun sole son 0.692 1 fly volare flai 0.742 three tre 8ri 0.744 1 flow fluire fl3u 0.759 heavy grEve hEvi 0.769 person persona p3s3n 0.799 1 animal animale Enim3l 0.947 1 vomit vomitare vomit 0.960 1 fruit frutto frut 0.966 1

Gerhard Jäger Sequence Alignment ESSLLI 2016 46 / 62

slide-108
SLIDE 108

Estimating distances from pairwise alignments

Estimating distances

average or maximal predicted probability of cognacy per concept = expected relative frequency of cognate pairs expected relative frequency of cognate pairs = e−t ⇒ distance estimation from raw data ⇒ applicable across language families

0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 expert prediction

Greek−Bulgarian Greek−Russian Greek−Polish Greek−Ukrainian Greek−Czech Greek−Icelandic Greek−Swedish Greek−Danish Greek−English Greek−Dutch Greek−German Greek−Catalan Greek−Portuguese Greek−Spanish Greek−French Greek−Italian Greek−Breton Greek−Romanian Greek−Lithuanian Greek−Irish Greek−Hindi Greek−Bengali Greek−Welsh Greek−Nepali Bulgarian−Russian Bulgarian−Polish Bulgarian−Ukrainian Bulgarian−Czech Bulgarian−Icelandic Bulgarian−Swedish Bulgarian−Danish Bulgarian−English Bulgarian−Dutch Bulgarian−German Bulgarian−Catalan Bulgarian−Portuguese Bulgarian−Spanish Bulgarian−French Bulgarian−Italian Bulgarian−Breton Bulgarian−Romanian Bulgarian−Lithuanian Bulgarian−Irish Bulgarian−Hindi Bulgarian−Bengali Bulgarian−Welsh Bulgarian−Nepali Russian−Polish Russian−Ukrainian Russian−Czech Russian−Icelandic Russian−Swedish Russian−Danish Russian−English Russian−Dutch Russian−German Russian−Catalan Russian−Portuguese Russian−Spanish Russian−French Russian−Italian Russian−Breton Russian−Romanian Russian−Lithuanian Russian−Irish Russian−Hindi Russian−Bengali Russian−Welsh Russian−Nepali Polish−Ukrainian Polish−Czech Polish−Icelandic Polish−Swedish Polish−Danish Polish−English Polish−Dutch Polish−German Polish−Catalan Polish−Portuguese Polish−Spanish Polish−French Polish−Italian Polish−Breton Polish−Romanian Polish−Lithuanian Polish−Irish Polish−Hindi Polish−Bengali Polish−Welsh Polish−Nepali Ukrainian−Czech Ukrainian−Icelandic Ukrainian−Swedish Ukrainian−Danish Ukrainian−English Ukrainian−Dutch Ukrainian−German Ukrainian−Catalan Ukrainian−Portuguese Ukrainian−Spanish Ukrainian−French Ukrainian−Italian Ukrainian−Breton Ukrainian−Romanian Ukrainian−Lithuanian Ukrainian−Irish Ukrainian−Hindi Ukrainian−Bengali Ukrainian−Welsh Ukrainian−Nepali Czech−Icelandic Czech−Swedish Czech−Danish Czech−English Czech−Dutch Czech−German Czech−Catalan Czech−Portuguese Czech−Spanish Czech−French Czech−Italian Czech−Breton Czech−Romanian Czech−Lithuanian Czech−Irish Czech−Hindi Czech−Bengali Czech−Welsh Czech−Nepali Icelandic−Swedish Icelandic−Danish Icelandic−English Icelandic−Dutch Icelandic−German Icelandic−Catalan Icelandic−Portuguese Icelandic−Spanish Icelandic−French Icelandic−Italian Icelandic−Breton Icelandic−Romanian Icelandic−Lithuanian Icelandic−Irish Icelandic−Hindi Icelandic−Bengali Icelandic−Welsh Icelandic−Nepali Swedish−Danish Swedish−English Swedish−Dutch Swedish−German Swedish−Catalan Swedish−Portuguese Swedish−Spanish Swedish−French Swedish−Italian Swedish−Breton Swedish−Romanian Swedish−Lithuanian Swedish−Irish Swedish−Hindi Swedish−Bengali Swedish−Welsh Swedish−Nepali Danish−English Danish−Dutch Danish−German Danish−Catalan Danish−Portuguese Danish−Spanish Danish−French Danish−Italian Danish−Breton Danish−Romanian Danish−Lithuanian Danish−Irish Danish−Hindi Danish−Bengali Danish−Welsh Danish−Nepali English−Dutch English−German English−Catalan English−Portuguese English−Spanish English−French English−Italian English−Breton English−Romanian English−Lithuanian English−Irish English−Hindi English−Bengali English−Welsh English−Nepali Dutch−German Dutch−Catalan Dutch−Portuguese Dutch−Spanish Dutch−French Dutch−Italian Dutch−Breton Dutch−Romanian Dutch−Lithuanian Dutch−Irish Dutch−Hindi Dutch−Bengali Dutch−Welsh Dutch−Nepali German−Catalan German−Portuguese German−Spanish German−French German−Italian German−Breton German−Romanian German−Lithuanian German−Irish German−Hindi German−Bengali German−Welsh German−Nepali Catalan−Portuguese Catalan−Spanish Catalan−French Catalan−Italian Catalan−Breton Catalan−Romanian Catalan−Lithuanian Catalan−Irish Catalan−Hindi Catalan−Bengali Catalan−Welsh Catalan−Nepali Portuguese−Spanish Portuguese−French Portuguese−Italian Portuguese−Breton Portuguese−Romanian Portuguese−Lithuanian Portuguese−Irish Portuguese−Hindi Portuguese−Bengali Portuguese−Welsh Portuguese−Nepali Spanish−French Spanish−Italian Spanish−Breton Spanish−Romanian Spanish−Lithuanian Spanish−Irish Spanish−Hindi Spanish−Bengali Spanish−Welsh Spanish−Nepali French−Italian French−Breton French−Romanian French−Lithuanian French−Irish French−Hindi French−Bengali French−Welsh French−Nepali Italian−Breton Italian−Romanian Italian−Lithuanian Italian−Irish Italian−Hindi Italian−Bengali Italian−Welsh Italian−Nepali Breton−Romanian Breton−Lithuanian Breton−Irish Breton−Hindi Breton−Bengali Breton−Welsh Breton−Nepali Romanian−Lithuanian Romanian−Irish Romanian−Hindi Romanian−Bengali Romanian−Welsh Romanian−Nepali Lithuanian−Irish Lithuanian−Hindi Lithuanian−Bengali Lithuanian−Welsh Lithuanian−Nepali Irish−Hindi Irish−Bengali Irish−Welsh Irish−Nepali Hindi−Bengali Hindi−Welsh Hindi−Nepali Bengali−Welsh Bengali−Nepali Welsh−Nepali

Gerhard Jäger Sequence Alignment ESSLLI 2016 47 / 62

slide-109
SLIDE 109

Estimating distances from pairwise alignments

Estimating distances

average or maximal predicted probability of cognacy per concept = expected relative frequency of cognate pairs expected relative frequency of cognate pairs = e−t ⇒ distance estimation from raw data ⇒ applicable across language families

Neighbor Joining tree

Greek Bulgarian Russian Polish Ukrainian Czech Icelandic Swedish Danish English Dutch German Catalan Portuguese Spanish French Italian Breton Romanian Lithuanian Irish Hindi Bengali Welsh Nepali

Gerhard Jäger Sequence Alignment ESSLLI 2016 47 / 62

slide-110
SLIDE 110

Estimating distances from pairwise alignments

Languages of Eurasia/ ASJP data

  • cf. Jäger (2015); full tree can be inspected here

Uralic Hmong-Mien Chukotko-Kamchatkan Japonic Dravidian Austronesian Nakh-Daghestanian Tai-Kadai Tungusic Sino-Tibetan Mongolic Yeniseian Ainu Austroasiatic Nivkh Indo-European Turkic

99.4% 100% 96.8% 99.9% 100% 96.9% 100%

Yukaghir Gerhard Jäger Sequence Alignment ESSLLI 2016 48 / 62

slide-111
SLIDE 111

Estimating distances from pairwise alignments

Languages of the World/ ASJP data

  • cf. Jäger and Wichmann (2016); full tree can be inspected here.

Austronesian Niger-Congo T ai-Kadai Austro-Asiatic Sino-Tibetan Uto-Aztecan Mayan Quechuan Altaic

Africa Eurasia Papunesia Australia America

Subsaharan Africa NW Eurasia A u s t r a l i a / P a p u a SE Asia America Papua

K h

  • i

s a n Nilo-Saharan Kadugli Nilo-Saharan N i g e r

  • C
  • n

g

  • Dravidian

T i m

  • r
  • A

l

  • r
  • P

a n t a r Indo-European Uralic Afro-Asiatic Afro-Asiatic Australian

Gerhard Jäger Sequence Alignment ESSLLI 2016 49 / 62

slide-112
SLIDE 112

Multiple sequence alignment

Multiple sequence alignment

Gerhard Jäger Sequence Alignment ESSLLI 2016 50 / 62

slide-113
SLIDE 113

Multiple sequence alignment

Multiple sequence alignment

Needleman-Wunsch only does pairwise alignment desirable: aligning all sequences of a taxon into one matrix

necessary for character-based phylogenetic inference improves the quality of the alignment

Gerhard Jäger Sequence Alignment ESSLLI 2016 51 / 62

slide-114
SLIDE 114

Multiple sequence alignment

Multiple sequence alignment

example: ’one’

PIE: oinos Bosian: yedan Kashubian: yEdEn

  • ptimal pairwise alignments:
  • i

n

  • s
  • i

n

  • s

y e d a n y e d a n y E d E n y E d E n

  • ptimal multiple alignment (maximizing sum of pairwise similarities

per column): y E d E n

  • i

n

  • s

y e d a n

  • alignment of all ’n’s is etymologically correct

Gerhard Jäger Sequence Alignment ESSLLI 2016 52 / 62

slide-115
SLIDE 115

Multiple sequence alignment

Multiple sequence alignment

in principle, the Needleman-Wunsch algorithm can be generalized to aligning k sequences however, aligning k sequences of length n has complexity O(nk2) ⇒ computationally intractable two strategies

heuristic search progressive alignment

Gerhard Jäger Sequence Alignment ESSLLI 2016 53 / 62

slide-116
SLIDE 116

Multiple sequence alignment

Progressive sequence alignment

start with a guide tree (using some heuristics like pairwise alignment + Neighbor Joining) working bottom-up, at each internal node, do pairwise alignment of the block alignments at the daugher node complexity is O(n2k3) ⇒ computationally feasible

Gerhard Jäger Sequence Alignment ESSLLI 2016 54 / 62

slide-117
SLIDE 117

Multiple sequence alignment

T-Coffee

progressive alignment only uses (phylogenetically) local information erroneous decisions cannot be corrected later

dendron 8enro dendron 8en-ro- tri dru dendron 8en-ro- d---ru- dendron 8en-ro- d---ru-

  • --tri-

Gerhard Jäger Sequence Alignment ESSLLI 2016 55 / 62

slide-118
SLIDE 118

Multiple sequence alignment

T-Coffee

Tree-based Consistency Objective Function for alignment Evaluation (Notredame et al., 2000) (slightly adapted for linguistic application)

1

pairwise alignment for all word pairs, using PMI scores

2

ternary alignments via relation composition

3

indirect alignment scores between sound occurrences

4

progressive alignment using those scores

dendron 8en-ro- dendron 8---ru- dendron

  • --dru-

dendron t---ri- dendron

  • --tri-

8enro d--ru 8enro t--ri dru tri Gerhard Jäger Sequence Alignment ESSLLI 2016 56 / 62

slide-119
SLIDE 119

Multiple sequence alignment

T-Coffee

Tree-based Consistency Objective Function for alignment Evaluation (Notredame et al., 2000) (slightly adapted for linguistic application)

1

pairwise alignment for all word pairs, using PMI scores

2

ternary alignments via relation composition

3

indirect alignment scores between sound occurrences

4

progressive alignment using those scores

dendron 8en-ro- dendron 8---ru- dendron

  • --dru-

dendron t---ri- dendron

  • --tri-

8enro d--ru 8enro t--ri dru tri

t---ri- dendron

  • --dru-

t--ri 8enro d--ru t---ri- dendron d---ru-

  • --tri-

dendron

  • --dru-
  • --tri-

dendron t---ru- ... ... ...

Gerhard Jäger Sequence Alignment ESSLLI 2016 56 / 62

slide-120
SLIDE 120

Multiple sequence alignment

T-Coffee

Tree-based Consistency Objective Function for alignment Evaluation (Notredame et al., 2000) (slightly adapted for linguistic application)

1

pairwise alignment for all word pairs, using PMI scores

2

ternary alignments via relation composition

3

indirect alignment scores between sound occurrences

4

progressive alignment using those scores

dendron 8en-ro- dendron 8---ru- dendron

  • --dru-

dendron t---ri- dendron

  • --tri-

8enro d--ru 8enro t--ri dru tri

t---ri- dendron

  • --dru-

t--ri 8enro d--ru t---ri- dendron d---ru-

  • --tri-

dendron

  • --dru-
  • --tri-

dendron t---ru- ... ... ...

Gerhard Jäger Sequence Alignment ESSLLI 2016 56 / 62

slide-121
SLIDE 121

Multiple sequence alignment

T-Coffee

Tree-based Consistency Objective Function for alignment Evaluation (Notredame et al., 2000) (slightly adapted for linguistic application)

1

pairwise alignment for all word pairs, using PMI scores

2

ternary alignments via relation composition

3

indirect alignment scores between sound occurrences

4

progressive alignment using those scores

dendron 8en-ro- dendron 8---ru- dendron

  • --dru-

dendron t---ri- dendron

  • --tri-

8enro d--ru 8enro t--ri dru tri

t---ri- dendron

  • --dru-

t--ri 8enro d--ru t---ri- dendron d---ru-

  • --tri-

dendron

  • --dru-
  • --tri-

dendron t---ru- ... ... ...

dendron 8enro dendron 8en-ro- tri dru dendron 8en-ro- d---ru- dendron 8en-ro- d---ru- t---ri-

Gerhard Jäger Sequence Alignment ESSLLI 2016 56 / 62

slide-122
SLIDE 122

Multiple sequence alignment

Examples

cognate class language word

  • ne:A

German

  • a-i--n-
  • ne:A

Dutch

  • --e--n-
  • ne:A

English

  • w-o--n-
  • ne:A

Danish

  • --e--n-
  • ne:A

Swedish

  • --E--n-
  • ne:A

Icelandic

  • --eidn-
  • ne:A

Irish

  • --e--n-
  • ne:A

Breton

  • --i--n-
  • ne:A

French

  • --E----
  • ne:A

Catalan

  • --u--n-
  • ne:A

Spanish

  • --u--no
  • ne:A

Portuguese

  • --u----
  • ne:A

Italian

  • --u--no
  • ne:A

Romanian

  • --u--nu
  • ne:A

Bengali

  • --E--k-
  • ne:A

Nepali

  • --e--k-
  • ne:A

Czech yEdE--n-

  • ne:A

Polish yEdE--n-

  • ne:A

Ukrainian

  • odi--n-
  • ne:A

Russian

  • adi--n-
  • ne:A

Bulgarian

  • 3di--n-

cognate class language word heart:J German h-Er-t--s- heart:J Dutch h-or-t---- heart:J English h-o--t---- heart:J Danish y-Ea-d--3- heart:J Swedish y-E--t--a- heart:J Icelandic S-ar-t--a- heart:J French k-Er------ heart:J Catalan k-or------ heart:J Spanish k-ora8--on heart:J Portuguese k-uras--aw heart:J Italian kwor----e- heart:J Hindi h--r-d--ai heart:J Lithuanian S-ir-dis-- heart:J Czech s--r-t-sE- heart:J Polish s-Er-t-sE- heart:J Ukrainian s-Er-t-sE- heart:J Russian s-Erdt-sE- heart:J Bulgarian s-3r-t-sE- heart:J Greek k-ar-8-Sa-

Gerhard Jäger Sequence Alignment ESSLLI 2016 57 / 62

slide-123
SLIDE 123

Multiple sequence alignment

Examples

cognate class language word two:A German tsvai- two:A Dutch t-we-- two:A English t--u-- two:A Danish d--o-- two:A Swedish t-vo-- two:A Icelandic t-veir two:A French d--e-- two:A Catalan d--o-s two:A Spanish d--o-s two:A Portuguese d--oiS two:A Italian d--ue- two:A Romanian d--o-y two:A Nepali d--ui- two:A Czech d-va-- two:A Polish d-va-- two:A Ukrainian d-wa-- two:A Russian d-va-- two:A Bulgarian d-va-- two:A Greek 8-io-- cognate class language word mother:A German mu-t--a- mother:A Dutch mu-d--3r mother:A English mo-8--3- mother:A Danish mo----a- mother:A Swedish mu-d--3r mother:A Icelandic mou8--ir mother:A French mE---r-- mother:A Catalan ma---r3- mother:A Spanish ma-8-re- mother:A Portuguese ma----i- mother:A Italian ma-d-re- mother:A Czech ma-t-ka- mother:A Polish ma-t-ka- mother:A Ukrainian ma-t--i- mother:A Russian ma-t---- mother:A Bulgarian ma-y-k3- mother:A Greek mi-tera-

Gerhard Jäger Sequence Alignment ESSLLI 2016 58 / 62

slide-124
SLIDE 124

Multiple sequence alignment

Examples

cognate class language word tongue:W German

  • --tsuN--3

tongue:W Dutch

  • --t-oN---

tongue:W English

  • --t-oN---

tongue:W Danish

  • --d-oN--3

tongue:W Swedish

  • --t-3N--a

tongue:W Icelandic

  • --t-uNg-a

tongue:W French

  • --l-o-g--

tongue:W Catalan

  • --l-ENgw3

tongue:W Spanish

  • --l-eNgwa

tongue:W Portuguese

  • --l-i-gua

tongue:W Italian

  • --l-ingwa

tongue:W Romanian

  • --l-im-b3

tongue:W Hindi

  • --dZi--b-

tongue:W Czech ya-z-ik--- tongue:W Polish yEwz-3k--- tongue:W Ukrainian ya-z-ik--- tongue:W Russian yi-z-3k--- tongue:W Bulgarian

  • 3-z-ik---

cognate class language word tooth:B Greek 8-ondi tooth:B German tsan-- tooth:B Dutch t-ont- tooth:B English t-u-8- tooth:B Danish d-an-- tooth:B Swedish t-and- tooth:B Icelandic t-En-- tooth:B French d-o--- tooth:B Catalan d-en-- tooth:B Spanish dyente tooth:B Portuguese d-e-t3 tooth:B Italian d-Ente tooth:B Romanian d-inte tooth:B Bengali d-o-t- tooth:B Hindi d-a-t-

Gerhard Jäger Sequence Alignment ESSLLI 2016 59 / 62

slide-125
SLIDE 125

Multiple sequence alignment

Examples

cognate class language word dog:A Lithuanian S-u---o- dog:A Ukrainian s-obaka- dog:A Russian s-abaka- dog:A Danish h-u-n--- dog:A Swedish h-3-nd-- dog:A Icelandic h-i-ndir dog:A German h-u-nt-- dog:A Dutch h-o-nt-- dog:A Welsh k-----i- dog:A Breton k-----i- dog:A Irish k-----u- dog:A French S-i---E- dog:A Italian k-a-n-e- dog:A Portuguese k-a---u- dog:A Romanian k-3yn-e- dog:A Greek Tio-n--- cognate class language word tree:C Danish d---G-E-- tree:C Swedish t---r-Ed- tree:C Icelandic t---ryE-- tree:C English t---r-i-- tree:C Ukrainian dE--r-Ewo tree:C Russian dE--r-Evo tree:C Polish d---Z-Evo tree:C Bulgarian d3--r--vo tree:C Greek 8endr---o

Gerhard Jäger Sequence Alignment ESSLLI 2016 60 / 62

slide-126
SLIDE 126

Wrapping up

Wrapping up

Gerhard Jäger Sequence Alignment ESSLLI 2016 61 / 62

slide-127
SLIDE 127

Wrapping up

Important topics not covered Bayesian tree estimation ⇒ no good introductory texts so far (that I would be aware of); best starting point might be Chen et al. (2014),

  • esp. chapters 1 and 7

estimation of time depths in years (rather than in “expected number of mutations”) ⇒ Chang et al. (2015) automatic cognate detection ⇒ first part of the Jäger/List-manuscript

  • n the course homepage

Hot research topics automatic discovery of regular sound correspondences and sound laws automatic reconstruction of proto-forms factoring vertical descent from language contact integrated probabilistic inference of sequence alignment and phylogenies

Gerhard Jäger Sequence Alignment ESSLLI 2016 62 / 62

slide-128
SLIDE 128

References

Chang, W., C. Cathcart, D. Hall, and A. Garrett (2015). Ancestry-constrained phylogenetic analysis supports the Indo-European steppe hypothesis. Language, 91(1):194–244. Chen, M.-H., L. Kuo, and P. O. Lewis (2014). Bayesian Phylogenetics. Methods, Algorithms and Applications. CRC Press, Abingdon. Dolgopolsky, A. B. (1986). A probabilistic hypothesis concerning the oldest relationships among the language families of northern eurasia. In V. V. Shevoroshkin, ed., Typology, Relationship and Time: A collection of papers on language change and relationship by Soviet linguists, pp. 27–50. Karoma Publisher, Ann Arbor. Jäger, G. (2015). Support for linguistic macrofamilies from weighted sequence alignment. Proceedings of the National Academy of Sciences, 112(41):12752–12757. Doi: 10.1073/pnas.1500331112. Jäger, G. and S. Wichmann (2016). Inferring the world tree of languages from word lists. In S. G. Roberts, C. Cuskley, L. McCrohon,

  • L. Barceló-Coblijn, O. Feher, and T. Verhoef, eds., The Evolution of

Language: Proceedings of the 11th International Conference

Gerhard Jäger Sequence Alignment ESSLLI 2016 62 / 62

slide-129
SLIDE 129

Wrapping up

(EVOLANG11). Available online: http://evolang.org/neworleans/papers/147.html. List, J.-M. (2014). Sequence Comparison in Historical Linguistics. Düsseldorf University Press, Düsseldorf.

Gerhard Jäger Sequence Alignment ESSLLI 2016 62 / 62