[PPT] - Improving Phonetic Alignment by Handling Secondary Sequence PowerPoint Presentation

SLIDE 1

. . . . . . .

Improving Phonetic Alignment by Handling Secondary Sequence Structures

Johann-Mattis List∗

∗Institute for Romance Languages and Literature

Heinrich Heine University Düsseldorf

2012/08/10

1 / 40

SLIDE 2

Structure of the Talk

. . .

1

Historical Linguistics Keys to the Past Comparative Method Sound Correspondences . . .

2

Sequence Comparison Sequences Alignment Analyses Alignment Modes . . .

3

Secondary Alignment Secondary Sequence Structures Secondary Alignment Problem Secondary Alignment Algorithm . . .

4

Phonetic Alignment SCA Paradigmatic Aspects Syntagmatic Aspects . . .

5

Evaluation Evaluation Measures Gold Standard Results

2 / 40

SLIDE 3

Historical Linguistics

3 / 40

SLIDE 4

Historical Linguistics Keys to the Past

Charles Lyell on Languages

4 / 40

SLIDE 5

Historical Linguistics Keys to the Past

Charles Lyell on Languages

The Geological Evidences

f

The Antiquity of Man

with Remarks on Theories of

The Origin of Species by Variation

By Sir Charles Lyell London John Murray, Albemarle Street 1863 1

4 / 40

SLIDE 6

Historical Linguistics Keys to the Past

Charles Lyell on Languages

If we new not- hing of the existence

f Latin, - if all

historical documents previous to the fin- teenth century had been lost, - if tra- dition even was si- lent as to the former existance of a Ro- man empire, a me- re comparison of the Italian, Spanish, Portuguese, French, Wallachian, and Rhaetian dialects would enable us to say that at some time there must have been a language, from which these six modern dialects derive their

rigin

in common. 1

4 / 40

SLIDE 7

Historical Linguistics Keys to the Past

Historical Scenarios

German ʦ aː n

* Proto-Germanic

t a n d English t ʊː θ

** Proto-Indo-European

d

n

t Italian d ɛ n t e * Proto-Romance d e n t French d ɑ̃

1

5 / 40

SLIDE 8

Historical Linguistics Keys to the Past

Historical Scenarios

German ʦ aː n

* Proto-Germanic

t a n d English t ʊː θ

** Proto-Indo-European

d

n

t Italian d ɛ n t e * Proto-Romance d e n t French d ɑ̃

1

5 / 40

SLIDE 9

Historical Linguistics Keys to the Past

Historical Scenarios

German ʦ aː n

* Proto-Germanic

t a n d English t ʊː

θ
** Proto-Indo-European

d

n

t Italian d ɛ n t e * Proto-Romance d e n t French d ɑ̃

1

5 / 40

SLIDE 10

Historical Linguistics Keys to the Past

Historical Scenarios

German ʦ aː n

Proto-Germanic

t a n θ

English

t ʊː

θ
** Proto-Indo-European

d

n

t Italian d ɛ n t e Proto-Romance d e n t e French d ɑ̃

1

5 / 40

SLIDE 11

Historical Linguistics Keys to the Past

Historical Scenarios

German ʦ aː n

Proto-Germanic

t a n θ

English

t ʊː

θ

** Proto-Indo-European d

n

t Italian d ɛ n t e Proto-Romance d e n t e French d ɑ̃

1

5 / 40

SLIDE 12

Historical Linguistics Keys to the Past

Historical Scenarios

German ʦ aː n

Proto-Germanic

t a n θ

English

t ʊː

θ

Proto-Indo-European d e n t

Italian

d ɛ n t ə Proto-Romance d e n t e French d ɑ̃

1

5 / 40

SLIDE 13

Historical Linguistics Keys to the Past

Historical Scenarios

German ʦ aː n

* Proto-Germanic

t a n d English t ʊː

θ

Proto-Indo-European d e n t Italian d ɛ n t ə * Proto-Romance d e n t French d ɑ̃

1

5 / 40

SLIDE 14

Historical Linguistics Keys to the Past

Historical Scenarios

German ʦ aː n Proto-Germanic

t a n θ

English t ʊː θ Proto-Indo-European

d e n t

Italian d ɛ n t e Proto-Romance

d e n t e

French d ɑ̃ German ʦ aː n Proto-Germanic

t a n θ

English t ʊː θ Proto-Indo-European

d e n t

Italian d ɛ n t e Proto-Romance

d e n t e

French d ɑ̃ 15 / 40

SLIDE 15

Historical Linguistics Comparative Method

The Comparative Method

Compile an initial list of putative cognate sets. Extract an initial list of putative sets of sound correspondences from the initial cognate list. Refine the cognate list and the correspondence list by

adding and deleting cognate sets from the cognate list, depending

n whether they are consistent with the correspondence list or not,

and adding and deleting correspondence sets from the correspondence list, depending on whether they are consistent with the cognate list

r not.

Finish when the results are satisfying enough.

6 / 40

SLIDE 16

Historical Linguistics Sound Correspondences

Sound Correspondences

Sequence similarity is determined on the basis of systematic sound correspondences as opposed to similarity based on surface resemblances of phonetic segments. Lass (1997) calls this notion of similarity phenotypic as opposed to a genotypic notion of similarity. The most crucial aspect of correspondence-based similarity is that it is language-specific: Genotypic similarity is never defined in general terms but always with respect to the language systems which are being compared.

bla German [ʦaːn] “tooth” Dutch tand [tɑnt] English [tʊːθ] “tooth” German [ʦeːn] “ten” Dutch tien [tiːn] English [tɛn] “ten” German [ʦʊŋə] “tongue” Dutch tong [tɔŋ] English [tʌŋ] “tongue”

7 / 40

SLIDE 17

Historical Linguistics Sound Correspondences

Sound Correspondences

Sequence similarity is determined on the basis of systematic sound correspondences as opposed to similarity based on surface resemblances of phonetic segments. Lass (1997) calls this notion of similarity phenotypic as opposed to a genotypic notion of similarity. The most crucial aspect of correspondence-based similarity is that it is language-specific: Genotypic similarity is never defined in general terms but always with respect to the language systems which are being compared.

Meaning German Dutch English “tooth” Zahn [ ʦ aːn] tand [ t ɑnt] tooth [ t ʊːθ] “ten” zehn [ ʦ eːn] tien [ t iːn] ten [ t ɛn] “tongue” Zunge [ ʦ ʊŋə] tong [ t ɔŋ] tongue [ t ʌŋ]

7 / 40

SLIDE 18

Historical Linguistics Sound Correspondences

Sound Correspondences

Sequence similarity is determined on the basis of systematic sound correspondences as opposed to similarity based on surface resemblances of phonetic segments. Lass (1997) calls this notion of similarity phenotypic as opposed to a genotypic notion of similarity. The most crucial aspect of correspondence-based similarity is that it is language-specific: Genotypic similarity is never defined in general terms but always with respect to the language systems which are being compared.

Meaning Shanghai Beijing Guangzhou “nine” [ ʨ iɤ³⁵] Beijing [ ʨ iou²¹⁴] [ k ɐu³⁵] “today” [ ʨ iŋ⁵⁵ʦɔ²¹] Beijing [ ʨ iɚ⁵⁵] [ k ɐm⁵³jɐt²] “rooster” [koŋ⁵⁵ ʨ i²¹] Beijing[kuŋ⁵⁵ ʨ i⁵⁵] [ k ɐi⁵⁵koŋ⁵⁵]

7 / 40

SLIDE 19

Sequence Comparison

S e q u e n c e C

m

p a r i s

n

8 / 40

SLIDE 20

Sequence Comparison Sequences

Sequences

Definition 1 Given an alphabet (a non-empty finite set, whose elements are called characters), a sequence is an ordered list of characters drawn from the alphabet. The elements of sequences are called segments. (cf. Böckenbauer & Bongartz 2003: 30f)

9 / 40

SLIDE 21

Sequence Comparison Sequences

Sequences

10 / 40

SLIDE 22

Sequence Comparison Sequences

Sequences

10 / 40

SLIDE 23

Sequence Comparison Sequences

Sequences

4

3

11 / 40

SLIDE 24

Sequence Comparison Sequences

Sequences

1 1 1 1

11 / 40

SLIDE 25

Sequence Comparison Sequences

Sequences

1

Baked Rabbit

1 rabbit 1 1/2 tsp. salt 1 1/8 1/8 tsp. pepper 1 1/2 c. onion slices

Rub salt and pepper on

rabbit pieces.

Place on large sheet of

aluminium foil.

Place onion slices on

rabbit.

Bake at 350 degrees.
Eat when done and

tender.

1

11 / 40

SLIDE 26

Sequence Comparison Alignment Analyses

Alignment Analyses

Definition 2 An alignment of two sequences s and t is a two-row matrix in which both sequences are aranged in such a way that all matching and mismatching segments occur in the same column, while empty cells, resulting from empty matches, are filled with gap symbols. (cf. Kruskal 1983)

12 / 40

SLIDE 27

Sequence Comparison Alignment Analyses

Alignment Analyses

H H H H H H H H H

13 / 40

SLIDE 28

Sequence Comparison Alignment Analyses

Alignment Analyses

H H H H H H H H H

13 / 40

SLIDE 29

Sequence Comparison Alignment Analyses

Alignment Analyses

H H H H H H H H H H

13 / 40

SLIDE 30

Sequence Comparison Alignment Modes

Global Alignment

Global alignment analyses are the most basic way to compare sequences. The traditional Needleman-Wunsch algorithm (Needleman and Wunsch 1971) conducts global alignment analyses, and the Levenshtein distance (edit distance, Levenshtein 1965) is defined for global alignments.

Mode Alignment global G R E E N C A T F I S H H U N T E R A F A T C A T

H

U N T E R 14 / 40

SLIDE 31

Sequence Comparison Alignment Modes

Global Alignment

Global alignment analyses are the most basic way to compare sequences. The traditional Needleman-Wunsch algorithm (Needleman and Wunsch 1971) conducts global alignment analyses, and the Levenshtein distance (edit distance, Levenshtein 1965) is defined for global alignments.

Mode Alignment global G R E E N C A T F I S H H U N T E R A F A T C A T

H

U N T E R 14 / 40

SLIDE 32

Sequence Comparison Alignment Modes

Semi-Global Alignment

Semi-global alignment analyses do not necessarily compare two sequences as a whole but allow prefixes and suffixes to be ignored in an alignment analysis, if these would otherwise increase the cost of the optimal alignment. Computationally, this is done by setting the costs for gaps inserted in the begin and at the end of an alignment to zero.

Mode Alignment global G R E E N C A T F I S H H U N T E R A F A T C A T

H

U N T E R semi-global G R E E N

C

A T F I S H H U N T E R

A

F A T C A T H U N T E R 15 / 40

SLIDE 33

Sequence Comparison Alignment Modes

Semi-Global Alignment

Semi-global alignment analyses do not necessarily compare two sequences as a whole but allow prefixes and suffixes to be ignored in an alignment analysis, if these would otherwise increase the cost of the optimal alignment. Computationally, this is done by setting the costs for gaps inserted in the begin and at the end of an alignment to zero.

Mode Alignment global G R E E N C A T F I S H H U N T E R A F A T C A T

H

U N T E R semi-global G R E E N

C

A T F I S H H U N T E R

A

F A T C A T H U N T E R 15 / 40

SLIDE 34

Sequence Comparison Alignment Modes

Local Alignment

While semi-global alignment analyses allow prefixes and suffixes to be ignored only if one sequence contains a prefix or suffix while the other does not, local alignment analyses (Smith-Waterman algorithm, Smith and Waterman 1981) only align the best scoring subsequences of two sequences, while leaving the rest of the sequences completely

unaligned. Computationally, this is done by prohibiting that

the cost of an alignment analysis goes beyond zero.

Mode Alignment global G R E E N C A T F I S H H U N T E R A F A T C A T

H

U N T E R semi-global G R E E N

C

A T F I S H H U N T E R

A

F A T C A T H U N T E R local GREEN CATFISH H U N T E R A FAT CAT H U N T E R 16 / 40

SLIDE 35

Sequence Comparison Alignment Modes

Local Alignment

While semi-global alignment analyses allow prefixes and suffixes to be ignored only if one sequence contains a prefix or suffix while the other does not, local alignment analyses (Smith-Waterman algorithm, Smith and Waterman 1981) only align the best scoring subsequences of two sequences, while leaving the rest of the sequences completely

unaligned. Computationally, this is done by prohibiting that

the cost of an alignment analysis goes beyond zero.

Mode Alignment global G R E E N C A T F I S H H U N T E R A F A T C A T

H

U N T E R semi-global G R E E N

C

A T F I S H H U N T E R

A

F A T C A T H U N T E R local GREEN CATFISH H U N T E R A FAT CAT H U N T E R 16 / 40

SLIDE 36

Sequence Comparison Alignment Modes

Diagonal Alignment

While local alignment analyses leave unalignable parts of sequences unaligned, diagonal alignment analyses (DI- ALIGN algorith, Morgenstern 1996) align sequences glob- ally, but search for local similarities at the same time. Local similarities are defined as “diagonals”, i.e. ungapped align-

ments. Diagonal alignment analyses maximize the score of

diagonals in an alignment.

Mode Alignment global G R E E N C A T F I S H H U N T E R A F A T C A T

H

U N T E R semi-global G R E E N

C

A T F I S H H U N T E R

A

F A T C A T H U N T E R local GREEN CATFISH H U N T E R A FAT CAT H U N T E R diagonal

G

R E E N C A T F I S H H U N T E R A F A T

C

A T

H

U N T E R 17 / 40

SLIDE 37

Sequence Comparison Alignment Modes

Diagonal Alignment

While local alignment analyses leave unalignable parts of sequences unaligned, diagonal alignment analyses (DI- ALIGN algorith, Morgenstern 1996) align sequences glob- ally, but search for local similarities at the same time. Local similarities are defined as “diagonals”, i.e. ungapped align-

ments. Diagonal alignment analyses maximize the score of

diagonals in an alignment.

Mode Alignment global G R E E N C A T F I S H H U N T E R A F A T C A T

H

U N T E R semi-global G R E E N

C

A T F I S H H U N T E R

A

F A T C A T H U N T E R local GREEN CATFISH H U N T E R A FAT CAT H U N T E R diagonal

G

R E E N C A T F I S H H U N T E R A F A T

C

A T

H

U N T E R 17 / 40

SLIDE 38

Secondary Alignment

secondarysequencestructures secondary sequence structures se co nda ry se que nce stru ctu re s se con da ry se quence struc tures s e c o n d a r y s e q u e n c e s t r u c t u r e s S E C O N D A R Y S E Q U E N C E S T R U C T U R E S sec ond ary seq uen ces tru ctu res seco ndar yseq uenc estr ctur es

S e c

n

d a r y A l i g n m e n t

18 / 40

SLIDE 39

Secondary Alignment Secondary Sequence Structures

Secondary Sequence Structures

Apart from a primary structure, sequences can also have a secondary structure. Primary structure refers to the order of

segments. Secondary structure refers to the order of sec-
ndary segments, i.e. segments that result from the group-

ing of primary segments into higher units. "ABCEFGIJK" "ABC.EFG.IJK" "THECATFISHHUNTER" "THE.CATFISH.HUNTER" "KARAOKE" "KA.RA.O.KE"

19 / 40

SLIDE 40

Secondary Alignment Secondary Sequence Structures

Secondary Sequence Structures

Apart from a primary structure, sequences can also have a secondary structure. Primary structure refers to the order of

segments. Secondary structure refers to the order of sec-
ndary segments, i.e. segments that result from the group-

ing of primary segments into higher units. "ABCEFGIJK" → "ABC.EFG.IJK" "THECATFISHHUNTER" "THE.CATFISH.HUNTER" "KARAOKE" "KA.RA.O.KE"

19 / 40

SLIDE 41

Secondary Alignment Secondary Sequence Structures

Secondary Sequence Structures

Apart from a primary structure, sequences can also have a secondary structure. Primary structure refers to the order of

segments. Secondary structure refers to the order of sec-
ndary segments, i.e. segments that result from the group-

ing of primary segments into higher units. "ABCEFGIJK" → "ABC.EFG.IJK" "THECATFISHHUNTER" → "THE.CATFISH.HUNTER" "KARAOKE" "KA.RA.O.KE"

19 / 40

SLIDE 42

Secondary Alignment Secondary Sequence Structures

Secondary Sequence Structures

Apart from a primary structure, sequences can also have a secondary structure. Primary structure refers to the order of

segments. Secondary structure refers to the order of sec-
ndary segments, i.e. segments that result from the group-

ing of primary segments into higher units. "ABCEFGIJK" → "ABC.EFG.IJK" "THECATFISHHUNTER" → "THE.CATFISH.HUNTER" "KARAOKE" → "KA.RA.O.KE"

19 / 40

SLIDE 43

Secondary Alignment Secondary Alignment Problem

The Secondary Alignment Problem

Secondary Alignment Problem Given two sequences s and t of length m and n which have the primary structures s1 , ..., sm and t1 , ..., tn , and the secondary structures s0→i, ..., sj→m and t0→k, ..., tl→n, find an alignment of maximal score in which segments belonging to the same secondary segment in s only correspond to segments belonging to the same secondary segment in t, and vice versa.

20 / 40

SLIDE 44

Secondary Alignment Secondary Alignment Problem

The Secondary Alignment Problem

Mode Alignment global T H E

C

A T

F

I S H

H

U N T S T H E

C

A T

F

I S H

E
S

semiglobal T H E

C

A T

F

I S H

H

U N T S T H E

C

A T

F

I S H E S

local

T H E

C

A T

F

I S H HUNTS T H E

C

A T

F

I S H ES diagonal T H E

C

A T

F

I S H

H

U N T S T H E

C

A T

F

I S H E

S

secondary T H E

C

A T F I S H

H

U N T

S

T H E

C

A T

F

I S H E S

21 / 40

SLIDE 45

Secondary Alignment Secondary Alignment Algorithm

A Secondary Alignment Algorithm

Algorithm 1: Secondary(x, y, g, r, score) comment: matrix construction and initialization . . . comment: main loop for i ← 1 to length(x) do                                    for j ← 1 to length(y) do                              M[i][j] ← max                              M[i − 1][j − 1] + score(xi−1, yj−1) comment: check for restriction 2 if xi−1 = r and yj−1 = r and j = length(y) then − ∞) else M[i − 1][j] + g if yj−1 = r and xi−1 = r and i = length(x) then − ∞) else M[i][j − 1] + g

22 / 40

SLIDE 46

Secondary Alignment Secondary Alignment Algorithm

A Secondary Alignment Algorithm

1

0 0 0 0

A . B C . D E

0 0 0 0 0 0 0 -1

A

0 -2

.

0 -3

B

0 -4

C

0 -5

.

0 -6

D

0 -7

E

A

A -1

A

1

0 A A

.

A -1

B

A -2

C

A -3

.

A -4

D

A -5

E

A

A -2

A

A

A 0 . A -1

B

A -2

C

A -3

.

A -4

D

A -5

E

B

B -3

B -1

A

B -1

.

B

1

0 B B

C

B -1

.

B -2

D

B -3

E

C

C -4

C -2

A

C -2

.

C

B

C

2

0 C C

1

.

C

D

C -1

E

D

D -5

D -3

A

D -3

.

D -1

B

D

1

C

D

1

0 . D

2

0 D D

1

E

E

E -6

E -4

A

E -4

.

E -2

B

E

C

E

.

E

1

D

E

3

0 E

.

. -7

. -5

A

. -3 0 . . -3

B

. -1

C

.

1

0 . .

D

.

2

E

E

E -8

E -6

A

E -4

.

E -4

B

E -2

C

E

.

E 0 D E

1

E

2

0 0 0 0

A . B C . D E

0 0 0 0 0 0 0 -1

A

0 -2

.

0 -3

B

0 -4

C

0 -5

.

0 -6

D

0 -7

E

A

A -1

A

1

0 A A -3

.

A -3 0 B A -4

C

A -6

.

A -6 0 D A -7

E

A

A -2

A

A

A -4

.

A -4

B

A -4 0 C A -7

.

A -7

D

A -7 0 E

B

B -3

B -1

A

B -5

.

B -3 0 B B -4

C

B -8

.

B -8

D

B -8

E

C

C -4

C -2

A

C -6

.

C -4

B

C -2 0 C C -9

.

C -9

D

C -9

E

D

D -5

D -3

A

D -7

.

D -5

B

D -3

C

D -10

.

D -8 0 D D -9

E

E

E -6

E -4

A

E -8

.

E -6

B

E -4

C

E -11

.

E -9

D

E -7 0 E

.

. -7

. -8

A

. -3 0 . . -4

B

. -5

C

. -3 0 . . -4

D

. -5

E

E

E -8

E -8 0 A E -4

.

E -4 0 B E -5

C

E -4

.

E -4 0 D E -3 0 E

23 / 40

SLIDE 47

Secondary Alignment Secondary Alignment Algorithm

A Secondary Alignment Algorithm

The extension for secondary alignment is independent of the underlying alignment mode. Global, semi-global, local, and diagonal alignment analyses that are sensitive for secondary sequence structures can be carried out. The only requirement of the algorithm in contrast to the traditional alignment algorithms is the boundary character which has to be specified by the user.

24 / 40

SLIDE 48

Phonetic Alignment

h j

ä

r t a

h
e
r

z

h
e

a r t

c
r

d i s hjärta herz heart cordis

Phonetic Alignment

25 / 40

SLIDE 49

Phonetic Alignment SCA

Sound-Class-Based Phonetic Alignment (SCA)

SCA (List 2012) is a new method for pairwise and multiple phonetic alignment, implemented as part of LingPy (http://lingulist.de/lingpy), a Python library for quantitative tasks in historical linguistics. SCA is based on a novel framework for phonetic alignment that combines both the most recent developments in computational biology with new approaches to sequence modelling in historical linguistics and dialectology. According to the new framework for sequence modelling, sound sequences are internally represented in different layers which relate to both important paradigmatic and syntagmatic aspects of linguistic sequences.

26 / 40

SLIDE 50

Phonetic Alignment Paradigmatic Aspects

Sound Classes

. Sound Classes . . . . . . . . Sounds which often occur in correspondence relations in genetically related languages can be clustered into classes (types). It is assumed “that phonetic correspondences inside a‘type’ are more regular than those between different‘types’” (Dolgopolsky 1986: 35).

27 / 40

SLIDE 51

Phonetic Alignment Paradigmatic Aspects

Sound Classes

. Sound Classes . . . . . . . . Sounds which often occur in correspondence relations in genetically related languages can be clustered into classes (types). It is assumed “that phonetic correspondences inside a‘type’ are more regular than those between different‘types’” (Dolgopolsky 1986: 35).

k g p b ʧ ʤ f v t d ʃ ʒ θ ð s z 1

27 / 40

SLIDE 52

Phonetic Alignment Paradigmatic Aspects

Sound Classes

. Sound Classes . . . . . . . . Sounds which often occur in correspondence relations in genetically related languages can be clustered into classes (types). It is assumed “that phonetic correspondences inside a‘type’ are more regular than those between different‘types’” (Dolgopolsky 1986: 35).

k g p b ʧ ʤ f v t d ʃ ʒ θ ð s z 1

27 / 40

SLIDE 53

Phonetic Alignment Paradigmatic Aspects

Sound Classes

. Sound Classes . . . . . . . . Sounds which often occur in correspondence relations in genetically related languages can be clustered into classes (types). It is assumed “that phonetic correspondences inside a‘type’ are more regular than those between different‘types’” (Dolgopolsky 1986: 35).

k g p b ʧ ʤ f v t d ʃ ʒ θ ð s z 1

27 / 40

SLIDE 54

Phonetic Alignment Paradigmatic Aspects

Sound Classes

. Sound Classes . . . . . . . . Sounds which often occur in correspondence relations in genetically related languages can be clustered into classes (types). It is assumed “that phonetic correspondences inside a‘type’ are more regular than those between different‘types’” (Dolgopolsky 1986: 35).

K T P S

1

27 / 40

SLIDE 55

Phonetic Alignment Paradigmatic Aspects

Scoring Functions for Sound Classes

LingPy offers default scoring functions for three standard sound-class models (ASJP, SCA, DOLGO). The standard models vary regarding the roughness by which the continuum of sounds is split into discrete classes. The scoring functions are based on empirical data on sound correspondence frequencies (ASJP model, Brown et al. 2011), and

n general theoretical models of the directionality and probability of

sound change processes that are converted into non-directional similarity matrices (SCA, DOLGO, see List 2012 for details).

28 / 40

SLIDE 56

Phonetic Alignment Syntagmatic Aspects

Prosodic Strings

Sound change occurs more frequently in prosodically weak positions of phonetic sequences (Geisler 1992). Given the sonority profile of a phonetic sequence, one can distinguish positions that differ regarding their prosodic context. Prosodic context can be modelled by representing a sequence by a prosodic string, indicating the different prosodic contexts of each segment. Based on the relative strength of all sites in a phonetic sequence, substitution scores and gap penalties can be modified when carrying out alignment analyses. Prosodic strings are an alternative to n-gram approaches, since they also handle context, their specific advantage being that they are more abstract and less data-dependent.

29 / 40

SLIDE 57

Phonetic Alignment Syntagmatic Aspects

Prosodic Strings

j a b ə l k a 1

30 / 40

SLIDE 58

Phonetic Alignment Syntagmatic Aspects

Prosodic Strings

j a b ə l k a

sonority increases

1

30 / 40

SLIDE 59

Phonetic Alignment Syntagmatic Aspects

Prosodic Strings

j a b ə l k a ↑ △ ↑ △ ↓ ↑ △ ↑ ascending △ maximum ↓ descending 1

30 / 40

SLIDE 60

Phonetic Alignment Syntagmatic Aspects

Prosodic Strings

j a b ə l k a ↑ △ ↑ △ ↓ ↑ △

strong

weak 1

30 / 40

SLIDE 61

Phonetic Alignment Syntagmatic Aspects

Prosodic Strings

phonetic sequence j a b ə l k a SCA model J A P E L K A ASJP model y a b I l k a DOLGO model J V P V R K V sonority profile 6 7 1 7 5 1 7 prosodic string # v C v c C > Relative Weight 2.0 1.5 1.5 1.3 1.1 1.5 0.7

30 / 40

SLIDE 62

Phonetic Alignment Syntagmatic Aspects

Secondary Alignment

While secondary alignment was never an issue in computational biology, it is a desideratum in historical linguistics and dialectology. Secondary structures are especially important when

(1) aligning whole sentences, where the alignment of one word from

ne with two words from another sentence should be avoided,

(2) aligning language data for which morphological information is also available, or (3) when aligning words from South-East-Asian tone languages which generally show a structure in which one syllable corresponds to one morpheme.

31 / 40

SLIDE 63

Phonetic Alignment Syntagmatic Aspects

Secondary Alignment

Primary Alignment Haikou z i

t
³

Beijing ʐ ʅ ⁵¹ tʰ

u

¹ Secondary Alignment Haikou z i t ³

Beijing

ʐ ʅ

⁵¹

tʰ

u

¹

32 / 40

SLIDE 64

Evaluation

* * * * * * * * * * * * *

v

l
d

e m

r

t v

l

a d i m i r

v

a l

d

e m a r

1

Evaluation

33 / 40

SLIDE 65

Evaluation Evaluation Measures

Evaluation Measures

PAS: Perfect Alignment Score CS: Column Score SPS: Sum-of-Pairs Score

34 / 40

SLIDE 66

Evaluation Evaluation Measures

Evaluation Measures

Column-Score (CS) CS = 100 · 2 ·

|Ct∩Cr| |Cr|+|Ct|,

where Ct is the set of columns in the test alignment and Cr is the set of columns in the reference alignment (Rosenberg and Ogden 2009). Sum-of-Pairs Score (SPS) SPS = 100 · 2 ·

|Pt∩Pr| |Pr|+|Pt|,

where Pt is the set of all aligned residue pairs in the test alignment and Pr is the set of all aligned residue pairs in the reference alignment (ibd.).

35 / 40

SLIDE 67

Evaluation Gold Standard

Gold Standard

1 089 manually aligned sequence pairs. Words taken from the Bai dialects (Wang 2006, Allen 2007) and Chinese dialects (Hou 2004). Both Bai and Chinese are tone languages. All data is available under http://lingulist.de/supp/secondary.zip

36 / 40

SLIDE 68

Evaluation Results

Results

Score Primary Secondary PAS 83.47 88.89 CS 88.54 92.70 SPS 92.78 95.52

37 / 40

SLIDE 69

Concluding Remarks

As can be seen from the results, the modified algorithm which is sensitive to secondary sequence structures shows a great improvement compared to the traditional algorithm which aligns sequences only with respect to their primary structure. The improvement is significant with p < 0.01 using the Wilcoxon signed rank test as suggested by Notredame (2000). The algorithm for secondary alignment proves very useful for the alignment of tonal languages, yet it may also be employed for the analysis of other kinds of sequential data and, e.g., help to carry

ut phonetic alignment analyses of whole sentences.

38 / 40

SLIDE 70

*deh3-

?

1 What’s next?

39 / 40

SLIDE 71

Special thanks to:

The German Federal Mi-

nistry of Education and Research (BMBF) for funding

ur

research project.

Hans Geisler for his hel-

pful, critical, and ins- piring support.

1

40 / 40

SLIDE 72

THANK YOU

1

FOR LISTENING!

1

40 / 40