Multiple Word Alignment with Profile Hidden Markov Models Aditya - - PowerPoint PPT Presentation

multiple word alignment with profile hidden markov models
SMART_READER_LITE
LIVE PREVIEW

Multiple Word Alignment with Profile Hidden Markov Models Aditya - - PowerPoint PPT Presentation

Multiple Word Alignment with Profile Hidden Markov Models Aditya Bhargava and Grzegorz Kondrak Department of Computing Science University of Alberta {abhargava,kondrak}@cs.ualberta.ca 2 Multiple word alignment Given multiple words, align


slide-1
SLIDE 1

Multiple Word Alignment with Profile Hidden Markov Models

Aditya Bhargava and Grzegorz Kondrak Department of Computing Science University of Alberta {abhargava,kondrak}@cs.ualberta.ca

slide-2
SLIDE 2

Multiple word alignment

  • Given multiple words, align them all to each
  • ther
  • Our approach: Profile HMMs, used in biological

sequence analysis

  • Use match, insert, and delete states to model

changes

  • Evaluate on cognate set matching

▫ Beat baselines of average and minimum edit distance

2

slide-3
SLIDE 3

What you can expect

  • Introduction: word alignment
  • Profile hidden Markov models

▫ For bioinformatics ▫ For words?

  • Experiments
  • Conclusions & future work

3

slide-4
SLIDE 4

Introduction

  • Multiple word alignment:

▫ Take a set of words ▫ Generate some alignment of these words ▫ Similar and equivalent characters should be aligned together

  • Pairwise alignment gets us:

▫ String similarity and word distances ▫ Cognate identification ▫ Comparative reconstruction

4

slide-5
SLIDE 5

Introduction

  • Extending to multiple words gets us:

▫ String similarity with multiple words ▫ Better-informed cognate identification ▫ Better-informed comparative reconstruction

  • We propose Profile HMMs for multiple

alignment

▫ Test on cognate set matching

5

slide-6
SLIDE 6

Profile hidden Markov models

6

slide-7
SLIDE 7

Profile hidden Markov models

  • Match states are “defaults”
  • Insert states are used to

represent insert symbols

  • Delete states are used to

represent the absence of symbols

7

slide-8
SLIDE 8

Profile hidden Markov models

8

MMIIIM AG...C A-AG.C AG.AA-

  • -AAAC

AG...C

  • In this sample DNA alignment, dashes represent

deletes and periods represent skipped inserts

slide-9
SLIDE 9

Profile hidden Markov models

9

MMIIIM AG...C A-AG.C AG.AA-

  • -AAAC

AG...C

slide-10
SLIDE 10

Profile hidden Markov models

10

MMIIIM AG...C A-AG.C AG.AA-

  • -AAAC

AG...C

slide-11
SLIDE 11

Profile hidden Markov models

11

MMIIIM AG...C A-AG.C AG.AA-

  • -AAAC

AG...C

slide-12
SLIDE 12

Profile hidden Markov models

12

MMIIIM AG...C A-AG.C AG.AA-

  • -AAAC

AG...C

slide-13
SLIDE 13

Profile hidden Markov models

13

MMIIIM AG...C A-AG.C AG.AA-

  • -AAAC

AG...C

slide-14
SLIDE 14

Profile hidden Markov models

14

MMIIIM AG...C A-AG.C AG.AA-

  • -AAAC

AG...C

slide-15
SLIDE 15

Profile hidden Markov models

15

MMIIIM AG...C A-AG.C AG.AA-

  • -AAAC

AG...C

slide-16
SLIDE 16

Profile hidden Markov models

  • To construct a Profile HMM from aligned

sequences:

▫ Determine which columns are match columns and which are insert columns, then estimate transition and emission probabilities directly from counts

  • To construct a Profile HMM from unaligned

sequences:

▫ Choose a model length, initialize the model, then train it to the sequences using Baum-Welch

16

slide-17
SLIDE 17

Profile hidden Markov models

  • Evaluating a sequence for membership in a

family

▫ Use the forward algorithm to get the probability ▫ Use Viterbi to align the sequence

  • Multiple alignment of unaligned sequences

▫ Construct & train a Profile HMM ▫ Use Viterbi to align the sequences

17

slide-18
SLIDE 18

Profile hidden Markov models

  • Profile HMMs are generalizations of Pair HMMs

▫ Word similarity and cognate identification

  • Unlike Pair HMMs, Profile HMMs are position-

specific

▫ Each model is constructed from a specific family

  • f sequences

▫ Pair HMMs are trained over many pairs of words

18

slide-19
SLIDE 19

Profile HMMs for words

  • Words are also sequences!
  • Similar to their use for biological sequences, we

apply Profile HMMs to multiple word alignment

  • We also test Profile HMMs on matching words

to cognate sets

  • We made our own implementation and

investigated several parameters

19

slide-20
SLIDE 20

Profile HMMs: parameters

  • Favour match states?
  • Pseudocount methods

▫ Constant-value, background frequency, substitution matrix

  • Pseudocount weight
  • Pseudocounts added during Baum-Welch

20

slide-21
SLIDE 21

Experiments: Data

  • Comparative Indoeuropean Data Corpus

▫ Cognation data for words in 95 languages corresponding to 200 meanings

  • Each meaning reorganized into disjoint cognate

sets

21

slide-22
SLIDE 22

Experiments: Multiple cognate alignment

MIIMIIMI D--E--N- D--E--NY Z--E--N- DZ-E--N- DZIE--N- D--A--N- DI-E--NA D--E--IZ D--E---- D--Y--DD D--I--A- D--I--E- D-----I- Z-----I- Z--U--E- Z-----U- J--O--UR DJ-O--U- J--O--UR G--IORNO

  • Parameters determined from cognate set matching

experiments (later)

  • Pseudocount weight set to 100 to bias the model using a

substitution matrix

  • Highly-conserved columns are aligned correctly
  • Similar-sounding characters are aligned also correctly,

thanks to the substitution matrix method

  • Insert columns should not be considered aligned
  • Problems with multi-character phonemes

▫ An expected problem when using the English alphabet instead of e.g. IPA

22

slide-23
SLIDE 23

Experiments: Cognate set matching

  • How can we evaluate the alignments in a

principled way? There is no gold standard!

  • We emulate the biological sequence analysis task
  • f matching a sequence to a family; we match a

word to a cognate set

  • The task is to correctly identify the cognate set to

which a word belongs given a number of cognate sets having the same meaning as the word; we choose the model yielding the highest score

23

slide-24
SLIDE 24

Experiments: Cognate set matching

  • Development set of 10 meanings (~5% of the

data)

  • Substitution matrix derived from Pair HMM

method

  • Best parameters:

▫ Favour match states ▫ Use substitution matrix pseudocount ▫ Use 0.5 for pseudocount weight ▫ Add pseudocounts during Baum-Welch

24

slide-25
SLIDE 25

Experiments: Cognate set matching

Average Edit Distance: 77.0% Minimum Edit Distance: 91.0% Profile HMM: 93.2%

25

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Average Edit Distance Minimum Edit Distance Profile HMM

Accuracy

slide-26
SLIDE 26

Experiments: Cognate set matching

  • Accuracy better than both average and minimum

edit distance

  • Why so close to MED?

▫ Many sets had duplicate words (same

  • rthographic representation for different

languages)

26

slide-27
SLIDE 27

Conclusions

  • Profile HMMs can work for word-related tasks
  • Multiple alignments are reasonable
  • Cognate set matching performance exceeds

minimum and average edit distance

  • If multiple words need to be considered, Profile

HMMs present a viable method

27

slide-28
SLIDE 28

Future work

  • Better model construction from aligned

sequences

  • Better initial models for unaligned sequences
  • Better pseudocount methods
  • N-gram output symbols

28