N UKTI : English-Inuktitut Word Alignment System Description - - PowerPoint PPT Presentation

n ukti english inuktitut word alignment system description
SMART_READER_LITE
LIVE PREVIEW

N UKTI : English-Inuktitut Word Alignment System Description - - PowerPoint PPT Presentation

N UKTI : English-Inuktitut Word Alignment System Description Philippe Langlais, Fabrizio Gotti and Guihong Cao RALI Dpartement dinformatique et de recherche oprationnelle Universit de Montral WPT June 2005 Felipe & Fabrizio


slide-1
SLIDE 1

NUKTI: English-Inuktitut Word Alignment System Description

Philippe Langlais, Fabrizio Gotti and Guihong Cao

RALI Département d’informatique et de recherche opérationnelle Université de Montréal

WPT— June 2005

Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ NUKTI WPT— June 2005 1 / 16

slide-2
SLIDE 2

Context

We found the task intriguing enough so we spent 2 weeks to test 2 approaches.

no other (textual) resource than the ones provided ∼ 4 000 lines of (C++) code (bug inside)

We corrected a few bugs after the deadline

Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ NUKTI WPT— June 2005 2 / 16

slide-3
SLIDE 3

Word Alignment as a Sentence Alignment Task

Observation on the DEV corpus : monotonicity

in

  • regards •

to

  • elders
  • and
  • youth
  • pijjutigillugu (3-1)
  • innatuqait (1-1)
  • amma (1-1)
  • makkuttu (1-1)
  • ✭✭✭✭✭✭✭✭✭✭✭✭

✏✏✏✏✏✏✏✏✏✏✏✏ ✘✘✘✘✘✘✘✘✘✘✘✘✘ ✘✘✘✘✘✘✘✘✘✘✘✘ ✘✘✘✘✘✘✘✘✘✘✘✘

monotonicity ≡ a perfect setting for sentence alignment

Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ NUKTI WPT— June 2005 3 / 16

slide-4
SLIDE 4

Our in-house Sentence Alignment Program : JAPA

developed for the Arcade evaluation campaign (Langlais & al., 1998) : step 1 (roughly) word-align in order to delimit the search space step 2 sentence-align by a mix of (Gale & Church, 1993) and (Simard et al., 1992) available at rali.iro.umontreal.ca/Japa

Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ NUKTI WPT— June 2005 4 / 16

slide-5
SLIDE 5

Word Alignment as a Sentence Alignment Task

Documents ≡ sentences Sentences ≡ words JAPA handles n-m patterns of arbitrary size (default n, m ∈ [0, 2])

  • Exp. 1 : seeding JAPA with the empirical pattern distribution

1-1 0.406 4-1 0.092 4-2 0.015 2-1 0.172 5-1 0.038 5-2 0.011 . . . 3-1 0.123 7-1 0.027 3-2 0.011 (24 patterns observed on the DEV corpus) We generated the cartesian product for each pattern where n, m > 1 Prec. Rec. F-meas. AER 26.17 74.49 38.73 71.27

  • fficial run

Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ NUKTI WPT— June 2005 5 / 16

slide-6
SLIDE 6

Word Alignment as a Sentence Alignment Task

Document ≡ sentence Sentence ≡ word JAPA handles n-m patterns of arbitrary size (default n, m ∈ [0, 2])

  • Exp. 2 : JAPA in its default mode

1-1 0.89 1-2 0.089 2-1 0.089 0-1 0.009 1-0 0.009 2-2 0.011 Prec. Rec. F-meas. AER 26.17 74.49 38.73 71.27 53.04 37.12 43.68 45.13 unofficial run

Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ NUKTI WPT— June 2005 6 / 16

slide-7
SLIDE 7

Word Alignment as a Sentence Alignment Task

Document ≡ sentence Sentence ≡ word JAPA handles n-m patterns of arbitrary size (default n, m ∈ [0, 2])

  • Exp. 3 : seeding JAPA with this pattern distribution

1-1 0.406 4-1 0.092 7-1 0.027 7-2 0.011 2-1 0.172 5-1 0.04 4-2 0.015 3-2 0.011 3-1 0.123 6-1 0.04 5-2 0.011 2-2 0.000 Prec. Rec. F-meas. AER 26.17 74.49 38.73 71.27 53.04 37.12 43.68 45.13 55.41 60.55 57.86 42.48 unofficial run

Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ NUKTI WPT— June 2005 7 / 16

slide-8
SLIDE 8

NUKTI : Principle

Finding a monotonic split of the English sentence

in regards to |c1 elders |c2 and |c3 youth pijjutigillugu innatuqait amma makkuttu let IK

1

be an Inuktitut sentence of K words EN

1

be an English sentence of N words We seek the split {ck|k ∈ [1, K − 1], ck ∈ [1, N − 1], ck > ck−1} which maximizes : A = argmax

cK

1

K

  • k=1

λ p(Ik|Eck

ck−1+1)

  • word-sequence score

+(1 − λ) p(dk ≡ ck − ck−1)

  • fertility

Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ NUKTI WPT— June 2005 8 / 16

slide-9
SLIDE 9

NUKTI : dirty hands

Word-Word distribution : p(Ik|Eck

ck−1+1) ≃

     maxck

j=ck−1+1 p(Ik|Ej)

  • r

ck

j=ck−1+1 p(Ik|Ej)

⇐ = Word-Substring distribution : p(I|E) ≃

  • i∈I

λpllr(i|E) + (1 − λ)pibm2(i|E) Fertility distribution p(dk) found useless in practice

Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ NUKTI WPT— June 2005 9 / 16

slide-10
SLIDE 10

NUKTI : Log-likelihood ratio score pllr(i|E)

Martin et al. (2003)

We computed a likelihood ratio score (Dunning, 1993) for all pairs of English tokens (E) and Inuktitut substrings (i) of length ranging from 3 to 10 characters. a maximum of 25 000 associations were kept for each English word (the top ranked ones) (probably too many) cooccurrence ≡ presence in the same pair of sentences (suboptimal) normalized so that ∀E,

i pllr(i|E) = 1

we used a suffix tree structure (1 hour for 100 English words)

Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ NUKTI WPT— June 2005 10 / 16

slide-11
SLIDE 11

NUKTI : IBM model pibm2(i|E)

Brown et al. 1993

we segmented the Inuktitut material by a recursive process and trained an IBM model 2 (we used only the transfer table)

Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ NUKTI WPT— June 2005 11 / 16

slide-12
SLIDE 12

NUKTI Greedy Search Strategy

Step1 : Seed NUKTI with a given split

I I I I E E E E E E

4 3 2 1 1 2 3 4 5 6

Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ NUKTI WPT— June 2005 12 / 16

slide-13
SLIDE 13

NUKTI Greedy Search Strategy

Step1 : Seed NUKTI with a given split

I I I I E E E E E

4 3 2 1 1 2 3 4 5 6

E

Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ NUKTI WPT— June 2005 12 / 16

slide-14
SLIDE 14

NUKTI Greedy Search Strategy

Step1 : Seed NUKTI with a given split

I I I I E E E E E

4 3 2 1 1 2 3 4 5 6

E

in |c1 regards to |c2 elders |c3 and youth pijjutigillugu innatuqait amma makkuttu We tried 2 seed splits : diagonal and JAPA

Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ NUKTI WPT— June 2005 12 / 16

slide-15
SLIDE 15

NUKTI Greedy Search Strategy

Step2 : Perturbation of the seed split

From left to right : in ≻c1 regards to |c2 elders |c3 and youth pijjutigillugu innatuqait amma makkuttu

Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ NUKTI WPT— June 2005 13 / 16

slide-16
SLIDE 16

NUKTI Greedy Search Strategy

Step2 : Perturbation of the seed split

From left to right : in regards to |c1 ≻c2 elders |c3 and youth pijjutigillugu innatuqait amma makkuttu

Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ NUKTI WPT— June 2005 13 / 16

slide-17
SLIDE 17

NUKTI Greedy Search Strategy

Step2 : Perturbation of the seed split

From left to right : in regards to |c1 elders |c2 ≻c3 and youth pijjutigillugu innatuqait amma makkuttu

Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ NUKTI WPT— June 2005 13 / 16

slide-18
SLIDE 18

NUKTI Greedy Search Strategy

Step2 : Perturbation of the seed split

From left to right : in regards to |c1 elders |c2 and |c3youth pijjutigillugu innatuqait amma makkuttu

Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ NUKTI WPT— June 2005 13 / 16

slide-19
SLIDE 19

NUKTI : results

Configuration Prec. Rec. F-m. AER seed diagonal 51.7 53.66 52.66 49.54 + greedy 65.4 68.31 66.83 32.10 seed JAPA 55.4 60.55 57.86 42.48 + greedy 65.47 68.36 66.88 31.93 Best submitted : NUKTI (diago) 63.09 65.87 64.45 34.06

Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ NUKTI WPT— June 2005 14 / 16

slide-20
SLIDE 20

Conclusion & Future Work

Word alignment as a sentence alignment task : AER ∼ 42

a dictionary (transfer parameters) could be used to ease JAPA transliteration for improving cognatness

JAPA + NUKTI : AER ∼ 32

no 1-0 cept allowed log-likelihood ratio distributions too noisy

If we were to do it again : http://www.inuktitutcomputing.ca/Uqailaut/ See the next talk ! (Schafer and Drábek, 2005)

Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ NUKTI WPT— June 2005 15 / 16

slide-21
SLIDE 21

thank you

Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ NUKTI WPT— June 2005 16 / 16