n ukti english inuktitut word alignment system description
play

N UKTI : English-Inuktitut Word Alignment System Description - PowerPoint PPT Presentation

N UKTI : English-Inuktitut Word Alignment System Description Philippe Langlais, Fabrizio Gotti and Guihong Cao RALI Dpartement dinformatique et de recherche oprationnelle Universit de Montral WPT June 2005 Felipe & Fabrizio


  1. N UKTI : English-Inuktitut Word Alignment System Description Philippe Langlais, Fabrizio Gotti and Guihong Cao RALI Département d’informatique et de recherche opérationnelle Université de Montréal WPT— June 2005 Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 1 / 16

  2. Context We found the task intriguing enough so we spent 2 weeks to test 2 approaches. no other (textual) resource than the ones provided ∼ 4 000 lines of (C++) code (bug inside) We corrected a few bugs after the deadline Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 2 / 16

  3. Word Alignment as a Sentence Alignment Task Observation on the DEV corpus : monotonicity • in • pijjutigillugu (3-1) ✭✭✭✭✭✭✭✭✭✭✭✭ ✏✏✏✏✏✏✏✏✏✏✏✏ regards • • innatuqait (1-1) ✘✘✘✘✘✘✘✘✘✘✘✘✘ • to • amma (1-1) ✘✘✘✘✘✘✘✘✘✘✘✘ • elders • makkuttu (1-1) ✘✘✘✘✘✘✘✘✘✘✘✘ • and • youth monotonicity ≡ a perfect setting for sentence alignment Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 3 / 16

  4. Our in-house Sentence Alignment Program : J APA developed for the Arcade evaluation campaign (Langlais & al., 1998) : step 1 (roughly) word-align in order to delimit the search space step 2 sentence-align by a mix of (Gale & Church, 1993) and (Simard et al., 1992) available at rali.iro.umontreal.ca/Japa Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 4 / 16

  5. Word Alignment as a Sentence Alignment Task Documents ≡ sentences Sentences ≡ words J APA handles n-m patterns of arbitrary size (default n , m ∈ [ 0 , 2 ] ) Exp. 1 : seeding J APA with the empirical pattern distribution 1-1 0.406 4-1 0.092 4-2 0.015 2-1 0.172 5-1 0.038 5-2 0.011 . . . 3-1 0.123 7-1 0.027 3-2 0.011 (24 patterns observed on the DEV corpus) We generated the cartesian product for each pattern where n , m > 1 Prec. Rec. F-meas. AER 26.17 74.49 38.73 71.27 official run Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 5 / 16

  6. Word Alignment as a Sentence Alignment Task Document ≡ sentence Sentence ≡ word J APA handles n-m patterns of arbitrary size (default n , m ∈ [ 0 , 2 ] ) Exp. 2 : J APA in its default mode 1-1 0.89 1-2 0.089 2-1 0.089 0-1 0.009 1-0 0.009 2-2 0.011 Prec. Rec. F-meas. AER 26.17 74.49 38.73 71.27 53.04 37.12 43.68 45.13 unofficial run Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 6 / 16

  7. Word Alignment as a Sentence Alignment Task Document ≡ sentence Sentence ≡ word J APA handles n-m patterns of arbitrary size (default n , m ∈ [ 0 , 2 ] ) Exp. 3 : seeding J APA with this pattern distribution 1-1 0.406 4-1 0.092 7-1 0.027 7-2 0.011 2-1 0.172 5-1 0.04 4-2 0.015 3-2 0.011 3-1 0.123 6-1 0.04 5-2 0.011 2-2 0.000 Prec. Rec. F-meas. AER 26.17 74.49 38.73 71.27 53.04 37.12 43.68 45.13 55.41 60.55 57.86 42.48 unofficial run Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 7 / 16

  8. N UKTI : Principle Finding a monotonic split of the English sentence in regards to | c 1 elders | c 2 and | c 3 youth pijjutigillugu innatuqait amma makkuttu � I K be an Inuktitut sentence of K words 1 let E N be an English sentence of N words 1 We seek the split { c k | k ∈ [ 1 , K − 1 ] , c k ∈ [ 1 , N − 1 ] , c k > c k − 1 } which maximizes : K � p ( I k | E c k A = argmax c k − 1 + 1 ) +( 1 − λ ) p ( d k ≡ c k − c k − 1 ) λ c K � �� � � �� � k = 1 1 fertility word-sequence score Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 8 / 16

  9. N UKTI : dirty hands Word-Word distribution :  max c k j = c k − 1 + 1 p ( I k | E j )   p ( I k | E c k c k − 1 + 1 ) ≃ or � c k  j = c k − 1 + 1 p ( I k | E j ) ⇐ =  Word-Substring distribution : � p ( I | E ) ≃ λ p llr ( i | E ) + ( 1 − λ ) p ibm 2 ( i | E ) i ∈ I Fertility distribution p ( d k ) found useless in practice Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 9 / 16

  10. N UKTI : Log-likelihood ratio score p llr ( i | E ) Martin et al. (2003) We computed a likelihood ratio score (Dunning, 1993) for all pairs of English tokens (E) and Inuktitut substrings (i) of length ranging from 3 to 10 characters. a maximum of 25 000 associations were kept for each English word (the top ranked ones) (probably too many) cooccurrence ≡ presence in the same pair of sentences (suboptimal) normalized so that ∀ E , � i p llr ( i | E ) = 1 we used a suffix tree structure (1 hour for 100 English words) Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 10 / 16

  11. N UKTI : IBM model p ibm 2 ( i | E ) Brown et al. 1993 we segmented the Inuktitut material by a recursive process and trained an IBM model 2 (we used only the transfer table) Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 11 / 16

  12. N UKTI Greedy Search Strategy Step1 : Seed N UKTI with a given split I 4 I 3 I 2 I 1 E E E E E E 1 2 3 4 5 6 Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 12 / 16

  13. N UKTI Greedy Search Strategy Step1 : Seed N UKTI with a given split I 4 I 3 I 2 I 1 E E E E E E 1 2 3 4 5 6 Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 12 / 16

  14. N UKTI Greedy Search Strategy Step1 : Seed N UKTI with a given split I 4 I 3 I 2 I 1 E E E E E E 1 2 3 4 5 6 in | c 1 regards to | c 2 elders | c 3 and youth pijjutigillugu innatuqait amma makkuttu We tried 2 seed splits : diagonal and J APA Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 12 / 16

  15. N UKTI Greedy Search Strategy Step2 : Perturbation of the seed split From left to right : in ≻ c 1 regards to | c 2 elders | c 3 and youth pijjutigillugu innatuqait amma makkuttu Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 13 / 16

  16. N UKTI Greedy Search Strategy Step2 : Perturbation of the seed split From left to right : in regards to | c 1 ≻ c 2 elders | c 3 and youth pijjutigillugu innatuqait amma makkuttu Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 13 / 16

  17. N UKTI Greedy Search Strategy Step2 : Perturbation of the seed split From left to right : in regards to | c 1 elders | c 2 ≻ c 3 and youth pijjutigillugu innatuqait amma makkuttu Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 13 / 16

  18. N UKTI Greedy Search Strategy Step2 : Perturbation of the seed split From left to right : in regards to | c 1 elders | c 2 and | c 3 youth pijjutigillugu innatuqait amma makkuttu Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 13 / 16

  19. N UKTI : results Configuration Prec. Rec. F-m. AER seed diagonal 51.7 53.66 52.66 49.54 + greedy 65.4 68.31 66.83 32.10 seed J APA 55.4 60.55 57.86 42.48 65.47 68.36 66.88 31.93 + greedy Best submitted : N UKTI (diago) 63.09 65.87 64.45 34.06 Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 14 / 16

  20. Conclusion & Future Work Word alignment as a sentence alignment task : AER ∼ 42 a dictionary (transfer parameters) could be used to ease J APA transliteration for improving cognatness J APA + N UKTI : AER ∼ 32 no 1-0 cept allowed log-likelihood ratio distributions too noisy If we were to do it again : http://www.inuktitutcomputing.ca/Uqailaut/ See the next talk ! (Schafer and Drábek, 2005) Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 15 / 16

  21. thank you Felipe & Fabrizio & Guihong @RALI, UdeM ( RALI Département d’informatique et de recherche opérationnelle Univ N UKTI WPT— June 2005 16 / 16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend