e f f e c t o f c r o s s l a n g u a g e i r e f f e c t

E f f e c t o f C r o s s - L a n g u a g e I - PowerPoint PPT Presentation

E f f e c t o f C r o s s - L a n g u a g e I R E f f e c t o f C r o s s - L a n g u a g e I R i n B i l i n g u a l L e x i c o n A c q u i s i t i o n i n B i l i


  1. E f f e c t o f C r o s s - L a n g u a g e I R E f f e c t o f C r o s s - L a n g u a g e I R i n B i l i n g u a l L e x i c o n A c q u i s i t i o n i n B i l i n g u a l L e x i c o n A c q u i s i t i o n f r o m C o m p a r a b l e C o r p o r a f r o m C o m p a r a b l e C o r p o r a Takehito Utsuro Graduate School of Informatics, Kyoto University, Japan utsuro@pine.kuee.kyoto-u.ac.jp July 4-5, 2003, German-Japan WS on NLP

  2. Background Translation Knowledge Acquisition from Parallel/Comparable Corpora From Parallel Corpora � translation knowledge acquisition: relatively easier � resource: less available From Comparable Corpora � translation knowledge acquisition: relatively harder � resource: more available

  3. Translation Knowledge Acquisition: Our Approach Source of Translation Knowledge Acquisition: Cross-Lingually Relevant News Articles (on WWW news sites) � Updated everyday → enabling efficient acquisition of up-to-date translation knowledge Techniques: Integration of CLIR and Translation Knowledge Acquisition from Parallel/Comparable Corpora

  4. Translation Knowledge Acquisition: Our Approach Source of Translation Knowledge Acquisition: Cross-Lingually Relevant News Articles (on WWW news sites) � Updated everyday → enabling efficient acquisition of up-to-date translation knowledge Techniques: Integration of CLIR and Translation Knowledge Acquisition from Parallel/Comparable Corpora

  5. Translation Knowledge Acquisition: Our Approach Source of Translation Knowledge Acquisition: Cross-Lingually Relevant News Articles (on WWW news sites) � Updated everyday → enabling efficient acquisition of up-to-date translation knowledge Techniques: Integration of CLIR and Translation Knowledge Acquisition from Parallel/Comparable Corpora

  6. Translation Knowledge Acquisition from WWW News Sites: Overview WWW WWW ( ) ( News Sites ) News Sites English News Articles Japanese News Articles DB DB ・ Retrieval of Bilingual Retrieval of Bilingual Bilingual Lexicon ・ MT system Article Pair Article Pair English Article Japanese Article Relevant Translation Article Pair Knowledge DB Translation Translation Knowledge Knowledge Acquisition Acquisition

  7. Cross-Language Retrieval of Relevant News Articles WWW WWW ( ) ( News Sites ) News Sites English News Articles Japanese News Articles DB DB F i l t e r i n g F i l t e r i n g English Article Japanese Article b y D a t e s b y D a t e s cosine of frequency vectors S i m i l a r i t y M T S y s t e m S i m i l a r i t y M T S y s t e m C a l c u l a t i o n C a l c u l a t i o n Japanese Bilingual Article Pair Translation Bilingual Article Pair (Relevant Articles) (Relevant Articles)

  8. Related Research Issues: Translation Knowledge Acquisition Acquisition from Parallel Corpora � statistical MT models: e.g., [Brown 90, 93] � term correspondences estimation based on contingency tables of cross-language co-occurrence frequencies: e.g., [Gale 91, Kumano 94, Haruno 96, Smadja 96, Kitamura 96, Melamed 00] Acquisition from Comparable Corpora: contextual similarities of words across languages � without the help of existing bilingual lexicons: earlier works [Fung 95] � exploiting existing bilingual lexicons as initial seed: later works [Rapp 95,99, Kaji 96, K.Tanaka 96, Fung 98, T.Tanaka 02] Collecting Partially Bilingual Texts from WWW with Internet Search Engines: [Nagata 01]

  9. Related Research Issues: Translation Knowledge Acquisition Acquisition from Parallel Corpora � statistical MT models: e.g., [Brown 90, 93] � term correspondences estimation based on contingency tables of cross-language co-occurrence frequencies: e.g., [Gale 91, Kumano 94, Haruno 96, Smadja 96, Kitamura 96, Melamed 00] Acquisition from Comparable Corpora: contextual similarities of words across languages � without the help of existing bilingual lexicons: earlier works [Fung 95] � exploiting existing bilingual lexicons as initial seed: later works [Rapp 95,99, Kaji 96, K.Tanaka 96, Fung 98, T.Tanaka 02] Collecting Partially Bilingual Texts from WWW with Internet Search Engines: [Nagata 01]

  10. Estimating Bilingual Term Correspondences from Parallel Sentences Parallel Sentences English Japanese term x term y x ∧ y ⇒ term x term y x ∧¬ y ⇒ term x - ¬ x ∧ y ⇒ - term y - - ¬ x ∧¬ y ⇒ … … - -

  11. Measures for Estimating Bilingual Term Correspondences from Contingency Table ¬ y y freq(x, ¬ y) = b x freq(x, y) = a ¬ x freq( ¬ x, y) = c freq( ¬ x, ¬ y) = d mutual information (MI) aN I(x ; y) = log 2 (a+b)(a+c) φ 2 statistic (ad-bc) 2 φ 2 (x, y) = (a+b)(a+c)(b+d)(c+d) dice coefficient 2a Dice(x, y) = 2a+b+c log-likelihood Log-like = f(a)+f(b)+f(c)+f(d)-f(a+b)-f(a+c)-f(b+d)-f(c+d)-f(a+b+c+d) Where f(x) = x log x

  12. Related Research Issues: Translation Knowledge Acquisition Acquisition from Parallel Corpora � statistical MT models: e.g., [Brown 90, 93] � term correspondences estimation based on contingency tables of cross-language co-occurrence frequencies: e.g., [Gale 91, Kumano 94, Haruno 96, Smadja 96, Kitamura 96, Melamed 00] Acquisition from Comparable Corpora: contextual similarities of words across languages � without the help of existing bilingual lexicons: earlier works [Fung 95] � exploiting existing bilingual lexicons as initial seed: later works [Rapp 95,99, Kaji 96, K.Tanaka 96, Fung 98, T.Tanaka 02] Collecting Partially Bilingual Texts from WWW with Internet Search Engines: [Nagata 01]

  13. Term Correspondence Acquisition from Comparable Corpora Whole English Corpus Whole Japanese Corpus term term context context … … frequency vector term correspondence … … … estimation

  14. Related Research Issues: Translation Knowledge Acquisition Acquisition from Parallel Corpora � statistical MT models: e.g., [Brown 90, 93] � term correspondences estimation based on contingency tables of cross-language co-occurrence frequencies: e.g., [Gale 91, Kumano 94, Haruno 96, Smadja 96, Kitamura 96, Melamed 00] Acquisition from Comparable Corpora: contextual similarities of words across languages � without the help of existing bilingual lexicons: earlier works [Fung 95] � exploiting existing bilingual lexicons as initial seed: later works [Rapp 95,99, Kaji 96, K.Tanaka 96, Fung 98, T.Tanaka 02] Collecting Partially Bilingual Texts from WWW with Internet Search Engines: [Nagata 01]

  15. Translation Knowledge Acquisition: Our Approach � Translation knowledge acquisition from cross-lingually relevant article pairs collected by CLIR techniques

  16. cross-lingually Relevant Term Correspondence Acquisition from Cross-Lingually Relevant Article Pairs Collected by CLIR Techniques Whole English Corpus Whole Japanese Corpus term term Article Article Article Article context cross-lingually Non-Relevant

  17. cross-lingually Relevant Term Correspondence Acquisition from Cross-Lingually Relevant Article Pairs Collected by CLIR Techniques Whole English Corpus Whole Japanese Corpus term term Article Article Article Article context context … … frequency vector … … … term correspondence estimation

  18. Translation Knowledge Acquisition: Our Approach � Translation knowledge acquisition from cross-lingually relevant article pairs collected by CLIR techniques

  19. Translation Knowledge Acquisition: Our Approach � Translation knowledge acquisition from cross-lingually relevant article pairs collected by CLIR techniques � Techniques for parallel corpora become applicable to translation knowledge acquisition from comparable corpora

  20. Translation Knowledge Acquisition: Our Approach � Translation knowledge acquisition from cross-lingually relevant article pairs collected by CLIR techniques � Techniques for parallel corpora become applicable to translation knowledge acquisition from comparable corpora Related Work: Translation Knowledge Acquisition from Comparable Corpora � Estimating term correspondences based on contextual similarities across languages � Contextual vectors: averaged over the whole corpus � No use of CLIR techniques for restricting relevant documents across languages

  21. Cross-Language Retrieval of Relevant News Articles: Evaluation Issues Availability of Cross-Lingually Relevant Articles � Query articles should be English rather than Japanese � Cross-Lingually relevant articles are available for more than 60% English query articles Recall/Precision of Cross-Language Retrieval of Relevant News Articles � precision: 50% or more when article similarities ≧ 0.4

  22. Term Correspondence Acquisition: Evaluation Issues Source of Translation Knowledge Acquisition: Cross-Lingually Relevant News Articles on WWW News Sites Effect of CLIR in Reducing Bilingual Term Pair Candidates Accuracy of Bilingual Term Correspondence Estimation

  23. Term Correspondence Acquisition: Evaluation Issues Source of Translation Knowledge Acquisition: Cross-Lingually Relevant News Articles on WWW News Sites Effect of CLIR in Reducing Bilingual Term Pair Candidates Accuracy of Bilingual Term Correspondence Estimation

Recommend


More recommend