e f f e c t o f c r o s s l a n g u a g e i r e f f e c t
play

E f f e c t o f C r o s s - L a n g u a g e I - PowerPoint PPT Presentation

E f f e c t o f C r o s s - L a n g u a g e I R E f f e c t o f C r o s s - L a n g u a g e I R i n B i l i n g u a l L e x i c o n A c q u i s i t i o n i n B i l i


  1. E f f e c t o f C r o s s - L a n g u a g e I R E f f e c t o f C r o s s - L a n g u a g e I R i n B i l i n g u a l L e x i c o n A c q u i s i t i o n i n B i l i n g u a l L e x i c o n A c q u i s i t i o n f r o m C o m p a r a b l e C o r p o r a f r o m C o m p a r a b l e C o r p o r a Takehito Utsuro Graduate School of Informatics, Kyoto University, Japan utsuro@pine.kuee.kyoto-u.ac.jp July 4-5, 2003, German-Japan WS on NLP

  2. Background Translation Knowledge Acquisition from Parallel/Comparable Corpora From Parallel Corpora � translation knowledge acquisition: relatively easier � resource: less available From Comparable Corpora � translation knowledge acquisition: relatively harder � resource: more available

  3. Translation Knowledge Acquisition: Our Approach Source of Translation Knowledge Acquisition: Cross-Lingually Relevant News Articles (on WWW news sites) � Updated everyday → enabling efficient acquisition of up-to-date translation knowledge Techniques: Integration of CLIR and Translation Knowledge Acquisition from Parallel/Comparable Corpora

  4. Translation Knowledge Acquisition: Our Approach Source of Translation Knowledge Acquisition: Cross-Lingually Relevant News Articles (on WWW news sites) � Updated everyday → enabling efficient acquisition of up-to-date translation knowledge Techniques: Integration of CLIR and Translation Knowledge Acquisition from Parallel/Comparable Corpora

  5. Translation Knowledge Acquisition: Our Approach Source of Translation Knowledge Acquisition: Cross-Lingually Relevant News Articles (on WWW news sites) � Updated everyday → enabling efficient acquisition of up-to-date translation knowledge Techniques: Integration of CLIR and Translation Knowledge Acquisition from Parallel/Comparable Corpora

  6. Translation Knowledge Acquisition from WWW News Sites: Overview WWW WWW ( ) ( News Sites ) News Sites English News Articles Japanese News Articles DB DB ・ Retrieval of Bilingual Retrieval of Bilingual Bilingual Lexicon ・ MT system Article Pair Article Pair English Article Japanese Article Relevant Translation Article Pair Knowledge DB Translation Translation Knowledge Knowledge Acquisition Acquisition

  7. Cross-Language Retrieval of Relevant News Articles WWW WWW ( ) ( News Sites ) News Sites English News Articles Japanese News Articles DB DB F i l t e r i n g F i l t e r i n g English Article Japanese Article b y D a t e s b y D a t e s cosine of frequency vectors S i m i l a r i t y M T S y s t e m S i m i l a r i t y M T S y s t e m C a l c u l a t i o n C a l c u l a t i o n Japanese Bilingual Article Pair Translation Bilingual Article Pair (Relevant Articles) (Relevant Articles)

  8. Related Research Issues: Translation Knowledge Acquisition Acquisition from Parallel Corpora � statistical MT models: e.g., [Brown 90, 93] � term correspondences estimation based on contingency tables of cross-language co-occurrence frequencies: e.g., [Gale 91, Kumano 94, Haruno 96, Smadja 96, Kitamura 96, Melamed 00] Acquisition from Comparable Corpora: contextual similarities of words across languages � without the help of existing bilingual lexicons: earlier works [Fung 95] � exploiting existing bilingual lexicons as initial seed: later works [Rapp 95,99, Kaji 96, K.Tanaka 96, Fung 98, T.Tanaka 02] Collecting Partially Bilingual Texts from WWW with Internet Search Engines: [Nagata 01]

  9. Related Research Issues: Translation Knowledge Acquisition Acquisition from Parallel Corpora � statistical MT models: e.g., [Brown 90, 93] � term correspondences estimation based on contingency tables of cross-language co-occurrence frequencies: e.g., [Gale 91, Kumano 94, Haruno 96, Smadja 96, Kitamura 96, Melamed 00] Acquisition from Comparable Corpora: contextual similarities of words across languages � without the help of existing bilingual lexicons: earlier works [Fung 95] � exploiting existing bilingual lexicons as initial seed: later works [Rapp 95,99, Kaji 96, K.Tanaka 96, Fung 98, T.Tanaka 02] Collecting Partially Bilingual Texts from WWW with Internet Search Engines: [Nagata 01]

  10. Estimating Bilingual Term Correspondences from Parallel Sentences Parallel Sentences English Japanese term x term y x ∧ y ⇒ term x term y x ∧¬ y ⇒ term x - ¬ x ∧ y ⇒ - term y - - ¬ x ∧¬ y ⇒ … … - -

  11. Measures for Estimating Bilingual Term Correspondences from Contingency Table ¬ y y freq(x, ¬ y) = b x freq(x, y) = a ¬ x freq( ¬ x, y) = c freq( ¬ x, ¬ y) = d mutual information (MI) aN I(x ; y) = log 2 (a+b)(a+c) φ 2 statistic (ad-bc) 2 φ 2 (x, y) = (a+b)(a+c)(b+d)(c+d) dice coefficient 2a Dice(x, y) = 2a+b+c log-likelihood Log-like = f(a)+f(b)+f(c)+f(d)-f(a+b)-f(a+c)-f(b+d)-f(c+d)-f(a+b+c+d) Where f(x) = x log x

  12. Related Research Issues: Translation Knowledge Acquisition Acquisition from Parallel Corpora � statistical MT models: e.g., [Brown 90, 93] � term correspondences estimation based on contingency tables of cross-language co-occurrence frequencies: e.g., [Gale 91, Kumano 94, Haruno 96, Smadja 96, Kitamura 96, Melamed 00] Acquisition from Comparable Corpora: contextual similarities of words across languages � without the help of existing bilingual lexicons: earlier works [Fung 95] � exploiting existing bilingual lexicons as initial seed: later works [Rapp 95,99, Kaji 96, K.Tanaka 96, Fung 98, T.Tanaka 02] Collecting Partially Bilingual Texts from WWW with Internet Search Engines: [Nagata 01]

  13. Term Correspondence Acquisition from Comparable Corpora Whole English Corpus Whole Japanese Corpus term term context context … … frequency vector term correspondence … … … estimation

  14. Related Research Issues: Translation Knowledge Acquisition Acquisition from Parallel Corpora � statistical MT models: e.g., [Brown 90, 93] � term correspondences estimation based on contingency tables of cross-language co-occurrence frequencies: e.g., [Gale 91, Kumano 94, Haruno 96, Smadja 96, Kitamura 96, Melamed 00] Acquisition from Comparable Corpora: contextual similarities of words across languages � without the help of existing bilingual lexicons: earlier works [Fung 95] � exploiting existing bilingual lexicons as initial seed: later works [Rapp 95,99, Kaji 96, K.Tanaka 96, Fung 98, T.Tanaka 02] Collecting Partially Bilingual Texts from WWW with Internet Search Engines: [Nagata 01]

  15. Translation Knowledge Acquisition: Our Approach � Translation knowledge acquisition from cross-lingually relevant article pairs collected by CLIR techniques

  16. cross-lingually Relevant Term Correspondence Acquisition from Cross-Lingually Relevant Article Pairs Collected by CLIR Techniques Whole English Corpus Whole Japanese Corpus term term Article Article Article Article context cross-lingually Non-Relevant

  17. cross-lingually Relevant Term Correspondence Acquisition from Cross-Lingually Relevant Article Pairs Collected by CLIR Techniques Whole English Corpus Whole Japanese Corpus term term Article Article Article Article context context … … frequency vector … … … term correspondence estimation

  18. Translation Knowledge Acquisition: Our Approach � Translation knowledge acquisition from cross-lingually relevant article pairs collected by CLIR techniques

  19. Translation Knowledge Acquisition: Our Approach � Translation knowledge acquisition from cross-lingually relevant article pairs collected by CLIR techniques � Techniques for parallel corpora become applicable to translation knowledge acquisition from comparable corpora

  20. Translation Knowledge Acquisition: Our Approach � Translation knowledge acquisition from cross-lingually relevant article pairs collected by CLIR techniques � Techniques for parallel corpora become applicable to translation knowledge acquisition from comparable corpora Related Work: Translation Knowledge Acquisition from Comparable Corpora � Estimating term correspondences based on contextual similarities across languages � Contextual vectors: averaged over the whole corpus � No use of CLIR techniques for restricting relevant documents across languages

  21. Cross-Language Retrieval of Relevant News Articles: Evaluation Issues Availability of Cross-Lingually Relevant Articles � Query articles should be English rather than Japanese � Cross-Lingually relevant articles are available for more than 60% English query articles Recall/Precision of Cross-Language Retrieval of Relevant News Articles � precision: 50% or more when article similarities ≧ 0.4

  22. Term Correspondence Acquisition: Evaluation Issues Source of Translation Knowledge Acquisition: Cross-Lingually Relevant News Articles on WWW News Sites Effect of CLIR in Reducing Bilingual Term Pair Candidates Accuracy of Bilingual Term Correspondence Estimation

  23. Term Correspondence Acquisition: Evaluation Issues Source of Translation Knowledge Acquisition: Cross-Lingually Relevant News Articles on WWW News Sites Effect of CLIR in Reducing Bilingual Term Pair Candidates Accuracy of Bilingual Term Correspondence Estimation

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend