3gtm a third generation translation memory

3GTM: A Third-Generation Translation Memory Fabrizio Gotti , - PowerPoint PPT Presentation

3GTM: A Third-Generation Translation Memory Fabrizio Gotti , Philippe Langlais , Elliott Macklovitch , Didier Bourigault , Benoit Robichaud and Claude Coulombe RALI D epartement dinformatique et de recherche op


  1. 3GTM: A Third-Generation Translation Memory Fabrizio Gotti † , Philippe Langlais † , Elliott Macklovitch † , Didier Bourigault ⋆ , Benoit Robichaud ‡ and Claude Coulombe ‡ † RALI D´ epartement d’informatique et de recherche op´ erationnelle Universit´ e de Montr´ eal ‡ Lingua Technologies Inc. Montr´ eal ⋆ ERSS-CNRS Toulouse CLiNE — August 26 th 2005 CLiNE — August 26 th 2005 RALI, Lingua Technologies Inc., ERSS-CNRS ( † RALI D´ epartement d’informatique et de recherche op´ 3GTM erationnelle Universit´ e de Montr´ 1 / 35 eal

  2. Outline Overview of the 3GTM project 1 Experimental Setting 2 Experiments 3 Sentence Coverage Random Substring Coverage Chunk-Based Coverage Tree-Phrase Coverage Discussion 4 CLiNE — August 26 th 2005 RALI, Lingua Technologies Inc., ERSS-CNRS ( † RALI D´ epartement d’informatique et de recherche op´ 3GTM erationnelle Universit´ e de Montr´ 2 / 35 eal

  3. Overview of the 3GTM project Translation Memory A Computer Assisted Tool which eases the recycling of past translations 1 st -generation TM never translates again a sentence that has already been translated Full-sentence repetition is a rather marginal phenomemon 2 nd -generation TM 2 source sentences might be considered identical if they differ only slightly (named entities, edit distance, etc.) Fuzzy matching 3 rd -generation TM (3GTM) recycles at a sub-sentential level A project currently funded by P RECARN Lingua Technologies Inc., RALI, Transetix Inc. CLiNE — August 26 th 2005 RALI, Lingua Technologies Inc., ERSS-CNRS ( † RALI D´ epartement d’informatique et de recherche op´ 3GTM erationnelle Universit´ e de Montr´ 3 / 35 eal

  4. 3GTM in a Screenshot CLiNE — August 26 th 2005 RALI, Lingua Technologies Inc., ERSS-CNRS ( † RALI D´ epartement d’informatique et de recherche op´ 3GTM erationnelle Universit´ e de Montr´ 4 / 35 eal

  5. Overview of the 3GTM project 1 Experimental Setting 2 Experiments 3 Sentence Coverage Random Substring Coverage Chunk-Based Coverage Tree-Phrase Coverage Discussion 4 CLiNE — August 26 th 2005 RALI, Lingua Technologies Inc., ERSS-CNRS ( † RALI D´ epartement d’informatique et de recherche op´ 3GTM erationnelle Universit´ e de Montr´ 5 / 35 eal

  6. Experimental Setting English-French language pair : querying the French side, proposing English material TM populated with Canadian Hansard texts Coverage statistics computed over a test corpus help appreciating the number of useful units that can be queried/found the easiest thing to implement in an early stage of a project ultimately, we target human evaluation runs (or simulations) CLiNE — August 26 th 2005 RALI, Lingua Technologies Inc., ERSS-CNRS ( † RALI D´ epartement d’informatique et de recherche op´ 3GTM erationnelle Universit´ e de Montr´ 6 / 35 eal

  7. Training Material Number of sentences, tokens and types in the training corpus Language English French Nb. sentences 1 753 443 1 753 443 Nb. tokens 31 637 775 34 150 039 Nb. types 85 810 106 987 Avg. word/sent. 17.5 19.3 CLiNE — August 26 th 2005 RALI, Lingua Technologies Inc., ERSS-CNRS ( † RALI D´ epartement d’informatique et de recherche op´ 3GTM erationnelle Universit´ e de Montr´ 7 / 35 eal

  8. Test Material 1000 sentences (Hansard corpus) chronologically distinct from the training material French = query or source language English = output or target language CLiNE — August 26 th 2005 RALI, Lingua Technologies Inc., ERSS-CNRS ( † RALI D´ epartement d’informatique et de recherche op´ 3GTM erationnelle Universit´ e de Montr´ 8 / 35 eal

  9. Tools used J APA an in-house sentence aligner http://rali.iro.umontreal.ca/Japa/ L UCENE a freely available full-featured text search engine http://lucene.apache.org S IMAC an in-house implementation of a word aligner (Simard and Langlais, 2003) G IZA ++ a tool to train translation models (Och and Ney, 2000) G RAMMATICUM a constituant-based parser (Coulombe, 1991) S YNTEX a dependency-based parser (Bourigault and Fabre, 2000) CLiNE — August 26 th 2005 RALI, Lingua Technologies Inc., ERSS-CNRS ( † RALI D´ epartement d’informatique et de recherche op´ 3GTM erationnelle Universit´ e de Montr´ 9 / 35 eal

  10. Overview of the 3GTM project 1 Experimental Setting 2 Experiments 3 Sentence Coverage Random Substring Coverage Chunk-Based Coverage Tree-Phrase Coverage Discussion 4 CLiNE — August 26 th 2005 RALI, Lingua Technologies Inc., ERSS-CNRS ( † RALI D´ epartement d’informatique et de recherche op´ 3GTM erationnelle Universit´ e de Montr´ 10 / 35 eal

  11. Full Sentence Coverage Using Verbatim Match Nb. of sentences 1000 Nb. of sent. found verbatim 148 Avg. size of sent. in test corpus 19.2 Avg. size of sent. found verbatim 11.1 14.8 % because of Hansard idioms : I don’t know Mr. Speaker : Order, please . within a TM ≡ TSRALI.com (6.6 M. pairs of phrases), we only found 11 out of 1000 sentences of the EuroParl corpus. CLiNE — August 26 th 2005 RALI, Lingua Technologies Inc., ERSS-CNRS ( † RALI D´ epartement d’informatique et de recherche op´ 3GTM erationnelle Universit´ e de Montr´ 11 / 35 eal

  12. Overview of the 3GTM project 1 Experimental Setting 2 Experiments 3 Sentence Coverage Random Substring Coverage Chunk-Based Coverage Tree-Phrase Coverage Discussion 4 CLiNE — August 26 th 2005 RALI, Lingua Technologies Inc., ERSS-CNRS ( † RALI D´ epartement d’informatique et de recherche op´ 3GTM erationnelle Universit´ e de Montr´ 12 / 35 eal

  13. Random Substring Coverage Protocol Query the TM with any sequence of the source (French) material 1 (length ≥ 2) A query found at least once is a valid one Compute a source (French) optimal coverage 2 Maximizing the source coverage while minimizing the number of queries Consider the target (English) material associated 3 By following the word alignment Compute a target (English) optimal coverage 4 Wait for details CLiNE — August 26 th 2005 RALI, Lingua Technologies Inc., ERSS-CNRS ( † RALI D´ epartement d’informatique et de recherche op´ 3GTM erationnelle Universit´ e de Montr´ 13 / 35 eal

  14. Random Substring Coverage Il travaille dans la chocolaterie S T He works in a chocolate factory la chocolaterie q Match : Charlie 1 et 2 [la 3 chocolaterie 4 , 5 ] S Charlie and [the chocolate factory] T m CLiNE — August 26 th 2005 RALI, Lingua Technologies Inc., ERSS-CNRS ( † RALI D´ epartement d’informatique et de recherche op´ 3GTM erationnelle Universit´ e de Montr´ 14 / 35 eal

  15. Random Substring Coverage Il travaille dans la chocolaterie S T He works in a chocolate factory la chocolaterie q Match : Charlie 1 et 2 [la 3 chocolaterie 4 , 5 ] S Charlie and [the chocolate factory] T m CLiNE — August 26 th 2005 RALI, Lingua Technologies Inc., ERSS-CNRS ( † RALI D´ epartement d’informatique et de recherche op´ 3GTM erationnelle Universit´ e de Montr´ 14 / 35 eal

  16. Random Substring Coverage Il travaille dans la chocolaterie S T He works in a chocolate factory la chocolaterie q Match : Charlie 1 et 2 [la 3 chocolaterie 4 , 5 ] S Charlie and [the chocolate factory] T the chocolate factory m CLiNE — August 26 th 2005 RALI, Lingua Technologies Inc., ERSS-CNRS ( † RALI D´ epartement d’informatique et de recherche op´ 3GTM erationnelle Universit´ e de Montr´ 14 / 35 eal

  17. Random Substring Coverage Il travaille dans la chocolaterie S T He works in a chocolate factory la chocolaterie q Match : Charlie 1 et 2 [la 3 chocolaterie 4 , 5 ] S Charlie and [the chocolate factory] T the chocolate factory m CLiNE — August 26 th 2005 RALI, Lingua Technologies Inc., ERSS-CNRS ( † RALI D´ epartement d’informatique et de recherche op´ 3GTM erationnelle Universit´ e de Montr´ 15 / 35 eal

  18. Random Substring Coverage Coverage statistics Metric Source Target Optimal coverage 98.8% 55.8% Cov. unit size (words) 4.09 2.98 Number of cov. units 4.65 3.23 Avg. nb. L UCENE queries per sentence : 226 CLiNE — August 26 th 2005 RALI, Lingua Technologies Inc., ERSS-CNRS ( † RALI D´ epartement d’informatique et de recherche op´ 3GTM erationnelle Universit´ e de Montr´ 16 / 35 eal

  19. Random Substring Coverage The unsustainable Truth S : m. mcinnes : je m’ excuse T : mr . mcinnes : i apologize mcinnes excuse m. m’ je : ր excuse – – – – – – m’ – – – – – 446 je – – – – 3719 347 : – – – 12330 185 107 mcinnes – – 43 4 0 0 m. – 69 43 4 0 0 CLiNE — August 26 th 2005 RALI, Lingua Technologies Inc., ERSS-CNRS ( † RALI D´ epartement d’informatique et de recherche op´ 3GTM erationnelle Universit´ e de Montr´ 17 / 35 eal

  20. Random Substring Coverage The unsustainable Truth S : m. mcinnes – : je m’ excuse T : mr . mcinnes : i apologize mcinnes excuse m. m’ je : ր excuse – – – – – – m’ – – – – – 446 je – – – – 3719 347 : – – – 12330 185 107 mcinnes – – 43 4 0 0 m. – 69 43 4 0 0 CLiNE — August 26 th 2005 RALI, Lingua Technologies Inc., ERSS-CNRS ( † RALI D´ epartement d’informatique et de recherche op´ 3GTM erationnelle Universit´ e de Montr´ 17 / 35 eal

Recommend


More recommend