post edi7ng effort for english to arabic hybrid machine
play

Post-edi7ng Effort for English to Arabic Hybrid Machine Transla7on - PowerPoint PPT Presentation

An Empirical Study: Post-edi7ng Effort for English to Arabic Hybrid Machine Transla7on Hassan Sajjad , Francisco Guzman, Stephan Vogel Qatar Compu7ng Research Ins7tute, HBKU Introduc7on Old Arabic documents Transla7on of metadata from


  1. An Empirical Study: Post-edi7ng Effort for English to Arabic Hybrid Machine Transla7on Hassan Sajjad , Francisco Guzman, Stephan Vogel Qatar Compu7ng Research Ins7tute, HBKU

  2. Introduc7on • Old Arabic documents • Transla7on of metadata from English to Arabic

  3. Tradi7onal Transla7on Process TM Translation Company British Library Translators

  4. Problem • Various small documents • Fewer overlap at sentence/segment level • Few transla7on memory matches – A lot needs to be translated from scratch • Time and cost inefficient

  5. Solu7on: Hybrid Machine Transla7on 100% recall – TM CMT High precision readily available transla7ons transla7ons Hybrid MT Hybrid MT: Combines the benefits of both! Transla7on Memory and Customized MT

  6. Hybrid MT System • Transla7on Memory TM – First pass: use strict matching to translate known words and phrases • Customized Machine Transla7on CMT – Second pass: translate the remaining text using machine transla7on system

  7. Aiming higher: Post Edi7ng for Quality TM CMT Hybrid MT Post Editors • High quality • High consistency • Cost and time effective

  8. Customized Machine Transla7on CMT • A sta7s7cal machine transla7on system – Train specific to the domain of the text that needs to be translated • General prac7ce – Use Moses – Train on the data of transla7on memory – Follow recipe of a compe77on grade system to ensure high quality

  9. English to Arabic CMT CMT • Best compe77on grade pipeline involves – Arabic (de-) tokeniza7on • Spli\ng morphologically rich words into smaller segments and vice-versa • +1.5 BLEU points improvement – Arabic (de-) normaliza7on • Mapping different forms of a leaer to one form and vice verse • +0.5 BLEU point improvement This ensures high quality but does not guarantee less frustra7on for post-editors

  10. Why? CMT Transla7on output requires: • De-tokeniza7on and de-normaliza7on • De-normaliza7on introduces character-level errors – Frustra7ng for the post-editor to correct – Time inefficient

  11. Recommended Prac7ces for CMT of CMT English-Arabic • Don’t normalize But • Always tokenize – Improve coverage of words – Beaer transla7ons

  12. Let’s Talk about BL Case Numbers! We compare: Looking at: • Transla7on Memory (TM) only • Effec7veness • Hybrid MT (TM + CMT) • Quality • Consistency Also: • Translator • Hybrid MT + Post edi7ng (PE)

  13. Data • 1000 documents – 90k parallel sentences/segments – 953 documents for training • 489k tokens – Rest for tune and test

  14. Effec7veness of TM Exact match Fuzzy match 7% 7% 84% 84% 13. 13.5% 5% 50% 50% BUT BUT COVERS COVERS ONLY ONLY words segments words segments More than 85% of words still need to be translated !!!! * Based on an assessment over X documents

  15. Effec7veness of CMT 100% 100% 99. 99.9% 9% AND segments words translated!

  16. Effec7veness of Hybrid MT • High precision – TM exact matches • High recall – CMT to produce high quality transla7ons

  17. Assessing Quality • BLEU – Compare output to ‘reference’ transla7on Strict Par7al TM 7.07 21.01 TM + CMT 54.60 48.54 CMT alone BLEU scores are 53.90

  18. Assessing Quality • TER: Transla7on Error Rate – How much effort is needed to get perfect transla7on? – Compare to ‘reference’ transla7on Hybrid MT TM 0% 20% 40% 60% 80% 100% Percentage of effort required Hybrid MT can improve beyond that!!!

  19. Assessing Quality • TER vs. Post edi7ng effort – Similar effort es7ma7on using post-edi7ng of Hybrid MT PE on Hybrid MT Hybrid MT TM 0% 20% 40% 60% 80% 100% Percentage of effort required * PE is based on an assessment over 4 documents, using a junior translator

  20. Consistency of Hybrid MT • We compared Hybrid MT versus a junior translator • We measured consistency with reference transla7ons Hybrid MT Translator 0% 10% 20% 30% 40% 50% 60% 70% Overlap with reference transla7on Hybrid MT is more consistent with reference translations * Based on an assessment over 4 documents

  21. Speedup of Hybrid MT • We compared Hybrid MT versus a junior translator 120 Hybrid MT+PE is 30% more efficient Time taken to translate 100 80 (mins) Translator 60 Hybrid MT + PE 40 20 0 * Based on an assessment over 4 documents

  22. Conclusion • Hybrid MT – High precision and high recall • Hybrid MT plus Post-edi7ng – Efficient in terms of both 7me and cost – Improves consistency • Customized MT for English-Arabic – Don’t normalize but always tokenize

  23. References Ahmed Abdelali, Kareem Darwish, Nadir Durrani, and Hamdy Mubarak. • Farasa: A Fast and Furious Segmenter for Arabic. In NAACL-2016, San Diego, US. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello • Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Chris7ne Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constan7n, and Evan Herbst. Moses: Open source toolkit for sta7s7cal machine transla7on. In ACL-2007, Prague, Czech Republic Hassan Sajjad, Francisco Guzman, Preslav Nakov, Ahmed Abdelali, Kenton • Murray, Fahad Al Obaidli, and Stephan Vogel. QCRI at IWSLT 2013: Experiments in Arabic-English and English-Arabic Spoken Language Transla7on. In IWSLT-2013, Heidelberg, Germany

  24. Thank you

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend