quality estimation and evaluation of machine translation
play

QUALITY ESTIMATION AND EVALUATION OF MACHINE TRANSLATION INTO - PowerPoint PPT Presentation

1 QUALITY ESTIMATION AND EVALUATION OF MACHINE TRANSLATION INTO ARABIC Houda Bouamor, Carnegie Mellon University-Qatar email: hbouamor@qatar.cmu.edu 2 Outline 3 SUMT: A Framework of Summarization and MT 1. AL-BLEU: Metric and a dataset


  1. 1 QUALITY ESTIMATION AND EVALUATION OF MACHINE TRANSLATION INTO ARABIC Houda Bouamor, Carnegie Mellon University-Qatar email: hbouamor@qatar.cmu.edu

  2. 2

  3. Outline 3 SUMT: A Framework of Summarization and MT 1. AL-BLEU: Metric and a dataset for Arabic MT 2. evaluation Conclusions 3.

  4. SUMT: A Framework of 1 Summarization and Machine Translation

  5. Motivation 5 ¨ MT quality is far from ideal for many languages and text genres ¤ Provides incorrect context and confuses readers ¨ Some of sentences are not as informative ¤ Could be summarized to make a more cohesive document Keep informative sentences+ decent MT quality

  6. Questions? 6 How can we estimate the MT quality of a sentence 1. without human references? How can we find the most informative part of a 2. document? How can we find a middle point between 3. informativeness and MT quality? How can we evaluate the quality of our system? 4.

  7. Part1: outline 7 MT quality estimation 1. MT-aware summarization system 2. Experiments and results 3. Conclusion 4.

  8. SuMT [Bouamor et al.,2013] 8

  9. SuMT: Translation 9

  10. SuMT: MT quality estimation 10

  11. SuMT: MT quality estimation Data labeling procedure 11 MT Quality Sent EN 1 , Sent AR 1 : Q Score1 Sent EN 1 , Sent AR 1 Classifier Sent EN 2 , Sent AR 2 : Q Score2 Sent EN 2 , Sent AR 2 Sent EN 3 , Sent AR 3 : Q Score3 Sent EN 3 , Sent AR 3 . . . . . . . . Sent EN n , Sent AR n : Q Scoren Sent EN n , Sent AR n Sent EN n : A source sentence Sent AR n : Its auomaticaly obtained translation

  12. Quality Estimation: MT Quality classifier 12 ¨ Use SVM classifier ¨ Adapt Quest framework [Specia et al., 2013] to our EN-AR translation setup ¨ Each sentence is characterized with: ¤ General features: length, ratio of S-T length, S-T punctuations ¤ 5-gram LM scores ¤ MT-based scores ¤ Morphosyntactic features ¤ …

  13. SuMT: MT-aware summarization 13

  14. SuMT: MT-aware summarization MEAD as a ranker 14

  15. SuMT: MT-aware summarization Our adaptation of MEAD 15

  16. Evaluation quality estimation 16 ¨ How do we evaluate the quality of the estimation? ¤ Intrinsically: very hard to trust n Need references à MT evaluation n Next... ¤ Extrinsically: in an application n In the context of MT of Wikipedia n Compare using QE vs. a simple baseline

  17. SuMT: experimental settings 17 ¨ MT setup ¤ Baseline MT system: MOSES trained on a standard English-Arabic corpus ¤ Standard preprocessing and tokenization for both English and Arabic ¤ Word-alignment using GIZA++ ¨ Summarization and test data ¤ English-Arabic NIST corpora n ︎ Train: NIST 2008 and 2009 for the training and development (259 documents ) n Test: NIST 2005 (100 documents)

  18. SuMT: experimental settings 18 ¨ Summarization setup ¤ Bilingual summarization of the test data ¤ 2 native speakers chose half of the sentences ¤ Guidelines in sentence selection: n Being informative in respect to the main story n Preserving key informations (NE, dates, etc.) ¤ A moderate agreement of K=0.61 .

  19. SuMT: experimental settings 19 ¨ Producing summaries for each document using: ¤ Length-based: choose the shortest sentences ( Length ) ¤ State of the art MEAD summarizer ( MEAD ) ¤ MT quality estimation classifier ( Classifier ) ¤ MT-aware summarizer ( SuMT ) ¤ Oracle classifier: choose the sentences with the highest translation quality ( Oracle ).

  20. SuMT: Results 20 ¨ MT results 40% BLEU% 34.75% 35% 32.12% 31.36% 30% 28.45% 28.42% 27.52% 26.33% 25% 20% 15% 10% 5% 0% Baseline% Length% MEAD% Classifier% Interpol% SuMT% Oracle%

  21. SuMT: Results 21 ¨ Arabic Summary quality 30% RougeESU4% 24.07% 25% 23.56% 23.09% 20.33% 20% 15.81% 15% 10% 5% 0% Length% MEAD% Classifier% Interpol% SuMT%

  22. Conclusions 22 ¨ Presented a framework for pairing MT with summarization ¨ We extend a classification framework for reference-free prediction of translation quality at the sentence-level. ¨ We incorporate MT knowledge into a summarization system which results in high quality translation summaries. ¨ Quality estimation is shown to be useful in the context of text summarization.

  23. Automatic MT quality evaluation 23

  24. The BLEU metric 24 ¨ De facto metric, proposed by IBM [Papineni et al., 2002] ¨ Main ideas ¤ Exact matches of words ¤ Match against a set of reference translations for greater variety of expressions ¤ Account for adequacy by looking at word precision ¤ Account for fluency by calculating n-gram precisions for n=1,2,3,4 ¤ No recall : difficult with multiple refs: “Brevity penalty” , instead ¤ Final score: a weighted geometric average of the n-gram scores

  25. The BLEU metric: Example 25 Adequacy • Example: – Reference: “the Iraqi weapons are to be handed over to the army within two weeks” – MT output: “in two weeks Iraq’s weapons will give army” • BLUE metric: – 1-gram precision: 4/8 – 2-gram precision: 1/7 – 3-gram precision: 0/6 – 4-gram precision: 0/5 – BLEU score = 0 (weighted geometric average) Example taken from Alon Lavie’s AMTA 2010 MT evaluation tutorial

  26. The BLEU metric: Example 26 Fluency • Example: – Reference: “the Iraqi weapons are to be handed over to the army within two weeks” – MT output: “in two weeks Iraq’s weapons will give army” • BLUE metric: – 1-gram precision: 4/8 – 2-gram precision: 1/7 – 3-gram precision: 0/6 – 4-gram precision: 0/5 – BLEU score = 0 (weighted geometric average) This example is taken from Alon Lavie’s AMTA 2010 MT evaluation tutorial

  27. The BLEU metric: Example 27 • Example: – Reference: “the Iraqi weapons are to be handed over to the army within two weeks” – MT output: “in two weeks Iraq’s weapons will give army” • BLUE metric: – 1-gram precision: 4/8 – 2-gram precision: 1/7 – 3-gram precision: 0/6 – 4-gram precision: 0/5 – BLEU score = 0 (weighted geometric average) BLEU =~ smoothing ( ( Π P i-gram ) 1/n * ( brevity penalty ) ) Short story: the score is [0..1]; higher is better!

  28. BLEU & Arabic 28 ¨ BLEU heavily penalizes Arabic ¨ Question: How can we adapt BLEU to support Arabic Morphology?

  29. CMU-Q’s AL-BLEU (Bouamor et al., 2014) 29 ¨ For our experiments: 1. AL-BLEU metric 2. Data and systems

  30. AL-BLEU: Arabic Language BLEU 30 ¨ Extend BLEU to deal with Arabic rich morphology ¨ Update the n-gram scores with partial credits for partial matches: ¤ Morphological : POS, gender , number , person, definiteness ¤ Stem and lexical matches ¨ Compute a new matching score as follows:

  31. AL-BLEU: Arabic Language BLEU 31

  32. AL-BLEU: Arabic Language BLEU 32 ¨ MADA [Habash et al., 2009] provides stem and morph. Features ¨ Weights are optimized towards improvement of correlation with human judgments ¨ Hill climbing used on development set ¨ AL-BLEU is a geometric mean of the different matched n-grams

  33. AL-BLEU: Evaluation and Results 33 ¨ A good MT metric should correlate well with human judgments. ¨ Measure the correlation between BLEU, AL-BLEU and human judgments at the sentence level

  34. AL-BLEU: Data and Systems 34 ¨ Problem: “No” human judgment dataset for Arabic ¨ Data ¤ Annotate a corpus composed of different text genres: n News, climate change, Wikipedia ¨ Systems ¤ Six state-of-the art EN-AR MT systems n 4 research-oriented systems n 2 commercial off-the-shelf systems

  35. Data: Judgment collection 35 ¨ Rank the sentences relatively to each other from the best to the worst

  36. Data: Judgment collection 36 ¨ Rank the sentences relatively to each other from the best to the worst ¨ Adapt a commonly used framework for evaluating MT for European languages [Callison-Burch et al., 2011] ¨ 10 bilingual annotators were hired to assess the quality of each system K inter K intra EN-AR 0.57 0.62 Average EN-EU 0.41 0.57 EN-CZ 0.40 0.54

  37. AL-BLEU: Evaluation and Results 37 ¨ Use 900 sentences extracted from the dataset: 600 dev and 300 test ¨ AL-BLEU correlates better with human judgments Dev Test BLEU 0.3361 0.3162 AL-BLEU 0.3759 0.3521 τ = (# ofconcordantpairs − # ofdiscordantpairs ) ÷ totalpairs

  38. AL-BLEU: Conclusion 38 ¨ We provide an annotated corpus of human judgments for evaluation of Arabic MT ¨ We adapt BLEU and introduce AL-BLEU ¨ AL-BLEU uses morphological, syntactic and lexical matching ¨ AL-BLEU correlates better with human judgments http://nlp.qatar.cmu.edu/resources/AL-BLEU

  39. 39 Thank you for your attention

  40. Collaborators 40 Prof. Kemal Oflazer Dr. Behrang Mohit Hanan Mohammed

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend