QUALITY ESTIMATION AND EVALUATION OF MACHINE TRANSLATION INTO ARABIC
Houda Bouamor, Carnegie Mellon University-Qatar email: hbouamor@qatar.cmu.edu
1
QUALITY ESTIMATION AND EVALUATION OF MACHINE TRANSLATION INTO - - PowerPoint PPT Presentation
1 QUALITY ESTIMATION AND EVALUATION OF MACHINE TRANSLATION INTO ARABIC Houda Bouamor, Carnegie Mellon University-Qatar email: hbouamor@qatar.cmu.edu 2 Outline 3 SUMT: A Framework of Summarization and MT 1. AL-BLEU: Metric and a dataset
Houda Bouamor, Carnegie Mellon University-Qatar email: hbouamor@qatar.cmu.edu
1
2
1.
2.
3.
3
1
¨ MT quality is far from ideal for many languages
¤ Provides incorrect context and confuses readers
¨ Some of sentences are not as informative
¤ Could be summarized to make a more cohesive
document
5
1.
2.
3.
4.
6
7
1.
2.
3.
4.
8
9
10
MT Quality Classifier
SentEN
1, SentAR 1 : QScore1
SentEN
2, SentAR 2 : QScore2
SentEN
3, SentAR 3 : QScore3
. . . . SentEN
n, SentAR n : QScoren
SentEN
1, SentAR 1
SentEN
2, SentAR 2
SentEN
3, SentAR 3
. . . . SentEN
n, SentAR n
SentEN
n: A source sentence
SentAR
n: Its auomaticaly obtained translation
11
¨ Use SVM classifier ¨ Adapt Quest framework [Specia et al., 2013] to our
¨ Each sentence is characterized with:
¤ General features: length, ratio of S-T length, S-T
punctuations
¤ 5-gram LM scores ¤ MT-based scores ¤ Morphosyntactic features ¤ …
12
13
14
15
16
¨ How do we evaluate the quality of the estimation?
¤ Intrinsically: very hard to trust
n Need references à MT evaluation n Next...
¤ Extrinsically: in an application
n In the context of MT of Wikipedia n Compare using QE vs. a simple baseline
¨ MT setup
¤ Baseline MT system: MOSES trained on a standard
English-Arabic corpus
¤ Standard preprocessing and tokenization for both
English and Arabic
¤ Word-alignment using GIZA++
¨ Summarization and test data
¤ English-Arabic NIST corpora
n ︎Train: NIST 2008 and 2009 for the training and
development (259 documents )
n Test: NIST 2005 (100 documents)
17
¨ Summarization setup
¤ Bilingual summarization of the test data ¤ 2 native speakers chose half of the sentences ¤ Guidelines in sentence selection:
n Being informative in respect to the main story n Preserving key informations (NE, dates, etc.)
¤ A moderate agreement of K=0.61 .
18
¨ Producing summaries for each document using:
¤ Length-based: choose the shortest sentences (Length) ¤ State of the art MEAD summarizer (MEAD) ¤ MT quality estimation classifier (Classifier) ¤ MT-aware summarizer (SuMT) ¤ Oracle classifier: choose the sentences with the highest
translation quality (Oracle).
19
27.52% 26.33% 28.42% 31.36% 28.45% 32.12% 34.75% 0% 5% 10% 15% 20% 25% 30% 35% 40% Baseline% Length% MEAD% Classifier% Interpol% SuMT% Oracle%
BLEU%
¨ MT results
20
¨ Arabic Summary quality
15.81% 23.56% 23.09% 20.33% 24.07% 0% 5% 10% 15% 20% 25% 30% Length% MEAD% Classifier% Interpol% SuMT%
RougeESU4%
21
¨ Presented a framework for pairing MT with
¨ We extend a classification framework for
¨ We incorporate MT knowledge into a summarization
¨ Quality estimation is shown to be useful in the
22
23
¨ De facto metric, proposed by IBM [Papineni et al., 2002]
¨ Main ideas
¤ Exact matches of words ¤ Match against a set of reference translations for greater
variety of expressions
¤ Account for adequacy by looking at word precision ¤ Account for fluency by calculating n-gram precisions for
n=1,2,3,4
¤ No recall: difficult with multiple refs: “Brevity penalty”,
instead
¤ Final score: a weighted geometric average of the n-gram
scores
24
Example taken from Alon Lavie’s AMTA 2010 MT evaluation tutorial
– Reference: “the Iraqi weapons are to be handed over to the army within two weeks” – MT output: “in two weeks Iraq’s weapons will give army”
– 1-gram precision: 4/8 – 2-gram precision: 1/7 – 3-gram precision: 0/6 – 4-gram precision: 0/5 – BLEU score = 0 (weighted geometric average)
25
This example is taken from Alon Lavie’s AMTA 2010 MT evaluation tutorial
– Reference: “the Iraqi weapons are to be handed over to the army within two weeks” – MT output: “in two weeks Iraq’s weapons will give army”
– 1-gram precision: 4/8 – 2-gram precision: 1/7 – 3-gram precision: 0/6 – 4-gram precision: 0/5 – BLEU score = 0 (weighted geometric average)
26
– Reference: “the Iraqi weapons are to be handed over to the army within two weeks” – MT output: “in two weeks Iraq’s weapons will give army”
– 1-gram precision: 4/8 – 2-gram precision: 1/7 – 3-gram precision: 0/6 – 4-gram precision: 0/5 – BLEU score = 0 (weighted geometric average)
27
BLEU =~ smoothing ( ( Π Pi-gram )1/n * ( brevity penalty ) )
¨ BLEU heavily penalizes Arabic ¨ Question: How can we adapt BLEU to support
28
29
¨ Extend BLEU to deal with Arabic rich morphology ¨ Update the n-gram scores with partial credits for
¤ Morphological: POS, gender
, number , person, definiteness
¤ Stem and lexical matches
¨ Compute a new matching score as follows:
30
31
¨ MADA [Habash et al., 2009] provides stem and morph.
¨ Weights are optimized towards improvement of
¨ Hill climbing used on development set ¨ AL-BLEU is a geometric mean of the different
32
¨ A good MT metric should correlate well with human
¨ Measure the correlation between BLEU, AL-BLEU
33
¨ Problem: “No” human judgment dataset for Arabic ¨ Data
¤ Annotate a corpus composed of different text genres:
n News, climate change, Wikipedia ¨ Systems
¤ Six state-of-the art EN-AR MT systems
n 4 research-oriented systems n 2 commercial off-the-shelf systems
34
¨ Rank the sentences relatively to each other from the
35
¨ Rank the sentences relatively to each other from the
¨ Adapt a commonly used framework for evaluating
¨ 10 bilingual annotators were hired to assess the
Kinter Kintra
EN-AR 0.57 0.62 Average EN-EU 0.41 0.57 EN-CZ 0.40 0.54
36
¨ Use 900 sentences extracted from the dataset: 600
¨ AL-BLEU correlates better with human judgments
37
τ = (#ofconcordantpairs − #ofdiscordantpairs)÷ totalpairs
¨ We provide an annotated corpus of human
¨ We adapt BLEU and introduce AL-BLEU ¨ AL-BLEU uses morphological, syntactic and lexical
¨ AL-BLEU correlates better with human judgments
http://nlp.qatar.cmu.edu/resources/AL-BLEU
38
39
Hanan Mohammed
40