turchi@<k.eu Slides from the presenta&on by MaDeo Negri and - - PowerPoint PPT Presentation

turchi k eu
SMART_READER_LITE
LIVE PREVIEW

turchi@<k.eu Slides from the presenta&on by MaDeo Negri and - - PowerPoint PPT Presentation

Outline Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua&on of Human evalua&on: fluency/adequacy Automa&c evalua&on: Machine Transla&on Quality Reference-based: BLEU, TER, HTER (chosen


slide-1
SLIDE 1

Evalua&on of Machine Transla&on Quality

Marco Turchi

FBK Trento, Italy turchi@<k.eu Slides from the presenta&on by MaDeo Negri… and myself

Disclaimer

“More has been wriDen about MT evalua&on

  • ver the past 50 years than about MT itself”

Hovy et al.: Principles of Context-Based Machine Transla7on Evalua7on. Machine Transla&on, 16, pp. 1–33, 2002 (aDributed to Yorick Wilks)

“It is impossible to write a comprehensive overview of the MT evalua&on literature”

Adam Lopez.: Sta7s7cal Machine Transla7on. ACM Compu&ng Surveys 40(3) pp. 1–49, August 2008.

MT Evalua&on, Trento, Doctoral School - April 2016

Outline

  • Importance of MT Evalua&on
  • Difficulty of MT Evalua&on
  • Human evalua&on: fluency/adequacy
  • Automa&c evalua&on:

– Reference-based: BLEU, TER, HTER (chosen among MANY others) – Reference-free: quality es&ma&on (es&ma&ng post-edi&ng effort)

MT Evalua&on, Trento, Doctoral School - April 2016

The importance of MT evalua&on

  • Answering “How good is an MT system?” as a way to:

– Which system to use for a given task – Assess and compare systems’ performance – Define the state of the art – Drive system development and measure improvements – Decide whether to apply MT at all

  • …Necessary (yes, not sufficient) condi&ons for progress in

any research field

  • Difficult task!

MT Evalua&on, Trento, Doctoral School - April 2016

slide-2
SLIDE 2

The importance of MT evalua&on

  • Answering “How good is an MT system?” as a way to:

– Which system to use for a given task – Assess and compare systems’ performance – Define the state of the art – Drive system development and measure improvements – Decide whether to apply MT at all

  • …Necessary (yes, not sufficient) condi&ons for progress in

any research field

  • Difficult task!

MT Evalua&on, Trento, Doctoral School - April 2016

The importance of MT evalua&on

  • Answering “How good is an MT system?” as a way to:

– Which system to use for a given task – Assess and compare systems’ performance – Define the state of the art – Drive system development and measure improvements – Decide whether to apply MT at all

  • …Necessary (yes, not sufficient) condi&ons for progress in

any research field

  • Difficult task!

MT Evalua&on, Trento, Doctoral School - April 2016

Difficulty of MT evalua&on

  • No formal defini=on of “transla&on” ! no defini&on of “good

transla&on”

  • The no&on of quality is inherently subjec=ve
  • Exact quan&fica&on is difficult (especially for long sentences)
  • MT errors are very varied in nature

Difficulty of MT evalua&on

  • No formal defini=on of “transla&on” ! no defini&on of “good

transla&on”

  • The no&on of quality is inherently subjec=ve
  • Exact quan&fica&on is difficult (especially for long sentences)
  • MT errors are very varied in nature
slide-3
SLIDE 3

Difficulty of MT evalua&on

  • No formal defini=on of “transla&on” ! no defini&on of “good

transla&on”

  • The no&on of quality is inherently subjec=ve
  • Exact quan&fica&on is difficult (especially for long sentences)
  • MT errors are very varied in nature

Difficulty of MT evalua&on

  • No formal defini=on of “transla&on” ! no defini&on of “good

transla&on”

  • The no&on of quality is inherently subjec=ve
  • Exact quan&fica&on is difficult (especially for long sentences)
  • MT errors are very varied in nature

Difficulty of MT evalua&on

  • No formal defini=on of “transla&on” ! no defini&on of “good

transla&on”

  • The no&on of quality is inherently subjec=ve
  • Exact quan&fica&on is difficult (especially for long sentences)
  • MT errors are very varied in nature
  • Perfect or very poor transla&ons

are easy to score, but what happens in between?

Difficulty of MT evalua&on

  • Many different acceptable transla&ons for the same sentence
  • – I am [experiencing|suffering from|feeling] a throbbing pain .

– I [feel|can feel|have] a [throbbing pain|painful throbbing] . – [It is a|It’s in|I’ve got a] throbbing pain . – It’s throbbing [and it really hurts|with pain] . – [It’s painful and|It hurts so much] it’s throbbing .

MT Evalua&on, Trento, Doctoral School - April 2016

slide-4
SLIDE 4

Difficulty of MT evalua&on

  • How would you translate:

It’s raining cats and dogs Ace in the hole Beat around the bush Chew the fat Wild goose chase Tie one on Sunny smile

  • Literally, its meaning or the corresponding idiom (if any)?

MT Evalua&on, Trento, Doctoral School - April 2016

Difficulty of MT evalua&on

MT Evalua&on, Trento, ISIT School - November 2013

  • Classifica&on of errors: a quite rich taxonomy

Note: error types are not mutually exclusive and onen co-occur (Vilar et al. 2006)

Human Vs Automa&c evalua&on

  • Human MT evalua=on:

– criteria: adequacy (fidelity) and fluency (intelligibility) – pros: very accurate, high quality – cons: expensive, slow, subjec&ve

  • Automa=c MT evalua=on:

– criteria: “similarity” to professional human transla&on – pros: inexpensive, quick, objec&ve – cons: quality is “slightly” lower than human check

MT Evalua&on, Trento, Doctoral School - April 2016

Human Vs Automa&c evalua&on

  • Human MT evalua=on:

– criteria: adequacy (fidelity) and fluency (intelligibility) – pros: very accurate, high quality – cons: expensive, slow, subjec&ve

  • Automa=c MT evalua=on:

– criteria: “similarity” to professional human transla&on – pros: inexpensive, quick, objec&ve – cons: quality is “slightly” lower than human check

MT Evalua&on, Trento, Doctoral School - April 2016

slide-5
SLIDE 5

Human evalua&on

MT Evalua&on, Trento, ISIT School - November 2013

Human evalua&on

  • Given:

– MT output, source and/or reference transla&on

  • Task: assess the quality of the MT output
  • Metrics

– Adequacy: does the output convey the same meaning as the input sentence? Is part of the message lost, added, or distorted? …requires bilingual judges or a reference transla&on – Fluency: is the output good fluent English? This involves both gramma&cal correctness and idioma&c word choices. …monolingual judges are sufficient, no reference needed

MT Evalua&on, Trento, Doctoral School - April 2016

Human evalua&on

  • Given:

– MT output, source and/or reference transla&on

  • Task: assess the quality of the MT output
  • Metrics

– Adequacy: does the output convey the same meaning as the input sentence? Is part of the message lost, added, or distorted? …requires bilingual judges or a reference transla&on – Fluency: is the output good fluent English? This involves both gramma&cal correctness and idioma&c word choices. …monolingual judges are sufficient, no reference needed

MT Evalua&on, Trento, Doctoral School - April 2016

Human evalua&on

  • Given:

– MT output, source and/or reference transla&on

  • Task: assess the quality of the MT output
  • Metrics

– Adequacy: does the output convey the same meaning as the input sentence? Is part of the message lost, added, or distorted? …requires bilingual judges or a reference transla&on – Fluency: is the output good fluent English? This involves both gramma&cal correctness and idioma&c word choices. …monolingual judges are sufficient, no reference needed

MT Evalua&on, Trento, Doctoral School - April 2016

slide-6
SLIDE 6

Human evalua&on: adequacy and fluency

  • Source sentence: Le chat entre dans la chambre.

(a) Adequate fluent transla&on: The cat enters the room. (b) Adequate disfluent transla&on: The cat enter in the room. (c) Fluent inadequate transla&on: The cats enter the bedroom. (d) Disfluent inadequate transla&on: Bedroom the dogs enters the

MT Evalua&on, Trento, Doctoral School - April 2016

Human evalua&on: Likert scales

Adequacy 5 all meaning 4 most meaning 3 much meaning 2 liDle meaning 1 none

MT Evalua&on, Trento, Doctoral School - April 2016

Fluency 5 flawless English 4 good English 3 non-na&ve English 2 disfluent English 1 incomprehensible

Human evalua&on: subjec&vity

a fluency adequacy b c d a fluency adequacy b c d a fluency adequacy b c d JUDGE1 JUDGE2 JUDGE3

  • Perfect or very poor transla&ons are easy to score…

…but what happens in between? (a) Adequate fluent transla&on: The cat enters the room. (b) Adequate disfluent transla&on: The cat enter in the room. (c) Fluent inadequate transla&on: The cats enter the bedroom. (d) Disfluent inadequate transla&on: Bedroom the dogs enters the

Human evalua&on: subjec&vity

Evaluators disagree!

  • …look at this histogram of adequacy judgments by

different human evaluators

MT Evalua&on, Trento, ISIT School - November 2013

slide-7
SLIDE 7

Human evalua&on: measuring agreement

  • Kappa coefficient

– p(A): propor&on of &mes that the evaluators agree – p(E): propor&on of &me that they would agree by chance (5-point scale → p(E) = 1/5) – Complete agreement: K=1 – No agreement higher than chance: K=0

  • Example: inter-evaluator agreement in WMT 2007

K = p(A) − p(E) 1− p(E)

p(A) p(E) K Fluency .400 .2 .250 Adequacy .380 .2 .226

Human evalua&on: alterna&ves

  • Ranking transla=ons: is transla&on X beDer than transla&on Y?

– Evaluators are more consistent

  • Informa=veness: answer comprehension ques&ons using the

transla&on (who? where? when? names, numbers, dates etc.)

– Very hard to devise ques&ons

p(A) p(E) K Fluency .400 .2 .250 Adequacy .380 .2 .226 Sentence ranking .582 .333 .373

Human evalua&on: alterna&ves

  • Ranking transla=ons: is transla&on X beDer than transla&on Y?

– Evaluators are more consistent

  • Informa=veness: answer comprehension ques&ons using the

transla&on (who? where? when? names, numbers, dates etc.)

– Very hard to devise ques&ons

p(A) p(E) K Fluency .400 .2 .250 Adequacy .380 .2 .226 Sentence ranking .582 .333 .373

Human evalua&on: alterna&ves

  • Reading =me

– people read more quickly a well-formed text

  • Post-edi=ng effort (=me/HTER)

– Time required to turn MT into a good transla&on – HTER (Human-Targeted Transla&on Error Rate) – number of edi&ng opera&ons required to turn MT output into an acceptable transla&on

slide-8
SLIDE 8

Human evalua&on: alterna&ves

  • Reading =me

– people read more quickly a well-formed text

  • Post-edi=ng effort (=me/HTER)

– Time required to turn MT into a good transla&on – HTER (Human-Targeted Transla&on Error Rate) – number of edi&ng opera&ons required to turn MT output into an acceptable transla&on

Automa&c metrics for MT evalua&on

MT Evalua&on, Trento, ISIT School - November 2013

Requirements for automa&c metrics

  • Low cost (wrt human evalua&on)
  • Objec=ve (unbiased)
  • Meaningful: score should give intui&ve interpreta&on of

transla&on quality

  • Efficient: to be computed quickly and onen
  • Consistent: repeated use of metric should give same results
  • Correct: metric must rank beDer systems higher

MT Evalua&on, Trento, Doctoral School - April 2016

Reference-based metrics

  • Idea: compute a similarity score between a candidate

transla&on and one or more high-quality reference transla&ons

– References are created by human experts (e.g. professional translators) – Several references allow us to account for variability of good transla&ons

  • Criterion for valida=ng automa=c metrics: automa&c scores

must correlate with human ones on test data

MT Evalua&on, Trento, Doctoral School - April 2016

slide-9
SLIDE 9

Reference-based metrics

  • Typically:

– Sim is a similarity metric between sentences – Sim can use a variety of proper&es: string distance, word precision/ recall, syntac&c similarity, seman&c distance, etc.

WER: ra&o of smallest edit distance and output length BLEU: weighted sum of precision of n-grams TER: normalized number of edits to match the closest reference METEOR: harmonic mean of unigram precision/recall NIST, PER, GTM, HTER, TERP, CDER, GTM, BLANC, PER, ULC, MT- NCD, ATEC, TESLA, SEPIA, IQTM, BEWT-E, MEANT, etc.

1 k sim(refi

i=1 k

,cand)

1 ≤ k ≤ 4

“candidate”, “reference”, “n-grams”

Candidate (or “target” or “hypothesis”):

the gunman was shot dead by police .

Reference transla=on:

the gunman was shot to death by the police .

N-grams:

the, gunman, was, shot, by, police, . the gunman, gunman was, was shot, police . the gunman was, gunman was shot the gunman was shot

4-grams 3-grams 2-grams 1-grams

The BLEU metric (BiLingual Evalua&on Understudy)

  • Proposed by IBM [Papineni et al., 2001] (name from IBM’s color)
  • A numerical measure of closeness between texts
  • Ra&onal: the closer MT is to human transla&on, the beDer
  • Idea: check matches of words (unigrams) and phrases (n-grams)

between:

– one hypothesis (the transla&on produced by MT) – a set of references (professional human transla&ons)

  • Criterion: the more the matches, the beDer the hypothesis
  • Needs good quality references to cover linguis&c variety

Important: only the target language is taken into account!

MT Evalua&on, Trento, Doctoral School - April 2016

The BLEU metric (BiLingual Evalua&on Understudy)

  • Proposed by IBM [Papineni et al., 2001] (name from IBM’s color)
  • A numerical measure of closeness between texts
  • Ra&onal: the closer MT is to human transla&on, the beDer
  • Idea: check matches of words (unigrams) and phrases (n-grams)

between:

– one hypothesis (the transla&on produced by MT) – a set of references (professional human transla&ons)

  • Criterion: the more the matches, the beDer the hypothesis
  • Needs good quality references to cover linguis&c variety

Important: only the target language is taken into account!

MT Evalua&on, Trento, Doctoral School - April 2016

slide-10
SLIDE 10

The BLEU metric (BiLingual Evalua&on Understudy)

  • Proposed by IBM [Papineni et al., 2001] (name from IBM’s color)
  • A numerical measure of closeness between texts
  • Ra&onal: the closer MT is to human transla&on, the beDer
  • Idea: check matches of words (unigrams) and phrases (n-grams)

between:

– one hypothesis (the transla&on produced by MT) – a set of references (professional human transla&ons)

  • Criterion: the more the matches, the beDer the hypothesis
  • Needs good quality references to cover linguis&c variety

Important: only the target language is taken into account!

MT Evalua&on, Trento, Doctoral School - April 2016

The BLEU metric (BiLingual Evalua&on Understudy)

  • Proposed by IBM [Papineni et al., 2001] (name from IBM’s color)
  • A numerical measure of closeness between texts
  • Ra&onal: the closer MT is to human transla&on, the beDer
  • Idea: check matches of words (unigrams) and phrases (n-grams)

between:

– one hypothesis (the transla&on produced by MT) – a set of references (professional human transla&ons)

  • Criterion: the more the matches, the beDer the hypothesis
  • Needs good quality references to cover linguis&c variety

Important: only the target language is taken into account!

MT Evalua&on, Trento, Doctoral School - April 2016

The BLEU metric (BiLingual Evalua&on Understudy)

MT Evalua&on, Trento, Doctoral School - April 2016

REF HYP1 HYP2 HYP3

VERY GOOD BAD VERY BAD

The BLEU metric: modified n-gram precision

  • n-gram Precision: percentage of n-grams in the hypothesis

that occur also in (any of the) references (0≤p≤1)

– matches of shorter n-grams (n=1,2) capture adequacy – matches of longer n-grams (n=3,4,...) capture fluency

  • Modified: a reference word is considered exhausted aner a

matching word is iden&fied in the hypothesis.

– Example: Hyp: the the the the the the the Ref: the cat is on the mat

MT Evalua&on, Trento, Doctoral School - April 2016

slide-11
SLIDE 11

The BLEU metric: modified n-gram precision

  • n-gram Precision: percentage of n-grams in the hypothesis

that occur also in (any of the) references (0≤p≤1)

– matches of shorter n-grams (n=1,2) capture adequacy – matches of longer n-grams (n=3,4,...) capture fluency

  • Modified: a reference word is considered exhausted aner a

matching word is iden&fied in the hypothesis.

– Example: Hyp: the the the the the the the Ref: the cat is on the mat

MT Evalua&on, Trento, Doctoral School - April 2016

p1

standard = 7

7 p1

modified = 2

7

The BLEU metric: brevity penalty

  • Brevity penalty (BP): to penalize too short hypotheses

– Example: Hyp: the Ref: the cat is on the mat …Can’t just type out single word “the’’ (precision 1.0!) – c = length of MT hypothesis, r = length of the closest reference

MT Evalua&on, Trento, Doctoral School - April 2016

The BLEU metric: computa&on

BLEU=Brevity Penalty * Geometric mean of p1, p2,..pn

(where is the modified n-gram precision for 1≤n≤4)

Hypothesis: The gunman was shot dead by police . – Ref 1: The gunman was shot to death by the police . – Ref 2: The hit man was killed by the police forces . – Ref 3: Police killed the gunman . – Ref 4: The gunman was shot dead by the police .

  • Precision: p1=1.0(8/8) p2=0.86(6/7) p3=0.67(4/6) p4=0.6 (3/5)
  • Brevity Penalty: c=8, r=9, BP=0.8825
  • Final Score:

MT Evalua&on, Trento, Doctoral School - April 2016

The BLEU metric: computa&on

BLEU=Brevity Penalty * Geometric mean of p1, p2,..pn

(where is the modified n-gram precision for 1≤n≤4)

Hypothesis: The gunman was shot dead by police . – Ref 1: The gunman was shot to death by the police . – Ref 2: The hit man was killed by the police forces . – Ref 3: Police killed the gunman . – Ref 4: The gunman was shot dead by the police .

  • Precision: p1=1.0(8/8) p2=0.86(6/7) p3=0.67(4/6) p4=0.6 (3/5)
  • Brevity Penalty: c=8, r=9, BP=0.8825
  • Final Score:

MT Evalua&on, Trento, Doctoral School - April 2016

slide-12
SLIDE 12

The BLEU metric: computa&on

BLEU=Brevity Penalty * Geometric mean of p1, p2,..pn

(where is the modified n-gram precision for 1≤n≤4)

Hypothesis: The gunman was shot dead by police . – Ref 1: The gunman was shot to death by the police . – Ref 2: The hit man was killed by the police forces . – Ref 3: Police killed the gunman . – Ref 4: The gunman was shot dead by the police .

  • Precision: p1=1.0(8/8) p2=0.86(6/7) p3=0.67(4/6) p4=0.6 (3/5)
  • Brevity Penalty: c=8, r=9, BP=0.8825
  • Final Score:

MT Evalua&on, Trento, Doctoral School - April 2016

The BLEU metric: computa&on

BLEU=Brevity Penalty * Geometric mean of p1, p2,..pn

(where is the modified n-gram precision for 1≤n≤4)

Hypothesis: The gunman was shot dead by police . – Ref 1: The gunman was shot to death by the police . – Ref 2: The hit man was killed by the police forces . – Ref 3: Police killed the gunman . – Ref 4: The gunman was shot dead by the police .

  • Precision: p1=1.0(8/8) p2=0.86(6/7) p3=0.67(4/6) p4=0.6 (3/5)
  • Brevity Penalty: c=8, r=9, BP=0.8825
  • Final Score:

MT Evalua&on, Trento, Doctoral School - April 2016

The BLEU metric: computa&on

BLEU=Brevity Penalty * Geometric mean of p1, p2,..pn

(where is the modified n-gram precision for 1≤n≤4)

Hypothesis: The gunman was shot dead by police . – Ref 1: The gunman was shot to death by the police . – Ref 2: The hit man was killed by the police forces . – Ref 3: Police killed the gunman . – Ref 4: The gunman was shot dead by the police .

  • Precision: p1=1.0(8/8) p2=0.86(6/7) p3=0.67(4/6) p4=0.6 (3/5)
  • Brevity Penalty: c=8, r=9, BP=0.8825
  • Final Score:

MT Evalua&on, Trento, Doctoral School - April 2016

The BLEU metric: computa&on

BLEU=Brevity Penalty * Geometric mean of p1, p2,..pn

(where is the modified n-gram precision for 1≤n≤4)

Hypothesis: The gunman was shot dead by police . – Ref 1: The gunman was shot to death by the police . – Ref 2: The hit man was killed by the police forces . – Ref 3: Police killed the gunman . – Ref 4: The gunman was shot dead by the police .

  • Precision: p1=1.0(8/8) p2=0.86(6/7) p3=0.67(4/6) p4=0.6 (3/5)
  • Brevity Penalty: BP=0.8825 (exp(1- (9/8))
  • Final Score:

MT Evalua&on, Trento, Doctoral School - April 2016

c=8 r=9

slide-13
SLIDE 13

The BLEU metric: computa&on

BLEU=Brevity Penalty * Geometric mean of p1, p2,..pn

(where is the modified n-gram precision for 1≤n≤4)

Hypothesis: The gunman was shot dead by police . – Ref 1: The gunman was shot to death by the police . – Ref 2: The hit man was killed by the police forces . – Ref 3: Police killed the gunman . – Ref 4: The gunman was shot dead by the police .

  • Precision: p1=1.0(8/8) p2=0.86(6/7) p3=0.67(4/6) p4=0.6 (3/5)
  • Brevity Penalty: BP=0.8825 (exp(1- (9/8))
  • Final Score:

1× 0.86 × 0.67 × 0.6 4 × 0.8825= 0.68

The BLEU metric: computa&on

BLEU=Brevity Penalty * Geometric mean of p1, p2,..pn

(where is the modified n-gram precision for 1≤n≤4)

Hypothesis: The gunman was shot dead by police . – Ref 1: The gunman was shot to death by the police . – Ref 2: The gunman was killed by the police . – Ref 3: Police killed the gunman . – Ref 4: The gunman was shot dead by the police .

  • Precision: p1=1.0(8/8) p2=0.86(6/7) p3=0.67(4/6) p4=0.6 (3/5)
  • Brevity Penalty: BP=0.8825(exp(1- (9/8))
  • Final Score:

1× 0.86 × 0.67 × 0.6 4 × 0.8825= 0.68

NOTE: this is a product!!! ! If one of the factors is 0 (e.g. no 4-gram matches) the final score will be 0!!! For this reason the final score is usually calculated on the en=re evalua=on corpus, not on single sentences!

The BLEU metric: correla&on with training set size

MT Evalua&on, Trento, Doctoral School - April 2016

Experiments by Philipp Koehn

BLEU score

  • No. sentence pairs used in training

From George Doddington, NIST, 2002

The BLEU metric: correla&on with human judgments

slide-14
SLIDE 14

The BLEU metric limita&ons: examples

  • Reference:

a b c d e f g h I j k l m n o p q r s

  • Hyp 1:

a b c d f e g i h j l k m o n p r q s

  • Hyp 2:

a b c d e f g x x x x x x x x x x x x

Hyp 1 Hyp 2 1-gram 1.0000 0.3684 2-gram 0.1666 0.3333 3-gram 0.1176 0.2941 4-gram 0.0625 0.2500 BLEU Score 0.1871 0.3083

MT Evalua&on, Trento, Doctoral School - April 2016

Longer n-grams dominate shorter n-grams!!!

The BLEU metric limita&ons: examples

HYPOTHESES BLEU George Bush will onen take a holiday in Crawford Texas 1.0000 Bush will onen holiday in Texas 0.4611 Bush will onen holiday in Crawford Texas 0.6363 George Bush will onen holiday in Crawford Texas 0.7490 George Bush will not onen vaca&on in Texas 0.4491 George Bush will not onen take a holiday in Crawford Texas 0.9129

MT Evalua&on, Trento, Doctoral School - April 2016

  • Reference:

George Bush will onen take a holiday in Crawford Texas

The BLEU metric limita&ons: examples

HYPOTHESES BLEU George Bush will onen take a holiday in Crawford Texas 1.0000 Bush will onen holiday in Texas 0.4611 Bush will onen holiday in Crawford Texas 0.6363 George Bush will onen holiday in Crawford Texas 0.7490 George Bush will not onen vaca&on in Texas 0.4491 George Bush will not onen take a holiday in Crawford Texas 0.9129!

MT Evalua&on, Trento, Doctoral School - April 2016

  • Reference:

George Bush will onen take a holiday in Crawford Texas

The BLEU metric limita&ons: examples

HYPOTHESES BLEU George Bush will onen take a holiday in Crawford Texas 1.0000 Bush will onen holiday in Texas 0.4611 Bush will onen holiday in Crawford Texas 0.6363 George Bush will onen holiday in Crawford Texas 0.7490 George Bush will not onen vaca&on in Texas 0.4491 George Bush will not onen take a holiday in Crawford Texas 0.9129!

MT Evalua&on, Trento, Doctoral School - April 2016

  • Reference:

George Bush will onen take a holiday in Crawford Texas

Small changes in the text may determine big meaning changes!

slide-15
SLIDE 15
  • Reference: The President frequently makes his vaca&on in Crawford Texas

WHY?

The BLEU metric limita&ons: examples

HYPOTHESES BLEU (4-gram) George Bush onen takes a holiday in Crawford Texas 0.2627 holiday onen Bush a takes George in Crawford Texas 0.2627

MT Evalua&on, Trento, Doctoral School - April 2016

  • Reference: The President frequently makes his vaca&on in Crawford Texas

WHY?

…The “invisible region” [Hovy & Ravichandran 2003]

The BLEU metric limita&ons: examples

MT Evalua&on, Trento, Doctoral School - April 2016

HYPOTHESES BLEU (4-gram) George Bush onen takes a holiday in Crawford Texas 0.2627 holiday onen Bush a takes George in Crawford Texas 0.2627

  • Reference: The President frequently makes his vaca&on in Crawford Texas

DT NNP RB VBZ PRP$ NN IN NNP NNP

Solu=on #1: matches at POS level [Hovy & Ravichandran 2003]

The BLEU metric limita&ons: improvements

HYPOTHESES BLEU (4-gram) George Bush onen takes a holiday in Crawford Texas 0.2627 holiday onen Bush a takes George in Crawford Texas 0.2627

MT Evalua&on, Trento, Doctoral School - April 2016

  • Reference: The President frequently makes his vaca&on in Crawford Texas

DT NNP RB VBZ PRP$ NN IN NNP NNP

Solu=on #1: matches at POS level [Hovy & Ravichandran 2003]

The BLEU metric limita&ons: improvements

HYPOTHESES BLEU (4-gram) NNP NNP RB VBZ DT NN IN NNP NNP 0.5411 NN RB NNP DT VBZ NNP IN NNP NNP 0.3117 HYPOTHESES BLEU (4-gram) George Bush onen takes a holiday in Crawford Texas 0.2627 holiday onen Bush a takes George in Crawford Texas 0.2627

MT Evalua&on, Trento, Doctoral School - April 2016

slide-16
SLIDE 16
  • Reference: The President frequently makes his vaca&on in Crawford Texas

DT NNP RB VBZ PRP$ NN IN NNP NNP

Solu=on #2: (Words + POS)/2 [Hovy & Ravichandran 2003]

The BLEU metric limita&ons: improvements

HYPOTHESES BLEU (4-gram) NNP NNP RB VBZ DT NN IN NNP NNP 0.4020 NN RB NNP DT VBZ NNP IN NNP NNP 0.2966 HYPOTHESES BLEU (4-gram) George Bush onen takes a holiday in Crawford Texas 0.2627 holiday onen Bush a takes George in Crawford Texas 0.2627

MT Evalua&on, Trento, Doctoral School - April 2016

The BLEU metric: pros and cons

  • BLEU ranges from 0 to 1 (transla&on quality as “percentage”)
  • The more the references, the higher the score
  • High correla&on with human assigned scores, especially on fluency
  • Ranking of “similar” MT systems equivalent to human ranking
  • Collec&ng reference has a high cost
  • Longer n-grams dominate shorter n-grams
  • Small changes in the text (e.g. “not”) may determine big meaning changes
  • Scores are not straighorward to interpret (BLEU = 30…so what?)
  • Syntax poorly modeled
  • Ignores word relevance and seman&c equivalence (string level comparisons)
  • Can fail in ranking systems based on different approaches

MT Evalua&on, Trento, Doctoral School - April 2016

The TER metric (Transla&on Edit Rate)

  • Idea: simulate post-edi=ng [Snover et al. 2006]

– Given a transla&on hypothesis (H) AND a reference transla&on (R) – Calculate the minimal number of edits to transform H into R (normalized by the average length of the references) – Possible edits: inser&ons/dele&on/subs&tu&on of single words, shins of word

sequences

  • Criterion: the less the number of edits, the beDer the hypothesis

MT Evalua&on, Trento, Doctoral School - April 2016

The TER metric (Transla&on Edit Rate)

  • Idea: simulate post-edi=ng [Snover et al. 2006]

– Given a transla&on hypothesis (H) AND a reference transla&on (R) – Calculate the minimal number of edits to transform H into R (normalized by the average length of the references) – Possible edits: inser&ons/dele&on/subs&tu&on of single words, shins of word

sequences

  • Criterion: the less the number of edits, the beDer the hypothesis

MT Evalua&on, Trento, Doctoral School - April 2016

slide-17
SLIDE 17

The TER metric: example

REF: Saudi Arabia denied this week informa&on published in the American NYT HYP: this week the Saudis denied informa&on published in the NYT

  • HYP: fluent, same meaning of reference (except “American”)
  • but not exact match:

– this week is shined – Saudi Arabia in the REF appears as the Saudis in the HYP – American appears only in the REF

  • Number of edits = 4 (1 shin, 2 subs&tu&ons, and 1 dele&on):

TER% = 4/11 * 100 = 36.36%

MT Evalua&on, Trento, Doctoral School - April 2016

The TER metric: discussion

  • Evalua&on close to a real task (post-edi&ng)
  • Results are more interpretable than for other metrics
  • Can be computed only for a single sentence
  • Insensi&ve to seman&c closeness (e.g. synonyms, paraphrases)
  • Complexity of computa&on (op&mal calcula&on of edit-distance with

move opera&ons: NP-complete)

– approximate search via dynamic programming (decomposi&on in sub- problems

MT Evalua&on, Trento, Doctoral School - April 2016

The HTER metric (Human-targeted TER)

  • TER ignores seman&c equivalence and heavily depends
  • n the reference transla&on
  • Idea: references as human post-edi=ons

– Perform human post-edi&ng to transform the hypothesis into the closest acceptable transla&on – HTER measures TER between the hypothesis and the resul&ng reference transla&on

  • Criterion: the less the number of edits, the beDer the

hypothesis (same as TER)

MT Evalua&on, Trento, Doctoral School - April 2016

TER/HTER: pros/cons

  • TER

– intui&ve measure of MT quality – adequate for fast development – reasonably correlates with human judgments (>BLEU, < than others e.g. METEOR) – ignores seman&c equivalence

  • HTER

– intui&ve measure of MT quality – highest correla&on with human judgments – possible subs&tute for human evalua&ons because less subjec&ve – expensive: 3 to 7 minutes per sentence for a human to annotate – not suitable for using in the development cycle of an MT system

MT Evalua&on, Trento, Doctoral School - April 2016

slide-18
SLIDE 18

Applica&on-oriented MT evalua&on Quality Es&ma&on (QE)

  • From controlled lab tests and

evalua&on campaigns…

  • …to MT evalua&on in real-life

condi&ons (e.g. the CAT framework)

– As a support to human translators – At run &me – Without reference transla&ons

MT Evalua&on, Trento, Doctoral School - April 2016

(One) scenario: the CAT framework

CAT Tool

?

The CAT tool

  • 1. Segments the input document
  • 2. Provides, for each segment:
  • Sugges&ons from a transla&on

memory (TM)

  • Sugges&ons from an MT engine

The translator, for each segment

  • 1. Selects the best sugges&on
  • 2. Post-edits it (if necessary) to

reach publica&on quality

(One) scenario: the CAT framework

  • Questions:

– Is this suggestion good enough to be published? – Can I trust it? – Can a reader get the gist? – Is it publishable “as is”? – If not, what is better: post-editing

  • r rewriting?
  • Huge market interest

– Increased translators’ productivity – No manual intervention on reliable MT suggestions

slide-19
SLIDE 19

Predic&ng MT output quality

  • Task: automa&cally es&mate MT output quality at run-8me

and without reference transla8ons

  • Approach: supervised learning. First (training step), a model is

learned from human-labelled data. Then (predic&on step), the the model is used to label new, unseen data.

Predic&ng MT output quality

  • Task: automa&cally es&mate MT output quality at run-8me

and without reference transla8ons

  • Approach: supervised learning. First (training step), a model is

learned from human-labelled data. Then (predic&on step), the the model is used to label new, unseen data.

Posi&ve/Nega&ve examples Possible features: hasWings, hasFeathers, sound, moves, hasPalmateFeet, etc.

Predic&ng MT output quality

  • What is a good indicator of transla&on quality?
  • It should take into account:

– Correctness and usefulness of the transla&on – Cogni&ve effort needed by human for the correc&on

  • All these aspects can be summarized in the:

– Post-edi=ng effort

MT Evalua&on, Trento, ISIT School - November 2013

Predic&ng MT output quality

  • What is a good indicator of transla&on quality?
  • It should take into account:

– Correctness and usefulness of the transla&on – Cogni&ve effort needed by human for the correc&on

MT Evalua&on, Trento, ISIT School - November 2013

slide-20
SLIDE 20

Predic&ng MT output quality

  • What is a good indicator of transla&on quality?
  • It should take into account:

– Correctness and usefulness of the transla&on – Cogni&ve effort needed by human for the correc&on

  • All these aspects can be summarized in the:

– Post-edi=ng effort

MT Evalua&on, Trento, ISIT School - November 2013

Predic&ng MT output quality

  • What is post-edi&ng?

– A process of modifica&on rather than revision (Loffler- Laurian 1985) – The “term used for the correc&on of machine transla&on

  • utput by human linguists/editors” (Veale and Way 1997)

– Repairing texts (Krings, 2001) – “…the process of improving a machine-generated transla&on with a minimum of manual labor” (TAUS report, 2010)

MT Evalua&on, Trento, Doctoral School - April 2016

Predic&ng MT output quality

  • What is post-edi&ng effort?

– the effort made by a post-editor to manually improve a machine generated transla&on

  • Measure of post-edi&ng effort:

– Quality score (as es&mated by humans on a 1-5 Likert scale) – Number of edit opera&ons (HTER) – Post-Edi&ng &me (total seconds or seconds per words) – Number of keystrokes – …

MT Evalua&on, Trento, Doctoral School - April 2016

Quality scores

  • Arbitrary choice of the levels of quality

1 = requires complete retransla&on;

2 = requires some retransla&on; 3 = very liDle post edi&ng needed; 4 = fit for purpose

  • Labeling requires human interven&on
  • A precise measure
  • Subjec&ve/expensive/&me consuming task

MT Evalua&on, Trento, Doctoral School - April 2016

slide-21
SLIDE 21
  • Workshop on SMT scoring schema:
  • 1. The MT output is incomprehensible, with liDle or no

informa&on transferred accurately. It cannot be edited, needs to be translated from scratch.

  • 2. About 50% -70% of the MT output needs to be edited. It requires

a significant edi&ng effort in order to reach publishable level.

  • 3. About 25-50% of the MT output needs to be edited. It contains

different errors and mistransla&ons that need to be corrected.

  • 4. About 10-25% of the MT output needs to be edited. It is

generally clear and intelligible.

  • 5. The MT output is perfectly clear and intelligible. It is not

necessarily a perfect transla&on, but requires liDle to no edi&ng.

81

Quality scores

MT Evalua&on, Trento, Doctoral School - April 2016

Post-edi&ng &me

  • Seconds needed to post-edit a sentence
  • normalized version in seconds per word

– liDle &me = good transla&on – large &me = bad transla&on

  • Usually includes:

– reading &me – searching for informa&on on external resources – typing &me – extra &me for secondary ac&vity (e.g. correc&on)

  • High variability across sentences and translators

MT Evalua&on, Trento, Doctoral School - April 2016

HTER (again!)

  • Human targeted TER is the standard edit distance

between the original machine transla&on and its minimally post-edited version

– edits: inser&on, dele&on, subs&tu&on, shin

  • Lower variability (wrt &me) across sentences/translators

MT Evalua&on, Trento, Doctoral School - April 2016

HTER = #edits # words_ postedited _version

Post-edi&ng &me Vs HTER

MT Evalua&on, Trento, Doctoral School - April 2016

  • Time: pros/cons

– Accounts for different efforts in transla&ng different words – Variability among post-editors

  • HTER: pros/cons

– Objec&ve, easy to compute measure – less variance across post-editors (bad = bad for all) – Ignores different efforts in transla&ng different words

slide-22
SLIDE 22

Predic&ng MT output quality

  • Tasks:

– Automa&c labeling

  • real values=regression
  • integers=classifica&on

– Automa&c ranking

  • Granularity

– Word level (e.g. “The cat enter in the room”) – Sentence level (e.g. “The cat enter in the room”: 2.27) – Document level

MT Evalua&on, Trento, Doctoral School - April 2016

Evalua&on Metrics - Regression

  • Regression (predic&ons as real values):

– Mean Absolute Error (MAE) – Root Mean Squared Error (RMSE)

  • Given a set of predicted scores H and a set of human scores V

MAE = H(si) −V(si)

i=1 N

N RMSE = (H(si) −V(si))2

i=1 N

N

MT Evalua&on, Trento, Doctoral School - April 2016

Evalua&on Metrics - Classifica&on

  • Classifica&on (predic&ons as integers):

– Precision (Pr) – Recall (Re) – f –score (F1)

  • Given a set of predicted scores H and a set of human scores V
  • An example for binary classifica&on

V 1

  • 1

H 1 True Posi&ve False Posi&ve

  • 1

False Nega&ve True Nega&ve

Pr = tp tp + fp Re = tp tp + fn F1 = 2* Pr*Re Pr+ Re

MT Evalua&on, Trento, Doctoral School - April 2016

Evalua&on Metrics - Ranking

MT Evalua&on, Trento, Doctoral School - April 2016

  • Spearman’s Rank Coefficient
  • Delta Average (introduced at WMT 2012)

Score Ranking s1 3.2 3 s2 1 5 s3 5 1 s4 2.7 4 s5 4 2 Judgment Ranking s1 5 1 s2 1 5 s3 4 2 s4 2 4 s5 3 3 System Human

Rank Similarity Metric

slide-23
SLIDE 23

Quality indicators

  • Features can be extracted from

– The source sentence (“Complexity” indicators) – The translated sentence (“Fluency” indicators) – Source and Target sentences (“Adequacy” and other indicators) – MT system during the transla&on process (“Confidence” indicators)

MT Evalua&on, Trento, Doctoral School - April 2016

Source sentence Translated sentence

MT system

Quality indicators- Complexity

  • Capture the difficulty to translate the source sentence
  • Complex sentences are harder to translate

– source sentence length – n-gram language model probability – number of punctua&on marks – source sentence type/token ra&o (e.g. #nouns/#tokens) – avg. # of transla&ons per word (as given by probabilis&c dic&onaries) – % of content/non-content words – … Source sentence Translated sentence

MT system

MT Evalua&on, Trento, Doctoral School - April 2016

Quality indicators- Fluency

  • Capture the level of naturalness of the transla=on in the target language
  • The transla&on should conform to the target language in terms of

grammar, with lexical choices appropriate to the genre of the source text

– n-gram language model probability – POS-tag target language model – … Source sentence

MT system

Translated sentence

MT Evalua&on, Trento, Doctoral School - April 2016

Quality indicators- Adequacy

  • Capture the level of seman=c equivalence between source and transla=on
  • Source and target sentences should convey the same meaning. Meaning

drins/losses from source to target sentence indicate a bad transla&on

– % of aligned words in source and target – % of alignments between words with the same part of speech – % of aligned nouns/verbs/adjec&ves – aligned IDF mass (IDF as indicator of term relevance) – …

MT system

Translated sentence Source sentence

MT Evalua&on, Trento, Doctoral School - April 2016

slide-24
SLIDE 24

Quality indicators- Confidence

  • Capture the level of confidence of the SMT system
  • sentences for which the transla&on process is complex are more likely to be

bad transla&ons

– length N of the N best list – number of pruned hypotheses – log-likelihood score – avg. edit-distance of the 1-best from the first k-bests – … Source sentence Translated sentence

MT system

MT Evalua&on, Trento, Doctoral School - April 2016

Open Issues

  • Lack of an objec&ve quality score able to catch

cogni&ve efforts

– A new score that contains the main features of HTER and correlates well with PE &me

  • Lack of a technique able to threshold the quality

score (bad vs. good transla&ons)

– Is HTER = 0.3/0.5/0.7 a bad or good transla&on? – Useful in the CAT tool scenario, where it is necessary to discard bad transla&ons

MT Evalua&on, Trento, Doctoral School - April 2016

Open Issues

  • More than 1,000 quality indicators have been

developed in the last years.

– Do we need all of them in a real applica&on? – Which are the most reliable in each group? – Which is the best combina&on?

  • Subjec&vity in the post-editor work and in the task

– A single quality es&mator for very different post-editor behavior and task – Adaptability/personaliza&on

MT Evalua&on, Trento, Doctoral School - April 2016

MT Evalua=on Dilemma

Summary

  • MT evalua&on: a hot topic…

– Shared evalua&on methods/rou&nes are a key asset in any field

  • …but a difficult task

– We talked about error variability, costs, speed, replicability, subjec&vity, correla&on with human judgments, etc.

slide-25
SLIDE 25

Summary

  • Human evalua&on

– Accurate, high quality, meaningful, expensive, slow, subjec&ve

  • Automa&c evalua&on

– Cheap, quick, repeatable, objec&ve, approximate, less accurate – Fluency, adequacy – Reference-based: BLEU, TER, HTER (pros and cons) – Reference-free: quality es&ma&on (goal, methods, open issues)

Summary

  • Key concepts:

Adequacy Reference transla&on Agreement Correla&on Post-edi&ng effort CAT tool Feature Cogni&ve effort HTER Mean Absolute Error

Evalua&on of Machine Transla&on Quality

Marco Turchi

FBK Trento, Italy turchi@<k.eu