Translation Quality Estimation: Past, Present, and Future Andr e - - PowerPoint PPT Presentation

translation quality estimation past present and future
SMART_READER_LITE
LIVE PREVIEW

Translation Quality Estimation: Past, Present, and Future Andr e - - PowerPoint PPT Presentation

Translation Quality Estimation: Past, Present, and Future Andr e Martins MT Marathon, Lisbon, August 31st, 2017 Andr e Martins (Unbabel) Quality Estimation MTM, 31/8/17 1 / 69 This Talk First part: largely based on Lucia Specias


slide-1
SLIDE 1

Translation Quality Estimation: Past, Present, and Future

Andr´ e Martins MT Marathon, Lisbon, August 31st, 2017

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 1 / 69

slide-2
SLIDE 2

This Talk

First part: largely based on Lucia Specia’s MTM16 slides Second part: joint work with Marcin, Fabio, Ramon, Chris, Roman Third part: my thoughts on the future of QE

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 2 / 69

slide-3
SLIDE 3

Outline

1 MT Evaluation & Quality Estimation 2 Pushing the Limits of Quality Estimation 3 The Future

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 3 / 69

slide-4
SLIDE 4

Why Do We Care About Evaluation?

In the business of developing MT, we need to: measure progress over new/alternative versions compare different MT systems decide whether a translation is good enough for something

  • ptimize parameters of MT systems

understand where systems go wrong (diagnosis) ... remember Yvette’s lecture on Monday:

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 4 / 69

slide-5
SLIDE 5

Why Do We Care About Evaluation?

One should optimize a system using the same metric that will be used to evaluate it Issue: how to choose a metric? Choice should be related to the system’s purpose (not the case in practice) Other aspects are important for tuning (sentence/corpus-level, fast, cheap, differentiable, ...)

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 5 / 69

slide-6
SLIDE 6

Complex Problem

What does quality mean? Fluent? Adequate? Both? Easy to post-edit? System A better than system B? ...

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 6 / 69

slide-7
SLIDE 7

Complex Problem

What does quality mean? Fluent? Adequate? Both? Easy to post-edit? System A better than system B? ... Quality for whom/what? End-user (gisting vs dissemination) Post-editor (light vs heavy post-editing) Other applications (e.g. CLIR) MT-system (tuning or diagnosis for improvement) ...

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 6 / 69

slide-8
SLIDE 8

Complex Problem

MT Do buy this product, it’s their craziest invention!

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 7 / 69

slide-9
SLIDE 9

Complex Problem

MT Do buy this product, it’s their craziest invention! HT Do not buy this product, it’s their craziest invention!

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 7 / 69

slide-10
SLIDE 10

Complex Problem

MT Do buy this product, it’s their craziest invention! HT Do not buy this product, it’s their craziest invention! Severe if end-user does not speak source language Trivial to post-edit by translators

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 7 / 69

slide-11
SLIDE 11

Complex Problem

MT Six-hours battery, 30 minutes to full charge last.

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 8 / 69

slide-12
SLIDE 12

Complex Problem

MT Six-hours battery, 30 minutes to full charge last. HT The battery lasts 6 hours and it can be fully recharged in 30 minutes.

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 8 / 69

slide-13
SLIDE 13

Complex Problem

MT Six-hours battery, 30 minutes to full charge last. HT The battery lasts 6 hours and it can be fully recharged in 30 minutes. Ok for gisting - meaning preserved Very costly for post-editing if style is to be preserved

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 8 / 69

slide-14
SLIDE 14

A Taxonomy of MT Evaluation Methods

Manual Automatic

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 9 / 69

slide-15
SLIDE 15

A Taxonomy of MT Evaluation Methods

Manual Automatic

Direct asses. Scoring

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 9 / 69

slide-16
SLIDE 16

Manual Assessment: Scoring

Is this translation correct? Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 10 / 69

slide-17
SLIDE 17

A Taxonomy of MT Evaluation Methods

Manual Automatic

Direct asses. Scoring Ranking

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 11 / 69

slide-18
SLIDE 18

Manual Assessment: Ranking

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 12 / 69

slide-19
SLIDE 19

A Taxonomy of MT Evaluation Methods

Manual Automatic

Direct asses. Scoring Ranking Error annotation

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 13 / 69

slide-20
SLIDE 20

MQM (Multidimensional Quality Metrics)

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 14 / 69

slide-21
SLIDE 21

A Taxonomy of MT Evaluation Methods

Manual Automatic

Direct asses. Task-based Scoring Ranking Error annotation Post-editing

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 15 / 69

slide-22
SLIDE 22

Amount of Post-Editing

HTER

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 16 / 69

slide-23
SLIDE 23

Amount of Post-Editing

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 16 / 69

slide-24
SLIDE 24

A Taxonomy of MT Evaluation Methods

Manual Automatic

Direct asses. Task-based Scoring Ranking Error annotation Post-editing Reading comprehension

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 17 / 69

slide-25
SLIDE 25

Reading Comprehension

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 18 / 69

slide-26
SLIDE 26

A Taxonomy of MT Evaluation Methods

Manual Automatic

Direct asses. Task-based Scoring Ranking Error annotation Post-editing Reading comprehension Eye-tracking

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 19 / 69

slide-27
SLIDE 27

Eye-Tracking

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 20 / 69

slide-28
SLIDE 28

A Taxonomy of MT Evaluation Methods

Manual Automatic

Direct asses. Task-based Scoring Ranking Error annotation Post-editing Reading comprehension Reference-based Eye-tracking

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 21 / 69

slide-29
SLIDE 29

A Taxonomy of MT Evaluation Methods

Manual Automatic

Direct asses. Task-based Scoring Ranking Error annotation Post-editing Reading comprehension Reference-based Quality estimation

BLEU, Meteor, NIST, TER, WER, PER, CDER, BEER, CiDER, Cobalt, RATATOUILLE, RED, AMBER, PARMESAN, ...

Eye-tracking

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 21 / 69

slide-30
SLIDE 30

Reference-Based Evaluation

Reference(s): subset of good translations, usually one

Some metrics expand matching, e.g. synonyms in Meteor

Huge variation in reference translations. E.g.

Source

不过这一切都由不得你 However these all totally beyond the control of you.

MT

But all this is beyond the control of you.

Human score BLEU score HT1

But all this is beyond your control.

3.4 0.427 HT2

However, you cannot choose yourself.

2 0.049 HT3

However, not everything is up to you to decide.

2 0.050 HT4

But you can’t choose that.

2.8 0.055

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 22 / 69

slide-31
SLIDE 31

Reference-Based Evaluation

Reference(s): subset of good translations, usually one

Some metrics expand matching, e.g. synonyms in Meteor

Huge variation in reference translations. E.g.

Source

不过这一切都由不得你 However these all totally beyond the control of you.

MT

But all this is beyond the control of you.

Human score BLEU score HT1

But all this is beyond your control.

3.4 0.427 HT2

However, you cannot choose yourself.

2 0.049 HT3

However, not everything is up to you to decide.

2 0.050 HT4

But you can’t choose that.

2.8 0.055

Metrics completely disregard source segment

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 22 / 69

slide-32
SLIDE 32

Reference-Based Evaluation

Reference(s): subset of good translations, usually one

Some metrics expand matching, e.g. synonyms in Meteor

Huge variation in reference translations. E.g.

Source

不过这一切都由不得你 However these all totally beyond the control of you.

MT

But all this is beyond the control of you.

Human score BLEU score HT1

But all this is beyond your control.

3.4 0.427 HT2

However, you cannot choose yourself.

2 0.049 HT3

However, not everything is up to you to decide.

2 0.050 HT4

But you can’t choose that.

2.8 0.055

Metrics completely disregard source segment Main problem: Cannot be applied for MT systems in use

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 22 / 69

slide-33
SLIDE 33

A Taxonomy of MT Evaluation Methods

Manual Automatic

Direct asses. Task-based Scoring Ranking Error annotation Post-editing Reading comprehension Reference-based Quality estimation

BLEU, Meteor, NIST, TER, WER, PER, CDER, BEER, CiDER, Cobalt, RATATOUILLE, RED, AMBER, PARMESAN, ...

Eye-tracking

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 23 / 69

slide-34
SLIDE 34

Quality Estimation (Specia et al., 2013)

Quality Estimation (QE): metrics that provide an estimate on the quality of translations on the fly

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 24 / 69

slide-35
SLIDE 35

Quality Estimation (Specia et al., 2013)

Quality Estimation (QE): metrics that provide an estimate on the quality of translations on the fly Quality defined by the data: purpose is clear, no comparison to references, source considered

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 24 / 69

slide-36
SLIDE 36

Quality Estimation (Specia et al., 2013)

Quality Estimation (QE): metrics that provide an estimate on the quality of translations on the fly Quality defined by the data: purpose is clear, no comparison to references, source considered Quality = Can we publish it as is?

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 24 / 69

slide-37
SLIDE 37

Quality Estimation (Specia et al., 2013)

Quality Estimation (QE): metrics that provide an estimate on the quality of translations on the fly Quality defined by the data: purpose is clear, no comparison to references, source considered Quality = Can we publish it as is? Quality = Can a reader get the gist?

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 24 / 69

slide-38
SLIDE 38

Quality Estimation (Specia et al., 2013)

Quality Estimation (QE): metrics that provide an estimate on the quality of translations on the fly Quality defined by the data: purpose is clear, no comparison to references, source considered Quality = Can we publish it as is? Quality = Can a reader get the gist? Quality = Is it worth post-editing it?

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 24 / 69

slide-39
SLIDE 39

Quality Estimation (Specia et al., 2013)

Quality Estimation (QE): metrics that provide an estimate on the quality of translations on the fly Quality defined by the data: purpose is clear, no comparison to references, source considered Quality = Can we publish it as is? Quality = Can a reader get the gist? Quality = Is it worth post-editing it? Quality = How much effort to fix it?

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 24 / 69

slide-40
SLIDE 40

Related: Confidence in MT (Blatz et al., 2004; Ueffing and Ney, 2007)

Goal: augment the MT system to produce a confidence score Quality Estimation is slightly more general and advantageous: it does not require access to the internals of the MT system (i.e. can treat it as a black box) it makes it possible to use several MT systems, several domains, etc.

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 25 / 69

slide-41
SLIDE 41

Quality Estimation: Framework

Building a model:

Machine Learning

X: examples of source & translations

QE model

Y: Quality scores for examples in X

Feature extraction Features (Slide credit: Lucia Specia)

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 26 / 69

slide-42
SLIDE 42

Quality Estimation: Framework

Applying the model:

MT system Translation for xt' QE model Quality score y' Features Feature extraction Source Text xs' (Slide credit: Lucia Specia)

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 26 / 69

slide-43
SLIDE 43

Data and Levels of Granularity

Sentence level: 1-5 subjective scores, PE time, PE edits Word level: good/bad, good/delete/replace, MQM Phrase level: good/bad Document level: PE effort

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 27 / 69

slide-44
SLIDE 44

Features and Algorithms

Source text Translation MT system Confidence features Complexity features Fluency features Adequacy features

s s-1 s+1 t t-1 t+1

Algorithms can be used off-the-shelf

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 28 / 69

slide-45
SLIDE 45

Sentence-Level QE: Baseline Setting

Features:

number of tokens in the source and target sentences average source token length average number of occurrences of words in the target number of punctuation marks in source and target sentences LM probability of source and target sentences average number of translations per source word % of seen source n-grams

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 29 / 69

slide-46
SLIDE 46

Sentence-Level QE: Baseline Setting

Features:

number of tokens in the source and target sentences average source token length average number of occurrences of words in the target number of punctuation marks in source and target sentences LM probability of source and target sentences average number of translations per source word % of seen source n-grams SVM regression with RBF kernel

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 29 / 69

slide-47
SLIDE 47

Sentence-Level QE: Baseline Setting

Features:

number of tokens in the source and target sentences average source token length average number of occurrences of words in the target number of punctuation marks in source and target sentences LM probability of source and target sentences average number of translations per source word % of seen source n-grams SVM regression with RBF kernel QuEst: http://www.quest.dcs.shef.ac.uk/

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 29 / 69

slide-48
SLIDE 48

Sentence-Level QE: SOTA

Predicting HTER (WMT16)

System ID Pearson ↑ Spearman ↑ English-German

  • YSDA/SNTX+BLEU+SVM

0.525 – POSTECH/SENT-RNN-QV2 0.460 0.483 SHEF-LIUM/SVM-NN-emb-QuEst 0.451 0.474 POSTECH/SENT-RNN-QV3 0.447 0.466 SHEF-LIUM/SVM-NN-both-emb 0.430 0.452 UGENT-LT3/SCATE-SVM2 0.412 0.418 UFAL/MULTIVEC 0.377 0.410 RTM/RTM-FS-SVR 0.376 0.400 UU/UU-SVM 0.370 0.405 UGENT-LT3/SCATE-SVM1 0.363 0.375 RTM/RTM-SVR 0.358 0.384 Baseline SVM 0.351 0.390 SHEF/SimpleNets-SRC 0.182 – SHEF/SimpleNets-TGT 0.182 –

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 30 / 69

slide-49
SLIDE 49

Sentence-Level QE: SOTA

Predicting HTER (WMT17)

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 31 / 69

slide-50
SLIDE 50

Word-Level QE: SOTA

Word-level ok/bad labels (WMT16)

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 32 / 69

slide-51
SLIDE 51

Word-Level QE: SOTA

Word-level ok/bad labels (WMT17)

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 33 / 69

slide-52
SLIDE 52

Outline

1 MT Evaluation & Quality Estimation 2 Pushing the Limits of Quality Estimation 3 The Future

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 34 / 69

slide-53
SLIDE 53

Pushing the Limits of Quality Estimation

Recent TACL paper:

Andr´ e F. T. Martins, Marcin Junczys-Dowmunt, Fabio Kepler, Ram´

  • n

Astudillo, Chris Hokamp, Roman Grundkiewicz. “Pushing the Limits of Quality Estimation.” TACL, 5: 205–218, 2017.

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 35 / 69

slide-54
SLIDE 54

In a Nutshell

Quality estimation: evaluate a translation “on the fly” with no reference Useful for predicting (or sidestepping) human post-editing effort Until now: not really accurate enough for practical use

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 36 / 69

slide-55
SLIDE 55

In a Nutshell

Quality estimation: evaluate a translation “on the fly” with no reference Useful for predicting (or sidestepping) human post-editing effort Until now: not really accurate enough for practical use This paper: considerable improvements by: stacking a neural system and a linear sequential model using automatic post-editing as an auxiliary task

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 36 / 69

slide-56
SLIDE 56

In a Nutshell

Quality estimation: evaluate a translation “on the fly” with no reference Useful for predicting (or sidestepping) human post-editing effort Until now: not really accurate enough for practical use This paper: considerable improvements by: stacking a neural system and a linear sequential model using automatic post-editing as an auxiliary task Overall (in the WMT16 En-De dataset): 57.47% (+7.95%) for word-level F mult

1

65.56% (+13.26%) for sentence-level Pearson score

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 36 / 69

slide-57
SLIDE 57

“But isn’t MT nearly indistinguishable from humans now?”

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 37 / 69

slide-58
SLIDE 58

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 38 / 69

slide-59
SLIDE 59

Wrong translation!

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 38 / 69

slide-60
SLIDE 60

Wrong translation!

We can fix it with a human post-editor!

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 38 / 69

slide-61
SLIDE 61

Le travail de La traduction automatique fonctionne-t-elle?

We can fix it with a human post-editor!

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 38 / 69

slide-62
SLIDE 62

Le travail de La traduction automatique fonctionne-t-elle?

BAD OK OK OK BAD BAD

quality estimation

We can fix it with a human post-editor!

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 38 / 69

slide-63
SLIDE 63

Quality Estimation (Blatz et al., 2004; Specia et al., 2013)

Le travail de La traduction automatique fonctionne-t-elle?

BAD OK OK OK BAD BAD

Sentence-level QE: predict edit distance (HTER) Word-level QE: predict a OK/BAD label for each translated word

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 39 / 69

slide-64
SLIDE 64

Quality Estimation (Blatz et al., 2004; Specia et al., 2013)

Le travail de La traduction automatique fonctionne-t-elle?

BAD OK OK OK BAD BAD

Sentence-level QE: predict edit distance (HTER) Word-level QE: predict a OK/BAD label for each translated word This paper: we first engineer a strong word-level QE system then convert it to make sentence-level predictions too

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 39 / 69

slide-65
SLIDE 65

Why Quality Estimation?

1 Informs an end user about the reliability of translated content 2 Decide if a translation is good to go or requires a human to fix it 3 Highlights to a human post-editor the words that need to be revised

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 40 / 69

slide-66
SLIDE 66

Unbabel Translation Pipeline

(Credit: Jo˜ ao Gra¸ ca’s EAMT presentation)

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 41 / 69

slide-67
SLIDE 67

Dataset of Post-Edited Translations

WMT 2016: English-German, IT domain 12,000 training sentences, 1,000 dev sentences, 2,000 test sentences Quality labels obtained from post-edited translations via TERCOM (Snover et al., 2006) In the paper: also experiments with WMT 2015 (English-Spanish)

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 42 / 69

slide-68
SLIDE 68

Model #1: Linear Sequential Model

First shot: a discriminative, shallow, CRF-like model

BAD OK OK OK BAD BAD

The model predicts a label sequence y1:N ∈ {OK, BAD}N:

  • y1:N = arg max

y N

  • i=1

w · φu(s, t, A, yi)

  • unigram features

+

N+1

  • i=1

w · φb(s, t, A, yi, yi−1)

  • bigram features

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 43 / 69

slide-69
SLIDE 69

Features

Features Label Input (referenced by the ith target word) unigram yi ∧ . . . Bias Word, LeftWord, RightWord SourceWord, SourceLeftWord, SourceRightWord LargestNGramLeft/Right SourceLargestNGramLeft/Right PosTag, SourcePosTag Word+LeftWord, Word+RightWord Word+SourceWord, PosTag+SourcePosTag simple yi ∧ yi−1 ∧ . . . Bias bigram rich yi ∧ yi−1 ∧ . . . all above bigrams yi+1 ∧ yi ∧ . . . Word+SourceWord, PosTag+SourcePosTag syntactic yi ∧ . . . DepRel, Word+DepRel HeadWord/PosTag+Word/PosTag LeftSibWord/PosTag+Word/PosTag RightSibWord/PosTag+Word/PosTag GrandWord/PosTag+HeadWord/PosTag+Word/PosTag

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 44 / 69

slide-70
SLIDE 70

Performance of Linear Model (Dev Set)

40.05 unigrams only 40.63 +simple bigrams 43.65 +rich bigrams 46.11 +syntactic (full) 30 50 F MULT

1

(WMT16, Dev) Large impact of rich bigram features (3 points) and syntactic features (another 2.5 points) Net improvement exceeds 6 points over the unigram model

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 45 / 69

slide-71
SLIDE 71

Model #2: Neural Model

Second shot: a model inspired by QUETCH (Kreutzer et al., 2015): a feedforward neural net whose inputs are target words and their aligned source words We depart from QUETCH in a few aspects: we add recurrent layers more depth embeddings for the POS tags (not just the words) dropout regularization layer normalization In the paper: ablation experiments to validate the architecture

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 46 / 69

slide-72
SLIDE 72

2 x FF 2 x 400 2 x FF 2 x 200 2 x FF 100 + 50

... ...

BiGRU 100

... ...

BiGRU 200 softmax

OK/BAD

source word source POS target word target POS embeddings 3 x 64 3 x 50 3 x 64 3 x 50

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 47 / 69

slide-73
SLIDE 73

Linear vs Neural Models (Test Set)

41.10 UGENT system 46.16 LinearQE 47.29 NeuralQE 30 60 F MULT

1

(WMT16) UGENT is Tezcan et al. (2016) (best in WMT16 after us) Both our linear and neural systems outperform it by a big margin

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 48 / 69

slide-74
SLIDE 74

Model #3: Stacked Architecture

Stacked learning is a simple and effective way of ensembling structured models (Cohen and de Carvalho, 2005; Martins et al., 2008) Key idea: include the neural model’s prediction as an additional feature in the linear model Better performance if we use several neural models with different random initializations and data shuffles

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 49 / 69

slide-75
SLIDE 75

Stacking Linear and Neural Models

41.10 UGENT system 46.16 LinearQE 47.29 NeuralQE 50.27 StackedQE 30 60 F MULT

1

(WMT16) Large improvement (3 points) just by combining the two systems! This shows that they are highly complementary We won the WMT16 shared task with this approach ... but can we still do better?

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 50 / 69

slide-76
SLIDE 76

Stacking Linear and Neural Models

41.10 UGENT system 46.16 LinearQE 47.29 NeuralQE 50.27 StackedQE 30 60 F MULT

1

(WMT16) Large improvement (3 points) just by combining the two systems! This shows that they are highly complementary We won the WMT16 shared task with this approach ... but can we still do better? Yes.

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 50 / 69

slide-77
SLIDE 77

Automatic Post-Editing (Simard et al., 2007)

... remember Marcin’s lecture on Wednesday:

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 51 / 69

slide-78
SLIDE 78

Automatic Post-Editing (Simard et al., 2007)

Le travail de La traduction automatique fonctionne-t-elle?

BAD OK OK OK BAD BAD

Goal: automatically correct the output of MT While word-level QE detects mistakes, APE seeks to correct mistakes Still, pretty similar.

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 52 / 69

slide-79
SLIDE 79

Roundtrip and Log-Linear Combination

Current best APE system (Junczys-Dowmunt and Grundkiewicz, 2016): generate a large amount of artificial data (“roundtrip translations”) train two NMT systems: s → p and t → p combine the two systems with a log-linear model Our strategy: train an APE system tuned for QE test time: project the predicted post-edited text p onto quality labels

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 53 / 69

slide-80
SLIDE 80

Roundtrip and Log-Linear Combination

Current best APE system (Junczys-Dowmunt and Grundkiewicz, 2016): generate a large amount of artificial data (“roundtrip translations”) train two NMT systems: s → p and t → p combine the two systems with a log-linear model Our strategy: train an APE system tuned for QE test time: project the predicted post-edited text p onto quality labels Two key differences with respect to the other “pure QE” systems: Learned from finer-grained information (post-edited text) A lot more data (500,000 roundtrip translations)

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 53 / 69

slide-81
SLIDE 81

Model #4: APE-Based Quality Estimation

41.10 UGENT system 46.16 LinearQE 47.29 NeuralQE 50.27 StackedQE 55.68 APE-QE 30 60 F MULT

1

(WMT16) This strategy outperforms the “pure” QE systems by 5 points!!

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 54 / 69

slide-82
SLIDE 82

Model #5: Combining Linear, Neural, and APE

41.10 UGENT system 46.16 LinearQE 47.29 NeuralQE 50.27 StackedQE 55.68 APE-QE 57.47 FullStackedQE 30 60 F MULT

1

(WMT16) ... combining all the models together, we get another improvement of 2 points!!

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 55 / 69

slide-83
SLIDE 83

Examples

Source Combines the hue value of the blend color with the luminance and saturation of the base color to create the result color . MT Kombiniert den Farbton Wert der Angleichungsfarbe mit der Lu- minanz und S¨ attigung der Grundfarbe zu erstellen . PE (Reference) Kombiniert den Farbtonwert der Mischfarbe mit der Luminanz und S¨ attigung der Grundfarbe . APE Kombiniert den Farbton der Mischfarbe mit der Luminanz und die S¨ attigung der Grundfarbe , um die Ergebnisfarbe zu erstellen . StackedQE Kombiniert den Farbton Wert der Angleichungsfarbe mit der Lumi- nanz und S¨ attigung der Grundfarbe zu erstellen . ApeQE Kombiniert den Farbton Wert der Angleichungsfarbe mit der Lumi- nanz und S¨ attigung der Grundfarbe zu erstellen . FullStackedQE Kombiniert den Farbton Wert der Angleichungsfarbe mit der Lumi- nanz und S¨ attigung der Grundfarbe zu erstellen .

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 56 / 69

slide-84
SLIDE 84

Performance over Sentence Length

Figure: Averaged number of words predicted as bad by the different systems in the WMT16 gold dev set, for different bins of the sentence length.

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 57 / 69

slide-85
SLIDE 85

Sentence-Level Quality Estimation

Now that we have a strong word-level system, we apply a very simple procedure to obtain sentence-level predictions: For pure QE: convert word-level predictions to a sentence HTER prediction by counting the fraction of BAD words For APE: use directly the HTER between t and p For the combined system: average of the two above

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 58 / 69

slide-86
SLIDE 86

Sentence-Level Quality Estimation

52.5 YANDEX system 54.93 StackedQE 61.27 APE-QE 65.56 FullStackedQE 40 70 Pearson correlation (WMT16) YANDEX was the best sentence-level system at WMT16 (Kozlova et al., 2016) Our simple conversion led to impressive results, well above the SOTA!

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 59 / 69

slide-87
SLIDE 87

This Year’s WMT17: Other Competitive Methods

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 60 / 69

slide-88
SLIDE 88

This Year’s WMT17: Other Competitive Methods

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 61 / 69

slide-89
SLIDE 89

Conclusions

New SOTA systems for word-level and sentence-level QE considerably more accurate than previously existing systems First, we proposed a new pure QE system which stacks a linear and a neural system Then, by relating the tasks of APE and word-level QE, we derived a new APE-based QE system Finally, we combined the two systems via a full stacking architecture The full system was extended to sentence-level QE by virtue of a simple word-to-sentence conversion (no further training or tuning)

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 62 / 69

slide-90
SLIDE 90

Outline

1 MT Evaluation & Quality Estimation 2 Pushing the Limits of Quality Estimation 3 The Future

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 63 / 69

slide-91
SLIDE 91

The Future of QE

Word-level: go beyond ok/bad labels look at missing words from the source (particularly relevant for NMT) mistakes of NMT systems have different patterns than PBMT how to estimate adequacy? Sentence-level: multi-task learning with word-level predict MQM scores (useful for quality ensurance of crowd-sourced translation) Document-level: take inter-sentential context into account (very relevant for chat) evaluate translation global coherence

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 64 / 69

slide-92
SLIDE 92

The Future of QE

Beyond MT: human translation quality estimation much harder: humans have higher variability hard to distinguish good non-literal translations from complete rubbish

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 65 / 69

slide-93
SLIDE 93

We’re Hiring!

Excited about MT, crowdsourcing and Lisbon? ⇒ jobs@unbabel.com.

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 66 / 69

slide-94
SLIDE 94

Acknowledgments

EXPERT project (EU Marie Curie ITN No. 317471) Funda¸ c˜ ao para a Ciˆ encia e Tecnologia (FCT), through contracts UID/EEA/50008/2013 and UID/CEC/50021/2013 LearnBig project (PTDC/EEI-SII/7092/2014) GoLocal project (grant CMUPERI/TIC/0046/2014) Amazon Academic Research Awards program

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 67 / 69

slide-95
SLIDE 95

References I

Blatz, J., Fitzgerald, E., Foster, G., Gandrabur, S., Goutte, C., Kulesza, A., Sanchis, A., and Ueffing, N. (2004). Confidence estimation for machine translation. In Proc. of the International Conference on Computational Linguistics, page 315. Cohen, W. W. and de Carvalho, V. R. (2005). Stacked sequential learning. In IJCAI. Junczys-Dowmunt, M. and Grundkiewicz, R. (2016). Log-linear combinations of monolingual and bilingual neural machine translation models for automatic post-editing. In Proceedings of the First Conference on Machine Translation, pages 751–758, Berlin, Germany. Association for Computational Linguistics. Kozlova, A., Shmatova, M., and Frolov, A. (2016). YSDA Participation in the WMT’16 Quality Estimation Shared Task. In Proceedings of the First Conference on Machine Translation, pages 793–799. Kreutzer, J., Schamoni, S., and Riezler, S. (2015). Quality estimation from scratch (quetch): Deep learning for word-level translation quality estimation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 316–322. Martins, A. F. T., Das, D., Smith, N. A., and Xing, E. P. (2008). Stacking Dependency Parsers. In Proc. of Empirical Methods for Natural Language Processing. Simard, M., Ueffing, N., Isabelle, P., and Kuhn, R. (2007). Rule-based translation with statistical phrase-based post-editing. In Proceedings of the Second Workshop on Statistical Machine Translation, pages 203–206.

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 68 / 69

slide-96
SLIDE 96

References II

Snover, M., Dorr, B., Schwartz, R., Micciulla, L., and Makhoul, J. (2006). A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference

  • f the Association for Machine Translation in the Americas, pages 223–231.

Specia, L., Shah, K., de Souza, J. G., and Cohn, T. (2013). QuEst - a translation quality estimation framework. In Proc. of the Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 79–84. Tezcan, A., Hoste, V., and Macken, L. (2016). Ugent-lt3 scate submission for wmt16 shared task on quality estimation. In Proceedings of the First Conference on Machine Translation, pages 843–850, Berlin, Germany. Association for Computational Linguistics. Ueffing, N. and Ney, H. (2007). Word-level confidence estimation for machine translation. Computational Linguistics, 33(1):9–40.

Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 69 / 69