Online Versus Offline NMT Quality An In-depth Analysis on - - PowerPoint PPT Presentation

online versus offline nmt quality
SMART_READER_LITE
LIVE PREVIEW

Online Versus Offline NMT Quality An In-depth Analysis on - - PowerPoint PPT Presentation

Online Versus Offline NMT Quality An In-depth Analysis on English-German and German-English Maha Elbayad 1,2 Michael Ustaszewski 3 Emmanuelle Esperana-Rodier 1 Francis Brunet-Manquat 1 Jakob Verbeek 4 Laurent Besacier 1 (1) (2) (3) (4)


slide-1
SLIDE 1

Online Versus Offline NMT Quality

An In-depth Analysis on English-German and German-English

Maha Elbayad1,2 Michael Ustaszewski3 Emmanuelle Esperança-Rodier1 Francis Brunet-Manquat1 Jakob Verbeek4 Laurent Besacier1

(1) (2) (3) (4)

slide-2
SLIDE 2

Introduction Online NMT models Automatic Evaluation Human Evaluation

Outline

1 Introduction to online translation 2 Neural architectures for online NMT a

Transformer (Vaswani et al. 2017)

b

Pervasive Attention (Elbayad et al. 2018)

3 Automatic evaluation 4 Human evaluation 5 Conclusion

Elbayad et al. Online vs. Offline NMT Quality 1 / 16

slide-3
SLIDE 3

Introduction

Online NMT models Automatic Evaluation Human Evaluation

Online Neural Machine Translation

x1 x2 x3 x4 x5 x6 x7 < / s> <s> y1 y2 y3 y4 y5 y6 y7 y8 < / s>

Offline translation

target source

x1 x2 x3 x4 x5 x6 x7 < / s> <s> y1 y2 y3 y4 y5 y6 y7 y8 < / s>

Online translation

target source

Elbayad et al. Online vs. Offline NMT Quality 2 / 16

slide-4
SLIDE 4

Introduction

Online NMT models Automatic Evaluation Human Evaluation

Wait-k Decoders for Online Translation

∀t ∈ [1..|y|], zwait-k

t

= min(k + t − 1, |x|)

<s> y1 y2 y3 y4 y5 < / s> x1 x2 x3 x4 x5 < / s>

Wait-1

source target <s> y1 y2 y3 y4 y5 < / s> x1 x2 x3 x4 x5 < / s>

Wait-3

source <s> y1 y2 y3 y4 y5 < / s> x1 x2 x3 x4 x5 < / s>

Wait-∞

source

Wait-k or prefix-to-prefix decoding (Dalvi et al. 2018; Ma et al. 2019; Elbayad et al. 2020)

Elbayad et al. Online vs. Offline NMT Quality 3 / 16

slide-5
SLIDE 5

Introduction

Online NMT models

Automatic Evaluation Human Evaluation

Online Transformer

◮ Unidirectional encoder (Elbayad et al. 2020) x1 x2 x3 x4 x5 x6 x

zt = 4

Source tokens Encoder states x1 x2 x3 x4 x5 x6 x

zt+1 = 5

s3

Elbayad et al. Online vs. Offline NMT Quality 4 / 16

slide-6
SLIDE 6

Introduction

Online NMT models

Automatic Evaluation Human Evaluation

Online Transformer

◮ Unidirectional encoder (Elbayad et al. 2020) x1 x2 x3 x4 x5 x6 x

zt = 4

Source tokens Encoder states x1 x2 x3 x4 x5 x6 x

zt+1 = 5

s3 ◮ Masked decoder - masking the attention energies wrt. zt s1 s2 s3 s4 s5 s6 ht−1, zt = 4

Encoder Decoder

x

Elbayad et al. Online vs. Offline NMT Quality 4 / 16

slide-7
SLIDE 7

Introduction

Online NMT models

Automatic Evaluation Human Evaluation

The Pervasive Attention Architecture

s

  • u

r c e ( x ) target (y)

H0 H1 H1 H2 . . . HN

Concatenated source-target embeddings Convolutional feature maps

Hconv

A g g r e g a t i

  • n

Hout

p(y1|y<1, x) · · · · ·

p(y|y||y<|y|, x)

Elbayad et al. 2018

Elbayad et al. Online vs. Offline NMT Quality 5 / 16

slide-8
SLIDE 8

Introduction

Online NMT models

Automatic Evaluation Human Evaluation

Online Pervasive Attention

source target

W

2D causal convolution

xzt yt−1 yt

Features aggregation

Elbayad et al. Online vs. Offline NMT Quality 6 / 16

slide-9
SLIDE 9

Introduction

Online NMT models

Automatic Evaluation Human Evaluation

Online Pervasive Attention

source target

W

+ Masking the future source for unidirectional encoding.

xzt yt−1 yt

The appropriate context size zt is controlled during aggregation.

Elbayad et al. Online vs. Offline NMT Quality 6 / 16

slide-10
SLIDE 10

Introduction

Online NMT models

Automatic Evaluation Human Evaluation

Training and Evaluation Setup

Data

◮ IWSLT’14 De-En and En-De (Cettolo et al. 2014). ◮ Sentences >175 words and pairs with length-ratio >1.5 are removed. ◮ The data is tokenized but not lowercased. ◮ The sequences are BPE segmented (Sennrich et al. 2016) → 32K vocabulary. ◮ Training = 160K, development = 7.3K and test = 6.7K.

Elbayad et al. Online vs. Offline NMT Quality 7 / 16

slide-11
SLIDE 11

Introduction

Online NMT models

Automatic Evaluation Human Evaluation

Training and Evaluation Setup

Data

◮ IWSLT’14 De-En and En-De (Cettolo et al. 2014). ◮ Sentences >175 words and pairs with length-ratio >1.5 are removed. ◮ The data is tokenized but not lowercased. ◮ The sequences are BPE segmented (Sennrich et al. 2016) → 32K vocabulary. ◮ Training = 160K, development = 7.3K and test = 6.7K.

Elbayad et al. Online vs. Offline NMT Quality 7 / 16

slide-12
SLIDE 12

Introduction

Online NMT models

Automatic Evaluation Human Evaluation

Training and Evaluation Setup

Data

◮ IWSLT’14 De-En and En-De (Cettolo et al. 2014). ◮ Sentences >175 words and pairs with length-ratio >1.5 are removed. ◮ The data is tokenized but not lowercased. ◮ The sequences are BPE segmented (Sennrich et al. 2016) → 32K vocabulary. ◮ Training = 160K, development = 7.3K and test = 6.7K.

Elbayad et al. Online vs. Offline NMT Quality 7 / 16

slide-13
SLIDE 13

Introduction

Online NMT models

Automatic Evaluation Human Evaluation

Training and Evaluation Setup

Data

◮ IWSLT’14 De-En and En-De (Cettolo et al. 2014). ◮ Sentences >175 words and pairs with length-ratio >1.5 are removed. ◮ The data is tokenized but not lowercased. ◮ The sequences are BPE segmented (Sennrich et al. 2016) → 32K vocabulary. ◮ Training = 160K, development = 7.3K and test = 6.7K.

Elbayad et al. Online vs. Offline NMT Quality 7 / 16

slide-14
SLIDE 14

Introduction

Online NMT models

Automatic Evaluation Human Evaluation

Training and Evaluation Setup

Data

◮ IWSLT’14 De-En and En-De (Cettolo et al. 2014). ◮ Sentences >175 words and pairs with length-ratio >1.5 are removed. ◮ The data is tokenized but not lowercased. ◮ The sequences are BPE segmented (Sennrich et al. 2016) → 32K vocabulary. ◮ Training = 160K, development = 7.3K and test = 6.7K.

Elbayad et al. Online vs. Offline NMT Quality 7 / 16

slide-15
SLIDE 15

Introduction

Online NMT models

Automatic Evaluation Human Evaluation

Training and Evaluation Setup

Data

◮ IWSLT’14 De-En and En-De (Cettolo et al. 2014). ◮ Sentences >175 words and pairs with length-ratio >1.5 are removed. ◮ The data is tokenized but not lowercased. ◮ The sequences are BPE segmented (Sennrich et al. 2016) → 32K vocabulary. ◮ Training = 160K, development = 7.3K and test = 6.7K.

Elbayad et al. Online vs. Offline NMT Quality 7 / 16

slide-16
SLIDE 16

Introduction

Online NMT models

Automatic Evaluation Human Evaluation

Training and Evaluation Setup

Data

◮ IWSLT’14 De-En and En-De (Cettolo et al. 2014). ◮ Sentences >175 words and pairs with length-ratio >1.5 are removed. ◮ The data is tokenized but not lowercased. ◮ The sequences are BPE segmented (Sennrich et al. 2016) → 32K vocabulary. ◮ Training = 160K, development = 7.3K and test = 6.7K.

Models

◮ For each direction and for each architecture, an online and an offline model. ◮ Pervasive Attention (PA) with 14 layers and 7 × 7 filters (effectively 4 × 4). ◮ Transformer (TF) small. ◮ Online trained with ktrain = 7 and evaluated with keval = 3. ◮ Greedy decoding for all.

Elbayad et al. Online vs. Offline NMT Quality 7 / 16

slide-17
SLIDE 17

Introduction

Online NMT models

Automatic Evaluation Human Evaluation

Training and Evaluation Setup

Data

◮ IWSLT’14 De-En and En-De (Cettolo et al. 2014). ◮ Sentences >175 words and pairs with length-ratio >1.5 are removed. ◮ The data is tokenized but not lowercased. ◮ The sequences are BPE segmented (Sennrich et al. 2016) → 32K vocabulary. ◮ Training = 160K, development = 7.3K and test = 6.7K.

Models

◮ For each direction and for each architecture, an online and an offline model. ◮ Pervasive Attention (PA) with 14 layers and 7 × 7 filters (effectively 4 × 4). ◮ Transformer (TF) small. ◮ Online trained with ktrain = 7 and evaluated with keval = 3. ◮ Greedy decoding for all.

Elbayad et al. Online vs. Offline NMT Quality 7 / 16

slide-18
SLIDE 18

Introduction

Online NMT models

Automatic Evaluation Human Evaluation

Training and Evaluation Setup

Data

◮ IWSLT’14 De-En and En-De (Cettolo et al. 2014). ◮ Sentences >175 words and pairs with length-ratio >1.5 are removed. ◮ The data is tokenized but not lowercased. ◮ The sequences are BPE segmented (Sennrich et al. 2016) → 32K vocabulary. ◮ Training = 160K, development = 7.3K and test = 6.7K.

Models

◮ For each direction and for each architecture, an online and an offline model. ◮ Pervasive Attention (PA) with 14 layers and 7 × 7 filters (effectively 4 × 4). ◮ Transformer (TF) small. ◮ Online trained with ktrain = 7 and evaluated with keval = 3. ◮ Greedy decoding for all.

Elbayad et al. Online vs. Offline NMT Quality 7 / 16

slide-19
SLIDE 19

Introduction

Online NMT models

Automatic Evaluation Human Evaluation

Training and Evaluation Setup

Data

◮ IWSLT’14 De-En and En-De (Cettolo et al. 2014). ◮ Sentences >175 words and pairs with length-ratio >1.5 are removed. ◮ The data is tokenized but not lowercased. ◮ The sequences are BPE segmented (Sennrich et al. 2016) → 32K vocabulary. ◮ Training = 160K, development = 7.3K and test = 6.7K.

Models

◮ For each direction and for each architecture, an online and an offline model. ◮ Pervasive Attention (PA) with 14 layers and 7 × 7 filters (effectively 4 × 4). ◮ Transformer (TF) small. ◮ Online trained with ktrain = 7 and evaluated with keval = 3. ◮ Greedy decoding for all.

Elbayad et al. Online vs. Offline NMT Quality 7 / 16

slide-20
SLIDE 20

Introduction

Online NMT models

Automatic Evaluation Human Evaluation

Training and Evaluation Setup

Data

◮ IWSLT’14 De-En and En-De (Cettolo et al. 2014). ◮ Sentences >175 words and pairs with length-ratio >1.5 are removed. ◮ The data is tokenized but not lowercased. ◮ The sequences are BPE segmented (Sennrich et al. 2016) → 32K vocabulary. ◮ Training = 160K, development = 7.3K and test = 6.7K.

Models

◮ For each direction and for each architecture, an online and an offline model. ◮ Pervasive Attention (PA) with 14 layers and 7 × 7 filters (effectively 4 × 4). ◮ Transformer (TF) small. ◮ Online trained with ktrain = 7 and evaluated with keval = 3. ◮ Greedy decoding for all.

Elbayad et al. Online vs. Offline NMT Quality 7 / 16

slide-21
SLIDE 21

Introduction Online NMT models

Automatic Evaluation

Human Evaluation

Metrics and Analysis Factors

◮ Translation quality.

BLEU (Papineni et al. 2002), METEOR (Lavie et al. 2007), TER (Snover et al. 2006). + ROUGE-L (Lin 2004) and BERTScore (Zhang et al. 2020) in the paper.

Elbayad et al. Online vs. Offline NMT Quality 8 / 16

slide-22
SLIDE 22

Introduction Online NMT models

Automatic Evaluation

Human Evaluation

Metrics and Analysis Factors

◮ Translation quality.

BLEU (Papineni et al. 2002), METEOR (Lavie et al. 2007), TER (Snover et al. 2006). + ROUGE-L (Lin 2004) and BERTScore (Zhang et al. 2020) in the paper. ◮ Translation delay with AL (Ma et al. 2019)

Elbayad et al. Online vs. Offline NMT Quality 8 / 16

slide-23
SLIDE 23

Introduction Online NMT models

Automatic Evaluation

Human Evaluation

Metrics and Analysis Factors

◮ Translation quality.

BLEU (Papineni et al. 2002), METEOR (Lavie et al. 2007), TER (Snover et al. 2006). + ROUGE-L (Lin 2004) and BERTScore (Zhang et al. 2020) in the paper. ◮ Translation delay with AL (Ma et al. 2019) ◮ Source length |x|.

+ Target-side and source-side relative positions in the paper. Elbayad et al. Online vs. Offline NMT Quality 8 / 16

slide-24
SLIDE 24

Introduction Online NMT models

Automatic Evaluation

Human Evaluation

Metrics and Analysis Factors

◮ Translation quality.

BLEU (Papineni et al. 2002), METEOR (Lavie et al. 2007), TER (Snover et al. 2006). + ROUGE-L (Lin 2004) and BERTScore (Zhang et al. 2020) in the paper. ◮ Translation delay with AL (Ma et al. 2019) ◮ Source length |x|.

+ Target-side and source-side relative positions in the paper.

◮ Lagging difficulty LD(x, y).

<s>

y1 y2 y3 y4

< / s>

x1 x2 x3 x4 x5<

/ s>

Alignments from fast-align Elbayad et al. Online vs. Offline NMT Quality 8 / 16

slide-25
SLIDE 25

Introduction Online NMT models

Automatic Evaluation

Human Evaluation

Metrics and Analysis Factors

◮ Translation quality.

BLEU (Papineni et al. 2002), METEOR (Lavie et al. 2007), TER (Snover et al. 2006). + ROUGE-L (Lin 2004) and BERTScore (Zhang et al. 2020) in the paper. ◮ Translation delay with AL (Ma et al. 2019) ◮ Source length |x|.

+ Target-side and source-side relative positions in the paper.

◮ Lagging difficulty LD(x, y).

<s>

y1 y2 y3 y4

< / s>

x1 x2 x3 x4 x5<

/ s>

Alignments from fast-align

<s>

y1 y2 y3 y4

< / s>

x1 x2 x3 x4 x5<

/ s>

Estimated ideal path Elbayad et al. Online vs. Offline NMT Quality 8 / 16

slide-26
SLIDE 26

Introduction Online NMT models

Automatic Evaluation

Human Evaluation

Metrics and Analysis Factors

◮ Translation quality.

BLEU (Papineni et al. 2002), METEOR (Lavie et al. 2007), TER (Snover et al. 2006). + ROUGE-L (Lin 2004) and BERTScore (Zhang et al. 2020) in the paper. ◮ Translation delay with AL (Ma et al. 2019) ◮ Source length |x|.

+ Target-side and source-side relative positions in the paper.

◮ Lagging difficulty LD(x, y).

<s>

y1 y2 y3 y4

< / s>

x1 x2 x3 x4 x5<

/ s>

Alignments from fast-align

<s>

y1 y2 y3 y4

< / s>

x1 x2 x3 x4 x5<

/ s>

Estimated ideal path

<s>

y1 y2 y3 y4

< / s>

x1 x2 x3 x4 x5<

/ s>

Average distance from wait-0 Elbayad et al. Online vs. Offline NMT Quality 8 / 16

slide-27
SLIDE 27

Introduction Online NMT models

Automatic Evaluation

Human Evaluation

Automatic Evaluation

Bold = the better scoring. Underlined = better than its competitor with at least 95% statistical significance.

De ✮En En ✮De PA TF PA TF Offline Offline Offline Offline ↑BLEU 31.24 31.13 26.03 26.60 ↑METEOR 28.95 29.25 38.81 39.37 ↓TER 0.56 0.56 0.63 0.62 AL 21.10 21.10 20.71 20.71

Elbayad et al. Online vs. Offline NMT Quality 9 / 16

slide-28
SLIDE 28

Introduction Online NMT models

Automatic Evaluation

Human Evaluation

Automatic Evaluation

Bold = the better scoring. Underlined = better than its competitor with at least 95% statistical significance.

De ✮En En ✮De PA PA TF TF PA TF Offline Online Offline Online Offline Offline ↑BLEU 31.24 26.44 31.13 26.57 26.03 26.60 ↑METEOR 28.95 25.97 29.25 25.65 38.81 39.37 ↓TER 0.56 0.62 0.56 0.64 0.63 0.62 AL 21.10 2.59 21.10 3.16 20.71 20.71

Elbayad et al. Online vs. Offline NMT Quality 9 / 16

slide-29
SLIDE 29

Introduction Online NMT models

Automatic Evaluation

Human Evaluation

Automatic Evaluation

Bold = the better scoring. Underlined = better than its competitor with at least 95% statistical significance.

De ✮En En ✮De PA PA TF TF PA PA TF TF Offline Online Offline Online Offline Online Offline Online ↑BLEU 31.24 26.44 31.13 26.57 26.03 23.04 26.60 22.98 ↑METEOR 28.95 25.97 29.25 25.65 38.81 35.72 39.37 35.35 ↓TER 0.56 0.62 0.56 0.64 0.63 0.68 0.62 0.69 AL 21.10 2.59 21.10 3.16 20.71 3.33 20.71 3.49

Elbayad et al. Online vs. Offline NMT Quality 9 / 16

slide-30
SLIDE 30

Introduction Online NMT models

Automatic Evaluation

Human Evaluation

Automatic Evaluation

De ✮En En ✮De

<12 [12,18) [18,24) [24,31) [31,42) ≥42

20 24 28 32 36 Source length BLEU

<1.1 [1.1,1.7) [1.7,2.2) [2,2,3.1) [3.1,4.3) ≥4.3

Lagging difficulty

<13 [13,19) [19,24) [24,30) [30,42) ≥42

16 20 24 28 32 Source length BLEU

<1.2 [1.2,1.7) [1.7,2.2) [2.2,2.9) [2.9,4.1) ≥4.1

Lagging difficulty

PA-offline PA-online TF-offline TF-online

Elbayad et al. Online vs. Offline NMT Quality 10 / 16

slide-31
SLIDE 31

Introduction Online NMT models

Automatic Evaluation

Human Evaluation

Automatic Evaluation

De ✮En En ✮De

<12 [12,18) [18,24) [24,31) [31,42) ≥42

20 24 28 32 36 Source length BLEU

<1.1 [1.1,1.7) [1.7,2.2) [2,2,3.1) [3.1,4.3) ≥4.3

Lagging difficulty

<13 [13,19) [19,24) [24,30) [30,42) ≥42

16 20 24 28 32 Source length BLEU

<1.2 [1.2,1.7) [1.7,2.2) [2.2,2.9) [2.9,4.1) ≥4.1

Lagging difficulty

PA-offline PA-online TF-offline TF-online

Lagging difficulty (LD) is highly correlated with the translation quality (BLEU) in both online and offline translation tasks.

Elbayad et al. Online vs. Offline NMT Quality 10 / 16

slide-32
SLIDE 32

Introduction Online NMT models Automatic Evaluation

Human Evaluation

Human Evaluation

Elbayad et al. Online vs. Offline NMT Quality 11 / 16

slide-33
SLIDE 33

Introduction Online NMT models Automatic Evaluation

Human Evaluation

Annotation Setup

◮ 200 segments per language pair.

Q1 ≤ |x| ≤ Q3, equally sampled over lagging difficulty bins.

◮ ACCOLÉ (Esperança-Rodier et al. 2019).

Web interface for annotation of error spans and types in source and target.

◮ Annotators.

2 native translation experts per pair, annotation training on calibration set.

◮ Inter-annotator agreement compatible with other MQM-based studies.

Cohen’s κ = 0.33 (De ✮En) and 0.40 (En ✮De). Elbayad et al. Online vs. Offline NMT Quality 12 / 16

slide-34
SLIDE 34

Introduction Online NMT models Automatic Evaluation

Human Evaluation

Annotation Setup

◮ 200 segments per language pair.

Q1 ≤ |x| ≤ Q3, equally sampled over lagging difficulty bins.

◮ ACCOLÉ (Esperança-Rodier et al. 2019).

Web interface for annotation of error spans and types in source and target.

◮ Annotators.

2 native translation experts per pair, annotation training on calibration set.

◮ Inter-annotator agreement compatible with other MQM-based studies.

Cohen’s κ = 0.33 (De ✮En) and 0.40 (En ✮De). Elbayad et al. Online vs. Offline NMT Quality 12 / 16

slide-35
SLIDE 35

Introduction Online NMT models Automatic Evaluation

Human Evaluation

Annotation Setup

◮ 200 segments per language pair.

Q1 ≤ |x| ≤ Q3, equally sampled over lagging difficulty bins.

◮ ACCOLÉ (Esperança-Rodier et al. 2019).

Web interface for annotation of error spans and types in source and target.

◮ Annotators.

2 native translation experts per pair, annotation training on calibration set.

◮ Inter-annotator agreement compatible with other MQM-based studies.

Cohen’s κ = 0.33 (De ✮En) and 0.40 (En ✮De). Elbayad et al. Online vs. Offline NMT Quality 12 / 16

slide-36
SLIDE 36

Introduction Online NMT models Automatic Evaluation

Human Evaluation

Annotation Setup

◮ 200 segments per language pair.

Q1 ≤ |x| ≤ Q3, equally sampled over lagging difficulty bins.

◮ ACCOLÉ (Esperança-Rodier et al. 2019).

Web interface for annotation of error spans and types in source and target.

◮ Annotators.

2 native translation experts per pair, annotation training on calibration set.

◮ Inter-annotator agreement compatible with other MQM-based studies.

Cohen’s κ = 0.33 (De ✮En) and 0.40 (En ✮De). Elbayad et al. Online vs. Offline NMT Quality 12 / 16

slide-37
SLIDE 37

Introduction Online NMT models Automatic Evaluation

Human Evaluation

Error Typology

Errors Accuracy (ac) Fluency (fl)

Other A pilot annotation was carried out to select a subset of the MQM error typology (Lommel et al. 2014) relevant to this study

Elbayad et al. Online vs. Offline NMT Quality 13 / 16

slide-38
SLIDE 38

Introduction Online NMT models Automatic Evaluation

Human Evaluation

Error Typology

Errors Accuracy (ac) Fluency (fl)

Other Duplication (du) Reference: In high school, a classmate once said [. . .] Hypothesis: In high school, once in high school, a fellow told us [. . .]

Elbayad et al. Online vs. Offline NMT Quality 13 / 16

slide-39
SLIDE 39

Introduction Online NMT models Automatic Evaluation

Human Evaluation

Error Typology

Errors Accuracy (ac) Fluency (fl)

Other Duplication (du) Typography (ty) Reference: [. . .] meaning of the word "educate" comes from the root word "educe." Hypothesis: [. . .] meaning of the word "educate" is rooted in the word "educe.

Elbayad et al. Online vs. Offline NMT Quality 13 / 16

slide-40
SLIDE 40

Introduction Online NMT models Automatic Evaluation

Human Evaluation

Error Typology

Errors Accuracy (ac) Fluency (fl)

Other Duplication (du) Typography (ty) Grammar (gr) Reference: [. . .] you want to design things as intuitively as possible. Hypothesis: [. . .] you want to make things so much intuitively possible.

Elbayad et al. Online vs. Offline NMT Quality 13 / 16

slide-41
SLIDE 41

Introduction Online NMT models Automatic Evaluation

Human Evaluation

Error Typology

Errors Accuracy (ac) Fluency (fl)

Other Duplication (du) Typography (ty) Grammar (gr) Word order (wo) Reference: I was given another gift, which was to be able to see into the future [. . .] Hypothesis: I got another gift, which is in the future to see [. . .].

Elbayad et al. Online vs. Offline NMT Quality 13 / 16

slide-42
SLIDE 42

Introduction Online NMT models Automatic Evaluation

Human Evaluation

Error Typology

Errors Accuracy (ac) Fluency (fl)

Other Duplication (du) Typography (ty) Grammar (gr) Word order (wo) Unintelligible (un) Reference: And for them language had inferior importance. Hypothesis: What the language undergeals from subordination in the other time was.

Elbayad et al. Online vs. Offline NMT Quality 13 / 16

slide-43
SLIDE 43

Introduction Online NMT models Automatic Evaluation

Human Evaluation

Error Typology

Errors Accuracy (ac) Fluency (fl)

Other Duplication (du) Typography (ty) Grammar (gr) Word order (wo) Unintelligible (un) Addition (ad) Reference: A couple months went by, and I had just forgotten all about it. Hypothesis: A few months ago, I was going to go over, and I just forgot everything.

Elbayad et al. Online vs. Offline NMT Quality 13 / 16

slide-44
SLIDE 44

Introduction Online NMT models Automatic Evaluation

Human Evaluation

Error Typology

Errors Accuracy (ac) Fluency (fl)

Other Duplication (du) Typography (ty) Grammar (gr) Word order (wo) Unintelligible (un) Addition (ad) Omission (om) Reference: And if we can do this for raw data, why not do it for content as well? Hypothesis: And if we do this for raw data, why not content itself?

Elbayad et al. Online vs. Offline NMT Quality 13 / 16

slide-45
SLIDE 45

Introduction Online NMT models Automatic Evaluation

Human Evaluation

Error Typology

Errors Accuracy (ac) Fluency (fl)

Other Duplication (du) Typography (ty) Grammar (gr) Word order (wo) Unintelligible (un) Addition (ad) Omission (om) Mistranslation (mt) Reference: [. . .] I immediately went to look up the 2009 online edition [. . .] Hypothesis: [. . .] I immediately started to call up the online copy of 2009 [. . .]

Elbayad et al. Online vs. Offline NMT Quality 13 / 16

slide-46
SLIDE 46

Introduction Online NMT models Automatic Evaluation

Human Evaluation

Error Typology

Errors Accuracy (ac) Fluency (fl)

Other Duplication (du) Typography (ty) Grammar (gr) Word order (wo) Unintelligible (un) Addition (ad) Omission (om) Mistranslation (mt) Overly literal (ol) Reference: Three months later I had relocated [. . .] Hypothesis: Three months later, I was moved [. . .]

Elbayad et al. Online vs. Offline NMT Quality 13 / 16

slide-47
SLIDE 47

Introduction Online NMT models Automatic Evaluation

Human Evaluation

Error Typology

Errors Accuracy (ac) Fluency (fl)

Other Duplication (du) Typography (ty) Grammar (gr) Word order (wo) Unintelligible (un) Addition (ad) Omission (om) Mistranslation (mt) Overly literal (ol) Non-existing word form (ne) Reference: the Tanzanian giraffe. Hypothesis: the tansanic giraffe

Elbayad et al. Online vs. Offline NMT Quality 13 / 16

slide-48
SLIDE 48

Introduction Online NMT models Automatic Evaluation

Human Evaluation

Human Evaluation

Bold = The system (PA or TF) with fewer errors either online or offline. De ✮En En ✮De PA TF PA TF System Offline Online ∆% Offline Online ∆% Offline Online ∆% Offline Online ∆% (ad) Addition 76 143 +88 95 160 +68 30 66 +120 35 97 +177 (mt) Mistranslation 433 587 +36 457 572 +25 245 260 +6 202 260 +29 (ne) Non-existing WF 26 17

  • 35

14 16 +14 39 58 +49 43 54 +26 (om) Omission 67 113 +69 96 127 +32 99 74

  • 25

126 114

  • 10

(ol) Overly litteral 78 95 +22 52 81 +56 150 179 +19 113 125 +11 (ac+) Total accuracy 682 956 +40 716 959 +34 563 637 +13 519 651 +25 Elbayad et al. Online vs. Offline NMT Quality 14 / 16

slide-49
SLIDE 49

Introduction Online NMT models Automatic Evaluation

Human Evaluation

Human Evaluation

Bold = The system (PA or TF) with fewer errors either online or offline. De ✮En En ✮De PA TF PA TF System Offline Online ∆% Offline Online ∆% Offline Online ∆% Offline Online ∆% (ad) Addition 76 143 +88 95 160 +68 30 66 +120 35 97 +177 (mt) Mistranslation 433 587 +36 457 572 +25 245 260 +6 202 260 +29 (ne) Non-existing WF 26 17

  • 35

14 16 +14 39 58 +49 43 54 +26 (om) Omission 67 113 +69 96 127 +32 99 74

  • 25

126 114

  • 10

(ol) Overly litteral 78 95 +22 52 81 +56 150 179 +19 113 125 +11 (ac+) Total accuracy 682 956 +40 716 959 +34 563 637 +13 519 651 +25

  • Overall, mistranslation is the largest contributor to accuracy errors.

Elbayad et al. Online vs. Offline NMT Quality 14 / 16

slide-50
SLIDE 50

Introduction Online NMT models Automatic Evaluation

Human Evaluation

Human Evaluation

Bold = The system (PA or TF) with fewer errors either online or offline. De ✮En En ✮De PA TF PA TF System Offline Online ∆% Offline Online ∆% Offline Online ∆% Offline Online ∆% (ad) Addition 76 143 +88 95 160 +68 30 66 +120 35 97 +177 (mt) Mistranslation 433 587 +36 457 572 +25 245 260 +6 202 260 +29 (ne) Non-existing WF 26 17

  • 35

14 16 +14 39 58 +49 43 54 +26 (om) Omission 67 113 +69 96 127 +32 99 74

  • 25

126 114

  • 10

(ol) Overly litteral 78 95 +22 52 81 +56 150 179 +19 113 125 +11 (ac+) Total accuracy 682 956 +40 716 959 +34 563 637 +13 519 651 +25

  • Addition errors increase drastically in En ✮De online.

Elbayad et al. Online vs. Offline NMT Quality 14 / 16

slide-51
SLIDE 51

Introduction Online NMT models Automatic Evaluation

Human Evaluation

Human Evaluation

Bold = The system (PA or TF) with fewer errors either online or offline. De ✮En En ✮De PA TF PA TF System Offline Online ∆% Offline Online ∆% Offline Online ∆% Offline Online ∆% (ad) Addition 76 143 +88 95 160 +68 30 66 +120 35 97 +177 (mt) Mistranslation 433 587 +36 457 572 +25 245 260 +6 202 260 +29 (ne) Non-existing WF 26 17

  • 35

14 16 +14 39 58 +49 43 54 +26 (om) Omission 67 113 +69 96 127 +32 99 74

  • 25

126 114

  • 10

(ol) Overly litteral 78 95 +22 52 81 +56 150 179 +19 113 125 +11 (ac+) Total accuracy 682 956 +40 716 959 +34 563 637 +13 519 651 +25

  • De ✮En is more prone to omissions errors in the online setup.

Elbayad et al. Online vs. Offline NMT Quality 14 / 16

slide-52
SLIDE 52

Introduction Online NMT models Automatic Evaluation

Human Evaluation

Human Evaluation

Bold = The system (PA or TF) with fewer errors either online or offline. De ✮En En ✮De PA TF PA TF System Offline Online ∆% Offline Online ∆% Offline Online ∆% Offline Online ∆% (ad) Addition 76 143 +88 95 160 +68 30 66 +120 35 97 +177 (mt) Mistranslation 433 587 +36 457 572 +25 245 260 +6 202 260 +29 (ne) Non-existing WF 26 17

  • 35

14 16 +14 39 58 +49 43 54 +26 (om) Omission 67 113 +69 96 127 +32 99 74

  • 25

126 114

  • 10

(ol) Overly litteral 78 95 +22 52 81 +56 150 179 +19 113 125 +11 (ac+) Total accuracy 682 956 +40 716 959 +34 563 637 +13 519 651 +25 (fl) Fluency 17 20 +18 14 20 +43 26 21

  • 19

20 24 +20 (du) Duplication 11 32 +191 22 144 +555 5 15 +200 13 71 +446 (gr) Grammar 57 65 +14 36 34

  • 6

198 260 +31 142 222 +56 (ty) Typography 41 42 +2 33 59 +79 52 92 +77 49 78 +59 (wo) Word order 65 105 +62 66 78 +18 46 85 +85 37 74 +100 (fl+) Total fluency 193 267 +38 173 337 +95 331 481 +45 272 480 +76 (ac+fl) Total 875 1223 +40 889 1296 +46 894 1118 +25 791 1131 +43

  • NMT systems are more prone to accuracy errors than fluency errors.

Elbayad et al. Online vs. Offline NMT Quality 14 / 16

slide-53
SLIDE 53

Introduction Online NMT models Automatic Evaluation

Human Evaluation

Human Evaluation

Bold = The system (PA or TF) with fewer errors either online or offline. De ✮En En ✮De PA TF PA TF System Offline Online ∆% Offline Online ∆% Offline Online ∆% Offline Online ∆% (ad) Addition 76 143 +88 95 160 +68 30 66 +120 35 97 +177 (mt) Mistranslation 433 587 +36 457 572 +25 245 260 +6 202 260 +29 (ne) Non-existing WF 26 17

  • 35

14 16 +14 39 58 +49 43 54 +26 (om) Omission 67 113 +69 96 127 +32 99 74

  • 25

126 114

  • 10

(ol) Overly litteral 78 95 +22 52 81 +56 150 179 +19 113 125 +11 (ac+) Total accuracy 682 956 +40 716 959 +34 563 637 +13 519 651 +25 (fl) Fluency 17 20 +18 14 20 +43 26 21

  • 19

20 24 +20 (du) Duplication 11 32 +191 22 144 +555 5 15 +200 13 71 +446 (gr) Grammar 57 65 +14 36 34

  • 6

198 260 +31 142 222 +56 (ty) Typography 41 42 +2 33 59 +79 52 92 +77 49 78 +59 (wo) Word order 65 105 +62 66 78 +18 46 85 +85 37 74 +100 (fl+) Total fluency 193 267 +38 173 337 +95 331 481 +45 272 480 +76 (ac+fl) Total 875 1223 +40 889 1296 +46 894 1118 +25 791 1131 +43

  • More grammar errors are found in En ✮De compared to De ✮En.

Elbayad et al. Online vs. Offline NMT Quality 14 / 16

slide-54
SLIDE 54

Introduction Online NMT models Automatic Evaluation

Human Evaluation

Human Evaluation

Bold = The system (PA or TF) with fewer errors either online or offline. De ✮En En ✮De PA TF PA TF System Offline Online ∆% Offline Online ∆% Offline Online ∆% Offline Online ∆% (ad) Addition 76 143 +88 95 160 +68 30 66 +120 35 97 +177 (mt) Mistranslation 433 587 +36 457 572 +25 245 260 +6 202 260 +29 (ne) Non-existing WF 26 17

  • 35

14 16 +14 39 58 +49 43 54 +26 (om) Omission 67 113 +69 96 127 +32 99 74

  • 25

126 114

  • 10

(ol) Overly litteral 78 95 +22 52 81 +56 150 179 +19 113 125 +11 (ac+) Total accuracy 682 956 +40 716 959 +34 563 637 +13 519 651 +25 (fl) Fluency 17 20 +18 14 20 +43 26 21

  • 19

20 24 +20 (du) Duplication 11 32 +191 22 144 +555 5 15 +200 13 71 +446 (gr) Grammar 57 65 +14 36 34

  • 6

198 260 +31 142 222 +56 (ty) Typography 41 42 +2 33 59 +79 52 92 +77 49 78 +59 (wo) Word order 65 105 +62 66 78 +18 46 85 +85 37 74 +100 (fl+) Total fluency 193 267 +38 173 337 +95 331 481 +45 272 480 +76 (ac+fl) Total 875 1223 +40 889 1296 +46 894 1118 +25 791 1131 +43

  • Addition, word order and duplication increase the most in the online setup.

Elbayad et al. Online vs. Offline NMT Quality 14 / 16

slide-55
SLIDE 55

Introduction Online NMT models Automatic Evaluation

Human Evaluation

Human Evaluation

Bold = The system (PA or TF) with fewer errors either online or offline. De ✮En En ✮De PA TF PA TF System Offline Online ∆% Offline Online ∆% Offline Online ∆% Offline Online ∆% (ad) Addition 76 143 +88 95 160 +68 30 66 +120 35 97 +177 (mt) Mistranslation 433 587 +36 457 572 +25 245 260 +6 202 260 +29 (ne) Non-existing WF 26 17

  • 35

14 16 +14 39 58 +49 43 54 +26 (om) Omission 67 113 +69 96 127 +32 99 74

  • 25

126 114

  • 10

(ol) Overly litteral 78 95 +22 52 81 +56 150 179 +19 113 125 +11 (ac+) Total accuracy 682 956 +40 716 959 +34 563 637 +13 519 651 +25 (fl) Fluency 17 20 +18 14 20 +43 26 21

  • 19

20 24 +20 (du) Duplication 11 32 +191 22 144 +555 5 15 +200 13 71 +446 (gr) Grammar 57 65 +14 36 34

  • 6

198 260 +31 142 222 +56 (ty) Typography 41 42 +2 33 59 +79 52 92 +77 49 78 +59 (wo) Word order 65 105 +62 66 78 +18 46 85 +85 37 74 +100 (fl+) Total fluency 193 267 +38 173 337 +95 331 481 +45 272 480 +76 (ac+fl) Total 875 1223 +40 889 1296 +46 894 1118 +25 791 1131 +43

  • Duplication is extremely problematic for TF.

Elbayad et al. Online vs. Offline NMT Quality 14 / 16

slide-56
SLIDE 56

Introduction Online NMT models Automatic Evaluation

Human Evaluation

Human Evaluation

Bold = The system (PA or TF) with fewer errors either online or offline. De ✮En En ✮De PA TF PA TF System Offline Online ∆% Offline Online ∆% Offline Online ∆% Offline Online ∆% (ad) Addition 76 143 +88 95 160 +68 30 66 +120 35 97 +177 (mt) Mistranslation 433 587 +36 457 572 +25 245 260 +6 202 260 +29 (ne) Non-existing WF 26 17

  • 35

14 16 +14 39 58 +49 43 54 +26 (om) Omission 67 113 +69 96 127 +32 99 74

  • 25

126 114

  • 10

(ol) Overly litteral 78 95 +22 52 81 +56 150 179 +19 113 125 +11 (ac+) Total accuracy 682 956 +40 716 959 +34 563 637 +13 519 651 +25 (fl) Fluency 17 20 +18 14 20 +43 26 21

  • 19

20 24 +20 (du) Duplication 11 32 +191 22 144 +555 5 15 +200 13 71 +446 (gr) Grammar 57 65 +14 36 34

  • 6

198 260 +31 142 222 +56 (ty) Typography 41 42 +2 33 59 +79 52 92 +77 49 78 +59 (wo) Word order 65 105 +62 66 78 +18 46 85 +85 37 74 +100 (fl+) Total fluency 193 267 +38 173 337 +95 331 481 +45 272 480 +76 (ac+fl) Total 875 1223 +40 889 1296 +46 894 1118 +25 791 1131 +43

  • PA is more prone to grammar and overly literal errors.

Elbayad et al. Online vs. Offline NMT Quality 14 / 16

slide-57
SLIDE 57

Introduction Online NMT models Automatic Evaluation

Human Evaluation

Human Evaluation

Source length De ✮En

< 13 [13,16) [16,18) [18,20) [20,23) ≥ 23

ac ad mt ne

  • m
  • l

fl du gr ty un wo

PA-offline

4 8 12 16 20 24 (%)

< 13 [13,16) [16,18) [18,20) [20,23) ≥ 23

PA-online

< 13 [13,16) [16,18) [18,20) [20,23) ≥ 23

TF-offline

< 13 [13,16) [16,18) [18,20) [20,23) ≥ 23

TF-online

Mistranslation errors peak in short segments due to the ease of spotting accuracy errors in fluent short segments.

Elbayad et al. Online vs. Offline NMT Quality 15 / 16

slide-58
SLIDE 58

Introduction Online NMT models Automatic Evaluation

Human Evaluation

Human Evaluation

Source length De ✮En

< 13 [13,16) [16,18) [18,20) [20,23) ≥ 23

ac ad mt ne

  • m
  • l

fl du gr ty un wo

PA-offline

4 8 12 16 20 24 (%)

< 13 [13,16) [16,18) [18,20) [20,23) ≥ 23

PA-online

< 13 [13,16) [16,18) [18,20) [20,23) ≥ 23

TF-offline

< 13 [13,16) [16,18) [18,20) [20,23) ≥ 23

TF-online

Mistranslation errors peak in short segments due to the ease of spotting accuracy errors in fluent short segments.

Lagging difficulty De ✮En

< 0.9 [0.9,1.5) [1.5,2.0) [2.0,2.7) [2.7,4.1) ≥ 4.1

ac ad mt ne

  • m
  • l

fl du gr ty un wo

PA-offline

4 8 12 16 20 (%)

< 0.9 [0.9,1.5) [1.5,2.0) [2.0,2.7) [2.7,4.1) ≥ 4.1

PA-online

< 0.9 [0.9,1.5) [1.5,2.0) [2.0,2.7) [2.7,4.1) ≥ 4.1

TF-offline

< 0.9 [0.9,1.5) [1.5,2.0) [2.0,2.7) [2.7,4.1) ≥ 4.1

TF-online

Addition and omission errors are particularly correlated with LD.

Elbayad et al. Online vs. Offline NMT Quality 15 / 16

slide-59
SLIDE 59

Introduction Online NMT models Automatic Evaluation

Human Evaluation

Conclusion

◮ We ran the first human evaluation of offline and online NMT systems for spoken language translation. ◮ We highlighted the weaknesses of wait-k Transformer (duplication) and Pervasive Attention (grammar, overly literal). ◮ We introduced a strong indicator in lagging difficulty that is highly correlated with translation quality, particularly in online translation. ◮ Our annotated data is made available at https://github.com/elbayadm/OnlineMT-Evaluation

Elbayad et al. Online vs. Offline NMT Quality 16 / 16

slide-60
SLIDE 60

References

References I

Cettolo, M., Niehues, J., Stüker, S., Bentivogli, L. & Federico, M. Report on the 11th IWSLT evaluation campaign. in Proc. of IWSLT (2014). Dalvi, F., Durrani, N., Sajjad, H. & Vogel, S. Incremental Decoding and Training Methods for Simultaneous Translation in Neural Machine Translation. in Proc. of NAACL-HLT (2018). Elbayad, M., Besacier, L. & Verbeek, J. Efficient Wait-k Models for Simultaneous Machine Translation. in Proc. of INTERSPEECH (2020). Elbayad, M., Besacier, L. & Verbeek, J. Pervasive Attention: 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction. in Proc. of CoNLL (2018).

Elbayad et al. Online vs. Offline NMT Quality 1 / 4

slide-61
SLIDE 61

References

References II

Esperança-Rodier, E., Brunet-Manquat, F. & Eady, S. ACCOLÉ: A Collaborative Platform of Error Annotation for Aligned Corpora. in Translating and the computer 41 (2019). Lavie, A. & Agarwal, A. METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments. in Proc. of the Second Workshop on Statistical Machine Translation (June 2007). Lin, C.-Y. ROUGE: A Package for Automatic Evaluation of Summaries. in Text Summarization Branches Out (2004). Lommel, A., Uszkoreit, H. & Burchardt, A. A Framework for Declaring and Describing Translation Quality Metrics. Revista Tradumàtica (2014).

Elbayad et al. Online vs. Offline NMT Quality 2 / 4

slide-62
SLIDE 62

References

References III

Ma, M., Huang, L., Xiong, H., Zheng, R., Liu, K., Zheng, B., Zhang, C., He, Z., Liu, H., Li, X., Wu, H. & Wang, H. STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework. in Proc. of ACL (2019). Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. BLEU: a Method for Automatic Evaluation of Machine Translation. in Proc. of ACL (2002). Sennrich, R., Haddow, B. & Birch, A. Neural Machine Translation of Rare Words with Subword Units. in Proc. of ACL (2016). Snover, M., Dorr, B., Schwartz, R., Micciulla, L. & Makhoul, J. A study of translation edit rate with targeted human annotation. in Proc. of AMTA (2006).

Elbayad et al. Online vs. Offline NMT Quality 3 / 4

slide-63
SLIDE 63

References

References IV

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. & Polosukhin, I. Attention Is All You Need. in Proc. of NeurIPS (2017). Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. Bertscore: Evaluating text generation with bert. in Proc. of ICLR (2020).

Elbayad et al. Online vs. Offline NMT Quality 4 / 4