Neural Machine Translation: directions for improvement CMSC 470 - - PowerPoint PPT Presentation

neural machine translation
SMART_READER_LITE
LIVE PREVIEW

Neural Machine Translation: directions for improvement CMSC 470 - - PowerPoint PPT Presentation

Neural Machine Translation: directions for improvement CMSC 470 Marine Carpuat How can we improve on state-of-the-art machine translation approaches? Model Training Data Objective Algorithm Addressing domain mismatch


slide-1
SLIDE 1

Neural Machine Translation: directions for improvement

CMSC 470 Marine Carpuat

slide-2
SLIDE 2

How can we improve on state-of-the-art machine translation approaches?

  • Model
  • Training
  • Data
  • Objective
  • Algorithm
slide-3
SLIDE 3

Addressing domain mismatch

Slides adapted from Kevin Duh [Domain Adaptation in Machine Translation, MTMA 2019]

slide-4
SLIDE 4

Supervised training data is not always in the domain we want to translate!

slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7

Domain adaptation is an important practical problem in machine translation

  • It may be expensive to obtain training sets that are both large and

relevant to test domain

  • So we often have to work with whatever we can!
slide-8
SLIDE 8

Possible strategies: “Continued Training” or “fine-tuning”

  • Requires small in-domain parallel data

[Luong and Manning 2016]

slide-9
SLIDE 9

Possible strategies: back-translation

slide-10
SLIDE 10

Possible strategies: data selection

  • Train a language model on data

representative of test domain

  • N-gram count based model [Moore & Lewis 2010]
  • Neural model [Duh et al. 2013]
  • Neural MT model [Junczys-Dowmunt 2018]
  • Use perplexity of LM on new data to

measure distance from test domain

slide-11
SLIDE 11

Possible strategies: different weights for different training samples

Corpus level weight

Instance level weight Based on classifier that measures similarity of samples with in domain data

slide-12
SLIDE 12

How can we improve on state-of-the-art machine translation approaches?

  • Model
  • Training
  • Data
  • Objective
  • Algorithm
slide-13
SLIDE 13

Beyond Maximum Likelihood Training

slide-14
SLIDE 14

How can we improve NMT training?

  • Assumption: References can substitute for predicted

translations during training

  • Our hypothesis: Modeling divergences between references

and predictions improves NMT

Based on paper by Weijia Xu [NAACL 2019]

slide-15
SLIDE 15

Exposure Bias: Gap Between Training and Inference

Maximum Likelihood Training Inference

<s>

ℎ1 ℎ2

dinner made We

我们 做了 晚餐

We will <s>

ℎ1 ℎ2

?

我们 做了 晚餐

Reference Model Translation

𝑢=1 𝑈

log 𝑞 𝑧𝑢 𝑧<𝑢, 𝑦 ෑ

𝑢=1 𝑈

𝑞 𝑧𝑢 𝑧<𝑢, 𝑦

Loss =

𝑄 𝑧 𝑦 =

slide-16
SLIDE 16

How to Address Exposure Bias?

  • Because of exposure bias
  • Models don’t learn to recover from their errors
  • Cascading errors at test time
  • Solution:
  • Expose models to their own predictions during training
  • But how to compute the loss when the partial translation

diverges from the reference?

slide-17
SLIDE 17

Existing Method: Scheduled Sampling

Reference: <s> We made dinner </s>

<s> We

predict

We

我们 做了 晚餐

We

P P = choose randomly

[Bengio et al., NeurIPS 2015]

slide-18
SLIDE 18

Existing Method: Scheduled Sampling

<s> ℎ1 We

我们 做了 晚餐

Reference: <s> We made dinner </s>

will

predict

made will

P P = choose randomly

[Bengio et al., NeurIPS 2015]

slide-19
SLIDE 19

Existing Method: Scheduled Sampling

<s> ℎ1 will ℎ2 ℎ3

We

我们 做了 晚餐

Reference: <s> We made dinner </s>

[Bengio et al., NeurIPS 2015]

slide-20
SLIDE 20

Existing Method: Scheduled Sampling

[Bengio et al., NeurIPS 2015]

<s> ℎ1 will ℎ2 ℎ3

We

我们 做了 晚餐

Reference: <s> We made dinner </s> J = log p(“We” | “<s>”, source)

slide-21
SLIDE 21

Existing Method: Scheduled Sampling

<s> ℎ1 will ℎ2 ℎ3 … We

我们 做了 晚餐

Reference: <s> We made dinner </s> J = log p(“made” | “<s> We”, source)

[Bengio et al., NeurIPS 2015]

slide-22
SLIDE 22

Existing Method: Scheduled Sampling

<s> ℎ1 will ℎ2 ℎ3

Incorrect synthetic reference: “We will dinner”

We

我们 做了 晚餐

Reference: <s> We made dinner </s> J = log p(“dinner” | “<s> We will”, source)

[Bengio et al., NeurIPS 2015]

slide-23
SLIDE 23

Our Solution: Align Reference with Partial Translations

<s> ℎ1 will ℎ2 ℎ3 We make ℎ4 dinner

Soft Alignment 𝒃𝟐

𝒃𝟐 logp(“dinner” | “<s>”, source)

我们 做了 晚餐

Reference: <s> We made dinner </s>

slide-24
SLIDE 24

Our Solution: Align Reference with Partial Translations

<s> ℎ1 will ℎ2 ℎ3 We make ℎ4 dinner

Soft Alignment 𝒃𝟑

𝒃𝟐 logp(“dinner” | “<s>”, source) + 𝒃𝟑 logp(“dinner” | “<s> We”, source)

我们 做了 晚餐

Reference: <s> We made dinner </s>

slide-25
SLIDE 25

Our Solution: Align Reference with Partial Translations

<s> ℎ1 will ℎ2 ℎ3 We make ℎ4 dinner

Soft Alignment 𝒃𝟒

𝒃𝟐 logp(“dinner” | “<s>”, source) + 𝒃𝟑 logp(“dinner” | “<s> We”, source) + 𝒃𝟒 logp(“dinner” | “<s> We will”, source)

我们 做了 晚餐

Reference: <s> We made dinner </s>

slide-26
SLIDE 26

Our Solution: Align Reference with Partial Translations

<s> ℎ1 will ℎ2 ℎ3 We make ℎ4 dinner

Soft Alignment 𝒃𝟓

𝒃𝟐 logp(“dinner” | “<s>”, source) + 𝒃𝟑 logp(“dinner” | “<s> We”, source) + 𝒃𝟒 logp(“dinner” | “<s> We will”, source) + 𝒃𝟓 logp(“dinner” | “<s> We will make”, source)

我们 做了 晚餐

Reference: <s> We made dinner </s>

slide-27
SLIDE 27

Our Solution: Align Reference with Partial Translations

<s> ℎ1 will ℎ2 ℎ3 We make ℎ4 dinner

Soft Alignment 𝒃𝒋 ∝ 𝐟𝐲𝐪(𝑭𝒏𝒄𝒇𝒆𝒆𝒋𝒐𝒐𝒇𝒔 ⋅ 𝒊𝒋)

𝒃𝟐 logp(“dinner” | “<s>”, source) + 𝒃𝟑 logp(“dinner” | “<s> We”, source) + 𝒃𝟒 logp(“dinner” | “<s> We will”, source) + 𝒃𝟓 logp(“dinner” | “<s> We will make”, source)

我们 做了 晚餐

Reference: <s> We made dinner </s>

slide-28
SLIDE 28

Our Solution: Align Reference with Partial Translations

<s> ℎ1 will ℎ2 ℎ3 We make ℎ4 dinner

Soft Alignment 𝒃𝒋 ∝ 𝐟𝐲𝐪(𝑭𝒏𝒄𝒇𝒆𝒆𝒋𝒐𝒐𝒇𝒔 ⋅ 𝒊𝒋)

𝒃𝟐 logp(“dinner” | “<s>”, source) + 𝒃𝟑 logp(“dinner” | “<s> We”, source) + 𝒃𝟒 logp(“dinner” | “<s> We will”, source) + 𝒃𝟓 logp(“dinner” | “<s> We will make”, source)

我们 做了 晚餐

Reference: <s> We made dinner </s>

slide-29
SLIDE 29

Training Objective

Ours:

Soft alignment between 𝑧𝑢 and ෤ 𝑧<𝑘

𝐾𝑇𝐵 = ෍

𝑦,𝑧 ∈𝐸

𝑢=1 𝑈

𝑚𝑝𝑕 ෍

𝑘=1 𝑈′

𝑏𝑢𝑘 𝑞 𝑧𝑢 ෤ 𝑧<𝑘, 𝑦)

Scheduled Sampling:

Hard alignment by time index t

𝐾𝑇𝑇 = ෍

𝑦,𝑧 ∈𝐸

𝑢=1 𝑈

𝑚𝑝𝑕 𝑞 𝑧𝑢 ෤ 𝑧<𝑢, 𝑦)

slide-30
SLIDE 30

Training Objective

Ours:

Soft alignment between 𝑧𝑢 and ෤ 𝑧<𝑘

𝐾𝑇𝐵 = ෍

𝑦,𝑧 ∈𝐸

𝑢=1 𝑈

𝑚𝑝𝑕 ෍

𝑘=1 𝑈′

𝑏𝑢𝑘 𝑞 𝑧𝑢 ෤ 𝑧<𝑘, 𝑦)

Scheduled Sampling:

Hard alignment by time index t

𝐾𝑇𝑇 = ෍

𝑦,𝑧 ∈𝐸

𝑢=1 𝑈

𝑚𝑝𝑕 𝑞 𝑧𝑢 ෤ 𝑧<𝑢, 𝑦)

slide-31
SLIDE 31

Training Objective

Ours:

Soft alignment between 𝑧𝑢 and ෤ 𝑧<𝑘

𝐾𝑇𝐵 = ෍

𝑦,𝑧 ∈𝐸

𝑢=1 𝑈

𝑚𝑝𝑕 ෍

𝑘=1 𝑈′

𝑏𝑢𝑘 𝑞 𝑧𝑢 ෤ 𝑧<𝑘, 𝑦)

Combined with maximum likelihood:

𝐾 = 𝐾𝑇𝐵 + 𝐾𝑁𝑀

Scheduled Sampling:

Hard alignment by time index t

𝐾𝑇𝑇 = ෍

𝑦,𝑧 ∈𝐸

𝑢=1 𝑈

𝑚𝑝𝑕 𝑞 𝑧𝑢 ෤ 𝑧<𝑢, 𝑦)

slide-32
SLIDE 32

Experiments

  • Data
  • IWSLT14 de-en
  • IWSLT15 vi-en
  • Model
  • Bi-LSTM encoder, LSTM decoder,

multilayer perceptron attention

  • Differentiable sampling with Straight-

Through Gumbel Softmax

  • Based on AWS sockeye
slide-33
SLIDE 33

Our Method Outperforms Maximum Likelihood and Scheduled Sampling

22 23 24 25 26 27 28 de-en en-de vi-en

BLEU Baseline Scheduled Sampling Differentiable Scheduled Sampling Our Method

slide-34
SLIDE 34

Our Method Needs No Annealing

17 19 21 23 25 27 de-en en-de vi-en

BLEU Baseline Scheduled Sampling w/ annealing Scheduled Sampling w/o annealing Our Method (no annealing) Scheduled sampling: BLEU drops when used without annealing!

slide-35
SLIDE 35

Summary

Introduced a new training objective

  • 1. Generate translation prefixes via differentiable sampling
  • 2. Learn to align the reference words with sampled prefixes

Better BLEU than the maximum likelihood and scheduled sampling (de-en, en-de, vi-en) Simple to train, no annealing schedule required

slide-36
SLIDE 36

What you should know

  • Lots of things can be done to improve neural MT even without changing

the model architecture

  • The domain of training data matters
  • Simple techniques can be used to measure distance from test domain
  • And to adapt model to domain of interest
  • The standard maximum likelihood objective is suboptimal
  • It does not directly measure translation quality
  • It is based on reference translations only, so the model is not exposed to its own

errors during training

  • Developing reliable alternatives is an active area of research