SwitchOut: An Efficient Data Augmentation for Neural Machine - - PowerPoint PPT Presentation

switchout an efficient data augmentation for neural
SMART_READER_LITE
LIVE PREVIEW

SwitchOut: An Efficient Data Augmentation for Neural Machine - - PowerPoint PPT Presentation

SwitchOut: An Efficient Data Augmentation for Neural Machine Translation Xinyi Wang , Hieu Pham , Zihang Dai, Graham Neubig November 2, 2018 :equal contribution 1 / 41 Data Augmentation Neural models are data hungry, while


slide-1
SLIDE 1

SwitchOut: An Efficient Data Augmentation for Neural Machine Translation

Xinyi Wang∗, Hieu Pham∗, Zihang Dai, Graham Neubig November 2, 2018

∗:equal contribution 1 / 41

slide-2
SLIDE 2

Data Augmentation

Neural models are data hungry, while collecting data is expensive

1image source:Medium 2 / 41

slide-3
SLIDE 3

Data Augmentation

Neural models are data hungry, while collecting data is expensive Prevalent in computer vision1

1image source:Medium 3 / 41

slide-4
SLIDE 4

Data Augmentation

Neural models are data hungry, while collecting data is expensive Prevalent in computer vision1 More difficult for natural language

◮ Discrete vocabulary ◮ NMT sensitive to arbitrary noise 1image source:Medium 4 / 41

slide-5
SLIDE 5

Existing Strategies

Word replacement

5 / 41

slide-6
SLIDE 6

Existing Strategies

Word replacement Dictionary [Fadaee et al., 2017]

6 / 41

slide-7
SLIDE 7

Existing Strategies

Word replacement Dictionary [Fadaee et al., 2017] Word dropout [Sennrich et al., 2016a]

7 / 41

slide-8
SLIDE 8

Existing Strategies

Word replacement Dictionary [Fadaee et al., 2017] Word dropout [Sennrich et al., 2016a] Reward Augmented Maximum Likelihood (RAML) [Norouzi et al., 2016]

8 / 41

slide-9
SLIDE 9

Existing Strategies

Word replacement Dictionary [Fadaee et al., 2017] Word dropout [Sennrich et al., 2016a] Reward Augmented Maximum Likelihood (RAML) [Norouzi et al., 2016] → Can we characterize all of the related approaches together?

9 / 41

slide-10
SLIDE 10

Existing Strategies: RAML

RAML [Norouzi et al., 2016] Motivation: NMT relies on imperfect partial translation at test time, but trained only on gold standard target

10 / 41

slide-11
SLIDE 11

Existing Strategies: RAML

RAML [Norouzi et al., 2016] Motivation: NMT relies on imperfect partial translation at test time, but trained only on gold standard target Solution: Sample corrupted target during training

11 / 41

slide-12
SLIDE 12

Existing Strategies: RAML

RAML [Norouzi et al., 2016] Motivation: NMT relies on imperfect partial translation at test time, but trained only on gold standard target Solution: Sample corrupted target during training Gold target y, corrupted y, similarity measure ry q∗( y|y, τ) = exp {ry( y, y)/τ}

  • y′ exp {ry(

y′, y)/τ}

12 / 41

slide-13
SLIDE 13

Formalize Data Augmentation

Real data distribution: x, y ∼ p(X, Y )

13 / 41

slide-14
SLIDE 14

Formalize Data Augmentation

Real data distribution: x, y ∼ p(X, Y ) Observed data distribution: x, y ∼ p(X, Y )

14 / 41

slide-15
SLIDE 15

Formalize Data Augmentation

Real data distribution: x, y ∼ p(X, Y ) Observed data distribution: x, y ∼ p(X, Y ) → Problem: p(X, Y ) and p(X, Y ) might have large discrepancy

15 / 41

slide-16
SLIDE 16

Formalize Data Augmentation

Real data distribution: x, y ∼ p(X, Y ) Observed data distribution: x, y ∼ p(X, Y ) → Problem: p(X, Y ) and p(X, Y ) might have large discrepancy Data augmentation: x, y ∼ q( X, Y )

16 / 41

slide-17
SLIDE 17

Design a good q( X, Y )

q: function of observed (x, y)

17 / 41

slide-18
SLIDE 18

Design a good q( X, Y )

q: function of observed (x, y) How should q approximate p?

18 / 41

slide-19
SLIDE 19

Design a good q( X, Y )

q: function of observed (x, y) How should q approximate p?

◮ Diversity: larger support with all valid data pairs (x, y) ⋆ Entropy H

  • q(

x, y|x, y)

  • is large

19 / 41

slide-20
SLIDE 20

Design a good q( X, Y )

q: function of observed (x, y) How should q approximate p?

◮ Diversity: larger support with all valid data pairs (x, y) ⋆ Entropy H

  • q(

x, y|x, y)

  • is large

◮ Smoothness: probability of similar data pairs are similar ⋆ q maximizes similarity measure rx(x,

x), ry(y, y)

20 / 41

slide-21
SLIDE 21

Design a good q( X, Y )

q: function of observed (x, y) How should q approximate p?

◮ Diversity: larger support with all valid data pairs (x, y) ⋆ Entropy H

  • q(

x, y|x, y)

  • is large

◮ Smoothness: probability of similar data pairs are similar ⋆ q maximizes similarity measure rx(x,

x), ry(y, y)

τ: control effect of diversity; q should maximize J(q) = τ · H

  • q(

x, y|x, y)

  • + E

x, y∼q

  • rx(x,

x) + ry(y, y)

  • 21 / 41
slide-22
SLIDE 22

Mathematically Optimal q

J(q) = τ · H

  • q(

x, y|x, y)

  • + E

x, y∼q

  • rx(x,

x) + ry(y, y)

  • Solve for the best q

q∗( x, y|x, y) = exp {s( x, y; x, y)/τ}

  • x′,

y′ exp {s(

x′, y′; x, y)/τ}

22 / 41

slide-23
SLIDE 23

Mathematically Optimal q

J(q) = τ · H

  • q(

x, y|x, y)

  • + E

x, y∼q

  • rx(x,

x) + ry(y, y)

  • Solve for the best q

q∗( x, y|x, y) = exp {s( x, y; x, y)/τ}

  • x′,

y′ exp {s(

x′, y′; x, y)/τ} Decompose x and y q∗( x, y|x, y) = exp {rx( x, x)/τx}

  • x′ exp {rx(

x′, x)/τx} × exp {ry( y, y)/τy}

  • y′ exp {ry(

y′, y)/τy}

23 / 41

slide-24
SLIDE 24

Mathematically Optimal q

J(q) = τ · H

  • q(

x, y|x, y)

  • + E

x, y∼q

  • rx(x,

x) + ry(y, y)

  • Solve for the best q

q∗( x, y|x, y) = exp {s( x, y; x, y)/τ}

  • x′,

y′ exp {s(

x′, y′; x, y)/τ} Decompose x and y q∗( x, y|x, y) = exp {rx( x, x)/τx}

  • x′ exp {rx(

x′, x)/τx} × exp {ry( y, y)/τy}

  • y′ exp {ry(

y′, y)/τy} Formulate existing methods

◮ Dictionary: jointly on x and y, but deterministic and not diverse ◮ Word dropout: only x side with null token ◮ RAML: only y side 24 / 41

slide-25
SLIDE 25

Formulate SwitchOut

Augment both x and y!

25 / 41

slide-26
SLIDE 26

Formulate SwitchOut

Augment both x and y! Sample for x, y independently

26 / 41

slide-27
SLIDE 27

Formulate SwitchOut

Augment both x and y! Sample for x, y independently Define rx( x, x) and ry( y, y)

◮ Negative Hamming Distance, following RAML 27 / 41

slide-28
SLIDE 28

SwitchOut: Sample efficiently

Given a sentence s = {s1, s2, ...s|s|}

1 How many words to corrupt?

Assumption: only one token for swapping. P(n) ∝ exp(−n)/τ

28 / 41

slide-29
SLIDE 29

SwitchOut: Sample efficiently

Given a sentence s = {s1, s2, ...s|s|}

1 How many words to corrupt?

Assumption: only one token for swapping. P(n) ∝ exp(−n)/τ

2 What is the corrupted sentence?

P(randomly swap si by another word) = n |s| See Appendix: Efficient batch implementation in PyTorch and Tensorflow

29 / 41

slide-30
SLIDE 30

Experiments

Datasets

◮ en-vi: IWSLT 2015 ◮ de-en: IWSLT 2016 ◮ en-de: WMT 2015

Models

◮ Transformer model ◮ Word-based, standard preprocessing 30 / 41

slide-31
SLIDE 31

Results: RAML and word dropout

Method en-de de-en en-vi src trg N/A N/A 21.73 29.81 27.97 WordDropout N/A 20.63 29.97 28.56 SwitchOut N/A 22.78† 29.94 28.67† N/A RAML 22.83 30.66 28.88 WordDropout RAML 20.69 30.79 28.86 SwitchOut RAML 23.13† 30.98† 29.09

31 / 41

slide-32
SLIDE 32

Results: RAML and word dropout

SwitchOut on source > word dropout Method en-de de-en en-vi src trg N/A N/A 21.73 29.81 27.97 WordDropout N/A 20.63 29.97 28.56 SwitchOut N/A 22.78† 29.94 28.67† N/A RAML 22.83 30.66 28.88 WordDropout RAML 20.69 30.79 28.86 SwitchOut RAML 23.13† 30.98† 29.09

32 / 41

slide-33
SLIDE 33

Results: RAML and word dropout

SwitchOut on source > word dropout SwitchOut on source and target > RAML Method en-de de-en en-vi src trg N/A N/A 21.73 29.81 27.97 WordDropout N/A 20.63 29.97 28.56 SwitchOut N/A 22.78† 29.94 28.67† N/A RAML 22.83 30.66 28.88 WordDropout RAML 20.69 30.79 28.86 SwitchOut RAML 23.13† 30.98† 29.09

33 / 41

slide-34
SLIDE 34

Where does SwitchOut help?

More gain for sentences more different from training data

1350 2700 4050 5400 Top K sentences

  • 0.25

0.25 0.5 0.75 Gain in BLEU 253 506 759 1012 Top K sentences

  • 1
  • 0.5

0.5 1 Gain in BLEU Figure: Left: IWSLT 16 de-en. Right: IWSLT 15 en-vi.

34 / 41

slide-35
SLIDE 35

Final Thoughts

SwitchOut sampling is efficient and easy-to-use

35 / 41

slide-36
SLIDE 36

Final Thoughts

SwitchOut sampling is efficient and easy-to-use Work with any NMT architecture

36 / 41

slide-37
SLIDE 37

Final Thoughts

SwitchOut sampling is efficient and easy-to-use Work with any NMT architecture Formulation of data augmentation encompasses existing works and inspires future direction

37 / 41

slide-38
SLIDE 38

Final Thoughts

SwitchOut sampling is efficient and easy-to-use Work with any NMT architecture Formulation of data augmentation encompasses existing works and inspires future direction

Thanks a lot for listening! Questions?

38 / 41

slide-39
SLIDE 39

References

Norouzi et al. (2016) Reward Augmented Maximum Likelihood for Neural Structured Prediction. In NIPS. Sennrich et al. (2016a) Edinburgh neural machine translation systems for wmt 16. In WMT. Sennrich et al. (2016b) Improving neural machine translation models with monolingual data. In ACL. Currey et al. (2017) Copied Monolingual Data Improves Low-Resource Neural Machine Translation. In WMT. Fadaee et al. (2017) Data Augmentation for Low-Resource Neural Machine

  • Translation. In ACL.

39 / 41

slide-40
SLIDE 40

Results: Back Translation (WMT en-de 2015)

SwitchOut > Back Translation Method en-de Transformer 21.73 +SwitchOut 22.78 +BT 21.82

40 / 41

slide-41
SLIDE 41

Results: Back Translation (WMT en-de 2015)

SwitchOut > Back Translation Switchout + RAML + back translate wins Method en-de Transformer 21.73 +SwitchOut 22.78 +BT 21.82 +BT +RAML 21.53 +BT +SwitchOut 22.93 +BT +RAML +SwitchOut 23.76

41 / 41