Target Conditioned Sampling: Optimizing Data Selection for - - PowerPoint PPT Presentation

target conditioned sampling
SMART_READER_LITE
LIVE PREVIEW

Target Conditioned Sampling: Optimizing Data Selection for - - PowerPoint PPT Presentation

Target Conditioned Sampling: Optimizing Data Selection for Multilingual NMT Xinyi Wang, Graham Neubig Language Technologies Institute Carnegie Mellon University Multilingual NMT glg: A ma que eu nunca vou spa: Una maana que nunca


slide-1
SLIDE 1

Target Conditioned Sampling:


Optimizing Data Selection for Multilingual NMT

Xinyi Wang, Graham Neubig

Language Technologies Institute Carnegie Mellon University

slide-2
SLIDE 2

Multilingual NMT

  • Particularly useful for low-resource languages (LRLs), such as Galician (glg)

spa: Una mañana que nunca olvidaré . ita: Una mattina che non dimenticherò mai . por: Uma manhã que nunca vou esquecer . jpn:その⽇旦の朝のことは 決し て忘れることはないでしょう

A morning that I will never forget .

glg: A mañá que eu nunca vou

slide-3
SLIDE 3

Multilingual Training Paradigms

  • Multi-lingual training (Dong et al. 2015, Firat et al. 2016)
  • Train on related high-resource language, tune towards

LRL (Zoph et al. 2016)

  • Train on multilingual data, tune towards LRL (Neubig and

Hu 2018, Gu et al. 2018)

  • Our proposal: can we more intelligently select data in a

less heuristic way?

slide-4
SLIDE 4

Multilingual Objective for LRL NMT

  • How to construct the ?

Q(X, Y) Q(X, Y) ≈ Ps(X, Y)

Ps(X, Y)

.... S S1 Sn−1 T Sn

slide-5
SLIDE 5

Target Conditioned Sampling

union of targets

A morning that I will never forget. When I was 11, I usually stay with ....

A morning that I will never forget.

Q(Y) spa: Una mañana.... ita: Una mattina ... por: Uma manhã .. jpn:その⽇旦の朝...

Sampled Data

por: Uma manhã ..

A morning that I will never forget.

Q(X|y)

slide-6
SLIDE 6

Choosing the Distributions

  • assume each language data comes from same domain
  • uniform sample from all target can match

measures how likely is in language

  • Approximate using heuristic similarity measure

, normalize over all multilingual for a given target

Q(Y) y Ps(Y) Q(X|y) Ps(X = x|y) x s sim(x, s) xi y

slide-7
SLIDE 7

Estimating sim(x, s)

Vocab Overlap

Language Model Language Level

character n-gram between S and each language

score document of each language

Sentence Level

character n-gram between S and each sentence use LM on S to score each sentence

slide-8
SLIDE 8

Algorithms

  • First sample based on

, then sample based

  • n
  • Stochastic (TCS-S):
  • dynamically sample each mini batch
  • Deterministic (TCS-D):
  • select

, fixed during training

y Q(Y) (xi, y) Q(X|y) x′ = argmaxxQ(x|y)

slide-9
SLIDE 9

Experiment

  • Dataset
  • 58-language-to-English TED dataset (Qi et al., 2018)
  • 4 test languages: Azerbaijani (aze), Belarusian (bel),

Galician (glg), Slovak (slk)

  • Baselines
  • Bi: each LRL paired with one related HRL (Neubig & Hu

2018)

  • All: train on all 59 languages
  • Copied: use union of English sentences as monolingual

data by copying them to the source (Currey et al. 2017)

slide-10
SLIDE 10

TCS vs. Baselines

  • 3
  • 2.25
  • 1.5
  • 0.75

0.75 1.5 2.25 3 aze bel glg slk

All copied TCS-S

Relative difference from Bi

slide-11
SLIDE 11

TCS-D vs. TCS-S

0.55 1.1 1.65 2.2 aze bel glg slk

TCS-D TCS-S

  • TCS-D already brings gains, TCS-S generally performs better
slide-12
SLIDE 12

LM vs. Vocab

0.55 1.1 1.65 2.2 aze bel glg slk

LM Vocab

  • Simple vocab overlap heuristic is already competitive
  • LM performs better for slk, with highest amount of data

Relative difference from Bi

slide-13
SLIDE 13

Sent vs. Lang

0.55 1.1 1.65 2.2 aze bel glg slk

Sent Lang

  • Language level heuristic is in general better

Relative difference from Bi

slide-14
SLIDE 14

Conclusion

  • TCS is a simple method for better multi-lingual data

selection

  • Brings significant improvements with little training overhead
  • Simple heuristics work well for LRLs to estimate language

similarity

Thank You! Questions?

https://github.com/cindyxinyiwang/TCS

slide-15
SLIDE 15

Extra Slides

slide-16
SLIDE 16

Relationship with Back- Translation

  • 5
  • 3.75
  • 2.5
  • 1.25

1.25 2.5 3.75 aze bel glg slk

back-translate TCS-S

  • TCS approximates back-translate probability
  • For LRL, heuristics performs better than back-translate model

Ps(X|y)

slide-17
SLIDE 17

Effect on SDE

  • SDE: a better word encoding designed for multilingual data (Wang et. al. 2019)
  • TCS still brings significant gains on top of SDE
  • 4
  • 3
  • 2
  • 1

1 2 3 aze bel glg slk

All copied TCS-S