target conditioned sampling
play

Target Conditioned Sampling: Optimizing Data Selection for - PowerPoint PPT Presentation

Target Conditioned Sampling: Optimizing Data Selection for Multilingual NMT Xinyi Wang, Graham Neubig Language Technologies Institute Carnegie Mellon University Multilingual NMT glg: A ma que eu nunca vou spa: Una maana que nunca


  1. Target Conditioned Sampling: 
 Optimizing Data Selection for Multilingual NMT Xinyi Wang, Graham Neubig Language Technologies Institute Carnegie Mellon University

  2. Multilingual NMT glg: A mañá que eu nunca vou spa: Una mañana que nunca olvidaré . por: Uma manhã que nunca vou esquecer . A morning that I will never forget . ita: Una mattina che non dimenticherò mai . jpn: その⽇旦の朝のことは 決し て忘れることはないでしょう • Particularly useful for low-resource languages (LRLs), such as Galician (glg)

  3. Multilingual Training Paradigms • Multi-lingual training (Dong et al. 2015, Firat et al. 2016) • Train on related high-resource language, tune towards LRL (Zoph et al. 2016) • Train on multilingual data, tune towards LRL (Neubig and Hu 2018, Gu et al. 2018) • Our proposal: can we more intelligently select data in a less heuristic way?

  4. Multilingual Objective for LRL NMT P s ( X , Y ) S S 1 T .... S n − 1 Q ( X , Y ) ≈ P s ( X , Y ) S n • How to construct the ? Q ( X , Y )

  5. Target Conditioned Sampling union of targets A morning that I will never forget. .... When I was 11, I usually stay with spa: Una mañana.... Q ( Y ) Sampled Data por: Uma manhã .. por: Uma manhã .. Q ( X | y ) A morning that I will ita: Una mattina ... A morning that I will never forget. never forget. jpn: その⽇旦の朝 ...

  6. Choosing the Distributions • � Q ( Y ) • assume each language data comes from same domain • uniform sample from all target � can match � y P s ( Y ) • � Q ( X | y ) • � measures how likely � is in language � P s ( X = x | y ) x s • Approximate using heuristic similarity measure � , sim ( x , s ) normalize over all multilingual � for a given target � x i y

  7. Estimating � sim ( x , s ) Vocab Overlap Language Model character n-gram Language score document of each between S and each Level language language character n-gram Sentence use LM on S to score each between S and each Level sentence sentence

  8. Algorithms • First sample � based on � , then sample � based y Q ( Y ) ( x i , y ) on � Q ( X | y ) • Stochastic (TCS-S): • dynamically sample each mini batch • Deterministic (TCS-D): • select � x ′ � = argmax x Q ( x | y ) , fixed during training

  9. Experiment • Dataset • 58-language-to-English TED dataset (Qi et al., 2018) • 4 test languages: Azerbaijani (aze), Belarusian (bel), Galician (glg), Slovak (slk) • Baselines • Bi: each LRL paired with one related HRL (Neubig & Hu 2018) • All: train on all 59 languages • Copied: use union of English sentences as monolingual data by copying them to the source (Currey et al. 2017)

  10. TCS vs. Baselines All copied TCS-S 3 Relative di ff erence from Bi 2.25 1.5 0.75 0 -0.75 -1.5 -2.25 -3 aze bel glg slk

  11. TCS-D vs. TCS-S TCS-D TCS-S 2.2 1.65 1.1 0.55 0 aze bel glg slk • TCS-D already brings gains, TCS-S generally performs better

  12. LM vs. Vocab LM Vocab 2.2 Relative di ff erence from Bi 1.65 1.1 0.55 0 aze bel glg slk • Simple vocab overlap heuristic is already competitive • LM performs better for slk, with highest amount of data

  13. Sent vs. Lang Sent Lang 2.2 Relative di ff erence from Bi 1.65 1.1 0.55 0 aze bel glg slk • Language level heuristic is in general better

  14. Conclusion • TCS is a simple method for better multi-lingual data selection • Brings significant improvements with little training overhead • Simple heuristics work well for LRLs to estimate language similarity https://github.com/cindyxinyiwang/TCS Thank You! Questions?

  15. Extra Slides

  16. Relationship with Back- Translation back-translate TCS-S 3.75 2.5 1.25 0 -1.25 -2.5 -3.75 -5 aze bel glg slk • TCS approximates back-translate probability � P s ( X | y ) • For LRL, heuristics performs better than back-translate model

  17. Effect on SDE All copied TCS-S 3 2 1 0 -1 -2 -3 -4 aze bel glg slk • SDE: a better word encoding designed for multilingual data (Wang et. al. 2019) • TCS still brings significant gains on top of SDE

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend