SwitchOut: An Efficient Data Augmentation for Neural Machine - PowerPoint PPT Presentation

SwitchOut: An Efficient Data Augmentation for Neural Machine Translation Xinyi Wang ∗ , Hieu Pham ∗ , Zihang Dai, Graham Neubig November 2, 2018 ∗ :equal contribution 1 / 41

Data Augmentation Neural models are data hungry, while collecting data is expensive 1 image source:Medium 2 / 41

Data Augmentation Neural models are data hungry, while collecting data is expensive Prevalent in computer vision 1 1 image source:Medium 3 / 41

Data Augmentation Neural models are data hungry, while collecting data is expensive Prevalent in computer vision 1 More difficult for natural language ◮ Discrete vocabulary ◮ NMT sensitive to arbitrary noise 1 image source:Medium 4 / 41

Existing Strategies Word replacement 5 / 41

Existing Strategies Word replacement Dictionary [Fadaee et al., 2017] 6 / 41

Existing Strategies Word replacement Dictionary [Fadaee et al., 2017] Word dropout [Sennrich et al., 2016a] 7 / 41

Existing Strategies Word replacement Dictionary [Fadaee et al., 2017] Word dropout [Sennrich et al., 2016a] Reward Augmented Maximum Likelihood (RAML) [Norouzi et al., 2016] 8 / 41

Existing Strategies Word replacement Dictionary [Fadaee et al., 2017] Word dropout [Sennrich et al., 2016a] Reward Augmented Maximum Likelihood (RAML) [Norouzi et al., 2016] → Can we characterize all of the related approaches together? 9 / 41

Existing Strategies: RAML RAML [Norouzi et al., 2016] Motivation: NMT relies on imperfect partial translation at test time, but trained only on gold standard target 10 / 41

Existing Strategies: RAML RAML [Norouzi et al., 2016] Motivation: NMT relies on imperfect partial translation at test time, but trained only on gold standard target Solution: Sample corrupted target during training 11 / 41

Existing Strategies: RAML RAML [Norouzi et al., 2016] Motivation: NMT relies on imperfect partial translation at test time, but trained only on gold standard target Solution: Sample corrupted target during training Gold target y , corrupted � y , similarity measure r y exp { r y ( � y , y ) /τ } q ∗ ( � y | y , τ ) = � y ′ , y ) /τ } y ′ exp { r y ( � � 12 / 41

Formalize Data Augmentation Real data distribution: x , y ∼ p ( X , Y ) 13 / 41

Formalize Data Augmentation Real data distribution: x , y ∼ p ( X , Y ) Observed data distribution: x , y ∼ � p ( X , Y ) 14 / 41

Formalize Data Augmentation Real data distribution: x , y ∼ p ( X , Y ) Observed data distribution: x , y ∼ � p ( X , Y ) → Problem: p ( X , Y ) and � p ( X , Y ) might have large discrepancy 15 / 41

Formalize Data Augmentation Real data distribution: x , y ∼ p ( X , Y ) Observed data distribution: x , y ∼ � p ( X , Y ) → Problem: p ( X , Y ) and � p ( X , Y ) might have large discrepancy y ∼ q ( � X , � Data augmentation: � x , � Y ) 16 / 41

Design a good q ( � X , � Y ) q : function of observed ( x , y ) 17 / 41

Design a good q ( � X , � Y ) q : function of observed ( x , y ) How should q approximate p ? 18 / 41

Design a good q ( � X , � Y ) q : function of observed ( x , y ) How should q approximate p ? ◮ Diversity : larger support with all valid data pairs ( x , y ) � � ⋆ Entropy H q ( � x , � y | x , y ) is large 19 / 41

Design a good q ( � X , � Y ) q : function of observed ( x , y ) How should q approximate p ? ◮ Diversity : larger support with all valid data pairs ( x , y ) � � ⋆ Entropy H q ( � x , � y | x , y ) is large ◮ Smoothness : probability of similar data pairs are similar ⋆ q maximizes similarity measure r x ( x , � x ), r y ( y , � y ) 20 / 41

Design a good q ( � X , � Y ) q : function of observed ( x , y ) How should q approximate p ? ◮ Diversity : larger support with all valid data pairs ( x , y ) � � ⋆ Entropy H q ( � x , � y | x , y ) is large ◮ Smoothness : probability of similar data pairs are similar ⋆ q maximizes similarity measure r x ( x , � x ), r y ( y , � y ) τ : control effect of diversity; q should maximize � � � � q ( � x , � r x ( x , � x ) + r y ( y , � J ( q ) = τ · H y | x , y ) + E � y ) x , � y ∼ q 21 / 41

Mathematically Optimal q � � � � J ( q ) = τ · H q ( � x , � y | x , y ) + E � r x ( x , � x ) + r y ( y , � y ) x , � y ∼ q Solve for the best q exp { s ( � x , � y ; x , y ) /τ } q ∗ ( � x , � y | x , y ) = � x ′ , � y ′ ; x , y ) /τ } y ′ exp { s ( � x ′ , � � 22 / 41

Mathematically Optimal q � � � � J ( q ) = τ · H q ( � x , � y | x , y ) + E � r x ( x , � x ) + r y ( y , � y ) x , � y ∼ q Solve for the best q exp { s ( � x , � y ; x , y ) /τ } q ∗ ( � x , � y | x , y ) = � x ′ , � y ′ ; x , y ) /τ } y ′ exp { s ( � x ′ , � � Decompose x and y exp { r x ( � x , x ) /τ x } exp { r y ( � y , y ) /τ y } q ∗ ( � � � x , � y | x , y ) = x ′ , x ) /τ x } × y ′ , y ) /τ y } x ′ exp { r x ( � y ′ exp { r y ( � � � 23 / 41

Mathematically Optimal q � � � � J ( q ) = τ · H q ( � x , � y | x , y ) + E � r x ( x , � x ) + r y ( y , � y ) x , � y ∼ q Solve for the best q exp { s ( � x , � y ; x , y ) /τ } q ∗ ( � x , � y | x , y ) = � x ′ , � y ′ ; x , y ) /τ } y ′ exp { s ( � x ′ , � � Decompose x and y exp { r x ( � x , x ) /τ x } exp { r y ( � y , y ) /τ y } q ∗ ( � � � x , � y | x , y ) = x ′ , x ) /τ x } × y ′ , y ) /τ y } x ′ exp { r x ( � y ′ exp { r y ( � � � Formulate existing methods ◮ Dictionary: jointly on x and y , but deterministic and not diverse ◮ Word dropout: only x side with null token ◮ RAML: only y side 24 / 41

Formulate SwitchOut Augment both x and y ! 25 / 41

Formulate SwitchOut Augment both x and y ! Sample for x , y independently 26 / 41

Formulate SwitchOut Augment both x and y ! Sample for x , y independently Define r x ( � x , x ) and r y ( � y , y ) ◮ Negative Hamming Distance, following RAML 27 / 41

SwitchOut: Sample efficiently Given a sentence s = { s 1 , s 2 , ... s | s | } 1 How many words to corrupt? Assumption: only one token for swapping. P ( n ) ∝ exp( − n ) /τ 28 / 41

SwitchOut: Sample efficiently Given a sentence s = { s 1 , s 2 , ... s | s | } 1 How many words to corrupt? Assumption: only one token for swapping. P ( n ) ∝ exp( − n ) /τ 2 What is the corrupted sentence? P (randomly swap s i by another word) = n | s | See Appendix: Efficient batch implementation in PyTorch and Tensorflow 29 / 41

Experiments Datasets ◮ en-vi: IWSLT 2015 ◮ de-en: IWSLT 2016 ◮ en-de: WMT 2015 Models ◮ Transformer model ◮ Word-based, standard preprocessing 30 / 41

Results: RAML and word dropout Method en-de de-en en-vi src trg N/A N/A 21.73 29.81 27.97 WordDropout N/A 20.63 29.97 28.56 22.78 † 28.67 † SwitchOut N/A 29.94 N/A RAML 22.83 30.66 28.88 WordDropout RAML 20.69 30.79 28.86 23.13 † 30.98 † SwitchOut RAML 29.09 31 / 41

Results: RAML and word dropout SwitchOut on source > word dropout Method en-de de-en en-vi src trg N/A N/A 21.73 29.81 27.97 WordDropout N/A 20.63 29.97 28.56 22.78 † 28.67 † SwitchOut N/A 29.94 N/A RAML 22.83 30.66 28.88 WordDropout RAML 20.69 30.79 28.86 23.13 † 30.98 † SwitchOut RAML 29.09 32 / 41

Results: RAML and word dropout SwitchOut on source > word dropout SwitchOut on source and target > RAML Method en-de de-en en-vi src trg N/A N/A 21.73 29.81 27.97 WordDropout N/A 20.63 29.97 28.56 22.78 † 28.67 † SwitchOut N/A 29.94 N/A RAML 22.83 30.66 28.88 WordDropout RAML 20.69 30.79 28.86 23.13 † 30.98 † SwitchOut RAML 29.09 33 / 41

Where does SwitchOut help? More gain for sentences more different from training data 1 0.75 Gain in BLEU Gain in BLEU 0.5 0.5 0.25 0 0 -0.5 -0.25 -1 1350 2700 4050 5400 253 506 759 1012 Top K sentences Top K sentences Figure: Left : IWSLT 16 de-en. Right : IWSLT 15 en-vi. 34 / 41

Final Thoughts SwitchOut sampling is efficient and easy-to-use 35 / 41

Final Thoughts SwitchOut sampling is efficient and easy-to-use Work with any NMT architecture 36 / 41

Final Thoughts SwitchOut sampling is efficient and easy-to-use Work with any NMT architecture Formulation of data augmentation encompasses existing works and inspires future direction 37 / 41

Final Thoughts SwitchOut sampling is efficient and easy-to-use Work with any NMT architecture Formulation of data augmentation encompasses existing works and inspires future direction Thanks a lot for listening! Questions? 38 / 41

References Norouzi et al. (2016) Reward Augmented Maximum Likelihood for Neural Structured Prediction. In NIPS. Sennrich et al. (2016a) Edinburgh neural machine translation systems for wmt 16. In WMT. Sennrich et al. (2016b) Improving neural machine translation models with monolingual data. In ACL. Currey et al. (2017) Copied Monolingual Data Improves Low-Resource Neural Machine Translation. In WMT. Fadaee et al. (2017) Data Augmentation for Low-Resource Neural Machine Translation. In ACL. 39 / 41

SwitchOut: An Efficient Data Augmentation for Neural Machine - PowerPoint PPT Presentation

SwitchOut: An Efficient Data Augmentation for Neural Machine Translation Xinyi Wang , Hieu Pham , Zihang Dai, Graham Neubig November 2, 2018 :equal contribution 1 / 41 Data Augmentation Neural models are data hungry, while

Population Based Augmentation Efficient Learning of Augmentation Policy Schedules Daniel Ho , Eric

Data Augmentation in NLP 2020-03-21 Xiachong Feng Outline Why we need Data Augmentation?

Convolutional Neural Networks with Data Augmentation against Jitter-Based Countermeasures Eleonora

image-augmentation April 9, 2019 1 Image Augmentation In [1]: % matplotlib inline import d2l

Galileo Local Element Augmentation System Galileo Local Element Augmentation System (GALILEA)

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

ECE 417 Fall 2018 Lecture 19: Mini-Batch Training and Data Augmentation Mark Hasegawa-Johnson

Does Data Augmentation Lead to Positive Margin? Dimitris Po-Ling Loh Shashank Rajput* Zhili

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

for Speech Synthesis and Sensor Data Augmentation Deep Generative Neural Network Speech Text

Improving Molecular Design by Stochastic Iterative Target Augmentation Kevin Yang, Wengong Jin,

Data Augmentation for Context-Sensitive Neural Lemmatization Using Inflection Tables and Raw

IPR/Reservoir Augmentation Reservoir Storage Permitting Issues Michael R. Welch, Ph.D., P.E.

Federal Aviation Administration Overview Wide Area Augmentation System (WAAS) Status

SCAF: Simplicial Complex Augmentation Framework for Bijective Maps Zhongshi Jiang, New York

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2014/15 Lecture I:

CS525: Advanced Database Organization Notes 2: Hardware Yousef M. Elmehdwi Department of

Storage Systems Main Points File systems Useful abstrac7ons

ProtoDUNE-DP: Computing, data readiness and organization LBNC Meeting CERN, 05/12/2019

Some Extensions of Neural Machine Translation for Auto-formalization of Mathematics Qingxiang

Transductive learning for statistical machine translation Nicola Ueffing 1 Gholamreza Haffari 2

The University of Cambridge's Machine Translation Systems for WMT18 Felix Stahlberg, Adria de

On-Demand Unstructured Mesh Translation for Reducing Memory Pressure during In Situ Analysis J.