CS11-737 Multilingual NLP
Data-based Strategies to Low-resource MT
Graham Neubig
Site http://demo.clab.cs.cmu.edu/11737fa20/
Many slides from: Xia, Mengzhou, et al. "Generalized data augmentation for low-resource translation." ACL 2019.
Data-based Strategies to Low-resource MT Graham Neubig Site - - PowerPoint PPT Presentation
CS11-737 Multilingual NLP Data-based Strategies to Low-resource MT Graham Neubig Site http://demo.clab.cs.cmu.edu/11737fa20/ Many slides from: Xia, Mengzhou, et al. "Generalized data augmentation for low-resource translation." ACL
CS11-737 Multilingual NLP
Graham Neubig
Site http://demo.clab.cs.cmu.edu/11737fa20/
Many slides from: Xia, Mengzhou, et al. "Generalized data augmentation for low-resource translation." ACL 2019.
parallel corpora → relatively good translations
HRL TRG LRL TRG
parallel corpora → nonsense!
source - Atam balaca boz radiosunda BBC Xəbərlərinə qulaq asırdı. translation - So I’m going to became a lot of people. reference - My father was listening to BBC News on his small , gray radio. A system that is trained with 5000 sentence pairs on Azerbaijani and English ? Does not convey the correct meaning at all.
al., 2016; Nguyen and Chiang, 2017)
HRL TRG LRL TRG
MT Syste m
train adapt
and HRL parallel data
(Johnson et al., 2017; Neubig and Hu, 2018)
HRL TRG LRL TRG concatenate
MT System
LRL HRL TRG-L TRG-H TRG-M
Available Resources Augmented Data
LRL TRG
LRL HRL TRG-L TRG-H LRL-M TRG-M TRG-M TRG -> LRL
Sennrich, Rico, Barry Haddow, and Alexandra Birch. "Improving neural machine translation models with monolingual data." ACL 2016.
LRL TRG-L TRG-M LRL-M TRG-M
System TRG→ LRL
LRL→ TRG
data bias
Understanding Back-Translation at Scale. Sergey Edunov, Myle Ott, Michael Auli, David Grangier. EMNLP 2018.
Vu Cong Duy Hoang, Philipp Koehn, Gholamreza Haffari, Trevor Cohn. "Iterative Back-Translation for Neural Machine Translation" WNGT 2018.
LRL TRG TRG LRL LRL TRG
LRL-TRG LRL TRG LRL→ TRG
System TRG→ LRL
System
System LRL→ TRG
domains
Xia, Mengzhou, et al. "Generalized data augmentation for low-resource translation." ACL 2019.
low quality
translate into HRL ○ more sentence pairs ○ vocabulary sharing of source-side ○ syntactic similarity of source-side ○ improves target-side LM
HRL-M
TRG-M TRG-M TRG -> HRL
TRG: Thank you very much. TUR: Çok teşekkür ederim. AZE: Hə Hə Hə.
Available Resources + TRG-LRL and TRG-HRL Back- translation
LRL HRL TRG-L TRG-H LRL-M TRG-M TRG-M TRG -> LRL TRG-M HRL-M TRG -> HRL
TRG
○ Translate from HRL to LRL
LRL-H
HRL TRG HRL -> LRL
TUR: Çok teşekkür ederim. TRG: Thank you so much. AZE: Çox sağ olun. TRG: Thank you so much.
Available Resources + TRG-LRL and TRG-HRL Back- translation + Pivoting
LRL HRL TRG-L TRG-H LRL-M TRG-M TRG-M TRG-M HRL-M TRG -> LRL TRG -> HRL LRL-H TRG-H HRL -> LRL
data also suffers from lexical or syntactic mismatch
○ Large amount of English monolingual data can be utilized
TRG-M
HRL -> LRL
TRG-M
TRG -> HRL HRL- M
LRL- MH
TRG: Thank you so much. TUR: Çok teşekkür ederim. TRG: Thank you so much. AZE: Çox sağ olun. TRG: Thank you so much.
LRL HRL TRG-L TRG-H LRL-M TRG-M TRG-M LRL-H TRG-H LRL-MH TRG-M HRL-M TRG -> LRL TRG -> HRL HRL -> LRL HRL -> LRL
TRG
but fail at terminology
○ Helps encourage the model to not drop words ○ Helps translation of terms that are identical across languages
TRG Copy
TRG: Thank you so much. SRC: Thank you so much. TRG: Thank you so much.
TRG
Anna Currey, Antonio Valerio Miceli Barone, Kenneth Heafield. Copied Monolingual Data Improves Low-Resource Neural Machine Translation. WMT 2018.
appear in that context
Marzieh Fadaee, Arianna Bisazza, Christof Monz. Data Augmentation for Low-Resource Neural Machine Translation. ACL 2017.
dictionary learning, analysis, supervised attention etc.
trained using EM algorithm
and find words with similar embeddings
target sentence using dictionary
divergence?
J'ai acheté une nouvelle voiture I bought a new car 私 は 新しい ⾞ を 買った I the new car a bought
Lample, Guillaume, et al. "Unsupervised machine translation using monolingual corpora only." arXiv preprint arXiv:1711.00043 (2017).
Zhou, Chunting, et al. "Handling Syntactic Divergence in Low-resource Machine Translation." arXiv preprint arXiv:1909.00040 (2019).
methods that use monolingual pre-training
by word-by-word translation to create pseudo-parallel data
augmentation
Marzieh Fadaee, Arianna Bisazza, Christof Monz. Data Augmentation for Low-Resource Neural Machine Translation. ACL 2017. Zhou, Chunting, et al. "Handling Syntactic Divergence in Low-resource Machine Translation." EMNLP 2019.
you're familiar with
method? Are there any improvements you can think of?