Multilingual Training and Cross-lingual Transfer
Xinyi Wang
CMU CS11-737: Multilingual NLP
Multilingual Training and Cross-lingual Transfer Xinyi Wang Many - - PowerPoint PPT Presentation
CMU CS11-737: Multilingual NLP Multilingual Training and Cross-lingual Transfer Xinyi Wang Many languages are left behind There is not enough monolingual data for many languages Even less annotated data for NMT, sequence label,
Xinyi Wang
CMU CS11-737: Multilingual NLP
Data Source: Wikipedia articles from different languages
training
training
Model
θ
Initialize Model
θ
English French Uzbek English
Transfer learning for low-resource neural machine translation. Zoph et. al. 2016
4*3=12 NMT models
Eng Tur Aze Kor Eng Tur Aze Kor
languages (eg. ~5 languages in the paper)
Model
θ
English French Hindi … Turkish English French Hindi … Turkish
Google’s multilingual neural machine translation system . Johnson et. al. 2016
target language label
Model
θ
Comment ça va? cómo estás?
…
nasılsın?
<2fr> How are you? <2es> How are you? … <2tr> How are you?
Google’s multilingual neural machine translation system . Johnson et. al. 2016
methods
for a new language?
various languages
https://www.wired.com/story/covid-language-translation-problem/
various languages
https://tico-19.github.io/l
Model
θ
English English French Hindi … Turkish Model
θ
Belurasian English Initialize
Rapid adaptation of Neural Machine Translation to New Languages. Neubig et. al. 2018
and its related high-resource language to avoid overfitting
Model
θ
English English French Hindi … Turkish Model
θ
Belurasian English Initialize
Rapid adaptation of Neural Machine Translation to New Languages. Neubig et. al. 2018
Russian
convergence faster
Rapid adaptation of Neural Machine Translation to New Languages. Neubig et. al. 2018
Meta-learning for low-resource neural machine translation. Gu et. al. 2018
training
data in that language
language or annotated data for other languages
Zulu English Probably some Bible data Italian English News, European Parliament documents …. Zulu Italian Unfortunately not much data available
Italian parallel data
Model
θ
<2it> Sawubona Model
θ
<2en> Zulu-English src <2en> Italian-English src Zulu-English trg <2it> English-Italian src English-Italian trg Italian-English trg Training: Testing: Ciao
Google’s multilingual neural machine translation system . Johnson et. al. 2016
the noisy version of the monolingual data
Model
θ
<2it> Sawubona Model
θ
<2en> Zulu-English src Zulu-English trg Training: Testing:
Ciao
<2it> noised(Italian) Italian
Leveraging Monolingual Data with Self-Supervision for Multilingual Neural Machine Translation . Siddhant et. al. 2019
invariant representation
representation
Similarly Loss Between representations
The missing ingredient in zero-shot Neural Machine Translation . Arivazhagan et. al. 2019
from many different langauges
English)
from the fine-tuned language (eg. French)
representation!
How multilingual is multilingual BERT? Pires et. al. 2019
vocabulary overlap
How multilingual is multilingual BERT? Pires et. al. 2019
Vocabulary overlap
training
target
many languages
target languages
Massively Multilingual Neural Machine Translation in the Wild. Arivazhagan et. al. 2019
Massively Multilingual Neural Machine Translation in the Wild. Arivazhagan et. al. 2019
Massively Multilingual Neural Machine Translation in the Wild. Arivazhagan et. al. 2019
Balancing Training for multilingual neural machine translation. Wang et. al. 2020
D1
train
x
Scorer Model
∇θJ(Di
train; θt)
∇θJdev(θ′
t+1, Ddev)
Dn
train
…
D1
dev
Dn
dev
…
ψt PD(i; ψt)
Massively Multilingual Neural Machine Translation in the Wild. Arivazhagan et. al. 2019
High-resource Low-resource High-resource Low-resource
model for each language cluster
Model 1 English French Model 2 English Chinese Model N English Zulu
Model French Chinese Zulu English English English Model 1 Model N Model 1
Multilingual Neural Machine Translation with Knowledge Distillation. Tan et. al. 2019
Simple, Scalable adaptation for neural machine translation. Bapna et. al. 2019
Massively Multilingual Neural Machine Translation in the Wild. Arivazhagan et. al. 2019
High-resource Low-resource High-resource Low-resource
decoder of many-to-one benefits more from the same target language
separate?
non-trivial
and do joint BPE on all the data
morphologically rich languages
between (en-fr: 40, en-zu: 15) vs. (en-fr: 35, en-zh: 20)
Does +5 BLEU on en-zh has the same “benefit” as +5 BLEU on en-fr?
in the Wild (https://arxiv.org/pdf/1907.05019.pdf)”
multilingual NMT, and what is the experiment or analysis from the paper that explains this problem? Can you think about any potential solutions?