Cr Cros
- ss-lin
lingual al lan languag age mod model pr pretraini ning ng
Alexis Conneau and Guillaume Lample Facebook AI Research
1
Cr Cros oss-lin lingual al lan languag age mod model pr - - PowerPoint PPT Presentation
Cr Cros oss-lin lingual al lan languag age mod model pr pretraini ning ng Alexis Conneau and Guillaume Lample Facebook AI Research 1 Why learning cross-lingual representations? 1 2 4 3 This is great. Cest super. Das ist toll.
1
2
1 2 4 3
This is great. C’est super. Das ist toll.
3
.. multilingual representations emerge from a single MLM trained on many languages.
4
Multilingual Masked language modeling pretraining
Similar to BERT, we pretrain a Transformer model with MLM but in many languages:
Devlin et al. – BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding (+ mBERT)
Multilingual MLM is unsupervised, but we leverage parallel data with TLM:
5
Translation language modeling (TLM) pretraining
.. to encourage the model to leverage cross-lingual context when making predictions.
6
The pretrained encoder is fine-tuned on the English XNLI(*) training data and then tested on 15 languages
XLM
7
Average XNLI accuracy on the 15 languages of XNLI for zero-shot cross-lingual classification
75.1 71.5 70.2 66.3 65.6
60 64 68 72 76
XLM (MLM+TLM) XLM (MLM) LASER mBERT XNLI baseline Average XNLI accuracy over 15 languages
(*) Conneau et al. – XNLI: Evaluating Cross-lingual Sentence Representations (EMNLP 2018)
8
Initialization is key in unsupervised MT to bootstrap the iterative BT process
Embedding layer initialization is essential for neural unsupervised MT (*) Full Transformer model initialization significantly improves performance (+7 BLEU)
(*) Lample et al. – Phrase-based and neural unsupervised machine translation (EMNLP 2018)
36.2 34.3 30.5 27.3
20 24 28 32 36 40
Supervised 2016 SOTA (Edinburgh) Full model pretrained (MLM) Full model pretrained (CLM) Embeddings pretrained BLEU
9
We also show the importance of pretraining for generation
decoder improves BLEU score
leads to the best BLEU score
when supervised data is small
20 24 28 32 36 40
Full model pretrained (MLM) Full model pretrained (CLM) No pretraining without back-translation with back-translation
10
Code and models available at github.com/facebookresearch/XLM
11
Lample & Conneau – Cross-lingual Language Model Pretraining (NeurIPS 2019)