Cr Cros oss-lin lingual al lan languag age mod model pr - - PowerPoint PPT Presentation

cr cros oss lin lingual al lan languag age mod model pr
SMART_READER_LITE
LIVE PREVIEW

Cr Cros oss-lin lingual al lan languag age mod model pr - - PowerPoint PPT Presentation

Cr Cros oss-lin lingual al lan languag age mod model pr pretraini ning ng Alexis Conneau and Guillaume Lample Facebook AI Research 1 Why learning cross-lingual representations? 1 2 4 3 This is great. Cest super. Das ist toll.


slide-1
SLIDE 1

Cr Cros

  • ss-lin

lingual al lan languag age mod model pr pretraini ning ng

Alexis Conneau and Guillaume Lample Facebook AI Research

1

slide-2
SLIDE 2

Why learning cross-lingual representations?

2

1 2 4 3

This is great. C’est super. Das ist toll.

slide-3
SLIDE 3

Cross-lingual language models

3

slide-4
SLIDE 4
  • Mult. Masked Language Modeling (MLM)

.. multilingual representations emerge from a single MLM trained on many languages.

4

Multilingual Masked language modeling pretraining

Similar to BERT, we pretrain a Transformer model with MLM but in many languages:

Devlin et al. – BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding (+ mBERT)

slide-5
SLIDE 5

Translation Language Modeling (TLM)

Multilingual MLM is unsupervised, but we leverage parallel data with TLM:

5

Translation language modeling (TLM) pretraining

.. to encourage the model to leverage cross-lingual context when making predictions.

slide-6
SLIDE 6

Results on XLU benchmarks

6

slide-7
SLIDE 7

Results on Cross-lingual Classification (XNLI)

The pretrained encoder is fine-tuned on the English XNLI(*) training data and then tested on 15 languages

XLM

7

Average XNLI accuracy on the 15 languages of XNLI for zero-shot cross-lingual classification

75.1 71.5 70.2 66.3 65.6

60 64 68 72 76

XLM (MLM+TLM) XLM (MLM) LASER mBERT XNLI baseline Average XNLI accuracy over 15 languages

(*) Conneau et al. – XNLI: Evaluating Cross-lingual Sentence Representations (EMNLP 2018)

slide-8
SLIDE 8

Results on Unsupervised Machine Translation

8

Initialization is key in unsupervised MT to bootstrap the iterative BT process

Embedding layer initialization is essential for neural unsupervised MT (*) Full Transformer model initialization significantly improves performance (+7 BLEU)

(*) Lample et al. – Phrase-based and neural unsupervised machine translation (EMNLP 2018)

36.2 34.3 30.5 27.3

20 24 28 32 36 40

Supervised 2016 SOTA (Edinburgh) Full model pretrained (MLM) Full model pretrained (CLM) Embeddings pretrained BLEU

slide-9
SLIDE 9

Results on Supervised Machine Translation

9

We also show the importance of pretraining for generation

  • Pretraining both the encoder and

decoder improves BLEU score

  • MLM better than LM pretraining
  • Back-translation + pretraining

leads to the best BLEU score

  • Pretraining is more important

when supervised data is small

20 24 28 32 36 40

Full model pretrained (MLM) Full model pretrained (CLM) No pretraining without back-translation with back-translation

slide-10
SLIDE 10

Conclusion

  • Cross-lingual language model pretraining is very effective for XLU
  • New state of the art for cross-lingual classification on XNLI
  • Reduces the gap between unsupervised and supervised MT
  • Recent developments have improved XLM/mBERT models

10

slide-11
SLIDE 11

Thank you!

Code and models available at github.com/facebookresearch/XLM

11

Lample & Conneau – Cross-lingual Language Model Pretraining (NeurIPS 2019)