Pseudo-Masked Language Models for Unified Language Model - - PowerPoint PPT Presentation

pseudo masked language models for unified language model
SMART_READER_LITE
LIVE PREVIEW

Pseudo-Masked Language Models for Unified Language Model - - PowerPoint PPT Presentation

Pseudo-Masked Language Models for Unified Language Model Pre-Training ICML-2020 Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Songhao Piao, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon Unified Pre-Training Framework h


slide-1
SLIDE 1

Pseudo-Masked Language Models for Unified Language Model Pre-Training

ICML-2020

Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Songhao Piao, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon

slide-2
SLIDE 2

T5, BART GPT BERT, RoBERT a

Unified Pre-Training Framework

Language Generation

(text generation)

story/news generation …

Language Understanding

intent classification entity recognition question answering …

Language Generation

(sequence-to-sequence)

summary generation question generation response generation machine translation …

Downstream Tasks Pre-Training Tasks

All tokens can see each other. A token can only see its left context.

Unidirectional (Left-to-Right) LM Bidirectional LM Sequence-to-Sequence LM

1) The given input is bidirectionally encoded. 2) The output is unidirectionally decoded.

x1 x2 x3 x4 x5

Transformer Block 1 Transformer Block 2 Transformer Block L

...

h1 h2 h3 h4 h5

Encoder

y1 y2 y3 y4

Transformer Block 1 Transformer Block 2 Transformer Block L

... Decoder

x0 y1 y2 y3 y4

Transformer Block 1 Transformer Block 2 Transformer Block L

... Decoder

x1 x2 x3 x4 x5

Transformer Block 1 Transformer Block 2

Transformer Block L

... Encoder

h1 h2 h3 h4 h5 x1 x2 x3 x4 x5

Transformer Block 1 Transformer Block 2 Transformer Block L

...

h1 h2 h3 h4 h5

UniLM

slide-3
SLIDE 3

Bidirectional Encoder Unidirectional Decoder Encoder-Decoder NLU: text classification, entity

recognition, question answering, …

NLG: synthetic text generation, … NLG (sequence-to-sequence): text

summarization, question generation, …

1 Unified Modeling Multitask-Style Pre-Training 2

Unified Language Model Pre-training for Natural Language Understanding and Generation. NeurIPS 2019.

UniLM v1

slide-4
SLIDE 4

x1 x2 x3 x4 x5

Transformer Block 1 Transformer Block 2 Transformer Block L

...

h1 h2 h3 h4 h5

UniLM

Bidirectional LM Sequence-to-Sequence LM Unidirectional LM

Motivation of UniLM v2

(v1) One training example for each type of LM

  • Three types of LMs
  • Three forward passes with different self-attention masks

Training Batch Training Example Training Example Training Example

How to train multiple LMs in one forward pass?

slide-5
SLIDE 5

Pseudo-Masked Language Model

𝑦4 𝑦5 [M] 𝑦1 𝑦3 𝑦6 𝑦2 𝑦1 [M] 𝑦3 [M] 𝑦6 [M] 𝑦4 𝑦5 t=1 t=2 [M] 𝑦1 [M] 𝑦3 [M] 𝑦6 𝑦2 𝑦5 𝑦4 Bidirectional LM T ask (for NLU)

1. Bidirectionally encode context tokens 2. Predict the masked spans at the same time

Sequence-to-Sequence LM T ask (for NLG)

1. Bidirectionally encode context tokens 2. Predict the masked spans one by one (e.g., 𝑦4, 𝑦5 → 𝑦2) 1. Predict 𝑦4, 𝑦5 2. Encode 𝑦4, 𝑦5 (i.e., fill in what we have predicted) 3. Predict 𝑦2

slide-6
SLIDE 6

Pseudo-Masked Language Model

𝑦4 𝑦5 [M] 𝑦1 𝑦3 𝑦6 𝑦2 𝑦1 [M] 𝑦3 [M] 𝑦6 [M] 𝑦4 𝑦5 t=1 t=2 [M] 𝑦1 [M] 𝑦3 [M] 𝑦6 𝑦2 𝑦5 𝑦4 Bidirectional LM T ask (for NLU)

1. Bidirectionally encode context tokens 2. Predict the masked spans at the same time

Sequence-to-Sequence LM T ask (for NLG)

1. Bidirectionally encode context tokens 2. Predict the masked spans one by one (e.g., 𝑦4, 𝑦5 → 𝑦2) 1. Predict 𝑦4, 𝑦5 2. Encode 𝑦4, 𝑦5 (i.e., fill in what we have predicted) 3. Predict 𝑦2

Observatio vation n 1: c conte text xt encodi

  • ding

ng can be reused ed

slide-7
SLIDE 7

𝑦4 𝑦5 [P] 𝑦1 𝑦3 𝑦6 𝑦2 𝑦1 [P] 𝑦3 [P] 𝑦6 [M] 𝑦4 𝑦5 t=1 t=2 [M] 𝑦1 [M] 𝑦3 [M] 𝑦6 𝑦2 𝑦5 𝑦4 Bidirectional LM T ask (for NLU)

1. Bidirectionally encode context tokens 2. Predict the masked spans at the same time

Sequence-to-Sequence LM T ask (for NLG)

1. Bidirectionally encode context tokens 2. Predict the masked spans one by one (e.g., 𝑦4, 𝑦5 → 𝑦2) 1. Predict 𝑦4, 𝑦5 2. Encode 𝑦4, 𝑦5 (i.e., fill in what we have predicted) 3. Predict 𝑦2

(3) Original nal tokens ens (2) Pseudo udo masks ks [P] (1) Contex ext masks ks [M] Observatio vation n 1: c conte text xt encodi

  • ding

ng can be reused ed Observati vation n 2: m masked sked positions ions have e three ee roles es

Pseudo-Masked Language Model

slide-8
SLIDE 8

𝒚𝟒 𝒚𝟓 𝒚𝟔 𝒚𝟕 𝒚1 𝒚2 3 4 5 6 1 2

Token Embeddings Position Embeddings

+ + + + + + M 2 + M 4 + M 5 + [𝐐] 2 + [𝐐] [𝐐] 4 5 + + 𝒚𝟓 𝒚𝟔 𝒚𝟑 𝒚𝟓 𝒚𝟔 𝒚𝟑

Bidirectional LM (Autoencoding) Sequence-to-Sequence LM (Partially Autoregressive)

(TL;DR) UniLM v2: unified pre-training of bi-directional LM (via autoencoding) and sequence-to-sequence LM (via partially autoregressive) with Pseudo-Masked Language Model for language understanding and generation

  • Transformer/Self-attention treats tokens with the same position embeddings as the same “token” at that position
  • Pseudo-masked LM can be used to efficiently realize different pre-training objectives, such as AE (autoencoding), AR (autoregressive), PAR (partially

autoregressive) , AE + AR, and AE + PAR, among which AE + PAR performs the best

slide-9
SLIDE 9

Pre-Training Objectives

Autoencoding Autoregressive Partially Autoregressive

Encourage the pre-trained model to learn and use global context (long-distance dependency)

slide-10
SLIDE 10
  • Pseudo-masked language model

efficiently realizes unified pre-training

  • Two types of LM tasks within one

forward pass

  • Bi-directional LM (for NLU)
  • Sequence-to-sequence LM (for NLG)
  • Learn different word dependencies
  • Between context and mask predictions
  • Between mask predictions

Bidirectional LM Sequence-to- Sequence LM

T akeaway Message of UniLM v2

slide-11
SLIDE 11

Benchmark Datasets

  • Natural language understanding
  • Question answering (SQuAD)
  • GLUE: General Language Understanding Evaluation
  • Natural language generation
  • Abstractive summarization
  • CNN / DailyMail
  • Gigaword
  • XSum
  • Question generation (SQuAD)

Bidirectional encoding Sequence-to-sequence modeling

slide-12
SLIDE 12

UniLMv2-Base for NLU T asks

Results of BASE-size pre-trained models on the SQuAD v1.1/v2.0 development sets. We report F1 scores and exact match (EM) scores. Results

  • f UniLMv2 are averaged over five runs.

Results of BASE-size models on the development set of the GLUE benchmark. We report Matthews correlation coefficient (MCC) for CoLA, Pearson correlation coefficient (PCC) for STS, and accuracy (Acc) for the rest. Metrics of UniLMv2 are averaged over five runs for the tasks.

+1.6 +2.5 +2.4 +2.8 +0.9 +0.3 +1.6 +2.6 +0.7

  • 0.2
  • 0.2

+2.6

slide-13
SLIDE 13

UniLMv2-Base for NLG T asks (Abstractive Summarization)

Abstractive summarization results on CNN/DailyMail and XSum. The evaluation metric is the F1 version of ROUGE (RG) scores. We also present the number of parameters (#Param) and the corpus size (#Corpus) for the methods using pre-trained models.

slide-14
SLIDE 14

UniLMv2-Base for NLG T asks (Question Generation)

MTR is short for METEOR, and RG for ROUGE. The official split is from (Du & Cardie, 2018), while the reversed split is the same as in (Zhao et al., 2018).

slide-15
SLIDE 15

Effect of Pre-Training Objectives

Comparisons between the pre-training objectives. All models are pre-trained over Wikipedia and BookCorpus for one million steps with a batch size of 256. Results in the second block are average over five runs for each task. We report F1 and exact match (EM) scores for SQuAD, and accuracy (Acc) for MNLI and SST-2.

  • AE: autoencoding
  • AR: autoregressive (AR)
  • PAR: partially autoregressive
slide-16
SLIDE 16

Thanks!

https://github.com/microsoft/unilm