Pseudo-Masked Language Models for Unified Language Model Pre-Training
ICML-2020
Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Songhao Piao, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon
Pseudo-Masked Language Models for Unified Language Model - - PowerPoint PPT Presentation
Pseudo-Masked Language Models for Unified Language Model Pre-Training ICML-2020 Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Songhao Piao, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon Unified Pre-Training Framework h
Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Songhao Piao, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon
T5, BART GPT BERT, RoBERT a
Language Generation
(text generation)
story/news generation …
Language Understanding
intent classification entity recognition question answering …
Language Generation
(sequence-to-sequence)
summary generation question generation response generation machine translation …
Downstream Tasks Pre-Training Tasks
All tokens can see each other. A token can only see its left context.
Unidirectional (Left-to-Right) LM Bidirectional LM Sequence-to-Sequence LM
1) The given input is bidirectionally encoded. 2) The output is unidirectionally decoded.
x1 x2 x3 x4 x5
Transformer Block 1 Transformer Block 2 Transformer Block L
...
h1 h2 h3 h4 h5
Encoder
y1 y2 y3 y4
Transformer Block 1 Transformer Block 2 Transformer Block L
... Decoder
x0 y1 y2 y3 y4
Transformer Block 1 Transformer Block 2 Transformer Block L
... Decoder
x1 x2 x3 x4 x5
Transformer Block 1 Transformer Block 2
Transformer Block L
... Encoder
h1 h2 h3 h4 h5 x1 x2 x3 x4 x5
Transformer Block 1 Transformer Block 2 Transformer Block L
...
h1 h2 h3 h4 h5
Bidirectional Encoder Unidirectional Decoder Encoder-Decoder NLU: text classification, entity
recognition, question answering, …
NLG: synthetic text generation, … NLG (sequence-to-sequence): text
summarization, question generation, …
1 Unified Modeling Multitask-Style Pre-Training 2
Unified Language Model Pre-training for Natural Language Understanding and Generation. NeurIPS 2019.
x1 x2 x3 x4 x5
Transformer Block 1 Transformer Block 2 Transformer Block L
...
h1 h2 h3 h4 h5
Bidirectional LM Sequence-to-Sequence LM Unidirectional LM
Training Batch Training Example Training Example Training Example
𝑦4 𝑦5 [M] 𝑦1 𝑦3 𝑦6 𝑦2 𝑦1 [M] 𝑦3 [M] 𝑦6 [M] 𝑦4 𝑦5 t=1 t=2 [M] 𝑦1 [M] 𝑦3 [M] 𝑦6 𝑦2 𝑦5 𝑦4 Bidirectional LM T ask (for NLU)
1. Bidirectionally encode context tokens 2. Predict the masked spans at the same time
Sequence-to-Sequence LM T ask (for NLG)
1. Bidirectionally encode context tokens 2. Predict the masked spans one by one (e.g., 𝑦4, 𝑦5 → 𝑦2) 1. Predict 𝑦4, 𝑦5 2. Encode 𝑦4, 𝑦5 (i.e., fill in what we have predicted) 3. Predict 𝑦2
𝑦4 𝑦5 [M] 𝑦1 𝑦3 𝑦6 𝑦2 𝑦1 [M] 𝑦3 [M] 𝑦6 [M] 𝑦4 𝑦5 t=1 t=2 [M] 𝑦1 [M] 𝑦3 [M] 𝑦6 𝑦2 𝑦5 𝑦4 Bidirectional LM T ask (for NLU)
1. Bidirectionally encode context tokens 2. Predict the masked spans at the same time
Sequence-to-Sequence LM T ask (for NLG)
1. Bidirectionally encode context tokens 2. Predict the masked spans one by one (e.g., 𝑦4, 𝑦5 → 𝑦2) 1. Predict 𝑦4, 𝑦5 2. Encode 𝑦4, 𝑦5 (i.e., fill in what we have predicted) 3. Predict 𝑦2
Observatio vation n 1: c conte text xt encodi
ng can be reused ed
𝑦4 𝑦5 [P] 𝑦1 𝑦3 𝑦6 𝑦2 𝑦1 [P] 𝑦3 [P] 𝑦6 [M] 𝑦4 𝑦5 t=1 t=2 [M] 𝑦1 [M] 𝑦3 [M] 𝑦6 𝑦2 𝑦5 𝑦4 Bidirectional LM T ask (for NLU)
1. Bidirectionally encode context tokens 2. Predict the masked spans at the same time
Sequence-to-Sequence LM T ask (for NLG)
1. Bidirectionally encode context tokens 2. Predict the masked spans one by one (e.g., 𝑦4, 𝑦5 → 𝑦2) 1. Predict 𝑦4, 𝑦5 2. Encode 𝑦4, 𝑦5 (i.e., fill in what we have predicted) 3. Predict 𝑦2
(3) Original nal tokens ens (2) Pseudo udo masks ks [P] (1) Contex ext masks ks [M] Observatio vation n 1: c conte text xt encodi
ng can be reused ed Observati vation n 2: m masked sked positions ions have e three ee roles es
𝒚𝟒 𝒚𝟓 𝒚𝟔 𝒚𝟕 𝒚1 𝒚2 3 4 5 6 1 2
Token Embeddings Position Embeddings
+ + + + + + M 2 + M 4 + M 5 + [𝐐] 2 + [𝐐] [𝐐] 4 5 + + 𝒚𝟓 𝒚𝟔 𝒚𝟑 𝒚𝟓 𝒚𝟔 𝒚𝟑
Bidirectional LM (Autoencoding) Sequence-to-Sequence LM (Partially Autoregressive)
autoregressive) , AE + AR, and AE + PAR, among which AE + PAR performs the best
Autoencoding Autoregressive Partially Autoregressive
Encourage the pre-trained model to learn and use global context (long-distance dependency)
Bidirectional LM Sequence-to- Sequence LM
Bidirectional encoding Sequence-to-sequence modeling
Results of BASE-size pre-trained models on the SQuAD v1.1/v2.0 development sets. We report F1 scores and exact match (EM) scores. Results
Results of BASE-size models on the development set of the GLUE benchmark. We report Matthews correlation coefficient (MCC) for CoLA, Pearson correlation coefficient (PCC) for STS, and accuracy (Acc) for the rest. Metrics of UniLMv2 are averaged over five runs for the tasks.
+1.6 +2.5 +2.4 +2.8 +0.9 +0.3 +1.6 +2.6 +0.7
+2.6
Abstractive summarization results on CNN/DailyMail and XSum. The evaluation metric is the F1 version of ROUGE (RG) scores. We also present the number of parameters (#Param) and the corpus size (#Corpus) for the methods using pre-trained models.
MTR is short for METEOR, and RG for ROUGE. The official split is from (Du & Cardie, 2018), while the reversed split is the same as in (Zhao et al., 2018).
Comparisons between the pre-training objectives. All models are pre-trained over Wikipedia and BookCorpus for one million steps with a batch size of 256. Results in the second block are average over five runs for each task. We report F1 and exact match (EM) scores for SQuAD, and accuracy (Acc) for MNLI and SST-2.