Pseudo-Masked Language Models for Unified Language Model - PowerPoint PPT Presentation

Pseudo-Masked Language Models for Unified Language Model Pre-Training ICML-2020 Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Songhao Piao, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon

Unified Pre-Training Framework h 1 h 2 h 3 h 4 h 5 Language Understanding Bidirectional LM Transformer Block L ... intent classification All tokens can see each other. Encoder entity recognition Transformer Block 2 question answering Transformer Block 1 BERT, RoBERT a … x 1 x 2 x 3 x 4 x 5 Language Generation Unidirectional (Left-to-Right) LM h 1 h 2 h 3 h 4 h 5 y 1 y 2 y 3 y 4 (text generation) Transformer Block L Transformer Block L ... A token can only see its left context. ... UniLM story/news generation Decoder … Transformer Block 2 Transformer Block 2 GPT Transformer Block 1 Transformer Block 1 Sequence-to-Sequence LM Language Generation x 1 x 2 x 3 x 4 x 5 x 0 (sequence-to-sequence) 1) The given input is bidirectionally encoded. h 1 h 2 h 3 h 4 h 5 2) The output is unidirectionally decoded. summary generation y 1 y 2 y 3 y 4 question generation Transformer Block L ... Transformer Block L response generation ... T5, BART Encoder machine translation Decoder Transformer Block 2 … Transformer Block 2 Transformer Block 1 Transformer Block 1 x 1 x 2 x 3 x 4 x 5 Downstream Tasks Pre-Training Tasks

UniLM v1 Bidirectional Encoder NLU: text classification, entity recognition, question answering, … Unidirectional Decoder NLG: synthetic text generation, … Encoder-Decoder NLG (sequence-to-sequence) : text Unified Modeling 1 summarization, question generation , … Multitask-Style Pre-Training 2 Unified Language Model Pre-training for Natural Language Understanding and Generation. NeurIPS 2019.

Motivation of UniLM v2 (v1) One training example for each type of LM How to train multiple LMs in one • Three types of LMs forward pass? • Three forward passes with different self-attention masks Bidirectional LM Training Batch Training Example Unidirectional LM h 1 h 2 h 3 h 4 h 5 Transformer Block L ... UniLM Training Transformer Block 2 Example Transformer Block 1 x 1 x 2 x 3 x 4 x 5 Sequence-to-Sequence LM Training Example

Pseudo-Masked Language Model 𝑦 2 𝑦 4 𝑦 5 Bidirectional LM T ask (for NLU) 1. Bidirectionally encode context tokens [M] [M] 𝑦 1 𝑦 3 𝑦 6 [M] 2. Predict the masked spans at the same time 𝑦 4 𝑦 5 Sequence-to-Sequence LM T ask (for NLG) 1. Bidirectionally encode context tokens [M] [M] 𝑦 3 𝑦 6 𝑦 1 t=1 [M] 2. Predict the masked spans one by one (e.g., 𝑦 4 , 𝑦 5 → 𝑦 2 ) 𝑦 2 1. Predict 𝑦 4 , 𝑦 5 2. Encode 𝑦 4 , 𝑦 5 (i.e., fill in what we have predicted) 𝑦 3 𝑦 4 𝑦 6 𝑦 1 𝑦 5 t=2 [M] 3. Predict 𝑦 2

Pseudo-Masked Language Model Observatio vation n 1: c conte text xt encodi oding ng can be reused ed 𝑦 2 𝑦 4 𝑦 5 Bidirectional LM T ask (for NLU) 1. Bidirectionally encode context tokens [M] [M] 𝑦 1 𝑦 3 𝑦 6 [M] 2. Predict the masked spans at the same time 𝑦 4 𝑦 5 Sequence-to-Sequence LM T ask (for NLG) 1. Bidirectionally encode context tokens [M] [M] 𝑦 3 𝑦 6 𝑦 1 t=1 [M] 2. Predict the masked spans one by one (e.g., 𝑦 4 , 𝑦 5 → 𝑦 2 ) 𝑦 2 1. Predict 𝑦 4 , 𝑦 5 2. Encode 𝑦 4 , 𝑦 5 (i.e., fill in what we have predicted) 𝑦 3 𝑦 4 𝑦 6 𝑦 1 𝑦 5 t=2 [M] 3. Predict 𝑦 2

Pseudo-Masked Language Model Observatio vation n 1: c conte text xt encodi oding ng can be reused ed Observati vation n 2: m masked sked positions ions have e three ee roles es 𝑦 2 𝑦 4 𝑦 5 Bidirectional LM T ask (for NLU) 1. Bidirectionally encode context tokens [M] [M] 𝑦 1 𝑦 3 𝑦 6 [M] 2. Predict the masked spans at the same time (1) Contex ext masks ks [M] (2) Pseudo udo masks ks [P] 𝑦 4 𝑦 5 Sequence-to-Sequence LM T ask (for NLG) 1. Bidirectionally encode context tokens [P] [P] 𝑦 3 𝑦 6 𝑦 1 t=1 [M] 2. Predict the masked spans one by one (e.g., 𝑦 4 , 𝑦 5 → 𝑦 2 ) 𝑦 2 1. Predict 𝑦 4 , 𝑦 5 2. Encode 𝑦 4 , 𝑦 5 (i.e., fill in what we have predicted) 𝑦 3 𝑦 4 𝑦 6 𝑦 1 𝑦 5 t=2 [P] 3. Predict 𝑦 2 (3) Original nal tokens ens

Bidirectional LM Sequence-to-Sequence LM (Autoencoding) (Partially Autoregressive) 𝒚 𝟓 𝒚 𝟔 𝒚 𝟑 𝒚 𝟓 𝒚 𝟔 𝒚 𝟑 Token Embeddings 𝒚 2 𝒚 𝟒 𝒚 𝟓 𝒚 𝟔 𝒚 𝟕 [𝐐] 𝒚 1 [𝐐] [𝐐] M M M + + + + + + + + + + + + 5 2 2 3 4 4 5 6 1 2 4 5 Position Embeddings (TL;DR) UniLM v2 : unified pre-training of bi-directional LM (via autoencoding) and sequence-to-sequence LM (via partially autoregressive) with Pseudo-Masked Language Model for language understanding and generation • Transformer/Self-attention treats tokens with the same position embeddings as the same “token” at that position • Pseudo-masked LM can be used to efficiently realize different pre-training objectives, such as AE (autoencoding), AR (autoregressive), PAR (partially autoregressive) , AE + AR, and AE + PAR, among which AE + PAR performs the best

Pre-Training Objectives Autoencoding Autoregressive Partially Autoregressive Encourage the pre-trained model to learn and use global context (long-distance dependency)

T akeaway Message of UniLM v2 • Pseudo-masked language model efficiently realizes unified pre-training Sequence-to- • Two types of LM tasks within one Sequence LM Bidirectional LM forward pass • Bi-directional LM (for NLU) • Sequence-to-sequence LM (for NLG) • Learn different word dependencies • Between context and mask predictions • Between mask predictions

Benchmark Datasets • Natural language understanding • Question answering (SQuAD) Bidirectional encoding • GLUE: General Language Understanding Evaluation • Natural language generation • Abstractive summarization • CNN / DailyMail Sequence-to-sequence modeling • Gigaword • XSum • Question generation (SQuAD)

UniLMv2-Base for NLU T asks +1.6 +2.5 +2.4 +2.8 +0.9 +0.3 +1.6 +2.6 +0.7 -0.2 -0.2 +2.6 Results of BASE-size models on the development set of the GLUE benchmark . We report Results of BASE-size pre-trained models on the Matthews correlation coefficient (MCC) for CoLA, Pearson correlation coefficient (PCC) SQuAD v1.1/v2.0 development sets. We report F1 scores and exact match (EM) scores. Results for STS, and accuracy (Acc) for the rest. Metrics of UniLMv2 are averaged over five runs of UniLMv2 are averaged over five runs. for the tasks.

UniLMv2-Base for NLG T asks (Abstractive Summarization) Abstractive summarization results on CNN/DailyMail and XSum. The evaluation metric is the F1 version of ROUGE (RG) scores. We also present the number of parameters (#Param) and the corpus size (#Corpus) for the methods using pre-trained models.

UniLMv2-Base for NLG T asks (Question Generation) MTR is short for METEOR, and RG for ROUGE. The official split is from (Du & Cardie, 2018), while the reversed split is the same as in (Zhao et al., 2018).

Effect of Pre-Training Objectives • AE: autoencoding • AR: autoregressive (AR) • PAR: partially autoregressive Comparisons between the pre-training objectives. All models are pre-trained over Wikipedia and BookCorpus for one million steps with a batch size of 256. Results in the second block are average over five runs for each task. We report F1 and exact match (EM) scores for SQuAD, and accuracy (Acc) for MNLI and SST-2.

Thanks! https://github.com/microsoft/unilm

Pseudo-Masked Language Models for Unified Language Model - PowerPoint PPT Presentation

Pseudo-Masked Language Models for Unified Language Model Pre-Training ICML-2020 Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Songhao Piao, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon Unified Pre-Training Framework h

Improving PixelCNN Vertical stack oblem with this m of masked convolution. Blind spot

Second-Order Masked Lookup Table Compression Scheme Annapurna Valiveti , Srinivas Vivek IIIT

A machine learning approach against a masked AES L. LERMAN , S. FERNANDES MEDEIROS, G. BONTEMPI,

Basics of Unified Sports Ways to get involved with Unified Sports in Ohio Ohio 1 What are

SARVAM UCS Unified Communication Server Unified Communication Server for Modern Enterprises

Models for Inexact Reasoning Reasoning with Subjective Pseudo Reasoning with Subjective Pseudo

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Unified Straight and Curved Steel Girder Design Specifications Introduction Unified Steel

UNIFIED PAYMENTS AT A GLANCE DEAR MERCHANT, WELCOME TO UNIFIED PAYMENTS! At Unified Payments,

SPORTS! Unified Basketball Special Olympics U NIFIED B ASKETBALL Unified Basketball helps

ECEN 5022 Cryptography Pseudo Random Number Generators Peter Mathys University of Colorado

MIPS Pseudo Instructions and Functions Philipp Koehn 2 October 2019 Philipp Koehn Computer

Stackable GSS Pseudo-Mechs draft-williams-gssapi-stackable-pseudo-mechs-00

Pseudo-random Functions Debdeep Mukhopadhyay IIT Kharagpur We have seen the construction of

Completions of Pseudo Ordered Sets Maria D Cruz BLAST 2018 August 10,2018 Maria D Cruz (NMSU)

Models and The Unified Modeling Language (UML) What is a Model: Maps Models are abstractions of

., .. ,

A Profile for Integrating Function Blocks into the Unified Modeling Language Torsten Heverhagen ,

Professor: Alvin Chao CS149 Intro to Classes and Objects When you define a class in Java,

Object-Oriented Design Lecture 1: UML Overview Sharif University of Technology 1 Department of

A Graphical Language for Modeling Stochastic Programming Leo Lopes University of Arizona Robert

Introduction Design Patterns and UML class diagrams Linda Marshall Department of Computer

Unified Process Roman Kontchakov Birkbeck, University of London Based on Chapter 3, 5 and 21 of

Rational Unified Process - an overview Uppsala, 2009-02-10 Mikael Broom Consultant