Language Models with Transformers Chenguang Wang, Mu Li, Alexander - PowerPoint PPT Presentation

Language Models with Transformers Chenguang Wang, Mu Li, Alexander J. Smola Amazon Web Services

Background 2

Language Model (LM) • Predict what word comes next Start to learn English 3

Language Model (LM) • Predict what word comes next • Useful in many NLP applications Start to learn English 4

Language Model (LM) • Predict what word comes next • Useful in many NLP applications Word order matters! Start to learn English Learn to start business • Many NLP problems share similar definition 5

Language Model with RNNs • RNN uses one-hot encoding Start input 6

Language Model with RNNs • RNN models the word order in hidden state to output RNN hidden state Start input 7

Language Model with RNNs • RNN models the word order in hidden state to learn output RNN RNN hidden state Start to input 8

Language Model with RNNs • RNN models the word order in hidden state to learn English output RNN RNN RNN hidden state Start to learn input 9

SOTA NLP with Transformers Transformer Positional encoding With less word order Other components are omitted for simplicity [Devlin, Jacob, et al 2018] 10

SOTA NLP with Transformers Transformer • Parallelizable Self-attention • Efficient Positional encoding Other components are omitted for simplicity [Devlin, Jacob, et al 2018] 11

SOTA NLP with Transformers • With less word order Transformer • Parallelizable • Efficient Self-attention RNN Positional encoding • With word order • Sequential Other components are omitted • Less efficient for simplicity [Devlin, Jacob, et al 2018] 12

SOTA NLP with Transformers • BERT: a stack of 12 (or 24) Transformer blocks Transformer 11 . . . Transformer 1 Transformer 0 13

SOTA NLP with Transformers • BERT: a stack of 12 (or 24) Transformer blocks Transformer 11 • Trained on large language model datasets . . • Full training cost in excess of $10,000 (16 TPU, 4 days) . • Achieved SOTA results on 11 NLP applications • Sentence level tasks: care less about word order Transformer 1 Transformer 0 14

Approach: Make Best Use of BERT for Language Model 15

LM: Adapted BERT BERT with Linear Layer Linear Transformer 11 . . Transformer 0 embedding Fixed Tunable weights weights 16

LM 1: Adapted BERT with Fixed Weights Model Test PPL Only moderate results BERT 69.32 Linear (the Lower, the Better) RNN 42.25 Transformer 11 . . Transformer 0 embedding Fixed Tunable weights weights 17

LM 2: Adapted BERT with All Weights Model Test PPL BERT 69.32 Linear Overfitting BERT-All 67.43 Transformer 11 . RNN 42.25 . Transformer 0 embedding Fixed Tunable weights weights 18

LM 3: Adapted BERT with Partial Weights Model Test PPL Fix a subset of weights is BERT 69.32 Linear promising BERT-All 67.43 Transformer 11 . BERT-Subset 40.56 . However, enumerating is Transformer 0 RNN 42.25 not feasible embedding Fixed Tunable weights weights 19

LM 4: Adapted BERT with RNN Add RNN to capture Model Test PPL word order is promising BERT 69.32 Linear BERT-RNN 41.64 RNN RNN 42.25 Transformer 11 However, enumerating . . is not feasible Transformer 0 Where • embedding How many • Fixed Tunable weights weights 20

Where to add the RNN layers? 21

Which layer’s pre-trained weights should be fixed? Where to add the RNN layers? 22

Coordinate Architecture Search (CAS) • Step 1: Choose a layer’s weights to fix • Step 2: Choose a position to add a RNN layer • Step 3: Go to Step 1 or Terminate Fixed Tunable Fix Transformer 0’s weights weights weights Transformer 1 Transformer 1 Transformer 0 Transformer 0 embedding embedding • Greedy strategy: fine-tune the resulting BERT and keep the best 23

Coordinate Architecture Search (CAS) • Step 1: Choose a layer’s weights to fix • Step 2: Choose a position to add a RNN layer • Step 3: Go to Step 1 or Terminate Fixed Tunable Add a RNN layer weights weights RNN Transformer 1 Transformer 1 Transformer 1 Transformer 0 Transformer 0 Transformer 0 embedding embedding embedding • Greedy strategy: fine-tune the resulting BERT and keep the best 24

Coordinate Architecture Search (CAS) • Step 1: Choose a layer’s weights to fix • Step 2: Choose a position to add a RNN layer Add a linear layer • Step 3: Go to Step 1 or Terminate Fixed Tunable weights weights Linear RNN RNN Transformer 1 Transformer 1 Transformer 1 Transformer 1 Transformer 0 Transformer 0 Transformer 0 Transformer 0 embedding embedding embedding embedding • Greedy strategy: fine-tune the resulting BERT and keep the best 25

Coordinate Architecture Search (CAS) • Step 1: Choose a layer’s weights to fix • Step 2: Choose a position to add a RNN layer • Step 3: Go to Step 1 or Terminate Fixed Tunable weights weights Linear RNN RNN Transformer 1 Transformer 1 Transformer 1 Transformer 1 Transformer 0 Transformer 0 Transformer 0 Transformer 0 embedding embedding embedding embedding • Greedy strategy: fine-tune the resulting BERT and keep the best 26

Best LM: Adapted BERT with CAS AWD-LSTM-MoS-BERTVocab BERT BERT-CAS-Subset BERT-CAS-LSTM BERT-CAS BERT-Large-CAS 120 100 Test Perplexity 80 60 40 20 0 PTB WT-103 27

Best LM: Adapted BERT with CAS AWD-LSTM-MoS-BERTVocab BERT BERT-CAS-Subset BERT-CAS-LSTM BERT-CAS BERT-Large-CAS 120 BERT-Large+CAS is best 100 Test Perplexity 80 60 40 20 0 PTB WT-103 28

Best LM: Adapted BERT with CAS AWD-LSTM-MoS-BERTVocab BERT BERT-CAS-Subset BERT-CAS-LSTM BERT-CAS BERT-Large-CAS 120 BERT-Large+CAS is best 100 Capture word order Test Perplexity 80 60 40 20 0 PTB WT-103 29

Best LM: Adapted BERT with CAS AWD-LSTM-MoS-BERTVocab BERT BERT-CAS-Subset BERT-CAS-LSTM BERT-CAS BERT-Large-CAS 120 BERT-Large+CAS is best 100 Capture word order Test Perplexity 80 60 40 20 Achieve SOTA: 31.34 PPL with 0.5 GPU days 0 PTB WT-103 30

Best LM: Adapted BERT with CAS AWD-LSTM-MoS-BERTVocab BERT BERT-CAS-Subset BERT-CAS-LSTM BERT-CAS BERT-Large-CAS 120 BERT-Large+CAS is best 100 Capture word order Test Perplexity 80 60 Achieve 20.42 PPL 40 with 1B tokens 20 Achieve SOTA: 31.34 PPL with 0.5 GPU days 0 PTB WT-103 31

Take-aways • BERT needs to be adapted for language model • Add RNN layers with neural architecture search works • Fix pre-trained weights with neural architecture search works 32

Language Models with Transformers Chenguang Wang, Mu Li, Alexander - PowerPoint PPT Presentation

Language Models with Transformers Chenguang Wang, Mu Li, Alexander J. Smola Amazon Web Services Background 2 Language Model (LM) Predict what word comes next Start to learn English 3 Language Model (LM) Predict what word comes next

Transformers Willem Maes High Voltage Safety Transformers Willem Maes High Voltage Safety

Status of CIGRE JWG A2/B4-28 HVDC Converter Transformers HVDC Converter Transformers Ugo Piovan

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention Angelos

QUALITY PLAN POWER TRANSFORMERS MANUFACTURING CUSTOMER: SIDOR C.A. PROJECT: POWER TRANSFORMERS

Task Force on Partial Discharge Testing of Class I Power Transformers IEEE/PES Transformers

Libraries and Tools Transformers, AllenNLP LING575 Analyzing Neural Language Models Shane

Models of Language Evolution models thereof its evolution language Models of Language Evolution

Lecture 12: Attention and Transformers Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel

CSC413/2516 Lecture 8: Attention and Transformers Jimmy Ba Jimmy Ba CSC413/2516 Lecture 8:

(Even More) Language Modeling: Multi-Task Learning, and Building Blocks of Transformers CMSC

4 Language Models 2: Log-linear Language Models This chapter will discuss another set of language

DISTRIBUTION TRANSFORMERS BUREAU OF INDIAN STANDARDS BHOPAL BIS Act WTO Principle on

Presentation February 2017 www.vimap-technics.com Moisture in power transformers

Commencing Development of Just Approved New Guide on Moisture in Transformers and Reactors

Understanding the Value of Electrical Testing for Power Transformers Charles Sweetser - OMICRON

Rating of Transformers supplying Harmonic-Rich Loads David Chapman Copper Development

Empirical Methods in Natural Language Processing Lecture 2 Introduction (II) Probability and

Waves, Light & Information Classwork and Homework www.njctl.org Slide 3 / 59 Slide 4 / 59

Information Theory Amount of information in a message by the average number of bits needed to

Cascade-Correlation and Deep Learning Scott E. Fahlman Professor Emeritus Language Technologies

T980 Crystal Collimation Status & Plans Nikolai Mokhov LARP CM14 Collaboration Meeting

Khataman Nabiyyeen Define terminology Sayings of the Promised Messiah ( on whom be peace) The

CMOS Analog VLSI Design Course Code: EE618 Department: Electrical Engineering Semester: Autumn

Total Dose Dependence of Oxide Charge, Interstrip Capacitance and Breakdown Behavior of sLHC

Language Models with Transformers Chenguang Wang, Mu Li, Alexander - PowerPoint PPT Presentation

Language Models with Transformers Chenguang Wang, Mu Li, Alexander J. Smola Amazon Web Services Background 2 Language Model (LM) Predict what word comes next Start to learn English 3 Language Model (LM) Predict what word comes next

Transformers Willem Maes High Voltage Safety Transformers Willem Maes High Voltage Safety

Status of CIGRE JWG A2/B4-28 HVDC Converter Transformers HVDC Converter Transformers Ugo Piovan

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention Angelos

QUALITY PLAN POWER TRANSFORMERS MANUFACTURING CUSTOMER: SIDOR C.A. PROJECT: POWER TRANSFORMERS

Task Force on Partial Discharge Testing of Class I Power Transformers IEEE/PES Transformers

Libraries and Tools Transformers, AllenNLP LING575 Analyzing Neural Language Models Shane

Models of Language Evolution models thereof its evolution language Models of Language Evolution

Lecture 12: Attention and Transformers Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel

CSC413/2516 Lecture 8: Attention and Transformers Jimmy Ba Jimmy Ba CSC413/2516 Lecture 8:

(Even More) Language Modeling: Multi-Task Learning, and Building Blocks of Transformers CMSC

4 Language Models 2: Log-linear Language Models This chapter will discuss another set of language

DISTRIBUTION TRANSFORMERS BUREAU OF INDIAN STANDARDS BHOPAL BIS Act WTO Principle on

Presentation February 2017 www.vimap-technics.com Moisture in power transformers

Commencing Development of Just Approved New Guide on Moisture in Transformers and Reactors

Understanding the Value of Electrical Testing for Power Transformers Charles Sweetser - OMICRON

Rating of Transformers supplying Harmonic-Rich Loads David Chapman Copper Development

Empirical Methods in Natural Language Processing Lecture 2 Introduction (II) Probability and

Waves, Light &amp; Information Classwork and Homework www.njctl.org Slide 3 / 59 Slide 4 / 59

Information Theory Amount of information in a message by the average number of bits needed to

Cascade-Correlation and Deep Learning Scott E. Fahlman Professor Emeritus Language Technologies

T980 Crystal Collimation Status &amp; Plans Nikolai Mokhov LARP CM14 Collaboration Meeting

Khataman Nabiyyeen Define terminology Sayings of the Promised Messiah ( on whom be peace) The

CMOS Analog VLSI Design Course Code: EE618 Department: Electrical Engineering Semester: Autumn

Total Dose Dependence of Oxide Charge, Interstrip Capacitance and Breakdown Behavior of sLHC

Waves, Light & Information Classwork and Homework www.njctl.org Slide 3 / 59 Slide 4 / 59

T980 Crystal Collimation Status & Plans Nikolai Mokhov LARP CM14 Collaboration Meeting