Language Models with Transformers
Chenguang Wang, Mu Li, Alexander J. Smola Amazon Web Services
Language Models with Transformers Chenguang Wang, Mu Li, Alexander - - PowerPoint PPT Presentation
Language Models with Transformers Chenguang Wang, Mu Li, Alexander J. Smola Amazon Web Services Background 2 Language Model (LM) Predict what word comes next Start to learn English 3 Language Model (LM) Predict what word comes next
Chenguang Wang, Mu Li, Alexander J. Smola Amazon Web Services
2
3
Start to learn English
4
Start to learn English
5
Start to learn English Learn to start business Word order matters!
6
Start
input
7
Start RNN to
input hidden state
8
Start RNN to to RNN learn
input hidden state
9
Start RNN to to RNN learn learn RNN English
input hidden state
10
Positional encoding Transformer Other components are omitted for simplicity [Devlin, Jacob, et al 2018] With less word order
11
Self-attention Positional encoding Transformer Other components are omitted for simplicity [Devlin, Jacob, et al 2018]
12
Self-attention Positional encoding Transformer Other components are omitted for simplicity [Devlin, Jacob, et al 2018]
RNN
13
Transformer 0 Transformer 1 Transformer 11
. . .
14
Transformer 0 Transformer 1 Transformer 11
. . .
15
16
. .
embedding Linear
Fixed weights Tunable weights
Transformer 11 Transformer 0
BERT with Linear Layer
17
. .
embedding Linear
Fixed weights Tunable weights
Transformer 11 Transformer 0 Model Test PPL BERT 69.32 RNN 42.25
Only moderate results (the Lower, the Better)
18
. .
embedding Linear
Fixed weights Tunable weights
Transformer 11 Transformer 0 Model Test PPL BERT 69.32 BERT-All 67.43 RNN 42.25
Overfitting
19
. .
embedding Linear
Fixed weights Tunable weights
Transformer 11 Model Test PPL BERT 69.32 BERT-All 67.43 BERT-Subset 40.56 RNN 42.25
Fix a subset of weights is promising
Transformer 0
However, enumerating is not feasible
20
. .
embedding Linear
Fixed weights Tunable weights
Model Test PPL BERT 69.32 BERT-RNN 41.64 RNN 42.25
Add RNN to capture word order is promising
Transformer 0
However, enumerating is not feasible
Transformer 11 RNN
21
Where to add the RNN layers?
22
Which layer’s pre-trained weights should be fixed? Where to add the RNN layers?
23
embedding Transformer 0 Transformer 1 embedding Transformer 0 Transformer 1
Fixed weights Tunable weights
Fix Transformer 0’s weights
24
embedding Transformer 0 Transformer 1 embedding Transformer 0 Transformer 1 embedding Transformer 0 Transformer 1 RNN
Fixed weights Tunable weights
Add a RNN layer
25
embedding Transformer 0 Transformer 1 embedding Transformer 0 Transformer 1 embedding Transformer 0 Transformer 1 RNN embedding Transformer 0 Transformer 1 RNN Linear
Fixed weights Tunable weights
Add a linear layer
26
embedding Transformer 0 Transformer 1 embedding Transformer 0 Transformer 1 embedding Transformer 0 Transformer 1 RNN embedding Transformer 0 Transformer 1 RNN Linear
Fixed weights Tunable weights
Test Perplexity 20 40 60 80 100 120 PTB WT-103
AWD-LSTM-MoS-BERTVocab BERT BERT-CAS-Subset BERT-CAS-LSTM BERT-CAS BERT-Large-CAS
27
Test Perplexity 20 40 60 80 100 120 PTB WT-103
AWD-LSTM-MoS-BERTVocab BERT BERT-CAS-Subset BERT-CAS-LSTM BERT-CAS BERT-Large-CAS
28
BERT-Large+CAS is best
Test Perplexity 20 40 60 80 100 120 PTB WT-103
AWD-LSTM-MoS-BERTVocab BERT BERT-CAS-Subset BERT-CAS-LSTM BERT-CAS BERT-Large-CAS
29
Capture word order BERT-Large+CAS is best
Test Perplexity 20 40 60 80 100 120 PTB WT-103
AWD-LSTM-MoS-BERTVocab BERT BERT-CAS-Subset BERT-CAS-LSTM BERT-CAS BERT-Large-CAS
30
Achieve SOTA: 31.34 PPL with 0.5 GPU days BERT-Large+CAS is best Capture word order
Test Perplexity 20 40 60 80 100 120 PTB WT-103
AWD-LSTM-MoS-BERTVocab BERT BERT-CAS-Subset BERT-CAS-LSTM BERT-CAS BERT-Large-CAS
31
Achieve 20.42 PPL with 1B tokens BERT-Large+CAS is best Capture word order Achieve SOTA: 31.34 PPL with 0.5 GPU days
32