Language Models with Transformers Chenguang Wang, Mu Li, Alexander - - PowerPoint PPT Presentation

language models with transformers
SMART_READER_LITE
LIVE PREVIEW

Language Models with Transformers Chenguang Wang, Mu Li, Alexander - - PowerPoint PPT Presentation

Language Models with Transformers Chenguang Wang, Mu Li, Alexander J. Smola Amazon Web Services Background 2 Language Model (LM) Predict what word comes next Start to learn English 3 Language Model (LM) Predict what word comes next


slide-1
SLIDE 1

Language Models with Transformers

Chenguang Wang, Mu Li, Alexander J. Smola Amazon Web Services

slide-2
SLIDE 2

Background

2

slide-3
SLIDE 3

Language Model (LM)

3

  • Predict what word comes next

Start to learn English

slide-4
SLIDE 4

Language Model (LM)

4

  • Predict what word comes next
  • Useful in many NLP applications

Start to learn English

slide-5
SLIDE 5

Language Model (LM)

5

  • Predict what word comes next
  • Useful in many NLP applications
  • Many NLP problems share similar definition

Start to learn English Learn to start business Word order matters!

slide-6
SLIDE 6

Language Model with RNNs

6

  • RNN uses one-hot encoding

Start

input

slide-7
SLIDE 7

Language Model with RNNs

7

  • RNN models the word order in hidden state

Start RNN to

input hidden state

  • utput
slide-8
SLIDE 8

Language Model with RNNs

8

  • RNN models the word order in hidden state

Start RNN to to RNN learn

input hidden state

  • utput
slide-9
SLIDE 9

Language Model with RNNs

9

  • RNN models the word order in hidden state

Start RNN to to RNN learn learn RNN English

input hidden state

  • utput
slide-10
SLIDE 10

SOTA NLP with Transformers

10

Positional encoding Transformer Other components are omitted for simplicity [Devlin, Jacob, et al 2018] With less word order

slide-11
SLIDE 11

SOTA NLP with Transformers

11

Self-attention Positional encoding Transformer Other components are omitted for simplicity [Devlin, Jacob, et al 2018]

  • Parallelizable
  • Efficient
slide-12
SLIDE 12

SOTA NLP with Transformers

12

Self-attention Positional encoding Transformer Other components are omitted for simplicity [Devlin, Jacob, et al 2018]

  • With less word order
  • Parallelizable
  • Efficient
  • With word order
  • Sequential
  • Less efficient

RNN

slide-13
SLIDE 13

SOTA NLP with Transformers

  • BERT: a stack of 12 (or 24) Transformer blocks

13

Transformer 0 Transformer 1 Transformer 11

. . .

slide-14
SLIDE 14

SOTA NLP with Transformers

  • BERT: a stack of 12 (or 24) Transformer blocks
  • Trained on large language model datasets
  • Full training cost in excess of $10,000 (16 TPU, 4 days)
  • Achieved SOTA results on 11 NLP applications
  • Sentence level tasks: care less about word order

14

Transformer 0 Transformer 1 Transformer 11

. . .

slide-15
SLIDE 15

Approach: Make Best Use of BERT for Language Model

15

slide-16
SLIDE 16

LM: Adapted BERT

16

. .

embedding Linear

Fixed weights Tunable weights

Transformer 11 Transformer 0

BERT with Linear Layer

slide-17
SLIDE 17

LM 1: Adapted BERT with Fixed Weights

17

. .

embedding Linear

Fixed weights Tunable weights

Transformer 11 Transformer 0 Model Test PPL BERT 69.32 RNN 42.25

Only moderate results (the Lower, the Better)

slide-18
SLIDE 18

18

. .

embedding Linear

Fixed weights Tunable weights

Transformer 11 Transformer 0 Model Test PPL BERT 69.32 BERT-All 67.43 RNN 42.25

Overfitting

LM 2: Adapted BERT with All Weights

slide-19
SLIDE 19

LM 3: Adapted BERT with Partial Weights

19

. .

embedding Linear

Fixed weights Tunable weights

Transformer 11 Model Test PPL BERT 69.32 BERT-All 67.43 BERT-Subset 40.56 RNN 42.25

Fix a subset of weights is promising

Transformer 0

However, enumerating is not feasible

slide-20
SLIDE 20

LM 4: Adapted BERT with RNN

20

. .

embedding Linear

Fixed weights Tunable weights

Model Test PPL BERT 69.32 BERT-RNN 41.64 RNN 42.25

Add RNN to capture word order is promising

Transformer 0

However, enumerating is not feasible

  • Where
  • How many

Transformer 11 RNN

slide-21
SLIDE 21

21

Where to add the RNN layers?

slide-22
SLIDE 22

22

Which layer’s pre-trained weights should be fixed? Where to add the RNN layers?

slide-23
SLIDE 23

Coordinate Architecture Search (CAS)

23

  • Step 1: Choose a layer’s weights to fix
  • Step 2: Choose a position to add a RNN layer
  • Step 3: Go to Step 1 or Terminate
  • Greedy strategy: fine-tune the resulting BERT and keep the best

embedding Transformer 0 Transformer 1 embedding Transformer 0 Transformer 1

Fixed weights Tunable weights

Fix Transformer 0’s weights

slide-24
SLIDE 24

Coordinate Architecture Search (CAS)

24

  • Step 1: Choose a layer’s weights to fix
  • Step 2: Choose a position to add a RNN layer
  • Step 3: Go to Step 1 or Terminate
  • Greedy strategy: fine-tune the resulting BERT and keep the best

embedding Transformer 0 Transformer 1 embedding Transformer 0 Transformer 1 embedding Transformer 0 Transformer 1 RNN

Fixed weights Tunable weights

Add a RNN layer

slide-25
SLIDE 25

Coordinate Architecture Search (CAS)

25

  • Step 1: Choose a layer’s weights to fix
  • Step 2: Choose a position to add a RNN layer
  • Step 3: Go to Step 1 or Terminate
  • Greedy strategy: fine-tune the resulting BERT and keep the best

embedding Transformer 0 Transformer 1 embedding Transformer 0 Transformer 1 embedding Transformer 0 Transformer 1 RNN embedding Transformer 0 Transformer 1 RNN Linear

Fixed weights Tunable weights

Add a linear layer

slide-26
SLIDE 26

Coordinate Architecture Search (CAS)

26

  • Step 1: Choose a layer’s weights to fix
  • Step 2: Choose a position to add a RNN layer
  • Step 3: Go to Step 1 or Terminate
  • Greedy strategy: fine-tune the resulting BERT and keep the best

embedding Transformer 0 Transformer 1 embedding Transformer 0 Transformer 1 embedding Transformer 0 Transformer 1 RNN embedding Transformer 0 Transformer 1 RNN Linear

Fixed weights Tunable weights

slide-27
SLIDE 27

Test Perplexity 20 40 60 80 100 120 PTB WT-103

AWD-LSTM-MoS-BERTVocab BERT BERT-CAS-Subset BERT-CAS-LSTM BERT-CAS BERT-Large-CAS

Best LM: Adapted BERT with CAS

27

slide-28
SLIDE 28

Test Perplexity 20 40 60 80 100 120 PTB WT-103

AWD-LSTM-MoS-BERTVocab BERT BERT-CAS-Subset BERT-CAS-LSTM BERT-CAS BERT-Large-CAS

Best LM: Adapted BERT with CAS

28

BERT-Large+CAS is best

slide-29
SLIDE 29

Test Perplexity 20 40 60 80 100 120 PTB WT-103

AWD-LSTM-MoS-BERTVocab BERT BERT-CAS-Subset BERT-CAS-LSTM BERT-CAS BERT-Large-CAS

Best LM: Adapted BERT with CAS

29

Capture word order BERT-Large+CAS is best

slide-30
SLIDE 30

Test Perplexity 20 40 60 80 100 120 PTB WT-103

AWD-LSTM-MoS-BERTVocab BERT BERT-CAS-Subset BERT-CAS-LSTM BERT-CAS BERT-Large-CAS

Best LM: Adapted BERT with CAS

30

Achieve SOTA: 31.34 PPL with 0.5 GPU days BERT-Large+CAS is best Capture word order

slide-31
SLIDE 31

Test Perplexity 20 40 60 80 100 120 PTB WT-103

AWD-LSTM-MoS-BERTVocab BERT BERT-CAS-Subset BERT-CAS-LSTM BERT-CAS BERT-Large-CAS

Best LM: Adapted BERT with CAS

31

Achieve 20.42 PPL with 1B tokens BERT-Large+CAS is best Capture word order Achieve SOTA: 31.34 PPL with 0.5 GPU days

slide-32
SLIDE 32

Take-aways

  • BERT needs to be adapted for language model
  • Add RNN layers with neural architecture search works
  • Fix pre-trained weights with neural architecture search works

32