language models with transformers
play

Language Models with Transformers Chenguang Wang, Mu Li, Alexander - PowerPoint PPT Presentation

Language Models with Transformers Chenguang Wang, Mu Li, Alexander J. Smola Amazon Web Services Background 2 Language Model (LM) Predict what word comes next Start to learn English 3 Language Model (LM) Predict what word comes next


  1. Language Models with Transformers Chenguang Wang, Mu Li, Alexander J. Smola Amazon Web Services

  2. Background 2

  3. Language Model (LM) • Predict what word comes next Start to learn English 3

  4. Language Model (LM) • Predict what word comes next • Useful in many NLP applications Start to learn English 4

  5. Language Model (LM) • Predict what word comes next • Useful in many NLP applications Word order matters! Start to learn English Learn to start business • Many NLP problems share similar definition 5

  6. Language Model with RNNs • RNN uses one-hot encoding Start input 6

  7. Language Model with RNNs • RNN models the word order in hidden state to output RNN hidden state Start input 7

  8. Language Model with RNNs • RNN models the word order in hidden state to learn output RNN RNN hidden state Start to input 8

  9. Language Model with RNNs • RNN models the word order in hidden state to learn English output RNN RNN RNN hidden state Start to learn input 9

  10. SOTA NLP with Transformers Transformer Positional encoding With less word order Other components are omitted for simplicity [Devlin, Jacob, et al 2018] 10

  11. SOTA NLP with Transformers Transformer • Parallelizable Self-attention • Efficient Positional encoding Other components are omitted for simplicity [Devlin, Jacob, et al 2018] 11

  12. SOTA NLP with Transformers • With less word order Transformer • Parallelizable • Efficient Self-attention RNN Positional encoding • With word order • Sequential Other components are omitted • Less efficient for simplicity [Devlin, Jacob, et al 2018] 12

  13. SOTA NLP with Transformers • BERT: a stack of 12 (or 24) Transformer blocks Transformer 11 . . . Transformer 1 Transformer 0 13

  14. SOTA NLP with Transformers • BERT: a stack of 12 (or 24) Transformer blocks Transformer 11 • Trained on large language model datasets . . • Full training cost in excess of $10,000 (16 TPU, 4 days) . • Achieved SOTA results on 11 NLP applications • Sentence level tasks: care less about word order Transformer 1 Transformer 0 14

  15. Approach: Make Best Use of BERT for Language Model 15

  16. LM: Adapted BERT BERT with Linear Layer Linear Transformer 11 . . Transformer 0 embedding Fixed Tunable weights weights 16

  17. LM 1: Adapted BERT with Fixed Weights Model Test PPL Only moderate results BERT 69.32 Linear (the Lower, the Better) RNN 42.25 Transformer 11 . . Transformer 0 embedding Fixed Tunable weights weights 17

  18. LM 2: Adapted BERT with All Weights Model Test PPL BERT 69.32 Linear Overfitting BERT-All 67.43 Transformer 11 . RNN 42.25 . Transformer 0 embedding Fixed Tunable weights weights 18

  19. LM 3: Adapted BERT with Partial Weights Model Test PPL Fix a subset of weights is BERT 69.32 Linear promising BERT-All 67.43 Transformer 11 . BERT-Subset 40.56 . However, enumerating is Transformer 0 RNN 42.25 not feasible embedding Fixed Tunable weights weights 19

  20. LM 4: Adapted BERT with RNN Add RNN to capture Model Test PPL word order is promising BERT 69.32 Linear BERT-RNN 41.64 RNN RNN 42.25 Transformer 11 However, enumerating . . is not feasible Transformer 0 Where • embedding How many • Fixed Tunable weights weights 20

  21. Where to add the RNN layers? 21

  22. Which layer’s pre-trained weights should be fixed? Where to add the RNN layers? 22

  23. Coordinate Architecture Search (CAS) • Step 1: Choose a layer’s weights to fix • Step 2: Choose a position to add a RNN layer • Step 3: Go to Step 1 or Terminate Fixed Tunable Fix Transformer 0’s weights weights weights Transformer 1 Transformer 1 Transformer 0 Transformer 0 embedding embedding • Greedy strategy: fine-tune the resulting BERT and keep the best 23

  24. Coordinate Architecture Search (CAS) • Step 1: Choose a layer’s weights to fix • Step 2: Choose a position to add a RNN layer • Step 3: Go to Step 1 or Terminate Fixed Tunable Add a RNN layer weights weights RNN Transformer 1 Transformer 1 Transformer 1 Transformer 0 Transformer 0 Transformer 0 embedding embedding embedding • Greedy strategy: fine-tune the resulting BERT and keep the best 24

  25. Coordinate Architecture Search (CAS) • Step 1: Choose a layer’s weights to fix • Step 2: Choose a position to add a RNN layer Add a linear layer • Step 3: Go to Step 1 or Terminate Fixed Tunable weights weights Linear RNN RNN Transformer 1 Transformer 1 Transformer 1 Transformer 1 Transformer 0 Transformer 0 Transformer 0 Transformer 0 embedding embedding embedding embedding • Greedy strategy: fine-tune the resulting BERT and keep the best 25

  26. Coordinate Architecture Search (CAS) • Step 1: Choose a layer’s weights to fix • Step 2: Choose a position to add a RNN layer • Step 3: Go to Step 1 or Terminate Fixed Tunable weights weights Linear RNN RNN Transformer 1 Transformer 1 Transformer 1 Transformer 1 Transformer 0 Transformer 0 Transformer 0 Transformer 0 embedding embedding embedding embedding • Greedy strategy: fine-tune the resulting BERT and keep the best 26

  27. Best LM: Adapted BERT with CAS AWD-LSTM-MoS-BERTVocab BERT BERT-CAS-Subset BERT-CAS-LSTM BERT-CAS BERT-Large-CAS 120 100 Test Perplexity 80 60 40 20 0 PTB WT-103 27

  28. Best LM: Adapted BERT with CAS AWD-LSTM-MoS-BERTVocab BERT BERT-CAS-Subset BERT-CAS-LSTM BERT-CAS BERT-Large-CAS 120 BERT-Large+CAS is best 100 Test Perplexity 80 60 40 20 0 PTB WT-103 28

  29. Best LM: Adapted BERT with CAS AWD-LSTM-MoS-BERTVocab BERT BERT-CAS-Subset BERT-CAS-LSTM BERT-CAS BERT-Large-CAS 120 BERT-Large+CAS is best 100 Capture word order Test Perplexity 80 60 40 20 0 PTB WT-103 29

  30. Best LM: Adapted BERT with CAS AWD-LSTM-MoS-BERTVocab BERT BERT-CAS-Subset BERT-CAS-LSTM BERT-CAS BERT-Large-CAS 120 BERT-Large+CAS is best 100 Capture word order Test Perplexity 80 60 40 20 Achieve SOTA: 31.34 PPL with 0.5 GPU days 0 PTB WT-103 30

  31. Best LM: Adapted BERT with CAS AWD-LSTM-MoS-BERTVocab BERT BERT-CAS-Subset BERT-CAS-LSTM BERT-CAS BERT-Large-CAS 120 BERT-Large+CAS is best 100 Capture word order Test Perplexity 80 60 Achieve 20.42 PPL 40 with 1B tokens 20 Achieve SOTA: 31.34 PPL with 0.5 GPU days 0 PTB WT-103 31

  32. Take-aways • BERT needs to be adapted for language model • Add RNN layers with neural architecture search works • Fix pre-trained weights with neural architecture search works 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend