BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Source : NAACL-HLT 2019 Speaker : Ya-Fang, Hsiao Advisor : Jia-Ling, Koh Date : 2019/09/02
Deep Bidirectional Transformers for Language Understanding Source : - - PowerPoint PPT Presentation
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Source : NAACL-HLT 2019 Speaker : Ya-Fang, Hsiao Advisor : Jia-Ling, Koh Date : 2019/09/02 CONTENTS Introduction Conclusion Method 1 5 3 4 2 Experiment
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Source : NAACL-HLT 2019 Speaker : Ya-Fang, Hsiao Advisor : Jia-Ling, Koh Date : 2019/09/02
CONTENTS
1
Introduction Related Work Method Experiment Conclusion
2 3 4 5
Bidirectional Encoder Representations from Transformers Language Model
𝑄 𝑥1, 𝑥2, … , 𝑥𝑈 = ෑ
𝑢=1 𝑈
ሻ 𝑄(𝑥𝑢|𝑥1, 𝑥2, … , 𝑥𝑢−1
Pre-trained Language Model
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Related Work
Pre-trained Language Model Feature-based Fine-tuning : ELMo : OpenAI GPT
Related Work
Pre-trained Language Model Feature-based Fine-tuning : ELMo : OpenAI GPT
Bidirectional Encoder Representations from Transformers
Masked Language Models (MLM) Next Sentence Prediction (NSP)
《Attention is all you need》 Vaswani et al. (NIPS2017)
Transformers
Encoder Decoder Sequence2sequence RNN : hard to parallel
Encoder-Decoder
《Attention is all you need》 Vaswani et al. (NIPS2017)
Transformers
《Attention is all you need》 Vaswani et al. (NIPS2017)
Transformers
Encoder-Decoder *6
Self-attention layer can be parallelly computed
《Attention is all you need》 Vaswani et al. (NIPS2017)
Transformers
Self-Attention
query (to match others) key (to be matched)
information to be extracted
《Attention is all you need》 Vaswani et al. (NIPS2017)
Transformers
Multi-Head Attention
Transformers
《Attention is all you need》 Vaswani et al. (NIPS2017)
BERTBASE (L=12, H=768, A=12, Parameters=110M) BERTLARGE (L=24, H=1024, A=16, Parameters=340M) L A 4H
Bidirectional Encoder Representations from Transformers
Framework
Pre-training : trained on unlabeled data over different pre-training tasks. Fine-Tuning : fine-tuned parameters using labeled data from the downstream tasks.
Token Embedding : WordPiece embeddings with a 30,000 token vocabulary. [CLS] : classification token [SEP] : separate token Segment Embedding : Learned embeddings belong to sentence A or sentence B. Position Embedding : Learned positional embeddings. Pre-training corpus : BooksCorpus、English Wikipedia
Pre-training
Two unsupervised tasks:
Masked Language Models
Hung-Yi Lee - BERT ppt
Mask 15% of all WordPiece tokens in each sequence at random for prediction. Replace the token with (1) the [MASK] token 80% of the time. (2) a random token 10% of the time. (3) the unchanged i-th token 10% of the time.
Next Sentence Prediction
Hung-Yi Lee - BERT ppt
Input = [CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP] Label = IsNext Input = [CLS] the man [MASK] to the store [SEP] penguin [MASK] are flight ##less birds [SEP] Label = NotNext
Fine-Tuning
Fine-Tuning : fine-tuned parameters using labeled data from the downstream tasks.
BERT
[CLS] w1 w2 w3 Linear Classifier class Input: single sentence,
sentence Example: Sentiment analysis Document Classification Trained from Scratch Fine-tune
Hung-Yi Lee - BERT ppt
Single Sentence Classification Tasks
BERT
[CLS] w1 w2 w3 Linear Cls class Input: single sentence,
sentence Example: Slot filling Linear Cls class Linear Cls class
Hung-Yi Lee - BERT ppt
Single Sentence Tagging Tasks
Linear Classifier w1 w2
BERT
[CLS] [SEP] Class Sentence 1 Sentence 2 w3 w4 w5 Input: two sentences,
Example: Natural Language Inference
Hung-Yi Lee - BERT ppt
Sentence Pair Classification Tasks
𝐸 = 𝑒1, 𝑒2, ⋯ , 𝑒𝑂 𝑅 = 𝑟1, 𝑟2, ⋯ , 𝑟𝑂 QA Model
𝐵 = 𝑟𝑡, ⋯ , 𝑟𝑓 Document: Query: Answer: 𝐸 𝑅 𝑡 𝑓
17 77 79
𝑡 = 17, 𝑓 = 17 𝑡 = 77, 𝑓 = 79
Hung-Yi Lee - BERT ppt
Question Answering Tasks
q1 q2
BERT
[CLS] [SEP] question document d1 d2 d3 dot product Softmax 0.5 0.3 0.2 The answer is “d2 d3”. s = 2, e = 3 Learned from scratch
Hung-Yi Lee - BERT ppt
Question Answering Tasks
q1 q2
BERT
[CLS] [SEP] question document d1 d2 d3 The answer is “d2 d3”. s = 2, e = 3 Learned from scratch
Hung-Yi Lee - BERT ppt
dot product Softmax 0.2 0.1 0.7
Question Answering Tasks
Experiments
Fine-tuning results on 11 NLP tasks
Implements
LeeMeng-進擊的BERT (Pytorch)
Implements
LeeMeng-進擊的BERT (Pytorch)
Implements
LeeMeng-進擊的BERT (Pytorch)
Implements
LeeMeng-進擊的BERT (Pytorch)
Implements
LeeMeng-進擊的BERT (Pytorch)
語言模型發展 http://bit.ly/nGram2NNLM 語言模型預訓練方法 http://bit.ly/ELMo_OpenAIGPT_BERT Attention Is All You Need http://bit.ly/AttIsAllUNeed BERT http://bit.ly/BERTpaper 李弘毅-Transformer(Youtube) http://bit.ly/HungYiLee_Transformer Illustrated Transformer http://bit.ly/illustratedTransformer 詳解Transformer http://bit.ly/explainTransformer github/codertimo - BERT(pytorch) http://bit.ly/BERT_pytorch 實作假新聞分類 http://bit.ly/implementpaircls Pytorch.org_BERT http://bit.ly/pytorchorgBERT