ad advanced ed pre tr training languag language m e
play

Ad Advanced ed Pre-tr training languag language m e models - PowerPoint PPT Presentation

Ad Advanced ed Pre-tr training languag language m e models dels a br a brie ief in introduct ductio ion Xiachong Feng Ou Outline 1. Encoder-Decoder 2. Attention 3. Transformer : Attention is all you need 4. Word embedding


  1. Ad Advanced ed Pre-tr training languag language m e models dels a br a brie ief in introduct ductio ion Xiachong Feng

  2. Ou Outline 1. Encoder-Decoder 2. Attention 3. Transformer : 《 Attention is all you need 》 4. Word embedding and pre-trained model 5. ELMo : 《 Deep contextualized word representations 》 6. OpenAI GPT : 《 Improving Language Understanding by Generative Pre-Training 》 7. BERT : 《 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 》 8. Conclusion

  3. Ou Outline 1. Encoder-Decoder 2. Attention 3. Transformer : 《 Attention is all you need 》 4. Word embedding and pre-trained model 5. ELMo : 《 Deep contextualized word representations 》 6. OpenAI GPT : 《 Improving Language Understanding by Generative Pre-Training 》 7. BERT : 《 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 》 8. Conclusion

  4. Encode Enc der-De Decoder • When the sentence is short , context vector 𝑧 * = 𝑔(𝐷) may retain some important information 𝑧 , = 𝑔(𝐷, 𝑧 * ) • When the sentence is long , context vector 𝑧 = = 𝑔(𝐷, 𝑧 * , 𝑧 , ) will lose some information such as semantic. 𝐷 = 𝐺(𝑦 * , 𝑦 * , … 𝑦 . ) 𝑧 : = 𝑕(𝐷, 𝑧 * , 𝑧 , , … , 𝑧 :;* ) 𝑈𝑏𝑠𝑕𝑓𝑢 =< 𝑧 * , 𝑧 , , … 𝑧 5 > 𝑇𝑝𝑣𝑠𝑑𝑓 =< 𝑦 * , 𝑦 , , … 𝑦 . > Encoder-Decoder Framework

  5. Ou Outline 1. Encoder-Decoder 2. Attention 3. Transformer : 《 Attention is all you need 》 4. Word embedding and pre-trained model 5. ELMo : 《 Deep contextualized word representations 》 6. OpenAI GPT : 《 Improving Language Understanding by Generative Pre-Training 》 7. BERT : 《 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 》 8. Conclusion

  6. Sof Soft-At Attention 𝑧 * = 𝑔(𝐷 * ) A 𝐷 : = > 𝑏 :? ℎ ? 𝑧 , = 𝑔(𝐷 , , 𝑧 * ) ?@* 𝑧 = = 𝑔(𝐷 = , 𝑧 * , 𝑧 , )

  7. Cor Core idea of of Attention on A 𝐵𝑢𝑢𝑓𝑜𝑢𝑗𝑝𝑜 𝑅𝑣𝑓𝑠𝑧, 𝑇𝑝𝑣𝑠𝑑𝑓 = > 𝑇𝑗𝑛𝑗𝑚𝑏𝑠𝑗𝑢𝑧 𝑅𝑣𝑓𝑠𝑧, 𝐿𝑓𝑧 : ∗ 𝑊𝑏𝑚𝑣𝑓 : :@* 𝐸𝑝𝑢: 𝑇𝑗𝑛𝑗𝑚𝑏𝑠𝑗𝑢𝑧 𝑅𝑣𝑓𝑠𝑧, 𝐿𝑓𝑧 : = 𝑅𝑣𝑓𝑠𝑧 N 𝐿𝑓𝑧 : 𝑅𝑣𝑓𝑠𝑧 N 𝐿𝑓𝑧 : 𝐷𝑝𝑡𝑗𝑜𝑓: 𝑇𝑗𝑛𝑗𝑚𝑏𝑠𝑗𝑢𝑧 𝑅𝑣𝑓𝑠𝑧, 𝐿𝑓𝑧 : = | 𝑅𝑣𝑓𝑠𝑧 | N ||𝐿𝑓𝑧 : || 𝑁𝑀𝑄: 𝑇𝑗𝑛𝑗𝑚𝑏𝑠𝑗𝑢𝑧 𝑅𝑣𝑓𝑠𝑧, 𝐿𝑓𝑧 : = MLP(𝑅𝑣𝑓𝑠𝑧, 𝐿𝑓𝑧 : )

  8. Attention Ti Timeline 2014 2015-2016 Recurrent Models Attention-based Of Visual attention RNN/CNN in NLP 2014-2015 2017 Attention in Self-Attention Neural machine translation (Transformer)

  9. Ou Outline 1. Encoder-Decoder 2. Attention 3. Transformer : 《 Attention is all you need 》 4. Word embedding and pre-trained model 5. ELMo : 《 Deep contextualized word representations 》 6. OpenAI GPT : 《 Improving Language Understanding by Generative Pre-Training 》 7. BERT : 《 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 》 8. Conclusion

  10. At Attention is all you need Key words • Transformer • Faster • Encoder-Decoder • Scaled Dot-Product Attention • Multi-Head Attention • Position encoding • Residual connections

  11. A A High-Le Level Look Look

  12. Enc Encode der-De Decoder 1. The encoders are all identical in structure (yet they do not share weights). 2. The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. 3. The outputs of the self-attention layer are fed to a feed-forward neural network . The exact same feed-forward network is independently applied to each position. 4. The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence

  13. Enc Encode der De Detail ail independent dependent 1. Word embedding 2. Self-attention 3. FFNN

  14. Se Self-At Attention Hig High h Level el As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.

  15. Se Self-At Attention in Detail The first step in calculating self-attention is to create three vectors from each of the encoder’s input vectors Size of 512 Query vector Size of 64 Key vector Value vector

  16. Se Self-At Attention in Detail • The second step in calculating self- attention is to calculate a score. • The third and forth steps are to divide the scores by 8, then pass the result through a softmax operation. • The fifth step is to Scaled Dot-Product Attention multiply each value vector by the softmax score • The sixth step is to sum up the weighted value vectors.

  17. Se Self-At Attention in Detail The self-attention calculation in matrix form

  18. Mu Multi-hea head d atten ention

  19. Mu Multi-hea head d atten ention

  20. Mu Multi-hea head d atten ention

  21. Mu Multi-hea head d atten ention

  22. Po Positional Encoding

  23. Th The Residuals

  24. Enc Encode der-De Decoder

  25. De Decoder

  26. Li Linear a r and Sof Softma max La Layer

  27. Tr Transformer Matrix Attention: K=V≠Q Q K FFNN output V Position embedding Self attention: K=V=Q Word embedding

  28. Ou Outline 1. Encoder-Decoder 2. Attention 3. Transformer : 《 Attention is all you need 》 4. Word embedding and pre-trained model 5. ELMo : 《 Deep contextualized word representations 》 6. OpenAI GPT : 《 Improving Language Understanding by Generative Pre-Training 》 7. BERT : 《 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 》 8. Conclusion

  29. Language mod La model Language model is a probability distribution over a • sequences of words. 𝑄 𝑥 * , 𝑥 , , … , 𝑥 . = 𝑞 𝑥 * 𝑞 𝑥 , 𝑥 * 𝑞 𝑥 = 𝑥 * , 𝑥 , … N-Gram Models • Uni-gram • Bi-gram • Tri-gram • Neural network language models(NNLM) •

  30. NNLM NNLM 𝑎 Z = tanh(𝑋𝑦 Z + 𝑞) 𝑧 Z = 𝑉𝑨 Z + 𝑟 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝑧 Z )

  31. NNLM a NNLM and W Word2V 2Vec Neural probabilistic language model(2003) Word2vec(2013)

  32. Pr Pre-tr train ainin ing Word embedding • Word2vec • Glove • FastText • … • Transfer learning •

  33. Ou Outline 1. Encoder-Decoder 2. Attention 3. Transformer : 《 Attention is all you need 》 4. Word embedding and pre-trained model 5. ELMo : 《 Deep contextualized word representations 》 6. OpenAI GPT : 《 Improving Language Understanding by Generative Pre-Training 》 7. BERT : 《 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 》 8. Conclusion

  34. Ov Overview

  35. EL ELMo ELMo ( E mbeddings from L anguage Mo dels) • complex characteristics of word use (syntax and • semantics) across linguistic contexts (polysemy) • Feature-Based • ELMo representations are deep, in the sense that • they are a function of all of the internal layers of the biLM. The higher-level LSTM states capture context- • dependent aspects of word meaning, while lower- level states model aspects of syntax.

  36. Bi Bidirection onal language mod models Forward language model • Backward language model • Jointly maximizes the log likelihood of the forward • and backward directions Token representation Softmax layer share some weights between directions instead of using completely independent parameters.

  37. Em Embe beddi dding ng fr from langua nguage mode dels ELMo is a task specific combination of the intermediate layer • representations in the biLM. For k-th token, L-layer bi-directional Language models • computes 2L+1 representations: For a specific down-stream task, ELMo would learn a weight • to combine these representations(In the simplest just selects the top layer )

  38. Em Embe beddi dding ng fr from langua nguage mode dels

  39. Us Using biLMs for supervised NLP P tasks Concatenate the ELMo vector with initial word embedding • and pass representation into the task RNN. Including ELMo at the output of the task RNN by introducing • another set of output specific linear weights. Add a moderate amount of dropout to ELMo, in some cases to • regularize the ELMo weights by adding to the loss.

  40. Expe Experiment 1. Question answering 2. Textual entailment 3. Semantic role labeling 4. Coreference resolution 5. Named entity extraction 6. Sentiment analysis

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend