cs5242 neural networks and deep learning
play

CS5242 Neural Networks and Deep Learning Lecture 09: RNN - PowerPoint PPT Presentation

CS5242 Neural Networks and Deep Learning Lecture 09: RNN Applications II Wei WANG TA: Yao SHU, Juncheng LIU, Qingpeng Cai cs5242@comp.nus.edu.sg Recap Language modelling Training Model the joint probability Model the conditional


  1. CS5242 Neural Networks and Deep Learning Lecture 09: RNN Applications II Wei WANG TA: Yao SHU, Juncheng LIU, Qingpeng Cai cs5242@comp.nus.edu.sg

  2. Recap • Language modelling • Training • Model the joint probability • Model the conditional probability • Train RNN for predicting the next word • Inference • Greedy search VS beam search • Image caption generation • CNN---Image feature • RNN---word generation CS5242 2

  3. Agenda • Machine translation • Attention modelling • Transformer * • Question answering • Colab tutorial CS5242 3

  4. RNN Architectures Sentiment Language Image caption Machine translation, analysis modelling Question answering Image source: https://karpathy.github.io/2015/05/21/rnn-effectiveness/ CS5242 4

  5. Machine translation [19] • Given a sentence in one language, e.g. English • Singapore MRT is not stable • Return a sentence in another language, e.g. Chinese • 新加坡地铁不稳定 • Training • max σ <𝑦,𝑧> 𝑚𝑝𝑕𝑄 𝑧 1 , 𝑧 2 , … , 𝑧 𝑛 |𝑦 1 , 𝑦 2 , … , 𝑦 𝑜 Θ • Inference • <𝑧 1 ,𝑧 2 ,…,𝑧 𝑛 > 𝑚𝑝𝑕𝑄 𝑧 1 , 𝑧 2 , … , 𝑧 𝑛 |𝑦 1 , 𝑦 2 , … , 𝑦 𝑜 max CS5242 5

  6. Sequence to sequence model • Seq2Seq • 𝑄 𝑧 1 , 𝑧 2 , … , 𝑧 𝑛 |𝑦 1 , 𝑦 2 , … , 𝑦 𝑜 = 𝑄 𝑧 1 , 𝑧 2 , … , 𝑧 𝑛 |𝑇 • S is a summary of input W X Y Z END S RNN RNN RNN RNN RNN RNN RNN RNN A B C END (START) W X Y Z Encoder Decoder CS5242 6

  7. Sequence to Sequence[19] • Seq2Seq • 𝑄 𝑧 1 , 𝑧 2 , … , 𝑧 𝑛 |𝑦 1 , 𝑦 2 , … , 𝑦 𝑜 = 𝑄 𝑧 1 , 𝑧 2 , … , 𝑧 𝑛 |𝑇 • S is a summary of input • Encoder and Decoder are two RNN networks • They have their own parameters • End-to-end training • Reverse the input sequence • 𝑦 1 , 𝑦 2 , … , 𝑦 𝑜 → 𝑦 𝑜 , 𝑦 𝑜−1 , … , 𝑦 1 • Overall better output sequence ? • 𝑦 1 is near 𝑧 1 , 𝑦 2 is near 𝑧 2 , … • Multiple stacks of RNN → better output sequence CS5242 8

  8. Attention modelling [20] • Problem of Seq2Seq model • Each output word depends on the output of the encoder • The contribution of some words should be larger than others • Singapore MRT is not stable --- 新加坡 • Singapore MRT is not stable --- 地铁 新加坡 地铁 不 稳定 Singapore MRT is not stable CS5242 9

  9. Attention modelling • Differentiate the words from the encoder • Let some words have more contribution 新加坡 地铁 不 Decoder RNN Encoder RNN Singapore MRT is not Image from: https://distill.pub/2016/augmented-rnns/ CS5242 10

  10. Attention modelling 新加坡 地铁 不 demo Singapore MRT is not Image from: https://distill.pub/2016/augmented-rnns/ CS5242 11

  11. ǁ Attention modelling [20] • Example implementation • Extended GRU for the decoder • Consider the related words from the encoder during decoding • Weighted combination of hidden states from encoder → c t c t • 𝑡 𝑢 = 1 − 𝑨 𝑢 ∘ 𝑡 𝑢−1 + 𝑨 𝑢 ∘ ǁ 𝑡 𝑢 • 𝑧 𝑢 𝑡 𝑢 = tanh 𝑠 𝑢 ∘ 𝑡 𝑢−1 , 𝑧 𝑢−1 𝑋 𝑓 , 𝑑 𝑢 W 𝑧 𝑢−1 Attention • 𝑠 𝑢 = 𝜏 𝑡 𝑢−1 , 𝑧 𝑢−1 𝑋 𝑓 , 𝑑 𝑢 𝑋 𝑠 • 𝑨 𝑢 = 𝜏 𝑡 𝑢−1 , 𝑧 𝑢−1 𝑋 𝑓 , 𝑑 𝑢 𝑋 ℎ 1 ℎ 2 𝑡 𝑢 𝑡 𝑢−1 𝑨 𝑜 • 𝑑 𝑢 = σ 𝑘=1 𝛽 𝑢𝑘 ℎ 𝑘 RNN RNN RNN RNN RNN • 𝛽 𝑢𝑘 attention weight 𝑋 𝑓 • 𝛽 𝑢𝑘=𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝑓 𝑢𝑘 ) j=1…n 𝑈 tanh(𝑡 𝑢−1 𝑋 • 𝑓 𝑢𝑘 = 𝑤 𝑏 𝑏 + ℎ 𝑘 𝑉 𝑏 ) 𝑧 𝑢−1 • Larger weight for more related words from the encoder Encoder Decoder CS5242 12

  12. Attention modelling • Encoder RNN • Input word vectors: a=[0.1,0,1], b=[1,0.1,0], c=[0.2,0.3,1] • Hidden representation (vector) • h1, h2, h3 • h1=tanh(aU+h 0 W) • h2=tanh(bU+h 1 W) • h3=tanh(cU+h 2 W) Parameters: {U, W} CS5242 13

  13. ǁ • Decoder RNN • Given hidden state vector s 0 • To compute the weights of h1, h2, h3 for computing s1 • e 11 = a(s 0 , h 1 ), e 12 =a(s 0 , h 2 ), e 13 =a(s 0 , h 3 ) • a(s 0 , h 1 ) = v T tanh(s 0 W a + h 1 U a ) • a(s 0 , h 2 ) = v T tanh(s 0 W a + h 2 U a ) • a(s 0 , h 3 ) = v T tanh(s 0 W a + h 3 U a ) • 𝛽 11 = exp(e 11 ) / (exp(e 11 )+exp(e 12 )+exp(e 13 )) • 𝛽 12 = exp(e 12 ) / (exp(e 11 )+exp(e 12 )+exp(e 13 )) C t • 𝛽 13 = exp(e 13 ) / (exp(e 11 )+exp(e 12 )+exp(e 13 )) • c 1 = 𝛽 11 h 1 + 𝛽 12 h 2 + 𝛽 13 h 3 • 𝑡 𝑢 = 1 − 𝑨 𝑢 ∘ 𝑡 𝑢−1 + 𝑨 𝑢 ∘ ǁ 𝑡 𝑢 Attention • 𝑡 𝑢 = tanh 𝑠 𝑢 ∘ 𝑡 𝑢−1 , 𝑧 𝑢−1 𝑋 𝑓 , 𝑑 𝑢 W • 𝑠 s0 h1 h2 h3 s1 𝑢 = 𝜏 𝑡 𝑢−1 , 𝑧 𝑢−1 𝑋 𝑓 , 𝑑 𝑢 𝑋 𝑠 • 𝑨 𝑢 = 𝜏 𝑡 𝑢−1 , 𝑧 𝑢−1 𝑋 𝑓 , 𝑑 𝑢 𝑋 𝑨 • Parameters: {v, W a , U a , W, W r , W z , 𝑋 𝑓 } CS5242 14

  14. Transformer Repeat self-attention modelling to get better word embedding Image source: https://jalammar.github.io/illustrated-transformer/ CS5242 15

  15. Encoder and Decoder Image source: https://jalammar.github.io/illustrated-transformer/ CS5242 16

  16. Encoder Image source: https://jalammar.github.io/illustrated-transformer/ CS5242 17

  17. Self-attention Represent a word by considering the words in the context Image source: https://jalammar.github.io/illustrated-transformer/ CS5242 18

  18. Self-attention • Attention modelling • Find the attention weights? • query VS key • Do weighted summation • value CS5242 19

  19. Self-attention Image source: https://jalammar.github.io/illustrated-transformer/ CS5242 20

  20. Multi-headed self-attention One row per word Image source: https://jalammar.github.io/illustrated-transformer/ CS5242 21

  21. Encoder Image source: https://jalammar.github.io/illustrated-transformer/ CS5242 22

  22. Transformer Image source: https://jalammar.github.io/illustrated-transformer/ CS5242 23

  23. Transformer Image source: https://jalammar.github.io/illustrated-transformer/ CS5242 24

  24. Question answering • Given a context, a question • Outputs the answer • A word • A substring of the context • A new sentence CS5242 25

  25. Example Source from [21] CS5242 26

  26. Example Source from [22] CS5242 27

  27. Example Source from [23] CS5242 28

  28. Solution [24] • Find the answer word from the context • Max P(a=w|c, q) • Steps Softmax 1. Extract representation of question and passage (context) 2. Combine question and context • Concatenation RNN RNN RNN RNN RNN RNN • Addition 3. Generate the prediction • Matching the combined feature with each candidate Question Passage / Context word feature • E.g. inner-product • Use the similarity as input to softmax CS5242 29

  29. Summary • Seq2seq model for machine translation • Attention modelling • Transformer model • Question answering CS5242 30

  30. Reference • [1] https://www.quora.com/What-are-differences-between-recurrent-neural-network-language-model-hidden-markov-model-and-n-gram-language-model • [2] https://code.google.com/archive/p/word2vec/ • [3] https://nlp.stanford.edu/projects/glove/ • [4] Klaus Greff, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink, Jürgen Schmidhuber. LSTM: A Search Space Odyssey. https://arxiv.org/abs/1503.04069 • [5] http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture8.pdf • [6] http://www.deeplearningbook.org/contents/applications.html (12.4.3) • [7] Quoc V. Le, Navdeep Jaitly, Geoffrey E. Hinton . A Simple Way to Initialize Recurrent Networks of Rectified Linear Units. 2015. arxiv.org/abs/1504.00941v2 • [8] “Layer Normalization" Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton. https://arxiv.org/abs/1607.06450. • [9] “Recurrent Dropout without Memory Loss" Stanislau Semeniuta, Aliaksei Severyn, Erhardt Barth. https://arxiv.org/abs/1603.05118 • [10] https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/LayerNormBasicLSTMCell • [11] https://r2rt.com/non-zero-initial-states-for-recurrent-neural-networks.html • [12] LSTM: A Search Space Odyssey. Klaus Greff, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink, Jürgen Schmidhuber. https://arxiv.org/abs/1503.04069 • [13] https://github.com/karpathy/char-rnn/issues/138#issuecomment-162763435 • https://research.googleblog.com/2016/05/announcing-syntaxnet-worlds-most.html • https://danijar.com/tips-for-training-recurrent-neural-networks/ CS5242 31

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend