CS5242 Neural Networks and Deep Learning Lecture 09: RNN - - PowerPoint PPT Presentation

cs5242 neural networks and deep learning
SMART_READER_LITE
LIVE PREVIEW

CS5242 Neural Networks and Deep Learning Lecture 09: RNN - - PowerPoint PPT Presentation

CS5242 Neural Networks and Deep Learning Lecture 09: RNN Applications II Wei WANG TA: Yao SHU, Juncheng LIU, Qingpeng Cai cs5242@comp.nus.edu.sg Recap Language modelling Training Model the joint probability Model the conditional


slide-1
SLIDE 1

CS5242 Neural Networks and Deep Learning

Lecture 09: RNN Applications II

Wei WANG TA: Yao SHU, Juncheng LIU, Qingpeng Cai cs5242@comp.nus.edu.sg

slide-2
SLIDE 2

Recap

  • Language modelling
  • Training
  • Model the joint probability
  • Model the conditional probability
  • Train RNN for predicting the next word
  • Inference
  • Greedy search VS beam search
  • Image caption generation
  • CNN---Image feature
  • RNN---word generation

CS5242 2

slide-3
SLIDE 3

Agenda

  • Machine translation
  • Attention modelling
  • Transformer *
  • Question answering
  • Colab tutorial

CS5242 3

slide-4
SLIDE 4

RNN Architectures

Language modelling Sentiment analysis Image caption Machine translation, Question answering

CS5242 4

Image source: https://karpathy.github.io/2015/05/21/rnn-effectiveness/

slide-5
SLIDE 5

Machine translation [19]

  • Given a sentence in one language, e.g. English
  • Singapore MRT is not stable
  • Return a sentence in another language, e.g. Chinese
  • 新加坡地铁不稳定
  • Training
  • max

Θ

σ<𝑦,𝑧> 𝑚𝑝𝑕𝑄 𝑧1, 𝑧2, … , 𝑧𝑛|𝑦1, 𝑦2, … , 𝑦𝑜

  • Inference
  • max

<𝑧1,𝑧2,…,𝑧𝑛> 𝑚𝑝𝑕𝑄 𝑧1, 𝑧2, … , 𝑧𝑛|𝑦1, 𝑦2, … , 𝑦𝑜

CS5242 5

slide-6
SLIDE 6

Sequence to sequence model

  • Seq2Seq
  • 𝑄 𝑧1, 𝑧2, … , 𝑧𝑛|𝑦1, 𝑦2, … , 𝑦𝑜 = 𝑄 𝑧1, 𝑧2, … , 𝑧𝑛|𝑇
  • S is a summary of input

RNN RNN RNN RNN RNN RNN RNN A B C END (START) W X Y Z RNN W X Y Z END Encoder Decoder S

CS5242 6

slide-7
SLIDE 7

Sequence to Sequence[19]

  • Seq2Seq
  • 𝑄 𝑧1, 𝑧2, … , 𝑧𝑛|𝑦1, 𝑦2, … , 𝑦𝑜 = 𝑄 𝑧1, 𝑧2, … , 𝑧𝑛|𝑇
  • S is a summary of input
  • Encoder and Decoder are two RNN networks
  • They have their own parameters
  • End-to-end training
  • Reverse the input sequence
  • 𝑦1, 𝑦2, … , 𝑦𝑜 → 𝑦𝑜, 𝑦𝑜−1, … , 𝑦1
  • Overall better output sequence?
  • 𝑦1 is near 𝑧1, 𝑦2 is near 𝑧2, …
  • Multiple stacks of RNN → better output sequence

CS5242 8

slide-8
SLIDE 8
  • Problem of Seq2Seq model
  • Each output word depends on the output of the encoder
  • The contribution of some words should be larger than others
  • Singapore MRT is not stable --- 新加坡
  • Singapore MRT is not stable --- 地铁

Attention modelling [20]

CS5242 9

新加坡 地铁 不 稳定

Singapore MRT is not stable

slide-9
SLIDE 9

Attention modelling

  • Differentiate the words from the encoder
  • Let some words have more contribution

CS5242 10

Image from: https://distill.pub/2016/augmented-rnns/ Encoder RNN Decoder RNN Singapore MRT is not 新加坡 地铁 不

slide-10
SLIDE 10

Attention modelling

CS5242 11

Image from: https://distill.pub/2016/augmented-rnns/ demo 新加坡 地铁 不 Singapore MRT is not

slide-11
SLIDE 11
  • Example implementation
  • Extended GRU for the decoder
  • Consider the related words from the encoder during decoding
  • Weighted combination of hidden states from encoder → ct
  • 𝑡𝑢 = 1 − 𝑨𝑢 ∘ 𝑡𝑢−1 + 𝑨𝑢 ∘ ǁ

𝑡𝑢

  • ǁ

𝑡𝑢 = tanh 𝑠

𝑢 ∘ 𝑡𝑢−1, 𝑧𝑢−1𝑋 𝑓, 𝑑𝑢 W

  • 𝑠

𝑢 = 𝜏 𝑡𝑢−1, 𝑧𝑢−1𝑋 𝑓, 𝑑𝑢 𝑋 𝑠

  • 𝑨𝑢 = 𝜏 𝑡𝑢−1, 𝑧𝑢−1𝑋

𝑓, 𝑑𝑢 𝑋 𝑨

  • 𝑑𝑢 = σ𝑘=1

𝑜

𝛽𝑢𝑘ℎ𝑘

  • 𝛽𝑢𝑘 attention weight
  • 𝛽𝑢𝑘=𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝑓𝑢𝑘) j=1…n
  • 𝑓𝑢𝑘 = 𝑤𝑏

𝑈tanh(𝑡𝑢−1𝑋 𝑏 + ℎ𝑘𝑉𝑏)

  • Larger weight for more related words from the encoder

Attention modelling [20]

RNN RNN RNN RNN RNN Encoder Attention Decoder 𝑧𝑢 𝑡𝑢 ℎ1 ℎ2

CS5242 12

𝑡𝑢−1 𝑧𝑢−1 𝑧𝑢−1 𝑋

𝑓

ct

slide-12
SLIDE 12

Attention modelling

  • Encoder RNN
  • Input word vectors: a=[0.1,0,1], b=[1,0.1,0], c=[0.2,0.3,1]
  • Hidden representation (vector)
  • h1, h2, h3
  • h1=tanh(aU+h0W)
  • h2=tanh(bU+h1W)
  • h3=tanh(cU+h2W)

CS5242 13

Parameters: {U, W}

slide-13
SLIDE 13
  • Decoder RNN
  • Given hidden state vector s0
  • To compute the weights of h1, h2, h3 for computing s1
  • e11 = a(s0, h1), e12=a(s0, h2), e13=a(s0, h3)
  • a(s0, h1) = vTtanh(s0Wa + h1Ua)
  • a(s0, h2) = vTtanh(s0Wa + h2Ua)
  • a(s0, h3) = vTtanh(s0Wa + h3Ua)
  • 𝛽11 = exp(e11) / (exp(e11)+exp(e12)+exp(e13))
  • 𝛽12 = exp(e12) / (exp(e11)+exp(e12)+exp(e13))
  • 𝛽13 = exp(e13) / (exp(e11)+exp(e12)+exp(e13))
  • c1=𝛽11h1+𝛽12h2+𝛽13h3
  • 𝑡𝑢 = 1 − 𝑨𝑢 ∘ 𝑡𝑢−1 + 𝑨𝑢 ∘ ǁ

𝑡𝑢

  • ǁ

𝑡𝑢 = tanh 𝑠

𝑢 ∘ 𝑡𝑢−1, 𝑧𝑢−1𝑋 𝑓, 𝑑𝑢 W

  • 𝑠

𝑢 = 𝜏 𝑡𝑢−1, 𝑧𝑢−1𝑋 𝑓, 𝑑𝑢 𝑋 𝑠

  • 𝑨𝑢 = 𝜏 𝑡𝑢−1, 𝑧𝑢−1𝑋

𝑓, 𝑑𝑢 𝑋 𝑨

  • Parameters: {v, Wa, Ua, W, Wr, Wz, 𝑋

𝑓}

CS5242 14

h1 h2 h3 s0 s1

Attention

Ct

slide-14
SLIDE 14

Transformer

CS5242 15

Repeat self-attention modelling to get better word embedding Image source: https://jalammar.github.io/illustrated-transformer/

slide-15
SLIDE 15

Encoder and Decoder

CS5242 16

Image source: https://jalammar.github.io/illustrated-transformer/

slide-16
SLIDE 16

Encoder

CS5242 17

Image source: https://jalammar.github.io/illustrated-transformer/

slide-17
SLIDE 17

Self-attention

CS5242 18

Represent a word by considering the words in the context Image source: https://jalammar.github.io/illustrated-transformer/

slide-18
SLIDE 18

Self-attention

  • Attention modelling
  • Find the attention weights?
  • query VS key
  • Do weighted summation
  • value

CS5242 19

slide-19
SLIDE 19

Self-attention

CS5242 20

Image source: https://jalammar.github.io/illustrated-transformer/

slide-20
SLIDE 20

Multi-headed self-attention

CS5242 21

Image source: https://jalammar.github.io/illustrated-transformer/ One row per word

slide-21
SLIDE 21

Encoder

CS5242 22

Image source: https://jalammar.github.io/illustrated-transformer/

slide-22
SLIDE 22

Transformer

CS5242 23

Image source: https://jalammar.github.io/illustrated-transformer/

slide-23
SLIDE 23

Transformer

CS5242 24

Image source: https://jalammar.github.io/illustrated-transformer/

slide-24
SLIDE 24

Question answering

  • Given a context, a question
  • Outputs the answer
  • A word
  • A substring of the context
  • A new sentence

CS5242 25

slide-25
SLIDE 25

Example

Source from [21]

CS5242 26

slide-26
SLIDE 26

Example

Source from [22]

CS5242 27

slide-27
SLIDE 27

Example

Source from [23]

CS5242 28

slide-28
SLIDE 28

Solution [24]

  • Find the answer word from the context
  • Max P(a=w|c, q)
  • Steps

1. Extract representation of question and passage (context) 2. Combine question and context

  • Concatenation
  • Addition

3. Generate the prediction

  • Matching the combined feature with each candidate

word feature

  • E.g. inner-product
  • Use the similarity as input to softmax

CS5242 29

RNN RNN RNN RNN RNN RNN

Softmax

Question Passage / Context

slide-29
SLIDE 29

Summary

  • Seq2seq model for machine translation
  • Attention modelling
  • Transformer model
  • Question answering

CS5242 30

slide-30
SLIDE 30

Reference

  • [1] https://www.quora.com/What-are-differences-between-recurrent-neural-network-language-model-hidden-markov-model-and-n-gram-language-model
  • [2] https://code.google.com/archive/p/word2vec/
  • [3] https://nlp.stanford.edu/projects/glove/
  • [4] Klaus Greff, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink, Jürgen Schmidhuber. LSTM: A Search Space Odyssey. https://arxiv.org/abs/1503.04069
  • [5] http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture8.pdf
  • [6] http://www.deeplearningbook.org/contents/applications.html (12.4.3)
  • [7] Quoc V. Le, Navdeep Jaitly, Geoffrey E. Hinton. A Simple Way to Initialize Recurrent Networks of Rectified Linear Units. 2015. arxiv.org/abs/1504.00941v2
  • [8] “Layer Normalization" Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton. https://arxiv.org/abs/1607.06450.
  • [9] “Recurrent Dropout without Memory Loss" Stanislau Semeniuta, Aliaksei Severyn, Erhardt Barth. https://arxiv.org/abs/1603.05118
  • [10] https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/LayerNormBasicLSTMCell
  • [11] https://r2rt.com/non-zero-initial-states-for-recurrent-neural-networks.html
  • [12] LSTM: A Search Space Odyssey. Klaus Greff, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink, Jürgen Schmidhuber. https://arxiv.org/abs/1503.04069
  • [13] https://github.com/karpathy/char-rnn/issues/138#issuecomment-162763435
  • https://research.googleblog.com/2016/05/announcing-syntaxnet-worlds-most.html
  • https://danijar.com/tips-for-training-recurrent-neural-networks/

CS5242 31

slide-31
SLIDE 31
  • [14] https://github.com/kjw0612/awesome-rnn#image-captioning
  • [15] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan, Show and Tell: A Neural Image Caption Generator, arXiv:1411.4555 / CVPR 2015
  • [16] K. Papineni, S. Roukos, T. Ward, and W. J. Zhu. BLEU: A method for automatic evaluation of machine translation. In ACL, 2002.
  • [17] Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan L. Yuille, Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN), arXiv:1412.6632 / ICLR

2015

  • [18] Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell, Long-term Recurrent Convolutional

Networks for Visual Recognition and Description, arXiv:1411.4389 / CVPR 2015

  • [19] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le, Sequence to Sequence Learning with Neural Networks, arXiv:1409.3215 / NIPS 2014
  • [20] Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio, Neural Machine Translation by Jointly Learning to Align and Translate, arXiv:1409.0473 / ICLR 2015
  • [21] Karl M. Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom, Teaching Machines to Read and Comprehend,

arXiv:1506.03340 / NIPS 2015

  • [22] SQuAD: 100,000+ Questions for Machine Comprehension of Text. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, Percy Liang. https://arxiv.org/abs/1606.05250
  • [23] Ankit Kumar, Ozan Irsoy, Jonathan Su, James Bradbury, Robert English, Brian Pierce, Peter Ondruska, Mohit Iyyer, Ishaan Gulrajani, and Richard Socher, Ask Me Anything:

Dynamic Memory Networks for Natural Language Processing, arXiv:1506.07285

  • [24] Question Answering Using Deep Learning. https://cs224d.stanford.edu/reports/StrohMathur.pdf
  • [25] Shuohang Wang and Jing Jiang. Machine comprehension using match-lstm and answer pointer. arXiv preprint arXiv:1608.07905, 2016
  • https://machinelearningmastery.com/how-does-attention-work-in-encoder-decoder-recurrent-neural-networks
  • https://sites.google.com/site/deeplearningdialogue/references
  • https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/

CS5242 32

slide-32
SLIDE 32

CS5242 33

RNN RNN RNN RNN RNN Attention 𝑧𝑢 𝑡𝑢 ℎ1 ℎ2 𝑡𝑢−1 𝑧𝑢−1 𝑧𝑢−1 ct RNN ℎ3 ℎ4