csep 517 natural language processing recurrent neural
play

CSEP 517: Natural Language Processing Recurrent Neural Networks - PowerPoint PPT Presentation

CSEP 517: Natural Language Processing Recurrent Neural Networks Autumn 2018 Luke Zettlemoyer University of Washington [most slides from Yejin Choi] RECURRENT NEURAL L NE NETWOR WORKS Recurrent Neural Networks (RNNs) Each input


  1. CSEP 517: Natural Language Processing Recurrent Neural Networks Autumn 2018 Luke Zettlemoyer University of Washington [most slides from Yejin Choi]

  2. RECURRENT NEURAL L NE NETWOR WORKS

  3. Recurrent Neural Networks (RNNs) Each input “word” is a vector • Each RNN unit computes a new hidden state using the previous • state and a new input h t = f ( x t , h t − 1 ) Each RNN unit (optionally) makes an output using the current • hidden state y t = softmax( V h t ) Hidden states are continuous vectors h t ∈ R D • – Can represent very rich information, function of entire history Parameters are shared (tied) across all RNN units (unlike • feedforward NNs) ℎ % ℎ " ℎ # ℎ $ ! $ ! % ! " ! #

  4. Softmax • Turn a vector of real numbers x into a probability distribution • We have seen this trick before! – log-linear models… 4

  5. Recurrent Neural Networks (RNNs) • Generic RNNs: h t = f ( x t , h t − 1 ) y t = softmax( V h t ) • Vanilla RNN: h t = tanh( Ux t + Wh t − 1 + b ) y t = softmax( V h t ) ℎ % ℎ " ℎ # ℎ $ ! $ ! % ! " ! #

  6. Sigmoid • Often used for gates 1 σ ( x ) = • Pro: neuron-like, 1 + e − x differentiable σ 0 ( x ) = σ ( x )(1 − σ ( x )) • Con: gradients saturate to zero almost everywhere except x near zero => vanishing gradients • Batch normalization helps 6

  7. Tanh tanh( x ) = e x − e − x e x + e − x • Often used for hidden states & cells tanh’(x) = 1 − tanh 2 ( x ) in RNNs, LSTMs • Pro: differentiable, tanh( x ) = 2 σ (2 x ) − 1 often converges faster than sigmoid • Con: gradients easily saturate to zero => vanishing gradients 7

  8. Many uses of RNNs 1. Classification (seq to one) • Input: a sequence • Output: one label (classification) • Example: sentiment classification h t = f ( x t , h t − 1 ) y = softmax( V h n ) ℎ % ℎ " ℎ # ℎ $ ! $ ! % ! " ! #

  9. Many uses of RNNs 2. one to seq • Input: one item • Output: a sequence h t = f ( x t , h t − 1 ) • Example: Image captioning y t = softmax( V h t ) Cat sitting on top of …. ℎ " ℎ % ℎ & ℎ $ ℎ " ℎ $ ℎ % ! "

  10. Many uses of RNNs 3. sequence tagging • Input: a sequence • Output: a sequence (of the same length) • Example: POS tagging, Named Entity Recognition • How about Language Models? – Yes! RNNs can be used as LMs! h t = f ( x t , h t − 1 ) – RNNs make markov assumption: T/F? y t = softmax( V h t ) ℎ $ ℎ " ℎ % ℎ # ℎ " ℎ # ℎ $ ! $ ! % ! " ! #

  11. Many uses of RNNs 4. Language models • Input: a sequence of words h t = f ( x t , h t − 1 ) • Output: next word y t = softmax( V h t ) – (or sequence of next words, if repeated) • During training, x t and y t-1 are the same word. • During testing, x t is sampled from softmax in y t-1 . • Does RNN LMs make Markov assumption? – i.e., the next word depends only on the previous N words ℎ $ ℎ " ℎ % ℎ # ℎ " ℎ # ℎ $ ! $ ! % ! " ! #

  12. Many uses of RNNs 5. seq2seq (aka “encoder-decoder”) • Input: a sequence • Output: a sequence (of different length) • Examples? h t = f ( x t , h t − 1 ) y t = softmax( V h t ) ℎ ( ℎ & ℎ ) ℎ ' ℎ " ℎ # ℎ $ ℎ & ℎ ' ℎ ( ! $ ! " ! #

  13. Many uses of RNNs 4. seq2seq (aka “encoder-decoder”) Parsing! - “Grammar as Foreign Language” (Vinyals et al., 2015) ℎ ( ℎ & ℎ ) ℎ ' ℎ " ℎ # ℎ $ ℎ & ℎ ' ℎ ( ! $ ! " ! # John has a dog

  14. Recurrent Neural Networks (RNNs) • Generic RNNs: h t = f ( x t , h t − 1 ) y t = softmax( V h t ) • Vanilla RNN: h t = tanh( Ux t + Wh t − 1 + b ) y t = softmax( V h t ) ℎ % ℎ " ℎ # ℎ $ ! $ ! % ! " ! #

  15. vanishing gradient problem for RNNs. The shading of the nodes in the unfolded network indicates their • sensitivity to the inputs at time one (the darker the shade, the greater the sensitivity). The sensitivity decays over time as new inputs overwrite the activations • of the hidden layer, and the network ‘forgets’ the first inputs. Example from Graves 2012

  16. Recurrent Neural Networks (RNNs) • Generic RNNs: h t = f ( x t , h t − 1 ) • Vanilla RNNs: h t = tanh( Ux t + Wh t − 1 + b ) • LSTMs ( L ong S hort- t erm M emory Networks): i t = σ ( U ( i ) x t + W ( i ) h t − 1 + b ( i ) ) f t = σ ( U ( f ) x t + W ( f ) h t − 1 + b ( f ) ) o t = σ ( U ( o ) x t + W ( o ) h t − 1 + b ( o ) ) c t = tanh( U ( c ) x t + W ( c ) h t − 1 + b ( c ) ) ˜ There are many c t = f t � c t − 1 + i t � ˜ c t known variations to this set of h t = o t � tanh( c t ) equations! ' % ' ( : cell state ' # ' $ ' " ℎ # ℎ $ ℎ % ℎ ( : hidden state ℎ " ! $ ! % ! " ! #

  17. LST LSTMS S (LONG ONG SHOR HORT-TERM ERM MEM EMOR ORY NETWO WORKS ! "#$ ! " ℎ "#$ ℎ " Figure by Christopher Olah (colah.github.io)

  18. LST LSTMS S (LONG ONG SHOR HORT-TERM ERM MEM EMOR ORY NETWO WORKS sigmoid: Forget gate: forget the past or not [0,1] f t = σ ( U ( f ) x t + W ( f ) h t − 1 + b ( f ) ) Figure by Christopher Olah (colah.github.io)

  19. LST LSTMS S (LONG ONG SHOR HORT-TERM ERM MEM EMOR ORY NETWO WORKS sigmoid: Forget gate: forget the past or not [0,1] f t = σ ( U ( f ) x t + W ( f ) h t − 1 + b ( f ) ) Input gate: use the input or not tanh: [-1,1] i t = σ ( U ( i ) x t + W ( i ) h t − 1 + b ( i ) ) New cell content (temp): c t = tanh( U ( c ) x t + W ( c ) h t − 1 + b ( c ) ) ˜ Figure by Christopher Olah (colah.github.io)

  20. LST LSTMS S (LONG ONG SHOR HORT-TERM ERM MEM EMOR ORY NETWO WORKS sigmoid: Forget gate: forget the past or not [0,1] f t = σ ( U ( f ) x t + W ( f ) h t − 1 + b ( f ) ) Input gate: use the input or not tanh: [-1,1] i t = σ ( U ( i ) x t + W ( i ) h t − 1 + b ( i ) ) New cell content (temp): c t = tanh( U ( c ) x t + W ( c ) h t − 1 + b ( c ) ) ˜ New cell content: - mix old cell with the new temp cell c t = f t � c t − 1 + i t � ˜ c t Figure by Christopher Olah (colah.github.io)

  21. LST LSTMS S (LONG ONG SHOR HORT-TERM ERM MEM EMOR ORY NETWO WORKS Forget gate: forget the past or not Output gate: output from the new cell or not f t = σ ( U ( f ) x t + W ( f ) h t − 1 + b ( f ) ) o t = σ ( U ( o ) x t + W ( o ) h t − 1 + b ( o ) ) Input gate: use the input or not Hidden state: i t = σ ( U ( i ) x t + W ( i ) h t − 1 + b ( i ) ) h t = o t � tanh( c t ) New cell content (temp): c t = tanh( U ( c ) x t + W ( c ) h t − 1 + b ( c ) ) ˜ New cell content: - mix old cell with the new temp cell c t = f t � c t − 1 + i t � ˜ c t Figure by Christopher Olah (colah.github.io)

  22. LST LSTMS S (LONG ONG SHOR HORT-TERM ERM MEM EMOR ORY NETWO WORKS Forget gate: forget the past or not f t = σ ( U ( f ) x t + W ( f ) h t − 1 + b ( f ) ) Input gate: use the input or not i t = σ ( U ( i ) x t + W ( i ) h t − 1 + b ( i ) ) Output gate: output from the new o t = σ ( U ( o ) x t + W ( o ) h t − 1 + b ( o ) ) cell or not New cell content (temp): c t = tanh( U ( c ) x t + W ( c ) h t − 1 + b ( c ) ) ˜ New cell content: c t = f t � c t − 1 + i t � ˜ c t - mix old cell with the new temp cell Hidden state: h t = o t � tanh( c t ) ! "#$ ! " ℎ "#$ ℎ "

  23. Preservation of gradient information by LSTM Output gate Forget gate Input gate For simplicity, all gates are either entirely open (‘O’) or closed (‘—’). • The memory cell ‘remembers’ the first input as long as the forget gate is • open and the input gate is closed. The sensitivity of the output layer can be switched on and off by the output • gate without affecting the cell. Example from Graves 2012

  24. Gates • Gates contextually control information flow • Open/close with sigmoid • In LSTMs, they are used to (contextually) maintain longer term history 27

  25. RNN Learning: B ack p rop T hrough T ime (BPTT) • Similar to backprop with non-recurrent NNs • But unlike feedforward (non-recurrent) NNs, each unit in the computation graph repeats the exact same parameters… • Backprop gradients of the parameters of each unit as if they are different parameters • When updating the parameters using the gradients, use the average gradients throughout the entire chain of units. ℎ % ℎ " ℎ # ℎ $ ! $ ! % ! " ! #

  26. Vanishing / exploding Gradients • Deep networks are hard to train • Gradients go through multiple layers • The multiplicative effect tends to lead to exploding or vanishing gradients • Practical solutions w.r.t. – network architecture – numerical operations 29

  27. Vanishing / exploding Gradients • Practical solutions w.r.t. numerical operations – Gradient Clipping: bound gradients by a max value – Gradient Normalization: renormalize gradients when they are above a fixed norm – Careful initialization, smaller learning rates – Avoid saturating nonlinearities (like tanh, sigmoid) • ReLU or hard-tanh instead – Batch Normalization: add intermediate input normalization layers 30

  28. Sneak peak: Bi-directional RNNs Can incorporate context from both directions • Generally improves over uni-directional RNNs • 31

  29. RNNs make great LMs! https://research.fb.com/building-an-efficient-neural-language-model-over-a-billion-words/ 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend