natural language processing with deep learning lstm gru
play

Natural Language Processing with Deep Learning LSTM, GRU, and - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning LSTM, GRU, and applications in summarization and contextualized word embeddings Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception Institute of Computational Perception


  1. Natural Language Processing with Deep Learning LSTM, GRU, and applications in summarization and contextualized word embeddings Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception Institute of Computational Perception

  2. Agenda • Vanishing/Exploding gradient • RNNs with Gates: LSTM, GRU • Contextualized word embeddings with RNNs • Extractive summarization with RNNs Some slides are adopted from http://web.stanford.edu/class/cs224n/

  3. Element-wise Multiplication 𝒅 § 𝒃⨀𝒄 = dimensions: 1 × d ⨀ 1 × d = - 1 × d 3 ⨀ 3 −2 = 3 0 −6 1 2 0 𝑫 § 𝑩⨀𝑪 = dimensions: l × m ⨀ l × m = - l × m 2 3 −1 0 −2 0 ⨀ = 0 1 0 2 0 2 1 −1 0.5 −1 0.5 1 3

  4. Agenda • Vanishing/Exploding gradient • RNNs with Gates: LSTM, GRU • Contextualized word embeddings with RNNs • Extractive summarization with RNNs

  5. Recap: Backpropagation Through Time (BPTT) § Unrolling RNN (simplified) ℒ ($) 𝒊 ($(*) 𝒊 (') 𝒊 ($()) 𝒊 ($) 𝑿 & 𝑿 & 𝑿 & 𝑿 & 𝑿 & … 𝜖ℒ ($) =? 𝜖𝑿 & 5

  6. Recap: Backpropagation Through Time (BPTT) ℒ (,) 𝒊 (*) 𝒊 ()) 𝒊 (+) 𝒊 (') 𝒊 (,) 𝑿 & 𝑿 & 𝑿 & 𝑿 & 𝜖ℒ (,) =? 𝜖𝑿 & 6

  7. Recap: Backpropagation Through Time (BPTT) ( ℒ (,) 𝜖ℒ (() 𝜖ℒ (() = 1 2 𝜖𝑿 ' 𝜖𝑿 ' 𝜖ℒ (") )*& ()) 𝜖𝒊 (") 𝒊 (*) 𝒊 ()) 𝒊 (+) 𝒊 (') 𝒊 (,) 𝑿 & 𝑿 & 𝑿 & 𝑿 & 𝜖𝒊 (&) 𝜖𝒊 (%) 𝜖𝒊 ($) 𝜖𝒊 (") 𝜖𝑿 ' 𝜖𝒊 (&) 𝜖𝒊 (%) 𝜖𝒊 ($) 𝜖ℒ (") = 𝜖ℒ (") 𝜖𝒊 (") ! 𝜖𝑿 $ 𝜖𝒊 (") 𝜖𝑿 $ (") 𝜖ℒ (") = 𝜖ℒ (") 𝜖𝒊 (") 𝜖𝒊 (%) 𝜖ℒ ($) = 𝜖ℒ ($) 𝜖𝒊 ($(*) … 𝜖𝒊 (3) 𝜖𝒊 ($) ! 𝜖𝒊 (") 𝜖𝒊 (%) 𝜖𝑿 $ 𝜖𝑿 $ . (%) 𝜖𝒊 ($) 𝜖𝑿 & (3) 𝜖𝑿 & 𝜖ℒ (") = 𝜖ℒ (") 𝜖𝒊 (") 𝜖𝒊 (%) 𝜖𝒊 (&) ! 𝜖𝒊 (") 𝜖𝒊 (%) 𝜖𝒊 (&) 𝜖𝑿 $ 𝜖𝑿 $ (&) 𝜖ℒ (") = 𝜖ℒ (") 𝜖𝒊 (") 𝜖𝒊 (%) 𝜖𝒊 (&) 𝜖𝒊 (') ! 𝜖𝒊 (") 𝜖𝒊 (%) 𝜖𝒊 (&) 𝜖𝒊 (') 𝜖𝑿 $ 𝜖𝑿 $ (') 7

  8. Vanishing/Exploding gradient ℒ (,) 𝜖ℒ (") 𝜖𝒊 (") 𝒊 (*) 𝒊 ()) 𝒊 (+) 𝒊 (') 𝒊 (,) 𝑿 & 𝑿 & 𝑿 & 𝑿 & 𝜖𝒊 (&) 𝜖𝒊 (%) 𝜖𝒊 ($) 𝜖𝒊 (") 𝜖𝑿 ' 𝜖𝒊 (&) 𝜖𝒊 (%) 𝜖𝒊 ($) § In practice, the gradient regarding each time step becomes smaller and smaller as it goes back in time → Vanishing gradient § While less often, this may also happen other way around: the gradient regarding further time steps becomes larger and larger→ Exploding gradient 8

  9. Vanishing/Exploding gradient – why? ℒ (,) 𝜖ℒ (") 𝜖𝒊 (") 𝒊 (*) 𝒊 ()) 𝒊 (+) 𝒊 (') 𝒊 (,) 𝑿 & 𝑿 & 𝑿 & 𝑿 & 𝜖𝒊 (&) 𝜖𝒊 (%) 𝜖𝒊 ($) 𝜖𝒊 (") 𝜖𝑿 ' 𝜖𝒊 (&) 𝜖𝒊 (%) 𝜖𝒊 ($) If these gradients are small, their multiplication gets smaller. As we go further back, the final gradient contains more of these! 𝜖ℒ (") = 𝜖ℒ (") 𝜖𝒊 (") 𝜖𝒊 (&) 𝜖𝒊 (') 𝜖𝒊 (%) + 𝜖𝒊 (") 𝜖𝒊 (&) 𝜖𝒊 (') 𝜖𝒊 (%) 𝜖𝑿 $ 𝜖𝑿 $ (%) 9

  10. Vanishing/Exploding gradient – why? § What is 4𝒊 (") 4𝒊 ("$%) ?! § Recall the definition of RNN: 𝒊 ($) = 𝜏(𝒊 $(* 𝑿 & + 𝒇 ($) 𝑿 5 + 𝒄) § Let’s replace sigmoid ( 𝜏 ) with a simple linear activation ( 𝑧 = 𝑦 ) function. 𝒊 ($) = 𝒊 $(* 𝑿 & + 𝒇 ($) 𝑿 5 + 𝒄 § In this case: 𝜖𝒊 (1) 𝜖𝒊 (134) = 𝑿 5 10

  11. Vanishing/Exploding gradient – why? § Recall the BPTT formula (for the simplified case): 𝜖ℒ ($) = 𝜖ℒ ($) 𝜖𝒊 ($(*) … 𝜖𝒊 (36*) 𝜖𝒊 ($) 𝜖𝒊 (3) . 𝜖𝒊 ($) 𝜖𝒊 (3) 𝜖𝑿 & (3) 𝜖𝑿 & § Given 𝑚 = 𝑢 − 𝑗 , the BPTT formula can be rewritten as: 𝜖ℒ ($) = 𝜖ℒ $ (𝑿 & ) 7 𝜖𝒊 (3) . 𝜖𝑿 & (3) 𝜖𝒊 $ 𝜖𝑿 & If weights in 𝑿 $ are small (i.e. eigenvalues of 𝑿 $ are smaller then 1), these term gets exponentially smaller 11

  12. Why is vanishing/exploding gradient a problem? § Vanishing gradient - Gradient signal from faraway “fades away” and becomes insignificant in comparison with the gradient signal from close-by - Long-term dependencies are not captured, since model weights are updated only with respect to near effects → one approach to address it: RNNs with gates – LSTM, GRU § Exploding gradient - Gradients become too big → SGD update steps become too large - This causes (large loss values and) large updates on parameters, and eventually unstable training → main approach to address it: Gradient clipping 12

  13. Gradient clipping § Gradient clipping: if the norm of the gradient is greater than some threshold, scale the gradient down § Intuition: take a step in the same direction, but in a smaller step 13

  14. Problem with vanilla RNN – summary § It is too difficult for the hidden state of vanilla RNN to learn and preserve information of several time steps - In particular as new contents are constantly added to the hidden state in every step 𝒊 ($) = 𝜏(𝒊 $(* 𝑿 & + 𝒚 ($) 𝑿 5 + 𝒄) In every step, input vector “adds” new content to hidden state 14

  15. Agenda • Vanishing & Exploding gradient • RNNs with Gates: LSTM, GRU • Contextualized word embeddings with LSTM • Extractive summarization with LSTM

  16. Gate vector § Gate vector: - A vector with values between 0 and 1 - Gate vector acts as “gate-keeper”, such that it controls the content flow of another vector § Gate vectors are typically defined using sigmoid: 𝒉 = 𝜏(𝑡𝑝𝑛𝑓 𝑤𝑓𝑑𝑢𝑝𝑠) … and are applied to a vector 𝒘 with element-wise multiplication to control its contents: 𝒉 ⊙ 𝒘 § For each element (feature) 𝑗 of the vectors: - If 𝑕 ( is 1 → 𝑤 ( remains the same; everything passes; open gate! - If 𝑕 ( is 0 → 𝑤 ( becomes 0; nothing passes; closed gate! 16

  17. Long Short-Term Memory (LSTM) § Proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients problem § LSTM exploits a new vector cell state 𝒅 (() to carry the memory of previous states - The cell state stores long-term information - As in vanilla RNN, hidden states 𝒊 ()) is used as output vector § LSTM controls the process of reading, writing, and erasing information in/from memory states - These controls are done using gate vectors - Gates are dynamic and defined based on the input vector and hidden state 17 Hochreiter, Sepp, and Jürgen Schmidhuber. " Long short-term memory " Neural computation (1997)

  18. LSTM – unrolled 𝒊 (*) 𝒊 ()) 𝒊 (+) 𝒊 ($) 𝒊 (') 𝒊 ($(*) 𝒅 ($(*) LSTM LSTM LSTM 𝒅 (') LSTM 𝒅 (+) 𝒅 ()) 𝒅 (*) 𝒚 ($) 𝒚 (*) 𝒚 ()) 𝒚 (+) 18

  19. LSTM definition – gates § Gates are functions of input vector 𝒚 (() and previous hidden state 𝒊 ((*+) 𝒋 ($) = function(𝒊 $(* , 𝒚 $ ) input gate : controls what parts of the new cell 𝒋 ($) = 𝜏(𝒊 $(* 𝑿 &3 + 𝒚 $ 𝑿 83 + 𝒄 3 ) content are written to cell 𝒈 ($) = function(𝒊 $(* , 𝒚 $ ) forget gate : controls what is kept vs forgotten, from 𝒈 ($) = 𝜏(𝒊 $(* 𝑿 &9 + 𝒚 $ 𝑿 89 + 𝒄 9 ) previous cell state 𝒑 ($) = function(𝒊 $(* , 𝒚 $ ) output gate : controls what parts of cell are 𝒑 ($) = 𝜏(𝒊 $(* 𝑿 &: + 𝒚 $ 𝑿 8: + 𝒄 : ) output to hidden state Model parameters (weights) are shown in red 19

  20. LSTM definition – states new cell content : the new content to be used for cell 𝒅 ($) = function(𝒊 $(* , 𝒚 $ ) and hidden (output) state G 𝒅 ($) = tanh(𝒊 $(* 𝑿 &; + 𝒚 $ 𝑿 8; + 𝒄 ; ) G cell state : erases (“forgets”) 𝒅 ($) = 𝒈 ($) ⊙ 𝒅 $ + 𝒋 ($) ⊙ G 𝒅 ($) some content from last cell state, and writes (“inputs”) some new cell content 𝒊 ($) = 𝒑 ($) ⊙ tanh(𝒅 ($) ) hidden state : reads (“outputs”) some content from the current cell state Model parameters (weights) are shown in red 20

  21. LSTM definition – all together input gate : controls what parts of the new cell 𝒋 ($) = 𝜏(𝒊 $(* 𝑿 &3 + 𝒚 $ 𝑿 83 + 𝒄 3 ) content are written to cell forget gate : controls what 𝒈 ($) = 𝜏(𝒊 $(* 𝑿 &9 + 𝒚 $ 𝑿 89 + 𝒄 9 ) is kept vs forgotten, from previous cell state 𝒑 ($) = 𝜏(𝒊 $(* 𝑿 &: + 𝒚 $ 𝑿 8: + 𝒄 : ) output gate : controls what parts of cell are output to hidden state new cell content : the new content to be used for cell 𝒅 ($) = tanh(𝒊 $(* 𝑿 &; + 𝒚 $ 𝑿 8; + 𝒄 ; ) G and hidden (output) state 𝒅 ($) = 𝒈 ($) ⊙ 𝒅 $ + 𝒋 ($) ⊙ G 𝒅 ($) cell state : erases (“forgets”) some content from last cell state, and writes 𝒊 ($) = 𝒑 ($) ⊙ tanh(𝒅 ($) ) (“inputs”) some new cell content hidden state : reads (“outputs”) some content from the current cell state Model parameters (weights) are shown in red 21

  22. LSTM definition – visually! http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend