recurrent neural networks
play

Recurrent Neural Networks LING572 Advanced Statistical Methods for - PowerPoint PPT Presentation

Recurrent Neural Networks LING572 Advanced Statistical Methods for NLP March 5 2020 1 Outline Word representations and MLPs for NLP tasks Recurrent neural networks for sequences Fancier RNNs Vanishing/exploding gradients


  1. Recurrent Neural Networks LING572 Advanced Statistical Methods for NLP March 5 2020 1

  2. Outline ● Word representations and MLPs for NLP tasks ● Recurrent neural networks for sequences ● Fancier RNNs ● Vanishing/exploding gradients ● LSTMs (Long Short-Term Memory) ● Variants ● Seq2seq architecture ● Attention 2

  3. MLPs for text classification 3

  4. Word Representations ● Traditionally: words are discrete features ● e.g. curWord=“class” ● As vectors: one-hot encoding ● Each vector is | V | -dimensional, where V is the vocabulary ● Each dimension corresponds to one word of the vocabulary ● A 1 for the current word; 0 everywhere else w 1 = [1 ⋯ 0] 0 0 w 3 = [0 0 1 ⋯ 0] 4

  5. Word Embeddings ● Problem 1: every word is equally different from every other. ● All words are orthogonal to each other. ● Problem 2: very high dimensionality ● Solution: Move words into dense , lower-dimensional space ● Grouping similar words to each other ● These denser representations are called embeddings 5

  6. Word Embeddings ● Formally, a d -dimensional embedding is a matrix E with shape (|V|, d) ● Each row is the vector for one word in the vocabulary ● Matrix multiplying by a one-hot vector returns the corresponding row, i.e. the right word vector ● Trained on prediction tasks (see LING571 slides) ● Continuous bag of words ● Skip-gram ● … ● Can be trained on specific task, or download pre-trained (e.g. GloVe, fastText) ● Fancier versions now to deal with OOV: sub-word (e.g. BPE), character CNN/LSTM 6

  7. Relationships via Offsets WOMAN AUNT MAN UNCLE QUEEN KING Mikolov et al 2013b 7

  8. Relationships via Offsets WOMAN AUNT QUEENS MAN UNCLE KINGS QUEEN QUEEN KING KING Mikolov et al 2013b 7

  9. One More Example Mikolov et al 2013c 8

  10. One More Example 9

  11. Caveat Emptor Linzen 2016, a.o. 10

  12. Example MLP for Language Modeling Bengio et al 2003 11

  13. Example MLP for Language Modeling Bengio et al 2003 : one-hot vector w t 11

  14. Example MLP for Language Modeling Bengio et al 2003 embeddings = concat ( Cw t − 1 , Cw t − 2 , …, Cw t − ( n +1) ) : one-hot vector w t 11

  15. Example MLP for Language Modeling Bengio et al 2003 hidden = tanh ( W 1 embeddings + b 1 ) embeddings = concat ( Cw t − 1 , Cw t − 2 , …, Cw t − ( n +1) ) : one-hot vector w t 11

  16. Example MLP for Language Modeling Bengio et al 2003 probabilities = softmax ( W 2 hidden + b 2 ) hidden = tanh ( W 1 embeddings + b 1 ) embeddings = concat ( Cw t − 1 , Cw t − 2 , …, Cw t − ( n +1) ) : one-hot vector w t 11

  17. Example MLP for sentiment classification ● Issue: texts of different length. ● One solution: average (or sum, or…) all the embeddings, which are of same dim IMDB Model accuracy Deep averaging 89.4 network NB-SVM 91.2 (Wang and Manning 2012) Iyyer et al 2015 12

  18. Recurrent Neural Networks 13

  19. RNNs: high-level 14

  20. RNNs: high-level ● Feed-forward networks: fixed-size input, fixed-size output ● Previous classifier: average embeddings of words ● Other solutions: n -gram assumption (i.e. fixed-size context of word embeddings) 14

  21. RNNs: high-level ● Feed-forward networks: fixed-size input, fixed-size output ● Previous classifier: average embeddings of words ● Other solutions: n -gram assumption (i.e. fixed-size context of word embeddings) ● RNNs process sequences of vectors ● Maintaining “hidden” state ● Applying the same operation at each step 14

  22. RNNs: high-level ● Feed-forward networks: fixed-size input, fixed-size output ● Previous classifier: average embeddings of words ● Other solutions: n -gram assumption (i.e. fixed-size context of word embeddings) ● RNNs process sequences of vectors ● Maintaining “hidden” state ● Applying the same operation at each step ● Different RNNs: ● Different operations at each step ● Operation also called “recurrent cell” ● Other architectural considerations (e.g. depth; bidirectionally) 14

  23. RNNs Steinert-Threlkeld and Szymanik 2019; Olah 2015 15

  24. RNNs h t = f ( x t , h t − 1 ) Steinert-Threlkeld and Szymanik 2019; Olah 2015 15

  25. RNNs h t = f ( x t , h t − 1 ) h t = tanh ( W x x t + W h h t − 1 + b ) Simple/“Vanilla” RNN: Steinert-Threlkeld and Szymanik 2019; Olah 2015 15

  26. RNNs This class … interesting h t = f ( x t , h t − 1 ) h t = tanh ( W x x t + W h h t − 1 + b ) Simple/“Vanilla” RNN: Steinert-Threlkeld and Szymanik 2019; Olah 2015 15

  27. RNNs Linear + Linear + Linear + softmax softmax softmax This class … interesting h t = f ( x t , h t − 1 ) h t = tanh ( W x x t + W h h t − 1 + b ) Simple/“Vanilla” RNN: Steinert-Threlkeld and Szymanik 2019; Olah 2015 15

  28. Using RNNs MLP seq2seq (later) e.g. POS tagging e.g. text classification 16

  29. Training: BPTT ● “Unroll” the network across time-steps ● Apply backprop to the “wide” network ● Each cell has the same parameters ● When updating parameters using the gradients, take the average across the time steps 17

  30. Fancier RNNs 18

  31. Vanishing/Exploding Gradients Problem ● BPTT with vanilla RNNs faces a major problem: ● The gradients can vanish (approach 0) across time ● This makes it hard/impossible to learn long distance dependencies , which are rampant in natural language 19

  32. Vanishing Gradients source If these are small (depends on W), the effect from t=4 on t=1 will be very small 20

  33. Vanishing Gradient Problem source 21

  34. Vanishing Gradient Problem Graves 2012 22

  35. Vanishing Gradient Problem ● Gradient measures the effect of the past on the future ● If it vanishes between t and t+n, can’t tell if: ● There’s no dependency in fact ● The weights in our network just haven’t yet captured the dependency 23

  36. The need for long-distance dependencies ● Language modeling (fill-in-the-blank) ● The keys ____ ● The keys on the table ____ ● The keys next to the book on top of the table ____ ● To get the number on the verb, need to look at the subject, which can be very far away ● And number can disagree with linearly-close nouns ● Need models that can capture long-range dependencies like this. Vanishing gradients means vanilla RNNs will have difficulty. 24

  37. Long Short-Term Memory (LSTM) 25

  38. LSTMs ● Long Short-Term Memory (Hochreiter and Schmidhuber 1997) ● The gold standard / default RNN ● If someone says “RNN” now, they almost always mean “LSTM” ● Originally: to solve the vanishing/exploding gradient problem for RNNs ● Vanilla: re-writes the entire hidden state at every time-step ● LSTM: separate hidden state and memory ● Read, write to/from memory; can preserve long-term information 26

  39. ̂ LSTMs f t = σ ( W f ⋅ h t − 1 x t + b f ) i t = σ ( W i ⋅ h t − 1 x t + b i ) c t = tanh ( W c ⋅ h t − 1 x t + b c ) c t = f t ⊙ c t − 1 + i t ⊙ ̂ c t o t = σ ( W o ⋅ h t − 1 x t + b o ) h t = o t ⊙ tanh ( c t ) 27

  40. ̂ LSTMs f t = σ ( W f ⋅ h t − 1 x t + b f ) 🤕🤕🤸🤯 i t = σ ( W i ⋅ h t − 1 x t + b i ) c t = tanh ( W c ⋅ h t − 1 x t + b c ) c t = f t ⊙ c t − 1 + i t ⊙ ̂ c t o t = σ ( W o ⋅ h t − 1 x t + b o ) h t = o t ⊙ tanh ( c t ) 27

  41. ̂ LSTMs f t = σ ( W f ⋅ h t − 1 x t + b f ) ● Key innovation: ● c t , h t = f ( x t , c t − 1 , h t − 1 ) 🤕🤕🤸🤯 i t = σ ( W i ⋅ h t − 1 x t + b i ) ● c t : a memory cell c t = tanh ( W c ⋅ h t − 1 x t + b c ) ● Reading/writing (smooth) c t = f t ⊙ c t − 1 + i t ⊙ ̂ c t controlled by gates ● f t o t = σ ( W o ⋅ h t − 1 x t + b o ) : forget gate ● i t : input gate h t = o t ⊙ tanh ( c t ) ● o t : output gate 27

  42. LSTMs 28 Steinert-Threlkeld and Szymanik 2019; Olah 2015

  43. LSTMs f t ∈ [0,1] m : which cells to forget 28 Steinert-Threlkeld and Szymanik 2019; Olah 2015

  44. LSTMs Element-wise multiplication: 0: erase 1: retain f t ∈ [0,1] m : which cells to forget 28 Steinert-Threlkeld and Szymanik 2019; Olah 2015

  45. LSTMs Element-wise multiplication: 0: erase 1: retain f t ∈ [0,1] m : which cells to forget : which cells to write to i t ∈ [0,1] m 28 Steinert-Threlkeld and Szymanik 2019; Olah 2015

  46. LSTMs Element-wise multiplication: 0: erase 1: retain f t ∈ [0,1] m : which cells to forget “candidate” / new values : which cells to write to i t ∈ [0,1] m 28 Steinert-Threlkeld and Szymanik 2019; Olah 2015

  47. LSTMs Element-wise multiplication: 0: erase Add new values to memory 1: retain f t ∈ [0,1] m : which cells to forget “candidate” / new values : which cells to write to i t ∈ [0,1] m 28 Steinert-Threlkeld and Szymanik 2019; Olah 2015

  48. LSTMs Element-wise multiplication: 0: erase Add new values to memory 1: retain = f t ⊙ c t − 1 + i t ⊙ ̂ c t f t ∈ [0,1] m : which cells to forget “candidate” / new values : which cells to write to i t ∈ [0,1] m 28 Steinert-Threlkeld and Szymanik 2019; Olah 2015

  49. LSTMs Element-wise multiplication: 0: erase Add new values to memory 1: retain = f t ⊙ c t − 1 + i t ⊙ ̂ c t f t ∈ [0,1] m : which cells to forget o t ∈ [0,1] m : which cells to output “candidate” / new values : which cells to write to i t ∈ [0,1] m 28 Steinert-Threlkeld and Szymanik 2019; Olah 2015

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend