lecture 6 rnn wrap up
play

Lecture 6: RNN wrap-up Julia Hockenmaier juliahmr@illinois.edu - PowerPoint PPT Presentation

CS546: Machine Learning in NLP (Spring 2020) http://courses.engr.illinois.edu/cs546/ Lecture 6: RNN wrap-up Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Monday, 11am12:30pm Todays class: RNN architectures


  1. CS546: Machine Learning in NLP (Spring 2020) http://courses.engr.illinois.edu/cs546/ Lecture 6: 
 RNN wrap-up Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Monday, 11am—12:30pm

  2. Today’s class: RNN architectures RNNs are among the workhorses of neural NLP: — Basic RNNs are rarely used — LSTMs and GRUs are commonly used. What’s the difference between these variants? 
 RNN odds and ends: — Character RNNs — Attention mechanisms (LSTMs/GRUs) 2 CS546 Machine Learning in NLP

  3. Character RNNs and BPE Character RNNs: — Each input element is one character: ’t’,’h’, ‘e’,… — Can be used to replace word embeddings, 
 or to compute embeddings for rare/unknown words (in languages with an alphabet, like English…) 
 see e.g. http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 
 (in Chinese, RNNs can be used directly on characters without word segmentation; the equivalent of “character RNNs” might be models that decompose characters into radicals/strokes) Byte Pair Encoding (BPE): — Learn which character sequences are common 
 in the language (‘ing’, ‘pre’, ‘at’, …) — Split input into these sequences and learn embeddings for these sequences 3 CS546 Machine Learning in NLP

  4. Attention mechanisms α = ( α 1 t , . . . , α St ) Compute a probability distribution over the h ( s ) h ( t ) encoder’s hidden states that depends on the decoder’s current exp( s ( h ( t ) , h ( s ) )) α ts = ∑ s ′ � exp( s ( h ( t ) , h ( s ′ � ) )) h ( s ) c ( t ) = ∑ α ts h ( s ) Compute a weighted avg. of the encoder’s : s =1.. S o ( t ) = tanh( W 1 h ( t ) + W 2 c ( t ) ) h ( t ) that gets then used with , e.g. in — Hard attention (degenerate case, non-differentiable): 
 α is a one-hot vector 
 α — Soft attention (general case): is not a one-hot s ( h ( t ) , h ( s ) ) = h ( t ) ⋅ h ( s ) — is the dot product (no learned parameters) s ( h ( t ) , h ( s ) ) = ( h ( t ) ) T W h ( s ) — (learn a bilinear matrix W) s ( h ( t ) , h ( s ) ) = v T tanh( W 1 h ( t ) + W 2 h ( s ) ) — concat. hidden states 4 CS546 Machine Learning in NLP

  5. Activation functions CS546 Machine Learning in NLP 5

  6. 
 Recap: Activation functions 3 1/(1+exp(-x)) Sigmoid (logistic function): 
 tanh(x) max(0,x) 2.5 σ (x) = 1/(1 + e − x ) 
 2 Returns values bound above and below 
 1.5 [0,1] in the range 
 1 0.5 0 Hyperbolic tangent: 
 -0.5 tanh(x) = (e 2x − 1)/(e 2x +1) -1 -3 -2 -1 0 1 2 3 Returns values bound above and below 
 [ − 1, +1] in the range 
 Rectified Linear Unit: 
 ReLU(x) = max(0, x) Returns values bound below 
 [0, + ∞ ] in the range 6 CS546 Machine Learning in NLP

  7. From RNNs to LSTMs CS546 Machine Learning in NLP 7

  8. 
 
 From RNNs to LSTMs h ( t ) In Vanilla (Elman) RNNs, the current hidden state 
 h ( t − 1) is a nonlinear function of the previous hidden state x ( t ) and the current input : h ( t ) = g ( W h [ h ( t − 1) , x ( t ) ] + b h ) With g =tanh (the original definition): 
 ⇒ Models suffer from the vanishing gradient problem: 
 they can’t be trained effectively on long sequences. With g =ReLU 
 ⇒ Models suffer from the exploding gradient problem: 
 they can’t be trained effectively on long sequences. 8 CS546 Machine Learning in NLP

  9. From RNNs to LSTMs LSTMs (Long Short-Term Memory networks) were introduced by Hochreiter and Schmidhuber to overcome this problem. — They introduce an additional cell state that also gets passed through the network and updated at each time step — LSTMs define three different gates that read in the previous hidden state and current input to decide how much of the past hidden and cell states to keep. — This gating mechanism mitigates the vanishing/ exploding gradient problems of traditional RNNs 9 CS546 Machine Learning in NLP

  10. Gating mechanisms Gates are trainable layers with a sigmoid activation function 
 h ( t − 1) x ( t ) often determined by the current input and the (last) hidden state eg.: k = σ ( W k x ( t ) + U k h ( t − 1) + b k ) g ( t ) ∀ i : 0 ≤ g i ≤ 1 g is a vector of (Bernoulli) probabilities ( ) Unlike traditional (0,1) gates, neural gates are differentiable (we can train them) 
 g u is combined with another vector (of the same dimensionality) v = g ⊗ u by element-wise multiplication (Hadamard product): g i ≈ 0 v i ≈ 0 g i ≈ 1 v i ≈ u i — If , , and if , g i — Each is associated with its own set of trainable parameters 
 u i and determines how much of to keep or forget u , v Gates are used to form linear combinations of vectors : w = g ⊗ u + ( 1 − g ) ⊗ v — Linear interpolation (coupled gates): w = g 1 ⊗ u + g 2 ⊗ v — Addition of two gates: 10 CS546 Machine Learning in NLP

  11. Long Short Term Memory Networks (LSTMs) c (t-1) c (t) h (t-1) h (t-1) https://colah.github.io/posts/2015-08-Understanding-LSTMs/ t At time , the LSTM cell reads in c ( t − 1) — a c -dimensional previous cell state vector h ( t − 1) — an h- dimensional previous hidden state vector x ( t ) — a d -dimensional current input vector t At time , the LSTM cell returns c ( t ) — a c -dimensional new cell state vector h ( t ) — an h- dimensional new hidden state vector 
 (which may also be passed to an output layer) 11 CS546 Machine Learning in NLP

  12. 
 
 LSTM operations c ( t − 1) h ( t − 1) Based on the previous cell state and hidden state 
 x ( t ) and the current input , the LSTM computes: 
 h ( t − 1) c ( t ) x ( t ) 1) A new intermediate cell state ˜ that depends on and : c ( t ) = tanh( W c [ h ( t − 1) , x ( t ) ] + b c ) ˜ h ( t − 1) x ( t ) 2) Three gates (which each depend on and ) f ( t ) = σ ( W f [ h ( t − 1) , x ( t ) ] + b f ) a) The forget gate decides 
 f ( t ) ⊗ c ( t − 1) c ( t − 1) how much of the last to remember in the cell state: i ( t ) = σ ( W i [ h ( t − 1) , x ( t ) ] + b i ) b) The input gate decides 
 i ( t ) ⊗ ˜ c ( t ) c ( t ) ˜ how much of the intermediate to use in the new cell state: o ( t ) = σ ( W o [ h ( t − 1) , x ( t ) ] + b o ) c) The output gate decides 
 h ( t ) = o ( t ) ⊗ tanh( c ( t ) ) c ( t ) how much of the new to use in c ( t ) = f ( t ) ⊗ c ( t − 1) + i ( t ) ⊗ ˜ c ( t ) 3) The new cell state is a linear combination of c ( t − 1) c ( t ) f ( t ) i ( t ) ˜ cell states and that depends on forget gate and input gate h ( t ) = o ( t ) ⊗ tanh( c ( t ) ) 4) The new hidden state 12 CS546 Machine Learning in NLP

  13. 
 LSTM summary c ( t − 1) h ( t − 1) x ( t ) Based on , , and , the LSTM computes: 
 c ( t ) = tanh( W c [ h ( t − 1) , x ( t ) ] + b c ) ˜ — Intermediate cell state f ( t ) = σ ( W f [ h ( t − 1) , x ( t ) ] + b f ) — Forget gate i ( t ) = σ ( W i [ h ( t − 1) , x ( t ) ] + b i ) — Input gate c ( t ) = f ( t ) ⊗ c ( t − 1) + i ( t ) ⊗ ˜ c ( t ) — New (final) cell state o ( t ) = σ ( W o [ h ( t − 1) , x ( t ) ] + b o ) — Output gate h ( t ) = o ( t ) ⊗ tanh( c ( t ) ) — New hidden state c ( t ) h ( t ) and are passed on to the next time step. 13 CS546 Machine Learning in NLP

  14. Gated Recurrent Units (GRUs) Cho et al. (2014) Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation https://arxiv.org/pdf/1406.1078.pdf CS546 Machine Learning in NLP 14

  15. 
 
 GRU definition h ( t − 1) x ( t ) Based on , and , the GRU computes: 
 h ( t − 1) ˜ r ( t ) h ( t ) — a reset gate to determine how much of to keep in r ( t ) = σ ( W r x ( t ) + U r h ( t − 1) + b r ) 
 r ( t ) ⊗ h ( t − 1) ˜ h ( t ) x ( t ) — an intermediate hidden state that depends on and h ( t ) = ϕ ( W h x ( t ) + U h ( r ( t ) ⊗ h ( t − 1) ) + b r ) ˜ h ( t − 1) z ( t ) h ( t ) — an update gate to determine how much of to keep in z ( t ) = σ ( W z x ( t ) + U z h ( t − 1) + b r ) h ( t − 1) ˜ h ( t ) h ( t ) — a new hidden state as a linear interpolation of and 
 z ( t ) with weights determined by the update gate h ( t ) = z ( t ) ⊗ h ( t − 1) + ( 1 − z ( t ) ) ⊗ ˜ h ( t ) 15 CS546 Machine Learning in NLP

  16. Expressive power of RNN, LSTM, GRU Weiss, Goldberg, Yahav (2018) 
 On the Practical Computational Power 
 of Finite Precision RNNs for Language Recognition 
 https://www.aclweb.org/anthology/P18-2117.pdf CS546 Machine Learning in NLP 16

  17. 
 Models Basic RNNs: h ( t ) = tanh( W x ( t ) + U h ( t − 1) + b ) Simple (Elman) SRNN: h ( t ) = ReLU ( W x ( t ) + U h ( t − 1) + b ) IRNN: Gated RNNs (GRUs and LSTMs) k = σ ( W k x ( t ) + U k h ( t − 1) + b k ) g ( t ) Gates : each element is a probability 0 1 NB: a gate can return or by setting its matrices to 0 and b=0 or b=1 r ( t ) , z ( t ) GRU with gates h ( t ) = tanh( W h x ( t ) + U h ( r ( t ) ⊗ h ( t − 1) ) + b r ) ˜ hidden state h ( t ) = z ( t ) ⊗ c ( t − 1) + ( 1 − z ( t ) ) ⊗ ˜ h ( t − 1) r = 1 , z = 0 NB: GRU reduces to SRNN with f ( t ) , i ( t ) , o ( t ) LSTM with gates , 
 c ( t ) = tanh( W c x ( t ) + U c h ( t − 1) + b c ) ˜ memory cell 
 c ( t ) = f ( t ) ⊗ c ( t − 1) + i ( t ) ⊗ ˜ c ( t ) h ( t ) = o ( t ) ⊗ ϕ ( c ( t ) ) ϕ hidden state for = identity or tanh f = 0 , i = 1 , o = 1 NB: LSTM reduces to SRNN with 17 CS546 Machine Learning in NLP

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend