Lecture 6: RNN wrap-up Julia Hockenmaier juliahmr@illinois.edu - PowerPoint PPT Presentation

CS546: Machine Learning in NLP (Spring 2020) http://courses.engr.illinois.edu/cs546/ Lecture 6:   RNN wrap-up Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Monday, 11am—12:30pm

Today’s class: RNN architectures RNNs are among the workhorses of neural NLP: — Basic RNNs are rarely used — LSTMs and GRUs are commonly used. What’s the difference between these variants?   RNN odds and ends: — Character RNNs — Attention mechanisms (LSTMs/GRUs) 2 CS546 Machine Learning in NLP

Character RNNs and BPE Character RNNs: — Each input element is one character: ’t’,’h’, ‘e’,… — Can be used to replace word embeddings,   or to compute embeddings for rare/unknown words (in languages with an alphabet, like English…)   see e.g. http://karpathy.github.io/2015/05/21/rnn-effectiveness/   (in Chinese, RNNs can be used directly on characters without word segmentation; the equivalent of “character RNNs” might be models that decompose characters into radicals/strokes) Byte Pair Encoding (BPE): — Learn which character sequences are common   in the language (‘ing’, ‘pre’, ‘at’, …) — Split input into these sequences and learn embeddings for these sequences 3 CS546 Machine Learning in NLP

Attention mechanisms α = ( α 1 t , . . . , α St ) Compute a probability distribution over the h ( s ) h ( t ) encoder’s hidden states that depends on the decoder’s current exp( s ( h ( t ) , h ( s ) )) α ts = ∑ s ′ � exp( s ( h ( t ) , h ( s ′ � ) )) h ( s ) c ( t ) = ∑ α ts h ( s ) Compute a weighted avg. of the encoder’s : s =1.. S o ( t ) = tanh( W 1 h ( t ) + W 2 c ( t ) ) h ( t ) that gets then used with , e.g. in — Hard attention (degenerate case, non-differentiable):   α is a one-hot vector   α — Soft attention (general case): is not a one-hot s ( h ( t ) , h ( s ) ) = h ( t ) ⋅ h ( s ) — is the dot product (no learned parameters) s ( h ( t ) , h ( s ) ) = ( h ( t ) ) T W h ( s ) — (learn a bilinear matrix W) s ( h ( t ) , h ( s ) ) = v T tanh( W 1 h ( t ) + W 2 h ( s ) ) — concat. hidden states 4 CS546 Machine Learning in NLP

Activation functions CS546 Machine Learning in NLP 5

  Recap: Activation functions 3 1/(1+exp(-x)) Sigmoid (logistic function):   tanh(x) max(0,x) 2.5 σ (x) = 1/(1 + e − x )   2 Returns values bound above and below   1.5 [0,1] in the range   1 0.5 0 Hyperbolic tangent:   -0.5 tanh(x) = (e 2x − 1)/(e 2x +1) -1 -3 -2 -1 0 1 2 3 Returns values bound above and below   [ − 1, +1] in the range   Rectified Linear Unit:   ReLU(x) = max(0, x) Returns values bound below   [0, + ∞ ] in the range 6 CS546 Machine Learning in NLP

From RNNs to LSTMs CS546 Machine Learning in NLP 7

    From RNNs to LSTMs h ( t ) In Vanilla (Elman) RNNs, the current hidden state   h ( t − 1) is a nonlinear function of the previous hidden state x ( t ) and the current input : h ( t ) = g ( W h [ h ( t − 1) , x ( t ) ] + b h ) With g =tanh (the original definition):   ⇒ Models suffer from the vanishing gradient problem:   they can’t be trained effectively on long sequences. With g =ReLU   ⇒ Models suffer from the exploding gradient problem:   they can’t be trained effectively on long sequences. 8 CS546 Machine Learning in NLP

From RNNs to LSTMs LSTMs (Long Short-Term Memory networks) were introduced by Hochreiter and Schmidhuber to overcome this problem. — They introduce an additional cell state that also gets passed through the network and updated at each time step — LSTMs define three different gates that read in the previous hidden state and current input to decide how much of the past hidden and cell states to keep. — This gating mechanism mitigates the vanishing/ exploding gradient problems of traditional RNNs 9 CS546 Machine Learning in NLP

Gating mechanisms Gates are trainable layers with a sigmoid activation function   h ( t − 1) x ( t ) often determined by the current input and the (last) hidden state eg.: k = σ ( W k x ( t ) + U k h ( t − 1) + b k ) g ( t ) ∀ i : 0 ≤ g i ≤ 1 g is a vector of (Bernoulli) probabilities ( ) Unlike traditional (0,1) gates, neural gates are differentiable (we can train them)   g u is combined with another vector (of the same dimensionality) v = g ⊗ u by element-wise multiplication (Hadamard product): g i ≈ 0 v i ≈ 0 g i ≈ 1 v i ≈ u i — If , , and if , g i — Each is associated with its own set of trainable parameters   u i and determines how much of to keep or forget u , v Gates are used to form linear combinations of vectors : w = g ⊗ u + ( 1 − g ) ⊗ v — Linear interpolation (coupled gates): w = g 1 ⊗ u + g 2 ⊗ v — Addition of two gates: 10 CS546 Machine Learning in NLP

Long Short Term Memory Networks (LSTMs) c (t-1) c (t) h (t-1) h (t-1) https://colah.github.io/posts/2015-08-Understanding-LSTMs/ t At time , the LSTM cell reads in c ( t − 1) — a c -dimensional previous cell state vector h ( t − 1) — an h- dimensional previous hidden state vector x ( t ) — a d -dimensional current input vector t At time , the LSTM cell returns c ( t ) — a c -dimensional new cell state vector h ( t ) — an h- dimensional new hidden state vector   (which may also be passed to an output layer) 11 CS546 Machine Learning in NLP

    LSTM operations c ( t − 1) h ( t − 1) Based on the previous cell state and hidden state   x ( t ) and the current input , the LSTM computes:   h ( t − 1) c ( t ) x ( t ) 1) A new intermediate cell state ˜ that depends on and : c ( t ) = tanh( W c [ h ( t − 1) , x ( t ) ] + b c ) ˜ h ( t − 1) x ( t ) 2) Three gates (which each depend on and ) f ( t ) = σ ( W f [ h ( t − 1) , x ( t ) ] + b f ) a) The forget gate decides   f ( t ) ⊗ c ( t − 1) c ( t − 1) how much of the last to remember in the cell state: i ( t ) = σ ( W i [ h ( t − 1) , x ( t ) ] + b i ) b) The input gate decides   i ( t ) ⊗ ˜ c ( t ) c ( t ) ˜ how much of the intermediate to use in the new cell state: o ( t ) = σ ( W o [ h ( t − 1) , x ( t ) ] + b o ) c) The output gate decides   h ( t ) = o ( t ) ⊗ tanh( c ( t ) ) c ( t ) how much of the new to use in c ( t ) = f ( t ) ⊗ c ( t − 1) + i ( t ) ⊗ ˜ c ( t ) 3) The new cell state is a linear combination of c ( t − 1) c ( t ) f ( t ) i ( t ) ˜ cell states and that depends on forget gate and input gate h ( t ) = o ( t ) ⊗ tanh( c ( t ) ) 4) The new hidden state 12 CS546 Machine Learning in NLP

  LSTM summary c ( t − 1) h ( t − 1) x ( t ) Based on , , and , the LSTM computes:   c ( t ) = tanh( W c [ h ( t − 1) , x ( t ) ] + b c ) ˜ — Intermediate cell state f ( t ) = σ ( W f [ h ( t − 1) , x ( t ) ] + b f ) — Forget gate i ( t ) = σ ( W i [ h ( t − 1) , x ( t ) ] + b i ) — Input gate c ( t ) = f ( t ) ⊗ c ( t − 1) + i ( t ) ⊗ ˜ c ( t ) — New (final) cell state o ( t ) = σ ( W o [ h ( t − 1) , x ( t ) ] + b o ) — Output gate h ( t ) = o ( t ) ⊗ tanh( c ( t ) ) — New hidden state c ( t ) h ( t ) and are passed on to the next time step. 13 CS546 Machine Learning in NLP

Gated Recurrent Units (GRUs) Cho et al. (2014) Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation https://arxiv.org/pdf/1406.1078.pdf CS546 Machine Learning in NLP 14

    GRU definition h ( t − 1) x ( t ) Based on , and , the GRU computes:   h ( t − 1) ˜ r ( t ) h ( t ) — a reset gate to determine how much of to keep in r ( t ) = σ ( W r x ( t ) + U r h ( t − 1) + b r )   r ( t ) ⊗ h ( t − 1) ˜ h ( t ) x ( t ) — an intermediate hidden state that depends on and h ( t ) = ϕ ( W h x ( t ) + U h ( r ( t ) ⊗ h ( t − 1) ) + b r ) ˜ h ( t − 1) z ( t ) h ( t ) — an update gate to determine how much of to keep in z ( t ) = σ ( W z x ( t ) + U z h ( t − 1) + b r ) h ( t − 1) ˜ h ( t ) h ( t ) — a new hidden state as a linear interpolation of and   z ( t ) with weights determined by the update gate h ( t ) = z ( t ) ⊗ h ( t − 1) + ( 1 − z ( t ) ) ⊗ ˜ h ( t ) 15 CS546 Machine Learning in NLP

Expressive power of RNN, LSTM, GRU Weiss, Goldberg, Yahav (2018)   On the Practical Computational Power   of Finite Precision RNNs for Language Recognition   https://www.aclweb.org/anthology/P18-2117.pdf CS546 Machine Learning in NLP 16

  Models Basic RNNs: h ( t ) = tanh( W x ( t ) + U h ( t − 1) + b ) Simple (Elman) SRNN: h ( t ) = ReLU ( W x ( t ) + U h ( t − 1) + b ) IRNN: Gated RNNs (GRUs and LSTMs) k = σ ( W k x ( t ) + U k h ( t − 1) + b k ) g ( t ) Gates : each element is a probability 0 1 NB: a gate can return or by setting its matrices to 0 and b=0 or b=1 r ( t ) , z ( t ) GRU with gates h ( t ) = tanh( W h x ( t ) + U h ( r ( t ) ⊗ h ( t − 1) ) + b r ) ˜ hidden state h ( t ) = z ( t ) ⊗ c ( t − 1) + ( 1 − z ( t ) ) ⊗ ˜ h ( t − 1) r = 1 , z = 0 NB: GRU reduces to SRNN with f ( t ) , i ( t ) , o ( t ) LSTM with gates ,   c ( t ) = tanh( W c x ( t ) + U c h ( t − 1) + b c ) ˜ memory cell   c ( t ) = f ( t ) ⊗ c ( t − 1) + i ( t ) ⊗ ˜ c ( t ) h ( t ) = o ( t ) ⊗ ϕ ( c ( t ) ) ϕ hidden state for = identity or tanh f = 0 , i = 1 , o = 1 NB: LSTM reduces to SRNN with 17 CS546 Machine Learning in NLP

Lecture 6: RNN wrap-up Julia Hockenmaier juliahmr@illinois.edu - PowerPoint PPT Presentation

CS546: Machine Learning in NLP (Spring 2020) http://courses.engr.illinois.edu/cs546/ Lecture 6: RNN wrap-up Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Monday, 11am12:30pm Todays class: RNN architectures

Outline Gated Feedback Recurrent Neural Networks. arXiv1502. Introduction: RNN & Gated RNN

Some RNN Variants Arun Mallya Best viewed with Computer Modern fonts installed Outline

RNN-based AMs + Introduction to Language Modeling Lecture 9 CS 753 Instructor: Preethi Jyothi

Lecture 5: Wireless Physical Lecture 5: Wireless Physical Layer: Wrap-up Layer: Wrap-up Mythili

The Cake is Baked...now what? Let cool almost to room temperature Wrap in plastic wrap

Recurrent Neural Network Rachel Hu and Zhi Zhang Amazon AI d2l.ai Outline Dependent Random

RNN Input Layer RNN Hidden Layer RNN h t-1 h t x t (Picture adapted from Andrej

CS 730/730W/830: Intro AI MDP Wrap-Up ADP Q -Learning 1 handout: slides project proposals are

Course wrap up J uly 26, 2005 CS 486/ 686 Universit y of Wat erloo Out line Course wrap

Outline Course wrap up Final exam info Course wrap up Other AI courses AI jobs

Course wrap up November 27, 2008 CS 486/686 University of Waterloo Outline Course wrap up

Lecture 9: Wireless link layer: Lecture 9: Wireless link layer: error control and wrap-up error

FOOD SERVICE UPDATE Seed Change Grant Wrap up Wrap up of the Seed Change grant which sent us

Mirror Lake Management District Mirror Lake Management Planning Project Wrap-Up Meeting June

Speaker 9 Mr Mark Barthel Special Adviser Retail Innovation Team WRAP mark.barthel@wrap.org.uk

1.4 WESTAR-WRAP-TDWG Contract with NAU ITEP Presented Tuesday, February 12, 2019 By Bill Auberle

Bo Border der Hi High gher Educ ucat ation on: Ch Challenge nges s fo for GCC CC Dr

On Type-holding and type-repelling lambda-term skeletons, with applications to all-term and

Beyond the rainbow LEGATO, GALA Choruses and Fruitvox London

Gorgeous Guts Explor ing t he l ink bet ween d iet a nd g u t hea l t h Kelsey Weight

CSCI 447/547 MACHINE LEARNING Outline Introduction Sequence Data Sequential Memory

ProverBot9000 A proof assistant assistant Proofs are hard Proof assistants are hard Big Idea:

NLP with recurrent networks Chapter 9 in Martin/Jurafsky Feed-forward networks for text

Neural Network Training: Old & New Tricks Old: (80s) Stochastic Gradient Descent,

Lecture 6: RNN wrap-up Julia Hockenmaier juliahmr@illinois.edu - PowerPoint PPT Presentation

CS546: Machine Learning in NLP (Spring 2020) http://courses.engr.illinois.edu/cs546/ Lecture 6: RNN wrap-up Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Monday, 11am12:30pm Todays class: RNN architectures

Outline Gated Feedback Recurrent Neural Networks. arXiv1502. Introduction: RNN &amp; Gated RNN

Some RNN Variants Arun Mallya Best viewed with Computer Modern fonts installed Outline

RNN-based AMs + Introduction to Language Modeling Lecture 9 CS 753 Instructor: Preethi Jyothi

Lecture 5: Wireless Physical Lecture 5: Wireless Physical Layer: Wrap-up Layer: Wrap-up Mythili

The Cake is Baked...now what? Let cool almost to room temperature Wrap in plastic wrap

Recurrent Neural Network Rachel Hu and Zhi Zhang Amazon AI d2l.ai Outline Dependent Random

RNN Input Layer RNN Hidden Layer RNN h t-1 h t x t (Picture adapted from Andrej

CS 730/730W/830: Intro AI MDP Wrap-Up ADP Q -Learning 1 handout: slides project proposals are

Course wrap up J uly 26, 2005 CS 486/ 686 Universit y of Wat erloo Out line Course wrap

Outline Course wrap up Final exam info Course wrap up Other AI courses AI jobs

Course wrap up November 27, 2008 CS 486/686 University of Waterloo Outline Course wrap up

Lecture 9: Wireless link layer: Lecture 9: Wireless link layer: error control and wrap-up error

FOOD SERVICE UPDATE Seed Change Grant Wrap up Wrap up of the Seed Change grant which sent us

Mirror Lake Management District Mirror Lake Management Planning Project Wrap-Up Meeting June

Speaker 9 Mr Mark Barthel Special Adviser Retail Innovation Team WRAP mark.barthel@wrap.org.uk

1.4 WESTAR-WRAP-TDWG Contract with NAU ITEP Presented Tuesday, February 12, 2019 By Bill Auberle

Bo Border der Hi High gher Educ ucat ation on: Ch Challenge nges s fo for GCC CC Dr

On Type-holding and type-repelling lambda-term skeletons, with applications to all-term and

Beyond the rainbow LEGATO, GALA Choruses and Fruitvox London

Gorgeous Guts Explor ing t he l ink bet ween d iet a nd g u t hea l t h Kelsey Weight

CSCI 447/547 MACHINE LEARNING Outline Introduction Sequence Data Sequential Memory

ProverBot9000 A proof assistant assistant Proofs are hard Proof assistants are hard Big Idea:

NLP with recurrent networks Chapter 9 in Martin/Jurafsky Feed-forward networks for text

Neural Network Training: Old &amp; New Tricks Old: (80s) Stochastic Gradient Descent,

Outline Gated Feedback Recurrent Neural Networks. arXiv1502. Introduction: RNN & Gated RNN

Neural Network Training: Old & New Tricks Old: (80s) Stochastic Gradient Descent,