Recurrent Neural Networks LING572 Advanced Statistical Methods for - PowerPoint PPT Presentation

Recurrent Neural Networks LING572 Advanced Statistical Methods for NLP March 5 2020 1

Outline ● Word representations and MLPs for NLP tasks ● Recurrent neural networks for sequences ● Fancier RNNs ● Vanishing/exploding gradients ● LSTMs (Long Short-Term Memory) ● Variants ● Seq2seq architecture ● Attention 2

MLPs for text classification 3

Word Representations ● Traditionally: words are discrete features ● e.g. curWord=“class” ● As vectors: one-hot encoding ● Each vector is | V | -dimensional, where V is the vocabulary ● Each dimension corresponds to one word of the vocabulary ● A 1 for the current word; 0 everywhere else w 1 = [1 ⋯ 0] 0 0 w 3 = [0 0 1 ⋯ 0] 4

Word Embeddings ● Problem 1: every word is equally different from every other. ● All words are orthogonal to each other. ● Problem 2: very high dimensionality ● Solution: Move words into dense , lower-dimensional space ● Grouping similar words to each other ● These denser representations are called embeddings 5

Word Embeddings ● Formally, a d -dimensional embedding is a matrix E with shape (|V|, d) ● Each row is the vector for one word in the vocabulary ● Matrix multiplying by a one-hot vector returns the corresponding row, i.e. the right word vector ● Trained on prediction tasks (see LING571 slides) ● Continuous bag of words ● Skip-gram ● … ● Can be trained on specific task, or download pre-trained (e.g. GloVe, fastText) ● Fancier versions now to deal with OOV: sub-word (e.g. BPE), character CNN/LSTM 6

Relationships via Offsets WOMAN AUNT MAN UNCLE QUEEN KING Mikolov et al 2013b 7

Relationships via Offsets WOMAN AUNT QUEENS MAN UNCLE KINGS QUEEN QUEEN KING KING Mikolov et al 2013b 7

One More Example Mikolov et al 2013c 8

One More Example 9

Caveat Emptor Linzen 2016, a.o. 10

Example MLP for Language Modeling Bengio et al 2003 11

Example MLP for Language Modeling Bengio et al 2003 : one-hot vector w t 11

Example MLP for Language Modeling Bengio et al 2003 embeddings = concat ( Cw t − 1 , Cw t − 2 , …, Cw t − ( n +1) ) : one-hot vector w t 11

Example MLP for Language Modeling Bengio et al 2003 hidden = tanh ( W 1 embeddings + b 1 ) embeddings = concat ( Cw t − 1 , Cw t − 2 , …, Cw t − ( n +1) ) : one-hot vector w t 11

Example MLP for Language Modeling Bengio et al 2003 probabilities = softmax ( W 2 hidden + b 2 ) hidden = tanh ( W 1 embeddings + b 1 ) embeddings = concat ( Cw t − 1 , Cw t − 2 , …, Cw t − ( n +1) ) : one-hot vector w t 11

Example MLP for sentiment classification ● Issue: texts of different length. ● One solution: average (or sum, or…) all the embeddings, which are of same dim IMDB Model accuracy Deep averaging 89.4 network NB-SVM 91.2 (Wang and Manning 2012) Iyyer et al 2015 12

Recurrent Neural Networks 13

RNNs: high-level 14

RNNs: high-level ● Feed-forward networks: fixed-size input, fixed-size output ● Previous classifier: average embeddings of words ● Other solutions: n -gram assumption (i.e. fixed-size context of word embeddings) 14

RNNs: high-level ● Feed-forward networks: fixed-size input, fixed-size output ● Previous classifier: average embeddings of words ● Other solutions: n -gram assumption (i.e. fixed-size context of word embeddings) ● RNNs process sequences of vectors ● Maintaining “hidden” state ● Applying the same operation at each step 14

RNNs: high-level ● Feed-forward networks: fixed-size input, fixed-size output ● Previous classifier: average embeddings of words ● Other solutions: n -gram assumption (i.e. fixed-size context of word embeddings) ● RNNs process sequences of vectors ● Maintaining “hidden” state ● Applying the same operation at each step ● Different RNNs: ● Different operations at each step ● Operation also called “recurrent cell” ● Other architectural considerations (e.g. depth; bidirectionally) 14

RNNs Steinert-Threlkeld and Szymanik 2019; Olah 2015 15

RNNs h t = f ( x t , h t − 1 ) Steinert-Threlkeld and Szymanik 2019; Olah 2015 15

RNNs h t = f ( x t , h t − 1 ) h t = tanh ( W x x t + W h h t − 1 + b ) Simple/“Vanilla” RNN: Steinert-Threlkeld and Szymanik 2019; Olah 2015 15

RNNs This class … interesting h t = f ( x t , h t − 1 ) h t = tanh ( W x x t + W h h t − 1 + b ) Simple/“Vanilla” RNN: Steinert-Threlkeld and Szymanik 2019; Olah 2015 15

RNNs Linear + Linear + Linear + softmax softmax softmax This class … interesting h t = f ( x t , h t − 1 ) h t = tanh ( W x x t + W h h t − 1 + b ) Simple/“Vanilla” RNN: Steinert-Threlkeld and Szymanik 2019; Olah 2015 15

Using RNNs MLP seq2seq (later) e.g. POS tagging e.g. text classification 16

Training: BPTT ● “Unroll” the network across time-steps ● Apply backprop to the “wide” network ● Each cell has the same parameters ● When updating parameters using the gradients, take the average across the time steps 17

Fancier RNNs 18

Vanishing/Exploding Gradients Problem ● BPTT with vanilla RNNs faces a major problem: ● The gradients can vanish (approach 0) across time ● This makes it hard/impossible to learn long distance dependencies , which are rampant in natural language 19

Vanishing Gradients source If these are small (depends on W), the effect from t=4 on t=1 will be very small 20

Vanishing Gradient Problem source 21

Vanishing Gradient Problem Graves 2012 22

Vanishing Gradient Problem ● Gradient measures the effect of the past on the future ● If it vanishes between t and t+n, can’t tell if: ● There’s no dependency in fact ● The weights in our network just haven’t yet captured the dependency 23

The need for long-distance dependencies ● Language modeling (fill-in-the-blank) ● The keys ____ ● The keys on the table ____ ● The keys next to the book on top of the table ____ ● To get the number on the verb, need to look at the subject, which can be very far away ● And number can disagree with linearly-close nouns ● Need models that can capture long-range dependencies like this. Vanishing gradients means vanilla RNNs will have difficulty. 24

Long Short-Term Memory (LSTM) 25

LSTMs ● Long Short-Term Memory (Hochreiter and Schmidhuber 1997) ● The gold standard / default RNN ● If someone says “RNN” now, they almost always mean “LSTM” ● Originally: to solve the vanishing/exploding gradient problem for RNNs ● Vanilla: re-writes the entire hidden state at every time-step ● LSTM: separate hidden state and memory ● Read, write to/from memory; can preserve long-term information 26

̂ LSTMs f t = σ ( W f ⋅ h t − 1 x t + b f ) i t = σ ( W i ⋅ h t − 1 x t + b i ) c t = tanh ( W c ⋅ h t − 1 x t + b c ) c t = f t ⊙ c t − 1 + i t ⊙ ̂ c t o t = σ ( W o ⋅ h t − 1 x t + b o ) h t = o t ⊙ tanh ( c t ) 27

̂ LSTMs f t = σ ( W f ⋅ h t − 1 x t + b f ) 🤕🤕🤸🤯 i t = σ ( W i ⋅ h t − 1 x t + b i ) c t = tanh ( W c ⋅ h t − 1 x t + b c ) c t = f t ⊙ c t − 1 + i t ⊙ ̂ c t o t = σ ( W o ⋅ h t − 1 x t + b o ) h t = o t ⊙ tanh ( c t ) 27

̂ LSTMs f t = σ ( W f ⋅ h t − 1 x t + b f ) ● Key innovation: ● c t , h t = f ( x t , c t − 1 , h t − 1 ) 🤕🤕🤸🤯 i t = σ ( W i ⋅ h t − 1 x t + b i ) ● c t : a memory cell c t = tanh ( W c ⋅ h t − 1 x t + b c ) ● Reading/writing (smooth) c t = f t ⊙ c t − 1 + i t ⊙ ̂ c t controlled by gates ● f t o t = σ ( W o ⋅ h t − 1 x t + b o ) : forget gate ● i t : input gate h t = o t ⊙ tanh ( c t ) ● o t : output gate 27

LSTMs 28 Steinert-Threlkeld and Szymanik 2019; Olah 2015

LSTMs f t ∈ [0,1] m : which cells to forget 28 Steinert-Threlkeld and Szymanik 2019; Olah 2015

LSTMs Element-wise multiplication: 0: erase 1: retain f t ∈ [0,1] m : which cells to forget 28 Steinert-Threlkeld and Szymanik 2019; Olah 2015

LSTMs Element-wise multiplication: 0: erase 1: retain f t ∈ [0,1] m : which cells to forget : which cells to write to i t ∈ [0,1] m 28 Steinert-Threlkeld and Szymanik 2019; Olah 2015

LSTMs Element-wise multiplication: 0: erase 1: retain f t ∈ [0,1] m : which cells to forget “candidate” / new values : which cells to write to i t ∈ [0,1] m 28 Steinert-Threlkeld and Szymanik 2019; Olah 2015

LSTMs Element-wise multiplication: 0: erase Add new values to memory 1: retain f t ∈ [0,1] m : which cells to forget “candidate” / new values : which cells to write to i t ∈ [0,1] m 28 Steinert-Threlkeld and Szymanik 2019; Olah 2015

LSTMs Element-wise multiplication: 0: erase Add new values to memory 1: retain = f t ⊙ c t − 1 + i t ⊙ ̂ c t f t ∈ [0,1] m : which cells to forget “candidate” / new values : which cells to write to i t ∈ [0,1] m 28 Steinert-Threlkeld and Szymanik 2019; Olah 2015

LSTMs Element-wise multiplication: 0: erase Add new values to memory 1: retain = f t ⊙ c t − 1 + i t ⊙ ̂ c t f t ∈ [0,1] m : which cells to forget o t ∈ [0,1] m : which cells to output “candidate” / new values : which cells to write to i t ∈ [0,1] m 28 Steinert-Threlkeld and Szymanik 2019; Olah 2015

Recurrent Neural Networks LING572 Advanced Statistical Methods for - PowerPoint PPT Presentation

Recurrent Neural Networks LING572 Advanced Statistical Methods for NLP March 5 2020 1 Outline Word representations and MLPs for NLP tasks Recurrent neural networks for sequences Fancier RNNs Vanishing/exploding gradients

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

The Power of Linear Recurrent Neural Networks Neural Networks Was knnen lineare rekurrente

Recurrent Neural Network Xiaogang Wang xgwang@ee.cuhk.edu.hk February 26, 2019 cuhk Xiaogang

CHAPTER VII VII CHAPTER Learning in Recurrent Networks Learning in Recurrent Networks CHAPTER

Recurrent Neural Networks Greg Mori - CMPT 419/726 Goodfellow, Bengio, and Courville: Deep

Understanding LSTM Networks Recurrent Neural Networks An unrolled recurrent neural network The

CSEP 517: Natural Language Processing Recurrent Neural Networks Autumn 2018 Luke Zettlemoyer

Recurrent Neural Networks CS60010: Deep Learning Abir Das IIT Kharagpur Mar 11, 2020

Computa(on through dynamics Using recurrent neural networks to unveil mechanism in neural

IN5550 Neural Methods in Natural Language Processing Recurrent Neural Networks Stephan Oepen

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Introduction to Recurrent Neural Networks Jakob Verbeek Modeling sequential data with Recurrent

POLI 142P: Crisis Areas in World Politics Class 1: Introduction and Concepts What is a crisis?

The Regime Complex for Climate Change Professor Robert O Keohane Professor of International

Hegemony and the Balance of Power David K. Levine and Salvatore Modica 12/27/14 1 Conflict and

There are two tests: IELTS Academic IELTS General Training IELTS Academic

D t A Data Analytics & l ti & High Performance Computing: g p g When Worlds Collide

Gdels Argument for Cantors Cardinals Matthew W. Parker Centre for Philosophy of Natural

Condensed Lear arnin ing Diar arie ies for Refle lectiv ive Develo lopment a new

The Dollar Hegemon? Evidence and Implications for Policymakers Pierre-Olivier Gourinchas UC

Recurrent Neural Networks LING572 Advanced Statistical Methods for - PowerPoint PPT Presentation

Recurrent Neural Networks LING572 Advanced Statistical Methods for NLP March 5 2020 1 Outline Word representations and MLPs for NLP tasks Recurrent neural networks for sequences Fancier RNNs Vanishing/exploding gradients

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

The Power of Linear Recurrent Neural Networks Neural Networks Was knnen lineare rekurrente

Recurrent Neural Network Xiaogang Wang xgwang@ee.cuhk.edu.hk February 26, 2019 cuhk Xiaogang

CHAPTER VII VII CHAPTER Learning in Recurrent Networks Learning in Recurrent Networks CHAPTER

Recurrent Neural Networks Greg Mori - CMPT 419/726 Goodfellow, Bengio, and Courville: Deep

Understanding LSTM Networks Recurrent Neural Networks An unrolled recurrent neural network The

CSEP 517: Natural Language Processing Recurrent Neural Networks Autumn 2018 Luke Zettlemoyer

Recurrent Neural Networks CS60010: Deep Learning Abir Das IIT Kharagpur Mar 11, 2020

Computa(on through dynamics Using recurrent neural networks to unveil mechanism in neural

IN5550 Neural Methods in Natural Language Processing Recurrent Neural Networks Stephan Oepen

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Introduction to Recurrent Neural Networks Jakob Verbeek Modeling sequential data with Recurrent

POLI 142P: Crisis Areas in World Politics Class 1: Introduction and Concepts What is a crisis?

The Regime Complex for Climate Change Professor Robert O Keohane Professor of International

Hegemony and the Balance of Power David K. Levine and Salvatore Modica 12/27/14 1 Conflict and

There are two tests: IELTS Academic IELTS General Training IELTS Academic

D t A Data Analytics &amp; l ti &amp; High Performance Computing: g p g When Worlds Collide

Gdels Argument for Cantors Cardinals Matthew W. Parker Centre for Philosophy of Natural

Condensed Lear arnin ing Diar arie ies for Refle lectiv ive Develo lopment a new

The Dollar Hegemon? Evidence and Implications for Policymakers Pierre-Olivier Gourinchas UC

D t A Data Analytics & l ti & High Performance Computing: g p g When Worlds Collide