Recurrent Neural Networks Luke Zettlemoyer (Slides adapted from - - PowerPoint PPT Presentation

recurrent neural networks
SMART_READER_LITE
LIVE PREVIEW

Recurrent Neural Networks Luke Zettlemoyer (Slides adapted from - - PowerPoint PPT Presentation

CSEP 517 Natural Language Processing Recurrent Neural Networks Luke Zettlemoyer (Slides adapted from Danqi Chen, Chris Manning, Abigail See, Andrej Karpathy) Overview What is a recurrent neural network (RNN)? Simple RNNs


slide-1
SLIDE 1

CSEP 517 Natural Language Processing

Recurrent Neural Networks

Luke Zettlemoyer

(Slides adapted from Danqi Chen, Chris Manning, Abigail See, Andrej Karpathy)

slide-2
SLIDE 2

Overview

  • What is a recurrent neural network (RNN)?
  • Simple RNNs
  • Backpropagation through time
  • Long short-term memory networks (LSTMs)
  • Applications
  • Variants: Stacked RNNs, Bidirectional RNNs
slide-3
SLIDE 3

Recurrent neural networks (RNNs)

A class of neural networks allowing to handle variable length inputs A function: y = RNN(x1, x2, …, xn) ∈ ℝd where x1, …, xn ∈ ℝdin

slide-4
SLIDE 4

Recurrent neural networks (RNNs)

Proven to be an highly effective approach to language modeling, sequence tagging as well as text classification tasks:

Language modeling Sequence tagging

The

movie

sucks .

👏

Text classification

slide-5
SLIDE 5

Recurrent neural networks (RNNs)

Form the basis for the modern approaches to machine translation, question answering and dialogue:

slide-6
SLIDE 6

Why variable-length?

Recall the feedfoward neural LMs we learned:

The dogs are barking

the dogs in the neighborhood are ___

x = [ethe, edogs, eare] ∈ R3d

<latexit sha1_base64="o+E5BkBVqzV8FU9SF1g20jIaEIM=">ACVnicbVFNT+MwEHXCskD3gwBHLtZWK+1hVSULEntBQnDZIyD6ITXZynEmrYXjRPZkRXlT8KF/SlcEG4bJGg7kqXn9+Zpxs9xIYVB3/vuBsfNj9ube+0Pn3+8nX29vmbzUHLo8l7kexMyAFAq6KFDCoNDAslhCP769mOn9f6CNyNUNTguIMjZWIhWcoaVGXhZmDCdxWt3V9JQOX29Qj6oQ4Q4rnEBd/6SrQpKPzXqFaWuJaChUI8bVdf23Okrqkdf2O/686CoIGtAmTV2OvPswyXmZgUIumTHDwC8wshNQcAl1KywNFIzfsjEMLVQsAxNV81hq+t0yCU1zbY9COmfOiqWGTPNYts5W9MsazNynTYsMf0dVUIVJYLi0FpKSnmdJYxTYQGjnJqAeNa2F0pnzDNONqfaNkQguUnr4Ler07gd4Kr4/bZeRPHNjk38gPEpATckb+kEvSJZw8kCfHdTacR+fZ3XS3Fq2u03gOyLtyvRfjObdo</latexit><latexit sha1_base64="o+E5BkBVqzV8FU9SF1g20jIaEIM=">ACVnicbVFNT+MwEHXCskD3gwBHLtZWK+1hVSULEntBQnDZIyD6ITXZynEmrYXjRPZkRXlT8KF/SlcEG4bJGg7kqXn9+Zpxs9xIYVB3/vuBsfNj9ube+0Pn3+8nX29vmbzUHLo8l7kexMyAFAq6KFDCoNDAslhCP769mOn9f6CNyNUNTguIMjZWIhWcoaVGXhZmDCdxWt3V9JQOX29Qj6oQ4Q4rnEBd/6SrQpKPzXqFaWuJaChUI8bVdf23Okrqkdf2O/686CoIGtAmTV2OvPswyXmZgUIumTHDwC8wshNQcAl1KywNFIzfsjEMLVQsAxNV81hq+t0yCU1zbY9COmfOiqWGTPNYts5W9MsazNynTYsMf0dVUIVJYLi0FpKSnmdJYxTYQGjnJqAeNa2F0pnzDNONqfaNkQguUnr4Ler07gd4Kr4/bZeRPHNjk38gPEpATckb+kEvSJZw8kCfHdTacR+fZ3XS3Fq2u03gOyLtyvRfjObdo</latexit><latexit sha1_base64="o+E5BkBVqzV8FU9SF1g20jIaEIM=">ACVnicbVFNT+MwEHXCskD3gwBHLtZWK+1hVSULEntBQnDZIyD6ITXZynEmrYXjRPZkRXlT8KF/SlcEG4bJGg7kqXn9+Zpxs9xIYVB3/vuBsfNj9ube+0Pn3+8nX29vmbzUHLo8l7kexMyAFAq6KFDCoNDAslhCP769mOn9f6CNyNUNTguIMjZWIhWcoaVGXhZmDCdxWt3V9JQOX29Qj6oQ4Q4rnEBd/6SrQpKPzXqFaWuJaChUI8bVdf23Okrqkdf2O/686CoIGtAmTV2OvPswyXmZgUIumTHDwC8wshNQcAl1KywNFIzfsjEMLVQsAxNV81hq+t0yCU1zbY9COmfOiqWGTPNYts5W9MsazNynTYsMf0dVUIVJYLi0FpKSnmdJYxTYQGjnJqAeNa2F0pnzDNONqfaNkQguUnr4Ler07gd4Kr4/bZeRPHNjk38gPEpATckb+kEvSJZw8kCfHdTacR+fZ3XS3Fq2u03gOyLtyvRfjObdo</latexit><latexit sha1_base64="o+E5BkBVqzV8FU9SF1g20jIaEIM=">ACVnicbVFNT+MwEHXCskD3gwBHLtZWK+1hVSULEntBQnDZIyD6ITXZynEmrYXjRPZkRXlT8KF/SlcEG4bJGg7kqXn9+Zpxs9xIYVB3/vuBsfNj9ube+0Pn3+8nX29vmbzUHLo8l7kexMyAFAq6KFDCoNDAslhCP769mOn9f6CNyNUNTguIMjZWIhWcoaVGXhZmDCdxWt3V9JQOX29Qj6oQ4Q4rnEBd/6SrQpKPzXqFaWuJaChUI8bVdf23Okrqkdf2O/686CoIGtAmTV2OvPswyXmZgUIumTHDwC8wshNQcAl1KywNFIzfsjEMLVQsAxNV81hq+t0yCU1zbY9COmfOiqWGTPNYts5W9MsazNynTYsMf0dVUIVJYLi0FpKSnmdJYxTYQGjnJqAeNa2F0pnzDNONqfaNkQguUnr4Ler07gd4Kr4/bZeRPHNjk38gPEpATckb+kEvSJZw8kCfHdTacR+fZ3XS3Fq2u03gOyLtyvRfjObdo</latexit>

(fixed-window size = 3)

slide-7
SLIDE 7

Simple RNNs

h0 ∈ ℝd is an initial state ht = f(ht−1, xt) ∈ ℝd ht = g(Wht−1 + Uxt + b) ∈ ℝd

Simple RNNs:

W ∈ ℝd×d, U ∈ ℝd×din, b ∈ ℝd

: nonlinearity (e.g. tanh),

g

ht : hidden states which store information from

to

x1 xt

slide-8
SLIDE 8

Simple RNNs

Key idea: apply the same weights repeatedly

W

ht = g(Wht−1 + Uxt + b) ∈ ℝd

slide-9
SLIDE 9

RNNs vs Feedforward NNs

slide-10
SLIDE 10

Recurrent Neural Language Models (RNNLMs)

P(w1, w2, …, wn) = P(w1) × P(w2 ∣ w1) × P(w3 ∣ w1, w2) × … × P(wn ∣ w1, w2, …, wn−1)

= P(w1 ∣ h0) × P(w2 ∣ h1) × P(w3 ∣ h2) × … × P(wn ∣ hn−1)

  • Denote

,

̂ yt = softmax(Woht) Wo ∈ ℝ|V|×d

  • Cross-entroy loss:

L(θ) = − 1 n

n

t=1

log ̂ yt−1(wt)

the students

  • pened

their … exams …

θ = {W, U, b, Wo, E}

slide-11
SLIDE 11

Training RNNLMs

  • Backpropagation? Yes, but not that simple!
  • The algorithm is called Backpropagation Through Time (BPTT).
slide-12
SLIDE 12

Backpropagation through time

h1 = g(Wh0 + Ux1 + b) h2 = g(Wh1 + Ux2 + b) h3 = g(Wh2 + Ux3 + b) L3 = − log ̂ y3(w4)

You should know how to compute: ∂L3

∂h3

∂L3 ∂W = ∂L3 ∂h3 ∂h3 ∂W + ∂L3 ∂h3 ∂h3 ∂h2 ∂h2 ∂W + ∂L3 ∂h3 ∂h3 ∂h2 ∂h2 ∂h1 ∂h1 ∂W ∂L ∂W = − 1 n

n

t=1 t

k=1

∂Lt ∂ht

t

j=k+1

∂hj ∂hj−1 ∂hk ∂W

slide-13
SLIDE 13

Truncated backpropagation through time

  • Backpropagation is very expensive if you handle long sequences
  • Run forward and backward through chunks of the sequence instead of whole sequence
  • Carry hidden states forward in time forever, but only backpropagate for some smaller

number of steps

slide-14
SLIDE 14

Progress on language models

On the Penn Treebank (PTB) dataset

Metric: perplexity

(Mikolov and Zweig, 2012): Context dependent recurrent neural network language model KN5: Kneser-Ney 5-gram

slide-15
SLIDE 15

Progress on language models

(Yang et al, 2018): Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

On the Penn Treebank (PTB) dataset

Metric: perplexity

slide-16
SLIDE 16

Vanishing/exploding gradients

  • Consider the gradient of

at step , with respect to the hidden state at some previous step ( ):

Lt t hk k k < t

∂Lt ∂hk = ∂Lt ∂ht ∏

t≥j>k

∂hj ∂hj−1

(advanced)

  • (Pascanu et al, 2013) showed that if the largest eigenvalue of

is less than 1 for , then the gradient will shrink exponentially. This problem is called vanishing gradients.

W g = tanh

  • In contrast, if the gradients are getting too large, it is called exploding

gradients.

= ∂Lt ∂ht × ∏

t≥j>k(diag (g′(Whj−1 + Uxj + b)) W)

slide-17
SLIDE 17

Why is exploding gradient a problem?

  • Gradients become too big and we take a very large step in SGD.
  • Solution: Gradient clipping — if the norm of the gradient is

greater than some threshold, scale it down before applying SGD update.

slide-18
SLIDE 18

Why is vanishing gradient a problem?

  • If the gradients becomes vanishingly small over long distances (step to

step ), then we can’t tell whether:

  • We don’t need long-term dependencies
  • We have wrong parameters to capture the true dependency

k t the dogs in the neighborhood are ___

Still difficult to predict “barking”

  • How to fix vanishing gradient problem?
  • LSTMs: Long short-term memory networks
  • GRUs: Gated recurrent units
slide-19
SLIDE 19

Long Short-term Memory (LSTM)

  • A type of RNN proposed by Hochreiter and Schmidhuber

in 1997 as a solution to the vanishing gradients problem

ht = f(ht−1, xt) ∈ ℝd

  • Work extremely well in practice
  • Basic idea: turning multiplication into addition
  • Use “gates” to control how much information to add/erase
  • At each timestep, there is a hidden state

and also a cell state

  • stores long-term information
  • We write/erase after each step
  • We read

from

ht ∈ ℝd ct ∈ ℝd ct ct ht ct

slide-20
SLIDE 20

Long Short-term Memory (LSTM)

There are 4 gates:

  • Input gate (how much to write):

it = σ(W(i)ht−1 + U(i)xt + b(i)) ∈ ℝd

  • Forget gate (how much to erase):

ft = σ(W( f )ht−1 + U( f )xt + b( f )) ∈ ℝd

  • Output gate (how much to reveal):
  • t = σ(W(o)ht−1 + U(o)xt + b(o)) ∈ ℝd
  • New memory cell (what to write):

˜ ct = tanh(W(c)ht−1 + U(c)xt + b(c)) ∈ ℝd

How many parameters in total?

  • Final memory cell: ct = ft ⊙ ct−1 + it ⊙ ˜

ct

  • Final hidden cell: ht = ot ⊙ ct

element-wise product

slide-21
SLIDE 21

Long Short-term Memory (LSTM)

  • LSTM doesn’t guarantee that there is no vanishing/exploding gradient, but it

does provide an easier way for the model to learn long-distance dependencies

  • LSTMs were invented in 1997 but finally got working from 2013-2015.
slide-22
SLIDE 22

Is the LSTM architecture optimal?

(Jozefowicz et al, 2015): An Empirical Exploration of Recurrent Network Architectures

slide-23
SLIDE 23

Overview

  • What is a recurrent neural network (RNN)?
  • Simple RNNs
  • Backpropagation through time
  • Long short-term memory networks (LSTMs)
  • Applications
  • Variants: Stacked RNNs, Bidirectional RNNs
slide-24
SLIDE 24

Application: Text Generation

You can generate text by repeated sampling. Sampled output is next step’s input.

slide-25
SLIDE 25

Fun with RNNs

Andrej Karpathy “The Unreasonable Effectiveness of Recurrent Neural Networks”

Obama speeches Latex generation

slide-26
SLIDE 26

Application: Sequence Tagging

P(yi = k) = softmaxk(Wohi)

Wo ∈ ℝC×d

L = − 1 n

n

i=1

log P(yi = k)

Input: a sentence of n words: x1, …, xn Output: y1, …, yn, yi ∈ {1,…C}

slide-27
SLIDE 27

Application: Text Classification

the

movie

was

terribly

exciting

!

hn

P(y = k) = softmaxk(Wohn)

Wo ∈ ℝC×d Input: a sentence of n words Output: y ∈ {1,2,…, C}

slide-28
SLIDE 28

Multi-layer RNNs

  • RNNs are already “deep” on one dimension (unroll over

time steps)

  • We can also make them “deep” in another dimension by

applying multiple RNNs

  • Multi-layer RNNs are also called stacked RNNs.
slide-29
SLIDE 29

Multi-layer RNNs

The hidden states from RNN layer are the inputs to RNN layer

i i + 1

  • In practice, using 2 to 4 layers is common (usually better than 1 layer)
  • Transformer-based networks can be up to 24 layers with lots of skip-

connections.

slide-30
SLIDE 30

Bidirectional RNNs

  • Bidirectionality is important in language representations:

terribly:

  • left context “the movie was”
  • right context “exciting !”
slide-31
SLIDE 31

Bidirectional RNNs

ht = f(ht−1, xt) ∈ ℝd h t = f1(h t−1, xt), t = 1,2,…n h t = f2(h t+1, xt), t = n, n − 1,…1 ht = [h t, h t] ∈ ℝ2d

slide-32
SLIDE 32

Bidirectional RNNs

  • Sequence tagging: Yes!
  • Text classification: Yes! With slight modifications.
  • Text generation: No. Why?

terribly exciting ! the movie was terribly exciting ! the movie was Sentence encoding element-wise mean/max e l e m e n t

  • w

i s e m e a n / m a x