Introduction to RNNs Arun Mallya Best viewed with Computer Modern - - PowerPoint PPT Presentation

introduction to rnns
SMART_READER_LITE
LIVE PREVIEW

Introduction to RNNs Arun Mallya Best viewed with Computer Modern - - PowerPoint PPT Presentation

Introduction to RNNs Arun Mallya Best viewed with Computer Modern fonts installed Outline Why Recurrent Neural Networks (RNNs)? The Vanilla RNN unit The RNN forward pass Backpropagation refresher


slide-1
SLIDE 1

Introduction to RNNs

Arun Mallya

Best viewed with Computer Modern fonts installed

slide-2
SLIDE 2

Outline

  • Why Recurrent Neural Networks (RNNs)?
  • The Vanilla RNN unit
  • The RNN forward pass
  • Backpropagation refresher
  • The RNN backward pass
  • Issues with the Vanilla RNN
  • The Long Short-Term Memory (LSTM) unit
  • The LSTM Forward & Backward pass
  • LSTM variants and tips

– Peephole LSTM – GRU

slide-3
SLIDE 3

Motivation

  • Not all problems can be converted into one with fixed-

length inputs and outputs

  • Problems such as Speech Recognition or Time-series

Prediction require a system to store and use context information

– Simple case: Output YES if the number of 1s is even, else NO 1000010101 – YES, 100011 – NO, …

  • Hard/Impossible to choose a fixed context window

– There can always be a new sample longer than anything seen

slide-4
SLIDE 4

Recurrent Neural Networks (RNNs)

  • Recurrent Neural Networks take the previous output or

hidden states as inputs. The composite input at time t has some historical information about the happenings at time T < t

  • RNNs are useful as their intermediate values (state) can

store information about past inputs for a time that is not fixed a priori

slide-5
SLIDE 5

Sample Feed-forward Network

5 ¡

h1 y1 x1 t = 1

slide-6
SLIDE 6

Sample RNN

6 ¡

h1 y1 x1 t = 1 h2 y2 x2 h3 y3 x3 t = 2 t = 3

slide-7
SLIDE 7

Sample RNN

7 ¡

h1 y1 x1 t = 1 h2 y2 x2 h3 y3 x3 t = 2 t = 3 h0

slide-8
SLIDE 8

The Vanilla RNN Cell

8 ¡

ht xt

  • ht-1
  • ht = tanhW

xt ht−1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟

W

slide-9
SLIDE 9

The Vanilla RNN Forward

9 ¡

h1 x1 h0

  • C1

y1 h2 x2 h1

  • C2

y2 h3 x3 h2

  • C3

y3

ht = tanhW xt ht−1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ yt = F(ht ) Ct = Loss(yt,GT

t )

slide-10
SLIDE 10

The Vanilla RNN Forward

10 ¡

h1 x1 h0

  • C1

y1 h2 x2 h1

  • C2

y2 h3 x3 h2

  • C3

y3

ht = tanhW xt ht−1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ yt = F(ht ) Ct = Loss(yt,GT

t ) indicates shared weights

slide-11
SLIDE 11

Recurrent Neural Networks (RNNs)

  • Note that the weights are shared over time
  • Essentially, copies of the RNN cell are made over time

(unrolling/unfolding), with different inputs at different time steps

slide-12
SLIDE 12

Sentiment Classification

  • Classify a

restaurant review from Yelp! OR movie review from IMDB OR … as positive or negative

  • Inputs: Multiple words, one or more sentences
  • Outputs: Positive / Negative classification
  • “The food was really good”
  • “The chicken crossed the road because it was uncooked”
slide-13
SLIDE 13

RNN The h1

Sentiment Classification

slide-14
SLIDE 14

RNN The RNN food h1 h2

Sentiment Classification

slide-15
SLIDE 15

RNN The RNN food h1 h2 RNN good hn-1 hn

Sentiment Classification

slide-16
SLIDE 16

RNN The RNN food h1 h2 RNN good hn-1 hn Linear Classifier

Sentiment Classification

slide-17
SLIDE 17

RNN The RNN food h1 h2 RNN good hn-1 hn Linear Classifier

Sentiment Classification

Ignore Ignore h1 h2

slide-18
SLIDE 18

RNN The RNN food h1 h2 RNN good hn-1 h = Sum(…) h1 h2 hn

Sentiment Classification

http://deeplearning.net/tutorial/lstm.html

slide-19
SLIDE 19

RNN The RNN food h1 h2 RNN good hn-1 h = Sum(…) h1 h2 hn Linear Classifier

Sentiment Classification

http://deeplearning.net/tutorial/lstm.html

slide-20
SLIDE 20

Image Captioning

  • Given an image, produce a sentence describing its contents
  • Inputs: Image feature (from a CNN)
  • Outputs: Multiple words (let’s consider one sentence)

: The dog is hiding ¡

slide-21
SLIDE 21

RNN

Image Captioning

CNN

slide-22
SLIDE 22

RNN

Image Captioning

CNN RNN h2 h1 The h2 Linear Classifier

slide-23
SLIDE 23

RNN

Image Captioning

CNN RNN RNN h2 h3 h1 The dog h2 h3 Linear Classifier Linear Classifier

slide-24
SLIDE 24

RNN Outputs: Image Captions

Show and Tell: A Neural Image Caption Generator, CVPR 15

slide-25
SLIDE 25

RNN Outputs: Language Modeling

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

VIOLA: Why, Salisbury must find his flesh and thought That which I am not aps, not a man and in fire, To show the reining of the raven and the wars To grace my hand reproach within, and not a fair are hand, That Caesar and my goodly father's world; When I was heaven of presence and our fleets, We spare with hours, but cut thy council I am great, Murdered and by thy master's ready there My power to give thee but so much as hell: Some service in the noble bondman here, Would show him to her wine. KING LEAR: O, if you were a feeble sight, the courtesy of your law, Your sight and several breath, will wear the gods With his heads, and my hands are wonder'd at the deeds, So drop upon your lordship's head, and your opinion Shall be against your honour.

slide-26
SLIDE 26

Input – Output Scenarios

Single - Single Single - Multiple Multiple - Single Multiple - Multiple Feed-forward Network Image Captioning Sentiment Classification Translation Image Captioning

slide-27
SLIDE 27

Input – Output Scenarios

Note: We might deliberately choose to frame our problem as a particular input-output scenario for ease of training or better performance. For example, at each time step, provide previous word as input for image captioning (Single-Multiple to Multiple-Multiple).

slide-28
SLIDE 28

The Vanilla RNN Forward

28 ¡

h1 x1 h0

  • C1

y1 h2 x2 h1

  • C2

y2 h3 x3 h2

  • C3

y3

ht = tanhW xt ht−1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ yt = F(ht ) Ct = Loss(yt,GT

t )

“Unfold” network through time by making copies at each time-step

slide-29
SLIDE 29

BackPropagation Refresher

f(x; W) x y C

SGD Update W ←W −η ∂C ∂W ∂C ∂W = ∂C ∂y ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ ∂y ∂W ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ y = f (x;W ) C = Loss(y,yGT )

slide-30
SLIDE 30

Multiple Layers

f1(x; W1) x y1 C

SGD Update W2 ←W2 −η ∂C ∂W2 W1 ←W1 −η ∂C ∂W1

f2(y1; W2) y2

y1 = f1(x;W1) y2 = f2(y1;W2) C = Loss(y2,yGT )

slide-31
SLIDE 31

Chain Rule for Gradient Computation

f1(x; W1) x y1 C

∂C ∂W1 = ∂C ∂y1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ ∂y1 ∂W1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟

f2(y1; W2) y2

Find ∂C ∂W1 , ∂C ∂W2 ∂C ∂W2 = ∂C ∂y2 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ ∂y2 ∂W2 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟

Application of the Chain Rule

y1 = f1(x;W1) y2 = f2(y1;W2) C = Loss(y2,yGT ) = ∂C ∂y2 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ ∂y2 ∂y1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ ∂y1 ∂W1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟

slide-32
SLIDE 32

Chain Rule for Gradient Computation

∂y ∂W ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ − How does output change due to params ∂y ∂x ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ − How does output change due to inputs ∂C ∂x ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ = ∂C ∂y ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ ∂y ∂x ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ ∂C ∂W ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ = ∂C ∂y ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ ∂y ∂W ⎛ ⎝ ⎜ ⎞ ⎠ ⎟

f(x; W) x y

We are interested in computing:

∂C ∂W ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ , ∂C ∂x ⎛ ⎝ ⎜ ⎞ ⎠ ⎟

Intrinsic to the layer are: Given:

∂C ∂y ⎛ ⎝ ⎜ ⎞ ⎠ ⎟

slide-33
SLIDE 33

Chain Rule for Gradient Computation

∂y ∂W ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ − How does output change due to params ∂y ∂x ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ − How does output change due to inputs ∂C ∂x ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ = ∂C ∂y ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ ∂y ∂x ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ ∂C ∂W ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ = ∂C ∂y ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ ∂y ∂W ⎛ ⎝ ⎜ ⎞ ⎠ ⎟

We are interested in computing:

∂C ∂W ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ , ∂C ∂x ⎛ ⎝ ⎜ ⎞ ⎠ ⎟

Intrinsic to the layer are: Given:

∂C ∂y ⎛ ⎝ ⎜ ⎞ ⎠ ⎟

f(x; W) Equations for common layers: http://arunmallya.github.io/writeups/nn/backprop.html

∂C ∂y ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ ∂C ∂x ⎛ ⎝ ⎜ ⎞ ⎠ ⎟

slide-34
SLIDE 34

Extension to Computational Graphs

f(x; W) f1(y; W1) f2(y; W2) f(x; W) x y x y y y2 y1

slide-35
SLIDE 35

Extension to Computational Graphs

f(x; W)

∂C ∂y ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ ∂C ∂x ⎛ ⎝ ⎜ ⎞ ⎠ ⎟

f1(y; W1)

∂C1 ∂y1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ ∂C1 ∂y ⎛ ⎝ ⎜ ⎞ ⎠ ⎟

f2(y; W2)

∂C2 ∂y2 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ ∂C2 ∂y ⎛ ⎝ ⎜ ⎞ ⎠ ⎟

f(x; W)

∂C ∂x ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ Σ

slide-36
SLIDE 36

Extension to Computational Graphs

f(x; W)

∂C ∂y ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ ∂C ∂x ⎛ ⎝ ⎜ ⎞ ⎠ ⎟

f1(y; W1)

∂C1 ∂y1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ ∂C1 ∂y ⎛ ⎝ ⎜ ⎞ ⎠ ⎟

f2(y; W2)

∂C2 ∂y2 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ ∂C2 ∂y ⎛ ⎝ ⎜ ⎞ ⎠ ⎟

f(x; W)

∂C ∂x ⎛ ⎝ ⎜ ⎞ ⎠ ⎟

Gradient Accumulation

Σ

slide-37
SLIDE 37

BackPropagation Through Time (BPTT)

  • One of the methods used to train RNNs
  • The unfolded network (used during forward pass) is

treated as one big feed-forward network

  • This unfolded network accepts the whole time series as

input

  • The weight updates are computed for each copy in the

unfolded network, then summed (or averaged) and then applied to the RNN weights

slide-38
SLIDE 38

The Unfolded Vanilla RNN

38 ¡

h1 x1

  • C1

y1 h2

C2

y2 h3

C3

y3 h0

  • h1
  • h2
  • x2
  • x3
  • Treat the unfolded network as one

big feed-forward network!

  • This big network takes in entire

sequence as an input

  • Compute gradients through the

usual backpropagation

  • Update shared weights
slide-39
SLIDE 39

The Unfolded Vanilla RNN Forward

39 ¡

h1 x1 h0

  • C1

y1 h2 x2 h1

  • C2

y2 h3 x3 h2

  • C3

y3

slide-40
SLIDE 40

40 ¡

h1 x1 h0

  • C1

y1 h2 x2 h1

  • C2

y2 h3 x3 h2

  • C3

y3

The Unfolded Vanilla RNN Backward

slide-41
SLIDE 41

The Vanilla RNN Backward

41 ¡

h1 x1 h0

  • C1

y1 h2 x2 h1

  • C2

y2 h3 x3 h2

  • C3

y3

ht = tanhW xt ht−1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ yt = F(ht ) Ct = Loss(yt,GT

t )

∂Ct ∂h1 = ∂Ct ∂yt ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ ∂yt ∂h1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ = ∂Ct ∂yt ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ ∂yt ∂ht ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ ∂ht ∂ht−1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟! ∂h2 ∂h1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟

slide-42
SLIDE 42

Issues with the Vanilla RNNs

  • In the same way a product of k real numbers can shrink to

zero or explode to infinity, so can a product of matrices

  • It is sufficient for , where is the largest singular

value of W, for the vanishing gradients problem to occur and it is necessary for exploding gradients that , where for the tanh non-linearity and for the sigmoid non-linearity 1

  • Exploding gradients are often controlled with gradient

element-wise or norm clipping

λ1 <1/γ λ1 λ1 >1/γ γ = 1

1 On the difficulty of training recurrent neural networks, Pascanu et al., 2013

γ = 1/ 4

slide-43
SLIDE 43

The Identity Relationship

  • Recall

ht = ht−1 + F(xt ) ∂Ct ∂h1 = ∂Ct ∂yt ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ ∂yt ∂h1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ = ∂Ct ∂yt ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ ∂yt ∂ht ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ ∂ht ∂ht−1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟! ∂h2 ∂h1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟

  • Suppose that instead of a matrix multiplication, we had an

identity relationship between the hidden states

  • The gradient does not decay as the error is propagated all

the way back aka “Constant Error Flow” ⇒ ∂ht ∂ht−1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ = 1

ht = tanhW xt ht−1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ yt = F(ht ) Ct = Loss(yt,GT

t )

slide-44
SLIDE 44

The Identity Relationship

  • Recall

ht = ht−1 + F(xt ) ∂Ct ∂h1 = ∂Ct ∂yt ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ ∂yt ∂h1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ = ∂Ct ∂yt ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ ∂yt ∂ht ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ ∂ht ∂ht−1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟! ∂h2 ∂h1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟

  • Suppose that instead of a matrix multiplication, we had an

identity relationship between the hidden states

  • The gradient does not decay as the error is propagated all

the way back aka “Constant Error Flow” ⇒ ∂ht ∂ht−1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ = 1

ht = tanhW xt ht−1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ yt = F(ht ) Ct = Loss(yt,GT

t )

Remember Resnets?

slide-45
SLIDE 45

Disclaimer

  • The explanations in the previous few slides are handwavy
  • For rigorous proofs and derivations, please refer to

On the difficulty of training recurrent neural networks, Pascanu et al., 2013 Long Short-Term Memory, Hochreiter et al., 1997

And other sources

slide-46
SLIDE 46

Long Short-Term Memory (LSTM)1

46 ¡

  • The LSTM uses this idea of “Constant Error Flow” for

RNNs to create a “Constant Error Carousel” (CEC) which ensures that gradients don’t decay

  • The key component is a memory cell that acts like an

accumulator (contains the identity relationship) over time

  • Instead of computing new state as a matrix product with

the old state, it rather computes the difference between

  • them. Expressivity is the same, but gradients are better

behaved

  • 1 Long Short-Term Memory, Hochreiter et al., 1997
slide-47
SLIDE 47

The LSTM Idea

Cell

ht

47 ¡

xt

  • ht-1
  • ct = ct−1 + tanhW

xt ht−1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟

ct

ht = tanhct

W

* Dashed line indicates time-lag

slide-48
SLIDE 48

The Original LSTM Cell

it

  • t

Input Gate Output Gate Cell

ht

48 ¡

xt ht-1

  • xt ht-1
  • ct = ct−1 + it ⊗ tanhW

xt ht−1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟

ct

ht = ot ⊗ tanhct it = σ Wi xt ht−1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ + bi ⎛ ⎝ ⎜ ⎞ ⎠ ⎟

Similarly for ot

xt

  • ht-1
  • W

Wi Wo

slide-49
SLIDE 49

The Popular LSTM Cell

it

  • t

ft

Input Gate Output Gate Forget Gate

ht

49 ¡

xt ht-1

  • Cell

ct

ct = ft ⊗ ct−1 + it ⊗ tanhW xt ht−1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟

xt ht-1

  • xt ht-1
  • xt
  • ht-1
  • W

Wi Wo Wf

ft = σ W f xt ht−1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ + bf ⎛ ⎝ ⎜ ⎞ ⎠ ⎟

slide-50
SLIDE 50

LSTM – Forward/Backward

50 ¡

Go ¡To: ¡Illustrated LSTM Forward and Backward Pass

slide-51
SLIDE 51

Summary

51 ¡

  • RNNs allow for processing of variable length inputs and
  • utputs by maintaining state information across time steps
  • Various Input-Output scenarios are possible

(Single/Multiple)

  • Vanilla RNNs are improved upon by LSTMs which address

the vanishing gradient problem through the CEC

  • Exploding gradients are handled by gradient clipping
  • More complex architectures are listed in the course

materials for you to read, understand, and present

slide-52
SLIDE 52

Other Useful Resources / References

52 ¡

  • http://cs231n.stanford.edu/slides/winter1516_lecture10.pdf
  • http://www.cs.toronto.edu/~rgrosse/csc321/lec10.pdf
  • R. Pascanu, T. Mikolov, and Y. Bengio,

On the difficulty of training recurrent neural networks, ICML 2013

  • S. Hochreiter, and J. Schmidhuber, Long short-term memory, Neural computation,

1997 9(8), pp.1735-1780

  • F.A. Gers, and J. Schmidhuber, Recurrent nets that time and count, IJCNN 2000
  • K. Greff , R.K. Srivastava, J. Koutník, B.R. Steunebrink, and J. Schmidhuber,

LSTM: A search space odyssey, IEEE transactions on neural networks and learning systems, 2016

  • K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk,

and Y. Bengio, Learning phrase representations using RNN encoder-decoder for statistical machine translation, ACL 2014

  • R. Jozefowicz, W. Zaremba, and I. Sutskever,

An empirical exploration of recurrent network architectures, JMLR 2015