Recurrent Neural Networks CS60010: Deep Learning Abir Das IIT - - PowerPoint PPT Presentation

recurrent neural networks
SMART_READER_LITE
LIVE PREVIEW

Recurrent Neural Networks CS60010: Deep Learning Abir Das IIT - - PowerPoint PPT Presentation

Recurrent Neural Networks CS60010: Deep Learning Abir Das IIT Kharagpur Mar 11, 2020 Introduction Recurrent Neural Network LSTM Agenda Get introduced to different recurrent neural architecture e.g. , RNNs, LSTMs, GRUs etc. Get


slide-1
SLIDE 1

Recurrent Neural Networks

CS60010: Deep Learning Abir Das

IIT Kharagpur

Mar 11, 2020

slide-2
SLIDE 2

Introduction Recurrent Neural Network LSTM

Agenda

§ Get introduced to different recurrent neural architecture e.g., RNNs, LSTMs, GRUs etc. § Get introduced to tasks involving sequential inputs and/or sequential

  • utputs.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 2 / 30

slide-3
SLIDE 3

Introduction Recurrent Neural Network LSTM

Resources

§ Deep Learning by I. Goodfellow and Y. Bengio and A. Courville. [Link] [Chapter 10] § CS231n by Stanford University [Link] § Understanding LSTM Networks by Chris Olah [Link]

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 3 / 30

slide-4
SLIDE 4

Introduction Recurrent Neural Network LSTM

Why do we Need another NN Model?

§ So far, we focused mainly on prediction problems with fixedsize inputs and outputs. § In image classification, input is fixed size image and and output is its class, in video classification, the input is fixed size video and output is its class, in bounding-box regression the input is fixed size region proposal (resized/RoI pooled) and output is bounding box coordinates.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 4 / 30

slide-5
SLIDE 5

Introduction Recurrent Neural Network LSTM

Why do we Need another NN Model?

§ Suppose, we want our model to write down the caption of this image.

Figure: Several people with umbrellas walk down a side walk on a rainy day.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 5 / 30 Image source: COCO Dataset, ICCV 2015

slide-6
SLIDE 6

Introduction Recurrent Neural Network LSTM

Why do we Need another NN Model?

§ Will this work?

Several people with umbrellas

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 6 / 30

slide-7
SLIDE 7

Introduction Recurrent Neural Network LSTM

Why do we Need another NN Model?

§ Will this work?

Several people with umbrellas

§ When the model generates ‘people’, we need a way to tell the model that ‘several’ has already been generated and similarly for the other words.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 6 / 30

slide-8
SLIDE 8

Introduction Recurrent Neural Network LSTM

Why do we Need another NN Model?

§

Several people with umbrellas

§

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 - May 4, 2017 12

e.g. Image Captioning image -> sequence of words

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 7 / 30

slide-9
SLIDE 9

Introduction Recurrent Neural Network LSTM

Recurrent Neural Networks: Process Sequences

e.g., Image Captioning image-> sequence of words e.g., Sentiment Classification sequence of words -> sentiment e.g., Machine Translation sequence of words -> sequence of words e.g., Frame Level Video Classification sequence of frames -> sequence of labels Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 8 / 30 Image source: CS231n from Stanford

slide-10
SLIDE 10

Introduction Recurrent Neural Network LSTM

Recurrent Neural Network

§ The fundamental feature of a Recurrent Neural Network (RNN) is that the network contains at least one feedback connection so that activation can flow in a loop. § The feedback connection allows information to persist. Remember the generation of people would require the generation of several to be remembered. § The simplest form of RNN has the previous set of hidden unit activations feeding back into the network along with the inputs.

Outputs Hidden Units Inputs Delay

ℎ" ℎ" ℎ"#$ 𝑦"

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 9 / 30

slide-11
SLIDE 11

Introduction Recurrent Neural Network LSTM

Recurrent Neural Network

Outputs Hidden Units Inputs Delay

ℎ" ℎ" ℎ"#$ 𝑦"

§ Note that the concept of ‘time’ or sequential processing comes into picture. § The activations are updated one time-step at a time. § The task of the delay unit is to simply delay the hidden layer activation until the next time-step.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 10 / 30

slide-12
SLIDE 12

Introduction Recurrent Neural Network LSTM

Recurrent Neural Network

ℎ" = 𝑔 𝑦", ℎ"'(

New state Some function Old state Input vector

§ f, in particular, can be a layer of a neural network.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 11 / 30

slide-13
SLIDE 13

Introduction Recurrent Neural Network LSTM

Recurrent Neural Network

ℎ" = 𝑔 𝑦", ℎ"'(

New state Some function Old state Input vector

§ f, in particular, can be a layer of a neural network. § Lets unroll the recurrent connection.

𝒊"#$ 𝑿& 𝑿' 𝑿( 𝒊" 𝑿& 𝑿& 𝒊")$ 𝑿' 𝑿( 𝑿' 𝑿( 𝒊")𝟑 𝒊+#$ 𝑿& 𝑿' 𝑿( 𝒚" 𝒚")$ 𝒚")- 𝒚+ 𝒛" 𝒛")$ 𝒛")- 𝒛+

§ Note that all weight matrices are same across timesteps. So, the weights are shared for all the timesteps.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 11 / 30

slide-14
SLIDE 14

Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: Forward Pass

𝒊"#$ 𝑿& 𝑿' 𝑿( 𝒊" 𝑿& 𝑿& 𝒊")$ 𝑿' 𝑿( 𝑿' 𝑿( 𝒊")𝟑 𝒊+#$ 𝑿& 𝑿' 𝑿( 𝒚" 𝒚")$ 𝒚")- 𝒚+ 𝒛" 𝒛")$ 𝒛")- 𝒛+

at = Whht−1 + Wixt (1) ht = g(at) (2) yt = Woht (3) § Note that we can have biases too. For simplicity these are omitted.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 12 / 30

slide-15
SLIDE 15

Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT

𝒊"#$ 𝑿& 𝑿' 𝑿( 𝒊" 𝑿& 𝑿& 𝒊")$ 𝑿' 𝑿( 𝑿' 𝑿( 𝒊")𝟑 𝒊+#$ 𝑿& 𝑿' 𝑿( 𝒚" 𝒚")$ 𝒚")- 𝒚+ 𝒛" 𝒛")$ 𝒛")- 𝒛+

at = Whht−1 + Wixt ht = g(at) yt = Woht

§ BPTT: Backpropagation through time

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 13 / 30

slide-16
SLIDE 16

Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT

𝒊"#$ 𝑿& 𝑿' 𝑿( 𝒊" 𝑿& 𝑿& 𝒊")$ 𝑿' 𝑿( 𝑿' 𝑿( 𝒊")𝟑 𝒊+#$ 𝑿& 𝑿' 𝑿( 𝒚" 𝒚")$ 𝒚")- 𝒚+ 𝒛" 𝒛")$ 𝒛")- 𝒛+

at = Whht−1 + Wixt ht = g(at) yt = Woht

§ BPTT: Backpropagation through time § Total loss L =

T

  • t=1

Lt and we are after

∂L ∂Wo , ∂L ∂Wh and ∂L ∂Wi

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 13 / 30

slide-17
SLIDE 17

Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT

𝒊"#$ 𝑿& 𝑿' 𝑿( 𝒊" 𝑿& 𝑿& 𝒊")$ 𝑿' 𝑿( 𝑿' 𝑿( 𝒊")𝟑 𝒊+#$ 𝑿& 𝑿' 𝑿( 𝒚" 𝒚")$ 𝒚")- 𝒚+ 𝒛" 𝒛")$ 𝒛")- 𝒛+

at = Whht−1 + Wixt ht = g(at) yt = Woht

§ BPTT: Backpropagation through time § Total loss L =

T

  • t=1

Lt and we are after

∂L ∂Wo , ∂L ∂Wh and ∂L ∂Wi

§ Lets compute ∂L

∂yt .

∂L ∂yt = ∂L ∂Lt ∂Lt ∂yt = 1. ∂Lt ∂yt (4)

§

∂Lt ∂yt is computable depending on the particular form of the loss

function.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 13 / 30

slide-18
SLIDE 18

Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT

𝒊"#$ 𝑿& 𝑿' 𝑿( 𝒊" 𝑿& 𝑿& 𝒊")$ 𝑿' 𝑿( 𝑿' 𝑿( 𝒊")𝟑 𝒊+#$ 𝑿& 𝑿' 𝑿( 𝒚" 𝒚")$ 𝒚")- 𝒚+ 𝒛" 𝒛")$ 𝒛")- 𝒛+

at = Whht−1 + Wixt ht = g(at) yt = Woht

§ Lets compute ∂L

∂ht . The subtlety here is that all Lt after timestep t

are functions of ht. So, let us first consider

∂L ∂hT , where T is the last

timestep.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 14 / 30

slide-19
SLIDE 19

Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT

𝒊"#$ 𝑿& 𝑿' 𝑿( 𝒊" 𝑿& 𝑿& 𝒊")$ 𝑿' 𝑿( 𝑿' 𝑿( 𝒊")𝟑 𝒊+#$ 𝑿& 𝑿' 𝑿( 𝒚" 𝒚")$ 𝒚")- 𝒚+ 𝒛" 𝒛")$ 𝒛")- 𝒛+

at = Whht−1 + Wixt ht = g(at) yt = Woht

§ Lets compute ∂L

∂ht . The subtlety here is that all Lt after timestep t

are functions of ht. So, let us first consider

∂L ∂hT , where T is the last

timestep.

∂L ∂hT = ∂L ∂yT ∂yT ∂hT = ∂L ∂yT Wo (5)

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 14 / 30

slide-20
SLIDE 20

Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT

𝒊"#$ 𝑿& 𝑿' 𝑿( 𝒊" 𝑿& 𝑿& 𝒊")$ 𝑿' 𝑿( 𝑿' 𝑿( 𝒊")𝟑 𝒊+#$ 𝑿& 𝑿' 𝑿( 𝒚" 𝒚")$ 𝒚")- 𝒚+ 𝒛" 𝒛")$ 𝒛")- 𝒛+

at = Whht−1 + Wixt ht = g(at) yt = Woht

§ Lets compute ∂L

∂ht . The subtlety here is that all Lt after timestep t

are functions of ht. So, let us first consider

∂L ∂hT , where T is the last

timestep.

∂L ∂hT = ∂L ∂yT ∂yT ∂hT = ∂L ∂yT Wo (5)

§

∂L ∂yT , we just computed last slide (eqn. (4)).

§ For a generic t, we need to compute ∂L

∂ht . ht affects yt and also ht+1.

For this we will use something that we used while studying Backpropagation for feedforward networks.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 14 / 30

slide-21
SLIDE 21

Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT

𝒊"#$ 𝑿& 𝑿' 𝑿( 𝒊" 𝑿& 𝑿& 𝒊")$ 𝑿' 𝑿( 𝑿' 𝑿( 𝒊")𝟑 𝒊+#$ 𝑿& 𝑿' 𝑿( 𝒚" 𝒚")$ 𝒚")- 𝒚+ 𝒛" 𝒛")$ 𝒛")- 𝒛+

at = Whht−1 + Wixt ht = g(at) yt = Woht

§ If u = f(x, y), where x = φ(t), y = ψ(t), then ∂u

∂t = ∂u ∂x ∂x ∂t + ∂u ∂y ∂y ∂t

∂L ∂ht = ∂L ∂yt ∂yt ∂ht + ∂L ∂ht+1 ∂ht+1 ∂ht (6)

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 15 / 30

slide-22
SLIDE 22

Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT

𝒊"#$ 𝑿& 𝑿' 𝑿( 𝒊" 𝑿& 𝑿& 𝒊")$ 𝑿' 𝑿( 𝑿' 𝑿( 𝒊")𝟑 𝒊+#$ 𝑿& 𝑿' 𝑿( 𝒚" 𝒚")$ 𝒚")- 𝒚+ 𝒛" 𝒛")$ 𝒛")- 𝒛+

at = Whht−1 + Wixt ht = g(at) yt = Woht

§ If u = f(x, y), where x = φ(t), y = ψ(t), then ∂u

∂t = ∂u ∂x ∂x ∂t + ∂u ∂y ∂y ∂t

∂L ∂ht = ∂L ∂yt ∂yt ∂ht + ∂L ∂ht+1 ∂ht+1 ∂ht (6)

§

∂L ∂yt we computed in eqn. (4)

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 15 / 30

slide-23
SLIDE 23

Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT

𝒊"#$ 𝑿& 𝑿' 𝑿( 𝒊" 𝑿& 𝑿& 𝒊")$ 𝑿' 𝑿( 𝑿' 𝑿( 𝒊")𝟑 𝒊+#$ 𝑿& 𝑿' 𝑿( 𝒚" 𝒚")$ 𝒚")- 𝒚+ 𝒛" 𝒛")$ 𝒛")- 𝒛+

at = Whht−1 + Wixt ht = g(at) yt = Woht

§ If u = f(x, y), where x = φ(t), y = ψ(t), then ∂u

∂t = ∂u ∂x ∂x ∂t + ∂u ∂y ∂y ∂t

∂L ∂ht = ∂L ∂yt ∂yt ∂ht + ∂L ∂ht+1 ∂ht+1 ∂ht (6)

§

∂yt ∂ht = Wo.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 15 / 30

slide-24
SLIDE 24

Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT

𝒊"#$ 𝑿& 𝑿' 𝑿( 𝒊" 𝑿& 𝑿& 𝒊")$ 𝑿' 𝑿( 𝑿' 𝑿( 𝒊")𝟑 𝒊+#$ 𝑿& 𝑿' 𝑿( 𝒚" 𝒚")$ 𝒚")- 𝒚+ 𝒛" 𝒛")$ 𝒛")- 𝒛+

at = Whht−1 + Wixt ht = g(at) yt = Woht

§ If u = f(x, y), where x = φ(t), y = ψ(t), then ∂u

∂t = ∂u ∂x ∂x ∂t + ∂u ∂y ∂y ∂t

∂L ∂ht = ∂L ∂yt ∂yt ∂ht + ∂L ∂ht+1 ∂ht+1 ∂ht (6)

§

∂L ∂ht+1 is almost same as ∂L ∂ht . It is just for the next timestep.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 15 / 30

slide-25
SLIDE 25

Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT

𝒊"#$ 𝑿& 𝑿' 𝑿( 𝒊" 𝑿& 𝑿& 𝒊")$ 𝑿' 𝑿( 𝑿' 𝑿( 𝒊")𝟑 𝒊+#$ 𝑿& 𝑿' 𝑿( 𝒚" 𝒚")$ 𝒚")- 𝒚+ 𝒛" 𝒛")$ 𝒛")- 𝒛+

at = Whht−1 + Wixt ht = g(at) yt = Woht

§ If u = f(x, y), where x = φ(t), y = ψ(t), then ∂u

∂t = ∂u ∂x ∂x ∂t + ∂u ∂y ∂y ∂t

∂L ∂ht = ∂L ∂yt ∂yt ∂ht + ∂L ∂ht+1 ∂ht+1 ∂ht (6)

§

∂ht+1 ∂ht

= ∂ht+1

∂at+1 ∂at+1 ∂ht =g′ · Wh.

§ Since, g is an elementwise operation,g′ will be a diagonal matrix. § In particular, if g is tanh, then

∂ht+1 ∂ht

= diag

  • 1 − (ht

1)2, 1 − (ht 2)2, · · · , 1 − (ht m)2

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 15 / 30

slide-26
SLIDE 26

Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT

𝒊"#$ 𝑿& 𝑿' 𝑿( 𝒊" 𝑿& 𝑿& 𝒊")$ 𝑿' 𝑿( 𝑿' 𝑿( 𝒊")𝟑 𝒊+#$ 𝑿& 𝑿' 𝑿( 𝒚" 𝒚")$ 𝒚")- 𝒚+ 𝒛" 𝒛")$ 𝒛")- 𝒛+

at = Whht−1 + Wixt ht = g(at) yt = Woht

§

∂L ∂ht = ∂L ∂yt Wo + ∂L ∂ht+1 diag

  • 1 − (ht

1)2, 1 − (ht 2)2, · · · , 1 − (ht m)2

Wh § All the other things we can compute, but to compute ∂L

∂ht we need ∂L ∂ht+1 .

§ From eqn. (5), we get

∂L ∂hT , which gives ∂L ∂hT −1 and so on.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 16 / 30

slide-27
SLIDE 27

Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT

𝒊"#$ 𝑿& 𝑿' 𝑿( 𝒊" 𝑿& 𝑿& 𝒊")$ 𝑿' 𝑿( 𝑿' 𝑿( 𝒊")𝟑 𝒊+#$ 𝑿& 𝑿' 𝑿( 𝒚" 𝒚")$ 𝒚")- 𝒚+ 𝒛" 𝒛")$ 𝒛")- 𝒛+

at = Whht−1 + Wixt ht = g(at) yt = Woht

§

∂L ∂ht = ∂L ∂yt Wo + ∂L ∂ht+1 diag

  • 1 − (ht

1)2, 1 − (ht 2)2, · · · , 1 − (ht m)2

Wh § All the other things we can compute, but to compute ∂L

∂ht we need ∂L ∂ht+1 .

§ From eqn. (5), we get

∂L ∂hT , which gives ∂L ∂hT −1 and so on.

§ Now,

∂L ∂Wo = t ∂Lt ∂Wo = t ∂Lt ∂yt ∂yt ∂Wo = t

∂Lt ∂yt ht §

∂Lt ∂yt is computable depending on the form of the loss function.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 16 / 30

slide-28
SLIDE 28

Introduction Recurrent Neural Network LSTM

Recurrent Neural Network: BPTT

𝒊"#$ 𝑿& 𝑿' 𝑿( 𝒊" 𝑿& 𝑿& 𝒊")$ 𝑿' 𝑿( 𝑿' 𝑿( 𝒊")𝟑 𝒊+#$ 𝑿& 𝑿' 𝑿( 𝒚" 𝒚")$ 𝒚")- 𝒚+ 𝒛" 𝒛")$ 𝒛")- 𝒛+

at = Whht−1 + Wixt ht = g(at) yt = Woht

§

∂L ∂ht = ∂L ∂yt Wo + ∂L ∂ht+1 diag

  • 1 − (ht

1)2, 1 − (ht 2)2, · · · , 1 − (ht m)2

Wh § All the other things we can compute, but to compute ∂L

∂ht we need ∂L ∂ht+1 .

§ From eqn. (5), we get

∂L ∂hT , which gives ∂L ∂hT −1 and so on.

§ Now,

∂L ∂Wo = t ∂Lt ∂Wo = t ∂Lt ∂yt ∂yt ∂Wo = t

∂Lt ∂yt ht §

∂Lt ∂yt is computable depending on the form of the loss function.

§ (Do it yourself) Similarly for

∂L ∂Wh and ∂L ∂Wi .

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 16 / 30

slide-29
SLIDE 29

Introduction Recurrent Neural Network LSTM

Exploding or Vanishing Gradients

§ In recurrent nets (also in very deep nets), the final output is the composition of a large number of non-linear transformations. § The derivatives through these compositions will tend to be either very small or very large. § If h = (f ◦ g)(x) = f(g(x)), then h′(x) = f′(g(x))g′(x) § If the gradients are small, the product is small. § If the gradients are large, the product is large.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 17 / 30

slide-30
SLIDE 30

Introduction Recurrent Neural Network LSTM

Exploding or Vanishing Gradients

𝒊"#$ 𝑿& 𝑿' 𝑿( 𝒊" 𝑿& 𝑿& 𝒊")$ 𝑿' 𝑿( 𝑿' 𝑿( 𝒊")𝟑 𝒊+#$ 𝑿& 𝑿' 𝑿( 𝒚" 𝒚")$ 𝒚")- 𝒚+ 𝒛" 𝒛")$ 𝒛")- 𝒛+

at = Whht−1 + Wixt ht = g(at) yt = Woht

§ Let us see what happens with one learnable weight matrix θ = Wh

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 18 / 30

slide-31
SLIDE 31

Introduction Recurrent Neural Network LSTM

Exploding or Vanishing Gradients

𝒊"#$ 𝑿& 𝑿' 𝑿( 𝒊" 𝑿& 𝑿& 𝒊")$ 𝑿' 𝑿( 𝑿' 𝑿( 𝒊")𝟑 𝒊+#$ 𝑿& 𝑿' 𝑿( 𝒚" 𝒚")$ 𝒚")- 𝒚+ 𝒛" 𝒛")$ 𝒛")- 𝒛+

at = Whht−1 + Wixt ht = g(at) yt = Woht

§ Let us see what happens with one learnable weight matrix θ = Wh

∂L ∂θ =

T

  • t=1

∂Lt ∂θ

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 18 / 30

slide-32
SLIDE 32

Introduction Recurrent Neural Network LSTM

Exploding or Vanishing Gradients

𝒊"#$ 𝑿& 𝑿' 𝑿( 𝒊" 𝑿& 𝑿& 𝒊")$ 𝑿' 𝑿( 𝑿' 𝑿( 𝒊")𝟑 𝒊+#$ 𝑿& 𝑿' 𝑿( 𝒚" 𝒚")$ 𝒚")- 𝒚+ 𝒛" 𝒛")$ 𝒛")- 𝒛+

at = Whht−1 + Wixt ht = g(at) yt = Woht

§ Let us see what happens with one learnable weight matrix θ = Wh

∂L ∂θ =

T

  • t=1

∂Lt ∂θ ∂Lt ∂θ = ∂Lt ∂yt ∂yt ∂θ

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 18 / 30

slide-33
SLIDE 33

Introduction Recurrent Neural Network LSTM

Exploding or Vanishing Gradients

𝒊"#$ 𝑿& 𝑿' 𝑿( 𝒊" 𝑿& 𝑿& 𝒊")$ 𝑿' 𝑿( 𝑿' 𝑿( 𝒊")𝟑 𝒊+#$ 𝑿& 𝑿' 𝑿( 𝒚" 𝒚")$ 𝒚")- 𝒚+ 𝒛" 𝒛")$ 𝒛")- 𝒛+

at = Whht−1 + Wixt ht = g(at) yt = Woht

§ Let us see what happens with one learnable weight matrix θ = Wh

∂L ∂θ =

T

  • t=1

∂Lt ∂θ ∂Lt ∂θ = ∂Lt ∂yt ∂yt ∂θ ∂yt ∂θ = ∂yt ∂ht ∂ht ∂θ

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 18 / 30

slide-34
SLIDE 34

Introduction Recurrent Neural Network LSTM

Exploding or Vanishing Gradients

𝒊"#$ 𝑿& 𝑿' 𝑿( 𝒊" 𝑿& 𝑿& 𝒊")$ 𝑿' 𝑿( 𝑿' 𝑿( 𝒊")𝟑 𝒊+#$ 𝑿& 𝑿' 𝑿( 𝒚" 𝒚")$ 𝒚")- 𝒚+ 𝒛" 𝒛")$ 𝒛")- 𝒛+

at = Whht−1 + Wixt ht = g(at) yt = Woht

§ Let us see what happens with one learnable weight matrix θ = Wh

∂L ∂θ =

T

  • t=1

∂Lt ∂θ ∂Lt ∂θ = ∂Lt ∂yt ∂yt ∂θ ∂yt ∂θ = ∂yt ∂ht ∂ht ∂θ

§ But, ht is a function of ht−1, ht−2, · · · , h2, h1 and each of these is a function of θ.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 18 / 30

slide-35
SLIDE 35

Introduction Recurrent Neural Network LSTM

Exploding or Vanishing Gradients

𝒊"#$ 𝑿& 𝑿' 𝑿( 𝒊" 𝑿& 𝑿& 𝒊")$ 𝑿' 𝑿( 𝑿' 𝑿( 𝒊")𝟑 𝒊+#$ 𝑿& 𝑿' 𝑿( 𝒚" 𝒚")$ 𝒚")- 𝒚+ 𝒛" 𝒛")$ 𝒛")- 𝒛+

at = Whht−1 + Wixt ht = g(at) yt = Woht

§ Now we will resort to our friend again - If u = f(x, y), where x = φ(t), y = ψ(t), then ∂u

∂t = ∂u ∂x ∂x ∂t + ∂u ∂y ∂y ∂t

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 19 / 30

slide-36
SLIDE 36

Introduction Recurrent Neural Network LSTM

Exploding or Vanishing Gradients

𝒊"#$ 𝑿& 𝑿' 𝑿( 𝒊" 𝑿& 𝑿& 𝒊")$ 𝑿' 𝑿( 𝑿' 𝑿( 𝒊")𝟑 𝒊+#$ 𝑿& 𝑿' 𝑿( 𝒚" 𝒚")$ 𝒚")- 𝒚+ 𝒛" 𝒛")$ 𝒛")- 𝒛+

at = Whht−1 + Wixt ht = g(at) yt = Woht

§ Now we will resort to our friend again - If u = f(x, y), where x = φ(t), y = ψ(t), then ∂u

∂t = ∂u ∂x ∂x ∂t + ∂u ∂y ∂y ∂t

∂ht ∂θ =

t−1

  • k=1

∂ht ∂hk ∂hk ∂θ

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 19 / 30

slide-37
SLIDE 37

Introduction Recurrent Neural Network LSTM

Exploding or Vanishing Gradients

𝒊"#$ 𝑿& 𝑿' 𝑿( 𝒊" 𝑿& 𝑿& 𝒊")$ 𝑿' 𝑿( 𝑿' 𝑿( 𝒊")𝟑 𝒊+#$ 𝑿& 𝑿' 𝑿( 𝒚" 𝒚")$ 𝒚")- 𝒚+ 𝒛" 𝒛")$ 𝒛")- 𝒛+

at = Whht−1 + Wixt ht = g(at) yt = Woht

§ Now we will resort to our friend again - If u = f(x, y), where x = φ(t), y = ψ(t), then ∂u

∂t = ∂u ∂x ∂x ∂t + ∂u ∂y ∂y ∂t

∂ht ∂θ =

t−1

  • k=1

∂ht ∂hk ∂hk ∂θ

§ And

∂ht ∂hk= ∂ht ∂ht−1 ∂ht−1 ∂ht−2 · · · ∂hk+1 ∂hk =diag

  • 1 − (ht−1

1

)2

  • 1 − (ht−2

1

)2 · · · ,

  • 1 − (ht−1

2

)2

  • 1 − (ht−2

2

)2 · · · ,

  • Abir Das (IIT Kharagpur)

CS60010 Mar 11, 2020 19 / 30

slide-38
SLIDE 38

Introduction Recurrent Neural Network LSTM

Exploding or Vanishing Gradients

§ Exploding Gradients

◮ Easy to detect ◮ Clip the gradient at a threshold

§ Vanishing Gradients

◮ More difficult to detect ◮ Architectures designed to combat the problem of vanishing gradients. Example: LSTMs by Schmidhuberet et. al.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 20 / 30

slide-39
SLIDE 39

Introduction Recurrent Neural Network LSTM

Long Short Term Memory (LSTM)

§ Hochreiter & Schmidhuber(1997) solved the problem of getting an RNN to remember things for a long time (e.g., hundreds of time steps). § They designed a memory cell using logistic and linear units with multiplicative interactions. § Information is handled using three gates, namely - forget, input and

  • utput.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 21 / 30

slide-40
SLIDE 40

Introduction Recurrent Neural Network LSTM

Recall: Vanilla RNNs

§ In a standard RNN the repeating module has a simple structure.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 22 / 30 Source: Chris Olah’s blog

slide-41
SLIDE 41

Introduction Recurrent Neural Network LSTM

LSTMs

§ LSTMs also have this chain like structure, but the repeating module has a different structure. § Instead of having a single neural network layer, there are four, interacting in a very special way. § The notations mean

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 23 / 30 Source: Chris Olah’s blog

slide-42
SLIDE 42

Introduction Recurrent Neural Network LSTM

LSTM Memory/Cell State

§ The key to LSTMs is the cell state, the horizontal line running through the top of the diagram. § The cell state is kind of like a conveyor belt. Its very easy for information to just flow along it unchanged. § The LSTM does have the ability to remove or add information to the cell state, carefully regulated by gates.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 24 / 30 Source: Chris Olah’s blog

slide-43
SLIDE 43

Introduction Recurrent Neural Network LSTM

Gate

§ Composed of a sigmoid neural net layer and a pointwise multiplication

  • peration.

§ Sigmoid: outputs numbers between

◮ Zero: “let nothing through” and ◮ One: “let everything through”

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 25 / 30 Source: Chris Olah’s blog

slide-44
SLIDE 44

Introduction Recurrent Neural Network LSTM

Gate

§ And LSTM has three such gates.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 26 / 30 Source: Chris Olah’s blog

slide-45
SLIDE 45

Introduction Recurrent Neural Network LSTM

Forget Gate

§ The first step is to decide what information is going to be throw away from the cell state. This decision is made by the “forget gate layer”. § It looks at ht−1 and xt, and outputs a number between 0 and 1 for each number in the cell state Ct−1. A 1 represents “completely keep this” while a 0 represents “completely get rid of this”.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 27 / 30 Source: Chris Olah’s blog

slide-46
SLIDE 46

Introduction Recurrent Neural Network LSTM

Input Gate

§ The next step is to decide what new information is going to be stored in the cell state. This has two parts. § First, a sigmoid layer called the “input gate layer” decides which values to update. Next, a tanh layer creates a vector of new candidate values, ˜ Ct, that could be added to the state. § Next step combines these two to create an update to the state.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 28 / 30 Source: Chris Olah’s blog

slide-47
SLIDE 47

Introduction Recurrent Neural Network LSTM

Input Gate

§ Its now time to update the old cell state, Ct−1, into the new cell state Ct. § This is done by multiplying the old state by ft, forgetting the things that were decided to forget earlier and adding it ∗ ˜ Ct.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 29 / 30 Source: Chris Olah’s blog

slide-48
SLIDE 48

Introduction Recurrent Neural Network LSTM

Output Gate

§ Finally, we need to decide what is going to be the output. This

  • utput will be based on the cell state.

§ First a sigmoid layer is run which decides what parts of the cell state are going to be output. § Then, the cell state is put through tanh (to push the values to be between 1 and 1) and this is multiplied by the output of the sigmoid gate.

Abir Das (IIT Kharagpur) CS60010 Mar 11, 2020 30 / 30 Source: Chris Olah’s blog