CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells - - PowerPoint PPT Presentation

cs7015 deep learning lecture 15
SMART_READER_LITE
LIVE PREVIEW

CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells - - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells (LSTMs), Gated Recurrent Units (GRUs) Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/43 Mitesh M. Khapra CS7015 (Deep


slide-1
SLIDE 1

1/43

CS7015 (Deep Learning) : Lecture 15

Long Short Term Memory Cells (LSTMs), Gated Recurrent Units (GRUs) Mitesh M. Khapra

Department of Computer Science and Engineering Indian Institute of Technology Madras

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-2
SLIDE 2

2/43

Module 15.1: Selective Read, Selective Write, Selective Forget - The Whiteboard Analogy

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-3
SLIDE 3

3/43 s1 W

V U

x1 y1

The state (si) of an RNN records information from all previous time steps

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-4
SLIDE 4

3/43 s1 W

V U

x1 y1

The state (si) of an RNN records information from all previous time steps At each new timestep the

  • ld

information gets morphed by the current input

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-5
SLIDE 5

3/43 s1 W

V U

x1 y1 s2 x2 y2

W V U

The state (si) of an RNN records information from all previous time steps At each new timestep the

  • ld

information gets morphed by the current input

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-6
SLIDE 6

3/43 s1 W

V U

x1 y1 s2 x2 y2

W V U

s3 x3 y3

W V U

The state (si) of an RNN records information from all previous time steps At each new timestep the

  • ld

information gets morphed by the current input

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-7
SLIDE 7

3/43 s1 W

V U

x1 y1 s2 x2 y2

W V U

s3 x3 y3

W V U

s4 x4 y4

W V U

The state (si) of an RNN records information from all previous time steps At each new timestep the

  • ld

information gets morphed by the current input

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-8
SLIDE 8

3/43 s1 W

V U

x1 y1 s2 x2 y2

W V U

s3 x3 y3

W V U

s4 x4 y4

W V U

. . .

st xt yt

W V U

The state (si) of an RNN records information from all previous time steps At each new timestep the

  • ld

information gets morphed by the current input

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-9
SLIDE 9

3/43 s1 W

V U

x1 y1 s2 x2 y2

W V U

s3 x3 y3

W V U

s4 x4 y4

W V U

. . .

st xt yt

W V U

The state (si) of an RNN records information from all previous time steps At each new timestep the

  • ld

information gets morphed by the current input One could imagine that after t steps the information stored at time step t−k (for some k < t) gets completely morphed

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-10
SLIDE 10

3/43 s1 W

V U

x1 y1 s2 x2 y2

W V U

s3 x3 y3

W V U

s4 x4 y4

W V U

. . .

st xt yt

W V U

The state (si) of an RNN records information from all previous time steps At each new timestep the

  • ld

information gets morphed by the current input One could imagine that after t steps the information stored at time step t−k (for some k < t) gets completely morphed so much that it would be impossible to extract the original information stored at time step t − k

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-11
SLIDE 11

4/43 s1 W

V U

x1 y1 s2 x2 y2

W V U

s3 x3 y3

W V U

s4 x4 y4

W V U

. . .

st xt yt

W V U

A similar problem

  • ccurs

when the information flows backwards (backpropagation)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-12
SLIDE 12

4/43 s1 W

V U

x1 y1 s2 x2 y2

W V U

s3 x3 y3

W V U

s4 x4 y4

W V U

. . .

st xt yt

W V U

A similar problem

  • ccurs

when the information flows backwards (backpropagation) It is very hard to assign the responsibility of the error caused at time step t to the events that

  • ccurred at time step t − k

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-13
SLIDE 13

4/43 s1 W

V U

x1 y1 s2 x2 y2

W V U

s3 x3 y3

W V U

s4 x4 y4

W V U

. . .

st xt yt

W V U

A similar problem

  • ccurs

when the information flows backwards (backpropagation) It is very hard to assign the responsibility of the error caused at time step t to the events that

  • ccurred at time step t − k

This responsibility is of course in the form of gradients and we studied the problem in backward flow of gradients

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-14
SLIDE 14

4/43 s1 W

V U

x1 y1 s2 x2 y2

W V U

s3 x3 y3

W V U

s4 x4 y4

W V U

. . .

st xt yt

W V U

A similar problem

  • ccurs

when the information flows backwards (backpropagation) It is very hard to assign the responsibility of the error caused at time step t to the events that

  • ccurred at time step t − k

This responsibility is of course in the form of gradients and we studied the problem in backward flow of gradients We saw a formal argument for this while discussing vanishing gradients

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-15
SLIDE 15

5/43

Let us see an analogy for this

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-16
SLIDE 16

5/43

Let us see an analogy for this We can think of the state as a fixed size memory

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-17
SLIDE 17

5/43

Let us see an analogy for this We can think of the state as a fixed size memory Compare this to a fixed size white board that you use to record information

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-18
SLIDE 18

5/43

Let us see an analogy for this We can think of the state as a fixed size memory Compare this to a fixed size white board that you use to record information At each time step (periodic intervals) we keep writing something to the board

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-19
SLIDE 19

5/43

Let us see an analogy for this We can think of the state as a fixed size memory Compare this to a fixed size white board that you use to record information At each time step (periodic intervals) we keep writing something to the board Effectively at each time step we morph the information recorded till that time point

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-20
SLIDE 20

5/43

Let us see an analogy for this We can think of the state as a fixed size memory Compare this to a fixed size white board that you use to record information At each time step (periodic intervals) we keep writing something to the board Effectively at each time step we morph the information recorded till that time point After many timesteps it would be impossible to see how the information at time step t − k contributed to the state at timestep t

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-21
SLIDE 21

6/43

Continuing our whiteboard analogy, suppose we are interested in deriving an expression on the whiteboard

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-22
SLIDE 22

6/43

Continuing our whiteboard analogy, suppose we are interested in deriving an expression on the whiteboard We follow the following strategy at each time step

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-23
SLIDE 23

6/43

Continuing our whiteboard analogy, suppose we are interested in deriving an expression on the whiteboard We follow the following strategy at each time step Selectively write on the board

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-24
SLIDE 24

6/43

Continuing our whiteboard analogy, suppose we are interested in deriving an expression on the whiteboard We follow the following strategy at each time step Selectively write on the board Selectively read the already written content

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-25
SLIDE 25

6/43

Continuing our whiteboard analogy, suppose we are interested in deriving an expression on the whiteboard We follow the following strategy at each time step Selectively write on the board Selectively read the already written content Selectively forget (erase) some content

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-26
SLIDE 26

6/43

Continuing our whiteboard analogy, suppose we are interested in deriving an expression on the whiteboard We follow the following strategy at each time step Selectively write on the board Selectively read the already written content Selectively forget (erase) some content Let us look at each of these in detail

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-27
SLIDE 27

7/43

Selective write There may be many steps in the derivation but we may just skip a few

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-28
SLIDE 28

7/43

Selective write There may be many steps in the derivation but we may just skip a few In other words we select what to write

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-29
SLIDE 29

7/43

a = 1 b = 3 c = 5 d = 11 Compute ac(bd + a) + ad Say “board” can have only 3 statements at a time.

1 ac 2 bd 3 bd + a 4 ac(bd + a) 5 ad 6 ac(bd + a) + ad

Selective write There may be many steps in the derivation but we may just skip a few In other words we select what to write

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-30
SLIDE 30

7/43

a = 1 b = 3 c = 5 d = 11 Compute ac(bd + a) + ad Say “board” can have only 3 statements at a time.

1 ac 2 bd 3 bd + a 4 ac(bd + a) 5 ad 6 ac(bd + a) + ad

ac = 5 Selective write There may be many steps in the derivation but we may just skip a few In other words we select what to write

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-31
SLIDE 31

7/43

a = 1 b = 3 c = 5 d = 11 Compute ac(bd + a) + ad Say “board” can have only 3 statements at a time.

1 ac 2 bd 3 bd + a 4 ac(bd + a) 5 ad 6 ac(bd + a) + ad

ac = 5 bd = 33 Selective write There may be many steps in the derivation but we may just skip a few In other words we select what to write

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-32
SLIDE 32

8/43

a = 1 b = 3 c = 5 d = 11 Compute ac(bd + a) + ad Say “board” can have only 3 statements at a time.

1 ac 2 bd 3 bd + a 4 ac(bd + a) 5 ad 6 ac(bd + a) + ad

ac = 5 bd = 33 Selective read While writing one step we typically read some of the previous steps we have already written and then decide what to write next

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-33
SLIDE 33

8/43

a = 1 b = 3 c = 5 d = 11 Compute ac(bd + a) + ad Say “board” can have only 3 statements at a time.

1 ac 2 bd 3 bd + a 4 ac(bd + a) 5 ad 6 ac(bd + a) + ad

ac = 5 bd = 33 Selective read While writing one step we typically read some of the previous steps we have already written and then decide what to write next For example at Step 3, information from Step 2 is important

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-34
SLIDE 34

8/43

a = 1 b = 3 c = 5 d = 11 Compute ac(bd + a) + ad Say “board” can have only 3 statements at a time.

1 ac 2 bd 3 bd + a 4 ac(bd + a) 5 ad 6 ac(bd + a) + ad

ac = 5 bd = 33 Selective read While writing one step we typically read some of the previous steps we have already written and then decide what to write next For example at Step 3, information from Step 2 is important In other words we select what to read

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-35
SLIDE 35

8/43

a = 1 b = 3 c = 5 d = 11 Compute ac(bd + a) + ad Say “board” can have only 3 statements at a time.

1 ac 2 bd 3 bd + a 4 ac(bd + a) 5 ad 6 ac(bd + a) + ad

ac = 5 bd = 33 bd + a = 34 Selective read While writing one step we typically read some of the previous steps we have already written and then decide what to write next For example at Step 3, information from Step 2 is important In other words we select what to read

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-36
SLIDE 36

9/43

a = 1 b = 3 c = 5 d = 11 Compute ac(bd + a) + ad Say “board” can have only 3 statements at a time.

1 ac 2 bd 3 bd + a 4 ac(bd + a) 5 ad 6 ac(bd + a) + ad

ac = 5 bd = 33 bd + a = 34 Selective forget Once the board is full, we need to delete some obsolete information

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-37
SLIDE 37

9/43

a = 1 b = 3 c = 5 d = 11 Compute ac(bd + a) + ad Say “board” can have only 3 statements at a time.

1 ac 2 bd 3 bd + a 4 ac(bd + a) 5 ad 6 ac(bd + a) + ad

ac = 5 bd = 33 bd + a = 34 Selective forget Once the board is full, we need to delete some obsolete information But how do we decide what to delete? We will typically delete the least useful information

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-38
SLIDE 38

9/43

a = 1 b = 3 c = 5 d = 11 Compute ac(bd + a) + ad Say “board” can have only 3 statements at a time.

1 ac 2 bd 3 bd + a 4 ac(bd + a) 5 ad 6 ac(bd + a) + ad

ac = 5 bd = 33 bd + a = 34 Selective forget Once the board is full, we need to delete some obsolete information But how do we decide what to delete? We will typically delete the least useful information In other words we select what to forget

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-39
SLIDE 39

9/43

a = 1 b = 3 c = 5 d = 11 Compute ac(bd + a) + ad Say “board” can have only 3 statements at a time.

1 ac 2 bd 3 bd + a 4 ac(bd + a) 5 ad 6 ac(bd + a) + ad

ac = 5 bd + a = 34 Selective forget Once the board is full, we need to delete some obsolete information But how do we decide what to delete? We will typically delete the least useful information In other words we select what to forget

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-40
SLIDE 40

9/43

a = 1 b = 3 c = 5 d = 11 Compute ac(bd + a) + ad Say “board” can have only 3 statements at a time.

1 ac 2 bd 3 bd + a 4 ac(bd + a) 5 ad 6 ac(bd + a) + ad

ac = 5 ac(bd + a) = 170 bd + a = 34 Selective forget Once the board is full, we need to delete some obsolete information But how do we decide what to delete? We will typically delete the least useful information In other words we select what to forget

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-41
SLIDE 41

9/43

a = 1 b = 3 c = 5 d = 11 Compute ac(bd + a) + ad Say “board” can have only 3 statements at a time.

1 ac 2 bd 3 bd + a 4 ac(bd + a) 5 ad 6 ac(bd + a) + ad

ac = 5 ac(bd + a) = 170 ad = 11 Selective forget Once the board is full, we need to delete some obsolete information But how do we decide what to delete? We will typically delete the least useful information In other words we select what to forget

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-42
SLIDE 42

9/43

a = 1 b = 3 c = 5 d = 11 Compute ac(bd + a) + ad Say “board” can have only 3 statements at a time.

1 ac 2 bd 3 bd + a 4 ac(bd + a) 5 ad 6 ac(bd + a) + ad

ad + ac(bd + a) = 181 ac(bd + a) = 170 ad = 11 Selective forget Once the board is full, we need to delete some obsolete information But how do we decide what to delete? We will typically delete the least useful information In other words we select what to forget

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-43
SLIDE 43

10/43

a = 1 b = 3 c = 5 d = 11 Compute ac(bd + a) + ad Say “board” can have only 3 statements at a time.

1 ac 2 bd 3 bd + a 4 ac(bd + a) 5 ad 6 ac(bd + a) + ad

ad + ac(bd + a) = 181 ac(bd + a) = 170 ad = 11 There are various other scenarios where we can motivate the need for selective write, read and forget

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-44
SLIDE 44

10/43

a = 1 b = 3 c = 5 d = 11 Compute ac(bd + a) + ad Say “board” can have only 3 statements at a time.

1 ac 2 bd 3 bd + a 4 ac(bd + a) 5 ad 6 ac(bd + a) + ad

ad + ac(bd + a) = 181 ac(bd + a) = 170 ad = 11 There are various other scenarios where we can motivate the need for selective write, read and forget For example, you could think of our brain as something which can store

  • nly a finite number of facts

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-45
SLIDE 45

10/43

a = 1 b = 3 c = 5 d = 11 Compute ac(bd + a) + ad Say “board” can have only 3 statements at a time.

1 ac 2 bd 3 bd + a 4 ac(bd + a) 5 ad 6 ac(bd + a) + ad

ad + ac(bd + a) = 181 ac(bd + a) = 170 ad = 11 There are various other scenarios where we can motivate the need for selective write, read and forget For example, you could think of our brain as something which can store

  • nly a finite number of facts

At different time steps we selectively read, write and forget some of these facts

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-46
SLIDE 46

10/43

a = 1 b = 3 c = 5 d = 11 Compute ac(bd + a) + ad Say “board” can have only 3 statements at a time.

1 ac 2 bd 3 bd + a 4 ac(bd + a) 5 ad 6 ac(bd + a) + ad

ad + ac(bd + a) = 181 ac(bd + a) = 170 ad = 11 There are various other scenarios where we can motivate the need for selective write, read and forget For example, you could think of our brain as something which can store

  • nly a finite number of facts

At different time steps we selectively read, write and forget some of these facts Since the RNN also has a finite state size, we need to figure out a way to allow it to selectively read, write and forget

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-47
SLIDE 47

11/43

Module 15.2: Long Short Term Memory(LSTM) and Gated Recurrent Units(GRUs)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-48
SLIDE 48

12/43

Questions Can we give a concrete example where RNNs also need to selectively read, write and forget ?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-49
SLIDE 49

12/43

Questions Can we give a concrete example where RNNs also need to selectively read, write and forget ? How do we convert this intuition into mathematical equations ?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-50
SLIDE 50

12/43

Questions Can we give a concrete example where RNNs also need to selectively read, write and forget ? How do we convert this intuition into mathematical equations ? We will see this over the next few slides

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-51
SLIDE 51

13/43

Review: The first half of the movie was dry but the second half really picked up pace. The lead actor delivered an amazing performance The first ... ... ... performance +/−

Consider the task of predicting the sentiment (positive/negative) of a review

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-52
SLIDE 52

13/43

Review: The first half of the movie was dry but the second half really picked up pace. The lead actor delivered an amazing performance The first ... ... ... performance +/−

Consider the task of predicting the sentiment (positive/negative) of a review RNN reads the document from left to right and after every word updates the state

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-53
SLIDE 53

13/43

Review: The first half of the movie was dry but the second half really picked up pace. The lead actor delivered an amazing performance The first ... ... ... performance +/−

Consider the task of predicting the sentiment (positive/negative) of a review RNN reads the document from left to right and after every word updates the state By the time we reach the end of the document the information obtained from the first few words is completely lost

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-54
SLIDE 54

13/43

Review: The first half of the movie was dry but the second half really picked up pace. The lead actor delivered an amazing performance The first ... ... ... performance +/−

Consider the task of predicting the sentiment (positive/negative) of a review RNN reads the document from left to right and after every word updates the state By the time we reach the end of the document the information obtained from the first few words is completely lost Ideally we want to

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-55
SLIDE 55

13/43

Review: The first half of the movie was dry but the second half really picked up pace. The lead actor delivered an amazing performance The first ... ... ... performance +/−

Consider the task of predicting the sentiment (positive/negative) of a review RNN reads the document from left to right and after every word updates the state By the time we reach the end of the document the information obtained from the first few words is completely lost Ideally we want to

forget the information added by stop words (a, the, etc.)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-56
SLIDE 56

13/43

Review: The first half of the movie was dry but the second half really picked up pace. The lead actor delivered an amazing performance The first ... ... ... performance +/−

Consider the task of predicting the sentiment (positive/negative) of a review RNN reads the document from left to right and after every word updates the state By the time we reach the end of the document the information obtained from the first few words is completely lost Ideally we want to

forget the information added by stop words (a, the, etc.) selectively read the information added by previous sentiment bearing words (awesome, amazing, etc.)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-57
SLIDE 57

13/43

Review: The first half of the movie was dry but the second half really picked up pace. The lead actor delivered an amazing performance The first ... ... ... performance +/−

Consider the task of predicting the sentiment (positive/negative) of a review RNN reads the document from left to right and after every word updates the state By the time we reach the end of the document the information obtained from the first few words is completely lost Ideally we want to

forget the information added by stop words (a, the, etc.) selectively read the information added by previous sentiment bearing words (awesome, amazing, etc.) selectively write new information from the current word to the state

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-58
SLIDE 58

14/43

Questions Can we give a concrete example where RNNs also need to selectively read, write and forget ? How do we convert this intuition into mathematical equations ?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-59
SLIDE 59

15/43

Review: The first half of the movie was dry but the second half really picked up pace. The lead actor delivered an amazing performance The first ... ... ... performance +/−

Recall that the blue colored vector (st) is called the state of the RNN

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-60
SLIDE 60

15/43

Review: The first half of the movie was dry but the second half really picked up pace. The lead actor delivered an amazing performance The first ... ... ... performance +/−

Recall that the blue colored vector (st) is called the state of the RNN It has a finite size (st ∈ Rn) and is used to store all the information upto timestep t

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-61
SLIDE 61

15/43

Review: The first half of the movie was dry but the second half really picked up pace. The lead actor delivered an amazing performance The first ... ... ... performance +/−

Recall that the blue colored vector (st) is called the state of the RNN It has a finite size (st ∈ Rn) and is used to store all the information upto timestep t This state is analogous to the whiteboard and sooner or later it will get overloaded and the information from the initial states will get morphed beyond recognition

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-62
SLIDE 62

15/43

Review: The first half of the movie was dry but the second half really picked up pace. The lead actor delivered an amazing performance The first ... ... ... performance +/−

Recall that the blue colored vector (st) is called the state of the RNN It has a finite size (st ∈ Rn) and is used to store all the information upto timestep t This state is analogous to the whiteboard and sooner or later it will get overloaded and the information from the initial states will get morphed beyond recognition Wishlist: selective write, selective read and selective forget to ensure that this finite sized state vector is used effectively

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-63
SLIDE 63

16/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1 0.7

  • 0.2

1.1

. . .

  • 0.3

xt

  • 0.9

0.2 1

. . .

  • 1.9

st

Just to be clear, we have computed a state st−1 at timestep t − 1 and now we want to overload it with new information (xt) and compute a new state (st)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-64
SLIDE 64

16/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1 selective read selective write selective forget 0.7

  • 0.2

1.1

. . .

  • 0.3

xt

  • 0.9

0.2 1

. . .

  • 1.9

st

Just to be clear, we have computed a state st−1 at timestep t − 1 and now we want to overload it with new information (xt) and compute a new state (st) While doing so we want to make sure that we use selective write, selective read and selective forget so that only important information is retained in st

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-65
SLIDE 65

16/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1 selective read selective write selective forget 0.7

  • 0.2

1.1

. . .

  • 0.3

xt

  • 0.9

0.2 1

. . .

  • 1.9

st

Just to be clear, we have computed a state st−1 at timestep t − 1 and now we want to overload it with new information (xt) and compute a new state (st) While doing so we want to make sure that we use selective write, selective read and selective forget so that only important information is retained in st We will now see how to implement these items from our wishlist

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-66
SLIDE 66

17/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

  • 0.9

0.2 1

. . .

  • 1.9

st

Selective Write Recall that in RNNs we use st−1 to compute st

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-67
SLIDE 67

17/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

  • 0.9

0.2 1

. . .

  • 1.9

st

Selective Write Recall that in RNNs we use st−1 to compute st st = σ(Wst−1 + Uxt) (ignoring bias)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-68
SLIDE 68

17/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

  • 0.9

0.2 1

. . .

  • 1.9

st

Selective Write Recall that in RNNs we use st−1 to compute st st = σ(Wst−1 + Uxt) (ignoring bias) But now instead of passing st−1 as it is to st we want to pass (write) only some portions of it to the next state

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-69
SLIDE 69

17/43

  • 1.4

1

. . .

st−1

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

  • 0.9

0.2 1

. . .

  • 1.9

st

Selective Write Recall that in RNNs we use st−1 to compute st st = σ(Wst−1 + Uxt) (ignoring bias) But now instead of passing st−1 as it is to st we want to pass (write) only some portions of it to the next state In the strictest case our decisions could be binary (for example, retain 1st and 3rd entries and delete the rest of the entries)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-70
SLIDE 70

17/43

  • 1.4

1

. . .

st−1

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

  • 0.9

0.2 1

. . .

  • 1.9

st

Selective Write Recall that in RNNs we use st−1 to compute st st = σ(Wst−1 + Uxt) (ignoring bias) But now instead of passing st−1 as it is to st we want to pass (write) only some portions of it to the next state In the strictest case our decisions could be binary (for example, retain 1st and 3rd entries and delete the rest of the entries) But a more sensible way of doing this would be to assign a value between 0 and 1 which determines what fraction of the current state to pass on to the next state

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-71
SLIDE 71

18/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

  • 1.4
  • 0.4

1

. . .

  • 2

st

Selective Write We introduce a vector ot−1 which decides what fraction of each element

  • f st−1 should be passed to the next

state

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-72
SLIDE 72

18/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

  • 1.4
  • 0.4

1

. . .

  • 2

st

Selective Write We introduce a vector ot−1 which decides what fraction of each element

  • f st−1 should be passed to the next

state Each element of ot−1 gets multiplied with the corresponding element of st−1

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-73
SLIDE 73

18/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

  • 1.4
  • 0.4

1

. . .

  • 2

st

Selective Write We introduce a vector ot−1 which decides what fraction of each element

  • f st−1 should be passed to the next

state Each element of ot−1 gets multiplied with the corresponding element of st−1 Each element of ot−1 is restricted to be between 0 and 1

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-74
SLIDE 74

18/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

  • 1.4
  • 0.4

1

. . .

  • 2

st

Selective Write We introduce a vector ot−1 which decides what fraction of each element

  • f st−1 should be passed to the next

state Each element of ot−1 gets multiplied with the corresponding element of st−1 Each element of ot−1 is restricted to be between 0 and 1 But how do we compute ot−1? How does the RNN know what fraction of the state to pass on?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-75
SLIDE 75

19/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

  • 1.4
  • 0.4

1

. . .

  • 2

st

Selective Write Well the RNN has to learn ot−1 along with the other parameters (W, U, V )

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-76
SLIDE 76

19/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

  • 1.4
  • 0.4

1

. . .

  • 2

st

Selective Write Well the RNN has to learn ot−1 along with the other parameters (W, U, V ) We compute ot−1 and ht−1 as

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-77
SLIDE 77

19/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

  • 1.4
  • 0.4

1

. . .

  • 2

st

Selective Write Well the RNN has to learn ot−1 along with the other parameters (W, U, V ) We compute ot−1 and ht−1 as

  • t−1 = σ(Woht−2 + Uoxt−1 + bo)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-78
SLIDE 78

19/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

  • 1.4
  • 0.4

1

. . .

  • 2

st

Selective Write Well the RNN has to learn ot−1 along with the other parameters (W, U, V ) We compute ot−1 and ht−1 as

  • t−1 = σ(Woht−2 + Uoxt−1 + bo)

ht−1 = ot−1 ⊙ σ(st−1)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-79
SLIDE 79

19/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

  • 1.4
  • 0.4

1

. . .

  • 2

st

Selective Write Well the RNN has to learn ot−1 along with the other parameters (W, U, V ) We compute ot−1 and ht−1 as

  • t−1 = σ(Woht−2 + Uoxt−1 + bo)

ht−1 = ot−1 ⊙ σ(st−1) The parameters Wo, Uo, bo need to be learned along with the existing parameters W, U, V

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-80
SLIDE 80

19/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

  • 1.4
  • 0.4

1

. . .

  • 2

st

Selective Write Well the RNN has to learn ot−1 along with the other parameters (W, U, V ) We compute ot−1 and ht−1 as

  • t−1 = σ(Woht−2 + Uoxt−1 + bo)

ht−1 = ot−1 ⊙ σ(st−1) The parameters Wo, Uo, bo need to be learned along with the existing parameters W, U, V The sigmoid (logistic) function ensures that the values are between 0 and 1

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-81
SLIDE 81

19/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

  • 1.4
  • 0.4

1

. . .

  • 2

st

Selective Write Well the RNN has to learn ot−1 along with the other parameters (W, U, V ) We compute ot−1 and ht−1 as

  • t−1 = σ(Woht−2 + Uoxt−1 + bo)

ht−1 = ot−1 ⊙ σ(st−1) The parameters Wo, Uo, bo need to be learned along with the existing parameters W, U, V The sigmoid (logistic) function ensures that the values are between 0 and 1

  • t is called the output gate as it

decides how much to pass (write) to the next time step

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-82
SLIDE 82

20/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

  • 1.4
  • 0.4

1

. . .

  • 2

st

Selective Read We will now use ht−1 to compute the new state at the next time step

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-83
SLIDE 83

20/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

  • 1.4
  • 0.4

1

. . .

  • 2

st

Selective Read We will now use ht−1 to compute the new state at the next time step We will also use xt which is the new input at time step t ˜ st = σ(Wht−1 + Uxt + b)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-84
SLIDE 84

20/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

  • 1.4
  • 0.4

1

. . .

  • 2

st

Selective Read We will now use ht−1 to compute the new state at the next time step We will also use xt which is the new input at time step t ˜ st = σ(Wht−1 + Uxt + b) Note that W, U and b are similar to the parameters that we used in RNN (for simplicity we have not shown the bias b in the figure)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-85
SLIDE 85

21/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

selective read

  • 1.4
  • 0.4

1

. . .

  • 2

st

Selective Read ˜ st thus captures all the information from the previous state (ht−1) and the current input xt

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-86
SLIDE 86

21/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

selective read

  • 1.4
  • 0.4

1

. . .

  • 2

st

Selective Read ˜ st thus captures all the information from the previous state (ht−1) and the current input xt However, we may not want to use all this new information and

  • nly selectively read from it before

constructing the new cell state st

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-87
SLIDE 87

21/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

selective read

  • 1.4
  • 0.4

1

. . .

  • 2

st

Selective Read ˜ st thus captures all the information from the previous state (ht−1) and the current input xt However, we may not want to use all this new information and

  • nly selectively read from it before

constructing the new cell state st To do this we introduce another gate called the input gate

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-88
SLIDE 88

21/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

selective read

  • 1.4
  • 0.4

1

. . .

  • 2

st

Selective Read ˜ st thus captures all the information from the previous state (ht−1) and the current input xt However, we may not want to use all this new information and

  • nly selectively read from it before

constructing the new cell state st To do this we introduce another gate called the input gate it = σ(Wiht−1 + Uixt + bi)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-89
SLIDE 89

21/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

selective read

  • 1.4
  • 0.4

1

. . .

  • 2

st

Selective Read ˜ st thus captures all the information from the previous state (ht−1) and the current input xt However, we may not want to use all this new information and

  • nly selectively read from it before

constructing the new cell state st To do this we introduce another gate called the input gate it = σ(Wiht−1 + Uixt + bi) and use it ⊙ ˜ st as the selectively read state information

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-90
SLIDE 90

22/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

selective read

  • 1.4
  • 0.4

1

. . .

  • 2

st

So far we have the following

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-91
SLIDE 91

22/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

selective read

  • 1.4
  • 0.4

1

. . .

  • 2

st

So far we have the following Previous state: st−1

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-92
SLIDE 92

22/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

selective read

  • 1.4
  • 0.4

1

. . .

  • 2

st

So far we have the following Previous state: st−1 Output gate:

  • t−1 = σ(Woht−2 + Uoxt−1 + bo)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-93
SLIDE 93

22/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

selective read

  • 1.4
  • 0.4

1

. . .

  • 2

st

So far we have the following Previous state: st−1 Output gate:

  • t−1 = σ(Woht−2 + Uoxt−1 + bo)

Selectively Write: ht−1 = ot−1 ⊙ σ(st−1)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-94
SLIDE 94

22/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

selective read

  • 1.4
  • 0.4

1

. . .

  • 2

st

So far we have the following Previous state: st−1 Output gate:

  • t−1 = σ(Woht−2 + Uoxt−1 + bo)

Selectively Write: ht−1 = ot−1 ⊙ σ(st−1) Current (temporary) state: ˜ st = σ(Wht−1 + Uxt + b)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-95
SLIDE 95

22/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

selective read

  • 1.4
  • 0.4

1

. . .

  • 2

st

So far we have the following Previous state: st−1 Output gate:

  • t−1 = σ(Woht−2 + Uoxt−1 + bo)

Selectively Write: ht−1 = ot−1 ⊙ σ(st−1) Current (temporary) state: ˜ st = σ(Wht−1 + Uxt + b) Input gate: it = σ(Wiht−1 + Uixt + bi)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-96
SLIDE 96

22/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

selective read

  • 1.4
  • 0.4

1

. . .

  • 2

st

So far we have the following Previous state: st−1 Output gate:

  • t−1 = σ(Woht−2 + Uoxt−1 + bo)

Selectively Write: ht−1 = ot−1 ⊙ σ(st−1) Current (temporary) state: ˜ st = σ(Wht−1 + Uxt + b) Input gate: it = σ(Wiht−1 + Uixt + bi) Selectively Read: it ⊙ ˜ st

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-97
SLIDE 97

23/43

Selective Forget How do we combine st−1 and ˜ st to get the new state

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-98
SLIDE 98

23/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

selective read

+

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

=

  • 0.9

0.2 1

. . .

  • 1.9

st

Selective Forget How do we combine st−1 and ˜ st to get the new state Here is one simple (but effective) way

  • f doing this:

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-99
SLIDE 99

23/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

selective read

+

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

=

  • 0.9

0.2 1

. . .

  • 1.9

st

Selective Forget How do we combine st−1 and ˜ st to get the new state Here is one simple (but effective) way

  • f doing this:

st = st−1 + it ⊙ ˜ st

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-100
SLIDE 100

23/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

selective read

+

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

=

  • 0.9

0.2 1

. . .

  • 1.9

st

Selective Forget How do we combine st−1 and ˜ st to get the new state Here is one simple (but effective) way

  • f doing this:

st = st−1 + it ⊙ ˜ st But we may not want to use the whole

  • f st−1 but forget some parts of it

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-101
SLIDE 101

23/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

selective read

+

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

0.9 0.7 0.9

. . .

0.8 ft

selective forget

=

  • 0.9

0.2 1

. . .

  • 1.9

st

Selective Forget How do we combine st−1 and ˜ st to get the new state Here is one simple (but effective) way

  • f doing this:

st = st−1 + it ⊙ ˜ st But we may not want to use the whole

  • f st−1 but forget some parts of it

To do this we introduce the forget gate

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-102
SLIDE 102

23/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

selective read

+

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

0.9 0.7 0.9

. . .

0.8 ft

selective forget

=

  • 0.9

0.2 1

. . .

  • 1.9

st

Selective Forget How do we combine st−1 and ˜ st to get the new state Here is one simple (but effective) way

  • f doing this:

st = st−1 + it ⊙ ˜ st But we may not want to use the whole

  • f st−1 but forget some parts of it

To do this we introduce the forget gate ft = σ(Wfht−1 + Ufxt + bf)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-103
SLIDE 103

23/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

selective read

+

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

0.9 0.7 0.9

. . .

0.8 ft

selective forget

=

  • 0.9

0.2 1

. . .

  • 1.9

st

Selective Forget How do we combine st−1 and ˜ st to get the new state Here is one simple (but effective) way

  • f doing this:

st = st−1 + it ⊙ ˜ st But we may not want to use the whole

  • f st−1 but forget some parts of it

To do this we introduce the forget gate ft = σ(Wfht−1 + Ufxt + bf) st = ft ⊙ st−1 + it ⊙ ˜ st

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-104
SLIDE 104

24/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

selective read

+

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

0.9 0.7 0.9

. . .

0.8 ft

selective forget

=

  • 1.5

0.2 1

. . .

  • 1.9

st

⊙ =

0.19 0.34 0.9

. . .

0.32

  • t

0.4 0.34 0.8

. . .

0.12 ht

selective write

We now have the full set of equations for LSTMs

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-105
SLIDE 105

24/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

selective read

+

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

0.9 0.7 0.9

. . .

0.8 ft

selective forget

=

  • 1.5

0.2 1

. . .

  • 1.9

st

⊙ =

0.19 0.34 0.9

. . .

0.32

  • t

0.4 0.34 0.8

. . .

0.12 ht

selective write

We now have the full set of equations for LSTMs The green box together with the selective write operations following it, show all the computations which happen at timestep t Gates: States:

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-106
SLIDE 106

24/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

selective read

+

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

0.9 0.7 0.9

. . .

0.8 ft

selective forget

=

  • 1.5

0.2 1

. . .

  • 1.9

st

⊙ =

0.19 0.34 0.9

. . .

0.32

  • t

0.4 0.34 0.8

. . .

0.12 ht

selective write

We now have the full set of equations for LSTMs The green box together with the selective write operations following it, show all the computations which happen at timestep t Gates:

  • t = σ(Woht−1 + Uoxt + bo)

States:

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-107
SLIDE 107

24/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

selective read

+

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

0.9 0.7 0.9

. . .

0.8 ft

selective forget

=

  • 1.5

0.2 1

. . .

  • 1.9

st

⊙ =

0.19 0.34 0.9

. . .

0.32

  • t

0.4 0.34 0.8

. . .

0.12 ht

selective write

We now have the full set of equations for LSTMs The green box together with the selective write operations following it, show all the computations which happen at timestep t Gates:

  • t = σ(Woht−1 + Uoxt + bo)

it = σ(Wiht−1 + Uixt + bi) States:

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-108
SLIDE 108

24/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

selective read

+

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

0.9 0.7 0.9

. . .

0.8 ft

selective forget

=

  • 1.5

0.2 1

. . .

  • 1.9

st

⊙ =

0.19 0.34 0.9

. . .

0.32

  • t

0.4 0.34 0.8

. . .

0.12 ht

selective write

We now have the full set of equations for LSTMs The green box together with the selective write operations following it, show all the computations which happen at timestep t Gates:

  • t = σ(Woht−1 + Uoxt + bo)

it = σ(Wiht−1 + Uixt + bi) ft = σ(Wfht−1 + Ufxt + bf) States:

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-109
SLIDE 109

24/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

selective read

+

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

0.9 0.7 0.9

. . .

0.8 ft

selective forget

=

  • 1.5

0.2 1

. . .

  • 1.9

st

⊙ =

0.19 0.34 0.9

. . .

0.32

  • t

0.4 0.34 0.8

. . .

0.12 ht

selective write

We now have the full set of equations for LSTMs The green box together with the selective write operations following it, show all the computations which happen at timestep t Gates:

  • t = σ(Woht−1 + Uoxt + bo)

it = σ(Wiht−1 + Uixt + bi) ft = σ(Wfht−1 + Ufxt + bf) States: ˜ st = σ(Wht−1 + Uxt + b)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-110
SLIDE 110

24/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

selective read

+

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

0.9 0.7 0.9

. . .

0.8 ft

selective forget

=

  • 1.5

0.2 1

. . .

  • 1.9

st

⊙ =

0.19 0.34 0.9

. . .

0.32

  • t

0.4 0.34 0.8

. . .

0.12 ht

selective write

We now have the full set of equations for LSTMs The green box together with the selective write operations following it, show all the computations which happen at timestep t Gates:

  • t = σ(Woht−1 + Uoxt + bo)

it = σ(Wiht−1 + Uixt + bi) ft = σ(Wfht−1 + Ufxt + bf) States: ˜ st = σ(Wht−1 + Uxt + b) st = ft ⊙ st−1 + it ⊙ ˜ st

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-111
SLIDE 111

24/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

selective read

+

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

0.9 0.7 0.9

. . .

0.8 ft

selective forget

=

  • 1.5

0.2 1

. . .

  • 1.9

st

⊙ =

0.19 0.34 0.9

. . .

0.32

  • t

0.4 0.34 0.8

. . .

0.12 ht

selective write

We now have the full set of equations for LSTMs The green box together with the selective write operations following it, show all the computations which happen at timestep t Gates:

  • t = σ(Woht−1 + Uoxt + bo)

it = σ(Wiht−1 + Uixt + bi) ft = σ(Wfht−1 + Ufxt + bf) States: ˜ st = σ(Wht−1 + Uxt + b) st = ft ⊙ st−1 + it ⊙ ˜ st ht = ot ⊙ σ(st)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-112
SLIDE 112

24/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

selective read

+

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

0.9 0.7 0.9

. . .

0.8 ft

selective forget

=

  • 1.5

0.2 1

. . .

  • 1.9

st

⊙ =

0.19 0.34 0.9

. . .

0.32

  • t

0.4 0.34 0.8

. . .

0.12 ht

selective write

We now have the full set of equations for LSTMs The green box together with the selective write operations following it, show all the computations which happen at timestep t Gates:

  • t = σ(Woht−1 + Uoxt + bo)

it = σ(Wiht−1 + Uixt + bi) ft = σ(Wfht−1 + Ufxt + bf) States: ˜ st = σ(Wht−1 + Uxt + b) st = ft ⊙ st−1 + it ⊙ ˜ st ht = ot ⊙ σ(st) and rnnout = ht

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-113
SLIDE 113

25/43

Note LSTM has many variants which include different number of gates and also different arrangement of gates The one which we just saw is one of the most popular variants of LSTM Another equally popular variant of LSTM is Gated Recurrent Unit which we will see next

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-114
SLIDE 114

26/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

The full set of equations for GRUs

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-115
SLIDE 115

26/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

The full set of equations for GRUs Gates: States:

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-116
SLIDE 116

26/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

0.2 0.34 0.9

. . .

0.29

  • t

The full set of equations for GRUs Gates:

  • t = σ(Wost−1 + Uoxt + bo)

States:

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-117
SLIDE 117

26/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

0.2 0.34 0.9

. . .

0.29

  • t

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

The full set of equations for GRUs Gates:

  • t = σ(Wost−1 + Uoxt + bo)

States: ˜ st = σ(W(ot ⊙ st−1) + Uxt + b)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-118
SLIDE 118

26/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

0.2 0.34 0.9

. . .

0.29

  • t

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

The full set of equations for GRUs Gates:

  • t = σ(Wost−1 + Uoxt + bo)

it = σ(Wist−1 + Uixt + bi) States: ˜ st = σ(W(ot ⊙ st−1) + Uxt + b)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-119
SLIDE 119

26/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

0.2 0.34 0.9

. . .

0.29

  • t

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

+

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

0.2 0.34 0.9

. . .

0.29 1 − it

=

  • 1.5

0.2 1

. . .

  • 1.9

st

The full set of equations for GRUs Gates:

  • t = σ(Wost−1 + Uoxt + bo)

it = σ(Wist−1 + Uixt + bi) States: ˜ st = σ(W(ot ⊙ st−1) + Uxt + b) st = (1 − it) ⊙ st−1 + it ⊙ ˜ st

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-120
SLIDE 120

26/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

0.2 0.34 0.9

. . .

0.29

  • t

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

+

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

0.2 0.34 0.9

. . .

0.29 1 − it

=

  • 1.5

0.2 1

. . .

  • 1.9

st

The full set of equations for GRUs Gates:

  • t = σ(Wost−1 + Uoxt + bo)

it = σ(Wist−1 + Uixt + bi) States: ˜ st = σ(W(ot ⊙ st−1) + Uxt + b) st = (1 − it) ⊙ st−1 + it ⊙ ˜ st No explicit forget gate (the forget gate and input gates are tied)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-121
SLIDE 121

26/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

0.2 0.34 0.9

. . .

0.29

  • t

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

+

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

0.2 0.34 0.9

. . .

0.29 1 − it

=

  • 1.5

0.2 1

. . .

  • 1.9

st

The full set of equations for GRUs Gates:

  • t = σ(Wost−1 + Uoxt + bo)

it = σ(Wist−1 + Uixt + bi) States: ˜ st = σ(W(ot ⊙ st−1) + Uxt + b) st = (1 − it) ⊙ st−1 + it ⊙ ˜ st No explicit forget gate (the forget gate and input gates are tied) The gates depend directly on st−1 and not the intermediate ht−1 as in the case of LSTMs

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-122
SLIDE 122

27/43

Module 15.3: How LSTMs avoid the problem of vanishing gradients

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-123
SLIDE 123

28/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

selective read

+

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

0.9 0.7 0.9

. . .

0.8 ft

selective forget

=

  • 1.5

0.2 1

. . .

  • 1.9

st

⊙ =

0.19 0.34 0.9

. . .

0.32

  • t

0.4 0.34 0.8

. . .

0.12 ht

selective write

Intuition During forward propagation the gates control the flow of information

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-124
SLIDE 124

28/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

selective read

+

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

0.9 0.7 0.9

. . .

0.8 ft

selective forget

=

  • 1.5

0.2 1

. . .

  • 1.9

st

⊙ =

0.19 0.34 0.9

. . .

0.32

  • t

0.4 0.34 0.8

. . .

0.12 ht

selective write

Intuition During forward propagation the gates control the flow of information They prevent any irrelevant information from being written to the state

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-125
SLIDE 125

28/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

selective read

+

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

0.9 0.7 0.9

. . .

0.8 ft

selective forget

=

  • 1.5

0.2 1

. . .

  • 1.9

st

⊙ =

0.19 0.34 0.9

. . .

0.32

  • t

0.4 0.34 0.8

. . .

0.12 ht

selective write

Intuition During forward propagation the gates control the flow of information They prevent any irrelevant information from being written to the state Similarly during backward propagation they control the flow of gradients

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-126
SLIDE 126

28/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

selective read

+

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

0.9 0.7 0.9

. . .

0.8 ft

selective forget

=

  • 1.5

0.2 1

. . .

  • 1.9

st

⊙ =

0.19 0.34 0.9

. . .

0.32

  • t

0.4 0.34 0.8

. . .

0.12 ht

selective write

Intuition During forward propagation the gates control the flow of information They prevent any irrelevant information from being written to the state Similarly during backward propagation they control the flow of gradients It is easy to see that during backward pass the gradients will get multiplied by the gate

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-127
SLIDE 127

29/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

selective read

+

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

0.9 0.7 0.9

. . .

0.8 ft

selective forget

=

  • 1.5

0.2 1

. . .

  • 1.9

st

⊙ =

0.19 0.34 0.9

. . .

0.32

  • t

0.4 0.34 0.8

. . .

0.12 ht

selective write

If the state at time t − 1 did not contribute much to the state at time t (i.e., if ft → 0 and ot−1 → 0) then during backpropagation the gradients flowing into st−1 will vanish

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-128
SLIDE 128

29/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

selective read

+

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

0.9 0.7 0.9

. . .

0.8 ft

selective forget

=

  • 1.5

0.2 1

. . .

  • 1.9

st

⊙ =

0.19 0.34 0.9

. . .

0.32

  • t

0.4 0.34 0.8

. . .

0.12 ht

selective write

If the state at time t − 1 did not contribute much to the state at time t (i.e., if ft → 0 and ot−1 → 0) then during backpropagation the gradients flowing into st−1 will vanish But this kind of a vanishing gradient is fine (since st−1 did not contribute to st we don’t want to hold it responsible for the crimes of st)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-129
SLIDE 129

29/43

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

⊙ =

0.2 0.34 0.9

. . .

0.29

  • t−1

0.5 0.36 0.9

. . .

0.6 ht−1

selective write

W

σ

0.7

  • 0.2

1.1

. . .

  • 0.3

xt

U

0.4 0.6 0.1

. . .

0.2 ˜ st

0.8 0.66 0.1

. . .

0.71 it

selective read

+

  • 1.4
  • 0.4

1

. . .

  • 2

st−1

0.9 0.7 0.9

. . .

0.8 ft

selective forget

=

  • 1.5

0.2 1

. . .

  • 1.9

st

⊙ =

0.19 0.34 0.9

. . .

0.32

  • t

0.4 0.34 0.8

. . .

0.12 ht

selective write

If the state at time t − 1 did not contribute much to the state at time t (i.e., if ft → 0 and ot−1 → 0) then during backpropagation the gradients flowing into st−1 will vanish But this kind of a vanishing gradient is fine (since st−1 did not contribute to st we don’t want to hold it responsible for the crimes of st) The key difference from vanilla RNNs is that the flow of information and gradients is controlled by the gates which ensure that the gradients vanish only when they should (i.e., when st−1 didn’t contribute much to st)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-130
SLIDE 130

30/43

We will now see an illustrative proof of how the gates control the flow of gradients

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-131
SLIDE 131

31/43

s1 W

V U

x1 L1(θ) s2 x2 L2(θ)

W V U

s3 x3 L3(θ)

W V U

s4 x4 L4(θ)

W V U

. . . s4 L4(θ) W s3 s2 s1 s0

Recall that RNNs had this multiplicative term which caused the gradients to vanish ∂Lt(θ) ∂W = ∂Lt(θ) ∂st

t

  • k=1

t−1

  • j=k

∂sj+1 ∂sj ∂+sk ∂W

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-132
SLIDE 132

31/43

s1 W

V U

x1 L1(θ) s2 x2 L2(θ)

W V U

s3 x3 L3(θ)

W V U

s4 x4 L4(θ)

W V U

. . . s4 L4(θ) W s3 s2 s1 s0

Recall that RNNs had this multiplicative term which caused the gradients to vanish ∂Lt(θ) ∂W = ∂Lt(θ) ∂st

t

  • k=1

t−1

  • j=k

∂sj+1 ∂sj ∂+sk ∂W In particular, if the loss at L4(θ) was high because W was not good enough to compute s1 correctly then this information will not be propagated back to W as the gradient

∂Lt(θ) ∂W

along this long path will vanish

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-133
SLIDE 133

32/43

s1 W

V U

x1 L1(θ) s2 x2 L2(θ)

W V U

s3 x3 L3(θ)

W V U

s4 x4 L4(θ)

W V U

. . . s4 L4(θ) W s3 s2 s1 s0

In general, the gradient of Lt(θ) w.r.t. θi vanishes when the gradients flowing through each and every path from Lt(θ) to θi vanish.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-134
SLIDE 134

32/43

s1 W

V U

x1 L1(θ) s2 x2 L2(θ)

W V U

s3 x3 L3(θ)

W V U

s4 x4 L4(θ)

W V U

. . . s4 L4(θ) W s3 s2 s1 s0

In general, the gradient of Lt(θ) w.r.t. θi vanishes when the gradients flowing through each and every path from Lt(θ) to θi vanish. On the other hand, the gradient of Lt(θ) w.r.t. θi explodes when the gradient flowing through at least

  • ne path explodes.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-135
SLIDE 135

32/43

s1 W

V U

x1 L1(θ) s2 x2 L2(θ)

W V U

s3 x3 L3(θ)

W V U

s4 x4 L4(θ)

W V U

. . . s4 L4(θ) W s3 s2 s1 s0

In general, the gradient of Lt(θ) w.r.t. θi vanishes when the gradients flowing through each and every path from Lt(θ) to θi vanish. On the other hand, the gradient of Lt(θ) w.r.t. θi explodes when the gradient flowing through at least

  • ne path explodes.

We will first argue that in the case of LSTMs there exists at least one path through which the gradients can flow effectively (and hence no vanishing gradients)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-136
SLIDE 136

33/43

We will start with the dependency graph involving different variables in LSTMs

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-137
SLIDE 137

33/43

We will start with the dependency graph involving different variables in LSTMs Starting with the states at timestep k − 1

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-138
SLIDE 138

33/43

sk−1 hk−1

We will start with the dependency graph involving different variables in LSTMs Starting with the states at timestep k − 1

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-139
SLIDE 139

33/43

sk−1 hk−1 Wo, Uo, bo

  • k

We will start with the dependency graph involving different variables in LSTMs Starting with the states at timestep k − 1

  • k = σ(Wohk−1 + Uoxk + bo)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-140
SLIDE 140

33/43

sk−1 hk−1

  • k

We will start with the dependency graph involving different variables in LSTMs Starting with the states at timestep k − 1

  • k = σ(Wohk−1 + Uoxk + bo)

For simplicity we will omit the parameters for now and return back to them later

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-141
SLIDE 141

33/43

sk−1 hk−1

  • k

˜ sk fk ik

We will start with the dependency graph involving different variables in LSTMs Starting with the states at timestep k − 1

  • k = σ(Wohk−1 + Uoxk + bo)

For simplicity we will omit the parameters for now and return back to them later ik = σ(Wihk−1 + Uixk + bi) fk = σ(Wfhk−1 + Ufxk + bf) ˜ sk = σ(Whk−1 + Uxk + b)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-142
SLIDE 142

33/43

sk−1 hk−1

  • k

˜ sk fk ik sk

We will start with the dependency graph involving different variables in LSTMs Starting with the states at timestep k − 1

  • k = σ(Wohk−1 + Uoxk + bo)

For simplicity we will omit the parameters for now and return back to them later ik = σ(Wihk−1 + Uixk + bi) fk = σ(Wfhk−1 + Ufxk + bf) ˜ sk = σ(Whk−1 + Uxk + b) sk = fk ⊙ sk−1 + ik ⊙ ˜ sk

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-143
SLIDE 143

33/43

sk−1 hk−1

  • k

˜ sk fk ik sk hk

We will start with the dependency graph involving different variables in LSTMs Starting with the states at timestep k − 1

  • k = σ(Wohk−1 + Uoxk + bo)

For simplicity we will omit the parameters for now and return back to them later ik = σ(Wihk−1 + Uixk + bi) fk = σ(Wfhk−1 + Ufxk + bf) ˜ sk = σ(Whk−1 + Uxk + b) sk = fk ⊙ sk−1 + ik ⊙ ˜ sk hk = ok ⊙ σ(sk)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-144
SLIDE 144

34/43

sk−1 hk−1

  • k

˜ sk fk ik sk hk

Starting from hk−1 and sk−1 we have reached hk and sk

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-145
SLIDE 145

34/43

sk−1 hk−1

  • k

˜ sk fk ik sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

Starting from hk−1 and sk−1 we have reached hk and sk And the recursion will now continue till the last timestep

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-146
SLIDE 146

34/43

sk−1 hk−1

  • k

Wo

˜ sk fk ik

Wi Wf W

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

Starting from hk−1 and sk−1 we have reached hk and sk And the recursion will now continue till the last timestep For simplicity and ease of illustration, instead

  • f considering the parameters (W, Wo, Wi,

Wf, U, Uo, Ui, Uf) as separate nodes in the graph we will just put them on the appropriate edges. (We show only a few parameters and not all)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-147
SLIDE 147

34/43

sk−1 hk−1

  • k

Wo

˜ sk fk ik

Wi Wf W

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

Starting from hk−1 and sk−1 we have reached hk and sk And the recursion will now continue till the last timestep For simplicity and ease of illustration, instead

  • f considering the parameters (W, Wo, Wi,

Wf, U, Uo, Ui, Uf) as separate nodes in the graph we will just put them on the appropriate edges. (We show only a few parameters and not all) We are now interested in knowing if the gradient from Lt(θ) flows back to an arbitrary timestep k

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-148
SLIDE 148

35/43

sk−1 hk−1

  • k

Wo

˜ sk fk ik

Wi Wf W

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

For example, we are interested in knowing if the gradient flows to Wf through sk

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-149
SLIDE 149

35/43

sk−1 hk−1

  • k

Wo

˜ sk fk ik

Wi Wf W

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

For example, we are interested in knowing if the gradient flows to Wf through sk In other words, if Lt(θ) was high because Wf failed to compute an appropriate value for sk then this information should flow back to Wf through the gradients

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-150
SLIDE 150

35/43

sk−1 hk−1

  • k

Wo

˜ sk fk ik

Wi Wf W

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

For example, we are interested in knowing if the gradient flows to Wf through sk In other words, if Lt(θ) was high because Wf failed to compute an appropriate value for sk then this information should flow back to Wf through the gradients We can ask a similar question about the other parameters (for example, Wi, Wo, W, etc.)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-151
SLIDE 151

35/43

sk−1 hk−1

  • k

Wo

˜ sk fk ik

Wi Wf W

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

For example, we are interested in knowing if the gradient flows to Wf through sk In other words, if Lt(θ) was high because Wf failed to compute an appropriate value for sk then this information should flow back to Wf through the gradients We can ask a similar question about the other parameters (for example, Wi, Wo, W, etc.) How does LSTM ensure that this gradient does not vanish even at arbitrary time steps? Let us see

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-152
SLIDE 152

36/43

sk−1 hk−1

  • k

Wo

˜ sk fk ik

Wi Wf W

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

It is sufficient to show that

∂Lt(θ) ∂sk

does not vanish (because if this does not vanish we can reach Wf through sk)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-153
SLIDE 153

36/43

sk−1 hk−1

  • k

Wo

˜ sk fk ik

Wi Wf W

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

It is sufficient to show that

∂Lt(θ) ∂sk

does not vanish (because if this does not vanish we can reach Wf through sk) First, we observe that there are multiple paths from Lt(θ) to sk (you just need to reverse the direction of the arrows for backpropagation)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-154
SLIDE 154

36/43

sk−1 hk−1

  • k

Wo

˜ sk fk ik

Wi Wf W

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

It is sufficient to show that

∂Lt(θ) ∂sk

does not vanish (because if this does not vanish we can reach Wf through sk) First, we observe that there are multiple paths from Lt(θ) to sk (you just need to reverse the direction of the arrows for backpropagation) For example, there is one path through sk+1, another through hk

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-155
SLIDE 155

36/43

sk−1 hk−1

  • k

Wo

˜ sk fk ik

Wi Wf W

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

It is sufficient to show that

∂Lt(θ) ∂sk

does not vanish (because if this does not vanish we can reach Wf through sk) First, we observe that there are multiple paths from Lt(θ) to sk (you just need to reverse the direction of the arrows for backpropagation) For example, there is one path through sk+1, another through hk Further, there are multiple paths to reach to hk itself (as should be obvious from the number of outgoing arrows from hk)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-156
SLIDE 156

36/43

sk−1 hk−1

  • k

Wo

˜ sk fk ik

Wi Wf W

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

It is sufficient to show that

∂Lt(θ) ∂sk

does not vanish (because if this does not vanish we can reach Wf through sk) First, we observe that there are multiple paths from Lt(θ) to sk (you just need to reverse the direction of the arrows for backpropagation) For example, there is one path through sk+1, another through hk Further, there are multiple paths to reach to hk itself (as should be obvious from the number of outgoing arrows from hk) So at this point just convince yourself that there are many paths from Lt(θ) to sk

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-157
SLIDE 157

37/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

Consider one such path (highlighted) which will contribute to the gradient

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-158
SLIDE 158

37/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

Consider one such path (highlighted) which will contribute to the gradient Let us denote the gradient along this path as t0

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-159
SLIDE 159

37/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

Consider one such path (highlighted) which will contribute to the gradient Let us denote the gradient along this path as t0 t0 = ∂Lt(θ) ∂ht ∂ht ∂st ∂st ∂st−1 . . . ∂sk+1 ∂sk

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-160
SLIDE 160

37/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

Consider one such path (highlighted) which will contribute to the gradient Let us denote the gradient along this path as t0 t0 = ∂Lt(θ) ∂ht ∂ht ∂st ∂st ∂st−1 . . . ∂sk+1 ∂sk The first term

∂Lt(θ) ∂ht

is fine and it doesn’t vanish (ht is directly connected to Lt(θ) and there are no intermediate nodes which can cause the gradient to vanish)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-161
SLIDE 161

37/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

Consider one such path (highlighted) which will contribute to the gradient Let us denote the gradient along this path as t0 t0 = ∂Lt(θ) ∂ht ∂ht ∂st ∂st ∂st−1 . . . ∂sk+1 ∂sk The first term

∂Lt(θ) ∂ht

is fine and it doesn’t vanish (ht is directly connected to Lt(θ) and there are no intermediate nodes which can cause the gradient to vanish) We will now look at the other terms

∂ht ∂st ∂st ∂st−1 (∀t)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-162
SLIDE 162

38/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

Let us first look at ∂ht

∂st

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-163
SLIDE 163

38/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

Let us first look at ∂ht

∂st

Recall that ht = ot ⊙ σ(st)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-164
SLIDE 164

38/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

Let us first look at ∂ht

∂st

Recall that ht = ot ⊙ σ(st) Note that hti only depends on oti and sti and not on any other elements of ot and st

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-165
SLIDE 165

38/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

Let us first look at ∂ht

∂st

Recall that ht = ot ⊙ σ(st) Note that hti only depends on oti and sti and not on any other elements of ot and st

∂ht ∂st will thus be a square diagonal matrix

∈ Rd×d whose diagonal will be

  • t ⊙ σ′(st) ∈ Rd (see slide 35 of Lecture 14)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-166
SLIDE 166

38/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

Let us first look at ∂ht

∂st

Recall that ht = ot ⊙ σ(st) Note that hti only depends on oti and sti and not on any other elements of ot and st

∂ht ∂st will thus be a square diagonal matrix

∈ Rd×d whose diagonal will be

  • t ⊙ σ′(st) ∈ Rd (see slide 35 of Lecture 14)

We will represent this diagonal matrix by D(ot ⊙ σ′(st))

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-167
SLIDE 167

39/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

Now let us consider

∂st ∂st−1

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-168
SLIDE 168

39/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

Now let us consider

∂st ∂st−1

Recall that st = ft ⊙ st−1 + it ⊙ ˜ st

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-169
SLIDE 169

39/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

Now let us consider

∂st ∂st−1

Recall that st = ft ⊙ st−1 + it ⊙ ˜ st Notice that ˜ st also depends on st−1 so we cannot treat it as a constant

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-170
SLIDE 170

39/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

Now let us consider

∂st ∂st−1

Recall that st = ft ⊙ st−1 + it ⊙ ˜ st Notice that ˜ st also depends on st−1 so we cannot treat it as a constant So once again we are dealing with an ordered network and thus

∂st ∂st−1 will be a sum of an

explicit term and an implicit term (see slide 37 from Lecture 14)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-171
SLIDE 171

39/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

Now let us consider

∂st ∂st−1

Recall that st = ft ⊙ st−1 + it ⊙ ˜ st Notice that ˜ st also depends on st−1 so we cannot treat it as a constant So once again we are dealing with an ordered network and thus

∂st ∂st−1 will be a sum of an

explicit term and an implicit term (see slide 37 from Lecture 14) For simplicity, let us assume that the gradient from the implicit term vanishes (we are assuming a worst case scenario)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-172
SLIDE 172

39/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

Now let us consider

∂st ∂st−1

Recall that st = ft ⊙ st−1 + it ⊙ ˜ st Notice that ˜ st also depends on st−1 so we cannot treat it as a constant So once again we are dealing with an ordered network and thus

∂st ∂st−1 will be a sum of an

explicit term and an implicit term (see slide 37 from Lecture 14) For simplicity, let us assume that the gradient from the implicit term vanishes (we are assuming a worst case scenario) And the gradient from the explicit term (treating ˜ st as a constant) is given by D(ft)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-173
SLIDE 173

40/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

We now return back to our full expression for t0:

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-174
SLIDE 174

40/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

We now return back to our full expression for t0: t0 = ∂Lt(θ) ∂ht ∂ht ∂st ∂st ∂st−1 . . . ∂sk+1 ∂sk

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-175
SLIDE 175

40/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

We now return back to our full expression for t0: t0 = ∂Lt(θ) ∂ht ∂ht ∂st ∂st ∂st−1 . . . ∂sk+1 ∂sk = L′

t(ht).D(ot ⊙ σ′(st))D(ft) . . . D(fk+1)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-176
SLIDE 176

40/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

We now return back to our full expression for t0: t0 = ∂Lt(θ) ∂ht ∂ht ∂st ∂st ∂st−1 . . . ∂sk+1 ∂sk = L′

t(ht).D(ot ⊙ σ′(st))D(ft) . . . D(fk+1)

= L′

t(ht).D(ot ⊙ σ′(st))D(ft ⊙ . . . ⊙ fk+1)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-177
SLIDE 177

40/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

We now return back to our full expression for t0: t0 = ∂Lt(θ) ∂ht ∂ht ∂st ∂st ∂st−1 . . . ∂sk+1 ∂sk = L′

t(ht).D(ot ⊙ σ′(st))D(ft) . . . D(fk+1)

= L′

t(ht).D(ot ⊙ σ′(st))D(ft ⊙ . . . ⊙ fk+1)

= L′

t(ht).D(ot ⊙ σ′(st))D(⊙t i=k+1fi)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-178
SLIDE 178

40/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

We now return back to our full expression for t0: t0 = ∂Lt(θ) ∂ht ∂ht ∂st ∂st ∂st−1 . . . ∂sk+1 ∂sk = L′

t(ht).D(ot ⊙ σ′(st))D(ft) . . . D(fk+1)

= L′

t(ht).D(ot ⊙ σ′(st))D(ft ⊙ . . . ⊙ fk+1)

= L′

t(ht).D(ot ⊙ σ′(st))D(⊙t i=k+1fi)

The red terms don’t vanish and the blue terms contain a multiplication of the forget gates The forget gates thus regulate the gradient flow depending on the explicit contribution of a state (st) to the next state st+1

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-179
SLIDE 179

41/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

If during forward pass st did not contribute much to st+1 (because ft → 0) then during backpropgation also the gradient will not reach st

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-180
SLIDE 180

41/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

If during forward pass st did not contribute much to st+1 (because ft → 0) then during backpropgation also the gradient will not reach st This is fine because if st did not contribute much to st+1 then there is no reason to hold it responsible during backpropgation

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-181
SLIDE 181

41/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

If during forward pass st did not contribute much to st+1 (because ft → 0) then during backpropgation also the gradient will not reach st This is fine because if st did not contribute much to st+1 then there is no reason to hold it responsible during backpropgation (ft does the same regulation during forward pass and backward pass which is fair)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-182
SLIDE 182

41/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

If during forward pass st did not contribute much to st+1 (because ft → 0) then during backpropgation also the gradient will not reach st This is fine because if st did not contribute much to st+1 then there is no reason to hold it responsible during backpropgation (ft does the same regulation during forward pass and backward pass which is fair) Thus there exists this one path along which the gradient doesn’t vanish when it shouldn’t

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-183
SLIDE 183

41/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

If during forward pass st did not contribute much to st+1 (because ft → 0) then during backpropgation also the gradient will not reach st This is fine because if st did not contribute much to st+1 then there is no reason to hold it responsible during backpropgation (ft does the same regulation during forward pass and backward pass which is fair) Thus there exists this one path along which the gradient doesn’t vanish when it shouldn’t And as argued as long as the gradient flows back to Wf through one of the paths (t0) through sk we are fine !

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-184
SLIDE 184

41/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

If during forward pass st did not contribute much to st+1 (because ft → 0) then during backpropgation also the gradient will not reach st This is fine because if st did not contribute much to st+1 then there is no reason to hold it responsible during backpropgation (ft does the same regulation during forward pass and backward pass which is fair) Thus there exists this one path along which the gradient doesn’t vanish when it shouldn’t And as argued as long as the gradient flows back to Wf through one of the paths (t0) through sk we are fine ! Of course the gradient flows back only when required as regulated by fi’s (but let me just say it one last time that this is fair)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-185
SLIDE 185

42/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

Now we will see why LSTMs do not solve the problem of exploding gradients

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-186
SLIDE 186

42/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

Now we will see why LSTMs do not solve the problem of exploding gradients We will show a path through which the gradient can explode

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-187
SLIDE 187

42/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

Now we will see why LSTMs do not solve the problem of exploding gradients We will show a path through which the gradient can explode Let us compute one term (say t1) of

∂Lt(θ) ∂hk−1

corresponding to the highlighted path

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-188
SLIDE 188

42/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

Now we will see why LSTMs do not solve the problem of exploding gradients We will show a path through which the gradient can explode Let us compute one term (say t1) of

∂Lt(θ) ∂hk−1

corresponding to the highlighted path

t1 =∂Lt(θ) ∂ht ∂ht ∂ot ∂ot ∂ht−1

  • . . .

∂hk ∂ok ∂ok ∂hk−1

  • Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 15

slide-189
SLIDE 189

42/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

Now we will see why LSTMs do not solve the problem of exploding gradients We will show a path through which the gradient can explode Let us compute one term (say t1) of

∂Lt(θ) ∂hk−1

corresponding to the highlighted path

t1 =∂Lt(θ) ∂ht ∂ht ∂ot ∂ot ∂ht−1

  • . . .

∂hk ∂ok ∂ok ∂hk−1

  • =L′

t(ht) (D(σ(st) ⊙ o′ t).Wo) . . .

(D(σ(sk) ⊙ o′

k).Wo)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-190
SLIDE 190

42/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

Now we will see why LSTMs do not solve the problem of exploding gradients We will show a path through which the gradient can explode Let us compute one term (say t1) of

∂Lt(θ) ∂hk−1

corresponding to the highlighted path

t1 =∂Lt(θ) ∂ht ∂ht ∂ot ∂ot ∂ht−1

  • . . .

∂hk ∂ok ∂ok ∂hk−1

  • =L′

t(ht) (D(σ(st) ⊙ o′ t).Wo) . . .

(D(σ(sk) ⊙ o′

k).Wo)

t1 ≤L′

t(ht) (KWo)t−k+1

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-191
SLIDE 191

42/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

Now we will see why LSTMs do not solve the problem of exploding gradients We will show a path through which the gradient can explode Let us compute one term (say t1) of

∂Lt(θ) ∂hk−1

corresponding to the highlighted path

t1 =∂Lt(θ) ∂ht ∂ht ∂ot ∂ot ∂ht−1

  • . . .

∂hk ∂ok ∂ok ∂hk−1

  • =L′

t(ht) (D(σ(st) ⊙ o′ t).Wo) . . .

(D(σ(sk) ⊙ o′

k).Wo)

t1 ≤L′

t(ht) (KWo)t−k+1

Depending on the norm of matrix Wo, the gradient ∂Lt(θ)

∂hk−1 may explode

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-192
SLIDE 192

42/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

Now we will see why LSTMs do not solve the problem of exploding gradients We will show a path through which the gradient can explode Let us compute one term (say t1) of

∂Lt(θ) ∂hk−1

corresponding to the highlighted path

t1 =∂Lt(θ) ∂ht ∂ht ∂ot ∂ot ∂ht−1

  • . . .

∂hk ∂ok ∂ok ∂hk−1

  • =L′

t(ht) (D(σ(st) ⊙ o′ t).Wo) . . .

(D(σ(sk) ⊙ o′

k).Wo)

t1 ≤L′

t(ht) (KWo)t−k+1

Depending on the norm of matrix Wo, the gradient ∂Lt(θ)

∂hk−1 may explode

Similarly, Wi, Wf and W can also cause the gradients to explode

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-193
SLIDE 193

43/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

So how do we deal with the problem of exploding gradients ?

∗Pascanu, Razvan, Tomas Mikolov, and Yoshua Bengio.

“On the difficulty of training recurrent neural networks.” ICML(3)28(2013):1310-1318

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-194
SLIDE 194

43/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

So how do we deal with the problem of exploding gradients ? One popular trick is to use gradient clipping

∗Pascanu, Razvan, Tomas Mikolov, and Yoshua Bengio.

“On the difficulty of training recurrent neural networks.” ICML(3)28(2013):1310-1318

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-195
SLIDE 195

43/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

So how do we deal with the problem of exploding gradients ? One popular trick is to use gradient clipping While backpropagating if the norm of the gradient exceeds a certain value, it is scaled to keep its norm within an acceptable threshold∗

∗Pascanu, Razvan, Tomas Mikolov, and Yoshua Bengio.

“On the difficulty of training recurrent neural networks.” ICML(3)28(2013):1310-1318

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

slide-196
SLIDE 196

43/43

sk−1 hk−1 ˜ sk fk ik

  • k

sk hk st−1 ht−1 ˜ st ft it

  • t

st ht Lt(θ)

So how do we deal with the problem of exploding gradients ? One popular trick is to use gradient clipping While backpropagating if the norm of the gradient exceeds a certain value, it is scaled to keep its norm within an acceptable threshold∗ Essentially we retain the direction of the gradient but scale down the norm

∗Pascanu, Razvan, Tomas Mikolov, and Yoshua Bengio.

“On the difficulty of training recurrent neural networks.” ICML(3)28(2013):1310-1318

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15