CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells - PowerPoint PPT Presentation

Selective write a = 1 b = 3 c = 5 d = 11 There may be many steps in the Compute ac ( bd + a ) + ad derivation but we may just skip a few Say “board” can have only 3 statements In other words we select what to at a time. write 1 ac 2 bd 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 bd = 33 7/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective read a = 1 b = 3 c = 5 d = 11 While writing one step we typically Compute ac ( bd + a ) + ad read some of the previous steps we Say “board” can have only 3 statements have already written and then decide at a time. what to write next 1 ac 2 bd 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 bd = 33 8/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective read a = 1 b = 3 c = 5 d = 11 While writing one step we typically Compute ac ( bd + a ) + ad read some of the previous steps we Say “board” can have only 3 statements have already written and then decide at a time. what to write next For example at Step 3, information 1 ac from Step 2 is important 2 bd 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 bd = 33 8/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective read a = 1 b = 3 c = 5 d = 11 While writing one step we typically Compute ac ( bd + a ) + ad read some of the previous steps we Say “board” can have only 3 statements have already written and then decide at a time. what to write next For example at Step 3, information 1 ac from Step 2 is important 2 bd In other words we select what to read 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 bd = 33 8/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective read a = 1 b = 3 c = 5 d = 11 While writing one step we typically Compute ac ( bd + a ) + ad read some of the previous steps we Say “board” can have only 3 statements have already written and then decide at a time. what to write next For example at Step 3, information 1 ac from Step 2 is important 2 bd In other words we select what to read 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 bd = 33 bd + a = 34 8/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective forget a = 1 b = 3 c = 5 d = 11 Once the board is full, we need to Compute ac ( bd + a ) + ad delete some obsolete information Say “board” can have only 3 statements at a time. 1 ac 2 bd 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 bd = 33 bd + a = 34 9/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective forget a = 1 b = 3 c = 5 d = 11 Once the board is full, we need to Compute ac ( bd + a ) + ad delete some obsolete information Say “board” can have only 3 statements But how do we decide what to delete? at a time. We will typically delete the least useful information 1 ac 2 bd 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 bd = 33 bd + a = 34 9/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective forget a = 1 b = 3 c = 5 d = 11 Once the board is full, we need to Compute ac ( bd + a ) + ad delete some obsolete information Say “board” can have only 3 statements But how do we decide what to delete? at a time. We will typically delete the least useful information 1 ac In other words we select what to 2 bd forget 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 bd = 33 bd + a = 34 9/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective forget a = 1 b = 3 c = 5 d = 11 Once the board is full, we need to Compute ac ( bd + a ) + ad delete some obsolete information Say “board” can have only 3 statements But how do we decide what to delete? at a time. We will typically delete the least useful information 1 ac In other words we select what to 2 bd forget 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 bd + a = 34 9/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective forget a = 1 b = 3 c = 5 d = 11 Once the board is full, we need to Compute ac ( bd + a ) + ad delete some obsolete information Say “board” can have only 3 statements But how do we decide what to delete? at a time. We will typically delete the least useful information 1 ac In other words we select what to 2 bd forget 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 ac ( bd + a ) = 170 bd + a = 34 9/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective forget a = 1 b = 3 c = 5 d = 11 Once the board is full, we need to Compute ac ( bd + a ) + ad delete some obsolete information Say “board” can have only 3 statements But how do we decide what to delete? at a time. We will typically delete the least useful information 1 ac In other words we select what to 2 bd forget 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 ac ( bd + a ) = 170 ad = 11 9/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective forget a = 1 b = 3 c = 5 d = 11 Once the board is full, we need to Compute ac ( bd + a ) + ad delete some obsolete information Say “board” can have only 3 statements But how do we decide what to delete? at a time. We will typically delete the least useful information 1 ac In other words we select what to 2 bd forget 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ad + ac ( bd + a ) = 181 ac ( bd + a ) = 170 ad = 11 9/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

a = 1 b = 3 c = 5 d = 11 There are various other scenarios where we can motivate the need for Compute ac ( bd + a ) + ad selective write, read and forget Say “board” can have only 3 statements at a time. 1 ac 2 bd 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ad + ac ( bd + a ) = 181 ac ( bd + a ) = 170 ad = 11 10/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

a = 1 b = 3 c = 5 d = 11 There are various other scenarios where we can motivate the need for Compute ac ( bd + a ) + ad selective write, read and forget Say “board” can have only 3 statements For example, you could think of our at a time. brain as something which can store 1 ac only a finite number of facts 2 bd 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ad + ac ( bd + a ) = 181 ac ( bd + a ) = 170 ad = 11 10/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

a = 1 b = 3 c = 5 d = 11 There are various other scenarios where we can motivate the need for Compute ac ( bd + a ) + ad selective write, read and forget Say “board” can have only 3 statements For example, you could think of our at a time. brain as something which can store 1 ac only a finite number of facts 2 bd At different time steps we selectively read, write and forget some of these 3 bd + a facts 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ad + ac ( bd + a ) = 181 ac ( bd + a ) = 170 ad = 11 10/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

a = 1 b = 3 c = 5 d = 11 There are various other scenarios where we can motivate the need for Compute ac ( bd + a ) + ad selective write, read and forget Say “board” can have only 3 statements For example, you could think of our at a time. brain as something which can store 1 ac only a finite number of facts 2 bd At different time steps we selectively read, write and forget some of these 3 bd + a facts 4 ac ( bd + a ) Since the RNN also has a finite state 5 ad size, we need to figure out a way to 6 ac ( bd + a ) + ad allow it to selectively read, write and forget ad + ac ( bd + a ) = 181 ac ( bd + a ) = 170 ad = 11 10/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Module 15.2: Long Short Term Memory(LSTM) and Gated Recurrent Units(GRUs) 11/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Questions Can we give a concrete example where RNNs also need to selectively read, write and forget ? 12/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Questions Can we give a concrete example where RNNs also need to selectively read, write and forget ? How do we convert this intuition into mathematical equations ? 12/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Questions Can we give a concrete example where RNNs also need to selectively read, write and forget ? How do we convert this intuition into mathematical equations ? We will see this over the next few slides 12/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Consider the task of predicting the sentiment + / − (positive/negative) of a review ... ... ... performance The first Review: The first half of the movie was dry but the second half really picked up pace. The lead actor delivered an amazing performance 13/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Consider the task of predicting the sentiment + / − (positive/negative) of a review RNN reads the document from left to right and after every word updates the state ... ... ... performance The first Review: The first half of the movie was dry but the second half really picked up pace. The lead actor delivered an amazing performance 13/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Consider the task of predicting the sentiment + / − (positive/negative) of a review RNN reads the document from left to right and after every word updates the state By the time we reach the end of the document the information obtained from the first few words is completely lost ... ... ... performance The first Review: The first half of the movie was dry but the second half really picked up pace. The lead actor delivered an amazing performance 13/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Consider the task of predicting the sentiment + / − (positive/negative) of a review RNN reads the document from left to right and after every word updates the state By the time we reach the end of the document the information obtained from the first few words is completely lost ... ... ... performance The first Ideally we want to Review: The first half of the movie was dry but the second half really picked up pace. The lead actor delivered an amazing performance 13/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Consider the task of predicting the sentiment + / − (positive/negative) of a review RNN reads the document from left to right and after every word updates the state By the time we reach the end of the document the information obtained from the first few words is completely lost ... ... ... performance The first Ideally we want to forget the information added by stop words Review: The first half of the movie was dry but (a, the, etc.) the second half really picked up pace. The lead actor delivered an amazing performance 13/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Consider the task of predicting the sentiment + / − (positive/negative) of a review RNN reads the document from left to right and after every word updates the state By the time we reach the end of the document the information obtained from the first few words is completely lost ... ... ... performance The first Ideally we want to forget the information added by stop words Review: The first half of the movie was dry but (a, the, etc.) the second half really picked up pace. The lead selectively read the information added by actor delivered an amazing performance previous sentiment bearing words (awesome, amazing, etc.) 13/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Consider the task of predicting the sentiment + / − (positive/negative) of a review RNN reads the document from left to right and after every word updates the state By the time we reach the end of the document the information obtained from the first few words is completely lost ... ... ... performance The first Ideally we want to forget the information added by stop words Review: The first half of the movie was dry but (a, the, etc.) the second half really picked up pace. The lead selectively read the information added by actor delivered an amazing performance previous sentiment bearing words (awesome, amazing, etc.) selectively write new information from the current word to the state 13/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Questions Can we give a concrete example where RNNs also need to selectively read, write and forget ? How do we convert this intuition into mathematical equations ? 14/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Recall that the blue colored vector + / − ( s t ) is called the state of the RNN ... ... ... performance The first Review: The first half of the movie was dry but the second half really picked up pace. The lead actor delivered an amazing performance 15/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Recall that the blue colored vector + / − ( s t ) is called the state of the RNN It has a finite size ( s t ∈ R n ) and is used to store all the information upto timestep t ... ... ... performance The first Review: The first half of the movie was dry but the second half really picked up pace. The lead actor delivered an amazing performance 15/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Recall that the blue colored vector + / − ( s t ) is called the state of the RNN It has a finite size ( s t ∈ R n ) and is used to store all the information upto timestep t This state is analogous to the whiteboard and sooner or later it will ... ... ... performance get overloaded and the information The first from the initial states will get Review: The first half of the movie was dry but morphed beyond recognition the second half really picked up pace. The lead actor delivered an amazing performance 15/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Recall that the blue colored vector + / − ( s t ) is called the state of the RNN It has a finite size ( s t ∈ R n ) and is used to store all the information upto timestep t This state is analogous to the whiteboard and sooner or later it will ... ... ... performance get overloaded and the information The first from the initial states will get Review: The first half of the movie was dry but morphed beyond recognition the second half really picked up pace. The lead Wishlist: selective write, selective actor delivered an amazing performance read and selective forget to ensure that this finite sized state vector is used effectively 15/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Just to be clear, we have computed -1.4 -0.9 -0.4 0.2 a state s t − 1 at timestep t − 1 and 1 1 . . . . . . now we want to overload it with new -2 -1.9 information ( x t ) and compute a new s t − 1 s t 0.7 state ( s t ) -0.2 1.1 . . . -0.3 x t 16/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Just to be clear, we have computed -1.4 -0.9 -0.4 selective read 0.2 selective write a state s t − 1 at timestep t − 1 and 1 1 . selective forget . . . . . now we want to overload it with new -2 -1.9 information ( x t ) and compute a new s t − 1 s t 0.7 state ( s t ) -0.2 1.1 . . While doing so we want to make sure . that we use selective write, selective -0.3 x t read and selective forget so that only important information is retained in s t 16/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Just to be clear, we have computed -1.4 -0.9 -0.4 selective read 0.2 selective write a state s t − 1 at timestep t − 1 and 1 1 . selective forget . . . . . now we want to overload it with new -2 -1.9 information ( x t ) and compute a new s t − 1 s t 0.7 state ( s t ) -0.2 1.1 . . While doing so we want to make sure . that we use selective write, selective -0.3 x t read and selective forget so that only important information is retained in s t We will now see how to implement these items from our wishlist 16/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective Write -1.4 -0.9 W -0.4 0.2 Recall that in RNNs we use s t − 1 to σ 1 1 . . . . compute s t . . U -2 -1.9 s t − 1 s t 0.7 -0.2 1.1 . . . -0.3 x t 17/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective Write -1.4 -0.9 W -0.4 0.2 Recall that in RNNs we use s t − 1 to σ 1 1 . . . . compute s t . . U s t = σ ( Ws t − 1 + Ux t ) (ignoring bias) -2 -1.9 s t − 1 s t 0.7 -0.2 1.1 . . . -0.3 x t 17/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective Write -1.4 -0.9 W -0.4 0.2 Recall that in RNNs we use s t − 1 to σ 1 1 . . . . compute s t . . U s t = σ ( Ws t − 1 + Ux t ) (ignoring bias) -2 -1.9 s t − 1 s t 0.7 But now instead of passing s t − 1 as it -0.2 1.1 is to s t we want to pass (write) only . . . some portions of it to the next state -0.3 x t 17/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective Write -1.4 -0.9 W 0 0.2 Recall that in RNNs we use s t − 1 to σ 1 1 . . . . compute s t . . U s t = σ ( Ws t − 1 + Ux t ) (ignoring bias) 0 -1.9 s t − 1 s t 0.7 But now instead of passing s t − 1 as it -0.2 1.1 is to s t we want to pass (write) only . . . some portions of it to the next state -0.3 x t In the strictest case our decisions could be binary (for example, retain 1st and 3rd entries and delete the rest of the entries) 17/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective Write -1.4 -0.9 W 0 0.2 Recall that in RNNs we use s t − 1 to σ 1 1 . . . . compute s t . . U s t = σ ( Ws t − 1 + Ux t ) (ignoring bias) 0 -1.9 s t − 1 s t 0.7 But now instead of passing s t − 1 as it -0.2 1.1 is to s t we want to pass (write) only . . . some portions of it to the next state -0.3 x t In the strictest case our decisions could be binary (for example, retain 1st and 3rd entries and delete the rest of the entries) But a more sensible way of doing this would be to assign a value between 0 and 1 which determines what fraction of the current state to pass on to the next state 17/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective Write -1.4 0.2 0.5 -1.4 -0.4 0.34 0.36 -0.4 We introduce a vector o t − 1 which 1 0.9 = 0.9 1 ⊙ . . . . . . . . decides what fraction of each element . . . . of s t − 1 should be passed to the next -2 0.29 0.6 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 state 1.1 selective write . . . -0.3 x t 18/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective Write -1.4 0.2 0.5 -1.4 -0.4 0.34 0.36 -0.4 We introduce a vector o t − 1 which 1 0.9 = 0.9 1 ⊙ . . . . . . . . decides what fraction of each element . . . . of s t − 1 should be passed to the next -2 0.29 0.6 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 state 1.1 selective write . . . Each element of o t − 1 gets multiplied -0.3 with the corresponding element of x t s t − 1 18/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective Write -1.4 0.2 0.5 -1.4 -0.4 0.34 0.36 -0.4 We introduce a vector o t − 1 which 1 0.9 = 0.9 1 ⊙ . . . . . . . . decides what fraction of each element . . . . of s t − 1 should be passed to the next -2 0.29 0.6 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 state 1.1 selective write . . . Each element of o t − 1 gets multiplied -0.3 with the corresponding element of x t s t − 1 Each element of o t − 1 is restricted to be between 0 and 1 18/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective Write -1.4 0.2 0.5 -1.4 -0.4 0.34 0.36 -0.4 We introduce a vector o t − 1 which 1 0.9 = 0.9 1 ⊙ . . . . . . . . decides what fraction of each element . . . . of s t − 1 should be passed to the next -2 0.29 0.6 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 state 1.1 selective write . . . Each element of o t − 1 gets multiplied -0.3 with the corresponding element of x t s t − 1 Each element of o t − 1 is restricted to be between 0 and 1 But how do we compute o t − 1 ? How does the RNN know what fraction of the state to pass on? 18/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective Write -1.4 0.2 0.5 -1.4 -0.4 0.34 0.36 -0.4 Well the RNN has to learn o t − 1 along 1 0.9 = 0.9 1 ⊙ . . . . . . . . with the other parameters ( W, U, V ) . . . . -2 0.29 0.6 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 1.1 selective write . . . -0.3 x t 19/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective Write -1.4 0.2 0.5 -1.4 -0.4 0.34 0.36 -0.4 Well the RNN has to learn o t − 1 along 1 0.9 = 0.9 1 ⊙ . . . . . . . . with the other parameters ( W, U, V ) . . . . -2 0.29 0.6 -2 We compute o t − 1 and h t − 1 as 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 1.1 selective write . . . -0.3 x t 19/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective Write -1.4 0.2 0.5 -1.4 -0.4 0.34 0.36 -0.4 Well the RNN has to learn o t − 1 along 1 0.9 = 0.9 1 ⊙ . . . . . . . . with the other parameters ( W, U, V ) . . . . -2 0.29 0.6 -2 We compute o t − 1 and h t − 1 as 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 1.1 selective write . . . o t − 1 = σ ( W o h t − 2 + U o x t − 1 + b o ) -0.3 x t 19/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective Write -1.4 0.2 0.5 -1.4 -0.4 0.34 0.36 -0.4 Well the RNN has to learn o t − 1 along 1 0.9 = 0.9 1 ⊙ . . . . . . . . with the other parameters ( W, U, V ) . . . . -2 0.29 0.6 -2 We compute o t − 1 and h t − 1 as 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 1.1 selective write . . . o t − 1 = σ ( W o h t − 2 + U o x t − 1 + b o ) -0.3 h t − 1 = o t − 1 ⊙ σ ( s t − 1 ) x t 19/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective Write -1.4 0.2 0.5 -1.4 -0.4 0.34 0.36 -0.4 Well the RNN has to learn o t − 1 along 1 0.9 = 0.9 1 ⊙ . . . . . . . . with the other parameters ( W, U, V ) . . . . -2 0.29 0.6 -2 We compute o t − 1 and h t − 1 as 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 1.1 selective write . . . o t − 1 = σ ( W o h t − 2 + U o x t − 1 + b o ) -0.3 h t − 1 = o t − 1 ⊙ σ ( s t − 1 ) x t The parameters W o , U o , b o need to be learned along with the existing parameters W, U, V 19/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective Write -1.4 0.2 0.5 -1.4 -0.4 0.34 0.36 -0.4 Well the RNN has to learn o t − 1 along 1 0.9 = 0.9 1 ⊙ . . . . . . . . with the other parameters ( W, U, V ) . . . . -2 0.29 0.6 -2 We compute o t − 1 and h t − 1 as 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 1.1 selective write . . . o t − 1 = σ ( W o h t − 2 + U o x t − 1 + b o ) -0.3 h t − 1 = o t − 1 ⊙ σ ( s t − 1 ) x t The parameters W o , U o , b o need to be learned along with the existing parameters W, U, V The sigmoid (logistic) function ensures that the values are between 0 and 1 19/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective Write -1.4 0.2 0.5 -1.4 -0.4 0.34 0.36 -0.4 Well the RNN has to learn o t − 1 along 1 0.9 = 0.9 1 ⊙ . . . . . . . . with the other parameters ( W, U, V ) . . . . -2 0.29 0.6 -2 We compute o t − 1 and h t − 1 as 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 1.1 selective write . . . o t − 1 = σ ( W o h t − 2 + U o x t − 1 + b o ) -0.3 h t − 1 = o t − 1 ⊙ σ ( s t − 1 ) x t The parameters W o , U o , b o need to be learned along with the existing parameters W, U, V The sigmoid (logistic) function ensures that the values are between 0 and 1 o t is called the output gate as it decides how much to pass (write) to the next time step 19/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective Read -1.4 0.2 0.5 0.4 -1.4 W -0.4 0.34 0.36 0.6 -0.4 We will now use h t − 1 to compute the σ 1 0.9 = 0.9 0.1 1 ⊙ . . . . . . . . . . new state at the next time step . . . . . U -2 0.29 0.6 0.2 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 ˜ s t 1.1 selective write . . . -0.3 x t 20/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective Read -1.4 0.2 0.5 0.4 -1.4 W -0.4 0.34 0.36 0.6 -0.4 We will now use h t − 1 to compute the σ 1 0.9 = 0.9 0.1 1 ⊙ . . . . . . . . . . new state at the next time step . . . . . U -2 0.29 0.6 0.2 -2 We will also use x t which is the new 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 ˜ s t input at time step t 1.1 selective write . . . -0.3 s t = σ ( Wh t − 1 + Ux t + b ) ˜ x t 20/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective Read -1.4 0.2 0.5 0.4 -1.4 W -0.4 0.34 0.36 0.6 -0.4 We will now use h t − 1 to compute the σ 1 0.9 = 0.9 0.1 1 ⊙ . . . . . . . . . . new state at the next time step . . . . . U -2 0.29 0.6 0.2 -2 We will also use x t which is the new 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ input at time step t 1.1 selective write . . . -0.3 s t = σ ( Wh t − 1 + Ux t + b ) ˜ x t Note that W, U and b are similar to the parameters that we used in RNN (for simplicity we have not shown the bias b in the figure) 20/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective Read -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 s t thus captures all the information ˜ σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . from the previous state ( h t − 1 ) and the . . . . . . U current input x t -2 0.29 0.6 0.2 0.71 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ i t 1.1 selective write . selective read . . -0.3 x t 21/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective Read -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 s t thus captures all the information ˜ σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . from the previous state ( h t − 1 ) and the . . . . . . U current input x t -2 0.29 0.6 0.2 0.71 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 ˜ s t i t However, we may not want to 1.1 selective write . selective read . . use all this new information and -0.3 only selectively read from it before x t constructing the new cell state s t 21/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective Read -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 s t thus captures all the information ˜ σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . from the previous state ( h t − 1 ) and the . . . . . . U current input x t -2 0.29 0.6 0.2 0.71 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ i t However, we may not want to 1.1 selective write . selective read . . use all this new information and -0.3 only selectively read from it before x t constructing the new cell state s t To do this we introduce another gate called the input gate 21/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective Read -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 s t thus captures all the information ˜ σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . from the previous state ( h t − 1 ) and the . . . . . . U current input x t -2 0.29 0.6 0.2 0.71 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ i t However, we may not want to 1.1 selective write . selective read . . use all this new information and -0.3 only selectively read from it before x t constructing the new cell state s t To do this we introduce another gate called the input gate i t = σ ( W i h t − 1 + U i x t + b i ) 21/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective Read -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 s t thus captures all the information ˜ σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . from the previous state ( h t − 1 ) and the . . . . . . U current input x t -2 0.29 0.6 0.2 0.71 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ i t However, we may not want to 1.1 selective write . selective read . . use all this new information and -0.3 only selectively read from it before x t constructing the new cell state s t To do this we introduce another gate called the input gate i t = σ ( W i h t − 1 + U i x t + b i ) and use i t ⊙ ˜ s t as the selectively read state information 21/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

So far we have the following -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . . . . . . . U -2 0.29 0.6 0.2 0.71 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ i t 1.1 selective write . selective read . . -0.3 x t 22/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

So far we have the following -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . Previous state: . . . . . . U -2 0.29 0.6 0.2 0.71 -2 0.7 s t − 1 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ i t 1.1 selective write . selective read . . -0.3 x t 22/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

So far we have the following -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . Previous state: . . . . . . U -2 0.29 0.6 0.2 0.71 -2 0.7 s t − 1 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ i t 1.1 selective write . selective read Output gate: . . o t − 1 = σ ( W o h t − 2 + U o x t − 1 + b o ) -0.3 x t 22/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

So far we have the following -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . Previous state: . . . . . . U -2 0.29 0.6 0.2 0.71 -2 0.7 s t − 1 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ i t 1.1 selective write . selective read Output gate: . . o t − 1 = σ ( W o h t − 2 + U o x t − 1 + b o ) -0.3 x t Selectively Write: h t − 1 = o t − 1 ⊙ σ ( s t − 1 ) 22/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

So far we have the following -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . Previous state: . . . . . . U -2 0.29 0.6 0.2 0.71 -2 0.7 s t − 1 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ i t 1.1 selective write . selective read Output gate: . . o t − 1 = σ ( W o h t − 2 + U o x t − 1 + b o ) -0.3 x t Selectively Write: h t − 1 = o t − 1 ⊙ σ ( s t − 1 ) Current (temporary) state: s t = σ ( Wh t − 1 + Ux t + b ) ˜ 22/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

So far we have the following -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . Previous state: . . . . . . U -2 0.29 0.6 0.2 0.71 -2 0.7 s t − 1 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ i t 1.1 selective write . selective read Output gate: . . o t − 1 = σ ( W o h t − 2 + U o x t − 1 + b o ) -0.3 x t Selectively Write: h t − 1 = o t − 1 ⊙ σ ( s t − 1 ) Current (temporary) state: s t = σ ( Wh t − 1 + Ux t + b ) ˜ Input gate: i t = σ ( W i h t − 1 + U i x t + b i ) 22/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

So far we have the following -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . Previous state: . . . . . . U -2 0.29 0.6 0.2 0.71 -2 0.7 s t − 1 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ i t 1.1 selective write . selective read Output gate: . . o t − 1 = σ ( W o h t − 2 + U o x t − 1 + b o ) -0.3 x t Selectively Write: h t − 1 = o t − 1 ⊙ σ ( s t − 1 ) Current (temporary) state: s t = σ ( Wh t − 1 + Ux t + b ) ˜ Input gate: i t = σ ( W i h t − 1 + U i x t + b i ) Selectively Read: i t ⊙ ˜ s t 22/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Selective Forget How do we combine s t − 1 and ˜ s t to get the new state 23/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

-1.4 0.2 0.5 0.4 0.8 -1.4 -0.9 W -0.4 0.34 0.36 0.6 0.66 -0.4 0.2 σ 1 0.9 = 0.9 0.1 0.1 1 = 1 ⊙ ⊙ + . . . . . . . . . . . . . . . . . . . . . U -2 0.29 0.6 0.2 0.71 -2 -1.9 0.7 -0.2 s t − 1 o t − 1 s t − 1 s t h t − 1 s t ˜ i t 1.1 selective write . selective read . . -0.3 x t Selective Forget How do we combine s t − 1 and ˜ s t to get the new state Here is one simple (but effective) way of doing this: 23/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

-1.4 0.2 0.5 0.4 0.8 -1.4 -0.9 W -0.4 0.34 0.36 0.6 0.66 -0.4 0.2 σ 1 0.9 = 0.9 0.1 0.1 1 = 1 ⊙ ⊙ + . . . . . . . . . . . . . . . . . . . . . U -2 0.29 0.6 0.2 0.71 -2 -1.9 0.7 -0.2 s t − 1 o t − 1 s t − 1 s t h t − 1 s t ˜ i t 1.1 selective write . selective read . . -0.3 x t Selective Forget How do we combine s t − 1 and ˜ s t to get the new state Here is one simple (but effective) way of doing this: s t = s t − 1 + i t ⊙ ˜ s t 23/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

-1.4 0.2 0.5 0.4 0.8 -1.4 -0.9 W -0.4 0.34 0.36 0.6 0.66 -0.4 0.2 σ 1 0.9 = 0.9 0.1 0.1 1 = 1 ⊙ ⊙ + . . . . . . . . . . . . . . . . . . . . . U -2 0.29 0.6 0.2 0.71 -2 -1.9 0.7 -0.2 s t − 1 o t − 1 s t − 1 s t h t − 1 s t ˜ i t 1.1 selective write . selective read . . -0.3 x t Selective Forget But we may not want to use the whole How do we combine s t − 1 and ˜ s t to get of s t − 1 but forget some parts of it the new state Here is one simple (but effective) way of doing this: s t = s t − 1 + i t ⊙ ˜ s t 23/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells (LSTMs), Gated Recurrent Units (GRUs) Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/43 Mitesh M. Khapra CS7015 (Deep

CS7015 (Deep Learning) : Lecture 10 Learning Vectorial Representations Of Words Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 16 Encoder Decoder Models, Attention Mechanism Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 18 Markov Networks Mitesh M. Khapra Department of Computer

CS7015 (Deep Learning) : Lecture 23 Generative Adversarial Networks (GANs) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh M. Khapra Department of

CS7015 (Deep Learning): Lecture 4 Feedforward Neural Networks, Backpropagation Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh M. Khapra Department of

CS7015 (Deep Learning) : Lecture 23 Generative Adversarial Networks (GANs) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells (LSTMs), Gated Recurrent Units

CS7015 (Deep Learning) : Lecture 1 (Partial/Brief) History of Deep Learning Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 1 (Partial/Brief) History of Deep Learning Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 13 Visualizing Convolutional Neural Networks, Guided

CS7015 (Deep Learning) : Lecture 19 Using joint distributions for classification and sampling,

CS7015 (Deep Learning) : Lecture 2 McCulloch Pitts Neuron, Thresholding Logic, Perceptrons,

Discrimination between genuine versus fake emotion using long-short term memory with parametric

EVALUATION OF THE MODAL MODEL OF MEMORY Lecturer: Dr. Benjamin Amponsah, Dept. of Psychology, UG,

Human Abilities Design in HCI Note: Differences with design in software

Recurrent Neural Networks Xavier Gir-i-Nieto Acknowledgments Santi Pascual 2 General idea

Boosting the deep multidimensional long short- term memory network for handwritten recognition

Memory Models for Incremental Learning Architectures Viktor Losing, Heiko Wersing and Barbara

Creating a long-term memory for the global DNS Mattijs Jonker Introduction Almost fjve

ARFIMA (long memory) models Christopher F Baum EC 327: Financial Econometrics Boston College,