cs7015 deep learning lecture 15
play

CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells (LSTMs), Gated Recurrent Units (GRUs) Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/43 Mitesh M. Khapra CS7015 (Deep


  1. Selective write a = 1 b = 3 c = 5 d = 11 There may be many steps in the Compute ac ( bd + a ) + ad derivation but we may just skip a few Say “board” can have only 3 statements In other words we select what to at a time. write 1 ac 2 bd 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 bd = 33 7/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  2. Selective read a = 1 b = 3 c = 5 d = 11 While writing one step we typically Compute ac ( bd + a ) + ad read some of the previous steps we Say “board” can have only 3 statements have already written and then decide at a time. what to write next 1 ac 2 bd 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 bd = 33 8/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  3. Selective read a = 1 b = 3 c = 5 d = 11 While writing one step we typically Compute ac ( bd + a ) + ad read some of the previous steps we Say “board” can have only 3 statements have already written and then decide at a time. what to write next For example at Step 3, information 1 ac from Step 2 is important 2 bd 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 bd = 33 8/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  4. Selective read a = 1 b = 3 c = 5 d = 11 While writing one step we typically Compute ac ( bd + a ) + ad read some of the previous steps we Say “board” can have only 3 statements have already written and then decide at a time. what to write next For example at Step 3, information 1 ac from Step 2 is important 2 bd In other words we select what to read 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 bd = 33 8/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  5. Selective read a = 1 b = 3 c = 5 d = 11 While writing one step we typically Compute ac ( bd + a ) + ad read some of the previous steps we Say “board” can have only 3 statements have already written and then decide at a time. what to write next For example at Step 3, information 1 ac from Step 2 is important 2 bd In other words we select what to read 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 bd = 33 bd + a = 34 8/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  6. Selective forget a = 1 b = 3 c = 5 d = 11 Once the board is full, we need to Compute ac ( bd + a ) + ad delete some obsolete information Say “board” can have only 3 statements at a time. 1 ac 2 bd 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 bd = 33 bd + a = 34 9/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  7. Selective forget a = 1 b = 3 c = 5 d = 11 Once the board is full, we need to Compute ac ( bd + a ) + ad delete some obsolete information Say “board” can have only 3 statements But how do we decide what to delete? at a time. We will typically delete the least useful information 1 ac 2 bd 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 bd = 33 bd + a = 34 9/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  8. Selective forget a = 1 b = 3 c = 5 d = 11 Once the board is full, we need to Compute ac ( bd + a ) + ad delete some obsolete information Say “board” can have only 3 statements But how do we decide what to delete? at a time. We will typically delete the least useful information 1 ac In other words we select what to 2 bd forget 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 bd = 33 bd + a = 34 9/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  9. Selective forget a = 1 b = 3 c = 5 d = 11 Once the board is full, we need to Compute ac ( bd + a ) + ad delete some obsolete information Say “board” can have only 3 statements But how do we decide what to delete? at a time. We will typically delete the least useful information 1 ac In other words we select what to 2 bd forget 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 bd + a = 34 9/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  10. Selective forget a = 1 b = 3 c = 5 d = 11 Once the board is full, we need to Compute ac ( bd + a ) + ad delete some obsolete information Say “board” can have only 3 statements But how do we decide what to delete? at a time. We will typically delete the least useful information 1 ac In other words we select what to 2 bd forget 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 ac ( bd + a ) = 170 bd + a = 34 9/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  11. Selective forget a = 1 b = 3 c = 5 d = 11 Once the board is full, we need to Compute ac ( bd + a ) + ad delete some obsolete information Say “board” can have only 3 statements But how do we decide what to delete? at a time. We will typically delete the least useful information 1 ac In other words we select what to 2 bd forget 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 ac ( bd + a ) = 170 ad = 11 9/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  12. Selective forget a = 1 b = 3 c = 5 d = 11 Once the board is full, we need to Compute ac ( bd + a ) + ad delete some obsolete information Say “board” can have only 3 statements But how do we decide what to delete? at a time. We will typically delete the least useful information 1 ac In other words we select what to 2 bd forget 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ad + ac ( bd + a ) = 181 ac ( bd + a ) = 170 ad = 11 9/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  13. a = 1 b = 3 c = 5 d = 11 There are various other scenarios where we can motivate the need for Compute ac ( bd + a ) + ad selective write, read and forget Say “board” can have only 3 statements at a time. 1 ac 2 bd 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ad + ac ( bd + a ) = 181 ac ( bd + a ) = 170 ad = 11 10/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  14. a = 1 b = 3 c = 5 d = 11 There are various other scenarios where we can motivate the need for Compute ac ( bd + a ) + ad selective write, read and forget Say “board” can have only 3 statements For example, you could think of our at a time. brain as something which can store 1 ac only a finite number of facts 2 bd 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ad + ac ( bd + a ) = 181 ac ( bd + a ) = 170 ad = 11 10/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  15. a = 1 b = 3 c = 5 d = 11 There are various other scenarios where we can motivate the need for Compute ac ( bd + a ) + ad selective write, read and forget Say “board” can have only 3 statements For example, you could think of our at a time. brain as something which can store 1 ac only a finite number of facts 2 bd At different time steps we selectively read, write and forget some of these 3 bd + a facts 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ad + ac ( bd + a ) = 181 ac ( bd + a ) = 170 ad = 11 10/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  16. a = 1 b = 3 c = 5 d = 11 There are various other scenarios where we can motivate the need for Compute ac ( bd + a ) + ad selective write, read and forget Say “board” can have only 3 statements For example, you could think of our at a time. brain as something which can store 1 ac only a finite number of facts 2 bd At different time steps we selectively read, write and forget some of these 3 bd + a facts 4 ac ( bd + a ) Since the RNN also has a finite state 5 ad size, we need to figure out a way to 6 ac ( bd + a ) + ad allow it to selectively read, write and forget ad + ac ( bd + a ) = 181 ac ( bd + a ) = 170 ad = 11 10/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  17. Module 15.2: Long Short Term Memory(LSTM) and Gated Recurrent Units(GRUs) 11/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  18. Questions Can we give a concrete example where RNNs also need to selectively read, write and forget ? 12/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  19. Questions Can we give a concrete example where RNNs also need to selectively read, write and forget ? How do we convert this intuition into mathematical equations ? 12/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  20. Questions Can we give a concrete example where RNNs also need to selectively read, write and forget ? How do we convert this intuition into mathematical equations ? We will see this over the next few slides 12/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  21. Consider the task of predicting the sentiment + / − (positive/negative) of a review ... ... ... performance The first Review: The first half of the movie was dry but the second half really picked up pace. The lead actor delivered an amazing performance 13/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  22. Consider the task of predicting the sentiment + / − (positive/negative) of a review RNN reads the document from left to right and after every word updates the state ... ... ... performance The first Review: The first half of the movie was dry but the second half really picked up pace. The lead actor delivered an amazing performance 13/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  23. Consider the task of predicting the sentiment + / − (positive/negative) of a review RNN reads the document from left to right and after every word updates the state By the time we reach the end of the document the information obtained from the first few words is completely lost ... ... ... performance The first Review: The first half of the movie was dry but the second half really picked up pace. The lead actor delivered an amazing performance 13/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  24. Consider the task of predicting the sentiment + / − (positive/negative) of a review RNN reads the document from left to right and after every word updates the state By the time we reach the end of the document the information obtained from the first few words is completely lost ... ... ... performance The first Ideally we want to Review: The first half of the movie was dry but the second half really picked up pace. The lead actor delivered an amazing performance 13/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  25. Consider the task of predicting the sentiment + / − (positive/negative) of a review RNN reads the document from left to right and after every word updates the state By the time we reach the end of the document the information obtained from the first few words is completely lost ... ... ... performance The first Ideally we want to forget the information added by stop words Review: The first half of the movie was dry but (a, the, etc.) the second half really picked up pace. The lead actor delivered an amazing performance 13/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  26. Consider the task of predicting the sentiment + / − (positive/negative) of a review RNN reads the document from left to right and after every word updates the state By the time we reach the end of the document the information obtained from the first few words is completely lost ... ... ... performance The first Ideally we want to forget the information added by stop words Review: The first half of the movie was dry but (a, the, etc.) the second half really picked up pace. The lead selectively read the information added by actor delivered an amazing performance previous sentiment bearing words (awesome, amazing, etc.) 13/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  27. Consider the task of predicting the sentiment + / − (positive/negative) of a review RNN reads the document from left to right and after every word updates the state By the time we reach the end of the document the information obtained from the first few words is completely lost ... ... ... performance The first Ideally we want to forget the information added by stop words Review: The first half of the movie was dry but (a, the, etc.) the second half really picked up pace. The lead selectively read the information added by actor delivered an amazing performance previous sentiment bearing words (awesome, amazing, etc.) selectively write new information from the current word to the state 13/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  28. Questions Can we give a concrete example where RNNs also need to selectively read, write and forget ? How do we convert this intuition into mathematical equations ? 14/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  29. Recall that the blue colored vector + / − ( s t ) is called the state of the RNN ... ... ... performance The first Review: The first half of the movie was dry but the second half really picked up pace. The lead actor delivered an amazing performance 15/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  30. Recall that the blue colored vector + / − ( s t ) is called the state of the RNN It has a finite size ( s t ∈ R n ) and is used to store all the information upto timestep t ... ... ... performance The first Review: The first half of the movie was dry but the second half really picked up pace. The lead actor delivered an amazing performance 15/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  31. Recall that the blue colored vector + / − ( s t ) is called the state of the RNN It has a finite size ( s t ∈ R n ) and is used to store all the information upto timestep t This state is analogous to the whiteboard and sooner or later it will ... ... ... performance get overloaded and the information The first from the initial states will get Review: The first half of the movie was dry but morphed beyond recognition the second half really picked up pace. The lead actor delivered an amazing performance 15/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  32. Recall that the blue colored vector + / − ( s t ) is called the state of the RNN It has a finite size ( s t ∈ R n ) and is used to store all the information upto timestep t This state is analogous to the whiteboard and sooner or later it will ... ... ... performance get overloaded and the information The first from the initial states will get Review: The first half of the movie was dry but morphed beyond recognition the second half really picked up pace. The lead Wishlist: selective write, selective actor delivered an amazing performance read and selective forget to ensure that this finite sized state vector is used effectively 15/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  33. Just to be clear, we have computed -1.4 -0.9 -0.4 0.2 a state s t − 1 at timestep t − 1 and 1 1 . . . . . . now we want to overload it with new -2 -1.9 information ( x t ) and compute a new s t − 1 s t 0.7 state ( s t ) -0.2 1.1 . . . -0.3 x t 16/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  34. Just to be clear, we have computed -1.4 -0.9 -0.4 selective read 0.2 selective write a state s t − 1 at timestep t − 1 and 1 1 . selective forget . . . . . now we want to overload it with new -2 -1.9 information ( x t ) and compute a new s t − 1 s t 0.7 state ( s t ) -0.2 1.1 . . While doing so we want to make sure . that we use selective write, selective -0.3 x t read and selective forget so that only important information is retained in s t 16/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  35. Just to be clear, we have computed -1.4 -0.9 -0.4 selective read 0.2 selective write a state s t − 1 at timestep t − 1 and 1 1 . selective forget . . . . . now we want to overload it with new -2 -1.9 information ( x t ) and compute a new s t − 1 s t 0.7 state ( s t ) -0.2 1.1 . . While doing so we want to make sure . that we use selective write, selective -0.3 x t read and selective forget so that only important information is retained in s t We will now see how to implement these items from our wishlist 16/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  36. Selective Write -1.4 -0.9 W -0.4 0.2 Recall that in RNNs we use s t − 1 to σ 1 1 . . . . compute s t . . U -2 -1.9 s t − 1 s t 0.7 -0.2 1.1 . . . -0.3 x t 17/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  37. Selective Write -1.4 -0.9 W -0.4 0.2 Recall that in RNNs we use s t − 1 to σ 1 1 . . . . compute s t . . U s t = σ ( Ws t − 1 + Ux t ) (ignoring bias) -2 -1.9 s t − 1 s t 0.7 -0.2 1.1 . . . -0.3 x t 17/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  38. Selective Write -1.4 -0.9 W -0.4 0.2 Recall that in RNNs we use s t − 1 to σ 1 1 . . . . compute s t . . U s t = σ ( Ws t − 1 + Ux t ) (ignoring bias) -2 -1.9 s t − 1 s t 0.7 But now instead of passing s t − 1 as it -0.2 1.1 is to s t we want to pass (write) only . . . some portions of it to the next state -0.3 x t 17/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  39. Selective Write -1.4 -0.9 W 0 0.2 Recall that in RNNs we use s t − 1 to σ 1 1 . . . . compute s t . . U s t = σ ( Ws t − 1 + Ux t ) (ignoring bias) 0 -1.9 s t − 1 s t 0.7 But now instead of passing s t − 1 as it -0.2 1.1 is to s t we want to pass (write) only . . . some portions of it to the next state -0.3 x t In the strictest case our decisions could be binary (for example, retain 1st and 3rd entries and delete the rest of the entries) 17/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  40. Selective Write -1.4 -0.9 W 0 0.2 Recall that in RNNs we use s t − 1 to σ 1 1 . . . . compute s t . . U s t = σ ( Ws t − 1 + Ux t ) (ignoring bias) 0 -1.9 s t − 1 s t 0.7 But now instead of passing s t − 1 as it -0.2 1.1 is to s t we want to pass (write) only . . . some portions of it to the next state -0.3 x t In the strictest case our decisions could be binary (for example, retain 1st and 3rd entries and delete the rest of the entries) But a more sensible way of doing this would be to assign a value between 0 and 1 which determines what fraction of the current state to pass on to the next state 17/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  41. Selective Write -1.4 0.2 0.5 -1.4 -0.4 0.34 0.36 -0.4 We introduce a vector o t − 1 which 1 0.9 = 0.9 1 ⊙ . . . . . . . . decides what fraction of each element . . . . of s t − 1 should be passed to the next -2 0.29 0.6 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 state 1.1 selective write . . . -0.3 x t 18/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  42. Selective Write -1.4 0.2 0.5 -1.4 -0.4 0.34 0.36 -0.4 We introduce a vector o t − 1 which 1 0.9 = 0.9 1 ⊙ . . . . . . . . decides what fraction of each element . . . . of s t − 1 should be passed to the next -2 0.29 0.6 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 state 1.1 selective write . . . Each element of o t − 1 gets multiplied -0.3 with the corresponding element of x t s t − 1 18/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  43. Selective Write -1.4 0.2 0.5 -1.4 -0.4 0.34 0.36 -0.4 We introduce a vector o t − 1 which 1 0.9 = 0.9 1 ⊙ . . . . . . . . decides what fraction of each element . . . . of s t − 1 should be passed to the next -2 0.29 0.6 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 state 1.1 selective write . . . Each element of o t − 1 gets multiplied -0.3 with the corresponding element of x t s t − 1 Each element of o t − 1 is restricted to be between 0 and 1 18/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  44. Selective Write -1.4 0.2 0.5 -1.4 -0.4 0.34 0.36 -0.4 We introduce a vector o t − 1 which 1 0.9 = 0.9 1 ⊙ . . . . . . . . decides what fraction of each element . . . . of s t − 1 should be passed to the next -2 0.29 0.6 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 state 1.1 selective write . . . Each element of o t − 1 gets multiplied -0.3 with the corresponding element of x t s t − 1 Each element of o t − 1 is restricted to be between 0 and 1 But how do we compute o t − 1 ? How does the RNN know what fraction of the state to pass on? 18/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  45. Selective Write -1.4 0.2 0.5 -1.4 -0.4 0.34 0.36 -0.4 Well the RNN has to learn o t − 1 along 1 0.9 = 0.9 1 ⊙ . . . . . . . . with the other parameters ( W, U, V ) . . . . -2 0.29 0.6 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 1.1 selective write . . . -0.3 x t 19/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  46. Selective Write -1.4 0.2 0.5 -1.4 -0.4 0.34 0.36 -0.4 Well the RNN has to learn o t − 1 along 1 0.9 = 0.9 1 ⊙ . . . . . . . . with the other parameters ( W, U, V ) . . . . -2 0.29 0.6 -2 We compute o t − 1 and h t − 1 as 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 1.1 selective write . . . -0.3 x t 19/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  47. Selective Write -1.4 0.2 0.5 -1.4 -0.4 0.34 0.36 -0.4 Well the RNN has to learn o t − 1 along 1 0.9 = 0.9 1 ⊙ . . . . . . . . with the other parameters ( W, U, V ) . . . . -2 0.29 0.6 -2 We compute o t − 1 and h t − 1 as 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 1.1 selective write . . . o t − 1 = σ ( W o h t − 2 + U o x t − 1 + b o ) -0.3 x t 19/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  48. Selective Write -1.4 0.2 0.5 -1.4 -0.4 0.34 0.36 -0.4 Well the RNN has to learn o t − 1 along 1 0.9 = 0.9 1 ⊙ . . . . . . . . with the other parameters ( W, U, V ) . . . . -2 0.29 0.6 -2 We compute o t − 1 and h t − 1 as 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 1.1 selective write . . . o t − 1 = σ ( W o h t − 2 + U o x t − 1 + b o ) -0.3 h t − 1 = o t − 1 ⊙ σ ( s t − 1 ) x t 19/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  49. Selective Write -1.4 0.2 0.5 -1.4 -0.4 0.34 0.36 -0.4 Well the RNN has to learn o t − 1 along 1 0.9 = 0.9 1 ⊙ . . . . . . . . with the other parameters ( W, U, V ) . . . . -2 0.29 0.6 -2 We compute o t − 1 and h t − 1 as 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 1.1 selective write . . . o t − 1 = σ ( W o h t − 2 + U o x t − 1 + b o ) -0.3 h t − 1 = o t − 1 ⊙ σ ( s t − 1 ) x t The parameters W o , U o , b o need to be learned along with the existing parameters W, U, V 19/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  50. Selective Write -1.4 0.2 0.5 -1.4 -0.4 0.34 0.36 -0.4 Well the RNN has to learn o t − 1 along 1 0.9 = 0.9 1 ⊙ . . . . . . . . with the other parameters ( W, U, V ) . . . . -2 0.29 0.6 -2 We compute o t − 1 and h t − 1 as 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 1.1 selective write . . . o t − 1 = σ ( W o h t − 2 + U o x t − 1 + b o ) -0.3 h t − 1 = o t − 1 ⊙ σ ( s t − 1 ) x t The parameters W o , U o , b o need to be learned along with the existing parameters W, U, V The sigmoid (logistic) function ensures that the values are between 0 and 1 19/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  51. Selective Write -1.4 0.2 0.5 -1.4 -0.4 0.34 0.36 -0.4 Well the RNN has to learn o t − 1 along 1 0.9 = 0.9 1 ⊙ . . . . . . . . with the other parameters ( W, U, V ) . . . . -2 0.29 0.6 -2 We compute o t − 1 and h t − 1 as 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 1.1 selective write . . . o t − 1 = σ ( W o h t − 2 + U o x t − 1 + b o ) -0.3 h t − 1 = o t − 1 ⊙ σ ( s t − 1 ) x t The parameters W o , U o , b o need to be learned along with the existing parameters W, U, V The sigmoid (logistic) function ensures that the values are between 0 and 1 o t is called the output gate as it decides how much to pass (write) to the next time step 19/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  52. Selective Read -1.4 0.2 0.5 0.4 -1.4 W -0.4 0.34 0.36 0.6 -0.4 We will now use h t − 1 to compute the σ 1 0.9 = 0.9 0.1 1 ⊙ . . . . . . . . . . new state at the next time step . . . . . U -2 0.29 0.6 0.2 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 ˜ s t 1.1 selective write . . . -0.3 x t 20/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  53. Selective Read -1.4 0.2 0.5 0.4 -1.4 W -0.4 0.34 0.36 0.6 -0.4 We will now use h t − 1 to compute the σ 1 0.9 = 0.9 0.1 1 ⊙ . . . . . . . . . . new state at the next time step . . . . . U -2 0.29 0.6 0.2 -2 We will also use x t which is the new 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 ˜ s t input at time step t 1.1 selective write . . . -0.3 s t = σ ( Wh t − 1 + Ux t + b ) ˜ x t 20/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  54. Selective Read -1.4 0.2 0.5 0.4 -1.4 W -0.4 0.34 0.36 0.6 -0.4 We will now use h t − 1 to compute the σ 1 0.9 = 0.9 0.1 1 ⊙ . . . . . . . . . . new state at the next time step . . . . . U -2 0.29 0.6 0.2 -2 We will also use x t which is the new 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ input at time step t 1.1 selective write . . . -0.3 s t = σ ( Wh t − 1 + Ux t + b ) ˜ x t Note that W, U and b are similar to the parameters that we used in RNN (for simplicity we have not shown the bias b in the figure) 20/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  55. Selective Read -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 s t thus captures all the information ˜ σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . from the previous state ( h t − 1 ) and the . . . . . . U current input x t -2 0.29 0.6 0.2 0.71 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ i t 1.1 selective write . selective read . . -0.3 x t 21/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  56. Selective Read -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 s t thus captures all the information ˜ σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . from the previous state ( h t − 1 ) and the . . . . . . U current input x t -2 0.29 0.6 0.2 0.71 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 ˜ s t i t However, we may not want to 1.1 selective write . selective read . . use all this new information and -0.3 only selectively read from it before x t constructing the new cell state s t 21/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  57. Selective Read -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 s t thus captures all the information ˜ σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . from the previous state ( h t − 1 ) and the . . . . . . U current input x t -2 0.29 0.6 0.2 0.71 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ i t However, we may not want to 1.1 selective write . selective read . . use all this new information and -0.3 only selectively read from it before x t constructing the new cell state s t To do this we introduce another gate called the input gate 21/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  58. Selective Read -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 s t thus captures all the information ˜ σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . from the previous state ( h t − 1 ) and the . . . . . . U current input x t -2 0.29 0.6 0.2 0.71 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ i t However, we may not want to 1.1 selective write . selective read . . use all this new information and -0.3 only selectively read from it before x t constructing the new cell state s t To do this we introduce another gate called the input gate i t = σ ( W i h t − 1 + U i x t + b i ) 21/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  59. Selective Read -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 s t thus captures all the information ˜ σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . from the previous state ( h t − 1 ) and the . . . . . . U current input x t -2 0.29 0.6 0.2 0.71 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ i t However, we may not want to 1.1 selective write . selective read . . use all this new information and -0.3 only selectively read from it before x t constructing the new cell state s t To do this we introduce another gate called the input gate i t = σ ( W i h t − 1 + U i x t + b i ) and use i t ⊙ ˜ s t as the selectively read state information 21/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  60. So far we have the following -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . . . . . . . U -2 0.29 0.6 0.2 0.71 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ i t 1.1 selective write . selective read . . -0.3 x t 22/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  61. So far we have the following -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . Previous state: . . . . . . U -2 0.29 0.6 0.2 0.71 -2 0.7 s t − 1 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ i t 1.1 selective write . selective read . . -0.3 x t 22/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  62. So far we have the following -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . Previous state: . . . . . . U -2 0.29 0.6 0.2 0.71 -2 0.7 s t − 1 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ i t 1.1 selective write . selective read Output gate: . . o t − 1 = σ ( W o h t − 2 + U o x t − 1 + b o ) -0.3 x t 22/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  63. So far we have the following -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . Previous state: . . . . . . U -2 0.29 0.6 0.2 0.71 -2 0.7 s t − 1 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ i t 1.1 selective write . selective read Output gate: . . o t − 1 = σ ( W o h t − 2 + U o x t − 1 + b o ) -0.3 x t Selectively Write: h t − 1 = o t − 1 ⊙ σ ( s t − 1 ) 22/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  64. So far we have the following -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . Previous state: . . . . . . U -2 0.29 0.6 0.2 0.71 -2 0.7 s t − 1 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ i t 1.1 selective write . selective read Output gate: . . o t − 1 = σ ( W o h t − 2 + U o x t − 1 + b o ) -0.3 x t Selectively Write: h t − 1 = o t − 1 ⊙ σ ( s t − 1 ) Current (temporary) state: s t = σ ( Wh t − 1 + Ux t + b ) ˜ 22/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  65. So far we have the following -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . Previous state: . . . . . . U -2 0.29 0.6 0.2 0.71 -2 0.7 s t − 1 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ i t 1.1 selective write . selective read Output gate: . . o t − 1 = σ ( W o h t − 2 + U o x t − 1 + b o ) -0.3 x t Selectively Write: h t − 1 = o t − 1 ⊙ σ ( s t − 1 ) Current (temporary) state: s t = σ ( Wh t − 1 + Ux t + b ) ˜ Input gate: i t = σ ( W i h t − 1 + U i x t + b i ) 22/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  66. So far we have the following -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . Previous state: . . . . . . U -2 0.29 0.6 0.2 0.71 -2 0.7 s t − 1 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ i t 1.1 selective write . selective read Output gate: . . o t − 1 = σ ( W o h t − 2 + U o x t − 1 + b o ) -0.3 x t Selectively Write: h t − 1 = o t − 1 ⊙ σ ( s t − 1 ) Current (temporary) state: s t = σ ( Wh t − 1 + Ux t + b ) ˜ Input gate: i t = σ ( W i h t − 1 + U i x t + b i ) Selectively Read: i t ⊙ ˜ s t 22/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  67. Selective Forget How do we combine s t − 1 and ˜ s t to get the new state 23/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  68. -1.4 0.2 0.5 0.4 0.8 -1.4 -0.9 W -0.4 0.34 0.36 0.6 0.66 -0.4 0.2 σ 1 0.9 = 0.9 0.1 0.1 1 = 1 ⊙ ⊙ + . . . . . . . . . . . . . . . . . . . . . U -2 0.29 0.6 0.2 0.71 -2 -1.9 0.7 -0.2 s t − 1 o t − 1 s t − 1 s t h t − 1 s t ˜ i t 1.1 selective write . selective read . . -0.3 x t Selective Forget How do we combine s t − 1 and ˜ s t to get the new state Here is one simple (but effective) way of doing this: 23/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  69. -1.4 0.2 0.5 0.4 0.8 -1.4 -0.9 W -0.4 0.34 0.36 0.6 0.66 -0.4 0.2 σ 1 0.9 = 0.9 0.1 0.1 1 = 1 ⊙ ⊙ + . . . . . . . . . . . . . . . . . . . . . U -2 0.29 0.6 0.2 0.71 -2 -1.9 0.7 -0.2 s t − 1 o t − 1 s t − 1 s t h t − 1 s t ˜ i t 1.1 selective write . selective read . . -0.3 x t Selective Forget How do we combine s t − 1 and ˜ s t to get the new state Here is one simple (but effective) way of doing this: s t = s t − 1 + i t ⊙ ˜ s t 23/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

  70. -1.4 0.2 0.5 0.4 0.8 -1.4 -0.9 W -0.4 0.34 0.36 0.6 0.66 -0.4 0.2 σ 1 0.9 = 0.9 0.1 0.1 1 = 1 ⊙ ⊙ + . . . . . . . . . . . . . . . . . . . . . U -2 0.29 0.6 0.2 0.71 -2 -1.9 0.7 -0.2 s t − 1 o t − 1 s t − 1 s t h t − 1 s t ˜ i t 1.1 selective write . selective read . . -0.3 x t Selective Forget But we may not want to use the whole How do we combine s t − 1 and ˜ s t to get of s t − 1 but forget some parts of it the new state Here is one simple (but effective) way of doing this: s t = s t − 1 + i t ⊙ ˜ s t 23/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend