recurrent neural networks
play

Recurrent Neural Networks CS 6956: Deep Learning for NLP Overview - PowerPoint PPT Presentation

Recurrent Neural Networks CS 6956: Deep Learning for NLP Overview 1. Modeling sequences 2. Recurrent neural networks: An abstraction 3. Usage patterns for RNNs 4. BiDirectional RNNs 5. A concrete example: The Elman RNN 6. The vanishing


  1. Recurrent Neural Networks CS 6956: Deep Learning for NLP

  2. Overview 1. Modeling sequences 2. Recurrent neural networks: An abstraction 3. Usage patterns for RNNs 4. BiDirectional RNNs 5. A concrete example: The Elman RNN 6. The vanishing gradient problem 7. Long short-term memory units 1

  3. Overview 1. Modeling sequences 2. Recurrent neural networks: An abstraction 3. Usage patterns for RNNs 4. BiDirectional RNNs 5. A concrete example: The Elman RNN 6. The vanishing gradient problem 7. Gating and Long short-term memory units 2

  4. A simple RNN 1. How to generate the current state using the previous state and the current input? Next state 𝐭 " = 𝑕(𝐭 "&' 𝐗 ) + 𝐲 " 𝐗 - + 𝐜) 2. How to generate the current output using the current state? The output is the state. That is, 𝒛 " = 𝐭 " 3

  5. How do we train a recurrent network? We need to specify a problem first. Let’s take an example. – Inputs are sequences (say, of words) Initial state I 4 like cake

  6. How do we train a recurrent network? We need to specify a problem first. Let’s take an example. – Inputs are sequences (say, of words) – The outputs are labels associated with each word Verb Pronoun Noun Initial state I 5 like cake

  7. How do we train a recurrent network? We need to specify a problem first. Let’s take an example. – Inputs are sequences (say, of words) – The outputs are labels associated with each word – Losses for each word are added up Loss loss1 loss2 loss3 Verb Pronoun Noun Initial state I 6 like cake

  8. Gradients to the rescue β€’ We have a computation graph β€’ Use back propagation to compute gradients of the loss with respect to the parameters ( 𝐗 ) , 𝐗 - , 𝐜 ) – Sometimes called Backpropagation Through Time (BPTT) β€’ Update gradients using SGD or a variant – Adam, for example 7

  9. A simple RNN 1. How to generate the current state using the previous state and the current input? Next state 𝐭 " = 𝑕(𝐭 "&' 𝐗 ) + 𝐲 " 𝐗 - + 𝐜) 2. How to generate the current output using the current state? The output is the state. That is, 𝒛 " = 𝐭 " 8

  10. Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step 9

  11. Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step First input: 𝑦 ' Transform: 𝑒 ' = 𝑑 6 𝑋 ) + 𝑦 ' 𝑋 - + 𝑐 State: s ' = 𝑕(𝑒 ' ) Loss: π‘š ' = 𝑔(𝑑 ' ) 10

  12. Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step Let’s compute the derivative of the First input: 𝑦 ' loss with respect to the parameter 𝑋 Transform: 𝑒 ' = 𝑑 6 𝑋 ) + 𝑦 ' 𝑋 - + 𝑐 - State: s ' = 𝑕(𝑒 ' ) Loss: π‘š ' = 𝑔(𝑑 ' ) 11

  13. Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step Let’s compute the derivative of the First input: 𝑦 ' loss with respect to the parameter 𝑋 Transform: 𝑒 ' = 𝑑 6 𝑋 ) + 𝑦 ' 𝑋 - + 𝑐 - State: s ' = 𝑕(𝑒 ' ) πœ–π‘š ' = πœ–π‘š ' β‹… πœ–π‘‘ ' β‹… πœ–π‘’ ' Loss: π‘š ' = 𝑔(𝑑 ' ) πœ–π‘‹ πœ–π‘‘ ' πœ–π‘’ ' πœ–π‘‹ - - Follows the chain rule 12

  14. Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step Let’s compute the derivative of the First input: 𝑦 ' loss with respect to the parameter 𝑋 Transform: 𝑒 ' = 𝑑 6 𝑋 ) + 𝑦 ' 𝑋 - + 𝑐 - State: s ' = 𝑕(𝑒 ' ) πœ–π‘š ' = πœ–π‘š ' β‹… πœ–π‘‘ ' β‹… πœ–π‘’ ' Loss: π‘š ' = 𝑔(𝑑 ' ) πœ–π‘‹ πœ–π‘‘ ' πœ–π‘’ ' πœ–π‘‹ - - Follows the chain rule 13

  15. Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step Let’s compute the derivative of the First input: 𝑦 ' loss with respect to the parameter 𝑋 Transform: 𝑒 ' = 𝑑 6 𝑋 ) + 𝑦 ' 𝑋 - + 𝑐 - State: s ' = 𝑕(𝑒 ' ) πœ–π‘š ' = πœ–π‘š ' β‹… πœ–π‘‘ ' β‹… πœ–π‘’ ' Loss: π‘š ' = 𝑔(𝑑 ' ) πœ–π‘‹ πœ–π‘‘ ' πœ–π‘’ ' πœ–π‘‹ - - Follows the chain rule 14

  16. Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step Let’s compute the derivative of the First input: 𝑦 ' loss with respect to the parameter 𝑋 Transform: 𝑒 ' = 𝑑 6 𝑋 ) + 𝑦 ' 𝑋 - + 𝑐 - State: s ' = 𝑕(𝑒 ' ) πœ–π‘š ' = πœ–π‘š ' β‹… πœ–π‘‘ ' β‹… πœ–π‘’ ' Loss: π‘š ' = 𝑔(𝑑 ' ) πœ–π‘‹ πœ–π‘‘ ' πœ–π‘’ ' πœ–π‘‹ - - Follows the chain rule 15

  17. Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step Let’s compute the derivative of the First input: 𝑦 ' loss with respect to the parameter 𝑋 Transform: 𝑒 ' = 𝑑 6 𝑋 ) + 𝑦 ' 𝑋 - + 𝑐 - State: s ' = 𝑕(𝑒 ' ) πœ–π‘š ' = πœ–π‘š ' β‹… πœ–π‘‘ ' β‹… πœ–π‘’ ' Loss: π‘š ' = 𝑔(𝑑 ' ) πœ–π‘‹ πœ–π‘‘ ' πœ–π‘’ ' πœ–π‘‹ - - Let us examine the non-linearity in this system due to the activation function 16

  18. Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step Let’s compute the derivative of the First input: 𝑦 ' loss with respect to the parameter 𝑋 Transform: 𝑒 ' = 𝑑 6 𝑋 ) + 𝑦 ' 𝑋 - + 𝑐 - State: s ' = 𝑕(𝑒 ' ) πœ–π‘š ' = πœ–π‘š ' β‹… πœ–π‘‘ ' β‹… πœ–π‘’ ' Loss: π‘š ' = 𝑔(𝑑 ' ) πœ–π‘‹ πœ–π‘‘ ' πœ–π‘’ ' πœ–π‘‹ - - Suppose 𝑕 𝑨 = tanh 𝑨 17

  19. Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step Let’s compute the derivative of the First input: 𝑦 ' loss with respect to the parameter 𝑋 Transform: 𝑒 ' = 𝑑 6 𝑋 ) + 𝑦 ' 𝑋 - + 𝑐 - State: s ' = 𝑕(𝑒 ' ) πœ–π‘š ' = πœ–π‘š ' β‹… πœ–π‘‘ ' β‹… πœ–π‘’ ' Loss: π‘š ' = 𝑔(𝑑 ' ) πœ–π‘‹ πœ–π‘‘ ' πœ–π‘’ ' πœ–π‘‹ - - Suppose 𝑕 𝑨 = tanh 𝑨 Then BC BD = 1 βˆ’ tanh G (𝑨) 18

  20. Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step Let’s compute the derivative of the First input: 𝑦 ' loss with respect to the parameter 𝑋 Transform: 𝑒 ' = 𝑑 6 𝑋 ) + 𝑦 ' 𝑋 - + 𝑐 - State: s ' = 𝑕(𝑒 ' ) πœ–π‘š ' = πœ–π‘š ' β‹… πœ–π‘‘ ' β‹… πœ–π‘’ ' Loss: π‘š ' = 𝑔(𝑑 ' ) πœ–π‘‹ πœ–π‘‘ ' πœ–π‘’ ' πœ–π‘‹ - - Suppose 𝑕 𝑨 = tanh 𝑨 Then BC BD = 1 βˆ’ tanh G (𝑨) Always between zero and one 19

  21. Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step Let’s compute the derivative of the First input: 𝑦 ' loss with respect to the parameter 𝑋 Transform: 𝑒 ' = 𝑑 6 𝑋 ) + 𝑦 ' 𝑋 - + 𝑐 - State: s ' = 𝑕(𝑒 ' ) πœ–π‘š ' = πœ–π‘š ' β‹… πœ–π‘‘ ' β‹… πœ–π‘’ ' Loss: π‘š ' = 𝑔(𝑑 ' ) πœ–π‘‹ πœ–π‘‘ ' πœ–π‘’ ' πœ–π‘‹ - - Suppose 𝑕 𝑨 = tanh 𝑨 Then BC BD = 1 βˆ’ tanh G (𝑨) πœ–π‘‘ ' = 1 βˆ’ tanh G 𝑒 ' That is πœ–π‘’ ' 20

  22. Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step Let’s compute the derivative of the First input: 𝑦 ' loss with respect to the parameter 𝑋 Transform: 𝑒 ' = 𝑑 6 𝑋 ) + 𝑦 ' 𝑋 - + 𝑐 - State: s ' = 𝑕(𝑒 ' ) πœ–π‘š ' = πœ–π‘š ' β‹… πœ–π‘‘ ' β‹… πœ–π‘’ ' Loss: π‘š ' = 𝑔(𝑑 ' ) πœ–π‘‹ πœ–π‘‘ ' πœ–π‘’ ' πœ–π‘‹ - - Suppose 𝑕 𝑨 = tanh 𝑨 Then BC BD = 1 βˆ’ tanh G (𝑨) πœ–π‘‘ ' = 1 βˆ’ tanh G 𝑒 ' That is πœ–π‘’ ' A number between zero and one. 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend