long short term memory
play

Long/Short-Term Memory Mark Hasegawa-Johnson All content CC-SA 4.0 - PowerPoint PPT Presentation

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Long/Short-Term Memory Mark Hasegawa-Johnson All content CC-SA 4.0 unless otherwise specified. University of Illinois ECE 417: Multimedia Signal Processing, Fall


  1. Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Long/Short-Term Memory Mark Hasegawa-Johnson All content CC-SA 4.0 unless otherwise specified. University of Illinois ECE 417: Multimedia Signal Processing, Fall 2020

  2. Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Review: Recurrent Neural Networks 1 Vanishing/Exploding Gradient 2 Running Example: a Pocket Calculator 3 Regular RNN 4 Forget Gate 5 Long Short-Term Memory (LSTM) 6 Backprop for an LSTM 7 Conclusion 8

  3. Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Outline Review: Recurrent Neural Networks 1 Vanishing/Exploding Gradient 2 Running Example: a Pocket Calculator 3 Regular RNN 4 Forget Gate 5 Long Short-Term Memory (LSTM) 6 Backprop for an LSTM 7 Conclusion 8

  4. Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Recurrent Neural Net (RNN) = Nonlinear(IIR) Image CC-SA-4.0 by Ixnay, https://commons.wikimedia.org/wiki/File:Recurrent_neural_network_unfold.svg

  5. Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Back-Propagation and Causal Graphs ˆ y h 1 h 0 x N − 1 d ˆ y d ˆ y ∂ h i � dx = ∂ x dh i i =0 For each h i , we find the total derivative of ˆ y w.r.t. h i , multiplied by the partial derivative of h i w.r.t. x .

  6. Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Back-Propagation Through Time Back-propagation through time computes the error gradient at each time step based on the error gradients at future time steps. If the forward-prop equation is M − 1 � y [ n ] = g ( e [ n ]) , ˆ e [ n ] = x [ n ] + w [ m ]ˆ y [ n − m ] , m =1 then the BPTT equation is M − 1 ∂ E dE � δ [ n ] = de [ n ] = ∂ e [ n ] + δ [ n + m ] w [ m ] ˙ g ( e [ n ]) m =1 Weight update, for an RNN, multiplies the back-prop times the forward-prop. dE � dw [ m ] = δ [ n ]ˆ y [ n − m ] n

  7. Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Outline Review: Recurrent Neural Networks 1 Vanishing/Exploding Gradient 2 Running Example: a Pocket Calculator 3 Regular RNN 4 Forget Gate 5 Long Short-Term Memory (LSTM) 6 Backprop for an LSTM 7 Conclusion 8

  8. Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Vanishing/Exploding Gradient The “vanishing gradient” problem refers to the tendency of d ˆ y [ n + m ] to disappear, exponentially, when m is large. de [ n ] The “exploding gradient” problem refers to the tendency of d ˆ y [ n + m ] to explode toward infinity, exponentially, when m is de [ n ] large. If the largest feedback coefficient is | w [ m ] | > 1, then you get exploding gradient. If | w [ m ] | < 1, you get vanishing gradient.

  9. Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Example: A Memorizer Network Suppose that we have a very simple RNN: y [ n ] = wx [ n ] + u ˆ ˆ y [ n − 1] Suppose that x [ n ] is only nonzero at time 0: � x 0 n = 0 x [ n ] = 0 n � = 0 Suppose that, instead of measuring x [0] directly, we are only allowed to measure the output of the RNN m time-steps later. Our goal is to learn w and u so that ˆ y [ m ] remembers x 0 , thus: E = 1 y [ m ] − x 0 ) 2 2 (ˆ

  10. Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Example: A Memorizer Network Now, how do we perform gradient update of the weights? If y [ n ] = wx [ n ] + u ˆ ˆ y [ n − 1] then � ∂ ˆ � dE y [ n ] dE � dw = d ˆ y [ n ] ∂ w n � dE � dE � � � = x [ n ] = x 0 d ˆ y [ n ] d ˆ y [0] n But the error is defined as E = 1 y [ m ] − x 0 ) 2 2 (ˆ so dE y [0] = u dE y [1] = u 2 dE y [2] = . . . = u m (ˆ y [ m ] − x 0 ) d ˆ d ˆ d ˆ

  11. Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Example: Vanishing Gradient So we find out that the gradient, w.r.t. the coefficient w , is either exponentially small, Exponential Decay or exponentially large, depending on whether | u | < 1 or | u | > 1: dE y [ m ] − x 0 ) u m dw = x 0 (ˆ In other words, if our application requires the neural net to wait m time steps before generating its output, then the gradient is Image CC-SA-4.0, PeterQ, Wikipedia exponentially smaller, and therefore training the neural net is exponentially harder.

  12. Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Outline Review: Recurrent Neural Networks 1 Vanishing/Exploding Gradient 2 Running Example: a Pocket Calculator 3 Regular RNN 4 Forget Gate 5 Long Short-Term Memory (LSTM) 6 Backprop for an LSTM 7 Conclusion 8

  13. Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Notation Today’s lecture will try to use notation similar to the Wikipedia page for LSTM. x [ t ] = input at time t y [ t ] = target/desired output c [ t ] = excitation at time t OR LSTM cell h [ t ] = activation at time t OR LSTM output u = feedback coefficient w = feedforward coefficient b = bias

  14. Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Running Example: a Pocket Calculator The rest of this lecture will refer to a toy application called “pocket calculator.” Pocket Calculator When x [ t ] > 0, add it to the current tally: c [ t ] = c [ t − 1] + x [ t ]. When x [ t ] = 0, Print out the current tally, h [ t ] = c [ t − 1], and then 1 Reset the tally to zero, c [ t ] = 0. 2 Example Signals Input: x [ t ] = 1 , 2 , 1 , 0 , 1 , 1 , 1 , 0 Target Output: y [ t ] = 0 , 0 , 0 , 4 , 0 , 0 , 0 , 3

  15. Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Pocket Calculator Pocket Calculator When x [ t ] > 0, add it to the current tally: c [ t ] = c [ t − 1] + x [ t ]. When x [ t ] = 0, Print out the current 1 tally, h [ t ] = c [ t − 1], and then Reset the tally to zero, 2 c [ t ] = 0.

  16. Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Outline Review: Recurrent Neural Networks 1 Vanishing/Exploding Gradient 2 Running Example: a Pocket Calculator 3 Regular RNN 4 Forget Gate 5 Long Short-Term Memory (LSTM) 6 Backprop for an LSTM 7 Conclusion 8

  17. Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion One-Node One-Tap Linear RNN Suppose that we have a very simple RNN: Excitation: c [ t ] = x [ t ] + uh [ t − 1] Activation: h [ t ] = σ h ( c [ t ]) where σ h () is some feedback nonlinearity. In this simple example, let’s just use σ h ( c [ t ]) = c [ t ], i.e., no nonlinearity. GOAL: Find u so that h [ t ] ≈ y [ t ]. In order to make the problem easier, we will only score an “error” when y [ t ] � = 0: E = 1 � ( h [ t ] − y [ t ]) 2 2 t : y [ t ] > 0

  18. Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion RNN with u = 1 RNN: u = 1? Obviously, if we want to just add numbers, we should just set u = 1. Then the RNN is computing Excitation: c [ t ] = x [ t ] + h [ t − 1] Activation: h [ t ] = σ h ( c [ t ]) That works until the first zero-valued input. But then it just keeps on adding.

  19. Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion RNN: u = 0 . 5? RNN with u = 0 . 5 Can we get decent results using u = 0 . 5? Advantage: by the time we reach x [ t ] = 0, the sum has kind of leaked away from us ( c [ t ] ≈ 0), so a hard-reset is not necessary. Disadvantage: by the time we reach x [ t ] = 0, the sum has kind of leaked away from us ( h [ t ] ≈ 0).

  20. Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Gradient Descent c [ t ] = x [ t ] + uh [ t − 1] h [ t ] = σ h ( c [ t ]) Let’s try initializing u = 0 . 5, and then performing gradient descent to improve it. Gradient descent has five steps: 1 Forward Propagation: c [ t ] = x [ t ] + uh [ t − 1], h [ t ] = c [ t ]. 2 Synchronous Backprop: ǫ [ t ] = ∂ E /∂ c [ t ]. 3 Back-Prop Through Time: δ [ t ] = dE / dc [ t ]. 4 Weight Gradient: dE / du = � t δ [ t ] h [ t − 1] 5 Gradient Descent: u ← u − η dE / du

  21. Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Gradient Descent Excitation: c [ t ] = x [ t ] + uh [ t − 1] Activation: h [ t ] = σ h ( c [ t ]) Error: E = 1 � ( h [ t ] − y [ t ]) 2 2 t : y [ t ] > 0 So the back-prop stages are: � ( h [ t ] − y [ t ]) Synchronous Backprop: ǫ [ t ] = ∂ E y [ t ] > 0 ∂ c [ t ] = 0 otherwise BPTT: δ [ t ] = dE dc [ t ] = ǫ [ t ] + u δ [ t + 1] Weight Gradient: dE � du = δ [ t ] h [ t − 1] t

  22. Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Backprop Stages, u = 0 . 5 Backprop Stages � ( h [ t ] − y [ t ]) y [ t ] > 0 ǫ [ t ] = 0 otherwise δ [ t ] = ǫ [ t ] + u δ [ t + 1] dE � du = δ [ t ] h [ t − 1] t

  23. Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Vanishing Gradient and Exploding Gradient Notice that, with | u | < 1, δ [ t ] tends to vanish exponentially fast as we go backward in time. This is called the vanishing gradient problem. It is a big problem for RNNs with long time-dependency, and for deep neural nets with many layers. If we set | u | > 1, we get an even worse problem, sometimes called the exploding gradient problem.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend