recurrent neural nets
play

Recurrent Neural Nets ECE 417: Multimedia Signal Processing Mark - PowerPoint PPT Presentation

FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Recurrent Neural Nets ECE 417: Multimedia Signal Processing Mark Hasegawa-Johnson University of Illinois November 19, 2019 FIR/IIR CNN/RNN Back-Prop BPTT


  1. FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Recurrent Neural Nets ECE 417: Multimedia Signal Processing Mark Hasegawa-Johnson University of Illinois November 19, 2019

  2. FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Linear Time Invariant Filtering: FIR & IIR 1 Nonlinear Time Invariant Filtering: CNN & RNN 2 Back-Propagation Training for CNN and RNN 3 Back-Prop Through Time 4 Vanishing/Exploding Gradient 5 Gated Recurrent Units 6 Long Short-Term Memory (LSTM) 7 Conclusion 8

  3. FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Outline Linear Time Invariant Filtering: FIR & IIR 1 Nonlinear Time Invariant Filtering: CNN & RNN 2 Back-Propagation Training for CNN and RNN 3 Back-Prop Through Time 4 Vanishing/Exploding Gradient 5 Gated Recurrent Units 6 Long Short-Term Memory (LSTM) 7 Conclusion 8

  4. FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Basics of DSP: Filtering ∞ � y [ n ] = h [ m ] x [ n − m ] m = −∞ Y ( z ) = H ( z ) X ( z )

  5. FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Finite Impulse Response (FIR) N − 1 � y [ n ] = h [ m ] x [ n − m ] m =0 The coefficients, h [ m ], are chosen in order to optimally position the N − 1 zeros of the transfer function, r k , defined according to: N − 1 N − 1 h [ m ] z − m = h [0] � � 1 − r k z − 1 � � H ( z ) = m =0 k =1

  6. FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Infinite Impulse Response (IIR) N − 1 M − 1 � � y [ n ] = b m x [ n − m ] + a m y [ n − m ] m =0 m =1 The coefficients, b m and a m , are chosen in order to optimally position the N − 1 zeros and M − 1 poles of the transfer function, r k and p k , defined according to: � N − 1 � N − 1 � 1 − r k z − 1 � m =0 b m z − m k =1 H ( z ) = m =1 a m z − m = b 0 1 − � M − 1 � M − 1 k =1 (1 − p k z − 1 )

  7. FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Outline Linear Time Invariant Filtering: FIR & IIR 1 Nonlinear Time Invariant Filtering: CNN & RNN 2 Back-Propagation Training for CNN and RNN 3 Back-Prop Through Time 4 Vanishing/Exploding Gradient 5 Gated Recurrent Units 6 Long Short-Term Memory (LSTM) 7 Conclusion 8

  8. FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Convolutional Neural Net = Nonlinear(FIR) � N − 1 � � y [ n ] = g h [ m ] x [ n − m ] m =0 The coefficients, h [ m ], are chosen to minimize some kind of error. For example, suppose that the goal is to make y [ n ] resemble a target signal t [ n ]; then we might use N E = 1 ( y [ n ] − t [ n ]) 2 � 2 n =0 and choose h [ n ] ← h [ n ] − η dE dh [ n ]

  9. FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Recurrent Neural Net (RNN) = Nonlinear(IIR) � M − 1 � � y [ n ] = g x [ n ] + a m y [ n − m ] m =1 The coefficients, a m , are chosen to minimize the error. For example, suppose that the goal is to make y [ n ] resemble a target signal t [ n ]; then we might use N E = 1 ( y [ n ] − t [ n ]) 2 � 2 n =0 and choose a m ← a m − η dE da m

  10. FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Outline Linear Time Invariant Filtering: FIR & IIR 1 Nonlinear Time Invariant Filtering: CNN & RNN 2 Back-Propagation Training for CNN and RNN 3 Back-Prop Through Time 4 Vanishing/Exploding Gradient 5 Gated Recurrent Units 6 Long Short-Term Memory (LSTM) 7 Conclusion 8

  11. FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Review: Excitation and Activation The activation of a hidden node is the output of the nonlinearity (for this reason, the nonlinearity is sometimes called the activation function ). For example, in a fully-connected network with outputs z l , weights � v , bias v 0 , nonlinearity g (), and hidden node activations � y , the activation of the l th output node is p � � � z l = g v l 0 + v lk y k k =1 The excitation of a hidden node is the input of the nonlinearity. For example, the excitation of the node above is p � e l = v l 0 + v lk y k k =1

  12. FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Backprop = Derivative w.r.t. Excitation The excitation of a hidden node is the input of the nonlinearity. For example, the excitation of the node above is p � e l = v l 0 + v lk y k k =1 The gradient of the error w.r.t. the weight is dE = ǫ l y k dv lk where ǫ l is the derivative of the error w.r.t. the l th excitation : ǫ l = dE de l

  13. FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Backprop for Fully-Connected Network Suppose we have a fully-connected network, with inputs � x , weight matrices U and V , nonlinearities g () and h (), and output z : � e k = u k 0 + u kj x j j y k = g ( e k ) � e l = v l 0 + v lk y k k z l = h ( e l ) Then the back-prop gradients are the derivatives of E with respect to the excitations at each node: ǫ l = dE de l δ k = dE de k

  14. FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Back-Prop in a CNN Suppose we have a convolutional neural net, defined by N − 1 � e [ n ] = h [ m ] x [ n − m ] m =0 y [ n ] = g ( e [ n ]) then dE � dh [ m ] = δ [ n ] x [ n − m ] n where δ [ n ] is the back-prop gradient, defined by dE δ [ n ] = de [ n ]

  15. FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Back-Prop in an RNN Suppose we have a recurrent neural net, defined by M − 1 � e [ n ] = x [ n ] + a m y [ n − m ] m =1 y [ n ] = g ( e [ n ]) then dE � = δ [ n ] y [ n − m ] da m n where y [ n − m ] is calculated by forward-propagation, and then δ [ n ] is calculated by back-propagation as dE δ [ n ] = de [ n ]

  16. FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Outline Linear Time Invariant Filtering: FIR & IIR 1 Nonlinear Time Invariant Filtering: CNN & RNN 2 Back-Propagation Training for CNN and RNN 3 Back-Prop Through Time 4 Vanishing/Exploding Gradient 5 Gated Recurrent Units 6 Long Short-Term Memory (LSTM) 7 Conclusion 8

  17. FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Partial vs. Full Derivatives For example, suppose we want y [ n ] to be as close as possible to some target signal t [ n ]: E = 1 � ( y [ n ] − t [ n ]) 2 2 n Notice that E depends on y [ n ] in many different ways: ∂ E ∂ y [ n + 1] ∂ y [ n + 2] dE dE dE dy [ n ] = ∂ y [ n ] + + + . . . dy [ n + 1] ∂ y [ n ] dy [ n + 2] ∂ y [ n ]

  18. FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Partial vs. Full Derivatives In general, ∞ dE ∂ E dE ∂ y [ n + m ] � dy [ n ] = ∂ y [ n ] + dy [ n + m ] ∂ y [ n ] m =1 where dE dy [ n ] is the total derivative, and includes all of the different ways in which E depends on y [ n ]. ∂ y [ n + m ] is the partial derivative, i.e., the change in y [ n + m ] ∂ y [ n ] per unit change in y [ n ] if all of the other variables (all other values of y [ n + k ]) are held constant.

  19. FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Partial vs. Full Derivatives So for example, if E = 1 � ( y [ n ] − t [ n ]) 2 2 n then the partial derivative of E w.r.t. y [ n ] is ∂ E ∂ y [ n ] = y [ n ] − t [ n ] and the total derivative of E w.r.t. y [ n ] is ∞ dE dE ∂ y [ n + m ] � dy [ n ] = ( y [ n ] − t [ n ]) + dy [ n + m ] ∂ y [ n ] m =1

  20. FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Partial vs. Full Derivatives So for example, if M − 1 � � � y [ n ] = g x [ n ] + a m y [ n − m ] m =1 then the partial derivative of y [ n + k ] w.r.t. y [ n ] is � M − 1 � ∂ y [ n + k ] � = a k ˙ g x [ n + k ] + a m y [ n + k − m ] ∂ y [ n ] m =1 g ( x ) = dg where ˙ dx is the derivative of the nonlinearity. The total derivative of y [ n + k ] w.r.t. y [ n ] is k − 1 dy [ n + k ] = ∂ y [ n + k ] dy [ n + k ] ∂ y [ n + j ] � + dy [ n ] ∂ y [ n ] dy [ n + j ] ∂ y [ n ] j =1

  21. FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Synchronous Backprop vs. BPTT The basic idea of back-prop-through-time is divide-and-conquer. 1 Synchronous Backprop: First, calculate the partial derivative of E w.r.t. the excitation e [ n ] at time n , assuming that all other time steps are held constant. ∂ E ǫ [ n ] = ∂ e [ n ] 2 Back-prop through time: Second, iterate backward through time to calculate the total derivative dE δ [ n ] = de [ n ]

  22. FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Synchronous Backprop in an RNN Suppose we have a recurrent neural net, defined by M − 1 � e [ n ] = x [ n ] + a m y [ n − m ] m =1 y [ n ] = g ( e [ n ]) E = 1 ( y [ n ] − t [ n ]) 2 � 2 n then ∂ E ǫ [ n ] = ∂ e [ n ] = ( y [ n ] − t [ n ]) ˙ g ( e [ n ]) g ( x ) = dg where ˙ dx is the derivative of the nonlinearity.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend