Recurrent Neural Nets ECE 417: Multimedia Signal Processing Mark - PowerPoint PPT Presentation

FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Recurrent Neural Nets ECE 417: Multimedia Signal Processing Mark Hasegawa-Johnson University of Illinois November 19, 2019

FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Linear Time Invariant Filtering: FIR & IIR 1 Nonlinear Time Invariant Filtering: CNN & RNN 2 Back-Propagation Training for CNN and RNN 3 Back-Prop Through Time 4 Vanishing/Exploding Gradient 5 Gated Recurrent Units 6 Long Short-Term Memory (LSTM) 7 Conclusion 8

FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Outline Linear Time Invariant Filtering: FIR & IIR 1 Nonlinear Time Invariant Filtering: CNN & RNN 2 Back-Propagation Training for CNN and RNN 3 Back-Prop Through Time 4 Vanishing/Exploding Gradient 5 Gated Recurrent Units 6 Long Short-Term Memory (LSTM) 7 Conclusion 8

FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Basics of DSP: Filtering ∞ � y [ n ] = h [ m ] x [ n − m ] m = −∞ Y ( z ) = H ( z ) X ( z )

FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Finite Impulse Response (FIR) N − 1 � y [ n ] = h [ m ] x [ n − m ] m =0 The coefficients, h [ m ], are chosen in order to optimally position the N − 1 zeros of the transfer function, r k , defined according to: N − 1 N − 1 h [ m ] z − m = h [0] � � 1 − r k z − 1 � � H ( z ) = m =0 k =1

FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Infinite Impulse Response (IIR) N − 1 M − 1 � � y [ n ] = b m x [ n − m ] + a m y [ n − m ] m =0 m =1 The coefficients, b m and a m , are chosen in order to optimally position the N − 1 zeros and M − 1 poles of the transfer function, r k and p k , defined according to: � N − 1 � N − 1 � 1 − r k z − 1 � m =0 b m z − m k =1 H ( z ) = m =1 a m z − m = b 0 1 − � M − 1 � M − 1 k =1 (1 − p k z − 1 )

FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Convolutional Neural Net = Nonlinear(FIR) � N − 1 � � y [ n ] = g h [ m ] x [ n − m ] m =0 The coefficients, h [ m ], are chosen to minimize some kind of error. For example, suppose that the goal is to make y [ n ] resemble a target signal t [ n ]; then we might use N E = 1 ( y [ n ] − t [ n ]) 2 � 2 n =0 and choose h [ n ] ← h [ n ] − η dE dh [ n ]

FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Recurrent Neural Net (RNN) = Nonlinear(IIR) � M − 1 � � y [ n ] = g x [ n ] + a m y [ n − m ] m =1 The coefficients, a m , are chosen to minimize the error. For example, suppose that the goal is to make y [ n ] resemble a target signal t [ n ]; then we might use N E = 1 ( y [ n ] − t [ n ]) 2 � 2 n =0 and choose a m ← a m − η dE da m

FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Review: Excitation and Activation The activation of a hidden node is the output of the nonlinearity (for this reason, the nonlinearity is sometimes called the activation function ). For example, in a fully-connected network with outputs z l , weights � v , bias v 0 , nonlinearity g (), and hidden node activations � y , the activation of the l th output node is p � � � z l = g v l 0 + v lk y k k =1 The excitation of a hidden node is the input of the nonlinearity. For example, the excitation of the node above is p � e l = v l 0 + v lk y k k =1

FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Backprop = Derivative w.r.t. Excitation The excitation of a hidden node is the input of the nonlinearity. For example, the excitation of the node above is p � e l = v l 0 + v lk y k k =1 The gradient of the error w.r.t. the weight is dE = ǫ l y k dv lk where ǫ l is the derivative of the error w.r.t. the l th excitation : ǫ l = dE de l

FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Backprop for Fully-Connected Network Suppose we have a fully-connected network, with inputs � x , weight matrices U and V , nonlinearities g () and h (), and output z : � e k = u k 0 + u kj x j j y k = g ( e k ) � e l = v l 0 + v lk y k k z l = h ( e l ) Then the back-prop gradients are the derivatives of E with respect to the excitations at each node: ǫ l = dE de l δ k = dE de k

FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Back-Prop in a CNN Suppose we have a convolutional neural net, defined by N − 1 � e [ n ] = h [ m ] x [ n − m ] m =0 y [ n ] = g ( e [ n ]) then dE � dh [ m ] = δ [ n ] x [ n − m ] n where δ [ n ] is the back-prop gradient, defined by dE δ [ n ] = de [ n ]

FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Back-Prop in an RNN Suppose we have a recurrent neural net, defined by M − 1 � e [ n ] = x [ n ] + a m y [ n − m ] m =1 y [ n ] = g ( e [ n ]) then dE � = δ [ n ] y [ n − m ] da m n where y [ n − m ] is calculated by forward-propagation, and then δ [ n ] is calculated by back-propagation as dE δ [ n ] = de [ n ]

FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Partial vs. Full Derivatives For example, suppose we want y [ n ] to be as close as possible to some target signal t [ n ]: E = 1 � ( y [ n ] − t [ n ]) 2 2 n Notice that E depends on y [ n ] in many different ways: ∂ E ∂ y [ n + 1] ∂ y [ n + 2] dE dE dE dy [ n ] = ∂ y [ n ] + + + . . . dy [ n + 1] ∂ y [ n ] dy [ n + 2] ∂ y [ n ]

FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Partial vs. Full Derivatives In general, ∞ dE ∂ E dE ∂ y [ n + m ] � dy [ n ] = ∂ y [ n ] + dy [ n + m ] ∂ y [ n ] m =1 where dE dy [ n ] is the total derivative, and includes all of the different ways in which E depends on y [ n ]. ∂ y [ n + m ] is the partial derivative, i.e., the change in y [ n + m ] ∂ y [ n ] per unit change in y [ n ] if all of the other variables (all other values of y [ n + k ]) are held constant.

FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Partial vs. Full Derivatives So for example, if E = 1 � ( y [ n ] − t [ n ]) 2 2 n then the partial derivative of E w.r.t. y [ n ] is ∂ E ∂ y [ n ] = y [ n ] − t [ n ] and the total derivative of E w.r.t. y [ n ] is ∞ dE dE ∂ y [ n + m ] � dy [ n ] = ( y [ n ] − t [ n ]) + dy [ n + m ] ∂ y [ n ] m =1

FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Partial vs. Full Derivatives So for example, if M − 1 � � � y [ n ] = g x [ n ] + a m y [ n − m ] m =1 then the partial derivative of y [ n + k ] w.r.t. y [ n ] is � M − 1 � ∂ y [ n + k ] � = a k ˙ g x [ n + k ] + a m y [ n + k − m ] ∂ y [ n ] m =1 g ( x ) = dg where ˙ dx is the derivative of the nonlinearity. The total derivative of y [ n + k ] w.r.t. y [ n ] is k − 1 dy [ n + k ] = ∂ y [ n + k ] dy [ n + k ] ∂ y [ n + j ] � + dy [ n ] ∂ y [ n ] dy [ n + j ] ∂ y [ n ] j =1

FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Synchronous Backprop vs. BPTT The basic idea of back-prop-through-time is divide-and-conquer. 1 Synchronous Backprop: First, calculate the partial derivative of E w.r.t. the excitation e [ n ] at time n , assuming that all other time steps are held constant. ∂ E ǫ [ n ] = ∂ e [ n ] 2 Back-prop through time: Second, iterate backward through time to calculate the total derivative dE δ [ n ] = de [ n ]

FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Synchronous Backprop in an RNN Suppose we have a recurrent neural net, defined by M − 1 � e [ n ] = x [ n ] + a m y [ n − m ] m =1 y [ n ] = g ( e [ n ]) E = 1 ( y [ n ] − t [ n ]) 2 � 2 n then ∂ E ǫ [ n ] = ∂ e [ n ] = ( y [ n ] − t [ n ]) ˙ g ( e [ n ]) g ( x ) = dg where ˙ dx is the derivative of the nonlinearity.

Recurrent Neural Nets ECE 417: Multimedia Signal Processing Mark - PowerPoint PPT Presentation

FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Recurrent Neural Nets ECE 417: Multimedia Signal Processing Mark Hasegawa-Johnson University of Illinois November 19, 2019 FIR/IIR CNN/RNN Back-Prop BPTT

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

Recurrent Neural Network Xiaogang Wang xgwang@ee.cuhk.edu.hk February 26, 2019 cuhk Xiaogang

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

Conflict nets: Efficient locally canonical MALL proof nets Dominic J. D. Hughes and Willem

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

The Power of Linear Recurrent Neural Networks Neural Networks Was knnen lineare rekurrente

Petri Nets Petri Nets Inputs and Outputs Petri Nets vs FSM Lionel Morel Modeling Templates

Mix-Nets Lecture 19 Some tools for electronic-voting (and other things) Mix-Nets Mix-Nets

Petri Nets and Model Checking Natasa Gkolfi University of Oslo March 31, 2017 Petri Nets and

Understanding LSTM Networks Recurrent Neural Networks An unrolled recurrent neural network The

Recurrent Neural Networks Greg Mori - CMPT 419/726 Goodfellow, Bengio, and Courville: Deep

CSEP 517: Natural Language Processing Recurrent Neural Networks Autumn 2018 Luke Zettlemoyer

Recurrent Neural Networks CS60010: Deep Learning Abir Das IIT Kharagpur Mar 11, 2020

Efficient Computation of Reachable Sets of Linear Time-Invariant Systems with Inputs Colas Le

Invariants for LTI systems with uncertain input Paul Hnsch Embedded Software Lab RWTH Aachen

Geometric Methods in Representation Theory Columbia, Missouri, November 23-25, 2014 Varieties of

EI331 Signals and Systems Lecture 5 Bo Jiang John Hopcroft Center for Computer Science Shanghai

Advances in Reachability Analysis with Applications to Safety Verification of Vehicle Control

Modelling and Control of Dynamic Systems Linear Systems Sven Laur University of Tartu Formal

Transductions in affjne logic LIPN, Universit Paris 13 Trends in Linear Logic and Applications,

Large-scale Ultrasound Simulations Using the Hybrid OpenMP/MPI Decomposition Jiri Jaros*, Vojtech