Long/Short-Term Memory Mark Hasegawa-Johnson All content CC-SA 4.0 - PowerPoint PPT Presentation

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Long/Short-Term Memory Mark Hasegawa-Johnson All content CC-SA 4.0 unless otherwise specified. University of Illinois ECE 417: Multimedia Signal Processing, Fall 2020

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Review: Recurrent Neural Networks 1 Vanishing/Exploding Gradient 2 Running Example: a Pocket Calculator 3 Regular RNN 4 Forget Gate 5 Long Short-Term Memory (LSTM) 6 Backprop for an LSTM 7 Conclusion 8

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Outline Review: Recurrent Neural Networks 1 Vanishing/Exploding Gradient 2 Running Example: a Pocket Calculator 3 Regular RNN 4 Forget Gate 5 Long Short-Term Memory (LSTM) 6 Backprop for an LSTM 7 Conclusion 8

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Recurrent Neural Net (RNN) = Nonlinear(IIR) Image CC-SA-4.0 by Ixnay, https://commons.wikimedia.org/wiki/File:Recurrent_neural_network_unfold.svg

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Back-Propagation and Causal Graphs ˆ y h 1 h 0 x N − 1 d ˆ y d ˆ y ∂ h i � dx = ∂ x dh i i =0 For each h i , we find the total derivative of ˆ y w.r.t. h i , multiplied by the partial derivative of h i w.r.t. x .

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Back-Propagation Through Time Back-propagation through time computes the error gradient at each time step based on the error gradients at future time steps. If the forward-prop equation is M − 1 � y [ n ] = g ( e [ n ]) , ˆ e [ n ] = x [ n ] + w [ m ]ˆ y [ n − m ] , m =1 then the BPTT equation is M − 1 ∂ E dE � δ [ n ] = de [ n ] = ∂ e [ n ] + δ [ n + m ] w [ m ] ˙ g ( e [ n ]) m =1 Weight update, for an RNN, multiplies the back-prop times the forward-prop. dE � dw [ m ] = δ [ n ]ˆ y [ n − m ] n

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Vanishing/Exploding Gradient The “vanishing gradient” problem refers to the tendency of d ˆ y [ n + m ] to disappear, exponentially, when m is large. de [ n ] The “exploding gradient” problem refers to the tendency of d ˆ y [ n + m ] to explode toward infinity, exponentially, when m is de [ n ] large. If the largest feedback coefficient is | w [ m ] | > 1, then you get exploding gradient. If | w [ m ] | < 1, you get vanishing gradient.

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Example: A Memorizer Network Suppose that we have a very simple RNN: y [ n ] = wx [ n ] + u ˆ ˆ y [ n − 1] Suppose that x [ n ] is only nonzero at time 0: � x 0 n = 0 x [ n ] = 0 n � = 0 Suppose that, instead of measuring x [0] directly, we are only allowed to measure the output of the RNN m time-steps later. Our goal is to learn w and u so that ˆ y [ m ] remembers x 0 , thus: E = 1 y [ m ] − x 0 ) 2 2 (ˆ

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Example: A Memorizer Network Now, how do we perform gradient update of the weights? If y [ n ] = wx [ n ] + u ˆ ˆ y [ n − 1] then � ∂ ˆ � dE y [ n ] dE � dw = d ˆ y [ n ] ∂ w n � dE � dE � � � = x [ n ] = x 0 d ˆ y [ n ] d ˆ y [0] n But the error is defined as E = 1 y [ m ] − x 0 ) 2 2 (ˆ so dE y [0] = u dE y [1] = u 2 dE y [2] = . . . = u m (ˆ y [ m ] − x 0 ) d ˆ d ˆ d ˆ

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Example: Vanishing Gradient So we find out that the gradient, w.r.t. the coefficient w , is either exponentially small, Exponential Decay or exponentially large, depending on whether | u | < 1 or | u | > 1: dE y [ m ] − x 0 ) u m dw = x 0 (ˆ In other words, if our application requires the neural net to wait m time steps before generating its output, then the gradient is Image CC-SA-4.0, PeterQ, Wikipedia exponentially smaller, and therefore training the neural net is exponentially harder.

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Notation Today’s lecture will try to use notation similar to the Wikipedia page for LSTM. x [ t ] = input at time t y [ t ] = target/desired output c [ t ] = excitation at time t OR LSTM cell h [ t ] = activation at time t OR LSTM output u = feedback coefficient w = feedforward coefficient b = bias

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Running Example: a Pocket Calculator The rest of this lecture will refer to a toy application called “pocket calculator.” Pocket Calculator When x [ t ] > 0, add it to the current tally: c [ t ] = c [ t − 1] + x [ t ]. When x [ t ] = 0, Print out the current tally, h [ t ] = c [ t − 1], and then 1 Reset the tally to zero, c [ t ] = 0. 2 Example Signals Input: x [ t ] = 1 , 2 , 1 , 0 , 1 , 1 , 1 , 0 Target Output: y [ t ] = 0 , 0 , 0 , 4 , 0 , 0 , 0 , 3

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Pocket Calculator Pocket Calculator When x [ t ] > 0, add it to the current tally: c [ t ] = c [ t − 1] + x [ t ]. When x [ t ] = 0, Print out the current 1 tally, h [ t ] = c [ t − 1], and then Reset the tally to zero, 2 c [ t ] = 0.

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion One-Node One-Tap Linear RNN Suppose that we have a very simple RNN: Excitation: c [ t ] = x [ t ] + uh [ t − 1] Activation: h [ t ] = σ h ( c [ t ]) where σ h () is some feedback nonlinearity. In this simple example, let’s just use σ h ( c [ t ]) = c [ t ], i.e., no nonlinearity. GOAL: Find u so that h [ t ] ≈ y [ t ]. In order to make the problem easier, we will only score an “error” when y [ t ] � = 0: E = 1 � ( h [ t ] − y [ t ]) 2 2 t : y [ t ] > 0

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion RNN with u = 1 RNN: u = 1? Obviously, if we want to just add numbers, we should just set u = 1. Then the RNN is computing Excitation: c [ t ] = x [ t ] + h [ t − 1] Activation: h [ t ] = σ h ( c [ t ]) That works until the first zero-valued input. But then it just keeps on adding.

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion RNN: u = 0 . 5? RNN with u = 0 . 5 Can we get decent results using u = 0 . 5? Advantage: by the time we reach x [ t ] = 0, the sum has kind of leaked away from us ( c [ t ] ≈ 0), so a hard-reset is not necessary. Disadvantage: by the time we reach x [ t ] = 0, the sum has kind of leaked away from us ( h [ t ] ≈ 0).

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Gradient Descent c [ t ] = x [ t ] + uh [ t − 1] h [ t ] = σ h ( c [ t ]) Let’s try initializing u = 0 . 5, and then performing gradient descent to improve it. Gradient descent has five steps: 1 Forward Propagation: c [ t ] = x [ t ] + uh [ t − 1], h [ t ] = c [ t ]. 2 Synchronous Backprop: ǫ [ t ] = ∂ E /∂ c [ t ]. 3 Back-Prop Through Time: δ [ t ] = dE / dc [ t ]. 4 Weight Gradient: dE / du = � t δ [ t ] h [ t − 1] 5 Gradient Descent: u ← u − η dE / du

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Gradient Descent Excitation: c [ t ] = x [ t ] + uh [ t − 1] Activation: h [ t ] = σ h ( c [ t ]) Error: E = 1 � ( h [ t ] − y [ t ]) 2 2 t : y [ t ] > 0 So the back-prop stages are: � ( h [ t ] − y [ t ]) Synchronous Backprop: ǫ [ t ] = ∂ E y [ t ] > 0 ∂ c [ t ] = 0 otherwise BPTT: δ [ t ] = dE dc [ t ] = ǫ [ t ] + u δ [ t + 1] Weight Gradient: dE � du = δ [ t ] h [ t − 1] t

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Backprop Stages, u = 0 . 5 Backprop Stages � ( h [ t ] − y [ t ]) y [ t ] > 0 ǫ [ t ] = 0 otherwise δ [ t ] = ǫ [ t ] + u δ [ t + 1] dE � du = δ [ t ] h [ t − 1] t

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Vanishing Gradient and Exploding Gradient Notice that, with | u | < 1, δ [ t ] tends to vanish exponentially fast as we go backward in time. This is called the vanishing gradient problem. It is a big problem for RNNs with long time-dependency, and for deep neural nets with many layers. If we set | u | > 1, we get an even worse problem, sometimes called the exploding gradient problem.

Long/Short-Term Memory Mark Hasegawa-Johnson All content CC-SA 4.0 - PowerPoint PPT Presentation

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Long/Short-Term Memory Mark Hasegawa-Johnson All content CC-SA 4.0 unless otherwise specified. University of Illinois ECE 417: Multimedia Signal Processing, Fall

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Class 15 - Long Short-Term Memory (LSTM) Class 15 - Long Short-Term Memory (LSTM) Study materials

The short- -term and long term and long- -term term The short stratospheric and tropospheric

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Long-Term Memory Introduction STM versus LTM Episodic Memory Semantic Memory

SHORT-TERM RENTALS IN AUSTIN, TX Smart City Policy Summit September 17, 2019 Todd LaRue,

Chapter 3 - Cognition Types of human memory Short term memory and cognitive processes

An Introduction to Neural Networks Long Short Term Memory (LSTM) and the Attention mechanism Ange

Towards Greater International Transparency of Clinical Trials Short Term Efforts for Long Term

REZCO CASH: SHORT TERM GAIN = LONG TERM PAIN CASH VS EQUITY 2 CASH VS EQUITY CASH VS EQUITY

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

South Burlington School District Proposed Long-Term Bond Why issue a long term bond? Entities

Return To Office Strategy Short-Term Strategy Mid-Term Strategy - Remote Work Long-Term

Short Term Rental Enforcement Short Term Rental Defined The City of Garden Grove Land Use

HDFC Ultra Short Term Fund (An open ended ultra-short term debt scheme investing in instruments

Long Memory Time Series A time series has short memory if | ( h ) | < . So a

Mixed models in R using the lme4 package Part 3: Longitudinal data Douglas Bates University of

LSH for 2 distances Lecture 15 March 12, 2019 Chandra (UIUC) CS498ABD 1 Spring 2019 1 /

/ AVL trees and rotations This week, you should be able to perform rotations on

A two-sample test for comparison of long memory parameters F. Lavancier 1 , A. Philippe 1 , D.

1 Locate set Cache Read Example: Direct Mapped Cache (E = 1) Check if any line in set

Memory Corruption Vulnerabilities, Part I Gang Tan Penn State University Spring 2019 CMPSC

Lecture 2: Single processor architecture and memory David Bindel 27 Jan 2010 Logistics If

Long/Short-Term Memory Mark Hasegawa-Johnson All content CC-SA 4.0 - PowerPoint PPT Presentation

Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Long/Short-Term Memory Mark Hasegawa-Johnson All content CC-SA 4.0 unless otherwise specified. University of Illinois ECE 417: Multimedia Signal Processing, Fall

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Class 15 - Long Short-Term Memory (LSTM) Class 15 - Long Short-Term Memory (LSTM) Study materials

The short- -term and long term and long- -term term The short stratospheric and tropospheric

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Long-Term Memory Introduction STM versus LTM Episodic Memory Semantic Memory

SHORT-TERM RENTALS IN AUSTIN, TX Smart City Policy Summit September 17, 2019 Todd LaRue,

Chapter 3 - Cognition Types of human memory Short term memory and cognitive processes

An Introduction to Neural Networks Long Short Term Memory (LSTM) and the Attention mechanism Ange

Towards Greater International Transparency of Clinical Trials Short Term Efforts for Long Term

REZCO CASH: SHORT TERM GAIN = LONG TERM PAIN CASH VS EQUITY 2 CASH VS EQUITY CASH VS EQUITY

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

South Burlington School District Proposed Long-Term Bond Why issue a long term bond? Entities

Return To Office Strategy Short-Term Strategy Mid-Term Strategy - Remote Work Long-Term

Short Term Rental Enforcement Short Term Rental Defined The City of Garden Grove Land Use

HDFC Ultra Short Term Fund (An open ended ultra-short term debt scheme investing in instruments

Long Memory Time Series A time series has short memory if | ( h ) | &lt; . So a

Mixed models in R using the lme4 package Part 3: Longitudinal data Douglas Bates University of

LSH for 2 distances Lecture 15 March 12, 2019 Chandra (UIUC) CS498ABD 1 Spring 2019 1 /

/ AVL trees and rotations This week, you should be able to perform rotations on

A two-sample test for comparison of long memory parameters F. Lavancier 1 , A. Philippe 1 , D.

1 Locate set Cache Read Example: Direct Mapped Cache (E = 1) Check if any line in set

Memory Corruption Vulnerabilities, Part I Gang Tan Penn State University Spring 2019 CMPSC

Lecture 2: Single processor architecture and memory David Bindel 27 Jan 2010 Logistics If

Long Memory Time Series A time series has short memory if | ( h ) | < . So a