lo long short term memory l lstm
play

Lo Long-short term memory (L (LSTM) Jeong Min Lee CS3750, - PDF document

3/3/2020 Recurrent neural networks and Lo Long-short term memory (L (LSTM) Jeong Min Lee CS3750, University of Pittsburgh Outline RNN RNN Unfolding Computational Graph Backpropagation and weight update Explode / Vanishing


  1. 3/3/2020 Recurrent neural networks and Lo Long-short term memory (L (LSTM) Jeong Min Lee CS3750, University of Pittsburgh Outline • RNN • RNN • Unfolding Computational Graph • Backpropagation and weight update • Explode / Vanishing gradient problem • LSTM • GRU • Tasks with RNN • Software Packages 1

  2. 3/3/2020 So far we are • Modeling sequence (time-series) and predicting future values by probabilistic models (AR, HMM, LDS, Particle Filtering, Hawkes Process, etc) • E.g. LDS • Observation 𝑦 𝑢 is modeled as emission matrix 𝐷 , hidden state 𝑨 𝑢 with Gaussian 𝑨 𝑢−1 𝑨 𝑢 𝑨 𝑢+1 noise 𝑥 𝑢 𝑦 𝑢 = 𝐷𝑨 𝑢 + 𝑥 𝑢 ; 𝑥 𝑢 ~𝑂 𝑥 0, Σ • The hidden state is also probabilistically 𝑦 𝑢−1 𝑦 𝑢 𝑦 𝑢+1 computed with transition matrix 𝐵 and Gaussian noise 𝑤 𝑢 𝑨 𝑢 = 𝐵𝑨 𝑢−1 + 𝑤 𝑢 ; 𝑤 𝑢 ~𝑂(𝑥|0, Γ) Paradigm Shift to RNN • We are moving into a new world where no probabilistic component exists in a model • That is, we may not need to inference like in LDS and HMM • In RNN, hidden states bear no probabilistic form or assumption • Given fixed input and target from data, RNN is to learn intermediate association between them and also the real-valued vector representation 2

  3. 3/3/2020 RNN • RNN’s input, output, and internal representation (hidden states) are all real-valued vectors • ℎ 𝑢 : hidden states; real-valued vector ℎ 𝑢 = tanh 𝑉𝑦 𝑢 + 𝑋ℎ 𝑢−1 • 𝑦 𝑢 : input vector (real-valued) • 𝑊ℎ 𝑢 : real-valued vector 𝑧 = λ(𝑊ℎ 𝑢 ) ො • 𝑧 : output vector (real-valued) ො RNN • RNN consists of three parameter matrices ( 𝑉 , 𝑋, 𝑊 ) with activation functions • 𝑉 : input-hidden matrix ℎ 𝑢 = tanh 𝑉𝑦 𝑢 + 𝑋ℎ 𝑢−1 • 𝑋 : hidden-hidden matrix • 𝑊 : hidden-output matrix 𝑧 = λ(𝑊ℎ 𝑢 ) ො 3

  4. 3/3/2020 RNN • tanh ∙ is a tangent hyperbolic function. It models non-linearity. ℎ 𝑢 = tanh 𝑉𝑦 𝑢 + 𝑋ℎ 𝑢−1 tanh(z) 𝑧 = λ(𝑊ℎ𝑢 ) ො z RNN • λ ∙ is output transformation function • It can be any function and selected for a task and type of target in data • It can be even another feed-forward neural network and it makes RNN to model anything, without any restriction • Sigmoid: binary probability distribution ℎ 𝑢 = tanh 𝑉𝑦 𝑢 + 𝑋ℎ 𝑢−1 • Softmax: categorical probability distribution • ReLU: positive real-value output 𝑧 = λ(𝑊ℎ𝑢 ) ො • Identity function: real-value output 4

  5. 3/3/2020 Make a prediction • Let’s see how it makes a prediction • In the beginning, initial hidden state ℎ 0 is filled with zero or random value • Also we assume the model is already trained. (we will see how it is trained soon) ℎ 0 𝑦 1 Make a prediction • Assume we currently have observation 𝑦 1 and want to predict 𝑦 2 • We compute hidden states ℎ 1 first ℎ 1 = tanh 𝑉𝑦 1 + 𝑋ℎ 0 𝑋 ℎ 0 ℎ 1 𝑉 𝑦 1 5

  6. 3/3/2020 Make a prediction • Then we generate prediction: • 𝑊ℎ 1 is a real-valued vector or scalar value (depends on the size of output matrix 𝑊) ℎ 1 = tanh 𝑉𝑦 1 + 𝑋ℎ 0 𝑦 2 ො 𝑦 2 = ො ො 𝑧 = λ(𝑊ℎ 1 ) 𝑊, λ( ) 𝑋 ℎ 0 ℎ 1 𝑉 𝑦 1 𝑦 2 ො Make a prediction multiple steps • In prediction for multiple steps a head, predicted value ො 𝑦 2 from previous step is considered as input 𝑦 2 at time step 2 ℎ 2 = tanh 𝑉ො 𝑦 2 + 𝑋ℎ 1 𝑦 2 ො 𝑦 3 ො 𝑦 3 = ො ො 𝑧 = λ(𝑊ℎ 2 ) 𝑊, λ( ) 𝑊, λ( ) 𝑋 𝑋 ℎ 0 ℎ 1 ℎ 2 𝑉 𝑉 𝑦 2 ො 𝑦 1 𝑦 3 ො 6

  7. 3/3/2020 Make a prediction multiple steps • Same mechanism applies forward in time.. ℎ 3 = tanh 𝑉ො 𝑦 3 + 𝑋ℎ 2 𝑦 4 = ො ො 𝑧 = λ(𝑊ℎ 3 ) 𝑦 2 ො 𝑦 3 ො 𝑦 4 ො 𝑊, λ( ) 𝑊, λ( ) 𝑊, λ( ) 𝑋 𝑋 𝑋 ℎ 0 ℎ 1 ℎ 2 ℎ 3 𝑉 𝑉 𝑉 𝑦 2 ො 𝑦 1 𝑦 3 ො 𝑦 4 ො RNN Characteristic • You might observed that … • Parameters 𝑉, 𝑊, 𝑋 are shared across all time steps • No probabilistic component (random number generation) is involved • So, everything is deterministic 𝑦 2 ො 𝑦 3 ො 𝑦 4 ො 𝑊, λ( ) 𝑊, λ( ) 𝑊, λ( ) 𝑋 𝑋 𝑋 ℎ 0 ℎ 1 ℎ 2 ℎ 3 𝑉 𝑉 𝑉 𝑦 2 ො 𝑦 1 𝑦 3 ො 𝑦 4 ො 7

  8. 3/3/2020 Another way to see RNN • RNN is a type of neural network Neural Network • Cascading several linear weights with nonlinear 𝑧 activation functions in between them 𝑊 ℎ • 𝑧 : output 𝑉 • 𝑊 : Hidden-Output matrix • ℎ : hidden units (states) 𝑦 • 𝑉 : Input-Hidden matrix • 𝑦 : input 8

  9. 3/3/2020 Neural Network • In traditional NN, it is assumed that every input is 𝑧 independent each other 𝑊 ℎ • But with sequential data, input in current time step is highly likely depends on input in previous time step 𝑉 𝑦 • We need some additional structure that can model dependencies of inputs over time Recurrent Neural Network • A type of a neural network that has a recurrence structure • The recurrence structure allows us to operate over a sequence of vectors 𝑧 𝑊 𝑋 ℎ 𝑉 𝑦 9

  10. 3/3/2020 RNN as an Unfolding Computational Graph 𝑧 𝑢−1 ො 𝑧 𝑢 ො 𝑧 𝑢+1 ො 𝑧 𝑊 𝑊 𝑊 𝑊 𝑋 𝑋 𝑋 𝑋 𝑋 … … ℎ ℎ 𝑢−1 ℎ 𝑢 ℎ 𝑢+1 ℎ ℎ Unfold 𝑉 𝑉 𝑉 𝑉 𝑦 𝑢−1 𝑦 𝑢 𝑦 𝑢+1 𝑦 RNN as an Unfolding Computational Graph 𝑧 𝑢−1 ො 𝑧 𝑢 ො 𝑧 𝑢+1 ො 𝑧 𝑊 𝑊 𝑊 𝑊 𝑋 𝑋 𝑋 𝑋 𝑋 … … ℎ ℎ 𝑢−1 ℎ 𝑢 ℎ 𝑢+1 ℎ ℎ Unfold 𝑉 𝑉 𝑉 𝑉 𝑦 𝑢−1 𝑦 𝑢 𝑦 𝑢+1 𝑦 RNN can be converted into a feed-forward neural network by unfolding over time 10

  11. 3/3/2020 How to train RNN? • Before make train happen, we need to define these: • 𝑧 𝑢 : true target • ො 𝑧 𝑢 : output of RNN (=prediction for true target) • 𝐹 𝑢 : error (loss); difference between the true target and the output • As the output transformation function 𝜇 is selected by the task and data, so does the loss: • Binary Classification: Binary Cross Entropy • Categorical Classification: Cross Entropy • Regression: Mean Squared Error With the loss, the RNN will be like: 𝑧 𝑧 𝑢−1 𝑧 𝑢 𝑧 𝑢+1 𝐹 𝑢−1 𝐹 𝑢 𝐹 𝑢+1 𝐹 𝑧 ො 𝑧 𝑢−1 ො 𝑧 𝑢 ො 𝑧 𝑢+1 ො Unfold 𝑊 𝑊 𝑊 𝑊 𝑋 𝑋 𝑋 𝑋 𝑋 … … ℎ ℎ ℎ 𝑢−1 ℎ 𝑢 ℎ 𝑢+1 ℎ 𝑉 𝑉 𝑉 𝑉 𝑦 𝑦 𝑢−1 𝑦 𝑢 𝑦 𝑢+1 11

  12. 3/3/2020 Back Propagation Through Time (BPTT) • Extension of standard backpropagation 𝑧 1 𝑧 2 𝑧 3 that performs gradient descent on an unfolded network 𝐹 1 𝐹 2 𝐹 3 • Goal is to calculate gradients of the error with respect to parameters U, V, and W 𝑧 1 ො 𝑧 2 ො 𝑧 3 ො and learn desired parameters using 𝑊 𝑊 𝑊 Stochastic Gradient Descent 𝑋 𝑋 ℎ 1 ℎ 2 ℎ 3 𝑉 𝑉 𝑉 𝑦 1 𝑦 2 𝑦 3 Back Propagation Through Time (BPTT) • To update in one training example 𝑧 1 𝑧 2 𝑧 3 (sequence), we sum up the gradients at each time of the sequence: 𝐹 1 𝐹 2 𝐹 3 𝜖𝐹 𝜖𝐹 𝑢 𝑧 1 ො 𝑧 2 ො 𝑧 3 ො 𝜖𝑋 = ෍ 𝜖𝑋 𝑊 𝑊 𝑊 𝑢 𝑋 𝑋 ℎ 1 ℎ 2 ℎ 3 𝑉 𝑉 𝑉 𝑦 1 𝑦 2 𝑦 3 12

  13. 3/3/2020 Learning Parameters 𝑧 1 𝑧 2 𝑧 3 ℎ 𝑢 = tanh(𝑉𝑦 𝑢 + 𝑋ℎ 𝑢−1 ) 𝑨 𝑢 = 𝑉𝑦 𝑢 + 𝑋ℎ 𝑢−1 𝐹 1 𝐹 2 𝐹 3 ℎ 𝑢 = tanh(𝑨 𝑢 ) 𝑧 1 ො 𝑧 2 ො 𝑧 3 ො 𝛽 𝑙 = 𝜖ℎ 𝑙 𝜇 𝑙 = 𝜖ℎ 𝑙 • Let 𝑊 𝑊 𝑊 = 1 − ℎ 𝑙 2 𝜖𝑨 𝑙 𝑋 𝑋 𝜖𝑋 ℎ 1 ℎ 2 ℎ 3 𝛾 𝑙 = 𝜖𝐹 𝑙 = 𝑝 𝑙 − 𝑧 𝑙 𝑊 𝑉 𝑉 𝑉 𝜖ℎ 𝑙 𝑦 1 𝑦 2 𝑦 3 Learning Parameters 𝜖𝑋 = 𝜖𝐹 𝑙 𝜖𝐹 𝑙 𝜖ℎ 𝑙 𝑧 1 𝑧 2 𝑧 3 𝜖𝑋 = 𝛾 𝑙 𝜇 𝑙 𝜖ℎ 𝑙 𝐹 1 𝐹 2 𝐹 3 𝜇 𝑙 = 𝜖ℎ 𝑙 𝜖𝑋 = 𝜖ℎ 𝑙 𝜖𝑨 𝑙 𝜖𝑋 = 𝛽 𝑙 (ℎ 𝑙−1 + 𝑋𝜇 𝑙−1 ) 𝜖𝑨 𝑙 𝑧 1 ො 𝑧 2 ො 𝑧 3 ො 𝑊 𝑊 𝑊 𝜔 𝑙 = 𝜖ℎ 𝑙 𝜖𝑨 𝑙 𝑋 𝑋 𝜖𝑉 = 𝛽 𝑙 𝜖𝑉 = 𝛽 𝑙 (𝑦 𝑙 + 𝑋𝜔 𝑙−1 ) ℎ 1 ℎ 2 ℎ 3 𝑉 𝑉 𝑉 𝑦 1 𝑦 2 𝑦 3 13

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend