Understanding LSTM Networks Recurrent Neural Networks An unrolled - - PowerPoint PPT Presentation

understanding lstm networks
SMART_READER_LITE
LIVE PREVIEW

Understanding LSTM Networks Recurrent Neural Networks An unrolled - - PowerPoint PPT Presentation

Understanding LSTM Networks Recurrent Neural Networks An unrolled recurrent neural network The Problem of Long-Term Dependencies RNN short-term dependencies Language model trying to predict the next word based on the previous ones the clouds


slide-1
SLIDE 1

Understanding LSTM Networks

slide-2
SLIDE 2

Recurrent Neural Networks

slide-3
SLIDE 3

An unrolled recurrent neural network

slide-4
SLIDE 4

The Problem of Long-Term Dependencies

slide-5
SLIDE 5

RNN short-term dependencies

A

x0 h0

A

x1 h1

A

x2 h2

A

x3 h3

A

x4 h4

Language model trying to predict the next word based on the previous ones

the clouds are in the sky,

slide-6
SLIDE 6

RNN long-term dependencies

A

x0 h0

A

x1 h1

A

x2 h2

A

xt−1

ht−1

A

xt ht

Language model trying to predict the next word based on the previous ones

I grew up in India… I speak fluent Hindi.

slide-7
SLIDE 7

Standard RNN

slide-8
SLIDE 8

Backpropagation Through Time (BPTT)

slide-9
SLIDE 9

RNN forward pass

st=tanh(Ux t+Wst −1) ^ yt=softmax(Vst) E( y , ^ y)=−∑

t

Et( y t , ^ yt) V U W V U W V U W V U W V U W

slide-10
SLIDE 10

Backpropagation Through Time

∂ E ∂W =∑

t

∂ Et ∂ W ∂ E3 ∂W =∂ E3 ∂ ^ y3 ∂ ^ y3 ∂ s3 ∂ s3 ∂ W s3=tanh(Uxt+Ws2)

S_3 depends on s_2, which depends on W and s_1, and so on.

But

∂ E3 ∂W =∑

k=0 3 ∂ E3

∂ ^ y3 ∂ ^ y3 ∂ s3 ∂ s3 ∂ sk ∂ sk ∂W

slide-11
SLIDE 11

The Vanishing Gradient Problem

∂ E3 ∂W =∑

k=0 3 ∂ E3

∂ ^ y3 ∂ ^ y3 ∂ s3 ∂ s3 ∂ sk ∂ sk ∂W ∂ E3 ∂W =∑

k=0 3 ∂ E3

∂ ^ y3 ∂ ^ y3 ∂ s3 ( ∏

j=k +1 3

∂ s j ∂ s j−1) ∂ sk ∂W

  • Derivative of a vector w.r.t a vector is a matrix called jacobian
  • 2-norm of the above Jacobian matrix has an upper bound of 1
  • tanh maps all values into a range between -1 and 1, and the derivative

is bounded by 1

  • With multiple matrix multiplications, gradient values shrink

exponentially

  • Gradient contributions from “far away” steps become zero
  • Depending on activation functions and network parameters, gradients

could explode instead of vanishing

slide-12
SLIDE 12

Activation function

slide-13
SLIDE 13

Basic LSTM

slide-14
SLIDE 14

Unrolling the LSTM through time

slide-15
SLIDE 15

Constant error carousel

Edge to next time step

Π Π σ σ σ

Edge from previous time step (and current input) Weight fixed at 1

it

  • t

~ Ct Ct= ~ Ct⋅i c

( t)+Ct− 1

Ct⋅ot

st=tanh(Ux t+Wst −1)

Replaced by

slide-16
SLIDE 16

Input gate

Edge to next time step

Π Π σ σ σ

Edge from previous time step (and current input) Weight fixed at 1

i t

  • t

~ Ct Ct= ~ Ct⋅ic

(t)+Ct−1

Ct⋅ot

  • Use contextual information to decide
  • Store input into memory
  • Protect memory from overwritten

by other irrelevant inputs

slide-17
SLIDE 17

Output gate

Edge to next time step

Π Π σ σ σ

Edge from previous time step (and current input) Weight fixed at 1

i t

  • t

~ Ct Ct= ~ Ct⋅ic

(t)+Ct−1

Ct⋅ot

  • Use contextual information to decide
  • Access information in memory
  • Block irrelevant information
slide-18
SLIDE 18

Forget or reset gate

Edge to next time step

Π Π Π σ σ σ σ

Edge from previous time step (and current input) Weight fixed at 1

f t

it

  • t

~ Ct Ct= ~ Ct⋅ic

(t)+Ct−1⋅f t

Ct⋅ot

slide-19
SLIDE 19

LSTM with four interacting layers

slide-20
SLIDE 20

The cell state

slide-21
SLIDE 21

Gates

sigmoid layer

slide-22
SLIDE 22

Step-by-Step LSTM Walk Through

slide-23
SLIDE 23

Forget gate layer

slide-24
SLIDE 24

Input gate layer

slide-25
SLIDE 25

The current state

slide-26
SLIDE 26

Output layer

slide-27
SLIDE 27

Refrence

  • http://colah.github.io/posts/2015-08-Understanding-LSTMs/
  • http://www.wildml.com/
  • http://nikhilbuduma.com/2015/01/11/a-deep-dive-into-recurrent-neural-netwo

rks/

  • http://deeplearning.net/tutorial/lstm.html
  • https://theclevermachine.files.wordpress.com/2014/09/act-funs.png
  • http://blog.terminal.com/demistifying-long-short-term-memory-lstm-recurrent
  • neural-networks/
  • A Critical Review of Recurrent Neural Networks for Sequence Learning,

Zachary C. Lipton, John Berkowitz

  • Long Short-Term Memory, Hochreiter, Sepp and Schmidhuber, Jurgen, 1997
  • Gers, F. A.; Schmidhuber, J. & Cummins, F. A. (2000), 'Learning to Forget:

Continual Prediction with LSTM.', Neural Computation 12 (10) , 2451-2471 .