Another view Hidden Input CEC is constant error Hidden carrousel - - PowerPoint PPT Presentation

another view
SMART_READER_LITE
LIVE PREVIEW

Another view Hidden Input CEC is constant error Hidden carrousel - - PowerPoint PPT Presentation

Another view Hidden Input CEC is constant error Hidden carrousel No vanishing gradients Input f But, it is not always on Hidden s f Introducing gates: Input f Allow or disallow input Hidden Allow or


slide-1
SLIDE 1

Another view

  • CEC is constant error

carrousel

  • No vanishing gradients
  • But, it is not always on
  • Introducing gates:
  • Allow or disallow input
  • Allow or disallow output
  • Remember or forget

state

f g f s

× × ×

f

Input Hidden Input Hidden Hidden Hidden Input Input Input Hidden

slide-2
SLIDE 2

A few words about the LSTM

  • CEC: With the forget gate, influence of the state forward

can be modulated such that it can be remembered for a long time, until the state or the input changes to make LSTM forget it. This ability or the path to pass the past-state unaltered to the future-state (and the gradient backward) is called constant error carrousel (CEC). It gives LSTM the ability to remember long term (hence, long short term memory)

  • Blocks: Since there are just too many weights to be learnt

for a single state bit, several state bits can be combined into a single block such that the state bits in a block share gates

  • Peepholes: The state itself can be an input for the gate

using peephole connections

  • GRU: In a variant of LSTM called gated recurrent unit

(GRU), input gate can simply be one-minus-forget-gate. That is, if the state is being forgotten, then replace it by input, and if it is being remembered, then block the input

slide-3
SLIDE 3

LSTM block

sct sct Block 1 sc

t

f g h f f xi

t

yk

t

bh

t-1

t

t

t

g(ac

t)

h(sc

t)

whω whι whϕ wiι wic wiω wiϕ whc y1

t

yK

t

x1

t

xI

t

bH

t-1

… … … … … Block H/C bc

t

Input Layer Hidden Layer Output Layer Current Delayed Peephole Legend … … CEC 𝑥𝑑𝑙

Adapted from: “Supervised Sequence Labelling with Recurrent Neural Networks.” by Alex Graves

slide-4
SLIDE 4

sct sct

Adding peep-holes

Block 1 sc

t

f g h f f xi

t

yk

t

bh

t-1

sc

t-1

t

t

t

g(ac

t)

h(sc

t)

whω whι whϕ wiι wic wiω wiϕ whc y1

t

yK

t

x1

t

xI

t

bH

t-1

… … … … … Block H/C bc

t

Input Layer Hidden Layer Output Layer Current Delayed Peephole Legend … … CEC 𝑥𝑑𝑙

Adapted from: “Supervised Sequence Labelling with Recurrent Neural Networks.” by Alex Graves

slide-5
SLIDE 5

sct sct Block 1 sc

t

f g h f f xi

t

yk

t

bh

t-1

t

t

t

g(ac

t)

h(sc

t)

whω whι whϕ wiι wic wiω wiϕ whc y1

t

yK

t

x1

t

xI

t

bH

t-1

… … … … … Block H/C bc

t

Input Layer Hidden Layer Output Layer … … CEC Input gate: 𝑏𝑚

𝑢 = 𝑥𝑗𝑚𝑦𝑗 𝑢 𝐽 𝑗=1

+ 𝑥ℎ𝑚𝑐ℎ

𝑢−1 𝐼 ℎ=1

+ 𝑥𝑑𝑚𝑡𝑑

𝑢−1 𝐷 𝑑=1

𝑐𝑚

𝑢 = 𝑔(𝑏𝑚 𝑢)

Forward Pass

sc

t-1

𝑥𝑑𝑙

Adapted from: “Supervised Sequence Labelling with Recurrent Neural Networks.” by Alex Graves

slide-6
SLIDE 6

Forward Pass

sct sct Block 1 sc

t

f g h f f xi

t

yk

t

bh

t-1

t

t

t

g(ac

t)

h(sc

t)

whω whι whϕ wiι wic wiω wiϕ whc y1

t

yK

t

x1

t

xI

t

bH

t-1

… … … … … Block H/C bc

t

Input Layer Hidden Layer Output Layer … … CEC Forget gate: Cell input: 𝑏𝜚

𝑢 = 𝑥𝑗𝜚𝑦𝑗 𝑢 𝐽 𝑗=1

+ 𝑥ℎ𝜚𝑐ℎ

𝑢−1 𝐼 ℎ=1

+ 𝑥𝑑𝜚𝑡𝑑

𝑢−1 𝐷 𝑑=1

𝑐𝜚

𝑢 = 𝑔(𝑏𝜚 𝑢 )

𝑏𝑑

𝑢 = 𝑥𝑗𝑑𝑦𝑗 𝑢 𝐽 𝑗=1

+ 𝑥ℎ𝑑𝑐ℎ

𝑢−1 𝐼 ℎ=1

𝑡𝑑

𝑢 = 𝑐𝜚 𝑢 𝑡𝑑 𝑢−1 + 𝑐𝑚 𝑢𝑕(𝑏𝑑 𝑢)

sc

t-1

𝑥𝑑𝑙

Adapted from: “Supervised Sequence Labelling with Recurrent Neural Networks.” by Alex Graves

slide-7
SLIDE 7

Forward Pass

sct sct Block 1 sc

t

f g h f f xi

t

yk

t

bh

t-1

t

t

t

g(ac

t)

h(sc

t)

whω whι whϕ wiι wic wiω wiϕ whc y1

t

yK

t

x1

t

xI

t

bH

t-1

… … … … … Block H/C bc

t

Input Layer Hidden Layer Output Layer … … CEC Output gate: Cell output: 𝑏𝜕

𝑢 = 𝑥𝑗𝜕𝑦𝑗 𝑢 𝐽 𝑗=1

+ 𝑥ℎ𝜕𝑐ℎ

𝑢−1 𝐼 ℎ=1

+ 𝑥𝑑𝜕𝑡𝑑

𝑢 𝐷 𝑑=1

𝑐𝜕

𝑢 = 𝑔(𝑏𝜕 𝑢 )

𝑐𝑑

𝑢 = 𝑐𝜕 𝑢 𝑖(𝑡𝑑 𝑢)

sc

t-1

𝑥𝑑𝑙

Adapted from: “Supervised Sequence Labelling with Recurrent Neural Networks.” by Alex Graves

slide-8
SLIDE 8

Revisiting backpropagation through b- diagrams

  • An efficient way to

perform gradient descent in NNs

  • Efficiency comes from

local computations

  • This can be visualized

using b-diagrams

– Propagate x (actually w) forward – Propagate 1 backward

Source: “Neural Networks - A Systematic Introduction,” by Raul Rojas, Springer-Verlag, Berlin, New-York, 1996.

slide-9
SLIDE 9

Chain rule using b-diagram

Source: “Neural Networks - A Systematic Introduction,” by Raul Rojas, Springer-Verlag, Berlin, New-York, 1996.

slide-10
SLIDE 10

Addition of functions using b-diagram

Source: “Neural Networks - A Systematic Introduction,” by Raul Rojas, Springer-Verlag, Berlin, New-York, 1996.

slide-11
SLIDE 11

Weighted edge on a b-diagram

Source: “Neural Networks - A Systematic Introduction,” by Raul Rojas, Springer-Verlag, Berlin, New-York, 1996.

slide-12
SLIDE 12

Product in a b-diagram

*

f g fg

*

f’g g’f 1

slide-13
SLIDE 13

Backpropagation

sct sct Block 1 sc

t

f g h f f xi

t

yk

t

t

t

t

g(ac

t)

h(sc

t)

whω whι whϕ wiι wic wiω wiϕ whc y1

t

yK

t

x1

t

xI

t

… … … … … Block H/C bc

t

Input Layer Hidden Layer Output Layer … … CEC Cell output: ϵ𝑑

𝑢 ≝ 𝜖𝑀

𝜖𝑐𝑑

𝑢

𝜗𝑡

𝑢 ≝ 𝜖𝑀

𝜖𝑡𝑑

𝑢

𝜗𝑑

𝑢 = 𝑥𝑑𝑙𝜀𝑙 𝑢 𝐿 𝑙=1

+ 𝑥𝑑𝑕𝜀𝑕

𝑢+1 𝐻 𝑕=1

sc

t-1

𝑥𝑑𝑙

Adapted from: “Supervised Sequence Labelling with Recurrent Neural Networks.” by Alex Graves

slide-14
SLIDE 14

Backpropagation

sct sct Block 1 sc

t

f g h f f xi

t

yk

t

bh

t-1

t

t

t

g(ac

t)

h(sc

t)

whω whι whϕ wiι wic wiω wiϕ whc y1

t

yK

t

x1

t

xI

t

bH

t-1

… … … … … Block H/C bc

t

Input Layer Hidden Layer Output Layer … … CEC Output gate: State: 𝜀𝜕

𝑢 = 𝑔′(𝑏𝜕 𝑢 ) 𝑖(𝑡𝑑 𝑢)𝜗𝑑 𝑢 𝐷 𝑑=1

𝜀𝜕

𝑢 𝜗𝑡 𝑢 = 𝑐𝜕 𝑢 𝑖′ 𝑡𝑑 𝑢 𝜗𝑑 𝑢 + 𝑐∅ 𝑢+1𝜗𝑡 𝑢+1

+𝑥𝑑𝑚𝜀𝑚

𝑢+1 + 𝑥𝑑∅𝜀∅ 𝑢+1 + 𝑥𝑑𝜕𝜀𝜕 𝑢

sc

t-1

𝑥𝑑𝑙

Adapted from: “Supervised Sequence Labelling with Recurrent Neural Networks.” by Alex Graves

Cells: 𝜀𝑑

𝑢 = 𝑐𝑚 𝑢𝑕′(𝑏𝑑 𝑢)𝜗𝑡 𝑢

slide-15
SLIDE 15

Backpropagation

sct sct Block 1 sc

t

f g h f f xi

t

yk

t

bh

t-1

t

t

t

g(ac

t)

h(sc

t)

whω whι whϕ wiι wic wiω wiϕ whc y1

t

yK

t

x1

t

xI

t

bH

t-1

… … … … … Block H/C bc

t

Input Layer Hidden Layer Output Layer … … CEC Forget gate: Input gate: 𝜀∅

𝑢 = 𝑔′(𝑏∅ 𝑢 ) 𝑡𝑑 𝑢−1𝜗𝑡 𝑢 𝐷 𝑑=1

𝜀𝑚

𝑢 = 𝑔′(𝑏𝑚 𝑢) 𝑕(𝑏𝑑 𝑢)𝜗𝑡 𝑢 𝐷 𝑑=1

sc

t-1

𝑥𝑑𝑙

Adapted from: “Supervised Sequence Labelling with Recurrent Neural Networks.” by Alex Graves

slide-16
SLIDE 16

Contents

  • Need for memory to process sequential data
  • Recurrent neural networks
  • LSTM basics
  • Some applications of LSTM in NLP
slide-17
SLIDE 17

Gated Recurrent Unit (GRU)

  • Reduces the need for

input gate by reusing the forget gate

f g f s

× × ×

1-x

Input Hidden Input Hidden Hidden Hidden Input Input Input Hidden

slide-18
SLIDE 18

GRUs combine input and forget gates

RNN unit LSTM unit LSTM unit with peepholes Gated Recurrent Unit combines input and forget gate as f and 1-f

Source: Cho, et al. (2014), and “Understanding LSTM Networks”, by C Olah, http://colah.github.io/posts/2015-08-Understanding-LSTMs/

slide-19
SLIDE 19

Sentence generation

  • Very common for image captioning
  • Input is given only in the beginning
  • This is a one-to-many task

A LSTM LSTM LSTM LSTM LSTM LSTM CNN boy swimming in water <END>

slide-20
SLIDE 20

Video Caption Generation

Source: “Translating Videos to Natural Language Using Deep Recurrent Neural Networks”, Venugopal et al., ArXiv 2014

slide-21
SLIDE 21

Pre-processing for NLP

  • The most basic pre-processing is to convert

words into an embedding using Word2Vec or GloVe

  • Otherwise, a one-hot-bit input vector can be

too long and sparse, and require lots on input weights

slide-22
SLIDE 22

Sentiment analysis

  • Very common for customer review or new article

analysis

  • Output before the end can be discarded (not

used for backpropagation)

  • This is a many-to-one task

Positive pleased with their service <END> Embed Embed Embed Embed Embed Embed LSTM LSTM LSTM LSTM LSTM LSTM Very

slide-23
SLIDE 23

Machine translation using encoder- decoder

Source: “Learning Phrase Representations using RNN Encoder–Decoderfor Statistical Machine Translation”, Cho et al., ArXiv 2014

slide-24
SLIDE 24

Multi-layer LSTM

  • More than one hidden layer can be used

LSTM1 LSTM1 LSTM1 LSTM1 LSTM2 LSTM2 LSTM2 LSTM2 xn-3 xn-2 xn-1 xn yn-3 yn-2 yn-1 yn Second hidden layer  First hidden layer 

slide-25
SLIDE 25

Bi-directional LSTM

  • Many problems require a reverse flow of

information as well

  • For example, POS tagging may require context

from future words

LSTM1 LSTM1 LSTM1 LSTM1 LSTM2 LSTM2 LSTM2 LSTM2 xn-3 xn-2 xn-1 xn yn-3 yn-2 yn-1 yn Backward layer  Forward layer 

slide-26
SLIDE 26

Some problems in LSTM and its troubleshooting

  • Inappropriate model
  • Identify the problem: One-to-many, many-to-one etc.
  • Loss only for outputs that matter
  • Separate LSTMs for separate languages
  • High training loss
  • Model not expressive
  • Too few hidden nodes
  • Only one hidden layer
  • Overfitting
  • Model has too much freedom
  • Too many hidden nodes
  • Too many blocks
  • Too many layers
  • Not bi-directional
slide-27
SLIDE 27

Multi-dimensional RNNs

Source: ”Supervised Sequence Labelling with Recurrent Neural Networks.” by Graves, Alex.

slide-28
SLIDE 28

In summary, LSTMs are powerful

  • Using recurrent connections is an old idea
  • It suffered from lack of gradient control over long term
  • CEC was an important innovation for remembering states

long term

  • This has many applications in time series modeling
  • Newer innovations are:

– Forget gates – Peepholes – Combining input and forget gate in GRU

  • LSTM can be generalized in direction, dimension, and

number of hidden layers to produce more complex models

58