[PPT] - INF5820: Language technological applications Gated RNNs (3:2) PowerPoint Presentation

SLIDE 1

INF5820: Language technological applications Gated RNNs (3:2)

Taraka Rama

University of Oslo

30 October 2018

SLIDE 2

Agenda

◮ Break only for 5 minutes ◮ Lecture ends at 3:45 ◮ GRU ◮ LSTM ◮ Connections ◮ An analysis of why GRUs address vanishing gradients ◮ Applications of Gated RNNs

2

SLIDE 3

Some external resources

Many online resources on Gated RNNs:

◮ Fourth part on GRU and LSTM: https://tinyurl.com/z9j4ws9 ◮ Understanding Long-Short Term Memory Network:

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

3

SLIDE 4

Recap

Equations of basic RNN

◮ ht = f(Wxhxt + Whhht−1 + bh) ◮ ˆ

yt = Whyht + by

◮ pt = softmax( ˆ

yt)

4

SLIDE 5

Recap

Equations of basic RNN

◮ ht = f(Wxhxt + Whhht−1 + bh) ◮ ˆ

yt = Whyht + by

◮ pt = softmax( ˆ

yt)

◮ xt is the input vector at time t ◮ ht is the hidden state or memory at time t ◮ f(·) is the non-linearity such as tanh or sigmoid function ◮ ˆ

yt is the output transformation that maps ht to the number of output classes at each time step

◮ pt is the output of softmax function that transforms the values in ˆ

yt to probabilities

4

SLIDE 6

Recap: Vanishing/Exploding Gradient

◮ Fix k = 18 and S = 20 ◮ All W matrices and inputs are of size 1 × 1. (Simplication!!!)

5

SLIDE 7

Recap: Vanishing/Exploding Gradient

Basic recursion

◮ dhrawt = At + Btdhnextt ◮ dhnextt−1 = Whhdhrawt

5

SLIDE 8

Recap: Vanishing/Exploding Gradient

Basic recursion

◮ dhrawt = At + Btdhnextt ◮ dhnextt−1 = Whhdhrawt ◮ dhraw18 = A18 + B17dhnext18

5

SLIDE 9

Recap: Vanishing/Exploding Gradient

Basic recursion

◮ dhrawt = At + Btdhnextt ◮ dhnextt−1 = Whhdhrawt ◮ dhraw18 = A18 + B17dhnext18 ◮ ≈ A18 + B18Whhdhraw19

5

SLIDE 10

Recap: Vanishing/Exploding Gradient

Basic recursion

◮ dhrawt = At + Btdhnextt ◮ dhnextt−1 = Whhdhrawt ◮ dhraw18 = A18 + B17dhnext18 ◮ ≈ A18 + B18Whhdhraw19 ◮ ≈ A18 + B18Whh(A19 + B19dhnext19)

5

SLIDE 11

Recap: Vanishing/Exploding Gradient

Basic recursion

◮ dhrawt = At + Btdhnextt ◮ dhnextt−1 = Whhdhrawt ◮ dhraw18 = A18 + B17dhnext18 ◮ ≈ A18 + B18Whhdhraw19 ◮ ≈ A18 + B18Whh(A19 + B19dhnext19) ◮ ≈ A18 + B18Whh(A19 + B19Whhdhraw20)

5

SLIDE 12

Recap: Vanishing/Exploding Gradient

Basic recursion

◮ dhrawt = At + Btdhnextt ◮ dhnextt−1 = Whhdhrawt ◮ dhraw18 = A18 + B17dhnext18 ◮ ≈ A18 + B18Whhdhraw19 ◮ ≈ A18 + B18Whh(A19 + B19dhnext19) ◮ ≈ A18 + B18Whh(A19 + B19Whhdhraw20) ◮ The right most term: dhraw20 is always left multiplied by Whh

5

SLIDE 13

Recap: Vanishing/Exploding Gradient

Basic recursion

◮ dhrawt = At + Btdhnextt ◮ dhnextt−1 = Whhdhrawt ◮ dhraw18 = A18 + B17dhnext18 ◮ ≈ A18 + B18Whhdhraw19 ◮ ≈ A18 + B18Whh(A19 + B19dhnext19) ◮ ≈ A18 + B18Whh(A19 + B19Whhdhraw20) ◮ The right most term: dhraw20 is always left multiplied by Whh ◮ When k = 1 then dWhh is computed using (Whh)19

5

SLIDE 14

Problem with RNNs

◮ Vanishing/Exploding gradients ◮ Not sure if

◮ There is no dependency between different time steps ◮ Wrong initilization of parameters

◮ Many solutions (Pascanau 2012). ◮ Today: We will analyze why Gates address Vanishing gradient problem. ◮ Main solution is Gated RNNs

6

SLIDE 15

Gated RNNs are better than RNNs at MT

Sutskever et al. (2014) Sequence to Sequence Learning with Neural Networks (4903 citations according to Google Scholar as of today)

7

SLIDE 16

LSTM vs. GRU vs. CNN?

Yin et al. 2017: Comparative Study of CNN and RNN for Natural Language Processing https://arxiv.org/pdf/1702.01923.pdf

8

SLIDE 17

Why the name gated?

◮ Analogous to gates in circuits

9

SLIDE 18

Why the name gated?

◮ Analogous to gates in circuits ◮ Allow parts of the input vector or hidden state to pass over to the next

state

9

SLIDE 19

Why the name gated?

◮ Analogous to gates in circuits ◮ Allow parts of the input vector or hidden state to pass over to the next

state

◮ x = g × y + (1 − g) × z where g ∈ {0, 1}

9

SLIDE 20

Why the name gated?

◮ Analogous to gates in circuits ◮ Allow parts of the input vector or hidden state to pass over to the next

state

◮ x = g × y + (1 − g) × z where g ∈ {0, 1} ◮ In vector notation: xi = gi × yi + (1 − gi) × zi where gi ∈ {0, 1}

9

SLIDE 21

Why the name gated?

◮ Analogous to gates in circuits ◮ Allow parts of the input vector or hidden state to pass over to the next

state

◮ x = g × y + (1 − g) × z where g ∈ {0, 1} ◮ In vector notation: xi = gi × yi + (1 − gi) × zi where gi ∈ {0, 1} ◮ A simpler way to write: Hadamard Product

9

SLIDE 22

Hadamard product

◮ A cool name for elementwise matrix multiplication ◮ Can be represented as (in Machine Learning or Physics) or ◦ (usually

in NLP)

◮ C = A B where Ci,j = Ai,j × Bi,j

◮ Matrix dimensions do not change ◮ Can be performed only between identical dimension matrices 10

SLIDE 23

0/1 problem

◮ 0/1 gate variable is not differentiable

11

SLIDE 24

0/1 problem

◮ 0/1 gate variable is not differentiable ◮ Non-differentiable variable blocks back propagation

11

SLIDE 25

0/1 problem

◮ 0/1 gate variable is not differentiable ◮ Non-differentiable variable blocks back propagation ◮ No back propagation implies no gradient descent

11

SLIDE 26

0/1 problem

◮ 0/1 gate variable is not differentiable ◮ Non-differentiable variable blocks back propagation ◮ No back propagation implies no gradient descent ◮ What is the solution?

11

SLIDE 27

0/1 problem

◮ 0/1 gate variable is not differentiable ◮ Non-differentiable variable blocks back propagation ◮ No back propagation implies no gradient descent ◮ What is the solution? ◮ sigmoid or tanh function

11

SLIDE 28

Gated Recurrent Unit (GRU)

◮ GRU has two gates when compared to RNN

12

SLIDE 29

Gated Recurrent Unit (GRU)

◮ GRU has two gates when compared to RNN ◮ Update gate z and Reset gate r

12

SLIDE 30

Gated Recurrent Unit (GRU)

◮ GRU has two gates when compared to RNN ◮ Update gate z and Reset gate r ◮ Both gates are computed using different weights just like RNN at each

time step

12

SLIDE 31

Gated Recurrent Unit (GRU)

◮ GRU has two gates when compared to RNN ◮ Update gate z and Reset gate r ◮ Both gates are computed using different weights just like RNN at each

time step

◮ Update gate: zt = σ(W (z)xt + U(z)ht−1)

12

SLIDE 32

Gated Recurrent Unit (GRU)

◮ GRU has two gates when compared to RNN ◮ Update gate z and Reset gate r ◮ Both gates are computed using different weights just like RNN at each

time step

◮ Update gate: zt = σ(W (z)xt + U(z)ht−1) ◮ Reset gate: rt = σ(W (r)xt + U(r)ht−1)

12

SLIDE 33

GRU computations

◮ Update gate: zt = σ(W (z)xt + U(z)ht−1)

13

SLIDE 34

GRU computations

◮ Update gate: zt = σ(W (z)xt + U(z)ht−1) ◮ Reset gate: rt = σ(W (r)xt + U(r)ht−1)

13

SLIDE 35

GRU computations

◮ Update gate: zt = σ(W (z)xt + U(z)ht−1) ◮ Reset gate: rt = σ(W (r)xt + U(r)ht−1) ◮ Internal memory content: ˜

ht = tanh(Wxt + rt ◦ Uht−1)

13

SLIDE 36

GRU computations

◮ Update gate: zt = σ(W (z)xt + U(z)ht−1) ◮ Reset gate: rt = σ(W (r)xt + U(r)ht−1) ◮ Internal memory content: ˜

ht = tanh(Wxt + rt ◦ Uht−1)

◮ Final memory: ht = zt ◦ ht−1 + (1 − zt) ◦ ˜

ht

13

SLIDE 37

GRU (Visualization)

(http://web.stanford.edu/class/cs224n/lectures/lecture9.pdf)

14

SLIDE 38

Is it complex?

◮ Think of all weights, variables and input as 1 × 1 matrix.

15

SLIDE 39

Is it complex?

◮ Think of all weights, variables and input as 1 × 1 matrix. ◮ Easy to analyze

15

SLIDE 40

Is it complex?

◮ Think of all weights, variables and input as 1 × 1 matrix. ◮ Easy to analyze ◮ Boils down to single variable mathematics.

15

SLIDE 41

Reset Gate and GRU

◮ Update gate: zt = σ(W (z)xt + U(z)ht−1) ◮ Reset gate: rt = σ(W (r)xt + U(r)ht−1) ◮ Internal memory content: ˜

ht = tanh(Wxt + rt ◦ Uht−1)

◮ Final memory: ht = zt ◦ ht−1 + (1 − zt) ◦ ˜

ht If rt = 0

◮ Ignore previous hidden memory ht−1. Equivalently, forget the past.

16

SLIDE 42

Reset Gate and GRU

◮ Update gate: zt = σ(W (z)xt + U(z)ht−1) ◮ Reset gate: rt = σ(W (r)xt + U(r)ht−1) ◮ Internal memory content: ˜

ht = tanh(Wxt + rt ◦ Uht−1)

◮ Final memory: ht = zt ◦ ht−1 + (1 − zt) ◦ ˜

ht If rt = 0

◮ Ignore previous hidden memory ht−1. Equivalently, forget the past. ◮ Current hidden memory ht is partly dependent on current input xt

16

SLIDE 43

Reset Gate and GRU

◮ Update gate: zt = σ(W (z)xt + U(z)ht−1) ◮ Reset gate: rt = σ(W (r)xt + U(r)ht−1) ◮ Internal memory content: ˜

ht = tanh(Wxt + rt ◦ Uht−1)

◮ Final memory: ht = zt ◦ ht−1 + (1 − zt) ◦ ˜

ht If rt = 1 Use everything in the previous hidden memory ht−1. Looks like RNN is back.

16

SLIDE 44

Update Gate and GRU

◮ Update gate: zt = σ(W (z)xt + U(z)ht−1) ◮ Reset gate: rt = σ(W (r)xt + U(r)ht−1) ◮ Internal memory content: ˜

ht = tanh(Wxt + rt ◦ Uht−1)

◮ Final memory: ht = zt ◦ ht−1 + (1 − zt) ◦ ˜

ht If zt = 0

◮ Ignore previous hidden memory ht−1

17

SLIDE 45

Update Gate and GRU

◮ Update gate: zt = σ(W (z)xt + U(z)ht−1) ◮ Reset gate: rt = σ(W (r)xt + U(r)ht−1) ◮ Internal memory content: ˜

ht = tanh(Wxt + rt ◦ Uht−1)

◮ Final memory: ht = zt ◦ ht−1 + (1 − zt) ◦ ˜

ht If zt = 0

◮ Ignore previous hidden memory ht−1 ◮ Current hidden memory ht is only dependent on the new memory state

˜ ht

17

SLIDE 46

Update Gate and GRU

◮ Update gate: zt = σ(W (z)xt + U(z)ht−1) ◮ Reset gate: rt = σ(W (r)xt + U(r)ht−1) ◮ Internal memory content: ˜

ht = tanh(Wxt + rt ◦ Uht−1)

◮ Final memory: ht = zt ◦ ht−1 + (1 − zt) ◦ ˜

ht If zt = 1

◮ Copy the previous hidden memory ht−1 ◮ Current input is ignored completely

17

SLIDE 47

What is it for NLP?

◮ Short term dependencies means older history should be ignored (rt → 0)

18

SLIDE 48

What is it for NLP?

◮ Short term dependencies means older history should be ignored (rt → 0) ◮ Long term dependencies means older history should be retained as

much as possible (zt → 1)

18

SLIDE 49

Is RNN a special case of GRU?

Can you think when RNN is a special case of GRU?

19

SLIDE 50

Is RNN a special case of GRU?

Can you think when RNN is a special case of GRU?

◮ When rt = 1 and zt = 0:

19

SLIDE 51

Is RNN a special case of GRU?

Can you think when RNN is a special case of GRU?

◮ When rt = 1 and zt = 0:

◮ ˜

ht = tanh(Wxt + rt ◦ Uht−1)

19

SLIDE 52

Is RNN a special case of GRU?

Can you think when RNN is a special case of GRU?

◮ When rt = 1 and zt = 0:

◮ ˜

ht = tanh(Wxt + rt ◦ Uht−1)

◮ ht = 0 ∗ ht−1 + 1 ∗ ˜

ht

19

SLIDE 53

Is RNN a special case of GRU?

Can you think when RNN is a special case of GRU?

◮ When rt = 1 and zt = 0:

◮ ˜

ht = tanh(Wxt + rt ◦ Uht−1)

◮ ht = 0 ∗ ht−1 + 1 ∗ ˜

ht

◮ ht = ˜

ht

19

SLIDE 54

Is RNN a special case of GRU?

Can you think when RNN is a special case of GRU?

◮ When rt = 1 and zt = 0:

◮ ˜

ht = tanh(Wxt + rt ◦ Uht−1)

◮ ht = 0 ∗ ht−1 + 1 ∗ ˜

ht

◮ ht = ˜

ht

◮ W is Wxh and U is Whh 19

SLIDE 55

Is RNN a special case of GRU?

Can you think when RNN is a special case of GRU?

◮ When rt = 1 and zt = 0:

◮ ˜

ht = tanh(Wxt + rt ◦ Uht−1)

◮ ht = 0 ∗ ht−1 + 1 ∗ ˜

ht

◮ ht = ˜

ht

◮ W is Wxh and U is Whh

◮ Back to Vanilla RNN... Is it a problem?

19

SLIDE 56

Is RNN a special case of GRU?

Can you think when RNN is a special case of GRU?

◮ When rt = 1 and zt = 0:

◮ ˜

ht = tanh(Wxt + rt ◦ Uht−1)

◮ ht = 0 ∗ ht−1 + 1 ∗ ˜

ht

◮ ht = ˜

ht

◮ W is Wxh and U is Whh

◮ Back to Vanilla RNN... Is it a problem? ◮ rt = 1 and zt = 0 should happen at all the time steps. Very low chance

f happening!!

19

SLIDE 57

Is RNN a special case of GRU?

Can you think when RNN is a special case of GRU?

◮ When rt = 1 and zt = 0:

◮ ˜

ht = tanh(Wxt + rt ◦ Uht−1)

◮ ht = 0 ∗ ht−1 + 1 ∗ ˜

ht

◮ ht = ˜

ht

◮ W is Wxh and U is Whh

◮ Back to Vanilla RNN... Is it a problem? ◮ rt = 1 and zt = 0 should happen at all the time steps. Very low chance

f happening!!

◮ So we are fine...

19

SLIDE 58

Does GRU fixes vanishing gradient?

Two ways to analyze this:

◮ What does the hidden state look like in GRU after few time steps?

20

SLIDE 59

Does GRU fixes vanishing gradient?

Two ways to analyze this:

◮ What does the hidden state look like in GRU after few time steps? ◮ How does gradient calculation look like?

20

SLIDE 60

Does GRU fixes vanishing gradient?

Two ways to analyze this:

◮ What does the hidden state look like in GRU after few time steps? ◮ How does gradient calculation look like?

◮ Requires multivariate chain rule!!! 20

SLIDE 61

GRU: Hidden state calculation

We will expand ht from t = 20 to t = 19. Again, all our matrices are 1 × 1 to simplify.

◮ h20 = z20h19 + (1 − z20)˜

h20

21

SLIDE 62

GRU: Hidden state calculation

We will expand ht from t = 20 to t = 19. Again, all our matrices are 1 × 1 to simplify.

◮ h20 = z20h19 + (1 − z20)˜

h20

◮ h19 = z19h18 + (1 − z19)˜

h19

21

SLIDE 63

GRU: Hidden state calculation

We will expand ht from t = 20 to t = 19. Again, all our matrices are 1 × 1 to simplify.

◮ h20 = z20h19 + (1 − z20)˜

h20

◮ h19 = z19h18 + (1 − z19)˜

h19

◮ Therefore, h20 = z20z19h18 + z20(1 − z19)˜

h19 + z20(1 − z20)˜ h20

21

SLIDE 64

GRU: Hidden state calculation

We will expand ht from t = 20 to t = 19. Again, all our matrices are 1 × 1 to simplify.

◮ h20 = z20h19 + (1 − z20)˜

h20

◮ h19 = z19h18 + (1 − z19)˜

h19

◮ Therefore, h20 = z20z19h18 + z20(1 − z19)˜

h19 + z20(1 − z20)˜ h20

◮ If z19 → 0 then only part of h18 is relevant through ˜

h19

21

SLIDE 65

GRU: Hidden state calculation

We will expand ht from t = 20 to t = 19. Again, all our matrices are 1 × 1 to simplify.

◮ h20 = z20h19 + (1 − z20)˜

h20

◮ h19 = z19h18 + (1 − z19)˜

h19

◮ Therefore, h20 = z20z19h18 + z20(1 − z19)˜

h19 + z20(1 − z20)˜ h20

◮ If z19 → 0 then only part of h18 is relevant through ˜

h19

◮ If z19 → 1 then ˜

h19 is ignored completely. It is as if the input x19 is completely ignored.

21

SLIDE 66

GRU: Hidden state calculation

We will expand ht from t = 20 to t = 19. Again, all our matrices are 1 × 1 to simplify.

◮ h20 = z20h19 + (1 − z20)˜

h20

◮ h19 = z19h18 + (1 − z19)˜

h19

◮ Therefore, h20 = z20z19h18 + z20(1 − z19)˜

h19 + z20(1 − z20)˜ h20

◮ If z19 → 0 then only part of h18 is relevant through ˜

h19

◮ If z19 → 1 then ˜

h19 is ignored completely. It is as if the input x19 is completely ignored.

◮ There is a jump between timesteps 18 and 20.

21

SLIDE 67

GRU: Hidden state calculation

We will expand ht from t = 20 to t = 19. Again, all our matrices are 1 × 1 to simplify.

◮ h20 = z20h19 + (1 − z20)˜

h20

◮ h19 = z19h18 + (1 − z19)˜

h19

◮ Therefore, h20 = z20z19h18 + z20(1 − z19)˜

h19 + z20(1 − z20)˜ h20

◮ If z19 → 0 then only part of h18 is relevant through ˜

h19

◮ If z19 → 1 then ˜

h19 is ignored completely. It is as if the input x19 is completely ignored.

◮ There is a jump between timesteps 18 and 20. ◮ It is a shortcut across time.

21

SLIDE 68

Single variable chain rule

Question If z is dependent on y and y is dependent on x what is dz

dx?

22

SLIDE 69

Single variable chain rule

Question If z is dependent on y and y is dependent on x what is dz

dx?

Answer

dz dx = dz dy dy dx

22

SLIDE 70

Single variable chain rule

Question If z is dependent on y and y is dependent on x what is dz

dx?

Answer

dz dx = dz dy dy dx

Example

◮ z = exp(y) ◮ y = x2 ◮ dz dy = exp(y) ◮ dy dx = 2x ◮ dz dx = exp(y) × 2x

22

SLIDE 71

Multivariate chain rule (Very Important!)

23

SLIDE 72

Multivariate chain rule example

◮ z = x + y

24

SLIDE 73

Multivariate chain rule example

◮ z = x + y ◮ x = t2, y = exp(t)

24

SLIDE 74

Multivariate chain rule example

◮ z = x + y ◮ x = t2, y = exp(t) ◮ dz dt = ∂z ∂y dy dt + ∂z ∂x dx dt

24

SLIDE 75

Multivariate chain rule example

◮ z = x + y ◮ x = t2, y = exp(t) ◮ dz dt = ∂z ∂y dy dt + ∂z ∂x dx dt ◮ ∂z ∂y = 1

24

SLIDE 76

Multivariate chain rule example

◮ z = x + y ◮ x = t2, y = exp(t) ◮ dz dt = ∂z ∂y dy dt + ∂z ∂x dx dt ◮ ∂z ∂y = 1 ◮ ∂z ∂x = 1

24

SLIDE 77

Multivariate chain rule example

◮ z = x + y ◮ x = t2, y = exp(t) ◮ dz dt = ∂z ∂y dy dt + ∂z ∂x dx dt ◮ ∂z ∂y = 1 ◮ ∂z ∂x = 1 ◮ dy dt = exp(t)

24

SLIDE 78

Multivariate chain rule example

◮ z = x + y ◮ x = t2, y = exp(t) ◮ dz dt = ∂z ∂y dy dt + ∂z ∂x dx dt ◮ ∂z ∂y = 1 ◮ ∂z ∂x = 1 ◮ dy dt = exp(t) ◮ dx dt = 2t

24

SLIDE 79

Multivariate chain rule example

◮ z = x + y ◮ x = t2, y = exp(t) ◮ dz dt = ∂z ∂y dy dt + ∂z ∂x dx dt ◮ ∂z ∂y = 1 ◮ ∂z ∂x = 1 ◮ dy dt = exp(t) ◮ dx dt = 2t ◮ Finally, dz dt = exp(t) + 2t

24

SLIDE 80

Multivariate chain rule in summary

What is the derivative of h3 with respect to Whh? ( ∂h3

∂Whh )

25

SLIDE 81

Multivariate chain rule in summary

What is the derivative of h3 with respect to Whh? ( ∂h3

∂Whh ) ◮ Enumerate all paths from Whh to h3 and sum the derivatives

25

SLIDE 82

Multivariate chain rule in summary

What is the derivative of h3 with respect to Whh? ( ∂h3

∂Whh ) ◮ Enumerate all paths from Whh to h3 and sum the derivatives ◮ A arrow between two nodes is derivative between the two nodes.

25

SLIDE 83

Multivariate chain rule in summary

What is the derivative of h3 with respect to Whh? ( ∂h3

∂Whh ) ◮ Enumerate all paths from Whh to h3 and sum the derivatives ◮ A arrow between two nodes is derivative between the two nodes.

1. Whh → h3 directly

25

SLIDE 84

Multivariate chain rule in summary

What is the derivative of h3 with respect to Whh? ( ∂h3

∂Whh ) ◮ Enumerate all paths from Whh to h3 and sum the derivatives ◮ A arrow between two nodes is derivative between the two nodes.

1. Whh → h3 directly
2. Whh → h2 → h3

25

SLIDE 85

Multivariate chain rule in summary

What is the derivative of h3 with respect to Whh? ( ∂h3

∂Whh ) ◮ Enumerate all paths from Whh to h3 and sum the derivatives ◮ A arrow between two nodes is derivative between the two nodes.

1. Whh → h3 directly
2. Whh → h2 → h3
3. Whh → h1 → h2 → h3

25

SLIDE 86

Multivariate chain rule in summary

What is the derivative of h3 with respect to Whh? ( ∂h3

∂Whh ) ◮ Enumerate all paths from Whh to h3 and sum the derivatives ◮ A arrow between two nodes is derivative between the two nodes.

1. Whh → h3 directly
2. Whh → h2 → h3
3. Whh → h1 → h2 → h3

◮ h2 → h3 means ∂h3 ∂h2

25

SLIDE 87

What does it mean?

What is the derivative of h3 with respect to Whh?

26

SLIDE 88

What does it mean?

What is the derivative of h3 with respect to Whh?

◮ ∂h3 ∂Whh is sum of the following

26

SLIDE 89

What does it mean?

What is the derivative of h3 with respect to Whh?

◮ ∂h3 ∂Whh is sum of the following

1.

∂h3 ∂Whh

26

SLIDE 90

What does it mean?

What is the derivative of h3 with respect to Whh?

◮ ∂h3 ∂Whh is sum of the following

1.

∂h3 ∂Whh

2.

∂h3 ∂h2 ∂h2 ∂Whh

26

SLIDE 91

What does it mean?

What is the derivative of h3 with respect to Whh?

◮ ∂h3 ∂Whh is sum of the following

1.

∂h3 ∂Whh

2.

∂h3 ∂h2 ∂h2 ∂Whh

3.

∂h3 ∂h2 ∂h2 ∂h1 ∂h1 ∂Whh

26

SLIDE 92

RNN expansion

Equations of basic RNN

◮ ht = tanh(Wxhxt + Whhht−1 + bh) ◮ ˆ

yt = Whyht + by

◮ pt = softmax( ˆ

yt)

◮ ∂L ∂Whh = T t=1 ∂Lt ∂Whh (Sum the loss over each time step t)

27

SLIDE 93

RNN expansion

Equations of basic RNN

◮ ht = tanh(Wxhxt + Whhht−1 + bh) ◮ ˆ

yt = Whyht + by

◮ pt = softmax( ˆ

yt)

◮ ∂L ∂Whh = T t=1 ∂Lt ∂Whh (Sum the loss over each time step t) ◮ ∂Lt ∂Whh = ∂Lt ∂ˆ yt ∂ˆ yt ∂ht

t

k=1 ∂ht ∂hk ∂hk ∂Whh (Each ht is computed using Whh.

Apply multivariate chain rule.)

27

SLIDE 94

RNN expansion

Equations of basic RNN

◮ ht = tanh(Wxhxt + Whhht−1 + bh) ◮ ˆ

yt = Whyht + by

◮ pt = softmax( ˆ

yt)

◮ ∂L ∂Whh = T t=1 ∂Lt ∂Whh (Sum the loss over each time step t) ◮ ∂Lt ∂Whh = ∂Lt ∂ˆ yt ∂ˆ yt ∂ht

t

k=1 ∂ht ∂hk ∂hk ∂Whh (Each ht is computed using Whh.

Apply multivariate chain rule.)

◮ ∂ht ∂hk = t j=k+1 ∂hj ∂hj−1 (Each hj is immediately dependent on hj−1.

Apply single variable chain rule.)

27

SLIDE 95

RNN expansion

Equations of basic RNN

◮ ht = tanh(Wxhxt + Whhht−1 + bh) ◮ ˆ

yt = Whyht + by

◮ pt = softmax( ˆ

yt)

◮ ∂L ∂Whh = T t=1 ∂Lt ∂Whh (Sum the loss over each time step t) ◮ ∂Lt ∂Whh = ∂Lt ∂ˆ yt ∂ˆ yt ∂ht

t

k=1 ∂ht ∂hk ∂hk ∂Whh (Each ht is computed using Whh.

Apply multivariate chain rule.)

◮ ∂ht ∂hk = t j=k+1 ∂hj ∂hj−1 (Each hj is immediately dependent on hj−1.

Apply single variable chain rule.)

◮ ∂hj ∂hj−1 = W(1 − (hj)2) (Derivative of tanh(x) function is

1 − (tanh(x))2)

27

SLIDE 96

RNN expansion

Equations of basic RNN

◮ ht = tanh(Wxhxt + Whhht−1 + bh) ◮ ˆ

yt = Whyht + by

◮ pt = softmax( ˆ

yt)

◮ ∂L ∂Whh = T t=1 ∂Lt ∂Whh (Sum the loss over each time step t) ◮ ∂Lt ∂Whh = ∂Lt ∂ˆ yt ∂ˆ yt ∂ht

t

k=1 ∂ht ∂hk ∂hk ∂Whh (Each ht is computed using Whh.

Apply multivariate chain rule.)

◮ ∂ht ∂hk = t j=k+1 ∂hj ∂hj−1 (Each hj is immediately dependent on hj−1.

Apply single variable chain rule.)

◮ ∂hj ∂hj−1 = W(1 − (hj)2) (Derivative of tanh(x) function is

1 − (tanh(x))2)

◮ Repeated multiplication of Whh when t >> k causes vanishing gradient

r exploding gradient

27

SLIDE 97

GRU expansion

How does

∂hj ∂hj−1 look in the case of GRU? ◮ Is not so simple as in the case of RNN. Why?

28

SLIDE 98

GRU expansion

How does

∂hj ∂hj−1 look in the case of GRU? ◮ Is not so simple as in the case of RNN. Why? ◮ GRU has two gates that play a role in the computation of hidden state

ht

28

SLIDE 99

GRU expansion

How does

∂hj ∂hj−1 look in the case of GRU? ◮ Is not so simple as in the case of RNN. Why? ◮ GRU has two gates that play a role in the computation of hidden state

ht

◮ How does it look?

28

SLIDE 100

GRU expansion

How does

∂hj ∂hj−1 look in the case of GRU? ◮ Is not so simple as in the case of RNN. Why? ◮ GRU has two gates that play a role in the computation of hidden state

ht

◮ How does it look? ◮ [z((1 − z)W z(1 − h) + 1)] + [r(1 − z)(1 − h2)U(1 + ht−1(1 − r)W r)]

28

SLIDE 101

Analysis of the partial derivative

[z((1 − z)W z(1 − h) + 1)] + [r(1 − z)(1 − h2)U(1 + ht−1(1 − r)W r)]

◮ When z → 0: First term vanishes. Only dependent on reset gate.

29

SLIDE 102

Analysis of the partial derivative

[z((1 − z)W z(1 − h) + 1)] + [r(1 − z)(1 − h2)U(1 + ht−1(1 − r)W r)]

◮ When z → 0: First term vanishes. Only dependent on reset gate. ◮ When z → 1: 1. Repeated multiplication by 1 does not effect the

gradient. (Good for long-term dependencies)

29

SLIDE 103

Analysis of the partial derivative

[z((1 − z)W z(1 − h) + 1)] + [r(1 − z)(1 − h2)U(1 + ht−1(1 − r)W r)]

◮ When z → 0: First term vanishes. Only dependent on reset gate. ◮ When z → 1: 1. Repeated multiplication by 1 does not effect the

gradient. (Good for long-term dependencies)

◮ When r → 0: Second term vanishes. (Good for short term

dependencies)

29

SLIDE 104

Analysis of the partial derivative

[z((1 − z)W z(1 − h) + 1)] + [r(1 − z)(1 − h2)U(1 + ht−1(1 − r)W r)]

◮ When z → 0: First term vanishes. Only dependent on reset gate. ◮ When z → 1: 1. Repeated multiplication by 1 does not effect the

gradient. (Good for long-term dependencies)

◮ When r → 0: Second term vanishes. (Good for short term

dependencies)

◮ No guarantee that the gradient is always going to be multiplied by a

weight matrix.

29

SLIDE 105

Conclusion

When r = 1 and z = 0, we get back the RNN derivative leading to gradient problems. What is the chance for that to happen?

30

SLIDE 106

Conclusion

When r = 1 and z = 0, we get back the RNN derivative leading to gradient problems. What is the chance for that to happen? No need to worry. It has to happen at each timestep. Very low chance.

30

SLIDE 107

Conclusion

When r = 1 and z = 0, we get back the RNN derivative leading to gradient problems. What is the chance for that to happen? No need to worry. It has to happen at each timestep. Very low chance. Clearly GRU addresses vanishing gradient problem

30

SLIDE 108

Long-Short Term Memory Network

◮ Is more complex than GRU. Has one extra gate and one extra internal

state.

◮ Input gate: it = σ(W (i)xt + U(i)ht−1) ◮ Forget gate: ft = σ(W (f)xt + U(f)ht−1) ◮ Output gate: ot = σ(W (o)xt + U(o)ht−1) ◮ Candidate gate: gt = σ(W (g)xt + U(g)ht−1) ◮ Internal memory content: ct = ft ◦ ct−1 + it ◦ gt ◮ Final hidden state: ht = ot ◦ tanh(ct)

31

SLIDE 109

Visualizations

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

32

SLIDE 110

Visualizations

http://deeplearning.net/tutorial/lstm.html

32

SLIDE 111

Visualizations

Chung et al. 2014: https://arxiv.org/abs/1412.3555

32

SLIDE 112

Visualizations

Chung et al. 2014: https://arxiv.org/abs/1412.3555

32

SLIDE 113

Visualizations

Visualization by Tim Rocktäschel

32

SLIDE 114

Comparison between LSTM and GRU

LSTM

◮ it = σ(W (i)xt + U(i)ht−1) ◮ ft = σ(W (f)xt + U(f)ht−1) ◮ ot = σ(W (o)xt + U(o)ht−1) ◮ gt = σ(W (g)xt + U(g)ht−1) ◮ ct = ft ◦ ct−1 + it ◦ gt ◮ ht = ot ◦ tanh(ct)

GRU

◮ zt = σ(W (z)xt + U(z)ht−1) ◮ rt = σ(W (r)xt + U(r)ht−1) ◮ ˜

ht = tanh(Wxt + rt ◦ Uht−1)

◮ ht = zt ◦ ht−1 + (1 − zt) ◦ ˜

ht Differences

◮ Forget gate is like Update gate ◮ Input gate is like Reset gate. ◮ Input gate does not act

directly on the previous hidden state.

◮ Lesser number of parameters.

33

SLIDE 115

Is RNN is a special case of LSTM?

◮ Not very straightforward.

34

SLIDE 116

Is RNN is a special case of LSTM?

◮ Not very straightforward. ◮ Input gate is all 1

34

SLIDE 117

Is RNN is a special case of LSTM?

◮ Not very straightforward. ◮ Input gate is all 1 ◮ Forget gate is all 0

34

SLIDE 118

Is RNN is a special case of LSTM?

◮ Not very straightforward. ◮ Input gate is all 1 ◮ Forget gate is all 0 ◮ Output gate is all 1

34

SLIDE 119

Is RNN is a special case of LSTM?

◮ Not very straightforward. ◮ Input gate is all 1 ◮ Forget gate is all 0 ◮ Output gate is all 1 ◮ Remove the final tanh covering the internal memory

34