Recurrent Neural Networks CS 6956: Deep Learning for NLP Overview - - PowerPoint PPT Presentation

β–Ά
recurrent neural networks
SMART_READER_LITE
LIVE PREVIEW

Recurrent Neural Networks CS 6956: Deep Learning for NLP Overview - - PowerPoint PPT Presentation

Recurrent Neural Networks CS 6956: Deep Learning for NLP Overview 1. Modeling sequences 2. Recurrent neural networks: An abstraction 3. Usage patterns for RNNs 4. BiDirectional RNNs 5. A concrete example: The Elman RNN 6. The vanishing


slide-1
SLIDE 1

CS 6956: Deep Learning for NLP

Recurrent Neural Networks

slide-2
SLIDE 2

Overview

1. Modeling sequences 2. Recurrent neural networks: An abstraction 3. Usage patterns for RNNs 4. BiDirectional RNNs 5. A concrete example: The Elman RNN 6. The vanishing gradient problem 7. Long short-term memory units

1

slide-3
SLIDE 3

Overview

1. Modeling sequences 2. Recurrent neural networks: An abstraction 3. Usage patterns for RNNs 4. BiDirectional RNNs 5. A concrete example: The Elman RNN 6. The vanishing gradient problem 7. Gating and Long short-term memory units

2

slide-4
SLIDE 4

A simple RNN

  • 1. How to generate the current state using the

previous state and the current input?

Next state 𝐭" = 𝑕(𝐭"&'𝐗

) + 𝐲"𝐗- + 𝐜)

  • 2. How to generate the current output using the

current state?

The output is the state. That is, 𝒛" = 𝐭"

3

slide-5
SLIDE 5

How do we train a recurrent network?

We need to specify a problem first. Let’s take an example.

– Inputs are sequences (say, of words)

4

I Initial state like cake

slide-6
SLIDE 6

How do we train a recurrent network?

We need to specify a problem first. Let’s take an example.

– Inputs are sequences (say, of words) – The outputs are labels associated with each word

5

I Initial state like cake Pronoun Verb Noun

slide-7
SLIDE 7

How do we train a recurrent network?

We need to specify a problem first. Let’s take an example.

– Inputs are sequences (say, of words) – The outputs are labels associated with each word – Losses for each word are added up

6

I Initial state like cake Pronoun Verb Noun loss1 loss2 loss3 Loss

slide-8
SLIDE 8

Gradients to the rescue

  • We have a computation graph
  • Use back propagation to compute gradients of the

loss with respect to the parameters (𝐗

) , 𝐗-, 𝐜)

– Sometimes called Backpropagation Through Time (BPTT)

  • Update gradients using SGD or a variant

– Adam, for example

7

slide-9
SLIDE 9

A simple RNN

  • 1. How to generate the current state using the

previous state and the current input?

Next state 𝐭" = 𝑕(𝐭"&'𝐗

) + 𝐲"𝐗- + 𝐜)

  • 2. How to generate the current output using the

current state?

The output is the state. That is, 𝒛" = 𝐭"

8

slide-10
SLIDE 10

Does this work? Let’s see a simple example

9

To avoid complicating the notation more than necessary, suppose

  • 1. The inputs, states and outputs are all scalars
  • 2. The loss at each step is a function 𝑔 of the state at that step
slide-11
SLIDE 11

Does this work? Let’s see a simple example

10

To avoid complicating the notation more than necessary, suppose

  • 1. The inputs, states and outputs are all scalars
  • 2. The loss at each step is a function 𝑔 of the state at that step

First input: 𝑦' Transform: 𝑒' = 𝑑6𝑋

) + 𝑦'𝑋

  • + 𝑐

State: s' = 𝑕(𝑒') Loss: π‘š' = 𝑔(𝑑')

slide-12
SLIDE 12

Does this work? Let’s see a simple example

11

To avoid complicating the notation more than necessary, suppose

  • 1. The inputs, states and outputs are all scalars
  • 2. The loss at each step is a function 𝑔 of the state at that step

Let’s compute the derivative of the loss with respect to the parameter 𝑋

  • First input: 𝑦'

Transform: 𝑒' = 𝑑6𝑋

) + 𝑦'𝑋

  • + 𝑐

State: s' = 𝑕(𝑒') Loss: π‘š' = 𝑔(𝑑')

slide-13
SLIDE 13

Does this work? Let’s see a simple example

12

To avoid complicating the notation more than necessary, suppose

  • 1. The inputs, states and outputs are all scalars
  • 2. The loss at each step is a function 𝑔 of the state at that step

Let’s compute the derivative of the loss with respect to the parameter 𝑋

  • πœ–π‘š'

πœ–π‘‹

  • = πœ–π‘š'

πœ–π‘‘' β‹… πœ–π‘‘' πœ–π‘’' β‹… πœ–π‘’' πœ–π‘‹

  • First input: 𝑦'

Transform: 𝑒' = 𝑑6𝑋

) + 𝑦'𝑋

  • + 𝑐

State: s' = 𝑕(𝑒') Loss: π‘š' = 𝑔(𝑑')

Follows the chain rule

slide-14
SLIDE 14

Does this work? Let’s see a simple example

13

To avoid complicating the notation more than necessary, suppose

  • 1. The inputs, states and outputs are all scalars
  • 2. The loss at each step is a function 𝑔 of the state at that step

Let’s compute the derivative of the loss with respect to the parameter 𝑋

  • πœ–π‘š'

πœ–π‘‹

  • = πœ–π‘š'

πœ–π‘‘' β‹… πœ–π‘‘' πœ–π‘’' β‹… πœ–π‘’' πœ–π‘‹

  • First input: 𝑦'

Transform: 𝑒' = 𝑑6𝑋

) + 𝑦'𝑋

  • + 𝑐

State: s' = 𝑕(𝑒') Loss: π‘š' = 𝑔(𝑑')

Follows the chain rule

slide-15
SLIDE 15

Does this work? Let’s see a simple example

14

To avoid complicating the notation more than necessary, suppose

  • 1. The inputs, states and outputs are all scalars
  • 2. The loss at each step is a function 𝑔 of the state at that step

Let’s compute the derivative of the loss with respect to the parameter 𝑋

  • πœ–π‘š'

πœ–π‘‹

  • = πœ–π‘š'

πœ–π‘‘' β‹… πœ–π‘‘' πœ–π‘’' β‹… πœ–π‘’' πœ–π‘‹

  • First input: 𝑦'

Transform: 𝑒' = 𝑑6𝑋

) + 𝑦'𝑋

  • + 𝑐

State: s' = 𝑕(𝑒') Loss: π‘š' = 𝑔(𝑑')

Follows the chain rule

slide-16
SLIDE 16

Does this work? Let’s see a simple example

15

To avoid complicating the notation more than necessary, suppose

  • 1. The inputs, states and outputs are all scalars
  • 2. The loss at each step is a function 𝑔 of the state at that step

Let’s compute the derivative of the loss with respect to the parameter 𝑋

  • πœ–π‘š'

πœ–π‘‹

  • = πœ–π‘š'

πœ–π‘‘' β‹… πœ–π‘‘' πœ–π‘’' β‹… πœ–π‘’' πœ–π‘‹

  • First input: 𝑦'

Transform: 𝑒' = 𝑑6𝑋

) + 𝑦'𝑋

  • + 𝑐

State: s' = 𝑕(𝑒') Loss: π‘š' = 𝑔(𝑑')

Follows the chain rule

slide-17
SLIDE 17

Does this work? Let’s see a simple example

16

To avoid complicating the notation more than necessary, suppose

  • 1. The inputs, states and outputs are all scalars
  • 2. The loss at each step is a function 𝑔 of the state at that step

Let’s compute the derivative of the loss with respect to the parameter 𝑋

  • πœ–π‘š'

πœ–π‘‹

  • = πœ–π‘š'

πœ–π‘‘' β‹… πœ–π‘‘' πœ–π‘’' β‹… πœ–π‘’' πœ–π‘‹

  • First input: 𝑦'

Transform: 𝑒' = 𝑑6𝑋

) + 𝑦'𝑋

  • + 𝑐

State: s' = 𝑕(𝑒') Loss: π‘š' = 𝑔(𝑑')

Let us examine the non-linearity in this system due to the activation function

slide-18
SLIDE 18

Does this work? Let’s see a simple example

17

Suppose 𝑕 𝑨 = tanh 𝑨

To avoid complicating the notation more than necessary, suppose

  • 1. The inputs, states and outputs are all scalars
  • 2. The loss at each step is a function 𝑔 of the state at that step

Let’s compute the derivative of the loss with respect to the parameter 𝑋

  • πœ–π‘š'

πœ–π‘‹

  • = πœ–π‘š'

πœ–π‘‘' β‹… πœ–π‘‘' πœ–π‘’' β‹… πœ–π‘’' πœ–π‘‹

  • First input: 𝑦'

Transform: 𝑒' = 𝑑6𝑋

) + 𝑦'𝑋

  • + 𝑐

State: s' = 𝑕(𝑒') Loss: π‘š' = 𝑔(𝑑')

slide-19
SLIDE 19

Does this work? Let’s see a simple example

18

Suppose 𝑕 𝑨 = tanh 𝑨 Then BC

BD = 1 βˆ’ tanhG(𝑨)

To avoid complicating the notation more than necessary, suppose

  • 1. The inputs, states and outputs are all scalars
  • 2. The loss at each step is a function 𝑔 of the state at that step

Let’s compute the derivative of the loss with respect to the parameter 𝑋

  • πœ–π‘š'

πœ–π‘‹

  • = πœ–π‘š'

πœ–π‘‘' β‹… πœ–π‘‘' πœ–π‘’' β‹… πœ–π‘’' πœ–π‘‹

  • First input: 𝑦'

Transform: 𝑒' = 𝑑6𝑋

) + 𝑦'𝑋

  • + 𝑐

State: s' = 𝑕(𝑒') Loss: π‘š' = 𝑔(𝑑')

slide-20
SLIDE 20

Does this work? Let’s see a simple example

19

To avoid complicating the notation more than necessary, suppose

  • 1. The inputs, states and outputs are all scalars
  • 2. The loss at each step is a function 𝑔 of the state at that step

Let’s compute the derivative of the loss with respect to the parameter 𝑋

  • πœ–π‘š'

πœ–π‘‹

  • = πœ–π‘š'

πœ–π‘‘' β‹… πœ–π‘‘' πœ–π‘’' β‹… πœ–π‘’' πœ–π‘‹

  • First input: 𝑦'

Transform: 𝑒' = 𝑑6𝑋

) + 𝑦'𝑋

  • + 𝑐

State: s' = 𝑕(𝑒') Loss: π‘š' = 𝑔(𝑑') Suppose 𝑕 𝑨 = tanh 𝑨 Then BC

BD = 1 βˆ’ tanhG(𝑨)

Always between zero and one

slide-21
SLIDE 21

Does this work? Let’s see a simple example

20

To avoid complicating the notation more than necessary, suppose

  • 1. The inputs, states and outputs are all scalars
  • 2. The loss at each step is a function 𝑔 of the state at that step

Let’s compute the derivative of the loss with respect to the parameter 𝑋

  • πœ–π‘š'

πœ–π‘‹

  • = πœ–π‘š'

πœ–π‘‘' β‹… πœ–π‘‘' πœ–π‘’' β‹… πœ–π‘’' πœ–π‘‹

  • First input: 𝑦'

Transform: 𝑒' = 𝑑6𝑋

) + 𝑦'𝑋

  • + 𝑐

State: s' = 𝑕(𝑒') Loss: π‘š' = 𝑔(𝑑') Suppose 𝑕 𝑨 = tanh 𝑨 Then BC

BD = 1 βˆ’ tanhG(𝑨)

πœ–π‘‘' πœ–π‘’' = 1 βˆ’ tanhG 𝑒' That is

slide-22
SLIDE 22

Does this work? Let’s see a simple example

21

To avoid complicating the notation more than necessary, suppose

  • 1. The inputs, states and outputs are all scalars
  • 2. The loss at each step is a function 𝑔 of the state at that step

Let’s compute the derivative of the loss with respect to the parameter 𝑋

  • πœ–π‘š'

πœ–π‘‹

  • = πœ–π‘š'

πœ–π‘‘' β‹… πœ–π‘‘' πœ–π‘’' β‹… πœ–π‘’' πœ–π‘‹

  • First input: 𝑦'

Transform: 𝑒' = 𝑑6𝑋

) + 𝑦'𝑋

  • + 𝑐

State: s' = 𝑕(𝑒') Loss: π‘š' = 𝑔(𝑑') Suppose 𝑕 𝑨 = tanh 𝑨 Then BC

BD = 1 βˆ’ tanhG(𝑨)

πœ–π‘‘' πœ–π‘’' = 1 βˆ’ tanhG 𝑒' That is A number between zero and one.

slide-23
SLIDE 23

Does this work? Let’s see a simple example

22

To avoid complicating the notation more than necessary, suppose

  • 1. The inputs, states and outputs are all scalars
  • 2. The loss at each step is a function 𝑔 of the state at that step

First input: 𝑦' Transform: 𝑒' = 𝑑6𝑋

) + 𝑦'𝑋

  • + 𝑐

State: s' = 𝑕(𝑒') Loss: π‘š' = 𝑔(𝑑')

Let’s see what happens with another input

slide-24
SLIDE 24

Does this work? Let’s see a simple example

23

To avoid complicating the notation more than necessary, suppose

  • 1. The inputs, states and outputs are all scalars
  • 2. The loss at each step is a function 𝑔 of the state at that step

First input: 𝑦' Transform: 𝑒' = 𝑑6𝑋

) + 𝑦'𝑋

  • + 𝑐

State: s' = 𝑕(𝑒') Loss: π‘š' = 𝑔(𝑑') Second input: 𝑦G Transform: 𝑒G = 𝑑'𝑋

) + 𝑦G𝑋

  • + 𝑐

State: sG = 𝑕(𝑒G) Loss: π‘šG = 𝑔(𝑑G)

slide-25
SLIDE 25

Does this work? Let’s see a simple example

24

To avoid complicating the notation more than necessary, suppose

  • 1. The inputs, states and outputs are all scalars
  • 2. The loss at each step is a function 𝑔 of the state at that step

Let’s compute the derivative of the loss with respect to the parameter 𝑋

  • First input: 𝑦'

Transform: 𝑒' = 𝑑6𝑋

) + 𝑦'𝑋

  • + 𝑐

State: s' = 𝑕(𝑒') Loss: π‘š' = 𝑔(𝑑') Second input: 𝑦G Transform: 𝑒G = 𝑑'𝑋

) + 𝑦G𝑋

  • + 𝑐

State: sG = 𝑕(𝑒G) Loss: π‘šG = 𝑔(𝑑G)

Once again, the chain rule

slide-26
SLIDE 26

Does this work? Let’s see a simple example

25

To avoid complicating the notation more than necessary, suppose

  • 1. The inputs, states and outputs are all scalars
  • 2. The loss at each step is a function 𝑔 of the state at that step

Let’s compute the derivative of the loss with respect to the parameter 𝑋

  • First input: 𝑦'

Transform: 𝑒' = 𝑑6𝑋

) + 𝑦'𝑋

  • + 𝑐

State: s' = 𝑕(𝑒') Loss: π‘š' = 𝑔(𝑑') Second input: 𝑦G Transform: 𝑒G = 𝑑'𝑋

) + 𝑦G𝑋

  • + 𝑐

State: sG = 𝑕(𝑒G) Loss: π‘šG = 𝑔(𝑑G)

Once again, the chain rule

πœ–π‘šG πœ–π‘‹

  • = πœ–π‘šG

πœ–π‘‘G β‹… πœ–π‘‘G πœ–π‘’G β‹… πœ–π‘’G πœ–π‘‹

  • + πœ–π‘’G

πœ–π‘‘' β‹… πœ–π‘‘' πœ–π‘’' β‹… πœ–π‘’' πœ–π‘‹

slide-27
SLIDE 27

Does this work? Let’s see a simple example

26

To avoid complicating the notation more than necessary, suppose

  • 1. The inputs, states and outputs are all scalars
  • 2. The loss at each step is a function 𝑔 of the state at that step

Let’s compute the derivative of the loss with respect to the parameter 𝑋

  • First input: 𝑦'

Transform: 𝑒' = 𝑑6𝑋

) + 𝑦'𝑋

  • + 𝑐

State: s' = 𝑕(𝑒') Loss: π‘š' = 𝑔(𝑑') Second input: 𝑦G Transform: 𝑒G = 𝑑'𝑋

) + 𝑦G𝑋

  • + 𝑐

State: sG = 𝑕(𝑒G) Loss: π‘šG = 𝑔(𝑑G)

Once again, the chain rule

πœ–π‘šG πœ–π‘‹

  • = πœ–π‘šG

πœ–π‘‘G β‹… πœ–π‘‘G πœ–π‘’G β‹… πœ–π‘’G πœ–π‘‹

  • + πœ–π‘’G

πœ–π‘‘' β‹… πœ–π‘‘' πœ–π‘’' β‹… πœ–π‘’' πœ–π‘‹

slide-28
SLIDE 28

Does this work? Let’s see a simple example

27

To avoid complicating the notation more than necessary, suppose

  • 1. The inputs, states and outputs are all scalars
  • 2. The loss at each step is a function 𝑔 of the state at that step

Let’s compute the derivative of the loss with respect to the parameter 𝑋

  • First input: 𝑦'

Transform: 𝑒' = 𝑑6𝑋

) + 𝑦'𝑋

  • + 𝑐

State: s' = 𝑕(𝑒') Loss: π‘š' = 𝑔(𝑑') Second input: 𝑦G Transform: 𝑒G = 𝑑'𝑋

) + 𝑦G𝑋

  • + 𝑐

State: sG = 𝑕(𝑒G) Loss: π‘šG = 𝑔(𝑑G)

Once again, the chain rule

πœ–π‘šG πœ–π‘‹

  • = πœ–π‘šG

πœ–π‘‘G β‹… πœ–π‘‘G πœ–π‘’G β‹… πœ–π‘’G πœ–π‘‹

  • + πœ–π‘’G

πœ–π‘‘' β‹… πœ–π‘‘' πœ–π‘’' β‹… πœ–π‘’' πœ–π‘‹

slide-29
SLIDE 29

Does this work? Let’s see a simple example

28

To avoid complicating the notation more than necessary, suppose

  • 1. The inputs, states and outputs are all scalars
  • 2. The loss at each step is a function 𝑔 of the state at that step

Let’s compute the derivative of the loss with respect to the parameter 𝑋

  • First input: 𝑦'

Transform: 𝑒' = 𝑑6𝑋

) + 𝑦'𝑋

  • + 𝑐

State: s' = 𝑕(𝑒') Loss: π‘š' = 𝑔(𝑑') Second input: 𝑦G Transform: 𝑒G = 𝑑'𝑋

) + 𝑦G𝑋

  • + 𝑐

State: sG = 𝑕(𝑒G) Loss: π‘šG = 𝑔(𝑑G)

Once again, the chain rule

πœ–π‘šG πœ–π‘‹

  • = πœ–π‘šG

πœ–π‘‘G β‹… πœ–π‘‘G πœ–π‘’G β‹… πœ–π‘’G πœ–π‘‹

  • + πœ–π‘’G

πœ–π‘‘' β‹… πœ–π‘‘' πœ–π‘’' β‹… πœ–π‘’' πœ–π‘‹

  • Two dependencies on 𝑋
slide-30
SLIDE 30

Does this work? Let’s see a simple example

29

To avoid complicating the notation more than necessary, suppose

  • 1. The inputs, states and outputs are all scalars
  • 2. The loss at each step is a function 𝑔 of the state at that step

Let’s compute the derivative of the loss with respect to the parameter 𝑋

  • First input: 𝑦'

Transform: 𝑒' = 𝑑6𝑋

) + 𝑦'𝑋

  • + 𝑐

State: s' = 𝑕(𝑒') Loss: π‘š' = 𝑔(𝑑') Second input: 𝑦G Transform: 𝑒G = 𝑑'𝑋

) + 𝑦G𝑋

  • + 𝑐

State: sG = 𝑕(𝑒G) Loss: π‘šG = 𝑔(𝑑G)

Once again, the chain rule

πœ–π‘šG πœ–π‘‹

  • = πœ–π‘šG

πœ–π‘‘G β‹… πœ–π‘‘G πœ–π‘’G β‹… πœ–π‘’G πœ–π‘‹

  • + πœ–π‘’G

πœ–π‘‘' β‹… πœ–π‘‘' πœ–π‘’' β‹… πœ–π‘’' πœ–π‘‹

slide-31
SLIDE 31

Does this work? Let’s see a simple example

30

To avoid complicating the notation more than necessary, suppose

  • 1. The inputs, states and outputs are all scalars
  • 2. The loss at each step is a function 𝑔 of the state at that step

Let’s compute the derivative of the loss with respect to the parameter 𝑋

  • First input: 𝑦'

Transform: 𝑒' = 𝑑6𝑋

) + 𝑦'𝑋

  • + 𝑐

State: s' = 𝑕(𝑒') Loss: π‘š' = 𝑔(𝑑') Second input: 𝑦G Transform: 𝑒G = 𝑑'𝑋

) + 𝑦G𝑋

  • + 𝑐

State: sG = 𝑕(𝑒G) Loss: π‘šG = 𝑔(𝑑G)

Once again, the chain rule

πœ–π‘šG πœ–π‘‹

  • = πœ–π‘šG

πœ–π‘‘G β‹… πœ–π‘‘G πœ–π‘’G β‹… πœ–π‘’G πœ–π‘‹

  • + πœ–π‘’G

πœ–π‘‘' β‹… πœ–π‘‘' πœ–π‘’' β‹… πœ–π‘’' πœ–π‘‹

slide-32
SLIDE 32

Does this work? Let’s see a simple example

31

To avoid complicating the notation more than necessary, suppose

  • 1. The inputs, states and outputs are all scalars
  • 2. The loss at each step is a function 𝑔 of the state at that step

Let’s compute the derivative of the loss with respect to the parameter 𝑋

  • First input: 𝑦'

Transform: 𝑒' = 𝑑6𝑋

) + 𝑦'𝑋

  • + 𝑐

State: s' = 𝑕(𝑒') Loss: π‘š' = 𝑔(𝑑') Second input: 𝑦G Transform: 𝑒G = 𝑑'𝑋

) + 𝑦G𝑋

  • + 𝑐

State: sG = 𝑕(𝑒G) Loss: π‘šG = 𝑔(𝑑G)

Once again, the chain rule

πœ–π‘šG πœ–π‘‹

  • = πœ–π‘šG

πœ–π‘‘G β‹… πœ–π‘‘G πœ–π‘’G β‹… πœ–π‘’G πœ–π‘‹

  • + πœ–π‘’G

πœ–π‘‘' β‹… πœ–π‘‘' πœ–π‘’' β‹… πœ–π‘’' πœ–π‘‹

slide-33
SLIDE 33

Does this work? Let’s see a simple example

32

To avoid complicating the notation more than necessary, suppose

  • 1. The inputs, states and outputs are all scalars
  • 2. The loss at each step is a function 𝑔 of the state at that step

Let’s compute the derivative of the loss with respect to the parameter 𝑋

  • First input: 𝑦'

Transform: 𝑒' = 𝑑6𝑋

) + 𝑦'𝑋

  • + 𝑐

State: s' = 𝑕(𝑒') Loss: π‘š' = 𝑔(𝑑') Second input: 𝑦G Transform: 𝑒G = 𝑑'𝑋

) + 𝑦G𝑋

  • + 𝑐

State: sG = 𝑕(𝑒G) Loss: π‘šG = 𝑔(𝑑G)

Once again, the chain rule

πœ–π‘šG πœ–π‘‹

  • = πœ–π‘šG

πœ–π‘‘G β‹… πœ–π‘‘G πœ–π‘’G β‹… πœ–π‘’G πœ–π‘‹

  • + πœ–π‘’G

πœ–π‘‘' β‹… πœ–π‘‘' πœ–π‘’' β‹… πœ–π‘’' πœ–π‘‹

slide-34
SLIDE 34

Does this work? Let’s see a simple example

33

To avoid complicating the notation more than necessary, suppose

  • 1. The inputs, states and outputs are all scalars
  • 2. The loss at each step is a function 𝑔 of the state at that step

Let’s compute the derivative of the loss with respect to the parameter 𝑋

  • First input: 𝑦'

Transform: 𝑒' = 𝑑6𝑋

) + 𝑦'𝑋

  • + 𝑐

State: s' = 𝑕(𝑒') Loss: π‘š' = 𝑔(𝑑') Second input: 𝑦G Transform: 𝑒G = 𝑑'𝑋

) + 𝑦G𝑋

  • + 𝑐

State: sG = 𝑕(𝑒G) Loss: π‘šG = 𝑔(𝑑G)

Once again, the chain rule How does the first input affect the loss for the second term?

πœ–π‘šG πœ–π‘‹

  • = πœ–π‘šG

πœ–π‘‘G β‹… πœ–π‘‘G πœ–π‘’G β‹… πœ–π‘’G πœ–π‘‹

  • + πœ–π‘’G

πœ–π‘‘' β‹… πœ–π‘‘' πœ–π‘’' β‹… πœ–π‘’' πœ–π‘‹

slide-35
SLIDE 35

Does this work? Let’s see a simple example

34

To avoid complicating the notation more than necessary, suppose

  • 1. The inputs, states and outputs are all scalars
  • 2. The loss at each step is a function 𝑔 of the state at that step

Let’s compute the derivative of the loss with respect to the parameter 𝑋

  • First input: 𝑦'

Transform: 𝑒' = 𝑑6𝑋

) + 𝑦'𝑋

  • + 𝑐

State: s' = 𝑕(𝑒') Loss: π‘š' = 𝑔(𝑑') Second input: 𝑦G Transform: 𝑒G = 𝑑'𝑋

) + 𝑦G𝑋

  • + 𝑐

State: sG = 𝑕(𝑒G) Loss: π‘šG = 𝑔(𝑑G)

Once again, the chain rule How does the first input affect the loss for the second term? Through this term here

πœ–π‘šG πœ–π‘‹

  • = πœ–π‘šG

πœ–π‘‘G β‹… πœ–π‘‘G πœ–π‘’G β‹… πœ–π‘’G πœ–π‘‹

  • + πœ–π‘’G

πœ–π‘‘' β‹… πœ–π‘‘' πœ–π‘’' β‹… πœ–π‘’' πœ–π‘‹

slide-36
SLIDE 36

Does this work? Let’s see a simple example

35

To avoid complicating the notation more than necessary, suppose

  • 1. The inputs, states and outputs are all scalars
  • 2. The loss at each step is a function 𝑔 of the state at that step

Let’s compute the derivative of the loss with respect to the parameter 𝑋

  • First input: 𝑦'

Transform: 𝑒' = 𝑑6𝑋

) + 𝑦'𝑋

  • + 𝑐

State: s' = 𝑕(𝑒') Loss: π‘š' = 𝑔(𝑑') Second input: 𝑦G Transform: 𝑒G = 𝑑'𝑋

) + 𝑦G𝑋

  • + 𝑐

State: sG = 𝑕(𝑒G) Loss: π‘šG = 𝑔(𝑑G) πœ–π‘šG πœ–π‘‹

  • = πœ–π‘šG

πœ–π‘‘G β‹… πœ–π‘‘G πœ–π‘’G β‹… πœ–π‘’G πœ–π‘‹

  • + πœ–π‘’G

πœ–π‘‘' β‹… πœ–π‘‘' πœ–π‘’' β‹… πœ–π‘’' πœ–π‘‹

  • Once again, the chain rule

But this gradient is multiplied by all these other terms

slide-37
SLIDE 37

Does this work? Let’s see a simple example

36

To avoid complicating the notation more than necessary, suppose

  • 1. The inputs, states and outputs are all scalars
  • 2. The loss at each step is a function 𝑔 of the state at that step

Let’s compute the derivative of the loss with respect to the parameter 𝑋

  • First input: 𝑦'

Transform: 𝑒' = 𝑑6𝑋

) + 𝑦'𝑋

  • + 𝑐

State: s' = 𝑕(𝑒') Loss: π‘š' = 𝑔(𝑑') Second input: 𝑦G Transform: 𝑒G = 𝑑'𝑋

) + 𝑦G𝑋

  • + 𝑐

State: sG = 𝑕(𝑒G) Loss: π‘šG = 𝑔(𝑑G) πœ–π‘šG πœ–π‘‹

  • = πœ–π‘šG

πœ–π‘‘G β‹… πœ–π‘‘G πœ–π‘’G β‹… πœ–π‘’G πœ–π‘‹

  • + πœ–π‘’G

πœ–π‘‘' β‹… πœ–π‘‘' πœ–π‘’' β‹… πœ–π‘’' πœ–π‘‹

  • Once again, the chain rule

Let’s focus on the impact

  • f the activation terms
slide-38
SLIDE 38

Does this work? Let’s see a simple example

37

To avoid complicating the notation more than necessary, suppose

  • 1. The inputs, states and outputs are all scalars
  • 2. The loss at each step is a function 𝑔 of the state at that step

Let’s compute the derivative of the loss with respect to the parameter 𝑋

  • First input: 𝑦'

Transform: 𝑒' = 𝑑6𝑋

) + 𝑦'𝑋

  • + 𝑐

State: s' = 𝑕(𝑒') Loss: π‘š' = 𝑔(𝑑') Second input: 𝑦G Transform: 𝑒G = 𝑑'𝑋

) + 𝑦G𝑋

  • + 𝑐

State: sG = 𝑕(𝑒G) Loss: π‘šG = 𝑔(𝑑G) πœ–π‘šG πœ–π‘‹

  • = πœ–π‘šG

πœ–π‘‘G β‹… πœ–π‘‘G πœ–π‘’G β‹… πœ–π‘’G πœ–π‘‹

  • + πœ–π‘’G

πœ–π‘‘' β‹… πœ–π‘‘' πœ–π‘’' β‹… πœ–π‘’' πœ–π‘‹

  • Once again, the chain rule

Let’s focus on the impact

  • f the activation terms

Suppose 𝑕 𝑨 = tanh 𝑨 Then BC

BD = 1 βˆ’ tanhG(𝑨)

slide-39
SLIDE 39

Does this work? Let’s see a simple example

38

To avoid complicating the notation more than necessary, suppose

  • 1. The inputs, states and outputs are all scalars
  • 2. The loss at each step is a function 𝑔 of the state at that step

Let’s compute the derivative of the loss with respect to the parameter 𝑋

  • First input: 𝑦'

Transform: 𝑒' = 𝑑6𝑋

) + 𝑦'𝑋

  • + 𝑐

State: s' = 𝑕(𝑒') Loss: π‘š' = 𝑔(𝑑') Second input: 𝑦G Transform: 𝑒G = 𝑑'𝑋

) + 𝑦G𝑋

  • + 𝑐

State: sG = 𝑕(𝑒G) Loss: π‘šG = 𝑔(𝑑G) πœ–π‘šG πœ–π‘‹

  • = πœ–π‘šG

πœ–π‘‘G β‹… πœ–π‘‘G πœ–π‘’G β‹… πœ–π‘’G πœ–π‘‹

  • + πœ–π‘’G

πœ–π‘‘' β‹… πœ–π‘‘' πœ–π‘’' β‹… πœ–π‘’' πœ–π‘‹

  • Once again, the chain rule

Let’s focus on the impact

  • f the activation terms

Suppose 𝑕 𝑨 = tanh 𝑨 Then BC

BD = 1 βˆ’ tanhG(𝑨)

Both these gradients are numbers between zero and

  • ne. Multiplying them scales the gradient down
slide-40
SLIDE 40

Does this work? Let’s see a simple example

39

To avoid complicating the notation more than necessary, suppose

  • 1. The inputs, states and outputs are all scalars
  • 2. The loss at each step is a function 𝑔 of the state at that step

First input: 𝑦' Transform: 𝑒' = 𝑑6𝑋

) + 𝑦'𝑋

  • + 𝑐

State: s' = 𝑕(𝑒') Loss: π’ŽπŸ = π’ˆ(π’•πŸ) Second input: 𝑦G Transform: 𝑒G = 𝑑'𝑋

) + 𝑦G𝑋

  • + 𝑐

State: sG = 𝑕(𝑒G) Loss: π‘šG = 𝑔(𝑑G)

With one input, the contribution of the first input towards the gradient of the loss of the first output is scaled by one term between zero and one.

slide-41
SLIDE 41

Does this work? Let’s see a simple example

40

To avoid complicating the notation more than necessary, suppose

  • 1. The inputs, states and outputs are all scalars
  • 2. The loss at each step is a function 𝑔 of the state at that step

First input: 𝑦' Transform: 𝑒' = 𝑑6𝑋

) + 𝑦'𝑋

  • + 𝑐

State: s' = 𝑕(𝑒') Loss: π‘š' = 𝑔(𝑑') Second input: 𝑦G Transform: 𝑒G = 𝑑'𝑋

) + 𝑦G𝑋

  • + 𝑐

State: sG = 𝑕(𝑒G) Loss: π’ŽπŸ‘ = π’ˆ(π’•πŸ‘)

With two inputs, the contribution of the first input towards the gradient of the loss

  • f the second output is scaled by two terms

between zero and one.

slide-42
SLIDE 42

Does this work? Let’s see a simple example

41

To avoid complicating the notation more than necessary, suppose

  • 1. The inputs, states and outputs are all scalars
  • 2. The loss at each step is a function 𝑔 of the state at that step

First input: 𝑦' Transform: 𝑒' = 𝑑6𝑋

) + 𝑦'𝑋

  • + 𝑐

State: s' = 𝑕(𝑒') Loss: π‘š' = 𝑔(𝑑') Second input: 𝑦G Transform: 𝑒G = 𝑑'𝑋

) + 𝑦G𝑋

  • + 𝑐

State: 𝑑G = 𝑕(𝑒G) Loss: π‘šG = 𝑔(𝑑G) nth input: 𝑦M Transform: 𝑒M = 𝑑M&'𝑋

) + 𝑦M𝑋

  • + 𝑐

State: sM = 𝑕(𝑒M) Loss: π’Žπ’ = π’ˆ(𝒕𝒐)

With n inputs, the contribution of the first input towards the gradient of the loss of the nth output is scaled by n terms between zero and one.

slide-43
SLIDE 43

The vanishing gradient problem

  • As the length of the sequence grows, the impact of the far

away inputs diminishes because the gradient vanishes

  • We saw an example where states and inputs are scalars.

– Applies when the states and inputs are vectors/matrices as in usual networks

  • Happens because the gradient of the non-linear activation is a

number between zero and one

– … and many such numbers are multiplied together

  • Applicable not only to recurrent networks, but to any case

where we have a long chain of such activations (i.e. in a deep neural network)

42

[Bengio et al 1994]

slide-44
SLIDE 44

The vanishing gradient problem

  • As the length of the sequence grows, the impact of the far

away inputs diminishes because the gradient vanishes

  • We saw an example where states and inputs are scalars.

– Applies when the states and inputs are vectors/matrices as in usual networks

  • Happens because the gradient of the non-linear activation is a

number between zero and one

– … and many such numbers are multiplied together

  • Applicable not only to recurrent networks, but to any case

where we have a long chain of such activations (i.e. in a deep neural network)

43

Why is this a problem?

slide-45
SLIDE 45

The vanishing gradient problem

  • As the length of the sequence grows, the impact of the far

away inputs diminishes because the gradient vanishes

  • We saw an example where states and inputs are scalars.

– Applies when the states and inputs are vectors/matrices as in usual networks

  • Happens because the gradient of the non-linear activation is a

number between zero and one

– … and many such numbers are multiplied together

  • Applicable not only to recurrent networks, but to any case

where we have a long chain of such activations (i.e. in a deep neural network)

44

Why is this a problem? I have a banana and an apple. My friend ate the banana and I ate the ________?

slide-46
SLIDE 46

The vanishing gradient problem

  • As the length of the sequence grows, the impact of the far

away inputs diminishes because the gradient vanishes

  • We saw an example where states and inputs are scalars.

– Applies when the states and inputs are vectors/matrices as in usual networks

  • Happens because the gradient of the non-linear activation is a

number between zero and one

– … and many such numbers are multiplied together

  • Applicable not only to recurrent networks, but to any case

where we have a long chain of such activations (i.e. in a deep neural network)

45

Why is this a problem? I have a banana and an apple. My friend ate the banana. I was hungry and wanted a fruit. So I ate the ________?

slide-47
SLIDE 47

The vanishing gradient problem

  • As the length of the sequence grows, the impact of the far

away inputs diminishes because the gradient vanishes

  • We saw an example where states and inputs are scalars.

– Applies when the states and inputs are vectors/matrices as in usual networks

  • Happens because the gradient of the non-linear activation is a

number between zero and one

– … and many such numbers are multiplied together

  • Applicable not only to recurrent networks, but to any case

where we have a long chain of such activations (i.e. in a deep neural network)

46

Why is this a problem? I have a banana and an apple. My friend ate the banana. I was hungry and wanted a fruit. I really wished I had a banana as well, but we were all out. So I ate the ________?

slide-48
SLIDE 48

The vanishing gradient problem

  • As the length of the sequence grows, the impact of the far

away inputs diminishes because the gradient vanishes

  • We saw an example where states and inputs are scalars.

– Applies when the states and inputs are vectors/matrices as in usual networks

  • Happens because the gradient of the non-linear activation is a

number between zero and one

– … and many such numbers are multiplied together

  • Applicable not only to recurrent networks, but to any case

where we have a long chain of such activations (i.e. in a deep neural network)

47

Why is this a problem? I have a banana and an apple. My friend ate the banana. I was hungry and wanted a fruit. I really wished I had a banana as well, but we were all out. So I ate the ________?

Consider a RNN language model for this task. If it makes a mistake in the final word, the signal for correcting it is far away.

slide-49
SLIDE 49

The vanishing gradient problem

  • As the length of the sequence grows, the impact of the far

away inputs diminishes because the gradient vanishes

  • We saw an example where states and inputs are scalars.

– Applies when the states and inputs are vectors/matrices as in usual networks

  • Happens because the gradient of the non-linear activation is a

number between zero and one

– … and many such numbers are multiplied together

  • Applicable not only to recurrent networks, but to any case

where we have a long chain of such activations (i.e. in a deep neural network)

48

Why is this a problem? I have a banana and an apple. My friend ate the banana. I was hungry and wanted a fruit. I really wished I had a banana as well, but we were all out. So I ate the ________?

[Hochreiter and Schmidhuber 1997]: β€œBackpropagation through time is too sensitive to recent distractions.”

slide-50
SLIDE 50

The vanishing gradient problem

  • As the length of the sequence grows, the impact of the far

away inputs diminishes because the gradient vanishes

  • We saw an example where states and inputs are scalars.

– Applies when the states and inputs are vectors/matrices as in usual networks

  • Happens because the gradient of the non-linear activation is a

number between zero and one

– … and many such numbers are multiplied together

  • Applicable not only to recurrent networks, but to any case

where we have a long chain of such activations (i.e. in a deep neural network)

49

slide-51
SLIDE 51

The vanishing gradient problem

  • As the length of the sequence grows, the impact of the far

away inputs diminishes because the gradient vanishes

  • We saw an example where states and inputs are scalars.

– Applies when the states and inputs are vectors/matrices as in usual networks

  • Happens because the gradient of the non-linear activation is a

number between zero and one

– … and many such numbers are multiplied together

  • Applicable not only to recurrent networks, but to any case

where we have a long chain of such activations (i.e. in a deep network): Layers closer to the loss will get larger updates

50

slide-52
SLIDE 52

Addressing the vanishing gradient problem

Approach 1: Change the activation – The problem occurs because the derivatives of the activation function are small, so change it – Commonly used: the rectified linear unit ReLU 𝑨 = max (0, 𝑨) What is its derivative? 𝑒 𝑆𝐹𝑀𝑉 𝑒𝑨 = [1 if 𝑨 β‰₯ 0 else

51

Multiplying many of these won’t vanish the gradient if the pre- activation value is positive. But can completely erase the gradient if it is negative.

slide-53
SLIDE 53

Addressing the vanishing gradient problem

Approach 1: Change the activation – The problem occurs because the derivatives of the activation function are small, so change it – Commonly used: the rectified linear unit ReLU 𝑨 = max (0, 𝑨) What is its derivative? 𝑒 𝑆𝐹𝑀𝑉 𝑒𝑨 = [1 if 𝑨 β‰₯ 0 else

52

Multiplying many of these won’t vanish the gradient if the pre- activation value is positive. But can completely erase the gradient if it is negative.

slide-54
SLIDE 54

Addressing the vanishing gradient problem

Approach 1: Change the activation – The problem occurs because the derivatives of the activation function are small, so change it – Commonly used: the rectified linear unit ReLU 𝑨 = max (0, 𝑨) What is its derivative? 𝑒 𝑆𝐹𝑀𝑉 𝑒𝑨 = [1 if 𝑨 β‰₯ 0 else

53

Multiplying many of these won’t vanish the gradient if the pre- activation value is positive. But can completely erase the gradient if it is negative.

slide-55
SLIDE 55

Addressing the vanishing gradient problem

Approach 1: Change the activation – The problem occurs because the derivatives of the activation function are small, so change it – Commonly used: the rectified linear unit ReLU 𝑨 = max (0, 𝑨) What is its derivative? 𝑒 𝑆𝐹𝑀𝑉 𝑒𝑨 = [1 if 𝑨 β‰₯ 0 else

54

Multiplying many of these won’t vanish the gradient if the pre- activation value is positive. But can completely erase the gradient if it is negative.

slide-56
SLIDE 56

Addressing the vanishing gradient problem

Approach 1: Change the activation – The problem occurs because the derivatives of the activation function are small, so change it – Commonly used: the rectified linear unit ReLU 𝑨 = max (0, 𝑨) What is its derivative? 𝑒 𝑆𝐹𝑀𝑉 𝑒𝑨 = [1 if 𝑨 β‰₯ 0 else

55

Multiplying many of these won’t vanish the gradient if the pre- activation value is positive. But can completely erase the gradient if it is negative.

slide-57
SLIDE 57

Exploding gradients

If our gradients are not fractional (e.g. with ReLUs), we might end up multiplying many large numbers during gradient computation This could quickly give numeric overflow errors

The Exploding Gradient Problem

56

slide-58
SLIDE 58

Addressing vanishing/exploding gradients

Approach 2: Don’t take derivatives all the way to the beginning – The problem occurs because we need to compute derivatives with respect to the early inputs – Truncate the backpropagation process instead – Called Truncated Backpropagation Through Time (TBPTT) Essentially, this makes a Markov-like assumption.

57

slide-59
SLIDE 59

Addressing vanishing/exploding gradients

Approach 3: Use a ReLU activation, but explicitly avoid exploding gradients – If a gradient is larger than a certain threshold, truncate it – ReLUs reduce vanishing gradients, and truncation takes care of exploding gradients

58

slide-60
SLIDE 60

Addressing vanishing/exploding gradients

Approach 4: Changing the internals of the RNN more thoroughly…

… by using a gated architecture such as an LSTM or a GRU unit

59