Recurrent Neural Networks CS 6956: Deep Learning for NLP Overview - - PowerPoint PPT Presentation
Recurrent Neural Networks CS 6956: Deep Learning for NLP Overview - - PowerPoint PPT Presentation
Recurrent Neural Networks CS 6956: Deep Learning for NLP Overview 1. Modeling sequences 2. Recurrent neural networks: An abstraction 3. Usage patterns for RNNs 4. BiDirectional RNNs 5. A concrete example: The Elman RNN 6. The vanishing
Overview
1. Modeling sequences 2. Recurrent neural networks: An abstraction 3. Usage patterns for RNNs 4. BiDirectional RNNs 5. A concrete example: The Elman RNN 6. The vanishing gradient problem 7. Long short-term memory units
1
Overview
1. Modeling sequences 2. Recurrent neural networks: An abstraction 3. Usage patterns for RNNs 4. BiDirectional RNNs 5. A concrete example: The Elman RNN 6. The vanishing gradient problem 7. Gating and Long short-term memory units
2
A simple RNN
- 1. How to generate the current state using the
previous state and the current input?
Next state π" = π(π"&'π
) + π²"π- + π)
- 2. How to generate the current output using the
current state?
The output is the state. That is, π" = π"
3
How do we train a recurrent network?
We need to specify a problem first. Letβs take an example.
β Inputs are sequences (say, of words)
4
I Initial state like cake
How do we train a recurrent network?
We need to specify a problem first. Letβs take an example.
β Inputs are sequences (say, of words) β The outputs are labels associated with each word
5
I Initial state like cake Pronoun Verb Noun
How do we train a recurrent network?
We need to specify a problem first. Letβs take an example.
β Inputs are sequences (say, of words) β The outputs are labels associated with each word β Losses for each word are added up
6
I Initial state like cake Pronoun Verb Noun loss1 loss2 loss3 Loss
Gradients to the rescue
- We have a computation graph
- Use back propagation to compute gradients of the
loss with respect to the parameters (π
) , π-, π)
β Sometimes called Backpropagation Through Time (BPTT)
- Update gradients using SGD or a variant
β Adam, for example
7
A simple RNN
- 1. How to generate the current state using the
previous state and the current input?
Next state π" = π(π"&'π
) + π²"π- + π)
- 2. How to generate the current output using the
current state?
The output is the state. That is, π" = π"
8
Does this work? Letβs see a simple example
9
To avoid complicating the notation more than necessary, suppose
- 1. The inputs, states and outputs are all scalars
- 2. The loss at each step is a function π of the state at that step
Does this work? Letβs see a simple example
10
To avoid complicating the notation more than necessary, suppose
- 1. The inputs, states and outputs are all scalars
- 2. The loss at each step is a function π of the state at that step
First input: π¦' Transform: π’' = π‘6π
) + π¦'π
- + π
State: s' = π(π’') Loss: π' = π(π‘')
Does this work? Letβs see a simple example
11
To avoid complicating the notation more than necessary, suppose
- 1. The inputs, states and outputs are all scalars
- 2. The loss at each step is a function π of the state at that step
Letβs compute the derivative of the loss with respect to the parameter π
- First input: π¦'
Transform: π’' = π‘6π
) + π¦'π
- + π
State: s' = π(π’') Loss: π' = π(π‘')
Does this work? Letβs see a simple example
12
To avoid complicating the notation more than necessary, suppose
- 1. The inputs, states and outputs are all scalars
- 2. The loss at each step is a function π of the state at that step
Letβs compute the derivative of the loss with respect to the parameter π
- ππ'
ππ
- = ππ'
ππ‘' β ππ‘' ππ’' β ππ’' ππ
- First input: π¦'
Transform: π’' = π‘6π
) + π¦'π
- + π
State: s' = π(π’') Loss: π' = π(π‘')
Follows the chain rule
Does this work? Letβs see a simple example
13
To avoid complicating the notation more than necessary, suppose
- 1. The inputs, states and outputs are all scalars
- 2. The loss at each step is a function π of the state at that step
Letβs compute the derivative of the loss with respect to the parameter π
- ππ'
ππ
- = ππ'
ππ‘' β ππ‘' ππ’' β ππ’' ππ
- First input: π¦'
Transform: π’' = π‘6π
) + π¦'π
- + π
State: s' = π(π’') Loss: π' = π(π‘')
Follows the chain rule
Does this work? Letβs see a simple example
14
To avoid complicating the notation more than necessary, suppose
- 1. The inputs, states and outputs are all scalars
- 2. The loss at each step is a function π of the state at that step
Letβs compute the derivative of the loss with respect to the parameter π
- ππ'
ππ
- = ππ'
ππ‘' β ππ‘' ππ’' β ππ’' ππ
- First input: π¦'
Transform: π’' = π‘6π
) + π¦'π
- + π
State: s' = π(π’') Loss: π' = π(π‘')
Follows the chain rule
Does this work? Letβs see a simple example
15
To avoid complicating the notation more than necessary, suppose
- 1. The inputs, states and outputs are all scalars
- 2. The loss at each step is a function π of the state at that step
Letβs compute the derivative of the loss with respect to the parameter π
- ππ'
ππ
- = ππ'
ππ‘' β ππ‘' ππ’' β ππ’' ππ
- First input: π¦'
Transform: π’' = π‘6π
) + π¦'π
- + π
State: s' = π(π’') Loss: π' = π(π‘')
Follows the chain rule
Does this work? Letβs see a simple example
16
To avoid complicating the notation more than necessary, suppose
- 1. The inputs, states and outputs are all scalars
- 2. The loss at each step is a function π of the state at that step
Letβs compute the derivative of the loss with respect to the parameter π
- ππ'
ππ
- = ππ'
ππ‘' β ππ‘' ππ’' β ππ’' ππ
- First input: π¦'
Transform: π’' = π‘6π
) + π¦'π
- + π
State: s' = π(π’') Loss: π' = π(π‘')
Let us examine the non-linearity in this system due to the activation function
Does this work? Letβs see a simple example
17
Suppose π π¨ = tanh π¨
To avoid complicating the notation more than necessary, suppose
- 1. The inputs, states and outputs are all scalars
- 2. The loss at each step is a function π of the state at that step
Letβs compute the derivative of the loss with respect to the parameter π
- ππ'
ππ
- = ππ'
ππ‘' β ππ‘' ππ’' β ππ’' ππ
- First input: π¦'
Transform: π’' = π‘6π
) + π¦'π
- + π
State: s' = π(π’') Loss: π' = π(π‘')
Does this work? Letβs see a simple example
18
Suppose π π¨ = tanh π¨ Then BC
BD = 1 β tanhG(π¨)
To avoid complicating the notation more than necessary, suppose
- 1. The inputs, states and outputs are all scalars
- 2. The loss at each step is a function π of the state at that step
Letβs compute the derivative of the loss with respect to the parameter π
- ππ'
ππ
- = ππ'
ππ‘' β ππ‘' ππ’' β ππ’' ππ
- First input: π¦'
Transform: π’' = π‘6π
) + π¦'π
- + π
State: s' = π(π’') Loss: π' = π(π‘')
Does this work? Letβs see a simple example
19
To avoid complicating the notation more than necessary, suppose
- 1. The inputs, states and outputs are all scalars
- 2. The loss at each step is a function π of the state at that step
Letβs compute the derivative of the loss with respect to the parameter π
- ππ'
ππ
- = ππ'
ππ‘' β ππ‘' ππ’' β ππ’' ππ
- First input: π¦'
Transform: π’' = π‘6π
) + π¦'π
- + π
State: s' = π(π’') Loss: π' = π(π‘') Suppose π π¨ = tanh π¨ Then BC
BD = 1 β tanhG(π¨)
Always between zero and one
Does this work? Letβs see a simple example
20
To avoid complicating the notation more than necessary, suppose
- 1. The inputs, states and outputs are all scalars
- 2. The loss at each step is a function π of the state at that step
Letβs compute the derivative of the loss with respect to the parameter π
- ππ'
ππ
- = ππ'
ππ‘' β ππ‘' ππ’' β ππ’' ππ
- First input: π¦'
Transform: π’' = π‘6π
) + π¦'π
- + π
State: s' = π(π’') Loss: π' = π(π‘') Suppose π π¨ = tanh π¨ Then BC
BD = 1 β tanhG(π¨)
ππ‘' ππ’' = 1 β tanhG π’' That is
Does this work? Letβs see a simple example
21
To avoid complicating the notation more than necessary, suppose
- 1. The inputs, states and outputs are all scalars
- 2. The loss at each step is a function π of the state at that step
Letβs compute the derivative of the loss with respect to the parameter π
- ππ'
ππ
- = ππ'
ππ‘' β ππ‘' ππ’' β ππ’' ππ
- First input: π¦'
Transform: π’' = π‘6π
) + π¦'π
- + π
State: s' = π(π’') Loss: π' = π(π‘') Suppose π π¨ = tanh π¨ Then BC
BD = 1 β tanhG(π¨)
ππ‘' ππ’' = 1 β tanhG π’' That is A number between zero and one.
Does this work? Letβs see a simple example
22
To avoid complicating the notation more than necessary, suppose
- 1. The inputs, states and outputs are all scalars
- 2. The loss at each step is a function π of the state at that step
First input: π¦' Transform: π’' = π‘6π
) + π¦'π
- + π
State: s' = π(π’') Loss: π' = π(π‘')
Letβs see what happens with another input
Does this work? Letβs see a simple example
23
To avoid complicating the notation more than necessary, suppose
- 1. The inputs, states and outputs are all scalars
- 2. The loss at each step is a function π of the state at that step
First input: π¦' Transform: π’' = π‘6π
) + π¦'π
- + π
State: s' = π(π’') Loss: π' = π(π‘') Second input: π¦G Transform: π’G = π‘'π
) + π¦Gπ
- + π
State: sG = π(π’G) Loss: πG = π(π‘G)
Does this work? Letβs see a simple example
24
To avoid complicating the notation more than necessary, suppose
- 1. The inputs, states and outputs are all scalars
- 2. The loss at each step is a function π of the state at that step
Letβs compute the derivative of the loss with respect to the parameter π
- First input: π¦'
Transform: π’' = π‘6π
) + π¦'π
- + π
State: s' = π(π’') Loss: π' = π(π‘') Second input: π¦G Transform: π’G = π‘'π
) + π¦Gπ
- + π
State: sG = π(π’G) Loss: πG = π(π‘G)
Once again, the chain rule
Does this work? Letβs see a simple example
25
To avoid complicating the notation more than necessary, suppose
- 1. The inputs, states and outputs are all scalars
- 2. The loss at each step is a function π of the state at that step
Letβs compute the derivative of the loss with respect to the parameter π
- First input: π¦'
Transform: π’' = π‘6π
) + π¦'π
- + π
State: s' = π(π’') Loss: π' = π(π‘') Second input: π¦G Transform: π’G = π‘'π
) + π¦Gπ
- + π
State: sG = π(π’G) Loss: πG = π(π‘G)
Once again, the chain rule
ππG ππ
- = ππG
ππ‘G β ππ‘G ππ’G β ππ’G ππ
- + ππ’G
ππ‘' β ππ‘' ππ’' β ππ’' ππ
Does this work? Letβs see a simple example
26
To avoid complicating the notation more than necessary, suppose
- 1. The inputs, states and outputs are all scalars
- 2. The loss at each step is a function π of the state at that step
Letβs compute the derivative of the loss with respect to the parameter π
- First input: π¦'
Transform: π’' = π‘6π
) + π¦'π
- + π
State: s' = π(π’') Loss: π' = π(π‘') Second input: π¦G Transform: π’G = π‘'π
) + π¦Gπ
- + π
State: sG = π(π’G) Loss: πG = π(π‘G)
Once again, the chain rule
ππG ππ
- = ππG
ππ‘G β ππ‘G ππ’G β ππ’G ππ
- + ππ’G
ππ‘' β ππ‘' ππ’' β ππ’' ππ
Does this work? Letβs see a simple example
27
To avoid complicating the notation more than necessary, suppose
- 1. The inputs, states and outputs are all scalars
- 2. The loss at each step is a function π of the state at that step
Letβs compute the derivative of the loss with respect to the parameter π
- First input: π¦'
Transform: π’' = π‘6π
) + π¦'π
- + π
State: s' = π(π’') Loss: π' = π(π‘') Second input: π¦G Transform: π’G = π‘'π
) + π¦Gπ
- + π
State: sG = π(π’G) Loss: πG = π(π‘G)
Once again, the chain rule
ππG ππ
- = ππG
ππ‘G β ππ‘G ππ’G β ππ’G ππ
- + ππ’G
ππ‘' β ππ‘' ππ’' β ππ’' ππ
Does this work? Letβs see a simple example
28
To avoid complicating the notation more than necessary, suppose
- 1. The inputs, states and outputs are all scalars
- 2. The loss at each step is a function π of the state at that step
Letβs compute the derivative of the loss with respect to the parameter π
- First input: π¦'
Transform: π’' = π‘6π
) + π¦'π
- + π
State: s' = π(π’') Loss: π' = π(π‘') Second input: π¦G Transform: π’G = π‘'π
) + π¦Gπ
- + π
State: sG = π(π’G) Loss: πG = π(π‘G)
Once again, the chain rule
ππG ππ
- = ππG
ππ‘G β ππ‘G ππ’G β ππ’G ππ
- + ππ’G
ππ‘' β ππ‘' ππ’' β ππ’' ππ
- Two dependencies on π
Does this work? Letβs see a simple example
29
To avoid complicating the notation more than necessary, suppose
- 1. The inputs, states and outputs are all scalars
- 2. The loss at each step is a function π of the state at that step
Letβs compute the derivative of the loss with respect to the parameter π
- First input: π¦'
Transform: π’' = π‘6π
) + π¦'π
- + π
State: s' = π(π’') Loss: π' = π(π‘') Second input: π¦G Transform: π’G = π‘'π
) + π¦Gπ
- + π
State: sG = π(π’G) Loss: πG = π(π‘G)
Once again, the chain rule
ππG ππ
- = ππG
ππ‘G β ππ‘G ππ’G β ππ’G ππ
- + ππ’G
ππ‘' β ππ‘' ππ’' β ππ’' ππ
Does this work? Letβs see a simple example
30
To avoid complicating the notation more than necessary, suppose
- 1. The inputs, states and outputs are all scalars
- 2. The loss at each step is a function π of the state at that step
Letβs compute the derivative of the loss with respect to the parameter π
- First input: π¦'
Transform: π’' = π‘6π
) + π¦'π
- + π
State: s' = π(π’') Loss: π' = π(π‘') Second input: π¦G Transform: π’G = π‘'π
) + π¦Gπ
- + π
State: sG = π(π’G) Loss: πG = π(π‘G)
Once again, the chain rule
ππG ππ
- = ππG
ππ‘G β ππ‘G ππ’G β ππ’G ππ
- + ππ’G
ππ‘' β ππ‘' ππ’' β ππ’' ππ
Does this work? Letβs see a simple example
31
To avoid complicating the notation more than necessary, suppose
- 1. The inputs, states and outputs are all scalars
- 2. The loss at each step is a function π of the state at that step
Letβs compute the derivative of the loss with respect to the parameter π
- First input: π¦'
Transform: π’' = π‘6π
) + π¦'π
- + π
State: s' = π(π’') Loss: π' = π(π‘') Second input: π¦G Transform: π’G = π‘'π
) + π¦Gπ
- + π
State: sG = π(π’G) Loss: πG = π(π‘G)
Once again, the chain rule
ππG ππ
- = ππG
ππ‘G β ππ‘G ππ’G β ππ’G ππ
- + ππ’G
ππ‘' β ππ‘' ππ’' β ππ’' ππ
Does this work? Letβs see a simple example
32
To avoid complicating the notation more than necessary, suppose
- 1. The inputs, states and outputs are all scalars
- 2. The loss at each step is a function π of the state at that step
Letβs compute the derivative of the loss with respect to the parameter π
- First input: π¦'
Transform: π’' = π‘6π
) + π¦'π
- + π
State: s' = π(π’') Loss: π' = π(π‘') Second input: π¦G Transform: π’G = π‘'π
) + π¦Gπ
- + π
State: sG = π(π’G) Loss: πG = π(π‘G)
Once again, the chain rule
ππG ππ
- = ππG
ππ‘G β ππ‘G ππ’G β ππ’G ππ
- + ππ’G
ππ‘' β ππ‘' ππ’' β ππ’' ππ
Does this work? Letβs see a simple example
33
To avoid complicating the notation more than necessary, suppose
- 1. The inputs, states and outputs are all scalars
- 2. The loss at each step is a function π of the state at that step
Letβs compute the derivative of the loss with respect to the parameter π
- First input: π¦'
Transform: π’' = π‘6π
) + π¦'π
- + π
State: s' = π(π’') Loss: π' = π(π‘') Second input: π¦G Transform: π’G = π‘'π
) + π¦Gπ
- + π
State: sG = π(π’G) Loss: πG = π(π‘G)
Once again, the chain rule How does the first input affect the loss for the second term?
ππG ππ
- = ππG
ππ‘G β ππ‘G ππ’G β ππ’G ππ
- + ππ’G
ππ‘' β ππ‘' ππ’' β ππ’' ππ
Does this work? Letβs see a simple example
34
To avoid complicating the notation more than necessary, suppose
- 1. The inputs, states and outputs are all scalars
- 2. The loss at each step is a function π of the state at that step
Letβs compute the derivative of the loss with respect to the parameter π
- First input: π¦'
Transform: π’' = π‘6π
) + π¦'π
- + π
State: s' = π(π’') Loss: π' = π(π‘') Second input: π¦G Transform: π’G = π‘'π
) + π¦Gπ
- + π
State: sG = π(π’G) Loss: πG = π(π‘G)
Once again, the chain rule How does the first input affect the loss for the second term? Through this term here
ππG ππ
- = ππG
ππ‘G β ππ‘G ππ’G β ππ’G ππ
- + ππ’G
ππ‘' β ππ‘' ππ’' β ππ’' ππ
Does this work? Letβs see a simple example
35
To avoid complicating the notation more than necessary, suppose
- 1. The inputs, states and outputs are all scalars
- 2. The loss at each step is a function π of the state at that step
Letβs compute the derivative of the loss with respect to the parameter π
- First input: π¦'
Transform: π’' = π‘6π
) + π¦'π
- + π
State: s' = π(π’') Loss: π' = π(π‘') Second input: π¦G Transform: π’G = π‘'π
) + π¦Gπ
- + π
State: sG = π(π’G) Loss: πG = π(π‘G) ππG ππ
- = ππG
ππ‘G β ππ‘G ππ’G β ππ’G ππ
- + ππ’G
ππ‘' β ππ‘' ππ’' β ππ’' ππ
- Once again, the chain rule
But this gradient is multiplied by all these other terms
Does this work? Letβs see a simple example
36
To avoid complicating the notation more than necessary, suppose
- 1. The inputs, states and outputs are all scalars
- 2. The loss at each step is a function π of the state at that step
Letβs compute the derivative of the loss with respect to the parameter π
- First input: π¦'
Transform: π’' = π‘6π
) + π¦'π
- + π
State: s' = π(π’') Loss: π' = π(π‘') Second input: π¦G Transform: π’G = π‘'π
) + π¦Gπ
- + π
State: sG = π(π’G) Loss: πG = π(π‘G) ππG ππ
- = ππG
ππ‘G β ππ‘G ππ’G β ππ’G ππ
- + ππ’G
ππ‘' β ππ‘' ππ’' β ππ’' ππ
- Once again, the chain rule
Letβs focus on the impact
- f the activation terms
Does this work? Letβs see a simple example
37
To avoid complicating the notation more than necessary, suppose
- 1. The inputs, states and outputs are all scalars
- 2. The loss at each step is a function π of the state at that step
Letβs compute the derivative of the loss with respect to the parameter π
- First input: π¦'
Transform: π’' = π‘6π
) + π¦'π
- + π
State: s' = π(π’') Loss: π' = π(π‘') Second input: π¦G Transform: π’G = π‘'π
) + π¦Gπ
- + π
State: sG = π(π’G) Loss: πG = π(π‘G) ππG ππ
- = ππG
ππ‘G β ππ‘G ππ’G β ππ’G ππ
- + ππ’G
ππ‘' β ππ‘' ππ’' β ππ’' ππ
- Once again, the chain rule
Letβs focus on the impact
- f the activation terms
Suppose π π¨ = tanh π¨ Then BC
BD = 1 β tanhG(π¨)
Does this work? Letβs see a simple example
38
To avoid complicating the notation more than necessary, suppose
- 1. The inputs, states and outputs are all scalars
- 2. The loss at each step is a function π of the state at that step
Letβs compute the derivative of the loss with respect to the parameter π
- First input: π¦'
Transform: π’' = π‘6π
) + π¦'π
- + π
State: s' = π(π’') Loss: π' = π(π‘') Second input: π¦G Transform: π’G = π‘'π
) + π¦Gπ
- + π
State: sG = π(π’G) Loss: πG = π(π‘G) ππG ππ
- = ππG
ππ‘G β ππ‘G ππ’G β ππ’G ππ
- + ππ’G
ππ‘' β ππ‘' ππ’' β ππ’' ππ
- Once again, the chain rule
Letβs focus on the impact
- f the activation terms
Suppose π π¨ = tanh π¨ Then BC
BD = 1 β tanhG(π¨)
Both these gradients are numbers between zero and
- ne. Multiplying them scales the gradient down
Does this work? Letβs see a simple example
39
To avoid complicating the notation more than necessary, suppose
- 1. The inputs, states and outputs are all scalars
- 2. The loss at each step is a function π of the state at that step
First input: π¦' Transform: π’' = π‘6π
) + π¦'π
- + π
State: s' = π(π’') Loss: ππ = π(ππ) Second input: π¦G Transform: π’G = π‘'π
) + π¦Gπ
- + π
State: sG = π(π’G) Loss: πG = π(π‘G)
With one input, the contribution of the first input towards the gradient of the loss of the first output is scaled by one term between zero and one.
Does this work? Letβs see a simple example
40
To avoid complicating the notation more than necessary, suppose
- 1. The inputs, states and outputs are all scalars
- 2. The loss at each step is a function π of the state at that step
First input: π¦' Transform: π’' = π‘6π
) + π¦'π
- + π
State: s' = π(π’') Loss: π' = π(π‘') Second input: π¦G Transform: π’G = π‘'π
) + π¦Gπ
- + π
State: sG = π(π’G) Loss: ππ = π(ππ)
With two inputs, the contribution of the first input towards the gradient of the loss
- f the second output is scaled by two terms
between zero and one.
Does this work? Letβs see a simple example
41
To avoid complicating the notation more than necessary, suppose
- 1. The inputs, states and outputs are all scalars
- 2. The loss at each step is a function π of the state at that step
First input: π¦' Transform: π’' = π‘6π
) + π¦'π
- + π
State: s' = π(π’') Loss: π' = π(π‘') Second input: π¦G Transform: π’G = π‘'π
) + π¦Gπ
- + π
State: π‘G = π(π’G) Loss: πG = π(π‘G) nth input: π¦M Transform: π’M = π‘M&'π
) + π¦Mπ
- + π
State: sM = π(π’M) Loss: ππ = π(ππ)
With n inputs, the contribution of the first input towards the gradient of the loss of the nth output is scaled by n terms between zero and one.
The vanishing gradient problem
- As the length of the sequence grows, the impact of the far
away inputs diminishes because the gradient vanishes
- We saw an example where states and inputs are scalars.
β Applies when the states and inputs are vectors/matrices as in usual networks
- Happens because the gradient of the non-linear activation is a
number between zero and one
β β¦ and many such numbers are multiplied together
- Applicable not only to recurrent networks, but to any case
where we have a long chain of such activations (i.e. in a deep neural network)
42
[Bengio et al 1994]
The vanishing gradient problem
- As the length of the sequence grows, the impact of the far
away inputs diminishes because the gradient vanishes
- We saw an example where states and inputs are scalars.
β Applies when the states and inputs are vectors/matrices as in usual networks
- Happens because the gradient of the non-linear activation is a
number between zero and one
β β¦ and many such numbers are multiplied together
- Applicable not only to recurrent networks, but to any case
where we have a long chain of such activations (i.e. in a deep neural network)
43
Why is this a problem?
The vanishing gradient problem
- As the length of the sequence grows, the impact of the far
away inputs diminishes because the gradient vanishes
- We saw an example where states and inputs are scalars.
β Applies when the states and inputs are vectors/matrices as in usual networks
- Happens because the gradient of the non-linear activation is a
number between zero and one
β β¦ and many such numbers are multiplied together
- Applicable not only to recurrent networks, but to any case
where we have a long chain of such activations (i.e. in a deep neural network)
44
Why is this a problem? I have a banana and an apple. My friend ate the banana and I ate the ________?
The vanishing gradient problem
- As the length of the sequence grows, the impact of the far
away inputs diminishes because the gradient vanishes
- We saw an example where states and inputs are scalars.
β Applies when the states and inputs are vectors/matrices as in usual networks
- Happens because the gradient of the non-linear activation is a
number between zero and one
β β¦ and many such numbers are multiplied together
- Applicable not only to recurrent networks, but to any case
where we have a long chain of such activations (i.e. in a deep neural network)
45
Why is this a problem? I have a banana and an apple. My friend ate the banana. I was hungry and wanted a fruit. So I ate the ________?
The vanishing gradient problem
- As the length of the sequence grows, the impact of the far
away inputs diminishes because the gradient vanishes
- We saw an example where states and inputs are scalars.
β Applies when the states and inputs are vectors/matrices as in usual networks
- Happens because the gradient of the non-linear activation is a
number between zero and one
β β¦ and many such numbers are multiplied together
- Applicable not only to recurrent networks, but to any case
where we have a long chain of such activations (i.e. in a deep neural network)
46
Why is this a problem? I have a banana and an apple. My friend ate the banana. I was hungry and wanted a fruit. I really wished I had a banana as well, but we were all out. So I ate the ________?
The vanishing gradient problem
- As the length of the sequence grows, the impact of the far
away inputs diminishes because the gradient vanishes
- We saw an example where states and inputs are scalars.
β Applies when the states and inputs are vectors/matrices as in usual networks
- Happens because the gradient of the non-linear activation is a
number between zero and one
β β¦ and many such numbers are multiplied together
- Applicable not only to recurrent networks, but to any case
where we have a long chain of such activations (i.e. in a deep neural network)
47
Why is this a problem? I have a banana and an apple. My friend ate the banana. I was hungry and wanted a fruit. I really wished I had a banana as well, but we were all out. So I ate the ________?
Consider a RNN language model for this task. If it makes a mistake in the final word, the signal for correcting it is far away.
The vanishing gradient problem
- As the length of the sequence grows, the impact of the far
away inputs diminishes because the gradient vanishes
- We saw an example where states and inputs are scalars.
β Applies when the states and inputs are vectors/matrices as in usual networks
- Happens because the gradient of the non-linear activation is a
number between zero and one
β β¦ and many such numbers are multiplied together
- Applicable not only to recurrent networks, but to any case
where we have a long chain of such activations (i.e. in a deep neural network)
48
Why is this a problem? I have a banana and an apple. My friend ate the banana. I was hungry and wanted a fruit. I really wished I had a banana as well, but we were all out. So I ate the ________?
[Hochreiter and Schmidhuber 1997]: βBackpropagation through time is too sensitive to recent distractions.β
The vanishing gradient problem
- As the length of the sequence grows, the impact of the far
away inputs diminishes because the gradient vanishes
- We saw an example where states and inputs are scalars.
β Applies when the states and inputs are vectors/matrices as in usual networks
- Happens because the gradient of the non-linear activation is a
number between zero and one
β β¦ and many such numbers are multiplied together
- Applicable not only to recurrent networks, but to any case
where we have a long chain of such activations (i.e. in a deep neural network)
49
The vanishing gradient problem
- As the length of the sequence grows, the impact of the far
away inputs diminishes because the gradient vanishes
- We saw an example where states and inputs are scalars.
β Applies when the states and inputs are vectors/matrices as in usual networks
- Happens because the gradient of the non-linear activation is a
number between zero and one
β β¦ and many such numbers are multiplied together
- Applicable not only to recurrent networks, but to any case
where we have a long chain of such activations (i.e. in a deep network): Layers closer to the loss will get larger updates
50
Addressing the vanishing gradient problem
Approach 1: Change the activation β The problem occurs because the derivatives of the activation function are small, so change it β Commonly used: the rectified linear unit ReLU π¨ = max (0, π¨) What is its derivative? π ππΉππ ππ¨ = [1 if π¨ β₯ 0 else
51
Multiplying many of these wonβt vanish the gradient if the pre- activation value is positive. But can completely erase the gradient if it is negative.
Addressing the vanishing gradient problem
Approach 1: Change the activation β The problem occurs because the derivatives of the activation function are small, so change it β Commonly used: the rectified linear unit ReLU π¨ = max (0, π¨) What is its derivative? π ππΉππ ππ¨ = [1 if π¨ β₯ 0 else
52
Multiplying many of these wonβt vanish the gradient if the pre- activation value is positive. But can completely erase the gradient if it is negative.
Addressing the vanishing gradient problem
Approach 1: Change the activation β The problem occurs because the derivatives of the activation function are small, so change it β Commonly used: the rectified linear unit ReLU π¨ = max (0, π¨) What is its derivative? π ππΉππ ππ¨ = [1 if π¨ β₯ 0 else
53
Multiplying many of these wonβt vanish the gradient if the pre- activation value is positive. But can completely erase the gradient if it is negative.
Addressing the vanishing gradient problem
Approach 1: Change the activation β The problem occurs because the derivatives of the activation function are small, so change it β Commonly used: the rectified linear unit ReLU π¨ = max (0, π¨) What is its derivative? π ππΉππ ππ¨ = [1 if π¨ β₯ 0 else
54
Multiplying many of these wonβt vanish the gradient if the pre- activation value is positive. But can completely erase the gradient if it is negative.
Addressing the vanishing gradient problem
Approach 1: Change the activation β The problem occurs because the derivatives of the activation function are small, so change it β Commonly used: the rectified linear unit ReLU π¨ = max (0, π¨) What is its derivative? π ππΉππ ππ¨ = [1 if π¨ β₯ 0 else
55
Multiplying many of these wonβt vanish the gradient if the pre- activation value is positive. But can completely erase the gradient if it is negative.
Exploding gradients
If our gradients are not fractional (e.g. with ReLUs), we might end up multiplying many large numbers during gradient computation This could quickly give numeric overflow errors
The Exploding Gradient Problem
56
Addressing vanishing/exploding gradients
Approach 2: Donβt take derivatives all the way to the beginning β The problem occurs because we need to compute derivatives with respect to the early inputs β Truncate the backpropagation process instead β Called Truncated Backpropagation Through Time (TBPTT) Essentially, this makes a Markov-like assumption.
57
Addressing vanishing/exploding gradients
Approach 3: Use a ReLU activation, but explicitly avoid exploding gradients β If a gradient is larger than a certain threshold, truncate it β ReLUs reduce vanishing gradients, and truncation takes care of exploding gradients
58
Addressing vanishing/exploding gradients
Approach 4: Changing the internals of the RNN more thoroughlyβ¦
β¦ by using a gated architecture such as an LSTM or a GRU unit
59