Lecture 14: Recurrent Neural Networks CS109B Data Science 2 Pavlos - - PowerPoint PPT Presentation
Lecture 14: Recurrent Neural Networks CS109B Data Science 2 Pavlos - - PowerPoint PPT Presentation
Lecture 14: Recurrent Neural Networks CS109B Data Science 2 Pavlos Protopapas, Mark Glickman, and Chris Tanner Online lectures guidelines We would prefer you have your video on, but it is OK if you have it off. We would prefer you have
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Online lectures guidelines
- We would prefer you have your video on, but it is OK if you have
it off.
- We would prefer you have your real name.
- All lectures, labs and a-sections will be live streamed and als
available for viewing later on canvas/zoom.
- We will have course staff in the chat online and during lecture
you can also make use of this spreadsheet to enter your own questions or 'up vote' those of your fellow students.
- Quizzed will be available for 24 hours.
2
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Outline
Why Recurrent Neural Networks (RNNs) Main Concept of RNNs More Details of RNNs RNN training Gated RNN
3
CS109B, PROTOPAPAS, GLICKMAN, TANNER
4
CS109B, PROTOPAPAS, GLICKMAN, TANNER
1
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Background
Many classification and regression tasks involve data that is assumed to be in independent and id identica ically ly dis istrib ibuted (i. i.i. i.d.). .). For example:
5
De Detecting l lung c cancer Fa Face ce reco ecogni nition Ri Risk of heart attack
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Background
Much of our data is inherently se sequential
6
WORLD LD HU HUMA MANITY INDIVIDUAL L PEOPLE LE
Na Natural disasters (e.g., earthquakes) Cl Climate c change St Stock ck market ket Vi Virus outbreaks Sp Speech eech reco ecogni nition Ma Machine Tra ranslation (e.g., Engli lish -> F > French) Ca Cancer t treatment
scale examples
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Much of our data is inherently seq sequenti uential
PR PREDICTI TING EA EARTHQU HQUAKES KES
7
Background
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Much of our data is inherently seq sequenti uential
STO STOCK MA MARKET PR PREDICTI TIONS
8
Background
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Much of our data is inherently seq sequenti uential
SPE SPEECH RECOGNITI TION
“What is the weather today?” “What is the weather two day?” “What is the whether too day?” “What is, the Wrether to Dae?”
9
Background
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Sequence Modeling: Handwritten Text
10
- Input : Image
- Output: Text
https://towardsdatascience.com/build-a-handwritten-text-recognition-system-using-tensorflow- 2326a3487cd5
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Sequence Modeling: Text-to-Speech
11
- Input : Text
- Output: Audio
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Sequence Modeling: Machine Translation
12
- Input : Text
- Output: Translated Text
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Outline
Why RNNs Main Concept of RNNs (part 1) More Details of RNNs RNN training Gated RNN
13
CS109B, PROTOPAPAS, GLICKMAN, TANNER
14
CS109B, PROTOPAPAS, GLICKMAN, TANNER
What can my NN do?
15
NN [George, Mary, Tom, Suzie]
Training: Present to the NN examples and learn from them.
CS109B, PROTOPAPAS, GLICKMAN, TANNER
What can my NN do?
16
NN George NN Mary Prediction: Given an example
CS109B, PROTOPAPAS, GLICKMAN, TANNER
What my NN can NOT do?
17
WHO IS IT?
?
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Learn from previous examples
18
Time
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Recurrent Neural Network (RNN)
19
NN George
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Recurrent Neural Network (RNN)
RNNs recognize the data's sequential characteristics and use patterns to predict the next likely scenario.
20
NN George I have seen George moving in this way before.
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Recurrent Neural Network (RNN)
Our model requires context - or contextual information - to understand the subject (he) and the direct object (it) in the sentence.
21
WHO IS HE?
I do not know. I need to know who said that and what he said before. Can you tell me more? He told me I could have it
CS109B, PROTOPAPAS, GLICKMAN, TANNER
RNN – Another Example with Text
After providing sequential information, the model recognize the subject (Joe’s brother) and the object (sweater) in the sentence.
22
WHO IS HE?
I see what you mean now! The noun “he” stands for Joe’s brother while ”it” for the sweater.
- Hellen: Nice sweater Joe.
- Joe: Thanks, Hellen. It used
to belong to my brother and he told me I could have it.
CS109B, PROTOPAPAS, GLICKMAN, TANNER
23
Batch_size = 2048
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Sequences
- We want a machine learning model to understand sequences, not isolated
samples.
- Can MLP do this?
- Assume we have a sequence of temperature measurements and we want to take 3
sequential measurements and predict the next one
24
35 32 45 48 41 39 36 … 1 2 3 4 5 6 7 … features samples
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Sequences
- We want a machine learning model to understand sequences, not isolated
samples.
- Can MLP do this?
- Assume we have a sequence of temperature measurements and we want to take 3
sequential measurements and predict the next one
25
35 32 45 48 41 39 36 … 1 2 3 4 5 6 7 … features samples 35 32 45 48 1 2 3 4
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Sequences
- We want a machine learning model to understand sequences, not isolated
samples.
- Can MLP do this?
- Assume we have a sequence of temperature measurements and we want to take 3
sequential measurements and predict the next one
26
35 32 45 48 41 39 36 … 1 2 3 4 5 6 7 … features samples 35 32 45 48 1 2 3 4 32 45 48 41 2 3 4 5
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Sequences
- We want a machine learning model to understand sequences, not isolated
samples.
- Can MLP do this?
- Assume we have a sequence of temperature measurements and we want to take 3
sequential measurements and predict the next one
27
35 32 45 48 41 39 36 … 1 2 3 4 5 6 7 … features samples 35 32 45 48 1 2 3 4 32 45 48 41 2 3 4 5 45 48 41 39 3 4 5 6
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Sequences
- We want a machine learning model to understand sequences, not isolated
samples.
- Can MLP do this?
- Assume we have a sequence of temperature measurements and we want to take 3
sequential measurements and predict the next one
28
35 32 45 48 41 39 36 … 1 2 3 4 5 6 7 … features samples 35 32 45 48 1 2 3 4 32 45 48 41 2 3 4 5 45 48 41 39 3 4 5 6
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Windowed dataset
This is called overlapping windowed dataset, since we’re windowing observations to create new. We can easily do using a MLS:
29
3 1
10 ReLU 10 ReLU 1 ReLU 45 35 32 3 1 2 4 32 48 45 2 4 3 5 41 48 45 5 4 3 6
will produce the same results But re-arranging the order of the inputs like:
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Why not CNNs or MLPs?
30
1. MLPs/CNNs require fixed input and output size
- 2. MLPs/CNNs can’t classify inputs in multiple places
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Windowed dataset
What follows after: `I got in the car and’ ? `drove away’ What follows after: `In car the and I got’ ? Not obvious that it should be `drove away’ The order of words matters. This is true for most sequential data. A fully connected network will not distinguish the order and therefore missing some information.
31
CS109B, PROTOPAPAS, GLICKMAN, TANNER
32
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Outline
Why RNNs Main Concept of RNNs More Details of RNNs RNN training Gated RNN
33
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Memory
Somehow the computational unit should remember what it has seen before.
34
Unit 𝑌" 𝑍
"
Should remember 𝑌$ … 𝑌"&'
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Memory
Somehow the computational unit should remember what it has seen before.
35
Unit Internal memory
𝑌" 𝑍
"
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Memory
Somehow the computational unit should remember what it has seen before. We’ll call the information the unit’s state.
36
RNN Internal memory
𝑌" 𝑍
"
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Memory
In neural networks, once training is over, the weights do not change. This means that the network is done learning and done changing. Then, we feed in values, and it simply applies the operations that make up the network, using the values it has learned. But the RNN units can remember new information after training has completed. That is, they’re able to keep changing after training is over.
37
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Memory
Question: How can we do this? How can build a unit that remembers the past? The memory or state can be written to a file but in RNNs, we keep it inside the recurrent unit. In an array or in a vector! Work with an example: Anna Sofia said her shoes are too ugly. Her here means Anna Sofia. Nikolas put his keys on the table. His here means Nikolas
38
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Memory
Question: How can we do this? How can we build a unit that remembers the past? The memory or state can be written to a file but in RNNs, we keep it inside the recurrent unit. In an array or in a vector!
39
RNN
𝑌" (𝑓. . ℎ𝑗𝑡) Memory
RNN
𝑍
" (𝑓. . 𝑂𝑗𝑙𝑝𝑚𝑏𝑡)
Memory
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Building an RNN
40
RNN
𝑍
"
Memory 𝑌" Memory
RNN
𝑍
"
Memory
𝑌"
Memory
RNN
𝑍
"6'
Memory
𝑌"6'
RNN
𝑍
"67
Memory
𝑌"67
RNN
𝑍
"68
Memory
𝑌"68
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Outline
Why RNNs Main Concept of RNNs More Details of RNNs RNN training Gated RNN
41
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Structure of an RNN cell
42
RNN
𝑍
"
State 𝑌"
RNN
𝑍
"
State 𝑌"
- utput
weight update weight input weight
CS109B, PROTOPAPAS, GLICKMAN, TANNER
43
Input, 𝑦" Hidden state, ℎ"&'
U V 𝑨" = 𝑊𝑦" + 𝑉ℎ"&' + 𝑐@
activation
ℎ" = (𝑨") W2
activation activation
W1
Output, 𝑧"
𝑋
'ℎ" + 𝑐'
𝑋
7ℎ" + 𝑐7
Anatomy of an RNN unit
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Outline
Why RNNs Main Concept of RNNs More Details of RNNs RNN training Gated RNN
44
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Backprop Through Time
- For each input, unfold network for the sequence length T
- Back-propagation: apply forward and backward pass on unfolded
network
- Memory cost: O(T)
45
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Backprop Through Time
46
RNN
𝑍
"
State 𝑌"
- utput
weights update weights input weights
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Backprop Through Time
47
RNN
𝑍
"
State 𝑌"
- utput
weights update weights input weights
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Backprop Through Time
48
RNN
𝑍
"
State 𝑌" Update Weights: U Output Weights: W Input Weights: V
ℎ"
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Backprop Through Time
49
You have two activation functions C which serves as the activation for the hidden state and D which is the activation of the output. In the example shown before D was the identity.
ℎ"&7 𝑌"&7 V U W ℎ"&' 𝑌"&' V U W ℎ" 𝑌" V U W 𝑧 Et-2 𝑧 Et-1 𝑧 Et
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Backprop Through Time
50
ℎ"&7 𝑌"&7 V U W ℎ"&' 𝑌"&' V U W ℎ" 𝑌" V U W 𝑧 Et-2 𝑧 Et-1 𝑧 Et
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Backprop Through Time
51
𝜖𝑧 E" 𝜖𝑋 = D′ℎ"
𝑀 = I 𝑀"
- "
𝑒𝑀 𝑒𝑋 = I 𝑒𝑀" 𝑒𝑋
- "
= I 𝜖𝑀" 𝜖𝑧 E" 𝜖𝑧 E" 𝜖𝑋
- "
𝑀" = 𝑀"(𝑧 E")
𝑧 E" = D 𝑋ℎ" + 𝑐
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Backprop Through Time
52
𝑒𝑀 𝑒𝑉 = I 𝜖𝑀" 𝜖𝑧 E" 𝜖𝑧 E" 𝜖ℎ" 𝜖ℎ" 𝜖𝑉
- "
𝑧 E" = D 𝑋ℎ" + 𝑐
𝑀 = I 𝑀"
- "
𝑀" = 𝑀"(𝑧 E")
ℎ" = C 𝑊𝑦" + 𝑉ℎ"&' + 𝑐L
𝑧 E" = D(𝑋C 𝑊𝑦" + 𝑉ℎ"&' + 𝑐L + 𝑐) MCN MO=∑ MCN MCQ MCQ MO " RS' 𝜖𝑀" 𝜖𝑉 = 𝜖𝑀" 𝜖𝑧 E" 𝜖𝑧 E" 𝜖ℎ" (𝑒ℎ" 𝑒𝑉 + 𝑒ℎ" 𝑒ℎ"&' 𝑒ℎ"&' 𝑒𝑉 + 𝑒ℎ" 𝑒ℎ"&' 𝑒ℎ"&' 𝑒ℎ"&7 𝑒ℎ"&7 𝑒𝑉 + ⋯ ) MCN MCQ= MCN MCNUV MCNUV MCNUW … MCQXV MCQ = ∏ MCZ MCZUV " [SR6' 𝜖ℎ[ 𝜖ℎ[&' = g]
L U
CS109B, PROTOPAPAS, GLICKMAN, TANNER
53
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Gradient Clipping
Prevents exploding gradients Clip the norm of gradient before update. For some derivative , and some threshold u
54
if > 𝑣 ⟵ 𝑣
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Outline
Why RNNs Main Concept of RNNs More Details of RNNs RNN training Gated RNN
55
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Long-term Dependencies
Unfolded networks can be very deep Long-term interactions are given exponentially smaller weights than small-term interactions Gradients tend to either vanish or explode
56
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Long Short-Term Memory
Handles long-term dependencies Leaky units where weight on self-loop α is context-dependent Allow network to decide whether to accumulate or forget past info
57
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Notation
Using conventional and convenient notation
58
𝑍
"
𝑌"
CS109B, PROTOPAPAS, GLICKMAN, TANNER
59
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Simple RNN again
60
State V + W σ σ U 𝑌" 𝑍
"
ℎ" Input, 𝑦" Hidden state, ℎ"&'
U V
𝑨" = 𝑊𝑦" + 𝑉ℎ"&' + 𝑐@
activation
ℎ" = (𝑨")
W2
activation activation
W1
Output, 𝑧"
𝑋
'ℎ" + 𝑐'
𝑋
7ℎ" + 𝑐7
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Simple RNN again
61
State V + W σ σ U 𝑌" 𝑍
"
ℎ" Input, 𝑦" Hidden state, ℎ"&'
U V
𝑨" = 𝑊𝑦" + 𝑉ℎ"&' + 𝑐@
activation
ℎ" = (𝑨")
W2
activation activation
W1
Output, 𝑧"
𝑋
'ℎ" + 𝑐'
𝑋
7ℎ" + 𝑐7
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Simple RNN again: Memories
State V + W σ σ U 𝑌" 𝑍
"
ℎ"
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Simple RNN again: Memories - Forgetting
State V + W σ σ U 𝑌" 𝑍
"
ℎ"
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Simple RNN again: New Events
State V + W σ σ U 𝑌" 𝑍
"
ℎ"
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Simple RNN again: New Events Weighted
State V + W σ σ U 𝑌" 𝑍
"
ℎ"
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Simple RNN again: Updated memories
State V + W σ σ U 𝑌" 𝑍
"
ℎ"
CS109B, PROTOPAPAS, GLICKMAN, TANNER
Ref
- [Chen17b] Qiming Chen, Ren Wu, “CNN Is All You Need”, arXiv 1712.09662, 2017.
https://arxiv.org/abs/1712.09662
- [Chu17] Hang Chu, Raquel Urtasun, Sanja Fidler, “Song From PI: A Musically Plausible
Network for Pop Music Generation”, arXiv preprint, 2017. https://arxiv.org/abs/1611.03477
- [Johnson17] Daniel Johnson, “Composing Music with Recurrent Neural Networks”,
Heahedria, 2017. http://www.hexahedria.com/2015/08/03/ composing-music-with- recurrent-neural-networks/
- [Deutsch16b] Max Deutsch, “Silicon Valley: A New Episode Written by AI”, Deep
Writing blog post, 2017. https://medium.com/deep-writing/ silicon-valley-a-new- episode-written-by-ai-a8f832645bc2
- [Fan16] Bo Fan, Lijuan Wang, Frank K. Soong, Lei Xie “Photo-Real Talking Head with
Deep Bidirectional LSTM”, Multimedia Tools and Applications, 75(9), 2016. https://www.microsoft.com/en-us/research/wp- content/uploads/2015/04/icassp2015_fanbo_1009.pdf
67