Deep Learning
Recurrent Networks: Part 3 Fall 2020
1
Recurrent Networks: Part 3 Fall 2020 1 Y(t+6) Story so far Stock - - PowerPoint PPT Presentation
Deep Learning Recurrent Networks: Part 3 Fall 2020 1 Y(t+6) Story so far Stock vector X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7) Iterated structures are good for analyzing time series data with short-time dependence on
1
Stock vector X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7) Y(t+6)
2
– These are “Time delay” neural nets, AKA convnets
– These are recurrent neural networks
Time X(t) Y(t) t=0 h-1
3
4
– Input is binary – Will require large number of training instances
– Network trained for N-bit numbers will not work for N+1 bit numbers
– With very little training data!
1 0 0 0 1 1 0 0 1 0 1 1 0 0 1 0 1 1 0 0 MLP 1 0 1 0 1 0 1 1 1 1 0 1 1 RNN unit Previous carry Carry
5
Time X(t) Y(t) t=0 h-1 DIVERGENCE Ydesired(t)
6
Time X(t) Y(t) t=0 h-1 DIVERGENCE Ydesired(t) Primary topic for today
7
sigmoid tanh relu
8
Output layer Input layer
9
1
10
11
Time X(t) Y(t)
12
X(0)
Y(0) t hf(-1)
X(1) X(2) X(T-2) X(T-1) X(T)
Y(1) Y(2)
Y(T-2) Y(T-1) Y(T) X(0) X(1) X(2) X(T-2) X(T-1) X(T)
hb(inf)
13
Time X(t) Y(t) t=0 h-1 DIVERGENCE Ydesired(t) Primary topic for today
14
15
Images from Karpathy
16
– E.g isolated word/phrase recognition
– E.g. speech recognition – Exact location of output is unknown a priori
17
input
– E.g. language translation
– E.g. captioning an image
Images from Karpathy
18
Images from Karpathy
19
Time X(t) Y(t) t=0
20
Time X(t) Y(t) t=0 DIVERGENCE
21
()
is typically set to 1.0
– This is further backpropagated to update weights etc Y(t) DIVERGENCE
22
()
Y(t) DIVERGENCE
23
Typical Divergence for classification:
Images from Karpathy
24
Images from Karpathy
25
and future words in the sentence
26
two
CD h-1
roads diverged a yellow wood
NNS VBD
DT JJ NN in
IN
27
X(0)
Y(0) h-1
X(1) X(2) X(T-2) X(T-1) X(T)
Y(1) Y(2)
Y(T-2) Y(T-1) Y(T)
– Process input left to right using forward net – Process it right to left using backward net – The combined outputs are time-synchronous, one per input time, and are passed up to the next layer
discussion generalizes
28
ℎ𝑔(−1)
ℎ(𝑈 − 1) ℎ(𝑈) 𝑌(0) 𝑌(1) 𝑌(𝑈 − 1) 𝑌(𝑈)
ℎ𝑔(1) ℎ𝑔(𝑈 − 1) ℎ𝑔(𝑈) ℎ𝑐(0) ℎ𝑐(1) ℎ𝑐(𝑈 − 1) ℎ𝑐(𝑈)
sequences and output sequences of equal length, with one-to-one correspondence
–
–
,
–
,
X(0)
Y(0) t h-1
X(1) X(2) X(T-2) X(T-1) X(T)
Y(1) Y(2)
Y(T-2) Y(T-1) Y(T)
29
generate outputs
X(0)
Y(0) t h-1
X(1) X(2) X(T-2) X(T-1) X(T)
Y(1) Y(2)
Y(T-2) Y(T-1) Y(T)
30
– Back Propagation Through Time
X(0)
Y(0) t h-1
X(1) X(2) X(T-2) X(T-1) X(T)
Y(1) Y(2)
Y(T-2) Y(T-1) Y(T)
31
h-1
𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑈 − 2) 𝑌(𝑈 − 1) 𝑌(𝑈) 𝑍(0) 𝑍(1) 𝑍(2) 𝑍(𝑈 − 2) 𝑍(𝑈 − 1) 𝑍(𝑈) 𝐸(1. . 𝑈) 𝐸𝐽𝑊
32
h-1
𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑈 − 2) 𝑌(𝑈 − 1) 𝑌(𝑈) 𝑍(0) 𝑍(1) 𝑍(2) 𝑍(𝑈 − 2) 𝑍(𝑈 − 1) 𝑍(𝑈) 𝐸(1. . 𝑈) 𝐸𝐽𝑊
33
h-1
𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑈 − 2) 𝑌(𝑈 − 1) 𝑌(𝑈) 𝑍(0) 𝑍(1) 𝑍(2) 𝑍(𝑈 − 2) 𝑍(𝑈 − 1) 𝑍(𝑈) 𝐸(1. . 𝑈) 𝐸𝐽𝑊
34
()
h-1
𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑈 − 2) 𝑌(𝑈 − 1) 𝑌(𝑈) 𝑍(0) 𝑍(1) 𝑍(2) 𝑍(𝑈 − 2) 𝑍(𝑈 − 1) 𝑍(𝑈) 𝐸(1. . 𝑈) 𝐸𝐽𝑊
35
individual instants
X(t) Y(t) t=0 h-1 Y(t) DIVERGENCE
36
individual instants
X(t) Y(t) t=0 h-1 Y(t) DIVERGENCE
37
Typical Divergence for classification:
characters
– Or, at a higher level, words
– Actually “embeddings” of one-hot vectors
– Must ideally peak at the target character
Figure from Andrej Karpathy. Input: Sequence of characters (presented as one-hot vectors). Target output after observing “h e l l” is “o”
39
𝑍 𝑢, 𝑗 = 𝑄(𝑊
|𝑥 … 𝑥)
is the i-th symbol in the vocabulary
𝐸𝑗𝑤 𝑍
1 … 𝑈 , 𝑍(1 … 𝑈) = 𝐿𝑀 𝑍 𝑢 , 𝑍(𝑢)
Y(t) t=0 h-1 Y(t) DIVERGENCE
to the correct next word
40
41
Four score and seven years ??? A B R A H A M L I N C O L ??
42
– Pre-specify a vocabulary of N words in fixed (e.g. lexical) order
– Represent each word by an N-dimensional vector with N-1 zeros and a single 1 (in the position of the word in the ordered list of words)
– English will require about 100 characters, to include both cases, special characters such as commas, hyphens, apostrophes, etc., and the space character
43
Nx1 one-hot vectors
⋮ 1 1 ⋮ 1 ⋮ 1 ⋮
Nx1 one-hot vectors
⋮ 1 1 ⋮ 1 ⋮ 1 ⋮
cube
– Actual volume of space used = 0
– Density of points:
(1,0,0) (0,1,0) (0,0,1)
46
importance of words
– All word vectors are the same length
– The distance between every pair of words is the same
(1,0,0) (0,1,0) (0,0,1)
47
– Or more generally, a linear transform into a lower-dimensional subspace – The volume used is still 0, but density can go up by many orders of magnitude
If properly learned, the distances between projected points will capture semantic relations between the words
(1,0,0) (0,1,0) (0,0,1)
48
– Or more generally, a linear transform into a lower-dimensional subspace – The volume used is still 0, but density can go up by many orders of magnitude
If properly learned, the distances between projected points will capture semantic relations between the words
(1,0,0) (0,1,0) (0,0,1)
49
– Replace every one-hot vector 𝑋
by 𝑄𝑋
𝑄 is an 𝑁 × 𝑂 matrix – 𝑄𝑋
is now an 𝑁-dimensional vector
– Learn P using an appropriate objective
⋮ 1 1 ⋮ 1 ⋮ 1 ⋮
(0,1,0) (0,0,1)
50
tied weights
(0,1,0) (0,0,1)
1 ⋮
1 1 ⋮ 1 ⋮
– “A neural probabilistic language model”, Bengio et al. 2003 – Hidden layer has Tanh() activation, output is softmax
representations
𝑄 Mean pooling 𝑋
𝑋
𝑋
𝑋
𝑋
𝑋
𝑋
shared parameters
53
and Phrases and their Compositionality”
54
– No explicit labels in the training data: at each time the next word is the label.
– One-hot vectors
– Outputs an N-valued probability distribution rather than a one-hot vector
– One-hot vectors
– Outputs an N-valued probability distribution rather than a one-hot vector
– And set it as the next word in the series
– And draw the next word from the output probability distribution
– In some cases, e.g. generating programs, there may be a natural termination
– And draw the next word from the output probability distribution
– In some cases, e.g. generating programs, there may be a natural termination
Trained on linux source code Actually uses a character-level model (predicts character sequences)
60
http://www.hexahedria.com/2015/08/03/composing-music-with-recurrent-neural-networks/
61
62
– E.g phoneme recognition
– E.g. speech recognition – Exact location of output is unknown a priori
63
64
Blue
– Represented as an N-dimensional output probability vector, where N is the number of phonemes
65
66
– Output generated when the last vector is processed
– We only read it at the end of the sequence
67
Div Y(2)
68
Div Y(2) Shortcoming: Pretends there’s no useful information in these
69
Div Y(2) Fix: Use these
These too must ideally point to the correct phoneme /AH/ Div /AH/ Div
70
– Only
is high, other weights are 0 or low
Div Y(2) Fix: Use these
These too must ideally point to the correct phoneme /AH/ Div /AH/ Div
71
Blue Div Y(2) Div Div
– E.g phoneme recognition
– E.g. speech recognition – Exact location of output is unknown a priori
72
– This is just a simple concatenation of many copies of the simple “output at the end of the input sequence” model we just saw
/AH/ /T/
74
/B/ /D/ /EH/ /IY/ /F/ /G/
76
/AH/ /B/ /D/ /EH/ /IY/ /F/ /G/
/B/ /D/ /EH/ /IY/ /F/ /G/
– Merge adjacent repeated symbols, and place the actual emission
/AH/ /B/ /D/ /EH/ /IY/ /F/ /G/
/F/ /IY/ /D/
80
– Merge adjacent repeated symbols, and place the actual emission
/AH/ /B/ /D/ /EH/ /IY/ /F/ /G/
/F/ /IY/ /D/ Cannot distinguish between an extended symbol and repetitions of the symbol /F/
– Merge adjacent repeated symbols, and place the actual emission
/AH/ /B/ /D/ /EH/ /IY/ /F/ /G/
/F/ /IY/ /D/ Cannot distinguish between an extended symbol and repetitions of the symbol /F/ Resulting sequence may be meaningless (what word is “GFIYD”?)
– E.g. only allow sequences corresponding to dictionary words – E.g. Sub-symbol units (like in HW1 – what were they?) – E.g. using special “separating” symbols to separate repetitions
/B/ /D/ /EH/ /IY/ /F/ /G/
– E.g. only allow sequences corresponding to dictionary words – E.g. Sub-symbol units (like in HW1 – what were they?) – E.g. using special “separating” symbols to separate repetitions
/B/ /D/ /EH/ /IY/ /F/ /G/
– Which is not necessarily the most likely order-synchronous sequence
– We will return to this topic later
85
/B/ /D/ /EH/ /IY/ /F/ /G/
/AH/ /T/
86
– Which portion of the input aligns to what symbol
– This is extra information
88
/AH/ /T/
/AH/ /T/
/AH/ /T/
Div Div /AH/ /T/
Div Div /AH/ /T/
Div Div Div Div Div Div
91
– But no indication of which one occurs where
– And how do we compute its gradient w.r.t.
93
– Either randomly, based on some heuristic, or any other rationale
– Train the network using the current alignment – Reestimate the alignment for each training instance
– Either randomly, based on some heuristic, or any other rationale
– Train the network using the current alignment – Reestimate the alignment for each training instance
96
/AH/ /T/
/AH/ /T/
/AH/ /T/ /B/ /B/ /B/ /AH/ /AH/ /AH/ /T/ /B/ /B/ /B/ /AH/ /AH/ /AH/ /AH/ /AH/ /AH/ /T/ /T/ /T/ /T/ /T/
– The unaligned -length symbol sequence
/B/ /IY/ /F/ /IY/) – An -length input ( ) – And a (trained) recurrent network
– An -length expansion
comprising the symbols in S in
strict order
97
98
99
/B/ /D/ /EH/ /IY/ /F/ /G/
– (Conditioned on input
sequence
– E.g. the unconstrained decode may be /AH//AH//AH//D//D//AH//F//IY//IY/
– Whereas we want an expansion of /B//IY//F//IY/
100
/AH/ /B/ /D/ /EH/ /IY/ /F/ /G/
101
/AH/ /B/ /D/ /EH/ /IY/ /F/ /G/
/B/
102
Compute the entire output (for all symbols) Copy the output values for the target symbols into the secondary reduced structure
103
/B/
104
/B/
105
Arrange the constructed table so that from top to bottom it has the exact sequence of symbols required
/B/
106
Arrange the constructed table so that from top to bottom it has the exact sequence of symbols required
/B/
#N is the number of symbols in the target output #S(i) is the ith symbol in target output #T = length of input #First create output table For i = 1:N s(1:T,i) = y(1:T, S(i))
107
/IY/ /B/ /F/ /IY/
block
travels down from top left to bottom right
– I.e. symbol chosen at any time is at the same level or at the next level to the symbol at the previous time
sequence
– /B/ /IY/ /F/ /IY/ in this case
108
/B/
sink represents a valid alignment
– Which maps on to the target symbol sequence (/B//IY//F//IY/)
neural network
along the path
/IY/ /B/ /F/ /IY/
110
/IY/ /B/ /F/ /IY/
represents a valid alignment
– Which maps on to the target symbol sequence (/B//IY//F//IY/)
network
the path
111
/IY/ /B/ /F/ /IY/
represents a valid alignment
– Which maps on to the target symbol sequence (/B//IY//F//IY/)
network
the path
such paths. Challenge: Find the path with the highest score (probability)
112
/IY/ /B/ /F/ /IY/
113
/IY/ /B/ /F/ /IY/
∈(
, )
/IY/ /B/ /F/ /IY/
best parent edge
115
/IY/ /B/ /F/ /IY/
in the -th time (given inputs
/IY/ /B/ /F/ /IY/
𝑚 − 1; 𝑓𝑚𝑡𝑓 𝑚
117
/IY/ /B/ /F/ /IY/
Bscr := Bestpath Score to node
𝑚 − 1; 𝑓𝑚𝑡𝑓 𝑚
118
/IY/ /B/ /F/ /IY/
𝑚 − 1 ∶ 𝑗𝑔 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 − 1 > 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 𝑚 − 1; 𝑚 ∶ 𝑓𝑚𝑡𝑓
119
/IY/ /B/ /F/ /IY/
𝑚 − 1 ∶ 𝑗𝑔 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 − 1 > 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 𝑚 − 1; 𝑚 ∶ 𝑓𝑚𝑡𝑓
120
/IY/ /B/ /F/ /IY/
𝑚 − 1 ∶ 𝑗𝑔 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 − 1 > 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 𝑚 − 1; 𝑚 ∶ 𝑓𝑚𝑡𝑓
121
/IY/ /B/ /F/ /IY/
𝑚 − 1; 𝑓𝑚𝑡𝑓 𝑚
122
/IY/ /B/ /F/ /IY/
𝑚 − 1; 𝑓𝑚𝑡𝑓 𝑚
123
/IY/ /B/ /F/ /IY/
𝑚 − 1; 𝑓𝑚𝑡𝑓 𝑚
124
/IY/ /B/ /F/ /IY/
𝑚 − 1 ∶ 𝑗𝑔 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 − 1 > 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 𝑚 − 1; 𝑚 ∶ 𝑓𝑚𝑡𝑓
125
/IY/ /B/ /F/ /IY/
𝑚 − 1 ∶ 𝑗𝑔 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 − 1 > 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 𝑚 − 1; 𝑚 ∶ 𝑓𝑚𝑡𝑓
126
/IY/ /B/ /F/ /IY/
𝑚 − 1 ∶ 𝑗𝑔 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 − 1 > 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 𝑚 − 1; 𝑚 ∶ 𝑓𝑚𝑡𝑓
127
/IY/ /B/ /F/ /IY/
128
/IY/ /B/ /F/ /IY/
129
/IY/ /B/ /F/ /IY/
130
/IY/ /B/ /F/ /IY/
#N is the number of symbols in the target output #S(i) is the ith symbol in target output #T = length of input #First create output table For i = 1:N s(1:T,i) = y(1:T, S(i)) #Now run the Viterbi algorithm # First, at t = 1 BP(1,1) = -1 Bscr(1,1) = s(1,1) Bscr(1,2:N) = -infty for t = 2:T BP(t,1) = 1; Bscr(t,1) = Bscr(t-1,1)*s(t,1) for i = 1:min(t,N) BP(t,i) = Bscr(t-1,i) > Bscr(t-1,i-1) ? i : i-1 Bscr(t,i) = Bscr(t-1,BP(t,i))*s(t,i) # Backtrace AlignedSymbol(T) = N for t = T downto 2 AlignedSymbol(t-1) = BP(t,AlignedSymbol(t))
131
Using 1..N and 1..T indexing, instead of 0..N-1, 0..T-1, for convenience of notation
#N is the number of symbols in the target output #S(i) is the ith symbol in target output #T = length of input #First create output table For i = 1:N s(1:T,i) = y(1:T, S(i)) #Now run the Viterbi algorithm # First, at t = 1 BP(1,1) = -1 Bscr(1,1) = s(1,1) Bscr(1,2:N) = -infty for t = 2:T BP(t,1) = 1; Bscr(t,1) = Bscr(t-1,1)*s(t,1) for i = 2:min(t,N) BP(t,i) = Bscr(t-1,i) > Bscr(t-1,i-1) ? i : i-1 Bscr(t,i) = Bscr(t-1,BP(t,i))*s(t,i) # Backtrace AlignedSymbol(T) = N for t = T downto 2 AlignedSymbol(t-1) = BP(t,AlignedSymbol(t))
132
Using 1..N and 1..T indexing, instead of 0..N-1, 0..T-1, for convenience of notation Do not need explicit construction of output table Information about order already in symbol sequence S(i), so we can use y(t,S(i)) instead of composing s(t,i) = y(t,S(i)) and using s(t,i)
#N is the number of symbols in the target output #S(i) is the ith symbol in target output #T = length of input # First, at t = 1 BP(1,1) = -1 Bscr(1,1) = y(1,S(1)) Bscr(1,2:N) = -infty for t = 2:T BP(t,1) = 1; Bscr(t,1) = Bscr(t-1,1)*y(t,S(1)) for i = 2:min(t,N) BP(t,i) = Bscr(t-1,i) > Bscr(t-1,i-1) ? i : i-1 Bscr(t,i) = Bscr(t-1,BP(t,i))*y(t,S(i)) # Backtrace AlignedSymbol(T) = N for t = T downto 2 AlignedSymbol(t-1) = BP(t,AlignedSymbol(t))
133
Using 1..N and 1..T indexing, instead of 0..N-1, 0..T-1, for convenience of notation Without explicit construction of output table
134
/IY/ /B/ /F/ /IY/
Div Div Div Div Div Div Div Div
alignment
135
/IY/ /B/ /F/ /IY/
Decode to obtain alignments Train model with given alignments Initialize alignments The “decode” and “train” steps may be combined into a single “decode, find alignment, compute derivatives” step for SGD and mini-batch updates
137
138
139