Deep Learning
Recurrent Networks : 1 Spring 2020
Instructor: Bhiksha Raj
1
Recurrent Networks : 1 Spring 2020 Instructor: Bhiksha Raj 1 - - PowerPoint PPT Presentation
Deep Learning Recurrent Networks : 1 Spring 2020 Instructor: Bhiksha Raj 1 Which open source project? 2 Related math. What is it talking about? 3 And a Wikipedia page explaining it all 4 The unreasonable effectiveness of recurrent
1
2
3
4
5
Sakyamuni (1575-1611) was a popular religious figure in India and around the world. This Bodhisattva Buddha was said to have passed his life peacefully and joyfully, without passion and anger. For over twenty years he lived as a lay man and dedicated himself toward the welfare, prosperity, and welfare of others. Among the many spiritual and philosophical teachings he wrote, three are most important; the first, titled the "Three Treatises of Avalokiteśvara"; the second, the teachings of the "Ten Questions;" and the third, "The Eightfold Path of Discipline.“
– Entirely randomly generated
6
7
– Analyze a series of spectral vectors, determine what was said
“To be” or not “to be”??
8
– E.g. analyze document, identify topic
– E.g. read English, output French
“Football” or “basketball”?
9
The Steelers, meanwhile, continue to struggle to make stops on
shown no signs of improving anytime soon.
– Should I invest, vs. should I not invest in X? – Decision must be taken considering how things have fared over time
15/03 14/03 13/03 12/03 11/03 10/03 9/03 8/03 7/03 To invest or not to invest? stocks
10
11
– Output layer too may have many neurons
– Each box actually represents an entire layer with many units
12
– Output layer too may have many neurons
– Each box actually represents an entire layer with many units
13
– Output layer too may have many neurons
– Each box actually represents an entire layer with many units
14
15/03 14/03 13/03 12/03 11/03 10/03 9/03 8/03 7/03 To invest or not to invest? stocks
15
Stock vector Time X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7) Y(t+3)
16
Stock vector Time X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7) Y(t+4)
17
Stock vector Time X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7) Y(t+5)
18
Stock vector Time X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7) Y(t+6)
19
Stock vector Time X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7) Y(t+6)
20
21
Stock vector Time X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7) Y(t+2)
– Something that happens today only affects the output of the system for days into the future
22
Stock vector Time X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7) Y(t+3)
– Something that happens today only affects the output of the system for days into the future
23
Stock vector Time X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7) Y(t+4)
– Something that happens today only affects the output of the system for days into the future
24
Stock vector Time X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7) Y(t+5)
– Something that happens today only affects the output of the system for days into the future
25
Stock vector Time X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7) Y(t+6)
– Something that happens today only affects the output of the system for days into the future
26
Stock vector Time X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7) Y(t+7)
– Something that happens today only affects the output of the system for days into the future
27
system for days into the future
– Predictions consider N days of history
increase the “history” considered by the system
Stock vector Time X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7) Y(t+6)
28
Stock vector Time X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7) Y(t+6)
29
30
Time
31
– Required: Define initial state: for – An input at
at
produces –
produces which produces and so on until even if
– A single input influences the output for the rest of time!
– “nonlinear autoregressive network with exogenous inputs” –
:
32
Time X(t) Y(t)
33
Time X(t) Y(t) Y
34
Time X(t) Y(t)
35
Time X(t) Y(t)
36
Time X(t) Y(t)
37
Time X(t) Y(t)
38
Time X(t) Y(t)
39
Time X(t) Y(t)
40
Time X(t) Y(t-1) Brown boxes show output nodes Yellow boxes are outputs
41
Time X(t) Y(t) Brown boxes show output nodes All outgoing arrows are the same output
42
Time X(t) Y(t)
43
Time X(t) Y(t)
44
45
– Generally stored in a “memory” unit – Used to “remember” the past
46
– “Serial order: A parallel distributed processing approach”, M.I.Jordan, 1986
– Memory has fixed structure; does not “learn” to remember
Time Y(t) Y(t+1) 1 1 Fixed weights Fixed weights X(t) X(t+1)
47
– “Context” units that carry historical state – “Finding structure in time”, Jeffrey Elman, Cognitive Science, 1990
history nets
– But during training no gradient is backpropagated over the “1” link Time X(t) Y(t) Y(t+1) 1 Cloned state 1 Cloned state X(t+1)
48
input
– Looking at a finite horizon of past inputs gives us a convolutional network
– May feed back a finite horizon of outputs
– Jordon networks maintain a running average of outputs in a “memory” unit – Elman networks store hidden unit values for one time instant in a “context” unit – “Simple” (or partially recurrent) because during learning current error does not actually propagate to the past
49
50
Time X(t) Y(t) t=0 h-1
51
52
Time X(t) Y(t) t=0 h-1
53
Time Y(t) X(t) t=0
54
Time Y(t) X(t) t=0
55
Time X(t) Y(t)
56
Time X(t) Y(t)
57
Time Y(t) X(t) t=0
58
Time Y(t) X(t) t=0
59
60
Time X(t) Y(t) t=0 h-1
61
Time Y(t) X(t) t=0
62
Time Y(t) X(t) t=0
63
64
Recurrent weights Current weights
()
65
text classification
Images from Karpathy
67
Images from Karpathy
68
units to retain some information about the past
– But during learning the current error does not influence the past
hidden states
– These are “fully recurrent” networks – The initial values of the hidden states are generally learnable parameters as well
69
– (𝐘, 𝐄), where – 𝐘 = 𝑌,, … , 𝑌, – 𝐄 = 𝐸,, … , 𝐸,
network
, , and the desired outputs
– This is the most generic setting. In other settings we just “remove” some of the input or
X(0)
Y(0) t h-1
X(1) X(2) X(T-2) X(T-1) X(T)
Y(1) Y(2)
Y(T-2) Y(T-1) Y(T)
70
generate outputs
X(0)
Y(0) t h-1
X(1) X(2) X(T-2) X(T-1) X(T)
Y(1) Y(2)
Y(T-2) Y(T-1) Y(T)
71
# Assuming h(-1,*) is known # Assuming L hidden-state layers and an output layer # Wc(*) and Wr(*) are matrics, b(*) are vectors # Wc are weights for inputs from current time # Wr is recurrent weight applied to the previous time # Wo are output layre weights for t = 0:T-1 # Including both ends of the index h(t,0) = x(t) # Vectors. Initialize h(0) to input for l = 1:L # hidden layers operate at time t z(t,l) = Wc(l)h(t,l-1) + Wr(l)h(t-1,l) + b(l) h(t,l) = tanh(z(t,l)) # Assuming tanh activ. zo(t) = Woh(t,L) + bo Y(t) = softmax( zo(t) )
72
Subscript “c” – current Subscript “r” – recurrent
– Back Propagation Through Time
X(0)
Y(0) t h-1
X(1) X(2) X(T-2) X(T-1) X(T)
Y(1) Y(2)
Y(T-2) Y(T-1) Y(T)
73
h-1
𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑈 − 2) 𝑌(𝑈 − 1) 𝑌(𝑈) 𝑍(0) 𝑍(1) 𝑍(2) 𝑍(𝑈 − 2) 𝑍(𝑈 − 1) 𝑍(𝑈)
Will only focus on one training instance All subscripts represent components and not training instance index
74
h-1
𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑈 − 2) 𝑌(𝑈 − 1) 𝑌(𝑈) 𝑍(0) 𝑍(1) 𝑍(2) 𝑍(𝑈 − 2) 𝑍(𝑈 − 1) 𝑍(𝑈) 𝐸(1. . 𝑈) 𝐸𝐽𝑊
75
–
– Assuming only one hidden layer in this example
76
h-1
𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑈 − 2) 𝑌(𝑈 − 1) 𝑌(𝑈) 𝑍(0) 𝑍(1) 𝑍(2) 𝑍(𝑈 − 2) 𝑍(𝑈 − 1) 𝑍(𝑈) 𝐸(1. . 𝑈) 𝐸𝐽𝑊
h-1
𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑈 − 2) 𝑌(𝑈 − 1) 𝑌(𝑈) 𝑍(0) 𝑍(1) 𝑍(2) 𝑍(𝑈 − 2) 𝑍(𝑈 − 1) 𝑍(𝑈) 𝐸(1. . 𝑈) 𝐸𝐽𝑊
First step of backprop: Compute
()
Note: DIV is a function of all outputs Y(0) … Y(T) In general we will be required to compute
()
as we will see. This can be a source of significant difficulty in many scenarios.
77
h-1
𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑈 − 2) 𝑌(𝑈 − 1) 𝑌(𝑈) 𝑍(0) 𝑍(1) 𝑍(2) 𝑍(𝑈 − 2) 𝑍(𝑈 − 1) 𝑍(𝑈) 𝐸(𝑈) 𝐸𝑗𝑤(𝑈)
𝐸𝑗𝑤(𝑈 − 2) 𝐸𝑗𝑤(2) 𝐸𝑗𝑤(1) 𝐸𝑗𝑤(0) 𝐸𝐽𝑊
Must compute
Special case, when the overall divergence is a simple sum of local divergences at each time:
78
h-1
𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑈 − 2) 𝑌(𝑈 − 1) 𝑌(𝑈) 𝑍(0) 𝑍(1) 𝑍(2) 𝑍(𝑈 − 2) 𝑍(𝑈 − 1) 𝑍(𝑈) 𝐸(1. . 𝑈) 𝐸𝐽𝑊
First step of backprop: Compute
()
Vector output activation
79
()() () ()()
h-1
𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑈 − 2) 𝑌(𝑈 − 1) 𝑌(𝑈) 𝑍(0) 𝑍(1) 𝑍(2) 𝑍(𝑈 − 2) 𝑍(𝑈 − 1) 𝑍(𝑈)
𝐸𝐽𝑊
80
() ()() ()
h-1
𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑈 − 2) 𝑌(𝑈 − 1) 𝑌(𝑈) 𝑍(0) 𝑍(1) 𝑍(2) 𝑍(𝑈 − 2) 𝑍(𝑈 − 1) 𝑍(𝑈)
𝐸𝐽𝑊
81
() ()()
h-1
𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑈 − 2) 𝑌(𝑈 − 1) 𝑌(𝑈) 𝑍(0) 𝑍(1) 𝑍(2) 𝑍(𝑈 − 2) 𝑍(𝑈 − 1) 𝑍(𝑈)
𝐸(1. . 𝑈) 𝐸𝐽𝑊
82
() () () ()()
h-1
𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑈 − 2) 𝑌(𝑈 − 1) 𝑌(𝑈) 𝑍(0) 𝑍(1) 𝑍(2) 𝑍(𝑈 − 2) 𝑍(𝑈 − 1) 𝑍(𝑈)
𝐸𝐽𝑊
83
() ()()
h-1
𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑈 − 2) 𝑌(𝑈 − 1) 𝑌(𝑈) 𝑍(0) 𝑍(1) 𝑍(2) 𝑍(𝑈 − 2) 𝑍(𝑈 − 1) 𝑍(𝑈)
𝐸𝐽𝑊
84
() ()()
h-1
𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑈 − 2) 𝑌(𝑈 − 1) 𝑌(𝑈) 𝑍(0) 𝑍(1) 𝑍(2) 𝑍(𝑈 − 2) 𝑍(𝑈 − 1) 𝑍(𝑈)
𝐸𝐽𝑊
Vector output activation
85
()() () ()
h-1
𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑈 − 2) 𝑌(𝑈 − 1) 𝑌(𝑈) 𝑍(0) 𝑍(1) 𝑍(2) 𝑍(𝑈 − 2) 𝑍(𝑈 − 1) 𝑍(𝑈)
𝐸𝐽𝑊
86
() () () ()() ()
h-1
𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑈 − 2) 𝑌(𝑈 − 1) 𝑌(𝑈) 𝑍(0) 𝑍(1) 𝑍(2) 𝑍(𝑈 − 2) 𝑍(𝑈 − 1) 𝑍(𝑈) 𝐸(1. . 𝑈) 𝐸𝐽𝑊
87
() ()
h-1
𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑈 − 2) 𝑌(𝑈 − 1) 𝑌(𝑈) 𝑍(0) 𝑍(1) 𝑍(2) 𝑍(𝑈 − 2) 𝑍(𝑈 − 1) 𝑍(𝑈) 𝐸(1. . 𝑈) 𝐸𝐽𝑊
() () ()
h-1
𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑈 − 2) 𝑌(𝑈 − 1) 𝑌(𝑈) 𝑍(0) 𝑍(1) 𝑍(2) 𝑍(𝑈 − 2) 𝑍(𝑈 − 1) 𝑍(𝑈) 𝐸(1. . 𝑈) 𝐸𝐽𝑊
89
() ()
h-1
𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑈 − 2) 𝑌(𝑈 − 1) 𝑌(𝑈) 𝑍(0) 𝑍(1) 𝑍(2) 𝑍(𝑈 − 2) 𝑍(𝑈 − 1) 𝑍(𝑈) 𝐸(1. . 𝑈) 𝐸𝐽𝑊
90
() ()
h-1
𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑈 − 2) 𝑌(𝑈 − 1) 𝑌(𝑈) 𝑍(0) 𝑍(1) 𝑍(2) 𝑍(𝑈 − 2) 𝑍(𝑈 − 1) 𝑍(𝑈) 𝐸(1. . 𝑈) 𝐸𝐽𝑊
𝑗
()
Continue computing derivatives going backward through time until..
h-1
𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑈 − 2) 𝑌(𝑈 − 1) 𝑌(𝑈) 𝑍(0) 𝑍(1) 𝑍(2) 𝑍(𝑈 − 2) 𝑍(𝑈 − 1) 𝑍(𝑈) 𝐸(1. . 𝑈) 𝐸𝐽𝑊
()
(,)
at output neurons
92
h-1
𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑈 − 2) 𝑌(𝑈 − 1) 𝑌(𝑈) 𝑍(0) 𝑍(1) 𝑍(2) 𝑍(𝑈 − 2) 𝑍(𝑈 − 1) 𝑍(𝑈) 𝐸(1. . 𝑈) 𝐸𝐽𝑊
# Assuming forward pass has been completed # Jacobian(x,y) is the jacobian of x w.r.t. y # Assuming dY(t) = gradient(div,Y(t)) available for all t # Assuming all dz, dh, dW and db are initialized to 0 for t = T-1:downto:0 # Backward through time dzo(t) = dY(t)Jacobian(Y(t),zo(t)) dWo += h(t,L)dzo(t) dbo += dzo(t) dh(t,L) += dzo(t)Wo for l = L:1 # Reverse through layers dz(t,l) = dh(t,l)Jacobian(h(t,l),z(t,l)) dh(t,l-1) += dz(t,l) Wc(l) dh(t-1,l) = dz(t,l) Wr(l) dWc(l) += h(t,l-1)dz(t,l) dWr(l) += h(t-1,l)dz(t,l) db(l) += dz(t,l)
94
Subscript “c” – current Subscript “r” – recurrent
95
– Explicitly models the fact that just as the future can be predicted from the past, the past can be deduced from the future
96 Proposed by Schuster and Paliwal 1997
X(0)
Y(0) t hf(-1)
X(1) X(2) X(T-2) X(T-1) X(T)
Y(1) Y(2)
Y(T-2) Y(T-1) Y(T) X(0) X(1) X(2) X(T-2) X(T-1) X(T)
hb(inf)
97
– Only computing the hidden states, initially
X(0)
t hf(-1)
X(1) X(2) X(T-2) X(T-1) X(T)
98
– Initially only the hidden state values are computed
– Note: This is not the backward pass of backprop.net processes it backward from t=T down to t=0 X(0)
t hf(-1)
X(1) X(2) X(T-2) X(T-1) X(T) X(0) X(1) X(2) X(T-2) X(T-1) X(T)
hb(inf)
99
X(0)
Y(0) t hf(-1)
X(1) X(2) X(T-2) X(T-1) X(T)
Y(1) Y(2)
Y(T-2) Y(T-1) Y(T) X(0) X(1) X(2) X(T-2) X(T-1) X(T)
hb(inf)
100
101
# Subscript f represents forward net, b is backward net # Assuming hf(-1,*) and hb(inf,*) are known #forward pass for t = 0:T-1 # Going forward in time hf(t,0) = x(t) # Vectors. Initialize h(0) to input for l = 1:Lf # Lf is depth of forward network hidden layers zf(t,l) = Wfc(l)hf(t,l-1) + Wfr(l)hf(t-1,l) + bf(l) hf(t,l) = tanh(zf(t,l)) # Assuming tanh activ. #backward h(T,:,:) = h(inf,:,:) # Just the initial value for t = T-1:downto:0 # Going backward in time hb(t,0) = x(t) # Vectors. Initialize h(0) to input for l = 1:Lb # Lb is depth of backward network hidden layers zb(t,l) = Wbc(l)hb(t,l-1) + Wbr(l)h(t+1,l) + bb(l) hb(t,l) = tanh(zb(t,l)) # Assuming tanh activ. for t = 0:T-1 # The output combines forward and backward zo(t) = Wfohf(t,Lf) + Wbohb(t,Lb) + bo Y(t) = softmax( zo(t) )
102
103
# Inputs: # L : Number of hidden layers # Wc,Wr,b: current weights, recurrent weights, biases # hinit: initial value of h(representing h(-1,*)) # x: input vector sequence # T: Length of input vector sequence # Output: # h, z: sequence of pre-and post activation hidden # representations from all layers of the RNN function [h,z] = RNN_forward(L, Wc, Wr, b, hinit, x, T) h(-1,:) = hinit # hinit is the initial value for all layers for t = 0:T-1 # Going forward in time h(t,0) = x(t) # Vectors. Initialize h(0) to input for l = 1:L z(t,l) = Wc(l)h(t,l-1) + Wr(l)h(t-1,l) + b(l) h(t,l) = tanh(z(t,l)) # Assuming tanh activ. return h,z
104
# Subscript f represents forward net, b is backward net # Assuming hf(-1,*) and hb(inf,*) are known #forward pass [hf, zf] = RNN_forward(Lf, Wfc, Wfr, bf, h(-1,:), x, T) #backward pass xrev = fliplr(x) # Flip it in time [hbrev, zbrev] = RNN_forward(Lb, Wbc, Wbr, bb, h(inf,:), xrev, T) hb = fliplr(hbrev) # Flip back to straighten time zb = fliplr(zbrev) #combine the two for the output for t = 0:T-1 # The output combines forward and backward zo(t) = Wfohf(t,Lf) + Wbohb(t,Lb) + bo Y(t) = softmax( zo(t) )
105
X(0)
Y(0) t hf(-1)
X(1) X(2) X(T-2) X(T-1) X(T)
Y(1) Y(2)
Y(T-2) Y(T-1) Y(T) X(0) X(1) X(2) X(T-2) X(T-1) X(T)
hb(inf)
106
– From t=T down to t=0 for the forward net – From t=0 up to t=T for the backward net
X(0)
Y(0) t hf(-1)
X(1) X(2) X(T-2) X(T-1) X(T)
Y(1) Y(2)
Y(T-2) Y(T-1) Y(T) X(0) X(1) X(2) X(T-2) X(T-1) X(T)
hb(inf) Div() d1..dT Div
107
– From t=T down to t=0 for the forward net
– From t=0 up to t=T for the backward net
X(0)
Y(0) t hf(-1)
X(1) X(2) X(T-2) X(T-1) X(T)
Y(1) Y(2)
Y(T-2) Y(T-1) Y(T)
Div() d1..dT Div
108
– From t=T down to t=0 for the forward net – From t=0 up to t=T for the backward net
Y(0) t Y(1) Y(2)
Y(T-2) Y(T-1) Y(T) X(0) X(1) X(2) X(T-2) X(T-1) X(T)
hb(inf) Div() d1..dT Div
109
110
# Inputs: # (In addition to inputs used by L : Number of hidden layers # dhtop: derivatives ddiv/dh*(t,L) at each time (* may be f or b) # h, z: h and z values returned by the forward pass # T: Length of input vector sequence # Output: # dWc, dWb, db dhinit: derivatives w.r.t current and recurrent weights, # biases, and initial h. # Assuming all dz, dh, dWc, dWr and db are initialized to 0 function [dWc,dWr,db,dhinit] = RNN_bptt(L, Wc, Wr, b, hinit, x, T, dhtop, h, z) dh = zeros for t = T-1:downto:0 # Backward through time dh(t,L) += dhtop(t) for l = L:1 # Reverse through layers dz(t,l) = dh(t,l)Jacobian(h(t,l),z(t,l)) dh(t,l-1) += dz(t,l) Wc(l) dh(t-1,l) += dz(t,l) Wr(l) dWc(l) += h(t,l-1)dz(t,l) dWr(l) += h(t-1,l)dz(t,l) db(l) += dz(t,l) return dWc, dWr, db, dh(-1) # dh(-1) is actually dh(-1,1:L,:)
111
# Subscript f represents forward net, b is backward net # First compute derivatives that directly relate to dY(t) for all t, # then pass the derivatives into RNN_bptt to compute forward and backward # parameter derivatives for t = 0:T-1 # The output combines forward and backward dzo(t) = dY(t)Jacobian(Y(t),zo(t)) dhfo(t) = dzo(t)Wfo dhbo(t) = dzo(t)Wbo dbo += dzo(t) dWfo += hf(t,L)dzo(t) dWbo += hb(t,L)dzo(t) #forward net [dWfc,dWfr,dbf,dhf(-1)] = RNN_bptt(L, Wfc, Wfr, bf, hf(-1), x, T, dhfo, hf, zf) #backward net xrev = fliplr(x) # Flip it in time [dWbc,dWbr,dbb,dhb(inf)] = RNN_bptt(L, Wbc, Wbr, bb, hb(inf), xrev, T, dhbo, hb, zb)
112
– Hidden states that recurse on themselves
– Defining a divergence between the actual and desired output sequences – Backpropagating gradients over the entire chain of recursion
– Pooling gradients with respect to individual parameters over time
endbeginning to make predictions
– In these networks, backprop must follow the chain of recursion (and gradient pooling) separately in the forward and reverse nets
113
114
115
116
117