CSE 490 U: Deep Learning Spring 2016
Yejin Choi
Some slides from Carlos Guestrin, Andrew Rosenberg, Luke Zettlemoyer
CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from - - PowerPoint PPT Presentation
CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew Rosenberg, Luke Zettlemoyer Human Neurons Switching time ~ 0.001 second Number of neurons 10 10 Connections per neuron 10 4-5
Some slides from Carlos Guestrin, Andrew Rosenberg, Luke Zettlemoyer
g
g
⇧l ⇧wi = − ↵
j
[yj − g(w0 + ↵
i
wixj
i)] ⇧
⇧wi g(w0 + ↵
i
wixj
i)
Solution just depends on g’: derivative of activation function!
∂ ∂xf(g(x)) = f(g(x))g(x)
∂ ∂wi g(w0 + X
i
wixj
i) = xj ig0(w0 +
X
i
wixj
i)
g
You design the feature vector Feature representations are “learned” through hidden layers
– Can represent very rich information – Possibly the entire history from the beginning
ht = f(xt, ht−1)
yt = softmax(V ht)
yt = softmax(V ht) ht = tanh(Uxt + Wht−1 + b)
ht = f(xt, ht−1)
yt = softmax(V ht)
𝑦" ℎ" ℎ# ℎ$ ℎ% ℎ$ ℎ# ℎ" Cat sitting on top of ….
– Yes! RNNs can be used as LMs! – RNNs make markov assumption: T/F?
yt = softmax(V ht)
𝑦" 𝑦# 𝑦$ 𝑦% ℎ" ℎ# ℎ$ ℎ% ℎ$ ℎ# ℎ"
– i.e., the next word depends only on the previous N words
𝑦" 𝑦# 𝑦$ 𝑦% ℎ" ℎ# ℎ$ ℎ% ℎ$ ℎ# ℎ"
yt = softmax(V ht)
ht = f(xt, ht−1) yt = softmax(V ht)
𝑦" 𝑦# 𝑦$ ℎ" ℎ# ℎ$ ℎ% ℎ% ℎ' ℎ( ℎ) ℎ( ℎ'
Figure from http://www.wildml.com/category/conversational-agents/
John has a dog
𝑦" 𝑦# 𝑦$ ℎ" ℎ# ℎ$ ℎ% ℎ% ℎ' ℎ( ℎ) ℎ( ℎ'
Parsing!
yt = softmax(V ht) ht = tanh(Uxt + Wht−1 + b)
𝑦" 𝑦# 𝑦$ 𝑦% ℎ" ℎ# ℎ$ ℎ%
ft = σ(U (f)xt + W (f)ht−1 + b(f))
it = σ(U (i)xt + W (i)ht−1 + b(i))
˜ ct = tanh(U (c)xt + W (c)ht−1 + b(c)) ct = ft ct−1 + it ˜ ct ht = ot tanh(ct)
There are many known variations to this set of equations!
ht = tanh(Uxt + Wht−1 + b)
𝑑" 𝑑# 𝑑$ 𝑑% 𝑑+ : cell state ℎ+ : hidden state
𝑑+," ℎ+," 𝑑+ ℎ+
Figure by Christopher Olah (colah.github.io)
sigmoid: [0,1] ft = σ(U (f)xt + W (f)ht−1 + b(f)) Forget gate: forget the past or not
Figure by Christopher Olah (colah.github.io)
sigmoid: [0,1] tanh: [-1,1] it = σ(U (i)xt + W (i)ht−1 + b(i)) ˜ ct = tanh(U (c)xt + W (c)ht−1 + b(c)) Input gate: use the input or not New cell content (temp): ft = σ(U (f)xt + W (f)ht−1 + b(f)) Forget gate: forget the past or not
Figure by Christopher Olah (colah.github.io)
sigmoid: [0,1] tanh: [-1,1]
ct = ft ct−1 + it ˜ ct
New cell content:
it = σ(U (i)xt + W (i)ht−1 + b(i))
˜ ct = tanh(U (c)xt + W (c)ht−1 + b(c)) Input gate: use the input or not New cell content (temp): ft = σ(U (f)xt + W (f)ht−1 + b(f)) Forget gate: forget the past or not
Figure by Christopher Olah (colah.github.io)
ct = ft ct−1 + it ˜ ct
New cell content:
it = σ(U (i)xt + W (i)ht−1 + b(i))
˜ ct = tanh(U (c)xt + W (c)ht−1 + b(c)) Input gate: use the input or not New cell content (temp): ft = σ(U (f)xt + W (f)ht−1 + b(f)) Forget gate: forget the past or not Output gate: output from the new cell or not
ht = ot tanh(ct)
Hidden state:
Figure by Christopher Olah (colah.github.io)
it = σ(U (i)xt + W (i)ht−1 + b(i)) Input gate: use the input or not
ft = σ(U (f)xt + W (f)ht−1 + b(f)) Forget gate: forget the past or not Output gate: output from the new cell or not
ct = ft ct−1 + it ˜ ct
New cell content:
˜ ct = tanh(U (c)xt + W (c)ht−1 + b(c)) New cell content (temp):
ht = ot tanh(ct)
Hidden state: 𝑑+," ℎ+," 𝑑+ ℎ+
the inputs at time one (the darker the shade, the greater the sensitivity).
hidden layer, and the network ‘forgets’ the first inputs.
Example from Graves 2012
the input gate is closed.
without affecting the cell.
Forget gate Input gate Output gate Example from Graves 2012
𝑦" 𝑦# 𝑦$ 𝑦% ℎ" ℎ# ℎ$ ℎ%
zt = σ(U (z)xt + W (z)ht−1 + b(z)) rt = σ(U (r)xt + W (r)ht−1 + b(r))
˜ ht = tanh(U (h)xt + W (h)(rt ht−1) + b(h))
ht = (1 zt) ht−1 + zt ˜ ht
Less parameters than LSTMs. Easier to train for comparable performance!
ht = tanh(Uxt + Wht−1 + b)
Example from Iyyer et al., 2014
Example from Iyyer et al., 2014
Next 10 slides on back propagation are adapted from Andrew Rosenberg
~ ✓ = {w(1)
ij , w(2) jk , w(3) kl }
w(1)
ij
w(2)
jk
kl
41
∑
i
j
k
i
j
k
x0
x1 x2 xP f(x, ~ ✓)
wij wjk wkl
zj
zk
zi aj ak zl al
R(θ) = 1 N
N
X
n=0
L(yn − f(xn)) = 1 N
N
X
n=0
1 2 (yn − f(xn))2 = 1 N
N
X
n=0
1 2 @yn − g @X
k
wklg @X
j
wjkg X
i
wijxn,i !1 A 1 A 1 A
2
Empirical Risk Function
x0
x1 x2 xP f(x, ~ ✓)
wij wjk wkl
zj
zk
zi aj ak zl al
R(θ) = 1 N
N
X
n=0
L(yn − f(xn)) = 1 N
N
X
n=0
1 2 (yn − f(xn))2 = 1 N
N
X
n=0
1 2 @yn − g @X
k
wklg @X
j
wjkg X
i
wijxn,i !1 A 1 A 1 A
2
Empirical Risk Function
x0
x1 x2 xP f(x, ~ ✓)
wij wjk wkl
zj
zk
zi aj ak zl al
Optimize last layer weights wkl
Ln = 1 2 (yn − f(xn))2 ∂R ∂wkl = 1 N X
n
∂Ln ∂al,n ∂al,n ∂wkl
x0
x1 x2 xP f(x, ~ ✓)
wij wjk wkl
zj
zk
zi aj ak zl al
Optimize last layer weights wkl
Ln = 1 2 (yn − f(xn))2 ∂R ∂wkl = 1 N X
n
∂Ln ∂al,n ∂al,n ∂wkl
∂R ∂wkl = 1 N X
n
∂ 1
2(yn − g(al,n))2
∂al,n ∂al,n ∂wkl
x0
x1 x2 xP f(x, ~ ✓)
wij wjk wkl
zj
zk
zi aj ak zl al
Optimize last layer weights wkl
Ln = 1 2 (yn − f(xn))2 ∂R ∂wkl = 1 N X
n
∂Ln ∂al,n ∂al,n ∂wkl
∂R ∂wkl = 1 N X
n
∂ 1
2(yn − g(al,n))2
∂al,n ∂zk,nwkl ∂wkl
x0
x1 x2 xP f(x, ~ ✓)
wij wjk wkl
zj
zk
zi aj ak zl al
Optimize last layer weights wkl
Ln = 1 2 (yn − f(xn))2 ∂R ∂wkl = 1 N X
n
∂Ln ∂al,n ∂al,n ∂wkl
∂R ∂wkl = 1 N X
n
∂ 1
2(yn − g(al,n))2
∂al,n ∂zk,nwkl ∂wkl
N X
n
[−(yn − zl,n)g0(al,n)] zk,n
x0
x1 x2 xP f(x, ~ ✓)
wij wjk wkl
zj
zk
zi aj ak zl al
Optimize last layer weights wkl
Ln = 1 2 (yn − f(xn))2 ∂R ∂wkl = 1 N X
n
∂Ln ∂al,n ∂al,n ∂wkl
∂R ∂wkl = 1 N X
n
∂ 1
2(yn − g(al,n))2
∂al,n ∂zk,nwkl ∂wkl
1 N X
n
[−(yn − zl,n)g0(al,n)] zk,n = 1 N X
n
δl,nnzk,n
x0
x1 x2 xP f(x, ~ ✓)
wij wjk wkl
zj
zk
zi aj ak zl al
Repeat for all previous layers ∂R ∂wkl = 1 N X
n
∂Ln ∂al,n ∂al,n ∂wkl
N X
n
[−(yn − zl,n)g0(al,n)] zk,n = 1 N X
n
δl,nzk,n ∂R ∂wjk = 1 N X
n
∂Ln ∂ak,n ∂ak,n ∂wjk
N X
n
"X
l
δl,nwklg0(ak,n) # zj,n = 1 N X
n
δk,nzj,n ∂R ∂wij = 1 N X
n
∂Ln ∂aj,n ∂aj,n ∂wij
N X
n
"X
k
δk,nwjkg0(aj,n) # zi,n = 1 N X
n
δj,nzi,n
i
∂R ∂wjk = 1 N X
n
∂Ln ∂ak,n ∂ak,n ∂wjk
N X
n
"X
l
δl,nwklg0(ak,n) # zj,n = 1 N X
n
δk,nzj,n ∂R ∂wij = 1 N X
n
∂Ln ∂aj,n ∂aj,n ∂wij
N X
n
"X
k
δk,nwjkg0(aj,n) # zi,n = 1 N X
n
δj,nzi,n
∑
∑
x0
x1 x2 xP f(x, ~ ✓)
wij wjk wkl
zj
zk
zi aj ak zl al
wt+1
ij
= wt
ij − η ∂R
wij wt+1
jk
= wt
jk − η ∂R
wkl wt+1
kl
= wt
kl − η ∂R
wkl
– Instead of considering exponentially many paths between a weight w_ij and the final loss (risk), store and reuse intermediate results.
differentiation only through forward propagation.
∂R ∂wij
Forward Gradient
Primary Interface Language:
Forward Gradient
H(p, q) = Ep[−log q] = H(p) + DKL(p||q) H(p, q) = − X
y
p(y) log q(y)
H(p) = X
y
p(y)log p(y)
DKL(p||q) = X
y
p(y) log p(y) q(y) MSE = 1 2(y − f(x))2
Predicted prob True prob
– Deep belief networks – Huge error reduction when trained with lots of data on GPUs