CSE 481: NLP Capstone Spring 2017
Yejin Choi University of Washington
CSE 481: NLP Capstone Spring 2017 Yejin Choi University of - - PowerPoint PPT Presentation
CSE 481: NLP Capstone Spring 2017 Yejin Choi University of Washington Office Hour News Hannah: Wed 2 - 3pm @ CSE 220 Maarten: Wed 2 - 3pm @ CSE 220 Yejin: Tue 2pm - 3:30pm Wed 5pm - 5:30pm @ CSE 578 All:
CSE 481: NLP Capstone Spring 2017
Yejin Choi University of Washington
Office Hour News
– Wed 2 - 3pm @ CSE 220
– Wed 2 - 3pm @ CSE 220
– Tue 2pm - 3:30pm – Wed 5pm - 5:30pm @ CSE 578
– Thu 12pm – 1:25pm @ ??? for some weeks
2
3
4
GPU NEWS!
– desktop with 2 GPUs can be set up at $4000
free GPU cycles for the class!!!!!
$200 credits
5
RECURRENT NEURAL RECURRENT NEURAL NETWORKS NETWORKS
𝑦↓1 𝑦↓2 𝑦↓3 𝑦↓4 ℎ↓1 ℎ↓2 ℎ↓3 ℎ↓4
Recurrent Neural Networks (RNNs)
state and a new input
state
– Can represent very rich information – Possibly the entire history from the beginning
NNs) ht = f(xt, ht−1) ht ∈ RD yt = softmax(V ht)
Recurrent Neural Networks (RNNs)
ht = f(xt, ht−1) yt = softmax(V ht) yt = softmax(V ht)
ht = tanh(Uxt + Wht−1 + b)
𝑦↓1 𝑦↓2 𝑦↓3 𝑦↓4 ℎ↓1 ℎ↓2 ℎ↓3 ℎ↓4
Recurrent Neural Networks (RNNs)
ht = f(xt, ht−1)
𝑦↓1 𝑦↓2 𝑦↓3 𝑦↓4 ℎ↓1 ℎ↓2 ℎ↓3 ℎ↓4
ft = σ(U (f)xt + W (f)ht−1 + b(f))
it = σ(U (i)xt + W (i)ht−1 + b(i))
˜ ct = tanh(U (c)xt + W (c)ht−1 + b(c)) ct = ft ct−1 + it ˜ ct
ht = ot tanh(ct)
There are many known variations to this set of equations!
ht = tanh(Uxt + Wht−1 + b)
𝑑↓1 𝑑↓2 𝑑↓3 𝑑↓4 𝑑↓𝑢
: cell state
ℎ↓𝑢
: hidden state
Many uses of RNNs
ht = f(xt, ht−1)
𝑦↓1 𝑦↓2 𝑦↓3 𝑦↓4 ℎ↓1 ℎ↓2 ℎ↓3 ℎ↓4
y = softmax(V hn)
ht = f(xt, ht−1) yt = softmax(V ht)
𝑦↓1 ℎ↓1 ℎ↓2 ℎ↓3 ℎ↓4 ℎ↓3 ℎ↓2 ℎ↓1 Cat sitting on top of ….
Many uses of RNNs
– Yes! RNNs can be used as LMs! – RNNs make markov assumption: T/F?
ht = f(xt, ht−1) yt = softmax(V ht)
𝑦↓1 𝑦↓2 𝑦↓3 𝑦↓4 ℎ↓1 ℎ↓2 ℎ↓3 ℎ↓4 ℎ↓3 ℎ↓2 ℎ↓1
Many uses of RNNs
step.
– i.e., the next word depends only on the previous N words
𝑦↓1 𝑦↓2 𝑦↓3 𝑦↓4 ℎ↓1 ℎ↓2 ℎ↓3 ℎ↓4 ℎ↓3 ℎ↓2 ℎ↓1
Many uses of RNNs
ht = f(xt, ht−1) yt = softmax(V ht)
ht = f(xt, ht−1) yt = softmax(V ht)
𝑦↓ 1 𝑦↓ 2 𝑦↓ 3 ℎ↓ 1 ℎ↓ 2 ℎ↓ 3 ℎ↓ 4 ℎ↓ 4 ℎ↓ 5 ℎ↓ 6 ℎ↓ 7 ℎ↓ 6 ℎ↓ 5
Many uses of RNNs
Many uses of RNNs
Figure from http://www.wildml.com/category/conversational-agents/
Many uses of RNNs
John has a dog 𝑦↓ 1 𝑦↓ 2 𝑦↓ 3 ℎ↓ 1 ℎ↓ 2 ℎ↓ 3 ℎ↓ 4 ℎ↓ 4 ℎ↓ 5 ℎ↓ 6 ℎ↓ 7 ℎ↓ 6 ℎ↓ 5
Parsing!
Recurrent Neural Networks (RNNs)
ht = f(xt, ht−1) yt = softmax(V ht) yt = softmax(V ht)
ht = tanh(Uxt + Wht−1 + b)
𝑦↓1 𝑦↓2 𝑦↓3 𝑦↓4 ℎ↓1 ℎ↓2 ℎ↓3 ℎ↓4
Recurrent Neural Networks (RNNs)
ht = f(xt, ht−1)
𝑦↓1 𝑦↓2 𝑦↓3 𝑦↓4 ℎ↓1 ℎ↓2 ℎ↓3 ℎ↓4
ft = σ(U (f)xt + W (f)ht−1 + b(f))
it = σ(U (i)xt + W (i)ht−1 + b(i))
˜ ct = tanh(U (c)xt + W (c)ht−1 + b(c)) ct = ft ct−1 + it ˜ ct
ht = ot tanh(ct)
There are many known variations to this set of equations!
ht = tanh(Uxt + Wht−1 + b)
𝑑↓1 𝑑↓2 𝑑↓3 𝑑↓4 𝑑↓𝑢
: cell state
ℎ↓𝑢
: hidden state
LSTMS ( LSTMS (LONG ONG SHOR HORT-TERM ERM MEMOR EMORY Y NETWORKS NETWORKS
𝑑↓𝑢 −1 ℎ↓𝑢 −1 𝑑↓𝑢 ℎ↓𝑢
Figure by Christopher Olah (colah.github.io)
LSTMS ( LSTMS (LONG ONG SHOR HORT-TERM ERM MEMOR EMORY Y NETWORKS NETWORKS
sigmoid: [0,1] ft = σ(U (f)xt + W (f)ht−1 + b(f)) Forget gate: forget the past or not
Figure by Christopher Olah (colah.github.io)
LSTMS ( LSTMS (LONG ONG SHOR HORT-TERM ERM MEMOR EMORY Y NETWORKS NETWORKS
sigmoid: [0,1] tanh: [-1,1] it = σ(U (i)xt + W (i)ht−1 + b(i)) ˜ ct = tanh(U (c)xt + W (c)ht−1 + b(c)) Input gate: use the input or not New cell content (temp): ft = σ(U (f)xt + W (f)ht−1 + b(f)) Forget gate: forget the past or not
Figure by Christopher Olah (colah.github.io)
LSTMS ( LSTMS (LONG ONG SHOR HORT-TERM ERM MEMOR EMORY Y NETWORKS NETWORKS
sigmoid: [0,1] tanh: [-1,1]
ct = ft ct−1 + it ˜ ct
New cell content:
it = σ(U (i)xt + W (i)ht−1 + b(i))
˜ ct = tanh(U (c)xt + W (c)ht−1 + b(c)) Input gate: use the input or not New cell content (temp): ft = σ(U (f)xt + W (f)ht−1 + b(f)) Forget gate: forget the past or not
Figure by Christopher Olah (colah.github.io)
LSTMS ( LSTMS (LONG ONG SHOR HORT-TERM ERM MEMOR EMORY Y NETWORKS NETWORKS
ct = ft ct−1 + it ˜ ct
New cell content:
it = σ(U (i)xt + W (i)ht−1 + b(i))
˜ ct = tanh(U (c)xt + W (c)ht−1 + b(c)) Input gate: use the input or not New cell content (temp): ft = σ(U (f)xt + W (f)ht−1 + b(f)) Forget gate: forget the past or not Output gate: output from the new cell or not
ht = ot tanh(ct)
Hidden state:
Figure by Christopher Olah (colah.github.io)
LSTMS ( LSTMS (LONG ONG SHOR HORT-TERM ERM MEMOR EMORY Y NETWORKS NETWORKS
it = σ(U (i)xt + W (i)ht−1 + b(i)) Input gate: use the input or not
ft = σ(U (f)xt + W (f)ht−1 + b(f)) Forget gate: forget the past or not Output gate: output from the new cell or not
ct = ft ct−1 + it ˜ ct
New cell content:
˜ ct = tanh(U (c)xt + W (c)ht−1 + b(c)) New cell content (temp):
ht = ot tanh(ct)
Hidden state: 𝑑↓𝑢 −1 ℎ↓𝑢 −1 𝑑↓𝑢 ℎ↓𝑢
vanishing gradient problem for RNNs.
sensitivity to the inputs at time one (the darker the shade, the greater the sensitivity).
Example from Graves 2012
Preservation of gradient information by LSTM
gate without affecting the cell.
Forget gate Input gate Output gate Example from Graves 2012
Recurrent Neural Networks (RNNs)
ht = f(xt, ht−1)
𝑦↓1 𝑦↓2 𝑦↓3 𝑦↓4 ℎ↓ 1 ℎ↓ 2 ℎ↓ 3 ℎ↓ 4
zt = σ(U (z)xt + W (z)ht−1 + b(z)) rt = σ(U (r)xt + W (r)ht−1 + b(r))
˜ ht = tanh(U (h)xt + W (h)(rt ht−1) + b(h))
ht = (1 zt) ht−1 + zt ˜ ht
Z: Update gate R: Reset gate Less parameters than LSTMs. Easier to train for comparable performance!
ht = tanh(Uxt + Wht−1 + b)
Gates
flow
(contextually) maintain longer term history
28
Bi-directional RNNs
29
Google NMT (Oct 2016)
Recursive Neural Networks
than sequential structure
(red → conservative, blue → liberal, gray → neutral) in which modifier phrases and punctuation cause polarity switches at higher levels of the parse tree
Example from Iyyer et al., 2014
Recursive Neural Networks
Example from Iyyer et al., 2014
Tree LSTMs
33
expressive than sequence LSTMs?
Necessary for Deep Learning of Representations? Jiwei Li, Minh-Thang Luong, Dan Jurafsky and Eduard Hovy. EMNLP , 2015.
Neural Probabilistic Language Model (Bengio 2003)
34
Neural Probabilistic Language Model (Bengio 2003)
35
NNDMLP1(x) = [tanh(xW1 + b1), x]W 2 + b2
I W1 ∈ Rdin×dhid, b1 ∈ R1×dhid; first affine transformation I W2 ∈ R(dhid+din)×dout, b2 ∈ R1×dout; second affine transformation
separate feed forward neural network
Markovian language model
connections
LEARNING: LEARNING: BACKPROP BACKPROPAGA AGATION TION
Error Backpropagation
for brevity:
x0 x1 x2 xP f(x, ~ ✓) ~ ✓ = {wij, wjk, wkl}
Next 10 slides on back propagation are adapted from Andrew Rosenberg
~ ✓ = {w(1)
ij , w(2) jk , w(3) kl }
w(1)
ij
w(2)
jk
w(3)
kl
x0 x1 x2 xP f(x, ~ ✓) wij wjk wkl zj zk zi aj ak zl al wt+1
ij
= wt
ij − η ∂R
wij wt+1
jk
= wt
jk − η ∂R
wkl wt+1
kl
= wt
kl − η ∂R
wkl
Learning: Gradient Descent
Backpropagation
values
– Instead of considering exponentially many paths between a weight w_ij and the final loss (risk), store and reuse intermediate results.
differentiation only through forward propagation.
zi
δj
∂R ∂wij
Forward Gradient
Backpropagation
Primary Interface Language:
Forward Gradient
Cross Entropy Loss (aka log loss, logistic
loss)
– Entropy – KL divergence (the distance between two distributions p and q)
flavor (e.g., language models)
incorrect predictions
H(p, q) = Ep[−log q] = H(p) + DKL(p||q)
H(p, q) = − X
y
p(y) log q(y) H(p) = X
y
p(y)log p(y)
DKL(p||q) = X
y
p(y) log p(y) q(y) MSE = 1 2(y − f(x))2
Predicted prob True prob
RNN Learning: Backprop Through Time
(BPTT)
the computation graph repeats the exact same parameters…
they are different parameters
the average gradients throughout the entire chain of units.
𝑦↓1 𝑦↓2 𝑦↓3 𝑦↓4 ℎ↓1 ℎ↓2 ℎ↓3 ℎ↓4
LEARNING: TRAINING DEEP LEARNING: TRAINING DEEP NETWORKS NETWORKS
Vanishing / exploding Gradients
exploding or vanishing gradients
– network architecture – numerical operations
44
Vanishing / exploding Gradients
architecture
– Add skip connections to reduce distance
– Add gates (and memory cells) to allow longer term memory
45
Gradients of deep networks
NNlayer(x) = ReLU(xW1 + b1)
hn hn−1 . . . h2 h1 x
I Can have similar issues with vanishing gradients.
∂L ∂hn−1,jn−1
= ∑
jn
1(hn,jn > 0)Wjn−1,jn ∂L ∂hn,jn
46
Diagram borrowed from Alex Rush
47
Effects of Skip Connections on Gradients
NNsl1(x) = 1 2 ReLU(xW1 + b1) + 1 2x
hn hn−1 . . . h3 h2 h1 x
∂L ∂hn−1,jn−1
=
1 2
(∑
jn
1(hn,jn > 0)Wjn−1,jn ∂L ∂hn,jn
) +
1 2
(hn−1,jn−1
∂L ∂hn,jn−1
)
Diagram borrowed from Alex Rush
48
Effects of Skip Connections on Gradients
NNsl2(x)
= (1 − t) ReLU(xW1 + b1) + tx
t
=
σ(xWt + bt) W1
∈
Rdhid×dhid Wt
∈
Rdhid×1
hn hn−1 . . . h3 h2 h1 x
Diagram borrowed from Alex Rush
Highway Network (Srivastava et al., 2015)
– H is a typical affine transformation followed by a non- linear activation
– T is a “transform gate” – C is a “carry gate” – Often C = 1 – T for simplicity
49
y = H(x, WH).
y = H(x, WH)· T(x, WT) + x · C(x, WC).
Residual Networks
network successfully trained for object recognition
50
any two stacked layers
a(0)
weight layer weight layer
relu relu
weight layer weight layer
relu relu
a 0 = b 0 + 0
identity
b(0)
Residual Networks
51
any two stacked layers
a(0)
weight layer weight layer
relu relu
weight layer weight layer
relu relu
a 0 = b 0 + 0
identity
b(0)
propagation --- more direct influence from the final loss to any deep layer
input connection only through “gates”.
Residual Networks
Revolution of Depth
11x11 conv, 96, /4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000AlexNet, 8 layers (ILSVRC 2012)
3x3 conv, 64 3x3 conv, 64, pool/2 3x3 conv, 128 3x3 conv, 128, pool/2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 fc, 4096 fc, 4096 fc, 1000VGG, 19 layers (ILSVRC 2014)
input Conv 7x7+ 2(S) MaxPool 3x3+ 2(S) LocalRespNorm Conv 1x1+ 1(V) Conv 3x3+ 1(S) LocalRespNorm MaxPool 3x3+ 2(S) Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat MaxPool 3x3+ 2(S) Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) AveragePool 5x5+ 3(V) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) AveragePool 5x5+ 3(V) Dept hConcat MaxPool 3x3+ 2(S) Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat AveragePool 7x7+ 1(V) FC Conv 1x1+ 1(S) FC FC Soft maxAct ivat ion soft max0 Conv 1x1+ 1(S) FC FC Soft maxAct ivation soft max1 Soft maxAct ivat ion soft max2GoogleNet, 22 layers (ILSVRC 2014)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
52
Residual Networks
1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 128, /2 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 256, /2 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 512, /2 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 ave pool, fc 1000 7x7 conv, 64, /2, pool/2AlexNet, 8 layers (ILSVRC 2012)
Revolution of Depth
ResNet, 152 layers (ILSVRC 2015)
3x3 conv, 64 3x3 conv, 64, pool/2 3x3 conv, 128 3x3 conv, 128, pool/2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 fc, 4096 fc, 4096 fc, 1000 11x11 conv, 96, /4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000VGG, 19 layers (ILSVRC 2014)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
53
Residual Networks
Revolution of Depth
3.57 6.7 7.3 11.7 16.4 25.8 28.2 ILSVRC'15 ResNet ILSVRC'14 GoogleNet ILSVRC'14 VGG ILSVRC'13 ILSVRC'12 AlexNet ILSVRC'11 ILSVRC'10
ImageNet Classification top-5 error (%)
shallow 8 layers 19 layers 22 layers
152 layers
8 layers
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.54
Highway Network (Srivastava et al., 2015)
– H is a typical affine transformation followed by a non- linear activation
– T is a “transform gate” – C is a “carry gate” – Often C = 1 – T for simplicity
55
y = H(x, WH).
y = H(x, WH)· T(x, WT) + x · C(x, WC).
@Schmidhubered
56
Vanishing / exploding Gradients
– Gradient Clipping: bound gradients by a max value – Gradient Normalization: renormalize gradients when they are above a fixed norm – Careful initialization, smaller learning rates – Avoid saturating nonlinearities (like tanh, sigmoid)
57
Sigmoid
differentiable
zero almost everywhere except x near zero => vanishing gradients
58
σ(x) = 1 1 + e−x σ0(x) = σ(x)(1 − σ(x))
Tanh
hidden states & cells in RNNs, LSTMs
faster than sigmoid
saturate to zero => vanishing gradients
59
tanh(x) = 2σ(2x) − 1 tanh(x) = ex − e−x ex + e−x
tanh’(x) = 1 − tanh2(x)
Hard Tanh
hardtanh(t) =
−1
t < −1 t
−1 ≤ t ≤ 1
1 t > 1
60
cheaper
zero easily, doesn’t differentiate at 1, -1
ReLU
x > 0, computationally cheaper, induces sparse NNs
at 0
NN, but not as much in RNNs
subgradients:
61
ReLU(x) = max(0, x)
d ReLU(x) dx
=
1 x > 0 x < 0 1 or 0
Vanishing / exploding Gradients
– Gradient Clipping: bound gradients by a max value – Gradient Normalization: renormalize gradients when they are above a fixed norm – Careful initialization, smaller learning rates – Avoid saturating nonlinearities (like tanh, sigmoid)
– Batch Normalization: add intermediate input normalization layers
62
Batch Normalization
63
Regularization
– Modify loss with L1 or L2 norms
Dropout
– Randomly delete parts of network during training – Each node (and its corresponding incoming and outgoing edges) dropped with a probability p – P is higher for internal nodes, lower for input nodes – The full network is used for testing – Faster training, better results – Vs. Bagging
64
L(θ) =
n
∑
i=1
max{0, 1 ( ˆ yc ˆ yc0)} + λ||θ||2
Convergence of backprop
convex optimization
– Gradient descent reaches global minima global minima
not convex convex
– Gradient descent gets stuck in local minima – Selecting number of hidden units and layers = fuzzy process – NNs have made a HUGE comeback in the last few years
– Deep belief networks – Huge error reduction when trained with lots of data on GPUs
RECAP RECAP
Vanishing / exploding Gradients
exploding or vanishing gradients
– network architecture – numerical operations
67
Vanishing / exploding Gradients
architecture
– Add skip connections to reduce distance
– Add gates (and memory cells) to allow longer term memory
68
seq2seq (aka “encoder-decoder”)
ht = f(xt, ht−1) yt = softmax(V ht)
Google NMT (Oct 2016)
ATTENTION! TTENTION!
Seq-to-Seq with Attention
Diagram from http://distill.pub/2016/augmented-rnns/ 72
Seq-to-Seq with Attention
Diagram from http://distill.pub/2016/augmented-rnns/ 73
Trial: Hard Attention
74
word
st
i
ss
j
wt
i+1 = argmaxwO(w, st i+1, ss j)
zj = tanh([st
i, ss j]W + b)
j = argmaxjzj
Encoder – Decoder Architecture
Sequence-to-Sequence
the red dog ˆ y1 ˆ y2 ˆ y3 ss
1
ss
2
ss
3
st
1
st
2
st
3
x1 x2 x3 ˆ x1 ˆ x2 ˆ x3 the red dog
<s>
75
Diagram borrowed from Alex Rush
Attention: Soft Alignments
76
word
– Step-1: compute the attention weights – Step-2: compute the attention vector as interpolation
st
i
wt
i+1 = argmaxwO(w, st i+1, c)
c ss
zj = tanh([st
i, ss j]W + b)
α = softmax(z) c = X
j
αjss
j
Attention function parameterization
77
zj = st
i · ss j
zj = st
i · ss j
||st
i||||ss j||
zj = st
i T Wss j
zj = tanh([st
i; ss j]W + b)
zj = tanh([st
i; ss j; st i ss j]W + b)
78
Learned Attention!
79
Diagram borrowed from Alex Rush
80
Qualitative results
27Figure 2. Attention over time. As the model generates each word, its attention changes to reflect the relevant parts of the image. “soft” (top row) vs “hard” (bottom row) attention. (Note that both models generated the same captions in this example.) Figure 3. Examples of attending to the correct object (white indicates the attended regions, underlines indicated the corresponding word)
POINTER NETWORKS POINTER NETWORKS
Convex haul, Delaunay Triangulation, Traveling Salesman
82
Can we model these problems using seq-to-seq?
Pointer Networks! (Vinyals et al. 2015)
83
Pointer Networks
84
(a) Sequence-to-Sequence (b) Ptr-Net
Pointer Networks
Attention Mechanism vs Pointer Networks
Softmax normalizes the vector eij to be an output distribution over the dictionary of inputs Attention mechanism Ptr-Net Diagram borrowed from Keon Kim 85
CopyNet (Gu et al. 2016)
– I: Hello Jack, my name is Chandralekha – R: Nice to meet you, Chandralekha – I: This new guy doesn’t perform exactly as expected. – R: what do you mean by “doesn’t perform exactly as expected?”
86
CopyNet (Gu et al. 2016)
hello , my name is Tony Jebara .
Attentive Read
hi , Tony Jebara <eos> hi , Tony
h1 h2 h3 h4 h5 s1 s2 s3 s4 h6 h7 h8
“Tony” DNN
Embedding for “Tony” Selective Read for “Tony”
(a) Attention-based Encoder-Decoder (RNNSearch) (c) State Update
s4
Source Vocabulary
SoftmaxProb(“Jebara”) = Prob(“Jebara”, g) + Prob(“Jebara”, c)
… ...
(b) Generate-Mode & Copy-Mode
!
M M
87
CopyNet (Gu et al. 2016)
88
copy model
p(yt|st, yt−1, ct, M) = p(yt, g|st, yt−1, ct, M) + p(yt, c|st, yt−1, ct, M) (4)
Generate-Mode: The same scoring function as in the generic RNN encoder-decoder (Bahdanau et al., 2014) is used, i.e. ψg(yt = vi) = v>
i Wost,
vi ∈ V ∪ UNK (7) where Wo ∈ R(N+1)⇥ds and vi is the one-hot in- dicator vector for vi. Copy-Mode: The score for “copying” the word xj is calculated as ψc(yt = xj) = σ ⇣ h>
j Wc
⌘ st, xj ∈ X (8)
p(yt, g|·)= 8 > > < > > : 1 Z eψg(yt), yt 2 V 0, yt 2 X \ ¯ V 1 Z eψg(UNK) yt 62 V [ X (5) p(yt, c|·)= ( 1 Z P
j:xj=yt eψc(xj),
yt 2 X
(6)
BiDAF
89
NEURAL CHECK LIST NEURAL CHECK LIST
Neural Checklist Models
(Kiddon et al., 2016)
91
Encoder--Decoder Architecture
Chop tomatoes the . Add Chop tomatoes the . <s>
Doesn’t address changing ingredients Want to update ingredient information as ingredients are used
garlic tomato salsa
Encode title - decode recipe
Cut each sandwich in halves. Sandwiches with sandwiches. Sandwiches, sandwiches, Sandwiches, sandwiches, sandwiches sandwiches, sandwiches, sandwiches, sandwiches, sandwiches, sandwiches, or sandwiches or triangles, a griddle, each sandwich. Top each with a slice of cheese, tomato, and cheese. Top with remaining cheese mixture. Top with remaining cheese. Broil until tops are bubbly and cheese is melted, about 5 minutes.
sausage sandwiches
Recipe generation vs vs machine translation
by by by by recipe ecipe recipe ecipe token token <S> token token token token decode decode decode decode recipe title ecipe title ingr ingredient 1 edient 1 ingr ingredient 2 edient 2 ingr ingredient 3 edient 3 ingr ingredient 4 edient 4
Two input sources
between input and output.
from context (and implicit knowledge about cooking)
two different input sources
Chop tomatoes the . Add Chop tomatoes the . <s>
Doesn’t address changing ingredients Want to update ingredient information as ingredients are used
garlic tomato salsa
Encoder--Decoder with Attention
Neural checklist model
Let’s make salsa!
Garlic tomato salsa
garlic salt
Neural checklist model
LM
Chop <S>
hidden state classifier: non-ingredient new ingredient used ingredient which ingredients are still available new hidden state
tomato salsa garlic
Neural checklist model
tomatoes tomatoes Chop Chop the the <S>
0.85 0.10 0.04 0.01
.
non- ingredient new ingredient
Neural checklist model
Dice Dice the the . .
0.00 0.94 0.03 0.01
Neural checklist model
tomatoes tomatoes Add Add to to . .
0.94 0.04 0.01 0.01
used ingredient
Checklist is probabilistic
tomatoes tomatoes Add Add to to . .
0.90 0.08 0.01 0.01
used ingredient
0.85 1.00 1.00 0.85 0.85 1.00 0.02 0.02 0.02 0.04 0.04 0.04
= new ingredient prob. distribution
Hidden state classifier is soft
tomatoes tomatoes Add Add to to . .
0.90 0.08 0.01 0.01
0.85 1.00 1.00 0.85 0.85 1.00 0.02 0.02 0.02 0.04 0.04 0.04
0.00 0.00 0.50 0.50
0.85 1.00 0.02 0.04
0.94 0.05 0.01
Interpolation
0.94 0.05 0.01
0.90 0.08 0.01 0.01
0.85 1.00 0.02 0.04
0.00 0.00 0.50 0.50
0.85 1.00 0.02 0.04
probability distribution over vocabulary
Attention model
ingredients Attention model
ingredients
Choose ingredient via attention
Generates a probability distribution over a set of embeddings that corresponds to how close a target embedding is to each
Attention models for other NLP tasks
MT (Balasubramanian et al. 13, Bahdanau et al. 14) Sentence summarization (Rush et
Machine reading (Cheng et al. 16) Image captioning (Xu et al. 15) available ingredient embeddings content vector from language model temperature term
available ingredient embeddings
Attention-generated embeddings
Can generate an embedding from the attention probabilities
ingredient embeddings
Discussion Points
… what do NNs think about this?
107
Hafez: Neural Sonnet Writer
(Ghazvininejad et al. 2016)
108
Neural Sonnets
Deep Convolution Network
Outrageous channels on the wrong connections, An empty space without an open layer, A closet full of black and blue extensions, Connections by the closure operator.
Theory
Another way to reach the wrong conclusion! A vision from a total transformation, Created by the great magnetic fusion, Lots of people need an explanation.
109
Discussion Points
– Less efforts on feature engineering (at the cost of more hyperparameter tuning!) – In computer vision: NN learned representation is significantly better than human engineered features – In NLP: often NN induced representation is concatenated with additional human engineered features.
– Most success from massive amount of clean (expensive) data – Recent surge of data creation type papers (especially AI challenge type tasks) – Which significantly limits the domains & applications – Need stronger models for unsupervised & distantly supervised approaches
110
Discussion Points
– allows for flexible, expressive, and creative modeling
– Recent breakthrough from engineering advancements than theoretic advancements – Several NN platforms, code sharing culture
111
Neural Recipe Example #1
Cook eggplant in boiling water , covered , for 10 min . Drain and cut in half lengthwise . scoop out insides leaving 1/2 '' shell . Mash insides with cottage cheese ,
Preheat oven to 350 ^ stuff eggplant halves , place in casserole dish and bake covered for 15 min . Add a little water to bottom of pan to keep eggplant moist . top with provolone cheese . Bake 5 more min uncovered 1 serving = In a small bowl , combine the cheese , eggplant , basil , oregano , tomato sauce and onion . Mix well . Shape mixture into 6 patties , each about 3/4-inch thick. Place on baking sheet . Bake at 350 degrees for 30 minutes or until lightly browned . Southern living magazine , sometime in 1980 . Typed for you by nancy coleman .
eggplant cheese cottage lowfat chopped onion bay ground leaf basil
tomato sauce provolone
title: oven eggplant
CONVOLUTION NEURAL CONVOLUTION NEURAL NETWORK NETWORK
Next several slides borrowed from Alex Rush
Models with Sliding Windows
– E.g., neural language model
– E.g., sequence tagging with CRFs or structured perceptron
114
[w1 w2 w3 w4 w5] w6 w7 w8
w1 [w2 w3 w4 w5 w6] w7 w8 w1 w2 [w3 w4 w5 w6 w7] w8 . . .
Sliding Windows w/ Convolution
Let our input be the embeddings of the full sentence, X 2 Rn⇥d0 X = [v(w1), v(w2), v(w3), . . . , v(wn)] Define a window model as NNwindow : R1⇥(dwind0) 7! R1⇥dhid, NNwindow(xwin) = xwinW1 + b1
115
The convolution is defined as NNconv : Rn⇥d0 7! R(ndwin+1)⇥dhid, NNconv(X) = tanh NNwindow(X1:dwin) NNwindow(X2:dwin+1) . . . NNwindow(Xndwin:n)
Pooling Operations
116
I Pooling “over-time” operations f : Rn⇥m 7! R1⇥m
f (X) =
+ +
. . .
+ +
. . . . . .
+ +
. . .
= [ . . . ]
Convolution + Pooling
ˆ y = softmax(fmax(NNconv(X))W2 + b2)
I W2 ∈ Rdhid×dout, b2 ∈ R1×dout I Final linear layer W2 uses learned window features
117
Multiple Convolutions
ˆ y = softmax([f (NN1
conv(X)), f (NN2 conv(X)), . . . , f (NNf conv(X))]W2 + b2)
I Concat several convolutions together. I Each NN1, NN2, etc uses a different dwin I Allows for different window-sizes (similar to multiple n-grams)
118
Convolution Diagram (kim 2014)
I n = 9, dhid = 4 , dout = 2 I red- dwin = 2, blue- dwin = 3, (ignore back channel)
119
Text Classification (Kim 2014)
120
AlexNet (krizhevsky et al., 2012)
121