Lecture #08 – Attention and Memory
Aykut Erdem // Hacettepe University // Spring 2018
CMP784
DEEP LEARNING
Sherlock Holmes’ mind palace, BBC/Masterpiece's Sherlock
CMP784 DEEP LEARNING Lecture #08 Attention and Memory Aykut Erdem - - PowerPoint PPT Presentation
Sherlock Holmes mind palace, BBC/Masterpiece's Sherlock CMP784 DEEP LEARNING Lecture #08 Attention and Memory Aykut Erdem // Hacettepe University // Spring 2018 Breaking news! Practical 2 is due April 6, 23:59 Midterm exam in
Lecture #08 – Attention and Memory
Aykut Erdem // Hacettepe University // Spring 2018
DEEP LEARNING
Sherlock Holmes’ mind palace, BBC/Masterpiece's Sherlock
Breaking news!
− Check the midterm guide for details
− Language modeling with RNNs − Due Sunday, April 22, 23:59
27
!
Previously on CMP784
(RNNs)
(LSTM) unit and its variants
image: Oleg Soroko3 Using RNNs to generate Super Mario Maker levels, Adam Geitgey
Lecture overview
Di Disclaimer: Much of the material and slides for this lecture were borrowed from
— Mateusz Malinowski’s lecture on Attention-based Networks — Graham Neubig’s CMU CS11-747 Neural Networks for NLP class — Chris Dyer’s Oxford Deep NLP class — Yoshua Bengio’s talk on From Attention to Memory and towards Longer-Term Dependencies — Sumit Chopra’s lecture on Reasoning, Attention and Memory — Jason Weston’s tutorial on Memory Networks for Language Understanding — Richard Socher’s talk on Dynamic Memory Networks
4Deep Learning for Vision
5Figure credit: Xiaogang Wang
−U−
−U−
Figur
Deep Learning for Speech
6Figure credit: NVidia
Deep Learning for Text
7x1 x2 x3 x4 x5 z11 z12 z13 z14 z15 z16 z21 z22 z23 z24 z25 ˆ Y W1 W2 W3
positive “The movie was not bad at all. I had fun.”
Deep Models
8Input Representation Feature Extractor (encoder) Classifier/Regressor (decoder)
GW2 FW1
Loss Function Fully Connected Network Typically a Linear Pr with some non-linearity a prior on the type of mation you want
“The movie was not bad at all. I had fun.”
can be seen as a prior on the type of transformation you want Fully Connected Network Convolution Network Recurrent Network Typically a Linear Projection with some non-linearity (log-soft-max)
Deep Models
9Input Representation Feature Extractor (encoder) Classifier/Regressor (decoder)
GW2 FW1
Loss Function Fully Connected Network Typically a Linear Pr with some non-linearity a prior on the type of mation you want
“The movie was not bad at all. I had fun.”
can be seen as a prior on the type of transformation you want Fully Connected Network Convolution Network Recurrent Network Typically a Linear Projection with some non-linearity (log-soft-max)
Learnable parametric function Inputs: generally considered I.I.D. Outputs: classification or regression
Encoder-Decoder Framework
= ‘universal representation’
x1 x2 xT
yT' y2 y1
c
Decoder Encoder
French encoder
English decoder French sentence English sentence
English encoder
English decoder English sentence English sentence For bitext data For unilingual data
Sentence Representations
the sentence
11this is an example this is an example
Basic Idea
weighted by “attention weights” (wh where to look)
kono eiga ga kirai Key Vectors I hate Query Vector a1=2.1 a2=-0.1 a3=0.3 a4=-1.0
softmax
α1=0.76 α2=0.08 α3=0.13 α4=0.03
Calculating Attention
query vector (decoder state) and ke key vectors (all encoder states)
pair, calculate weight
to one using softmax
14Calculating Attention
vectors) by taking the weighted sum
kono eiga ga kirai Value Vectors α1=0.76 α2=0.08 α3=0.13 α4=0.03 * * * *
A Graphical Example
16(Bahdanau et al 2014, Jean et al 2014, Gulcehre et al 2015, Jean et al 2015)
5 10 15 20 25 2013 2014 2015 2016 Phrase-based SMT Syntax-based SMT Neural MT
End-to-End Machine Translation with Recurrent Nets and Attention Mechanism
17(Bahdanau et al 2014, Jean et al 2014, Gulcehre et al 2015, Jean et al 2015)
Figure credit: Rico Sennrich
25 Phrase-based SMT SMT Syntax-based SMT SMT Neural MT
a(q, k) = q|k p |k| a(q, k) = w|
2tanh(W1[q; k])
a(q, k) = q|Wk a(q, k) = q|k
Attention Score Functions
Multi-la layer er Pe Perce ceptron
(Bahdanau et al. 2015)
− Flexible, often very good with large data
Bilinear (Luong et al. 2015)
Dot Pr Product ct (Luong et al. 2015)
− No parameters! But requires sizes to be the same.
Scaled Do Dot Pr Product ct (Vaswani et al. 2017)
− Problem: scale of dot product increases as dimensions get • larger − Fix: scale by size of the vector
18Case Study: Show, Attend and Tell
19Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov,
Paying Attention to Selected Parts
While Uttering Words
20ˆ p1 h1 x1
<s>
∼
Akiko
h2
softmaxx2
∼
likes
x3 h3
softmax∼
Pimm’s
x4 h4
softmax∼
</s>
Sutskever et al. (2014)
ˆ p1 h1 x1
<s>
∼
a
h2
softmaxx2
∼
man
x3 h3
softmax∼
is
x4 h4
softmax∼
rowing
Vinyals et al. (2014) Show and Tell: A Neural Image Caption Generator Vinyals et al. (2014) Show and Tell: A Neural Image Caption Generator
Regions in ConvNets
feature vectors(/matrices).
Each point in a “higher” level of a convnet Xu et al. calls these “annotation vectors”, ai, i ∈ {1, . . . , L}
Regions in ConvNets
24a1 a1
F =
Regions in ConvNets
25Regions in ConvNets
26Extension of LSTM via the context vector
− Lower convolutional layer to have the correspondence between the feature vectors and portions of the 2-D image
27E - embeddin y - captions h - previous h z - context ve representation part of the ima
B B @ it ft
gt 1 C C A = B B @ σ σ σ tanh 1 C C A TD+m+n,n @ Eyt−1 ht−1 ˆ zt 1 A (1) ct = ft ct−1 + it gt (2) ht = ot tanh(ct). (3)
eti =fatt(ai, ht−1) αti = exp(eti) PL
k=1 exp(etk)
.
ˆ zt = φ ({ai} , {αi})
is the ‘attention’ (‘focus’) fun p(yt|a, yt−1
1
) / exp(Lo(Eyt−1 + Lhht + Lzˆ zt))
E: embedding matrix y: captions h: previous hidden state z: context vector, a dynamic representation
is the ‘attention’ (‘focus’) function – ‘soft’ / ’hard’
A MLP conditioned on the previous hidden state
Hard attention
28We have two sequences ‘I’ that runs over localizations ‘t’ that runs over words Stochastic decisions are discrete here, so derivatives are zero
ˆ zt = φ ({ai} , {αi})
eti =fatt(ai, ht−1) αti = exp(eti) PL
k=1 exp(etk)
. by
Loss is a variational lower bound on the marginal log-likelihood Due to Jensen’s inequality
Ls = X
sp(s | a) log p(y | s, a) log X
sp(s | a)p(y | s, a) = log p(y | a)
the marginal log-likelihood
their
E[log(X)]≤ lo
Due to Jensen’s inequality
p(st,i = 1 | sj<t, a) = αt,i ˆ zt = X
i
st,iai.
∂Ls ∂W = X
s
p(s | a) ∂ log p(y | s, a) ∂W +
function
log p(y | s, a)∂ log p(s | a) ∂W
∂
i}
E[log(X)]≤ log(E[X])
lity
[1] J. Ba et al. “Multiple object recognition with visual attention” [2] A. Mnih et al. “Neural variational inference and learning in belief networks”
∂Ls ∂W ≈ 1 N
NX
n=1∂ log p(y | ˜ sn, a) ∂W + λr(log p(y | ˜ sn, a) − b)∂ log p(˜ sn | a) ∂W + λe ∂H[˜ sn] ∂W
W
returns a
∂Ls ∂W ≈ 1 N
NX
n=1∂ log p(y | ˜ sn, a) ∂W + log p(y | ˜ sn, a)∂ log p(˜ sn | a) ∂W
st ∼ MultinoulliL({αi})
[1] J. Ba et. al. “Multiple object recognition wTo reduce the estimator variance, entropy term H[s] and bias are added [1,2]
Hard attention
29We have two sequences ‘I’ that runs over localizations ‘t’ that runs over words Stochastic decisions are discrete here, so derivatives are zero
ˆ zt = φ ({ai} , {αi})
eti =fatt(ai, ht−1) αti = exp(eti) PL
k=1 exp(etk)
. by
Loss is a variational lower bound on the marginal log-likelihood Due to Jensen’s inequality
Ls = X
sp(s | a) log p(y | s, a) log X
sp(s | a)p(y | s, a) = log p(y | a)
the marginal log-likelihood
their
E[log(X)]≤ lo
Due to Jensen’s inequality
p(st,i = 1 | sj<t, a) = αt,i ˆ zt = X
i
st,iai.
∂Ls ∂W = X
s
p(s | a) ∂ log p(y | s, a) ∂W +
function
log p(y | s, a)∂ log p(s | a) ∂W
∂
i}
E[log(X)]≤ log(E[X])
lity
[1] J. Ba et al. “Multiple object recognition with visual attention” [2] A. Mnih et al. “Neural variational inference and learning in belief networks”
∂Ls ∂W ≈ 1 N
NX
n=1∂ log p(y | ˜ sn, a) ∂W + λr(log p(y | ˜ sn, a) − b)∂ log p(˜ sn | a) ∂W + λe ∂H[˜ sn] ∂W
W
returns a
∂Ls ∂W ≈ 1 N
NX
n=1∂ log p(y | ˜ sn, a) ∂W + log p(y | ˜ sn, a)∂ log p(˜ sn | a) ∂W
st ∼ MultinoulliL({αi})
[1] J. Ba et. al. “Multiple object recognition wTo reduce the estimator variance, entropy term H[s] and bias are added [1,2]
ze zero ro-one
decis ision ion about where to attend
reinforcement learning
Soft attention
30| ˆ zt = X
i
st,iai.
Instead o
the
Ep(st|a)[ˆ zt] =
L
X
i=1
αt,iai
φ ({ai} , {αi}) = PL
i αiai
et al. (2014). This corresponds
T d
Instead of making hard decisions, we take the expected context vector The whole model is smooth and differentiable under the deterministic attention; learning via a standard backprop The Theor
gument nts
ectors fed
a)]
Finally
NWGM[p(yt = k | a)] = Q
i exp(nt,k,i)p(st,i=1|a)
P
j
Q
i exp(nt,j,i)p(st,i=1|a)
Q P Q = exp(Ep(st|a)[nt,k]) P
j exp(Ep(st|a)[nt,j])
E[nt] = Lo(Eyt−1 + LhE[ht] + LzE[ˆ zt]). the NWGM of a softmax unit is obtained by
[1] P. Baldi et al. “The dropout learning algorithm”
[1] NWGM[p(yt = k | a)] ≈ E[p(yt = k | a)]
softmax activation. That means the expectation of the
Q P Q
using a single forw ector Ep(st|a)[ˆ zt]. ,
|
ctor
Theoretical argu
alue Ep(st|a)[ht] ard prop with
How soft/hard attention works
31How soft/hard attention works
32Sample regions of attention A variational lower bound of maximum likelihood Computes the expexted attention
Hard Attention
Soft Attention
The Good
35And the Bad
36Quantitative results
37Human Automatic Model M1 M2 BLEU CIDEr Human 0.638 0.675 0.471 0.91 Google? 0.273 0.317 0.587 0.946 MSR• 0.268 0.322 0.567 0.925 Attention-based⇤ 0.262 0.272 0.523 0.878 Captivator 0.250 0.301 0.601 0.937 Berkeley LRCN⇧ 0.246 0.268 0.534 0.891
M1: human preferred (or equal) the method over human annotation M2: turing test
+2 BL BLEU
+4 BL BLEU
Video Description Generation
38− Context set consists of per-frame context vectors, and attention mechanism that selects one of those vectors for each output symbol being decoded – capturing the global temporal structure across frames − 3-D conv-net that applies local filters across spatio-temporal dimensions working on motion statistics
the
3D ConvNet
Why attention?
− Dealing with gradient vanishing problem
− Attending/focusing to smaller parts of data
§ patches in images § words or phrases in sentences
− Different problems required different sizes of representations
§ LSTM with longer sentences requires larger vectors
− Focusing only on the parts of images − Scalability independent of the size of images
Recurrent net memory Attention mechanism
Attention on Memory Elements
Recurren ent net etworks can annot rem emem ember er things for ver very y long
We need eed a a “h “hippocam ampus” ” (a a sep epar arat ate e mem emory y module) e)
Memo mory ry network rks [Weston et 2014] (FAIR), associative memory
Recall: Long-Term Dependencies
with a step in the forward computation. To store information robustly in a finite-dimensional state, the dynamics must be contractive [Bengio et al 1994].
Storing bits robustly requires
(Hochreiter 1991) Gr Gradient cl clipping
× input input gate forget gate
state self-loop × + ×
Gated Recurrent Units & LSTM
Create eate a a pa path wh where gr gradi dient nts s ca can fl flow fo for longer wi with se self-lo loop
Jacobian slightly less than 1
heavily use used (Hochreiter & Schmidhuber 1997)
(Cho et al 2014)
43xt xt−1 xt+1 x unfold s
st st+1
W1 W3 W1 W1 W1 W1 W3
st−2
W3 W3 W3
Delays & Hierarchies to Reach Farther
scales, Elhihi & Bengio NIPS 1995, Koutnik et al ICML 2014
44Hierarchical RNNs (words / sentences): Sordoni et al CIKM 2015, Serban et al AAAI 2016
Large Memory Networks: Sparse Access Memory for Long-Term Dependencies
long durations, until evoked for read or write
passive copy access
Memory Networks
that can read and write to it.
reasoning soning with at atten ention over memor memory (RAM).
“low level” tasks e.g. object detection.
46Scenario 1
Joe went to the kitchen. Fred went to the kitchen. Joe picked up the milk. Joe travelled to the office. Joe left the milk. Joe went to the bathroom.
47Scenario 1
Joe went to the kitchen. Fred went to the kitchen. Joe picked up the milk. Joe travelled to the office. Joe left the milk. Joe went to the bathroom. Where is the milk now? Where is Joe? Where was Joe before the office?
48Scenario 1
Joe went to the kitchen. Fred went to the kitchen. Joe picked up the milk. Joe travelled to the office. Joe left the milk. Joe went to the bathroom. Where is the milk now? A: office Where is Joe? Where was Joe before the office?
49Scenario 1
Joe went to the kitchen. Fred went to the kitchen. Joe picked up the milk. Joe travelled to the office. Joe left the milk. Joe went to the bathroom. Where is the milk now? A: office Where is Joe? A: bathroom Where was Joe before the office?
50Scenario 1
Joe went to the kitchen. Fred went to the kitchen. Joe picked up the milk. Joe travelled to the office. Joe left the milk. Joe went to the bathroom. Where is the milk now? A: office Where is Joe? A: bathroom Where was Joe before the office? A: kitchen
51Scenario 2
52Scenario 2
Baxter
Scenario 2
Shaolin Soccer directed by Stephen Chow Shaolin Soccer written by Stephen Chow Shaolin Soccer starred actors Stephen Chow Shaolin Soccer release year 2001 Shaolin Soccer has genre comedy Shaolin Soccer has tags martial arts, kung fu soccer, stephen chow Kung Fu Hustle directed by Stephen Chow Kung Fu Hustle written by Stephen Chow Kung Fu Hustle starred actors Stephen Chow Kung Fu Hustle has genre comedy action Kung Fu Hustle has imdb votes famous Kung Fu Hustle has tags comedy, action, martial arts, kung fu, china, soccer, hong kong, stephen chow The God of Cookery directed by Stephen Chow The God of Cookery written by Stephen Chow The God of Cookery starred actors Stephen Chow The God of Cookery has tags hong kong Stephen Chow From Beijing with Love directed by Stephen Chow From Beijing with Love written by Stephen Chow From Beijing with Love starred actors Stephen Chow, Anita Yuen ...<and more> ... 1) I’m looking a fun comedy to watch tonight, any ideas?
Scenario 3
Scenario 3
Who wrote Kung Fu Hustle?
56Shaolin Soccer directed by Stephen Chow Shaolin Soccer written by Stephen Chow Shaolin Soccer starred actors Stephen Chow Shaolin Soccer release year 2001 Shaolin Soccer has genre comedy Shaolin Soccer has tags martial arts, kung fu soccer, stephen chow Kung Fu Hustle directed by Stephen Chow Kung Fu Hustle written by Stephen Chow Kung Fu Hustle starred actors Stephen Chow Kung Fu Hustle has genre comedy action Kung Fu Hustle has imdb votes famous Kung Fu Hustle has tags comedy, action, martial arts, kung fu, china, soccer, hong kong, stephen chow The God of Cookery directed by Stephen Chow The God of Cookery written by Stephen Chow The God of Cookery starred actors Stephen Chow The God of Cookery has tags hong kong Stephen Chow From Beijing with Love directed by Stephen Chow From Beijing with Love written by Stephen Chow From Beijing with Love starred actors Stephen Chow, Anita Yuen ...<and more> ... 1) I’m looking a fun comedy to watch tonight, any ideas?
Scenario 3
I’m interested in watching a Stephen Chow movie other than Kung Fu Hustle. Can you suggest something?
57Shaolin Soccer directed by Stephen Chow Shaolin Soccer written by Stephen Chow Shaolin Soccer starred actors Stephen Chow Shaolin Soccer release year 2001 Shaolin Soccer has genre comedy Shaolin Soccer has tags martial arts, kung fu soccer, stephen chow Kung Fu Hustle directed by Stephen Chow Kung Fu Hustle written by Stephen Chow Kung Fu Hustle starred actors Stephen Chow Kung Fu Hustle has genre comedy action Kung Fu Hustle has imdb votes famous Kung Fu Hustle has tags comedy, action, martial arts, kung fu, china, soccer, hong kong, stephen chow The God of Cookery directed by Stephen Chow The God of Cookery written by Stephen Chow The God of Cookery starred actors Stephen Chow The God of Cookery has tags hong kong Stephen Chow From Beijing with Love directed by Stephen Chow From Beijing with Love written by Stephen Chow From Beijing with Love starred actors Stephen Chow, Anita Yuen ...<and more> ... 1) I’m looking a fun comedy to watch tonight, any ideas?
What is required?
Not all problems can be mapped to y = f(x)
X Y = fW (X) fW
What is a Memory Network?
Original paper description of class of models MemNNs have four component networks (which may or may not have shared parameters):
I: (input feature map) convert incoming data to the internal feature representation.
G: (generalization) update memories given new input.
O: produce new output (in feature representation space) given the memories.
R: (response) convert output O into a response seen by the outside world.
60Memory Networks- Some early publications
§ J. Weston, S. Chopra, A. Bordes. Memory Networks. ICLR 2015 (and arXiv:1410.3916). § S. Sukhbaatar, A. Szlam, J. Weston, R. Fergus. End-To-End Memory Networks. NIPS 2015 (and arXiv:1503.08895). § J. Weston, A. Bordes, S. Chopra, A. M. Rush, B. van Merriënboer, A. Joulin, T. Mikolov. Towards AI- Complete Question Answering: A Set of Prerequisite Toy Tasks. arXiv:1502.05698. § A. Bordes, N. Usunier, S. Chopra, J. Weston. Large-scale Simple Question Answering with Memory
§ J. Dodge, A. Gane, X. Zhang, A. Bordes, S. Chopra, A. Miller, A. Szlam, J. Weston. Evaluating Prerequisite Qualities for Learning End-to-End Dialog Systems. arXiv:1511.06931. § F. Hill, A. Bordes, S. Chopra, J. Weston. The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations. arXiv:1511.02301. § J. Weston. Dialog-based Language Learning. arXiv:1604.06045. § A. Bordes, Jason Weston. Learning End-to-End Goal-Oriented Dialog. arXiv:1605.07683.
61Memory Module
Controller module Input
a d d r e s s i n g read a d d r e s s i n g read
Internal state Vector (initially: query)
Output
Memory vectors Supervision (direct or reward-based)
m
m
q
Memory Network Models
implemented models..
Figure: Saina Sukhbaatar
62Variants of the class…
Some options and extensions:
encodings: bag of words, RNN style reading at word or character level, etc.
uses an RNN to output sentences.
memory addressing and reading doesn’t operate on all memories.
most useless; i.e. it ``forgets’’ somehow. That would require a scoring function of the utility of each memory..
63