Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio. ICML 2015
Presented By: Sai Krishna Bollam
Show, Attend and Tell: Neural Image Caption Generation with Visual - - PowerPoint PPT Presentation
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio. ICML 2015 Presented By: Sai Krishna Bollam
Presented By: Sai Krishna Bollam
2
3
A bird flying over a body of water. A woman throwing a frisbee in a park.
4
Analogous to translation but.. Encoder output is not a single vector
Using Attention over low level feature maps Instead of joint object-text embedding
Bahdanau et al. (2014)
5
π = π§1, β¦ , π§π· , π§π β βπΏ
FC
6
7
using π
ππ’π’
Annotation vectors πππ’π’ : Attention Model, an MLP conditioned on previous hidden state
8
Based on CS231n by Fei-Fei Li, Justin Johnson & Serena Yeung
CNN
Image: H x W x 3 Features: L x D
h0 a1 z1
Weighted combination
y1 h1
First word Distribution over L locations
a2 d1 h2 a3 d2 z2 y2
Weighted features: D Distribution
π<π’, π
ππ’π’ aπ, π’π’β1 )
π
π‘
9
Sample aπ based on Multinoulli distribution
10
REINFORCE learning rule
Monte Carlo based sampling approximation
Wikipedia
π½π π‘π’ a ΰ· π΄π’ = ΰ·
π=1 π
Ξ±π’,πaπ π aπ , Ξ±π = ΰ·
π π
Ξ±πaπ
11
Soft attention weighted vector Normalized Weighted Geometric Mean
πΎπ’ = π(ππΎ π’π’ β 1 )
12
Encourages model to pay equal attention to every part
π π
13
CNN
Image: H x W x 3
Grid of features (Each D- dimensional) a b c d pa pb pc pd Distribution over grid locations pa + pb + pc + pc = 1
From RNN: Context vector z (D-dimensional)
Soft attention: Summarize ALL locations z = paa+ pbb+ pcc+ pdd Derivative dz/dp is nice! Train with gradient descent Hard attention: Sample ONE location according to p, z = that vector With argmax, dz/dp is zero almost everywhere β¦ Canβt use gradient descent; need reinforcement learning
Based on CS231n by Fei-Fei Li, Andrej Karpathy & Justin Johnson
14
Soft attention Hard attention
before pooling. 14x14x512 flattened to 196 x 512 (L x D)
Flickr8k: RMSProp Flickr30k/MS COCO: Adam Dropout Early stopping on BLEU Batching by sentence lengths
15
16
17
18