Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio
Show, Attend and Tell: Neural Image Caption Generation with Visual - - PowerPoint PPT Presentation
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio Presented by Kathy Ge Motivation: Attention
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio
the forefront as needed”
– Deterministic “soft” mechanism – Stochastic “hard” mechanism
encoded in images
the LSTM at time t
with a particular input location
which computes the context vector zt given the annotation vectors and corresponding weights
mechanism generates a positive weight
multi-layer perceptron conditioned on previous hidden states ht-1
backpropagation
stochastic regularization, where
throughout the caption generation
descriptive captions
model decides to focus attention at the tth word
where
introducing entropy term H[s]
where λr, λe are hyperparameters, and b is exponential decay used in calculating moving average baseline
returns a sampled ai at every point in time based on a multinomial distribution parametrized by
A woman is throwing a frisbee in a park.
A man and a woman playing frisbee in a field.
A woman holding a clock in her hand.
A woman is holding a donut in his hand.
an image
caption sequence
maximizing a variational lower bound and a deterministic“soft” attention mechanism using standard backpropagation
and through qualitative analysis can show that alignments of words to locations in an image correspond well to human intuition