show attend and tell neural image caption
play

Show, Attend and Tell: Neural Image Caption Generation with Visual - PowerPoint PPT Presentation

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio. ICML 2015 Presented By: Sai Krishna Bollam


  1. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio. ICML 2015 Presented By: Sai Krishna Bollam

  2. Outline • Introduction • Model Overview • Model Details • Encoder • Decoder • Attention • Experiments • Results • Conclusion 2

  3. Introduction • Multimodal Machine Learning • Relate information from multiple modalities: speech, image, language etc. • Scene understanding • Automatic caption generation • Task: Given an image, generate a sentence describing it • Object Detection and Machine Translation • Image to Language translation A woman throwing a frisbee in a park. A bird flying over a body of water. 3

  4. Model Overview Encoder-Decoder framework Learn alignments from scratch Analogous to translation but.. Using Attention over low level feature maps Encoder output is not a single vector Instead of joint object-text embedding Bahdanau et al. (2014) 4

  5. Model Details: Encoder • Model: • Input: Raw image • Output: Sequence of C words from vocabulary of size K 𝒛 = 𝑧 1 , … , 𝑧 𝐷 , 𝑧 𝑗 ∈ ℝ 𝐿 • Encoder: Convolutional Neural Network • Input: Raw image • Output: multiple feature vectors (annotation vectors) from lower conv layers 𝐛 = a 1 , … , a 𝑀 , a 𝑗 ∈ ℝ 𝐸 Conv Image FC NN L x D 5

  6. Model Details: Decoder • LSTM Network z 𝑢 : Context vector ෝ 6

  7. Model Details: Decoder – Context Vector Context Vector ( ො z 𝑢 ): A dynamic representation of relevant part of image at time 𝑢 z 𝑢 = 𝜚( a 𝑗 , {α 𝑗 }) ො Attention. Calculated Annotation vectors using 𝑔 𝑏𝑢𝑢 𝑔 𝑏𝑢𝑢 : Attention Model, an MLP conditioned on previous hidden state 7

  8. Model Representation Distribution over L Distribution locations over vocab a1 a2 d1 a3 d2 CNN h0 h1 h2 Features: Image: L x D Weighted H x W x 3 z1 y1 z2 y2 features: D Weighted combination First of features word 8 Based on CS231n by Fei-Fei Li, Justin Johnson & Serena Yeung

  9. Attention Mechanism: Stochastic Attention Stochastic “Hard” Attention At every time step, focus on exactly 1 location (a i ) 𝑡 𝑢,𝑗 = 1 iff 𝑗 𝑢ℎ location is used to extract visual features 𝑞 𝑡 𝑢,𝑗 = 1 𝑡 𝑘<𝑢 , 𝐛 = α 𝑢,𝑗 = softmax (𝑔 𝑏𝑢𝑢 a 𝑗 , 𝐢 𝑢−1 ) 𝐴 𝑢 = ෍ ො 𝑡 𝑢,𝑗 a 𝑗 Sample a 𝑗 based on Multinoulli distribution 𝑗 𝑀 𝑡 = ෍ 𝑞 𝑡 𝐛 log 𝑞 𝐳 𝑡, 𝐛 ≤ log[𝑞 𝑡 𝐛 𝑞 𝐳 𝑡, 𝐛 ] 𝑡 = log 𝑞(𝐳 ∣ 𝐛) 9

  10. Attention Mechanism: Stochastic Attention Monte Carlo based sampling approximation Wikipedia REINFORCE learning rule 10

  11. Attention Mechanisms: Deterministic Attention Deterministic “Soft” Attention Expectation of context vector, instead of sampling. Differentiable! 𝑀 𝔽 𝑞 𝑡 𝑢 a ො 𝐴 𝑢 = ෍ α 𝑢,𝑗 a 𝑗 𝑗=1 𝑀 Soft attention weighted vector 𝜚 a 𝑗 , α 𝑗 = ෍ α 𝑗 a 𝑗 𝑗 Normalized Weighted Geometric Mean 11

  12. Attention Mechanisms: Deterministic Attention Doubly Stochastic Attention Encourages model to pay σ 𝑢 α 𝑢,𝑗 ≈ 1 Introduce regularization: equal attention to every part of image over time 𝛾 𝑢 = 𝜏(𝑔 𝛾 𝐢 𝑢 − 1 ) 𝑀 𝜚 𝑏 𝑗 , 𝛽 𝑗 = 𝛾 ෍ 𝛽 𝑗 𝑏 𝑗 𝑗 12

  13. Attention Mechanisms Soft attention: Summarize ALL locations a b z = p a a+ p b b+ p c c+ p d d CNN c d Derivative dz/dp is nice! Grid of features Train with gradient Image: (Each D- descent Context vector z H x W x 3 dimensional) (D-dimensional) Hard attention : p a p b Sample ONE location From according to p, z = that RNN: p c p d vector Distribution over With argmax, dz/dp is zero grid locations almost everywhere … p a + p b + p c + p c = 1 Can’t use gradient descent; need reinforcement learning 13 Based on CS231n by Fei-Fei Li, Andrej Karpathy & Justin Johnson

  14. Visualizing Attention Soft attention Hard attention 14

  15. Experiments Encoder: Training: Oxford VGGnet Flickr8k: RMSProp Flickr30k/MS COCO: Adam - pretrained on ImageNet - Feature maps from 4 th conv layer Dropout before pooling. 14x14x512 flattened Early stopping on BLEU to 196 x 512 ( L x D ) Batching by sentence lengths Metrics : Datasets: BLEU-1, 2, 3, 4 Flickr8k : 8,000 Flickr30k : 30,000 - No brevity penalty MS COCO : 82,783 METEOR Vocabulary : 10,000 words 5 reference sentences per image 15

  16. Results 16

  17. Key Points • Learn latent alignments from scratch • Better context to decoder • Attends to non object regions • Joint representation • Visualizing attention to interpret functioning • Stochastic Attention • Deterministic Attention 17

  18. Thank you Questions? 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend