show attend and tell neural image caption generation with
play

Show, Attend, and Tell Neural Image Caption Generation with Visual - PowerPoint PPT Presentation

Show, Attend, and Tell Neural Image Caption Generation with Visual Attention Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio University of Montreal and University of


  1. Show, Attend, and Tell Neural Image Caption Generation with Visual Attention Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio University of Montreal and University of Toronto Presented By: Hannah Li, Sivaraman K S 1

  2. Introduction We can easily: Segment, localize, and categorize However, Interpreting the image is more difficult Goal of this work: Generate captions for images using attention mechanism 2

  3. Related Work - Generating Image Captions - Recurrent neural networks (Cho et al., 2014, Bahdanau et al., 2014, Stuskever et al., 2014) - LSTM for videos and images (Vinyals et al., 2014, Donahue et al., 2014) - Joint CNN-RNN with object detection (Karpathy & Li, 2014, Fang et al., 2014) - Attention (Larochelle & Hinton 2010) 3

  4. Model Overview Generates a caption y as a sequence of encoded words 4

  5. Encoder: Convolutional Features Goal: input raw image and produce a set of feature vectors (annotation vectors) Produces L vectors (each a D-dimensional representation corresponding to part of an image) 5

  6. Decoder: Long Short-Term Memory Network Input, forget, memory, output and hidden state W, U, Z: weight matrices b: biases E: an embedding matrix z t : representation of the relevant part of the image at time t 6

  7. Decoder: Long Short-Term Memory Network Logistic sigmoid activation Deep output layer to compute the output word probability Stochastic attention: the probability that location i is the correct place to focus on for producing the next word Deterministic attention: the relative importance to give to location i in blending the a i ’s together 7

  8. Hard Attention Hard attention - learning to maximize the context vector z from a combination of a one-hot encoded variable s t,i and the extracted features a i . Trained using Sampling method s t - where the model decides to focus attention when generating the t th word Stochastic - Assign a Multinoulli distribution 8

  9. Soft Attention Learning by maximizing the expectation of the context vector. Trained End-to-End Deterministic - Whole distribution optimized, not single choices (s t not picked from a distribution) 9

  10. Training The attention framework learns latent alignments from scratch instead of explicitly using object detectors. Allows the model to go beyond "objectness" and learn to attend to abstract concepts. 10

  11. Dataset Flickr8k and Flickr30k datasets 5 reference captions MS COCO dataset Discarded caption in excess of 5 Applied basic tokenization Fixed vocabulary size of 10K 11

  12. Results 1. Significantly improve the state of the art performance METEOR on MS COCO 2. More flexibility - attend to non object salient regions 12

  13. 13 Source: http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2015/12/Screen-Shot-2015-12-30-at-1.42.58-PM.png

  14. Analysis of learning to attend 14

  15. Mistakes 15

  16. Reference ● https://arxiv.org/pdf/1502.03044.pdf ● http://kelvinxu.github.io/projects/capgen.html ● http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nl p/ ● https://blog.heuritech.com/2016/01/20/attention-mechanism/ 16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend