Show, Attend, and Tell Neural Image Caption Generation with Visual - - PowerPoint PPT Presentation

show attend and tell neural image caption generation with
SMART_READER_LITE
LIVE PREVIEW

Show, Attend, and Tell Neural Image Caption Generation with Visual - - PowerPoint PPT Presentation

Show, Attend, and Tell Neural Image Caption Generation with Visual Attention Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio University of Montreal and University of


slide-1
SLIDE 1

Show, Attend, and Tell Neural Image Caption Generation with Visual Attention

Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio University of Montreal and University of Toronto Presented By: Hannah Li, Sivaraman K S

1

slide-2
SLIDE 2

Introduction

We can easily: Segment, localize, and categorize However, Interpreting the image is more difficult Goal of this work: Generate captions for images using attention mechanism

2

slide-3
SLIDE 3

Related Work - Generating Image Captions

  • Recurrent neural networks (Cho et al., 2014, Bahdanau et al., 2014,

Stuskever et al., 2014)

  • LSTM for videos and images (Vinyals et al., 2014, Donahue et al., 2014)
  • Joint CNN-RNN with object detection (Karpathy & Li, 2014, Fang et al., 2014)
  • Attention (Larochelle & Hinton 2010)

3

slide-4
SLIDE 4

Model Overview

4

Generates a caption y as a sequence of encoded words

slide-5
SLIDE 5

Encoder: Convolutional Features

Goal: input raw image and produce a set of feature vectors (annotation vectors) Produces L vectors (each a D-dimensional representation corresponding to part of an image)

5

slide-6
SLIDE 6

Decoder: Long Short-Term Memory Network

6

Input, forget, memory, output and hidden state W, U, Z: weight matrices b: biases E: an embedding matrix zt: representation of the relevant part of the image at time t

slide-7
SLIDE 7

Decoder: Long Short-Term Memory Network

7

Logistic sigmoid activation Deep output layer to compute the

  • utput word probability

Stochastic attention: the probability that location i is the correct place to focus on for producing the next word Deterministic attention: the relative importance to give to location i in blending the ai’s together

slide-8
SLIDE 8

Hard Attention

Hard attention - learning to maximize the context vector z from a combination of a

  • ne-hot encoded variable st,i and the extracted features ai.

Trained using Sampling method st - where the model decides to focus attention when generating the tth word Stochastic - Assign a Multinoulli distribution

8

slide-9
SLIDE 9

Soft Attention

Learning by maximizing the expectation of the context vector. Trained End-to-End Deterministic - Whole distribution optimized, not single choices (st not picked from a distribution)

9

slide-10
SLIDE 10

Training

The attention framework learns latent alignments from scratch instead of explicitly using object detectors. Allows the model to go beyond "objectness" and learn to attend to abstract concepts.

10

slide-11
SLIDE 11

Dataset

Flickr8k and Flickr30k datasets 5 reference captions MS COCO dataset Discarded caption in excess of 5 Applied basic tokenization Fixed vocabulary size of 10K

11

slide-12
SLIDE 12

Results

1. Significantly improve the state of the art performance METEOR on MS COCO 2. More flexibility - attend to non object salient regions

12

slide-13
SLIDE 13

13

Source: http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2015/12/Screen-Shot-2015-12-30-at-1.42.58-PM.png

slide-14
SLIDE 14

Analysis of learning to attend

14

slide-15
SLIDE 15

Mistakes

15

slide-16
SLIDE 16

Reference

  • https://arxiv.org/pdf/1502.03044.pdf
  • http://kelvinxu.github.io/projects/capgen.html
  • http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nl

p/

  • https://blog.heuritech.com/2016/01/20/attention-mechanism/

16