Show, Attend and Tell: Neural Image Caption Generation with Visual - - PowerPoint PPT Presentation

β–Ά
show attend and tell neural image caption
SMART_READER_LITE
LIVE PREVIEW

Show, Attend and Tell: Neural Image Caption Generation with Visual - - PowerPoint PPT Presentation

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio. ICML 2015 Presented By: Sai Krishna Bollam


slide-1
SLIDE 1

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio. ICML 2015

Presented By: Sai Krishna Bollam

slide-2
SLIDE 2

Outline

  • Introduction
  • Model Overview
  • Model Details
  • Encoder
  • Decoder
  • Attention
  • Experiments
  • Results
  • Conclusion

2

slide-3
SLIDE 3

Introduction

  • Multimodal Machine Learning
  • Relate information from multiple modalities: speech, image, language etc.
  • Scene understanding
  • Automatic caption generation
  • Task: Given an image, generate a sentence describing it
  • Object Detection and Machine Translation
  • Image to Language translation

3

A bird flying over a body of water. A woman throwing a frisbee in a park.

slide-4
SLIDE 4

Model Overview

4

Encoder-Decoder framework

Analogous to translation but.. Encoder output is not a single vector

Learn alignments from scratch

Using Attention over low level feature maps Instead of joint object-text embedding

Bahdanau et al. (2014)

slide-5
SLIDE 5

Model Details: Encoder

5

  • Model:
  • Input: Raw image
  • Output: Sequence of C words from vocabulary of size K

𝒛 = 𝑧1, … , 𝑧𝐷 , 𝑧𝑗 ∈ ℝ𝐿

  • Encoder: Convolutional Neural Network
  • Input: Raw image
  • Output: multiple feature vectors (annotation vectors) from lower conv layers

𝐛 = a1, … , a𝑀 , a𝑗 ∈ ℝ𝐸 Image Conv NN L x D

FC

slide-6
SLIDE 6

Model Details: Decoder

6

  • LSTM Network

ෝ z𝑒 : Context vector

slide-7
SLIDE 7

Model Details: Decoder – Context Vector

Context Vector (ො z𝑒): A dynamic representation of relevant part of image at time 𝑒

ො z𝑒 = 𝜚( a𝑗 , {α𝑗})

7

  • Attention. Calculated

using 𝑔

𝑏𝑒𝑒

Annotation vectors 𝑔𝑏𝑒𝑒 : Attention Model, an MLP conditioned on previous hidden state

slide-8
SLIDE 8

Model Representation

8

Based on CS231n by Fei-Fei Li, Justin Johnson & Serena Yeung

CNN

Image: H x W x 3 Features: L x D

h0 a1 z1

Weighted combination

  • f features

y1 h1

First word Distribution over L locations

a2 d1 h2 a3 d2 z2 y2

Weighted features: D Distribution

  • ver vocab
slide-9
SLIDE 9

Attention Mechanism: Stochastic Attention

Stochastic β€œHard” Attention At every time step, focus on exactly 1 location (ai) 𝑑𝑒,𝑗 = 1 iff π‘—π‘’β„Ž location is used to extract visual features π‘ž 𝑑𝑒,𝑗 = 1 𝑑

π‘˜<𝑒, 𝐛

= α𝑒,𝑗 = softmax (𝑔

𝑏𝑒𝑒 a𝑗, π’π‘’βˆ’1 )

ො 𝐴𝑒 = ෍

𝑗

𝑑𝑒,𝑗a𝑗 𝑀𝑑 = ෍

𝑑

π‘ž 𝑑 𝐛 log π‘ž 𝐳 𝑑, 𝐛 ≀ log[π‘ž 𝑑 𝐛 π‘ž 𝐳 𝑑, 𝐛 ] = log π‘ž(𝐳 ∣ 𝐛)

9

Sample a𝑗 based on Multinoulli distribution

slide-10
SLIDE 10

Attention Mechanism: Stochastic Attention

10

REINFORCE learning rule

Monte Carlo based sampling approximation

Wikipedia

slide-11
SLIDE 11

Attention Mechanisms: Deterministic Attention

Deterministic β€œSoft” Attention Expectation of context vector, instead of sampling. Differentiable!

π”½π‘ž 𝑑𝑒 a ො 𝐴𝑒 = ෍

𝑗=1 𝑀

α𝑒,𝑗a𝑗 𝜚 a𝑗 , α𝑗 = ෍

𝑗 𝑀

α𝑗a𝑗

11

Soft attention weighted vector Normalized Weighted Geometric Mean

slide-12
SLIDE 12

Attention Mechanisms: Deterministic Attention

Doubly Stochastic Attention Introduce regularization: σ𝑒 α𝑒,𝑗 β‰ˆ 1

𝛾𝑒 = 𝜏(𝑔𝛾 𝐒𝑒 βˆ’ 1 )

12

Encourages model to pay equal attention to every part

  • f image over time

𝜚 𝑏𝑗 , 𝛽𝑗 = 𝛾 ෍

𝑗 𝑀

𝛽𝑗𝑏𝑗

slide-13
SLIDE 13

Attention Mechanisms

13

CNN

Image: H x W x 3

Grid of features (Each D- dimensional) a b c d pa pb pc pd Distribution over grid locations pa + pb + pc + pc = 1

From RNN: Context vector z (D-dimensional)

Soft attention: Summarize ALL locations z = paa+ pbb+ pcc+ pdd Derivative dz/dp is nice! Train with gradient descent Hard attention: Sample ONE location according to p, z = that vector With argmax, dz/dp is zero almost everywhere … Can’t use gradient descent; need reinforcement learning

Based on CS231n by Fei-Fei Li, Andrej Karpathy & Justin Johnson

slide-14
SLIDE 14

Visualizing Attention

14

Soft attention Hard attention

slide-15
SLIDE 15

Experiments

Encoder:

Oxford VGGnet

  • pretrained on ImageNet
  • Feature maps from 4th conv layer

before pooling. 14x14x512 flattened to 196 x 512 (L x D)

Datasets:

Flickr8k : 8,000 Flickr30k : 30,000 MS COCO : 82,783 Vocabulary : 10,000 words 5 reference sentences per image

Training:

Flickr8k: RMSProp Flickr30k/MS COCO: Adam Dropout Early stopping on BLEU Batching by sentence lengths

Metrics: BLEU-1, 2, 3, 4

  • No brevity penalty

METEOR

15

slide-16
SLIDE 16

Results

16

slide-17
SLIDE 17

Key Points

  • Learn latent alignments from scratch
  • Better context to decoder
  • Attends to non object regions
  • Joint representation
  • Visualizing attention to interpret functioning
  • Stochastic Attention
  • Deterministic Attention

17

slide-18
SLIDE 18

Questions?

18

Thank you