Show, Attend and Tell: Neural Image Caption Generation with Visual - - PowerPoint PPT Presentation

show attend and tell
SMART_READER_LITE
LIVE PREVIEW

Show, Attend and Tell: Neural Image Caption Generation with Visual - - PowerPoint PPT Presentation

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Kelvin Xu*, Jimmy Ba , Ryan Kiros , Kyunghyun Cho*, Aaron Courville*, Ruslan Salakhutdinov , Richard Zemel , Yoshua Bengio* eal*/ University of Toronto


slide-1
SLIDE 1

Show, Attend and Tell:

Neural Image Caption Generation with Visual Attention

Kelvin Xu*, Jimmy Ba†, Ryan Kiros†, Kyunghyun Cho*, Aaron Courville*, Ruslan Salakhutdinov†, Richard Zemel†, Yoshua Bengio*

Universit´ e de Montr´ eal*/ University of Toronto†

(some figures from Hugo Larochelle)

July 8, 2015

1 / 46

slide-2
SLIDE 2

Caption generation is another building block

Level of Task

Low Level High Level

feature and descriptors texture shape tracking actions activity understanding segmentation Detection Scene Understanding

Human Time

causality goals and intentions

90 ms 150 ms 1 sec

situation functionality social roles

Object Scene Activity

Figure: adapted from a figure from Feifei Li

2 / 46

slide-3
SLIDE 3

What our model does:

Figure: A bird flying over a body of water .

3 / 46

slide-4
SLIDE 4

Overview

Recent work in image caption generation

Encoder-decoder models from a number of groups: (Berkeley, Google, Stanford, Toronto, others) Motivating our architecture from human attention Our proposed attention model Model Description Quantitative/Qualitative Results

4 / 46

slide-5
SLIDE 5

This talk:

Recent work in image caption generation

Encoder-decoder models from a number of groups: (Berkeley, Google, Stanford, Toronto, others) Motivating our architecture from human attention Our proposed attention model Model Description Quantitative/Qualitative Results

5 / 46

slide-6
SLIDE 6

Recent surge of interest in image captioning

◮ Submissions on this topic at CVPR 2015

(from groups at Google, Berkeley, Stanford, Microsoft.. etc)

◮ Inspired by some successes in machine translation

(Kalchbrenner et al. 2013, Sutskever et al. 2014, Cho et al. 2014)

6 / 46

slide-7
SLIDE 7

Theme: Use a convnet to condition

Figure: from Karpathy et al. (2015)

7 / 46

slide-8
SLIDE 8

Figure: Vinyal et al. (2015) model is quite similar

8 / 46

slide-9
SLIDE 9

This talk:

Recent work in image caption generation

Encoder-decoder models from a number of groups: (Berkeley, Google, Stanford, Toronto, others) Motivating our architecture from human attention Our proposed attention model Model Description Quantitative/Qualitative Results

9 / 46

slide-10
SLIDE 10

What are some things we know about human attention?

10 / 46

slide-11
SLIDE 11

(1) human vision is foveated & sequential

◮ Particular parts of an image come to the forefront

1 2 3 1 2 3

◮ It is a sequential decision process (“saccades”, glimpses)

11 / 46

slide-12
SLIDE 12

(2) bottom-up input influences

Figure: from Borji and Itti. (2013) [2]

12 / 46

slide-13
SLIDE 13

(3) top-down task level control

mechanisms ¡at ¡work…

Figure: from Yarbus (1967)

13 / 46

slide-14
SLIDE 14

Summary: useful aspects of attention

◮ foveated visual field (spatial focus) ◮ sequential decision making (temporal dynamics) ◮ bottom-up input influence ◮ top-down modulation of specific task

14 / 46

slide-15
SLIDE 15

This talk:

Recent work in image caption generation

Encoder-decoder models from a number of groups: (Berkeley, Google, Stanford, Toronto, others) Motivating our architecture from human attention Our proposed attention model Model Description Quantitative/Qualitative Results

15 / 46

slide-16
SLIDE 16

Our proposed attention model

◮ ”Low Level” convolutional feature extraction:

a = {a1, a2, .., aL}

◮ Compute the importance of each of these regions

α = {α1, α2, .., αL}

◮ Combine α and a to represent the image (context: ˆ

zi)

16 / 46

slide-17
SLIDE 17

A little bit more specific

  • utput = ( a, man, is, jumping, into, a, lake, . )

17 / 46

slide-18
SLIDE 18

Convolutional feature extraction

Annotation Vectors

  • utput = ( a, man, is, jumping, into, a, lake, . )

aj

Convolutional Neural Network 18 / 46

slide-19
SLIDE 19

Given a initial hidden state (predicted from image)..

Annotation Vectors Recurrent State

hi

  • utput = ( a, man, is, jumping, into, a, lake, . )

aj

Convolutional Neural Network 19 / 46

slide-20
SLIDE 20

Predict the “importance” of each region

Annotation Vectors Recurrent State

hi

  • utput = ( a, man, is, jumping, into, a, lake, . )

aj

Attention Mechanism

αj αj Σ =1

Convolutional Neural Network 20 / 46

slide-21
SLIDE 21

Combine with annotation vectors..

Annotation Vectors Recurrent State

hi

  • utput = ( a, man, is, jumping, into, a, lake, . )

+ aj

Attention Mechanism

α

Attention weight

j

αj Σ =1

Convolutional Neural Network 21 / 46

slide-22
SLIDE 22

Feed into next hidden state and predict the next word

Annotation Vectors Word Sample

y i

Recurrent State

hi

  • utput = ( a, man, is, jumping, into, a, lake, . )

+ aj

Attention Mechanism

α

Attention weight

j

αj Σ =1

Convolutional Neural Network 22 / 46

slide-23
SLIDE 23

In the next step, we use the new hidden state

Annotation Vectors Word Sample

y i

Recurrent State

hi

  • utput = ( a, man, is, jumping, into, a, lake, . )

+ aj

Attention Mechanism

α

Attention weight

j

αj Σ =1

Convolutional Neural Network 23 / 46

slide-24
SLIDE 24

Continue until end of sequence

Annotation Vectors Word Sample

y i

Recurrent State

hi

  • utput = ( a, man, is, jumping, into, a, lake, . )

+ aj

Attention Mechanism

α

Attention weight

j

αj Σ =1

Convolutional Neural Network 24 / 46

slide-25
SLIDE 25

The attention is driven by the recurrent state + image

◮ At every time step, compute the importance of each region

depending on the top-down + bottom-up signals eti =fatt(ai, ht−1) αti = exp(eti) L

k=1 exp(etk) ◮ We use a softmax to constrain that these weights sum to 1 ◮ We explore two different ways use the above distribution to

compute a meaningful image representation

25 / 46

slide-26
SLIDE 26

Stochastic or Deterministic?

Annotation Vectors Word Sample

y i

Recurrent State

hi

  • utput = ( a, man, is, jumping, into, a, lake, . )

+ aj

Attention Mechanism

α

Attention weight

j

αj Σ =1

Convolutional Neural Network Stochastic

  • r

Deterministic 26 / 46

slide-27
SLIDE 27

Quick note on our decoder: LSTM (Hochreiter et al. 1997)

f c

  • i

ht ht-1 zt Eyt-1 ht-1 zt Eyt-1 ht-1 zt Eyt-1 ht-1 zt Eyt-1

input gate

  • utput gate

memory cell forget gate input modulator

27 / 46

slide-28
SLIDE 28

Deterministic (Soft) Attention

◮ Feed in a attention weighted image input:

ˆ zt =

L

  • i=1

αt,iai

◮ This is what A. Graves (2013)/D. Bahdanau et al (2015) did

in handwriting recognition/machine translation

28 / 46

slide-29
SLIDE 29

Alternatively: Stochastic (Hard) Attention

◮ Sample α stochastically at every time step ◮ In RL terms, think of softmax α as a Boltzmann Policy:

Ls =

  • s

p(s | a) log p(y | s, a) ≤ log p(y | a) ∂Ls ∂W ≈ 1 N

N

  • n=1

∂ log p(y | ˜ sn, a) ∂W + log p(y | ˜ sn, a)∂ log p(˜ sn | a) ∂W

  • By Williams 1992, and re-popularized recently by Mnih et al. 2014,

Ba et al. 2015

29 / 46

slide-30
SLIDE 30

Quantitative Results

30 / 46

slide-31
SLIDE 31

A footnote on these metrics

31 / 46

slide-32
SLIDE 32

Under automatic metrics, humans are not great :(

32 / 46

slide-33
SLIDE 33

But human evaluation (mechanical turks) is quite different

33 / 46

slide-34
SLIDE 34

Stochastic or Deterministic?

Annotation Vectors Word Sample

y i

Recurrent State

hi

  • utput = ( a, man, is, jumping, into, a, lake, . )

+ aj

Attention Mechanism

α

Attention weight

j

αj Σ =1

Convolutional Neural Network Stochastic

  • r

Deterministic 34 / 46

slide-35
SLIDE 35

Visualizing our learned attention: the good

35 / 46

slide-36
SLIDE 36

Visualizing the our learned attention: the bad

36 / 46

slide-37
SLIDE 37

Other fun things you can do:

37 / 46

slide-38
SLIDE 38

A soccer ball ..

38 / 46

slide-39
SLIDE 39

Two cakes on a plate..

39 / 46

slide-40
SLIDE 40

Important previous work

40 / 46

slide-41
SLIDE 41

attention in machine translation

  • (4)

ap- distinct com-

Figure: also from UdeM lab (Bahdanau et al. 2014) [1]

41 / 46

slide-42
SLIDE 42

attention mechanism in handwritten character generation

  • Figure: from (Graves et al. 2013) [3]

42 / 46

slide-43
SLIDE 43

Recently, many more..

43 / 46

slide-44
SLIDE 44

Thanks for attending!

44 / 46

slide-45
SLIDE 45

Thanks for attending! Code: https://github.com/kelvinxu/arctic-captions

45 / 46

slide-46
SLIDE 46

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014. Ali Borji and Laurent Itti. State-of-the-art in visual attention modeling. Pattern Analysis and Machine Intelligence, IEEE Transactions

  • n, 35(1):185–207, 2013.

Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.

46 / 46