Why Did You Say That? Explaining and Diversifying Captioning Models - - PowerPoint PPT Presentation

why did you say that explaining and diversifying
SMART_READER_LITE
LIVE PREVIEW

Why Did You Say That? Explaining and Diversifying Captioning Models - - PowerPoint PPT Presentation

Why Did You Say That? Explaining and Diversifying Captioning Models Kate Saenko VQA Workshop, CVPR, July 26, 2017 Explaining: Top-down saliency guided by captions http://ai.bu.edu/caption-guided-saliency/ Vasili Abir Jianming Kate


slide-1
SLIDE 1

Why Did You Say That? Explaining and Diversifying Captioning Models

Kate Saenko

VQA Workshop, CVPR, July 26, 2017

slide-2
SLIDE 2

Top-down saliency guided by captions

http://ai.bu.edu/caption-guided-saliency/

Abir Das

Boston University

Vasili Ramanishka

Boston University

Jianming Zhang

Adobe Research

Explaining:

Kate Saenko

Boston University

slide-3
SLIDE 3

A woman is cutting a piece of meat

Captioning

3 Kate Saenko

slide-4
SLIDE 4

Why did the network say that?

4 Kate Saenko

slide-5
SLIDE 5

A woman is ..

Captioning

5 science cooking

A man is talking about…

Kate Saenko

slide-6
SLIDE 6

?

A woman is cutting a piece of meat

6 Kate Saenko

slide-7
SLIDE 7

7 Kate Saenko

slide-8
SLIDE 8

Predicted sentence: A woman is cutting a piece of meat

Explaining the network’s captions

can the network localize objects?

Kate Saenko 8

slide-9
SLIDE 9

Related: Attention layers

9 Show, Attend and Tell [Xu et al. ICML’15]

“Attention Layers”: Sequentially process regions in a single image. Objective: Model learns “where to look” next.

girl teddy bear

Image Captioning

  • soft attention adds special

attention layer

  • Only spatial or only temporal
  • Hard to do spatio-temporal

attention

  • Can we get salient regions

without adding such layers?

Kate Saenko

slide-10
SLIDE 10

Key idea: probe the network with small part of input

10 Encode P(word) Encoder Decoder Network

  • No need for special attention layer
  • Get spatio-temporal attention for free

. . . . . .

Kate Saenko

slide-11
SLIDE 11

Encoder-decoder framework for video description

11 LSTM

Average

Encoder CNN

1x2048 8x8x2048

slide: Vasili Ramanishka

slide-12
SLIDE 12

Encoder-decoder framework for video description

12 LSTM

Average

Encoder LSTM LSTM LSTM LSTM car … a is man Decode r … CNN

1x2048 8x8x2048

slide: Vasili Ramanishka

slide-13
SLIDE 13

Encoder-decoder framework for video description

13 LSTM

Average

Encoder LSTM LSTM LSTM LSTM car … a is man Decode r … CNN

1x2048 8x8x2048

slide: Vasili Ramanishka

slide-14
SLIDE 14

Saliency Estimation

14 LSTM CNN

1x2048 8x8x2048

LSTM LSTM LSTM LSTM car … … a is man …

slide: Vasili Ramanishka

slide-15
SLIDE 15

Saliency Estimation

15 LSTM CNN

1x2048 8x8x2048

LSTM LSTM LSTM LSTM car … … a is man …

slide: Vasili Ramanishka

slide-16
SLIDE 16

Saliency Estimation

16 LSTM CNN

1x2048 8x8x2048

LSTM LSTM LSTM LSTM … car … a is man Decode r … Kate Saenko

slide: Vasili Ramanishka

slide-17
SLIDE 17

Saliency Estimation

17 “A man is driving a car” normalization

slide: Vasili Ramanishka

slide-18
SLIDE 18

Spatiotemporal saliency

Predicted sentence: A woman is cutting a piece of meat Kate Saenko 18

slide-19
SLIDE 19

Spatiotemporal saliency

woman phone 19 Kate Saenko

slide-20
SLIDE 20

CNN

WxHxC 1xC

vi

hi LSTM

Image captioning with the same architecture

Kate Saenko 20

slide-21
SLIDE 21

Image captioning with the same architecture

Input query: A man in a jacket is standing at the slot machine Kate Saenko 21

slide-22
SLIDE 22

22 Kate Saenko

Flickr30kEntities

Plummer et al., ICCV 2015

slide-23
SLIDE 23

23 Kate Saenko

Pointing game in Flickr30kEntities

slide-24
SLIDE 24

Comparison to Soft Attention on Flickr30kEntities

24

[14] C. Liu, J. Mao, F. Sha, and A. L. Yuille. Attention correctness in neural image captioning, 2016, implementation of

  • K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image

caption generation with visual attention. In ICML 2015

Attention correctness Captioning performance Pointing game accuracy Kate Saenko

slide-25
SLIDE 25

Video summarization: predicted sentence

25 Kate Saenko

slide-26
SLIDE 26

Video summarization: arbitrary query

26 Kate Saenko

slide-27
SLIDE 27

Lisa Anne Hendricks Subhashini Venugopalan Marcus Rohrbach Raymond Mooney Kate Saenko Trevor Darrell

29

UT Austin UC Berkeley Boston Univ.

Captioning Images with Diverse Objects

Diversifying:

slide-28
SLIDE 28

Object Recognition

30

Can identify 1000’s of categories of objects. 14M images, 22K classes [Deng et al. CVPR’09]

slide: Subhashini Venugopalan

slide-29
SLIDE 29

Visual Description

Berkeley LRCN [Donahue et al. CVPR’15]: A brown bear standing on top of a lush green field. MSR CaptionBot [http://captionbot.ai/]: A large brown bear walking through a forest.

31

MSCOCO 80 classes

slide: Subhashini Venugopalan

slide-30
SLIDE 30

Novel Object Captioner (NOC)

We present Novel Object Captioner which can compose descriptions of 100s of objects in context.

32

Visual Classifiers. Existing captioners.

MSCOCO An okapi standing in the middle of a field. MSCOCO

+ + NOC (ours): Describe novel objects without paired image-caption data.

  • kapi

init + train

A horse standing in the dirt.

slide: Subhashini Venugopalan

slide-31
SLIDE 31

Insights

33

  • 1. Need to recognize and describe objects
  • utside of image-caption datasets.
  • kapi

slide: Subhashini Venugopalan

slide-32
SLIDE 32

Insight 1: Train effectively on external sources

CNN

Embed

LSTM

Embed

Image-Specific Loss Text-Specific Loss

Visual features from unpaired image data Language model from unannotated text data

34

slide: Subhashini Venugopalan

slide-33
SLIDE 33

Insights

35

  • 2. Describe unseen objects that are similar to
  • bjects seen in image-caption datasets.
  • kapi

zebra

slide: Subhashini Venugopalan

slide-34
SLIDE 34

Insight 2: Capture semantic similarity of words

CNN

Embed

LSTM WTglove Wglove

Embed

Image-Specific Loss Text-Specific Loss

36 zebra

  • kapi

dress tutu cake scone

slide: Subhashini Venugopalan

slide-35
SLIDE 35

Insight 2: Capture semantic similarity of words

zebra

  • kapi

dress tutu cake scone 37

MSCOCO

LSTM WTglove Wglove

Embed

Text-Specific Loss

CNN

Embed

Image-Specific Loss slide: Subhashini Venugopalan

slide-36
SLIDE 36

Combine to form a Caption Model

CNN

Embed

MSCOCO

Elementwise sum

CNN

Embed

LSTM WTglove Wglove

Embed

Image-Specific Loss Image-Text Loss Text-Specific Loss

LSTM WTglove Wglove

Embed

init parameters init parameters

38

Not different from existing caption models. Problem: Forgetting.

slide: Subhashini Venugopalan

slide-37
SLIDE 37

Insights

39

  • 3. Overcome “forgetting” since pre-

training alone is not sufficient.

[Catastrophic Forgetting in Neural Networks. Kirkpatrick et al. PNAS 2017]

slide: Subhashini Venugopalan

slide-38
SLIDE 38

Insight 3: Jointly train on multiple sources

joint training shared parameters

CNN

Embed

MSCOCO

shared parameters

Elementwise sum

CNN

Embed

LSTM WTglove Wglove

Embed

joint training Image-Specific Loss Image-Text Loss Text-Specific Loss

LSTM WTglove Wglove

Embed

40

slide: Subhashini Venugopalan

slide-39
SLIDE 39

Novel Object Captioner (NOC) Model

joint training shared parameters

CNN

Embed

MSCOCO

shared parameters

Elementwise sum

CNN

Embed

LSTM WTglove Wglove

Embed

joint training

Joint-Objective Loss

Image-Specific Loss Image-Text Loss Text-Specific Loss

LSTM WTglove Wglove

Embed

41

slide: Subhashini Venugopalan

slide-40
SLIDE 40

Empirical Evaluation: COCO dataset In-Domain setting

MSCOCO Paired Image-Sentence Data MSCOCO Unpaired Image Data MSCOCO Unpaired Text Data

”An elephant galloping in the green grass” ”Two people playing ball in a field” ”A black train stopped

  • n the tracks”

”Someone is about to eat some pizza” Elephant, Galloping, Green, Grass People, Playing, Ball, Field Black, Train, Tracks Eat, Pizza ”An elephant galloping in the green grass” ”Two people playing ball in a field” ”A black train stopped on the tracks” ”Someone is about to eat some pizza” ”A microwave is sitting on top of a kitchen counter ” ”A kitchen counter with a microwave on it” Kitchen, Microwave 48

slide: Subhashini Venugopalan

slide-41
SLIDE 41

Empirical Evaluation: COCO heldout dataset

MSCOCO Paired Image-Sentence Data MSCOCO Unpaired Image Data MSCOCO Unpaired Text Data

”An elephant galloping in the green grass” ”Two people playing ball in a field” ”A black train stopped

  • n the tracks”

”Someone is about to eat some pizza” Elephant, Galloping, Green, Grass People, Playing, Ball, Field Black, Train, Tracks Pizza ”An elephant galloping in the green grass” ”Two people playing ball in a field” ”A black train stopped on the tracks” ”A white plate topped with cheesy pizza and toppings.” ”A white refrigerator, stove, oven dishwasher and microwave” ”A kitchen counter with a microwave on it” Microwave 49

Held-out

slide: Subhashini Venugopalan

slide-42
SLIDE 42

Empirical Evaluation: COCO

MSCOCO Paired Image-Sentence Data MSCOCO Unpaired Image Data MSCOCO Unpaired Text Data

”An elephant galloping in the green grass” ”Two people playing ball in a field” ”A black train stopped

  • n the tracks”

Baseball, batting, boy, swinging Black, Train, Tracks Pizza ”A small elephant standing on top

  • f a dirt field”

”A hitter swinging his bat to hit the ball” ”A black train stopped on the tracks” ”A white plate topped with cheesy pizza and toppings.” ”A white refrigerator, stove, oven dishwasher and microwave” Microwave 50 Two, elephants, Path, walking

  • CNN is pre-trained on ImageNet

slide: Subhashini Venugopalan

slide-43
SLIDE 43

Empirical Evaluation: Metrics

51

F1 (Utility): Ability to recognize and incorporate new words. (Is the word/object mentioned in the caption?) METEOR: Fluency and sentence quality.

slide: Subhashini Venugopalan

slide-44
SLIDE 44

Empirical Evaluation: Baselines

52

[1] J. Donahue, L.A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, T. Darrell. CVPR’15 [2] L.A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, T. Darrell CVPR’16

LRCN [1]: Does not caption novel objects. DCC [2] : Copies parameters for the novel

  • bject from a similar object seen

in training. (also not end-to-end)

slide: Subhashini Venugopalan

slide-45
SLIDE 45

Empirical Evaluation: Results

53

[1] J. Donahue, L.A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, T. Darrell. CVPR’15 [2] L.A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, T. Darrell CVPR’16

F1 (Utility) METEOR (Fluency)

slide: Subhashini Venugopalan

slide-46
SLIDE 46

ImageNet: Human Evaluations

58

  • ImageNet: 638 object classes not mentioned in COCO

NOC can describe 582 object classes (60% more objects than prior work)

slide: Subhashini Venugopalan

slide-47
SLIDE 47

ImageNet: Human Evaluations

59

  • ImageNet: 638 object classes not mentioned in COCO
  • Word Incorporation: Which model incorporates the

word (name of the object) in the sentence better?

  • Image Description: Which sentence (model) describes

the image better?

slide: Subhashini Venugopalan

slide-48
SLIDE 48

ImageNet: Human Evaluations

60

Word Incorporation Image Description

43.7 8 25.7 4 6.10 24.3 7 40.16 59.8 4

slide: Subhashini Venugopalan

slide-49
SLIDE 49

Qualitative Evaluation: ImageNet

61

slide: Subhashini Venugopalan

slide-50
SLIDE 50

Qualitative Evaluation: ImageNet

62

slide: Subhashini Venugopalan

slide-51
SLIDE 51

Qualitative Examples: Errors

63 Sunglass (n04355933) Error: Grammar NOC: A sunglass mirror reflection of a mirror in a mirror. Gymnast (n10153594) Error: Gender, Hallucination NOC: A man gymnast in a blue shirt doing a trick on a skateboard. Balaclava (n02776825) Error: Repetition NOC: A balaclava black and white photo of a man in a balaclava. Cougar (n02125311) Error: Description NOC: A cougar with a cougar in its mouth.

slide: Subhashini Venugopalan

slide-52
SLIDE 52

Novel Object Captioner - Take away

64

MSCOCO

A okapi standing in the middle of a field.

Semantic embeddings and joint training to caption 100s of objects.

slide: Subhashini Venugopalan

slide-53
SLIDE 53

Thanks!

VQA Workshop, CVPR, July 26, 2017 Abir Das Vasili Ramanishka Jianming Zhang Lisa Anne Hendricks Subhashini Venugopalan Marcus Rohrbach Raymond Mooney Trevor Darrell 65