Captioning Images with Diverse Objects Lisa Anne Subhashini - - PowerPoint PPT Presentation

captioning images with diverse objects
SMART_READER_LITE
LIVE PREVIEW

Captioning Images with Diverse Objects Lisa Anne Subhashini - - PowerPoint PPT Presentation

Captioning Images with Diverse Objects Lisa Anne Subhashini Marcus Raymond Kate Trevor Hendricks Venugopalan Rohrbach Mooney Saenko Darrell UT Austin UC Berkeley Boston Univ. 1 Object Recognition Can identify hundreds of


slide-1
SLIDE 1

Lisa Anne Hendricks Subhashini Venugopalan Marcus Rohrbach Raymond Mooney Kate Saenko Trevor Darrell

1

UT Austin UC Berkeley Boston Univ.

Captioning Images with Diverse Objects

slide-2
SLIDE 2

Object Recognition

2

Can identify hundreds of categories of objects. 14M images, 22K classes [Deng et al. CVPR’09]

slide-3
SLIDE 3

Visual Description

Berkeley LRCN [Donahue et al. CVPR’15]: A brown bear standing on top of a lush green field. MSR CaptionBot [http://captionbot.ai/]: A large brown bear walking through a forest.

3

MSCOCO 80 classes

slide-4
SLIDE 4

Novel Object Captioner (NOC)

We present Novel Object Captioner which can compose descriptions of 100s of objects in context.

4

Visual Classifiers. Existing captioners.

MSCOCO An okapi standing in the middle of a field. MSCOCO

+ + NOC (ours): Describe novel objects without paired image-caption data.

  • kapi

init + train

A horse standing in the dirt.

slide-5
SLIDE 5

Insights

5

  • 1. Need to recognize and describe objects
  • utside of image-caption datasets.
  • kapi
slide-6
SLIDE 6

Insight 1: Train effectively on external sources

CNN

Embed

LSTM

Embed

Image-Specific Loss Text-Specific Loss

Visual features from unpaired image data Language model from unannotated text data

6

slide-7
SLIDE 7

Insights

7

  • 2. Describe unseen objects that are similar to
  • bjects seen in image-caption datasets.
  • kapi

zebra

slide-8
SLIDE 8

Insight 2: Capture semantic similarity of words

CNN

Embed

LSTM WTglove Wglove

Embed

Image-Specific Loss Text-Specific Loss 8

zebra

  • kapi

dress tutu cake scone

slide-9
SLIDE 9

Insight 2: Capture semantic similarity of words

zebra

  • kapi

dress tutu cake scone

9

MSCOCO

LSTM WTglove Wglove

Embed

Text-Specific Loss

CNN

Embed

Image-Specific Loss

slide-10
SLIDE 10

Combine to form a Caption Model

CNN

Embed

MSCOCO

Elementwise sum

CNN

Embed

LSTM WTglove Wglove

Embed

Image-Specific Loss Image-Text Loss Text-Specific Loss

LSTM WTglove Wglove

Embed

init parameters init parameters

10

Not different from existing caption models. Problem: Forgetting.

slide-11
SLIDE 11

Insights

11

  • 3. Overcome “forgetting” since

pre-training alone is not sufficient.

[Catastrophic Forgetting in Neural Networks. Kirkpatrick et al. PNAS 2017]

slide-12
SLIDE 12

Insight 3: Jointly train on multiple sources

joint training shared parameters

CNN

Embed

MSCOCO

shared parameters

Elementwise sum

CNN

Embed

LSTM WTglove Wglove

Embed

joint training Image-Specific Loss Image-Text Loss Text-Specific Loss

LSTM WTglove Wglove

Embed

12

slide-13
SLIDE 13

Novel Object Captioner (NOC) Model

joint training shared parameters

CNN

Embed

MSCOCO

shared parameters

Elementwise sum

CNN

Embed

LSTM WTglove Wglove

Embed

joint training

Joint-Objective Loss

Image-Specific Loss Image-Text Loss Text-Specific Loss

LSTM WTglove Wglove

Embed

13

slide-14
SLIDE 14

Visual Network

CNN

Embed

Image-Specific Loss

Network: VGG-16 with multi-label loss [sigmoid cross-entropy loss] Training Data: Unpaired image data Output: Vector with activations corresponding to scores for words in the vocabulary.

impala:0.86 green: 0.72 ... cut: 0.04

14

slide-15
SLIDE 15

Language Model

LSTM WTglove Wglove

Embed

Text-Specific Loss

Network: Single LSTM layer. Predict next word t+1 given previous words 0..t (t+1 | 0..t)

(Wglove)T : Shared weights with input embedding.

Training Data: Unannotated text data (BNC, ukWac, Wikipedia, Gigaword) Output: Vector with activations corresponding to scores for words in the vocabulary.

15

slide-16
SLIDE 16

Caption Network

CNN

Embed

MSCOCO

Elementwise sum

Image-Text Loss

LSTM WTglove Wglove

Embed

Network: Combine output of the visual and text networks. (softmax + cross-entropy loss)

16

slide-17
SLIDE 17

Caption Model

CNN

Embed

MSCOCO

Elementwise sum

Image-Text Loss

LSTM WTglove Wglove

Embed

Training Data: COCO images with multiple labels bear, brown, field, grassy, trees, walking Training Data: Captions from COCO A brown bear walking on a grassy field next to trees

17

slide-18
SLIDE 18

NOC Model: Train simultaneously

joint training shared parameters

CNN

Embed

MSCOCO

shared parameters

Elementwise sum

CNN

Embed

LSTM WTglove Wglove

Embed

joint training

Joint-Objective Loss

Image-Specific Loss Image-Text Loss Text-Specific Loss

LSTM WTglove Wglove

Embed

18

slide-19
SLIDE 19

Evaluation

  • Empirical: COCO held-out objects

○ In-domain [Use images from COCO] ○ Out-of-domain [Use imagenet images for held-out concepts]

  • Ablations

○ Embedding & Joint training contribution

  • ImageNet

○ Quantitative ○ Human Evaluation - Objects not in COCO ○ Rare objects in COCO

19

slide-20
SLIDE 20

Evaluation

  • Empirical: COCO held-out objects

○ In-domain [Use images from COCO] ○ Out-of-domain [Use imagenet images for held-out concepts]

  • Ablations

○ Embedding & Joint training contribution

  • ImageNet

○ Quantitative ○ Human Evaluation - Objects not in COCO ○ Rare objects in COCO

20

slide-21
SLIDE 21

Empirical Evaluation: COCO dataset In-Domain setting

MSCOCO Paired Image-Sentence Data MSCOCO Unpaired Image Data MSCOCO Unpaired Text Data

”An elephant galloping in the green grass” ”Two people playing ball in a field” ”A black train stopped

  • n the tracks”

”Someone is about to eat some pizza” Elephant, Galloping, Green, Grass People, Playing, Ball, Field Black, Train, Tracks Eat, Pizza ”An elephant galloping in the green grass” ”Two people playing ball in a field” ”A black train stopped on the tracks” ”Someone is about to eat some pizza” ”A microwave is sitting on top of a kitchen counter ” ”A kitchen counter with a microwave on it” Kitchen, Microwave

21

slide-22
SLIDE 22

Empirical Evaluation: COCO heldout dataset

MSCOCO Paired Image-Sentence Data MSCOCO Unpaired Image Data MSCOCO Unpaired Text Data

”An elephant galloping in the green grass” ”Two people playing ball in a field” ”A black train stopped

  • n the tracks”

”Someone is about to eat some pizza” Elephant, Galloping, Green, Grass People, Playing, Ball, Field Black, Train, Tracks Pizza ”An elephant galloping in the green grass” ”Two people playing ball in a field” ”A black train stopped on the tracks” ”A white plate topped with cheesy pizza and toppings.” ”A white refrigerator, stove, oven dishwasher and microwave” ”A kitchen counter with a microwave on it” Microwave

22

Held-out

slide-23
SLIDE 23

Empirical Evaluation: COCO

MSCOCO Paired Image-Sentence Data MSCOCO Unpaired Image Data MSCOCO Unpaired Text Data

”An elephant galloping in the green grass” ”Two people playing ball in a field” ”A black train stopped

  • n the tracks”

Baseball, batting, boy, swinging Black, Train, Tracks Pizza ”A small elephant standing on top

  • f a dirt field”

”A hitter swinging his bat to hit the ball” ”A black train stopped on the tracks” ”A white plate topped with cheesy pizza and toppings.” ”A white refrigerator, stove, oven dishwasher and microwave” Microwave

23

Two, elephants, Path, walking

  • CNN is pre-trained on ImageNet
slide-24
SLIDE 24

Empirical Evaluation: Metrics

24

F1 (Utility): Ability to recognize and incorporate new words. (Is the word/object mentioned in the caption?) METEOR: Fluency and sentence quality.

slide-25
SLIDE 25

Empirical Evaluation: Baselines

25

[1] J. Donahue, L.A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, T. Darrell. CVPR’15 [2] L.A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, T. Darrell CVPR’16

LRCN [1]: Does not caption novel objects. DCC [2] : Copies parameters for the novel

  • bject from a similar object seen

in training. (also not end-to-end)

slide-26
SLIDE 26

Empirical Evaluation: Results

26

[1] J. Donahue, L.A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, T. Darrell. CVPR’15 [2] L.A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, T. Darrell CVPR’16

F1 (Utility) METEOR (Fluency)

slide-27
SLIDE 27

Ablations

27

MSCOCO

Evaluated on held-out COCO objects.

slide-28
SLIDE 28

Ablation: Language Embedding

28

Joint Training Frozen CNN Language Embedding

MSCOCO

slide-29
SLIDE 29

Ablation: Freeze CNN after pre-training

29

Joint Training Frozen CNN Language Embedding

[Catastrophic forgetting in Neural Networks Kirkpatrick et al. PNAS 2017]

MSCOCO

FIX

slide-30
SLIDE 30

Ablation: Joint Training

30

MSCOCO

Joint Training Frozen CNN Language Embedding

slide-31
SLIDE 31

ImageNet: Human Evaluations

31

  • ImageNet: 638 object classes not mentioned in COCO

NOC can describe 582 object classes (60% more objects than prior work)

slide-32
SLIDE 32

ImageNet: Human Evaluations

32

  • ImageNet: 638 object classes not mentioned in COCO
  • Word Incorporation: Which model incorporates the

word (name of the object) in the sentence better?

  • Image Description: Which sentence (model) describes

the image better?

slide-33
SLIDE 33

ImageNet: Human Evaluations

33

Word Incorporation Image Description

43.78 25.74 6.10 24.37 40.16 59.84

slide-34
SLIDE 34

Qualitative Evaluation: ImageNet

34

slide-35
SLIDE 35

Qualitative Evaluation: ImageNet

35

slide-36
SLIDE 36

Qualitative Examples: Errors

36

Sunglass (n04355933) Error: Grammar NOC: A sunglass mirror reflection of a mirror in a mirror. Gymnast (n10153594) Error: Gender, Hallucination NOC: A man gymnast in a blue shirt doing a trick on a skateboard. Balaclava (n02776825) Error: Repetition NOC: A balaclava black and white photo of a man in a balaclava. Cougar (n02125311) Error: Description NOC: A cougar with a cougar in its mouth.

slide-37
SLIDE 37

Novel Object Captioner - Take away

37

MSCOCO

A okapi standing in the middle of a field.

Semantic embeddings and joint training to caption 100s of objects.

slide-38
SLIDE 38

38

Poster 11