Why Did You Say That? Explaining and Diversifying Captioning Models
Kate Saenko
VQA Workshop, CVPR, July 26, 2017
Why Did You Say That? Explaining and Diversifying Captioning Models - - PowerPoint PPT Presentation
Why Did You Say That? Explaining and Diversifying Captioning Models Kate Saenko VQA Workshop, CVPR, July 26, 2017 Explaining: Top-down saliency guided by captions http://ai.bu.edu/caption-guided-saliency/ Vasili Abir Jianming Kate
VQA Workshop, CVPR, July 26, 2017
http://ai.bu.edu/caption-guided-saliency/
Abir Das
Boston University
Vasili Ramanishka
Boston University
Jianming Zhang
Adobe Research
Kate Saenko
Boston University
A woman is cutting a piece of meat
3 Kate Saenko
4 Kate Saenko
A woman is ..
5 science cooking
A man is talking about…
Kate Saenko
A woman is cutting a piece of meat
6 Kate Saenko
7 Kate Saenko
Predicted sentence: A woman is cutting a piece of meat
Kate Saenko 8
9 Show, Attend and Tell [Xu et al. ICML’15]
“Attention Layers”: Sequentially process regions in a single image. Objective: Model learns “where to look” next.
girl teddy bear
Image Captioning
attention layer
attention
without adding such layers?
Kate Saenko
10 Encode P(word) Encoder Decoder Network
. . . . . .
Kate Saenko
11 LSTM
Average
Encoder CNN
1x2048 8x8x2048
slide: Vasili Ramanishka
12 LSTM
Average
Encoder LSTM LSTM LSTM LSTM car … a is man Decode r … CNN
1x2048 8x8x2048
slide: Vasili Ramanishka
13 LSTM
Average
Encoder LSTM LSTM LSTM LSTM car … a is man Decode r … CNN
1x2048 8x8x2048
slide: Vasili Ramanishka
14 LSTM CNN
1x2048 8x8x2048
LSTM LSTM LSTM LSTM car … … a is man …
slide: Vasili Ramanishka
15 LSTM CNN
1x2048 8x8x2048
LSTM LSTM LSTM LSTM car … … a is man …
slide: Vasili Ramanishka
16 LSTM CNN
1x2048 8x8x2048
LSTM LSTM LSTM LSTM … car … a is man Decode r … Kate Saenko
slide: Vasili Ramanishka
17 “A man is driving a car” normalization
slide: Vasili Ramanishka
Predicted sentence: A woman is cutting a piece of meat Kate Saenko 18
woman phone 19 Kate Saenko
CNN
WxHxC 1xC
vi
hi LSTM
…
Kate Saenko 20
Input query: A man in a jacket is standing at the slot machine Kate Saenko 21
22 Kate Saenko
Plummer et al., ICCV 2015
23 Kate Saenko
24
[14] C. Liu, J. Mao, F. Sha, and A. L. Yuille. Attention correctness in neural image captioning, 2016, implementation of
caption generation with visual attention. In ICML 2015
Attention correctness Captioning performance Pointing game accuracy Kate Saenko
25 Kate Saenko
26 Kate Saenko
Lisa Anne Hendricks Subhashini Venugopalan Marcus Rohrbach Raymond Mooney Kate Saenko Trevor Darrell
29
30
slide: Subhashini Venugopalan
Berkeley LRCN [Donahue et al. CVPR’15]: A brown bear standing on top of a lush green field. MSR CaptionBot [http://captionbot.ai/]: A large brown bear walking through a forest.
31
MSCOCO 80 classes
slide: Subhashini Venugopalan
32
Visual Classifiers. Existing captioners.
MSCOCO An okapi standing in the middle of a field. MSCOCO
+ + NOC (ours): Describe novel objects without paired image-caption data.
init + train
A horse standing in the dirt.
slide: Subhashini Venugopalan
33
slide: Subhashini Venugopalan
CNN
Embed
LSTM
Embed
Image-Specific Loss Text-Specific Loss
34
slide: Subhashini Venugopalan
35
slide: Subhashini Venugopalan
CNN
Embed
LSTM WTglove Wglove
Embed
Image-Specific Loss Text-Specific Loss
36 zebra
dress tutu cake scone
slide: Subhashini Venugopalan
zebra
dress tutu cake scone 37
MSCOCO
LSTM WTglove Wglove
Embed
Text-Specific Loss
CNN
Embed
Image-Specific Loss slide: Subhashini Venugopalan
CNN
Embed
MSCOCO
Elementwise sum
CNN
Embed
LSTM WTglove Wglove
Embed
Image-Specific Loss Image-Text Loss Text-Specific Loss
LSTM WTglove Wglove
Embed
init parameters init parameters
38
slide: Subhashini Venugopalan
39
[Catastrophic Forgetting in Neural Networks. Kirkpatrick et al. PNAS 2017]
slide: Subhashini Venugopalan
joint training shared parameters
CNN
Embed
MSCOCO
shared parameters
Elementwise sum
CNN
Embed
LSTM WTglove Wglove
Embed
joint training Image-Specific Loss Image-Text Loss Text-Specific Loss
LSTM WTglove Wglove
Embed
40
slide: Subhashini Venugopalan
joint training shared parameters
CNN
Embed
MSCOCO
shared parameters
Elementwise sum
CNN
Embed
LSTM WTglove Wglove
Embed
joint training
Joint-Objective Loss
Image-Specific Loss Image-Text Loss Text-Specific Loss
LSTM WTglove Wglove
Embed
41
slide: Subhashini Venugopalan
MSCOCO Paired Image-Sentence Data MSCOCO Unpaired Image Data MSCOCO Unpaired Text Data
”An elephant galloping in the green grass” ”Two people playing ball in a field” ”A black train stopped
”Someone is about to eat some pizza” Elephant, Galloping, Green, Grass People, Playing, Ball, Field Black, Train, Tracks Eat, Pizza ”An elephant galloping in the green grass” ”Two people playing ball in a field” ”A black train stopped on the tracks” ”Someone is about to eat some pizza” ”A microwave is sitting on top of a kitchen counter ” ”A kitchen counter with a microwave on it” Kitchen, Microwave 48
slide: Subhashini Venugopalan
MSCOCO Paired Image-Sentence Data MSCOCO Unpaired Image Data MSCOCO Unpaired Text Data
”An elephant galloping in the green grass” ”Two people playing ball in a field” ”A black train stopped
”Someone is about to eat some pizza” Elephant, Galloping, Green, Grass People, Playing, Ball, Field Black, Train, Tracks Pizza ”An elephant galloping in the green grass” ”Two people playing ball in a field” ”A black train stopped on the tracks” ”A white plate topped with cheesy pizza and toppings.” ”A white refrigerator, stove, oven dishwasher and microwave” ”A kitchen counter with a microwave on it” Microwave 49
Held-out
slide: Subhashini Venugopalan
MSCOCO Paired Image-Sentence Data MSCOCO Unpaired Image Data MSCOCO Unpaired Text Data
”An elephant galloping in the green grass” ”Two people playing ball in a field” ”A black train stopped
Baseball, batting, boy, swinging Black, Train, Tracks Pizza ”A small elephant standing on top
”A hitter swinging his bat to hit the ball” ”A black train stopped on the tracks” ”A white plate topped with cheesy pizza and toppings.” ”A white refrigerator, stove, oven dishwasher and microwave” Microwave 50 Two, elephants, Path, walking
slide: Subhashini Venugopalan
51
F1 (Utility): Ability to recognize and incorporate new words. (Is the word/object mentioned in the caption?) METEOR: Fluency and sentence quality.
slide: Subhashini Venugopalan
52
[1] J. Donahue, L.A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, T. Darrell. CVPR’15 [2] L.A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, T. Darrell CVPR’16
LRCN [1]: Does not caption novel objects. DCC [2] : Copies parameters for the novel
in training. (also not end-to-end)
slide: Subhashini Venugopalan
53
[1] J. Donahue, L.A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, T. Darrell. CVPR’15 [2] L.A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, T. Darrell CVPR’16
F1 (Utility) METEOR (Fluency)
slide: Subhashini Venugopalan
58
slide: Subhashini Venugopalan
59
slide: Subhashini Venugopalan
60
43.7 8 25.7 4 6.10 24.3 7 40.16 59.8 4
slide: Subhashini Venugopalan
61
slide: Subhashini Venugopalan
62
slide: Subhashini Venugopalan
63 Sunglass (n04355933) Error: Grammar NOC: A sunglass mirror reflection of a mirror in a mirror. Gymnast (n10153594) Error: Gender, Hallucination NOC: A man gymnast in a blue shirt doing a trick on a skateboard. Balaclava (n02776825) Error: Repetition NOC: A balaclava black and white photo of a man in a balaclava. Cougar (n02125311) Error: Description NOC: A cougar with a cougar in its mouth.
slide: Subhashini Venugopalan
64
MSCOCO
A okapi standing in the middle of a field.
slide: Subhashini Venugopalan
VQA Workshop, CVPR, July 26, 2017 Abir Das Vasili Ramanishka Jianming Zhang Lisa Anne Hendricks Subhashini Venugopalan Marcus Rohrbach Raymond Mooney Trevor Darrell 65