Why Did You Say That? Explaining and Diversifying Captioning Models - PowerPoint PPT Presentation

Why Did You Say That? Explaining and Diversifying Captioning Models Kate Saenko VQA Workshop, CVPR, July 26, 2017

Explaining: Top-down saliency guided by captions http://ai.bu.edu/caption-guided-saliency/ Vasili Abir Jianming Kate Ramanishka Das Zhang Saenko Boston University Boston University Boston University Adobe Research

Captioning A woman is cutting a piece of meat 3 Kate Saenko

Why did the network say that? 4 Kate Saenko

Captioning A woman is .. cooking A man is talking about… science 5 Kate Saenko

? A woman is cutting a piece of meat 6 Kate Saenko

7 Kate Saenko

Explaining the network’s captions Predicted sentence: A woman is cutting a piece of meat can the network localize objects? Kate Saenko 8

Related: Attention layers “Attention Layers”: Sequentially process regions in a single image. Objective: Model learns “where to look” next. • soft attention adds special Image Captioning attention layer • Only spatial or only temporal • Hard to do spatio-temporal attention • Can we get salient regions girl teddy bear without adding such layers? Show, Attend and Tell [Xu et al. ICML’15] Kate Saenko 9

Key idea: probe the network with small part of input Encoder Decoder . . . . . . Encode P(word) Network • No need for special attention layer • Get spatio-temporal attention for free Kate Saenko 10

Encoder-decoder framework slide: Vasili Ramanishka for video description Encoder CNN Average 8x8x2048 1x2048 LSTM 11

Encoder-decoder framework slide: Vasili Ramanishka for video description Encoder CNN Average 8x8x2048 1x2048 LSTM LSTM LSTM LSTM … LSTM Decode r is car a man … 12

Encoder-decoder framework slide: Vasili Ramanishka for video description Encoder CNN Average 8x8x2048 1x2048 LSTM LSTM LSTM LSTM … LSTM Decode r is car a man … 13

slide: Vasili Ramanishka Saliency Estimation CNN 8x8x2048 1x2048 LSTM LSTM LSTM LSTM … LSTM is car a man … … 14

slide: Vasili Ramanishka Saliency Estimation CNN 8x8x2048 1x2048 LSTM LSTM LSTM LSTM … LSTM is car a man … … 15

slide: Vasili Ramanishka Saliency Estimation CNN 8x8x2048 1x2048 LSTM LSTM LSTM LSTM … LSTM Decode r a man is car … … Kate Saenko 16

slide: Vasili Ramanishka Saliency Estimation “A man is driving a car” normalization 17

Spatiotemporal saliency Predicted sentence: A woman is cutting a piece of meat Kate Saenko 18

Spatiotemporal saliency woman phone Kate Saenko 19

Image captioning with the same architecture CNN LSTM … v i h i WxHxC 1xC … Kate Saenko 20

Image captioning with the same architecture Input query: A man in a jacket is standing at the slot machine 21 Kate Saenko

Flickr30kEntities 22 Kate Saenko Plummer et al., ICCV 2015

Pointing game in Flickr30kEntities 23 Kate Saenko

Comparison to Soft Attention on Flickr30kEntities Attention correctness Pointing game accuracy Captioning performance [14] C. Liu, J. Mao, F. Sha, and A. L. Yuille. Attention correctness in neural image captioning, 2016, implementation of K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image 24 caption generation with visual attention. In ICML 2015 Kate Saenko

Video summarization: predicted sentence Kate Saenko 25

Video summarization: arbitrary query Kate Saenko 26

Diversifying: Captioning Images with Diverse Objects Lisa Anne Subhashini Marcus Raymond Kate Trevor Hendricks Venugopalan Rohrbach Mooney Saenko Darrell UT Austin UC Berkeley Boston Univ. 29

slide: Subhashini Venugopalan Object Recognition Can identify 1000’s of categories of objects. 14M images, 22K classes [Deng et al. CVPR’09] 30

slide: Subhashini Venugopalan Visual Description Berkeley LRCN [Donahue et al. CVPR’15] : A brown bear standing on top of a lush green field. MSR CaptionBot [http://captionbot.ai/] : A large brown bear walking through a forest. MSCOCO 80 classes 31

slide: Subhashini Venugopalan Novel Object Captioner (NOC) We present Novel Object Captioner which can compose descriptions of 100s of objects in context. NOC (ours): Describe novel objects without paired image-caption data. An okapi standing in the middle of a field. + + MSCOCO Visual Classifiers. okapi init + train Existing captioners. A horse standing in the dirt. MSCOCO 32

slide: Subhashini Venugopalan Insights 1. Need to recognize and describe objects outside of image-caption datasets. okapi 33

slide: Subhashini Venugopalan Insight 1: Train effectively on external sources Image-Specific Loss Text-Specific Loss Visual features from unpaired image Embed data Embed LSTM Language model from CNN unannotated text data 34

slide: Subhashini Venugopalan Insights 2. Describe unseen objects that are similar to objects seen in image-caption datasets. okapi zebra 35

slide: Subhashini Venugopalan Insight 2: Capture semantic similarity of words scone Image-Specific Loss Text-Specific Loss zebra W T glove Embed Embed cake LSTM CNN okapi W glove tutu dress 36

slide: Subhashini Venugopalan Insight 2: Capture semantic similarity of words scone Image-Specific Loss Text-Specific Loss zebra W T glove Embed Embed cake LSTM CNN okapi W glove tutu dress MSCOCO 37

slide: Subhashini Venugopalan Combine to form a Caption Model Image-Specific Loss Image-Text Loss Text-Specific Loss Elementwise sum W T glove W T glove Embed Embed Embed Embed init init LSTM LSTM parameters parameters CNN CNN W glove W glove MSCOCO Not different from existing caption models. Problem: Forgetting. 38

slide: Subhashini Venugopalan Insights 3. Overcome “forgetting” since pre- training alone is not sufficient. [Catastrophic Forgetting in Neural Networks. Kirkpatrick et al. PNAS 2017] 39

slide: Subhashini Venugopalan Insight 3: Jointly train on multiple sources Image-Specific Loss Image-Text Loss Text-Specific Loss Elementwise sum W T glove W T glove Embed Embed joint joint Embed Embed training training shared shared LSTM LSTM parameters parameters CNN CNN W glove W glove MSCOCO 40

slide: Subhashini Venugopalan Novel Object Captioner (NOC) Model Joint-Objective Loss Image-Specific Loss Image-Text Loss Text-Specific Loss Elementwise sum W T glove W T glove Embed Embed joint joint Embed Embed training training shared shared LSTM LSTM parameters parameters CNN CNN W glove W glove MSCOCO 41

slide: Subhashini Venugopalan Empirical Evaluation: COCO dataset In-Domain setting MSCOCO Paired MSCOCO Unpaired MSCOCO Unpaired Image-Sentence Data Image Data Text Data ”An elephant galloping ”An elephant galloping in the Elephant, Galloping, in the green grass” green grass” Green, Grass ”Two people playing ball in a ”Two people playing People, Playing, Ball, ball in a field” field” Field ”A black train stopped on the ”A black train stopped Black, Train, tracks” on the tracks” Tracks ”Someone is about to ”Someone is about to eat some Eat, Pizza eat some pizza” pizza” Kitchen, ”A kitchen counter with ”A microwave is sitting on top of a Microwave a microwave on it” kitchen counter ” 48

slide: Subhashini Venugopalan Empirical Evaluation: COCO heldout dataset MSCOCO Paired MSCOCO Unpaired MSCOCO Unpaired Image-Sentence Data Image Data Text Data ”An elephant galloping ”An elephant galloping in the Elephant, Galloping, in the green grass” green grass” Green, Grass ”Two people playing ball in a ”Two people playing People, Playing, Ball, ball in a field” field” Field ”A black train stopped on the ”A black train stopped Black, Train, tracks” on the tracks” Tracks ”Someone is about to ”A white plate topped with cheesy Pizza eat some pizza” pizza and toppings.” ”A kitchen counter with Microwave ”A white refrigerator, stove, oven a microwave on it” dishwasher and microwave” Held-out 49

slide: Subhashini Venugopalan Empirical Evaluation: COCO MSCOCO Paired MSCOCO Unpaired MSCOCO Unpaired Image-Sentence Data Image Data Text Data ”An elephant galloping ”A small elephant standing on top Two, elephants, in the green grass” of a dirt field” Path, walking ”A hitter swinging his bat to hit ”Two people playing Baseball, batting, ball in a field” the ball” boy, swinging ”A black train stopped on the ”A black train stopped Black, Train, tracks” on the tracks” Tracks ”A white plate topped with cheesy Pizza pizza and toppings.” Microwave ”A white refrigerator, stove, oven dishwasher and microwave” ● CNN is pre-trained on ImageNet 50

slide: Subhashini Venugopalan Empirical Evaluation: Metrics F1 (Utility) : Ability to recognize and incorporate new words. (Is the word/object mentioned in the caption?) METEOR: Fluency and sentence quality. 51

Why Did You Say That? Explaining and Diversifying Captioning Models - PowerPoint PPT Presentation

Why Did You Say That? Explaining and Diversifying Captioning Models Kate Saenko VQA Workshop, CVPR, July 26, 2017 Explaining: Top-down saliency guided by captions http://ai.bu.edu/caption-guided-saliency/ Vasili Abir Jianming Kate

They Say, I Say: TEMPLATES FOR WRITING ABOUT RESEARCH They Say, I Say (Graff, Birkenstein, and

say, it is better to say nothing at all. 1 Or sometimes he would say Do not speak unless it

A Protocol for Leibowitz Travis Goodspeed, Sergey Bratus You say a radio, I say a parser You

Tackling Performance Bottlenecks in the Diversifying CUDA HPC Ecosystem: a Molecular Dynamics

Explaining Deep Learning Predictions and Isaac Ahern Integrating Domain Ontologies Outline

Explaining Type Errors Brent Yorgey Richard Eisenberg Harley Eades Off the Beaten Track 13

Presentation for County Management and Risk Conference If You Cant Say Something Nice, What

Why is there a price to pay? Why is there a price to pay? Why cant God just

Who do people say that I am? Who do you say that I am? Christ and Adam Jesus and the

If you come to me and say you are going to help me, then go away. If you come to me and say

TITLE IN ALL CAPS DOUBLE LINED Place content here TITLE Say whatever you would like to

TITLE IN ALL CAPS DOUBLE LINED Place content here TITLE Say whatever you would like to

Dr John Anthony Hanne Gypsy? But you told me to behave You didnt say how ! Who did

I will direct your path (Proverbs 3:5-6) When you say... Its impossible... God

Why did my car just do that? Explaining semi-autonomous driving actions to improve driver

3BQ Three Big Questions CLASS 2 DID JESUS RISE? Did it happen, and if it did - so

Chapter 12 Developing New Products & Services Today Identify the reasons firms create new

Proverbs Series Lesson #027 August 11, 2013 Dean Bible Ministries www.deanbible.org Dr. Robert

Water Resource Sustainability Issues on Tropical Islands December, 2015 Estuary Rehabilitation

From the Horses Mouth: Exploring IRS Guidance for Fiscal/Employer Agents Kate Murray,

Psalm 63 1 O God, you are my God, earnestly I seek you; my soul thirsts for you, my body longs

1 Peter Series Lesson #074 December 8, 2016 Dean Bible Ministries www.deanbibleministries.org Dr.

Flourishing in Science with Obedience to Mission S. Joshua Swamidass, M.D. Ph.D. Washington

1 What do we know about oceans? Ask students if they have visited a beach have they seen the

Sambuz

Useful Links

Newsletter

Mail Us

Why Did You Say That? Explaining and Diversifying Captioning Models - PowerPoint PPT Presentation

Why Did You Say That? Explaining and Diversifying Captioning Models Kate Saenko VQA Workshop, CVPR, July 26, 2017 Explaining: Top-down saliency guided by captions http://ai.bu.edu/caption-guided-saliency/ Vasili Abir Jianming Kate

They Say, I Say: TEMPLATES FOR WRITING ABOUT RESEARCH They Say, I Say (Graff, Birkenstein, and

say, it is better to say nothing at all. 1 Or sometimes he would say Do not speak unless it

A Protocol for Leibowitz Travis Goodspeed, Sergey Bratus You say a radio, I say a parser You

Tackling Performance Bottlenecks in the Diversifying CUDA HPC Ecosystem: a Molecular Dynamics

Explaining Deep Learning Predictions and Isaac Ahern Integrating Domain Ontologies Outline

Explaining Type Errors Brent Yorgey Richard Eisenberg Harley Eades Off the Beaten Track 13

Presentation for County Management and Risk Conference If You Cant Say Something Nice, What

Why is there a price to pay? Why is there a price to pay? Why cant God just

Who do people say that I am? Who do you say that I am? Christ and Adam Jesus and the

If you come to me and say you are going to help me, then go away. If you come to me and say

TITLE IN ALL CAPS DOUBLE LINED Place content here TITLE Say whatever you would like to

TITLE IN ALL CAPS DOUBLE LINED Place content here TITLE Say whatever you would like to

Dr John Anthony Hanne Gypsy? But you told me to behave You didnt say how ! Who did

I will direct your path (Proverbs 3:5-6) When you say... Its impossible... God

Why did my car just do that? Explaining semi-autonomous driving actions to improve driver

3BQ Three Big Questions CLASS 2 DID JESUS RISE? Did it happen, and if it did - so

Chapter 12 Developing New Products &amp; Services Today Identify the reasons firms create new

Proverbs Series Lesson #027 August 11, 2013 Dean Bible Ministries www.deanbible.org Dr. Robert

Water Resource Sustainability Issues on Tropical Islands December, 2015 Estuary Rehabilitation

From the Horses Mouth: Exploring IRS Guidance for Fiscal/Employer Agents Kate Murray,

Psalm 63 1 O God, you are my God, earnestly I seek you; my soul thirsts for you, my body longs

1 Peter Series Lesson #074 December 8, 2016 Dean Bible Ministries www.deanbibleministries.org Dr.

Flourishing in Science with Obedience to Mission S. Joshua Swamidass, M.D. Ph.D. Washington

1 What do we know about oceans? Ask students if they have visited a beach have they seen the

Sambuz

Useful Links

Newsletter

Mail Us

Chapter 12 Developing New Products & Services Today Identify the reasons firms create new