Captioning Images with Diverse Objects Lisa Anne Subhashini - PowerPoint PPT Presentation

Captioning Images with Diverse Objects Lisa Anne Subhashini Marcus Raymond Kate Trevor Hendricks Venugopalan Rohrbach Mooney Saenko Darrell UT Austin UC Berkeley Boston Univ. 1

Object Recognition Can identify hundreds of categories of objects. 14M images, 22K classes [Deng et al. CVPR’09] 2

Visual Description Berkeley LRCN [Donahue et al. CVPR’15] : A brown bear standing on top of a lush green field. MSR CaptionBot [http://captionbot.ai/] : A large brown bear walking through a forest. MSCOCO 80 classes 3

Novel Object Captioner (NOC) We present Novel Object Captioner which can compose descriptions of 100s of objects in context. NOC (ours): Describe novel objects without paired image-caption data. An okapi standing in the middle of a field. + + MSCOCO okapi Visual Classifiers. init + train Existing captioners. A horse standing in the dirt. MSCOCO 4

Insights 1. Need to recognize and describe objects outside of image-caption datasets. okapi 5

Insight 1: Train effectively on external sources Image-Specific Loss Text-Specific Loss Visual features from unpaired image data Embed Embed LSTM Language model from CNN unannotated text data 6

Insights 2. Describe unseen objects that are similar to objects seen in image-caption datasets. okapi zebra 7

Insight 2: Capture semantic similarity of words scone Image-Specific Loss Text-Specific Loss zebra W T glove Embed cake Embed LSTM CNN okapi W glove tutu dress 8

Insight 2: Capture semantic similarity of words scone Image-Specific Loss Text-Specific Loss zebra W T glove Embed cake Embed LSTM CNN okapi W glove tutu dress MSCOCO 9

Combine to form a Caption Model Image-Specific Loss Image-Text Loss Text-Specific Loss Elementwise sum W T glove W T glove Embed Embed Embed Embed init init LSTM LSTM parameters parameters CNN CNN W glove W glove MSCOCO Not different from existing caption models. Problem: Forgetting. 10

Insights 3. Overcome “forgetting” since pre-training alone is not sufficient. [Catastrophic Forgetting in Neural Networks. Kirkpatrick et al. PNAS 2017] 11

Insight 3: Jointly train on multiple sources Image-Specific Loss Image-Text Loss Text-Specific Loss Elementwise sum W T glove W T glove Embed Embed joint joint Embed Embed training training shared shared LSTM LSTM parameters parameters CNN CNN W glove W glove MSCOCO 12

Novel Object Captioner (NOC) Model Joint-Objective Loss Image-Specific Loss Image-Text Loss Text-Specific Loss Elementwise sum W T glove W T glove Embed Embed joint joint Embed Embed training training shared shared LSTM LSTM parameters parameters CNN CNN W glove W glove MSCOCO 13

Visual Network impala:0.86 green: 0.72 Network: VGG-16 with multi-label loss ... Image-Specific Loss cut: 0.04 [sigmoid cross-entropy loss] Embed Training Data: Unpaired image data Output: Vector with activations CNN corresponding to scores for words in the vocabulary . 14

Language Model Network: Single LSTM layer. Predict next word Text-Specific Loss � t+1 given previous words � 0..t � ( � t+1 | � 0..t ) ( W glove ) T : Shared weights with input embedding. W T glove Training Data: Unannotated text data (BNC, Embed ukWac, Wikipedia, Gigaword) LSTM Output: Vector with activations corresponding W glove to scores for words in the vocabulary . 15

Caption Network Network: Combine output of the visual and text networks. (softmax + cross-entropy loss) Image-Text Loss Elementwise sum W T glove Embed Embed LSTM CNN W glove MSCOCO 16

Caption Model Image-Text Loss Training Data: Training Data: COCO images with Captions from Elementwise multiple labels COCO sum W T glove Embed A brown bear Embed walking on a grassy field next to trees LSTM CNN bear, brown, field, W glove grassy, trees, walking MSCOCO 17

NOC Model: Train simultaneously Joint-Objective Loss Image-Specific Loss Image-Text Loss Text-Specific Loss Elementwise sum W T glove W T glove Embed Embed joint joint Embed Embed training training shared shared LSTM LSTM parameters parameters CNN CNN W glove W glove MSCOCO 18

Evaluation ● Empirical: COCO held-out objects ○ In-domain [Use images from COCO] ○ Out-of-domain [Use imagenet images for held-out concepts] ● Ablations ○ Embedding & Joint training contribution ● ImageNet ○ Quantitative ○ Human Evaluation - Objects not in COCO ○ Rare objects in COCO 19

Evaluation ● Empirical: COCO held-out objects ○ In-domain [Use images from COCO] ○ Out-of-domain [Use imagenet images for held-out concepts] ● Ablations ○ Embedding & Joint training contribution ● ImageNet ○ Quantitative ○ Human Evaluation - Objects not in COCO ○ Rare objects in COCO 20

Empirical Evaluation: COCO dataset In-Domain setting MSCOCO Paired MSCOCO Unpaired MSCOCO Unpaired Image-Sentence Data Image Data Text Data ”An elephant galloping ”An elephant galloping in the Elephant, Galloping, in the green grass” green grass” Green, Grass ”Two people playing ”Two people playing ball in a People, Playing, Ball, ball in a field” field” Field ”A black train stopped on the ”A black train stopped Black, Train, tracks” on the tracks” Tracks ”Someone is about to ”Someone is about to eat some Eat, Pizza eat some pizza” pizza” Kitchen, ”A kitchen counter with ”A microwave is sitting on top of a Microwave a microwave on it” kitchen counter ” 21

Empirical Evaluation: COCO heldout dataset MSCOCO Paired MSCOCO Unpaired MSCOCO Unpaired Image-Sentence Data Image Data Text Data ”An elephant galloping ”An elephant galloping in the Elephant, Galloping, in the green grass” green grass” Green, Grass ”Two people playing ”Two people playing ball in a People, Playing, Ball, ball in a field” field” Field ”A black train stopped on the ”A black train stopped Black, Train, tracks” on the tracks” Tracks ”Someone is about to ”A white plate topped with cheesy Pizza eat some pizza” pizza and toppings.” ”A kitchen counter with Microwave ”A white refrigerator, stove, oven a microwave on it” dishwasher and microwave” 22 Held-out

Empirical Evaluation: COCO MSCOCO Paired MSCOCO Unpaired MSCOCO Unpaired Image-Sentence Data Image Data Text Data ”An elephant galloping ”A small elephant standing on top Two, elephants, in the green grass” of a dirt field” Path, walking ”Two people playing ”A hitter swinging his bat to hit Baseball, batting, ball in a field” the ball” boy, swinging ”A black train stopped on the ”A black train stopped Black, Train, tracks” on the tracks” Tracks ”A white plate topped with cheesy Pizza pizza and toppings.” Microwave ”A white refrigerator, stove, oven dishwasher and microwave” ● CNN is pre-trained on ImageNet 23

Empirical Evaluation: Metrics F1 (Utility) : Ability to recognize and incorporate new words. (Is the word/object mentioned in the caption?) METEOR: Fluency and sentence quality. 24

Empirical Evaluation: Baselines LRCN [1]: Does not caption novel objects. DCC [2] : Copies parameters for the novel object from a similar object seen in training. (also not end-to-end) [1] J. Donahue, L.A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, T. Darrell. CVPR’15 25 [2] L.A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, T. Darrell CVPR’16

Empirical Evaluation: Results F1 (Utility) METEOR (Fluency) [1] J. Donahue, L.A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, T. Darrell. CVPR’15 26 [2] L.A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, T. Darrell CVPR’16

Ablations Evaluated on held-out COCO objects. MSCOCO 27

Ablation: Language Embedding Language Embedding Frozen CNN MSCOCO Joint Training 28

Ablation: Freeze CNN after pre-training Language Embedding Frozen CNN FIX MSCOCO Joint Training [Catastrophic forgetting in Neural Networks Kirkpatrick et al. PNAS 2017] 29

Ablation: Joint Training Language Embedding Frozen CNN MSCOCO Joint Training 30

ImageNet: Human Evaluations ● ImageNet: 638 object classes not mentioned in COCO NOC can describe 582 object classes (60% more objects than prior work) 31

ImageNet: Human Evaluations ● ImageNet: 638 object classes not mentioned in COCO ● Word Incorporation: Which model incorporates the word (name of the object) in the sentence better? ● Image Description: Which sentence (model) describes the image better? 32

ImageNet: Human Evaluations 24.37 40.16 43.78 6.10 59.84 25.74 Word Incorporation Image Description 33

Qualitative Evaluation: ImageNet 34

Qualitative Evaluation: ImageNet 35

Qualitative Examples: Errors Balaclava ( n02776825 ) Error: Repetition NOC: A balaclava black and white photo of a man in a balaclava . Sunglass (n04355933) Error: Grammar NOC: A sunglass mirror reflection of a mirror in a mirror. Gymnast (n10153594) Error: Gender, Hallucination NOC: A man gymnast in a blue shirt doing a trick on a skateboard. Cougar (n02125311) Error: Description NOC: A cougar with a cougar in its mouth. 36

Captioning Images with Diverse Objects Lisa Anne Subhashini - PowerPoint PPT Presentation

Captioning Images with Diverse Objects Lisa Anne Subhashini Marcus Raymond Kate Trevor Hendricks Venugopalan Rohrbach Mooney Saenko Darrell UT Austin UC Berkeley Boston Univ. 1 Object Recognition Can identify hundreds of

Image Captioning Image Captioning Image Captioning A survey of recent deep-learning approaches

Video Captioning Erin Grant March 1 st , 2016 Last Class: Image Captioning From Kiros et al.

Mutable Values Announcements Objects (Demo) Objects 4 Objects Objects represent

61A Lecture 12 Announcements Objects (Demo) Objects 4 Objects Objects represent

Implementing Closed Captioning Implementing Closed Captioning for DTV for DTV Graham Jones

Session Transcript: 6/26/2020 Closed Captioning/ Transcript Disclaimer Closed captioning and/or

Tutorial on Recent Advances in Visual Captioning Luowei Zhou 06/15/2020 1 Outline Problem

Objects & Inheritance Section 7 Implementing Objects in 401 Ways of implementing objects:

CS4495/6495 Introduction to Computer Vision 2A-L1 Images as functions Images as functions Images

Live Objects Live Objects Live Objects Live Objects Krzys Ostrowski, Ken Birman, Danny Dolev

Who Are Diverse Learners? How Do We Reach Them? Oklahoma State Department of Education Diverse

Neural network architectures for image captioning By Emily Kern Given a set of images and

Real Time American Sign Language Video Captioning using Deep Neural Networks Syed Tousif Ahmed

Closed Captioning in the US Technology for TV & Internet By Jason Livingston Telestream, LLC

Seattles New Closed Captioning Requirements 08/20/2019 08/20/2019 Seattle Office for Civil

Session Transcript: 6/25/2020 - Afternoon Closed Captioning/ Transcript Disclaimer Closed

Meas rDroid An Android Measurement Framework Johann Schlamp Georg Carle May 2, 2013 The Meas

Lessons Learned from Building a Large Multilingual, Multi-region Website in Drupal 8 Stella

Combining Solr and Elasticsearch to Improve Autosuggestion on Mobile Local Search Toan Vinh Luu,

Fuzzy Matching In PostgreSQL A Story From The Trenches Charles Clavadetscher Swiss PostgreSQL

Natural-Language Video Description with Deep Recurrent Neural Networks June 2017 Subhashini

The Innovation & Collaboration Centre was offjcially launched on 16 November 2015 by the

9.1 Remeshing Hao Li http://cs599.hao-li.com 1 Outline What is remeshing? Why

Disaster Recovery Compliance Disaster Recovery Compliance Davis- -Bacon and CDBG Bacon and CDBG

Captioning Images with Diverse Objects Lisa Anne Subhashini - PowerPoint PPT Presentation

Captioning Images with Diverse Objects Lisa Anne Subhashini Marcus Raymond Kate Trevor Hendricks Venugopalan Rohrbach Mooney Saenko Darrell UT Austin UC Berkeley Boston Univ. 1 Object Recognition Can identify hundreds of

Image Captioning Image Captioning Image Captioning A survey of recent deep-learning approaches

Video Captioning Erin Grant March 1 st , 2016 Last Class: Image Captioning From Kiros et al.

Mutable Values Announcements Objects (Demo) Objects 4 Objects Objects represent

61A Lecture 12 Announcements Objects (Demo) Objects 4 Objects Objects represent

Implementing Closed Captioning Implementing Closed Captioning for DTV for DTV Graham Jones

Session Transcript: 6/26/2020 Closed Captioning/ Transcript Disclaimer Closed captioning and/or

Tutorial on Recent Advances in Visual Captioning Luowei Zhou 06/15/2020 1 Outline Problem

Objects &amp; Inheritance Section 7 Implementing Objects in 401 Ways of implementing objects:

CS4495/6495 Introduction to Computer Vision 2A-L1 Images as functions Images as functions Images

Live Objects Live Objects Live Objects Live Objects Krzys Ostrowski, Ken Birman, Danny Dolev

Who Are Diverse Learners? How Do We Reach Them? Oklahoma State Department of Education Diverse

Neural network architectures for image captioning By Emily Kern Given a set of images and

Real Time American Sign Language Video Captioning using Deep Neural Networks Syed Tousif Ahmed

Closed Captioning in the US Technology for TV &amp; Internet By Jason Livingston Telestream, LLC

Seattles New Closed Captioning Requirements 08/20/2019 08/20/2019 Seattle Office for Civil

Session Transcript: 6/25/2020 - Afternoon Closed Captioning/ Transcript Disclaimer Closed

Meas rDroid An Android Measurement Framework Johann Schlamp Georg Carle May 2, 2013 The Meas

Lessons Learned from Building a Large Multilingual, Multi-region Website in Drupal 8 Stella

Combining Solr and Elasticsearch to Improve Autosuggestion on Mobile Local Search Toan Vinh Luu,

Fuzzy Matching In PostgreSQL A Story From The Trenches Charles Clavadetscher Swiss PostgreSQL

Natural-Language Video Description with Deep Recurrent Neural Networks June 2017 Subhashini

The Innovation &amp; Collaboration Centre was offjcially launched on 16 November 2015 by the

9.1 Remeshing Hao Li http://cs599.hao-li.com 1 Outline What is remeshing? Why

Disaster Recovery Compliance Disaster Recovery Compliance Davis- -Bacon and CDBG Bacon and CDBG

Objects & Inheritance Section 7 Implementing Objects in 401 Ways of implementing objects:

Closed Captioning in the US Technology for TV & Internet By Jason Livingston Telestream, LLC

The Innovation & Collaboration Centre was offjcially launched on 16 November 2015 by the