captioning images with diverse objects
play

Captioning Images with Diverse Objects Lisa Anne Subhashini - PowerPoint PPT Presentation

Captioning Images with Diverse Objects Lisa Anne Subhashini Marcus Raymond Kate Trevor Hendricks Venugopalan Rohrbach Mooney Saenko Darrell UT Austin UC Berkeley Boston Univ. 1 Object Recognition Can identify hundreds of


  1. Captioning Images with Diverse Objects Lisa Anne Subhashini Marcus Raymond Kate Trevor Hendricks Venugopalan Rohrbach Mooney Saenko Darrell UT Austin UC Berkeley Boston Univ. 1

  2. Object Recognition Can identify hundreds of categories of objects. 14M images, 22K classes [Deng et al. CVPR’09] 2

  3. Visual Description Berkeley LRCN [Donahue et al. CVPR’15] : A brown bear standing on top of a lush green field. MSR CaptionBot [http://captionbot.ai/] : A large brown bear walking through a forest. MSCOCO 80 classes 3

  4. Novel Object Captioner (NOC) We present Novel Object Captioner which can compose descriptions of 100s of objects in context. NOC (ours): Describe novel objects without paired image-caption data. An okapi standing in the middle of a field. + + MSCOCO okapi Visual Classifiers. init + train Existing captioners. A horse standing in the dirt. MSCOCO 4

  5. Insights 1. Need to recognize and describe objects outside of image-caption datasets. okapi 5

  6. Insight 1: Train effectively on external sources Image-Specific Loss Text-Specific Loss Visual features from unpaired image data Embed Embed LSTM Language model from CNN unannotated text data 6

  7. Insights 2. Describe unseen objects that are similar to objects seen in image-caption datasets. okapi zebra 7

  8. Insight 2: Capture semantic similarity of words scone Image-Specific Loss Text-Specific Loss zebra W T glove Embed cake Embed LSTM CNN okapi W glove tutu dress 8

  9. Insight 2: Capture semantic similarity of words scone Image-Specific Loss Text-Specific Loss zebra W T glove Embed cake Embed LSTM CNN okapi W glove tutu dress MSCOCO 9

  10. Combine to form a Caption Model Image-Specific Loss Image-Text Loss Text-Specific Loss Elementwise sum W T glove W T glove Embed Embed Embed Embed init init LSTM LSTM parameters parameters CNN CNN W glove W glove MSCOCO Not different from existing caption models. Problem: Forgetting. 10

  11. Insights 3. Overcome “forgetting” since pre-training alone is not sufficient. [Catastrophic Forgetting in Neural Networks. Kirkpatrick et al. PNAS 2017] 11

  12. Insight 3: Jointly train on multiple sources Image-Specific Loss Image-Text Loss Text-Specific Loss Elementwise sum W T glove W T glove Embed Embed joint joint Embed Embed training training shared shared LSTM LSTM parameters parameters CNN CNN W glove W glove MSCOCO 12

  13. Novel Object Captioner (NOC) Model Joint-Objective Loss Image-Specific Loss Image-Text Loss Text-Specific Loss Elementwise sum W T glove W T glove Embed Embed joint joint Embed Embed training training shared shared LSTM LSTM parameters parameters CNN CNN W glove W glove MSCOCO 13

  14. Visual Network impala:0.86 green: 0.72 Network: VGG-16 with multi-label loss ... Image-Specific Loss cut: 0.04 [sigmoid cross-entropy loss] Embed Training Data: Unpaired image data Output: Vector with activations CNN corresponding to scores for words in the vocabulary . 14

  15. Language Model Network: Single LSTM layer. Predict next word Text-Specific Loss � t+1 given previous words � 0..t � ( � t+1 | � 0..t ) ( W glove ) T : Shared weights with input embedding. W T glove Training Data: Unannotated text data (BNC, Embed ukWac, Wikipedia, Gigaword) LSTM Output: Vector with activations corresponding W glove to scores for words in the vocabulary . 15

  16. Caption Network Network: Combine output of the visual and text networks. (softmax + cross-entropy loss) Image-Text Loss Elementwise sum W T glove Embed Embed LSTM CNN W glove MSCOCO 16

  17. Caption Model Image-Text Loss Training Data: Training Data: COCO images with Captions from Elementwise multiple labels COCO sum W T glove Embed A brown bear Embed walking on a grassy field next to trees LSTM CNN bear, brown, field, W glove grassy, trees, walking MSCOCO 17

  18. NOC Model: Train simultaneously Joint-Objective Loss Image-Specific Loss Image-Text Loss Text-Specific Loss Elementwise sum W T glove W T glove Embed Embed joint joint Embed Embed training training shared shared LSTM LSTM parameters parameters CNN CNN W glove W glove MSCOCO 18

  19. Evaluation ● Empirical: COCO held-out objects ○ In-domain [Use images from COCO] ○ Out-of-domain [Use imagenet images for held-out concepts] ● Ablations ○ Embedding & Joint training contribution ● ImageNet ○ Quantitative ○ Human Evaluation - Objects not in COCO ○ Rare objects in COCO 19

  20. Evaluation ● Empirical: COCO held-out objects ○ In-domain [Use images from COCO] ○ Out-of-domain [Use imagenet images for held-out concepts] ● Ablations ○ Embedding & Joint training contribution ● ImageNet ○ Quantitative ○ Human Evaluation - Objects not in COCO ○ Rare objects in COCO 20

  21. Empirical Evaluation: COCO dataset In-Domain setting MSCOCO Paired MSCOCO Unpaired MSCOCO Unpaired Image-Sentence Data Image Data Text Data ”An elephant galloping ”An elephant galloping in the Elephant, Galloping, in the green grass” green grass” Green, Grass ”Two people playing ”Two people playing ball in a People, Playing, Ball, ball in a field” field” Field ”A black train stopped on the ”A black train stopped Black, Train, tracks” on the tracks” Tracks ”Someone is about to ”Someone is about to eat some Eat, Pizza eat some pizza” pizza” Kitchen, ”A kitchen counter with ”A microwave is sitting on top of a Microwave a microwave on it” kitchen counter ” 21

  22. Empirical Evaluation: COCO heldout dataset MSCOCO Paired MSCOCO Unpaired MSCOCO Unpaired Image-Sentence Data Image Data Text Data ”An elephant galloping ”An elephant galloping in the Elephant, Galloping, in the green grass” green grass” Green, Grass ”Two people playing ”Two people playing ball in a People, Playing, Ball, ball in a field” field” Field ”A black train stopped on the ”A black train stopped Black, Train, tracks” on the tracks” Tracks ”Someone is about to ”A white plate topped with cheesy Pizza eat some pizza” pizza and toppings.” ”A kitchen counter with Microwave ”A white refrigerator, stove, oven a microwave on it” dishwasher and microwave” 22 Held-out

  23. Empirical Evaluation: COCO MSCOCO Paired MSCOCO Unpaired MSCOCO Unpaired Image-Sentence Data Image Data Text Data ”An elephant galloping ”A small elephant standing on top Two, elephants, in the green grass” of a dirt field” Path, walking ”Two people playing ”A hitter swinging his bat to hit Baseball, batting, ball in a field” the ball” boy, swinging ”A black train stopped on the ”A black train stopped Black, Train, tracks” on the tracks” Tracks ”A white plate topped with cheesy Pizza pizza and toppings.” Microwave ”A white refrigerator, stove, oven dishwasher and microwave” ● CNN is pre-trained on ImageNet 23

  24. Empirical Evaluation: Metrics F1 (Utility) : Ability to recognize and incorporate new words. (Is the word/object mentioned in the caption?) METEOR: Fluency and sentence quality. 24

  25. Empirical Evaluation: Baselines LRCN [1]: Does not caption novel objects. DCC [2] : Copies parameters for the novel object from a similar object seen in training. (also not end-to-end) [1] J. Donahue, L.A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, T. Darrell. CVPR’15 25 [2] L.A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, T. Darrell CVPR’16

  26. Empirical Evaluation: Results F1 (Utility) METEOR (Fluency) [1] J. Donahue, L.A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, T. Darrell. CVPR’15 26 [2] L.A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, T. Darrell CVPR’16

  27. Ablations Evaluated on held-out COCO objects. MSCOCO 27

  28. Ablation: Language Embedding Language Embedding Frozen CNN MSCOCO Joint Training 28

  29. Ablation: Freeze CNN after pre-training Language Embedding Frozen CNN FIX MSCOCO Joint Training [Catastrophic forgetting in Neural Networks Kirkpatrick et al. PNAS 2017] 29

  30. Ablation: Joint Training Language Embedding Frozen CNN MSCOCO Joint Training 30

  31. ImageNet: Human Evaluations ● ImageNet: 638 object classes not mentioned in COCO NOC can describe 582 object classes (60% more objects than prior work) 31

  32. ImageNet: Human Evaluations ● ImageNet: 638 object classes not mentioned in COCO ● Word Incorporation: Which model incorporates the word (name of the object) in the sentence better? ● Image Description: Which sentence (model) describes the image better? 32

  33. ImageNet: Human Evaluations 24.37 40.16 43.78 6.10 59.84 25.74 Word Incorporation Image Description 33

  34. Qualitative Evaluation: ImageNet 34

  35. Qualitative Evaluation: ImageNet 35

  36. Qualitative Examples: Errors Balaclava ( n02776825 ) Error: Repetition NOC: A balaclava black and white photo of a man in a balaclava . Sunglass (n04355933) Error: Grammar NOC: A sunglass mirror reflection of a mirror in a mirror. Gymnast (n10153594) Error: Gender, Hallucination NOC: A man gymnast in a blue shirt doing a trick on a skateboard. Cougar (n02125311) Error: Description NOC: A cougar with a cougar in its mouth. 36

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend