image captioning
play

Image Captioning Describe an image with meaningful and sensible - PowerPoint PPT Presentation

23rd International Conference on MultiMedia Modeling (MMM 2017) What Convnets Make for Image Captioning? Yu Liu*, Yanming Guo*, and Michael S. Lew Leiden Institute of Advanced Computer Science, Leiden University Presenter: Yanming Guo Discover


  1. 23rd International Conference on MultiMedia Modeling (MMM 2017) What Convnets Make for Image Captioning? Yu Liu*, Yanming Guo*, and Michael S. Lew Leiden Institute of Advanced Computer Science, Leiden University Presenter: Yanming Guo Discover the world at Leiden University

  2. Image Captioning Describe an image with meaningful and sensible sentence-level captions.  Objects  Actions  Descriptive words  Relations … A large bus sitting next to a very tall building Discover the world at Leiden University

  3. Image Captioning  Retrieval approaches ---- Map images to pre-defined sentences  Generative approaches ---- Estimate novel sentences A white dog and a brown dog run along side each other at the beach; A dog running on a wet suit on the beach Discover the world at Leiden University

  4. Image Captioning  Retrieval approaches ---- Map images to pre-defined sentences  Generative approaches ---- Estimate novel sentences Advantages:  Caption does not have to be previous seen  A good language model  More intelligent  Better performance Discover the world at Leiden University

  5. General Structure “White” “Cup” END ? … “White” “Cup” START CNN RNN Generate a sentence of words High-level image features Discover the world at Leiden University

  6. General Structure “White” “Cup” END ? … “White” “Cup” START CNN RNN What Convnets make for image captioning? Discover the world at Leiden University

  7. Three types of Convnets Single-label finetune Multi-label Multi-attribute  Single-label Convnet Generic representation ---- Convnet pre-trained on ImageNet dataset, e.g. AlexNet , VGG …  Multi-label Convnet Salient objects ---- Fine-tune Convnet on 80 object categories of MS COCO  Multi-attribute Convnet Salient objects, actions, relations… ---- Fine-tune Convnet on attributes of MS COCO (e.g. 300 attributes) Discover the world at Leiden University

  8. Three types of Convnets Input image Single-label Convnet Multi-label Convnet Multi-attribute Convnet The visualization of the most activated feature map in conv5_3 Discover the world at Leiden University

  9. Multi-Convnet Aggregation Single-label feature Aggregation feature Multi-label feature Multi-attribute feature 𝑦 0 𝑦 1 𝑦 i−2 𝑦 T−1 ag(x) ag(x) ag(x) ag(x) … … LSTM LSTM LSTM LSTM 𝑞 2 𝑞 1 𝑞 i−1 𝑞 T Discover the world at Leiden University

  10. Multi-Scale Testing … CNN 224 Caption generation transfer average … FCN LSTM 256 x t transfer … FCN 320 Discover the world at Leiden University

  11. Experiments  BLUE: measures the precision of n-grams between the generated and reference sentences (e.g. B-1, B-2, B-3, B-4).  METEOR: computed based on the alignment between the words in a generated and reference sentences.  ROUGE-L: focus on a set words that are appear in the same order in two sentences.  CIDEr: use a tf-idf weights for computing each n-grams. Discover the world at Leiden University

  12. Experiments  Multi-scale: considerable improvement  SL-Net: largest dimension & worst performance  ML-Net: smallest dimension & considerable improvement  MA-Net : medium dimension & significant improvement Discover the world at Leiden University

  13. Experiments  Multi-scale testing using FCN is always better;  The aggregation of different Convnets can enhance the performance Discover the world at Leiden University

  14. Experiments Single-label Convnet: A man is sitting on the water with a surfboard. Multi-label Convnet: A man sitting on a boat in front of a boat. Multi-attribute Convnet: A man and a dog on a boat. Multi-Convnet aggregation: A man and a dog on a small boat. Ground truth: A man and a dog on a small yellow boat. Discover the world at Leiden University

  15. Experiments Discover the world at Leiden University

  16. Experiments Ours: A man riding Ours: A living room Ours: A man riding a Ours: A close up a wave in the ocean. with a lot of furniture. horse at a horse. of an elephant with an elephant GT: A man riding a GT: Living room GT: A man getting a GT: A horse that wave on a surfboard with furniture with kiss on the neck threw a man off a in the ocean. garage door at one from an elephant's horse. end. trunk Discover the world at Leiden University

  17. Conclusion  Multi-attribute Convnet performs better for image captioning  The aggregation of different Convnets can deliver slightly better performance than each individual Convnet  Efficient multi-scale augmentation test using FCNs  Comparable results with the state-of-the-art Discover the world at Leiden University

  18. Thanks for your attention! Questions please?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend