Image Captioning Describe an image with meaningful and sensible - - PowerPoint PPT Presentation

image captioning
SMART_READER_LITE
LIVE PREVIEW

Image Captioning Describe an image with meaningful and sensible - - PowerPoint PPT Presentation

23rd International Conference on MultiMedia Modeling (MMM 2017) What Convnets Make for Image Captioning? Yu Liu*, Yanming Guo*, and Michael S. Lew Leiden Institute of Advanced Computer Science, Leiden University Presenter: Yanming Guo Discover


slide-1
SLIDE 1

Discover the world at Leiden University

What Convnets Make for Image Captioning?

Leiden Institute of Advanced Computer Science, Leiden University Presenter: Yanming Guo Yu Liu*, Yanming Guo*, and Michael S. Lew

23rd International Conference on MultiMedia Modeling (MMM 2017)

slide-2
SLIDE 2

Discover the world at Leiden University

Image Captioning

Describe an image with meaningful and sensible sentence-level captions.  Objects  Actions  Descriptive words  Relations

A large bus sitting next to a very tall building

slide-3
SLIDE 3

Discover the world at Leiden University

Image Captioning

 Retrieval approaches  Generative approaches

  • --- Map images to pre-defined sentences
  • --- Estimate novel sentences

A white dog and a brown dog run along side each other at the beach; A dog running on a wet suit on the beach

slide-4
SLIDE 4

Discover the world at Leiden University

Image Captioning

 Caption does not have to be previous seen  A good language model  More intelligent  Better performance Advantages:

 Retrieval approaches  Generative approaches

  • --- Map images to pre-defined sentences
  • --- Estimate novel sentences
slide-5
SLIDE 5

Discover the world at Leiden University CNN

START “White” “Cup” “White” “Cup” END

RNN

General Structure

High-level image features Generate a sentence of words

?

slide-6
SLIDE 6

Discover the world at Leiden University CNN

START “White” “Cup” “White” “Cup” END

RNN

General Structure

?

What Convnets make for image captioning?

slide-7
SLIDE 7

Discover the world at Leiden University

 Single-label Convnet

Three types of Convnets

  • --- Convnet pre-trained on ImageNet dataset, e.g. AlexNet, VGG …

 Multi-label Convnet

  • --- Fine-tune Convnet on 80 object categories of MS COCO

 Multi-attribute Convnet

  • --- Fine-tune Convnet on attributes of MS COCO (e.g. 300 attributes)

Single-label Multi-label Multi-attribute

finetune

Generic representation Salient objects Salient objects, actions, relations…

slide-8
SLIDE 8

Discover the world at Leiden University

Three types of Convnets

Input image Single-label Convnet Multi-label Convnet Multi-attribute Convnet

The visualization of the most activated feature map in conv5_3

slide-9
SLIDE 9

Discover the world at Leiden University

Single-label feature Multi-label feature Multi-attribute feature LSTM LSTM LSTM LSTM

Aggregation feature 𝑦0 𝑞1 𝑦1 𝑞2 𝑦i−2 𝑞i−1 𝑦T−1 𝑞T ag(x) ag(x) ag(x) ag(x)

Multi-Convnet Aggregation

slide-10
SLIDE 10

Discover the world at Leiden University

Multi-Scale Testing

… … …

224 256 320 CNN FCN FCN

LSTM

xt

average transfer transfer

Caption generation

slide-11
SLIDE 11

Discover the world at Leiden University

Experiments

  • BLUE: measures the precision of n-grams between the

generated and reference sentences (e.g. B-1, B-2, B-3, B-4).

  • METEOR: computed based on the alignment between the

words in a generated and reference sentences.

  • ROUGE-L: focus on a set words that are appear in the same
  • rder in two sentences.
  • CIDEr: use a tf-idf weights for computing each n-grams.
slide-12
SLIDE 12

Discover the world at Leiden University

Experiments

  • Multi-scale: considerable improvement
  • SL-Net: largest dimension & worst performance
  • ML-Net: smallest dimension & considerable improvement
  • MA-Net : medium dimension & significant improvement
slide-13
SLIDE 13

Discover the world at Leiden University

Experiments

  • Multi-scale testing using FCN is always better;
  • The aggregation of different Convnets can enhance the performance
slide-14
SLIDE 14

Discover the world at Leiden University

Experiments

Multi-Convnet aggregation: A man and a dog on a small boat. Single-label Convnet: A man is sitting on the water with a surfboard. Multi-label Convnet: A man sitting on a boat in front of a boat. Multi-attribute Convnet: A man and a dog on a boat. Ground truth: A man and a dog on a small yellow boat.

slide-15
SLIDE 15

Discover the world at Leiden University

Experiments

slide-16
SLIDE 16

Discover the world at Leiden University

Experiments

Ours: A man riding a wave in the ocean. GT: A man riding a wave on a surfboard in the ocean. Ours: A living room with a lot of furniture. GT: Living room with furniture with garage door at one end. Ours: A man riding a horse at a horse. GT: A horse that threw a man off a horse. Ours: A close up

  • f an elephant

with an elephant GT: A man getting a kiss on the neck from an elephant's trunk

slide-17
SLIDE 17

Discover the world at Leiden University

Conclusion

 Multi-attribute Convnet performs better for image captioning  The aggregation of different Convnets can deliver slightly better performance than each individual Convnet  Efficient multi-scale augmentation test using FCNs  Comparable results with the state-of-the-art

slide-18
SLIDE 18

Thanks for your attention! Questions please?