Deep Image-Text Embeddings Learning Deep Structure-Preserving - - PowerPoint PPT Presentation

deep image text embeddings
SMART_READER_LITE
LIVE PREVIEW

Deep Image-Text Embeddings Learning Deep Structure-Preserving - - PowerPoint PPT Presentation

CS688 Paper Presentation 1 Deep Image-Text Embeddings Learning Deep Structure-Preserving Image-Text Embeddings (CVPR 2016) Woobin Im ( ) 2016-11-08 Sentence-to-image Retrieval Retrieval system Query text A cat next to a blue chair


slide-1
SLIDE 1

CS688 Paper Presentation 1

Deep Image-Text Embeddings

Learning Deep Structure-Preserving Image-Text Embeddings (CVPR 2016)

Woobin Im (임우빈)

2016-11-08

slide-2
SLIDE 2

2

Sentence-to-image Retrieval

Query text User Retrieval system Result image A cat next to a blue chair and a deck

slide-3
SLIDE 3

3

Image-to-sentence Retrieval

Query image User Retrieval system A black and white cat laying

  • n the carrying case of a computer

Result text

slide-4
SLIDE 4

4

Image-to-sentence Retrieval

Query image User Retrieval system A black and white cat laying

  • n the carrying case of a computer

Result text Among sentence list

slide-5
SLIDE 5

5

Image Description Generation

Query image User Retrieval system A black and white cat laying

  • n the carrying case of a computer

Result text

Text generation by NLP tech.

slide-6
SLIDE 6

6

Text-sentence Embeddings

Source: Accounting for the Relative Importance of Objects in Image Retrieval

Image representation Text representation Projection Projection

slide-7
SLIDE 7

7

Examples of image-to-sentence retrieval

Source: Associating neural word embeddings with deep image representations using Fisher Vectors

slide-8
SLIDE 8

8

Datasets

  • MSCOCO, Flickr 8K, Flickr 30K, Pascal 1K …
  • Have a few captions for each image
  • MSCOCO has object segment information
  • Flickr30K has phrase localizations

Source: Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

Example of Flikr30k Entities dataset

slide-9
SLIDE 9

9

Paper

  • Learning Deep Structure-Preserving

Image-Text Embeddings (CVPR 2016)

Image feature Text feature

slide-10
SLIDE 10

10

Paper

  • Learning Deep Structure-Preserving

Image-Text Embeddings (CVPR 2016)

Image feature Text feature Image Pretrained CNN Sentence Word2vec FV-HGLMM

slide-11
SLIDE 11

11

Paper

  • Learning Deep Structure-Preserving

Image-Text Embeddings (CVPR 2016)

Image feature Text feature Image Pretrained CNN (VGG) Sentence Word2vec FV-HGLMM

fc fc fc fc B-norm B-norm PCA

Loss

slide-12
SLIDE 12

12

Paper

  • Learning Deep Structure-Preserving

Image-Text Embeddings (CVPR 2016)

Image feature Text feature Image Pretrained CNN (VGG) Sentence Word2vec FV-HGLMM

fc fc fc fc B-norm B-norm PCA

Loss

slide-13
SLIDE 13

13

Image feature extraction

  • Using Pretrained VGG-VD-19

Resized Image ImageFeatures (4096D) x 10 5 crops & flip = 10 crops Averaging Image feature (4096D) 4 corners + center

slide-14
SLIDE 14

14

Paper

  • Learning Deep Structure-Preserving

Image-Text Embeddings (CVPR 2016)

Image feature Text feature Image Pretrained CNN (VGG) Sentence Word2vec FV-HGLMM

fc fc fc fc B-norm B-norm PCA

Loss

slide-15
SLIDE 15

15

Text feature extraction

  • Word2Vec – word semantic embedding

Source: Distributed representations of words and phrases and their compositionality

slide-16
SLIDE 16

16

Text feature extraction

  • Fisher Vector of (HGLMM + GMM)

Word2Vec

Sentence

Hybrid Gaussian-Laplacian Mixture model

Fisher Vector

Gaussian Mixture model

Fisher Vector Word Vector (18000D)

Concatenation

EM(Training)

Final Vector (6000D)

PCA

Work of “Associating neural word embeddings with deep image representations using Fisher Vectors” v

slide-17
SLIDE 17

17

Paper

  • Learning Deep Structure-Preserving

Image-Text Embeddings (CVPR 2016)

Image feature Text feature Image Pretrained CNN (VGG) Sentence Word2vec FV-HGLMM

fc fc fc fc B-norm B-norm PCA

Loss

slide-18
SLIDE 18

18

Loss Calculation

  • Structure-preserving triplet loss

!: anchor instance ": matching instance #: non-matching instance $: image %: sentence &((, *): euclidean distance ,: margin

image - sentence sentence - image Image structure preserving Text structure preserving

slide-19
SLIDE 19

19

Loss Calculation

  • Triplet loss?

Source: “FaceNet: A unified embedding for face recognition and clustering” margin

slide-20
SLIDE 20

20

Loss Calculation

  • Structure-preserving

Square: image Circle: sentence

slide-21
SLIDE 21

21

Paper

  • Learning Deep Structure-Preserving

Image-Text Embeddings (CVPR 2016)

Image feature Text feature Image Pretrained CNN (VGG) Sentence Word2vec FV-HGLMM

fc fc fc fc B-norm B-norm PCA

Loss

slide-22
SLIDE 22

22

Evaluation

  • Task:
  • Image-to-sentence retrieval -

Given an image, find nearest K sentences

  • Sentence-to-image retrieval -

Given a sentence, find nearest K images

  • L2-distance
  • Dataset
  • MSCOCO
  • Flickr30K
  • Metric
  • Recall @ 1, 5, 10 (GT: 5 captions per image)
slide-23
SLIDE 23

23

Evaluation setting index

  • Net models
  • Linear: just one linear projection (one fc)
  • Non-linear: what we’ve covered
  • Training constraints
  • One-directional : -. = 0
  • Bi-directional : -. = 1
  • Structure : -2 = 0. 1
  • -1 = 0 for all cases
  • No images have the same caption

image - sentence sentence - image Image structure preserving Text structure preserving

slide-24
SLIDE 24

24

Result (Flickr30K)

  • Mean vector: mean of word2vec vectors in a sentence
  • Tf-idf: what we learned
slide-25
SLIDE 25

25

Result (MSCOCO 1K test)

  • Mean vector: mean of word2vec vectors in a sentence
  • Tf-idf: what we learned
slide-26
SLIDE 26

26

Additional application - Phrase localization on Flickr30K

  • Region proposal + text-image Embedding
slide-27
SLIDE 27

27

Summary

  • Image-to-text & text-to-image retrieval
  • By embedding them to one space
  • Image feature: pretrained CNN
  • Text feature: word2vec + HLGMM+ FV
  • Loss: structure-preserving triplet loss
  • Test:
  • Image-to-text & text-to-image retrieval
  • Phrase localization
slide-28
SLIDE 28

28

Q&A