DEEP SEMANTIC-VISUAL EMBEDDING WITH LOCALIZATION Thursday 4th - - PowerPoint PPT Presentation

โ–ถ
deep semantic visual embedding with localization
SMART_READER_LITE
LIVE PREVIEW

DEEP SEMANTIC-VISUAL EMBEDDING WITH LOCALIZATION Thursday 4th - - PowerPoint PPT Presentation

DEEP SEMANTIC-VISUAL EMBEDDING WITH LOCALIZATION Thursday 4th October, 2018 Martin Engilberge, Louis Chevallier, Patrick Prez, Matthieu Cord Deep semantic-visual embedding with localization 2 Tasks Visual Grounding of phrases: Localize


slide-1
SLIDE 1

DEEP SEMANTIC-VISUAL EMBEDDING WITH LOCALIZATION

Martin Engilberge, Louis Chevallier, Patrick Pรฉrez, Matthieu Cord

Thursday 4th October, 2018

slide-2
SLIDE 2

Tasks

2

Deep semantic-visual embedding with localization

Visual Grounding of phrases: Localize any textual query into a given image. Cross-modal retrieval:

Query: A cat

  • n a sofa
slide-3
SLIDE 3

Semantic visual embedding

3

Deep semantic-visual embedding with localization

A cat on a sofa A dog playing A car

2D Semantic visual space example:

  • Distance in the space has a semantic interpretation.
  • Retrieval is done by finding nearest neighbors.
slide-4
SLIDE 4

Approach

4

Deep semantic-visual embedding with localization

  • Learning image and text joint embedding space.
  • Visual grounding relying on the spatial-textual information

modeling.

  • Cross-modal retrieval leveraging the semantic space and the

visual and textual alignment.

slide-5
SLIDE 5

Semantic Embedding Model

5

Deep semantic-visual embedding with localization

Visual pipeline:

  • ResNet-152 pretrained.
  • Weldon spatial pooling.
  • Affine projection
  • normalization.

Textual pipeline:

  • Pretrained word embedding.
  • Simple Recurrent Unit (SRU).
  • Normalization.

ResNet conv pool affine+ norm.

(a, man, in, ski, gear, skiing, on, snow)

w2v SRU+norm

cosine sim. ๐œ„0: 2 and ฯ• are the trained parameters

slide-6
SLIDE 6

Semantic Embedding Model

6

Deep semantic-visual embedding with localization

Visual pipeline:

  • ResNet-152 pretrained.
  • Weldon spatial pooling.
  • Affine projection
  • normalization.

Textual pipeline:

  • Pretrained word embedding.
  • Simple Recurrent Unit (SRU).
  • Normalization.

ResNet conv pool affine+ norm.

(a, man, in, ski, gear, skiing, on, snow)

w2v SRU+norm

cosine sim. ๐œ„0: 2 and ฯ• are the trained parameters

slide-7
SLIDE 7

Pooling mechanisms

7

Deep semantic-visual embedding with localization

Weldon spatial pooling:

  • Instead of global average/max pooling.
  • Aggregate the min and max of each map.
  • Produce activation map with finer localization.

information.

slide-8
SLIDE 8

Semantic Embedding Model

8

Deep semantic-visual embedding with localization

Visual pipeline:

  • ResNet-152 pretrained.
  • Weldon spatial pooling.
  • Affine projection
  • normalization.

Textual pipeline:

  • Pretrained word embedding.
  • Simple Recurrent Unit (SRU).
  • Normalization.

ResNet conv pool affine+ norm.

(a, man, in, ski, gear, skiing, on, snow)

w2v SRU+norm

cosine sim. ๐œ„0: 2 and ฯ• are the trained parameters

slide-9
SLIDE 9

Simple Recurrent Unit: SRU

9

Deep semantic-visual embedding with localization

Diagram by Jakub Kvita

Recurrent neural network:

  • Fixed sized representation for variable length sequence.
  • Able to capture long-term dependency between words.
slide-10
SLIDE 10

Semantic Embedding Model

10

Deep semantic-visual embedding with localization

Visual pipeline:

  • ResNet-152 pretrained.
  • Weldon spatial pooling.
  • Affine projection
  • normalization.

Textual pipeline:

  • Pretrained word embedding.
  • Simple Recurrent Unit (SRU).
  • Normalization.

ResNet conv pool affine+ norm.

(a, man, in, ski, gear, skiing, on, snow)

w2v SRU+norm

cosine sim. ๐œ„0: 2 and ฯ• are the trained parameters

slide-11
SLIDE 11

Semantic Embedding Model

11

Deep semantic-visual embedding with localization

Visual pipeline:

  • ResNet-152 pretrained.
  • Weldon spatial pooling.
  • Affine projection
  • normalization.

Textual pipeline:

  • Pretrained word embedding.
  • Simple Recurrent Unit (SRU).
  • Normalization.

ResNet conv pool affine+ norm.

(a, man, in, ski, gear, skiing, on, snow)

w2v SRU+norm

cosine sim. ๐œ„0: 2 and ฯ• are the trained parameters

slide-12
SLIDE 12

Dataset

12

Deep semantic-visual embedding with localization

  • MS-CoCo 2014:
  • 110K training images
  • 5 captions per image
  • 2*5k images for validation and test

Dining room table set for a casual meal, with flowers.

slide-13
SLIDE 13

Learning strategy: triplet loss

13

Deep semantic-visual embedding with localization

A variant of the standard margin based loss:

  • Triplet (๐ณ, ๐ด, ๐ดโ€ฒ)
  • Anchor: ๐ณ (E.g image representation)
  • Positive: z

z (E.g associated caption representation)

  • Negative: ๐ดโ€ฒ (E.g contrastive caption representation)
  • Margin parameter ฮฑ

แˆฝ loss(๐ณ, ๐ด, ๐ดโ€ฒ) = ma x{ 0, ฮฑโˆ’ < ๐ณ, ๐ด > + < ๐ณ, ๐ดโ€ฒ >

slide-14
SLIDE 14

Learning strategy: triplet loss

14

Deep semantic-visual embedding with localization

แˆฝ loss ๐ณ, ๐ด, ๐ดโ€ฒ = ma x{ 0, ฮฑ + d ๐ณ, ๐ด โˆ’ d(๐ณ, ๐ดโ€ฒ)

y z ๐ดโ€ฒ

ฮฑ

slide-15
SLIDE 15

Learning strategy: triplet loss

15

Deep semantic-visual embedding with localization

โ„’ ๐šฐ; โ„ฌ = 1 ๐ถ เท

๐‘œโˆˆ๐ถ

max

๐‘›โˆˆ๐ท๐‘œโˆฉ๐ถ loss ๐ฒ๐‘œ, ๐ฐ๐‘œ, ๐ฐ๐‘›

+ max

๐‘›โˆˆ๐ธ๐‘œโˆฉ๐ถ loss ๐ฐ๐‘œ, ๐ฒ๐‘œ, ๐ฒ๐‘›

๐—๐ฃ๐ฎ๐ข :

  • ๐ท๐‘œ (resp. ๐ธ๐‘œ) set of indices of caption (resp. image) unrelated to

n-th element. Hard negative margin based loss: Loss for a batch โ„ฌ = { ๐‰๐‘œ, ๐“๐‘œ แˆฝ๐‘œโˆˆ๐ถ of image sentence pairs:

slide-16
SLIDE 16

Learning strategy: hard negative triplet loss

16

Deep semantic-visual embedding with localization

vn xn

โ„’ ๐šฐ; โ„ฌ = 1 ๐ถ เท

๐‘œโˆˆ๐ถ

max

๐‘›โˆˆ๐ท๐‘œโˆฉ๐ถ loss ๐ฒ๐‘œ, ๐ฐ๐‘œ, ๐ฐ๐‘›

+ max

๐‘›โˆˆ๐ธ๐‘œโˆฉ๐ถ loss ๐ฐ๐‘œ, ๐ฒ๐‘œ, ๐ฒ๐‘›

Mining hard negative contrastive example:

slide-17
SLIDE 17

Learning strategy: hard negative triplet loss

17

Deep semantic-visual embedding with localization

vn xn vm

โ„’ ๐šฐ; โ„ฌ = 1 ๐ถ เท

๐‘œโˆˆ๐ถ

max

๐‘›โˆˆ๐ท๐‘œโˆฉ๐ถ loss ๐ฒ๐‘œ, ๐ฐ๐‘œ, ๐ฐ๐‘›

+ max

๐‘›โˆˆ๐ธ๐‘œโˆฉ๐ถ loss ๐ฐ๐‘œ, ๐ฒ๐‘œ, ๐ฒ๐‘›

Mining hard negative contrastive example:

slide-18
SLIDE 18

From training to testing

18

Deep semantic-visual embedding with localization

A cat on a sofa A dog playing A car

Training finished:

  • Visual-semantic space constructed.
  • Parameters of the model are fixed.
  • Time for testing.
slide-19
SLIDE 19

Qualitative evaluation: cross-modal retrieval

19

Deep semantic-visual embedding with localization

A dog playing with a frisbee A plane in a cloudy sky

  • 1. A herd of sheep standing on top of snow covered field.
  • 2. There are sheep standing in the grass near a fence.
  • 3. some black and white sheep a fence dirt and grass

Query Closest elements

slide-20
SLIDE 20

Quantitative evaluation: cross-modal retrieval

20

Deep semantic-visual embedding with localization

R@1 R@5 R@10 R@1 R@5 R@10 Caption retrieval Image retrieval 2-Way Net [5] 55.80% 75.20% 39.70% 63.30% VSE++ [6] 64.60% 95.70% 52% 92% Ours 69.80% 91.90% 96.60% 55.90% 86.90% 94% 35% 45% 55% 65% 75% 85% 95%

Recall

Cross-modal retrieval results

Cross-modal retrieval: Evaluated on MS-CoCo image/caption pairs.

slide-21
SLIDE 21

Performance evaluation: ablation study

21

Deep semantic-visual embedding with localization

Performance boost coming from:

  • Architecture choice: SRU and Weldon spatial pooling.
  • Efficient learning strategy: hard negative loss.

R@1 R@5 R@10 R@1 R@5 R@10 Caption retrieval Image retrieval Hard Neg + WLD + SRU 4 69.80% 91.90% 96.60% 55.90% 86.90% 94% Hard Neg + GAP + SRU 4 64.50% 90.20% 95.50% 51.20% 84.00% 92.00% Hard Neg + WLD + GRU 1 63.80% 90.20% 96% 52.20% 84.90% 92.60% Classic + WLD + SRU 4 49.50% 81% 90.10% 39.60% 77.30% 89.10% 35% 45% 55% 65% 75% 85% 95%

Recall

Ablation study: cross modal retrieval results

slide-22
SLIDE 22

Evaluation: cross-modal retrieval and limitations

22

Deep semantic-visual embedding with localization

The plane is parked at the gate at the airport terminal. Multiple wooden spoons are shown

  • n a table top.
  • 1. A harbor filled with boats floating on water
  • 2. A small marina with boats docked there
  • 3. a group of boats sitting together with no one around
  • 1. Two elephants in the eld moving along during the day.
  • 2. Two elephants are standing by the trees in the wild.
  • 3. An elephant and a rhino are grazing in an open wooded area.

Query Closest elements

slide-23
SLIDE 23

Localization

23

Deep semantic-visual embedding with localization

Visual grounding module:

  • Weakly supervised, with no additional training.
  • Localize a textual query in an image.
  • Using the embedding space to select convolutionnal activation maps.

two glasses Source image Text query Visual grounding Heat map

slide-24
SLIDE 24

Semantic Embedding Model

24

Deep semantic-visual embedding with localization

Visual pipeline:

  • ResNet-152 pretrained.
  • Weldon spatial pooling.
  • Affine projection
  • normalization.

Textual pipeline:

  • Pretrained word embedding.
  • Simple Recurrent Unit (SRU).
  • Normalization.

ResNet conv pool affine+ norm.

(a, man, in, ski, gear, skiing, on, snow)

w2v SRU+norm

cosine sim. ๐œ„0: 2 and ฯ• are the trained parameters

slide-25
SLIDE 25

Gโ€™ H

  • Conv. map

Localization

25

Deep semantic-visual embedding with localization

Generation of heatmap ๐ˆ: แˆฟ ๐‡โ€ฒ ๐‘—, ๐‘˜, : = ๐ต๐‡ ๐‘—, ๐‘˜, : , โˆ€ ๐‘—, ๐‘˜ โˆˆ [1, ๐‘ฅแˆฟ ร— [1, โ„Ž ๐ˆ = เท

๐‘ฃโˆˆ๐ฟ ๐ฐ

แˆฟ ๐ฐ ๐‘ฃ โˆ— ๐‡โ€ฒ[: , : , ๐‘ฃ

๐ฟ ๐ฐ the set

  • f the indices
  • f its k largest

entries

slide-26
SLIDE 26

Qualitative evaluation: localization

26

Deep semantic-visual embedding with localization

Visual grounding examples:

  • Generating multiple heat maps with different textual queries.
slide-27
SLIDE 27

Quantitative evaluation: localization

27

Deep semantic-visual embedding with localization

"Center" baseline 19.50% Linguistic structure [7] 24.40% Ours 33.80% 0% 5% 10% 15% 20% 25% 30% 35% 40%

Accuracy

Pointing game results

The pointing game: Localizing phrases corresponding to subregions

  • f the image.
slide-28
SLIDE 28

Toward zero-shot localization:

28

Deep semantic-visual embedding with localization

  • Emergence of colors understanding:
  • Even on artificial images:
slide-29
SLIDE 29

Toward zero-shot localization:

29

Deep semantic-visual embedding with localization

  • Generalization to unseen elements:
slide-30
SLIDE 30

Conclusion

Summary:

  • Semantic-visual embedding model.
  • Effective on the cross-modal retrieval task
  • Visual grounding of text with no extra supervision.

30

Deep semantic-visual embedding with localization

A cat

  • n a

sofa A dog playing A car

CNN image adaptation + pooling RNN encoding text projection tokenisation + embedding

Localization and retrieval using the embedding space

Thank you!

Paper - Finding beans in burgers: Deep semantic-visual embedding with localization