DEEP SEMANTIC-VISUAL EMBEDDING WITH LOCALIZATION
Martin Engilberge, Louis Chevallier, Patrick Pรฉrez, Matthieu Cord
Thursday 4th October, 2018
DEEP SEMANTIC-VISUAL EMBEDDING WITH LOCALIZATION Thursday 4th - - PowerPoint PPT Presentation
DEEP SEMANTIC-VISUAL EMBEDDING WITH LOCALIZATION Thursday 4th October, 2018 Martin Engilberge, Louis Chevallier, Patrick Prez, Matthieu Cord Deep semantic-visual embedding with localization 2 Tasks Visual Grounding of phrases: Localize
Thursday 4th October, 2018
2
Deep semantic-visual embedding with localization
Query: A cat
3
Deep semantic-visual embedding with localization
A cat on a sofa A dog playing A car
4
Deep semantic-visual embedding with localization
5
Deep semantic-visual embedding with localization
ResNet conv pool affine+ norm.
(a, man, in, ski, gear, skiing, on, snow)
w2v SRU+norm
cosine sim. ๐0: 2 and ฯ are the trained parameters
6
Deep semantic-visual embedding with localization
ResNet conv pool affine+ norm.
(a, man, in, ski, gear, skiing, on, snow)
w2v SRU+norm
cosine sim. ๐0: 2 and ฯ are the trained parameters
7
Deep semantic-visual embedding with localization
8
Deep semantic-visual embedding with localization
ResNet conv pool affine+ norm.
(a, man, in, ski, gear, skiing, on, snow)
w2v SRU+norm
cosine sim. ๐0: 2 and ฯ are the trained parameters
9
Deep semantic-visual embedding with localization
Diagram by Jakub Kvita
10
Deep semantic-visual embedding with localization
ResNet conv pool affine+ norm.
(a, man, in, ski, gear, skiing, on, snow)
w2v SRU+norm
cosine sim. ๐0: 2 and ฯ are the trained parameters
11
Deep semantic-visual embedding with localization
ResNet conv pool affine+ norm.
(a, man, in, ski, gear, skiing, on, snow)
w2v SRU+norm
cosine sim. ๐0: 2 and ฯ are the trained parameters
12
Deep semantic-visual embedding with localization
13
Deep semantic-visual embedding with localization
14
Deep semantic-visual embedding with localization
y z ๐ดโฒ
15
Deep semantic-visual embedding with localization
๐โ๐ถ
๐โ๐ท๐โฉ๐ถ loss ๐ฒ๐, ๐ฐ๐, ๐ฐ๐
๐โ๐ธ๐โฉ๐ถ loss ๐ฐ๐, ๐ฒ๐, ๐ฒ๐
16
Deep semantic-visual embedding with localization
vn xn
๐โ๐ถ
๐โ๐ท๐โฉ๐ถ loss ๐ฒ๐, ๐ฐ๐, ๐ฐ๐
๐โ๐ธ๐โฉ๐ถ loss ๐ฐ๐, ๐ฒ๐, ๐ฒ๐
17
Deep semantic-visual embedding with localization
vn xn vm
๐โ๐ถ
๐โ๐ท๐โฉ๐ถ loss ๐ฒ๐, ๐ฐ๐, ๐ฐ๐
๐โ๐ธ๐โฉ๐ถ loss ๐ฐ๐, ๐ฒ๐, ๐ฒ๐
18
Deep semantic-visual embedding with localization
A cat on a sofa A dog playing A car
19
Deep semantic-visual embedding with localization
A dog playing with a frisbee A plane in a cloudy sky
20
Deep semantic-visual embedding with localization
R@1 R@5 R@10 R@1 R@5 R@10 Caption retrieval Image retrieval 2-Way Net [5] 55.80% 75.20% 39.70% 63.30% VSE++ [6] 64.60% 95.70% 52% 92% Ours 69.80% 91.90% 96.60% 55.90% 86.90% 94% 35% 45% 55% 65% 75% 85% 95%
Recall
Cross-modal retrieval results
21
Deep semantic-visual embedding with localization
R@1 R@5 R@10 R@1 R@5 R@10 Caption retrieval Image retrieval Hard Neg + WLD + SRU 4 69.80% 91.90% 96.60% 55.90% 86.90% 94% Hard Neg + GAP + SRU 4 64.50% 90.20% 95.50% 51.20% 84.00% 92.00% Hard Neg + WLD + GRU 1 63.80% 90.20% 96% 52.20% 84.90% 92.60% Classic + WLD + SRU 4 49.50% 81% 90.10% 39.60% 77.30% 89.10% 35% 45% 55% 65% 75% 85% 95%
Recall
Ablation study: cross modal retrieval results
22
Deep semantic-visual embedding with localization
The plane is parked at the gate at the airport terminal. Multiple wooden spoons are shown
23
Deep semantic-visual embedding with localization
two glasses Source image Text query Visual grounding Heat map
24
Deep semantic-visual embedding with localization
ResNet conv pool affine+ norm.
(a, man, in, ski, gear, skiing, on, snow)
w2v SRU+norm
cosine sim. ๐0: 2 and ฯ are the trained parameters
Gโ H
25
Deep semantic-visual embedding with localization
๐ฃโ๐ฟ ๐ฐ
๐ฟ ๐ฐ the set
entries
26
Deep semantic-visual embedding with localization
27
Deep semantic-visual embedding with localization
"Center" baseline 19.50% Linguistic structure [7] 24.40% Ours 33.80% 0% 5% 10% 15% 20% 25% 30% 35% 40%
Accuracy
Pointing game results
28
Deep semantic-visual embedding with localization
29
Deep semantic-visual embedding with localization
30
Deep semantic-visual embedding with localization
A cat
sofa A dog playing A car
CNN image adaptation + pooling RNN encoding text projection tokenisation + embedding
Localization and retrieval using the embedding space
Paper - Finding beans in burgers: Deep semantic-visual embedding with localization