DEEP SEMANTIC-VISUAL EMBEDDING WITH LOCALIZATION Thursday 4th - PowerPoint PPT Presentation

DEEP SEMANTIC-VISUAL EMBEDDING WITH LOCALIZATION Thursday 4th October, 2018 Martin Engilberge, Louis Chevallier, Patrick Pérez, Matthieu Cord

Deep semantic-visual embedding with localization 2 Tasks Visual Grounding of phrases: Localize any textual query into a given image. Cross-modal retrieval: Query: A cat on a sofa

Deep semantic-visual embedding with localization 3 Semantic visual embedding A car A cat on a sofa A dog playing 2D Semantic visual space example: • Distance in the space has a semantic interpretation. • Retrieval is done by finding nearest neighbors.

Deep semantic-visual embedding with localization 4 Approach • Learning image and text joint embedding space. • Visual grounding relying on the spatial-textual information modeling. • Cross-modal retrieval leveraging the semantic space and the visual and textual alignment.

Deep semantic-visual embedding with localization 5 Semantic Embedding Model Textual pipeline: Visual pipeline: • Pretrained word embedding. • ResNet-152 pretrained. • Simple Recurrent Unit (SRU). • Weldon spatial pooling. • Normalization. • Affine projection • normalization. affine+ ResNet conv pool norm. cosine sim. (a, man, in, ski, gear, w2v SRU+norm skiing, on, snow) 𝜄 0: 2 and ϕ are the trained parameters

Deep semantic-visual embedding with localization 7 Pooling mechanisms Weldon spatial pooling: • Instead of global average/max pooling. • Aggregate the min and max of each map. • Produce activation map with finer localization. information.

Deep semantic-visual embedding with localization 9 Simple Recurrent Unit: SRU Recurrent neural network: • Fixed sized representation for variable length sequence. • Able to capture long-term dependency between words. Diagram by Jakub Kvita

Deep semantic-visual embedding with localization 12 Dataset • MS-CoCo 2014: • 110K training images • 5 captions per image • 2*5k images for validation and test Dining room table set for a casual meal, with flowers.

Deep semantic-visual embedding with localization 13 Learning strategy: triplet loss A variant of the standard margin based loss: Triplet ( 𝐳 , 𝐴 , 𝐴′ ) • Anchor: 𝐳 (E.g image representation) • Positive: z z (E.g associated caption representation) • Negative: 𝐴′ (E.g contrastive caption representation) • Margin parameter α • ሽ loss(𝐳, 𝐴, 𝐴′) = ma x{ 0, α− < 𝐳, 𝐴 > + < 𝐳, 𝐴′ >

Deep semantic-visual embedding with localization 14 Learning strategy: triplet loss loss 𝐳, 𝐴, 𝐴 ′ = ma x{ 0, α + d 𝐳, 𝐴 − d(𝐳, 𝐴 ′ ) ሽ 𝐴 ′ z y α

Deep semantic-visual embedding with localization 15 Learning strategy: triplet loss Hard negative margin based loss: Loss for a batch ℬ = { 𝐉 𝑜 , 𝐓 𝑜 ሽ 𝑜∈𝐶 of image sentence pairs: 𝑛∈𝐷 𝑜 ∩𝐶 loss 𝐲 𝑜 , 𝐰 𝑜 , 𝐰 𝑛 max ℒ 𝚰; ℬ = 1 𝐶 ෍ + 𝑛∈𝐸 𝑜 ∩𝐶 loss 𝐰 𝑜 , 𝐲 𝑜 , 𝐲 𝑛 max 𝑜∈𝐶 𝐗𝐣𝐮𝐢 : • 𝐷 𝑜 (resp. 𝐸 𝑜 ) set of indices of caption (resp. image) unrelated to n -th element.

Deep semantic-visual embedding with localization 16 Learning strategy: hard negative triplet loss Mining hard negative contrastive example: 𝑛∈𝐷 𝑜 ∩𝐶 loss 𝐲 𝑜 , 𝐰 𝑜 , 𝐰 𝑛 max ℒ 𝚰; ℬ = 1 𝐶 ෍ + 𝑛∈𝐸 𝑜 ∩𝐶 loss 𝐰 𝑜 , 𝐲 𝑜 , 𝐲 𝑛 max 𝑜∈𝐶 v n x n

Deep semantic-visual embedding with localization 17 Learning strategy: hard negative triplet loss Mining hard negative contrastive example: 𝑛∈𝐷 𝑜 ∩𝐶 loss 𝐲 𝑜 , 𝐰 𝑜 , 𝐰 𝑛 max ℒ 𝚰; ℬ = 1 𝐶 ෍ + 𝑛∈𝐸 𝑜 ∩𝐶 loss 𝐰 𝑜 , 𝐲 𝑜 , 𝐲 𝑛 max 𝑜∈𝐶 v n v m x n

Deep semantic-visual embedding with localization 18 From training to testing Training finished: • Visual-semantic space constructed. • Parameters of the model are fixed. • Time for testing. A car A cat on a sofa A dog playing

Deep semantic-visual embedding with localization 19 Qualitative evaluation: cross-modal retrieval Query Closest elements A plane in a cloudy sky A dog playing with a frisbee 1. A herd of sheep standing on top of snow covered field. 2. There are sheep standing in the grass near a fence. 3. some black and white sheep a fence dirt and grass

Deep semantic-visual embedding with localization 20 Quantitative evaluation: cross-modal retrieval Cross-modal retrieval: Evaluated on MS-CoCo image/caption pairs. Cross-modal retrieval results 95% 85% 75% Recall 65% 55% 45% 35% R@1 R@5 R@10 R@1 R@5 R@10 Caption retrieval Image retrieval 2-Way Net [5] 55.80% 75.20% 39.70% 63.30% VSE++ [6] 64.60% 95.70% 52% 92% Ours 69.80% 91.90% 96.60% 55.90% 86.90% 94%

Deep semantic-visual embedding with localization 21 Performance evaluation: ablation study Performance boost coming from: • Architecture choice: SRU and Weldon spatial pooling. • Efficient learning strategy: hard negative loss. Ablation study: cross modal retrieval results 95% 85% 75% Recall 65% 55% 45% 35% R@1 R@5 R@10 R@1 R@5 R@10 Caption retrieval Image retrieval Hard Neg + WLD + SRU 4 69.80% 91.90% 96.60% 55.90% 86.90% 94% Hard Neg + GAP + SRU 4 64.50% 90.20% 95.50% 51.20% 84.00% 92.00% Hard Neg + WLD + GRU 1 63.80% 90.20% 96% 52.20% 84.90% 92.60% Classic + WLD + SRU 4 49.50% 81% 90.10% 39.60% 77.30% 89.10%

Deep semantic-visual embedding with localization 22 Evaluation: cross-modal retrieval and limitations Closest elements Query Multiple wooden spoons are shown on a table top. The plane is parked at the gate at the airport terminal. 1. Two elephants in the eld moving along during the day. 2. Two elephants are standing by the trees in the wild. 3. An elephant and a rhino are grazing in an open wooded area. 1. A harbor filled with boats floating on water 2. A small marina with boats docked there 3. a group of boats sitting together with no one around

Deep semantic-visual embedding with localization 23 Localization Visual grounding module: • Weakly supervised, with no additional training. • Localize a textual query in an image. • Using the embedding space to select convolutionnal activation maps. Source image Visual grounding two glasses Heat map Text query

Deep semantic-visual embedding with localization 25 Localization Generation of heatmap 𝐈 : 𝐇 ′ 𝑗, 𝑘, : = 𝐵𝐇 𝑗, 𝑘, : , ∀ 𝑗, 𝑘 ∈ [1, 𝑥ሿ × [1, ℎ ሿ ሿ 𝐈 = ෍ 𝐰 𝑣 ∗ 𝐇′[: , : , 𝑣 𝐿 𝐰 the set of the indices 𝑣∈𝐿 𝐰 of its k largest H entries G’ Conv. map

Deep semantic-visual embedding with localization 26 Qualitative evaluation: localization Visual grounding examples: • Generating multiple heat maps with different textual queries.

Deep semantic-visual embedding with localization 27 Quantitative evaluation: localization The pointing game: Localizing phrases corresponding to subregions of the image. Pointing game results 40% 35% Accuracy 30% 25% 20% 15% 10% 5% 0% "Center" baseline 19.50% Linguistic structure [7] 24.40% Ours 33.80%

DEEP SEMANTIC-VISUAL EMBEDDING WITH LOCALIZATION Thursday 4th - PowerPoint PPT Presentation

DEEP SEMANTIC-VISUAL EMBEDDING WITH LOCALIZATION Thursday 4th October, 2018 Martin Engilberge, Louis Chevallier, Patrick Prez, Matthieu Cord Deep semantic-visual embedding with localization 2 Tasks Visual Grounding of phrases: Localize

Greedy embedding of a graph Greedy embedding of a graph 99 Greedy embedding Greedy embedding

Category-level localization Cordelia Schmid Category-level localization Localization of

DeViSE: A Deep Visual-Semantic Embedding Model Presenters: Ji Gao, Fandi Lin Motivation Visual

Graph Drawing Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 )

Planarity Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 ) assigns

Localization Nischal K N System Overview Mapping Hector Mapping Localization Path Planning

Category-level localization Cordelia Schmid Category-level localization Localization up to a

DeViSE: A Deep Visual-Semantic Embedding Model Frome et al., Google Research Presented by:

Biovision team 2 Retina Visual cortex 3 Retina Visual cortex 3 Retina Visual cortex 3

Embedding 3-manifolds via surgery on surfaces Kyle Larson University of Texas at Austin

Application: Semantic Role Labeling CS 6956: Deep Learning for NLP Overview What is semantic

Semantic Analysis and Semantic Roles Ling 571 Deep Processing Techniques for NLP February 10,

Semantic Roles & Semantic Role Labeling Ling571 Deep Processing Techniques for NLP February

Creating Semantic Mashups: Bridging Web 2.0 and the Semantic Web Jamie Taylor, Colin Evans, Toby

: on the Semantic Web : on the Semantic Web Building a Semantic Prototype for Danish Building a

Semantic Processing Augmenting CFGs Currying Quantifier scope Semantic Grammars L445 / L545

Attention-based Model and It Its Application in in Scene Text xt Recognition Baoguang Shi (

Visually grounded cross-lingual keyword spotting in speech SLTU, August 2018 Herman Kamper 1 and

At Attent ntio ion The The proble lem For very long sentences, the score for machine

Building Community Between Police and Youth Wednesday, August 16, 2017 Housekeeping Question or

Java Decision Making and booleans (Java: An Eventful Approach, Ch 4), 26 October 2012 Slides

Getting up to speed on PHP 5.3 Lukas Kahwe Smith - lukas@liip.ch PHPCon Italia - Roma 18. - 20.

& Reduce Costs What Is Service Design? The Double Diamond Discover Define Design Deploy

Lecture 12: AE Model to AD-AS Model and Fiscal Policy Reference: Appendix to Chapter 8 and

DEEP SEMANTIC-VISUAL EMBEDDING WITH LOCALIZATION Thursday 4th - PowerPoint PPT Presentation

DEEP SEMANTIC-VISUAL EMBEDDING WITH LOCALIZATION Thursday 4th October, 2018 Martin Engilberge, Louis Chevallier, Patrick Prez, Matthieu Cord Deep semantic-visual embedding with localization 2 Tasks Visual Grounding of phrases: Localize

Greedy embedding of a graph Greedy embedding of a graph 99 Greedy embedding Greedy embedding

Category-level localization Cordelia Schmid Category-level localization Localization of

DeViSE: A Deep Visual-Semantic Embedding Model Presenters: Ji Gao, Fandi Lin Motivation Visual

Graph Drawing Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 )

Planarity Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 ) assigns

Localization Nischal K N System Overview Mapping Hector Mapping Localization Path Planning

Category-level localization Cordelia Schmid Category-level localization Localization up to a

DeViSE: A Deep Visual-Semantic Embedding Model Frome et al., Google Research Presented by:

Biovision team 2 Retina Visual cortex 3 Retina Visual cortex 3 Retina Visual cortex 3

Embedding 3-manifolds via surgery on surfaces Kyle Larson University of Texas at Austin

Application: Semantic Role Labeling CS 6956: Deep Learning for NLP Overview What is semantic

Semantic Analysis and Semantic Roles Ling 571 Deep Processing Techniques for NLP February 10,

Semantic Roles &amp; Semantic Role Labeling Ling571 Deep Processing Techniques for NLP February

Creating Semantic Mashups: Bridging Web 2.0 and the Semantic Web Jamie Taylor, Colin Evans, Toby

: on the Semantic Web : on the Semantic Web Building a Semantic Prototype for Danish Building a

Semantic Processing Augmenting CFGs Currying Quantifier scope Semantic Grammars L445 / L545

Attention-based Model and It Its Application in in Scene Text xt Recognition Baoguang Shi (

Visually grounded cross-lingual keyword spotting in speech SLTU, August 2018 Herman Kamper 1 and

At Attent ntio ion The The proble lem For very long sentences, the score for machine

Building Community Between Police and Youth Wednesday, August 16, 2017 Housekeeping Question or

Java Decision Making and booleans (Java: An Eventful Approach, Ch 4), 26 October 2012 Slides

Getting up to speed on PHP 5.3 Lukas Kahwe Smith - lukas@liip.ch PHPCon Italia - Roma 18. - 20.

&amp; Reduce Costs What Is Service Design? The Double Diamond Discover Define Design Deploy

Lecture 12: AE Model to AD-AS Model and Fiscal Policy Reference: Appendix to Chapter 8 and

Semantic Roles & Semantic Role Labeling Ling571 Deep Processing Techniques for NLP February

& Reduce Costs What Is Service Design? The Double Diamond Discover Define Design Deploy