Spatial and Temporal representations for Multi-Modal Visual - - PowerPoint PPT Presentation

spatial and temporal representations for multi modal
SMART_READER_LITE
LIVE PREVIEW

Spatial and Temporal representations for Multi-Modal Visual - - PowerPoint PPT Presentation

Spatial and Temporal representations for Multi-Modal Visual Retrieval 17th December 2018 Noa Garcia Docampo PhD Candidate, Aston University Introduction Million of images created every day... Million of images created every day... Problem :


slide-1
SLIDE 1

Spatial and Temporal representations for Multi-Modal Visual Retrieval

Noa Garcia Docampo

PhD Candidate, Aston University

17th December 2018

slide-2
SLIDE 2

Million of images created every day... Problem: How to find images in large collections?

Introduction

Million of images created every day... Problem: How to find images in large collections?

slide-3
SLIDE 3

Million of images created every day... Problem: How to find images in large collections? Solution: Visual Retrieval!

  • Image Retrieval exists from the 90s
  • Many types of visual retrieval

Introduction

Million of images created every day... Problem: How to find images in large collections?

slide-4
SLIDE 4

Million of images created every day... Problem: How to find images in large collections? Solution: Visual Retrieval!

  • Image Retrieval exists from the 90s
  • Many types of visual retrieval

Introduction

Million of images created every day... Problem: How to find images in large collections?

slide-5
SLIDE 5

Introduction

We classify visual retrieval into 3 main types, depending on the query object and the dataset content:

slide-6
SLIDE 6

Introduction

We classify visual retrieval into 3 main types, depending on the query object and the dataset content:

slide-7
SLIDE 7

Structure

Introduction and Background Asymmetric Visual Retrieval Conclusions and Final Remarks S y m m e t r i c V i s u a l R e t r i e v a l Cross-Modal Retrieval

slide-8
SLIDE 8

Structure

Introduction and Background Asymmetric Visual Retrieval Conclusions and Final Remarks S y m m e t r i c V i s u a l R e t r i e v a l Cross-Modal Retrieval

slide-9
SLIDE 9

Contributions

  • CNNs for non-metric visual similarity
  • Pushing performance on standard CBIR datasets
  • MoviesDB: image-to-video retrieval dataset
  • Binary descriptors for local aggregation of video features
  • Spatio-temporal encoders for global aggregation of video features
  • Item video retrieval application
  • SemArt: semantic art understanding dataset
  • Cross-modal retrieval for semantic art understanding

Symmetric Visual Retrieval Asymmetric Visual Retrieval Cross-Modal Retrieval

slide-10
SLIDE 10

Introduction and Background S y m m e t r i c V i s u a l R e t r i e v a l

slide-11
SLIDE 11

Symmetric Visual Retrieval

Standard CBIR system

slide-12
SLIDE 12

Symmetric Visual Retrieval

Standard CBIR system

Drawbacks of metric distances

  • Do not consider data distribution
slide-13
SLIDE 13

Drawbacks of metric distances

  • Do not consider data distribution
  • Metric distance constraints:

○ ○ ○ ○

Symmetric Visual Retrieval

Standard CBIR system

slide-14
SLIDE 14

Drawbacks of metric distances

  • Do not consider data distribution
  • Metric distance constraints:

Symmetric Visual Retrieval

Standard CBIR system

slide-15
SLIDE 15

Drawbacks of metric distances

  • Do not consider data distribution
  • Metric distance constraints:

Symmetric Visual Retrieval

Standard CBIR system

slide-16
SLIDE 16

Symmetric Visual Retrieval

Standard CBIR system Proposed CBIR system

Garcia & Vogiatzis (2018). Learning Non-Metric Visual Similarity for Image Retrieval. Under review at IMAVIS journal

slide-17
SLIDE 17

Similarity Networks

slide-18
SLIDE 18

Symmetric Visual Retrieval

Off-the-shelf methods Garcia & Vogiatzis (2018). Learning Non-Metric Visual Similarity for Image Retrieval. Under review at IMAVIS journal

slide-19
SLIDE 19

Symmetric Visual Retrieval

Off-the-shelf methods Garcia & Vogiatzis (2018). Learning Non-Metric Visual Similarity for Image Retrieval. Under review at IMAVIS journal

slide-20
SLIDE 20

Symmetric Visual Retrieval

Fine-tuned methods Off-the-shelf methods Garcia & Vogiatzis (2018). Learning Non-Metric Visual Similarity for Image Retrieval. Under review at IMAVIS journal

slide-21
SLIDE 21

Symmetric Visual Retrieval

Fine-tuned methods Off-the-shelf methods Garcia & Vogiatzis (2018). Learning Non-Metric Visual Similarity for Image Retrieval. Under review at IMAVIS journal

slide-22
SLIDE 22

Contributions

  • CNNs for non-metric visual similarity
  • Pushing performance on standard CBIR datasets

Symmetric Visual Retrieval

slide-23
SLIDE 23

Introduction and Background Asymmetric Visual Retrieval S y m m e t r i c V i s u a l R e t r i e v a l

slide-24
SLIDE 24

Asymmetric Visual Retrieval

Garcia & Vogiatzis (2018). Temporal Aggregation of Visual Features for Large-Scale Image-to-Video Retrieval. In: ICMR 2018

slide-25
SLIDE 25

Asymmetric Visual Retrieval

Garcia & Vogiatzis (2018). Temporal Aggregation of Visual Features for Large-Scale Image-to-Video Retrieval. In: ICMR 2018

slide-26
SLIDE 26

Asymmetric Visual Retrieval

No temporal aggregation Chapter 5 Chapter 6

Garcia & Vogiatzis (2018). Temporal Aggregation of Visual Features for Large-Scale Image-to-Video Retrieval. In: ICMR 2018

slide-27
SLIDE 27

Asymmetric Visual Retrieval

No temporal aggregation Chapter 5 Chapter 6

Garcia & Vogiatzis (2018). Temporal Aggregation of Visual Features for Large-Scale Image-to-Video Retrieval. In: ICMR 2018

slide-28
SLIDE 28

Asymmetric Visual Retrieval

Garcia & Vogiatzis (2018). Dress like a Star: Retrieving Fashion Products from Videos. In: CVF workshop ICCV 2017

Feature Indexing

Temporal Local Aggregation

slide-29
SLIDE 29

Search and Retrieval

Asymmetric Visual Retrieval

Garcia & Vogiatzis (2018). Dress like a Star: Retrieving Fashion Products from Videos. In: CVF workshop ICCV 2017

Temporal Local Aggregation

slide-30
SLIDE 30

Asymmetric Visual Retrieval

No temporal aggregation Chapter 5 Chapter 6

slide-31
SLIDE 31

Asymmetric Visual Retrieval

Garcia & Vogiatzis (2018). Asymmetric Spatio-Temporal Embeddings for Large-Scale Image-to-Video Retrieval. In: BMVC 2018

Spatio-Temporal Global Aggregation

slide-32
SLIDE 32

Asymmetric Visual Retrieval

Chapter 5 Chapter 6

  • High accuracy
  • High compression rates
  • Multiple searches per query
  • Global aggregation state-of-the-art accuracy
  • High compression rates
  • Single search per query

Spatio-Temporal Global Aggregation Temporal Local Aggregation

slide-33
SLIDE 33

Contributions

  • CNNs for non-metric visual similarity
  • Pushing performance on standard CBIR datasets
  • MoviesDB: image-to-video retrieval dataset
  • Binary descriptors for local aggregation of video features
  • Spatio-temporal encoders for global aggregation of video features
  • Item video retrieval application

Symmetric Visual Retrieval Asymmetric Visual Retrieval

slide-34
SLIDE 34

Introduction and Background Asymmetric Visual Retrieval S y m m e t r i c V i s u a l R e t r i e v a l Cross-Modal Retrieval

slide-35
SLIDE 35

Cross-Modal Retrieval

Retrieve paintings from artistic comments

  • Artistic Comments:

○ Not only descriptions of the content but also about the author, context, techniques, etc.

  • Fine-art paintings:

○ Figurative representations

Garcia & Vogiatzis (2018). How to Read Paintings: Semantic Art Understanding with Multi-Modal Retrieval. In: VISART workshop ECCV 2018

slide-36
SLIDE 36

Cross-Modal Retrieval

  • Visual Encoding (images):

VGG16, ResNet, RMAC

  • Text Encoding (comments and titles):

BOW, MLP, RNN

  • Cross-Modal Transformation:

CCA, Cosine Margin Loss, Augmented with Metadata

Garcia & Vogiatzis (2018). How to Read Paintings: Semantic Art Understanding with Multi-Modal Retrieval. In: VISART workshop ECCV 2018

slide-37
SLIDE 37

Cross-Modal Retrieval

Garcia & Vogiatzis (2018). How to Read Paintings: Semantic Art Understanding with Multi-Modal Retrieval. In: VISART workshop ECCV 2018

Human Comparison: Difficult Set Human Comparison: Easy Set

Random images Same type images

slide-38
SLIDE 38

Contributions

  • CNNs for non-metric visual similarity
  • Pushing performance on standard CBIR datasets
  • MoviesDB: image-to-video retrieval dataset
  • Binary descriptors for local aggregation of video features
  • Spatio-temporal encoders for global aggregation of video features
  • Item video retrieval application
  • SemArt: semantic art understanding dataset
  • Cross-modal retrieval for semantic art understanding

Symmetric Visual Retrieval Asymmetric Visual Retrieval Cross-Modal Retrieval

slide-39
SLIDE 39

Introduction and Background Asymmetric Visual Retrieval Conclusions and Final Remarks S y m m e t r i c V i s u a l R e t r i e v a l Cross-Modal Retrieval

slide-40
SLIDE 40

Future Work

  • Similarity networks for other retrieval tasks
  • Temporal aggregation at the scene level
  • Asymmetric techniques for video-to-image retrieval
  • Style and content detector for cross-modal retrieval in art
  • SemArt dataset for alternative tasks

Symmetric Visual Retrieval Asymmetric Visual Retrieval Cross-Modal Retrieval

slide-41
SLIDE 41

Q&A

slide-42
SLIDE 42

Introduction and Background S y m m e t r i c V i s u a l R e t r i e v a l

slide-43
SLIDE 43

Content-Based Image Retrieval

slide-44
SLIDE 44

Similarity Networks

  • Input: Concatenation of feature vectors
  • Architecture: Fully connected layers with ReLU
  • Output: Similarity score

Loss Function

Network Output

slide-45
SLIDE 45

Similarity Networks

  • Input: Concatenation of feature vectors
  • Architecture: Fully connected layers with ReLU
  • Output: Similarity score

Loss Function

Pair Label

slide-46
SLIDE 46

Similarity Networks

  • Input: Concatenation of feature vectors
  • Architecture: Fully connected layers with ReLU
  • Output: Similarity score

Loss Function

Margin

slide-47
SLIDE 47

Similarity Networks

  • Input: Concatenation of feature vectors
  • Architecture: Fully connected layers with ReLU
  • Output: Similarity score

Loss Function

Standard Similarity

slide-48
SLIDE 48

Decrease score in dissimilar pairs Increase score in similar pairs

Similarity Networks

  • Input: Concatenation of feature vectors
  • Architecture: Fully connected layers with ReLU
  • Output: Similarity score

Loss Function

slide-49
SLIDE 49

Similarity Networks

Training Considerations:

  • Supervised - classification labels
  • Important to train on same domain as test
  • Emphasis on difficult pairs

○ First train the network with random pairs ○ Then re-train using pairs where the network performs worse than standard metric

slide-50
SLIDE 50

Similarity Networks

Experiments

  • RMAC as feature extractor
  • Test on Oxford and Paris

datasets

  • Train on Landmarks dataset

(33k images)

slide-51
SLIDE 51

Similarity Networks

Take-away Results in CBIR can be further improved by not only improving the feature representation but also by estimating a better visual similarity score.

slide-52
SLIDE 52

Similarity Networks

End-to-End CBIR

slide-53
SLIDE 53

Similarity Networks

End-to-End CBIR

slide-54
SLIDE 54

Similarity Networks

End-to-End CBIR

slide-55
SLIDE 55

Introduction and Background Asymmetric Visual Retrieval S y m m e t r i c V i s u a l R e t r i e v a l

slide-56
SLIDE 56

Image-to-Video Retrieval

Related Work

  • Hand-crafted based:

○ SIFT + BOW (Zhu and Satoh, 2012) ○ Fisher Vector + Bloom Filter (Araujo and Girod, 2017)

  • Deep Learning based:

○ Pooling of pre-trained CNN features (Wang et al., 2017)

Zhu and Satoh, ICMR 2012

slide-57
SLIDE 57

Image-to-Video Retrieval

Garcia & Vogiatzis (2018). Asymmetric Spatio-Temporal Embeddings for Large-Scale Image-to-Video Retrieval. In: BMVC 2018

slide-58
SLIDE 58

Image-to-Video Retrieval

Garcia & Vogiatzis (2018). Asymmetric Spatio-Temporal Embeddings for Large-Scale Image-to-Video Retrieval. In: BMVC 2018

Spatial Encoder Temporal Encoder

slide-59
SLIDE 59

Image-to-Video Retrieval

Garcia & Vogiatzis (2018). Asymmetric Spatio-Temporal Embeddings for Large-Scale Image-to-Video Retrieval. In: BMVC 2018

Spatial Encoder

  • Re-Implementation of RMAC features (Tolias et al. ICLR 2016)
slide-60
SLIDE 60

Image-to-Video Retrieval

Garcia & Vogiatzis (2018). Asymmetric Spatio-Temporal Embeddings for Large-Scale Image-to-Video Retrieval. In: BMVC 2018

Temporal Encoder

  • Shot boundary detection

○ Distance between consecutive frames

slide-61
SLIDE 61

Image-to-Video Retrieval

Garcia & Vogiatzis (2018). Asymmetric Spatio-Temporal Embeddings for Large-Scale Image-to-Video Retrieval. In: BMVC 2018

Temporal Encoder

  • Shot boundary detection

○ Distance between consecutive frames

  • Aggregation with Recurrent Neural Networks
slide-62
SLIDE 62

Image-to-Video Retrieval

Garcia & Vogiatzis (2018). Asymmetric Spatio-Temporal Embeddings for Large-Scale Image-to-Video Retrieval. In: BMVC 2018

Training

  • Cleaned LSMDC dataset
  • Pairs of matching/non-matching video-frame
  • Cosine Margin Loss
slide-63
SLIDE 63

Image-to-Video Retrieval

Garcia & Vogiatzis (2018). Asymmetric Spatio-Temporal Embeddings for Large-Scale Image-to-Video Retrieval. In: BMVC 2018

Evaluation

  • Videos:

○ SI2V and VB: newcast videos ○ MoviesDB: movie videos

  • Queries:

○ SI2V: images from newspapers ○ VB and MoviesDB: photo with a external device

slide-64
SLIDE 64

Image-to-Video Retrieval

Garcia & Vogiatzis (2018). Asymmetric Spatio-Temporal Embeddings for Large-Scale Image-to-Video Retrieval. In: BMVC 2018

Results

slide-65
SLIDE 65

Image-to-Video Retrieval

Garcia & Vogiatzis (2018). Asymmetric Spatio-Temporal Embeddings for Large-Scale Image-to-Video Retrieval. In: BMVC 2018

Results

  • FV methods use extremely large descriptors
slide-66
SLIDE 66

Image-to-Video Retrieval

Garcia & Vogiatzis (2018). Asymmetric Spatio-Temporal Embeddings for Large-Scale Image-to-Video Retrieval. In: BMVC 2018

Results

  • FV methods use extremely large descriptors
  • Previous deep features methods:

○ Fully Connected layers ○ No fine-tunning

slide-67
SLIDE 67

Image-to-Video Retrieval

Garcia & Vogiatzis (2018). Asymmetric Spatio-Temporal Embeddings for Large-Scale Image-to-Video Retrieval. In: BMVC 2018

Results

  • FV methods use extremely large descriptors
  • Previous deep features methods:

○ Fully Connected layers ○ No fine-tunning

  • Our Spatio-Temporal Encoder performs as well

as state-of-the-art using less memory

slide-68
SLIDE 68

Introduction and Background Asymmetric Visual Retrieval S y m m e t r i c V i s u a l R e t r i e v a l Cross-Modal Retrieval

slide-69
SLIDE 69

Semantic Art Understanding

SemArt is a dataset for studying semantic art understanding, in which is sample is a triplets as: (painting, attributes, comment) Attributes: author, title, date, technique, type, school, timeframe Collection: about 21,000 triplets

Garcia & Vogiatzis (2018). How to Read Paintings: Semantic Art Understanding with Multi-Modal Retrieval. In: VISART workshop ECCV 2018

slide-70
SLIDE 70

Semantic Art Understanding

Garcia & Vogiatzis (2018). How to Read Paintings: Semantic Art Understanding with Multi-Modal Retrieval. In: VISART workshop ECCV 2018

  • Project Paintings and Comments into a Common Semantic Space
slide-71
SLIDE 71

Semantic Art Understanding

  • Visual Encoding:

VGG16, ResNet, RMAC

Garcia & Vogiatzis (2018). How to Read Paintings: Semantic Art Understanding with Multi-Modal Retrieval. In: VISART workshop ECCV 2018

slide-72
SLIDE 72

Semantic Art Understanding

  • Visual Encoding:

VGG16, ResNet, RMAC

  • Text Encoding (comments and titles):

BOW, MLP, RNN

Garcia & Vogiatzis (2018). How to Read Paintings: Semantic Art Understanding with Multi-Modal Retrieval. In: VISART workshop ECCV 2018

slide-73
SLIDE 73

Semantic Art Understanding

  • Visual Encoding:

VGG16, ResNet, RMAC

  • Text Encoding (comments and titles):

BOW, MLP, RNN

  • Cross-Modal Transformation:

Garcia & Vogiatzis (2018). How to Read Paintings: Semantic Art Understanding with Multi-Modal Retrieval. In: VISART workshop ECCV 2018

slide-74
SLIDE 74

Semantic Art Understanding

  • Visual Encoding:

VGG16, ResNet, RMAC

  • Text Encoding (comments and titles):

BOW, MLP, RNN

  • Cross-Modal Transformation:

Garcia & Vogiatzis (2018). How to Read Paintings: Semantic Art Understanding with Multi-Modal Retrieval. In: VISART workshop ECCV 2018

slide-75
SLIDE 75

Semantic Art Understanding

Garcia & Vogiatzis (2018). How to Read Paintings: Semantic Art Understanding with Multi-Modal Retrieval. In: VISART workshop ECCV 2018

slide-76
SLIDE 76

Semantic Art Understanding

Garcia & Vogiatzis (2018). How to Read Paintings: Semantic Art Understanding with Multi-Modal Retrieval. In: VISART workshop ECCV 2018

slide-77
SLIDE 77

Human Comparison: Easy Set Human Comparison: Difficult Set Human Comparison: Easy Set Human Comparison: Difficult Set

Semantic Art Understanding

Garcia & Vogiatzis (2018). How to Read Paintings: Semantic Art Understanding with Multi-Modal Retrieval. In: VISART workshop ECCV 2018

Random images Same type images

slide-78
SLIDE 78

Human Comparison: Easy Set Human Comparison: Difficult Set Human Comparison: Easy Set Human Comparison: Difficult Set

Semantic Art Understanding

Garcia & Vogiatzis (2018). How to Read Paintings: Semantic Art Understanding with Multi-Modal Retrieval. In: VISART workshop ECCV 2018

Random images Same type images