Structured Query-Based Image Retrieval using Scene Graphs Brigit - - PowerPoint PPT Presentation

structured query based image retrieval using scene graphs
SMART_READER_LITE
LIVE PREVIEW

Structured Query-Based Image Retrieval using Scene Graphs Brigit - - PowerPoint PPT Presentation

Structured Query-Based Image Retrieval using Scene Graphs Brigit Schroeder , UCSC Subarna Tripathi, Intel Labs Complexity of Object Interactions for Retrieval woman rides vs woman motorcycle motorcycle Structured queries capture


slide-1
SLIDE 1

Structured Query-Based Image Retrieval using Scene Graphs

Brigit Schroeder, UCSC

Subarna Tripathi, Intel Labs

slide-2
SLIDE 2

Complexity of Object Interactions for Retrieval

  • Structured queries capture complexity of object interactions unlike single objects.
  • Visual relationships are directed subgraph with subject and object as nodes connected

by a predicate.

  • We propose to retrieve images from such queries (NOT from RGB image features)

utilizing a learned scene embedding space. woman motorcycle rides vs woman motorcycle

slide-3
SLIDE 3

Related Work

  • Image Retrieval Using Scene Graphs (Johnson et al., CVPR 2015)

⇒ Use a CRF model to match best possible bounding box groundings from SG to image for retrieval.

  • Cross-Modal Scene Graph Matching for Relationship-Aware Image-Text

Retrieval (Wang et al., WACV 2020) ⇒ Use cross-modal scene graphs for image-text retrieval relying upon both word embeddings and image features.

slide-4
SLIDE 4

Subgraph Query for Retrieval

  • Directed subgraphs are extracted from scene

graphs (objects as nodes, predicated as edges).

  • Each subgraph contains a subject and object

as nodes connected by an edge representing a predicate relationship.

  • Visual relationship, represented as the above

subgraph, posed as structured queries.

  • Similarity metric for retrieval in scene

embedding space.

  • Scene embedding learned via a pretext task

(described in the next slide) Scene Graph

slide-5
SLIDE 5

Scene Graph Embeddings from Layout Prediction

  • Scene graph embedding is learned via a

pretext task, layout prediction.

  • Layout prediction utilizes object localization

for individual objects AND Triplet-superbox regression network and Triplet-mask prediction network (described in the next slide).

  • Visual relation as directed subgraph as

structured query, such as:

  • Euclidean distance in this scene embedding

space used for retrieval.

giraffe giraffe Left of

Learning Scene Graph Embedding

slide-6
SLIDE 6

Triplet Mask Network

Triplet mask prediction: Triplets containing a <subject,predicate,object> found in a scene graph are used to predict corresponding triplet masks, labelling pixels either as subject and object. The mask prediction is used as supervisory signal during training

slide-7
SLIDE 7

Qualitative Retrieval Results

Image Retrieval Results. Retrieval for structured queries with object types with varying levels of frequency in COCO-Stuff dataset: (a) head (person, tree), (b) (long-tail) medium frequency (zebra, truck), and (c) (long-tail) low frequency (skateboard, skis). Query is in left-most column corresponding to red boxes.

slide-8
SLIDE 8

Quantitative Results

Image Retrieval Performance. Recall@k for all classes (left) and long-tail vs. head classes (right) found in COCO-Stuff

slide-9
SLIDE 9

Retrieval (NO input RGB image features) Performance

Adding a visual relationship-inspired (triplet) loss boosts our recall by 10% in the best case.

slide-10
SLIDE 10

Conclusions

  • We have trained scene graph embeddings for layout prediction with

triplet-based loss functions.

  • For the downstream application of image retrieval, we use structured

queries formed using the learned embeddings instead of input image content.

  • Our approach achieves high recall even on long-tailed object classes in

the COCO-Stuff dataset.

slide-11
SLIDE 11

Thank You!

Brigit Schroeder UC Santa Cruz brschroe@ucsc.edu

http://www.cs.uml.edu/~bschroed/

Subarna Tripathi Intel Labs subarna.tripathi@intel.com

https://subarnatripathi.github.io/

Please check out our paper online: https://arxiv.org/pdf/2005.06653.pdf