Structured Query-Based Image Retrieval using Scene Graphs Brigit - - PowerPoint PPT Presentation
Structured Query-Based Image Retrieval using Scene Graphs Brigit - - PowerPoint PPT Presentation
Structured Query-Based Image Retrieval using Scene Graphs Brigit Schroeder , UCSC Subarna Tripathi, Intel Labs Complexity of Object Interactions for Retrieval woman rides vs woman motorcycle motorcycle Structured queries capture
Complexity of Object Interactions for Retrieval
- Structured queries capture complexity of object interactions unlike single objects.
- Visual relationships are directed subgraph with subject and object as nodes connected
by a predicate.
- We propose to retrieve images from such queries (NOT from RGB image features)
utilizing a learned scene embedding space. woman motorcycle rides vs woman motorcycle
Related Work
- Image Retrieval Using Scene Graphs (Johnson et al., CVPR 2015)
⇒ Use a CRF model to match best possible bounding box groundings from SG to image for retrieval.
- Cross-Modal Scene Graph Matching for Relationship-Aware Image-Text
Retrieval (Wang et al., WACV 2020) ⇒ Use cross-modal scene graphs for image-text retrieval relying upon both word embeddings and image features.
Subgraph Query for Retrieval
- Directed subgraphs are extracted from scene
graphs (objects as nodes, predicated as edges).
- Each subgraph contains a subject and object
as nodes connected by an edge representing a predicate relationship.
- Visual relationship, represented as the above
subgraph, posed as structured queries.
- Similarity metric for retrieval in scene
embedding space.
- Scene embedding learned via a pretext task
(described in the next slide) Scene Graph
Scene Graph Embeddings from Layout Prediction
- Scene graph embedding is learned via a
pretext task, layout prediction.
- Layout prediction utilizes object localization
for individual objects AND Triplet-superbox regression network and Triplet-mask prediction network (described in the next slide).
- Visual relation as directed subgraph as
structured query, such as:
- Euclidean distance in this scene embedding
space used for retrieval.
giraffe giraffe Left of
Learning Scene Graph Embedding
Triplet Mask Network
Triplet mask prediction: Triplets containing a <subject,predicate,object> found in a scene graph are used to predict corresponding triplet masks, labelling pixels either as subject and object. The mask prediction is used as supervisory signal during training
Qualitative Retrieval Results
Image Retrieval Results. Retrieval for structured queries with object types with varying levels of frequency in COCO-Stuff dataset: (a) head (person, tree), (b) (long-tail) medium frequency (zebra, truck), and (c) (long-tail) low frequency (skateboard, skis). Query is in left-most column corresponding to red boxes.
Quantitative Results
Image Retrieval Performance. Recall@k for all classes (left) and long-tail vs. head classes (right) found in COCO-Stuff
Retrieval (NO input RGB image features) Performance
Adding a visual relationship-inspired (triplet) loss boosts our recall by 10% in the best case.
Conclusions
- We have trained scene graph embeddings for layout prediction with
triplet-based loss functions.
- For the downstream application of image retrieval, we use structured
queries formed using the learned embeddings instead of input image content.
- Our approach achieves high recall even on long-tailed object classes in
the COCO-Stuff dataset.
Thank You!
Brigit Schroeder UC Santa Cruz brschroe@ucsc.edu
http://www.cs.uml.edu/~bschroed/
Subarna Tripathi Intel Labs subarna.tripathi@intel.com
https://subarnatripathi.github.io/
Please check out our paper online: https://arxiv.org/pdf/2005.06653.pdf