 
              Image as a single label “king crab” Image Source: ImageNet
Image as an object set Man Person Woman Woman GIrl Coat King crab Box Image Source: ImageNet
Image as a scene graph Man embrace Woman Woman Woman GIrl Relationships: hold wear look at “Woman look at box” Coat King crab Box “Man hold king crab” “Woman wear coat” “Man embrace woman” Image Source: ImageNet
Image as a scene graph Man embrace Woman Woman Woman Attributes: GIrl Relationships: hold wear look at “Red king crab” “Woman look at box” Coat “Transparent box” King crab Box “Man hold king crab” “Blue coat” “Woman wear coat” “Smiling woman” “Man embrace woman” “Smiling Man” Image Source: ImageNet
Why we need scene graph? Distinguish images more accurately Man Hat Man Hat Horse Horse Walking with Feeding [1] Image Retrieval using Scene Graphs. Johnson et al. CVPR 2015 Left: https://cals.ncsu.edu/wp-content/uploads/2016/08/horse-1500x931.png Rigth: https://www.videoblocks.com/video/the-man-in-hat-feed-a-brown-horse-with-flowers-on-the-meadow-supmox_3xj0tvkb67
Why we need scene graph? Describe images more grounding Man Hat Man Hat Horse Horse “a man is walking with a horse” “the man is feeding a horse” [1]. Auto-Encoding Scene Graphs for Image Captioning. Yang et al. arXiv 2018 [2]. Exploring Visual Relationship for Image Captioning. Yao et al. ECCV 2018 Left: https://cals.ncsu.edu/wp-content/uploads/2016/08/horse-1500x931.png Rigth: https://www.videoblocks.com/video/the-man-in-hat-feed-a-brown-horse-with-flowers-on-the-meadow-supmox_3xj0tvkb67
Why we need scene graph? Answer question more precisely Man Hat Man Hat Horse Horse Q: What is the man walking with? Q: Is the man feeding a horse? A: A horse A: Yes [1] Graph-Structured Representations for Visual Question Answering. Teney et al. CVPR 2017 [2] Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding. Yi et al. Neurips 2018 Left: https://cals.ncsu.edu/wp-content/uploads/2016/08/horse-1500x931.png Rigth: https://www.videoblocks.com/video/the-man-in-hat-feed-a-brown-horse-with-flowers-on-the-meadow-supmox_3xj0tvkb67
Why we need scene graph? Generate questions more grounding Man Hat Man Hat Horse Horse Q: What animal is the man Q: What is the man doting with walking with? the horse? [1] Visual Curiosity: Learning to Ask Questions to Learn Visual Recognition. Yang et al. CoRL 2018 [2] Information Maximizing Visual Question Generation. Krishna et al. CVPR 2019 Left: https://cals.ncsu.edu/wp-content/uploads/2016/08/horse-1500x931.png Rigth: https://www.videoblocks.com/video/the-man-in-hat-feed-a-brown-horse-with-flowers-on-the-meadow-supmox_3xj0tvkb67
Visual System Communication Human Scene Graph generator
Visual Question Answering Answer Questions Visual System Human Scene Graph generator
Visual Question Answering Answer Questions Visual System Human Scene Graph generator Ask Questions Visual Question Generation
Visual Question Answering Answer Questions Visual System Human Scene Graph generator Ask Questions Visual Question Generation
Skeleton Model
Skeleton Model Input
Skeleton Model RPN Input Region Proposals
Skeleton Model Object Features ROI Pooling RPN Relationship Features ROI Pooling Input Region Proposals
Skeleton Model Object Object Features Scores ROI Pooling RPN Relationship Relationship Features Scores ROI Pooling Input Region Proposals
Skeleton Model Cup Object Object Hold Features Dog Scores In On ROI Person In Pooling Book TV RPN Watch Watch Left of Relationship Relationship Right of Features Scores Cat ROI Cat Pooling Input Region Proposals
Iterative Message Passing (IMP) Feature Updating Cup Object Object Hold Features Dog Scores In On ROI Person In Message Pooling Book Passing TV RPN Watch Watch Left of Relationship Relationship Right of Features Scores Cat ROI Cat Pooling Input Region Proposals Feature Updating Scene Graph Generation by Iterative Message Passing. Xu et al. CVPR 2017
Multi-level Scene Description Network (MSDN) Region Captions Feature Updating Cup Object Object Hold Features Dog Scores In On ROI Person In Message Pooling Book Passing TV RPN Watch Watch Left of Relationship Relationship Right of Features Scores Cat ROI Cat Pooling Input Region Proposals Feature Updating Scene Graph Generation from Objects, Phrases and Region Captions. Li et al. ICCV 2017
Neural Motif Network Feature Updating Score Updating Cup Object Object Hold Features Dog Scores In On ROI Person In Pooling Book TV RPN Watch Watch Left of Relationship Relationship Right of Features Scores Cat ROI Cat Pooling Input Region Proposals Frequency Prior Neural Motifs: Scene Graph Parsing with Global Context. Zellers et al. CVPR 2018
Graph R-CNN (Our work) Feature Updating Score Updating Cup Object Object Hold Features Dog Scores In On ROI Person In Message Message Pooling Book Passing Passing TV RPN Watch Watch Left of Relationship Relationship Right of Features Scores Cat ROI Cat Pooling Input Region Proposals Feature Updating Score Updating Neural Motifs: Scene Graph Parsing with Global Context. Zellers et al. CVPR 2018
Graph R-CNN (Our work) Feature Updating Score Updating Cup Object Object Hold Features Dog Scores In On ROI Person In Message Message Pooling Book Passing Passing TV RPN Watch Watch Left of Relationship Relationship Right of Features Scores Cat ROI Cat Pooling Region Input Feature Updating Score Updating Proposals Relation Proposal Network (RePN) Jianwei Yang*, Jiasen Lu*, Stefan Lee, Dhruv Batra, Devi Parikh. Graph R-CNN for Scene Graph Generation. ECCV 2018.
Motivations car building behind next to on car wheel near boy near wear behind fire hydrant sweater (a) (b) (c) (d)
Motivations car building behind next to on car wheel near boy near wear behind fire hydrant sweater (a) (b) (c) (d) 1. Objects in a scene usually have relationships with others;
Motivations car building behind next to on car wheel near boy near wear behind fire hydrant sweater (a) (b) (c) (d) 1. Objects in a scene usually have relationships with others; 2. Not all object pairs have relationships, the scene graph is usually sparse;
Motivations car building behind next to on car wheel near boy near wear behind fire hydrant sweater (a) (b) (c) (d) 1. Objects in a scene usually have relationships with others; 2. Not all object pairs have relationships, the scene graph is usually sparse; 3. Existence of relationships highly depends on the object categories, and type of relationships highly depends on the context.
Framework head leaf RePN aGCN on has Sparse graph Dense graph Attentional graph of behind tree bird 3 st Layer 2 st Layer in has 1 st Layer on Object Subject ! fc 0.2 $ stand fc 0.3 has + on … ReLU … wings … … 0.05 fc branch … … … Target Source … " Attention tails Object Score Matrix Object Relational Proposal Network Conv Feature Attentional GCNs Scene Graph
Framework head leaf RePN aGCN on has Sparse graph Dense graph Attentional graph of behind tree bird 3 st Layer 2 st Layer in has 1 st Layer on Object Subject ! fc 0.2 $ stand fc 0.3 has + on … ReLU … wings … … 0.05 fc branch … … … Target Source … " Attention tails Object Score Matrix Object Relational Proposal Network Conv Feature Attentional GCNs Scene Graph
Framework head leaf RePN aGCN on has Sparse graph Dense graph Attentional graph of behind tree bird 3 st Layer 2 st Layer in has 1 st Layer on Object Subject Subject ! fc 0.2 $ stand fc 0.3 has + on … ReLU … wings … … 0.05 fc branch … … … Target Source … " Attention tails Object Score Matrix Object Relational Proposal Network Conv Feature Attentional GCNs Scene Graph 1. Relation proposal network (RePN) to learn to prune the densely connected scene graph;
Framework head leaf RePN aGCN on has Sparse graph Dense graph Attentional graph of behind tree bird 3 st Layer 2 st Layer in has 1 st Layer on Object Subject Subject ! fc 0.2 $ stand fc 0.3 has + on … ReLU … wings … … 0.05 fc branch … … … Target Source … " Attention tails Object Score Matrix Object Relational Proposal Network Conv Feature Attentional GCNs Scene Graph 1. Relation proposal network (RePN) to learn to prune the densely connected scene graph; 2. Attentional graph convolutional networks (aGCN) to incorporate the contextual information.
Recommend
More recommend