Image as a single label
“king crab”
Image Source: ImageNet
Image as a single label king crab Image Source: ImageNet Image as - - PowerPoint PPT Presentation
Image as a single label king crab Image Source: ImageNet Image as an object set Man Person Woman Woman GIrl Coat King crab Box Image Source: ImageNet Image as a scene graph Man embrace Woman Woman Woman GIrl Relationships:
“king crab”
Image Source: ImageNet
King crab Man
Person Box
Coat Image Source: ImageNet
GIrl Woman
Woman
King crab
Woman Woman
embrace
Box
look at
Coat
wear
Image Source: ImageNet “Woman look at box” “Man hold king crab”
GIrl
Man “Woman wear coat” “Man embrace woman”
Relationships:
Woman
hold
King crab
Woman Woman
hold
Box
look at
Coat
wear
Image Source: ImageNet
GIrl
Man “Red king crab” “Blue coat” “Transparent box” “Smiling woman” “Smiling Man”
Attributes:
Woman
embrace
“Woman look at box” “Man hold king crab” “Woman wear coat” “Man embrace woman”
Relationships:
Man Horse Man Horse
Distinguish images more accurately
Walking with Feeding
[1] Image Retrieval using Scene Graphs. Johnson et al. CVPR 2015
Hat
Hat
Rigth: https://www.videoblocks.com/video/the-man-in-hat-feed-a-brown-horse-with-flowers-on-the-meadow-supmox_3xj0tvkb67 Left: https://cals.ncsu.edu/wp-content/uploads/2016/08/horse-1500x931.png
Man Horse Man Horse
“a man is walking with a horse” “the man is feeding a horse” Describe images more grounding
[1]. Auto-Encoding Scene Graphs for Image Captioning. Yang et al. arXiv 2018 [2]. Exploring Visual Relationship for Image Captioning. Yao et al. ECCV 2018
Hat
Hat
Rigth: https://www.videoblocks.com/video/the-man-in-hat-feed-a-brown-horse-with-flowers-on-the-meadow-supmox_3xj0tvkb67 Left: https://cals.ncsu.edu/wp-content/uploads/2016/08/horse-1500x931.png
Man Horse Man Horse Q: What is the man walking with? A: A horse
Answer question more precisely
[1] Graph-Structured Representations for Visual Question Answering. Teney et al. CVPR 2017
Q: Is the man feeding a horse? A: Yes
[2] Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding. Yi et al. Neurips 2018
Hat
Hat
Rigth: https://www.videoblocks.com/video/the-man-in-hat-feed-a-brown-horse-with-flowers-on-the-meadow-supmox_3xj0tvkb67 Left: https://cals.ncsu.edu/wp-content/uploads/2016/08/horse-1500x931.png
Man Horse Man Horse Q: What animal is the man walking with?
Generate questions more grounding
[1] Visual Curiosity: Learning to Ask Questions to Learn Visual Recognition. Yang et al. CoRL 2018 [2] Information Maximizing Visual Question Generation. Krishna et al. CVPR 2019
Q: What is the man doting with the horse?
Hat
Hat
Rigth: https://www.videoblocks.com/video/the-man-in-hat-feed-a-brown-horse-with-flowers-on-the-meadow-supmox_3xj0tvkb67 Left: https://cals.ncsu.edu/wp-content/uploads/2016/08/horse-1500x931.png
Communication
Scene Graph generator
Answer Questions
Scene Graph generator
Ask Questions Answer Questions
Scene Graph generator
Ask Questions Answer Questions
Scene Graph generator
Input
Input Region Proposals
RPN
Object Features Relationship Features
ROI Pooling ROI Pooling
Input Region Proposals
RPN
Object Features Relationship Features Object Scores Relationship Scores
ROI Pooling ROI Pooling
Input Region Proposals
RPN
Object Features Relationship Features Object Scores Relationship Scores
ROI Pooling ROI Pooling
Input Region Proposals
RPN
Cat Cat TV
Watch Watch Left of Right of Dog Person In In Hold Cup Book On
Object Features Relationship Features Object Scores Relationship Scores
ROI Pooling ROI Pooling
Input Region Proposals
RPN Feature Updating
Cat Cat TV
Watch Watch Left of Right of Dog Person In In Hold Cup Book On
Scene Graph Generation by Iterative Message Passing. Xu et al. CVPR 2017
Feature Updating
Message Passing
Object Features Relationship Features Object Scores Relationship Scores
ROI Pooling ROI Pooling
Input Region Proposals
RPN Feature Updating
Cat Cat TV
Watch Watch Left of Right of Dog Person In In Hold Cup Book On
Feature Updating
Message Passing
Scene Graph Generation from Objects, Phrases and Region Captions. Li et al. ICCV 2017
Region Captions
Object Features Relationship Features Object Scores Relationship Scores
ROI Pooling ROI Pooling
Input Region Proposals
RPN Feature Updating
Cat Cat TV
Watch Watch Left of Right of Dog Person In In Hold Cup Book On
Score Updating
Frequency Prior
Neural Motifs: Scene Graph Parsing with Global Context. Zellers et al. CVPR 2018
Object Features Relationship Features Object Scores Relationship Scores
ROI Pooling ROI Pooling
Input Region Proposals
RPN Feature Updating
Cat Cat TV
Watch Watch Left of Right of Dog Person In In Hold Cup Book On
Score Updating
Neural Motifs: Scene Graph Parsing with Global Context. Zellers et al. CVPR 2018
Message Passing Message Passing
Feature Updating Score Updating
Object Features Relationship Features Object Scores Relationship Scores
ROI Pooling ROI Pooling
Input Region Proposals
RPN Feature Updating
Cat Cat TV
Watch Watch Left of Right of Dog Person In In Hold Cup Book On
Score Updating
Message Passing Message Passing
Feature Updating Score Updating Relation Proposal Network (RePN)
Jianwei Yang*, Jiasen Lu*, Stefan Lee, Dhruv Batra, Devi Parikh. Graph R-CNN for Scene Graph Generation. ECCV 2018.
(a) (b) (c)
sweater boy fire hydrant car wheel building car
wear behind near near
next to behind
(d)
(a) (b) (c)
sweater boy fire hydrant car wheel building car
wear behind near near
next to behind
(d)
1. Objects in a scene usually have relationships with others;
(a) (b) (c)
sweater boy fire hydrant car wheel building car
wear behind near near
next to behind
(d)
1. Objects in a scene usually have relationships with others; 2. Not all object pairs have relationships, the scene graph is usually sparse;
(a) (b) (c)
sweater boy fire hydrant car wheel building car
wear behind near near
next to behind
(d)
1. Objects in a scene usually have relationships with others; 2. Not all object pairs have relationships, the scene graph is usually sparse; 3. Existence of relationships highly depends on the object categories, and type of relationships highly depends on the context.
Scene Graph
Dense graph Sparse graph
Attentional GCNs
1st Layer 2st Layer 3st Layer
+
… … Source Target
fc Attention
0.2 0.3 0.05Conv Feature
Attentional graph
! "
Object Subject Object Object Score Matrix … … … … … … $
Relational Proposal Network RePN aGCN
head
has
bird
has
wings tails
has
branch
stand
tree
in behind
leaf
fc fc ReLU
Scene Graph
Dense graph Sparse graph
Attentional GCNs
1st Layer 2st Layer 3st Layer
+
… … Source Target
fc Attention
0.2 0.3 0.05Conv Feature
Attentional graph
! "
Object Subject Object Object Score Matrix … … … … … … $
Relational Proposal Network RePN aGCN
head
has
bird
has
wings tails
has
branch
stand
tree
in behind
leaf
fc fc ReLU
Scene Graph
Dense graph Sparse graph
Attentional GCNs
1st Layer 2st Layer 3st Layer
+
… … Source Target
fc Attention
0.2 0.3 0.05Conv Feature
Attentional graph
! "
Object Subject Object Object Score Matrix … … … … … … $
Relational Proposal Network RePN aGCN
head
has
bird
has
wings tails
has
branch
stand
tree
in behind
leaf
fc fc ReLU
Subject
connected scene graph;
Scene Graph
Dense graph Sparse graph
Attentional GCNs
1st Layer 2st Layer 3st Layer
+
… … Source Target
fc Attention
0.2 0.3 0.05Conv Feature
Attentional graph
! "
Object Subject Object Object Score Matrix … … … … … … $
Relational Proposal Network RePN aGCN
head
has
bird
has
wings tails
has
branch
stand
tree
in behind
leaf
fc fc ReLU
connected scene graph;
contextual information.
Subject
Scene Graph
Dense graph Sparse graph
Attentional GCNs
1st Layer 2st Layer 3st Layer
+
… … Source Target
fc Attention
0.2 0.3 0.05Conv Feature
Attentional graph
! "
Object Subject Object Object Score Matrix … … … … … … $
Relational Proposal Network RePN aGCN
head
has
bird
has
wings tails
has
branch
stand
tree
in behind
leaf
fc fc ReLU
Subject
connected scene graph;
contextual information.
Scene Graph
Dense graph Sparse graph
Attentional GCNs
1st Layer 2st Layer 3st Layer
+
… … Source Target
fc Attention
0.2 0.3 0.05Conv Feature
Attentional graph
! "
Object Subject Object Object Score Matrix … … … … … … $
Relational Proposal Network RePN aGCN
head
has
bird
has
wings tails
has
branch
stand
tree
in behind
leaf
fc fc ReLU
Subject
𝑄 𝑇 𝐽
I: Input Image; S: Scene graph V: Scene graph vertices (object) E: Scene graph edges (relationship) O: Scene graph object labels R: Scene graph relationship labels
Scene Graph
Dense graph Sparse graph
Attentional GCNs
1st Layer 2st Layer 3st Layer
+
… … Source Target
fc Attention
0.2 0.3 0.05Conv Feature
Attentional graph
! "
Object Subject Object Object Score Matrix … … … … … … $
Relational Proposal Network RePN aGCN
head
has
bird
has
wings tails
has
branch
stand
tree
in behind
leaf
fc fc ReLU
Subject
𝑄 𝑇 𝐽 = 𝑄 𝑊 𝐽
I: Input Image; S: Scene graph V: Scene graph vertices (object) E: Scene graph edges (relationship) O: Scene graph object labels R: Scene graph relationship labels
Region Proposal
Scene Graph
Dense graph Sparse graph
Attentional GCNs
1st Layer 2st Layer 3st Layer
+
… … Source Target
fc Attention
0.2 0.3 0.05Conv Feature
Attentional graph
! "
Object Subject Object Object Score Matrix … … … … … … $
Relational Proposal Network RePN aGCN
head
has
bird
has
wings tails
has
branch
stand
tree
in behind
leaf
fc fc ReLU
Subject
𝑄 𝑇 𝐽 = 𝑄 𝑊 𝐽 𝑄 𝐹 𝑊, 𝐽
Relation Proposal
I: Input Image; S: Scene graph V: Scene graph vertices (object) E: Scene graph edges (relationship) O: Scene graph object labels R: Scene graph relationship labels
Region Proposal
Scene Graph
Dense graph Sparse graph
Attentional GCNs
1st Layer 2st Layer 3st Layer
+
… … Source Target
fc Attention
0.2 0.3 0.05Conv Feature
Attentional graph
! "
Object Subject Object Object Score Matrix … … … … … … $
Relational Proposal Network RePN aGCN
head
has
bird
has
wings tails
has
branch
stand
tree
in behind
leaf
fc fc ReLU
Subject
𝑄 𝑇 𝐽 = 𝑄 𝑊 𝐽 𝑄 𝐹 𝑊, 𝐽 𝑄 𝑆, 𝑃 𝑊, 𝐹, 𝐽
Graph Labeling
I: Input Image; S: Scene graph V: Scene graph vertices (object) E: Scene graph edges (relationship) O: Scene graph object labels R: Scene graph relationship labels
Relation Proposal Region Proposal
Region Proposal Network
Binary Cross Entropy Loss
Relation Proposal Network
Binary Cross Entropy Loss
Region Proposal Network
Binary Cross Entropy Loss
Graph Labeling Network
Two Cross Entropy Losses,
Region Proposal Network
Binary Cross Entropy Loss
Relation Proposal Network
Binary Cross Entropy Loss
[1]. Scene Graph Generation by Iterative Message Passing. Xu et al. CVPR 2017
Assume there are objects extracted from an image, then edges
𝑂 𝑂 ∗ 𝑂 − 1
[1]. Scene Graph Generation by Iterative Message Passing. Xu et al. CVPR 2017
Step 1: Take maximum for object scores and predicate scores, excluding background class.
[1]. Scene Graph Generation by Iterative Message Passing. Xu et al. CVPR 2017 Assume there are objects extracted from an image, then edges
𝑂 𝑂 ∗ 𝑂 − 1
Step 1: Take maximum for object scores and predicate scores, excluding background class. Step 2: Compute relationship scores: 𝑆𝑓𝑚(𝑗, 𝑘) = 𝑇𝑣𝑐𝑘(𝑗) ∗ 𝑃𝑐𝑘(𝑘) ∗ 𝑄𝑠𝑓𝑒(𝑗, 𝑘)
[1]. Scene Graph Generation by Iterative Message Passing. Xu et al. CVPR 2017 Assume there are objects extracted from an image, then edges
𝑂 𝑂 ∗ 𝑂 − 1
[1]. Scene Graph Generation by Iterative Message Passing. Xu et al. CVPR 2017
Step 1: Take maximum for object scores and predicate scores, excluding background class. Step 2: Compute relationship scores: Step 3: Sort the relationship triplets in a descending order. 𝑆𝑓𝑚(𝑗, 𝑘) = 𝑇𝑣𝑐𝑘(𝑗) ∗ 𝑃𝑐𝑘(𝑘) ∗ 𝑄𝑠𝑓𝑒(𝑗, 𝑘)
Assume there are objects extracted from an image, then edges
𝑂 𝑂 ∗ 𝑂 − 1
[1]. Scene Graph Generation by Iterative Message Passing. Xu et al. CVPR 2017
Step 1: Take maximum for object scores and predicate scores, excluding background class. Step 2: Compute relationship scores: Step 3: Sort the relationship triplets in a descending order. Step 4: Compute the triplet recalls (Recall@50, Recall@100) based on the ground-truth 𝑆𝑓𝑚(𝑗, 𝑘) = 𝑇𝑣𝑐𝑘(𝑗) ∗ 𝑃𝑐𝑘(𝑘) ∗ 𝑄𝑠𝑓𝑒(𝑗, 𝑘)
Assume there are objects extracted from an image, then edges
𝑂 𝑂 ∗ 𝑂 − 1
𝑆𝑓𝑑𝑏𝑚𝑚 = 𝐷(𝑈CDEF 𝑏𝑜𝑒 𝑈
HI)
𝑂(𝑈
HI)
SGGen:
IoU > 0.5
[1]. Scene Graph Generation by Iterative Message Passing. Xu et al. CVPR 2017
Step 1: Take maximum for object scores and predicate scores, excluding background class. Step 2: Compute relationship scores: Step 3: Sort the relationship triplets in a descending order. Step 4: Compute the triplet recalls (Recall@50, Recall@100) based on the ground-truth 𝑆𝑓𝑚(𝑗, 𝑘) = 𝑇𝑣𝑐𝑘(𝑗) ∗ 𝑃𝑐𝑘(𝑘) ∗ 𝑄𝑠𝑓𝑒(𝑗, 𝑘)
Assume there are objects extracted from an image, then edges
𝑂 𝑂 ∗ 𝑂 − 1
𝑆𝑓𝑑𝑏𝑚𝑚 = 𝐷(𝑈CDEF 𝑏𝑜𝑒 𝑈
HI)
𝑂(𝑈
HI)
SGGen:
IoU > 0.5 PhrCls: all object locations are known PredCls: all object locations and labels are known
Dataset Backbone #objects #predicates Metrics Visual Genome[1] Train: 75,651 Test: 32,422 VGG-16 Faster R-CNN[2] 150 50 PredCls,SGCls, SGGen,SGGen+, mAP
[2] A Faster Implementation of Faster R-CNN. Yang and Lu et al. [1] Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. Krishna et al.
10 20 30 40 50 60 70 PredCls PhrCls SGGen SGGen+
Recall@100
IMP[1] MSDN[2] NM-Freq[3] Ours
[1] Scene Graph Generation by Iterative Message Passing. Xu et al. CVPR 2017 [2] Scene Graph Generations from Objects, Phrases and Captions. Li et al. ICCV 2017 [3] Neural Motif: Scene Graph Parsing with Global Context. Zellers et al. CVPR 2018
Our model has over four point improvement
45.2 22.4 8.0 27.7 57.9 29.9 9.1 28.2 48.8 27.2 9.1 27.8 13.7 35.9 31.6 59.1
(Proposed new metric. Details in our paper)
surfboard
has near has bear ear leg flower head
has near bear ear leg flower
has in has bird head tree wing behind behind branch
behind leaf
in ride has man water wave arm short
10 20 30 40 50 60 PredCls PhrCls SGGen SGGen+ mAP Recall@50 Base Base+RePN Base+RePN+GCN Base+RePN+aGCN
10 20 30 40 50 60 PredCls PhrCls SGGen SGGen+ mAP Recall@50 Base Base+RePN Base+RePN+GCN Base+RePN+aGCN
RePN improves SGGen, SGGen+ and mAP
10 20 30 40 50 60 PredCls PhrCls SGGen SGGen+ mAP Recall@50 Base Base+RePN Base+RePN+GCN Base+RePN+aGCN
RePN improves SGGen, SGGen+ and mAP GCN/aGCN improves PredCls and PhrCls
10 20 30 40 50 60 PredCls PhrCls SGGen SGGen+ mAP Recall@50 Base Base+RePN Base+RePN+GCN Base+RePN+aGCN
RePN improves SGGen, SGGen+ and mAP GCN/aGCN improves PredCls and PhrCls
https://github.com/jwyang/graph-rcnn.pytorch The goal of gathering all these representative methods into a single repo is to establish a more fair comparison across different methods under the same settings. Welcome to contribute!
Ask Questions Answer Questions
Scene Graph generator
Visual Question Answering is a challenging task that involves fully visual understanding, language understanding and reasoning.
and relationship between objects).
chain in the question)
*Joint work with Chuang Gan et al
[1] GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. Hudson et al. CVPR 2019
Q: The shorts have what color? P: Filter(shorts)->Query(color)
Q: The shorts have what color? P: Filter(shorts)->Query(color) SG: shorts: 0.54 gray: 0.47 brown: 0.19
Q: The shorts have what color? P: Filter(shorts)->Query(color) SG: shorts: 0.54 gray: 0.47 brown: 0.19 A: gray
SG: frisbee: 0.85 yellow: 0.54 Q: What color is the frisbee? P: Filter(frisbee)->Query(color) A: yellow
Q: who wears shorts? Filter(shorts)->Relate_Subject(wears) A: man SG: man: 0.78 SG: shorts: 0.54 graph: 0.47 brown: 0.19 wear: 0.45
Ask Questions Answer Questions
Scene Graph generator
C
C
C
Human (Oracle)
What is the black object on the top of the table at left side? That’s a coffee bottle.
C
Human (Oracle)
What is the black object on the top of the table at left side? That’s a coffee bottle.
Visual Curiosity: Learning to Ask Questions to Learn Visual Recognition. Yang and Lu et al. CoRL 2018
Oracle/ Human
Image
Same Content
Visual System Graph Memory Question Generator
Image
C
Slide Credit: Stefan Lee
Answer Digestor
Question Answer
Oracle/ Human
Image
Same Content
Visual System Answer Digestor Question Generator
Image
Visual Graph Graph Memory
Oracle/ Human
Image
Same Content
Visual System Answer Digestor Question Generator
Image
Visual Memory Visual Graph Graph Memory Bottom-Up Update
Oracle/ Human
Image
Same Content
Visual System Answer Digestor Question Generator
Image
Visual Memory Memory Graph
Oracle/ Human
Image
Same Content
Visual System Answer Digestor Question Generator
Image
Visual Memory Graph Memory Top-Down Update
3 2
4 5
6
1 2 4 5 6 Graph Memroy Target 1 Attribute Shape Reference None Question What is the shape
large red object?
Color: UNK Size: UNK Shape: cube Mat: UNK Color: Red Size: Large Shape: UNK Mat: UNK Color: UNK Size: UNK Shape: UNK Mat: UNK Color: Pink Size: Small Shape: Cube Mat: UNK Color: UNK Size: UNK Shape: Cube Mat: Metal Color: UNK Size: Small Shape: UNK Mat: UNK
3 Color 2 What is the color of the metal cube on the left side of a small object? 5 Material 3 What is the material of object at left side of metal cube? 3
Update visual system after a piece of dialogs with Oracle/Human.
Loss: cross-entropy loss between the graph memory and the visual predictions over all images, objects, and attributes
∗ = arg min − R ST∈S
VWX YT
Visual attribute predictions Graph memory
Use A2C and update policy after each episode based on all rounds 𝜄\
∗ = arg max 𝐹^𝐹S~ℰ 𝐹\a R bWX c
R
IWX d
𝑠b
I(𝑟b I~𝜌g(ℎb I; 𝜄\)
𝑠
b I = 𝑇(𝐻b I, 𝐻b ∗) − 𝑇(𝐻b IkX, 𝐻b ∗)
Oracle Graph
Can be improved by:
Reward: Optimal Policy:
Synthesized Dataset
Different shapes, colors, materials and
[1] CLEVR. Johnson et al.
Realistic Dataset
Various real indoor scenes. Annotated based on the ARID dataset [2]
[2] Recognizing Objects In-the-Wild. Loghmani et al.
28.3 36.5 59.4 29.5 39.1 65.5 38 52.5 67.1 42.1 59.1 89.3 25.8 50.6 84.1
10 20 30 40 50 60 70 80 90 100
R@10 R@20 R@50 Random Entropy Entropy + Context Ours Ours w/o v
Consistent improvements over heuristic baselines especially over longer dialogs
Graph Recovery
New colors and shapes 600 images for test (12 episodes)
Mix of novel and standard colors and shapes 600 images for test (12 episodes)
Experiments: Novel Object Environments
51 categories, 11 colors, 6 materials 1200 images for test (24 episodes)
42.1 59.1 89.3 43.3 58.4 88.9 42.9 60.1 90.3 35.6 53.4 86.2
10 20 30 40 50 60 70 80 90 100
R@10 R@20 R@50 Std-Std Std-Novel Std-Mixed Std-Realistic
Experiments: Standard Train – New Test Environments
Practically no loss of performance in synthetic settings and small reductions for realistic (many more categories)
Graph Recovery
Standard
87
Experiments: Visual System Performance Mixed Novel Realistic
cereal
Experiments: Qualitative Example
food potato brown paper cereal yellow plastic ball
What is the closest thing that is in front of the yellow plastic ball? paper What material is the leftmost thing? food There is a leftmost object; what is it? potato The leftmost object is what color? What is the closest thing that is in front of the yellow plastic ball made of?
…
brown
human, especially visual question answering and generation
tasks, such as expression coreference
90
Leveraging external “knowledge” when interpreting images Specifically, using richer vision, language models/data to improve vision+language models Representing internal structure in images Specifically, scene graphs: generating, evaluating and using them for vision+language