[PPT] - Image as a single label king crab Image Source: ImageNet Image as PowerPoint Presentation

SLIDE 1

Image as a single label

“king crab”

Image Source: ImageNet

SLIDE 2

Image as an object set

King crab Man

Person Box

Coat Image Source: ImageNet

GIrl Woman

Woman

SLIDE 3

Image as a scene graph

King crab

Woman Woman

embrace

Box

look at

Coat

wear

Image Source: ImageNet “Woman look at box” “Man hold king crab”

GIrl

Man “Woman wear coat” “Man embrace woman”

Relationships:

Woman

hold

SLIDE 4

Image as a scene graph

King crab

Woman Woman

hold

Box

look at

Coat

wear

Image Source: ImageNet

GIrl

Man “Red king crab” “Blue coat” “Transparent box” “Smiling woman” “Smiling Man”

Attributes:

Woman

embrace

“Woman look at box” “Man hold king crab” “Woman wear coat” “Man embrace woman”

Relationships:

SLIDE 5

Why we need scene graph?

Man Horse Man Horse

Distinguish images more accurately

Walking with Feeding

[1] Image Retrieval using Scene Graphs. Johnson et al. CVPR 2015

Hat

Rigth: https://www.videoblocks.com/video/the-man-in-hat-feed-a-brown-horse-with-flowers-on-the-meadow-supmox_3xj0tvkb67 Left: https://cals.ncsu.edu/wp-content/uploads/2016/08/horse-1500x931.png

SLIDE 6

Why we need scene graph?

Man Horse Man Horse

“a man is walking with a horse” “the man is feeding a horse” Describe images more grounding

[1]. Auto-Encoding Scene Graphs for Image Captioning. Yang et al. arXiv 2018 [2]. Exploring Visual Relationship for Image Captioning. Yao et al. ECCV 2018

Hat

Rigth: https://www.videoblocks.com/video/the-man-in-hat-feed-a-brown-horse-with-flowers-on-the-meadow-supmox_3xj0tvkb67 Left: https://cals.ncsu.edu/wp-content/uploads/2016/08/horse-1500x931.png

SLIDE 7

Why we need scene graph?

Man Horse Man Horse Q: What is the man walking with? A: A horse

Answer question more precisely

[1] Graph-Structured Representations for Visual Question Answering. Teney et al. CVPR 2017

Q: Is the man feeding a horse? A: Yes

[2] Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding. Yi et al. Neurips 2018

Hat

Rigth: https://www.videoblocks.com/video/the-man-in-hat-feed-a-brown-horse-with-flowers-on-the-meadow-supmox_3xj0tvkb67 Left: https://cals.ncsu.edu/wp-content/uploads/2016/08/horse-1500x931.png

SLIDE 8

Why we need scene graph?

Man Horse Man Horse Q: What animal is the man walking with?

Generate questions more grounding

[1] Visual Curiosity: Learning to Ask Questions to Learn Visual Recognition. Yang et al. CoRL 2018 [2] Information Maximizing Visual Question Generation. Krishna et al. CVPR 2019

Q: What is the man doting with the horse?

Hat

Rigth: https://www.videoblocks.com/video/the-man-in-hat-feed-a-brown-horse-with-flowers-on-the-meadow-supmox_3xj0tvkb67 Left: https://cals.ncsu.edu/wp-content/uploads/2016/08/horse-1500x931.png

SLIDE 9

Human

Communication

Visual System

Scene Graph generator

SLIDE 10

Human

Answer Questions

Visual Question Answering

Visual System

Scene Graph generator

SLIDE 11

Human

Ask Questions Answer Questions

Visual Question Answering Visual Question Generation

Visual System

Scene Graph generator

SLIDE 12

Human

Ask Questions Answer Questions

Visual Question Answering Visual Question Generation

Visual System

Scene Graph generator

SLIDE 13

Skeleton Model

SLIDE 14

Skeleton Model

Input

SLIDE 15

Skeleton Model

Input Region Proposals

RPN

SLIDE 16

Skeleton Model

Object Features Relationship Features

ROI Pooling ROI Pooling

Input Region Proposals

RPN

SLIDE 17

Skeleton Model

Object Features Relationship Features Object Scores Relationship Scores

ROI Pooling ROI Pooling

Input Region Proposals

RPN

SLIDE 18

Skeleton Model

Object Features Relationship Features Object Scores Relationship Scores

ROI Pooling ROI Pooling

Input Region Proposals

RPN

Cat Cat TV

Watch Watch Left of Right of Dog Person In In Hold Cup Book On

SLIDE 19

Iterative Message Passing (IMP)

Object Features Relationship Features Object Scores Relationship Scores

ROI Pooling ROI Pooling

Input Region Proposals

RPN Feature Updating

Cat Cat TV

Watch Watch Left of Right of Dog Person In In Hold Cup Book On

Scene Graph Generation by Iterative Message Passing. Xu et al. CVPR 2017

Feature Updating

Message Passing

SLIDE 20

Multi-level Scene Description Network (MSDN)

Object Features Relationship Features Object Scores Relationship Scores

ROI Pooling ROI Pooling

Input Region Proposals

RPN Feature Updating

Cat Cat TV

Watch Watch Left of Right of Dog Person In In Hold Cup Book On

Feature Updating

Message Passing

Scene Graph Generation from Objects, Phrases and Region Captions. Li et al. ICCV 2017

Region Captions

SLIDE 21

Neural Motif Network

Object Features Relationship Features Object Scores Relationship Scores

ROI Pooling ROI Pooling

Input Region Proposals

RPN Feature Updating

Cat Cat TV

Watch Watch Left of Right of Dog Person In In Hold Cup Book On

Score Updating

Frequency Prior

Neural Motifs: Scene Graph Parsing with Global Context. Zellers et al. CVPR 2018

SLIDE 22

Graph R-CNN (Our work)

Object Features Relationship Features Object Scores Relationship Scores

ROI Pooling ROI Pooling

Input Region Proposals

RPN Feature Updating

Cat Cat TV

Watch Watch Left of Right of Dog Person In In Hold Cup Book On

Score Updating

Neural Motifs: Scene Graph Parsing with Global Context. Zellers et al. CVPR 2018

Message Passing Message Passing

Feature Updating Score Updating

SLIDE 23

Graph R-CNN (Our work)

Object Features Relationship Features Object Scores Relationship Scores

ROI Pooling ROI Pooling

Input Region Proposals

RPN Feature Updating

Cat Cat TV

Watch Watch Left of Right of Dog Person In In Hold Cup Book On

Score Updating

Message Passing Message Passing

Feature Updating Score Updating Relation Proposal Network (RePN)

Jianwei Yang*, Jiasen Lu*, Stefan Lee, Dhruv Batra, Devi Parikh. Graph R-CNN for Scene Graph Generation. ECCV 2018.

SLIDE 24

(a) (b) (c)

sweater boy fire hydrant car wheel building car

wear behind near near

n

next to behind

(d)

Motivations

SLIDE 25

(a) (b) (c)

sweater boy fire hydrant car wheel building car

wear behind near near

n

next to behind

(d)

Motivations

1. Objects in a scene usually have relationships with others;

SLIDE 26

(a) (b) (c)

sweater boy fire hydrant car wheel building car

wear behind near near

n

next to behind

(d)

Motivations

1. Objects in a scene usually have relationships with others; 2. Not all object pairs have relationships, the scene graph is usually sparse;

SLIDE 27

(a) (b) (c)

sweater boy fire hydrant car wheel building car

wear behind near near

n

next to behind

(d)

Motivations

1. Objects in a scene usually have relationships with others; 2. Not all object pairs have relationships, the scene graph is usually sparse; 3. Existence of relationships highly depends on the object categories, and type of relationships highly depends on the context.

SLIDE 28

Scene Graph

Dense graph Sparse graph

Attentional GCNs

1st Layer 2st Layer 3st Layer

+

… … Source Target

fc Attention

0.2 0.3 0.05

Conv Feature

Attentional graph

! "

Object Subject Object Object Score Matrix … … … … … … $

Relational Proposal Network RePN aGCN

head

has

f

bird

has

wings tails

has

branch

stand

n

tree

in behind

leaf

n
n

fc fc ReLU

Framework

SLIDE 29

Scene Graph

Dense graph Sparse graph

Attentional GCNs

1st Layer 2st Layer 3st Layer

+

… … Source Target

fc Attention

0.2 0.3 0.05

Conv Feature

Attentional graph

! "

Object Subject Object Object Score Matrix … … … … … … $

Relational Proposal Network RePN aGCN

head

has

f

bird

has

wings tails

has

branch

stand

n

tree

in behind

leaf

n
n

fc fc ReLU

Framework

SLIDE 30

Scene Graph

Dense graph Sparse graph

Attentional GCNs

1st Layer 2st Layer 3st Layer

+

… … Source Target

fc Attention

0.2 0.3 0.05

Conv Feature

Attentional graph

! "

Object Subject Object Object Score Matrix … … … … … … $

Relational Proposal Network RePN aGCN

head

has

f

bird

has

wings tails

has

branch

stand

n

tree

in behind

leaf

n
n

fc fc ReLU

Framework

Subject

1. Relation proposal network (RePN) to learn to prune the densely

connected scene graph;

SLIDE 31

Scene Graph

Dense graph Sparse graph

Attentional GCNs

1st Layer 2st Layer 3st Layer

+

… … Source Target

fc Attention

0.2 0.3 0.05

Conv Feature

Attentional graph

! "

Object Subject Object Object Score Matrix … … … … … … $

Relational Proposal Network RePN aGCN

head

has

f

bird

has

wings tails

has

branch

stand

n

tree

in behind

leaf

n
n

fc fc ReLU

1. Relation proposal network (RePN) to learn to prune the densely

connected scene graph;

2. Attentional graph convolutional networks (aGCN) to incorporate the

contextual information.

Framework

Subject

SLIDE 32

Scene Graph

Dense graph Sparse graph

Attentional GCNs

1st Layer 2st Layer 3st Layer

+

… … Source Target

fc Attention

0.2 0.3 0.05

Conv Feature

Attentional graph

! "

Object Subject Object Object Score Matrix … … … … … … $

Relational Proposal Network RePN aGCN

head

has

f

bird

has

wings tails

has

branch

stand

n

tree

in behind

leaf

n
n

fc fc ReLU

Framework

Subject

1. Relation proposal network (RePN) to learn to prune the densely

connected scene graph;

2. Attentional graph convolutional networks (aGCN) to incorporate the

contextual information.

SLIDE 33

Scene Graph

Dense graph Sparse graph

Attentional GCNs

1st Layer 2st Layer 3st Layer

+

… … Source Target

fc Attention

0.2 0.3 0.05

Conv Feature

Attentional graph

! "

Object Subject Object Object Score Matrix … … … … … … $

Relational Proposal Network RePN aGCN

head

has

f

bird

has

wings tails

has

branch

stand

n

tree

in behind

leaf

n
n

fc fc ReLU

Framework

Subject

𝑄 𝑇 𝐽

I: Input Image; S: Scene graph V: Scene graph vertices (object) E: Scene graph edges (relationship) O: Scene graph object labels R: Scene graph relationship labels

SLIDE 34

Scene Graph

Dense graph Sparse graph

Attentional GCNs

1st Layer 2st Layer 3st Layer

+

… … Source Target

fc Attention

0.2 0.3 0.05

Conv Feature

Attentional graph

! "

Object Subject Object Object Score Matrix … … … … … … $

Relational Proposal Network RePN aGCN

head

has

f

bird

has

wings tails

has

branch

stand

n

tree

in behind

leaf

n
n

fc fc ReLU

Framework

Subject

𝑄 𝑇 𝐽 = 𝑄 𝑊 𝐽

I: Input Image; S: Scene graph V: Scene graph vertices (object) E: Scene graph edges (relationship) O: Scene graph object labels R: Scene graph relationship labels

Region Proposal

SLIDE 35

Scene Graph

Dense graph Sparse graph

Attentional GCNs

1st Layer 2st Layer 3st Layer

+

… … Source Target

fc Attention

0.2 0.3 0.05

Conv Feature

Attentional graph

! "

Object Subject Object Object Score Matrix … … … … … … $

Relational Proposal Network RePN aGCN

head

has

f

bird

has

wings tails

has

branch

stand

n

tree

in behind

leaf

n
n

fc fc ReLU

Framework

Subject

𝑄 𝑇 𝐽 = 𝑄 𝑊 𝐽 𝑄 𝐹 𝑊, 𝐽

Relation Proposal

I: Input Image; S: Scene graph V: Scene graph vertices (object) E: Scene graph edges (relationship) O: Scene graph object labels R: Scene graph relationship labels

Region Proposal

SLIDE 36

Scene Graph

Dense graph Sparse graph

Attentional GCNs

1st Layer 2st Layer 3st Layer

+

… … Source Target

fc Attention

0.2 0.3 0.05

Conv Feature

Attentional graph

! "

Object Subject Object Object Score Matrix … … … … … … $

Relational Proposal Network RePN aGCN

head

has

f

bird

has

wings tails

has

branch

stand

n

tree

in behind

leaf

n
n

fc fc ReLU

Framework

Subject

𝑄 𝑇 𝐽 = 𝑄 𝑊 𝐽 𝑄 𝐹 𝑊, 𝐽 𝑄 𝑆, 𝑃 𝑊, 𝐹, 𝐽

Graph Labeling

I: Input Image; S: Scene graph V: Scene graph vertices (object) E: Scene graph edges (relationship) O: Scene graph object labels R: Scene graph relationship labels

Relation Proposal Region Proposal

SLIDE 37

Training

𝑄 𝑊 𝐽 𝑄 𝐹 𝑊, 𝐽 𝑄 𝑆, 𝑃 𝑊, 𝐹, 𝐽 = 𝑄 𝑇 𝐽

SLIDE 38

Training

𝑄 𝑊 𝐽 𝑄 𝐹 𝑊, 𝐽 𝑄 𝑆, 𝑃 𝑊, 𝐹, 𝐽 = 𝑄 𝑇 𝐽

Region Proposal Network

Binary Cross Entropy Loss

SLIDE 39

Training

𝑄 𝑊 𝐽 𝑄 𝐹 𝑊, 𝐽 𝑄 𝑆, 𝑃 𝑊, 𝐹, 𝐽 = 𝑄 𝑇 𝐽

Relation Proposal Network

Binary Cross Entropy Loss

Region Proposal Network

Binary Cross Entropy Loss

SLIDE 40

Training

𝑄 𝑊 𝐽 𝑄 𝐹 𝑊, 𝐽 𝑄 𝑆, 𝑃 𝑊, 𝐹, 𝐽 = 𝑄 𝑇 𝐽

Graph Labeling Network

Two Cross Entropy Losses,

ne for node and one for edge

Region Proposal Network

Binary Cross Entropy Loss

Relation Proposal Network

Binary Cross Entropy Loss

SLIDE 41

Metrics

[1]. Scene Graph Generation by Iterative Message Passing. Xu et al. CVPR 2017

SLIDE 42

Metrics

Assume there are objects extracted from an image, then edges

𝑂 𝑂 ∗ 𝑂 − 1

[1]. Scene Graph Generation by Iterative Message Passing. Xu et al. CVPR 2017

SLIDE 43

Metrics

Step 1: Take maximum for object scores and predicate scores, excluding background class.

[1]. Scene Graph Generation by Iterative Message Passing. Xu et al. CVPR 2017 Assume there are objects extracted from an image, then edges

𝑂 𝑂 ∗ 𝑂 − 1

SLIDE 44

Metrics

Step 1: Take maximum for object scores and predicate scores, excluding background class. Step 2: Compute relationship scores: 𝑆𝑓𝑚(𝑗, 𝑘) = 𝑇𝑣𝑐𝑘(𝑗) ∗ 𝑃𝑐𝑘(𝑘) ∗ 𝑄𝑠𝑓𝑒(𝑗, 𝑘)

[1]. Scene Graph Generation by Iterative Message Passing. Xu et al. CVPR 2017 Assume there are objects extracted from an image, then edges

𝑂 𝑂 ∗ 𝑂 − 1

SLIDE 45

Metrics

[1]. Scene Graph Generation by Iterative Message Passing. Xu et al. CVPR 2017

Step 1: Take maximum for object scores and predicate scores, excluding background class. Step 2: Compute relationship scores: Step 3: Sort the relationship triplets in a descending order. 𝑆𝑓𝑚(𝑗, 𝑘) = 𝑇𝑣𝑐𝑘(𝑗) ∗ 𝑃𝑐𝑘(𝑘) ∗ 𝑄𝑠𝑓𝑒(𝑗, 𝑘)

Assume there are objects extracted from an image, then edges

𝑂 𝑂 ∗ 𝑂 − 1

SLIDE 46

Metrics

[1]. Scene Graph Generation by Iterative Message Passing. Xu et al. CVPR 2017

Step 1: Take maximum for object scores and predicate scores, excluding background class. Step 2: Compute relationship scores: Step 3: Sort the relationship triplets in a descending order. Step 4: Compute the triplet recalls (Recall@50, Recall@100) based on the ground-truth 𝑆𝑓𝑚(𝑗, 𝑘) = 𝑇𝑣𝑐𝑘(𝑗) ∗ 𝑃𝑐𝑘(𝑘) ∗ 𝑄𝑠𝑓𝑒(𝑗, 𝑘)

Assume there are objects extracted from an image, then edges

𝑂 𝑂 ∗ 𝑂 − 1

𝑆𝑓𝑑𝑏𝑚𝑚 = 𝐷(𝑈CDEF 𝑏𝑜𝑒 𝑈

HI)

𝑂(𝑈

HI)

SGGen:

IoU > 0.5

SLIDE 47

Metrics

[1]. Scene Graph Generation by Iterative Message Passing. Xu et al. CVPR 2017

Step 1: Take maximum for object scores and predicate scores, excluding background class. Step 2: Compute relationship scores: Step 3: Sort the relationship triplets in a descending order. Step 4: Compute the triplet recalls (Recall@50, Recall@100) based on the ground-truth 𝑆𝑓𝑚(𝑗, 𝑘) = 𝑇𝑣𝑐𝑘(𝑗) ∗ 𝑃𝑐𝑘(𝑘) ∗ 𝑄𝑠𝑓𝑒(𝑗, 𝑘)

Assume there are objects extracted from an image, then edges

𝑂 𝑂 ∗ 𝑂 − 1

𝑆𝑓𝑑𝑏𝑚𝑚 = 𝐷(𝑈CDEF 𝑏𝑜𝑒 𝑈

HI)

𝑂(𝑈

HI)

SGGen:

IoU > 0.5 PhrCls: all object locations are known PredCls: all object locations and labels are known

SLIDE 48

Experiments

Dataset Backbone #objects #predicates Metrics Visual Genome[1] Train: 75,651 Test: 32,422 VGG-16 Faster R-CNN[2] 150 50 PredCls,SGCls, SGGen,SGGen+, mAP

Table. Implementation Details.

[2] A Faster Implementation of Faster R-CNN. Yang and Lu et al. [1] Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. Krishna et al.

SLIDE 49

Comparing with Previous Work

10 20 30 40 50 60 70 PredCls PhrCls SGGen SGGen+

Recall@100

IMP[1] MSDN[2] NM-Freq[3] Ours

[1] Scene Graph Generation by Iterative Message Passing. Xu et al. CVPR 2017 [2] Scene Graph Generations from Objects, Phrases and Captions. Li et al. ICCV 2017 [3] Neural Motif: Scene Graph Parsing with Global Context. Zellers et al. CVPR 2018

Our model has over four point improvement

n SGGen, and two point on SGGen+

45.2 22.4 8.0 27.7 57.9 29.9 9.1 28.2 48.8 27.2 9.1 27.8 13.7 35.9 31.6 59.1

(Proposed new metric. Details in our paper)

SLIDE 50

Qualitative Results

surfboard

f

has near has bear ear leg flower head

n

has near bear ear leg flower

f

has in has bird head tree wing behind behind branch

n

behind leaf

n

in ride has man water wave arm short

n

SLIDE 51

Ablation Study

10 20 30 40 50 60 PredCls PhrCls SGGen SGGen+ mAP Recall@50 Base Base+RePN Base+RePN+GCN Base+RePN+aGCN

SLIDE 52

Ablation Study

10 20 30 40 50 60 PredCls PhrCls SGGen SGGen+ mAP Recall@50 Base Base+RePN Base+RePN+GCN Base+RePN+aGCN

RePN improves SGGen, SGGen+ and mAP

SLIDE 53

Ablation Study

10 20 30 40 50 60 PredCls PhrCls SGGen SGGen+ mAP Recall@50 Base Base+RePN Base+RePN+GCN Base+RePN+aGCN

RePN improves SGGen, SGGen+ and mAP GCN/aGCN improves PredCls and PhrCls

SLIDE 54

Ablation Study

10 20 30 40 50 60 PredCls PhrCls SGGen SGGen+ mAP Recall@50 Base Base+RePN Base+RePN+GCN Base+RePN+aGCN

RePN improves SGGen, SGGen+ and mAP GCN/aGCN improves PredCls and PhrCls

SLIDE 55

A new codebase for scene graph generation

https://github.com/jwyang/graph-rcnn.pytorch The goal of gathering all these representative methods into a single repo is to establish a more fair comparison across different methods under the same settings. Welcome to contribute!

SLIDE 56

Summary

Takeaways:
Introducing a general base model for scene graph generation
Pruning the fully-connected graph is important for scene graph generation
Exploiting the context across objects and predicates is crucial
Scene graph generation helps to improve object detection
Challenges:
The dataset is noisy (incomplete and inconsistent annotations)
Relationships need more fine-grained categorizing (spatial, semantic, etc)
Rare/novel relationship is hard to detect

SLIDE 57

Human

Ask Questions Answer Questions

Visual Question Answering Visual Question Generation

Visual System

Scene Graph generator

SLIDE 58

Visual Question Answering

Visual Question Answering is a challenging task that involves fully visual understanding, language understanding and reasoning.

SLIDE 59

Methodology: Graph Reasoning Machine

Visual Understanding: Scene Graph Generator (extract object, attribute

and relationship between objects).

Language Understanding: Program Generator (extract logic reasoning

chain in the question)

Reasoning: Learnable Neural-Symbolic Executor (execute programs on s.g.)
Pros:
Make the VQA model more interpretable;
Easy to diagnose and analyze the model predictions;
Modules are disentangled from each other;
Introduces no or few language priors, (probably) better generalization ability;

*Joint work with Chuang Gan et al

SLIDE 60

Compositional Reasoning VQA Dataset

[1] GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. Hudson et al. CVPR 2019

SLIDE 61

SLIDE 62

SLIDE 63

Q: The shorts have what color? P: Filter(shorts)->Query(color)

SLIDE 64

Q: The shorts have what color? P: Filter(shorts)->Query(color) SG: shorts: 0.54 gray: 0.47 brown: 0.19

SLIDE 65

Q: The shorts have what color? P: Filter(shorts)->Query(color) SG: shorts: 0.54 gray: 0.47 brown: 0.19 A: gray

SLIDE 66

SG: frisbee: 0.85 yellow: 0.54 Q: What color is the frisbee? P: Filter(frisbee)->Query(color) A: yellow

SLIDE 67

Q: who wears shorts? Filter(shorts)->Relate_Subject(wears) A: man SG: man: 0.78 SG: shorts: 0.54 graph: 0.47 brown: 0.19 wear: 0.45

SLIDE 68

Human

Ask Questions Answer Questions

Visual Question Answering Visual Question Generation

Visual System

Scene Graph generator

SLIDE 69

The Open-World Recognition Problem

C

SLIDE 70

C

The Open-World Recognition Problem

SLIDE 71

C

The Open-World Recognition Problem

Human (Oracle)

What is the black object on the top of the table at left side? That’s a coffee bottle.

SLIDE 72

C

The Open-World Recognition Problem

Human (Oracle)

What is the black object on the top of the table at left side? That’s a coffee bottle.

How to train an agent to ask questions about its unknowns based

n its knowns to improve its visual

understanding capabilities?

SLIDE 73

Visual Question Generation: Visual Curiosity

Visual Curiosity: Learning to Ask Questions to Learn Visual Recognition. Yang and Lu et al. CoRL 2018

SLIDE 74

Oracle/ Human

Image

Same Content

Agent

Visual System Graph Memory Question Generator

Image

Agent Architecture

C

Slide Credit: Stefan Lee

Answer Digestor

Question Answer

SLIDE 75

Oracle/ Human

Image

Same Content

Agent

Visual System Answer Digestor Question Generator

Image

Bottom-up Update

Visual Graph Graph Memory

SLIDE 76

Oracle/ Human

Image

Same Content

Agent

Visual System Answer Digestor Question Generator

Image

Bottom-up Update

Visual Memory Visual Graph Graph Memory Bottom-Up Update

SLIDE 77

Oracle/ Human

Image

Same Content

Agent

Visual System Answer Digestor Question Generator

Image

Top-down Update

Visual Memory Memory Graph

SLIDE 78

Oracle/ Human

Image

Same Content

Agent

Visual System Answer Digestor Question Generator

Image

Top-down Update

Visual Memory Graph Memory Top-Down Update

SLIDE 79

Question Generator

3 2

4 5

6

1 2 4 5 6 Graph Memroy Target 1 Attribute Shape Reference None Question What is the shape

f the front most

large red object?

Color: UNK Size: UNK Shape: cube Mat: UNK Color: Red Size: Large Shape: UNK Mat: UNK Color: UNK Size: UNK Shape: UNK Mat: UNK Color: Pink Size: Small Shape: Cube Mat: UNK Color: UNK Size: UNK Shape: Cube Mat: Metal Color: UNK Size: Small Shape: UNK Mat: UNK

3 Color 2 What is the color of the metal cube on the left side of a small object? 5 Material 3 What is the material of object at left side of metal cube? 3

SLIDE 80

Update visual system after a piece of dialogs with Oracle/Human.

Training Objective: Visual System

Loss: cross-entropy loss between the graph memory and the visual predictions over all images, objects, and attributes

𝜄K

∗ = arg min − R ST∈S

R

VWX YT

log

Visual attribute predictions Graph memory

SLIDE 81

Use A2C and update policy after each episode based on all rounds 𝜄\

∗ = arg max 𝐹^𝐹S~ℰ 𝐹\a R bWX c

R

IWX d

𝑠b

I(𝑟b I~𝜌g(ℎb I; 𝜄\)

Training Objective: Questioner Policy

𝑠

b I = 𝑇(𝐻b I, 𝐻b ∗) − 𝑇(𝐻b IkX, 𝐻b ∗)

Oracle Graph

Can be improved by:

Asking unambiguous, informative questions (top-down)
Improving the visual system quickly (bottom-up)

Reward: Optimal Policy:

SLIDE 82

Synthesized Dataset

Different shapes, colors, materials and

sizes. Extended from CLEVR dataset [1]

[1] CLEVR. Johnson et al.

Realistic Dataset

Various real indoor scenes. Annotated based on the ARID dataset [2]

[2] Recognizing Objects In-the-Wild. Loghmani et al.

Experiments: Environments

SLIDE 83

28.3 36.5 59.4 29.5 39.1 65.5 38 52.5 67.1 42.1 59.1 89.3 25.8 50.6 84.1

10 20 30 40 50 60 70 80 90 100

R@10 R@20 R@50 Random Entropy Entropy + Context Ours Ours w/o v

Consistent improvements over heuristic baselines especially over longer dialogs

Experiment: Standard Training + Standard Testing

Graph Recovery

SLIDE 84

Novel

New colors and shapes 600 images for test (12 episodes)

Mixed

Mix of novel and standard colors and shapes 600 images for test (12 episodes)

Experiments: Novel Object Environments

Realistic

51 categories, 11 colors, 6 materials 1200 images for test (24 episodes)

SLIDE 85

42.1 59.1 89.3 43.3 58.4 88.9 42.9 60.1 90.3 35.6 53.4 86.2

10 20 30 40 50 60 70 80 90 100

R@10 R@20 R@50 Std-Std Std-Novel Std-Mixed Std-Realistic

Experiments: Standard Train – New Test Environments

Practically no loss of performance in synthetic settings and small reductions for realistic (many more categories)

Graph Recovery

SLIDE 86

Standard

87

Experiments: Visual System Performance Mixed Novel Realistic

SLIDE 87

cereal

Experiments: Qualitative Example

food potato brown paper cereal yellow plastic ball

What is the closest thing that is in front of the yellow plastic ball? paper What material is the leftmost thing? food There is a leftmost object; what is it? potato The leftmost object is what color? What is the closest thing that is in front of the yellow plastic ball made of?

…

brown

SLIDE 88

Summary for this part

Takeaways:
Scene graph can be used as a comprehensive semantic abstraction of image
Scene graph provides grounding information for language-based interaction with

human, especially visual question answering and generation

Scene graph gives it a chance to make models more interpretable and explainable
Potential Directions:
Leverage scene graph for explicit and effective reasoning on more vision-language

tasks, such as expression coreference

Language context dependent scene graph generation
Combine scene graph and knowledge graph for common sense reasoning

SLIDE 89

90

Summary for all

Leveraging external “knowledge” when interpreting images Specifically, using richer vision, language models/data to improve vision+language models Representing internal structure in images Specifically, scene graphs: generating, evaluating and using them for vision+language

SLIDE 90