Image as a single label king crab Image Source: ImageNet Image as - - PowerPoint PPT Presentation

image as a single label
SMART_READER_LITE
LIVE PREVIEW

Image as a single label king crab Image Source: ImageNet Image as - - PowerPoint PPT Presentation

Image as a single label king crab Image Source: ImageNet Image as an object set Man Person Woman Woman GIrl Coat King crab Box Image Source: ImageNet Image as a scene graph Man embrace Woman Woman Woman GIrl Relationships:


slide-1
SLIDE 1

Image as a single label

“king crab”

Image Source: ImageNet

slide-2
SLIDE 2

Image as an object set

King crab Man

Person Box

Coat Image Source: ImageNet

GIrl Woman

Woman

slide-3
SLIDE 3

Image as a scene graph

King crab

Woman Woman

embrace

Box

look at

Coat

wear

Image Source: ImageNet “Woman look at box” “Man hold king crab”

GIrl

Man “Woman wear coat” “Man embrace woman”

Relationships:

Woman

hold

slide-4
SLIDE 4

Image as a scene graph

King crab

Woman Woman

hold

Box

look at

Coat

wear

Image Source: ImageNet

GIrl

Man “Red king crab” “Blue coat” “Transparent box” “Smiling woman” “Smiling Man”

Attributes:

Woman

embrace

“Woman look at box” “Man hold king crab” “Woman wear coat” “Man embrace woman”

Relationships:

slide-5
SLIDE 5

Why we need scene graph?

Man Horse Man Horse

Distinguish images more accurately

Walking with Feeding

[1] Image Retrieval using Scene Graphs. Johnson et al. CVPR 2015

Hat

Hat

Rigth: https://www.videoblocks.com/video/the-man-in-hat-feed-a-brown-horse-with-flowers-on-the-meadow-supmox_3xj0tvkb67 Left: https://cals.ncsu.edu/wp-content/uploads/2016/08/horse-1500x931.png

slide-6
SLIDE 6

Why we need scene graph?

Man Horse Man Horse

“a man is walking with a horse” “the man is feeding a horse” Describe images more grounding

[1]. Auto-Encoding Scene Graphs for Image Captioning. Yang et al. arXiv 2018 [2]. Exploring Visual Relationship for Image Captioning. Yao et al. ECCV 2018

Hat

Hat

Rigth: https://www.videoblocks.com/video/the-man-in-hat-feed-a-brown-horse-with-flowers-on-the-meadow-supmox_3xj0tvkb67 Left: https://cals.ncsu.edu/wp-content/uploads/2016/08/horse-1500x931.png

slide-7
SLIDE 7

Why we need scene graph?

Man Horse Man Horse Q: What is the man walking with? A: A horse

Answer question more precisely

[1] Graph-Structured Representations for Visual Question Answering. Teney et al. CVPR 2017

Q: Is the man feeding a horse? A: Yes

[2] Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding. Yi et al. Neurips 2018

Hat

Hat

Rigth: https://www.videoblocks.com/video/the-man-in-hat-feed-a-brown-horse-with-flowers-on-the-meadow-supmox_3xj0tvkb67 Left: https://cals.ncsu.edu/wp-content/uploads/2016/08/horse-1500x931.png

slide-8
SLIDE 8

Why we need scene graph?

Man Horse Man Horse Q: What animal is the man walking with?

Generate questions more grounding

[1] Visual Curiosity: Learning to Ask Questions to Learn Visual Recognition. Yang et al. CoRL 2018 [2] Information Maximizing Visual Question Generation. Krishna et al. CVPR 2019

Q: What is the man doting with the horse?

Hat

Hat

Rigth: https://www.videoblocks.com/video/the-man-in-hat-feed-a-brown-horse-with-flowers-on-the-meadow-supmox_3xj0tvkb67 Left: https://cals.ncsu.edu/wp-content/uploads/2016/08/horse-1500x931.png

slide-9
SLIDE 9

Human

Communication

Visual System

Scene Graph generator

slide-10
SLIDE 10

Human

Answer Questions

Visual Question Answering

Visual System

Scene Graph generator

slide-11
SLIDE 11

Human

Ask Questions Answer Questions

Visual Question Answering Visual Question Generation

Visual System

Scene Graph generator

slide-12
SLIDE 12

Human

Ask Questions Answer Questions

Visual Question Answering Visual Question Generation

Visual System

Scene Graph generator

slide-13
SLIDE 13

Skeleton Model

slide-14
SLIDE 14

Skeleton Model

Input

slide-15
SLIDE 15

Skeleton Model

Input Region Proposals

RPN

slide-16
SLIDE 16

Skeleton Model

Object Features Relationship Features

ROI Pooling ROI Pooling

Input Region Proposals

RPN

slide-17
SLIDE 17

Skeleton Model

Object Features Relationship Features Object Scores Relationship Scores

ROI Pooling ROI Pooling

Input Region Proposals

RPN

slide-18
SLIDE 18

Skeleton Model

Object Features Relationship Features Object Scores Relationship Scores

ROI Pooling ROI Pooling

Input Region Proposals

RPN

Cat Cat TV

Watch Watch Left of Right of Dog Person In In Hold Cup Book On

slide-19
SLIDE 19

Iterative Message Passing (IMP)

Object Features Relationship Features Object Scores Relationship Scores

ROI Pooling ROI Pooling

Input Region Proposals

RPN Feature Updating

Cat Cat TV

Watch Watch Left of Right of Dog Person In In Hold Cup Book On

Scene Graph Generation by Iterative Message Passing. Xu et al. CVPR 2017

Feature Updating

Message Passing

slide-20
SLIDE 20

Multi-level Scene Description Network (MSDN)

Object Features Relationship Features Object Scores Relationship Scores

ROI Pooling ROI Pooling

Input Region Proposals

RPN Feature Updating

Cat Cat TV

Watch Watch Left of Right of Dog Person In In Hold Cup Book On

Feature Updating

Message Passing

Scene Graph Generation from Objects, Phrases and Region Captions. Li et al. ICCV 2017

Region Captions

slide-21
SLIDE 21

Neural Motif Network

Object Features Relationship Features Object Scores Relationship Scores

ROI Pooling ROI Pooling

Input Region Proposals

RPN Feature Updating

Cat Cat TV

Watch Watch Left of Right of Dog Person In In Hold Cup Book On

Score Updating

Frequency Prior

Neural Motifs: Scene Graph Parsing with Global Context. Zellers et al. CVPR 2018

slide-22
SLIDE 22

Graph R-CNN (Our work)

Object Features Relationship Features Object Scores Relationship Scores

ROI Pooling ROI Pooling

Input Region Proposals

RPN Feature Updating

Cat Cat TV

Watch Watch Left of Right of Dog Person In In Hold Cup Book On

Score Updating

Neural Motifs: Scene Graph Parsing with Global Context. Zellers et al. CVPR 2018

Message Passing Message Passing

Feature Updating Score Updating

slide-23
SLIDE 23

Graph R-CNN (Our work)

Object Features Relationship Features Object Scores Relationship Scores

ROI Pooling ROI Pooling

Input Region Proposals

RPN Feature Updating

Cat Cat TV

Watch Watch Left of Right of Dog Person In In Hold Cup Book On

Score Updating

Message Passing Message Passing

Feature Updating Score Updating Relation Proposal Network (RePN)

Jianwei Yang*, Jiasen Lu*, Stefan Lee, Dhruv Batra, Devi Parikh. Graph R-CNN for Scene Graph Generation. ECCV 2018.

slide-24
SLIDE 24

(a) (b) (c)

sweater boy fire hydrant car wheel building car

wear behind near near

  • n

next to behind

(d)

Motivations

slide-25
SLIDE 25

(a) (b) (c)

sweater boy fire hydrant car wheel building car

wear behind near near

  • n

next to behind

(d)

Motivations

1. Objects in a scene usually have relationships with others;

slide-26
SLIDE 26

(a) (b) (c)

sweater boy fire hydrant car wheel building car

wear behind near near

  • n

next to behind

(d)

Motivations

1. Objects in a scene usually have relationships with others; 2. Not all object pairs have relationships, the scene graph is usually sparse;

slide-27
SLIDE 27

(a) (b) (c)

sweater boy fire hydrant car wheel building car

wear behind near near

  • n

next to behind

(d)

Motivations

1. Objects in a scene usually have relationships with others; 2. Not all object pairs have relationships, the scene graph is usually sparse; 3. Existence of relationships highly depends on the object categories, and type of relationships highly depends on the context.

slide-28
SLIDE 28

Scene Graph

Dense graph Sparse graph

Attentional GCNs

1st Layer 2st Layer 3st Layer

+

… … Source Target

fc Attention

0.2 0.3 0.05

Conv Feature

Attentional graph

! "

Object Subject Object Object Score Matrix … … … … … … $

Relational Proposal Network RePN aGCN

head

has

  • f

bird

has

wings tails

has

branch

stand

  • n

tree

in behind

leaf

  • n
  • n

fc fc ReLU

Framework

slide-29
SLIDE 29

Scene Graph

Dense graph Sparse graph

Attentional GCNs

1st Layer 2st Layer 3st Layer

+

… … Source Target

fc Attention

0.2 0.3 0.05

Conv Feature

Attentional graph

! "

Object Subject Object Object Score Matrix … … … … … … $

Relational Proposal Network RePN aGCN

head

has

  • f

bird

has

wings tails

has

branch

stand

  • n

tree

in behind

leaf

  • n
  • n

fc fc ReLU

Framework

slide-30
SLIDE 30

Scene Graph

Dense graph Sparse graph

Attentional GCNs

1st Layer 2st Layer 3st Layer

+

… … Source Target

fc Attention

0.2 0.3 0.05

Conv Feature

Attentional graph

! "

Object Subject Object Object Score Matrix … … … … … … $

Relational Proposal Network RePN aGCN

head

has

  • f

bird

has

wings tails

has

branch

stand

  • n

tree

in behind

leaf

  • n
  • n

fc fc ReLU

Framework

Subject

  • 1. Relation proposal network (RePN) to learn to prune the densely

connected scene graph;

slide-31
SLIDE 31

Scene Graph

Dense graph Sparse graph

Attentional GCNs

1st Layer 2st Layer 3st Layer

+

… … Source Target

fc Attention

0.2 0.3 0.05

Conv Feature

Attentional graph

! "

Object Subject Object Object Score Matrix … … … … … … $

Relational Proposal Network RePN aGCN

head

has

  • f

bird

has

wings tails

has

branch

stand

  • n

tree

in behind

leaf

  • n
  • n

fc fc ReLU

  • 1. Relation proposal network (RePN) to learn to prune the densely

connected scene graph;

  • 2. Attentional graph convolutional networks (aGCN) to incorporate the

contextual information.

Framework

Subject

slide-32
SLIDE 32

Scene Graph

Dense graph Sparse graph

Attentional GCNs

1st Layer 2st Layer 3st Layer

+

… … Source Target

fc Attention

0.2 0.3 0.05

Conv Feature

Attentional graph

! "

Object Subject Object Object Score Matrix … … … … … … $

Relational Proposal Network RePN aGCN

head

has

  • f

bird

has

wings tails

has

branch

stand

  • n

tree

in behind

leaf

  • n
  • n

fc fc ReLU

Framework

Subject

  • 1. Relation proposal network (RePN) to learn to prune the densely

connected scene graph;

  • 2. Attentional graph convolutional networks (aGCN) to incorporate the

contextual information.

slide-33
SLIDE 33

Scene Graph

Dense graph Sparse graph

Attentional GCNs

1st Layer 2st Layer 3st Layer

+

… … Source Target

fc Attention

0.2 0.3 0.05

Conv Feature

Attentional graph

! "

Object Subject Object Object Score Matrix … … … … … … $

Relational Proposal Network RePN aGCN

head

has

  • f

bird

has

wings tails

has

branch

stand

  • n

tree

in behind

leaf

  • n
  • n

fc fc ReLU

Framework

Subject

𝑄 𝑇 𝐽

I: Input Image; S: Scene graph V: Scene graph vertices (object) E: Scene graph edges (relationship) O: Scene graph object labels R: Scene graph relationship labels

slide-34
SLIDE 34

Scene Graph

Dense graph Sparse graph

Attentional GCNs

1st Layer 2st Layer 3st Layer

+

… … Source Target

fc Attention

0.2 0.3 0.05

Conv Feature

Attentional graph

! "

Object Subject Object Object Score Matrix … … … … … … $

Relational Proposal Network RePN aGCN

head

has

  • f

bird

has

wings tails

has

branch

stand

  • n

tree

in behind

leaf

  • n
  • n

fc fc ReLU

Framework

Subject

𝑄 𝑇 𝐽 = 𝑄 𝑊 𝐽

I: Input Image; S: Scene graph V: Scene graph vertices (object) E: Scene graph edges (relationship) O: Scene graph object labels R: Scene graph relationship labels

Region Proposal

slide-35
SLIDE 35

Scene Graph

Dense graph Sparse graph

Attentional GCNs

1st Layer 2st Layer 3st Layer

+

… … Source Target

fc Attention

0.2 0.3 0.05

Conv Feature

Attentional graph

! "

Object Subject Object Object Score Matrix … … … … … … $

Relational Proposal Network RePN aGCN

head

has

  • f

bird

has

wings tails

has

branch

stand

  • n

tree

in behind

leaf

  • n
  • n

fc fc ReLU

Framework

Subject

𝑄 𝑇 𝐽 = 𝑄 𝑊 𝐽 𝑄 𝐹 𝑊, 𝐽

Relation Proposal

I: Input Image; S: Scene graph V: Scene graph vertices (object) E: Scene graph edges (relationship) O: Scene graph object labels R: Scene graph relationship labels

Region Proposal

slide-36
SLIDE 36

Scene Graph

Dense graph Sparse graph

Attentional GCNs

1st Layer 2st Layer 3st Layer

+

… … Source Target

fc Attention

0.2 0.3 0.05

Conv Feature

Attentional graph

! "

Object Subject Object Object Score Matrix … … … … … … $

Relational Proposal Network RePN aGCN

head

has

  • f

bird

has

wings tails

has

branch

stand

  • n

tree

in behind

leaf

  • n
  • n

fc fc ReLU

Framework

Subject

𝑄 𝑇 𝐽 = 𝑄 𝑊 𝐽 𝑄 𝐹 𝑊, 𝐽 𝑄 𝑆, 𝑃 𝑊, 𝐹, 𝐽

Graph Labeling

I: Input Image; S: Scene graph V: Scene graph vertices (object) E: Scene graph edges (relationship) O: Scene graph object labels R: Scene graph relationship labels

Relation Proposal Region Proposal

slide-37
SLIDE 37

Training

𝑄 𝑊 𝐽 𝑄 𝐹 𝑊, 𝐽 𝑄 𝑆, 𝑃 𝑊, 𝐹, 𝐽 = 𝑄 𝑇 𝐽

slide-38
SLIDE 38

Training

𝑄 𝑊 𝐽 𝑄 𝐹 𝑊, 𝐽 𝑄 𝑆, 𝑃 𝑊, 𝐹, 𝐽 = 𝑄 𝑇 𝐽

Region Proposal Network

Binary Cross Entropy Loss

slide-39
SLIDE 39

Training

𝑄 𝑊 𝐽 𝑄 𝐹 𝑊, 𝐽 𝑄 𝑆, 𝑃 𝑊, 𝐹, 𝐽 = 𝑄 𝑇 𝐽

Relation Proposal Network

Binary Cross Entropy Loss

Region Proposal Network

Binary Cross Entropy Loss

slide-40
SLIDE 40

Training

𝑄 𝑊 𝐽 𝑄 𝐹 𝑊, 𝐽 𝑄 𝑆, 𝑃 𝑊, 𝐹, 𝐽 = 𝑄 𝑇 𝐽

Graph Labeling Network

Two Cross Entropy Losses,

  • ne for node and one for edge

Region Proposal Network

Binary Cross Entropy Loss

Relation Proposal Network

Binary Cross Entropy Loss

slide-41
SLIDE 41

Metrics

[1]. Scene Graph Generation by Iterative Message Passing. Xu et al. CVPR 2017

slide-42
SLIDE 42

Metrics

Assume there are objects extracted from an image, then edges

𝑂 𝑂 ∗ 𝑂 − 1

[1]. Scene Graph Generation by Iterative Message Passing. Xu et al. CVPR 2017

slide-43
SLIDE 43

Metrics

Step 1: Take maximum for object scores and predicate scores, excluding background class.

[1]. Scene Graph Generation by Iterative Message Passing. Xu et al. CVPR 2017 Assume there are objects extracted from an image, then edges

𝑂 𝑂 ∗ 𝑂 − 1

slide-44
SLIDE 44

Metrics

Step 1: Take maximum for object scores and predicate scores, excluding background class. Step 2: Compute relationship scores: 𝑆𝑓𝑚(𝑗, 𝑘) = 𝑇𝑣𝑐𝑘(𝑗) ∗ 𝑃𝑐𝑘(𝑘) ∗ 𝑄𝑠𝑓𝑒(𝑗, 𝑘)

[1]. Scene Graph Generation by Iterative Message Passing. Xu et al. CVPR 2017 Assume there are objects extracted from an image, then edges

𝑂 𝑂 ∗ 𝑂 − 1

slide-45
SLIDE 45

Metrics

[1]. Scene Graph Generation by Iterative Message Passing. Xu et al. CVPR 2017

Step 1: Take maximum for object scores and predicate scores, excluding background class. Step 2: Compute relationship scores: Step 3: Sort the relationship triplets in a descending order. 𝑆𝑓𝑚(𝑗, 𝑘) = 𝑇𝑣𝑐𝑘(𝑗) ∗ 𝑃𝑐𝑘(𝑘) ∗ 𝑄𝑠𝑓𝑒(𝑗, 𝑘)

Assume there are objects extracted from an image, then edges

𝑂 𝑂 ∗ 𝑂 − 1

slide-46
SLIDE 46

Metrics

[1]. Scene Graph Generation by Iterative Message Passing. Xu et al. CVPR 2017

Step 1: Take maximum for object scores and predicate scores, excluding background class. Step 2: Compute relationship scores: Step 3: Sort the relationship triplets in a descending order. Step 4: Compute the triplet recalls (Recall@50, Recall@100) based on the ground-truth 𝑆𝑓𝑚(𝑗, 𝑘) = 𝑇𝑣𝑐𝑘(𝑗) ∗ 𝑃𝑐𝑘(𝑘) ∗ 𝑄𝑠𝑓𝑒(𝑗, 𝑘)

Assume there are objects extracted from an image, then edges

𝑂 𝑂 ∗ 𝑂 − 1

𝑆𝑓𝑑𝑏𝑚𝑚 = 𝐷(𝑈CDEF 𝑏𝑜𝑒 𝑈

HI)

𝑂(𝑈

HI)

SGGen:

IoU > 0.5

slide-47
SLIDE 47

Metrics

[1]. Scene Graph Generation by Iterative Message Passing. Xu et al. CVPR 2017

Step 1: Take maximum for object scores and predicate scores, excluding background class. Step 2: Compute relationship scores: Step 3: Sort the relationship triplets in a descending order. Step 4: Compute the triplet recalls (Recall@50, Recall@100) based on the ground-truth 𝑆𝑓𝑚(𝑗, 𝑘) = 𝑇𝑣𝑐𝑘(𝑗) ∗ 𝑃𝑐𝑘(𝑘) ∗ 𝑄𝑠𝑓𝑒(𝑗, 𝑘)

Assume there are objects extracted from an image, then edges

𝑂 𝑂 ∗ 𝑂 − 1

𝑆𝑓𝑑𝑏𝑚𝑚 = 𝐷(𝑈CDEF 𝑏𝑜𝑒 𝑈

HI)

𝑂(𝑈

HI)

SGGen:

IoU > 0.5 PhrCls: all object locations are known PredCls: all object locations and labels are known

slide-48
SLIDE 48

Experiments

Dataset Backbone #objects #predicates Metrics Visual Genome[1] Train: 75,651 Test: 32,422 VGG-16 Faster R-CNN[2] 150 50 PredCls,SGCls, SGGen,SGGen+, mAP

  • Table. Implementation Details.

[2] A Faster Implementation of Faster R-CNN. Yang and Lu et al. [1] Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. Krishna et al.

slide-49
SLIDE 49

Comparing with Previous Work

10 20 30 40 50 60 70 PredCls PhrCls SGGen SGGen+

Recall@100

IMP[1] MSDN[2] NM-Freq[3] Ours

[1] Scene Graph Generation by Iterative Message Passing. Xu et al. CVPR 2017 [2] Scene Graph Generations from Objects, Phrases and Captions. Li et al. ICCV 2017 [3] Neural Motif: Scene Graph Parsing with Global Context. Zellers et al. CVPR 2018

Our model has over four point improvement

  • n SGGen, and two point on SGGen+

45.2 22.4 8.0 27.7 57.9 29.9 9.1 28.2 48.8 27.2 9.1 27.8 13.7 35.9 31.6 59.1

(Proposed new metric. Details in our paper)

slide-50
SLIDE 50

Qualitative Results

surfboard

  • f

has near has bear ear leg flower head

  • n

has near bear ear leg flower

  • f

has in has bird head tree wing behind behind branch

  • n

behind leaf

  • n

in ride has man water wave arm short

  • n
slide-51
SLIDE 51

Ablation Study

10 20 30 40 50 60 PredCls PhrCls SGGen SGGen+ mAP Recall@50 Base Base+RePN Base+RePN+GCN Base+RePN+aGCN

slide-52
SLIDE 52

Ablation Study

10 20 30 40 50 60 PredCls PhrCls SGGen SGGen+ mAP Recall@50 Base Base+RePN Base+RePN+GCN Base+RePN+aGCN

RePN improves SGGen, SGGen+ and mAP

slide-53
SLIDE 53

Ablation Study

10 20 30 40 50 60 PredCls PhrCls SGGen SGGen+ mAP Recall@50 Base Base+RePN Base+RePN+GCN Base+RePN+aGCN

RePN improves SGGen, SGGen+ and mAP GCN/aGCN improves PredCls and PhrCls

slide-54
SLIDE 54

Ablation Study

10 20 30 40 50 60 PredCls PhrCls SGGen SGGen+ mAP Recall@50 Base Base+RePN Base+RePN+GCN Base+RePN+aGCN

RePN improves SGGen, SGGen+ and mAP GCN/aGCN improves PredCls and PhrCls

slide-55
SLIDE 55

A new codebase for scene graph generation

https://github.com/jwyang/graph-rcnn.pytorch The goal of gathering all these representative methods into a single repo is to establish a more fair comparison across different methods under the same settings. Welcome to contribute!

slide-56
SLIDE 56

Summary

  • Takeaways:
  • Introducing a general base model for scene graph generation
  • Pruning the fully-connected graph is important for scene graph generation
  • Exploiting the context across objects and predicates is crucial
  • Scene graph generation helps to improve object detection
  • Challenges:
  • The dataset is noisy (incomplete and inconsistent annotations)
  • Relationships need more fine-grained categorizing (spatial, semantic, etc)
  • Rare/novel relationship is hard to detect
slide-57
SLIDE 57

Human

Ask Questions Answer Questions

Visual Question Answering Visual Question Generation

Visual System

Scene Graph generator

slide-58
SLIDE 58

Visual Question Answering

Visual Question Answering is a challenging task that involves fully visual understanding, language understanding and reasoning.

slide-59
SLIDE 59

Methodology: Graph Reasoning Machine

  • Visual Understanding: Scene Graph Generator (extract object, attribute

and relationship between objects).

  • Language Understanding: Program Generator (extract logic reasoning

chain in the question)

  • Reasoning: Learnable Neural-Symbolic Executor (execute programs on s.g.)
  • Pros:
  • Make the VQA model more interpretable;
  • Easy to diagnose and analyze the model predictions;
  • Modules are disentangled from each other;
  • Introduces no or few language priors, (probably) better generalization ability;

*Joint work with Chuang Gan et al

slide-60
SLIDE 60

Compositional Reasoning VQA Dataset

[1] GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. Hudson et al. CVPR 2019

slide-61
SLIDE 61
slide-62
SLIDE 62
slide-63
SLIDE 63

Q: The shorts have what color? P: Filter(shorts)->Query(color)

slide-64
SLIDE 64

Q: The shorts have what color? P: Filter(shorts)->Query(color) SG: shorts: 0.54 gray: 0.47 brown: 0.19

slide-65
SLIDE 65

Q: The shorts have what color? P: Filter(shorts)->Query(color) SG: shorts: 0.54 gray: 0.47 brown: 0.19 A: gray

slide-66
SLIDE 66

SG: frisbee: 0.85 yellow: 0.54 Q: What color is the frisbee? P: Filter(frisbee)->Query(color) A: yellow

slide-67
SLIDE 67

Q: who wears shorts? Filter(shorts)->Relate_Subject(wears) A: man SG: man: 0.78 SG: shorts: 0.54 graph: 0.47 brown: 0.19 wear: 0.45

slide-68
SLIDE 68

Human

Ask Questions Answer Questions

Visual Question Answering Visual Question Generation

Visual System

Scene Graph generator

slide-69
SLIDE 69

The Open-World Recognition Problem

C

slide-70
SLIDE 70

C

The Open-World Recognition Problem

slide-71
SLIDE 71

C

The Open-World Recognition Problem

Human (Oracle)

What is the black object on the top of the table at left side? That’s a coffee bottle.

slide-72
SLIDE 72

C

The Open-World Recognition Problem

Human (Oracle)

What is the black object on the top of the table at left side? That’s a coffee bottle.

How to train an agent to ask questions about its unknowns based

  • n its knowns to improve its visual

understanding capabilities?

slide-73
SLIDE 73

Visual Question Generation: Visual Curiosity

Visual Curiosity: Learning to Ask Questions to Learn Visual Recognition. Yang and Lu et al. CoRL 2018

slide-74
SLIDE 74

Oracle/ Human

Image

Same Content

Agent

Visual System Graph Memory Question Generator

Image

Agent Architecture

C

Slide Credit: Stefan Lee

Answer Digestor

Question Answer

slide-75
SLIDE 75

Oracle/ Human

Image

Same Content

Agent

Visual System Answer Digestor Question Generator

Image

Bottom-up Update

Visual Graph Graph Memory

slide-76
SLIDE 76

Oracle/ Human

Image

Same Content

Agent

Visual System Answer Digestor Question Generator

Image

Bottom-up Update

Visual Memory Visual Graph Graph Memory Bottom-Up Update

slide-77
SLIDE 77

Oracle/ Human

Image

Same Content

Agent

Visual System Answer Digestor Question Generator

Image

Top-down Update

Visual Memory Memory Graph

slide-78
SLIDE 78

Oracle/ Human

Image

Same Content

Agent

Visual System Answer Digestor Question Generator

Image

Top-down Update

Visual Memory Graph Memory Top-Down Update

slide-79
SLIDE 79

Question Generator

3 2

4 5

6

1 2 4 5 6 Graph Memroy Target 1 Attribute Shape Reference None Question What is the shape

  • f the front most

large red object?

Color: UNK Size: UNK Shape: cube Mat: UNK Color: Red Size: Large Shape: UNK Mat: UNK Color: UNK Size: UNK Shape: UNK Mat: UNK Color: Pink Size: Small Shape: Cube Mat: UNK Color: UNK Size: UNK Shape: Cube Mat: Metal Color: UNK Size: Small Shape: UNK Mat: UNK

3 Color 2 What is the color of the metal cube on the left side of a small object? 5 Material 3 What is the material of object at left side of metal cube? 3

slide-80
SLIDE 80

Update visual system after a piece of dialogs with Oracle/Human.

Training Objective: Visual System

Loss: cross-entropy loss between the graph memory and the visual predictions over all images, objects, and attributes

𝜄K

∗ = arg min − R ST∈S

R

VWX YT

log

Visual attribute predictions Graph memory

slide-81
SLIDE 81

Use A2C and update policy after each episode based on all rounds 𝜄\

∗ = arg max 𝐹^𝐹S~ℰ 𝐹\a R bWX c

R

IWX d

𝑠b

I(𝑟b I~𝜌g(ℎb I; 𝜄\)

Training Objective: Questioner Policy

𝑠

b I = 𝑇(𝐻b I, 𝐻b ∗) − 𝑇(𝐻b IkX, 𝐻b ∗)

Oracle Graph

Can be improved by:

  • Asking unambiguous, informative questions (top-down)
  • Improving the visual system quickly (bottom-up)

Reward: Optimal Policy:

slide-82
SLIDE 82

Synthesized Dataset

Different shapes, colors, materials and

  • sizes. Extended from CLEVR dataset [1]

[1] CLEVR. Johnson et al.

Realistic Dataset

Various real indoor scenes. Annotated based on the ARID dataset [2]

[2] Recognizing Objects In-the-Wild. Loghmani et al.

Experiments: Environments

slide-83
SLIDE 83

28.3 36.5 59.4 29.5 39.1 65.5 38 52.5 67.1 42.1 59.1 89.3 25.8 50.6 84.1

10 20 30 40 50 60 70 80 90 100

R@10 R@20 R@50 Random Entropy Entropy + Context Ours Ours w/o v

Consistent improvements over heuristic baselines especially over longer dialogs

Experiment: Standard Training + Standard Testing

Graph Recovery

slide-84
SLIDE 84

Novel

New colors and shapes 600 images for test (12 episodes)

Mixed

Mix of novel and standard colors and shapes 600 images for test (12 episodes)

Experiments: Novel Object Environments

Realistic

51 categories, 11 colors, 6 materials 1200 images for test (24 episodes)

slide-85
SLIDE 85

42.1 59.1 89.3 43.3 58.4 88.9 42.9 60.1 90.3 35.6 53.4 86.2

10 20 30 40 50 60 70 80 90 100

R@10 R@20 R@50 Std-Std Std-Novel Std-Mixed Std-Realistic

Experiments: Standard Train – New Test Environments

Practically no loss of performance in synthetic settings and small reductions for realistic (many more categories)

Graph Recovery

slide-86
SLIDE 86

Standard

87

Experiments: Visual System Performance Mixed Novel Realistic

slide-87
SLIDE 87

cereal

Experiments: Qualitative Example

food potato brown paper cereal yellow plastic ball

What is the closest thing that is in front of the yellow plastic ball? paper What material is the leftmost thing? food There is a leftmost object; what is it? potato The leftmost object is what color? What is the closest thing that is in front of the yellow plastic ball made of?

brown

slide-88
SLIDE 88

Summary for this part

  • Takeaways:
  • Scene graph can be used as a comprehensive semantic abstraction of image
  • Scene graph provides grounding information for language-based interaction with

human, especially visual question answering and generation

  • Scene graph gives it a chance to make models more interpretable and explainable
  • Potential Directions:
  • Leverage scene graph for explicit and effective reasoning on more vision-language

tasks, such as expression coreference

  • Language context dependent scene graph generation
  • Combine scene graph and knowledge graph for common sense reasoning
slide-89
SLIDE 89

90

Summary for all

Leveraging external “knowledge” when interpreting images Specifically, using richer vision, language models/data to improve vision+language models Representing internal structure in images Specifically, scene graphs: generating, evaluating and using them for vision+language

slide-90
SLIDE 90

Thanks!