Language-Driven Visual Reasoning for Referring Expression - - PowerPoint PPT Presentation

language driven visual reasoning for referring expression
SMART_READER_LITE
LIVE PREVIEW

Language-Driven Visual Reasoning for Referring Expression - - PowerPoint PPT Presentation

Language-Driven Visual Reasoning for Referring Expression Comprehension VALSE 2019-12-18 Outline Introduction and Related Work Cross-Modal Relationship Inference for Grounding


slide-1
SLIDE 1

李冠彬

中山大学 数据科学与计算机学院

VALSE 2019-12-18

Language-Driven Visual Reasoning for Referring Expression Comprehension

slide-2
SLIDE 2

Outline

 Introduction and Related Work  Cross-Modal Relationship Inference for Grounding Referring Expressions, CVPR 2019  Dynamic Graph Attention for Referring Expression Comprehension, ICCV2019  Conclusion and Future Work Discussion

slide-3
SLIDE 3

Introduction

Classic Image Understanding Referring Expression Comprehension

  • 1. The sheep in the middle
  • 2. The fattest sheep
  • 3. The sheep farthest from the grass

Sheep 2 Sheep 3 Sheep 1

slide-4
SLIDE 4

Introduction

Requires Relationship Modeling and Reasoning

  • 1. The hat worn by the man bending over and stroking the dog
  • 2. The hat on the guy to the left of the man in the yellow shirt

1 2

May also require common sense knowledge

  • 1. The lady to the right of the waiter
  • 2. The person who ordered the dish served by the waiter

1 2

slide-5
SLIDE 5

Related Work

( Nagaraja et al. ECCV2016) (Rohrbach et al. ECCV2016)

slide-6
SLIDE 6

Related Work

Modular Attention Network (CVPR2018) Accumulated Co-Attention Method (CVPR2018)

S V O

slide-7
SLIDE 7

Outline

 Introduction and Related Work  Cross-Modal Relationship Inference for Grounding Referring Expressions, CVPR 2019  Dynamic Graph Attention for Referring Expression Comprehension, ICCV2019  Conclusion and Future Work Discussion

slide-8
SLIDE 8

Cross-Modal Relationship Inference (CVPR2019)

Motivation:

The extraction and modeling of relationships (including first-order and multi-order) is essential for visual grounding. Graph based information propagation helps to explicitly capture multi-order relationships. Our Proposed Method: Language-Guided Visual Relation Graph Gated Graph Convolutional Network based feature propagation for semantic context modeling Triplet loss with online hard negative sample mining

slide-9
SLIDE 9

Language-Guided Visual Relation Graph

Spatial Relation Graph Construction

is the index label of relationship

slide-10
SLIDE 10

Language-Guided Visual Relation Graph

Language-Guided Visual Relation Graph Construction

  • 1. Given expression

, Bidirectional LSTM for word feature extraction

  • 2. The type (i.e. entity, relation, absolute location and unnecessary word) for each word

Global language context: Weighted normalized attention of word refer to vertex , The language context at vertex :

slide-11
SLIDE 11

Language-Guided Visual Relation Graph

Language-Guided Visual Relation Graph Construction

3.

: :

The language-guided multi-modal graph is defined as:

slide-12
SLIDE 12

Language-Guided Visual Relation Graph

The n-th gated graph convolution operation at vertex :

slide-13
SLIDE 13

Language-Guided Visual Relation Graph

slide-14
SLIDE 14

Experiments

Evaluation Datasets: RefCOCO, RefCOCO+ and RefCOCOg Evaluation Metric: Precision@1 metric (the fraction of correct predictions) Comparison with state-of-the-art approaches on RefCOCO, RefCOCO+ and RefCOCOg

slide-15
SLIDE 15

Experiments

global langcxt+vis instance: Visual feature + location feature, last hidden unit of LSTM, matching global langcxt+global viscxt(2): GCN on the spatial relation graph weighted langcxt+guided viscxt: Gated GCN on the language-guided visual relation graph weighted langcxt+guided viscxt+fusion: Gated GCN on cross-modal relation graph Ablation study on variances of our proposed CMRIN on RefCOCO, RefCOCO+ and RefCOCOg

slide-16
SLIDE 16

Visualization Results

“an elephant between two other elephants” Input Image Initial Attention Score Final matching score left right

  • bjects

left right Result

slide-17
SLIDE 17

Visualization Results

“green plant behind a table visible behind a lady’s head” “sandwich in center row all the way on right” Input Image Input Image Objects Objects Initial Attention Score Initial Attention Score Final matching score Final matching score Result Result

slide-18
SLIDE 18

Outline

 Introduction and Related Work  Cross-Modal Relationship Inference for Grounding Referring Expressions, CVPR 2019  Dynamic Graph Attention for Referring Expression Comprehension, ICCV2019  Conclusion and Future Work Discussion

slide-19
SLIDE 19

Dynamic Graph Attention (ICCV2019)

Motivation:

Referring expression comprehension inherently requires visual reasoning on top of the relationships among the objects in the image. Example “the umbrella held by the person in the pink hat” Human visual reasoning of grounding is guided by the linguistic structure of the referring expression. Our Proposed Method: Specify the reasoning process as a sequence of constituent expressions. A dynamic graph attention network to perform multi-step visual reasoning to identify compound

  • bjects by following the predicted reasoning process.
slide-20
SLIDE 20

Dynamic Graph Attention Network

  • 1. Graph construction
  • Visual graph  Multi-modal graph
  • 2. Linguistic structure analysis
  • Constituent expressions  Guidance of reasoning
  • 3. Step-wisely dynamic reasoning
  • performs on the top of the graph under the

guidance

  • highlight edges and nodes  identify

compound objects

1 3 2

slide-21
SLIDE 21

Graph construction

Directed graph: Multi-modal graph: word embedding

:

slide-22
SLIDE 22

Language Guided Visual Reasoning Process

Model expression as a sequence of constituent expressions (soft distribution over words in the expression)

bi-directional LSTM

  • verall expression
slide-23
SLIDE 23

Step-wisely Dynamic Reasoning

The probability of the l-th word referring to each node and type of edge: The weight of each node (or the edge type) being mentioned in time step: Update the gates for every node or the edge type: Identify the compound object corresponding to each node:

slide-24
SLIDE 24

Experiments

Comparison with state-of-the-art methods on RefCOCO, RefCOCO+ and RefCOCOg when ground-truth bounding boxes are used.

slide-25
SLIDE 25

Experiments

Comparison with the state-of-the-art methods on RefCOCO, RefCOCO+ and RefCOCOg when detected objects are used.

slide-26
SLIDE 26

Explainable Visualization

slide-27
SLIDE 27

Visualization Results

“a lady wearing a purple shirt with a birthday cake” “the elephant behind the man wearing a gray shirt” “cake” “gray shirt” “purple shirt” “man” “lady” “elephant” matching matching

T = 1 T = 2 T = 3

tree structure chain structure

slide-28
SLIDE 28

Outline

 Introduction and Related Work  Cross-Modal Relationship Inference for Grounding Referring Expressions, CVPR 2019  Dynamic Graph Attention for Referring Expression Comprehension, ICCV2019  Conclusion and Future Work Discussion

slide-29
SLIDE 29

Conclusion

 Cross-modal relationship modeling is helpful to enhance the contextual feature representation and improve the performance of visual grounding.  Language-guided reasoning over object relation graph helps to better locate the objects referred to in complex language descriptions and generate interpretable results.

slide-30
SLIDE 30

Spatio-Temporally Reasoning in video grounding

Future Work Discussion

Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video, ACL2019

slide-31
SLIDE 31

Embodied Referring Expressions Comprehension

Future Work Discussion

RERERE: Remote Embodied Referring Expressions in Real indoor Environments, Arxiv 2019

slide-32
SLIDE 32

Future Work Discussion

  • 1. The lady to the right of the waiter
  • 2. The person who ordered the dish served by the waiter

1 2

Commonsense Reasoning for Visual Grounding From Recognition to Cognition: Visual Commonsense Reasoning, CVPR2019

slide-33
SLIDE 33

Future Work Discussion

Task Driven Object Detection What object in the scene would a human choose to serve wine? [Sawatzky et al. CVPR2019] I want to watch the “The Big Bang Theory” now, by the way, the room is too bright.

slide-34
SLIDE 34

Thank You!

http://guanbinli.com/