Language-Driven Visual Reasoning for Referring Expression - PowerPoint PPT Presentation

Language-Driven Visual Reasoning for Referring Expression Comprehension 李冠彬中山大学数据科学与计算机学院 VALSE 2019-12-18

Outline  Introduction and Related Work  Cross-Modal Relationship Inference for Grounding Referring Expressions, CVPR 2019  Dynamic Graph Attention for Referring Expression Comprehension, ICCV2019  Conclusion and Future Work Discussion

Introduction Referring Expression Comprehension Classic Image Understanding 1. The sheep in the middle Sheep 2 2. The fattest sheep Sheep 3 3. The sheep farthest from the grass Sheep 1

Introduction Requires Relationship Modeling and Reasoning May also require common sense knowledge 2 1 2 1 1. The hat worn by the man bending over and stroking the dog 1. The lady to the right of the waiter 2. The person who ordered the dish served by the waiter 2. The hat on the guy to the left of the man in the yellow shirt

Related Work ( Nagaraja et al. ECCV2016) (Rohrbach et al. ECCV2016)

Related Work S V O Modular Attention Network (CVPR2018) Accumulated Co-Attention Method (CVPR2018)

Cross-Modal Relationship Inference (CVPR2019) Motivation:  The extraction and modeling of relationships (including first-order and multi-order) is essential for visual grounding.  Graph based information propagation helps to explicitly capture multi-order relationships. Our Proposed Method:  Language-Guided Visual Relation Graph  Gated Graph Convolutional Network based feature propagation for semantic context modeling  Triplet loss with online hard negative sample mining

Language-Guided Visual Relation Graph Spatial Relation Graph Construction is the index label of relationship

Language-Guided Visual Relation Graph Language-Guided Visual Relation Graph Construction 1. Given expression , Bidirectional LSTM for word feature extraction 2. The type (i.e. entity, relation, absolute location and unnecessary word) for each word Global language context: Weighted normalized attention of word refer to vertex , The language context at vertex :

Language-Guided Visual Relation Graph Language-Guided Visual Relation Graph Construction 3. : : The language-guided multi-modal graph is defined as:

Language-Guided Visual Relation Graph The n-th gated graph convolution operation at vertex :

Language-Guided Visual Relation Graph

Experiments Evaluation Datasets: RefCOCO, RefCOCO+ and RefCOCOg Evaluation Metric: Precision@1 metric (the fraction of correct predictions) Comparison with state-of-the-art approaches on RefCOCO, RefCOCO+ and RefCOCOg

Experiments global langcxt+vis instance: Visual feature + location feature, last hidden unit of LSTM, matching global langcxt+global viscxt(2): GCN on the spatial relation graph weighted langcxt+guided viscxt: Gated GCN on the language-guided visual relation graph weighted langcxt+guided viscxt+fusion: Gated GCN on cross-modal relation graph Ablation study on variances of our proposed CMRIN on RefCOCO, RefCOCO+ and RefCOCOg

Visualization Results “an elephant between two other elephants” Initial Attention Score objects Final matching score left left Input Image Result right right

Visualization Results “green plant behind a table visible behind a lady ’ s head” Input Image Objects Result Initial Attention Score Final matching score “sandwich in center row all the way on right” Objects Final matching score Input Image Result Initial Attention Score

Dynamic Graph Attention (ICCV2019) Motivation:  Referring expression comprehension inherently requires visual reasoning on top of the relationships among the objects in the image. Example “the umbrella held by the person in the pink hat”  Human visual reasoning of grounding is guided by the linguistic structure of the referring expression. Our Proposed Method:  Specify the reasoning process as a sequence of constituent expressions.  A dynamic graph attention network to perform multi-step visual reasoning to identify compound objects by following the predicted reasoning process.

Dynamic Graph Attention Network 3 1 2 1. Graph construction 3. Step-wisely dynamic reasoning  Visual graph  Multi-modal graph  performs on the top of the graph under the guidance 2. Linguistic structure analysis  highlight edges and nodes  identify  Constituent expressions  Guidance of reasoning compound objects

Graph construction Directed graph: Multi-modal graph: word embedding :

Language Guided Visual Reasoning Process Model expression as a sequence of constituent expressions (soft distribution over words in the expression) bi-directional LSTM overall expression

Step-wisely Dynamic Reasoning The probability of the l-th word referring to each node and type of edge: The weight of each node (or the edge type) being mentioned in time step: Update the gates for every node or the edge type: Identify the compound object corresponding to each node:

Experiments Comparison with state-of-the-art methods on RefCOCO, RefCOCO+ and RefCOCOg when ground-truth bounding boxes are used.

Experiments Comparison with the state-of-the-art methods on RefCOCO, RefCOCO+ and RefCOCOg when detected objects are used.

Explainable Visualization

Visualization Results tree structure “lady” “purple shirt” matching “cake” “a lady wearing a purple shirt with a T = 1 T = 3 T = 2 birthday cake” matching “elephant” “man” “the elephant “gray shirt” behind the man chain structure wearing a gray shirt”

Conclusion  Cross-modal relationship modeling is helpful to enhance the contextual feature representation and improve the performance of visual grounding.  Language-guided reasoning over object relation graph helps to better locate the objects referred to in complex language descriptions and generate interpretable results.

Future Work Discussion Spatio-Temporally Reasoning in video grounding Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video, ACL2019

Future Work Discussion Embodied Referring Expressions Comprehension RERERE: Remote Embodied Referring Expressions in Real indoor Environments, Arxiv 2019

Future Work Discussion Commonsense Reasoning for Visual Grounding 2 1 From Recognition to Cognition: Visual Commonsense Reasoning, CVPR2019 1. The lady to the right of the waiter 2. The person who ordered the dish served by the waiter

Future Work Discussion Task Driven Object Detection What object in the scene would a human choose I want to watch the “ The Big to serve wine ？ Bang Theory ” now, by the way, [Sawatzky et al. CVPR2019] the room is too bright.

Thank You! http://guanbinli.com/

Language-Driven Visual Reasoning for Referring Expression - PowerPoint PPT Presentation

Language-Driven Visual Reasoning for Referring Expression Comprehension VALSE 2019-12-18 Outline Introduction and Related Work Cross-Modal Relationship Inference for Grounding

Language-Driven Visual Reasoning for Referring Expression Comprehension

Learning(Distribu.ons(over(Logical(Forms(for( Referring(Expression(Genera.on(

Mat MattNet tNet: : Modu Modular Atten lar Attention tion Network for Referring Network for

Gene Expression Data Introduction to gene expression data Expression data storage concept An

A Corpus of Natural Language for Visual Reasoning Cornell Natural Language Visual Reasoning

Visual Analytics Visual Analytics is the science of analytical reasoning supported by interactive

Automated Reasoning Course Presentation Summary Automated Reasoning Motivations Course Plan

Learning Distribu.ons over Logical Forms for Referring Expression

The Expression Problem and Lenses Lambdajam 2016 Tony Morris The Expression Problem A new name

The Generation of Referring Expressions: The Generation of Referring Expressions: Where We've

Visual complexity and referring expression generation Micha Elsner with Alasdair Clarke,

Evidential and Causal Reasoning Much reasoning in AI can be seen as evidential reasoning ,

Visual Question Answering and Visual Reasoning Zhe Gan 6/15/2020 Overview Goal of this part

ReferItGame: Referring to Objects in Photographs of Natural Scenes Motivation First

Biovision team 2 Retina Visual cortex 3 Retina Visual cortex 3 Retina Visual cortex 3

Priority-Driven Scheduling of Periodic Tasks Priority-driven vs. clock-driven scheduling:

CP Violation and Flavor Mixing Makoto Kobayashi KEK and JSPS Plan 1. Introduction to the

IT Optimization Under Renewable Energy Constraint Gustavo Rostirolla gustavo.rostirolla@irit.fr

Multivariable Matrix-valued moment problems David P. Kimsey The Weizmann Institute of Science

-Algebras generated by projections and their representations Vasyl Ostrovskyi Institute of

Hmong Language 2. The Hmong Language 1. Hmong People 3. Natural Language Processing of Hmong 1

Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing

Low-Crossing Spanning Trees An Alternative Proof and Experiments Panos Giannopoulos Maximilian

Human-Computer Interaction Bjrn Hartmann University of California, Berkeley EECS, Computer