Towards X Visual Reasoning Hanwang Zhang hanwangzhang@ntu.edu.sg - PowerPoint PPT Presentation

Towards X Visual Reasoning Hanwang Zhang 张含望 hanwangzhang@ntu.edu.sg

Pattern Recognition v.s. Reasoning

Pattern Recognition v.s. Reasoning Caption: Lu et al. Neural Baby Talk. CVPR’18 VQA: Teney et al. Graph- Structured Representations for Visual Question Answering. CVPR’17 Cond. Image Generation: Jonson et al. Image Generation from Scene Graphs. CVPR’18

Reasoning: Core Problems Compositionality Learning to Reason

Three Examples Visual Relation Detection [CVPR’17, ICCV’17] Referring Expression Grounding [CVPR’18] Compositionality Learning to Reason

Three Examples Sequence-level Image Captioning [MM’18 submission] Learning to Reason

Two Future Works • Scene Dynamics • Design-free NMN for VQA

Challenges in Visual Relation Detection • Modeling <Subject, Predicate, Object> – Joint Model: direct triplet modeling • Complexity O(N 2 R)  hard to scale up – Separate Model: separate objects & predicate • Complexity O(N+R)  visual diversity

TransE: Translation Embedding [Bordes et al. NIPS’13] Head+ Relation ≈ Tail _has_genre WALL-E Animation Computer Anim. Comedy film Adventure film Science Fiction Fantasy Stop motion Satire Drama Connecting

Visual Translation Embedding [Zhang et al. CVPR’17, ICCV’17] • VTransE: Visual extension of TransE

VTransE Network

Evaluation: Relation Datasets • Visual Relationship Lu et al. ECCV’16 • Visual Genome Krishna et al. IJCV’16 DataSet Image Object Predicate Unique Relation/ Relation Object VRD 5,000 100 70 6,672 24.25 VG 99,658 200 100 19,237 57 Main Deficiency ： Incomplete Annotation 13

Does TransE work in visual domain? • Predicate Prediction

Does TransE work in visual domain?

Demo link: cvpr.zl.io

Phrase Detection: only need to detect the <subject, object> joint box Relation Detection: detect both subject and object Retrieval: given a query relation, return images VTransE were best separate models in 2017. ([Li et al. and Dai et al. CVPR’17 are (partially joint models) New state-of-the- art: Neural MOTIF (Zellers et al. CVPR’18, 27.2/30.3 R@50/R@100) Bad retrieval on VR is due to incomplete annotation 18

Two follow-up works • The key: pure visual pair model f(x 1 , x 2 ) • f(x1,x2) underpins almost every VRD • Evaluation: predicate classification • 1. Faster pairwise modeling (ICCV’17) • 2. Object- agnostic modeling (ECCV’18 submission)

Parallel Pairwise R-FCN (Zhang et al. ICCV’17) VRD R@50 VRD VG R@50 VG R@100 R@100 VTransE 44.76 44.76 62.63 62.87 PPR-FCN 47.43 47.43 64.17 64.86

Shuffle-Then-Assemble (Yang et al. 18’)

What is grounding? Object Detection Link words (from a fixed vocab.) to visual objects O(N) R Girshick ICCV’15

What is grounding? Phrase-to-Region Link phrases to visual objects O(N) Plummer et al. ICCV’15

What is grounding? Visual Relation Detection O(N 2 ) Zhang et al. CVPR’17

What’s referring expression grounding? O(2 N )

Prior Work: Multiple Instance Learning Max-Pool [Hu et al. CVPR’17] Noisy-Or [Nagaraja et al. ECCV’16 ] O(2 N ) O(N 2 ) Bad Approximation: 1. Context z is not necessarily to be a single region 2. Log-sum directly to sum-log is too coarse, i.e., forcing every pair to be equally possible MIL Bag

Our Work: Variational Context [Zhang et al CVPR’18] Variational lower-bound: Sum-log

SGD Details z: reasoning over 2 N Deterministic function REINFORCE with baseline (Soft attention) (MC, hard sampling)

Network Details

Grounding Accuracy The best VGG SINGLE model to date. Best ResNet Model: Licheng Yu et al. MAttNet: Modular Attention Network for Referring Expression Comprehension. CVPR’18

More effective than MIL R. Hu et al. Modeling relationships in referential expressions with compositional modular networks. In CVPR , 2017

Qualitative Results A dark horse between three lighter horses

Three Examples

Neural Image Captioning GoogleNIC (Vinyals et al. 2014) Encoder ( Image  CNN  Vector )  Decoder ( Vector  Word Seq .)

Sequence-level Image Captioning

Context in Image Captioning

Context-Aware Visual Policy Network

Context-Aware Policy Network

MS-COCO Leaderboard We are SINGLE model.

Compare with Academic Peers

Detail Comparison with Up-Down P. Anderson et al. Bottom-up and top-down attention for image captioning and VQA. In CVPR’18

Visual Reasoning: A Desired Pipeline • Configurable NN for various reasoning applications ： Visual Configurable knowledge Task Graph Network Captioning, VQA, and Visual Dialogue

Visual Reasoning: Future Directions • Compositionality – Good SG generation – Robust SG representation – Task-specific SG generation • Learning to reason – Task-specific network – Good policy-gradient RL for large SG

Hard-design X Module Network • Q  Program not X • Module X but hard- design • CLEVER hacker • Poor generalization to COCO-VQA Jonson et al. ICCV’17 Hu et al. ICCV’17 Mascharka et al. CVPR’18

Design-Free Module Network C C A A A S A A A S A Choi et al. Learning to compose task- specific tree structures. AAAI’17

Towards X Visual Reasoning Hanwang Zhang hanwangzhang@ntu.edu.sg - PowerPoint PPT Presentation

Towards X Visual Reasoning Hanwang Zhang hanwangzhang@ntu.edu.sg Pattern Recognition v.s. Reasoning Pattern Recognition v.s. Reasoning Caption: Lu et al. Neural Baby Talk. CVPR18 VQA: Teney et al. Graph- Structured Representations

Visual Analytics Visual Analytics is the science of analytical reasoning supported by interactive

Automated Reasoning Course Presentation Summary Automated Reasoning Motivations Course Plan

Evidential and Causal Reasoning Much reasoning in AI can be seen as evidential reasoning ,

Biovision team 2 Retina Visual cortex 3 Retina Visual cortex 3 Retina Visual cortex 3

NEURO-SYMBOLIC VISUAL REASONING: DISENTANGLING VISUAL FROM REASONING HAMID PALANGI

A Corpus of Natural Language for Visual Reasoning Cornell Natural Language Visual Reasoning

Visual Question Answering and Visual Reasoning Zhe Gan 6/15/2020 Overview Goal of this part

SECTION 1: Introductions Code Reasoning Forward Reasoning CODE REASONING +

Probabilistic Reasoning; Probabilistic Reasoning; Network-based reasoning Network-based

CHAPTER-4 1 LOGIC AND REASONING ! Knowledge and ! Reasoning in Knowledge- Reasoning Based

CHRONIC CHRONIC VISUAL LOSS VISUAL LOSS Wasu Supakornthanasarn, MD. Visual loss Sensory

A Model of Visual Imagery A Model of Visual Imagery John Abbondanza, OD, FCOVD John Abbondanza,

Overview Overview Visual displays Visual displays Visual and tactile displays Visual and

Recap by Milo Davies, SAS NZ POWERFUL ADAPTIVE OPEN UNIFIED SAS Visual Analytics SAS Visual

Visual Perception human perception display devices 1 CS 349 - Visual Perception Reference

Reasoning Skills Alicia Foy Gifted Specialist 3/21/19 1 www.FLDOE.org Objectives Student

AUTOMATIC DETECTION OF SATIRE AND SARCASM Computational Approaches to Creative Language, SS

Satire Humor Presentation By: Elayna Parkhust Major Influence: The Onion I told you the

SATIRE: A Software Introduction to SATIRE Architecture for Smart Software Architecture

Data Sciences CentraleSupelec Advance Machine Learning Course III - Stochastic approximation

Nico Pietroni Paolo Cignoni 1 c What is Digital Fabrication? Additive Manufacturing CNC Milling

Ray Tracing Intro Steve Marschner CS 4620 Cornell University Cornell CS4620 Fall 2020 Steve

Learning to Select Expert Demonstra4ons for Deformable Object

Behind the Scene of Side Channel Attacks ASIACRYPT 2013 Victor LOMNE, Emmanuel PROUFF and Thomas