Towards X Visual Reasoning Hanwang Zhang hanwangzhang@ntu.edu.sg - - PowerPoint PPT Presentation

towards x visual reasoning
SMART_READER_LITE
LIVE PREVIEW

Towards X Visual Reasoning Hanwang Zhang hanwangzhang@ntu.edu.sg - - PowerPoint PPT Presentation

Towards X Visual Reasoning Hanwang Zhang hanwangzhang@ntu.edu.sg Pattern Recognition v.s. Reasoning Pattern Recognition v.s. Reasoning Caption: Lu et al. Neural Baby Talk. CVPR18 VQA: Teney et al. Graph- Structured Representations


slide-1
SLIDE 1

Towards X Visual Reasoning

Hanwang Zhang 张含望 hanwangzhang@ntu.edu.sg

slide-2
SLIDE 2

Pattern Recognition v.s. Reasoning

slide-3
SLIDE 3

Pattern Recognition v.s. Reasoning

Caption: Lu et al. Neural Baby Talk. CVPR’18 VQA: Teney et al. Graph-Structured Representations for Visual Question Answering. CVPR’17

  • Cond. Image Generation: Jonson et al. Image Generation from Scene Graphs. CVPR’18
slide-4
SLIDE 4

Reasoning: Core Problems

Compositionality Learning to Reason

slide-5
SLIDE 5

Three Examples

Visual Relation Detection [CVPR’17, ICCV’17] Referring Expression Grounding [CVPR’18] Compositionality Learning to Reason

slide-6
SLIDE 6

Three Examples

Sequence-level Image Captioning [MM’18 submission] Learning to Reason

slide-7
SLIDE 7

Two Future Works

  • Scene Dynamics
  • Design-free NMN for VQA
slide-8
SLIDE 8

Three Examples

Visual Relation Detection [CVPR’17, ICCV’17] Referring Expression Grounding [CVPR’18] Compositionality Learning to Reason

slide-9
SLIDE 9

Challenges in Visual Relation Detection

  • Modeling <Subject, Predicate, Object>

– Joint Model: direct triplet modeling

  • Complexity O(N2R)hard to scale up

– Separate Model: separate objects & predicate

  • Complexity O(N+R)visual diversity
slide-10
SLIDE 10

TransE: Translation Embedding

[Bordes et al. NIPS’13]

Head+ Relation ≈ Tail

WALL-E

_has_genre

Animation Computer Anim. Comedy film Adventure film Science Fiction Fantasy Stop motion Satire Drama Connecting

slide-11
SLIDE 11

Visual Translation Embedding

[Zhang et al. CVPR’17, ICCV’17]

  • VTransE: Visual extension of TransE
slide-12
SLIDE 12

VTransE Network

slide-13
SLIDE 13

Evaluation: Relation Datasets

13

  • Visual Relationship Lu et al. ECCV’16
  • Visual Genome Krishna et al. IJCV’16

Main Deficiency: Incomplete Annotation DataSet Image Object Predicate Unique Relation Relation/ Object VRD 5,000 100 70 6,672 24.25 VG 99,658 200 100 19,237 57

slide-14
SLIDE 14
  • Predicate Prediction

Does TransE work in visual domain?

slide-15
SLIDE 15

Does TransE work in visual domain?

slide-16
SLIDE 16

Demo link: cvpr.zl.io

slide-17
SLIDE 17

Demo link: cvpr.zl.io

slide-18
SLIDE 18

18 VTransE were best separate models in 2017. ([Li et al. and Dai et al. CVPR’17 are (partially joint models) New state-of-the-art: Neural MOTIF (Zellers et al. CVPR’18, 27.2/30.3 R@50/R@100) Bad retrieval on VR is due to incomplete annotation Phrase Detection: only need to detect the <subject, object> joint box Relation Detection: detect both subject and object Retrieval: given a query relation, return images

slide-19
SLIDE 19

Two follow-up works

  • The key: pure visual pair model f(x1, x2)
  • f(x1,x2) underpins almost every VRD
  • Evaluation: predicate classification
  • 1. Faster pairwise modeling (ICCV’17)
  • 2. Object-agnostic modeling (ECCV’18

submission)

slide-20
SLIDE 20

Parallel Pairwise R-FCN (Zhang et al. ICCV’17)

VRD R@50 VRD R@100 VG R@50 VG R@100 VTransE 44.76 44.76 62.63 62.87 PPR-FCN 47.43 47.43 64.17 64.86

slide-21
SLIDE 21

Shuffle-Then-Assemble (Yang et al. 18’)

slide-22
SLIDE 22

Shuffle-Then-Assemble (Yang et al. 18’)

slide-23
SLIDE 23

Three Examples

Visual Relation Detection [CVPR’17, ICCV’17] Referring Expression Grounding [CVPR’18] Compositionality Learning to Reason

slide-24
SLIDE 24

What is grounding? Object Detection

Link words (from a fixed vocab.) to visual objects

O(N)

R Girshick ICCV’15

slide-25
SLIDE 25

What is grounding? Phrase-to-Region

Link phrases to visual objects

O(N)

Plummer et al. ICCV’15

slide-26
SLIDE 26

What is grounding? Visual Relation Detection

O(N2)

Zhang et al. CVPR’17

slide-27
SLIDE 27

What’s referring expression grounding?

O(2N)

slide-28
SLIDE 28

Prior Work: Multiple Instance Learning

O(N2)

MIL Bag

O(2N)

Bad Approximation:

  • 1. Context z is not

necessarily to be a single region

  • 2. Log-sum directly to

sum-log is too coarse, i.e., forcing every pair to be equally possible

Max-Pool [Hu et al. CVPR’17]

Noisy-Or [Nagaraja et al. ECCV’16]

slide-29
SLIDE 29

Our Work: Variational Context [Zhang et al CVPR’18]

Variational lower-bound: Sum-log

slide-30
SLIDE 30

SGD Details

z: reasoning over 2N REINFORCE with baseline (MC, hard sampling) Deterministic function (Soft attention)

slide-31
SLIDE 31

Network Details

slide-32
SLIDE 32

Network Details

slide-33
SLIDE 33

Grounding Accuracy

The best VGG SINGLE model to date. Best ResNet Model: Licheng Yu et al. MAttNet: Modular Attention Network for Referring Expression Comprehension. CVPR’18

slide-34
SLIDE 34

More effective than MIL

  • R. Hu et al. Modeling relationships in referential expressions with compositional mod- ular networks. In CVPR, 2017
slide-35
SLIDE 35

Qualitative Results

A dark horse between three lighter horses

slide-36
SLIDE 36
slide-37
SLIDE 37
slide-38
SLIDE 38

Three Examples

slide-39
SLIDE 39

Neural Image Captioning

Encoder (ImageCNNVector)  Decoder (VectorWord Seq.)

GoogleNIC (Vinyals et al. 2014)

slide-40
SLIDE 40

Sequence-level Image Captioning

slide-41
SLIDE 41

Context in Image Captioning

slide-42
SLIDE 42

Context-Aware Visual Policy Network

slide-43
SLIDE 43

Context-Aware Policy Network

slide-44
SLIDE 44

Context-Aware Policy Network

slide-45
SLIDE 45
slide-46
SLIDE 46
slide-47
SLIDE 47
slide-48
SLIDE 48
slide-49
SLIDE 49

MS-COCO Leaderboard

We are SINGLE model.

slide-50
SLIDE 50

Compare with Academic Peers

slide-51
SLIDE 51

Detail Comparison with Up-Down

  • P. Anderson et al. Bottom-up and top-down attention for image captioning and VQA. In CVPR’18
slide-52
SLIDE 52

Visual Reasoning: A Desired Pipeline

  • Configurable NN for various reasoning

applications: Captioning, VQA, and Visual Dialogue

Visual knowledge Graph Configurable Network Task

slide-53
SLIDE 53

Visual Reasoning: Future Directions

  • Compositionality

– Good SG generation – Robust SG representation – Task-specific SG generation

  • Learning to reason

– Task-specific network – Good policy-gradient RL for large SG

slide-54
SLIDE 54
slide-55
SLIDE 55
slide-56
SLIDE 56
slide-57
SLIDE 57
slide-58
SLIDE 58
slide-59
SLIDE 59
slide-60
SLIDE 60
slide-61
SLIDE 61
slide-62
SLIDE 62
slide-63
SLIDE 63

Hard-design X Module Network

Jonson et al. ICCV’17 Hu et al. ICCV’17 Mascharka et al. CVPR’18

  • Q  Program not X
  • Module X but hard-

design

  • CLEVER hacker
  • Poor generalization to COCO-VQA
slide-64
SLIDE 64

Design-Free Module Network

A A A C A A A A C S S

Choi et al. Learning to compose task-specific tree structures. AAAI’17