NEURO-SYMBOLIC VISUAL REASONING: DISENTANGLING VISUAL FROM REASONING - - PowerPoint PPT Presentation

neuro symbolic visual reasoning
SMART_READER_LITE
LIVE PREVIEW

NEURO-SYMBOLIC VISUAL REASONING: DISENTANGLING VISUAL FROM REASONING - - PowerPoint PPT Presentation

NEURO-SYMBOLIC VISUAL REASONING: DISENTANGLING VISUAL FROM REASONING HAMID PALANGI SAEED AMIZADEH ALEX POLOZOV HPALANGI@MICROSOFT.COM SAAMIZAD@MICROSOFT.COM POLOZOV@MICROSOFT.COM YICHEN HUANG KAZUHITO KOISHIDA 8/14/2020


slide-1
SLIDE 1

8/14/2020 NEURO-SYMBOLIC VISUAL REASONING 1

NEURO-SYMBOLIC VISUAL REASONING:

DISENTANGLING “VISUAL” FROM “REASONING”

ALEX POLOZOV

POLOZOV@MICROSOFT.COM

SAEED AMIZADEH

SAAMIZAD@MICROSOFT.COM

HAMID PALANGI

HPALANGI@MICROSOFT.COM

KAZUHITO KOISHIDA

KAZUKOI@MICROSOFT.COM

YICHEN HUANG

YICHUANG@MIT.EDU

slide-2
SLIDE 2

VISUAL QUESTION ANSWERING

8/14/2020 NEURO-SYMBOLIC VISUAL REASONING 2

[GQA: Hudson & Manning, 2019]

Q: “What color is the food on the red object left of the small girl that is holding a hamburger?” A: “Yellow.” VQA Model Visual Perception Reasoning Answer Visual Signal Language Signal

slide-3
SLIDE 3

REASONING LOGICAL REASONING + EXTRA CAPABILITIES

Example: imperfect visual perception classifies . Then, Yet “in the living room” or the visual context should resolve the ambiguity.

8/14/2020 NEURO-SYMBOLIC VISUAL REASONING 3

Pure logical reasoning does not often suffice for visual reasoning because visual perception is noisy and uncertain.

slide-4
SLIDE 4

RESEARCH QUESTIONS

  • 1. Given a visual featurization
  • f a visual scene, how

informative is on its own to answer a question about the scene without learned reasoning?

  • 2. How solvable is VQA/GQA given perfect vision?
  • 3. For an arbitrary VQA model

, how much its reasoning abilities can compensate for the imperfections in perception to solve the task?

8/14/2020 NEURO-SYMBOLIC VISUAL REASONING 4

slide-5
SLIDE 5

OUR CONTRIBUTIONS

(I) Differentiable First-Order Logic ( -FOL) for Visual Description & Reasoning (II) Evaluation of Reasoning vs. Perception for VQA models using -FOL

8/14/2020 NEURO-SYMBOLIC VISUAL REASONING 5

Test-Dev Base Model

𝝔

Easy Set Hard Set

slide-6
SLIDE 6

FIRST ORDER LOGIC FOR SCENE DESCRIPTION

“There is a cat to the left of all objects.”

  • Variables enumerates over detected objects.
  • Atomic Predicates represent object names,

attributes and binary relations.

  • Formulas represent a statement or a

question about the scene.

Cat Mug Phone Pen

Left Left

  • Scene Graph Representation

FOL Representation

8/14/2020 NEURO-SYMBOLIC VISUAL REASONING 6

slide-7
SLIDE 7

FOL FOR POSING A HYPOTHETICAL QUESTION

“There is a cat to the left of all objects.” “Is there a cat to the left of all objects?” This question can be answered probabilistically by evaluating the likelihood:

Cat Mug Phone Pen

Left Left

  • Scene Graph Representation

FOL Representation

8/14/2020 NEURO-SYMBOLIC VISUAL REASONING 7

𝑹 𝑹 exponentially hard to calculate directly 

slide-8
SLIDE 8
  • FOL: INFERENCE IN POLYNOMIAL TIME

In order to do inference in polynomial time, we introduce the intermediate notion of attention on the object w.r.t. formula : Then the answer likelihood can be reduced to computing attention via aggregation operators

∀ and ∃:

8/14/2020 NEURO-SYMBOLIC VISUAL REASONING 8

𝒋 𝒀𝒚𝒋

Where

𝒀𝒚𝒋 𝒋 𝒋 𝑶 𝒋𝟐 ∀ 𝒋 𝑶 𝒋𝟐 ∃

slide-9
SLIDE 9
  • FOL: RECURSIVE CALCULATION OF ATTENTION

8/14/2020 NEURO-SYMBOLIC VISUAL REASONING 9

Every FOL formula Smaller FOL formula

NOT

Smaller FOL formula Unary Predicate

AND

Smaller FOL formula Binary Predicate

AND

Negation Operator

𝜷 𝑮|𝒚𝒋 = 𝟐 − 𝜷 𝑯|𝒚𝒋 ≜ 𝐎𝐟𝐡[𝜷 𝑯|𝒚𝒋 ]

Filter Operator

𝜷 𝑮|𝒚𝒋 = 𝜷 𝝆|𝒚𝒋 . 𝜷 𝑯|𝒚𝒋 ≜ 𝐆𝐣𝐦𝐮𝐟𝐬𝛒[𝜷 𝑯|𝒚𝒋 ]

Relate Operator

𝜷 𝑮|𝒚𝒋 = 𝑩𝒓 𝜷 𝝆|𝒚𝒋, 𝒁

𝝆∈𝚸𝐘𝐙

⊙ 𝜷 𝑯|𝒁 ≜ 𝐒𝐟𝐦𝐛𝐮𝐟𝐫,𝚸𝐘𝐙[𝜷 𝑯|𝒁 ], ∀𝒋 ∈ 𝟐. . 𝑶 , 𝒓 = 𝑹𝒗𝒃𝒐𝒖𝒋𝒈𝒋𝒇𝒔 𝒁 ∈ {∃, ∀}

slide-10
SLIDE 10

THE LANGUAGE SYSTEM: FROM NATURAL LANGUAGE TO FOL FORMULA

8/14/2020 10

Natural Language Task-dependent DSL Task-independent

  • FOL

Semantic parsing Compilation “Is there a ball on the table?” Select (Table)  Relate(on, Ball)  Exists(?)

∃ 𝐂𝐛𝐦𝐦 𝐩𝐨,∃ 𝐔𝐛𝐜𝐦𝐟

First-order Logic Equivalence

NEURO-SYMBOLIC VISUAL REASONING

slide-11
SLIDE 11

GQA DOMAIN SPECIFIC LANGUAGE

8/14/2020 NEURO-SYMBOLIC VISUAL REASONING 11

slide-12
SLIDE 12

VISUAL SYSTEM: FROM IMAGE TO PREDICATES

Object Detection Object Featurization

𝒋 𝒋 𝒋

. . .

𝒋

Neural Visual Oracle

Cat Dog Man …

Queried Predicates Off-the-shelf Object Detection (e.g. Faster- RCNN, Ren et

  • al. 2015)

Neural Visual Oracle

8/14/2020 NEURO-SYMBOLIC VISUAL REASONING 12

slide-13
SLIDE 13

THE WHOLE SYSTEM

8/14/2020 NEURO-SYMBOLIC VISUAL REASONING 13

slide-14
SLIDE 14

USING -FOL TO EVALUATE PERCEPTION

8/14/2020 NEURO-SYMBOLIC VISUAL REASONING 14

Q1: Given a visual featurization for a certain VR task, how informative is on its own to solve the task using mere FOL for reasoning?

For GQA: The visual featurization is the Faster-RCNN featurization [Ren et. al, 2015].

slide-15
SLIDE 15

BUILDING THE BASE MODEL

8/14/2020 NEURO-SYMBOLIC VISUAL REASONING 15

The Base Model 1) Put -FOL on the top of a neural Visual Oracle . 2) Train the resulted architecture using the Faster-RCNN featurization, the golden programs and golden answers in GQA via indirect supervision from the answer. 3) Denote the result as the Base Model

𝝔. Golden Programs

slide-16
SLIDE 16

USING -FOL TO EVALUATE PERCEPTION

8/14/2020 NEURO-SYMBOLIC VISUAL REASONING 16

Q1: Given a visual featurization for a certain VR task, how informative is on its own to solve the task using mere FOL for reasoning?

  • FOL has no trainable parameters, so the

accuracy of

𝝔 on test data indirectly

captures the amount of information in .

slide-17
SLIDE 17

USING -FOL TO MEASURE THE IMPORTANCE OF PERCEPTION

8/14/2020 NEURO-SYMBOLIC VISUAL REASONING 17

Q2: how well a VR task can be achieved given perfect vision?

For GQA: What happens if we replace the visual system by the Golden Scene Graphs?

slide-18
SLIDE 18

BUILDING THE PERFECT MODEL

8/14/2020 NEURO-SYMBOLIC VISUAL REASONING 18

The Perfect Model 1) Replace the trained in

𝝔

with the golden GQA scene graphs, denoted as

∗.

2) Denote the result as the Perfect Model

∗. Golden Programs Golden Scene Graphs

slide-19
SLIDE 19

USING -FOL TO MEASURE THE IMPORTANCE OF PERCEPTION

8/14/2020 NEURO-SYMBOLIC VISUAL REASONING 19

Q2: how well a VR task can be achieved given perfect vision?

The accuracy of

∗ on the GQA validation set is

96%. Achieving such high upper-bound shows that:

  • The -FOL is sound.
  • The GQA task is heavily vision-dependent.
slide-20
SLIDE 20

USING -FOL TO EVALUATE REASONING

8/14/2020 NEURO-SYMBOLIC VISUAL REASONING 20

Q3: How much the reasoning abilities of a candidate

model can compensate for the imperfections in perception to solve the task?

Important:

is arbitrary! Need not be DFOL-based.

For GQA: we compare MAC Network [Hudson & Manning, 2018] vs LXMERT [Tan & Bansal, 2019].

slide-21
SLIDE 21

HARD SET VS EASY SET

8/14/2020 NEURO-SYMBOLIC VISUAL REASONING 21

Test-Dev Base Model

𝝔

Easy Set Hard Set

The accuracy of

  • n the hard set

(

𝒊) captures the amount the

reasoning process of compensates for its imperfect perception. The error of

  • n the easy set (

𝒇)

captures the degree to which the reasoning process of distorts the informative visual signals.

slide-22
SLIDE 22

USING -FOL TO EVALUATE REASONING

8/14/2020 NEURO-SYMBOLIC VISUAL REASONING 22

Q3: How much the reasoning abilities of a candidate

model can compensate for the imperfections in perception to solve the task?

slide-23
SLIDE 23

CONCLUSION REMARKS

In this work, we

  • 1. Proposed a differentiable visual description and reasoning formalism

directly derived from first order logic.

  • 2. Proposed coherent methodology for separately evaluating perception and

reasoning using our differentiable first order logic formalism.

  • 3. Incorporated our framework for the GQA task and two of its famous

models and arrived at insightful observations.

8/14/2020 NEURO-SYMBOLIC VISUAL REASONING 23

Thank you 

slide-24
SLIDE 24

SUPPLEMENTAL MATERIALS

8/14/2020 NEURO-SYMBOLIC VISUAL REASONING 24

slide-25
SLIDE 25

MODELING OPEN QUESTIONS USING FOL

For open questions, we generate all potential options for the answer, treat each

  • ption as a binary question and choose the one with highest likelihood.

For example: “What is the color of the ball on the left of all objects?” can be answered by answering a set of binary questions: “Is the ball on the left of all objects blue?” 

𝑹𝟐

“Is the ball on the left of all objects red?” 

𝑹𝟑

“Is the ball on the left of all objects green?” 

𝑹𝟒

8/14/2020 NEURO-SYMBOLIC VISUAL REASONING 25

slide-26
SLIDE 26

BEYOND PURE LOGICAL REASONING: TOP-DOWN CONTEXTUAL CALIBRATION

Example of a reasoning technique beyond pure DFOL: Reminder: suppose . Then, However, the context “in the living room” should help resolve the ambiguity. In other words, the context can be used to calibrate the attentions values in the top-down manner.

8/14/2020 NEURO-SYMBOLIC VISUAL REASONING 26

slide-27
SLIDE 27

BEYOND PURE LOGICAL REASONING: TOP-DOWN CONTEXTUAL CALIBRATION

Instead of uniform, assume the attention values are Beta distributed, then the posterior is: Where are derived from the beta distribution likelihood the prior and are estimated from the question context using a bi-LSTM.

𝒀

𝟐 𝟐 𝟐 𝟐 𝜷(𝒏𝒃𝒐|𝒚𝟐) 𝜷(𝒏𝒃𝒐|𝒚𝟑) 𝜷(𝒏𝒃𝒐|𝒚𝟒) 𝜷(𝒏𝒃𝒐|𝒚𝟓) 𝜷(𝑮|𝒚𝟐) 𝜷(𝑮|𝒚𝟑) 𝜷(𝑮|𝒚𝟒) 𝜷(𝑮|𝒚𝟓) 𝟐 𝟐 𝟐 … 𝟐 𝜷 𝑮|𝒛𝟐 … 𝜷 𝑮|𝒛𝟓

Likelihood Visual Oracle

𝜷(𝒏𝒃𝒐|𝒚𝒋) 𝜷(𝑴𝒇𝒈𝒖|𝒚𝒋, 𝒛𝒌)

LSTM cell P 𝒅𝟐, 𝒆𝟐 𝒙𝟐, 𝒘𝟐 LSTM cell P 𝒅𝟑, 𝒆𝟑 𝒙𝟑, 𝒘𝟑 LSTM cell P 𝒅𝟒, 𝒆𝟒 𝒙𝟒, 𝒘𝟒 𝑟 𝑟 𝑟

𝒙 𝒙 𝒘

8/14/2020 NEURO-SYMBOLIC VISUAL REASONING 27

slide-28
SLIDE 28

EFFECT OF TOP-DOWN CONTEXTUAL CALIBRATION

8/14/2020 NEURO-SYMBOLIC VISUAL REASONING 28