From Recognition to Cognition : Visual CommonSense Reasoning Rowan - - PowerPoint PPT Presentation

▶

Feb 06, 2024 2.3k likes •2.92k views

From Recognition to Cognition : Visual CommonSense Reasoning Rowan Zellers et al. - Chinmoy Samant and Anurag Patil Outline The Problem Motivation and Importance Dataset The Approach Adversarial Matching

SLIDE 1

From Recognition to Cognition : Visual CommonSense Reasoning Rowan Zellers et al.

Chinmoy Samant and Anurag Patil

SLIDE 2

Outline

The Problem
Motivation and Importance
Dataset
The Approach

○ Adversarial Matching ○ Recognition and Cognition Model

Results
Critique

SLIDE 3

The Problem

Infer the entire situation: what is happening and why is it happening

1. Three people dining and already ordered food 2. Person 3 is serving and is not with the group 3. Person 1 ordered pancakes and bacon 4. How? Person 4 is pointing to Person 1 while looking at the server (Person 3)

SLIDE 4

Motivation and Importance

➔ Recognition [find objects] and cognition [infer interaction] ➔ Good vision systems ➔ Good cognition systems at scale.

Image Captioning : High level understanding, Difficult Evaluation
VQA : No rationale, Easy Evaluation.
Multiple choice setting.
Justification has to include details about the scene and background knowledge

about how the world works.

SLIDE 5

SLIDE 6

Collecting commonsense inferences

SLIDE 7

Collecting commonsense inferences

SLIDE 8

Every 48 Hrs the author manually verified the work.

Collecting commonsense inferences

SLIDE 9

Adversarial Matching

Annotation artifacts: subtle patterns that are by themselves highly

predictive of correct label

Wrong answers must be

Relevant to question yet different from correct answer Q, A’ Question Relevance A, A’ Entailment

SLIDE 10

Adversarial Matching

We’ll use these two metrics to recycle right answers to other questions, using a minimum weight bipartite matching.

SLIDE 11

Adversarial Matching

Wrong answers must be Relevant to question yet different from correct answer Q, A’ Question Relevance A, A’ Entailment Wi,j = log(Prel(qi,rj) + λ log(1-Psim(ri,rj))

Wi,j = element of weight matrix | Prel = relevance score

Question Relevance Dominates - Hard for machines Entailment Penalty dominates Easy for humans

SLIDE 12

What about the people tags ([ person 5 )] ?

We’ll modify the detection tags in candidate answer to better match the new question/image based on heuristics

SLIDE 13

Unique Contributions of Adversarial Matching

SLIDE 14

Thus,

Definition: I - image sequence of object detections (o): bounding box (b), segmentation mask (m), class label. Query (q) : either word in vocab or tag referring to an object. Responses(N) : same structure as query.

38% - Why and how | 24% cognition level activities | 13 % temporal reasoning 290K Questions | 110K images

SLIDE 15

Setting up the Task

SLIDE 16

Problem Description

SLIDE 17

Problem Description

SLIDE 18

Problem Description

SLIDE 19

Problem Description

SLIDE 20

Problem Description

SLIDE 21

How to reach at the correct answer and reasoning?

1. Figure out meanings of query and responses wrt image and each other. 2. Do some inference on this representation.

SLIDE 22

R2C model stages

1. Grounding
2. Contextualization
3. Reasoning

SLIDE 23

SLIDE 24

SLIDE 25

SLIDE 26

SLIDE 27

SLIDE 28

SLIDE 29

SLIDE 30

SLIDE 31

SLIDE 32

SLIDE 33

SLIDE 34

SLIDE 35

SLIDE 36

Final architecture

q = Question (Q) r = Answer (A) q = Question + Answer (QA) r = Rationale (R) R2C Model 1 R2C Model 2 Question (Q) + Image Answer(A) + Rationale (R)

SLIDE 37

Results

SLIDE 38

Baselines used

Text-only Baselines:

○ BERT ○ BERT-response only ○ ESIM+ELMO ○ LSTM+ELMO

VQA Baselines

○ RevisitedVQA ○ Bottom-up and Top-down attention ○ MLB(Multimodal Low-rank Bilinear Attention) ○ MUTAN(Multimodal Tucker Fusion)

SLIDE 39

Results

SLIDE 40

SLIDE 41

SLIDE 42

(Chance: 6.2)

Yes

(that’s why it got published!)

SLIDE 43

Ablation Results

SLIDE 44

Okay, but….

SLIDE 45

SLIDE 46

VCR Leaderboard*

*As of Feb 29th 2020, may have changed since then.

SLIDE 47

VCR Leaderboard*

*As of Feb 29th 2020, may have changed since then.

This paper …………………………………………

SLIDE 48

(From Microsoft Dynamics 365 AI Research https://arxiv.org/abs/1909.11740)

SLIDE 49

Sample Results (Correct)

SLIDE 50

Sample Results (Incorrect)

SLIDE 51

Critique

The “Good”

Dataset:

○ A comprehensive pipeline to select and annotate interesting images at scale. 290K questions with over 110k images.

Adversarial Matching:

○ Leverages the Natural Language Inference task to come up with a robust dataset with minimized annotation artifacts. Better BERT better dataset.

Ablation study:

○ Performed ablation study of all the important model components as well as alternatives not included, gives reasoning behind model design decisions.

Details:

○ Main paper: 8 pages, Appendix: 17 pages ○ Detailed description of Dataset, Adversarial Matching, R2C Model, even the hyper-parameters

used. Ensuring detailed insights and easy reproducibility of results.
Output analysis:

○ Performed analysis of correct as well as incorrect results, giving more insight into the model’s understanding of the world.

SLIDE 52

Critique

The “Not So Good”

Language bias in the dataset:

○ Abstract images help to reduce language bias, i.e things like fire hydrant being red, or sky being blue. Artificial images discourage bias due to this. Could this have been added?

Adversarial Matching:

○ Prel computation limits application of Bigger BERT easily.

R2C Model:

○ Knowledge bases for world knowledge and automatic reasoning generation. ○ Better handling of the subtasks Q→A and QA→R rather than naive composition.

Evaluation Methodology:

○ Better baseline VQA models could have been selected to ensure a fair comparison with R2C. ○ No masking-approach in the paper or masking-based results to ensure complete attention coverage of images.

SLIDE 53

Questions?

SLIDE 54

References

1. https://arxiv.org/pdf/1811.10830.pdf 2. https://www.youtube.com/watch?v=Je5LlZlqUt8&t=4776s 3. https://www.youtube.com/watch?v=nl6IsjfWKms