From Recognition to Cognition : Visual CommonSense Reasoning Rowan - - PowerPoint PPT Presentation

from recognition to cognition visual commonsense
SMART_READER_LITE
LIVE PREVIEW

From Recognition to Cognition : Visual CommonSense Reasoning Rowan - - PowerPoint PPT Presentation

From Recognition to Cognition : Visual CommonSense Reasoning Rowan Zellers et al. - Chinmoy Samant and Anurag Patil Outline The Problem Motivation and Importance Dataset The Approach Adversarial Matching


slide-1
SLIDE 1

From Recognition to Cognition : Visual CommonSense Reasoning Rowan Zellers et al.

  • Chinmoy Samant and Anurag Patil
slide-2
SLIDE 2

Outline

  • The Problem
  • Motivation and Importance
  • Dataset
  • The Approach

○ Adversarial Matching ○ Recognition and Cognition Model

  • Results
  • Critique
slide-3
SLIDE 3

The Problem

  • Infer the entire situation: what is happening and why is it happening

1. Three people dining and already ordered food 2. Person 3 is serving and is not with the group 3. Person 1 ordered pancakes and bacon 4. How? Person 4 is pointing to Person 1 while looking at the server (Person 3)

slide-4
SLIDE 4

Motivation and Importance

➔ Recognition [find objects] and cognition [infer interaction] ➔ Good vision systems ➔ Good cognition systems at scale.

  • Image Captioning : High level understanding, Difficult Evaluation
  • VQA : No rationale, Easy Evaluation.
  • Multiple choice setting.
  • Justification has to include details about the scene and background knowledge

about how the world works.

slide-5
SLIDE 5
slide-6
SLIDE 6

Collecting commonsense inferences

slide-7
SLIDE 7

Collecting commonsense inferences

slide-8
SLIDE 8

Every 48 Hrs the author manually verified the work.

Collecting commonsense inferences

slide-9
SLIDE 9

Adversarial Matching

  • Annotation artifacts: subtle patterns that are by themselves highly

predictive of correct label

  • Wrong answers must be

Relevant to question yet different from correct answer Q, A’ Question Relevance A, A’ Entailment

slide-10
SLIDE 10

Adversarial Matching

We’ll use these two metrics to recycle right answers to other questions, using a minimum weight bipartite matching.

slide-11
SLIDE 11

Adversarial Matching

Wrong answers must be Relevant to question yet different from correct answer Q, A’ Question Relevance A, A’ Entailment Wi,j = log(Prel(qi,rj) + λ log(1-Psim(ri,rj))

Wi,j = element of weight matrix | Prel = relevance score

Question Relevance Dominates - Hard for machines Entailment Penalty dominates Easy for humans

slide-12
SLIDE 12

What about the people tags ([ person 5 )] ?

We’ll modify the detection tags in candidate answer to better match the new question/image based on heuristics

slide-13
SLIDE 13

Unique Contributions of Adversarial Matching

slide-14
SLIDE 14

Thus,

Definition: I - image sequence of object detections (o): bounding box (b), segmentation mask (m), class label. Query (q) : either word in vocab or tag referring to an object. Responses(N) : same structure as query.

38% - Why and how | 24% cognition level activities | 13 % temporal reasoning 290K Questions | 110K images

slide-15
SLIDE 15

Setting up the Task

slide-16
SLIDE 16

Problem Description

slide-17
SLIDE 17

Problem Description

slide-18
SLIDE 18

Problem Description

slide-19
SLIDE 19

Problem Description

slide-20
SLIDE 20

Problem Description

slide-21
SLIDE 21

How to reach at the correct answer and reasoning?

1. Figure out meanings of query and responses wrt image and each other. 2. Do some inference on this representation.

slide-22
SLIDE 22

R2C model stages

  • 1. Grounding
  • 2. Contextualization
  • 3. Reasoning
slide-23
SLIDE 23
slide-24
SLIDE 24
slide-25
SLIDE 25
slide-26
SLIDE 26
slide-27
SLIDE 27
slide-28
SLIDE 28
slide-29
SLIDE 29
slide-30
SLIDE 30
slide-31
SLIDE 31
slide-32
SLIDE 32
slide-33
SLIDE 33
slide-34
SLIDE 34
slide-35
SLIDE 35
slide-36
SLIDE 36

Final architecture

q = Question (Q) r = Answer (A) q = Question + Answer (QA) r = Rationale (R) R2C Model 1 R2C Model 2 Question (Q) + Image Answer(A) + Rationale (R)

slide-37
SLIDE 37

Results

slide-38
SLIDE 38

Baselines used

  • Text-only Baselines:

○ BERT ○ BERT-response only ○ ESIM+ELMO ○ LSTM+ELMO

  • VQA Baselines

○ RevisitedVQA ○ Bottom-up and Top-down attention ○ MLB(Multimodal Low-rank Bilinear Attention) ○ MUTAN(Multimodal Tucker Fusion)

slide-39
SLIDE 39

Results

slide-40
SLIDE 40
slide-41
SLIDE 41
slide-42
SLIDE 42

(Chance: 6.2)

Yes

(that’s why it got published!)

slide-43
SLIDE 43

Ablation Results

slide-44
SLIDE 44

Okay, but….

slide-45
SLIDE 45
slide-46
SLIDE 46

VCR Leaderboard*

*As of Feb 29th 2020, may have changed since then.

slide-47
SLIDE 47

VCR Leaderboard*

*As of Feb 29th 2020, may have changed since then.

This paper …………………………………………

slide-48
SLIDE 48

(From Microsoft Dynamics 365 AI Research https://arxiv.org/abs/1909.11740)

slide-49
SLIDE 49

Sample Results (Correct)

slide-50
SLIDE 50

Sample Results (Incorrect)

slide-51
SLIDE 51

Critique

The “Good”

  • Dataset:

○ A comprehensive pipeline to select and annotate interesting images at scale. 290K questions with over 110k images.

  • Adversarial Matching:

○ Leverages the Natural Language Inference task to come up with a robust dataset with minimized annotation artifacts. Better BERT better dataset.

  • Ablation study:

○ Performed ablation study of all the important model components as well as alternatives not included, gives reasoning behind model design decisions.

  • Details:

○ Main paper: 8 pages, Appendix: 17 pages ○ Detailed description of Dataset, Adversarial Matching, R2C Model, even the hyper-parameters

  • used. Ensuring detailed insights and easy reproducibility of results.
  • Output analysis:

○ Performed analysis of correct as well as incorrect results, giving more insight into the model’s understanding of the world.

slide-52
SLIDE 52

Critique

The “Not So Good”

  • Language bias in the dataset:

○ Abstract images help to reduce language bias, i.e things like fire hydrant being red, or sky being blue. Artificial images discourage bias due to this. Could this have been added?

  • Adversarial Matching:

○ Prel computation limits application of Bigger BERT easily.

  • R2C Model:

○ Knowledge bases for world knowledge and automatic reasoning generation. ○ Better handling of the subtasks Q→A and QA→R rather than naive composition.

  • Evaluation Methodology:

○ Better baseline VQA models could have been selected to ensure a fair comparison with R2C. ○ No masking-approach in the paper or masking-based results to ensure complete attention coverage of images.

slide-53
SLIDE 53

Questions?

slide-54
SLIDE 54

References

1. https://arxiv.org/pdf/1811.10830.pdf 2. https://www.youtube.com/watch?v=Je5LlZlqUt8&t=4776s 3. https://www.youtube.com/watch?v=nl6IsjfWKms