Designing deep architectures for Visual Question Answering Matthieu - - PowerPoint PPT Presentation

designing deep architectures for visual question answering
SMART_READER_LITE
LIVE PREVIEW

Designing deep architectures for Visual Question Answering Matthieu - - PowerPoint PPT Presentation

Designing deep architectures for Visual Question Answering Matthieu Cord Sorbonne University valeo.ai research lab. Paris Thanks to H. Ben-younes, R. Cadne Visual Question Answering Question Answering: + What does Claudia do? Visual


slide-1
SLIDE 1

Designing deep architectures for Visual Question Answering

Matthieu Cord Sorbonne University valeo.ai research lab. Paris Thanks to H. Ben-younes, R. Cadène

slide-2
SLIDE 2

Visual Question Answering

Question Answering: What does Claudia do?

+

slide-3
SLIDE 3

Visual Question Answering: What does Claudia do?

+

Visual Question Answering

slide-4
SLIDE 4

Visual Question Answering: What does Claudia do?

+

Visual Question Answering

Sitting at the bottom Standing at the back …

slide-5
SLIDE 5

Visual Question Answering: What does Claudia do?

+

Visual Question Answering

Sitting at the bottom Standing at the back …

Solving this task interesting for:

  • Study of deep learning models in a multimodal context
  • Improving human-machine interaction
  • One step to build visual assistant for blind people

Deep ML

slide-6
SLIDE 6

Outline

  • 1. Multimodal embedding
  • Deep nets to align text+image
  • learning
  • 2. VQA framework
  • Task modeling
  • Fusion in VQA
  • Reasoning in VQA
slide-7
SLIDE 7

Deep semantic-visual embedding

ConvNet RNN

slide-8
SLIDE 8

ConvNet RNN

Deep semantic-visual embedding

Semantic of distance Retrieval by NN search

slide-9
SLIDE 9

A cat on a sofa A dog playing A car

2D Semantic visual space example:

  • Distance in the space has a semantic interpretation
  • Retrieval is done by finding nearest neighbors

Deep semantic-visual embedding

slide-10
SLIDE 10
  • Designing image and text embedding architectures
  • Learning scheme for these deep hybrid nets

Deep semantic-visual embedding

slide-11
SLIDE 11

Visual pipeline:

  • ResNet-152 pretrained
  • Weldon spatial pooling
  • Affine projection
  • normalization

Textual pipeline:

  • Pretrained word embedding
  • Simple Recurrent Unit (SRU)
  • Normalization

ResNet conv pool affine+ norm.

(a, man, in, ski, gear, skiing, on, snow)

w2v SRU+norm

!0: 2 and ϕ are the trained parameters

Deep semantic-visual embedding

DeViSE: A Deep Visual-Semantic Embedding Model,

  • A. Frome et al, NIPS 2013

Finding beans in burgers: Deep semantic-visual embedding with localization, M. Engilberge et al, CVPR 2018

RNN

slide-12
SLIDE 12

Visual pipeline:

  • ResNet-152 pretrained
  • Weldon spatial pooling
  • Affine projection
  • normalization

Textual pipeline:

  • Pretrained word embedding
  • Simple Recurrent Unit (SRU)
  • Normalization

ResNet conv pool affine+ norm.

(a, man, in, ski, gear, skiing, on, snow)

w2v SRU+norm

Deep semantic-visual embedding

Finding beans in burgers: Deep semantic-visual embedding with localization,

  • M. Engilberge et al, CVPR 2018
slide-13
SLIDE 13

Visual pipeline:

  • ResNet-152 pretrained
  • Weldon spatial pooling
  • Affine projection
  • normalization

Textual pipeline:

  • Pretrained word embedding
  • Simple Recurrent Unit (SRU)
  • Normalization

ResNet conv pool affine+ norm.

(a, man, in, ski, gear, skiing, on, snow)

w2v SRU+norm

!0: 2 and ϕ => Learning using a training set

Deep semantic-visual embedding

Finding beans in burgers: Deep semantic-visual embedding with localization,

  • M. Engilberge et al, CVPR 2018
slide-14
SLIDE 14

How to get large training datasets?

Learning Cross-modal Embeddings for Cooking Recipes and Food Images. A. Salvador, et al. CVPR 2017 Cross-modal retrieval in the cooking context: Learning semantic text-image embeddings M. Carvalho, R. Cadene, D. Picard, L. Soulier, N. Thome, M. Cord, SIGIR (2018)

Cooking recipes: easy to get large multimodal datasets with aligned data

slide-15
SLIDE 15

Deep semantic-visual embedding

Demo Visiir.lip6.fr

slide-16
SLIDE 16

A dog playing with a frisbee A plane in a cloudy sky

  • 1. A herd of sheep standing on top of snow covered field.
  • 2. There are sheep standing in the grass near a fence.
  • 3. some black and white sheep a fence dirt and grass

Query Closest elements

Cross-modal retrieval

slide-17
SLIDE 17

Visual grounding examples:

  • Generating multiple heat maps with different textual queries

Cross-modal retrieval and localization

Finding beans in burgers: Deep semantic-visual embedding with localization, M. Engilberge et al, CVPR 2018

slide-18
SLIDE 18

Emergence of color understanding:

Cross-modal retrieval and localization

slide-19
SLIDE 19

Outline

  • 1. Multimodal embedding
  • Deep nets to align text+image
  • Learning
  • 2. Visual Question Answering
  • Task modeling
  • Fusion in VQA
  • Reasoning in VQA
slide-20
SLIDE 20

VQA

slide-21
SLIDE 21

VQA

What color is the fire hydrant

What color is the fire Hydrant on the left?

Green

slide-22
SLIDE 22

VQA

What color is the fire hydrant

What color is the fire Hydrant on the right?

Yellow

slide-23
SLIDE 23

woman

Different answers Similar images

man

Who is wearing glasses?

@VQA workshop, CVPR 2017

ÞNeed very good Visual and Question (deep) representations

ÞFull scene understanding

ÞNeed High level multimodal interaction modeling

ÞMerging operators, attention and reasoning

slide-24
SLIDE 24

Vanilla VQA scheme: 2 deep + fusion

Question Representation Image

slide-25
SLIDE 25

VQA

Question: Is the lady with the blue fur wearing glasses ? Image representation Question representation

VQA: the output space

Yes

slide-26
SLIDE 26

VQA: the output space

slide-27
SLIDE 27

VQA: the output space

Output space representation: => Classify over the most frequent answers (3000/95%)

slide-28
SLIDE 28

VQA

Classes Question: Is the lady with the blue fur wearing glasses ? Image representation Question representation

VQA: the output space

slide-29
SLIDE 29

Image

  • Convolutional Network (VGG, ResNet,....)
  • Detection system (EdgeBoxes, Faster-RCNN, …)

Question

  • Bag-of-words
  • Recurrent Network (RNN, LSTM, GRU, SRU, …)

Learning

  • Fixed answer vocabulary
  • Classification (cross-entropy)

VQA processing

Multimodal Fusion Reasoning

slide-30
SLIDE 30

Fusion in VQA

slide-31
SLIDE 31

VQA: fusion

Fusion Concat+ Proj Element-Wise Concat+MLP

Is the lady with the purple fur wearing glasses ?

slide-32
SLIDE 32

VQA: fusion

Fusion Concat+ Proj Element-Wise Concat+MLP

Is the lady with the purple fur wearing glasses ?

slide-33
SLIDE 33

VQA: bilinear fusion

[Fukui, Akira et al. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding, CVPR 2016] [Kim, Jin-Hwa et al. Hadamard Product for Low-rank Bilinear Pooling, ICLR 2017]

Bilinear model:

Is the lady with the purple fur wearing glasses ?

slide-34
SLIDE 34

VQA: bilinear fusion

Learn the 3-ways tensor coeff.

  • Different than the Signal Proc. Tensor analysis

(representation)

Problem: q, v and y are of dimension ~ 2000 => 8 billion free parameters in the tensor Need to reduce the tensor size:

  • Idea: structure the tensor to reduce the number of

parameters

slide-35
SLIDE 35

Tucker decomposition: Tensor structure: ⇔ constrain the rank of each unfolding of

VQA: bilinear fusion

slide-36
SLIDE 36

=

VQA: bilinear fusion

slide-37
SLIDE 37

=

VQA: bilinear fusion

slide-38
SLIDE 38

=

VQA: bilinear fusion

slide-39
SLIDE 39

=

VQA: bilinear fusion

slide-40
SLIDE 40

Compact Bilinear Pooling (MCB) Low-Rank Bilinear Pooling (MLB) Tucker Decomposition (MUTAN)

Other ways of structuring the tensor of parameters

VQA: bilinear fusion

Ben-younes H.* Cadene R.*, Thome N., Cord M., MUTAN: Multimodal Tucker Fusion for Visual Question Answering, ICCV 2017

slide-41
SLIDE 41

VQA: bilinear fusion [AAAI 2019]

slide-42
SLIDE 42

VQA: BLOCK fusion [AAAI 2019]

A B C

slide-43
SLIDE 43
slide-44
SLIDE 44

VQA: bilinear fusion

Is the lady with the purple fur wearing glasses ?

slide-45
SLIDE 45

Reasoning in VQA

slide-46
SLIDE 46

What is reasoning (for VQA)? Attentional reasoning Relational reasoning Iterative reasoning Compositional reasoning

VQA: reasoning

slide-47
SLIDE 47

What is reasoning (for VQA)? Attentional reasoning: given a certain context (i.e. Q), focus only

  • n the relevant subparts of the image

Relational reasoning Iterative reasoning Compositional reasoning

VQA: reasoning

slide-48
SLIDE 48

Idea: focusing only on parts of the image relevant to Q

  • Each region scored according to the question
  • Representation = sum of all (weighted) embeddings

VQA: attentional reasoning

What is sitting on the desk in front of the boys ?

slide-49
SLIDE 49

VQA: attentional reasoning

What is sitting on the desk in front of the boys ?

GRU ResNet

slide-50
SLIDE 50

VQA: attentional reasoning

What is sitting on the desk in front of the boys ?

GRU ResNet

slide-51
SLIDE 51

MUTAN Fusion MUTAN Fusion

What is sitting

  • n the desk in

front of the boys ?

GRU ResNet “laptop” Attention mechanism

Attentional glimpse in most of recent strategies [MLB, MCB, MUTAN…]

VQA: attentional reasoning

slide-52
SLIDE 52

VQA: attentional reasoning

slide-53
SLIDE 53

Focusing on multiple regions: Multi-glimpse attention

Where is the smoke coming from ?

VQA: attentional reasoning

slide-54
SLIDE 54

MUTAN Fusion MUTAN Fusion

Where is the smoke coming from ?

GRU ResNet “train” Attention mechanism

Focus on the train Focus on the smoke

VQA: attentional reasoning with Multi-glimpse attention

slide-55
SLIDE 55

VQA: attentional reasoning with Multi-glimpse attention

slide-56
SLIDE 56

VQA: attentional reasoning

Evaluation on VQA dataset: Best MUTAN score of 67.36% on test-std Human performances about 83% on this dataset The winner of the VQA Challenge in CVPR 2017 (and CVPR 2018) integrates adaptive grid selection from additional region detection learning process

slide-57
SLIDE 57

VQA: attentional reasoning

slide-58
SLIDE 58

What is reasoning (for VQA)? Attentional reasoning: given a certain context (i.e. Q), focus only

  • n the relevant subparts of the image

Relational reasoning: object detection + mutual relationships (spatial, semantic,...), merging both with Q Iterative reasoning Compositional reasoning

VQA: reasoning

slide-59
SLIDE 59

Determine the answer using relevant objects and relationships

Bottom-up and Relational reasoning

slide-60
SLIDE 60

What is reasoning (for VQA)? Attentional reasoning: given a certain context (i.e. Q), focus only

  • n the relevant subparts of the image

Relational reasoning: object detection + mutual relationships (spatial, semantic,...), merging both with Q Iterative reasoning: refining the attention step-by-step, each time extracting a different piece of information from the image

VQA: reasoning

slide-61
SLIDE 61

Iterative Reasoning

At least 3 elementary steps are required to answer the question

  • Find bicycles
  • Find the bicycle that has a basket
  • Find what is in this basket

Stacked attention: iteratively refining visual attention and question representation

What are sitting in the basket on a bicycle ? Zichao Yang et. al., Stacked Attention Networks for Image Question Answering, CVPR 2016

slide-62
SLIDE 62

Multimodal Relational Reasoning for VQA

MUREL system:

  • Vector representation for Attention process
  • Spatial and semantic contexts to model relations between

image regions

  • Iterative process /Multistep reasoning

Cadene et al., MuRel: Multimodal Relational Reasoning for Visual Question Answering CVPR 2019

slide-63
SLIDE 63

MuRel: Multimodal Relational Reasoning for VQA

Bilinear Fusion Pairwise Relational Modeling

+

skip-connection point-wise addition

slide-64
SLIDE 64

MuRel: Multimodal Relational Reasoning for VQA

slide-65
SLIDE 65

MuRel: Multimodal Relational Reasoning for VQA

slide-66
SLIDE 66

TDIUC dataset (12 different categories)

VQA v2.0 dataset

slide-67
SLIDE 67

Datasets and challenges

Many initiatives to improve datasets and evaluate reasoning as:

VQA v2.0 [Y. Goyal, D. Batra, D. Parikh, CVPR 2017] TDIUC dataset and challenge (Task Driven Image Understanding Challenge) CLEVR dataset [J. Johnson et al, CVPR 2017]

  • Questions about visual reasoning including

attribute identification, counting, comparison, spatial relationships, and logical operations.

GQA dataset (2019) for compositional Q answering over real-world images

  • 22M diverse reasoning questions generated from

a scene graph

Visual dialogue task: to hold a dialog with humans in natural, conversational language about visual content

Are there an equal number of large things and metal spheres?

slide-68
SLIDE 68

MLIA/Chordettes team: Matthieu Cord http://webia.lip6.fr/~cord

  • A. Dapogny (Postdoc), PhD T. Robert, T. Mordan, H. BenYounes, R. Cadene, E. Mehr, M.

Engilberge, Y. Chen, A. Saporta, N. Thome (CNAM Pr 10% associate) CVPR 2019 MUREL: Multimodal Relational Reasoning for Visual Question Answering

  • R. Cadene, H. Ben-younes, N. Thome, M. Cord

AAAI 2019 BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection , H. Ben-younes, R. Cadene, N. Thome, M. Cord

ICCV 2017 MUTAN: Multimodal Tucker Fusion for Visual Question Answering

  • H. Ben-Younes*, R. Cadene*, N. Thome, M. Cord

Pytorch code: https://github.com/Cadene Our Deep Recipe Reco on your mobile: visiir.lip6.fr