Multimodal Compact Bilinear Pooling for VQA Akira Fukui 1,2 , Dong - - PowerPoint PPT Presentation

multimodal compact bilinear pooling for vqa
SMART_READER_LITE
LIVE PREVIEW

Multimodal Compact Bilinear Pooling for VQA Akira Fukui 1,2 , Dong - - PowerPoint PPT Presentation

Multimodal Compact Bilinear Pooling for VQA Akira Fukui 1,2 , Dong Huk Park 1 , Daylen Yang 1 , Anna Rohrbach 1,3 , Trevor Darrell 1 , Marcus Rohrbach 1 1 UC Berkeley EECS, CA, United States 2 Sony Corp., Tokyo, Japan 3 Max Planck Institute for


slide-1
SLIDE 1

Multimodal Compact Bilinear Pooling for VQA

Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus Rohrbach1

1UC Berkeley EECS, CA, United States 2Sony Corp., Tokyo, Japan 3Max Planck Institute for Informatics, Saarbrücken, Germany

slide-2
SLIDE 2

Multimodal language and visual understanding

A table full of food for a feast

Description

slide-3
SLIDE 3

Multimodal language and visual understanding

The bowl with the brown souce

Grounding

slide-4
SLIDE 4

Multimodal language and visual understanding

Gravy

Visual Question Answering

What is the brown souce?

slide-5
SLIDE 5

spoon plate bowl table food corn … person

How to Combine Image Representation and Question Representation?

Is this going to be a feast?

CNN LSTM

Yes

Is? feast going to be …

¨ All elements can interact ¨ Multiplicative interaction

slide-6
SLIDE 6

spoon plate bowl table food corn … person

How to Combine Image Representation and Question Representation?

Is this going to be a feast?

CNN LSTM

Yes

Is? feast going to be …

þ All elements can interact ¨ Multiplicative interaction

  • Difficult to learn output classification

Concatenate

FC FC RELU

slide-7
SLIDE 7

spoon plate bowl table food corn … person

How to Combine Image Representation and Question Representation?

Is this going to be a feast?

CNN LSTM

Yes

Is? feast going to be …

¨ All elements can interact þ Multiplicative interaction

  • Difficult to learn input embedding

Elementwise Multiplication

FC FC RELU ⨀

slide-8
SLIDE 8

spoon plate bowl table food corn … person

How to Combine Image Representation and Question Representation?

Is this going to be a feast?

CNN LSTM

Yes

Is? feast going to be …

þ All elements can interact þ Multiplicative interaction

Outer Product / Bilinear Pooling

FC

[Lin ICCV 2015]

[Lin ICCV 2015] Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji. Bilinear CNN models for fine-grained visual recognition. ICCV 2015

slide-9
SLIDE 9

spoon plate bowl table food corn … person

How to Combine Image Representation and Question Representation?

Is this going to be a feast?

CNN LSTM

Yes

Is? feast going to be …

þ All elements can interact þ Multiplicative interaction ¨ High #activations & computation ¨ High #parameters

Outer Product / Bilinear Pooling

FC

[Lin ICCV 2015]

2048 2048 4 million 4 million x 1000

[Lin ICCV 2015] Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji. Bilinear CNN models for fine-grained visual recognition. ICCV 2015

slide-10
SLIDE 10

spoon plate bowl table food corn … person

Multimodal Compact Bilinear Pooling

Is this going to be a feast?

CNN LSTM

Yes

Is? feast going to be …

þ All elements can interact þ Multiplicative interaction þ Low #activations & computation þ Low #parameters

Compact Bilinear Pooling

FC 2048 2048 16k x 1000

MCB

[Gao CVPR 16]

[Gao CVPR 16] Yang Gao, Oscar Beijbom, Ning Zhang, and Trevor Darrell. Compact bilinear pooling. CVPR 2016 [ICLR Workshops 2016] Fine-grained pose prediction, normalization, and recognitionNZhang, E Shelhamer, Y Gao, T Darrell

slide-11
SLIDE 11

Multimodal Compact Bilinear Pooling

Is this going to be a feast?

CNN LSTM

Yes þ All elements can interact þ Multiplicative interaction ¨ Low #activations & computation þ Low #parameters FC 16k x 1000

Random Projection: Countsketch Ψ

[Countsketch] M. Charikar, K. Chen, M. Farach-Colton. Finding frequent items in data streams. Automata, languages and programming 2002. [Pham&Pagh 13] N. Pham, R. Pagh. Fast and scalable polynomial kernels via explicit feature maps. KDD 2013

Pham & Pagh (2013): Ψ x⨂𝑧 = Ψ 𝑦 ∗ Ψ(𝑧)

slide-12
SLIDE 12

Multimodal Compact Bilinear Pooling

Is this going to be a feast?

CNN LSTM

Yes þ All elements can interact þ Multiplicative interaction þ Low #activations & computation þ Low #parameters FC 16k x 1000

Ψ

Pham & Pagh (2013): Ψ x⨂𝑧 = Ψ 𝑦 ∗ Ψ(𝑧)

Ψ

Convolution

[Countsketch] M. Charikar, K. Chen, M. Farach-Colton. Finding frequent items in data streams. Automata, languages and programming 2002. [Pham&Pagh 13] N. Pham, R. Pagh. Fast and scalable polynomial kernels via explicit feature maps. KDD 2013

slide-13
SLIDE 13

Multimodal Compact Bilinear Pooling

Is this going to be a feast?

CNN LSTM

Yes þ All elements can interact þ Multiplicative interaction þ Low #activations & computation þ Low #parameters FC 16k x 1000

Ψ

Pham & Pagh (2013): Ψ x⨂𝑧 = Ψ 𝑦 ∗ Ψ(𝑧)

Ψ

FFT FFT FFT-1 Convolution ⨀

[Countsketch] M. Charikar, K. Chen, M. Farach-Colton. Finding frequent items in data streams. Automata, languages and programming 2002. [Pham&Pagh 13] N. Pham, R. Pagh. Fast and scalable polynomial kernels via explicit feature maps. KDD 2013

slide-14
SLIDE 14

Related work

  • Alternative approach to multiplicative interactions
  • DPP Net: Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han.

Image question answering using convolutional neural network with dynamic parameter prediction. CVPR 2016

slide-15
SLIDE 15

Experimental setup (without Attention)

  • Solver
  • Cross-entropy-loss, Adam, learning rate 0.0007
  • Feature Extraction
  • ResNet 152, image: 448x448
  • Answers
  • 3000 most frequent on train
  • Sampling with probability of answers
  • Trained on train / validated on val / tested on test-dev

Is this going to be a feast?

CNN (ResNet152)

LSTM, drop LSTM, drop Full Connected

Yes

2048

MCB

16k

L2 Normalization

Embed,Tanh

13k ~ 20k 300 1024 1024

2048 Softmax Signed Sqrt

L2 Normalization

16k 16k 3000

slide-16
SLIDE 16

58.7 58.5 59.8 57.8 56.4 58.6 57.1 58.4 57.5 56.5 54.0 55.0 56.0 57.0 58.0 59.0 60.0 61.0 MCB (128x128 -> 4k) Full Bilinear (128x128 -> 16k) MCB (2048x2048 -> 16k) Eltwise Product + FC + FC Eltwise Product + FC Eltwise Product Concat + FC + FC Concat + FC Concat Eltwise Sum Trained on train, test-dev Acc. [%]

Ablation Comparison to other multimodal methods

Ablation Comparison to other multimodal methods

  • MCB achieves highest accuracy
slide-17
SLIDE 17

58.7 58.5 59.8 57.8 56.4 58.6 57.1 58.4 57.5 56.5 54.0 55.0 56.0 57.0 58.0 59.0 60.0 61.0 MCB (128x128 -> 4k) Full Bilinear (128x128 -> 16k) MCB (2048x2048 -> 16k) Eltwise Product + FC + FC Eltwise Product + FC Eltwise Product Concat + FC + FC Concat + FC Concat Eltwise Sum Trained on train, test-dev Acc. [%]

Ablation Comparison to other multimodal methods

Ablation Comparison to other multimodal methods

  • MCB comparable to Full Bilinear
slide-18
SLIDE 18

Dimensionality of MCB

  • Dimensionality of MCB decides the performance of outer product

approximation

2048

MCB

Dim size

2048

58.4 58.8 59.4 59.7 59.8 59.7 57.5 58.0 58.5 59.0 59.5 60.0 1024 2048 4096 8192 16000 32000 test-dev Acc. [%]

VQA Open-Ended test-dev accuracy

slide-19
SLIDE 19

Multimodal language and visual understanding

Gravy

Visual Question Answering

What is the brown souce?

slide-20
SLIDE 20

MCB with Attention

  • Predict spatial attentions with MCB

CNN (ResNet152) Tile

MCB

WE, LSTM

16k x14x14

Signed Sqrt

L2 Normalization

2048x14x14 2048x14x14

Conv, Relu

512 x 14 x 14

Softmax

1 x 14 x 14

Weighted Sum Conv

Full Connected

Corn

MCB

16k

Softmax Signed Sqrt

L2 Normalization

16k 16k 3000 2048 2048

What is the yellow food?

Attention for captioning :

  • K. Xu, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Attention for VQA :

  • H. Xu, K. Saenko Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering
  • J.Lu Hierarchal Question-Image Co-Attention for Visual Question Answering
slide-21
SLIDE 21

Attention Visualizations

Is this person wearing a hat? Yes [Groundtruth: Yes]

slide-22
SLIDE 22

Results on MCB with Attention

  • MCB performs well with Attention

58.4 59.8 58.4 62.5 56.0 57.0 58.0 59.0 60.0 61.0 62.0 63.0 Concat + FC MCB Concat + FC + Attention MCB + Attention test-dev Acc. [%]

Performance of Attention Models

slide-23
SLIDE 23

62.5 64.2 65.1 66.7 70.2

58 60 62 64 66 68 70 72 MCB + Attention (train) MCB + Attention (train + val) MCB + Attention (train + val + genome) MCB + Attention + Ensemble (train + val + genome) MCB + Attention + Ensemble (train + val + genome) Multiple Choice test-dev Acc. [%]

VQA Open-Ended accuracy for genome and ensemble

  • Data Augmentation
  • VQA data from Visual Genome Dataset
  • Additional 1M Question and answer pairs
  • Removed articles, Single word answer
  • Ensembles
  • Average the output of Softmax over models

Techniques to improve performance

Visual genome: Connecting language and vision using crowdsourced dense image annotations.

slide-24
SLIDE 24

MCB on other Datasets and Tasks

  • Visual 7w (Multiple Choice)
  • Visual Grounding

Our architecture for Visual 7w : MCB with Attention and Answer Encoding.

Visual7W: Grounded Question Answering in Images Grounding of textual phrases in images by reconstruction.

43.8 43.9 47.7 46.5 47.4 47.9 48.7 40 42 44 46 48 50 Plummer et al. Wang et al. Rohrbach et al. Concat Eltwise Prod Eltwise Prod + Conv MCB

Accuracy on Flickr30k Entities

54.3 52.8 62.2 45 50 55 60 65 Zhu et al. Concat + Attention MCB + Attention

Accuracy on Visual7W

slide-25
SLIDE 25

Examples for VQA

slide-26
SLIDE 26

What is the woman feeding the giraffe? Carrot [Groundtruth: Carrot]

Attention Visualizations

slide-27
SLIDE 27

What color is her shirt? Purple [Groundtruth: Purple]

Attention Visualizations

slide-28
SLIDE 28

What is her hairstyle for the picture? Ponytail [Groundtruth: Ponytail]

Attention Visualizations

slide-29
SLIDE 29

What color is the chain on the red dress? Pink [Groundtruth: Gold]

Attention Visualizations

  • Correct Attention, Incorrect Fine-grained Recognition
slide-30
SLIDE 30

Is the man going to fall down? No [Groundtruth: No]

Attention Visualizations

slide-31
SLIDE 31

What is the surface of the court made of? Clay [Groundtruth: Clay]

Attention Visualizations

slide-32
SLIDE 32

What sport is being played? Tennis [Groundtruth: Tennis]

Attention Visualizations

slide-33
SLIDE 33

What does the shop sell? Clocks [Groundtruth: Hot Dogs]

Attention Visualizations

  • Incorrect Attention
slide-34
SLIDE 34

What credit card company is on the banner in the background? Budweiser [Groundtruth: Mastercard]

Attention Visualizations

  • Correct Attention, Incorrect Concept Association
slide-35
SLIDE 35

Conclusions

  • Multimodal Compact Bilinear Pooling
  • All elements interact Multiplicatively
  • Compact and Efficient
  • MCB with Attention
  • Successfully predict spatial attention
  • Generalization Capability
  • Performance improvement in other vision and language tasks
  • Visual 7W, Visual Grounding
  • Compatible with other models
  • Applicable to general multimodal tasks, not only on vision and language
slide-36
SLIDE 36

Thank you for your attention!

Demo : demo.berkeleyvision.org Code: https://github.com/akirafukui/vqa-mcb/ Multimodal Compact Bilinear Pooling for Visual Question Answering and Grounding Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, Marcus Rohrbach Arxiv2016