Ask Your Neurons: A Neural-based Approach to Answering Questions - - PowerPoint PPT Presentation

ask your neurons a neural based approach to answering
SMART_READER_LITE
LIVE PREVIEW

Ask Your Neurons: A Neural-based Approach to Answering Questions - - PowerPoint PPT Presentation

Ask Your Neurons: A Neural-based Approach to Answering Questions about Images Mateusz Malinowski [1] Marcus Rohrbach [2] Mario Fritz [1] [1] Max Planck Institute for Informatics [2] Berkeley University of California, ICSI Human-like


slide-1
SLIDE 1

Mateusz Malinowski [1] Marcus Rohrbach [2] Mario Fritz [1]

Ask Your Neurons: A Neural-based Approach to Answering Questions about Images

[1] Max Planck Institute for Informatics [2] Berkeley University of California, ICSI

slide-2
SLIDE 2
  • M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images

Human-like Comprehension

2

011101011100 1011000100100 010011110000

Is the water boiling?

6=

  • How far are machines from human quality understanding?
  • How can we monitor progress and evaluate architectures?
slide-3
SLIDE 3
  • M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images

Visual Turing Test (NIPS’14)

  • Holistic, open-ended task
  • Visual scene understanding
  • Natural language understanding
  • Deduction
  • No internal representation is evaluated
  • Challenge is open to diverse approaches
  • Scalable annotation end evaluation effort
  • Only question-answer pairs

3

What is behind the table? sofa What color are the cabinets? brown How many lamps are there? 2 What is on the refrigerator? magnet, paper

slide-4
SLIDE 4
  • M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images
  • Symbolic-based Approaches

  • Large Scale Datasets



 
 


  • Neural-based Approaches


  • Attention-based Approaches

  • Hybrid Approaches-

4

  • S. Antol et. al. Visual QA. ICCV’15
  • L. Yu et. al. al. Visual Madlibs. ICCV’15

  • D. Geman et. al. Visual Turing Test. PNAS’15

  • M. Ren et. al. Image QA. NIPS15
  • H. Gao et. al. Are You Talking to a Machine? NIPS’15
  • Y. Zhu et. al. Visual7W. arXiv’15
  • L. Zhu et. al. Uncovering Temporal Context. arXiv’15
  • M. Malinowski et. al. Multiworld. NIPS’14
  • M. Ren et. al. Image QA. NIPS’15
  • H. Gao. et. al. Are You Talking to a Machine? NIPS’15
  • L. Ma et. al. Learning to Answer Questions From Images. arXiv’15
  • Z. Yang. et. al. Stacked Attention Networks. arXiv’15
  • Y. Zhu et. al. Visual7W. arXiv’15
  • J. Andres et. al. Deep Compositional QA. arXiv’15
  • H. Xu et. al. Ask, Attend and Answer. arXiv’15
  • K. Chen et. al. ABC-CNN. arXiv’15
  • K. J. Shih et. al. Where To Look. arXiv’15

What …? λx.Behind(x,Table)

chair(1, brown, position X, Y, Z) window(1, blue, position X, Y, Z)

window

What is the mustache made of? Person A is …

“How” “many” “books” LSTM ... Softmax One Two ... Red Bird .21 .56 ... .09 .01 Linear Image CNN Word Embedding What is the doing cat ? <BOA> Sitting
  • n
umbrella the

CNN

LSTM Embedding Fusing Sitting
  • n
umbrella the <EOA> Shared Shared Intermediate Softmax Question: What are sitting in the basket on a bicycle? CNN/ LSTM Softmax dogs Answer: CNN

+

Query

+

Attention layer 1 Attention layer 2 feature vectors of different parts of image

A cat. Why is the person holding a knife? To cut the cake with. What kind of animal is in the photo? At the top. Where are the carrots? Three. How many people are there?

cat cake

A B C D where count color ... dog standing ...

LSTM couch

cat CNN

Where is the dog? Layout Parser

Related Work

  • H. Noh et al. Dynamic Parameter Prediction. arXiv’15
  • J. Andres et al. Deep Compositional QA. arXiv’15
slide-5
SLIDE 5
  • M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images

Outline

  • Neural approach to answer questions about images 



 
 
 
 
 
 


  • Performance metrics based on additional annotations

5

What is the object on the floor in front of the wall?


  • .

Human 1: bed Human 2: shelf
 Human 3: bed Human 4: bookshelf

CNN

chair window <end> is behind table the ? LSTM What LSTM LSTM LSTM LSTM LSTM LSTM LSTM

slide-6
SLIDE 6
  • M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images

Method: Ask Your Neurons

6

CNN

chairs window <end> is behind table the ? LSTM What LSTM LSTM LSTM LSTM LSTM LSTM LSTM

slide-7
SLIDE 7
  • M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images

Method: Ask Your Neurons

7

ˆ at = arg max

a2V

p(a|x, q, ˆ At1; θ)

i.e. q =

  • rd ques-

⇥ q1, . . . , qn1, J?K ⇤ , tion and J K encodes

LSTM LSTM

qn

a1 at CNN x at-1 LSTM

qn-1

... ... ... ...

  • Predicting answer sequence
  • Recursive formulation

, qj - question word index problem ulary V

  • f the
  • vocabulary,

where ˆ At1 = {ˆ a1, . . . , ˆ at−1} ˆ

  • previous answer words

a|x,

,

  • image representation
slide-8
SLIDE 8
  • M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images

Method: Ask Your Neurons

8

ˆ at = arg max

a2V

p(a|x, q, ˆ At1; θ)

i.e. q =

  • rd ques-

⇥ q1, . . . , qn1, J?K ⇤ , tion and J K encodes

LSTM LSTM

qn

a1 at CNN x at-1 LSTM

qn-1

... ... ... ...

  • Predicting answer sequence
  • Recursive formulation

, qj - question word index problem ulary V

  • f the
  • vocabulary,

where ˆ At1 = {ˆ a1, . . . , ˆ at−1} ˆ

  • previous answer words

a|x,

,

  • image representation
slide-9
SLIDE 9
  • M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images

Method: Ask Your Neurons

9

ˆ at = arg max

a2V

p(a|x, q, ˆ At1; θ)

i.e. q =

  • rd ques-

⇥ q1, . . . , qn1, J?K ⇤ , tion and J K encodes

LSTM LSTM

qn

a1 at CNN x at-1 LSTM

qn-1

... ... ... ...

  • Predicting answer sequence
  • Recursive formulation

, qj - question word index problem ulary V

  • f the
  • vocabulary,

where ˆ At1 = {ˆ a1, . . . , ˆ at−1} ˆ

  • previous answer words

a|x,

,

  • image representation
slide-10
SLIDE 10
  • M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images

Method: Ask Your Neurons

10

ˆ at = arg max

a2V

p(a|x, q, ˆ At1; θ)

i.e. q =

  • rd ques-

⇥ q1, . . . , qn1, J?K ⇤ , tion and J K encodes

LSTM LSTM

qn

a1 at CNN x at-1 LSTM

qn-1

... ... ... ...

  • Predicting answer sequence
  • Recursive formulation

, qj - question word index problem ulary V

  • f the
  • vocabulary,

where ˆ At1 = {ˆ a1, . . . , ˆ at−1} ˆ

  • previous answer words

a|x,

,

  • image representation
slide-11
SLIDE 11
  • M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images

Symbolic vs Neural-based Approaches

11

  • Symbolic approach (NIPS’14)
  • Explicit representation
  • Independent components
  • Detectors, Semantic Parser,


Database

  • Components trained separately
  • Many ‘hard’ design decisions
  • M. Malinowski, et. al. “A Multi-World Approach to Question Answering

about Real-World Scenes based on Uncertain Input”. NIPS’14

Knowledge base

What is behind 
 the table ? Logical Representation

λx.Behind(x,Table)

chairs, 
 window

slide-12
SLIDE 12
  • M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images

Symbolic vs Neural-based Approaches

12

  • Symbolic approach (NIPS’14)
  • Explicit representation
  • Independent components
  • Detectors, Semantic Parser,


Database

  • Components trained separately
  • Many ‘hard’ design decisions
  • Ask Your Neurons (Our)
  • Implicit representation
  • End-to-end formula
  • From images and questions to

answers

  • Joint training
  • Fewer design decisions

End-to-end, jointly trained architecture

  • M. Malinowski, et. al. “A Multi-World Approach to Question Answering

about Real-World Scenes based on Uncertain Input”. NIPS’14

Knowledge base

What is behind 
 the table ? Logical Representation

λx.Behind(x,Table)

chairs, 
 window CNN

chairs window <end> LSTM LSTM LSTM LSTM LSTM is … ? LSTM What

slide-13
SLIDE 13
  • M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images

Neural Visual QA vs Neural Image Description

13

CNN

Large building with a clock <end> LSTM LSTM LSTM LSTM LSTM LSTM

  • Neural Image Description
  • Conditions on an image

  • Generates a description
  • Sequence of words
  • Loss at every step
  • J. Donahue, et. al. “Long-term Recurrent Convolutional Networks for

Visual Recognition and Description”. CVPR15

Loss

slide-14
SLIDE 14
  • M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images

Neural Visual QA vs Neural Image Description

14

CNN

Large building with a clock <end> LSTM LSTM LSTM LSTM LSTM LSTM

  • Neural Image Description
  • Conditions on an image

  • Generates a description
  • Sequence of words
  • Loss at every step
  • Ask Your Neurons (Our)
  • Conditions on an image


and a question

  • Generates an answer
  • Sequence of answer words
  • Loss only at answer words

Loss

  • J. Donahue, et. al. “Long-term Recurrent Convolutional Networks for

Visual Recognition and Description”. CVPR15

CNN

chairs window <end> LSTM LSTM LSTM LSTM LSTM is … ? LSTM What

Loss

slide-15
SLIDE 15
  • M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images

Visual Turing Test: DAQUAR (NIPS’14)

15

What is behind the table? sofa How many doors are open? 1 What is the object on the counter in the corner? microwave

  • Dataset for Question Answering on Real-world images
  • 1449 RGBD indoor images (NYU-Depth V2 dataset)
  • 12.5k question-answer pairs about colors, numbers, objects
  • Human-type subjectivity is common in the dataset
slide-16
SLIDE 16
  • M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images

Evaluation: WUPS (NIPS’14)

16

[1] Wu, Z., Palmer, M.: Verbs semantics and lexical selection. ACL. 1994.

Accuracy Wu-Palmer Similarity [1] WUPS @0.9
 (NIPS’14)

0.8 0.9 0.9

Armchair Wardrobe Chair

Ground Truth Predictions

=

< <<

) ≈

slide-17
SLIDE 17
  • M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images

Results on Full DAQUAR

17

Methods Accuracy WUPS @0.9 Baseline: Symbolic (NIPS’14) 7.86% 11.86% Language Only (Our) 17.15% 22.80% Vision + Language (Our) 19.43% 25.28% Human performance (NIPS’14) 50.20% 50.82%

What is on the refrigerator?

magnet, paper

What is the color of the comforter? blue, white How many drawers are there? 3 What is the largest

  • bject?

bed

slide-18
SLIDE 18
  • M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images

Results on Full DAQUAR

18

Methods Accuracy WUPS @0.9 Baseline: Symbolic (NIPS’14) 7.86% 11.86% Language Only (Our) 17.15% 22.80% Vision + Language (Our) 19.43% 25.28% Human performance (NIPS’14) 50.20% 50.82%

What is on the refrigerator?

magnet, paper

What is the color of the comforter? blue, white How many drawers are there? 3 What is the largest

  • bject?

bed

slide-19
SLIDE 19
  • M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images

Results on Full DAQUAR

19

Methods Accuracy WUPS @0.9 Baseline: Symbolic (NIPS’14) 7.86% 11.86% Language Only (Our) 17.15% 22.80% Vision + Language (Our) 19.43% 25.28% Human performance (NIPS’14) 50.20% 50.82%

What is on the refrigerator?

magnet, paper

What is the color of the comforter? blue, white How many drawers are there? 3 What is the largest

  • bject?

bed

slide-20
SLIDE 20
  • M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images

Qualitative Results

20

What is on the right side of the cabinet? Vision + Language:
 Language Only: What objects are found on the bed? Vision + Language: a 
 Language Only: a How many burner knobs are there? Vision + Language: 4
 Language Only: bed bed doll, pillow 6 pillow bed sheets,

slide-21
SLIDE 21
  • M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images

Qualitative Results: Failure Cases

21

How many chairs are there?
 Vision + Language: 1
 Language Only: Human: How many glass cups are there? Vision + Language: 2
 Language Only: Human: What is on the left side of the bed? Vision + Language: night stand 
 Language Only: Human: 2 4 4 6 night stand ball

slide-22
SLIDE 22
  • M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images
  • 1. New Performance Metric: Min Consensus
  • WUPS handle word-level ambiguities
  • But how to embrace many possible interpretations of both a question

and a scene?

22

What is the object on the floor in front of the wall?


  • .

Human 1: bed Human 2: shelf
 Human 3: bed Human 4: bookshelf

slide-23
SLIDE 23
  • M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images
  • We extend WUPS scores by Min Consensus
  • Finding at least one human answer that matches with the predicted one
  • Treat all possible interpretations equal

23

1 N

N

X

i=1 K

max

k=1

@min{ Y

a∈Ai

max

t∈T i

k

µ(a, t), Y

t∈T i

k

max

a∈Ai µ(a, t)}

1 A (10)

What is the object on the floor in front of the wall?


.

Human 1: bed Human 2: shelf
 Human 3: bed Human 4: bookshelf

  • 1. New Performance Metric: Min Consensus
slide-24
SLIDE 24
  • M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images

Results on DAQUAR-Consensus

24

Methods (Min Consensus) Accuracy WUPS @0.9 Language Only (Our) 22.56% 30.93% Vision + Language (Our) 26.53% 34.87% Human performance (Our) 60.50% 69.65% Methods (Old Metric) Accuracy WUPS @0.9 Language Only (Our) 17.15% 22.8% Vision + Language (Our) 19.43% 25.28% Human performance (NIPS’14) 50.2% 50.82%

slide-25
SLIDE 25
  • M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images

Results on DAQUAR-Consensus

25

What is in front of the curtain? Model: chair Human 1: guitar Human 2: chair What color are the beds? Model: white Human 1: white Human 2: pink How many steel chairs are there? Model: 4 Human 1: 2 Human 2: 4 What is the largest object? Model: bed Human 1: bed Human 2: quilt

slide-26
SLIDE 26
  • M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images
  • We extend WUPS scores by Average Consensus
  • Averaging over multiple possible human answers
  • Encourages the most agreeable answers



 
 


26

1 NK

N

X

i=1 K

X

k=1

min{ Y

a∈Ai

max

t∈T i

k

µ(a, t), Y

t∈T i

k

max

a∈Ai µ(a, t)}

(9)

For the Average Consensus: answer chair is better than wall What is in front of table?


.

Human 1: chair Human 2: chair
 Human 3: chair, bag Human 4: wall

  • 2. New Performance Metric: Average Consensus
slide-27
SLIDE 27
  • M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images

Results on DAQUAR-Consensus

27

Methods (Average Consensus) Accuracy WUPS @0.9 Language Only (Our) 11.57% 18.97% Vision + Language (Our) 13.51% 21.36% Human performance (Our) 36.78% 45.68%

Fraction of data

50 100 50 100 Human agreement

Agreement Level

Amount of subjectivity in the task captured by the Consensus metric

slide-28
SLIDE 28
  • M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images

Conclusions

  • Towards a Visual Turing Test
  • Can machine answer questions about images?
  • Novel Neural-based architecture
  • End-to-end training on Image-Question-Answer triples
  • Doubles the performance of the previous work on DAQUAR
  • New Consensus Metrics to deal with many interpretations
  • Outlook: Explore spectrum between classic AI and Deep Learning

28

LSTM LSTM

qn

a1 ai CNN x ai-1 LSTM

qn-1

... ... ... ... What is on the right side of the cabinet? Vision + Language: bed
 Language Only: bed How many burner knobs are there? Vision + Language: 4
 Language Only: 6

slide-29
SLIDE 29
  • M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images

29

Thank you for your attention!

https://www.d2.mpi-inf.mpg.de/visual-turing-challenge

Ask Your Neurons: A Neural-based Approach to Answering Questions about Images

Mateusz Malinowski Marcus Rohrbach Mario Fritz I am expecting to finish my PhD in 2016 and looking for new opportunities.