Mateusz Malinowski [1] Marcus Rohrbach [2] Mario Fritz [1]
Ask Your Neurons: A Neural-based Approach to Answering Questions about Images
[1] Max Planck Institute for Informatics [2] Berkeley University of California, ICSI
Ask Your Neurons: A Neural-based Approach to Answering Questions - - PowerPoint PPT Presentation
Ask Your Neurons: A Neural-based Approach to Answering Questions about Images Mateusz Malinowski [1] Marcus Rohrbach [2] Mario Fritz [1] [1] Max Planck Institute for Informatics [2] Berkeley University of California, ICSI Human-like
Mateusz Malinowski [1] Marcus Rohrbach [2] Mario Fritz [1]
[1] Max Planck Institute for Informatics [2] Berkeley University of California, ICSI
2
011101011100 1011000100100 010011110000
Is the water boiling?
3
What is behind the table? sofa What color are the cabinets? brown How many lamps are there? 2 What is on the refrigerator? magnet, paper
4
What …? λx.Behind(x,Table)
chair(1, brown, position X, Y, Z) window(1, blue, position X, Y, Z)
window
What is the mustache made of? Person A is …
“How” “many” “books” LSTM ... Softmax One Two ... Red Bird .21 .56 ... .09 .01 Linear Image CNN Word Embedding What is the doing cat ? <BOA> SittingCNN
LSTM Embedding Fusing Sitting+
Query+
Attention layer 1 Attention layer 2 feature vectors of different parts of imageA cat. Why is the person holding a knife? To cut the cake with. What kind of animal is in the photo? At the top. Where are the carrots? Three. How many people are there?
cat cake
A B C D where count color ... dog standing ...LSTM couch
cat CNNWhere is the dog? Layout Parser
5
What is the object on the floor in front of the wall?
Human 1: bed Human 2: shelf Human 3: bed Human 4: bookshelf
CNN
chair window <end> is behind table the ? LSTM What LSTM LSTM LSTM LSTM LSTM LSTM LSTM
6
CNN
chairs window <end> is behind table the ? LSTM What LSTM LSTM LSTM LSTM LSTM LSTM LSTM
7
ˆ at = arg max
a2V
p(a|x, q, ˆ At1; θ)
i.e. q =
⇥ q1, . . . , qn1, J?K ⇤ , tion and J K encodes
LSTM LSTM
qn
a1 at CNN x at-1 LSTM
qn-1
... ... ... ...
, qj - question word index problem ulary V
where ˆ At1 = {ˆ a1, . . . , ˆ at−1} ˆ
a|x,
,
8
ˆ at = arg max
a2V
p(a|x, q, ˆ At1; θ)
i.e. q =
⇥ q1, . . . , qn1, J?K ⇤ , tion and J K encodes
LSTM LSTM
qn
a1 at CNN x at-1 LSTM
qn-1
... ... ... ...
, qj - question word index problem ulary V
where ˆ At1 = {ˆ a1, . . . , ˆ at−1} ˆ
a|x,
,
9
ˆ at = arg max
a2V
p(a|x, q, ˆ At1; θ)
i.e. q =
⇥ q1, . . . , qn1, J?K ⇤ , tion and J K encodes
LSTM LSTM
qn
a1 at CNN x at-1 LSTM
qn-1
... ... ... ...
, qj - question word index problem ulary V
where ˆ At1 = {ˆ a1, . . . , ˆ at−1} ˆ
a|x,
,
10
ˆ at = arg max
a2V
p(a|x, q, ˆ At1; θ)
i.e. q =
⇥ q1, . . . , qn1, J?K ⇤ , tion and J K encodes
LSTM LSTM
qn
a1 at CNN x at-1 LSTM
qn-1
... ... ... ...
, qj - question word index problem ulary V
where ˆ At1 = {ˆ a1, . . . , ˆ at−1} ˆ
a|x,
,
11
Database
about Real-World Scenes based on Uncertain Input”. NIPS’14
Knowledge base
What is behind the table ? Logical Representation
λx.Behind(x,Table)
chairs, window
12
Database
answers
End-to-end, jointly trained architecture
about Real-World Scenes based on Uncertain Input”. NIPS’14
Knowledge base
What is behind the table ? Logical Representation
λx.Behind(x,Table)
chairs, window CNN
chairs window <end> LSTM LSTM LSTM LSTM LSTM is … ? LSTM What
13
CNN
Large building with a clock <end> LSTM LSTM LSTM LSTM LSTM LSTM
Visual Recognition and Description”. CVPR15
Loss
14
CNN
Large building with a clock <end> LSTM LSTM LSTM LSTM LSTM LSTM
and a question
Loss
Visual Recognition and Description”. CVPR15
CNN
chairs window <end> LSTM LSTM LSTM LSTM LSTM is … ? LSTM What
Loss
15
What is behind the table? sofa How many doors are open? 1 What is the object on the counter in the corner? microwave
16
[1] Wu, Z., Palmer, M.: Verbs semantics and lexical selection. ACL. 1994.
Accuracy Wu-Palmer Similarity [1] WUPS @0.9 (NIPS’14)
0.8 0.9 0.9
Armchair Wardrobe Chair
Ground Truth Predictions
) ≈
17
Methods Accuracy WUPS @0.9 Baseline: Symbolic (NIPS’14) 7.86% 11.86% Language Only (Our) 17.15% 22.80% Vision + Language (Our) 19.43% 25.28% Human performance (NIPS’14) 50.20% 50.82%
What is on the refrigerator?
magnet, paper
What is the color of the comforter? blue, white How many drawers are there? 3 What is the largest
bed
18
Methods Accuracy WUPS @0.9 Baseline: Symbolic (NIPS’14) 7.86% 11.86% Language Only (Our) 17.15% 22.80% Vision + Language (Our) 19.43% 25.28% Human performance (NIPS’14) 50.20% 50.82%
What is on the refrigerator?
magnet, paper
What is the color of the comforter? blue, white How many drawers are there? 3 What is the largest
bed
19
Methods Accuracy WUPS @0.9 Baseline: Symbolic (NIPS’14) 7.86% 11.86% Language Only (Our) 17.15% 22.80% Vision + Language (Our) 19.43% 25.28% Human performance (NIPS’14) 50.20% 50.82%
What is on the refrigerator?
magnet, paper
What is the color of the comforter? blue, white How many drawers are there? 3 What is the largest
bed
20
What is on the right side of the cabinet? Vision + Language: Language Only: What objects are found on the bed? Vision + Language: a Language Only: a How many burner knobs are there? Vision + Language: 4 Language Only: bed bed doll, pillow 6 pillow bed sheets,
21
How many chairs are there? Vision + Language: 1 Language Only: Human: How many glass cups are there? Vision + Language: 2 Language Only: Human: What is on the left side of the bed? Vision + Language: night stand Language Only: Human: 2 4 4 6 night stand ball
and a scene?
22
What is the object on the floor in front of the wall?
Human 1: bed Human 2: shelf Human 3: bed Human 4: bookshelf
23
1 N
N
X
i=1 K
max
k=1
@min{ Y
a∈Ai
max
t∈T i
k
µ(a, t), Y
t∈T i
k
max
a∈Ai µ(a, t)}
1 A (10)
What is the object on the floor in front of the wall?
.
Human 1: bed Human 2: shelf Human 3: bed Human 4: bookshelf
24
Methods (Min Consensus) Accuracy WUPS @0.9 Language Only (Our) 22.56% 30.93% Vision + Language (Our) 26.53% 34.87% Human performance (Our) 60.50% 69.65% Methods (Old Metric) Accuracy WUPS @0.9 Language Only (Our) 17.15% 22.8% Vision + Language (Our) 19.43% 25.28% Human performance (NIPS’14) 50.2% 50.82%
25
What is in front of the curtain? Model: chair Human 1: guitar Human 2: chair What color are the beds? Model: white Human 1: white Human 2: pink How many steel chairs are there? Model: 4 Human 1: 2 Human 2: 4 What is the largest object? Model: bed Human 1: bed Human 2: quilt
26
1 NK
N
X
i=1 K
X
k=1
min{ Y
a∈Ai
max
t∈T i
k
µ(a, t), Y
t∈T i
k
max
a∈Ai µ(a, t)}
(9)
For the Average Consensus: answer chair is better than wall What is in front of table?
.
Human 1: chair Human 2: chair Human 3: chair, bag Human 4: wall
27
Methods (Average Consensus) Accuracy WUPS @0.9 Language Only (Our) 11.57% 18.97% Vision + Language (Our) 13.51% 21.36% Human performance (Our) 36.78% 45.68%
Fraction of data
50 100 50 100 Human agreement
Agreement Level
Amount of subjectivity in the task captured by the Consensus metric
28
LSTM LSTM
qn
a1 ai CNN x ai-1 LSTM
qn-1
... ... ... ... What is on the right side of the cabinet? Vision + Language: bed Language Only: bed How many burner knobs are there? Vision + Language: 4 Language Only: 6
29
https://www.d2.mpi-inf.mpg.de/visual-turing-challenge
Mateusz Malinowski Marcus Rohrbach Mario Fritz I am expecting to finish my PhD in 2016 and looking for new opportunities.