Embodied Question Answering
Abhishek Das
PhD student, Georgia Tech
NVIDIA GTC
March 26, 2018
Embodied Question Answering NVIDIA GTC March 26, 2018 Abhishek Das - - PowerPoint PPT Presentation
Embodied Question Answering NVIDIA GTC March 26, 2018 Abhishek Das PhD student, Georgia Tech Embodied Question Answering Samyak Datta Georgia Gkioxari Stefan Lee Devi Parikh Dhruv Batra Georgia Tech FAIR Georgia Tech FAIR/Georgia Tech
PhD student, Georgia Tech
March 26, 2018
Samyak Datta
Georgia Tech
Devi Parikh
FAIR/Georgia Tech
Stefan Lee
Georgia Tech
Georgia Gkioxari
FAIR
Dhruv Batra
FAIR/Georgia Tech
To To appear in CVPR PR 2018 (Oral).
em embodied edqa. a.org/pa pape per.pd pdf
Forward
Forward
Turn Left
What is to the left of the shower? Cabinet
Slide credit: Devi Parikh
Slide credit: Devi Parikh
Language
Single-Shot QA Dialog Single Frame Video
Vision
Slide credit: Devi Parikh
A c t i
Language
P a s s i v e A c t i v e Single Frame Video Single-Shot QA Dialog
Vision
Slide credit: Devi Parikh
A c t i
Language
P a s s i v e A c t i v e Single Frame Video Single-Shot QA Dialog
Vision
VQA
[Antol and Agrawal et al., ICCV 2015] [Malinowski et al., ICCV 2015] …
Slide credit: Devi Parikh
A c t i
Language
P a s s i v e A c t i v e Single Frame Video Single-Shot QA Dialog
Vision
VideoQA VQA
[Ye et al., SIGIR 2017] [Jang et al., CVPR 2017]
Attribute: “dog”, “egg”, “bowl”, “woman”, “plate”
[Tapaswi et al., CVPR 2016] …
Slide credit: Devi Parikh
A c t i
Language
P a s s i v e A c t i v e Single Frame Video Single-Shot QA Dialog
Vision
VideoQA VQA Visual Dialog
[Das et al., CVPR 2017] [Das and Kottur et al., ICCV 2017] …
Slide credit: Devi Parikh
A c t i
Language
P a s s i v e A c t i v e Single Frame Video Single-Shot QA Dialog
Vision
VideoQA VQA Embodied QA Visual Dialog
Slide credit: Devi Parikh
Slide credit: Devi Parikh
Slide credit: Devi Parikh
Yi Wu Yuxin Wu Georgia Gkioxari Yuandong Tian UC Berkeley Facebook AI Research
Slide credit: Georgia Gkioxari
[Song et al., CVPR 2017] Manually designed using an online interior design interface (Planner5D)
Slide credit: Georgia Gkioxari
[Song et al., CVPR 2017] Manually designed using an online interior design interface (Planner5D)
Slide credit: Georgia Gkioxari
Slide credit: Georgia Gkioxari
Slide credit: Georgia Gkioxari
Slide credit: Devi Parikh
Rooms (12): gym dining room patio living room
bathroom lobby bedroom garage elevator kitchen balcony Homes (767): train: 643 homes val: 67 homes test: 57 homes Objects (50) rug piano dryer computer fireplace whiteboard bookshelf wardrobe cabinet pan toilet plates
fish tank dishwasher microwave water dispenser bed table mirror tv stand stereo set chessboard playstation vacuum cleaner cup xbox heater bathtub shoe rack range oven refrigerator coffee machine sink sofa kettle dresser knife rack towel rack loudspeaker utensil holder desk vase shower washer fruit bowl television dressing tab. cutting board ironing board food processor
Test for generalization to novel environments!
Slide credit: Devi Parikh
Slide credit: Devi Parikh
fish tank piano pedestal fan candle air conditioner bedroom kitchen living room
Slide credit: Devi Parikh
Slide credit: Devi Parikh
Slide credit: Devi Parikh
location: color: color_room: preposition: existence: logical: count: room_count: distance: What room is the <OBJ> located in? What color is the <OBJ>? What color is the <OBJ> in the <ROOM>? What is <on/above/below/next-to> the <OBJ> in the <ROOM>? Is there a(n) <OBJ> in the <ROOM>? Is there a(n) <OBJ1> and a(n) <OBJ2> in the <ROOM>? How many <OBJs> in the <ROOM>? How many <ROOMs> in the house? Is the <OBJ1> closer to the <OBJ2> than to the <OBJ3> in the <ROOM>?
Skill combinations Varying navigation and memory …
Slide credit: Devi Parikh
location: color: color_room: preposition: existence: logical: count: room_count: distance: What room is the <OBJ> located in? What color is the <OBJ>? What color is the <OBJ> in the <ROOM>? What is <on/above/below/next-to> the <OBJ> in the <ROOM>? Is there a(n) <OBJ> in the <ROOM>? Is there a(n) <OBJ1> and a(n) <OBJ2> in the <ROOM>? How many <OBJs> in the <ROOM>? How many <ROOMs> in the house? Is the <OBJ1> closer to the <OBJ2> than to the <OBJ3> in the <ROOM>?
EQA v1
Slide credit: Devi Parikh
Questions (5281) train: 4246 val: 506 test: 529 Remove questions with peaky answer distributions location: color: color_room: preposition: existence: logical: count: room_count: distance: What room is the <OBJ> located in? What color is the <OBJ>? What color is the <OBJ> in the <ROOM>? What is <on/above/below/next-to> the <OBJ> in the <ROOM>? Is there a(n) <OBJ> in the <ROOM>? Is there a(n) <OBJ1> and a(n) <OBJ2> in the <ROOM>? How many <OBJs> in the <ROOM>? How many <ROOMs> in the house? Is the <OBJ1> closer to the <OBJ2> than to the <OBJ3> in the <ROOM>?
EQA v1
Slide credit: Devi Parikh
Slide credit: Devi Parikh
Slide credit: Devi Parikh
41 Slide credit: Devi Parikh
42 Slide credit: Devi Parikh
Slide credit: Devi Parikh
an serve as as a a performan ance reference (see pap aper)
Slide credit: Devi Parikh
Slide credit: Devi Parikh
Encoder
Autoencoder Segmentation Depth
Conv_1 Conv_2 Conv_3 Conv_4
110 110 53 53 24 24 10 10 8 16 32 32
RGB
224 224
Slide credit: Devi Parikh
Slide credit: Devi Parikh
Stop Repeat
Slide credit: Devi Parikh
h"#$ a"#$ FORWARD a"#$ CNN I"#'
(
h"#' a"#' TURN RIGHT a"#' CTRL RETURN h" a" CNN I"#'
'
CTRL I"#$
'
CNN 1 a"#$ FORWARD CNN CTRL 1 a"#$ FORWARD I"#$
$
CNN CTRL 1 a"#$ FORWARD I"#$
)
CNN CTRL RETURN I"#$
*
Q PLNR CNN I"#$
(
Q PLNR CNN STOP I"#*
(
a"#* Q
CNN TURN LEFT I"#)
(
a"#) CNN CTRL 1 a"#) TURN LEFT I"#)
'
CNN CTRL RETURN I"#)
$
Q PLNR h"#) a"#) PLNR
Slide credit: Devi Parikh
Slide credit: Devi Parikh
Slide credit: Devi Parikh
Slide credit: Devi Parikh
Slide credit: Devi Parikh
Slide credit: Devi Parikh
Softmax over 172 answers
Slide credit: Devi Parikh
Slide credit: Devi Parikh
Does memory help? Does the hierarchy help?
Slide credit: Devi Parikh
Slide credit: Devi Parikh
0.5 1 1.5 2 2.5 Reactive CNN LSTM Reactive CNN + Question LSTM + Question Us Distance to target in meters (lower is better)
* Preliminary, somewhat cherry-picked, see full results in paper
Slide credit: Devi Parikh
20 40 60 80 100 Reactive CNN LSTM Reactive CNN + Question LSTM + Question Us % stopped (higher is better)
* Preliminary, somewhat cherry-picked, see full results in paper
Slide credit: Devi Parikh
2.9 2.95 3 3.05 3.1 3.15 Us Us+RL Mean rank of true answer (lower is better)
* Preliminary, somewhat cherry-picked, see full results in paper
Slide credit: Devi Parikh
Slide credit: Devi Parikh
Slide credit: Devi Parikh
Slide credit: Devi Parikh
M
Slide credit: Devi Parikh