Embodied Question Answering NVIDIA GTC March 26, 2018 Abhishek Das - - PowerPoint PPT Presentation

embodied question answering
SMART_READER_LITE
LIVE PREVIEW

Embodied Question Answering NVIDIA GTC March 26, 2018 Abhishek Das - - PowerPoint PPT Presentation

Embodied Question Answering NVIDIA GTC March 26, 2018 Abhishek Das PhD student, Georgia Tech Embodied Question Answering Samyak Datta Georgia Gkioxari Stefan Lee Devi Parikh Dhruv Batra Georgia Tech FAIR Georgia Tech FAIR/Georgia Tech


slide-1
SLIDE 1

Embodied Question Answering

Abhishek Das

PhD student, Georgia Tech

NVIDIA GTC

March 26, 2018

slide-2
SLIDE 2

Samyak Datta

Georgia Tech

Devi Parikh

FAIR/Georgia Tech

Stefan Lee

Georgia Tech

Georgia Gkioxari

FAIR

Dhruv Batra

FAIR/Georgia Tech

To To appear in CVPR PR 2018 (Oral).

Embodied Question Answering

em embodied edqa. a.org/pa pape per.pd pdf

slide-3
SLIDE 3
slide-4
SLIDE 4
slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7
slide-8
SLIDE 8
slide-9
SLIDE 9

Forward

slide-10
SLIDE 10

Forward

slide-11
SLIDE 11

Turn Left

slide-12
SLIDE 12
slide-13
SLIDE 13
slide-14
SLIDE 14
slide-15
SLIDE 15

What is to the left of the shower? Cabinet

Slide credit: Devi Parikh

slide-16
SLIDE 16

EmbodiedQA: AI Challenges

  • Language understanding
  • Visual understanding
  • Active perception
  • Common sense reasoning
  • Grounding into actions
  • Selective memory
  • Credit assignment

Slide credit: Devi Parikh

slide-17
SLIDE 17

EmbodiedQA: Context

Language

Single-Shot QA Dialog Single Frame Video

Vision

Slide credit: Devi Parikh

slide-18
SLIDE 18

EmbodiedQA: Context

A c t i

  • n

Language

P a s s i v e A c t i v e Single Frame Video Single-Shot QA Dialog

Vision

Slide credit: Devi Parikh

slide-19
SLIDE 19

EmbodiedQA: Context

A c t i

  • n

Language

P a s s i v e A c t i v e Single Frame Video Single-Shot QA Dialog

Vision

VQA

  • Q. What is the mustache made of?

[Antol and Agrawal et al., ICCV 2015] [Malinowski et al., ICCV 2015] …

Slide credit: Devi Parikh

slide-20
SLIDE 20

EmbodiedQA: Context

A c t i

  • n

Language

P a s s i v e A c t i v e Single Frame Video Single-Shot QA Dialog

Vision

VideoQA VQA

[Ye et al., SIGIR 2017] [Jang et al., CVPR 2017]

  • Q. How many times does the cat touch the dog?
  • A. 4 times

Attribute: “dog”, “egg”, “bowl”, “woman”, “plate”

  • Q. What is a woman boiling in a pot of water?
  • A. Eggs

[Tapaswi et al., CVPR 2016] …

Slide credit: Devi Parikh

slide-21
SLIDE 21

EmbodiedQA: Context

A c t i

  • n

Language

P a s s i v e A c t i v e Single Frame Video Single-Shot QA Dialog

Vision

VideoQA VQA Visual Dialog

[Das et al., CVPR 2017] [Das and Kottur et al., ICCV 2017] …

Slide credit: Devi Parikh

slide-22
SLIDE 22

EmbodiedQA: Context

A c t i

  • n

Language

P a s s i v e A c t i v e Single Frame Video Single-Shot QA Dialog

Vision

VideoQA VQA Embodied QA Visual Dialog

  • Goal specified via reward
  • e.g., [Gupta et al., CVPR17, Zhu et al., ICCV17]
  • Goal specified via visual target
  • e.g., [Zhu et al., ICRA17]
  • Fully observable environment
  • e.g., [Wang et al., ACL16]
  • Recent
  • [Hermann et al., 2017, Chaplot et al., 2017]
  • More complex environments
  • Higher level tasks
  • [Anderson et al., CVPR18]
  • Interactive downstream tasks

Slide credit: Devi Parikh

slide-23
SLIDE 23

EQA Dataset

  • Questions in environments

Slide credit: Devi Parikh

slide-24
SLIDE 24

EQA Dataset

  • Questions in en

envi vironmen ents ts

Slide credit: Devi Parikh

slide-25
SLIDE 25

EQA Dataset: Environments

Yi Wu Yuxin Wu Georgia Gkioxari Yuandong Tian UC Berkeley Facebook AI Research

House3D: A Rich and Realistic 3D environment

https://github.com/facebookresearch/House3D

Slide credit: Georgia Gkioxari

slide-26
SLIDE 26

SUNCG dataset

[Song et al., CVPR 2017] Manually designed using an online interior design interface (Planner5D)

Slide credit: Georgia Gkioxari

slide-27
SLIDE 27

45,622 indoor scenes 404,058 rooms 5,697,217 object instances 2644 unique objects 80 object categories

SUNCG dataset

[Song et al., CVPR 2017] Manually designed using an online interior design interface (Planner5D)

Slide credit: Georgia Gkioxari

slide-28
SLIDE 28
  • Collision and free space prediction
  • OpenGL
  • Linux/MacOS compatible

House3D

  • On Tesla M40 GPU (120x90 resolution)
  • 600fps single process
  • 1800fps multi process

Slide credit: Georgia Gkioxari

slide-29
SLIDE 29

RGB image Depth maps Semantic segmentation masks Top-down 2D views

House3D

Slide credit: Georgia Gkioxari

slide-30
SLIDE 30

EQA Dataset: Environments

  • Subset of House 3D: Typical home environments
  • Realistic layout according to all three SUNCG annotators
  • Not too large or too small (300-800m2, cover 1/3rd of ground area)
  • Have at least one kitchen, living room, dining room, bedroom
  • Ignore obscure rooms (e.g., loggia) and tiny objects (e.g., light switches)

Slide credit: Devi Parikh

slide-31
SLIDE 31

Rooms (12): gym dining room patio living room

  • ffice

bathroom lobby bedroom garage elevator kitchen balcony Homes (767): train: 643 homes val: 67 homes test: 57 homes Objects (50) rug piano dryer computer fireplace whiteboard bookshelf wardrobe cabinet pan toilet plates

  • ttoman

fish tank dishwasher microwave water dispenser bed table mirror tv stand stereo set chessboard playstation vacuum cleaner cup xbox heater bathtub shoe rack range oven refrigerator coffee machine sink sofa kettle dresser knife rack towel rack loudspeaker utensil holder desk vase shower washer fruit bowl television dressing tab. cutting board ironing board food processor

EQA Dataset: Environments

Test for generalization to novel environments!

Slide credit: Devi Parikh

slide-32
SLIDE 32

Slide credit: Devi Parikh

EQA Dataset: Environments

fish tank piano pedestal fan candle air conditioner bedroom kitchen living room

slide-33
SLIDE 33

EQA Dataset

  • Questions in en

envi vironmen ents ts

Slide credit: Devi Parikh

slide-34
SLIDE 34

EQA Dataset

  • Qu

Ques esti tions in environments

Slide credit: Devi Parikh

slide-35
SLIDE 35

Slide credit: Devi Parikh

EQA Dataset: Questions

  • Programmatically generate questions and answers

location: color: color_room: preposition: existence: logical: count: room_count: distance: What room is the <OBJ> located in? What color is the <OBJ>? What color is the <OBJ> in the <ROOM>? What is <on/above/below/next-to> the <OBJ> in the <ROOM>? Is there a(n) <OBJ> in the <ROOM>? Is there a(n) <OBJ1> and a(n) <OBJ2> in the <ROOM>? How many <OBJs> in the <ROOM>? How many <ROOMs> in the house? Is the <OBJ1> closer to the <OBJ2> than to the <OBJ3> in the <ROOM>?

Skill combinations Varying navigation and memory …

slide-36
SLIDE 36

EQA Dataset: Questions

  • Programmatically generate questions and answers

Slide credit: Devi Parikh

slide-37
SLIDE 37

EQA Dataset: Questions

  • Programmatically generate questions and answers

location: color: color_room: preposition: existence: logical: count: room_count: distance: What room is the <OBJ> located in? What color is the <OBJ>? What color is the <OBJ> in the <ROOM>? What is <on/above/below/next-to> the <OBJ> in the <ROOM>? Is there a(n) <OBJ> in the <ROOM>? Is there a(n) <OBJ1> and a(n) <OBJ2> in the <ROOM>? How many <OBJs> in the <ROOM>? How many <ROOMs> in the house? Is the <OBJ1> closer to the <OBJ2> than to the <OBJ3> in the <ROOM>?

EQA v1

Slide credit: Devi Parikh

slide-38
SLIDE 38

EQA Dataset: Questions

  • Programmatically generate questions and answers

Questions (5281) train: 4246 val: 506 test: 529 Remove questions with peaky answer distributions location: color: color_room: preposition: existence: logical: count: room_count: distance: What room is the <OBJ> located in? What color is the <OBJ>? What color is the <OBJ> in the <ROOM>? What is <on/above/below/next-to> the <OBJ> in the <ROOM>? Is there a(n) <OBJ> in the <ROOM>? Is there a(n) <OBJ1> and a(n) <OBJ2> in the <ROOM>? How many <OBJs> in the <ROOM>? How many <ROOMs> in the house? Is the <OBJ1> closer to the <OBJ2> than to the <OBJ3> in the <ROOM>?

EQA v1

Slide credit: Devi Parikh

slide-39
SLIDE 39

EQA Dataset: Expert Demonstrations

  • Connected House3D to Amazon Mechanical Turk

Slide credit: Devi Parikh

slide-40
SLIDE 40

EQA Dataset: Expert Demonstrations

Slide credit: Devi Parikh

slide-41
SLIDE 41

41 Slide credit: Devi Parikh

slide-42
SLIDE 42

42 Slide credit: Devi Parikh

slide-43
SLIDE 43

EQA Dataset: Expert Demonstrations

  • Connected House3D to Amazon Mechanical Turk
  • Currently: demonstrations for 1162 questions across 70 environments
  • Can be used for training
  • Learn how to explore
  • Capture human common sense
  • Can serve as a performance reference

Slide credit: Devi Parikh

slide-44
SLIDE 44

EQA Dataset: Expert Demonstrations

  • Connected House3D to Amazon Mechanical Turk
  • Currently: demonstrations for 1162 questions across 70 environments
  • Can be used for training
  • Learn how to explore
  • Capture human common sense
  • Can

an serve as as a a performan ance reference (see pap aper)

Slide credit: Devi Parikh

slide-45
SLIDE 45

Model:

Vision, Language, Navigation, Answering

Slide credit: Devi Parikh

slide-46
SLIDE 46

Encoder

Model:

Vi Visi sion, Language, Navigation, Answering

Autoencoder Segmentation Depth

Conv_1 Conv_2 Conv_3 Conv_4

110 110 53 53 24 24 10 10 8 16 32 32

RGB

224 224

Slide credit: Devi Parikh

slide-47
SLIDE 47

Model:

Vision, La Langu guage ge, Navigation, Answering

Slide credit: Devi Parikh

slide-48
SLIDE 48

Model:

Vision, Language, Na Navig vigatio tion, Answering

  • Planner: direction or intention
  • Controller: velocity or primitive actions

Stop Repeat

Slide credit: Devi Parikh

slide-49
SLIDE 49

h"#$ a"#$ FORWARD a"#$ CNN I"#'

(

h"#' a"#' TURN RIGHT a"#' CTRL RETURN h" a" CNN I"#'

'

CTRL I"#$

'

CNN 1 a"#$ FORWARD CNN CTRL 1 a"#$ FORWARD I"#$

$

CNN CTRL 1 a"#$ FORWARD I"#$

)

CNN CTRL RETURN I"#$

*

Q PLNR CNN I"#$

(

Q PLNR CNN STOP I"#*

(

a"#* Q

Model:

Vision, Language, Na Navig vigatio tion, Answering

CNN TURN LEFT I"#)

(

a"#) CNN CTRL 1 a"#) TURN LEFT I"#)

'

CNN CTRL RETURN I"#)

$

Q PLNR h"#) a"#) PLNR

Slide credit: Devi Parikh

slide-50
SLIDE 50

Model:

Vision, Language, Navigation, An Answering

Slide credit: Devi Parikh

slide-51
SLIDE 51

Model:

Vision, Language, Navigation, An Answering

Slide credit: Devi Parikh

slide-52
SLIDE 52

Model:

Vision, Language, Navigation, An Answering

Slide credit: Devi Parikh

slide-53
SLIDE 53

Model:

Vision, Language, Navigation, An Answering

Slide credit: Devi Parikh

slide-54
SLIDE 54

Model:

Vision, Language, Navigation, An Answering

Slide credit: Devi Parikh

slide-55
SLIDE 55

Model:

Vision, Language, Navigation, An Answering

Softmax over 172 answers

Slide credit: Devi Parikh

slide-56
SLIDE 56

Training

  • Pre-train CNN
  • Supervised learning
  • “Expert demonstrations”: shortest path
  • Curriculum
  • Pre-train (and freeze) answering module
  • Reinforcement learning
  • REINFORCE
  • Terminal reward: answering accuracy
  • Intermediate reward shaping: progress towards target

Slide credit: Devi Parikh

slide-57
SLIDE 57

Baselines

Predict action from

  • Reactive CNN
  • 5 images (+ question)
  • LSTM navigator
  • Images (+ question) + previous action

Does memory help? Does the hierarchy help?

Slide credit: Devi Parikh

slide-58
SLIDE 58

Metrics

  • Navigation:
  • Final distance to target
  • Improvement in distance to target
  • Min. distance to target during navigation
  • % ended in the right room
  • % entered the right room
  • % stopped
  • Answering:
  • Mean rank of ground truth answer
  • For varying initial positions
  • Q. What color is the fish tank/bowl in the living room?
  • A. Light blue

Slide credit: Devi Parikh

slide-59
SLIDE 59

Results*: Distance to target

0.5 1 1.5 2 2.5 Reactive CNN LSTM Reactive CNN + Question LSTM + Question Us Distance to target in meters (lower is better)

* Preliminary, somewhat cherry-picked, see full results in paper

Slide credit: Devi Parikh

slide-60
SLIDE 60

Results*: % stopped

20 40 60 80 100 Reactive CNN LSTM Reactive CNN + Question LSTM + Question Us % stopped (higher is better)

* Preliminary, somewhat cherry-picked, see full results in paper

Slide credit: Devi Parikh

slide-61
SLIDE 61

Results*: Mean rank of true answer

2.9 2.95 3 3.05 3.1 3.15 Us Us+RL Mean rank of true answer (lower is better)

* Preliminary, somewhat cherry-picked, see full results in paper

Slide credit: Devi Parikh

slide-62
SLIDE 62
slide-63
SLIDE 63
slide-64
SLIDE 64

Summary

  • Embodied Question Answering (EmbodiedQA) – new AI task.
  • Navigate, gather information through active perception, answer

the question

  • EQA v1 dataset on House3D environments

Slide credit: Devi Parikh

slide-65
SLIDE 65

Summary

  • Embodied Question Answering (EmbodiedQA) – new AI task.
  • Navigate, gather information through active perception, answer

the question

  • EQA v1 dataset on House3D environments
  • Human demonstrations
  • Hierarchical EmbodiedQA model

Slide credit: Devi Parikh

slide-66
SLIDE 66

Summary

  • Embodied Question Answering (EmbodiedQA) – new AI task.
  • Navigate, gather information through active perception, answer

the question

  • EQA v1 dataset on House3D environments
  • Human demonstrations
  • Hierarchical EmbodiedQA model
  • Imitation learning + RL finetuning
  • Only scratching the surface…

Slide credit: Devi Parikh

slide-67
SLIDE 67

Thank you.

embodiedqa.org

M

?

Slide credit: Devi Parikh