Vision-Language Navigation with Self-Supervised Auxiliary Reasoning - - PowerPoint PPT Presentation

vision language navigation with self supervised auxiliary
SMART_READER_LITE
LIVE PREVIEW

Vision-Language Navigation with Self-Supervised Auxiliary Reasoning - - PowerPoint PPT Presentation

MONASH INFORMATION TECHNOLOGY Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks Fengda Zhu, Yi Zhu, Xiaojun Chang, Xiaodan Liang Outline 1. Embodied Navigation 2. Vision-language Navigation Task 3. Related Works 4.


slide-1
SLIDE 1

Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks

Fengda Zhu, Yi Zhu, Xiaojun Chang, Xiaodan Liang MONASH INFORMATION TECHNOLOGY

slide-2
SLIDE 2

Outline

  • 1. Embodied Navigation
  • 2. Vision-language Navigation Task
  • 3. Related Works
  • 4. Our Methods
  • 5. Conclusion
slide-3
SLIDE 3

Embodied Navigation Problem

  • 1. datasets providing 3D assets with semantic annotations
  • 2. simulators render these assets and simulate an embodied agent
  • 3. tasks that define evaluable problems that enable us to benchmark scientific progress
slide-4
SLIDE 4

Synthetic Image / Real Image

Advantage

  • More data
  • Faster rendering

Disadvantage

  • Limited application

Advantage

  • Close to real application

Disadvantage

  • Less data
  • Easily Overfitting

Transfer: Sim-Real Joint Reinforcement Transfer for 3D Indoor Navigation (by Zhu et al. CVPR 2019)

slide-5
SLIDE 5

Matterport3D

slide-6
SLIDE 6

Habitat Simulator

A flexible, high-performance 3D simulator with configurable agents, multiple sensors, and generic 3D dataset handling (with built-in support for MatterPort3D, Gibson, Replica, and other datasets). Advantage:

  • Real image
  • Fast rendering
  • Continuous action space

Disadvantage:

  • Low rendering quality
slide-7
SLIDE 7

PointGoal Task

slide-8
SLIDE 8

ObjectGoal Task

slide-9
SLIDE 9

Vision Language Navigation (VLN) Task

  • Natural Language
  • More detailed description
  • Require complex scene understanding

Room-to-room (R2R) dataset

  • 90 houses
  • 7k trajectories
  • 21k instructions

Computer Vision + Natural Language Processing + Reinforcement Learning

slide-10
SLIDE 10

VLN baseline (seq-to-seq)

Disadvantage:

  • 1. Supervised learning is easily overfitting
  • 2. Does not sufficiently exploit the panoramic view
  • 3. The action space is redundant
  • 4. Training-testing domain gap
slide-11
SLIDE 11

Speaker-Follower Model

Speaker: trajectory to instruction Follower: instruction to trajectory

slide-12
SLIDE 12

Speaker-Follower Model

slide-13
SLIDE 13

Reinforced Cross-Modal Matching (RCM)

Advantage:

  • 1. Use cross-modal attention
  • 2. Introduce RL+ supervised learning
slide-14
SLIDE 14

Environmental Dropout (Envdrop)

slide-15
SLIDE 15

Self-Supervised Auxiliary Reasoning Tasks

𝑢0 𝑢1 𝑕𝑝𝑏𝑚 𝑢2

Rich information to explore:

  • Semantics of the route
  • Navigation Progress
  • Vision Language Consistency
  • Room Structure

Please turn left and walk through the living room. Exit the room and turn right into the bedroom. Navigation Node Navigation Edge Feasible Edge

slide-16
SLIDE 16

Self-Supervised Auxiliary Reasoning Tasks

𝑢0 𝑢1 𝑕𝑝𝑏𝑚 𝑢2

We require the agent to:

  • Interpret its actions
  • Reason about the past
  • Align Vision-language explicitly
  • Predict the future

Please turn left and walk through the living room. Exit the room and turn right into the bedroom. Navigation Node Navigation Edge Feasible Edge

slide-17
SLIDE 17

Self-Supervised Auxiliary Reasoning Tasks

slide-18
SLIDE 18

Self-Supervised Auxiliary Reasoning Tasks

slide-19
SLIDE 19

Self-Supervised Auxiliary Reasoning Tasks

slide-20
SLIDE 20

Self-Supervised Auxiliary Reasoning Tasks

Demo Code

slide-21
SLIDE 21

Thank You