Cognitive Mapping and Planning for Visual Navigation Saurabh Gupta 1 - - PowerPoint PPT Presentation

cognitive mapping and planning for visual navigation
SMART_READER_LITE
LIVE PREVIEW

Cognitive Mapping and Planning for Visual Navigation Saurabh Gupta 1 - - PowerPoint PPT Presentation

Cognitive Mapping and Planning for Visual Navigation Saurabh Gupta 1 , 2 James Davidson 2 Sergey Levine 1 , 2 Rahul Sukthankar 2 Jitendra Malik 1 , 2 1 UC Berkeley 2 Google Presented by Kent Sommer Korea Advanced Institute of Science and


slide-1
SLIDE 1

Cognitive Mapping and Planning for Visual Navigation

Saurabh Gupta1,2 James Davidson2 Sergey Levine1,2 Rahul Sukthankar2 Jitendra Malik1,2

1UC Berkeley 2Google

Presented by Kent Sommer

Korea Advanced Institute of Science and Technology

slide-2
SLIDE 2

Table of contents

  • 1. Problem Statement
  • 2. Related Work
  • 3. Contribution
  • 4. Results
  • 5. Video Demo
  • 6. Summary

1

slide-3
SLIDE 3

Problem Statement

slide-4
SLIDE 4

Problem Statement

Robot equipped with a first person camera Dropped into a novel environment Navigate in the environment Robot Navigation in novel envionments 2

slide-5
SLIDE 5

Motivation: Intelligent Navigation

What does it mean to navigate intelligently?

  • Navigate through novel environments
  • Draw on prior experience or similar conditions
  • Reason about free-space, obstacle-space, topology

3

slide-6
SLIDE 6

Motivation: Why Are Humans So Good?

Humans can often reason about their environment while classical agents can at best do uninformed exploration

  • Know where we are likely to find a chair
  • Know that hallways often lead to other hallways
  • Know common building patterns

4

slide-7
SLIDE 7

Related Work

slide-8
SLIDE 8

Classical Work

  • Over-complete
  • Precise reconstruction of

everything is not necessary

  • Incomplete
  • Nothing is known till it is

explicitly observed, fail to exploit the structure of the world

  • Only geometry, no semantics
  • Unnecessarily fragile due to

separation between mapping and planning

LSD-SLAM RRT 5

slide-9
SLIDE 9

Contemporary Work

Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning, Zhu et al., ICRA 2017 End-to-End Training of Deep Visuomotor Polocies, Levine et al., JMLR 2015 Human-level control through deep reinforcement learning, Mnih et al., Nature 2014

Context

xt xt−M

CNN Q

DQN

Context

xt

CNN Q

DRQN

Context Memory Q

xt

CNN

MQN

Context Memory

xt

CNN Q

MRQN

Context Memory

xt

CNN Q

FRMQN Control of Memory, Active Perception, and Action in Minecraft, Oh et al., IMCL 2016 6

slide-10
SLIDE 10

Contemporary Work

generic siamese layers scene-specific layers

W

  • bservation

target

policy (4) value (1) ResNet-50 ResNet-50 fc (512) fc (512) fc (512) fc (512) 224x224x3 224x224x3 embedding fusion policy (4) value (1) policy (4) value (1)

...

scene #1 scene #2 scene #N

Feed Forward architecture without memory.

  • Agent can’t systematically

explore a new environment or backtrack.

  • Agent needs experience with a

new environment before it can start navigating successfully.

7

slide-11
SLIDE 11

Contribution

slide-12
SLIDE 12

Contribution

Neural network policy for visual navigation

  • Joint architecture for mapping and planning
  • Spatial memory with the ability to plan given partial observations
  • Is end-to-end trainable

8

slide-13
SLIDE 13

Cognitive Mapping and Planning: System Overview

90o Egomotion

Differentiable Hierarchical Planner

Update multiscale belief

  • f the world in egocentric

coordinate frame Multiscale belief of the world in egocentric coordinate frame 90o Egomotion 90o

Action Differentiable Hierarchical Planner

90o

Action Differentiable Mapper Differentiable Mapper

Multiscale belief about the world in egocentric coordinate frame

Goal 9

slide-14
SLIDE 14

Differentiable Mapper

Fully Connected Layers with ReLUs. Encoder Network (ResNet 50) Decoder Network with residual connections 90o Egomotion Differentiable Warping Combine Confidence and belief about world from previous time step. Confidence and belief about world from previous time step, warped using egomotion. Updated confidence and belief about world.

Past Frames and Egomotion

10

slide-15
SLIDE 15

Differentiable Planner

Value Iteration Network1

  • Qn(s, a) = R(s, a) + γ

s′ P(s′|s, a)Vn(s′)

  • Computed as convolutions
  • Vn+1(s) = maxa Qn(s, a)

∀s

  • Computed as max pooling over channels

1Aviv Tamar et al. “Value iteration networks”.

In: Advances in Neural Information Processing Systems. 2016, pp. 2146–2154.

11

slide-16
SLIDE 16

Differentiable Planner: Value Iteration Network

  • Qn(s, a) = R(s, a) + γ

s′ P(s′|s, a)Vn(s′)

  • Computed as convolutions
  • Vn+1(s) = maxa Qn(s, a)

∀s

  • Computed as max pooling over channels

Trainable using simulated data 12

slide-17
SLIDE 17

Experimental Setup: Overview

  • Trained and tested in static simulated real-world environments
  • Testing environment is different from training environments
  • Robot:
  • Lives in a grid world, and motion is discrete
  • Has 4 macro-actions:
  • Go Forward, Turn left, Turn right, Stay in place
  • Has access to precise egomotion
  • Has RGB and/or Depth Cameras
  • All models are trained using DAGGER
  • Geometric Task:
  • Goal is sampled to be at most 32 time steps away. Agent is run for

39 time steps.

  • Semantic Task:
  • ’Go to a Chair,’ agent run for 39 time steps.

13

slide-18
SLIDE 18

Experimental Setup: Dataset

Stanford Building Parser Dataset

14

slide-19
SLIDE 19

Experimental Setup: Policy Training

Use DAGGER2

3

2St´

ephane Ross, Geoffrey J Gordon, and Drew Bagnell. “A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning.” In: AISTATS.

  • vol. 1. 2. 2011, p. 6.

3Image from: John Schulman´

s Lecture on Reinforcement Learning

15

slide-20
SLIDE 20

Results

slide-21
SLIDE 21

Mapper Unit Test

Ground Truth Analytical Project RGB Pred D Pred

16

slide-22
SLIDE 22

Navigation Results: Geometric Task

Method Mean 75th %ile Success %age RGB Depth RGB Depth RGB Depth Geometric Task Initial 25.3 25.3 30 30 0.7 0.7 No Image LSTM 20.8 20.8 28 28 6.2 6.2 Reactive (1 frame) 20.9 17.0 28 26 8.2 21.9 Reactive (4 frames) 14.4 8.8 25 18 31.4 56.9 LSTM 10.3 5.9 21 5 53.0 71.8 Our (CMP) 7.7 4.8 14 1 62.5 78.3

Geometric Results: Mean distance to goal location, 75th percentile distance to goal and success rate after executing the policy for 39 time steps. 17

slide-23
SLIDE 23

Navigation Results: Semantic Task

Method Mean 75th %ile Success %age RGB Depth RGB Depth RGB Depth Semantic Task (Aggregate) Initial 16.2 16.2 25 25 11.3 11.3 Reactive 14.2 14.2 22 23 23.4 22.3 LSTM 13.5 13.4 20 23 23.5 27.2 Our (CMP) 11.3 11.0 18 19 34.2 40.0

Semantic Results: Mean distance to goal location, 75th percentile distance to goal and success rate after executing the policy for 39 time steps. 18

slide-24
SLIDE 24

Successful Navigations

Agents exhibit backtracking behavior! 19

slide-25
SLIDE 25

Failure Cases

Missed Thrashing Tight 20

slide-26
SLIDE 26

Video Demo

slide-27
SLIDE 27

Demo

Video Demonstration 21

slide-28
SLIDE 28

Summary

slide-29
SLIDE 29

Summary

  • Joint fully end-to-end neural network policy for mapping and

planning

  • Uses mapping module to map from RGB and/or Depth images to a

top-down ego-centric belief map

  • Uses a Value Iteration Network to plan in the belief map generated

by the mapper

  • Trains the end-to-end policy using DAGGER

22

slide-30
SLIDE 30

Questions?

22

slide-31
SLIDE 31

Quiz

  • Why was DAGGER used to train the models?
  • 1. Other training methods were not possible
  • 2. To allow the agent to recover from bad decisions (backtracking)
  • 3. To minimize crashes in simulation
  • 4. Because it has a cool name
  • The model was trained end-to-end allowing for the mapping module

to encode whatever was most useful to the planning module

  • 1. True
  • 2. False