Few-shot Object Reasoning for Robot Instruction Following Yoav - - PowerPoint PPT Presentation

few shot object reasoning for robot instruction following
SMART_READER_LITE
LIVE PREVIEW

Few-shot Object Reasoning for Robot Instruction Following Yoav - - PowerPoint PPT Presentation

Few-shot Object Reasoning for Robot Instruction Following Yoav Artzi Workshop on Spatial Language Understanding EMNLP 2020 Task Navigation between landmarks Agent: quadcopter drone Inputs: poses, raw RGB camera images, and


slide-1
SLIDE 1

Few-shot Object Reasoning for Robot Instruction Following

Yoav Artzi Workshop on Spatial Language Understanding EMNLP 2020

slide-2
SLIDE 2
  • Navigation between

landmarks

  • Agent: quadcopter drone
  • Inputs: poses, raw RGB

camera images, and natural language instructions

Task

slide-3
SLIDE 3

Task

go straight and stop before reaching the planter turn left towards the globe and go forward until just before it

slide-4
SLIDE 4

STOP

Mapping Instructions to Control

  • The drone maintains a configuration of target velocities
  • Each action updates the configuration or stops
  • Goal: learn a mapping from inputs to configuration

updates

go straight and stop before reaching the planter turn left globe …

Linear forward velocity Angular yaw rate

f( )= , ,

(v, ω)

( vt ωt

slide-5
SLIDE 5

Language Understanding Mapping Perception Planning Control

Modular Approach

Instruction

  • Build/train separate components
  • Symbolic meaning representation
  • Complex integration
slide-6
SLIDE 6

Single-model Approach (a.k.a end-to-end)

Instruction Action

f

How to think of extensibility, interpretability, and modularity when packing everything in a single model?

slide-7
SLIDE 7

Single-model Approach

  • Extensibility: extending the model to reason about

new object after training

  • Interpretability: viewing how the model reasons

about object grounding and trajectories

  • Modularity: re-using parts of the model

Within a representation learning framework

slide-8
SLIDE 8

Representation: Design vs. Learning

  • Systems that use symbolic representations are

interpretable and (potentially) extensible

  • However: representation design of every possible

concept is brittle and hard to scale

  • Instead: design the most general concepts and let

representation learning fill them with content

  • Today, two concepts: objects and trajectories
slide-9
SLIDE 9

Today

Few-shot instruction following:

  • Few-shot language-conditioned object

segmentation

  • Object context mapping
  • Integration into a visitation-prediction policy for

mapping instructions to drone control

slide-10
SLIDE 10

Language-conditioned Object Segmentation

  • Input: instruction and observation images
  • Goal: identify and align objects and references
slide-11
SLIDE 11

Few-shot Version

  • Input: instruction, observation images, and database
  • Goal: identify previously unseen objects and

mentions and align them

  • range

cup plant pot blue ball planet earth

Database

slide-12
SLIDE 12

Alignment via a Database

  • Approach: align
  • bservations and

references through the database

  • Adding objects to the

database extends the alignment ability

  • Requires only adding a

few image and language exemplars

  • range

cup plant pot blue ball planet earth

slide-13
SLIDE 13

Alignment via a Database

  • Approach: align
  • bservations and

references through the database

  • Adding objects to the

database extends the alignment ability

  • Requires only adding a

few image and language exemplars

  • range

cup plant pot blue ball planet earth Melon wedge the fruit slice

watermelon

the red lego red cube red brick

slide-14
SLIDE 14

Alignment Score

go straight and stop before reaching the planter turn left towards the globe and go forward until just before it

  • range

cup plant pot blue ball planet earth

Database

Bounding box Reference Object record

<latexit sha1_base64="63JDzvbINZs3luchNf8vAbdEYw=">ACU3icbVFNaxsxFNRu0tTd1KnbHnsRMQEbitkNKe2l4NJDe3QgtgNeYyRZa6vWxyK9LTXL/scS6KF/pJceWvkjIbEzIBhm5vGkEc2lcBDHv4Pw4PDJ0dPas+j4ef3kRePlq4EzhW8z4w09poSx6XQvA8CJL/OLSeKSj6ki8rf/idWyeMvoJlzseKzLTIBCPgpUnjWwr8BzhWfpJipqsWfYtG3/EUeoKNSlN1WtRjFMlphibdq9lNtxnorPbSJpZwso7j/oYvR+tVl67mjSacSdeA+TZEuaIvepHGTg0rFNfAJHFulMQ5jEtiQTDJqygtHM8JW5AZH3mqieJuXK47qfCZV6Y4M9YfDXit3p8oiXJuqahPKgJzt+utxMe8UQHZh3EpdF4A12yzKCskBoNXBeOpsJyBXHpCmBX+rpjNiW8I/DdEvoRk98n7ZHDeSd514suLZvfLto4aeoNOUQsl6D3qoq+oh/qIoZ/oD/oXoOBX8DcMw8NAy2M6/RA4T1/yOUr2Y=</latexit>

Align(b, r) = X

  • P(b | o)P(o | r)

b Bounding box r Reference

  • Database object
slide-15
SLIDE 15

Alignment Score

go straight and stop before reaching the planter turn left towards the globe and go forward until just before it

  • range

cup plant pot blue ball planet earth

Database

<latexit sha1_base64="uZINEPQvNpHiWOKMysbyWhr8Wg=">ACU3icbVFNaxsxFNRu0tTd1KnbHnsRMQEbitkNKe2l4NJDe3QgtgNeYyRZa6vWxyK9LTXL/scS6KF/pJceWvkjIbEzIBhm5vGkEc2lcBDHv4Pw4PDJ0dPas+j4ef3kRePlq4EzhW8z4w09poSx6XQvA8CJL/OLSeKSj6ki8rf/idWyeMvoJlzseKzLTIBCPgpUnjWwr8BzhWfpJipqsWfYtG3/E0VnqCjUpTdVrUYxTJaYm3avZTbch6LbRJpZwso7i/oUvZ+sVl67mjSacSdeA+TZEuaIvepHGTg0rFNfAJHFulMQ5jEtiQTDJqygtHM8JW5AZH3mqieJuXK47qfCZV6Y4M9YfDXit3p8oiXJuqahPKgJzt+utxMe8UQHZh3EpdF4A12yzKCskBoNXBeOpsJyBXHpCmBX+rpjNiW8I/DdEvoRk98n7ZHDeSd514suLZvfLto4aeoNOUQsl6D3qoq+oh/qIoZ/oD/oXoOBX8DcMw8NAy2M6/RA4T1/x39r2Y=</latexit>

Align(b, r) = X

  • P(o | b)P(b)P(o | r)

P(o)

Bounding box Reference Object record

b Bounding box r Reference

  • Database object
slide-16
SLIDE 16

Alignment Score

  • Region proposal

network gives bounding boxes and

  • is uniform

P(b) P(o)

<latexit sha1_base64="uZINEPQvNpHiWOKMysbyWhr8Wg=">ACU3icbVFNaxsxFNRu0tTd1KnbHnsRMQEbitkNKe2l4NJDe3QgtgNeYyRZa6vWxyK9LTXL/scS6KF/pJceWvkjIbEzIBhm5vGkEc2lcBDHv4Pw4PDJ0dPas+j4ef3kRePlq4EzhW8z4w09poSx6XQvA8CJL/OLSeKSj6ki8rf/idWyeMvoJlzseKzLTIBCPgpUnjWwr8BzhWfpJipqsWfYtG3/E0VnqCjUpTdVrUYxTJaYm3avZTbch6LbRJpZwso7i/oUvZ+sVl67mjSacSdeA+TZEuaIvepHGTg0rFNfAJHFulMQ5jEtiQTDJqygtHM8JW5AZH3mqieJuXK47qfCZV6Y4M9YfDXit3p8oiXJuqahPKgJzt+utxMe8UQHZh3EpdF4A12yzKCskBoNXBeOpsJyBXHpCmBX+rpjNiW8I/DdEvoRk98n7ZHDeSd514suLZvfLto4aeoNOUQsl6D3qoq+oh/qIoZ/oD/oXoOBX8DcMw8NAy2M6/RA4T1/x39r2Y=</latexit>

Align(b, r) = X

  • P(o | b)P(b)P(o | r)

P(o)

  • range

cup plant pot blue ball planet earth

b Bounding box r Reference

  • Database object
slide-17
SLIDE 17

Alignment Score

  • is computed using

visual similarity

  • Using Kernel Density

Estimation with a symmetric multivariate Gaussian kernel

  • is computed similarly

using text similarity with pre- trained embeddings

P(o ∣ b) P(o ∣ r)

<latexit sha1_base64="uZINEPQvNpHiWOKMysbyWhr8Wg=">ACU3icbVFNaxsxFNRu0tTd1KnbHnsRMQEbitkNKe2l4NJDe3QgtgNeYyRZa6vWxyK9LTXL/scS6KF/pJceWvkjIbEzIBhm5vGkEc2lcBDHv4Pw4PDJ0dPas+j4ef3kRePlq4EzhW8z4w09poSx6XQvA8CJL/OLSeKSj6ki8rf/idWyeMvoJlzseKzLTIBCPgpUnjWwr8BzhWfpJipqsWfYtG3/E0VnqCjUpTdVrUYxTJaYm3avZTbch6LbRJpZwso7i/oUvZ+sVl67mjSacSdeA+TZEuaIvepHGTg0rFNfAJHFulMQ5jEtiQTDJqygtHM8JW5AZH3mqieJuXK47qfCZV6Y4M9YfDXit3p8oiXJuqahPKgJzt+utxMe8UQHZh3EpdF4A12yzKCskBoNXBeOpsJyBXHpCmBX+rpjNiW8I/DdEvoRk98n7ZHDeSd514suLZvfLto4aeoNOUQsl6D3qoq+oh/qIoZ/oD/oXoOBX8DcMw8NAy2M6/RA4T1/x39r2Y=</latexit>

Align(b, r) = X

  • P(o | b)P(b)P(o | r)

P(o)

  • range

cup plant pot blue ball planet earth

b Bounding box r Reference

  • Database object
slide-18
SLIDE 18

Mask Refinement

  • Refine each bounding

box with a UNet model

  • Gives a tight object

mask

  • Paired with a bounded

alignment score to a reference in the text

UNet

go straight and stop before reaching the planter turn left towards the globe and go forward until just before it

Align = 0.7

slide-19
SLIDE 19

Learning

  • Region proposal network parameters for bounding box proposal
  • Image similarity measure for
  • parameters for mask refinement
  • Text similarity uses pre-trained embeddings
  • Challenge: need large-scale heavily annotated visual data

P(o ∣ b) UNet

<latexit sha1_base64="uZINEPQvNpHiWOKMysbyWhr8Wg=">ACU3icbVFNaxsxFNRu0tTd1KnbHnsRMQEbitkNKe2l4NJDe3QgtgNeYyRZa6vWxyK9LTXL/scS6KF/pJceWvkjIbEzIBhm5vGkEc2lcBDHv4Pw4PDJ0dPas+j4ef3kRePlq4EzhW8z4w09poSx6XQvA8CJL/OLSeKSj6ki8rf/idWyeMvoJlzseKzLTIBCPgpUnjWwr8BzhWfpJipqsWfYtG3/E0VnqCjUpTdVrUYxTJaYm3avZTbch6LbRJpZwso7i/oUvZ+sVl67mjSacSdeA+TZEuaIvepHGTg0rFNfAJHFulMQ5jEtiQTDJqygtHM8JW5AZH3mqieJuXK47qfCZV6Y4M9YfDXit3p8oiXJuqahPKgJzt+utxMe8UQHZh3EpdF4A12yzKCskBoNXBeOpsJyBXHpCmBX+rpjNiW8I/DdEvoRk98n7ZHDeSd514suLZvfLto4aeoNOUQsl6D3qoq+oh/qIoZ/oD/oXoOBX8DcMw8NAy2M6/RA4T1/x39r2Y=</latexit>

Align(b, r) = X

  • P(o | b)P(b)P(o | r)

P(o) UNet

b Bounding box r Reference

  • Database object
slide-20
SLIDE 20

FPV

Augmented Reality Training Data

Overlay Composite Mask labels

slide-21
SLIDE 21

Augmented Reality Training Data

Composite Mask labels

Learned representations generalize beyond specific objects for:

  • Region proposal network for

bounding boxes

  • Image similarity measure for
  • parameters for mask

refinement

P(o ∣ b) UNet

Large-scale generation with ShapeNet objects

slide-22
SLIDE 22

Today

Few-shot instruction following:

  • Few-shot language-conditioned object

segmentation

  • Object context mapping
  • Integration into a visitation-prediction policy for

mapping instructions to drone control

slide-23
SLIDE 23

Object Context Mapping

  • 1. Identify and align object mentions to observations
  • 2. Compute abstract contextual representations for object

references

  • 3. Project and aggregate masks over time
  • 4. Combine aggregated masks with contextual

representations to create a map

Goal: create maps that capture object location and the instruction behavior around objects

slide-24
SLIDE 24

Object Context Mapping

Step I: Identify and Align

  • Bounding box proposals from

Region Proposal Network

  • Object references from tagger
  • Align with language-

conditioned segmentation and the database

  • To compute: first-person

masks aligned to instruction references

  • range

cup plant pot blue ball planet earth

slide-25
SLIDE 25

Object Context Mapping

Step II: Abstract Contextual Representations

  • Replace references with
  • bject placeholders
  • Compute bi-directional RNN

representations for all tokens

  • The hidden state for each

placeholder is the object context representation

… reaching the planter
 turn left towards the globe and …

… reaching ObjectA left towards ObjectB and …

Abstract references

slide-26
SLIDE 26

Object Context Mapping

Step III: Projection and Aggregation

  • Projection from first-person camera masks to third-

person environment ground with pinhole camera model

  • Deterministic aggregation

Pinhole camera projection First-person Masks

slide-27
SLIDE 27

Object Context Mapping

Step III: Projection and Aggregation

  • Projection from first-person camera masks to third-

person environment ground with pinhole camera model

  • Deterministic aggregation

Integrator Pinhole camera projection Projected Masks (time t) First-person Masks

slide-28
SLIDE 28

Object Context Mapping

Step III: Projection and Aggregation

  • Projection from first-person camera masks to third-

person environment ground with pinhole camera model

  • Deterministic aggregation

Masks (time t-1) Masks (time t)

Integrator Pinhole camera Projected Masks (time t)

slide-29
SLIDE 29

Object Context Mapping

Step IV: Combine Object Rpresentations

  • Each position is a product of a mask value and its

aligned object context representation

… reaching ObjectA left towards ObjectB and …

slide-30
SLIDE 30

Object Context Map

  • Map information abstracts
  • ver reference content

stripped from instruction

  • Includes for each object the

context of its reference in the instruction

  • Tells the agent how to behave

around the object

  • Policy remains blind to the
  • bject itself
slide-31
SLIDE 31

Today

Few-shot instruction following:

  • Few-shot language-conditioned object

segmentation

  • Object context mapping
  • Integration into a visitation-prediction policy for

mapping instructions to drone control

slide-32
SLIDE 32

Two-stage Policy

Mapping and Plan Generation Action Generation Stage I Stage II

Instruction Action

  • 1. Map and predict states likely to visit + track observability
  • 2. Generate actions to visit high-probability states and explore

Visitation Distributions Observation Mask Few-shot Segmentation

slide-33
SLIDE 33
  • The state-visitation distribution is the

probability of visiting state following policy from start state

  • Predicting for an expert policy tells

us the states to visit to complete the task

  • We compute two distributions: trajectory-visitation

and goal-visitation

Visitation Distributions

π d(s; π, s0) s0 s π* d(s; π*, s0)

slide-34
SLIDE 34

Visitation Distributions

  • Distributions reflect the

agent plan

  • Model path and goal
  • bservability
  • Refined as observing more
  • f the environment

Trajectory distribution Goal distribution

slide-35
SLIDE 35

Stage I: Mapping and Plan Generation

Plan Generation Action Generation Stage II

Instruction

Mapping

  • Few-shot language-conditioned segmentation to

construct an object context map

  • Predict distribution over map positions

Trajectory distribution Goal distribution

Few-shot Segmentation

Abstract Instruction

slide-36
SLIDE 36

Plan Generation

  • Cast distribution prediction as image generation
  • LingUNet: an image-to-image encoder-decoder
  • Visual reasoning at multiple image scales
  • Conditioned on language input at all levels of

reasoning using text-based convolutions

slide-37
SLIDE 37

LingUNet

Convolutions

Instruction

RNN

Text Kernels Text Convolutions Deconvolutions

SoftMax

Object Map Visitation Distributions

slide-38
SLIDE 38

Two-stage Policy

Mapping and Plan Generation Action Generation Stage I Stage II

Instruction Action

  • 1. Map and predict states likely to visit + track observability
  • 2. Generate actions to visit high-probability states and explore

Visitation Distributions Observation Mask Few-shot Segmentation

slide-39
SLIDE 39

Stage II: Action Generation

  • Relatively simple control problem without language
  • Transform and crop to agent perspective and generate

configuration update

Trajectory distribution Goal distribution

Egocentric Transform

Control Network

slide-40
SLIDE 40

Training

Instruction

Control Network

Mask Visitations

Plan Generation Mapping

Trained separately

Object Context Map

RNN LingUNet CNN+MLP

Few-shot Segmentation

Abstract Instruction

slide-41
SLIDE 41

Training in Simulation

  • Language-conditioned

segmentation trained separately for simulation and real environment

  • Policy training does not

require access to real world

  • After training: swap the

segmentation component

  • Data: demonstrations and

experience Go between the mushroom and flower chair the tree all the way up to the phone booth

slide-42
SLIDE 42

Supervised Learning Reinforcement Learning

SuReAL

Supervised and Reinforcement Asynchronous Learning

Instruction

Control Network

Mask Visitations

Plan Generation Mapping

Trained separately

Object Context Map

RNN LingUNet CNN+MLP

Few-shot Segmentation

Abstract Instruction

slide-43
SLIDE 43

Supervised Learning

Instruction

Plan Generation Mapping

Trained separately RNN LingUNet

Few-shot Segmentation

Objective: generate visitation distributions Data: simulation states paired with visitation predictions

Cross- entropy loss

Demonstration Visitations

slide-44
SLIDE 44

Reinforcement Learning

RL for Control

Instruction

Control Network

Mask Visitations

Plan Generation Mapping

Trained separately

Object Context Map

RNN LingUNet CNN+MLP

Few-shot Segmentation

Abstract Instruction

Intrinsic reward

slide-45
SLIDE 45

Supervised Learning Reinforcement Learning

SuReAL

Supervised and Reinforcement Asynchronous Learning

Instruction

Control Network Plan Generation Mapping

Trained separately RNN LingUNet CNN+MLP

Few-shot Segmentation Periodic parameter updates Sampled action sequences

slide-46
SLIDE 46

SuReAL

  • Stage I: learn to predict visitation distributions

based on noisy predicted execution trajectories

  • Stage II: learn to predict actions using predicted

visitation distributions Supervised and Reinforcement Asynchronous Learning

Periodic parameter updates Replace gold action sequences with sampled

slide-47
SLIDE 47

Experimental Setup

  • Intel Aero quadcopter
  • Vicon motion capture for pose estimate
  • Simulation with Microsoft AirSim
  • Drone cage is 4.7x4.7m
  • All evaluation with eight new objects
  • Database includes five images and five phrases for each object
  • Training data: 41k instruction-demonstration pairs in simulation,

no demonstration data in the real world

slide-48
SLIDE 48

Human Evaluation

  • Score path and goal on a 5-point Likert scale for 63 examples
  • Our model receives 4-5 path scores 53% of the time, double than

PVN2-SEEN, showing effective generalization to unknown objects

  • Outperforming PVN2-ALL illustrates the benefit of the object-

centric inductive bias

slide-49
SLIDE 49

Example

slide-50
SLIDE 50

Messy Example

slide-51
SLIDE 51

Failure

slide-52
SLIDE 52

Today

Few-shot instruction following:

  • Few-shot language-conditioned object segmentation

Modeling objects and aligning their references and

  • bservations + training with augmented reality data
  • Object context mapping

Incorporate contextual text information into spatial map without specific object information

  • Integration into a visitation-prediction policy for mapping

instructions to drone control Generate trajectory plans over object context map + train in simulation only by swapping the segmentation component

slide-53
SLIDE 53

Some Open Questions

  • How to elicit exemplars to add to the database

from human users, potentially within interaction?

  • How to generalize from objects to more general
  • bjects types?
  • What other object properties should we model?

Such as permanence and reference consistency

slide-54
SLIDE 54

The Papers

  • Few-shot Object Grounding for Mapping Natural Language Instructions to Robot

Control Valts Blukis, Ross A. Knepper, and Yoav Artzi CoRL, 2020

  • Learning to Map Natural Language Instructions to Physical Quadcopter Control

Using Simulated Flight 
 Valts Blukis, Yannick Terme, Eyvind Niklasson, Ross A. Knepper, and Yoav Artzi CoRL, 2019

  • Mapping Navigation Instructions to Continuous Control Actions with Position

Visitation Prediction 
 Valts Blukis, Dipendra Misra, Ross A. Knepper, and Yoav Artzi CoRL, 2018

  • Following High-level Navigation Instructions on a Simulated Quadcopter with

Imitation Learning 
 Valts Blukis, Nataly Brukhim, Andrew Bennett, Ross A. Knepper, and Yoav Artzi RSS, 2018.

slide-55
SLIDE 55

Valts Blukis

And collaborators: Dipendra Misra, Eyvind Niklasson, Nataly Brukhim, Andrew Bennett, and Ross Knepper

Thank you! Questions? https://github.com/lil-lab/drif

slide-56
SLIDE 56

[fin]

slide-57
SLIDE 57

Object Database

slide-58
SLIDE 58

The object database used during development in the physical environment.

slide-59
SLIDE 59

The object database used during testing, containing previously unseen physical objects.

slide-60
SLIDE 60

Visitation Distributions

slide-61
SLIDE 61
  • Given a Markov Decisions Process:
  • The state-visitation distribution is the probability
  • f visiting state following policy from start state
  • Predicting for an expert policy tells us the

states to visit to complete the task

  • Can learn from demonstrations, but prediction generally

impossible: is very large!

Visitation Distribution

π d(s; π, s0) s0 s π* MDP States

S

Actions

A

Reward

R

Horizon

H

S d(s; π*, s0)

slide-62
SLIDE 62

Approximating Visitation Distributions

  • Solution: approximate the state space
  • Use an approximate state space and a mapping

between the state spaces

  • For a well chosen , a policy with a state-

visitation distribution close to has bounded sub-optimality MDP States

S

Actions

A

Reward

R

Horizon

H

˜ S ϕ : S → ˜ S ϕ π d(˜ s; π*, ˜ s0)

slide-63
SLIDE 63

Visitation Distribution for Navigation

  • is a set of discrete positions in the world
  • We compute two distributions: trajectory-visitation and

goal-visitation

MDP States

S

Actions

A

Reward

R

Horizon

H

Trajectory Probability

˜ S

Planning with Position Visitation Prediction Action Generation Stage I Stage II

Instruction Action

Goal Probability

slide-64
SLIDE 64

Drone Related Work

(Somewhat outdated)

slide-65
SLIDE 65

Related Work: Task

  • Mapping instructions to actions with robotic agents
  • Mapping instruction to actions in software and simulated environments
  • Learning visuomotor policies for robotic agents

Tellex et al. 2011; Matuszek et al. 2012; Duvallet et al. 2013; Walter et al. 2013; Misra et al. 2014; Hemachandra et al. 2015; Lignos et al. 2015 MacMahon et al. 2006; Branavan et al. 2010; Matuszek et al. 2010, 2012; Artzi et al. 2013, 2014; Misra et al. 2017, 2018; Anderson et al. 2017; Suhr and Artzi 2018 Lenz et al. 2015; Levine et al. 2016; Bhatti et al. 2016; Nair et al. 2017; Tobin et al. 2017; Quillen et al. 2018, Sadeghi et al. 2017

slide-66
SLIDE 66

Related Work: Method

  • Mapping and planning in neural networks
  • Model and learning decomposition
  • Learning to explore

Bhatti et al. 2016; Gupta et al. 2017; Khan et al. 2018; Savinov et al. 2018; Srinivas et al. 2018 Pastor et al. 2009, 2011; Konidaris et al. 2012; Paraschos et al. 2013; Maeda et al. 2017 Knepper et al. 2015; Nyga et al. 2018

slide-67
SLIDE 67

Drone Data Collection

slide-68
SLIDE 68

Data

  • Crowdsourced with a simplified environment and agent
  • Two-step data collection: writing and validation/segmentation

Go towards the pink flowers and pass them on your left, between them and the

  • ladder. Go left around the flower until you're pointed towards the bush, going

between the gorilla and the traffic cone. Go around the bush, and go in between it and the apple, with the apple on your right. Turn right and go around the apple.

slide-69
SLIDE 69

Data

  • Crowdsourced with a simplified environment and agent
  • Two-step data collection: writing and validation/segmentation

Go towards the pink flowers and pass them on your left, between them and the

  • ladder. Go left around the flower until you're pointed towards the bush, going

between the gorilla and the traffic cone. Go around the bush, and go in between it and the apple, with the apple on your right. Turn right and go around the apple. Go towards the pink flowers and pass them on your left, between them and the

  • ladder. Go left around the flower until you're pointed towards the bush, going

between the gorilla and the traffic cone. Go around the bush, and go in between it and the apple, with the apple on your right. Turn right and go around the apple.

slide-70
SLIDE 70

CoRL 2018 Experiments

slide-71
SLIDE 71

Experimental Setup

  • Crowdsourced instructions and demonstrations
  • 19,758/4,135/4,072 train/dev/test examples
  • Each environment includes 6-13 landmarks
  • Quadcopter simulation with AirSim
  • Metric: task-completion accuracy
slide-72
SLIDE 72

Test Results

12.5 25 37.5 50 Success Rate

41.21 24.36 21.34 16.43 5.72

STOP Average Chaplot et al. 2018 Blukis et al. 2018 Our Approach

  • Explicit mapping helps

performance

  • Explicit planning further

improves performance

slide-73
SLIDE 73

Synthetic vs. Natural Language

  • Synthetically generated instructions with templates
  • Evaluated with explicit mapping (Blukis et al. 2018)
  • Using natural language is

significantly more challenging

  • Not only a language problem,

trajectories become more complex

20 40 60 80 Success Rate

24.36 79.2

Synthetic Language Natural Language

slide-74
SLIDE 74

Ablations

Development Results

  • The language is being

used effectively

  • Auxiliary objectives help

with credit assignment

10.5 21 31.5 42 Success Rate

23.07 30.77 35.98 38.87 40.44

Our Approach w/o imitation learning w/o goal distribution w/o auxiliary objectives w/o language

slide-75
SLIDE 75

Analysis

Development Results

  • Better control can improve

performance

  • Observing the environment,

potentially through exploration, remains a challenge

17.5 35 52.5 70 Success Rate

60.59 45.7 40.44

Our Approach Ideal Actions Fully Observable

slide-76
SLIDE 76

CoRL 2019 Experiments

slide-77
SLIDE 77

Environment

  • Drone cage is 4.7x4.7m
  • Created in reality and simulation
  • 15 possible landmarks, 5-8 in each environment
  • Also: larger 50x50m simulation-only environment

with 6-13 landmarks out of possible 63

slide-78
SLIDE 78

Data

  • Real environment training data includes 100

instruction paragraphs, segmented to 402 instructions

  • Evaluation with 20 paragraphs
  • Evaluate on concatenated consecutive segments
  • Oracle trajectories from a simple carrot planner
  • Much more data in simulation, including for a larger

50x50m environment

slide-79
SLIDE 79

Evaluation

  • Two automated metrics
  • SR: success rate
  • EMD: path earth’s move distance
  • Human evaluation: score path and goal on a 5-

point Likert scale

slide-80
SLIDE 80

Human Evaluation

  • Score path and goal on a 5-point Likert scale for 73 examples
  • Our model receives five-point path scores 37.8% of the time,

24.8% improvement over PVN2-BC

  • Improvements over PVN2-BC illustrates the benefit of SuReAL

and the exploration reward

slide-81
SLIDE 81

Observability

  • Big benefit when goal

is not immediately

  • bserved
  • However, complexity

comes at small performance cost on easier examples

slide-82
SLIDE 82

Test Results

Success Rate

30 30.6 29.2 20.8 16.7

Average PVN-BC PVN2-BC Our Approach

EMD

0.52 0.59 0.61 0.71

  • SR often too strict: 30.6%

compared to 39.7% five- points on goal

  • EMD performance generally

more reliable, but still fails to account for semantic correctness

slide-83
SLIDE 83

Simple vs. Complex Instructions

  • Performance on easier

single-segment instructions is much higher

  • Instructions are shorter

and trajectories simpler

Success Rate

30 30.6 56.5

1-segment Instructions 2-segment Instructions

EMD

0.52 0.34

slide-84
SLIDE 84

Transfer Effects

  • Visual and flight dynamics

transfer challenges remain

  • Even Oracle shows a drop in

performance form 0.17 EMD in the simulation to 0.23 in the real environment

Success Rate

30 30.6 33.3

Simulator Real

EMD

0.52 0.42

slide-85
SLIDE 85

CoRL 2019 Examples

slide-86
SLIDE 86

Cool Example

  • nce near the rear of the gorilla turn right and head

towards the rock stopping once near it

slide-87
SLIDE 87

Failure

head towards the area just to the left of the mushroom and then loop around it

slide-88
SLIDE 88

CoRL 2019 Sim-real Shift Examples

slide-89
SLIDE 89

Sim-real Control Shift

when you reach the right of the palm tree take a sharp right when you see a blue box head toward it

slide-90
SLIDE 90

Sim-real Control Shift

make a right at the rock and head towards the banana