Situated Mapping of Sequential Instructions to Actions with - - PowerPoint PPT Presentation

situated mapping of sequential instructions to actions
SMART_READER_LITE
LIVE PREVIEW

Situated Mapping of Sequential Instructions to Actions with - - PowerPoint PPT Presentation

Situated Mapping of Sequential Instructions to Actions with Single-step Reward Observation Alane Suhr and Yoav Artzi Executing Context- Dependent Instructions Task: map a sequence of instructions to actions Existing Work Today Symbolic


slide-1
SLIDE 1

Situated Mapping of Sequential Instructions to Actions with Single-step Reward Observation

Alane Suhr and Yoav Artzi

slide-2
SLIDE 2

Executing Context- Dependent Instructions

Task: map a sequence of instructions to actions

Existing Work

Symbolic Representations Modeling Context

Today

System Actions Learning from Exploration

slide-3
SLIDE 3

1 2 3 4 5 6 7

Executing a Sequence of Instructions

Empty out the leftmost beaker of purple chemical Then, add the contents of the first beaker to the second Mix it Then, drain 1 unit from it Same for 1 more unit

slide-4
SLIDE 4

1 2 3 4 5 6 7

Executing a Sequence of Instructions

Empty out the leftmost beaker of purple chemical Then, add the contents of the first beaker to the second Mix it Then, drain 1 unit from it Same for 1 more unit

slide-5
SLIDE 5

1 2 3 4 5 6 7

Executing a Sequence of Instructions

Empty out the leftmost beaker of purple chemical Then, add the contents of the first beaker to the second Mix it Then, drain 1 unit from it Same for 1 more unit

slide-6
SLIDE 6

1 2 3 4 5 6 7

Executing a Sequence of Instructions

Empty out the leftmost beaker of purple chemical Then, add the contents of the first beaker to the second Mix it Then, drain 1 unit from it Same for 1 more unit

slide-7
SLIDE 7

1 2 3 4 5 6 7

Executing a Sequence of Instructions

Empty out the leftmost beaker of purple chemical Then, add the contents of the first beaker to the second Mix it Then, drain 1 unit from it Same for 1 more unit

slide-8
SLIDE 8

1 2 3 4 5 6 7

Executing a Sequence of Instructions

Empty out the leftmost beaker of purple chemical Then, add the contents of the first beaker to the second Mix it Then, drain 1 unit from it Same for 1 more unit

slide-9
SLIDE 9

Empty out the leftmost beaker of purple chemical

1 2 3 4 5 6 7

Then, add the contents of the first beaker to the second Mix it Then, drain 1 unit from it Same for 1 more unit

Executing a Sequence of Instructions

slide-10
SLIDE 10

Empty out the leftmost beaker of purple chemical

1 2 3 4 5 6 7

Then, add the contents of the first beaker to the second Mix it Then, drain 1 unit from it Same for 1 more unit

Executing a Sequence of Instructions

slide-11
SLIDE 11

Problem Setup

  • Task: follow sequence of instructions
  • Learning from instructions and corresponding

world states

Empty out the leftmost beaker of purple chemical Then, add the contents of the first beaker to the second Mix it Then, drain 1 unit from it Same for 1 more unit

slide-12
SLIDE 12

Problem Setup

  • Task: follow sequence of instructions
  • Learning from instructions and corresponding

world states

Empty out the leftmost beaker of purple chemical Then, add the contents of the first beaker to the second Mix it Then, drain 1 unit from it Same for 1 more unit

slide-13
SLIDE 13

Problem Setup

  • Task: follow sequence of instructions
  • Learning from instructions and corresponding

world states

Empty out the leftmost beaker of purple chemical Then, add the contents of the first beaker to the second Mix it Then, drain 1 unit from it Same for 1 more unit

slide-14
SLIDE 14

Problem Setup

  • Task: follow sequence of instructions
  • Learning from instructions and corresponding

world states

Empty out the leftmost beaker of purple chemical Then, add the contents of the first beaker to the second Mix it Then, drain 1 unit from it Same for 1 more unit

slide-15
SLIDE 15

Problem Setup

  • Task: follow sequence of instructions
  • Learning from instructions and corresponding

world states

Empty out the leftmost beaker of purple chemical Then, add the contents of the first beaker to the second Mix it Then, drain 1 unit from it Same for 1 more unit

slide-16
SLIDE 16

Problem Setup

Empty out the leftmost beaker of purple chemical

  • Task: follow sequence of instructions
  • Learning from instructions and corresponding

world states

Then, add the contents of the first beaker to the second Mix it Then, drain 1 unit from it Same for 1 more unit

slide-17
SLIDE 17

Related Work

  • Static environments

(e.g., large database)

  • Environments that

change over time while instructions are given

  • Context-dependent language understanding
  • Following instructions in isolation;

varying levels of supervision

Miller et al. 1996, Zettlemoyer and Collins 2009, Suhr et al. 2018 Long et al. 2016, Guu et al. 2017, Fried et al. 2018 Chen and Mooney 2011, Chen 2012, Artzi and Zettlemoyer 2013, Artzi et

  • al. 2014, Andreas and

Klein 2015, Bisk et al. 2016, Misra et al. 2017

slide-18
SLIDE 18

Today

  • 1. Attention-based model for generating sequences
  • f system actions that modify the environment
  • 2. Exploration-based learning procedure that avoids

biases learned early in training

slide-19
SLIDE 19

System Actions

pop 2; pop 2; pop 2; push 2 brown; push 2 brown; push 2 brown;

Mix it 1 2 3 4 5 6 7

  • Each beaker is a

stack

  • Actions are pop

and push

slide-20
SLIDE 20

Meaning Representation

Mix it

mix(prevArg2(2)) pop 2; pop 2; pop 2; push 2 brown; push 2 brown; push 2 brown; vs.

1 2 3 4 5 6 7

Representation Engineering Learning Abstractions

High-level Program System Actions

slide-21
SLIDE 21

Meaning Representation

Mix it

mix(prevArg2(2)) pop 2; pop 2; pop 2; push 2 brown; push 2 brown; push 2 brown; vs.

1 2 3 4 5 6 7

Representation Engineering Learning Abstractions

High-level Program System Actions

slide-22
SLIDE 22

Model

Throw out first beaker It turns brown Pour sixth beaker into last one Previous instructions Current instruction Current state Initial state

  • Four inputs
  • Output: a sequence of

actions

  • Attend over each input

when generating actions

slide-23
SLIDE 23

Encode instructions

Model

Throw out first beaker It turns brown Pour sixth beaker into last one Previous instructions Current instruction Current state Initial state

slide-24
SLIDE 24

Model

Throw out first beaker It turns brown Pour sixth beaker into last one Previous instructions Current instruction Current state Initial state

Encode states

slide-25
SLIDE 25

Decoder state

Model

Throw out first beaker It turns brown Pour sixth beaker into last one Previous instructions Current instruction Current state Initial state

Initialize decoder

slide-26
SLIDE 26

Model

Throw out first beaker It turns brown Pour sixth beaker into last one Previous instructions Current instruction Current state Initial state Attention

Attend over current instruction

Decoder state Current instruction

slide-27
SLIDE 27

Model

Throw out first beaker It turns brown Pour sixth beaker into last one Previous instructions Current instruction Current state Initial state Attention Attention

Attend over previous instructions

Decoder state Current instruction Previous instructions

slide-28
SLIDE 28

Model

Throw out first beaker It turns brown Pour sixth beaker into last one Previous instructions Current instruction Current state Initial state Attention Attention Attention

Attend over initial state

Decoder state Current instruction Previous instructions Initial state

slide-29
SLIDE 29

Current state

Model

Throw out first beaker It turns brown Pour sixth beaker into last one Previous instructions Current instruction Current state Initial state Attention Attention Attention Attention

Decoder state Current instruction Previous instructions Initial state

Attend over current state

slide-30
SLIDE 30

Model

Throw out first beaker It turns brown Pour sixth beaker into last one Previous instructions Current instruction Current state Initial state

MLP

pop 7

Decoder state Current instruction Previous instructions Initial state Current state

Predict action

slide-31
SLIDE 31

Model

Throw out first beaker It turns brown Pour sixth beaker into last one Previous instructions Current instruction Current state Initial state

pop 7

Execute action, update state

slide-32
SLIDE 32

Model

Throw out first beaker It turns brown Pour sixth beaker into last one Previous instructions Current instruction Current state Initial state

pop 7

Attention

Attend over new state

slide-33
SLIDE 33

Model

Throw out first beaker It turns brown Pour sixth beaker into last one Previous instructions Current instruction Current state Initial state

pop 7 pop 7

Action decoder

slide-34
SLIDE 34

Model

Throw out first beaker It turns brown Pour sixth beaker into last one Previous instructions Current instruction Current state Initial state

pop 7 pop 7 pop 7

Action decoder

slide-35
SLIDE 35

Model

Throw out first beaker It turns brown Pour sixth beaker into last one Previous instructions Current instruction Current state Initial state

pop 7 pop 7 pop 7 push 7 brown

Action decoder

slide-36
SLIDE 36

Model

Throw out first beaker It turns brown Pour sixth beaker into last one Previous instructions Current instruction Current state Initial state

pop 7 pop 7 pop 7 push 7 brown push 7 brown

Action decoder

slide-37
SLIDE 37

Model

Throw out first beaker It turns brown Pour sixth beaker into last one Previous instructions Current instruction Current state Initial state

pop 7 pop 7 pop 7 push 7 brown push 7 brown push 7 brown

Action decoder

slide-38
SLIDE 38
  • Goal: learn a policy that maps from instructions and

environment states to actions

  • Approach
  • Learn through exploring the environment and observing

rewards

  • Policy gradient with contextual bandit
  • Challenge: overcome biases acquired early during

learning

Learning from World State Annotation

Empty out the leftmost beaker of purple chemical Then, add the contents of the first beaker to the second Mix it Then, drain 1 unit from it Same for 1 more unit

slide-39
SLIDE 39

Learning from World State Annotation

  • Goal: learn a policy that maps from instructions and

environment states to actions

  • Approach
  • Learn through exploring the environment and observing

rewards

  • Policy gradient with contextual bandit
  • Challenge: overcome biases acquired early during

learning

slide-40
SLIDE 40

Reward Function

R(s, a, s0) = P(s, a, s0) + φ(s0) − φ(s)

s s0 a Source state Target state Action

slide-41
SLIDE 41

Problem Reward

Reward Function

R(s, a, s0) = P(s, a, s0) + φ(s0) − φ(s)

s s0 a if stops the sequence and is the goal state if stops the sequence and is not the goal state +1 −1

a s0 a s0

Source state Target state Action

slide-42
SLIDE 42

Shaping Term

Reward Function

R(s, a, s0) = P(s, a, s0) + φ(s0) − φ(s)

s s0 a if is closer to the goal state than (moved closer) if is closer to the goal state than (moved further) +1 −1

s0

s0

s s

Source state Target state Action

slide-43
SLIDE 43

Learning Example

Add the third beaker to the first Iteration #1 Goal: Start:

1 2 3

pop push

Action Rewards:

+ - 0 0 0 0 Rollout: Rewards:

slide-44
SLIDE 44

Add the third beaker to the first Iteration #1 pop 2; push 1 green; pop 3; push 1 yellow;

  • 1
  • 1

+1

  • 1

Goal: Start:

1 2 3

Rollout: Rewards: pop push

Action Rewards:

+ - 0 0 0 0

Learning Example

slide-45
SLIDE 45

Add the third beaker to the first Iteration #1 Goal: Start:

1 2 3

Rollout: Rewards: pop push

Action Rewards:

+ - 1 1 0 2 pop 2; push 1 green; pop 3; push 1 yellow;

  • 1
  • 1

+1

  • 1

Learning Example

No positive reward for push actions

slide-46
SLIDE 46

Add the third beaker to the first Iteration #2 Goal: Start:

1 2 3

Rollout: Rewards: pop push

Action Rewards:

+ - 1 1 0 2

Learning Example

slide-47
SLIDE 47

Add the third beaker to the first Iteration #2 Goal: Start:

1 2 3

Rollout: Rewards: pop push

Action Rewards:

+ - 1 1 0 2 pop 3; push 1 green; pop 1; push 1 green; +1

  • 1

+1

  • 1

Learning Example

slide-48
SLIDE 48

Add the third beaker to the first Iteration #2 Goal: Start:

1 2 3

Rollout: Rewards: pop push

Action Rewards:

+ - 3 1 0 4 pop 3; push 1 green; pop 1; push 1 green; +1

  • 1

+1

  • 1

Learning Example

No positive reward for push actions

slide-49
SLIDE 49

Add the third beaker to the first Iteration #3 Goal: Start:

1 2 3

Rollout: Rewards: pop 3; pop 3; pop 1; +1 +1

  • 1

pop push

Action Rewards:

+ - 3 1 0 4

Learning Example

Quickly learned a strong bias against push actions

slide-50
SLIDE 50
  • Early during learning, model learns it can get

positive reward by predicting the pop actions

  • Less likely to get positive reward with push action
  • Becomes biased against push — during later

exploration, push is never sampled!

  • Compounding effect: never learns to generate push

actions

Learned Biases

slide-51
SLIDE 51
  • Our approach: observe reward of all actions by

looking one step ahead during exploration

  • Observe reward for actions like push

Single-step Reward Observation

slide-52
SLIDE 52

Learning Algorithm

For each training example:

  • 1. Rollout: sample sequence of state-

action pairs from the current policy

  • 2. For each state visited in the rollout,
  • A. For each possible action, execute

action and observe reward

  • 3. Update parameters based on observed

rewards for all states and actions

Single-step Observation

slide-53
SLIDE 53

Simple Exploration Single-step Reward Observation

Only observe states along sampled trajectory

Start state

Observe sampled states and single-step ahead

Start state

slide-54
SLIDE 54

Simple Exploration Single-step Reward Observation

Only observe states along sampled trajectory

Start state

Observe sampled states and single-step ahead

Start state

slide-55
SLIDE 55

Simple Exploration Single-step Reward Observation

Only observe states along sampled trajectory

Start state

Observe sampled states and single-step ahead

Start state

slide-56
SLIDE 56

Simple Exploration Single-step Reward Observation

Only observe states along sampled trajectory

Start state

Observe sampled states and single-step ahead

Start state

slide-57
SLIDE 57

Simple Exploration Single-step Reward Observation

Only observe states along sampled trajectory

Start state

Observe sampled states and single-step ahead

Start state

slide-58
SLIDE 58

Simple Exploration Single-step Reward Observation

Only observe states along sampled trajectory

Start state

Observe sampled states and single-step ahead

Start state

slide-59
SLIDE 59

Simple Exploration Single-step Reward Observation

Only observe states along sampled trajectory

Start state

Observe sampled states and single-step ahead

Start state

slide-60
SLIDE 60

Single-step Observation

Add the third beaker to the first Iteration #4 Goal: Start:

1 2 3

slide-61
SLIDE 61

Add the third beaker to the first Iteration #4 Rollout: Current State: Single-Step Actions: Single-step Observation:

Single-step Observation

Goal: Start:

1 2 3

pop 3; pop 3; pop 1;

slide-62
SLIDE 62

Add the third beaker to the first Iteration #4 Rollout: Current State: Single-Step Actions: Single-step Observation:

  • 1 pop 1;

Single-step Observation

Goal: Start:

1 2 3

pop 3; pop 3; pop 1;

slide-63
SLIDE 63

Add the third beaker to the first Iteration #4 Rollout: Current State: Single-Step Actions: Single-step Observation:

  • 1 pop 1;
  • 1 pop 2;

Single-step Observation

Goal: Start:

1 2 3

pop 3; pop 3; pop 1;

slide-64
SLIDE 64

Add the third beaker to the first Iteration #4 Rollout: Current State: Single-Step Actions: Single-step Observation:

  • 1 pop 1;
  • 1 pop 2;

+1 pop 3;

Single-step Observation

Goal: Start:

1 2 3

pop 3; pop 3; pop 1;

slide-65
SLIDE 65

Add the third beaker to the first Iteration #4 Rollout: Current State: Single-Step Actions: Single-step Observation:

  • 1 pop 1;
  • 1 pop 2;

+1 pop 3;

+1 push 1 orange;

Single-step Observation

Goal: Start:

1 2 3

pop 3; pop 3; pop 1;

slide-66
SLIDE 66

Add the third beaker to the first Iteration #4 Rollout: Current State: Single-Step Actions:

Single-step Observation

Goal: Start:

1 2 3

pop 3; pop 3; pop 1;

slide-67
SLIDE 67

Add the third beaker to the first Iteration #4 Rollout: Current State: Single-Step Actions:

  • 1 pop 1;
  • 1 pop 2;

+1 pop 3;

+1 push 1 orange

Single-step Observation

Goal: Start:

1 2 3

pop 3; pop 3; pop 1;

slide-68
SLIDE 68

Add the third beaker to the first Iteration #4 Rollout: Current State: Single-Step Actions:

Single-step Observation

Goal: Start:

1 2 3

pop 3; pop 3; pop 1;

slide-69
SLIDE 69

Add the third beaker to the first Iteration #4 Rollout: Current State: Single-Step Actions:

  • 1 pop 1;
  • 1 pop 2;
  • 1 pop 3;

+1 push 1 orange

Single-step Observation

Goal: Start:

1 2 3

pop 3; pop 3; pop 1;

slide-70
SLIDE 70

Add the third beaker to the first Iteration #4 Rollout: Current State: pop 3; pop 3; pop 1; Single-Step Actions:

Single-step Observation

Goal: Start:

1 2 3

slide-71
SLIDE 71

Add the third beaker to the first Iteration #4 Rollout: Current State: pop 3; pop 3; pop 1; Single-Step Actions:

  • 1 pop 1;
  • 1 pop 2;
  • 1 pop 3;

  • 1 push 1 orange

+1 push 1 yellow

Single-step Observation

Goal: Start:

1 2 3

slide-72
SLIDE 72

Experimental Setup

  • SCONE (Long et al. 2016): Alchemy, Scene,

Tangrams

  • Training data: start state and a sequence of

instructions and goal states

  • Standard evaluation metric: after following a

sequence of instructions, is the world state correct?

slide-73
SLIDE 73

Alchemy

pop 2; pop 2; pop 2; push 2 brown; push 2 brown; push 2 brown;

Mix it 1 2 3 4 5 6 7

slide-74
SLIDE 74

Scene

remove_person 7 remove_hat 7 add_person 10 red add_hat 10 blue The person with a red shirt and a blue hat moves to the right end 1 2 3 4 5 6 7 8 9 10

slide-75
SLIDE 75

Tangrams

remove 4 insert 3 boat Swap the third and fourth figures 1 2 3 4

slide-76
SLIDE 76

Results

Final state accuracy

Test Results

17.5 35 52.5 70

Alchemy Scene Tangrams

60.1 66.4 62.3 62.4 62.0 62.7 37.1 46.2 52.9 27.6 14.7 52.3

Long et al. 2016 Guu et al. 2017 SESTRA

  • Outperform

previous methods by up to 25%, while mapping directly to system actions

  • Performance is

comparable to direct supervision

slide-77
SLIDE 77

17.5 35 52.5 70

Alchemy Scene Tangrams

60.1 66.4 62.3 62.4 62.0 62.7 37.1 46.2 52.9 27.6 14.7 52.3

Long et al. 2016 Guu et al. 2017 SESTRA Supervised

Results

Final state accuracy

Test Results

  • Outperform

previous methods by up to 25%, while mapping directly to system actions

  • Performance is

comparable to direct supervision

slide-78
SLIDE 78

Learning Methods

20 40 60 80

Alchemy Scene Tangrams

52.6 1.5 5.7 52.3 0.5 60.3 56.1 71.8

SESTRA Policy Gradient Contextual Bandit

Final state accuracy

Development Results

  • Single-step
  • bservations
  • vercome

biases that get model stuck

slide-79
SLIDE 79

Ablations

20 40 60 80

Alchemy Scene Tangrams

3.5 3.3 27.6 45.5 66.1 60.3 56.1 71.8

SESTRA Without Previous Instructions Without World State Context

  • Need access to

previous instructions

  • Need access to

world state

Final state accuracy

Development Results

slide-80
SLIDE 80
  • Attention-based model for generating sequences
  • f atomic actions that modify the environment
  • Exploration-based learning procedure that avoids

biases learned early in training

https://github.com/clic-lab/scone