Using Natural Language for Reward Shaping in Reinforcement Learning
Prasoon Goyal, Scott Niekum and Raymond J. Mooney
The University of Texas at Austin
Using Natural Language for Reward Shaping in Reinforcement Learning - - PowerPoint PPT Presentation
Using Natural Language for Reward Shaping in Reinforcement Learning Prasoon Goyal , Scott Niekum and Raymond J. Mooney The University of Texas at Austin Motivation 2 Motivation In sparse reward settings, random exploration has very
The University of Texas at Austin
2
random exploration has very high sample complexity.
3
random exploration has very high sample complexity.
rewards to guide the agent towards the goal.
4
random exploration has very high sample complexity.
rewards to guide the agent towards the goal.
by hand is challenging.
5
6
Jump over the skull while going to the left
7
plus a natural language command describing the task.
8
Action
plus a natural language command describing the task.
the command to generate rewards.
9
Action Past actions Language-based reward
plus a natural language command describing the task.
the command to generate rewards. For example, Past actions Reward LLLJLLL → High RRRUULL → Low [L: Left, R: Right, U: Up, J: Jump]
10
Action Past actions Language-based reward
plus a natural language command describing the task.
the command to generate rewards. For example, Past actions Reward 4441444 → High 3332244 → Low [4: Left, 3: Right, 2: Up, 1: Jump]
11
Action Past actions Language-based reward
Problem: Given a sequence of actions (e.g. 4441444) and a command (e.g. “Jump over the skull while going to the left”), are they related?
12
Problem: Given a sequence of actions (e.g. 4441444) and a command (e.g. “Jump over the skull while going to the left”), are they related?
ϵ ⇒ [0 0 0 0 0 0 0 0] 4 ⇒ [0 0 0 0 1 0 0 0] 42 ⇒ [0 0 0.5 0 0.5 0 0 0] 422 ⇒ [0 0 0.7 0 0.3 0 0 0]
13
Problem: Given a sequence of actions (e.g. 4441444) and a command (e.g. “Jump over the skull while going to the left”), are they related?
ϵ ⇒ [0 0 0 0 0 0 0 0] 4 ⇒ [0 0 0 0 1 0 0 0] 42 ⇒ [0 0 0.5 0 0.5 0 0 0] 422 ⇒ [0 0 0.7 0 0.3 0 0 0]
command to predict whether they are related or not.
14
15
Neural Network Architecture
passed through 3 linear layers.
16
Neural Network Architecture
passed through 3 linear layers.
○ InferSent ○ GloVe+RNN ○ RNNOnly
17
Neural Network Architecture
passed through 3 linear layers.
○ InferSent ○ GloVe+RNN ○ RNNOnly
frequency vector and encoded language.
18
Neural Network Architecture
passed through 3 linear layers.
○ InferSent ○ GloVe+RNN ○ RNNOnly
frequency vector and encoded language.
followed by softmax layer.
19
Neural Network Architecture
Turk to collect language descriptions for trajectories.
20
Data Collection
Turk to collect language descriptions for trajectories.
remove low quality data.
21
Data Collection
Turk to collect language descriptions for trajectories.
remove low quality data.
generate negative examples.
22
Data Collection
23
Action
actions, generate an action- frequency vector.
24
Action
actions, generate an action- frequency vector.
relatedness between the action-frequency vector and the language command.
26
Action
actions, generate an action- frequency vector.
relatedness between the action-frequency vector and the language command.
intermediate rewards, such that the optimal policy does not change.
27
Action
28
DOWNWARDS IN LADDER
29
30
31
32
using PPO algorithm with and without language-based reward.
33
using PPO algorithm with and without language-based reward.
reaching the goal, reward of 0 in all other cases.
34
using PPO algorithm with and without language-based reward.
reaching the goal, reward of 0 in all other cases.
reward plus language- based intermediate rewards.
35
fixed natural language description.
36
fixed natural language description.
action-frequency vector, and the corresponding prediction from LEARN.
[0 0 0 0 1 0 0 0] 0.2 [0 0 0.5 0 0.5 0 0 0] 0.1 . . . [0.1 0.1 0.1 0.3 0.1 0.1 0.1 0.1] 0.3 . . .
37
fixed natural language description.
action-frequency vector, and the corresponding prediction from LEARN.
correlation coefficient between each component (action) and the prediction.
[0 0 0 0 1 0 0 0] 0.2 [0 0 0.5 0 0.5 0 0 0] 0.1 . . . [0.1 0.1 0.1 0.3 0.1 0.1 0.1 0.1] 0.3 . . .
38
39
go to the left and go under skulls and then down the ladder go to the left and then go down the ladder move to the left and go under the skulls
40
Language to Reward
[Williams et al. 2017, Arumugam et al. 2017]
41
Language to Reward
[Williams et al. 2017, Arumugam et al. 2017]
Language to Subgoals
[Kaplan et al. 2017]
42
Language to Reward
[Williams et al. 2017, Arumugam et al. 2017]
Language to Subgoals
[Kaplan et al. 2017]
Policy D States Goal states Instruction Reward Adversarial Reward Induction
[Bahdanau et al. 2018]
43
44
45
actions.
46
actions.
○ Temporal information ○ Continuous action space ○ State information ○ Multi-step instructions
47
actions.
○ Temporal information ○ Continuous action space ○ State information ○ Multi-step instructions
48
Code and Data available at www.cs.utexas.edu/~pgoyal