A Survey of Reinforcement Learning Informed by Natural Language
Maria Fabiano
Luketina et al., IJCAI 2019
A Survey of Reinforcement Learning Informed by Natural Language - - PowerPoint PPT Presentation
A Survey of Reinforcement Learning Informed by Natural Language Luketina et al., IJCAI 2019 Maria Fabiano Outline 1. Motivation 2. Background 3. Current Use of Natural Language in RL 4. Trends for Natural Language in RL 5. Future Work
Luketina et al., IJCAI 2019
1. Motivation 2. Background 3. Current Use of Natural Language in RL 4. Trends for Natural Language in RL 5. Future Work 6. Critique
Current Problems in RL
limits its real-world practicality.
Solutions with Natural Language
world knowledge from text corpora into decision-making problems.
constraints, and take advantage of human priors.
Agents learn what actions to take in various states to maximize a cumulative reward. Goal: Find a policy 𝜌(a|s) that maximizes the expected discounted cumulative return. Applications: continuous control, dialogue, board games, video games Limitations: real-world use is limited by data requirements and poor generalization
Recent NLP work has seen models transfer syntactic and semantic knowledge to downstream tasks. Transfer world and task-specific knowledge to sequential decision-making processes
(“scorpions are fast”)
Agents could learn to use NLP and information retrieval to seek information in order to make progress on a task.
Language conditional Language assisted
Instruction following Rewards from instructions Language in the action &
space Communicating domain knowledge Structuring policies
Language facilitates learning In both cases, language information can be task-independent (e.g., conveying general priors) or task-dependent (e.g., instructions). These tasks are not mutually exclusive.
Language is part of the task formulation
language assists by structuring the policy or providing auxiliary rewards
Language is part of the task
Instruction Following High-level instruction sequences (actions, goals, or policies) Rewards from Instructions Learn a reward function Observation & Action Space Environments use language for driving the interaction with the agent
Instructions can be specific actions, goal states, or desired policies Effective agents can: 1. Execute the instruction 2. Generalize to unseen instructions Ties to hierarchical RL
○ Oh et al., 2017 ○ Parameterized skill performs different subtasks ○ Objective function makes analogies between similar subtasks to try to learn the entire subtask space ○ Meta controller reads the instructions, decides which subtask to perform, and passes subtask parameters to the parameterized skill ○ Parameterized skill executes the given subtask
automatically evaluate if an instruction was completed.
instruction to a goal, then generates a reward for a policy-learning module
○ The reward learner is the discriminator that discerns between goal states and visited states. The agent is rewarded for visiting states the discriminator cannot discern from goal states.
auxiliary rewards to help learn efficiently.
Use the instructions to induce a reward function
with vocabulary size and grammar complexity
○ Cardinal directions (“Go north”) vs. relative (“go to the blue ball southwest of the green box”)
○ Multiple-choice nature makes these problems similar to instruction following
games that behave as RL environments
Environments use language to drive interaction with the agent
Language assists the task via transfer learning
Language is not essential to the task, but assists via transfer of knowledge
task-relevant information could be available
○ Advice about the policy, information about the environment
○ Must retrieve useful information for a given context ○ Must ground that information with respect to observations
○ Ground the meaning of text to the dynamics of the environment ○ Allows an agent to bootstrap policy learning in a new environment
communicating information about the state or dynamics of an environment
○ Shape representations to be more generalized abstractions ○ Make a representation space more interpretable to humans ○ Efficiently structure computations within a model
Neural Networks for Question Answering
1. Language-conditional RL is more studied than language-assisted RL 2. Learning from task-dependent text is more common than task-independent 3. Little research has been done in how to use unstructured text in knowledge transfer from task-dependent text 4. Little research in using language structure to build compositional representations and internal plans 5. Synthetically generated languages (instead of natural language) are the standard for instruction following
Task-independent
distribution without transfer from a language model
○ “Fetch a stick” vs. “Return with a stick” vs. “Grab a stick and come back”
Task-dependent
retrieval system. The RL agent queries the retrieval system and uses relevant information.
○ Example: game manuals
This is measured only by instruction following benchmarks in closed-task domains (navigation, object manipulation, etc.) and closed worlds.
each word To generalize, RL needs more diverse environments with complex composition.
A central promise of language in RL is helping agents adapt to new goals, reward functions, and environment dynamics.
semantics
progress of natural language and RL integration
○ Pre-trained information retrieval systems
“Good”
why RL + NLP is worth studying
next step in elevate RL
that show the feasibility of this work
“Not so Good”
○ Q-learning, imitation learning
problems to instruction following
worthwhile to work in this space
○ Success in multimodal NLP work ○ Success in other modalities with RL
converse true?