A Survey of Reinforcement Learning Informed by Natural Language - - PowerPoint PPT Presentation

a survey of reinforcement learning informed by natural
SMART_READER_LITE
LIVE PREVIEW

A Survey of Reinforcement Learning Informed by Natural Language - - PowerPoint PPT Presentation

A Survey of Reinforcement Learning Informed by Natural Language Luketina et al., IJCAI 2019 Maria Fabiano Outline 1. Motivation 2. Background 3. Current Use of Natural Language in RL 4. Trends for Natural Language in RL 5. Future Work


slide-1
SLIDE 1

A Survey of Reinforcement Learning Informed by Natural Language

Maria Fabiano

Luketina et al., IJCAI 2019

slide-2
SLIDE 2

Outline

1. Motivation 2. Background 3. Current Use of Natural Language in RL 4. Trends for Natural Language in RL 5. Future Work 6. Critique

slide-3
SLIDE 3

Motivation

Current Problems in RL

  • Most real-world tasks require some kind of language processing.
  • RL poorly generalizes to tasks that are very similar to what it trains on, which

limits its real-world practicality.

  • Previous research has been limited by small corpora or synthetic language.

Solutions with Natural Language

  • Advances in language representation learning allow models to integrate

world knowledge from text corpora into decision-making problems.

  • Potential to improve generalization, overcome issues related to data

constraints, and take advantage of human priors.

slide-4
SLIDE 4

Background: RL

Agents learn what actions to take in various states to maximize a cumulative reward. Goal: Find a policy 𝜌(a|s) that maximizes the expected discounted cumulative return. Applications: continuous control, dialogue, board games, video games Limitations: real-world use is limited by data requirements and poor generalization

slide-5
SLIDE 5

Background: Knowledge Transfer

Recent NLP work has seen models transfer syntactic and semantic knowledge to downstream tasks. Transfer world and task-specific knowledge to sequential decision-making processes

  • Understanding explicit goals (“go to the door”)
  • Policy constraints (“avoid the scorpion”)
  • Generic information about the reward or policy

(“scorpions are fast”)

  • Object affordances (what can be done with an object)

Agents could learn to use NLP and information retrieval to seek information in order to make progress on a task.

slide-6
SLIDE 6

Language conditional Language assisted

Instruction following Rewards from instructions Language in the action &

  • bservation

space Communicating domain knowledge Structuring policies

Natural Language in RL

Current Use of Natural Language in RL

slide-7
SLIDE 7

Current Use of Natural Language in RL

Language facilitates learning In both cases, language information can be task-independent (e.g., conveying general priors) or task-dependent (e.g., instructions). These tasks are not mutually exclusive.

Language conditional Language assisted

Language is part of the task formulation

slide-8
SLIDE 8

Current Use of Natural Language in RL

slide-9
SLIDE 9
  • Interpret and execute instructions given in language
  • Language is part of the state and action space
  • Often, the full language isn’t needed to solve the problem, but the full

language assists by structuring the policy or providing auxiliary rewards

Language-Conditional RL

Language is part of the task

Instruction Following High-level instruction sequences (actions, goals, or policies) Rewards from Instructions Learn a reward function Observation & Action Space Environments use language for driving the interaction with the agent

slide-10
SLIDE 10

Language-Conditional: Instruction Following

Instructions can be specific actions, goal states, or desired policies Effective agents can: 1. Execute the instruction 2. Generalize to unseen instructions Ties to hierarchical RL

○ Oh et al., 2017 ○ Parameterized skill performs different subtasks ○ Objective function makes analogies between similar subtasks to try to learn the entire subtask space ○ Meta controller reads the instructions, decides which subtask to perform, and passes subtask parameters to the parameterized skill ○ Parameterized skill executes the given subtask

slide-11
SLIDE 11
  • To apply instruction-following in a broader context, we need a way to

automatically evaluate if an instruction was completed.

  • Common architecture: a reward-learning module learns to ground an

instruction to a goal, then generates a reward for a policy-learning module

  • Use standard IRL or an adversarial process.

○ The reward learner is the discriminator that discerns between goal states and visited states. The agent is rewarded for visiting states the discriminator cannot discern from goal states.

  • When environment rewards are sparse, instructions can help generate

auxiliary rewards to help learn efficiently.

Language-Conditional: Rewards from Instructions

Use the instructions to induce a reward function

slide-12
SLIDE 12
  • Much more challenging – observation and action spaces grow combinatorially

with vocabulary size and grammar complexity

○ Cardinal directions (“Go north”) vs. relative (“go to the blue ball southwest of the green box”)

  • Dialogue systems, QA, VQA, EQA

○ Multiple-choice nature makes these problems similar to instruction following

  • To help create consistent benchmarks in this space, TextWorld generates text

games that behave as RL environments

Language-Conditional: Observation & Action Space

Environments use language to drive interaction with the agent

slide-13
SLIDE 13

TextWorld Example 1

slide-14
SLIDE 14

TextWorld Example 2

slide-15
SLIDE 15

Language-Assisted RL

Language assists the task via transfer learning

Language is not essential to the task, but assists via transfer of knowledge

  • Specifies features, annotates states or entities, describes subtasks
  • Most cases are task-specific
  • Pre-trained embeddings and parsers provide task-independent information
slide-16
SLIDE 16

Language-Assisted: Communicating Domain Knowledge

  • For more general settings outside instruction following, potentially

task-relevant information could be available

○ Advice about the policy, information about the environment

  • Unstructured, descriptive language is more available than instructive

○ Must retrieve useful information for a given context ○ Must ground that information with respect to observations

  • Narasimhan et al, 2018

○ Ground the meaning of text to the dynamics of the environment ○ Allows an agent to bootstrap policy learning in a new environment

slide-17
SLIDE 17

Language-Assisted: Structuring Policies

  • Construct priors on the model by

communicating information about the state or dynamics of an environment

○ Shape representations to be more generalized abstractions ○ Make a representation space more interpretable to humans ○ Efficiently structure computations within a model

  • Example: Learning to Compose

Neural Networks for Question Answering

slide-18
SLIDE 18

Trends for Natural Language in RL

1. Language-conditional RL is more studied than language-assisted RL 2. Learning from task-dependent text is more common than task-independent 3. Little research has been done in how to use unstructured text in knowledge transfer from task-dependent text 4. Little research in using language structure to build compositional representations and internal plans 5. Synthetically generated languages (instead of natural language) are the standard for instruction following

slide-19
SLIDE 19

Learning from Text Corpora in the Wild

Task-independent

  • RL systems can’t generalize to language outside outside of the training

distribution without transfer from a language model

○ “Fetch a stick” vs. “Return with a stick” vs. “Grab a stick and come back”

  • Would enable agents to better utilize task-dependent corpora

Task-dependent

  • Transfer task-specific corpora and fine-tune a pre-trained information

retrieval system. The RL agent queries the retrieval system and uses relevant information.

○ Example: game manuals

slide-20
SLIDE 20

Diverse Environments with Real Semantics

This is measured only by instruction following benchmarks in closed-task domains (navigation, object manipulation, etc.) and closed worlds.

  • Small vocabulary sizes
  • Multiple pieces of evidence to ground

each word To generalize, RL needs more diverse environments with complex composition.

  • 3D house simulation
  • Minecraft

A central promise of language in RL is helping agents adapt to new goals, reward functions, and environment dynamics.

slide-21
SLIDE 21

Future Work

  • Use pre-trained language models to transfer world knowledge
  • Learn from natural text rather than instructions or synthetic language
  • Use more diverse environments with complex composition and real-world

semantics

  • Develop standardized environments and evaluations to properly measure

progress of natural language and RL integration

  • Agents that can query knowledge more explicitly and reason with it

○ Pre-trained information retrieval systems

slide-22
SLIDE 22

“Good”

  • Provide compelling motivation for

why RL + NLP is worth studying

  • Challenges the field to take the

next step in elevate RL

  • Provides many positive example

that show the feasibility of this work

Critique

“Not so Good”

  • More background for RL

○ Q-learning, imitation learning

  • Similarity of multiple-choice QA

problems to instruction following

  • Factors they missed that make it

worthwhile to work in this space

○ Success in multimodal NLP work ○ Success in other modalities with RL

  • Language can inform RL; is the

converse true?