Fast Adaptation via Policy-Dynamics Value Functions Roberta - - PowerPoint PPT Presentation

fast adaptation via policy dynamics value functions
SMART_READER_LITE
LIVE PREVIEW

Fast Adaptation via Policy-Dynamics Value Functions Roberta - - PowerPoint PPT Presentation

Fast Adaptation via Policy-Dynamics Value Functions Roberta Raileanu Max Goldstein Arthur Szlam Rob Fergus NYU NYU FAIR NYU ICML 2020 Dynamics Often Change in the Real World How can agents rapidly adapt to changes in the environments


slide-1
SLIDE 1

Fast Adaptation via Policy-Dynamics Value Functions

Roberta Raileanu NYU Max Goldstein NYU Arthur Szlam FAIR Rob Fergus NYU ICML 2020

slide-2
SLIDE 2

Dynamics Often Change in the Real World

slide-3
SLIDE 3

How can agents rapidly adapt to changes in the environment’s dynamics? Learn a General Value Function in the Space of Policies and Dynamics

slide-4
SLIDE 4

Policy-Dynamics Value Function (PD-VF)

Value Function Total Future Reward Fixed Policy-Dynamics Value Function Total Future Reward

slide-5
SLIDE 5

Fast Adaptation to New Dynamics

Each Environment has a Different Transition Function Train on a Family of Different but Related Dynamics Test on New Dynamics Family of Environments

unobserved

slide-6
SLIDE 6

Training Recipe

  • 1. Reinforcement Learning Phase
  • train individual policies on each training environment
  • 2. Self-Supervised Learning Phase
  • Learn policy and dynamics embeddings using collected the trajectories
  • 3. Supervised Learning Phase
  • Learn a value function for this space of policies and environments
  • 4. Evaluation Phase
  • Infer the dynamics of a new environment using steps
  • Find the policy that maximizes the learned value function
slide-7
SLIDE 7

Learning Policy and Dynamics Embeddings

Learn Policy Embedding Learn Dynamics Embedding

slide-8
SLIDE 8

Learning the Policy-Dynamics Value Function

Training the Policy-Dynamics Value Function

slide-9
SLIDE 9

Evaluation Phase

Optimal Policy Embedding (OPE) Closed-form solution: top singular vector of A’s SVD decomposition

slide-10
SLIDE 10

Environments

Continuous Dynamics Spaceship Swimmer Ant-Wind Ant-Legs Ant-Legs Discrete Dynamics

slide-11
SLIDE 11

Evaluation on Unseen Environments

slide-12
SLIDE 12

Evaluation on Unseen Environments

slide-13
SLIDE 13

Learned Embeddings

Policy Embeddings Dynamics Embeddings Policy Color Dynamics Color

slide-14
SLIDE 14

Takeaways

Learn a value function in a space of policies and dynamics Infer the dynamics of a new environment from only a few interactions Improved performance on unseen environments No need for parameter updates, long rollouts, or dense rewards to adapt

slide-15
SLIDE 15

Future Work

  • Reward function variation → condition W on a task embedding
  • Multi-agent settings → dynamics given by the others’ policies
  • Continual learning
  • Integrate prior knowledge / constraints
  • Estimate other metrics apart from reward
slide-16
SLIDE 16

Thank you!