fast adaptation via policy dynamics value functions
play

Fast Adaptation via Policy-Dynamics Value Functions Roberta - PowerPoint PPT Presentation

Fast Adaptation via Policy-Dynamics Value Functions Roberta Raileanu Max Goldstein Arthur Szlam Rob Fergus NYU NYU FAIR NYU ICML 2020 Dynamics Often Change in the Real World How can agents rapidly adapt to changes in the environments


  1. Fast Adaptation via Policy-Dynamics Value Functions Roberta Raileanu Max Goldstein Arthur Szlam Rob Fergus NYU NYU FAIR NYU ICML 2020

  2. Dynamics Often Change in the Real World

  3. How can agents rapidly adapt to changes in the environment’s dynamics ? Learn a General Value Function in the Space of Policies and Dynamics

  4. Policy-Dynamics Value Function (PD-VF) Value Function Total Future Reward Fixed Policy-Dynamics Total Future Reward Value Function

  5. Fast Adaptation to New Dynamics Family of Environments Each Environment has a unobserved Different Transition Function Train on a Family of Different but Related Dynamics Test on New Dynamics

  6. Training Recipe 1. Reinforcement Learning Phase - train individual policies on each training environment 2. Self-Supervised Learning Phase - Learn policy and dynamics embeddings using collected the trajectories 3. Supervised Learning Phase - Learn a value function for this space of policies and environments 4. Evaluation Phase - Infer the dynamics of a new environment using steps - Find the policy that maximizes the learned value function

  7. Learning Policy and Dynamics Embeddings Learn Policy Embedding Learn Dynamics Embedding

  8. Learning the Policy-Dynamics Value Function Training the Policy-Dynamics Value Function

  9. Evaluation Phase Closed-form solution: top singular vector of A’s SVD decomposition Optimal Policy Embedding (OPE)

  10. Environments Spaceship Swimmer Ant-Wind Continuous Dynamics Ant-Legs Ant-Legs Discrete Dynamics

  11. Evaluation on Unseen Environments

  12. Evaluation on Unseen Environments

  13. Learned Embeddings Policy Embeddings Dynamics Embeddings Policy Color Dynamics Color

  14. Takeaways Learn a value function in a space of policies and dynamics Infer the dynamics of a new environment from only a few interactions No need for parameter updates, long rollouts, or dense rewards to adapt Improved performance on unseen environments

  15. Future Work ● Reward function variation → condition W on a task embedding ● Multi-agent settings → dynamics given by the others’ policies ● Continual learning ● Integrate prior knowledge / constraints ● Estimate other metrics apart from reward

  16. Thank you!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend