policy consolidation for continual reinforcement learning
play

Policy Consolidation for Continual Reinforcement Learning Christos - PowerPoint PPT Presentation

Policy Consolidation for Continual Reinforcement Learning Christos Kaplanis 1 , Murray Shanahan 1,2 and Claudia Clopath 1 1 Imperial College London, 2 DeepMind 11th June 2019 Motivation Motivation Catastrophic Forgetting in Artificial Neural


  1. Policy Consolidation for Continual Reinforcement Learning Christos Kaplanis 1 , Murray Shanahan 1,2 and Claudia Clopath 1 1 Imperial College London, 2 DeepMind 11th June 2019

  2. Motivation

  3. Motivation ◮ Catastrophic Forgetting in Artificial Neural Networks

  4. Motivation ◮ Catastrophic Forgetting in Artificial Neural Networks

  5. Motivation ◮ Catastrophic Forgetting in Artificial Neural Networks

  6. Motivation ◮ Catastrophic Forgetting in Artificial Neural Networks

  7. Motivation ◮ Catastrophic Forgetting in Artificial Neural Networks ◮ Agents should cope with

  8. Motivation ◮ Catastrophic Forgetting in Artificial Neural Networks ◮ Agents should cope with ◮ Both discrete and continuous changes to data distribution

  9. Motivation ◮ Catastrophic Forgetting in Artificial Neural Networks ◮ Agents should cope with ◮ Both discrete and continuous changes to data distribution ◮ No prior knowledge of when/how changes occur

  10. Motivation ◮ Catastrophic Forgetting in Artificial Neural Networks ◮ Agents should cope with ◮ Both discrete and continuous changes to data distribution ◮ No prior knowledge of when/how changes occur ◮ Test beds: alternating task, single task and multi-agent RL

  11. Policy Consolidation Train KL distillation agent loss Play game 𝜌 1 𝜌 2 𝜌 3 𝜌 N ... old old old 𝜌 1 𝜌 2 𝜌 3 old 𝜌 N Store Policy Recall Policy

  12. Alternating task experiments [Walker2d-v2, [HalfCheetah-v2, Walker2dBigLeg-v0] HalfCheetahBigLeg-v0] PC PC β = 1 β = 1 4000 6000 β = 5 β = 5 3000 Reward β = 10 Reward β = 10 4000 β = 20 β = 20 2000 β = 50 β = 50 2000 1000 clip=0.2 clip=0.2 clip=0.1 clip=0.1 0 0 clip=0.03 clip=0.03 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Steps 1e7 Steps 1e7 [HumanoidSmallLeg-v0, HumanoidBigLeg-v0] PC 6000 β = 1 β = 5 5000 β = 10 4000 Reward β = 20 3000 β = 50 2000 clip=0.2 clip=0.1 1000 clip=0.03 0 adaptive β 0.0 0.5 1.0 1.5 2.0 Steps 1e7

  13. Single task experiments [Walker2d-v2] [HalfCheetahBigLeg-v0] PC PC β = 1 β = 1 8000 5000 β = 5 β = 5 6000 4000 β = 10 β = 10 Reward Reward β = 20 β = 20 3000 4000 β = 50 β = 50 2000 clip=0.2 clip=0.2 2000 1000 clip=0.1 clip=0.1 0 clip=0.03 clip=0.03 0 adaptive β adaptive β 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Steps Steps 1e7 1e7 [RoboschoolHumanoid-v1] PC β = 1 2500 β = 5 2000 β = 10 Reward 1500 β = 20 β = 50 1000 clip=0.2 500 clip=0.1 clip=0.03 0 adaptive β 0 1 2 3 4 5 Steps 1e7

  14. Multi-agent self-play experiments 1.0 1.0 PC1 β = 0.5 PC2 β = 1.0 0.9 PC3 β = 2.0 0.8 Clip=0.2 β = 5.0 Mean Score Mean Score Clip=0.1 Adaptive β 0.8 0.6 0.7 0.4 0.6 PC vs Clip=0.2 PC vs β = 2.0 0.2 PC vs Clip=0.1 PC vs β = 5.0 0.5 PC vs β = 0.5 PC vs Adaptive β PC vs β = 1.0 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Steps 1e8 Steps 1e8 (a) Final model vs. self history (b) PC vs. baselines over training

  15. Future work

  16. Future work ◮ Prioritised consolidation

  17. Future work ◮ Prioritised consolidation ◮ Adapt for off-policy learning

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend