Policy Consolidation for Continual Reinforcement Learning
Christos Kaplanis1, Murray Shanahan1,2 and Claudia Clopath1
1Imperial College London, 2DeepMind
Policy Consolidation for Continual Reinforcement Learning Christos - - PowerPoint PPT Presentation
Policy Consolidation for Continual Reinforcement Learning Christos Kaplanis 1 , Murray Shanahan 1,2 and Claudia Clopath 1 1 Imperial College London, 2 DeepMind 11th June 2019 Motivation Motivation Catastrophic Forgetting in Artificial Neural
1Imperial College London, 2DeepMind
◮ Catastrophic Forgetting in Artificial Neural Networks
◮ Catastrophic Forgetting in Artificial Neural Networks
◮ Catastrophic Forgetting in Artificial Neural Networks
◮ Catastrophic Forgetting in Artificial Neural Networks
◮ Catastrophic Forgetting in Artificial Neural Networks ◮ Agents should cope with
◮ Catastrophic Forgetting in Artificial Neural Networks ◮ Agents should cope with
◮ Both discrete and continuous changes to data distribution
◮ Catastrophic Forgetting in Artificial Neural Networks ◮ Agents should cope with
◮ Both discrete and continuous changes to data distribution ◮ No prior knowledge of when/how changes occur
◮ Catastrophic Forgetting in Artificial Neural Networks ◮ Agents should cope with
◮ Both discrete and continuous changes to data distribution ◮ No prior knowledge of when/how changes occur
◮ Test beds: alternating task, single task and multi-agent RL
0.0 0.5 1.0 1.5 2.0
Steps
1e7
1000 2000 3000 4000
Reward [Walker2d-v2, Walker2dBigLeg-v0]
PC β = 1 β = 5 β = 10 β = 20 β = 50 clip=0.2 clip=0.1 clip=0.03
0.0 0.5 1.0 1.5 2.0
Steps
1e7
2000 4000 6000
Reward [HalfCheetah-v2, HalfCheetahBigLeg-v0]
PC β = 1 β = 5 β = 10 β = 20 β = 50 clip=0.2 clip=0.1 clip=0.03
0.0 0.5 1.0 1.5 2.0
Steps
1e7
1000 2000 3000 4000 5000 6000
Reward [HumanoidSmallLeg-v0, HumanoidBigLeg-v0]
PC β = 1 β = 5 β = 10 β = 20 β = 50 clip=0.2 clip=0.1 clip=0.03 adaptive β
0.0 0.5 1.0 1.5 2.0
Steps
1e7
1000 2000 3000 4000 5000
Reward [Walker2d-v2]
PC β = 1 β = 5 β = 10 β = 20 β = 50 clip=0.2 clip=0.1 clip=0.03 adaptive β
0.0 0.5 1.0 1.5 2.0
Steps
1e7
2000 4000 6000 8000
Reward [HalfCheetahBigLeg-v0]
PC β = 1 β = 5 β = 10 β = 20 β = 50 clip=0.2 clip=0.1 clip=0.03 adaptive β
1 2 3 4 5
Steps
1e7
500 1000 1500 2000 2500
Reward [RoboschoolHumanoid-v1]
PC β = 1 β = 5 β = 10 β = 20 β = 50 clip=0.2 clip=0.1 clip=0.03 adaptive β
1 2 3 4 5 6
Steps
1e8
0.5 0.6 0.7 0.8 0.9 1.0
Mean Score
PC1 PC2 PC3 Clip=0.2 Clip=0.1 β = 0.5 β = 1.0 β = 2.0 β = 5.0 Adaptive β
1 2 3 4 5 6
Steps
1e8
0.2 0.4 0.6 0.8 1.0
Mean Score
PC vs Clip=0.2 PC vs Clip=0.1 PC vs β = 0.5 PC vs β = 1.0 PC vs β = 2.0 PC vs β = 5.0 PC vs Adaptive β
◮ Prioritised consolidation
◮ Prioritised consolidation ◮ Adapt for off-policy learning