Policy Consolidation for Continual Reinforcement Learning Christos - - PowerPoint PPT Presentation

policy consolidation for continual reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Policy Consolidation for Continual Reinforcement Learning Christos - - PowerPoint PPT Presentation

Policy Consolidation for Continual Reinforcement Learning Christos Kaplanis 1 , Murray Shanahan 1,2 and Claudia Clopath 1 1 Imperial College London, 2 DeepMind 11th June 2019 Motivation Motivation Catastrophic Forgetting in Artificial Neural


slide-1
SLIDE 1

Policy Consolidation for Continual Reinforcement Learning

Christos Kaplanis1, Murray Shanahan1,2 and Claudia Clopath1

1Imperial College London, 2DeepMind

11th June 2019

slide-2
SLIDE 2

Motivation

slide-3
SLIDE 3

Motivation

◮ Catastrophic Forgetting in Artificial Neural Networks

slide-4
SLIDE 4

Motivation

◮ Catastrophic Forgetting in Artificial Neural Networks

slide-5
SLIDE 5

Motivation

◮ Catastrophic Forgetting in Artificial Neural Networks

slide-6
SLIDE 6

Motivation

◮ Catastrophic Forgetting in Artificial Neural Networks

slide-7
SLIDE 7

Motivation

◮ Catastrophic Forgetting in Artificial Neural Networks ◮ Agents should cope with

slide-8
SLIDE 8

Motivation

◮ Catastrophic Forgetting in Artificial Neural Networks ◮ Agents should cope with

◮ Both discrete and continuous changes to data distribution

slide-9
SLIDE 9

Motivation

◮ Catastrophic Forgetting in Artificial Neural Networks ◮ Agents should cope with

◮ Both discrete and continuous changes to data distribution ◮ No prior knowledge of when/how changes occur

slide-10
SLIDE 10

Motivation

◮ Catastrophic Forgetting in Artificial Neural Networks ◮ Agents should cope with

◮ Both discrete and continuous changes to data distribution ◮ No prior knowledge of when/how changes occur

◮ Test beds: alternating task, single task and multi-agent RL

slide-11
SLIDE 11

Policy Consolidation

𝜌1

...

𝜌2 𝜌3 𝜌N

𝜌1

  • ld

𝜌2

  • ld

𝜌3

  • ld

𝜌N

  • ld

Play game Train agent

Store Policy Recall Policy

KL distillation loss

slide-12
SLIDE 12

Alternating task experiments

0.0 0.5 1.0 1.5 2.0

Steps

1e7

1000 2000 3000 4000

Reward [Walker2d-v2, Walker2dBigLeg-v0]

PC β = 1 β = 5 β = 10 β = 20 β = 50 clip=0.2 clip=0.1 clip=0.03

0.0 0.5 1.0 1.5 2.0

Steps

1e7

2000 4000 6000

Reward [HalfCheetah-v2, HalfCheetahBigLeg-v0]

PC β = 1 β = 5 β = 10 β = 20 β = 50 clip=0.2 clip=0.1 clip=0.03

0.0 0.5 1.0 1.5 2.0

Steps

1e7

1000 2000 3000 4000 5000 6000

Reward [HumanoidSmallLeg-v0, HumanoidBigLeg-v0]

PC β = 1 β = 5 β = 10 β = 20 β = 50 clip=0.2 clip=0.1 clip=0.03 adaptive β

slide-13
SLIDE 13

Single task experiments

0.0 0.5 1.0 1.5 2.0

Steps

1e7

1000 2000 3000 4000 5000

Reward [Walker2d-v2]

PC β = 1 β = 5 β = 10 β = 20 β = 50 clip=0.2 clip=0.1 clip=0.03 adaptive β

0.0 0.5 1.0 1.5 2.0

Steps

1e7

2000 4000 6000 8000

Reward [HalfCheetahBigLeg-v0]

PC β = 1 β = 5 β = 10 β = 20 β = 50 clip=0.2 clip=0.1 clip=0.03 adaptive β

1 2 3 4 5

Steps

1e7

500 1000 1500 2000 2500

Reward [RoboschoolHumanoid-v1]

PC β = 1 β = 5 β = 10 β = 20 β = 50 clip=0.2 clip=0.1 clip=0.03 adaptive β

slide-14
SLIDE 14

Multi-agent self-play experiments

1 2 3 4 5 6

Steps

1e8

0.5 0.6 0.7 0.8 0.9 1.0

Mean Score

PC1 PC2 PC3 Clip=0.2 Clip=0.1 β = 0.5 β = 1.0 β = 2.0 β = 5.0 Adaptive β

(a) Final model vs. self history

1 2 3 4 5 6

Steps

1e8

0.2 0.4 0.6 0.8 1.0

Mean Score

PC vs Clip=0.2 PC vs Clip=0.1 PC vs β = 0.5 PC vs β = 1.0 PC vs β = 2.0 PC vs β = 5.0 PC vs Adaptive β

(b) PC vs. baselines over training

slide-15
SLIDE 15

Future work

slide-16
SLIDE 16

Future work

◮ Prioritised consolidation

slide-17
SLIDE 17

Future work

◮ Prioritised consolidation ◮ Adapt for off-policy learning