SLIDE 1 Off-Policy Deep Reinforcement Learning without Exploration
Scott Fujimoto, David Meger, Doina Precup Mila, McGill University
SLIDE 2
SLIDE 3 Surprise!
Agent orange and agent blue are trained withβ¦
- 1. The same off-policy algorithm (DDPG).
- 2. The same dataset.
SLIDE 4 The Difference?
- 1. Agent orange: Interacted with the environment.
- Standard RL loop.
- Collect data, store data in buffer, train, repeat.
- 2. Agent blue: Never interacted with the environment.
- Trained with data collected by agent orange concurrently.
SLIDE 5
- 1. Trained with the same off-policy algorithm.
- 2. Trained with the same dataset.
- 3. One interacts with the environment. One doesnβt.
SLIDE 6
Off-policy deep RL fails when truly off-policy.
SLIDE 7
Value Predictions
SLIDE 8
Extrapolation Error
π
π‘, π β π + πΏπ
π‘β², πβ²
SLIDE 9
Extrapolation Error
π
π‘, π β π + πΏπ
π‘β², πβ²
GIVEN GENERATED
SLIDE 10
Extrapolation Error
π
π‘, π β π + πΏπ
π‘β², πβ²
1. π‘, π, π , π‘β² ~πΈππ’ππ‘ππ’ 2. πβ²~π(π‘β²)
SLIDE 11
Extrapolation Error
π
π‘, π β π + πΏπ
π‘β², πβ²
π‘β², πβ² β πΈππ’ππ‘ππ’ β π
π‘β², πβ² = πππ β π
π‘, π = πππ
SLIDE 12
Extrapolation Error
π
π‘, π β π + πΏπ
π‘β², πβ²
π‘β², πβ² β πΈππ’ππ‘ππ’ β π
π‘β², πβ² = πππ β π
π‘, π = πππ
SLIDE 13
Extrapolation Error
π
π‘, π β π + πΏπ
π‘β², πβ²
π‘β², πβ² β πΈππ’ππ‘ππ’ β π
π‘β², πβ² = πππ β π
π‘, π = πππ
SLIDE 14
Extrapolation Error Attempting to evaluate π without (sufficient) access to the (π‘, π) pairs π visits.
SLIDE 15
Batch-Constrained Reinforcement Learning Only choose π such that we have access to the (π‘, π) pairs π visits.
SLIDE 16 Batch-Constrained Reinforcement Learning
- 1. a~π π‘ such that π‘, π β πΈππ’ππ‘ππ’.
- 2. a~π π‘ such that π‘β², π π‘β²
β πΈππ’ππ‘ππ’.
- 3. a~π π‘ such that π
(π‘, π) is maxed.
SLIDE 17 Batch-Constrained Deep Q-Learning (BCQ)
First imitate dataset via generative model: π»(π|π‘) β ππΈππ’ππ‘ππ’(π|π‘). π π‘ = argmaxππ π
(π‘, ππ), where ππ~π»
(I.e. select the best action that is likely under the dataset)
(+ some additional deep RL magic)
SLIDE 18
βBCQ βDDPG
SLIDE 19
βBCQ βDDPG
SLIDE 20 Come say Hi @ Pacific Ballroom #38 (6:30 Tonight)
(Artistβs rendition of poster session)
https://github.com/sfujim/BCQ