Learning without Exploration Scott Fujimoto , David Meger, Doina - - PowerPoint PPT Presentation

β–Ά
learning without exploration
SMART_READER_LITE
LIVE PREVIEW

Learning without Exploration Scott Fujimoto , David Meger, Doina - - PowerPoint PPT Presentation

Off-Policy Deep Reinforcement Learning without Exploration Scott Fujimoto , David Meger, Doina Precup Mila, McGill University Surprise! Agent orange and agent blue are trained with 1. The same off-policy algorithm (DDPG). 2. The same dataset.


slide-1
SLIDE 1

Off-Policy Deep Reinforcement Learning without Exploration

Scott Fujimoto, David Meger, Doina Precup Mila, McGill University

slide-2
SLIDE 2
slide-3
SLIDE 3

Surprise!

Agent orange and agent blue are trained with…

  • 1. The same off-policy algorithm (DDPG).
  • 2. The same dataset.
slide-4
SLIDE 4

The Difference?

  • 1. Agent orange: Interacted with the environment.
  • Standard RL loop.
  • Collect data, store data in buffer, train, repeat.
  • 2. Agent blue: Never interacted with the environment.
  • Trained with data collected by agent orange concurrently.
slide-5
SLIDE 5
  • 1. Trained with the same off-policy algorithm.
  • 2. Trained with the same dataset.
  • 3. One interacts with the environment. One doesn’t.
slide-6
SLIDE 6

Off-policy deep RL fails when truly off-policy.

slide-7
SLIDE 7

Value Predictions

slide-8
SLIDE 8

Extrapolation Error

𝑅 𝑑, 𝑏 ← 𝑠 + 𝛿𝑅 𝑑′, 𝑏′

slide-9
SLIDE 9

Extrapolation Error

𝑅 𝑑, 𝑏 ← 𝑠 + 𝛿𝑅 𝑑′, 𝑏′

GIVEN GENERATED

slide-10
SLIDE 10

Extrapolation Error

𝑅 𝑑, 𝑏 ← 𝑠 + 𝛿𝑅 𝑑′, 𝑏′

1. 𝑑, 𝑏, 𝑠, 𝑑′ ~𝐸𝑏𝑒𝑏𝑑𝑓𝑒 2. 𝑏′~𝜌(𝑑′)

slide-11
SLIDE 11

Extrapolation Error

𝑅 𝑑, 𝑏 ← 𝑠 + 𝛿𝑅 𝑑′, 𝑏′

𝑑′, 𝑏′ βˆ‰ 𝐸𝑏𝑒𝑏𝑑𝑓𝑒 β†’ 𝑅 𝑑′, 𝑏′ = πœπ›πž β†’ 𝑅 𝑑, 𝑏 = πœπ›πž

slide-12
SLIDE 12

Extrapolation Error

𝑅 𝑑, 𝑏 ← 𝑠 + 𝛿𝑅 𝑑′, 𝑏′

𝑑′, 𝑏′ βˆ‰ 𝐸𝑏𝑒𝑏𝑑𝑓𝑒 β†’ 𝑅 𝑑′, 𝑏′ = πœπ›πž β†’ 𝑅 𝑑, 𝑏 = πœπ›πž

slide-13
SLIDE 13

Extrapolation Error

𝑅 𝑑, 𝑏 ← 𝑠 + 𝛿𝑅 𝑑′, 𝑏′

𝑑′, 𝑏′ βˆ‰ 𝐸𝑏𝑒𝑏𝑑𝑓𝑒 β†’ 𝑅 𝑑′, 𝑏′ = πœπ›πž β†’ 𝑅 𝑑, 𝑏 = πœπ›πž

slide-14
SLIDE 14

Extrapolation Error Attempting to evaluate 𝜌 without (sufficient) access to the (𝑑, 𝑏) pairs 𝜌 visits.

slide-15
SLIDE 15

Batch-Constrained Reinforcement Learning Only choose 𝜌 such that we have access to the (𝑑, 𝑏) pairs 𝜌 visits.

slide-16
SLIDE 16

Batch-Constrained Reinforcement Learning

  • 1. a~𝜌 𝑑 such that 𝑑, 𝑏 ∈ 𝐸𝑏𝑒𝑏𝑑𝑓𝑒.
  • 2. a~𝜌 𝑑 such that 𝑑′, 𝜌 𝑑′

∈ 𝐸𝑏𝑒𝑏𝑑𝑓𝑒.

  • 3. a~𝜌 𝑑 such that 𝑅(𝑑, 𝑏) is maxed.
slide-17
SLIDE 17

Batch-Constrained Deep Q-Learning (BCQ)

First imitate dataset via generative model: 𝐻(𝑏|𝑑) β‰ˆ 𝑄𝐸𝑏𝑒𝑏𝑑𝑓𝑒(𝑏|𝑑). 𝜌 𝑑 = argmax𝑏𝑗 𝑅 (𝑑, 𝑏𝑗), where 𝑏𝑗~𝐻

(I.e. select the best action that is likely under the dataset)

(+ some additional deep RL magic)

slide-18
SLIDE 18

∎BCQ ∎DDPG

slide-19
SLIDE 19

∎BCQ ∎DDPG

slide-20
SLIDE 20

Come say Hi @ Pacific Ballroom #38 (6:30 Tonight)

(Artist’s rendition of poster session)

https://github.com/sfujim/BCQ