learning without exploration
play

Learning without Exploration Scott Fujimoto , David Meger, Doina - PowerPoint PPT Presentation

Off-Policy Deep Reinforcement Learning without Exploration Scott Fujimoto , David Meger, Doina Precup Mila, McGill University Surprise! Agent orange and agent blue are trained with 1. The same off-policy algorithm (DDPG). 2. The same dataset.


  1. Off-Policy Deep Reinforcement Learning without Exploration Scott Fujimoto , David Meger, Doina Precup Mila, McGill University

  2. Surprise! Agent orange and agent blue are trained with… 1. The same off-policy algorithm (DDPG). 2. The same dataset.

  3. The Difference? 1. Agent orange: Interacted with the environment. • Standard RL loop. • Collect data, store data in buffer, train, repeat. 2. Agent blue: Never interacted with the environment. • Trained with data collected by agent orange concurrently.

  4. 1. Trained with the same off-policy algorithm. 2. Trained with the same dataset. 3. One interacts with the environment. One doesn’t .

  5. Off-policy deep RL fails when truly off-policy .

  6. Value Predictions

  7. Extrapolation Error 𝑅 𝑡, 𝑏 ← 𝑠 + 𝛿𝑅 𝑡′, 𝑏′

  8. Extrapolation Error 𝑅 𝑡, 𝑏 ← 𝑠 + 𝛿𝑅 𝑡′, 𝑏′ GIVEN GENERATED

  9. Extrapolation Error 𝑅 𝑡, 𝑏 ← 𝑠 + 𝛿𝑅 𝑡′, 𝑏′ 𝑡, 𝑏, 𝑠, 𝑡 ′ ~𝐸𝑏𝑢𝑏𝑡𝑓𝑢 1. 𝑏 ′ ~𝜌(𝑡 ′ ) 2.

  10. Extrapolation Error 𝑅 𝑡, 𝑏 ← 𝑠 + 𝛿𝑅 𝑡′, 𝑏′ 𝑡 ′ , 𝑏 ′ ∉ 𝐸𝑏𝑢𝑏𝑡𝑓𝑢 → 𝑅 𝑡 ′ , 𝑏 ′ = 𝐜𝐛𝐞 → 𝑅 𝑡, 𝑏 = 𝐜𝐛𝐞

  11. Extrapolation Error 𝑅 𝑡, 𝑏 ← 𝑠 + 𝛿𝑅 𝑡′, 𝑏′ 𝑡 ′ , 𝑏 ′ ∉ 𝐸𝑏𝑢𝑏𝑡𝑓𝑢 → 𝑅 𝑡 ′ , 𝑏 ′ = 𝐜𝐛𝐞 → 𝑅 𝑡, 𝑏 = 𝐜𝐛𝐞

  12. Extrapolation Error 𝑅 𝑡, 𝑏 ← 𝑠 + 𝛿𝑅 𝑡′, 𝑏′ 𝑡 ′ , 𝑏 ′ ∉ 𝐸𝑏𝑢𝑏𝑡𝑓𝑢 → 𝑅 𝑡 ′ , 𝑏 ′ = 𝐜𝐛𝐞 → 𝑅 𝑡, 𝑏 = 𝐜𝐛𝐞

  13. Extrapolation Error Attempting to evaluate 𝜌 without (sufficient) access to the (𝑡, 𝑏) pairs 𝜌 visits.

  14. Batch-Constrained Reinforcement Learning Only choose 𝜌 such that we have access to the (𝑡, 𝑏) pairs 𝜌 visits.

  15. Batch-Constrained Reinforcement Learning 1. a~𝜌 𝑡 such that 𝑡, 𝑏 ∈ 𝐸𝑏𝑢𝑏𝑡𝑓𝑢 . 2. a~𝜌 𝑡 such that 𝑡 ′ , 𝜌 𝑡 ′ ∈ 𝐸𝑏𝑢𝑏𝑡𝑓𝑢 . 3. a~𝜌 𝑡 such that 𝑅(𝑡, 𝑏) is maxed.

  16. Batch-Constrained Deep Q-Learning (BCQ) First imitate dataset via generative model: 𝐻(𝑏|𝑡) ≈ 𝑄 𝐸𝑏𝑢𝑏𝑡𝑓𝑢 (𝑏|𝑡) . 𝜌 𝑡 = argmax 𝑏 𝑗 𝑅 (𝑡, 𝑏 𝑗 ) , where 𝑏 𝑗 ~𝐻 (I.e. select the best action that is likely under the dataset) (+ some additional deep RL magic )

  17. ∎ BCQ ∎ DDPG

  18. ∎ BCQ ∎ DDPG

  19. Come say Hi @ Pacific Ballroom #38 (6:30 Tonight) https://github.com/sfujim/BCQ (Artist’s rendition of poster session)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend