online exploration in least squares policy iteration
play

Online Exploration in Least-Squares Policy Iteration Lihong Li, - PowerPoint PPT Presentation

Online Exploration in Least-Squares Policy Iteration Lihong Li, Michael L. Littman, and Christopher R. Mansley Rutgers Laboratory for Real-Life Reinforcement Learning (RL 3 ) 5/14/2009 AAMAS - Budapest 1 Contributions Reinforcement Learning


  1. Online Exploration in Least-Squares Policy Iteration Lihong Li, Michael L. Littman, and Christopher R. Mansley Rutgers Laboratory for Real-Life Reinforcement Learning (RL 3 ) 5/14/2009 AAMAS - Budapest 1

  2. Contributions Reinforcement Learning Challenge I Challenge II Exploration/Exploitation Tradeoff Value-Function Approximation Rmax [Brafman & Tenneholtz 02] LSPI [Lagoudakis & Parr 03] (provably efficient, finite) (continuous, offline) LSPI-Rmax 5/14/2009 AAMAS - Budapest 2

  3. Outline • Introduction – LSPI – Rmax • LSPI-Rmax • Experiments • Conclusions 5/14/2009 AAMAS - Budapest 3

  4. Basic Terminology • Markov decision process – States: S – Actions: A – Reward function: -1 ≤ R(s,a) ≤ 1 – Transition probabilities: T(s’|s,a) – Discount factor: 0 < γ < 1 • Optimal value function: • Optimal policy: • Approximate 5/14/2009 AAMAS - Budapest 4

  5. Linear Function Approximation • Features: – A.k.a. “basis functions”, and predefined • Weights: – Measures contributions of φ i to approximating Q* • 5/14/2009 AAMAS - Budapest 5

  6. LSPI [Lagoudakis & Parr 03] π  π ’ Improve π : Initialize Evaluate π : compute w π ’(s) = argmax a w· φ (s,a) π 5/14/2009 AAMAS - Budapest 6

  7. LSPI [Lagoudakis & Parr 03] But, LSPI does not specify how to collect samples D : a fundamental challenge in online reinforcement learning π  π ’ Improve π : Initialize Evaluate π : An agent only collects samples in states it visits… compute w π ’(s) = argmax a w· φ (s,a) π 5/14/2009 AAMAS - Budapest 7

  8. Exploration/Exploitation Tradeoff 0 0 0 0 0 0 0 1000 1 2 3 98 99 100 0.001 total rewards efficient optimal policy exploration inefficient exploration time 5/14/2009 AAMAS - Budapest 8

  9. Rmax [Brafman & Tenenholtz 02] • Rmax is for finite-state, finite-action MDPs • Learns T and R by counting/averaging • In s t , takes optimal action in “Optimism in the face of uncertainty”  Either: explore “unknown” region  Or: exploit “known” region Thm: Rmax is provably efficient Known Unknown S x A state-actions state-actions 5/14/2009 AAMAS - Budapest 9

  10. LSPI-Rmax • Similar to LSPI • But distinguishes known/unknown ( s,a ) : Samples in D Known state-actions Unknown state-actions (Like Rmax) Treat their Q-value as Q max S x A Modifications of LSTDQ 5/14/2009 AAMAS - Budapest 10

  11. LSTDQ-Rmax 5/14/2009 AAMAS - Budapest 11

  12. LSPI-Rmax for Online RL • D = empty set • Initialize w • for t = 1, 2, 3, … – Take greedy action: a t = argmax a w· φ (s t ,a) – D = D U {( s t ,a t ,r t ,s t+1 )} – Run LSPI using LSTDQ-Rmax 5/14/2009 AAMAS - Budapest 12

  13. Experiments • Problems – MountainCar – Bicycle – Continuous Combination Lock – ExpressWorld (a variant of PuddleWorld) Four actions Stochastic transitions Reward: -1 reward per step -0.5 reward per step in “expresslane” penalty for stepping into puddles Random start states 5/14/2009 AAMAS - Budapest 13

  14. Various Exploration Rules with LSPI Converges to better policies 5/14/2009 AAMAS - Budapest 14

  15. A Closer Look States visited in the first 3 episodes: Efficient Inefficient exploration exploration Help discovery of goal and expresslane 5/14/2009 AAMAS - Budapest 15

  16. More Experiments 5/14/2009 AAMAS - Budapest 16

  17. Effect of Rmax Threshold 5/14/2009 AAMAS - Budapest 17

  18. Conclusions • We proposed LSPI-Rmax – LSPI + Rmax – encourages active exploration – with linear function approximation • Future directions – Similar idea applied to Gaussian process RL – Comparison to model-based RL 5/14/2009 AAMAS - Budapest 18

  19. 5/14/2009 AAMAS - Budapest 19

  20. Where are features from? • Hand-crafted features – expert knowledge required – expensive and error prone • Generic features – RBF, CMAC, polynomial, etc. – may not always work well • Automatic feature selection using – Bellman error [Parr et al. 07] – spectral graph analysis [Mahadevan & Maggioni 07] – TD approximation [Li & Williams & Balakrishnan 09] – L 1 Regularization for LSPI [Kolter & Ng 09] 5/14/2009 AAMAS - Budapest 20

  21. LSPI-Rmax vs. MBRL • Model-based RL (e.g., Rmax) – Learns an MDP model – Computes policy with the approximate model – Can use function approx. in model learning • Rmax w/ many compact representations [Li 09] • LSPI-Rmax is model-free RL – Avoids expensive “planning” step – Has weaker theoretical guarantees 5/14/2009 AAMAS - Budapest 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend