module 9
play

Module 9 LAO* CS 886 Sequential Decision Making and Reinforcement - PowerPoint PPT Presentation

Module 9 LAO* CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Large State Space Value Iteration, Policy Iteration and Linear Programming Complexity at least quadratic in || Problem: ||


  1. Module 9 LAO* CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

  2. Large State Space • Value Iteration, Policy Iteration and Linear Programming – Complexity at least quadratic in |𝑇| • Problem: |𝑇| may be very large – Queuing problems: infinite state space – Factored problems: exponentially many states 2 CS886 (c) 2013 Pascal Poupart

  3. Mitigate Size of State Space • Two ideas: • Exploit initial state – Not all states are reachable • Exploit heuristic ℎ – approximation of optimal value function – usually an upper bound ℎ 𝑡 ≥ 𝑊 ∗ 𝑡 ∀𝑡 3 CS886 (c) 2013 Pascal Poupart

  4. State Space State space |𝑇| Reachable states 𝑡 0 States reachable by 𝜌 ∗ 4 CS886 (c) 2013 Pascal Poupart

  5. State Space State space |𝑇| Reachable states 𝑡 0 States reachable by 𝜌 ∗ 5 CS886 (c) 2013 Pascal Poupart

  6. LAO* Algorithm • Related to – A*: path heuristic search – AO*: tree heuristic search – LAO*: cyclic graph heuristic search • LAO* alternates between – State space expansion – Policy optimization • value iteration, policy iteration, linear programming 6 CS886 (c) 2013 Pascal Poupart

  7. Terminology • 𝑇 : state space • 𝑇 𝐹 ⊆ 𝑇 : envelope – Growing set of states • 𝑇 𝑈 ⊆ 𝑇 𝐹 : terminal states – States whose children are not in the envelope 𝜌 ⊆ 𝑇 𝐹 : states reachable from 𝑡 0 by following 𝜌 • 𝑇 𝑡 0 • ℎ(𝑡) : heuristic such that ℎ 𝑡 ≥ 𝑊 ∗ 𝑡 ∀𝑡 – E.g., ℎ 𝑡 = max 𝑡,𝑏 𝑆(𝑡, 𝑏)/(1 − 𝛿) 7 CS886 (c) 2013 Pascal Poupart

  8. LAO* Algorithm LAO*(MDP, heuristic ℎ ) 𝑇 𝐹 ← {𝑡 0 } , 𝑇 𝑈 ← {𝑡 0 } Repeat Let 𝑆 𝐹 𝑡, 𝑏 = ℎ(𝑡) 𝑡 ∈ 𝑇 𝑈 otherwise 𝑆(𝑡, 𝑏) 0 𝑡 ∈ 𝑇 𝑈 Let 𝑈 𝐹 (𝑡 ′ |𝑡, 𝑏) = otherwise (𝑡 ′ |𝑡, 𝑏) Pr Find optimal policy 𝜌 for 𝑇 𝐹 , 𝑆 𝐹 , 𝑈 𝐹 𝜌 Find reachable states 𝑇 𝑡 0 𝜌 ∩ 𝑇 𝑈 Select reachable terminal states s 1 , … , s k ⊆ 𝑇 𝑡 0 𝑇 𝑈 ← (𝑇 𝑈 ∖ 𝑡 1 , … , 𝑡 𝑙 ) ∪ (𝑑ℎ𝑗𝑚𝑒𝑠𝑓𝑜 𝑡 1 , … , 𝑡 𝑙 ∖ 𝑇 𝐹 ) 𝑇 𝐹 ← 𝑇 𝐹 ∪ 𝑑ℎ𝑗𝑚𝑒𝑠𝑓𝑜( 𝑡 1 , … , 𝑡 𝑙 ) 𝜌 ∩ 𝑇 𝑈 is empty Until 𝑇 𝑡 0 8 CS886 (c) 2013 Pascal Poupart

  9. Efficiency Efficiency influenced by 1. Choice of terminal states to add to envelope 2. Algorithm to find optimal policy – Can use value iteration, policy iteration, modified policy iteration, linear programming – Key: reuse previous computation • E.g., start with previous policy or value function at each iteration 9 CS886 (c) 2013 Pascal Poupart

  10. Convergence • Theorem: LAO* converges to the optimal policy • Proof: – Fact: At each iteration, the value function 𝑊 is an upper bound on 𝑊 ∗ due to the heuristic function ℎ – Proof by contradiction: suppose the algorithm stops, but 𝜌 is not optimal. • Since the algorithm stopped, all states reachable by 𝜌 are in 𝑇 𝐹 ∖ 𝑇 𝑈 • Hence, the value function 𝑊 is the value of 𝜌 and since 𝜌 is suboptimal then 𝑊 < 𝑊 ∗ , which contradicts the fact that 𝑊 is an upper bound on 𝑊 ∗ 10 CS886 (c) 2013 Pascal Poupart

  11. Summary • LAO* – Extension of basic solution algorithms (value iteration, policy iteration, linear programming) – Exploit initial state and heuristic function – Gradually grow an envelope of states – Complexity depends on # of reachable states instead of size of state space 11 CS886 (c) 2013 Pascal Poupart

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend