module 9

Module 9 LAO* CS 886 Sequential Decision Making and Reinforcement - PowerPoint PPT Presentation

Module 9 LAO* CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Large State Space Value Iteration, Policy Iteration and Linear Programming Complexity at least quadratic in || Problem: ||


  1. Module 9 LAO* CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

  2. Large State Space β€’ Value Iteration, Policy Iteration and Linear Programming – Complexity at least quadratic in |𝑇| β€’ Problem: |𝑇| may be very large – Queuing problems: infinite state space – Factored problems: exponentially many states 2 CS886 (c) 2013 Pascal Poupart

  3. Mitigate Size of State Space β€’ Two ideas: β€’ Exploit initial state – Not all states are reachable β€’ Exploit heuristic β„Ž – approximation of optimal value function – usually an upper bound β„Ž 𝑑 β‰₯ π‘Š βˆ— 𝑑 βˆ€π‘‘ 3 CS886 (c) 2013 Pascal Poupart

  4. State Space State space |𝑇| Reachable states 𝑑 0 States reachable by 𝜌 βˆ— 4 CS886 (c) 2013 Pascal Poupart

  5. State Space State space |𝑇| Reachable states 𝑑 0 States reachable by 𝜌 βˆ— 5 CS886 (c) 2013 Pascal Poupart

  6. LAO* Algorithm β€’ Related to – A*: path heuristic search – AO*: tree heuristic search – LAO*: cyclic graph heuristic search β€’ LAO* alternates between – State space expansion – Policy optimization β€’ value iteration, policy iteration, linear programming 6 CS886 (c) 2013 Pascal Poupart

  7. Terminology β€’ 𝑇 : state space β€’ 𝑇 𝐹 βŠ† 𝑇 : envelope – Growing set of states β€’ 𝑇 π‘ˆ βŠ† 𝑇 𝐹 : terminal states – States whose children are not in the envelope 𝜌 βŠ† 𝑇 𝐹 : states reachable from 𝑑 0 by following 𝜌 β€’ 𝑇 𝑑 0 β€’ β„Ž(𝑑) : heuristic such that β„Ž 𝑑 β‰₯ π‘Š βˆ— 𝑑 βˆ€π‘‘ – E.g., β„Ž 𝑑 = max 𝑑,𝑏 𝑆(𝑑, 𝑏)/(1 βˆ’ 𝛿) 7 CS886 (c) 2013 Pascal Poupart

  8. LAO* Algorithm LAO*(MDP, heuristic β„Ž ) 𝑇 𝐹 ← {𝑑 0 } , 𝑇 π‘ˆ ← {𝑑 0 } Repeat Let 𝑆 𝐹 𝑑, 𝑏 = β„Ž(𝑑) 𝑑 ∈ 𝑇 π‘ˆ otherwise 𝑆(𝑑, 𝑏) 0 𝑑 ∈ 𝑇 π‘ˆ Let π‘ˆ 𝐹 (𝑑 β€² |𝑑, 𝑏) = otherwise (𝑑 β€² |𝑑, 𝑏) Pr Find optimal policy 𝜌 for 𝑇 𝐹 , 𝑆 𝐹 , π‘ˆ 𝐹 𝜌 Find reachable states 𝑇 𝑑 0 𝜌 ∩ 𝑇 π‘ˆ Select reachable terminal states s 1 , … , s k βŠ† 𝑇 𝑑 0 𝑇 π‘ˆ ← (𝑇 π‘ˆ βˆ– 𝑑 1 , … , 𝑑 𝑙 ) βˆͺ (π‘‘β„Žπ‘—π‘šπ‘’π‘ π‘“π‘œ 𝑑 1 , … , 𝑑 𝑙 βˆ– 𝑇 𝐹 ) 𝑇 𝐹 ← 𝑇 𝐹 βˆͺ π‘‘β„Žπ‘—π‘šπ‘’π‘ π‘“π‘œ( 𝑑 1 , … , 𝑑 𝑙 ) 𝜌 ∩ 𝑇 π‘ˆ is empty Until 𝑇 𝑑 0 8 CS886 (c) 2013 Pascal Poupart

  9. Efficiency Efficiency influenced by 1. Choice of terminal states to add to envelope 2. Algorithm to find optimal policy – Can use value iteration, policy iteration, modified policy iteration, linear programming – Key: reuse previous computation β€’ E.g., start with previous policy or value function at each iteration 9 CS886 (c) 2013 Pascal Poupart

  10. Convergence β€’ Theorem: LAO* converges to the optimal policy β€’ Proof: – Fact: At each iteration, the value function π‘Š is an upper bound on π‘Š βˆ— due to the heuristic function β„Ž – Proof by contradiction: suppose the algorithm stops, but 𝜌 is not optimal. β€’ Since the algorithm stopped, all states reachable by 𝜌 are in 𝑇 𝐹 βˆ– 𝑇 π‘ˆ β€’ Hence, the value function π‘Š is the value of 𝜌 and since 𝜌 is suboptimal then π‘Š < π‘Š βˆ— , which contradicts the fact that π‘Š is an upper bound on π‘Š βˆ— 10 CS886 (c) 2013 Pascal Poupart

  11. Summary β€’ LAO* – Extension of basic solution algorithms (value iteration, policy iteration, linear programming) – Exploit initial state and heuristic function – Gradually grow an envelope of states – Complexity depends on # of reachable states instead of size of state space 11 CS886 (c) 2013 Pascal Poupart

Recommend


More recommend