Module 9 LAO* CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo
Large State Space β’ Value Iteration, Policy Iteration and Linear Programming β Complexity at least quadratic in |π| β’ Problem: |π| may be very large β Queuing problems: infinite state space β Factored problems: exponentially many states 2 CS886 (c) 2013 Pascal Poupart
Mitigate Size of State Space β’ Two ideas: β’ Exploit initial state β Not all states are reachable β’ Exploit heuristic β β approximation of optimal value function β usually an upper bound β π‘ β₯ π β π‘ βπ‘ 3 CS886 (c) 2013 Pascal Poupart
State Space State space |π| Reachable states π‘ 0 States reachable by π β 4 CS886 (c) 2013 Pascal Poupart
State Space State space |π| Reachable states π‘ 0 States reachable by π β 5 CS886 (c) 2013 Pascal Poupart
LAO* Algorithm β’ Related to β A*: path heuristic search β AO*: tree heuristic search β LAO*: cyclic graph heuristic search β’ LAO* alternates between β State space expansion β Policy optimization β’ value iteration, policy iteration, linear programming 6 CS886 (c) 2013 Pascal Poupart
Terminology β’ π : state space β’ π πΉ β π : envelope β Growing set of states β’ π π β π πΉ : terminal states β States whose children are not in the envelope π β π πΉ : states reachable from π‘ 0 by following π β’ π π‘ 0 β’ β(π‘) : heuristic such that β π‘ β₯ π β π‘ βπ‘ β E.g., β π‘ = max π‘,π π(π‘, π)/(1 β πΏ) 7 CS886 (c) 2013 Pascal Poupart
LAO* Algorithm LAO*(MDP, heuristic β ) π πΉ β {π‘ 0 } , π π β {π‘ 0 } Repeat Let π πΉ π‘, π = β(π‘) π‘ β π π otherwise π(π‘, π) 0 π‘ β π π Let π πΉ (π‘ β² |π‘, π) = otherwise (π‘ β² |π‘, π) Pr Find optimal policy π for π πΉ , π πΉ , π πΉ π Find reachable states π π‘ 0 π β© π π Select reachable terminal states s 1 , β¦ , s k β π π‘ 0 π π β (π π β π‘ 1 , β¦ , π‘ π ) βͺ (πβππππ ππ π‘ 1 , β¦ , π‘ π β π πΉ ) π πΉ β π πΉ βͺ πβππππ ππ( π‘ 1 , β¦ , π‘ π ) π β© π π is empty Until π π‘ 0 8 CS886 (c) 2013 Pascal Poupart
Efficiency Efficiency influenced by 1. Choice of terminal states to add to envelope 2. Algorithm to find optimal policy β Can use value iteration, policy iteration, modified policy iteration, linear programming β Key: reuse previous computation β’ E.g., start with previous policy or value function at each iteration 9 CS886 (c) 2013 Pascal Poupart
Convergence β’ Theorem: LAO* converges to the optimal policy β’ Proof: β Fact: At each iteration, the value function π is an upper bound on π β due to the heuristic function β β Proof by contradiction: suppose the algorithm stops, but π is not optimal. β’ Since the algorithm stopped, all states reachable by π are in π πΉ β π π β’ Hence, the value function π is the value of π and since π is suboptimal then π < π β , which contradicts the fact that π is an upper bound on π β 10 CS886 (c) 2013 Pascal Poupart
Summary β’ LAO* β Extension of basic solution algorithms (value iteration, policy iteration, linear programming) β Exploit initial state and heuristic function β Gradually grow an envelope of states β Complexity depends on # of reachable states instead of size of state space 11 CS886 (c) 2013 Pascal Poupart
Recommend
More recommend