cs 730 830 intro ai
play

CS 730/830: Intro AI Solving MDPs MDP Extras Wheeler Ruml (UNH) - PowerPoint PPT Presentation

CS 730/830: Intro AI Solving MDPs MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 1 / 23 Solving MDPs Definition What to do? Value Iteration Stopping Sweeping SSPs RTDP UCT Solving MDPs Break Policy


  1. CS 730/830: Intro AI Solving MDPs MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 – 1 / 23

  2. Solving MDPs ■ Definition ■ What to do? ■ Value Iteration ■ Stopping ■ Sweeping ■ SSPs ■ RTDP ■ UCT Solving MDPs ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 – 2 / 23

  3. Markov Decision Process (MDP) initial state: s 0 Solving MDPs T ( s, a, s ′ ) = probability of going from s to transition model: ■ Definition ■ What to do? s ′ after doing a . ■ Value Iteration ■ Stopping reward function: R ( s ) for landing in state s . ■ Sweeping terminal states: sinks = absorbing states (end the trial). ■ SSPs ■ RTDP ■ UCT objective: ■ Break ■ Policy Iteration total reward: reward over (finite) trajectory: ■ Policy Evaluation R ( s 0 ) + R ( s 1 ) + R ( s 2 ) ■ Summary MDP Extras discounted reward: penalize future by γ : R ( s 0 ) + γR ( s 1 ) + γ 2 R ( s 2 ) . . . find: policy: π ( s ) = a optimal policy: π ∗ proper policy: reaches terminal state Wheeler Ruml (UNH) Lecture 20, CS 730 – 3 / 23

  4. What to do? � T ( s, a, s ′ ) U π ∗ ( s ′ ) π ∗ ( s ) = argmax Solving MDPs ■ Definition a s ′ ■ What to do? ■ Value Iteration ∞ ■ Stopping � U π ( s ) = E [ γ t R ( s t ) | π, s 0 = s ] ■ Sweeping ■ SSPs t =0 ■ RTDP ■ UCT The key: ■ Break ■ Policy Iteration � ■ Policy Evaluation T ( s, a, s ′ ) U ( s ′ ) U ( s ) = R ( s ) + γ max ■ Summary a s ′ MDP Extras (Richard Bellman, 1957) Wheeler Ruml (UNH) Lecture 20, CS 730 – 4 / 23

  5. Value Iteration Repeated Bellman updates: Solving MDPs ■ Definition ■ What to do? ■ Value Iteration Repeat until happy ■ Stopping ■ Sweeping for each state s ■ SSPs U ′ ( s ) ← R ( s ) + γ max a � s ′ T ( s, a, s ′ ) U ( s ′ ) ■ RTDP ■ UCT U ← U ′ ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary For infinite updates everywhere, guaranteed to reach equilibrium. MDP Extras Equilibrium is unique solution to Bellman equations! asychronous works: converges if every state updated infinitely often (no state permanently ignored) Wheeler Ruml (UNH) Lecture 20, CS 730 – 5 / 23

  6. Stopping || U i − U i − 1 || = max difference between corresponding elts Solving MDPs ■ Definition U ∗ = U π ∗ ■ What to do? ■ Value Iteration ■ Stopping ■ Sweeping if || U i − U i − 1 || γ/ (1 − γ ) < ǫ then || U i − U ∗ || < ǫ ■ SSPs ■ RTDP ■ UCT ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 – 6 / 23

  7. Stopping || U i − U i − 1 || = max difference between corresponding elts Solving MDPs ■ Definition U ∗ = U π ∗ ■ What to do? ■ Value Iteration ■ Stopping ■ Sweeping if || U i − U i − 1 || γ/ (1 − γ ) < ǫ then || U i − U ∗ || < ǫ ■ SSPs ■ RTDP if || U i − U ∗ || < ǫ then || U π i − U π ∗ || < 2 ǫγ/ (1 − γ ) ■ UCT ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 – 6 / 23

  8. Stopping || U i − U i − 1 || = max difference between corresponding elts Solving MDPs ■ Definition U ∗ = U π ∗ ■ What to do? ■ Value Iteration ■ Stopping ■ Sweeping if || U i − U i − 1 || γ/ (1 − γ ) < ǫ then || U i − U ∗ || < ǫ ■ SSPs ■ RTDP if || U i − U ∗ || < ǫ then || U π i − U π ∗ || < 2 ǫγ/ (1 − γ ) ■ UCT ■ Break ■ Policy Iteration ■ Policy Evaluation loss < 2( maxUpdate ) γ ■ Summary 1 − γ MDP Extras maxUpdate > loss (1 − γ ) 2 γ Wheeler Ruml (UNH) Lecture 20, CS 730 – 6 / 23

  9. Stopping maxUpdate > loss (1 − γ ) Solving MDPs 2 γ ■ Definition ■ What to do? 40 ■ Value Iteration (1-x)/(2*x) ■ Stopping ■ Sweeping ■ SSPs 30 ■ RTDP ■ UCT ■ Break 20 ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras 10 0 -10 0 0.2 0.4 0.6 0.8 1 Wheeler Ruml (UNH) Lecture 20, CS 730 – 7 / 23

  10. Prioritized Sweeping concentrate updates on states whose value changes! Solving MDPs ■ Definition ■ What to do? ■ Value Iteration ■ Stopping ■ Sweeping to update state s with change δ in U ( s ) : ■ SSPs ■ RTDP update U ( s ) ■ UCT priority of s ← 0 ■ Break for each predecessor s ′ of s : ■ Policy Iteration ■ Policy Evaluation priority s ′ ← max of current and max a δ ˆ T ( s ′ , a, s ) ■ Summary MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 – 8 / 23

  11. Stochastic Shortest Path Problems minimize sum of action costs ■ Solving MDPs all action costs ≥ 0 ■ Definition ■ ■ What to do? non-empty set of (absorbing zero-cost) goal states ■ ■ Value Iteration ■ Stopping there exists at least one proper policy ■ ■ Sweeping ■ SSPs proper policy: eventually brings agent to goal from any state ■ RTDP ■ UCT with probability 1 ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 – 9 / 23

  12. Real-time Dynamic Programming (RTDP) which states to update? Solving MDPs ■ Definition ■ What to do? ■ Value Iteration initialize U to an upper bound ■ Stopping ■ Sweeping do trials until happy: ■ SSPs ■ RTDP s ← s 0 ■ UCT until at a goal: ■ Break ■ Policy Iteration a, u a ← arg , min a c ( s, a ) + � s ′ T ( s, a, s ′ ) U ( s ′ ) ■ Policy Evaluation U ( s ) ← u a ■ Summary s ← pick among s ′ weighted by T ( s, a, s ′ ) MDP Extras states that agent is likely to visit under current policy nice anytime profile in practice, do updates backward from end of trajectory convergence guaranteed by optimism. Wheeler Ruml (UNH) Lecture 20, CS 730 – 10 / 23

  13. Upper Confidence Bounds on Trees (UCT, 2006) on-line action selection Solving MDPs ■ Definition ■ What to do? Monte Carlo tree search (MCTS) ■ Value Iteration ■ Stopping descent, roll-out, update, growth ■ Sweeping ■ SSPs ■ RTDP ■ UCT W ( s, a ) = total reward ■ Break ■ Policy Iteration N ( s, a ) = number of times a tried in s ■ Policy Evaluation ■ Summary N ( s ) = number of times s visited MDP Extras � Z ( s, a ) = W ( s, a ) log N ( s ) N ( s, a ) + C N ( s, a ) roll-out policy add one node after each roll-out consistent! Wheeler Ruml (UNH) Lecture 20, CS 730 – 11 / 23

  14. Break asst 9 ■ Solving MDPs project: reading, prep, final proposal ■ Definition ■ ■ What to do? wildcard topic ■ ■ Value Iteration ■ Stopping ■ Sweeping ■ SSPs ■ RTDP ■ UCT ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 – 12 / 23

  15. Policy Iteration repeat until π doesn’t change: Solving MDPs ■ Definition given π , compute U π ( s ) for all states ■ What to do? ■ Value Iteration given U , calculate policy by one-step look-ahead ■ Stopping ■ Sweeping ■ SSPs If π doesn’t change, U doesn’t either. ■ RTDP We are at an equilibrium (= optimal π )! ■ UCT ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 – 13 / 23

  16. Policy Evaluation computing U π ( s ) : Solving MDPs ■ Definition ■ What to do? � U π ( s ) = R ( s ) + γ T ( s, π ( s ) , s ′ ) U ( s ′ ) ■ Value Iteration ■ Stopping s ′ ■ Sweeping ■ SSPs ■ RTDP ■ UCT ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 – 14 / 23

  17. Policy Evaluation computing U π ( s ) : Solving MDPs ■ Definition ■ What to do? � U π ( s ) = R ( s ) + γ T ( s, π ( s ) , s ′ ) U ( s ′ ) ■ Value Iteration ■ Stopping s ′ ■ Sweeping ■ SSPs linear programming ( N 3 ) or ■ RTDP ■ UCT ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 – 14 / 23

  18. Policy Evaluation computing U π ( s ) : Solving MDPs ■ Definition ■ What to do? � U π ( s ) = R ( s ) + γ T ( s, π ( s ) , s ′ ) U ( s ′ ) ■ Value Iteration ■ Stopping s ′ ■ Sweeping ■ SSPs linear programming ( N 3 ) or simplified value iteration: ■ RTDP ■ UCT ■ Break do a few times: ■ Policy Iteration ■ Policy Evaluation ■ Summary � T ( s, π ( s ) , s ′ ) U ( s ′ ) U i +1 ( s ) ← R ( s ) + γ MDP Extras s ′ (simplified because we are given π , no max over a ) Wheeler Ruml (UNH) Lecture 20, CS 730 – 14 / 23

  19. Summary of MDP Solving value iteration: compute U π ∗ ■ Solving MDPs ■ Definition prioritized sweeping ◆ ■ What to do? ■ Value Iteration RTDP ◆ ■ Stopping policy iteration: compute U π using ■ Sweeping ■ ■ SSPs ■ RTDP linear algebra ◆ ■ UCT ■ Break simplified value iteration ◆ ■ Policy Iteration a few updates (modified PI) ◆ ■ Policy Evaluation ■ Summary MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 – 15 / 23

  20. Solving MDPs MDP Extras ■ ADP ■ Bandits ■ Q -Learning ■ RL Summary ■ Approx U ■ Deep RL ■ EOLQs MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 – 16 / 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend