CS 730/830: Intro AI Solving MDPs MDP Extras Wheeler Ruml (UNH) - PowerPoint PPT Presentation

CS 730/830: Intro AI Solving MDPs MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 – 1 / 23

Solving MDPs ■ Definition ■ What to do? ■ Value Iteration ■ Stopping ■ Sweeping ■ SSPs ■ RTDP ■ UCT Solving MDPs ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 – 2 / 23

Markov Decision Process (MDP) initial state: s 0 Solving MDPs T ( s, a, s ′ ) = probability of going from s to transition model: ■ Definition ■ What to do? s ′ after doing a . ■ Value Iteration ■ Stopping reward function: R ( s ) for landing in state s . ■ Sweeping terminal states: sinks = absorbing states (end the trial). ■ SSPs ■ RTDP ■ UCT objective: ■ Break ■ Policy Iteration total reward: reward over (finite) trajectory: ■ Policy Evaluation R ( s 0 ) + R ( s 1 ) + R ( s 2 ) ■ Summary MDP Extras discounted reward: penalize future by γ : R ( s 0 ) + γR ( s 1 ) + γ 2 R ( s 2 ) . . . find: policy: π ( s ) = a optimal policy: π ∗ proper policy: reaches terminal state Wheeler Ruml (UNH) Lecture 20, CS 730 – 3 / 23

What to do? � T ( s, a, s ′ ) U π ∗ ( s ′ ) π ∗ ( s ) = argmax Solving MDPs ■ Definition a s ′ ■ What to do? ■ Value Iteration ∞ ■ Stopping � U π ( s ) = E [ γ t R ( s t ) | π, s 0 = s ] ■ Sweeping ■ SSPs t =0 ■ RTDP ■ UCT The key: ■ Break ■ Policy Iteration � ■ Policy Evaluation T ( s, a, s ′ ) U ( s ′ ) U ( s ) = R ( s ) + γ max ■ Summary a s ′ MDP Extras (Richard Bellman, 1957) Wheeler Ruml (UNH) Lecture 20, CS 730 – 4 / 23

Value Iteration Repeated Bellman updates: Solving MDPs ■ Definition ■ What to do? ■ Value Iteration Repeat until happy ■ Stopping ■ Sweeping for each state s ■ SSPs U ′ ( s ) ← R ( s ) + γ max a � s ′ T ( s, a, s ′ ) U ( s ′ ) ■ RTDP ■ UCT U ← U ′ ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary For infinite updates everywhere, guaranteed to reach equilibrium. MDP Extras Equilibrium is unique solution to Bellman equations! asychronous works: converges if every state updated infinitely often (no state permanently ignored) Wheeler Ruml (UNH) Lecture 20, CS 730 – 5 / 23

Stopping || U i − U i − 1 || = max difference between corresponding elts Solving MDPs ■ Definition U ∗ = U π ∗ ■ What to do? ■ Value Iteration ■ Stopping ■ Sweeping if || U i − U i − 1 || γ/ (1 − γ ) < ǫ then || U i − U ∗ || < ǫ ■ SSPs ■ RTDP ■ UCT ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 – 6 / 23

Stopping || U i − U i − 1 || = max difference between corresponding elts Solving MDPs ■ Definition U ∗ = U π ∗ ■ What to do? ■ Value Iteration ■ Stopping ■ Sweeping if || U i − U i − 1 || γ/ (1 − γ ) < ǫ then || U i − U ∗ || < ǫ ■ SSPs ■ RTDP if || U i − U ∗ || < ǫ then || U π i − U π ∗ || < 2 ǫγ/ (1 − γ ) ■ UCT ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 – 6 / 23

Stopping || U i − U i − 1 || = max difference between corresponding elts Solving MDPs ■ Definition U ∗ = U π ∗ ■ What to do? ■ Value Iteration ■ Stopping ■ Sweeping if || U i − U i − 1 || γ/ (1 − γ ) < ǫ then || U i − U ∗ || < ǫ ■ SSPs ■ RTDP if || U i − U ∗ || < ǫ then || U π i − U π ∗ || < 2 ǫγ/ (1 − γ ) ■ UCT ■ Break ■ Policy Iteration ■ Policy Evaluation loss < 2( maxUpdate ) γ ■ Summary 1 − γ MDP Extras maxUpdate > loss (1 − γ ) 2 γ Wheeler Ruml (UNH) Lecture 20, CS 730 – 6 / 23

Stopping maxUpdate > loss (1 − γ ) Solving MDPs 2 γ ■ Definition ■ What to do? 40 ■ Value Iteration (1-x)/(2*x) ■ Stopping ■ Sweeping ■ SSPs 30 ■ RTDP ■ UCT ■ Break 20 ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras 10 0 -10 0 0.2 0.4 0.6 0.8 1 Wheeler Ruml (UNH) Lecture 20, CS 730 – 7 / 23

Prioritized Sweeping concentrate updates on states whose value changes! Solving MDPs ■ Definition ■ What to do? ■ Value Iteration ■ Stopping ■ Sweeping to update state s with change δ in U ( s ) : ■ SSPs ■ RTDP update U ( s ) ■ UCT priority of s ← 0 ■ Break for each predecessor s ′ of s : ■ Policy Iteration ■ Policy Evaluation priority s ′ ← max of current and max a δ ˆ T ( s ′ , a, s ) ■ Summary MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 – 8 / 23

Stochastic Shortest Path Problems minimize sum of action costs ■ Solving MDPs all action costs ≥ 0 ■ Definition ■ ■ What to do? non-empty set of (absorbing zero-cost) goal states ■ ■ Value Iteration ■ Stopping there exists at least one proper policy ■ ■ Sweeping ■ SSPs proper policy: eventually brings agent to goal from any state ■ RTDP ■ UCT with probability 1 ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 – 9 / 23

Real-time Dynamic Programming (RTDP) which states to update? Solving MDPs ■ Definition ■ What to do? ■ Value Iteration initialize U to an upper bound ■ Stopping ■ Sweeping do trials until happy: ■ SSPs ■ RTDP s ← s 0 ■ UCT until at a goal: ■ Break ■ Policy Iteration a, u a ← arg , min a c ( s, a ) + � s ′ T ( s, a, s ′ ) U ( s ′ ) ■ Policy Evaluation U ( s ) ← u a ■ Summary s ← pick among s ′ weighted by T ( s, a, s ′ ) MDP Extras states that agent is likely to visit under current policy nice anytime profile in practice, do updates backward from end of trajectory convergence guaranteed by optimism. Wheeler Ruml (UNH) Lecture 20, CS 730 – 10 / 23

Upper Confidence Bounds on Trees (UCT, 2006) on-line action selection Solving MDPs ■ Definition ■ What to do? Monte Carlo tree search (MCTS) ■ Value Iteration ■ Stopping descent, roll-out, update, growth ■ Sweeping ■ SSPs ■ RTDP ■ UCT W ( s, a ) = total reward ■ Break ■ Policy Iteration N ( s, a ) = number of times a tried in s ■ Policy Evaluation ■ Summary N ( s ) = number of times s visited MDP Extras � Z ( s, a ) = W ( s, a ) log N ( s ) N ( s, a ) + C N ( s, a ) roll-out policy add one node after each roll-out consistent! Wheeler Ruml (UNH) Lecture 20, CS 730 – 11 / 23

Break asst 9 ■ Solving MDPs project: reading, prep, final proposal ■ Definition ■ ■ What to do? wildcard topic ■ ■ Value Iteration ■ Stopping ■ Sweeping ■ SSPs ■ RTDP ■ UCT ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 – 12 / 23

Policy Iteration repeat until π doesn’t change: Solving MDPs ■ Definition given π , compute U π ( s ) for all states ■ What to do? ■ Value Iteration given U , calculate policy by one-step look-ahead ■ Stopping ■ Sweeping ■ SSPs If π doesn’t change, U doesn’t either. ■ RTDP We are at an equilibrium (= optimal π )! ■ UCT ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 – 13 / 23

Policy Evaluation computing U π ( s ) : Solving MDPs ■ Definition ■ What to do? � U π ( s ) = R ( s ) + γ T ( s, π ( s ) , s ′ ) U ( s ′ ) ■ Value Iteration ■ Stopping s ′ ■ Sweeping ■ SSPs ■ RTDP ■ UCT ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 – 14 / 23

Policy Evaluation computing U π ( s ) : Solving MDPs ■ Definition ■ What to do? � U π ( s ) = R ( s ) + γ T ( s, π ( s ) , s ′ ) U ( s ′ ) ■ Value Iteration ■ Stopping s ′ ■ Sweeping ■ SSPs linear programming ( N 3 ) or ■ RTDP ■ UCT ■ Break ■ Policy Iteration ■ Policy Evaluation ■ Summary MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 – 14 / 23

Policy Evaluation computing U π ( s ) : Solving MDPs ■ Definition ■ What to do? � U π ( s ) = R ( s ) + γ T ( s, π ( s ) , s ′ ) U ( s ′ ) ■ Value Iteration ■ Stopping s ′ ■ Sweeping ■ SSPs linear programming ( N 3 ) or simplified value iteration: ■ RTDP ■ UCT ■ Break do a few times: ■ Policy Iteration ■ Policy Evaluation ■ Summary � T ( s, π ( s ) , s ′ ) U ( s ′ ) U i +1 ( s ) ← R ( s ) + γ MDP Extras s ′ (simplified because we are given π , no max over a ) Wheeler Ruml (UNH) Lecture 20, CS 730 – 14 / 23

Summary of MDP Solving value iteration: compute U π ∗ ■ Solving MDPs ■ Definition prioritized sweeping ◆ ■ What to do? ■ Value Iteration RTDP ◆ ■ Stopping policy iteration: compute U π using ■ Sweeping ■ ■ SSPs ■ RTDP linear algebra ◆ ■ UCT ■ Break simplified value iteration ◆ ■ Policy Iteration a few updates (modified PI) ◆ ■ Policy Evaluation ■ Summary MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 – 15 / 23

Solving MDPs MDP Extras ■ ADP ■ Bandits ■ Q -Learning ■ RL Summary ■ Approx U ■ Deep RL ■ EOLQs MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 – 16 / 23

CS 730/830: Intro AI Solving MDPs MDP Extras Wheeler Ruml (UNH) - PowerPoint PPT Presentation

CS 730/830: Intro AI Solving MDPs MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 1 / 23 Solving MDPs Definition What to do? Value Iteration Stopping Sweeping SSPs RTDP UCT Solving MDPs Break Policy

CS 730/830: Intro AI CSPs 1 handout: slides asst 4 posted Wheeler Ruml (UNH) Lecture 8, CS 730

CS 730/730W/830: Intro AI Beyond STRIPS Hierarchy Wheeler Ruml (UNH) Lecture 18, CS 730 1 /

CS 730/830: Intro AI 1 handout: slides Control Wheeler Ruml (UNH) Lecture 6, CS 730 1 / 12

STAT 830 Blank Slides for Notes Richard Lockhart SFU STAT 830 Fall 2020 Richard Lockhart

CS 730/830: Intro AI 1 handout: slides Search Basic Algorithms A Clever Algorithm EOLQs

CS 730/830: Intro AI Class Outro AI at UNH Wheeler Ruml (UNH) Lecture 27, CS 730 1 / 12

CS 730/830: Intro AI Unsuperv. Learning asst 11 posted Wheeler Ruml (UNH) Lecture 23, CS 730

CS 730/830: Intro AI 1 handout: slides Are We Done? Beyond A* Suboptimal Search Anytime

CS 730/730W/830: Intro AI Naive Bayes Boosting 1 handout: slides asst 5 milestone was due

CS 730/730W/830: Intro AI MDP Wrap-Up ADP Q -Learning 1 handout: slides project proposals are

CS 730/730W/830: Intro AI Propositional Logic First-Order Logic 1 handout: slides Wheeler Ruml

CS 730/730W/830: Intro AI What is KR? Prop. Logic Reasoning 2 handouts: slides, assignment 2

CS 730/830: Intro AI Adversarial Search 1 handout: slides You think you know when you can learn,

CS 730/730W/830: Intro AI First-order Logic Inference in FOL 1 handout: slides 730W journal

CS 730/730W/830: Intro AI Bayesian Networks Approx. Inference Exact Inference 1 handout: slides

CS 730/830: Intro AI Reasoning Inference in FOL assignments 6 and 7 are posted Wheeler Ruml

Sweeping with Continuous Domains G. Chabert and N. Beldiceanu cole des Mines de Nantes,

Plane Sweep Algorithms Carola Wenk 3/3/16 1 CMPS 6640/4040 Computational Geometry Line Segment

Realizing OutofCore Stencil Computations using MultiTier Memory Hierarchy on GPGPU

Page Replacement Algorithms Don Porter Portions courtesy Emmett Witchel and Kevin Jeffay 1

Building (good) BVHs Naive (middle, median split) dont do that Sweep SAH evaluation

Ironing, Sweeping, and Multivariate Majorization Optimal Mechanisms for Mass-Produced Goods (with

Automatic Garbage Collection Reference counting Automatically free dead objects For each

EXTENDED EULER-LAGRANGE AND HAMILTONIAN CONDITIONS IN OPTIMAL CONTROL OF SWEEPING PROCESSES WITH

Sambuz

Useful Links

Newsletter

Mail Us