Planning with MDPs (Markov Decision Processes) H ector Geffner - PowerPoint PPT Presentation

Planning with MDPs (Markov Decision Processes) H´ ector Geffner ICREA and Universitat Pompeu Fabra Barcelona, Spain Hector Geffner, MDP Planning, Edinburgh, 11/2007 1

Status of Classical Planning • Classical planning works!! – Large problems solved very fast (non-optimally) • Model simple but useful – Operators not primitive; can be policies themselves – Fast closed-loop replanning able to cope with uncertainty sometimes • Limitations – Does not model Uncertainty (no probabilities) – Does not deal with Incomplete Information (no sensing) – Deals with very Simple Cost Structure (no state dependent costs) Hector Geffner, MDP Planning, Edinburgh, 11/2007 2

Beyond Classical Planning: Two Strategies 1. Develop solver for more general models; e.g., MDPs and POMDPs + : generality − : complexity 2. Extend the scope of current ’classical’ solvers + : efficiency − : generality We will pursue first approach here . . . Hector Geffner, MDP Planning, Edinburgh, 11/2007 3

Reminder: Basic State Models • Characterized by: – finite and discrete state space S – an initial state s 0 ∈ S – a set G ⊆ S of goal states – actions A ( s ) ⊆ A applicable in each state s ∈ S – a transition function F ( a, s ) for s ∈ S and a ∈ A ( s ) – action costs c ( a, s ) > 0 • A solution is a sequence of applicable actions a i , i = 0 , . . . , n , that maps the initial state s 0 into a goal state s ∈ S G ; i.e., s i +1 = f ( a i , s i ) and a i ∈ A ( s i ) for i = 0 , . . . , n and s n +1 ∈ S G • Optimal solutions minimize total cost � i = n i =0 c ( a i , s i ) , and can be computed by shortest-path or heuristic search algorithms . . . Hector Geffner, MDP Planning, Edinburgh, 11/2007 4

Markov Decision Processes (MDPs) MDPs are fully observable, probabilistic state models: • a state space S • a set G ⊆ S of goal states • actions A ( s ) ⊆ A applicable in each state s ∈ S • transition probabilities P a ( s ′ | s ) for s ∈ S and a ∈ A ( s ) • action costs c ( a, s ) > 0 – Solutions are functions (policies) mapping states into actions – Optimal solutions have minimum expected costs Hector Geffner, MDP Planning, Edinburgh, 11/2007 5

Partially Observable MDPs (POMDPs) POMDPs are partially observable, probabilistic state models: • states s ∈ S • actions A ( s ) ⊆ A • costs c ( a, s ) > 0 • transition probabilities P a ( s ′ | s ) for s ∈ S and a ∈ A ( s ) • initial belief state b 0 • final belief states b F • sensor model given by probabilities P a ( o | s ) , o ∈ Obs – Belief states are probability distributions over S – Solutions are policies that map belief states into actions – Optimal policies minimize expected cost to go from b 0 to b F Hector Geffner, MDP Planning, Edinburgh, 11/2007 6

Illustration: Navigation Problems Consider robot that has to reach target G when 1. initial state is known and actions are deterministic 2. initial state is unknown and actions are deterministic 3. states are fully observable and actions are stochastic 4. states are partially observable and actions are stochastic . . . G – How do these problems map into the models considered? – What is the form of the solutions? Hector Geffner, MDP Planning, Edinburgh, 11/2007 7

Solving State Models by Dynamic Programming • Solutions to wide range of state models can be expressed in terms of solution of Bellman equation over non-terminal states s : V ( s ) = min a ∈ A ( s ) Q V ( a, s ) where cost-to-go term Q V ( a, s ) depends on model c ( a, s ) + � s ′ ∈ F ( a,s ) P a ( s ′ | s ) V ( s ′ ) for MDPs c ( a, s ) + max s ′ ∈ F ( a,s ) V ( s ′ ) for Max AND/OR Graphs c ( a, s ) + V ( s ′ ) , s ′ ∈ F ( a, s ) for OR Graphs . . . ( F ( a, s ) : set of successor states; for terminal states, V ( s ) = V ∗ ( s ) assumed) • The greedy policy π V ( s ) = argmin a ∈ A ( s ) Q V ( a, s ) is optimal when V = V ∗ solves Bellman • Question: how to get V ∗ ? Hector Geffner, MDP Planning, Edinburgh, 11/2007 8

Value Iteration (VI) • Value Iteration finds V ∗ by successive approximations • Starting with an arbitrary V , uses Bellman equation to update V V ( s ) := min a ∈ A ( s ) Q V ( a, s ) • If all states updated a sufficient number of times (and certain general conditions hold), left and right hand sides converge to V = V ∗ • Example: . . . Hector Geffner, MDP Planning, Edinburgh, 11/2007 9

Value Iteration: Benefits and Limitations • VI is a very simple and general algorithm (can solve wide range of models) • Problem: VI is exhaustive ; value function V ( s ) is a vector of the size of the problem space • In particular, it does not compete with heuristic search algorithms such as A* or IDA* for solving OR-graphs (deterministic problems) . . . • Question: can VI be ’modified’ to deal with larger state spaces that do not fit into memory, without giving up optimality ? • Yes, use Lower Bounds and Initial State as in Heuristic Search methods . . . Hector Geffner, MDP Planning, Edinburgh, 11/2007 10

Focusing Value Iteration using LBs and s 0 : Find and Update • Say that a state s is – greedy if reachable from s 0 using greedy policy π V , and – inconsistent if V ( s ) � = min a ∈ A ( s ) Q V ( a, s ) • Then starting with an admissible and monotone V , follow loop: – Find an inconsistent greedy state s and Update it • Find-and-Update loop delivers greedy policy that is optimal even if some states not updated or visited at all! • Recent heuristic search algorithms for MDPs , like RTDP, LAO*, and LDFS; all implement this loop in various forms • We will focus here on RTDP (Barto, Bradke, Singh, 95) Hector Geffner, MDP Planning, Edinburgh, 11/2007 11

Greedy Policy for For Deterministic MDP The Greedy policy is a closed-loop version of greedy search 1. Evaluate each action a applicable in s Q ( a, s ) = c ( a, s ) + h ( s a ) where s a is next state 2. Apply action a that minimizes Q ( a , s ) 3. Observe resulting states s ′ 4. Exit if s ′ is goal, else go to 1 with s := s ′ • Greedy policy based on h can be written as π h ( s ) = argmin a ∈ A ( s ) Q ( a, s ) • π h is optimal when h = h ∗ , otherwise non-optimal and may get trapped into loops Hector Geffner, MDP Planning, Edinburgh, 11/2007 12

Modifiable Greedy Policy for For Deterministic MDP (LRTA*) Update heuristic h as you move, to make it consistent with Bellman 1. Evaluate each action a applicable in s Q ( a, s ) = c ( a, s ) + h ( s a ) where s a is next state 2. Apply action a that minimizes Q ( a , s ) 3. Update V ( s ) to Q ( a , s ) 4. Observe resulting states s ′ 5. Exit if s ′ is goal, else go to 1 with s := s ′ • Greedy policy based on h can be written as π h ( s ) = argmin a ∈ A ( s ) Q ( a, s ) • π h is optimal when h = h ∗ , otherwise non-optimal and may get trapped into loops Hector Geffner, MDP Planning, Edinburgh, 11/2007 13

Real Time Dynamic Programming (RTDP) Same as LRTA* but deals with true (probabilistic) MDP 1. Evaluate each action a applicable in s as � P a ( s ′ | s ) V i ( s ′ ) Q ( a, s ) = c ( a, s ) + s ′ ∈ S 2. Apply action a that minimizes Q ( a , s ) 3. Update V ( s ) to Q ( a , s ) 4. Observe resulting state s ′ 5. Exit if s ′ is goal, else go to 1 with s := s ′ V ( s ) initialized to h ( s ) ; if h < V ∗ , RTDP eventually optimal Hector Geffner, MDP Planning, Edinburgh, 11/2007 14

Variations on RTDP : Reinforcement Learning Q-learning is a model-free version of rtdp 1. Apply action a that minimizes Q ( a , s ) with probability 1 − ǫ , with probability ǫ , choose a randomly 2. Observe resulting state s ′ 3. Update Q ( a , s ) to (1 − α ) Q ( a , s ) + α [ c ( a , s ) + max a Q ( a, s ′ )] 4. Exit if s ′ is goal, else with s := s ′ go to 1 Q-learning learns asymptotically to solve MDPs optimally (Watkins 89) Hector Geffner, MDP Planning, Edinburgh, 11/2007 15

Bibliography • TEXT: Malik Ghallab et. al. Automated Planning: Theory & Practice . Morgan Kaufmann 2004. • Andrew G. Barto, Steven J. Bradtke, Satinder P. Singh: Learning to Act Using Real-Time Dynamic Programming. Artif. Intell. 72(1-2): 81-138 (1995) • Chris Watkins Peter Dayan. Q-learning, Machine Learning, 8, 279-292. • Blai Bonet and Hector Geffner. Learning Depth-First Search: A Unified Approach to Heuristic Search in Deterministic and Non-Deterministic Settings, and its application to MDPs. Proc. 16th Int. Conf. on Automated Planning and Scheduling (ICAPS-06), 6/2006 Hector Geffner, MDP Planning, Edinburgh, 11/2007 16

Planning with MDPs (Markov Decision Processes) H ector Geffner - PowerPoint PPT Presentation

Planning with MDPs (Markov Decision Processes) H ector Geffner ICREA and Universitat Pompeu Fabra Barcelona, Spain Hector Geffner, MDP Planning, Edinburgh, 11/2007 1 Status of Classical Planning Classical planning works!! Large

Planning and Optimization December 4, 2019 G1. Factored MDPs G1.1 Factored MDPs Planning and

Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning

Parity Objectives in Countable MDPs Stefan Kiefer Richard Mayr Mahsa Shirmohammadi Dominik

CS 730/830: Intro AI Solving MDPs MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 1 / 23

Planning and Optimization G1. Factored MDPs Malte Helmert and Thomas Keller Universit at

B5.1 Introduction Heuristics Constraints Planning Explicit MDPs Probabilistic Factored MDPs

B6.1 Introduction Heuristics Constraints Planning Explicit MDPs Probabilistic Factored MDPs

Policy Gradients for CVaR-Constrained MDPs Prashanth L.A. INRIA Lille Team SequeL Prashanth

CS 188: Artificial Intelligence Markov Decision Processes (MDPs) Pieter Abbeel UC Berkeley

Partially-Observable MDPs RN, Chapter 17.4 17.5 Decision Theoretic Agents Introduction

Online Convex Optimization in Adversarial MDPs Aviv Rosenberg Yishay Mansour Motivation:

Lecture 2: Infinite Horizon and Indefinite Horizon MDPs B9140 Dynamic Programming &

Computational Approaches for Stochastic Shortest Path on Succinct MDPs Krishnendu Chatterjee 1

Lecture 13 Reachability in MDPs Dr. Dave Parker Department of Computer Science University

Outline CS 188: Artificial Intelligence Markov Decision Processes (MDPs) Formalism

POMDPs (Ch. 17.4-17.6) Markov Decision Process Recap of Markov Decision Processes (MDPs): Know:

Nonlinear Optimization: Discrete optimization INSEAD, Spring 2006 Jean-Philippe Vert Ecole des

Dynamic Programming Algorithm Design 6.1, 6.2, 6.3 Thank you to Kevin Wayne for inspiration to

Approximation algorithms Some optimisation problems are hard, little chance of finding

Dynamic Programming Lecturer: Shi Li Department of Computer Science and Engineering University

Modular Static Scheduling of Synchronous Data-flow Networks Marc Pouzet Pascal Raymond LRI,

CS 310 Advanced Data Structures and Algorithms Greedy July 17, 2017 Tong Wang UMass Boston

Outline Optimisation Discrete Introduction 1 Solving the Knapsack Problem Heuristics for the

Analyzing and Improving Search 1/27/17 From Wednesday: Measuring Performance Completeness :

Sambuz

Useful Links

Newsletter

Mail Us

Planning with MDPs (Markov Decision Processes) H ector Geffner - PowerPoint PPT Presentation

Planning with MDPs (Markov Decision Processes) H ector Geffner ICREA and Universitat Pompeu Fabra Barcelona, Spain Hector Geffner, MDP Planning, Edinburgh, 11/2007 1 Status of Classical Planning Classical planning works!! Large

Planning and Optimization December 4, 2019 G1. Factored MDPs G1.1 Factored MDPs Planning and

Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning

Parity Objectives in Countable MDPs Stefan Kiefer Richard Mayr Mahsa Shirmohammadi Dominik

CS 730/830: Intro AI Solving MDPs MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 1 / 23

Planning and Optimization G1. Factored MDPs Malte Helmert and Thomas Keller Universit at

B5.1 Introduction Heuristics Constraints Planning Explicit MDPs Probabilistic Factored MDPs

B6.1 Introduction Heuristics Constraints Planning Explicit MDPs Probabilistic Factored MDPs

Policy Gradients for CVaR-Constrained MDPs Prashanth L.A. INRIA Lille Team SequeL Prashanth

CS 188: Artificial Intelligence Markov Decision Processes (MDPs) Pieter Abbeel UC Berkeley

Partially-Observable MDPs RN, Chapter 17.4 17.5 Decision Theoretic Agents Introduction

Online Convex Optimization in Adversarial MDPs Aviv Rosenberg Yishay Mansour Motivation:

Lecture 2: Infinite Horizon and Indefinite Horizon MDPs B9140 Dynamic Programming &amp;

Computational Approaches for Stochastic Shortest Path on Succinct MDPs Krishnendu Chatterjee 1

Lecture 13 Reachability in MDPs Dr. Dave Parker Department of Computer Science University

Outline CS 188: Artificial Intelligence Markov Decision Processes (MDPs) Formalism

POMDPs (Ch. 17.4-17.6) Markov Decision Process Recap of Markov Decision Processes (MDPs): Know:

Nonlinear Optimization: Discrete optimization INSEAD, Spring 2006 Jean-Philippe Vert Ecole des

Dynamic Programming Algorithm Design 6.1, 6.2, 6.3 Thank you to Kevin Wayne for inspiration to

Approximation algorithms Some optimisation problems are hard, little chance of finding

Dynamic Programming Lecturer: Shi Li Department of Computer Science and Engineering University

Modular Static Scheduling of Synchronous Data-flow Networks Marc Pouzet Pascal Raymond LRI,

CS 310 Advanced Data Structures and Algorithms Greedy July 17, 2017 Tong Wang UMass Boston

Outline Optimisation Discrete Introduction 1 Solving the Knapsack Problem Heuristics for the

Analyzing and Improving Search 1/27/17 From Wednesday: Measuring Performance Completeness :

Sambuz

Useful Links

Newsletter

Mail Us

Lecture 2: Infinite Horizon and Indefinite Horizon MDPs B9140 Dynamic Programming &