Partially observable Markov decision processes Matthijs Spaan - PowerPoint PPT Presentation

Partially observable Markov decision processes Matthijs Spaan Institute for Systems and Robotics Instituto Superior T´ ecnico Lisbon, Portugal Reading group meeting, February 12, 2007 1/22

Overview Partially observable Markov decision processes: • Model. • Belief states. • MDP-based algorithms. • Other sub-optimal algorithms. • Optimal algorithms. • Application to robotics. 2/22

A planning problem Task: start at random position ( × ) → pick up mail at P → deliver mail at D ( △ ). Characteristics: motion noise, perceptual aliasing. 3/22

Planning under uncertainty • Uncertainty is abundant in real-world planning domains. • Bayesian approach ⇒ probabilistic models. • Common approach in robotics, e.g., robot localization. 4/22

POMDPs Partially observable Markov decision processes (POMDPs) (Kaelbling et al., 1998): • Framework for agent planning under uncertainty. • Typically assumes discrete sets of states S , actions A and observations O . • Transition model p ( s ′ | s, a ) : models the effect of actions . • Observation model p ( o | s, a ) : relates observations to states. • Task is defined by a reward model r ( s, a ) . • Goal is to compute plan, or policy π , that maximizes long-term reward. 5/22

POMDP applications • Robot navigation (Simmons and Koenig, 1995; Theocharous and Mahadevan, 2002). • Visual tracking (Darrell and Pentland, 1996). • Dialogue management (Roy et al., 2000). • Robot-assisted health care (Pineau et al., 2003b; Boger et al., 2005). • Machine maintenance (Smallwood and Sondik, 1973), structural inspection (Ellis et al., 1995). • Inventory control (Treharne and Sox, 2002), dynamic pricing strategies (Aviv and Pazgal, 2005), marketing campaigns (Rusmevichientong and Van Roy, 2001). • Medical applications (Hauskrecht and Fraser, 2000; Hu et al., 1996). 6/22

Transition model • For instance, robot motion is inaccurate. ? • Transitions between states ? are stochastic . ? ? • p ( s ′ | s, a ) is the probability ? to jump from state s to state s ′ after taking action a . 7/22

Observation model • Imperfect sensors. • Partially observable environment: ◮ Sensors are noisy . ◮ Sensors have a limited view . • p ( o | s, a ) is the probability the agent receives observation o in state s after taking action a . 8/22

Memory A POMDP example that requires memory (Singh et al., 1994): − r − r a 2 , + r a 2 a 1 s 1 s 2 a 1 , + r Method Value r V = MDP policy 1 − γ γr V max = r − Memoryless deterministic POMDP policy 1 − γ V = 0 Memoryless stochastic POMDP policy γr V min = 1 − γ − r Memory-based POMDP policy 9/22

Beliefs Beliefs: • The agent maintains a belief b ( s ) of being at state s . • After action a ∈ A and observation o ∈ O the belief b ( s ) can be updated using Bayes’ rule: � b ′ ( s ′ ) ∝ p ( o | s ′ ) p ( s ′ | s, a ) b ( s ) s • The belief vector is a Markov signal for the planning task. 10/22

Belief update example True situation: Robot’s belief: 0 . 5 0 . 25 0 • Observations: door or corridor , 10% noise. • Action: moves 3 (20%), 4 (60%), or 5 (20%) states. 11/22

Solving POMDPs • A solution to a POMDP is a policy , i.e., a mapping a = π ( b ) from beliefs to actions. • An optimal policy is characterized by a value function that maximizes: ∞ � γ t r ( b t , π ( b t ))] V π ( b 0 ) = E [ t =0 • Computing the optimal value function is a hard problem (PSPACE-complete for finite horizon). • In robotics: a policy is often computed using simple MDP-based approximations. 12/22

MDP-based algorithms • Use the solution to the MDP as an heuristic. • Most likely state (Cassandra et al., 1996): π MLS ( b ) = π ∗ (arg max s b ( s )) . • Q MDP (Littman et al., 1995): s b ( s ) Q ∗ ( s, a ) . π Q MDP ( b ) = arg max a � C a a A b b c 0.5 +1 −1 I 0.5 b A c b a a D (Parr and Russell, 1995) 13/22

Other sub-optimal techniques • Grid-based approximations (Drake, 1962; Lovejoy, 1991; Brafman, 1997; Zhou and Hansen, 2001; Bonet, 2002). • Optimizing finite-state controllers (Platzman, 1981; Hansen, 1998b; Poupart and Boutilier, 2004). • Gradient ascent (Ng and Jordan, 2000; Aberdeen and Baxter, 2002). • Heuristic search in the belief tree (Satia and Lave, 1973; Hansen, 1998a; Smith and Simmons, 2004). • Compressing the POMDP (Roy et al., 2005; Poupart and Boutilier, 2003). • Point-based techniques (Pineau et al., 2003a; Spaan and Vlassis, 2005). 14/22

Optimal value functions The optimal value function of a (finite horizon) POMDP is piecewise linear and convex : V ( b ) = max α b · α . V α 1 �� α 2 �� α 3 �� α 4 �� (1,0) (0,1) 15/22

Exact value iteration Value iteration computes a sequence of value function estimates: V 1 , V 2 , . . . , V n . V V 3 V 2 V 1 (1,0) (0,1) 16/22

Optimal POMDP methods Enumerate and prune: • Most straightforward: Monahan (1982)’s enumeration algorithm. Generates a maximum of | A || V n | | O | vectors at each iteration, hence requires pruning. • Incremental pruning (Zhang and Liu, 1996; Cassandra et al., 1997). Search for witness points: • One Pass (Sondik, 1971; Smallwood and Sondik, 1973). • Relaxed Region, Linear Support (Cheng, 1988). • Witness (Cassandra et al., 1994). 17/22

Partially observable Markov decision processes Matthijs Spaan - PowerPoint PPT Presentation

Partially observable Markov decision processes Matthijs Spaan Institute for Systems and Robotics Instituto Superior T ecnico Lisbon, Portugal Reading group meeting, February 12, 2007 1/22 Overview Partially observable Markov decision

Introduction to Partially Observable Markov Decision Processes CS 886 Sequential Decision Making

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Model Repair for Markov Decision Model Repair for Markov Decision Model Repair for Markov

1 Stochastic, Partially Observable Markov Decision Process (MDP) Partially Observable MDP S

Outline Md Md Markov Markov Decision Decision Processes Processes Grid World Example

Stochastic Processes Markov Processes Hamid R. Rabiee 1 Overview o Markov Property o Markov

POMDPs (Ch. 17.4-17.6) Markov Decision Process Recap of Markov Decision Processes (MDPs): Know:

Partially-Observable Markov Decision Processes as Dynamical Causal Models Finale Doshi-Velez

Partially Observable Markov Decision Processes 3/3/17 (Dis)Advantages of Online MCTS + Just

Markov decision processes and interval Markov chains: exploiting the connection Mingmei Teo

Reinforcement Learning Environments Fully-observable vs partially-observable Single agent

Sequence Estimation and Schedulability Aim Analysis for Partially Observable Petri Nets

A Multi-Agent Prediction Market based on Raj Dasgupta Partially Observable Stochastic Game

Hidden Markov Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Discrete Markov Processes

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

Automatic Parameter-Range Estimation for Cardiac Cells Radu Grosu SUNY at Stony Brook Joint

GSA 2020 What We Did During The Last Seven Downturns in The Oil Markets Mike Carroll Dr. John

CS 331: Artificial Intelligence Introduction 1 What is AI? (4 categories of defns) Human

Building Sustainability by Design How Can We Design the Most Sustainable Business? Looking Out

Probabilistic Modelling, Machine Learning, and the Information Revolution Zoubin Ghahramani

In Intr trod oduc uctio tion n to to Art rtifici ficial al In Inte tellige genc nce

1/29/18 Chapter Outline Motivations to study AI Artificial Intel l i gence What is

For Wednesday No reading Homework: Chapter 8, exercise 24 Program 1 Any questions?