ARTIFICIAL INTELLIGENCE Planning under uncertainty: POMDPs - - PowerPoint PPT Presentation

artificial intelligence planning under uncertainty pomdps
SMART_READER_LITE
LIVE PREVIEW

ARTIFICIAL INTELLIGENCE Planning under uncertainty: POMDPs - - PowerPoint PPT Presentation

Utrecht University INFOB2KI 2019-2020 The Netherlands ARTIFICIAL INTELLIGENCE Planning under uncertainty: POMDPs Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from


slide-1
SLIDE 1

ARTIFICIAL INTELLIGENCE

Lecturer: Silja Renooij

Planning under uncertainty: POMDPs

Utrecht University The Netherlands

These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

INFOB2KI 2019-2020

slide-2
SLIDE 2

Markov models types

Prediction Planning Fully observable

Markov chain MDP

(Markov decision process)

Partially observable

Hidden Markov model POMDP

(Partially observable Markov decision process)

Prediction models can be represented at variable level by a (Dynamic) Bayesian network:

S3 S1 S2 … S1 S2 S3 O1 O2 O3 …

2

slide-3
SLIDE 3

An MDP is defined by:

  • A set of states s  S; typically finite
  • A set of actions a  A; typically finite
  • A transition function

T(s, a, s’) = P(St+1 = s’ | St = s and At = a)

  • A reward function (Note: simpler than before)

Recap: Markov Decision Process

T: SAS → 0,1 R: SA → 𝑺

Stationary process: T and R are time independent

Note the representation at variable level!

3

slide-4
SLIDE 4

Solving a Markov System

Goal: find an optimal policy π*

(recall: best action to choose in each state s)

  • compute, for each state s, the expected sum of

future rewards, when acting optimally: (this can be done by e.g. Value Iteration (DP))

  • for π* take the argmax, as before

 

) ' ( ) , ( ) ' , , ( max ) (

* ' *

s V a s R s a s T s V

s a

  

4

slide-5
SLIDE 5

Planning under partial observability

Environment

Action Imperfect observation

Goal

Environment

POMDP allows partial satisfaction of goals and tradeoffs among competing goals

5

slide-6
SLIDE 6

POMDP

  • Goal is to maximize expected long‐term

reward from the initial state distribution

  • But: state is not directly observed

world

a

  • 6
slide-7
SLIDE 7

Definition of POMDP

A Partially Observable Markov Decision Process (POMDP) is defined by:

  • All MDP variables/sets and functions
  • A set of observations o  O ; typically finite
  • An observation function

Z(s, a, o) = P(Ot = o | St = s and At‐1 = a)

: S A O

hidden states layer

7

slide-8
SLIDE 8

Memory vs Markov

The POMDP is non‐Markovian from viewpoint of agent:

  • any of the past actions and observations may influence

the agent’s belief concerning the current state

  • if action choices are based on only most recent
  • bservation the policy becomes memoryless

8

slide-9
SLIDE 9

Two sources of POMDP complexity

  • Curse of dimensionality

– size of state space – shared by other planning problems

  • Curse of memory

– size of value function (number of vectors) – or equivalently, size of controller (memory) – unique to POMDPs

O n

A S | || | | |

1 2 

| |

n

Complexity of each iteration of DP:

dimensionality memory

9

Γ represents a vector of values for possible states (true state unknown)

slide-10
SLIDE 10

Note

The following slides contain several formulas; you don’t have to understand these in detail if you grasp the general idea:

  • Rather than states, we use a probability

distribution over states

  • The transition and observation functions
  • f the POMDP contain all information

necessary to update this distribution

10

slide-11
SLIDE 11

Solution: Belief state

In regular MDP we track and update current state. Since in POMDP the actual state is uncertain, we maintain a belief state: (note: this is a probability distribution!) The belief state contains all relevant information from history: consider belief b; after subsequent action a and observation o this belief can be updated (using POMDP ingredients): 𝑐 𝑡 𝑄 𝑡 𝑏, 𝑝, 𝑐 𝑄 𝑝 𝑏, 𝑐, 𝑡 𝑄 𝑡 𝑏, 𝑐 𝑄 𝑝 𝑏, 𝑐 𝑎 𝑡, 𝑏, 𝑝 𝑈 𝑡, 𝑏, 𝑡 𝑐𝑡/𝑄𝑝|𝑏, 𝑐

b: S → 0,1

11

(Bayes’ rule) Also expressible in Z, T and b (see next slide)

slide-12
SLIDE 12

A belief-state MDP

A belief‐state MDP is a continuous space MDP, completely specified with ingredients from the POMDP. It contains: A set of states b B, where B is space of distributions over S A set of actions a  A; same as original POMDP A transition function

b 𝑐, 𝑏, 𝑐′ 𝑄 𝑐 𝑐, 𝑏 ∑

𝑄 𝑐 𝑏, 𝑐, 𝑝 𝑄 𝑝 𝑏, 𝑐

= ∑ 𝑄 𝑐 𝑏, 𝑐, 𝑝 ∑ 𝑎 𝑡, 𝑏, 𝑝 ∑ 𝑈 𝑡, 𝑏, 𝑡 𝑐 𝑡

∈ ∈ ∈

A reward function

b 𝑐, 𝑏 ∑

𝑐 𝑡 𝑆𝑡, 𝑏

where S, T, R and Z are defined by the corresponding POMDP

𝑈: BA → 𝐶 𝑆: BA → 𝑺

12

zero or one

slide-13
SLIDE 13

Belief state and Markov property

The process of maintaining the belief state is Markovian!

  • For any belief state, the successor belief state

depends only on the action and observation

  • 1
  • 2
  • 2
  • 1

a2 a1

P(s0) = 0 P(s0) = 1

13

slide-14
SLIDE 14

Solving the POMDP

P(b|b,a,o) Current Belief State (Register) Policy

  • Obs. o

b b a Action

 Update belief state after action and observation  Policy maps belief state to action  Policy is found by solving the belief‐state MDP

14

slide-15
SLIDE 15

Solving a belief-state MDP

For solving the continuous space MDP  use the value iteration algorithm... after some adaptations to cope with continuous space:

  • We cannot find a state's new value by looping over

all the possible (= infinitely many) next states

  • Representation of value function in tabular form

not possible; POMDP restrictions cause finite horizon value function to be piecewise linear and convex

15

slide-16
SLIDE 16

Conclusions

  • MDP’s have some efficient solution

methods, but require fully observable states

  • POMDP’s are used in more realistic

settings, but require sophisticated solution methods

16

slide-17
SLIDE 17

Let’s learn!

  • We now know how to plan, if we have a

fully specified (PO)MDP

  • But what if we don’t…?

17