Module 15 POMDP Bounds CS 886 Sequential Decision Making and - - PowerPoint PPT Presentation

β–Ά
module 15
SMART_READER_LITE
LIVE PREVIEW

Module 15 POMDP Bounds CS 886 Sequential Decision Making and - - PowerPoint PPT Presentation

Module 15 POMDP Bounds CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Bounds POMDP algorithms typically find approximations to optimal value function or optimal policy Need some performance


slide-1
SLIDE 1

Module 15 POMDP Bounds

CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

slide-2
SLIDE 2

CS886 (c) 2013 Pascal Poupart

2

Bounds

  • POMDP algorithms typically find approximations

to optimal value function or optimal policy

– Need some performance guarantees

  • Lower bounds on π‘Šβˆ—

– π‘ŠπœŒ for any policy 𝜌 – Point-based value iteration

  • Upper bounds on π‘Šβˆ—

– QMDP – Fast-informed bound – Finite Belief-State MDP

slide-3
SLIDE 3

CS886 (c) 2013 Pascal Poupart

3

Lower Bounds

  • Lower bounds are easy to obtain
  • For any policy 𝜌, π‘ŠπœŒ is a lower bound

since π‘ŠπœŒ 𝑐 ≀ π‘Šβˆ— 𝑐 βˆ€πœŒ, 𝑐

  • The main issue is to evaluate a policy 𝜌
slide-4
SLIDE 4

CS886 (c) 2013 Pascal Poupart

4

Point-based Value Iteration

  • Theorem: If π‘Š

0 is a lower bound, then the value

functions π‘Š

π‘œ produced by point-based value

iteration at each iteration π‘œ are lower bounds.

  • Proof by induction

– Base case: pick π‘Š

0 to be a lower bound

– Inductive assumption: π‘Š

π‘œ 𝑐 ≀ π‘Šβˆ— 𝑐 βˆ€π‘

– Induction:

  • Let Ξ€π‘œ+1 be the set of 𝛽-vectors for some set 𝐢 of beliefs
  • Let Ξ€π‘œ+1

βˆ—

be the set of 𝛽-vectors for all beliefs

  • Hence π‘Š

π‘œ+1 𝑐 = max π›½βˆˆΞ€π‘œ+1 𝛽(𝑐) ≀ max π›½βˆˆΞ€π‘œ+1

βˆ—

𝛽 𝑐 ≀ π‘Šβˆ—(𝑐)

slide-5
SLIDE 5

CS886 (c) 2013 Pascal Poupart

5

Upper Bounds

  • Idea: make decision based on more information

than normally available to obtain higher value than optimal.

  • POMDP: states are hidden
  • MDP: states are observable
  • Hence π‘Š

𝑁𝐸𝑄 β‰₯ π‘Š 𝑄𝑃𝑁𝐸𝑄

slide-6
SLIDE 6

CS886 (c) 2013 Pascal Poupart

6

QMDP Algorithm

  • Derive upper bound from MDP Q-function by

allowing the state to be observable

  • Policy: 𝑑𝑒 β†’ 𝑏𝑒

QMDP(POMDP) Solve MDP to find 𝑅𝑁𝐸𝑄 𝑅𝑁𝐸𝑄 𝑑, 𝑏 = 𝑆 𝑑, 𝑏 + 𝛿 Pr 𝑑′ 𝑑, 𝑏 max

𝑏′ 𝑅𝑁𝐸𝑄(𝑑′, 𝑏′) 𝑑′

Let π‘Š 𝑐 = max

𝑏

𝑐 𝑑 𝑅𝑁𝐸𝑄(𝑑, 𝑏)

𝑑

Return π‘Š

slide-7
SLIDE 7

CS886 (c) 2013 Pascal Poupart

7

Fast Informed Bound

  • QMDP upper bound is too loose

– Actions depend on current state (too informative)

  • Tighter upper bound: fast Informed bound (FIB)

– Actions depend on previous state (less informative)

π‘Š

𝑁𝐸𝑄 β‰₯ π‘Š 𝐺𝐽𝐢 β‰₯ π‘Šβˆ—

slide-8
SLIDE 8

CS886 (c) 2013 Pascal Poupart

8

FIB Algorithm

  • Derive upper bound by allowing the previous

state to be observable

  • Policy: π‘‘π‘’βˆ’1, π‘π‘’βˆ’1, 𝑝𝑒 β†’ 𝑏𝑒

FIB(POMDP) Find 𝑅𝐺𝐽𝐢 by value iteration 𝑅𝐺𝐽𝐢 𝑑, 𝑏 = 𝑆 𝑑, 𝑏 + 𝛿

max

𝑏′

Pr 𝑑′ 𝑑, 𝑏 Pr 𝑝′ 𝑑′, 𝑏

𝑑′

𝑅𝐺𝐽𝐢(𝑑′, 𝑏′)

𝑝′

Let π‘Š 𝑐 = max

𝑏

𝑐 𝑑 𝑅𝐺𝐽𝐢(𝑑, 𝑏)

𝑑

Return π‘Š

slide-9
SLIDE 9

CS886 (c) 2013 Pascal Poupart

9

FIB Analysis

  • Theorem: π‘Š

𝑁𝐸𝑄 β‰₯ π‘Š 𝐺𝐽𝐢 β‰₯ π‘Šβˆ—

  • Proof:

1) 𝑅𝑁𝐸𝑄 𝑑, 𝑏 = 𝑆 𝑑, 𝑏 + 𝛿 Pr 𝑑′ 𝑑, 𝑏 max

𝑏′ 𝑅 𝑑′, 𝑏′ 𝑑′

= 𝑆 𝑑, 𝑏 + 𝛿 Pr 𝑑′ 𝑑, 𝑏 Pr 𝑝′ 𝑑′, 𝑏 max

𝑏′ 𝑅(𝑑′, 𝑏′) 𝑑′𝑝′

β‰₯ 𝑆 𝑑, 𝑏 + 𝛿 max

𝑏′

Pr 𝑑′ 𝑑, 𝑏 Pr 𝑝′ 𝑑′, 𝑏

𝑑′

𝑅(𝑑′, 𝑏′)

𝑝′

= 𝑅𝐺𝐽𝐢(𝑑, 𝑏)

2) π‘Š

𝐺𝐽𝐢 β‰₯ π‘Šβˆ— since π‘Š 𝐺𝐽𝐢 is based on observing the

previous state (too informative)

slide-10
SLIDE 10

CS886 (c) 2013 Pascal Poupart

10

Finite Belief-State MDP

  • Belief state MDP: all beliefs are treated as states

π‘Šβˆ— 𝑐 = max

𝑏

π‘…βˆ—(𝑐, 𝑏)

  • QMDP and FIB: value of each interior belief is

interpolated: i.e., π‘Š 𝑐 = max

𝑏

𝑐 𝑑 𝑅𝐺𝐽𝐢(𝑑, 𝑏)

𝑑

  • Idea: retain subset of beliefs

– Interpolate value of remaining beliefs

slide-11
SLIDE 11

CS886 (c) 2013 Pascal Poupart

11

Finite Belief-State MDP

  • Belief state MDP

𝑅 𝑐, 𝑏 = 𝑆 𝑐, 𝑏 + 𝛿 Pr 𝑝′ 𝑐, 𝑏

𝑝′

max

𝑏′ 𝑅(𝑐𝑏,𝑝, 𝑏′)

  • Let 𝐢 be a subset of representative beliefs
  • Approximate 𝑅(𝑐𝑏,𝑝, 𝑏′) with lowest interpolation

– Linear program 𝑅 𝑐𝑏,𝑝, 𝑏′ = min

𝑑

𝑑𝑐𝑅 𝑐, 𝑏′

π‘βˆˆπΆ

such that 𝑑𝑐 = 1

𝑐

and 𝑑𝑐 β‰₯ 0 βˆ€π‘

slide-12
SLIDE 12

CS886 (c) 2013 Pascal Poupart

12

Finite Belief-State MDP Algorithm

  • Derive upper bound by interpolating values

based on a finite subset of values

FiniteBeliefStateMDP(POMDP) Find 𝑅𝐢 by value iteration 𝑅𝐢 𝑐, 𝑏 = 𝑆 𝑐, 𝑏 + 𝛿

Pr 𝑝′ 𝑐, 𝑏 max

𝑏′ 𝑅𝐢(𝑐𝑏𝑝′, 𝑏′) 𝑝′

βˆ€π‘ ∈ 𝐢, 𝑏

where 𝑅𝐢 𝑐𝑏𝑝′, 𝑏′

= min

𝑑

𝑑𝑐𝑅𝐢(𝑐, 𝑏′)

π‘βˆˆπΆ

such that 𝑑𝑐

π‘βˆˆπΆ

= 1 and 𝑑𝑐 β‰₯ 0 βˆ€π‘ ∈ 𝐢

Let π‘Š 𝑐 = max

𝑏

𝑐 𝑑 𝑅𝐢(𝑑, 𝑏)

𝑑

Return π‘Š