Module 15 POMDP Bounds CS 886 Sequential Decision Making and - - PowerPoint PPT Presentation
Module 15 POMDP Bounds CS 886 Sequential Decision Making and - - PowerPoint PPT Presentation
Module 15 POMDP Bounds CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Bounds POMDP algorithms typically find approximations to optimal value function or optimal policy Need some performance
CS886 (c) 2013 Pascal Poupart
2
Bounds
- POMDP algorithms typically find approximations
to optimal value function or optimal policy
β Need some performance guarantees
- Lower bounds on πβ
β ππ for any policy π β Point-based value iteration
- Upper bounds on πβ
β QMDP β Fast-informed bound β Finite Belief-State MDP
CS886 (c) 2013 Pascal Poupart
3
Lower Bounds
- Lower bounds are easy to obtain
- For any policy π, ππ is a lower bound
since ππ π β€ πβ π βπ, π
- The main issue is to evaluate a policy π
CS886 (c) 2013 Pascal Poupart
4
Point-based Value Iteration
- Theorem: If π
0 is a lower bound, then the value
functions π
π produced by point-based value
iteration at each iteration π are lower bounds.
- Proof by induction
β Base case: pick π
0 to be a lower bound
β Inductive assumption: π
π π β€ πβ π βπ
β Induction:
- Let Ξ€π+1 be the set of π½-vectors for some set πΆ of beliefs
- Let Ξ€π+1
β
be the set of π½-vectors for all beliefs
- Hence π
π+1 π = max π½βΞ€π+1 π½(π) β€ max π½βΞ€π+1
β
π½ π β€ πβ(π)
CS886 (c) 2013 Pascal Poupart
5
Upper Bounds
- Idea: make decision based on more information
than normally available to obtain higher value than optimal.
- POMDP: states are hidden
- MDP: states are observable
- Hence π
ππΈπ β₯ π ππππΈπ
CS886 (c) 2013 Pascal Poupart
6
QMDP Algorithm
- Derive upper bound from MDP Q-function by
allowing the state to be observable
- Policy: π‘π’ β ππ’
QMDP(POMDP) Solve MDP to find π ππΈπ π ππΈπ π‘, π = π π‘, π + πΏ Pr π‘β² π‘, π max
πβ² π ππΈπ(π‘β², πβ²) π‘β²
Let π π = max
π
π π‘ π ππΈπ(π‘, π)
π‘
Return π
CS886 (c) 2013 Pascal Poupart
7
Fast Informed Bound
- QMDP upper bound is too loose
β Actions depend on current state (too informative)
- Tighter upper bound: fast Informed bound (FIB)
β Actions depend on previous state (less informative)
π
ππΈπ β₯ π πΊπ½πΆ β₯ πβ
CS886 (c) 2013 Pascal Poupart
8
FIB Algorithm
- Derive upper bound by allowing the previous
state to be observable
- Policy: π‘π’β1, ππ’β1, ππ’ β ππ’
FIB(POMDP) Find π πΊπ½πΆ by value iteration π πΊπ½πΆ π‘, π = π π‘, π + πΏ
max
πβ²
Pr π‘β² π‘, π Pr πβ² π‘β², π
π‘β²
π πΊπ½πΆ(π‘β², πβ²)
πβ²
Let π π = max
π
π π‘ π πΊπ½πΆ(π‘, π)
π‘
Return π
CS886 (c) 2013 Pascal Poupart
9
FIB Analysis
- Theorem: π
ππΈπ β₯ π πΊπ½πΆ β₯ πβ
- Proof:
1) π ππΈπ π‘, π = π π‘, π + πΏ Pr π‘β² π‘, π max
πβ² π π‘β², πβ² π‘β²
= π π‘, π + πΏ Pr π‘β² π‘, π Pr πβ² π‘β², π max
πβ² π (π‘β², πβ²) π‘β²πβ²
β₯ π π‘, π + πΏ max
πβ²
Pr π‘β² π‘, π Pr πβ² π‘β², π
π‘β²
π (π‘β², πβ²)
πβ²
= π πΊπ½πΆ(π‘, π)
2) π
πΊπ½πΆ β₯ πβ since π πΊπ½πΆ is based on observing the
previous state (too informative)
CS886 (c) 2013 Pascal Poupart
10
Finite Belief-State MDP
- Belief state MDP: all beliefs are treated as states
πβ π = max
π
π β(π, π)
- QMDP and FIB: value of each interior belief is
interpolated: i.e., π π = max
π
π π‘ π πΊπ½πΆ(π‘, π)
π‘
- Idea: retain subset of beliefs
β Interpolate value of remaining beliefs
CS886 (c) 2013 Pascal Poupart
11
Finite Belief-State MDP
- Belief state MDP
π π, π = π π, π + πΏ Pr πβ² π, π
πβ²
max
πβ² π (ππ,π, πβ²)
- Let πΆ be a subset of representative beliefs
- Approximate π (ππ,π, πβ²) with lowest interpolation
β Linear program π ππ,π, πβ² = min
π
πππ π, πβ²
πβπΆ
such that ππ = 1
π
and ππ β₯ 0 βπ
CS886 (c) 2013 Pascal Poupart
12
Finite Belief-State MDP Algorithm
- Derive upper bound by interpolating values
based on a finite subset of values
FiniteBeliefStateMDP(POMDP) Find π πΆ by value iteration π πΆ π, π = π π, π + πΏ
Pr πβ² π, π max
πβ² π πΆ(πππβ², πβ²) πβ²
βπ β πΆ, π
where π πΆ πππβ², πβ²
= min
π
πππ πΆ(π, πβ²)
πβπΆ
such that ππ
πβπΆ
= 1 and ππ β₯ 0 βπ β πΆ
Let π π = max
π
π π‘ π πΆ(π‘, π)
π‘