Introduction to Partially Observable Markov Decision Processes CS - - PowerPoint PPT Presentation
Introduction to Partially Observable Markov Decision Processes CS - - PowerPoint PPT Presentation
Module 14 Introduction to Partially Observable Markov Decision Processes CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Markov Decision Processes MDPs: Fully Observable MDPs Decision maker
CS886 (c) 2013 Pascal Poupart
2
Markov Decision Processes
- MDPs:
– Fully Observable MDPs – Decision maker knows the state at each time step
- POMDPs:
– Partially Observable MDPs – Decision does not know the state – But makes observations that are correlated with the underlying state
- E.g. sensors provide noisy information about the state
CS886 (c) 2013 Pascal Poupart
3
Applications
- Robotic control
- Dialog systems
- Assistive Technologies
- Operations Research
CS886 (c) 2013 Pascal Poupart
4
Model Description
- Definition
– Set of states: 𝑇 – Set of actions (i.e., decisions): 𝐵 – Transition model: Pr (𝑡𝑢|𝑡𝑢−1, 𝑏𝑢−1) – Reward model (i.e., utility): 𝑆(𝑡𝑢, 𝑏𝑢) – Discount factor: 0 ≤ 𝛿 ≤ 1 – Horizon (i.e., # of time steps): ℎ – Set of observations: 𝑷 – Observation model: 𝐐𝐬 (𝒑𝒖|𝒕𝒖, 𝒃𝒖−𝟐)
CS886 (c) 2013 Pascal Poupart
5
Graphical Model
- Fully Observable MDP
s0 s1 s2 s3 s4 a0 a1 a2 a3 r1 r2 r3 r0
CS886 (c) 2013 Pascal Poupart
6
Graphical Model
- Partially Observable MDP
s0 s1 s2 s3 s4 a0 a1 a2 a3 r1 r2 r3 r0
- 1
- 2
- 3
- 4
CS886 (c) 2013 Pascal Poupart
7
Policies
- MDP policies: 𝜌: 𝑇 → 𝐵
– Markovian policy
- But state is unknown in POMDPs
- POMDP policies: 𝜌: 𝐶0 × 𝐼𝑢 → 𝐵𝑢
– 𝐶0 is the space of initial beliefs 𝑐0 𝑐0 = Pr (𝑡0) – 𝐼𝑢 is the space histories ℎ𝑢 of observables up to time 𝑢 ℎ𝑢 ≝ 𝑏0, 𝑝1, 𝑏1, 𝑝2, … , 𝑏𝑢−1, 𝑝𝑢 – Non-Markovian policy
CS886 (c) 2013 Pascal Poupart
8
Policy Trees
- Policy 𝜌: 𝐶 × 𝐼𝑢 → 𝐵𝑢
- Consider a single initial belief 𝑐
- Then 𝜌 can be represented by a tree
𝑏1 𝑏2 𝑏1 𝑏1 𝑏2 𝑝1 𝑝2 𝑏1 𝑏2 𝑝1 𝑝1 𝑝2 𝑝2
CS886 (c) 2013 Pascal Poupart
9
Policy Trees (continued)
- Policy 𝜌: 𝐶 × 𝐼𝑢 → 𝐵𝑢
– Set of trees Let 𝐶 = 𝐶1 ∪ 𝐶2 ∪ 𝐶3 𝑏1 𝑏1 𝑏2 𝑏2 𝑏1 𝑝1 𝑝2 𝑏1 𝑏2 𝑝1 𝑝1 𝑝2 𝑝2 𝑏2 𝑏2 𝑏1 𝑏1 𝑏2 𝑝1 𝑝2 𝑏2 𝑏1 𝑝1 𝑝1 𝑝2 𝑝2 𝑏1 𝑏2 𝑏1 𝑏2 𝑏1 𝑝1 𝑝2 𝑏1 𝑏1 𝑝1 𝑝1 𝑝2 𝑝2 𝑐 ∈ 𝐶1 𝑐 ∈ 𝐶2 𝑐 ∈ 𝐶3
CS886 (c) 2013 Pascal Poupart
10
Beliefs
- Belief 𝑐𝑢 𝑡 = Pr
(𝑡𝑢)
– Distribution over states at time 𝑢
- Belief about the underlying state based on
history ℎ𝑢 𝑐𝑢 𝑡 = Pr (𝑡𝑢|ℎ𝑢, 𝑐0)
CS886 (c) 2013 Pascal Poupart
11
Belief Update
- Belief update: 𝑐𝑢, 𝑏𝑢, 𝑝𝑢+1 → 𝑐𝑢+1
𝑐𝑢+1 𝑡𝑢+1 = Pr 𝑡𝑢+1 ℎ𝑢+1, 𝑐0 = Pr (𝑡𝑢+1|𝑝𝑢+1, 𝑏𝑢, ℎ𝑢, 𝑐0) ℎ𝑢+1 ≡ 𝑝𝑢+1, 𝑏𝑢, ℎ𝑢 = Pr 𝑡𝑢+1 𝑝𝑢+1, 𝑏𝑢, 𝑐𝑢 𝑐𝑢 ≡ 𝑐0, ℎ𝑢 =
Pr 𝑡𝑢+1,𝑝𝑢+1 𝑏𝑢,𝑐𝑢 Pr 𝑝𝑢+1 𝑏𝑢,𝑐𝑢
Bayes’ theorem =
Pr 𝑝𝑢+1 𝑡𝑢+1,𝑏𝑢 Pr 𝑡𝑢+1 𝑏𝑢,𝑐𝑢 Pr 𝑝𝑢+1 𝑏𝑢,𝑐𝑢
chain rule =
Pr 𝑝𝑢+1 𝑡𝑢+1,𝑏𝑢 Pr 𝑡𝑢+1 𝑡𝑢,𝑏𝑢 𝑐𝑢(𝑡𝑢)
𝑡𝑢
Pr 𝑝𝑢+1 𝑏𝑢,𝑐𝑢
belief definition ∝ Pr 𝑝𝑢+1 𝑡𝑢+1, 𝑏𝑢 Pr 𝑡𝑢+1 𝑡𝑢, 𝑏𝑢 𝑐𝑢(𝑡𝑢)
𝑡𝑢
CS886 (c) 2013 Pascal Poupart
12
Markovian Policies
- Beliefs are sufficient statistics equivalent to
histories (with the initial belief)
𝑐0, ℎ𝑢 ⇔ 𝑐𝑢
- Policies:
– Based on histories: 𝜌: 𝐶0 × 𝐼𝑢 → 𝐵𝑢
- Non-Markovian
– Based on beliefs: 𝜌: 𝐶 → 𝐵
- Markovian
CS886 (c) 2013 Pascal Poupart
13
Belief State MDPs
- POMDPs can be viewed as belief state MDPs
– States: 𝐶 (beliefs) – Actions: 𝐵 – Transitions: Pr 𝑐𝑢+1 𝑐𝑢, 𝑏𝑢 = Pr (𝑝𝑢+1|𝑐𝑢, 𝑏𝑢) if𝑐𝑢, 𝑏𝑢, 𝑝𝑢+1 → 𝑐𝑢+1
- therwise
– Rewards: 𝑆 𝑐, 𝑏 = 𝑐 𝑡 𝑆(𝑡, 𝑏)
𝑡
- Belief state MDPs
– Fully observable – Continuous belief space
CS886 (c) 2013 Pascal Poupart
14
Policy Evaluation
- Value 𝑊𝜌of a POMDP policy 𝜌
– Expected sum of rewards: 𝑊𝜌 𝑐 = 𝐹 𝛿𝑢𝑆 𝑐𝑢, 𝜌 𝑐𝑢
𝑢
– Policy evaluation: Bellman’s equation 𝑊𝜌 𝑐 = 𝑆 𝑐, 𝜌(𝑐) + 𝛿 Pr 𝑐′ 𝑐, 𝜌 𝑐
𝑐′
𝑊𝜌 𝑐′ ∀𝑐 – Equivalent equation 𝑊𝜌 𝑐 = 𝑆 𝑐, 𝜌 𝑐 + 𝛿 Pr 𝑝′ 𝑐, 𝑏 𝑊𝜌(𝑐𝑏,𝑝′)
𝑝′
∀𝑐
CS886 (c) 2013 Pascal Poupart
15
Policy Tree Value Function
- Theorem: The value function 𝑊𝜌(𝑐) of a policy
tree is linear in 𝑐
– i.e. 𝑊𝜌 𝑐 = 𝛽 𝑡 𝑐(𝑡)
𝑡
- Proof by induction:
– Base case: at the leaves
- 𝑊
0 𝑐 = 𝑆 𝑐, 𝜌 𝑐
= 𝑐 𝑡 𝑆(𝑡, 𝜌 𝑡 )
𝑡
– Hence 𝛽 𝑡 = 𝑆(𝑡, 𝜌 𝑡 )
– Assumption: for all trees of depth 𝑜, there exists an 𝛽- vector such that 𝑊
𝑜 𝑐 = 𝑐 𝑡 𝛽(𝑡) 𝑡
CS886 (c) 2013 Pascal Poupart
16
Proof continued
- Induction
𝑊
𝑜+1 𝑐 = 𝑆 𝑐, 𝜌 𝑐
+ 𝛿 Pr 𝑝′ 𝑐, 𝜌(𝑐) 𝑊
𝑜(𝑐𝜌(𝑐),𝑝′) 𝑝′
= 𝑆 𝑐, 𝜌 𝑐 + 𝛿 Pr (𝑝′|𝑐, 𝜌(𝑐))
𝑝′
𝑐𝜌(𝑐),𝑝′ 𝑡′ 𝛽𝑝′(𝑡′)
𝑡′
= 𝑆 𝑐, 𝜌 𝑐
+ 𝛿 Pr (𝑝′|𝑐, 𝜌(𝑐))
𝑝′ 𝑐 𝑡 Pr 𝑡′ 𝑡,𝜌(𝑐) Pr 𝑝′ 𝑡′,𝜌(𝑐) Pr 𝑝′ 𝑐,𝜌(𝑐)
𝛽𝑝′(𝑡′)
𝑡,𝑡′
= 𝑐 𝑡 𝑆 𝑡, 𝜌 𝑐
+ 𝛿 Pr 𝑡′ 𝑡, 𝜌(𝑐) Pr 𝑝′ 𝑡′, 𝜌(𝑐) 𝛽𝑝′ 𝑡′
𝑝′,𝑡′ 𝑡
𝛽(𝑡)
CS886 (c) 2013 Pascal Poupart
17
Value Function
- Corollary: A policy made up of a set of
trees is piece-wise linear
- Proof:
– Each tree leads to a linear piece for a region
- f the belief space
– Hence the value function is made up of several linear pieces.
CS886 (c) 2013 Pascal Poupart
18
Optimal Value Function
- Theorem: Optimal value function 𝑊∗(𝑐) for finite
horizon is piece-wise linear and convex in 𝑐
- Proof:
– There are finitely many trees of finite depth – Each tree gives rise to a linear piece 𝛽 – At each belief, select the highest linear piece
CS886 (c) 2013 Pascal Poupart
19
Value Iteration
- Bellman’s Equation:
𝑊∗ 𝑐 = max
𝑏
𝑆 𝑐, 𝑏 + 𝛿 Pr 𝑝′ 𝑐, 𝑏 𝑊∗(𝑐𝑏,𝑝′)
𝑝′
- Value Iteration:
– Idea: repeat 𝑊∗ 𝑐 ← max
𝑏
𝑆 𝑐, 𝑏 + 𝛿 Pr 𝑝′ 𝑐, 𝑏 𝑊∗(𝑐𝑏,𝑝′)
𝑝′
∀𝑐 – But we can’t enumerate all beliefs – Instead compute linear pieces 𝛽 for a subset of beliefs
CS886 (c) 2013 Pascal Poupart
20
Point-Base Value Iteration
- Let 𝐶 = {𝑐1, 𝑐2, … , 𝑐𝑙} be a subset of beliefs
- Let Γ = {𝛽1, 𝛽2, … , 𝛽𝑙} be a set of 𝛽-vectors such
that 𝛽𝑗 is associated with 𝑐𝑗
- Point-based value iteration:
– Repeatedly improve 𝑊(𝑐𝑗) at each 𝑐𝑗 𝑊 𝑐𝑗 = max
𝑏
𝑆 𝑐𝑗, 𝑏 + 𝛿 Pr 𝑝′ 𝑐, 𝑏 max
𝛽∈Γ 𝛽(𝑐𝑏,𝑝′) 𝑝′
– Find 𝛽𝑗(𝑐) such that 𝑊 𝑐𝑗 = 𝑐𝑗 𝑡 𝛽(𝑡)
𝑡
- 𝛽𝑏,𝑝′ ← 𝑏𝑠𝑛𝑏𝑦𝛽∈Γ
𝑐𝑗
𝑏,𝑝′ 𝑡′ 𝛽(𝑡′) 𝑡′
- 𝑏∗ ← 𝑏𝑠𝑛𝑏𝑦𝑏𝑆 𝑐𝑗, 𝑏 + 𝛿
Pr 𝑝′ 𝑐𝑗, 𝑏 𝛽𝑏,𝑝′(𝑐𝑗
𝑏,𝑝′) 𝑝′
- 𝛽𝑗 𝑡 ← 𝑆 𝑡, 𝑏∗ + 𝛿
Pr 𝑡′ 𝑡, 𝑏∗ Pr 𝑝′ 𝑡′, 𝑏∗ 𝛽𝑏∗,𝑝′(𝑡′)
𝑡′,𝑝′
CS886 (c) 2013 Pascal Poupart
21
Algorithm
Point-base Value Iteration(B,ℎ) Let 𝐶 be a set of beliefs 𝛽𝑗𝑜𝑗𝑢 𝑡 = min
𝑏,𝑡 𝑆 𝑡,𝑏 1−𝛿 ∀𝑡
Γ
0 ← 𝛽𝑗𝑜𝑗𝑢
For 𝑜 = 1 to ℎ do For each 𝑐𝑗 ∈ 𝐶 do 𝛽𝑏,𝑝′ ← 𝑏𝑠𝑛𝑏𝑦𝛽∈Γn
𝑐𝑗
𝑏,𝑝′ 𝑡′ 𝛽(𝑡′) 𝑡′
𝑏∗ ← 𝑏𝑠𝑛𝑏𝑦𝑏𝑆 𝑐𝑗, 𝑏 + 𝛿 Pr 𝑝′ 𝑐𝑗, 𝑏 𝛽𝑏,𝑝′(𝑐𝑗
𝑏,𝑝′) 𝑝′
𝛽𝑗 𝑡 ← 𝑆 𝑡, 𝑏∗ + 𝛿 Pr 𝑡′ 𝑡, 𝑏∗ Pr 𝑝′ 𝑡′, 𝑏∗ 𝛽𝑏∗,𝑝′(𝑡′)
𝑡′,𝑝′
Γ
𝑜 ← 𝛽𝑗 ∀𝑗
Return Γ
𝑜