introduction to
play

Introduction to Partially Observable Markov Decision Processes CS - PowerPoint PPT Presentation

Module 14 Introduction to Partially Observable Markov Decision Processes CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Markov Decision Processes MDPs: Fully Observable MDPs Decision maker


  1. Module 14 Introduction to Partially Observable Markov Decision Processes CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

  2. Markov Decision Processes • MDPs: – Fully Observable MDPs – Decision maker knows the state at each time step • POMDPs: – Partially Observable MDPs – Decision does not know the state – But makes observations that are correlated with the underlying state • E.g. sensors provide noisy information about the state 2 CS886 (c) 2013 Pascal Poupart

  3. Applications • Robotic control • Dialog systems • Assistive Technologies • Operations Research 3 CS886 (c) 2013 Pascal Poupart

  4. Model Description • Definition – Set of states: 𝑇 – Set of actions (i.e., decisions): 𝐵 – Transition model: Pr⁡ (𝑡 𝑢 |𝑡 𝑢−1 , 𝑏 𝑢−1 ) – Reward model (i.e., utility): 𝑆(𝑡 𝑢 , 𝑏 𝑢 ) – Discount factor: 0 ≤ 𝛿 ≤ 1 – Horizon (i.e., # of time steps): ℎ – Set of observations: 𝑷 – Observation model: 𝐐𝐬⁡ (𝒑 𝒖 |𝒕 𝒖 , 𝒃 𝒖−𝟐 ) 4 CS886 (c) 2013 Pascal Poupart

  5. Graphical Model • Fully Observable MDP a 1 a 0 a 3 a 2 s 0 s 1 s 2 s 4 s 3 r 2 r 3 r 0 r 1 5 CS886 (c) 2013 Pascal Poupart

  6. Graphical Model • Partially Observable MDP a 1 o 2 o 4 a 0 o 1 a 2 o 3 a 3 s 0 s 1 s 2 s 4 s 3 r 2 r 3 r 0 r 1 6 CS886 (c) 2013 Pascal Poupart

  7. Policies • MDP policies: 𝜌: 𝑇 → 𝐵 – Markovian policy • But state is unknown in POMDPs • POMDP policies: 𝜌: 𝐶 0 × 𝐼 𝑢 → 𝐵 𝑢 – 𝐶 0 is the space of initial beliefs 𝑐 0 𝑐 0 = Pr⁡ (𝑡 0 ) – 𝐼 𝑢 is the space histories ℎ 𝑢 of observables up to time 𝑢 ℎ 𝑢 ≝ 𝑏 0 , 𝑝 1 , 𝑏 1 , 𝑝 2 , … , 𝑏 𝑢−1 , 𝑝 𝑢 – Non-Markovian policy 7 CS886 (c) 2013 Pascal Poupart

  8. Policy Trees • Policy 𝜌: 𝐶 × 𝐼 𝑢 → 𝐵 𝑢 • Consider a single initial belief 𝑐 • Then 𝜌 can be represented by a tree 𝑏 1 𝑝 1 𝑝 2 𝑏 1 𝑏 2 𝑝 1 𝑝 1 𝑝 2 𝑝 2 𝑏 2 𝑏 1 𝑏 1 𝑏 2 8 CS886 (c) 2013 Pascal Poupart

  9. Policy Trees (continued) • Policy 𝜌: 𝐶 × 𝐼 𝑢 → 𝐵 𝑢 – Set of trees Let 𝐶 = 𝐶 1 ∪ 𝐶 2 ∪ 𝐶 3 𝑐 ∈ 𝐶 1 𝑐 ∈ 𝐶 3 𝑐 ∈ 𝐶 2 𝑏 1 𝑏 2 𝑏 1 𝑝 1 𝑝 2 𝑝 1 𝑝 2 𝑝 1 𝑝 2 𝑏 1 𝑏 2 𝑏 2 𝑏 1 𝑏 1 𝑏 1 𝑝 1 𝑝 2 𝑝 2 𝑝 1 𝑝 1 𝑝 2 𝑝 1 𝑝 2 𝑝 2 𝑝 1 𝑝 2 𝑝 1 𝑏 1 𝑏 2 𝑏 2 𝑏 1 𝑏 2 𝑏 1 𝑏 1 𝑏 2 𝑏 2 𝑏 1 𝑏 2 𝑏 1 9 CS886 (c) 2013 Pascal Poupart

  10. Beliefs • Belief 𝑐 𝑢 𝑡 = Pr⁡ (𝑡 𝑢 ) – Distribution over states at time 𝑢 • Belief about the underlying state based on history ℎ 𝑢 𝑐 𝑢 𝑡 = Pr⁡ (𝑡 𝑢 |ℎ 𝑢 , 𝑐 0 ) 10 CS886 (c) 2013 Pascal Poupart

  11. Belief Update • Belief update: 𝑐 𝑢 , 𝑏 𝑢 , 𝑝 𝑢+1 → 𝑐 𝑢+1 𝑐 𝑢+1 𝑡 𝑢+1 = Pr 𝑡 𝑢+1 ℎ 𝑢+1 , 𝑐 0 = Pr⁡ (𝑡 𝑢+1 |𝑝 𝑢+1 , 𝑏 𝑢 , ℎ 𝑢 , 𝑐 0 ) ℎ 𝑢+1 ≡ 𝑝 𝑢+1 , 𝑏 𝑢 , ℎ 𝑢 = Pr 𝑡 𝑢+1 𝑝 𝑢+1 , 𝑏 𝑢 , 𝑐 𝑢 𝑐 𝑢 ≡ 𝑐 0 , ℎ 𝑢 Pr 𝑡 𝑢+1 ,𝑝 𝑢+1 𝑏 𝑢 ,𝑐 𝑢 = Bayes’ theorem Pr 𝑝 𝑢+1 𝑏 𝑢 ,𝑐 𝑢 Pr 𝑝 𝑢+1 𝑡 𝑢+1 ,𝑏 𝑢 Pr 𝑡 𝑢+1 𝑏 𝑢 ,𝑐 𝑢 = chain rule Pr 𝑝 𝑢+1 𝑏 𝑢 ,𝑐 𝑢 Pr 𝑝 𝑢+1 𝑡 𝑢+1 ,𝑏 𝑢 Pr 𝑡 𝑢+1 𝑡 𝑢 ,𝑏 𝑢 𝑐 𝑢 (𝑡 𝑢 ) = 𝑡𝑢 belief definition Pr 𝑝 𝑢+1 𝑏 𝑢 ,𝑐 𝑢 ∝ Pr 𝑝 𝑢+1 𝑡 𝑢+1 , 𝑏 𝑢 Pr 𝑡 𝑢+1 𝑡 𝑢 , 𝑏 𝑢 𝑐 𝑢 (𝑡 𝑢 ) 𝑡 𝑢 11 CS886 (c) 2013 Pascal Poupart

  12. Markovian Policies • Beliefs are sufficient statistics equivalent to histories (with the initial belief) 𝑐 0 , ℎ 𝑢 ⇔ 𝑐 𝑢 • Policies: – Based on histories: 𝜌: 𝐶 0 × 𝐼 𝑢 → 𝐵 𝑢 • Non-Markovian – Based on beliefs: 𝜌: 𝐶 → 𝐵 • Markovian 12 CS886 (c) 2013 Pascal Poupart

  13. Belief State MDPs • POMDPs can be viewed as belief state MDPs – States: 𝐶 (beliefs) – Actions: 𝐵 – Transitions: Pr 𝑐 𝑢+1 𝑐 𝑢 , 𝑏 𝑢 = Pr⁡ (𝑝 𝑢+1 |𝑐 𝑢 , 𝑏 𝑢 ) if⁡𝑐 𝑢 , 𝑏 𝑢 , 𝑝 𝑢+1 → 𝑐 𝑢+1 0 otherwise – Rewards: 𝑆 𝑐, 𝑏 = 𝑐 𝑡 𝑆(𝑡, 𝑏) 𝑡 • Belief state MDPs – Fully observable – Continuous belief space 13 CS886 (c) 2013 Pascal Poupart

  14. Policy Evaluation • Value 𝑊 𝜌 ⁡ of a POMDP policy 𝜌 – Expected sum of rewards: 𝑊 𝜌 𝑐 = 𝐹 𝛿 𝑢 𝑆 𝑐 𝑢 , 𝜌 𝑐 𝑢 𝑢 – Policy evaluation: Bellman’s equation 𝑊 𝜌 𝑐 = 𝑆 𝑐, 𝜌(𝑐) + 𝛿 Pr 𝑐 ′ 𝑐, 𝜌 𝑐 𝑊 𝜌 𝑐 ′ ⁡⁡∀𝑐 𝑐 ′ – Equivalent equation 𝑊 𝜌 𝑐 = 𝑆 𝑐, 𝜌 𝑐 + 𝛿 Pr 𝑝 ′ 𝑐, 𝑏 𝑊 𝜌 (𝑐 𝑏,𝑝 ′ ) ⁡⁡⁡∀𝑐 𝑝 ′ 14 CS886 (c) 2013 Pascal Poupart

  15. Policy Tree Value Function • Theorem: The value function 𝑊 𝜌 (𝑐) of a policy tree is linear in 𝑐 – i.e. 𝑊 𝜌 𝑐 = 𝛽 𝑡 𝑐(𝑡) 𝑡 • Proof by induction: – Base case: at the leaves = 𝑐 𝑡 𝑆(𝑡, 𝜌 𝑡 ) • 𝑊 0 𝑐 = 𝑆 𝑐, 𝜌 𝑐 𝑡 – Hence 𝛽 𝑡 = 𝑆(𝑡, 𝜌 𝑡 ) – Assumption: for all trees of depth 𝑜 , there exists an 𝛽 - 𝑜 𝑐 = 𝑐 𝑡 𝛽(𝑡) vector such that 𝑊 𝑡 15 CS886 (c) 2013 Pascal Poupart

  16. Proof continued • Induction Pr 𝑝 ′ 𝑐, 𝜌(𝑐) 𝑊 𝑊 𝑜 (𝑐 𝜌(𝑐),𝑝′ ) + 𝛿 𝑜+1 𝑐 = 𝑆 𝑐, 𝜌 𝑐 𝑝 ′ 𝑐 𝜌(𝑐),𝑝 ′ 𝑡 ′ 𝛽 𝑝 ′ (𝑡 ′ ) = 𝑆 𝑐, 𝜌 𝑐 (𝑝 ′ |𝑐, 𝜌(𝑐)) + 𝛿 Pr⁡ 𝑝 ′ 𝑡 ′ = 𝑆 𝑐, 𝜌 𝑐 𝑐 𝑡 Pr 𝑡 ′ 𝑡,𝜌(𝑐) Pr 𝑝 ′ 𝑡 ′ ,𝜌(𝑐) 𝛽 𝑝 ′ (𝑡 ′ ) (𝑝 ′ |𝑐, 𝜌(𝑐)) + 𝛿 Pr⁡ 𝑝 ′ 𝑡,𝑡 ′ Pr 𝑝 ′ 𝑐,𝜌(𝑐) Pr 𝑡 ′ 𝑡, 𝜌(𝑐) Pr 𝑝 ′ 𝑡 ′ , 𝜌(𝑐) 𝛽 𝑝 ′ 𝑡 ′ = 𝑐 𝑡 𝑆 𝑡, 𝜌 𝑐 + 𝛿 𝑝 ′ ,𝑡 ′ 𝑡 𝛽(𝑡) 16 CS886 (c) 2013 Pascal Poupart

  17. Value Function • Corollary: A policy made up of a set of trees is piece-wise linear • Proof: – Each tree leads to a linear piece for a region of the belief space – Hence the value function is made up of several linear pieces. 17 CS886 (c) 2013 Pascal Poupart

  18. Optimal Value Function • Theorem: Optimal value function 𝑊 ∗ (𝑐) for finite horizon is piece-wise linear and convex in 𝑐 • Proof: – There are finitely many trees of finite depth – Each tree gives rise to a linear piece 𝛽 – At each belief, select the highest linear piece 18 CS886 (c) 2013 Pascal Poupart

  19. Value Iteration • Bellman’s Equation: 𝑊 ∗ 𝑐 = max 𝑆 𝑐, 𝑏 + 𝛿 Pr 𝑝 ′ 𝑐, 𝑏 𝑊 ∗ (𝑐 𝑏,𝑝 ′ ) 𝑏 𝑝 ′ • Value Iteration: – Idea: repeat 𝑊 ∗ 𝑐 ← max Pr 𝑝 ′ 𝑐, 𝑏 𝑊 ∗ (𝑐 𝑏,𝑝 ′ ) 𝑆 𝑐, 𝑏 + 𝛿 ⁡⁡⁡∀𝑐 𝑝 ′ 𝑏 – But we can’t enumerate all beliefs – Instead compute linear pieces 𝛽 for a subset of beliefs 19 CS886 (c) 2013 Pascal Poupart

  20. Point-Base Value Iteration • Let 𝐶 = {𝑐 1 , 𝑐 2 , … , 𝑐 𝑙 } be a subset of beliefs • Let Γ = {𝛽 1 , 𝛽 2 , … , 𝛽 𝑙 } be a set of 𝛽 -vectors such that 𝛽 𝑗 is associated with 𝑐 𝑗 • Point-based value iteration: – Repeatedly improve 𝑊(𝑐 𝑗 ) at each 𝑐 𝑗 Pr 𝑝 ′ 𝑐, 𝑏 max 𝑊 𝑐 𝑗 = max 𝛽∈Γ 𝛽(𝑐 𝑏,𝑝 ′ ) 𝑆 𝑐 𝑗 , 𝑏 + 𝛿 𝑝 ′ 𝑏 – Find 𝛽 𝑗 (𝑐) such that 𝑊 𝑐 𝑗 = 𝑐 𝑗 𝑡 𝛽(𝑡) 𝑡 𝑏,𝑝 ′ 𝑡 ′ 𝛽(𝑡 ′ ) • 𝛽 𝑏,𝑝 ′ ← 𝑏𝑠𝑕𝑛𝑏𝑦 𝛽∈Γ ⁡ 𝑐 𝑗 𝑡 ′ 𝑏,𝑝 ′ ) • 𝑏 ∗ ← 𝑏𝑠𝑕𝑛𝑏𝑦 𝑏 ⁡𝑆 𝑐 𝑗 , 𝑏 + 𝛿 Pr 𝑝 ′ 𝑐 𝑗 , 𝑏 𝛽 𝑏,𝑝 ′ (𝑐 𝑗 𝑝 ′ • 𝛽 𝑗 𝑡 ← 𝑆 𝑡, 𝑏 ∗ + 𝛿 Pr 𝑡 ′ 𝑡, 𝑏 ∗ Pr 𝑝 ′ 𝑡 ′ , 𝑏 ∗ 𝛽 𝑏 ∗ ,𝑝 ′ (𝑡 ′ ) 𝑡 ′ ,𝑝 ′ 20 CS886 (c) 2013 Pascal Poupart

  21. Algorithm Point-base Value Iteration(B, ℎ ) Let 𝐶 be a set of beliefs 𝑆 𝑡,𝑏 𝛽 𝑗𝑜𝑗𝑢 𝑡 = min 1−𝛿 ⁡⁡∀𝑡 𝑏,𝑡 Γ 0 ← 𝛽 𝑗𝑜𝑗𝑢 For 𝑜 = 1 to ℎ do For each 𝑐 𝑗 ∈ 𝐶 do 𝑏,𝑝 ′ 𝑡 ′ 𝛽(𝑡 ′ ) 𝛽 𝑏,𝑝 ′ ← 𝑏𝑠𝑕𝑛𝑏𝑦 𝛽∈Γ n ⁡ 𝑐 𝑗 𝑡 ′ 𝑏,𝑝 ′ ) 𝑏 ∗ ← 𝑏𝑠𝑕𝑛𝑏𝑦 𝑏 ⁡𝑆 𝑐 𝑗 , 𝑏 + 𝛿 Pr 𝑝 ′ 𝑐 𝑗 , 𝑏 𝛽 𝑏,𝑝 ′ (𝑐 𝑗 𝑝 ′ 𝛽 𝑗 𝑡 ← 𝑆 𝑡, 𝑏 ∗ + 𝛿 Pr 𝑡 ′ 𝑡, 𝑏 ∗ Pr 𝑝 ′ 𝑡 ′ , 𝑏 ∗ 𝛽 𝑏 ∗ ,𝑝 ′ (𝑡 ′ ) 𝑡 ′ ,𝑝 ′ Γ 𝑜 ← 𝛽 𝑗 ∀𝑗 Return Γ 𝑜 21 CS886 (c) 2013 Pascal Poupart

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend