Introduction to Partially Observable Markov Decision Processes CS - - PowerPoint PPT Presentation

introduction to
SMART_READER_LITE
LIVE PREVIEW

Introduction to Partially Observable Markov Decision Processes CS - - PowerPoint PPT Presentation

Module 14 Introduction to Partially Observable Markov Decision Processes CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Markov Decision Processes MDPs: Fully Observable MDPs Decision maker


slide-1
SLIDE 1

Module 14 Introduction to Partially Observable Markov Decision Processes

CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

slide-2
SLIDE 2

CS886 (c) 2013 Pascal Poupart

2

Markov Decision Processes

  • MDPs:

– Fully Observable MDPs – Decision maker knows the state at each time step

  • POMDPs:

– Partially Observable MDPs – Decision does not know the state – But makes observations that are correlated with the underlying state

  • E.g. sensors provide noisy information about the state
slide-3
SLIDE 3

CS886 (c) 2013 Pascal Poupart

3

Applications

  • Robotic control
  • Dialog systems
  • Assistive Technologies
  • Operations Research
slide-4
SLIDE 4

CS886 (c) 2013 Pascal Poupart

4

Model Description

  • Definition

– Set of states: 𝑇 – Set of actions (i.e., decisions): 𝐵 – Transition model: Pr⁡ (𝑡𝑢|𝑡𝑢−1, 𝑏𝑢−1) – Reward model (i.e., utility): 𝑆(𝑡𝑢, 𝑏𝑢) – Discount factor: 0 ≤ 𝛿 ≤ 1 – Horizon (i.e., # of time steps): ℎ – Set of observations: 𝑷 – Observation model: 𝐐𝐬⁡ (𝒑𝒖|𝒕𝒖, 𝒃𝒖−𝟐)

slide-5
SLIDE 5

CS886 (c) 2013 Pascal Poupart

5

Graphical Model

  • Fully Observable MDP

s0 s1 s2 s3 s4 a0 a1 a2 a3 r1 r2 r3 r0

slide-6
SLIDE 6

CS886 (c) 2013 Pascal Poupart

6

Graphical Model

  • Partially Observable MDP

s0 s1 s2 s3 s4 a0 a1 a2 a3 r1 r2 r3 r0

  • 1
  • 2
  • 3
  • 4
slide-7
SLIDE 7

CS886 (c) 2013 Pascal Poupart

7

Policies

  • MDP policies: 𝜌: 𝑇 → 𝐵

– Markovian policy

  • But state is unknown in POMDPs
  • POMDP policies: 𝜌: 𝐶0 × 𝐼𝑢 → 𝐵𝑢

– 𝐶0 is the space of initial beliefs 𝑐0 𝑐0 = Pr⁡ (𝑡0) – 𝐼𝑢 is the space histories ℎ𝑢 of observables up to time 𝑢 ℎ𝑢 ≝ 𝑏0, 𝑝1, 𝑏1, 𝑝2, … , 𝑏𝑢−1, 𝑝𝑢 – Non-Markovian policy

slide-8
SLIDE 8

CS886 (c) 2013 Pascal Poupart

8

Policy Trees

  • Policy 𝜌: 𝐶 × 𝐼𝑢 → 𝐵𝑢
  • Consider a single initial belief 𝑐
  • Then 𝜌 can be represented by a tree

𝑏1 𝑏2 𝑏1 𝑏1 𝑏2 𝑝1 𝑝2 𝑏1 𝑏2 𝑝1 𝑝1 𝑝2 𝑝2

slide-9
SLIDE 9

CS886 (c) 2013 Pascal Poupart

9

Policy Trees (continued)

  • Policy 𝜌: 𝐶 × 𝐼𝑢 → 𝐵𝑢

– Set of trees Let 𝐶 = 𝐶1 ∪ 𝐶2 ∪ 𝐶3 𝑏1 𝑏1 𝑏2 𝑏2 𝑏1 𝑝1 𝑝2 𝑏1 𝑏2 𝑝1 𝑝1 𝑝2 𝑝2 𝑏2 𝑏2 𝑏1 𝑏1 𝑏2 𝑝1 𝑝2 𝑏2 𝑏1 𝑝1 𝑝1 𝑝2 𝑝2 𝑏1 𝑏2 𝑏1 𝑏2 𝑏1 𝑝1 𝑝2 𝑏1 𝑏1 𝑝1 𝑝1 𝑝2 𝑝2 𝑐 ∈ 𝐶1 𝑐 ∈ 𝐶2 𝑐 ∈ 𝐶3

slide-10
SLIDE 10

CS886 (c) 2013 Pascal Poupart

10

Beliefs

  • Belief 𝑐𝑢 𝑡 = Pr⁡

(𝑡𝑢)

– Distribution over states at time 𝑢

  • Belief about the underlying state based on

history ℎ𝑢 𝑐𝑢 𝑡 = Pr⁡ (𝑡𝑢|ℎ𝑢, 𝑐0)

slide-11
SLIDE 11

CS886 (c) 2013 Pascal Poupart

11

Belief Update

  • Belief update: 𝑐𝑢, 𝑏𝑢, 𝑝𝑢+1 → 𝑐𝑢+1

𝑐𝑢+1 𝑡𝑢+1 = Pr 𝑡𝑢+1 ℎ𝑢+1, 𝑐0 = Pr⁡ (𝑡𝑢+1|𝑝𝑢+1, 𝑏𝑢, ℎ𝑢, 𝑐0) ℎ𝑢+1 ≡ 𝑝𝑢+1, 𝑏𝑢, ℎ𝑢 = Pr 𝑡𝑢+1 𝑝𝑢+1, 𝑏𝑢, 𝑐𝑢 𝑐𝑢 ≡ 𝑐0, ℎ𝑢 =

Pr 𝑡𝑢+1,𝑝𝑢+1 𝑏𝑢,𝑐𝑢 Pr 𝑝𝑢+1 𝑏𝑢,𝑐𝑢

Bayes’ theorem =

Pr 𝑝𝑢+1 𝑡𝑢+1,𝑏𝑢 Pr 𝑡𝑢+1 𝑏𝑢,𝑐𝑢 Pr 𝑝𝑢+1 𝑏𝑢,𝑐𝑢

chain rule =

Pr 𝑝𝑢+1 𝑡𝑢+1,𝑏𝑢 Pr 𝑡𝑢+1 𝑡𝑢,𝑏𝑢 𝑐𝑢(𝑡𝑢)

𝑡𝑢

Pr 𝑝𝑢+1 𝑏𝑢,𝑐𝑢

belief definition ∝ Pr 𝑝𝑢+1 𝑡𝑢+1, 𝑏𝑢 Pr 𝑡𝑢+1 𝑡𝑢, 𝑏𝑢 𝑐𝑢(𝑡𝑢)

𝑡𝑢

slide-12
SLIDE 12

CS886 (c) 2013 Pascal Poupart

12

Markovian Policies

  • Beliefs are sufficient statistics equivalent to

histories (with the initial belief)

𝑐0, ℎ𝑢 ⇔ 𝑐𝑢

  • Policies:

– Based on histories: 𝜌: 𝐶0 × 𝐼𝑢 → 𝐵𝑢

  • Non-Markovian

– Based on beliefs: 𝜌: 𝐶 → 𝐵

  • Markovian
slide-13
SLIDE 13

CS886 (c) 2013 Pascal Poupart

13

Belief State MDPs

  • POMDPs can be viewed as belief state MDPs

– States: 𝐶 (beliefs) – Actions: 𝐵 – Transitions: Pr 𝑐𝑢+1 𝑐𝑢, 𝑏𝑢 = Pr⁡ (𝑝𝑢+1|𝑐𝑢, 𝑏𝑢) if⁡𝑐𝑢, 𝑏𝑢, 𝑝𝑢+1 → 𝑐𝑢+1

  • therwise

– Rewards: 𝑆 𝑐, 𝑏 = 𝑐 𝑡 𝑆(𝑡, 𝑏)

𝑡

  • Belief state MDPs

– Fully observable – Continuous belief space

slide-14
SLIDE 14

CS886 (c) 2013 Pascal Poupart

14

Policy Evaluation

  • Value 𝑊𝜌⁡of a POMDP policy 𝜌

– Expected sum of rewards: 𝑊𝜌 𝑐 = 𝐹 𝛿𝑢𝑆 𝑐𝑢, 𝜌 𝑐𝑢

𝑢

– Policy evaluation: Bellman’s equation 𝑊𝜌 𝑐 = 𝑆 𝑐, 𝜌(𝑐) + 𝛿 Pr 𝑐′ 𝑐, 𝜌 𝑐

𝑐′

𝑊𝜌 𝑐′ ⁡⁡∀𝑐 – Equivalent equation 𝑊𝜌 𝑐 = 𝑆 𝑐, 𝜌 𝑐 + 𝛿 Pr 𝑝′ 𝑐, 𝑏 𝑊𝜌(𝑐𝑏,𝑝′)

𝑝′

⁡⁡⁡∀𝑐

slide-15
SLIDE 15

CS886 (c) 2013 Pascal Poupart

15

Policy Tree Value Function

  • Theorem: The value function 𝑊𝜌(𝑐) of a policy

tree is linear in 𝑐

– i.e. 𝑊𝜌 𝑐 = 𝛽 𝑡 𝑐(𝑡)

𝑡

  • Proof by induction:

– Base case: at the leaves

  • 𝑊

0 𝑐 = 𝑆 𝑐, 𝜌 𝑐

= 𝑐 𝑡 𝑆(𝑡, 𝜌 𝑡 )

𝑡

– Hence 𝛽 𝑡 = 𝑆(𝑡, 𝜌 𝑡 )

– Assumption: for all trees of depth 𝑜, there exists an 𝛽- vector such that 𝑊

𝑜 𝑐 = 𝑐 𝑡 𝛽(𝑡) 𝑡

slide-16
SLIDE 16

CS886 (c) 2013 Pascal Poupart

16

Proof continued

  • Induction

𝑊

𝑜+1 𝑐 = 𝑆 𝑐, 𝜌 𝑐

+ 𝛿 Pr 𝑝′ 𝑐, 𝜌(𝑐) 𝑊

𝑜(𝑐𝜌(𝑐),𝑝′) 𝑝′

= 𝑆 𝑐, 𝜌 𝑐 + 𝛿 Pr⁡ (𝑝′|𝑐, 𝜌(𝑐))

𝑝′

𝑐𝜌(𝑐),𝑝′ 𝑡′ 𝛽𝑝′(𝑡′)

𝑡′

= 𝑆 𝑐, 𝜌 𝑐

+ 𝛿 Pr⁡ (𝑝′|𝑐, 𝜌(𝑐))

𝑝′ 𝑐 𝑡 Pr 𝑡′ 𝑡,𝜌(𝑐) Pr 𝑝′ 𝑡′,𝜌(𝑐) Pr 𝑝′ 𝑐,𝜌(𝑐)

𝛽𝑝′(𝑡′)

𝑡,𝑡′

= 𝑐 𝑡 𝑆 𝑡, 𝜌 𝑐

+ 𝛿 Pr 𝑡′ 𝑡, 𝜌(𝑐) Pr 𝑝′ 𝑡′, 𝜌(𝑐) 𝛽𝑝′ 𝑡′

𝑝′,𝑡′ 𝑡

𝛽(𝑡)

slide-17
SLIDE 17

CS886 (c) 2013 Pascal Poupart

17

Value Function

  • Corollary: A policy made up of a set of

trees is piece-wise linear

  • Proof:

– Each tree leads to a linear piece for a region

  • f the belief space

– Hence the value function is made up of several linear pieces.

slide-18
SLIDE 18

CS886 (c) 2013 Pascal Poupart

18

Optimal Value Function

  • Theorem: Optimal value function 𝑊∗(𝑐) for finite

horizon is piece-wise linear and convex in 𝑐

  • Proof:

– There are finitely many trees of finite depth – Each tree gives rise to a linear piece 𝛽 – At each belief, select the highest linear piece

slide-19
SLIDE 19

CS886 (c) 2013 Pascal Poupart

19

Value Iteration

  • Bellman’s Equation:

𝑊∗ 𝑐 = max

𝑏

𝑆 𝑐, 𝑏 + 𝛿 Pr 𝑝′ 𝑐, 𝑏 𝑊∗(𝑐𝑏,𝑝′)

𝑝′

  • Value Iteration:

– Idea: repeat 𝑊∗ 𝑐 ← max

𝑏

𝑆 𝑐, 𝑏 + 𝛿 Pr 𝑝′ 𝑐, 𝑏 𝑊∗(𝑐𝑏,𝑝′)

𝑝′

⁡⁡⁡∀𝑐 – But we can’t enumerate all beliefs – Instead compute linear pieces 𝛽 for a subset of beliefs

slide-20
SLIDE 20

CS886 (c) 2013 Pascal Poupart

20

Point-Base Value Iteration

  • Let 𝐶 = {𝑐1, 𝑐2, … , 𝑐𝑙} be a subset of beliefs
  • Let Γ = {𝛽1, 𝛽2, … , 𝛽𝑙} be a set of 𝛽-vectors such

that 𝛽𝑗 is associated with 𝑐𝑗

  • Point-based value iteration:

– Repeatedly improve 𝑊(𝑐𝑗) at each 𝑐𝑗 𝑊 𝑐𝑗 = max

𝑏

𝑆 𝑐𝑗, 𝑏 + 𝛿 Pr 𝑝′ 𝑐, 𝑏 max

𝛽∈Γ 𝛽(𝑐𝑏,𝑝′) 𝑝′

– Find 𝛽𝑗(𝑐) such that 𝑊 𝑐𝑗 = 𝑐𝑗 𝑡 𝛽(𝑡)

𝑡

  • 𝛽𝑏,𝑝′ ← 𝑏𝑠𝑕𝑛𝑏𝑦𝛽∈Γ⁡

𝑐𝑗

𝑏,𝑝′ 𝑡′ 𝛽(𝑡′) 𝑡′

  • 𝑏∗ ← 𝑏𝑠𝑕𝑛𝑏𝑦𝑏⁡𝑆 𝑐𝑗, 𝑏 + 𝛿

Pr 𝑝′ 𝑐𝑗, 𝑏 𝛽𝑏,𝑝′(𝑐𝑗

𝑏,𝑝′) 𝑝′

  • 𝛽𝑗 𝑡 ← 𝑆 𝑡, 𝑏∗ + 𝛿

Pr 𝑡′ 𝑡, 𝑏∗ Pr 𝑝′ 𝑡′, 𝑏∗ 𝛽𝑏∗,𝑝′(𝑡′)

𝑡′,𝑝′

slide-21
SLIDE 21

CS886 (c) 2013 Pascal Poupart

21

Algorithm

Point-base Value Iteration(B,ℎ) Let 𝐶 be a set of beliefs 𝛽𝑗𝑜𝑗𝑢 𝑡 = min

𝑏,𝑡 𝑆 𝑡,𝑏 1−𝛿 ⁡⁡∀𝑡

Γ

0 ← 𝛽𝑗𝑜𝑗𝑢

For 𝑜 = 1 to ℎ do For each 𝑐𝑗 ∈ 𝐶 do 𝛽𝑏,𝑝′ ← 𝑏𝑠𝑕𝑛𝑏𝑦𝛽∈Γn ⁡

𝑐𝑗

𝑏,𝑝′ 𝑡′ 𝛽(𝑡′) 𝑡′

𝑏∗ ← 𝑏𝑠𝑕𝑛𝑏𝑦𝑏⁡𝑆 𝑐𝑗, 𝑏 + 𝛿 Pr 𝑝′ 𝑐𝑗, 𝑏 𝛽𝑏,𝑝′(𝑐𝑗

𝑏,𝑝′) 𝑝′

𝛽𝑗 𝑡 ← 𝑆 𝑡, 𝑏∗ + 𝛿 Pr 𝑡′ 𝑡, 𝑏∗ Pr 𝑝′ 𝑡′, 𝑏∗ 𝛽𝑏∗,𝑝′(𝑡′)

𝑡′,𝑝′

Γ

𝑜 ← 𝛽𝑗 ∀𝑗

Return Γ

𝑜