CMU-Q 15-381
Lecture 15: Predictions in Markov Chains Markov Decision Processes
Teacher: Gianni A. Di Caro
CMU-Q 15-381 Lecture 15: Predictions in Markov Chains Markov - - PowerPoint PPT Presentation
CMU-Q 15-381 Lecture 15: Predictions in Markov Chains Markov Decision Processes Teacher: Gianni A. Di Caro M AKING P REDICTIONS : G ENERAL T WO - STATE MC 6 (7) = 6 (8) . 7 Probability distribution over the states after 9 steps, given
Teacher: Gianni A. Di Caro
2
§ … ! = #$ #% = 1 −( 1 ) Eigenvector matrix § … !*$ =
$ +,-
) −( −1 1 § Diagonalization: !*$.! = 1 1 − ( − ) = 0$ 0% = 1 Eigenvalue matrix § Pre-multiplying both terms by ! and post-multiplying by !*$ : . = !1!*$ § .% = (!1!*$)(!1!*$) = (!1)(!*$!)(1!*$) = (!1)4%(1!*$) = !11!*$ = !1%!*$, 1% = 0$
%
0%
%
ü 6(7) = 6(8).7 Probability distribution over the states after 9 steps, given initial distribution ü :;
(7) = < =7 = >
Absolute probability of state > at step 9 given by the initial distribution § How do we compute .7? . = 1 − ( ( ) 1 − ) 0 < (, ) < 1
3
! = 1 − % % & 1 − & 0 < %, & < 1 § !* = +,*+-. ,* = /.
*
/0
* ,
+ = 1 −% 1 & , +-. =
. 123
& −% −1 1 § !* = ⋯ =
. 123
& % & % +
67 123
% −% −& & , 8 = 1 − % − & § 8* → 0 as : → ∞ § !* →
. 123
& % & % = < the matrix !* in the limit of large : § Probability distribution over the states after : steps, given initial distribution =(?) : § =(*) = =(?)!* = A.
(?)
A0
(?) !* → A. (?)
A0
(?) < =
= 1 % + & & A.
(?) + &A0 (?)
%A.
(?) + %A0 (?) =
& % + & % % + & as : → ∞, and given that A.
(?) + A0 (?) = 1
4
§ State distribution over the states after ! steps, given the initial distribution "($) : lim
)→+ "($),) = lim )→+ "()) =
. / + . / / + . = " à The chain has a limiting state probability distribution, denoted here as " § " is independent of "($) § à " is an Invariant limiting distribution of the chain: the limit exists and its invariant with respect to the initial distribution § The limiting distribution " is also a stationary distribution: if the chain starts (or arrives) in " as a state probability distribution, it stays in " (i.e., the distribution becomes stationary, it won’t change): ", = "
h
β α+β α α+β
i 2 41 − α α β 1 − β 3 5 = h
β(1−α)+αβ α+β αβ+α(1−β) α+β
i = h
β α+β α α+β
i
5
§ For studying the long-term behavior of a generic MC with one-step transition matrix ! and " states, let’s consider the limit of the #-step conditional transition probabilities, denoted with $: lim
(→* +,- (() = lim (→* 1 2( = 3
24 = 5) = $,- Let’s consider three different cases that can arise from the limit: 1) Limiting distribution exists 2) Limiting but no invariant distribution 3) No limiting (but possibly stationary) distribution
lim
n→∞ T n = lim n→∞
p(n)
11
p(n)
12
. . . p(n)
1m
p(n)
21
p(n)
22
. . . p(n)
2m
. . . . . . ... . . . p(n)
m1
p(n)
m2
. . . p(n)
mm
= Q11 Q12 . . . Q1m Q21 Q22 . . . Q2m . . . . . . ... . . . Qm1 Qm2 . . . Qmm
6
1. Limiting distribution: Let’s consider thet case when, for all !, #:
3 %' = 1
lim
n→∞ T n = lim n→∞
p(n)
11
p(n)
12
. . . p(n)
1m
p(n)
21
p(n)
22
. . . p(n)
2m
. . . . . . ... . . . p(n)
m1
p(n)
m2
. . . p(n)
mm
= Q1 Q2 . . . Qm Q1 Q2 . . . Qm . . . . . . ... . . . Q1 Q2 . . . Qm
lim
8→9 :&' (8) = lim 8→9 < =8 = #
=> = !) = %&'
7
à The (unconditional) convergence values of the limits for the !-step conditional transition probabilities define the limiting distribution of the chain, which is invariant with respect to the initial conditions § After the process has been in operation for some long duration, the probability of finding it in state " is #$, irrespective of the starting state lim
n→∞ p(0)T n =
h p(0)
1
p(0)
2
. . . p(0)
m
i 2 6 6 6 6 6 6 4 Q1 Q2 . . . Qm Q1 Q2 . . . Qm . . . . . . ... . . . Q1 Q2 . . . Qm 3 7 7 7 7 7 7 5 h 4 5 = h Q1 Pm
i=1 p(0) i
Q2 Pm
i=1 p(0) i
. . . Qm Pm
i=1 p(0) i
i = ⇥ Q1 Q2 . . . Qm ⇤ = p
8
§ From "($) = "($'() for ) → ∞, also "($) = "($'(), à the limiting distribution is the solution of the fixed point equation: ", = " à Because of the above equation, the limiting distribution is always also a stationary distribution: if the chain starts with or arrives to at any step ) to a probability state distribution equal to ", it doesn’t change it anymore § " = ", looks similar to an eigenvector equation: -. = /., with eigenvalue / = 1 § By transposing the matrices and calling , as 1: "2 = ("1)2 ⇒ 12"2 = "2, which is a “regular” eigenvector equation § à The transposed transition matrix 12 has eigenvectors with eigenvalue 1 that are stationary distributions expressed as column vectors.
9
§ Therefore, if the eigenvectors of the transposed transition matrix ! are known, then so are the stationary distributions of the Markov chain. This can save a lot of computations, avoiding to computing powers of !! § The stationary distribution is a left eigenvector (as opposed to the usual right eigenvectors) of the transition matrix, " = "! § Note: When there are multiple eigenvectors associated to an eigenvalue of value 1, each such eigenvector gives rise to an associated stationary
reducible, i.e. has multiple communicating classes.
10
ü Using ! = !# we can easily find the stationary distribution (assumed that there is one, and independently from the limiting distribution) either: ü by solving the linear equation ! = !# ü or by using the eigenvectors of the transposed transition matrix #$ § For instance, in the case of the general 2-state MC, let ! = % 1 − % and then we can solve the matrix equation and find the stationary matrix:
11
2. Limiting but no invariant distribution: Consider the case when for all !, #, the limit reaches convergence values $%& and for each # the value $%& is dependent of the initial the state !, such that we cannot write as before $%& as $&; ∑%)*
+ $%& = 1, ∀# must hold:
lim
n→∞ T n = lim n→∞
p(n)
11
p(n)
12
. . . p(n)
1m
p(n)
21
p(n)
22
. . . p(n)
2m
. . . . . . ... . . . p(n)
m1
p(n)
m2
. . . p(n)
mm
= Q11 Q12 . . . Q1m Q21 Q22 . . . Q2m . . . . . . ... . . . Qm1 Qm2 . . . Qmm lim
n→∞ p(0)T n =
h p(0)
1
p(0)
2
. . . p(0)
m
i 2 6 6 6 6 6 6 4 Q11 Q12 . . . Q1m Q21 Q22 . . . Q2m . . . . . . ... . . . Qm1 Qm2 . . . Qmm 3 7 7 7 7 7 7 5
à Each different initial distribution /(1) defines a possibly different limiting (stationary) distribution
12
lim
n→∞ p(0)T n =
h p(0)
1
p(0)
2
. . . p(0)
m
i 2 6 6 6 6 6 6 4 Q11 Q12 . . . Q1m Q21 Q22 . . . Q2m . . . . . . ... . . . Qm1 Qm2 . . . Qmm 3 7 7 7 7 7 7 5
§ Example: ! = 1 1 = %&, 2-state MC with 0 ≤ (, * ≤ 1 § !+ = ! for all ,, such that a limiting distribution does exist but it always depends on -(/) 12
(/)
1&
(/)
1 1 = 12
(/)
1&
(/)
13
3. No Limiting distribution: The limit doesn’t reach a convergence value !"# for all $, &. Therefore a limiting distribution as defined doesn’t exist. § ( = 0 1 1 0 , in this case, (,- = 0 1 1 0 , (,-./= 1 1 , § → the succession of (’s powers oscillates between the two matrices, the MC is periodic of period 2 § However, a stationary distribution can still exist § Limiting ⇒ Stationary, but the opposite doesn’t necessarily hold
14
§ ! = 0 1 1 0 , with, !%& = 0 1 1 0 , !%&()= 1 1 § The solution of the fixed point equation: pT = p ⇒ ⇥ a 1 − a ⇤ 2 40 1 1 3 5 = ⇥ a 1 − a ⇤ → ⇥ 1 − a a ⇤ = ⇥ a 1 − a ⇤ the resulting equation system: 1 − + = + + = 1 − + + = 0.5 satisfies the equations → / = 0.5 0.5 is a stationary distribution § This is intuitively expected since the oscillating behavior of the powers of ! that results in pairwise symmetric matrices, perfectly balances the probabilities of the two states of the chain.
15
§ What about !-state chains? à Same analysis of 2-state, with !-dim matrices § "# = %&#%'( &# = )(
#
⋯ ⋮ ⋱ ⋮ ⋯ ).
#
§ Example " = 1/4 1/2 1/4 1/2 1/4 1/4 1/4 1/4 1/2 § Eigenvectors: 3( = 1 1 1 34 = −1 −1 2 36 = 1 −1 § Eigenvalues: 7( = 1, 74 =
( 9 , 76= −1/4
§ "# =
( :
1 −1 1 1 −1 −1 1 2 1 (
( 9)#
(−
( 9)#
2 2 2 −1 −1 2 3 −3 > → ∞ A = 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 § B(#) = B(C)"# à The limiting distribution: B = lim
#→G B(#) = lim #→G B(C)"# = B(C)A
§ B = 1/3 1/3 1/3 which is also a stationary distribution
16
§ !" =
$ %
1 −1 1 1 −1 −1 1 2 1 (
$ +)"
(−
$ +)"
2 2 2 −1 −1 2 3 −3 . → ∞ 1 = 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 § 3(") = 3(4)!" à The limiting distribution: 3 = lim
"→8 3(") = lim "→8 3(4)!" = 3(4)1
§ 3 = 1/3 1/3 1/3 which is an invariant limit distribution (stationary distribution)
h p(0)
1
p(0)
2
p(0)
3
i 2 6 6 6 4 q11 q12 q13 q21 q22 q23 q31 q32 q33 3 7 7 7 5 = 2 6 6 6 6 4 p(0)
1 q11 + p(0) 2 q21 + p(0) 3 q31
p(0)
1 q12 + p(0) 2 q22 + p(0) 3 q32
p(0)
1 q3 + p(0) 2 q23 + p(0) 3 q33
3 7 7 7 7 5 = 2 6 6 6 6 4 p(0)
1 1/3 + p(0) 2 1/3 + p(0) 3 1/3
p(0)
1 1/3 + p(0) 2 1/3 + p(0) 3 1/3
p(0)
1 1/3 + p(0) 2 1/3 + p(0) 3 1/3
3 7 7 7 7 5
17
§ Fundamental Prediction queries: § What will be the probability of each state in the long run? § What will be the probability of state ! in the long run? § Will the state probability distribution be stationary? § Under which conditions a MC has a limiting (and therefore a stationary) distribution? § Under which conditions the limit distribution is invariant? § How long does it take to reach (approximately) the limit distribution?
t=0 t=100 t=1000 t=100000 t=100000 t=100001
18
§ Answering these questions requires introducing a state classification, based
§ Absorbing states: Once entered there’s no escape, !"" = 1, !"& = 0 ∀) ≠ + § Periodic states: The probability of a return to a state ) at a step , is !""
(.) > 0, , = 2, 22, 32 , … (periodic with period 2)
§ Persistent states (also referred to as recurrent states): following a first visit, a return at any step to the state is certain § Non-Null (persistent) states: if start in state +, the mean number of steps 5& to return to state + is finite, 5& < ∞ § Null (persistent) states: if start in state +, the mean number of steps 5& to return to state + is infinite, 5& = ∞ § Transient states: a return to the state is not certain § Ergodic states: aperiodic + persistent + non-null
19
§ Irreducible chain: every state can be reached or is accessible from every other state in the chain in a finite number of steps § Since any state !
" can be reached from any other state !#, irreducibility
means: $#"
% > 0 for some integer (
§ A matrix ) = +#" is said positive if +#" > 0 for all ,, . § Regular Markov Chain: if there exists an integer ( such as /% is positive § Regular chain ⇒ Irreducible § Irreducible ⇏ Regular chain (not necessarily)
T = 0 1 1 , T 2n = 1 1 , T 2n+1 = 0 1 1 = T
/ is irreducible, but no power of / is a positive matrix
§ Theorem: All states of an irreducible chain are of the same type, either all transient or all persistent and all have the same period. § However, they cannot be all transient since it would mean that the return to any state would not be certain even though all states are accessible from all other states in a finite number of steps ⇒ All states are recurrent
20
§ Ergodic chain: All states are ergodic, that is, persistent, non-null, aperiodic § Irreducible + aperiodic states ⇒ Ergodic § Regular ⇒ Irreducible (⇒ recurrent states) + aperiodic ⇒ Ergodic § Note: Ergodic ⇏ Regular § A MC is ergodic if there is a number # such that any state can be reached from any other state in any number of steps greater than or equal to a number # § In case of a fully connected transition matrix, where all transitions have a non-zero probability, this condition is trivially fulfilled with # = 1
21
§ Ergodic Markov Chains have a limiting invariant distribution ! § à Have a stationary distribution ! § à Regardless of the initial state, the time-t distribution of the chain converges to ! as t tends to infinity § How large must t be until the time-t distribution is approximately ! ? à Mixing time § For an ergodic chain the invariant distribution ! is the vector of mean recurrence time reciprocals § Check for ergodicity: If there’s only one eigenvalue of # that takes value 1 then the Markov chain is ergodic (this derives from the eigenvector equation #$!$ = !$)
22
§ Markov chain: prediction, what is the state distribution at time t ? Common: discrete-time random process, countable state set, transition matrix, that defines the internal stochastic dynamics
Environment’s dynamics
23
§ Markov chain: prediction, what is the state distribution at time t ? § Markov reward process: MC ∪ {Rewards}, prediction, what is the expected cumulative reward at time t ? what is state distribution at t ?
+1
+1 +2 +3
actions
24
§ Markov chain: prediction, what is the state distribution at time t ? § Markov reward process: MC ∪ {Rewards}, prediction, what is the expected cumulative reward at time t ? what is state distribution at t ? § Markov decision process (MDP): MC ∪ {Rewards} ∪ {Actions}, control, what is the optimal decision policy to optimize collected rewards?
+1 +2 +3
25
26
Deterministic actions Uncertain actions
27
Action effect is stochastic: probability distribution over next states
In general, non-Markov, the outcome can depend on all action history: P(st+1 = s0 | st, st1, . . . , s0, at, at1, . . . , a0) = P(st+1 = s0 | st:0, at:0) ü Deterministic: one single successor state, !, # → !% ü Probabilistic: Conditional distribution of successor states + Markov property, !&, #& → ' !&() = !% !& = !, #& = #) !, # → ' !% !, #)
28
§ A maze-like problem § The agent lives in a grid world § Walls block the agent’s path § The agent receives rewards each time step § Small “living” reward R each step (can be negative) § Big rewards come at the end (good or bad) § Goal: maximize sum of rewards § Potentially unlimited horizon § Noisy movement: actions do not always go as planned § 80% of the time, the action takes the agent in the desired direction (if there is no wall there) § 10% of the time, the action takes the agent to the direction perpendicular to the right; 10% perpendicular to the left. § If there is a wall in the direction the agent would have gone, agent stays put
Exit Exit
+1
R How do we formalize it and find the optimal policy?
29
Goal: Define the action decision policy !(#, %) that maximizes a given (utility) function of the rewards, potentially for ' → ∞
§ A set * of world states § A set + of feasible actions § A stochastic transition matrix ,, ,: *×*×+× 0,1, … , ' ↦ 0,1 , , #, #3, % = 5 #3 #, %) § A reward function 6: 6 # , 6 #, % , 6 #, %, #3 , 6: *×+×*× 0,1, … , ' ⟼ ℝ § A start state (or a distribution of initial states), optional § Terminal/Absorbing states, optional Presence of ' accounts for non-homogeneous Markov processes
30
General model for probabilistic planning MDP = < ", $%&'(&, "&)(*, +, ,, - > Find the policy optimizing the expected utility Classical deterministic planning P = < ", $%&'(&, "/0'1, +, ,, 2, 3 > Find the action sequence achieving the (best) goal state (least cost path)
31
Example from Sutton and Barto
Note: the “state” (robot’s battery status) is a parameter of the agent itself, not a property of the physical environment
§ At each step, a recycling robot has to decide whether it should: search for a can; wait for someone to bring it a can; go to home base and recharge. § Searching is better but runs down the battery; if runs out of power while searching, it has to be rescued. § States are battery levels: high,low. § Reward = number of cans collected (expected)
32
§ In deterministic single-agent search problems, we were looking for an optimal plan, or sequence of actions, from start to a goal § In MDPs we (usually) don’t have a specific goal, but we look for a policy, a mapping from states to actions: !: # → % § p(') deterministically specifies what action to take in each state → Deterministic policy § An explicit policy defines a reflex agent § A policy can also be stochastic: p(', *) specifies the probability of taking action * in state ' § In MDPs, if + is deterministic, the optimal policy is deterministic
33
§ How many non-terminal (absorbing) states? § How many actions? § How many deterministic policies
§ 9, 4, 49 § For a grid of a 100x100 cells, the # of policies is 410000 a huge number!