CMU-Q 15-381
Lecture 16: Markov Decision Processes I
Teacher: Gianni A. Di Caro
CMU-Q 15-381 Lecture 16: Markov Decision Processes I Teacher: - - PowerPoint PPT Presentation
CMU-Q 15-381 Lecture 16: Markov Decision Processes I Teacher: Gianni A. Di Caro R ECAP : M ARKOV D ECISION P ROCESSES (MPD) A set $ of world states A set % of feasible actions A stochastic transition matrix & , & ., . / , 0 = 2
Teacher: Gianni A. Di Caro
2
Goal: Define the action decision policy that maximizes a given (utility) function of the rewards, potentially for ! → ∞
§ A set $ of world states § A set % of feasible actions § A stochastic transition matrix &, &: $×$×%× 0,1, … , ! ↦ 0,1 , & ., ./, 0 = 2 ./ ., 0) § A reward function 4: 4 . , 4 ., 0 , 4 ., 0, ./ , 4: $×%×$× 0,1, … , ! ⟼ ℝ § A start state (or a distribution of initial states), optional § Terminal/Absorbing states, optional § Deterministic Policy 7 . : a mapping from states to actions, 7: $ → % § Stochastic Policy, 7 ., 0 : a mapping from states to a probability distribution
3
Example from Sutton and Barto
Note: the “state” (robot’s battery status) is a parameter of the agent itself, not a property of the physical environment
§ At each step, a recycling robot has to decide whether it should: search for a can; wait for someone to bring it a can; go to home base and recharge. § Searching is better but runs down the battery; if runs out of power while searching, it has to be rescued. § States are battery levels: high,low. § Reward = number of cans collected (expected)
4
§ Starting from !", applying the policy p, generates a sequence of states !", !$, ⋯ , !&, and of rewards '", '
$, ⋯ , '&
§ For the (rational) decision-maker each sequence has a utility based on its preferences § Utility is a function of the sequence of rewards: ( ')*+', !)-.)/0) → „Additive function of the rewards” § The expected utility, or value of a policy p starting in state !" is the expected utility over all the state sequences generated by the applying p and depending on state transition dynamics (2 !" = 4
5 ∈ {899 5&8&: 5:;<:=>:5 5&8?&@=A B?CD 5E}
G2 ! ((!)
5
§ An optimal policy p* yields the maximal utility = maximal expected utility function of the rewards from following the policy starting from the initial state ü Principle of maximum expected utility: a rational agent should choose the action(s) that maximize its expected utility § Note: Different optimal policies arise from different reward models, that, in turn, determine different utilities for the same action sequence à Let’s look at the grid world…
6
R(s) = -2.0 R(s) = -0.4 R(s) = -0.04 R(s) = -0.01
Balance between risk and reward changes depending on the value of R(s)
R(s) > 0
7
§ A robot car wants to travel far, quickly, gets higher rewards for moving fast § Three states: Cool, Warm, Overheated (Terminal state, end the process) § Two actions: Slow, Fast § Going faster gets double reward § Green numbers are rewards
Cool Warm Overheated
Fast Fast Slow Slow 0.5 0.5 0.5 0.5 1.0 1.0 +1 +1 +1 +2 +2
8
slow fast Chance nodes
9
§ What preferences should an agent have over reward sequences? § More or less? § Now or later? [1, 2, 2] [2, 3, 4]
[0, 0, 1] [1, 0, 0]
§ It’s reasonable to maximize the sum of rewards § It’s also reasonable to prefer rewards now to rewards later § One solution: values of rewards decay exponentially by a factor !
Worth Now Worth Next Step Worth In Two Steps
10
§ How to discount?
§ Each time we descend a level, we
multiply in the discount ! once § Why discount?
§ Sooner rewards probably do have
higher utility than later rewards
§ Also helps our algorithms converge
§ Example: discount of ! = 0.5
§ & 1,2,3
= 1 ∗ 1 + 0.5 ∗ 2 + 0.25 ∗ 3
§ & 1,2,3
< &(3,2,1)
11
state chance action time 0
12
§ Theorem: if we assume stationary preferences between sequences: then there are only two ways to define utilities over sequences of rewards
§ Additive utility: § Discounted utility:
13
§ MDP:
§ Actions: East, West § Terminal states: a and e (end when reach one or the other) § Transitions: deterministic § Reward for reaching a is 10 § Reward for reaching e is 1, reward for reaching all other states is 0
§ For g = 1, what is the optimal policy? § For g = 0.1, what is the optimal policy for states b, c and d? § For which g are West and East equally good when in state d? a b c d e
Exit Exit
1 10
γ = p (1/10)
14
§ Problem: What if the process can last forever? § Do we get infinite rewards?
§ Possible solutions:
§Terminate episodes after a fixed number of steps (e.g., life) §Gives nonstationary policies (p depends on time left)
! "#, ⋯ , "
&
= ∑)*#
& +)")
if ") = ", ∑)*#
& +)") = ,
! "#, ⋯ , "
&
≤ 2345
§ Smaller g means shorter horizon, the far future will matter less
eventually be reached (like “overheated” for racing)
§ The value (utility) of a state #: !∗(#) = expected utility starting in # and acting optimally (according to '∗) § The value (utility) of a (-state (#, *): "∗(#, *) = expected utility starting out having taken action * from state # and (thereafter) acting optimally. Action * is not necessarily the optimal
we can get after taking * in # § The optimal policy: '∗ # = optimal action from state #, the
* # s’ #, *
(#, *, #+) is a transition
#, *, #+
# is a state (#, *) is a q-state
Functional relation between !∗(#) and "∗(#, *)?
15
16
§ Markov decision processes (MDPs): § Set of states ! § Start state "# (optional) § Set of actions $ § Transitions % "& ", () or %("&, ", () § Rewards +(", (, "&) (and discount g) § Terminal states (optional) § Markov / memoryless property § Policy p = Choice of action for each state § Utility / Value = Sum of (discounted) rewards § Value of a state, ,("), and value of a Q-state, -(", () § Optimal policy p* = Best choice, that maximize Utility
17
§ Fundamental operation: compute the value !∗($) of a state
ü Expected utility under optimal action ü Average of sum of (discounted) rewards
§ Recursive definition of value of a state:
& $ $, & $, &, $( !∗($) )&*+ ,-([ ] Sub-problem $( 0($, &, $() ⋮ [230 current state + 2=!∗(next state)]
18
Probabilistic dynamics, 80% correct, 20% L/R Discount: ! = 1 Living reward: $ = 0
Forget about this for now …
It “means” that the
been found, which is the one shown with ▲▼◄►
19
Probabilistic dynamics, 80% correct, 20% L/R Discount: ! = 1 Living reward: $ = 0
20
Probabilistic dynamics, 80% correct, 20% L/R Discount: ! = 0.9 Living reward: & = 0 ' 3,3 : max . = Right '∗( 3,3 )789:; = 0.8 0 + 0.9 1 + 0.1 0 + 0.9 0.57 + 0.1 0 + 0.9 0.85 ≅ 0.85
21
Probabilistic dynamics, 80% correct, 20% L/R Discount: ! = 0.9 Living reward: & = 0
22
Probabilistic dynamics, 80% correct, 20% L/R Discount: ! = 0.9 Living reward: & = −0.1
23
Probabilistic dynamics, 80% correct, 20% L/R Discount: ! = 0.9 Living reward: & = −0.1
§ The value !"($) of a state $ under the policy p is the expected value of its return, the utility of all state sequences starting in $ and applying p
State-Value function
§ The value &"($, () of taking an action ( in state $ under policy p is the expected return starting from $, taking action (, and thereafter following p:
Action-Value function
V π(s) = Eπ "
∞
X
t=0
γtR(st+1) | s0 = s #
Qπ(s, a) = Eπ "
∞
X
t=0
γtR(st+1) | s0 = s, a0 = a #
24
FOR VALUE FUNCTION
Expected immediate reward (short-term utility) for taking action !(#) prescribed by p for state #
Expected future discounted reward (long-term utility) get after taking that action from that state and following p
V π(s) = Eπ " R(st+1) + γV π(st+1) | st = s # = X
s02S
p(s0|s, π(s)) ⇣ R(s0, s, π(s)) + γV π(s0) ⌘ ∀s ∈ S
25
ü Under a given policy !, an MDP is equivalent to an MRP, and the question of interest is the prediction about the expected cumulative reward that results from a state #, which is the same as computing %&(#)
V π(s) = Eπ "
1
X
t=0
γtR(st+1) | s0 = s # = Eπ "
1
X
k=0
γkR(st+k+1) | st = s # = Eπ " R(st+1) + γR(st+2) + γ2R(st+3) + . . . | st = s # = Eπ " R(st+1) + γ
1
X
k=0
γkR(st+k+2) | st = s # = X
s02S
p(s0|s, π(s)) R(s0, s, π(s)) + γEπ "
1
X
k=0
γkR(st+k+2) | st+1 = s0 #! = X
s02S
p(s0|s, π(s)) ⇣ R(s0, s, π(s)) + γV π(s0) ⌘ = Eπ " R(st+1) + γV π(st+1) | st = s #
26
FOR VALUE FUNCTION
V π(s) = X
s02S
p ⇣ s0 | s, π(s) ⌘h R(s, π(s), s0) + γV π(s0) i ∀s ∈ S
§ How do we find ! values for all states? § " linear equations in " unknowns! #$ = &$[($ + *#$] → #$ = [1 − *&$]/0[&$($] § Complexity: 1( " 3) for inverting an "×" matrix § Prediction problem: computing the value of a policy § Exact or numeric solution
27
1 2 3 1 2 3 –1 + 1 4 0.611 0.812 0.655 0.762 0.918 0.705 0.660 0.868 0.388
!=1, R(s)=-00.4 p
28
(p is also optimal in this example case)
Example from Sutton and Barto
29
§ Value of a state, ! " = $, &' : negative of the number of strokes to the hole from that location à Scalar field for the expected utility § Actions: which club to use {putter, driver} (assuming that we know how to swing the ball once decided the club) § Policy for the value function: only use the putter (off the green we cannot reach the hole by using a putt, while from anywhere in the green we assume we can make a putt)
30
ü Equations suggest an iterative, recursive update approach that exploits the sub-problem structure and their relations
ü != updating step for the value of a state ü Given an expected value function "# at iteration !, we can back up the expected value function "#$% at iteration ! + 1
← Vk+1(s) ← X
s02S
p(s0|s, a) ⇥ R(s, a, s0) + γVk(s0) ⇤
∀ ) ∈ +
Expected value function at iteration !
Bellman Backup operator, ,
§ -
#$%()) = , "# = ,"#
§ Sweep: apply the backup operator to all states "#$% = , "# = ,"#
31
Input π, the policy to be evaluated Initialize V (s) ∀s ∈ S (e.g., V (s) = 0)
Repeat ∆ ← 0 k ← 0 Foreach s ∈ S v ← V (s) Vk+1(s) ← X
s02S
p(s0|s, a) ⇥ R(s, a, s0) + γVk(s0) ⇤ ∆ ← max(∆, |v − V (s)|) k ← k + 1 Until ∆ < θ (small positive number) Output V ≈ V π
!
" ↦ ! $ ↦ ! % ↦ ⋯ ↦ ! ', with ! ' → !* for + → ∞ (for large, finite +)
FOR VALUE FUNCTION
State ! p(!) $
s0
s00
s000
V π(s) = X
s02S
p ⇣ s0 | s, π(s) ⌘h R(s, π(s), s0) + γV π(s0) i ∀s ∈ S R(s0) R(s00) R(s000)
V π(s000) V π(s00)
V π(s0)
§ Given an expected value function, we use it to back up the value of a state ! § Update of a state ! value: sub-problem related to a state !, backup operation § Relation between the value of a state and that
§ BE results from Additivity of utility + Markov property § Optimal solution can be decomposed in
§ Recursive state equations that need to be mutually consistent § Solutions for a sub-problem can be cached and reused Backup diagram
Average Average
% % %
32
State # p(#) &
s0
s00
s000
V π(s) = X
s02S
p ⇣ s0 | s, π(s) ⌘h R(s, π(s), s0) + γV π(s0) i ∀s ∈ S R(s0) R(s00) R(s000)
V π(s000) V π(s00)
V π(s0)
Average Average
' ' ' Deterministic policy Stochastic policy
33
V π(s) = X
s02S
p ⇣ s0 | s, π(s) ⌘h R(s, π(s), s0) + γV π(s0) i ∀s ∈ S !" #, % # Deterministic policy Stochastic policy
34
35
§ &∗($, () = Optimal action-value function:
V ∗(s) = max
π
V π(s) ∀s ∈ S Q∗(s, a) = max
π
Qπ(s, a) ∀ s ∈ S, a ∈ A
)* is the value of a policy +, but what we are looking for is the value (i.e., the expected utility) from applying the best policy, +∗ We need to find / compute the following functions:
36
§ Optimal action-values for choosing club=driver, and afterward select either the driver or the putter, whichever is better based