 
              CMU-Q 15-381 Lecture 16: Markov Decision Processes I Teacher: Gianni A. Di Caro
R ECAP : M ARKOV D ECISION P ROCESSES (MPD) § A set $ of world states § A set % of feasible actions § A stochastic transition matrix & , & ., . / , 0 = 2 . / ., 0) &: $×$×%× 0,1, … , ! ↦ 0,1 , § A reward function 4 : 4 . , 4 ., 0 , 4 ., 0, . / , 4: $×%×$× 0,1, … , ! ⟼ ℝ § A start state (or a distribution of initial states), optional § Terminal/Absorbing states, optional Goal: Define the action decision policy that maximizes a given (utility) function of the rewards, potentially for ! → ∞ § Deterministic Policy 7 . : a mapping from states to actions , 7: $ → % § Stochastic Policy, 7 ., 0 : a mapping from states to a probability distribution over the actions feasible in the state 2
R ECYCLING ROBOT § At each step, a recycling robot has to decide whether it should: search for a can; wait for someone to bring it a can; go to home base and recharge . § Searching is better but runs down the battery; if runs out of power while searching, it has to be rescued. § States are battery levels: high,low . § Reward = number of cans collected (expected) Note: the “state” (robot’s battery status) is a parameter of the agent itself, not a property of the physical environment Example from Sutton and Barto 3
U TILITY OF A P OLICY § Starting from ! " , applying the policy p , generates a sequence of states ! " , ! $ , ⋯ , ! & , and of rewards ' " , ' $ , ⋯ , ' & § For the (rational) decision-maker each sequence has a utility based on its preferences § Utility is a function of the sequence of rewards: ( ')*+', !)-.)/0) → „Additive function of the rewards” § The expected utility , or value of a policy p starting in state ! " is the expected utility over all the state sequences generated by the applying p and depending on state transition dynamics ( 2 ! " = G 2 ! ((!) 4 5 ∈ {899 5&8&: 5:;<:=>:5 5&8?&@=A B?CD 5 E } 4
O PTIMAL P OLICIES § An optimal policy p * yields the maximal utility = maximal expected utility function of the rewards from following the policy starting from the initial state ü Principle of maximum expected utility : a rational agent should choose the action(s) that maximize its expected utility § Note: Different optimal policies arise from different reward models, that, in turn, determine different utilities for the same action sequence à Let’s look at the grid world… 5
O PTIMAL P OLICIES R(s) = -0.01 R(s) = -0.04 Balance between risk and reward changes depending on the value of R(s) R(s) = -0.4 R(s) > 0 R(s) = -2.0 6
E XAMPLE : C AR R ACING § A robot car wants to travel far, quickly, gets higher rewards for moving fast § Three states: Cool, Warm, Overheated (Terminal state, end the process) § Two actions: Slow , Fast § Going faster gets double reward § Green numbers are rewards +1 0.5 1.0 Fast Slow -10 +1 0.5 Warm Slow Fast 0.5 +2 Cool 0.5 Overheated +1 1.0 +2 7
R ACING S EARCH T REE (~E XPECTIMAX ) Chance nodes slow fast 8
U TILITIES OF S EQUENCES § What preferences should an agent have over reward sequences? § More or less? [1, 2, 2] or [2, 3, 4] § Now or later? [1, 0, 0] [0, 0, 1] or 9
D ISCOUNTING § It’s reasonable to maximize the sum of rewards § It’s also reasonable to prefer rewards now to rewards later § One solution: values of rewards decay exponentially by a factor ! Worth Next Step Worth Now Worth In Two Steps 10
D ISCOUNTING § How to discount? state action § Each time we descend a level, we chance multiply in the discount ! once § Why discount? § Sooner rewards probably do have higher utility than later rewards § Also helps our algorithms converge § Example: discount of ! = 0.5 § & 1,2,3 = 1 ∗ 1 + 0.5 ∗ 2 + 0.25 ∗ 3 time 0 § & 1,2,3 < &(3,2,1) 11
S TATIONARY P REFERENCES § Theorem: if we assume stationary preferences between sequences: then there are only two ways to define utilities over sequences of rewards § Additive utility: § Discounted utility: 12
E FFECT OF D ISCOUNTING ON O PTIMAL P OLICY a b c d e 10 1 Exit Exit § MDP: § Actions: East, West § Terminal states: a and e (end when reach one or the other) § Transitions: deterministic § Reward for reaching a is 10 § Reward for reaching e is 1, reward for reaching all other states is 0 § For g = 1, what is the optimal policy? § For g = 0.1, what is the optimal policy for states b, c and d ? p § For which g are West and East equally good when in state d ? γ = (1 / 10) 13
I NFINITE U TILITIES ?! § Problem: What if the process can last forever ? § Do we get infinite rewards? § Possible solutions: 1. Finite horizon: (similar to depth-limited search) § Terminate episodes after a fixed number of steps (e.g., life) § Gives nonstationary policies ( p depends on time left) 2. Discounting: use 0 < g < 1 & + ) " ) = ∑ )*# ! " # , ⋯ , " & & + ) " ) = , ≤ 2 345 if " ) = ", ∑ )*# -./ ⇒ ! " # , ⋯ , " & -./ § Smaller g means shorter horizon, the far future will matter less 3. Absorbing states: guarantee that for every policy, a terminal state will eventually be reached (like “ overheated ” for racing) 14
U SE OF U TILITIES : ! AND " FUNCTIONS § The value (utility) of a state # : ! ∗ (#) = expected utility starting in # # is a and acting optimally (according to ' ∗ ) # state § The value (utility) of a ( -state (#, *) : * (#, *) is a " ∗ (#, *) = expected utility starting out q-state #, * having taken action * from state # and (thereafter) acting optimally. (#, *, # + ) is a #, *, # + transition s’ Action * is not necessarily the optimal one. " ∗ (#, *) says what is the best we can get after taking * in # § The optimal policy: ' ∗ # = optimal action from state # , the Functional relation between one that returns ! ∗ (#) ! ∗ (#) and " ∗ (#, *) ? 15
MDP S S UMMARY § Markov decision processes (MDPs): § Set of states ! § Start state " # (optional) § Set of actions $ § Transitions % " & ", () or %(" & , ", () § Rewards +(", (, " & ) (and discount g ) § Terminal states (optional) § Markov / memoryless property § Policy p = Choice of action for each state § Utility / Value = Sum of (discounted) rewards § Value of a state, ,(") , and value of a Q-state, -(", () § Optimal policy p * = Best choice, that maximize Utility 16
O PTIMAL V ALUES OF S TATES Sub-problem § Fundamental operation: compute the value ! ∗ ($) of a state ü Expected utility under optimal action ! ∗ ($) ü Average of sum of (discounted) rewards $ )&* + & § Recursive definition of value of a state: $, & , -( [ ] 0($, &, $ ( ) $, &, $ ( $ ( [ 2 3 0 current state + 2 = ! ∗ (next state)] ⋮ 17
G RIDWORLD V-V ALUES Forget about this for now … It “means” that the optimal policy has been found, which is the one shown with Probabilistic dynamics, 80% correct, 20% L/R ▲▼◄► Discount: ! = 1 Living reward: $ = 0 18
G RIDWORLD Q-V ALUES Probabilistic dynamics, 80% correct, 20% L/R Discount: ! = 1 Living reward: $ = 0 19
G RIDWORLD V-V ALUES ' 3,3 : max . = Right ' ∗ ( 3,3 ) 789:; = 0.8 0 + 0.9 1 + 0.1 0 + 0.9 0.57 + 0.1 0 + 0.9 0.85 ≅ 0.85 Probabilistic dynamics, 80% correct, 20% L/R Discount: ! = 0.9 Living reward: & = 0 20
G RIDWORLD Q-V ALUES Probabilistic dynamics, 80% correct, 20% L/R Discount: ! = 0.9 Living reward: & = 0 21
G RIDWORLD V-V ALUES Probabilistic dynamics, 80% correct, 20% L/R Discount: ! = 0.9 Living reward: & = −0.1 22
G RIDWORLD Q-V ALUES Probabilistic dynamics, 80% correct, 20% L/R Discount: ! = 0.9 Living reward: & = −0.1 23
V ALUE FUNCTION AND Q- FUNCTION § The value ! " ($) of a state $ under the policy p is the expected value of its return , the utility of all state sequences starting in $ and applying p " # ∞ State-Value X γ t R ( s t +1 ) | s 0 = s V π ( s ) = E π function t =0 The value & " ($, () of taking an action ( in state $ under policy p is the § expected return starting from $ , taking action ( , and thereafter following p : " # ∞ Action-Value X γ t R ( s t +1 ) | s 0 = s, a 0 = a Q π ( s, a ) = E π function t =0 24
B ELLMAN ( EXPECTATION ) EQUATION FOR V ALUE FUNCTION " # V π ( s ) = E π R ( s t +1 ) + γ V π ( s t +1 ) | s t = s ⇣ ⌘ X p ( s 0 | s, π ( s )) R ( s 0 , s, π ( s )) + γ V π ( s 0 ) = ∀ s ∈ S s 0 2 S Expected immediate reward (short-term utility) for taking action !(#) prescribed by p for state # + Expected future discounted reward (long-term utility) get after taking that action from that state and following p ü Under a given policy ! , an MDP is equivalent to an MRP, and the question of interest is the prediction about the expected cumulative reward that results from a state # , which is the same as computing % & (#) 25
Recommend
More recommend