cmu q 15 381
play

CMU-Q 15-381 Lecture 16: Markov Decision Processes I Teacher: - PowerPoint PPT Presentation

CMU-Q 15-381 Lecture 16: Markov Decision Processes I Teacher: Gianni A. Di Caro R ECAP : M ARKOV D ECISION P ROCESSES (MPD) A set $ of world states A set % of feasible actions A stochastic transition matrix & , & ., . / , 0 = 2


  1. CMU-Q 15-381 Lecture 16: Markov Decision Processes I Teacher: Gianni A. Di Caro

  2. R ECAP : M ARKOV D ECISION P ROCESSES (MPD) § A set $ of world states § A set % of feasible actions § A stochastic transition matrix & , & ., . / , 0 = 2 . / ., 0) &: $×$×%× 0,1, … , ! ↦ 0,1 , § A reward function 4 : 4 . , 4 ., 0 , 4 ., 0, . / , 4: $×%×$× 0,1, … , ! ⟼ ℝ § A start state (or a distribution of initial states), optional § Terminal/Absorbing states, optional Goal: Define the action decision policy that maximizes a given (utility) function of the rewards, potentially for ! → ∞ § Deterministic Policy 7 . : a mapping from states to actions , 7: $ → % § Stochastic Policy, 7 ., 0 : a mapping from states to a probability distribution over the actions feasible in the state 2

  3. R ECYCLING ROBOT § At each step, a recycling robot has to decide whether it should: search for a can; wait for someone to bring it a can; go to home base and recharge . § Searching is better but runs down the battery; if runs out of power while searching, it has to be rescued. § States are battery levels: high,low . § Reward = number of cans collected (expected) Note: the “state” (robot’s battery status) is a parameter of the agent itself, not a property of the physical environment Example from Sutton and Barto 3

  4. U TILITY OF A P OLICY § Starting from ! " , applying the policy p , generates a sequence of states ! " , ! $ , ⋯ , ! & , and of rewards ' " , ' $ , ⋯ , ' & § For the (rational) decision-maker each sequence has a utility based on its preferences § Utility is a function of the sequence of rewards: ( ')*+', !)-.)/0) → „Additive function of the rewards” § The expected utility , or value of a policy p starting in state ! " is the expected utility over all the state sequences generated by the applying p and depending on state transition dynamics ( 2 ! " = G 2 ! ((!) 4 5 ∈ {899 5&8&: 5:;<:=>:5 5&8?&@=A B?CD 5 E } 4

  5. O PTIMAL P OLICIES § An optimal policy p * yields the maximal utility = maximal expected utility function of the rewards from following the policy starting from the initial state ü Principle of maximum expected utility : a rational agent should choose the action(s) that maximize its expected utility § Note: Different optimal policies arise from different reward models, that, in turn, determine different utilities for the same action sequence à Let’s look at the grid world… 5

  6. O PTIMAL P OLICIES R(s) = -0.01 R(s) = -0.04 Balance between risk and reward changes depending on the value of R(s) R(s) = -0.4 R(s) > 0 R(s) = -2.0 6

  7. E XAMPLE : C AR R ACING § A robot car wants to travel far, quickly, gets higher rewards for moving fast § Three states: Cool, Warm, Overheated (Terminal state, end the process) § Two actions: Slow , Fast § Going faster gets double reward § Green numbers are rewards +1 0.5 1.0 Fast Slow -10 +1 0.5 Warm Slow Fast 0.5 +2 Cool 0.5 Overheated +1 1.0 +2 7

  8. R ACING S EARCH T REE (~E XPECTIMAX ) Chance nodes slow fast 8

  9. U TILITIES OF S EQUENCES § What preferences should an agent have over reward sequences? § More or less? [1, 2, 2] or [2, 3, 4] § Now or later? [1, 0, 0] [0, 0, 1] or 9

  10. D ISCOUNTING § It’s reasonable to maximize the sum of rewards § It’s also reasonable to prefer rewards now to rewards later § One solution: values of rewards decay exponentially by a factor ! Worth Next Step Worth Now Worth In Two Steps 10

  11. D ISCOUNTING § How to discount? state action § Each time we descend a level, we chance multiply in the discount ! once § Why discount? § Sooner rewards probably do have higher utility than later rewards § Also helps our algorithms converge § Example: discount of ! = 0.5 § & 1,2,3 = 1 ∗ 1 + 0.5 ∗ 2 + 0.25 ∗ 3 time 0 § & 1,2,3 < &(3,2,1) 11

  12. S TATIONARY P REFERENCES § Theorem: if we assume stationary preferences between sequences: then there are only two ways to define utilities over sequences of rewards § Additive utility: § Discounted utility: 12

  13. E FFECT OF D ISCOUNTING ON O PTIMAL P OLICY a b c d e 10 1 Exit Exit § MDP: § Actions: East, West § Terminal states: a and e (end when reach one or the other) § Transitions: deterministic § Reward for reaching a is 10 § Reward for reaching e is 1, reward for reaching all other states is 0 § For g = 1, what is the optimal policy? § For g = 0.1, what is the optimal policy for states b, c and d ? p § For which g are West and East equally good when in state d ? γ = (1 / 10) 13

  14. I NFINITE U TILITIES ?! § Problem: What if the process can last forever ? § Do we get infinite rewards? § Possible solutions: 1. Finite horizon: (similar to depth-limited search) § Terminate episodes after a fixed number of steps (e.g., life) § Gives nonstationary policies ( p depends on time left) 2. Discounting: use 0 < g < 1 & + ) " ) = ∑ )*# ! " # , ⋯ , " & & + ) " ) = , ≤ 2 345 if " ) = ", ∑ )*# -./ ⇒ ! " # , ⋯ , " & -./ § Smaller g means shorter horizon, the far future will matter less 3. Absorbing states: guarantee that for every policy, a terminal state will eventually be reached (like “ overheated ” for racing) 14

  15. U SE OF U TILITIES : ! AND " FUNCTIONS § The value (utility) of a state # : ! ∗ (#) = expected utility starting in # # is a and acting optimally (according to ' ∗ ) # state § The value (utility) of a ( -state (#, *) : * (#, *) is a " ∗ (#, *) = expected utility starting out q-state #, * having taken action * from state # and (thereafter) acting optimally. (#, *, # + ) is a #, *, # + transition s’ Action * is not necessarily the optimal one. " ∗ (#, *) says what is the best we can get after taking * in # § The optimal policy: ' ∗ # = optimal action from state # , the Functional relation between one that returns ! ∗ (#) ! ∗ (#) and " ∗ (#, *) ? 15

  16. MDP S S UMMARY § Markov decision processes (MDPs): § Set of states ! § Start state " # (optional) § Set of actions $ § Transitions % " & ", () or %(" & , ", () § Rewards +(", (, " & ) (and discount g ) § Terminal states (optional) § Markov / memoryless property § Policy p = Choice of action for each state § Utility / Value = Sum of (discounted) rewards § Value of a state, ,(") , and value of a Q-state, -(", () § Optimal policy p * = Best choice, that maximize Utility 16

  17. O PTIMAL V ALUES OF S TATES Sub-problem § Fundamental operation: compute the value ! ∗ ($) of a state ü Expected utility under optimal action ! ∗ ($) ü Average of sum of (discounted) rewards $ )&* + & § Recursive definition of value of a state: $, & , -( [ ] 0($, &, $ ( ) $, &, $ ( $ ( [ 2 3 0 current state + 2 = ! ∗ (next state)] ⋮ 17

  18. G RIDWORLD V-V ALUES Forget about this for now … It “means” that the optimal policy has been found, which is the one shown with Probabilistic dynamics, 80% correct, 20% L/R ▲▼◄► Discount: ! = 1 Living reward: $ = 0 18

  19. G RIDWORLD Q-V ALUES Probabilistic dynamics, 80% correct, 20% L/R Discount: ! = 1 Living reward: $ = 0 19

  20. G RIDWORLD V-V ALUES ' 3,3 : max . = Right ' ∗ ( 3,3 ) 789:; = 0.8 0 + 0.9 1 + 0.1 0 + 0.9 0.57 + 0.1 0 + 0.9 0.85 ≅ 0.85 Probabilistic dynamics, 80% correct, 20% L/R Discount: ! = 0.9 Living reward: & = 0 20

  21. G RIDWORLD Q-V ALUES Probabilistic dynamics, 80% correct, 20% L/R Discount: ! = 0.9 Living reward: & = 0 21

  22. G RIDWORLD V-V ALUES Probabilistic dynamics, 80% correct, 20% L/R Discount: ! = 0.9 Living reward: & = −0.1 22

  23. G RIDWORLD Q-V ALUES Probabilistic dynamics, 80% correct, 20% L/R Discount: ! = 0.9 Living reward: & = −0.1 23

  24. V ALUE FUNCTION AND Q- FUNCTION § The value ! " ($) of a state $ under the policy p is the expected value of its return , the utility of all state sequences starting in $ and applying p " # ∞ State-Value X γ t R ( s t +1 ) | s 0 = s V π ( s ) = E π function t =0 The value & " ($, () of taking an action ( in state $ under policy p is the § expected return starting from $ , taking action ( , and thereafter following p : " # ∞ Action-Value X γ t R ( s t +1 ) | s 0 = s, a 0 = a Q π ( s, a ) = E π function t =0 24

  25. B ELLMAN ( EXPECTATION ) EQUATION FOR V ALUE FUNCTION " # V π ( s ) = E π R ( s t +1 ) + γ V π ( s t +1 ) | s t = s ⇣ ⌘ X p ( s 0 | s, π ( s )) R ( s 0 , s, π ( s )) + γ V π ( s 0 ) = ∀ s ∈ S s 0 2 S Expected immediate reward (short-term utility) for taking action !(#) prescribed by p for state # + Expected future discounted reward (long-term utility) get after taking that action from that state and following p ü Under a given policy ! , an MDP is equivalent to an MRP, and the question of interest is the prediction about the expected cumulative reward that results from a state # , which is the same as computing % & (#) 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend