CMU-Q 15-381 Lecture 16: Markov Decision Processes I Teacher: - - PowerPoint PPT Presentation

cmu q 15 381
SMART_READER_LITE
LIVE PREVIEW

CMU-Q 15-381 Lecture 16: Markov Decision Processes I Teacher: - - PowerPoint PPT Presentation

CMU-Q 15-381 Lecture 16: Markov Decision Processes I Teacher: Gianni A. Di Caro R ECAP : M ARKOV D ECISION P ROCESSES (MPD) A set $ of world states A set % of feasible actions A stochastic transition matrix & , & ., . / , 0 = 2


slide-1
SLIDE 1

CMU-Q 15-381

Lecture 16: Markov Decision Processes I

Teacher: Gianni A. Di Caro

slide-2
SLIDE 2

RECAP: MARKOV DECISION PROCESSES (MPD)

2

Goal: Define the action decision policy that maximizes a given (utility) function of the rewards, potentially for ! → ∞

§ A set $ of world states § A set % of feasible actions § A stochastic transition matrix &, &: $×$×%× 0,1, … , ! ↦ 0,1 , & ., ./, 0 = 2 ./ ., 0) § A reward function 4: 4 . , 4 ., 0 , 4 ., 0, ./ , 4: $×%×$× 0,1, … , ! ⟼ ℝ § A start state (or a distribution of initial states), optional § Terminal/Absorbing states, optional § Deterministic Policy 7 . : a mapping from states to actions, 7: $ → % § Stochastic Policy, 7 ., 0 : a mapping from states to a probability distribution

  • ver the actions feasible in the state
slide-3
SLIDE 3

RECYCLING ROBOT

3

Example from Sutton and Barto

Note: the “state” (robot’s battery status) is a parameter of the agent itself, not a property of the physical environment

§ At each step, a recycling robot has to decide whether it should: search for a can; wait for someone to bring it a can; go to home base and recharge. § Searching is better but runs down the battery; if runs out of power while searching, it has to be rescued. § States are battery levels: high,low. § Reward = number of cans collected (expected)

slide-4
SLIDE 4

UTILITY OF A POLICY

4

§ Starting from !", applying the policy p, generates a sequence of states !", !$, ⋯ , !&, and of rewards '", '

$, ⋯ , '&

§ For the (rational) decision-maker each sequence has a utility based on its preferences § Utility is a function of the sequence of rewards: ( ')*+', !)-.)/0) → „Additive function of the rewards” § The expected utility, or value of a policy p starting in state !" is the expected utility over all the state sequences generated by the applying p and depending on state transition dynamics (2 !" = 4

5 ∈ {899 5&8&: 5:;<:=>:5 5&8?&@=A B?CD 5E}

G2 ! ((!)

slide-5
SLIDE 5

OPTIMAL POLICIES

5

§ An optimal policy p* yields the maximal utility = maximal expected utility function of the rewards from following the policy starting from the initial state ü Principle of maximum expected utility: a rational agent should choose the action(s) that maximize its expected utility § Note: Different optimal policies arise from different reward models, that, in turn, determine different utilities for the same action sequence à Let’s look at the grid world…

slide-6
SLIDE 6

OPTIMAL POLICIES

6

R(s) = -2.0 R(s) = -0.4 R(s) = -0.04 R(s) = -0.01

Balance between risk and reward changes depending on the value of R(s)

R(s) > 0

slide-7
SLIDE 7

EXAMPLE: CAR RACING

7

§ A robot car wants to travel far, quickly, gets higher rewards for moving fast § Three states: Cool, Warm, Overheated (Terminal state, end the process) § Two actions: Slow, Fast § Going faster gets double reward § Green numbers are rewards

Cool Warm Overheated

Fast Fast Slow Slow 0.5 0.5 0.5 0.5 1.0 1.0 +1 +1 +1 +2 +2

  • 10
slide-8
SLIDE 8

RACING SEARCH TREE (~EXPECTIMAX)

8

slow fast Chance nodes

slide-9
SLIDE 9

UTILITIES OF SEQUENCES

9

§ What preferences should an agent have over reward sequences? § More or less? § Now or later? [1, 2, 2] [2, 3, 4]

  • r

[0, 0, 1] [1, 0, 0]

  • r
slide-10
SLIDE 10

DISCOUNTING

§ It’s reasonable to maximize the sum of rewards § It’s also reasonable to prefer rewards now to rewards later § One solution: values of rewards decay exponentially by a factor !

Worth Now Worth Next Step Worth In Two Steps

10

slide-11
SLIDE 11

DISCOUNTING

§ How to discount?

§ Each time we descend a level, we

multiply in the discount ! once § Why discount?

§ Sooner rewards probably do have

higher utility than later rewards

§ Also helps our algorithms converge

§ Example: discount of ! = 0.5

§ & 1,2,3

= 1 ∗ 1 + 0.5 ∗ 2 + 0.25 ∗ 3

§ & 1,2,3

< &(3,2,1)

11

state chance action time 0

slide-12
SLIDE 12

STATIONARY PREFERENCES

12

§ Theorem: if we assume stationary preferences between sequences: then there are only two ways to define utilities over sequences of rewards

§ Additive utility: § Discounted utility:

slide-13
SLIDE 13

EFFECT OF DISCOUNTING ON OPTIMAL POLICY

13

§ MDP:

§ Actions: East, West § Terminal states: a and e (end when reach one or the other) § Transitions: deterministic § Reward for reaching a is 10 § Reward for reaching e is 1, reward for reaching all other states is 0

§ For g = 1, what is the optimal policy? § For g = 0.1, what is the optimal policy for states b, c and d? § For which g are West and East equally good when in state d? a b c d e

Exit Exit

1 10

γ = p (1/10)

slide-14
SLIDE 14

INFINITE UTILITIES?!

14

§ Problem: What if the process can last forever? § Do we get infinite rewards?

§ Possible solutions:

  • 1. Finite horizon: (similar to depth-limited search)

§Terminate episodes after a fixed number of steps (e.g., life) §Gives nonstationary policies (p depends on time left)

  • 2. Discounting: use 0 < g < 1

! "#, ⋯ , "

&

= ∑)*#

& +)")

if ") = ", ∑)*#

& +)") = ,

  • ./ ⇒

! "#, ⋯ , "

&

≤ 2345

  • ./

§ Smaller g means shorter horizon, the far future will matter less

  • 3. Absorbing states: guarantee that for every policy, a terminal state will

eventually be reached (like “overheated” for racing)

slide-15
SLIDE 15

USE OF UTILITIES: ! AND " FUNCTIONS

§ The value (utility) of a state #: !∗(#) = expected utility starting in # and acting optimally (according to '∗) § The value (utility) of a (-state (#, *): "∗(#, *) = expected utility starting out having taken action * from state # and (thereafter) acting optimally. Action * is not necessarily the optimal

  • ne. "∗(#, *) says what is the best

we can get after taking * in # § The optimal policy: '∗ # = optimal action from state #, the

  • ne that returns !∗(#)

* # s’ #, *

(#, *, #+) is a transition

#, *, #+

# is a state (#, *) is a q-state

Functional relation between !∗(#) and "∗(#, *)?

15

slide-16
SLIDE 16

MDPS SUMMARY

16

§ Markov decision processes (MDPs): § Set of states ! § Start state "# (optional) § Set of actions $ § Transitions % "& ", () or %("&, ", () § Rewards +(", (, "&) (and discount g) § Terminal states (optional) § Markov / memoryless property § Policy p = Choice of action for each state § Utility / Value = Sum of (discounted) rewards § Value of a state, ,("), and value of a Q-state, -(", () § Optimal policy p* = Best choice, that maximize Utility

slide-17
SLIDE 17

OPTIMAL VALUES OF STATES

17

§ Fundamental operation: compute the value !∗($) of a state

ü Expected utility under optimal action ü Average of sum of (discounted) rewards

§ Recursive definition of value of a state:

& $ $, & $, &, $( !∗($) )&*+ ,-([ ] Sub-problem $( 0($, &, $() ⋮ [230 current state + 2=!∗(next state)]

slide-18
SLIDE 18

GRIDWORLD V-VALUES

18

Probabilistic dynamics, 80% correct, 20% L/R Discount: ! = 1 Living reward: $ = 0

Forget about this for now …

It “means” that the

  • ptimal policy has

been found, which is the one shown with ▲▼◄►

slide-19
SLIDE 19

GRIDWORLD Q-VALUES

19

Probabilistic dynamics, 80% correct, 20% L/R Discount: ! = 1 Living reward: $ = 0

slide-20
SLIDE 20

GRIDWORLD V-VALUES

20

Probabilistic dynamics, 80% correct, 20% L/R Discount: ! = 0.9 Living reward: & = 0 ' 3,3 : max . = Right '∗( 3,3 )789:; = 0.8 0 + 0.9 1 + 0.1 0 + 0.9 0.57 + 0.1 0 + 0.9 0.85 ≅ 0.85

slide-21
SLIDE 21

GRIDWORLD Q-VALUES

21

Probabilistic dynamics, 80% correct, 20% L/R Discount: ! = 0.9 Living reward: & = 0

slide-22
SLIDE 22

GRIDWORLD V-VALUES

22

Probabilistic dynamics, 80% correct, 20% L/R Discount: ! = 0.9 Living reward: & = −0.1

slide-23
SLIDE 23

GRIDWORLD Q-VALUES

23

Probabilistic dynamics, 80% correct, 20% L/R Discount: ! = 0.9 Living reward: & = −0.1

slide-24
SLIDE 24

VALUE FUNCTION AND Q-FUNCTION

§ The value !"($) of a state $ under the policy p is the expected value of its return, the utility of all state sequences starting in $ and applying p

State-Value function

§ The value &"($, () of taking an action ( in state $ under policy p is the expected return starting from $, taking action (, and thereafter following p:

Action-Value function

V π(s) = Eπ "

X

t=0

γtR(st+1) | s0 = s #

Qπ(s, a) = Eπ "

X

t=0

γtR(st+1) | s0 = s, a0 = a #

24

slide-25
SLIDE 25

BELLMAN (EXPECTATION) EQUATION

FOR VALUE FUNCTION

Expected immediate reward (short-term utility) for taking action !(#) prescribed by p for state #

+

Expected future discounted reward (long-term utility) get after taking that action from that state and following p

V π(s) = Eπ " R(st+1) + γV π(st+1) | st = s # = X

s02S

p(s0|s, π(s)) ⇣ R(s0, s, π(s)) + γV π(s0) ⌘ ∀s ∈ S

25

ü Under a given policy !, an MDP is equivalent to an MRP, and the question of interest is the prediction about the expected cumulative reward that results from a state #, which is the same as computing %&(#)

slide-26
SLIDE 26

VALUE FUNCTION

V π(s) = Eπ "

1

X

t=0

γtR(st+1) | s0 = s # = Eπ "

1

X

k=0

γkR(st+k+1) | st = s # = Eπ " R(st+1) + γR(st+2) + γ2R(st+3) + . . . | st = s # = Eπ " R(st+1) + γ

1

X

k=0

γkR(st+k+2) | st = s # = X

s02S

p(s0|s, π(s)) R(s0, s, π(s)) + γEπ "

1

X

k=0

γkR(st+k+2) | st+1 = s0 #! = X

s02S

p(s0|s, π(s)) ⇣ R(s0, s, π(s)) + γV π(s0) ⌘ = Eπ " R(st+1) + γV π(st+1) | st = s #

26

slide-27
SLIDE 27

BELLMAN EXPECTATION EQUATIONS

FOR VALUE FUNCTION

V π(s) = X

s02S

p ⇣ s0 | s, π(s) ⌘h R(s, π(s), s0) + γV π(s0) i ∀s ∈ S

§ How do we find ! values for all states? § " linear equations in " unknowns! #$ = &$[($ + *#$] → #$ = [1 − *&$]/0[&$($] § Complexity: 1( " 3) for inverting an "×" matrix § Prediction problem: computing the value of a policy § Exact or numeric solution

27

slide-28
SLIDE 28

VALUES FOR THE GRID WORLD STATES

1 2 3 1 2 3 –1 + 1 4 0.611 0.812 0.655 0.762 0.918 0.705 0.660 0.868 0.388

!=1, R(s)=-00.4 p

Policy

28

(p is also optimal in this example case)

Values

slide-29
SLIDE 29

A GOLF CLUB EXAMPLE

Example from Sutton and Barto

29

§ Value of a state, ! " = $, &' : negative of the number of strokes to the hole from that location à Scalar field for the expected utility § Actions: which club to use {putter, driver} (assuming that we know how to swing the ball once decided the club) § Policy for the value function: only use the putter (off the green we cannot reach the hole by using a putt, while from anywhere in the green we assume we can make a putt)

slide-30
SLIDE 30

ITERATIVE POLICY EVALUATION

30

ü Equations suggest an iterative, recursive update approach that exploits the sub-problem structure and their relations

ü != updating step for the value of a state ü Given an expected value function "# at iteration !, we can back up the expected value function "#$% at iteration ! + 1

← Vk+1(s) ← X

s02S

p(s0|s, a) ⇥ R(s, a, s0) + γVk(s0) ⇤

∀ ) ∈ +

Expected value function at iteration !

Bellman Backup operator, ,

§ -

#$%()) = , "# = ,"#

§ Sweep: apply the backup operator to all states "#$% = , "# = ,"#

slide-31
SLIDE 31

ITERATIVE POLICY EVALUATION

31

  • 1. Initialization:

Input π, the policy to be evaluated Initialize V (s) ∀s ∈ S (e.g., V (s) = 0)

  • 2. Policy Evaluation:

Repeat ∆ ← 0 k ← 0 Foreach s ∈ S v ← V (s) Vk+1(s) ← X

s02S

p(s0|s, a) ⇥ R(s, a, s0) + γVk(s0) ⇤ ∆ ← max(∆, |v − V (s)|) k ← k + 1 Until ∆ < θ (small positive number) Output V ≈ V π

!

" ↦ ! $ ↦ ! % ↦ ⋯ ↦ ! ', with ! ' → !* for + → ∞ (for large, finite +)

slide-32
SLIDE 32

OBSERVATIONS ON THE BELLMAN EQUATIONS

FOR VALUE FUNCTION

State ! p(!) $

s0

s00

s000

V π(s) = X

s02S

p ⇣ s0 | s, π(s) ⌘h R(s, π(s), s0) + γV π(s0) i ∀s ∈ S R(s0) R(s00) R(s000)

V π(s000) V π(s00)

V π(s0)

§ Given an expected value function, we use it to back up the value of a state ! § Update of a state ! value: sub-problem related to a state !, backup operation § Relation between the value of a state and that

  • f its successors in state space

§ BE results from Additivity of utility + Markov property § Optimal solution can be decomposed in

  • verlapping sub-problems

§ Recursive state equations that need to be mutually consistent § Solutions for a sub-problem can be cached and reused Backup diagram

Average Average

% % %

32

slide-33
SLIDE 33

BACKUP DIAGRAMS (FOR !")

State # p(#) &

s0

s00

s000

V π(s) = X

s02S

p ⇣ s0 | s, π(s) ⌘h R(s, π(s), s0) + γV π(s0) i ∀s ∈ S R(s0) R(s00) R(s000)

V π(s000) V π(s00)

V π(s0)

Average Average

' ' ' Deterministic policy Stochastic policy

33

slide-34
SLIDE 34

BACKUP DIAGRAMS

V π(s) = X

s02S

p ⇣ s0 | s, π(s) ⌘h R(s, π(s), s0) + γV π(s0) i ∀s ∈ S !" #, % # Deterministic policy Stochastic policy

34

slide-35
SLIDE 35

OPTIMAL STATE AND ACTION VALUE FUNCTIONS

35

§ !∗($) = Highest possible expected utility from $

§ &∗($, () = Optimal action-value function:

V ∗(s) = max

π

V π(s) ∀s ∈ S Q∗(s, a) = max

π

Qπ(s, a) ∀ s ∈ S, a ∈ A

)* is the value of a policy +, but what we are looking for is the value (i.e., the expected utility) from applying the best policy, +∗ We need to find / compute the following functions:

slide-36
SLIDE 36

OPTIMAL ACTION-VALUE EXAMPLE

36

§ Optimal action-values for choosing club=driver, and afterward select either the driver or the putter, whichever is better based

  • n the optimal policy.