Markov Decision Processes 2/23/18 Recall: State Space Search - - PowerPoint PPT Presentation

▶

Dec 16, 2022 398 likes •600 views

Markov Decision Processes 2/23/18 Recall: State Space Search Problems A set of discrete states A distinguished start state A set of actions available to the agent in each state An action function that, given a state and an

SLIDE 1

Markov Decision Processes

2/23/18

SLIDE 2

Recall: State Space Search Problems

A set of discrete states
A distinguished start state
A set of actions available to the agent in each state
An action function that, given a state and an

action, returns a new state

A set of goal states, often specified as a function
A way to measure solution quality

SLIDE 3

What if actions aren’t perfect?

We might not know exactly which next state will

result from an action.

We can model this as a probability distribution over

next states.

SLIDE 4

SLIDE 5

Search with Non-Deterministic Actions

A set of discrete states
A distinguished start state
A set of actions available to the agent in each state
An action function that, given a state and an

action, returns a new state

A set of goal states, often specified as a function
A way to measure solution quality
A set of terminal states
A reward function that gives a utility for each state

a probability distribution over next states

SLIDE 6

Markov Decision Processes (MDPs)

Named after the “Markov property”: if you know the state you don’t need to remember history.

We still represent states and actions.
Actions no longer lead to a single next state.
Instead they lead to one of several possible states,

determined randomly.

We’re now working with utilities instead of goals.
Expected utility works well for handling randomness.
We need to plan for unintended consequences.
We need to plan over an indefinite horizon.
Even an optimal agent may run forever!

SLIDE 7

States: S
Actions: As
Transition function
F(s, a) = s’
Start ∈ S
Goals ⊂ S
Action Costs: C(a)
States: S
Actions: As
Transition probabilities
P(s’ | s, a)
Start ∈ S
Terminal ⊂ S
Can be empty.
State Rewards: R(s)
Or action costs: C(a)

State Space Search MDPs

SLIDE 8

We can’t rely on a single plan!

Actions might not have the outcome we expect, so

ur plans need to include contingencies for states we

could end up in. Instead of searching for a plan, we devise a policy. A policy is a function that maps states to actions.

For each state we could end up in, the policy tells

us which action to take.

SLIDE 9

A simple example: Grid World

end +1 end

start

If actions were deterministic, we could solve this with state space search.

(3,2) would be a goal state
(3,1) would be a dead end

SLIDE 10

A simple example: Grid World

end +1 end

start

Suppose instead that the move we try to make only

works correctly 80% of the time.

10% of the time, we go in each perpendicular

direction, e.g. try to go right, go up instead.

If impossible, stay in place.

SLIDE 11

A simple example: Grid World

end +1 end

start

Before, we had two equally-good alternatives.
Which path is better when actions are uncertain?
What should we do if we find ourselves in (2,1)?

SLIDE 12

New Objective: Find an optimal po

policy.

We can’t just rely on a single plan, since we might end up in an unintended state. A policy is a function that maps every state to an action: We want policies that yield high reward.

In expectation (since transitions are random).
Over time (we may be willing to accept low reward

now to achieve high reward later, but not always).

π(s) = a

SLIDE 13

Expected Value

Since future states are uncertain, we can’t perfectly

maximize future reward.

Instead, we maximize expected reward, a

probability-weighted average over state rewards.

Expected reward at time t Probability

f state s

at time t Reward

f state s

E(Rt) = X

Pr(st = s) · R(s)

SLIDE 14

Discounting

How do we trade off short-term vs long-term reward? Key idea: reward now is better than reward later.

Rewards in the future are exponentially decayed.
Reward t steps in the future is discounted by

V = γt · Rt

γt 0 < γ < 1

Value now Reward at time-step t

SLIDE 15

Value of a policy

Value depends on what state the agent is in.
Value depends on what the policy tells the agent to

do in the future. Value is the sum over all timesteps of the expected discounted reward at that timestep.

V π(s) =

t=0

γt · X

π (st = s0 | s0 = s) · R(s0)

SLIDE 16

Optimal Policy

The optimal policy is the one that maximizes value.
If we knew the optimal policy, we could easily find

the true value of any state.

If we knew the true value of every state, we could

easily find the optimal policy.

end +1 end

start

V ∗(s) = V π∗(s)

SLIDE 17

Value Iteration

The value of state s depends on the value of other

states s’.

The value of s’ may depend on the value of s.

We can iteratively approximate the value using dynamic programming.

Initialize all values to the immediate rewards.
Update values based on the best next-state.
Repeat until convergence (values don’t change).

SLIDE 18

Value Iteration Pseudocode

values = {state : R(state) for each state} until values don’t change: prev = copy of values for each state s: initialize best_EV for each action: EV = 0 for each next state ns: EV += prob * prev[ns] best_EV = max(EV, best_EV) values[s] = R(s) + gamma*best_EV