Markov Decision Processes 2/23/18 Recall: State Space Search - - PowerPoint PPT Presentation

markov decision processes
SMART_READER_LITE
LIVE PREVIEW

Markov Decision Processes 2/23/18 Recall: State Space Search - - PowerPoint PPT Presentation

Markov Decision Processes 2/23/18 Recall: State Space Search Problems A set of discrete states A distinguished start state A set of actions available to the agent in each state An action function that, given a state and an


slide-1
SLIDE 1

Markov Decision Processes

2/23/18

slide-2
SLIDE 2

Recall: State Space Search Problems

  • A set of discrete states
  • A distinguished start state
  • A set of actions available to the agent in each state
  • An action function that, given a state and an

action, returns a new state

  • A set of goal states, often specified as a function
  • A way to measure solution quality
slide-3
SLIDE 3

What if actions aren’t perfect?

  • We might not know exactly which next state will

result from an action.

  • We can model this as a probability distribution over

next states.

slide-4
SLIDE 4
slide-5
SLIDE 5

Search with Non-Deterministic Actions

  • A set of discrete states
  • A distinguished start state
  • A set of actions available to the agent in each state
  • An action function that, given a state and an

action, returns a new state

  • A set of goal states, often specified as a function
  • A way to measure solution quality
  • A set of terminal states
  • A reward function that gives a utility for each state

a probability distribution over next states

slide-6
SLIDE 6

Markov Decision Processes (MDPs)

Named after the “Markov property”: if you know the state you don’t need to remember history.

  • We still represent states and actions.
  • Actions no longer lead to a single next state.
  • Instead they lead to one of several possible states,

determined randomly.

  • We’re now working with utilities instead of goals.
  • Expected utility works well for handling randomness.
  • We need to plan for unintended consequences.
  • We need to plan over an indefinite horizon.
  • Even an optimal agent may run forever!
slide-7
SLIDE 7
  • States: S
  • Actions: As
  • Transition function
  • F(s, a) = s’
  • Start ∈ S
  • Goals ⊂ S
  • Action Costs: C(a)
  • States: S
  • Actions: As
  • Transition probabilities
  • P(s’ | s, a)
  • Start ∈ S
  • Terminal ⊂ S
  • Can be empty.
  • State Rewards: R(s)
  • Or action costs: C(a)

State Space Search MDPs

slide-8
SLIDE 8

We can’t rely on a single plan!

Actions might not have the outcome we expect, so

  • ur plans need to include contingencies for states we

could end up in. Instead of searching for a plan, we devise a policy. A policy is a function that maps states to actions.

  • For each state we could end up in, the policy tells

us which action to take.

slide-9
SLIDE 9

A simple example: Grid World

end +1 end

  • 1

start

If actions were deterministic, we could solve this with state space search.

  • (3,2) would be a goal state
  • (3,1) would be a dead end
slide-10
SLIDE 10

A simple example: Grid World

end +1 end

  • 1

start

  • Suppose instead that the move we try to make only

works correctly 80% of the time.

  • 10% of the time, we go in each perpendicular

direction, e.g. try to go right, go up instead.

  • If impossible, stay in place.
slide-11
SLIDE 11

A simple example: Grid World

end +1 end

  • 1

start

  • Before, we had two equally-good alternatives.
  • Which path is better when actions are uncertain?
  • What should we do if we find ourselves in (2,1)?
slide-12
SLIDE 12

New Objective: Find an optimal po

policy.

We can’t just rely on a single plan, since we might end up in an unintended state. A policy is a function that maps every state to an action: We want policies that yield high reward.

  • In expectation (since transitions are random).
  • Over time (we may be willing to accept low reward

now to achieve high reward later, but not always).

π(s) = a

slide-13
SLIDE 13

Expected Value

  • Since future states are uncertain, we can’t perfectly

maximize future reward.

  • Instead, we maximize expected reward, a

probability-weighted average over state rewards.

Expected reward at time t Probability

  • f state s

at time t Reward

  • f state s

E(Rt) = X

s

Pr(st = s) · R(s)

slide-14
SLIDE 14

Discounting

How do we trade off short-term vs long-term reward? Key idea: reward now is better than reward later.

  • Rewards in the future are exponentially decayed.
  • Reward t steps in the future is discounted by

V = γt · Rt

γt 0 < γ < 1

Value now Reward at time-step t

slide-15
SLIDE 15

Value of a policy

  • Value depends on what state the agent is in.
  • Value depends on what the policy tells the agent to

do in the future. Value is the sum over all timesteps of the expected discounted reward at that timestep.

V π(s) =

1

X

t=0

γt · X

s0

Pr

π (st = s0 | s0 = s) · R(s0)

slide-16
SLIDE 16

Optimal Policy

  • The optimal policy is the one that maximizes value.
  • If we knew the optimal policy, we could easily find

the true value of any state.

  • If we knew the true value of every state, we could

easily find the optimal policy.

end +1 end

  • 1

start

V ∗(s) = V π∗(s)

slide-17
SLIDE 17

Value Iteration

  • The value of state s depends on the value of other

states s’.

  • The value of s’ may depend on the value of s.

We can iteratively approximate the value using dynamic programming.

  • Initialize all values to the immediate rewards.
  • Update values based on the best next-state.
  • Repeat until convergence (values don’t change).
slide-18
SLIDE 18

Value Iteration Pseudocode

values = {state : R(state) for each state} until values don’t change: prev = copy of values for each state s: initialize best_EV for each action: EV = 0 for each next state ns: EV += prob * prev[ns] best_EV = max(EV, best_EV) values[s] = R(s) + gamma*best_EV