MDPs and Value Iteration 2/20/17 Recall: State Space Search - - PowerPoint PPT Presentation

mdps and value iteration
SMART_READER_LITE
LIVE PREVIEW

MDPs and Value Iteration 2/20/17 Recall: State Space Search - - PowerPoint PPT Presentation

MDPs and Value Iteration 2/20/17 Recall: State Space Search Problems A set of discrete states A distinguished start state A set of actions available to the agent in each state An action function that, given a state and an action,


slide-1
SLIDE 1

MDPs and Value Iteration

2/20/17

slide-2
SLIDE 2

Recall: State Space Search Problems

  • A set of discrete states
  • A distinguished start state
  • A set of actions available to the agent in each state
  • An action function that, given a state and an

action, returns a new state

  • A set of goal states, often specified as a function
  • A way to measure solution quality
slide-3
SLIDE 3

What if actions aren’t perfect?

  • We might not know exactly which next state will

result from an action.

  • We can model this as a probability distribution over

next states.

slide-4
SLIDE 4

Search with Non-Deterministic Actions

  • A set of discrete states
  • A distinguished start state
  • A set of actions available to the agent in each state
  • An action function that, given a state and an

action, returns a new state

  • A set of goal states, often specified as a function
  • A way to measure solution quality
  • A set of terminal states
  • A reward function that gives a utility for each state

a probability distribution over next states

slide-5
SLIDE 5

Markov Decision Processes (MDPs)

Named after the “Markov property”: if you know the state then you know the transition probabilities.

  • We still represent states and actions.
  • Actions no longer lead to a single next state.
  • Instead they lead to one of several possible states,

determined randomly.

  • We’re now working with utilities instead of goals.
  • Expected utility works well for handling randomness.
  • We need to plan for unintended consequences.
  • Even an optimal agent may run forever!
slide-6
SLIDE 6
  • States: S
  • Actions: As
  • Transition function
  • F(s, a) = s’
  • Start ∈ S
  • Goals ⊂ S
  • Action Costs: C(a)
  • States: S
  • Actions: As
  • Transition probabilities
  • P(s’ | s, a)
  • Start ∈ S
  • Terminal ⊂ S
  • State Rewards: R(s)
  • Can also have costs: C(a)

State Space Search MDPs

slide-7
SLIDE 7

We can’t rely on a single plan!

Actions might not have the outcome we expect, so

  • ur plans need to include contingencies for states we

could end up in. Instead of searching for a plan, we devise a policy. A policy is a function that maps states to actions.

  • For each state we could end up in, the policy tells

us which action to take.

slide-8
SLIDE 8

A simple example: Grid World

end +1 end

  • 1

start

If actions were deterministic, we could solve this with state space search.

  • (3,2) would be a goal state
  • (3,1) would be a dead end
slide-9
SLIDE 9

A simple example: Grid World

end +1 end

  • 1

start

  • Suppose instead that the move we try to make only

works correctly 80% of the time.

  • 10% of the time, we go in each perpendicular

direction, e.g. try to go right, go up instead.

  • If impossible, stay in place.
slide-10
SLIDE 10

A simple example: Grid World

end +1 end

  • 1

start

  • Before, we had two equally-good alternatives.
  • Which path is better when actions are uncertain?
  • What should we do if we find ourselves in (2,1)?
slide-11
SLIDE 11

Discount Factor

Specifies how impatient the agent is. Key idea: reward now is better than reward later.

  • Rewards in the future are exponentially decayed.
  • Reward t steps in the future is discounted by 𝜹t
  • Why do we need a discount factor?

U = γt · Rt

slide-12
SLIDE 12

Value of a State

  • To come up with an optimal policy, we start by

determining a value for each state.

  • The value of a state is reward now, plus discounted

future reward:

  • Assume we’ll do the best thing in the future.

V (s) = R(s) + γ[future value]

slide-13
SLIDE 13

Future Value

  • If we know the value of other states, we can

calculate the expected value of each action:

  • Future value is the expected value of the best

action:

E(s, a) = X

s0

P (s0 | s, a) · V (s0) max

a

E(s, a)

slide-14
SLIDE 14

Value Iteration

  • The value of state s depends on the value of other

states s’.

  • The value of s’ may depend on the value of s.

We can iteratively approximate the value using dynamic programming.

  • Initialize all values to the immediate rewards.
  • Update values based on the best next-state.
  • Repeat until convergence (values don’t change).
slide-15
SLIDE 15

Value Iteration Pseudocode

values = {state : R(state) for each state} until values don’t change: prev = copy of values for each state s: initialize best_EV for each action: EV = 0 for each next state ns: EV += prob * prev[ns] best_EV = max(EV, best_EV) values[s] = R(s) + gamma*best_EV

slide-16
SLIDE 16

Value Iteration on Grid World

discount =.9 +1

  • 1

V (3, 0) = 0 + γ · max [E((3, 0), u), E((3, 0), d), E((3, 0), l), E((3, 0), r) ] V (2, 1) = 0 + γ · max [E((2, 1), u), E((2, 1), d), E((2, 1), l), E((2, 1), r) ] V (2, 2) = 0 + γ · max [E((2, 2), u), E((2, 2), d), E((2, 2), l), E((2, 2), r) ]

slide-17
SLIDE 17

Value Iteration on Grid World

.72 +1

  • 1

discount =.9

V (3, 0) = γ · max [.8 · −1 + .1 · 0 + .1 · 0, .8 · 0 + .1 · 0 + .1 · 0, .8 · 0 + .1 · 0 + .1 · −1, .8 · 0 + .1 · −1 + .1 · 0 ] V (2, 1) = γ · max [.8 · 0 + .1 · 0 + .1 · −1, .8 · 0 + .1 · −1 + .1 · 0, .8 · 0 + .1 · 0 + .1 · 0, .8 · −1 + .1 · 0 + .1 · 0 ] V (2, 2) = γ · max[.8 · 0 + .1 · 0 + .1 · 1, .8 · 0 + .1 · 1 + .1 · 0, .8 · 0 + .1 · 0 + .1 · 0, .8 · 1 + .1 · 0 + .1 · 0]

slide-18
SLIDE 18

Value Iteration on Grid World

discount =.9 .5184 .7848 +1 .4284

  • 1

Exercise: Continue value iteration

slide-19
SLIDE 19

What do we do with the values?

When values have converged, the optimal policy is to select the action with the highest expected value at each state.

  • What should we do if we find ourselves in (2,1)?

.64 .74 .85 +1 .57 .57

  • 1

.49 .43 .48 .28