MDPs and Value Iteration
2/20/17
MDPs and Value Iteration 2/20/17 Recall: State Space Search - - PowerPoint PPT Presentation
MDPs and Value Iteration 2/20/17 Recall: State Space Search Problems A set of discrete states A distinguished start state A set of actions available to the agent in each state An action function that, given a state and an action,
2/20/17
action, returns a new state
result from an action.
next states.
action, returns a new state
a probability distribution over next states
Named after the “Markov property”: if you know the state then you know the transition probabilities.
determined randomly.
Actions might not have the outcome we expect, so
could end up in. Instead of searching for a plan, we devise a policy. A policy is a function that maps states to actions.
us which action to take.
end +1 end
start
If actions were deterministic, we could solve this with state space search.
end +1 end
start
works correctly 80% of the time.
direction, e.g. try to go right, go up instead.
end +1 end
start
Specifies how impatient the agent is. Key idea: reward now is better than reward later.
determining a value for each state.
future reward:
calculate the expected value of each action:
action:
s0
a
states s’.
We can iteratively approximate the value using dynamic programming.
values = {state : R(state) for each state} until values don’t change: prev = copy of values for each state s: initialize best_EV for each action: EV = 0 for each next state ns: EV += prob * prev[ns] best_EV = max(EV, best_EV) values[s] = R(s) + gamma*best_EV
discount =.9 +1
V (3, 0) = 0 + γ · max [E((3, 0), u), E((3, 0), d), E((3, 0), l), E((3, 0), r) ] V (2, 1) = 0 + γ · max [E((2, 1), u), E((2, 1), d), E((2, 1), l), E((2, 1), r) ] V (2, 2) = 0 + γ · max [E((2, 2), u), E((2, 2), d), E((2, 2), l), E((2, 2), r) ]
.72 +1
discount =.9
V (3, 0) = γ · max [.8 · −1 + .1 · 0 + .1 · 0, .8 · 0 + .1 · 0 + .1 · 0, .8 · 0 + .1 · 0 + .1 · −1, .8 · 0 + .1 · −1 + .1 · 0 ] V (2, 1) = γ · max [.8 · 0 + .1 · 0 + .1 · −1, .8 · 0 + .1 · −1 + .1 · 0, .8 · 0 + .1 · 0 + .1 · 0, .8 · −1 + .1 · 0 + .1 · 0 ] V (2, 2) = γ · max[.8 · 0 + .1 · 0 + .1 · 1, .8 · 0 + .1 · 1 + .1 · 0, .8 · 0 + .1 · 0 + .1 · 0, .8 · 1 + .1 · 0 + .1 · 0]
discount =.9 .5184 .7848 +1 .4284
Exercise: Continue value iteration
When values have converged, the optimal policy is to select the action with the highest expected value at each state.
.64 .74 .85 +1 .57 .57
.49 .43 .48 .28