Markov Decision Processes
2/23/18
Markov Decision Processes 2/23/18 Recall: State Space Search - - PowerPoint PPT Presentation
Markov Decision Processes 2/23/18 Recall: State Space Search Problems A set of discrete states A distinguished start state A set of actions available to the agent in each state An action function that, given a state and an
2/23/18
action, returns a new state
result from an action.
next states.
action, returns a new state
a probability distribution over next states
Named after the “Markov property”: if you know the state you don’t need to remember history.
determined randomly.
Actions might not have the outcome we expect, so
could end up in. Instead of searching for a plan, we devise a policy. A policy is a function that maps states to actions.
us which action to take.
end +1 end
start
If actions were deterministic, we could solve this with state space search.
end +1 end
start
works correctly 80% of the time.
direction, e.g. try to go right, go up instead.
end +1 end
start
We can’t just rely on a single plan, since we might end up in an unintended state. A policy is a function that maps every state to an action: We want policies that yield high reward.
now to achieve high reward later, but not always).
maximize future reward.
probability-weighted average over state rewards.
Expected reward at time t Probability
at time t Reward
E(Rt) = X
s
Pr(st = s) · R(s)
How do we trade off short-term vs long-term reward? Key idea: reward now is better than reward later.
γt 0 < γ < 1
Value now Reward at time-step t
do in the future. Value is the sum over all timesteps of the expected discounted reward at that timestep.
V π(s) =
1
X
t=0
γt · X
s0
Pr
π (st = s0 | s0 = s) · R(s0)
the true value of any state.
easily find the optimal policy.
end +1 end
start
V ∗(s) = V π∗(s)
states s’.
We can iteratively approximate the value using dynamic programming.
values = {state : R(state) for each state} until values don’t change: prev = copy of values for each state s: initialize best_EV for each action: EV = 0 for each next state ns: EV += prob * prev[ns] best_EV = max(EV, best_EV) values[s] = R(s) + gamma*best_EV