Online Planning
3/1/17
Online Planning 3/1/17 Q-Learning vs MCTS Dynamic programming - - PowerPoint PPT Presentation
Online Planning 3/1/17 Q-Learning vs MCTS Dynamic programming Backpropagation Update depends on prior Update uses all rewards estimates for other states. from a full rollout. end t T R t X Q ( s, a ) [ R + V (
3/1/17
estimates for other states.
from a full rollout.
are known.
Q(s, a) ←average of
end
X
t=T
γt−T Rt and old Q(s, a) Q(s, a) ←α [R + γV (s0)] + (1 − α) [old Q(s, a)]
Both converge to correct Q(s,a) estimates!
for nodes already in the tree.
encounter. Which method should we use in MCTS for MDPs?
Our approach to MDPs so far: learn the value model completely, then pick optimal actions. Alternative approach: learn the (local) value model well enough to find a good action for the current state, take that action, then continue learning. When is online reasoning a good idea? Note: online learning (taking actions while you’re still learning) comes up in many machine learning contexts.
So far, we’ve been blurring an important distinction. Does the agent:
consequences, or
deciding how to act? Q-learning can be applied in either case. For online learning, we care about the difference.
model (and can fit the value table in memory).
don’t know the full set of possible states in advance.
In the online planning setting, every time we need to choose an action, we stop and think about it first. “Thinking about it” means simulating future actions to learn from their consequences.
Observe sequence of (state, action) pairs and corresponding rewards.
Want to compute value (on the current rollout) for each (s,a) pair, then average with old values. states: [s0, s7, s3, s5] actions: [a0, a0, a2, a1] rewards: [ 0, -1, +2, 0, 0, +1, -1] Compute values for the current rollout.
𝛿=.9