Online Planning 3/1/17 Q-Learning vs MCTS Dynamic programming - - PowerPoint PPT Presentation

online planning
SMART_READER_LITE
LIVE PREVIEW

Online Planning 3/1/17 Q-Learning vs MCTS Dynamic programming - - PowerPoint PPT Presentation

Online Planning 3/1/17 Q-Learning vs MCTS Dynamic programming Backpropagation Update depends on prior Update uses all rewards estimates for other states. from a full rollout. end t T R t X Q ( s, a ) [ R + V (


slide-1
SLIDE 1

Online Planning

3/1/17

slide-2
SLIDE 2
  • Dynamic programming
  • Update depends on prior

estimates for other states.

  • Updates immediately
  • Try action a in state s
  • Update Q(s,a)
  • Backpropagation
  • Update uses all rewards

from a full rollout.

  • Updates after rollout
  • Save path of (s,a) pairs
  • Update when all rewards

are known.

Q-Learning vs MCTS

Q(s, a) ←average of

end

X

t=T

γt−T Rt and old Q(s, a) Q(s, a) ←α [R + γV (s0)] + (1 − α) [old Q(s, a)]

Both converge to correct Q(s,a) estimates!

slide-3
SLIDE 3

Demo: Q-Learning vs. MCTS

slide-4
SLIDE 4

What about expansion?

  • In MCTS for game playing, we only update values

for nodes already in the tree.

  • On each rollout we expanded exactly one node.
  • In Q-learning, we update values for every node we

encounter. Which method should we use in MCTS for MDPs?

  • Hint: either is appropriate under the right
  • circumstances. What are those circumstances?
slide-5
SLIDE 5

Online vs. Offline Decision-Making

Our approach to MDPs so far: learn the value model completely, then pick optimal actions. Alternative approach: learn the (local) value model well enough to find a good action for the current state, take that action, then continue learning. When is online reasoning a good idea? Note: online learning (taking actions while you’re still learning) comes up in many machine learning contexts.

slide-6
SLIDE 6

Simulated vs. Real World Actions

So far, we’ve been blurring an important distinction. Does the agent:

  • take actions in the world and learn from the

consequences, or

  • simulate the effect of possible actions before

deciding how to act? Q-learning can be applied in either case. For online learning, we care about the difference.

slide-7
SLIDE 7

Model Simulations

  • Value iteration is great when we know the whole

model (and can fit the value table in memory).

  • Q-learning is great when we don’t know anything.
  • Simulation is a middle ground.
  • We might want to use simulation when:
  • We know the MDP, but it’s huge.
  • We have a function that generates successor states, but

don’t know the full set of possible states in advance.

slide-8
SLIDE 8

MCTS for Online Planning

In the online planning setting, every time we need to choose an action, we stop and think about it first. “Thinking about it” means simulating future actions to learn from their consequences.

slide-9
SLIDE 9

MCTS Review

  • Selection
  • Runs in the already-explored part of the state space.
  • Choose a random action, according to UCB weights.
  • Expansion
  • When we first encounter something unexplored.
  • Chose an unexplored action uniformly at random.
  • Simulation
  • After we’ve left the known region.
  • Select actions randomly according to the default policy.
  • Backpropagation
  • Update values for states visited in selection/expansion.
  • Average previous values with value on current rollout.
slide-10
SLIDE 10

Differences from game-playing MCTS

  • Learning state/action values instead of state values.
  • The next state is non-deterministic.
  • Simulation may never reach a terminal state.
  • There is no longer a tree-structure to the states.
  • Non-terminal states can have rewards.
slide-11
SLIDE 11

Online MCTS Value Backup

Observe sequence of (state, action) pairs and corresponding rewards.

  • Save (state, action, reward) during selection/expansion
  • Save only reward during simulation

Want to compute value (on the current rollout) for each (s,a) pair, then average with old values. states: [s0, s7, s3, s5] actions: [a0, a0, a2, a1] rewards: [ 0, -1, +2, 0, 0, +1, -1] Compute values for the current rollout.

𝛿=.9

slide-12
SLIDE 12

Demo: Online MCTS