N-step bootstrapping Robert Platt Northeastern University Id love - - PowerPoint PPT Presentation
N-step bootstrapping Robert Platt Northeastern University Id love - - PowerPoint PPT Presentation
N-step bootstrapping Robert Platt Northeastern University Id love to use my experiences more efficiently... Motivation Left: path taken by agent in grid world. Gets zero reward everywhere except for goal state where it gets positive
Motivation
Problem: standard Q-Learning/SARSA “propagates rewards” only one state back per time step – n-step bootstrapping is one way to address this problem – we will see other ways in subsequent slide decks.
Left: path taken by agent in grid world. Gets zero reward everywhere except for goal state where it gets positive reward. Middle: 1-step SARSA updates only penultimate state/action pair
TD and MC are two extremes of a continuum
TD and MC are two extremes of a continuum
What are these?
TD and MC are two extremes of a continuum
TD and MC are two extremes of a continuum
Update equation:
TD and MC are two extremes of a continuum
Is called the target of the update Update equation:
TD and MC are two extremes of a continuum
What’s the target for this one? Update equation:
TD and MC are two extremes of a continuum
Complete update equation: What’s the target for this one?
TD and MC are two extremes of a continuum
Complete update equation: What’s the target for this one?
TD and MC are two extremes of a continuum
Complete update equation: What’s the target for this one?
TD and MC are two extremes of a continuum
Complete update equation: What’s the target for this one?
Notice that you can’t do this update until time step t+3 – TD update happens on next time step – MC update happens at end of episode – n-step TD update happens on time step n
How well does this work?
This comparison is for: – a 19 state random walk policy – n-step TD policy evaluation