n step bootstrapping
play

N-step bootstrapping Robert Platt Northeastern University Id love - PowerPoint PPT Presentation

N-step bootstrapping Robert Platt Northeastern University Id love to use my experiences more efficiently... Motivation Left: path taken by agent in grid world. Gets zero reward everywhere except for goal state where it gets positive


  1. N-step bootstrapping Robert Platt Northeastern University I’d love to use my experiences more efficiently...

  2. Motivation Left: path taken by agent in grid world. Gets zero reward everywhere except for goal state where it gets positive reward. Middle: 1-step SARSA updates only penultimate state/action pair Problem: standard Q-Learning/SARSA “propagates rewards” only one state back per time step – n-step bootstrapping is one way to address this problem – we will see other ways in subsequent slide decks.

  3. TD and MC are two extremes of a continuum

  4. TD and MC are two extremes of a continuum What are these?

  5. TD and MC are two extremes of a continuum

  6. TD and MC are two extremes of a continuum Update equation:

  7. TD and MC are two extremes of a continuum Is called the target of the update Update equation:

  8. TD and MC are two extremes of a continuum What’s the target for this one? Update equation:

  9. TD and MC are two extremes of a continuum What’s the target for this one? Complete update equation:

  10. TD and MC are two extremes of a continuum What’s the target for this one? Complete update equation:

  11. TD and MC are two extremes of a continuum What’s the target for this one? Complete update equation:

  12. TD and MC are two extremes of a continuum Notice that you can’t do this update until time step t+3 – TD update happens on next time step – MC update happens at end of episode – n-step TD update happens on time step n What’s the target for this one? Complete update equation:

  13. How well does this work? This comparison is for: – a 19 state random walk policy – n-step TD policy evaluation

  14. n-step TD algorithm

  15. n-step SARSA Same idea as in n-step TD – how is this backup diagram different from that of n-step TD? – why is it different?

  16. n-step SARSA Why does the backup start with a dot rather than a circle? Same idea as in n-step TD – how is this backup diagram different from that of n-step TD? – why is it different?

  17. n-step SARSA Left: path taken by agent in grid world. Gets zero reward everywhere except for goal state where it gets positive reward. Middle: 1-step SARSA updates only penultimate state/action pair Right: 10-step SARSA updates last 10 state/action pairs

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend