reducing sampling error in batch temporal di ff erence
play

Reducing Sampling Error in Batch Temporal Di ff erence Learning - PowerPoint PPT Presentation

Reducing Sampling Error in Batch Temporal Di ff erence Learning Brahma S. Pavse 1 , Ishan Durugkar 1 , Josiah Hanna 2 , Peter Stone 1 3 1 The University of Texas at Austin 2 The University of Edinburgh 3 Sony AI ICML July 2020


  1. Reducing Sampling Error in Batch Temporal Di ff erence Learning Brahma S. Pavse 1 , Ishan Durugkar 1 , Josiah Hanna 2 , Peter Stone 1 3 1 The University of Texas at Austin 2 The University of Edinburgh 3 Sony AI ICML July 2020 brahmasp@cs.utexas.edu Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 1

  2. Reinforcement Learning Successes Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 2

  3. Reinforcement Learning Successes Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 2

  4. Reinforcement Learning Successes Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 2

  5. Reinforcement Learning Successes Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 2

  6. How can RL agents make the most from a finite amount of experience? Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 3

  7. How can RL agents make the most from a finite amount of experience? Learning an accurate estimation of the value function with finite amount data . Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 3

  8. Spotlight Overview Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 4

  9. Spotlight Overview With finite batch of data, on-policy single-step temporal di ff erence learning converges to • the value function for the wrong policy. Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 4

  10. Spotlight Overview With finite batch of data, on-policy single-step temporal di ff erence learning converges to • the value function for the wrong policy. Propose and prove that a more e ffi cient estimator converges to the value function for the • true policy. Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 4

  11. Spotlight Overview: Flaw in Batch TD(0) True policy s 1 +30 a 1 s True value function a 2 +60 s 2 Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 5

  12. Spotlight Overview: Flaw in Batch TD(0) finite-sized batch True policy s 1 +30 a 1 s True value function a 2 +60 s 2 Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 5

  13. Spotlight Overview: Flaw in Batch TD(0) finite-sized batch True policy s 1 Batch TD(0) computes +30 a 1 s True value function a 2 +60 s 2 Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 5

  14. Spotlight Overview: Flaw in Batch TD(0) finite-sized batch True policy s 1 Batch TD(0) computes +30 a 1 s True value function a 2 Batch TD(0) estimates value function for the +60 s 2 wrong policy! Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 5

  15. Spotlight Overview: Flaw in Batch TD(0) finite-sized batch True policy s 1 Batch TD(0) computes +30 a 1 s True value function a 2 Batch TD(0) estimates value function for the +60 s 2 wrong policy! Our estimator will estimate value function for the true policy Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 5

  16. Batch Linear* Value Function Learning *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 6

  17. Batch Linear* Value Function Learning Policy and environment transition dynamics: *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 6

  18. Batch Linear* Value Function Learning Policy and environment transition dynamics: Generates batch of m episodes: where *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 6

  19. Batch Linear* Value Function Learning Policy and environment transition dynamics: Generates batch of m episodes: where Estimate value function: *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 6

  20. Batch Linear* Value Function Learning Policy and environment transition dynamics: Generates batch of m episodes: where Estimate value function: *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 6

  21. Batch Linear* Value Function Learning Policy and environment transition dynamics: Generates batch of m episodes: where Estimate value function: Assumptions: 1. is known (policy we want to learn about). 2. is unknown (model-free). 3. Reward function is unknown. 4. On-policy (focus of talk). *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 6

  22. Batch Linear* TD(0) *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 7

  23. Batch Linear* TD(0) *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 7

  24. Batch Linear* TD(0) fixed finite batch as input *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 7

  25. Batch Linear* TD(0) for each transition *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 7

  26. Batch Linear* TD(0) accumulate computed TD error *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 7

  27. Batch Linear* TD(0) make aggregated update to weights *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 7

  28. Batch Linear* TD(0) clear aggregation *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 7

  29. Batch Linear* TD(0) until convergence *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 7

  30. Batch TD(0) Value Function finite-sized batch TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 8

  31. Batch TD(0) Value Function finite-sized batch TD(0) certainty-equivalence estimate for MDP* *Sutton (1988) proved a similar result for a Markov reward process Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 8

  32. Batch TD(0) Value Function finite-sized batch TD(0) certainty-equivalence estimate for MDP* maximum-likelihood estimates (MLE) computed from *Sutton (1988) proved a similar result for a Markov reward process Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 8

  33. Batch TD(0) Value Function finite-sized batch TD(0) certainty-equivalence estimate for MDP* maximum-likelihood estimates (MLE) computed from Problem! *Sutton (1988) proved a similar result for a Markov reward process Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 8

  34. Batch TD(0) Value Function finite-sized batch TD(0) certainty-equivalence estimate for MDP* maximum-likelihood estimates (MLE) computed from Problem! *Sutton (1988) proved a similar result for a Markov reward process Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 8

  35. Batch TD(0) Value Function finite-sized batch TD(0) certainty-equivalence estimate for MDP* maximum-likelihood estimates (MLE) computed from policy and transition dynamics Problem! sampling error *Sutton (1988) proved a similar result for a Markov reward process Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 8

  36. Policy Sampling Error in Batch TD(0) finite-sized batch s 1 True policy Batch TD(0) computes +30 a 1 s True value function Batch TD(0) estimates a 2 value function for the wrong policy! +60 s 2 Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 9

  37. Policy Sampling Error in Batch TD(0) finite-sized batch s 1 True policy Batch TD(0) computes +30 a 1 s MLE policy True value function a 2 +60 s 2 Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 9

  38. Policy Sampling Error in Batch TD(0) finite-sized batch s 1 True policy Batch TD(0) computes +30 a 1 s MLE policy True value function a 2 +60 s 2 Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 9

  39. Policy Sampling Error in Batch TD(0) finite-sized batch s 1 True policy Batch TD(0) computes +30 a 1 s MLE policy True value function a 2 +60 s 2 Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 9

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend