renewal monte carlo
play

Renewal Monte Carlo: Renewal theory based reinforcement learning - PowerPoint PPT Presentation

Renewal Monte Carlo: Renewal theory based reinforcement learning Jayakumar Subramanian and Aditya Mahajan 57th IEEE Conference on Decision and Control, Miami Beach, FL, USA, December 17-19, 2018 RL has achieved considerable success


  1. Renewal Monte Carlo: 
 Renewal theory based reinforcement learning Jayakumar Subramanian and Aditya Mahajan 57th IEEE Conference on Decision and Control, Miami Beach, FL, USA, December 17-19, 2018

  2. RL has achieved considerable success… Image credit: Towards Data Science Image credit: MIT Technology review Image credit: Popular Science 19/12/18 RMC: Subramanian and Mahajan �2

  3. RL has achieved considerable success… Image credit: Towards Data Science Image credit: MIT Technology review Image credit: Popular Science Salient features ⊕ Model-free method ⊕ Use policy search 19/12/18 RMC: Subramanian and Mahajan �2

  4. RL has achieved considerable success… Image credit: Towards Data Science Image credit: MIT Technology review Image credit: Popular Science Salient features Limitation ⊕ Model-free method ⊖ Learning is slow (takes ∼ 10 ^9 to 10 ^15 ⊕ Use policy search iterations to converge) 19/12/18 RMC: Subramanian and Mahajan �2

  5. RL has achieved considerable success… Image credit: Towards Data Science Image credit: MIT Technology review Image credit: Popular Science Salient features Limitation ⊕ Model-free method ⊖ Learning is slow (takes ∼ 10 ^9 to 10 ^15 ⊕ Use policy search iterations to converge) ⊕ Can we exploit features of the model to make it learn faster? … ⊕ Without sacrificing generality? 19/12/18 RMC: Subramanian and Mahajan �2

  6. An RL problem can be formulated as… 19/12/18 RMC: Subramanian and Mahajan �3

  7. An RL problem can be formulated as… Agent 19/12/18 RMC: Subramanian and Mahajan �3

  8. An RL problem can be formulated as… Environment Agent 19/12/18 RMC: Subramanian and Mahajan �3

  9. An RL problem can be formulated as… Environment Agent 19/12/18 RMC: Subramanian and Mahajan �3

  10. An RL problem can be formulated as… Environment Agent Infinite horizon Markov decision process (MDP) Model State space Action space Transition probability Per-step reward 19/12/18 RMC: Subramanian and Mahajan �3

  11. An RL problem can be formulated as… Environment Unknown in RL Agent Infinite horizon Markov decision process (MDP) Model State space Action space Transition probability Per-step reward 19/12/18 RMC: Subramanian and Mahajan �3

  12. Policy parametrization 19/12/18 RMC: Subramanian and Mahajan �4

  13. Policy parametrization is a parametrized policy 19/12/18 RMC: Subramanian and Mahajan �4

  14. Policy parametrization is a parametrized policy Gibbs (softmax) policy 19/12/18 RMC: Subramanian and Mahajan �4

  15. Policy parametrization is a parametrized policy Gibbs (softmax) policy Neural network (NN) policy : weights of NN 19/12/18 RMC: Subramanian and Mahajan �4

  16. Policy gradient 19/12/18 RMC: Subramanian and Mahajan �5

  17. Policy gradient Performance Gradient Estimate 19/12/18 RMC: Subramanian and Mahajan �5

  18. Policy gradient Performance Gradient Estimate 19/12/18 RMC: Subramanian and Mahajan �5

  19. Policy gradient Performance Gradient Estimate is an estimate of 19/12/18 RMC: Subramanian and Mahajan �5

  20. Policy gradient Performance Gradient Estimate is an estimate of Stochastic Gradient Ascent 19/12/18 RMC: Subramanian and Mahajan �5

  21. Policy gradient Performance Gradient Estimate is an estimate of Stochastic Gradient Ascent 19/12/18 RMC: Subramanian and Mahajan �5

  22. Policy gradient Performance Gradient Estimate is an estimate of Stochastic Gradient Ascent and 19/12/18 RMC: Subramanian and Mahajan �5

  23. Policy gradient Performance Gradient Estimate is an estimate of How do we estimate this? Stochastic Gradient Ascent and 19/12/18 RMC: Subramanian and Mahajan �5

  24. How to estimate ? 19/12/18 RMC: Subramanian and Mahajan �6

  25. How to estimate ? Monte Carlo estimate (REINFORCE) 19/12/18 RMC: Subramanian and Mahajan �6

  26. How to estimate ? Monte Carlo estimate (REINFORCE) Actor Critic estimate (Temporal difference / SARSA) 19/12/18 RMC: Subramanian and Mahajan �6

  27. How to estimate ? Monte Carlo estimate (REINFORCE) Actor Critic estimate (Temporal difference / SARSA) Actor Critic with eligibility traces estimate (SARSA-λ) 19/12/18 RMC: Subramanian and Mahajan �6

  28. MC vs. TD MC TD 19/12/18 RMC: Subramanian and Mahajan �7

  29. MC vs. TD MC TD ⊕ Unbiased ⊕ Simple & easy to implement ⊕ Discounted & average reward cases ⊖ High variance ⊖ End-of-episode updates ⊖ Not asymptotically optimal for inf. hor. 19/12/18 RMC: Subramanian and Mahajan �7

  30. MC vs. TD MC TD ⊕ Unbiased ⊕ Low variance ⊕ Simple & easy to implement ⊕ Per-step updates ⊕ Discounted & average reward cases ⊕ Asymptotically optimal for inf. hor. ⊖ High variance ⊖ Biased ⊖ End-of-episode updates ⊖ Often requires function approximation ⊖ Not asymptotically optimal for inf. hor. ⊖ Additional effort for average reward 19/12/18 RMC: Subramanian and Mahajan �7

  31. MC vs. TD MC TD ⊕ Unbiased ⊕ Low variance ⊕ Simple & easy to implement ⊕ Per-step updates ⊕ Discounted & average reward cases ⊕ Asymptotically optimal for inf. hor. ⊖ High variance ⊖ Biased ⊖ End-of-episode updates ⊖ Often requires function approximation ⊖ Not asymptotically optimal for inf. hor. ⊖ Additional effort for average reward Can we get the best of both worlds? 19/12/18 RMC: Subramanian and Mahajan �7

  32. Renewal Monte Carlo State 0 1 2 3 4 5 6 7 Time 19/12/18 RMC: Subramanian and Mahajan �8

  33. Renewal Monte Carlo State 0 1 2 3 4 5 6 7 Time 19/12/18 RMC: Subramanian and Mahajan �8

  34. Renewal Monte Carlo State 0 1 2 3 4 5 6 7 Time 19/12/18 RMC: Subramanian and Mahajan �8

  35. Renewal Monte Carlo State 0 1 2 3 4 5 6 7 Time 19/12/18 RMC: Subramanian and Mahajan �8

  36. Renewal Monte Carlo State 0 1 2 3 4 5 6 7 Time 19/12/18 RMC: Subramanian and Mahajan �8

  37. Renewal Monte Carlo State 0 1 2 3 4 5 6 7 Time 19/12/18 RMC: Subramanian and Mahajan �8

  38. Renewal Monte Carlo State 0 1 2 3 4 5 6 7 Time estimated by 19/12/18 RMC: Subramanian and Mahajan �8

  39. RMC based policy gradient 19/12/18 RMC: Subramanian and Mahajan �9

  40. RMC based policy gradient Performance Gradient Estimate 19/12/18 RMC: Subramanian and Mahajan �9

  41. RMC based policy gradient ; Performance Gradient Estimate 19/12/18 RMC: Subramanian and Mahajan �9

  42. RMC based policy gradient ; Performance Gradient Estimate 19/12/18 RMC: Subramanian and Mahajan �9

  43. RMC based policy gradient ; Performance Gradient Estimate with estimate: 19/12/18 RMC: Subramanian and Mahajan �9

  44. RMC based policy gradient ; Performance Gradient Estimate with estimate: Stochastic Gradient Ascent and 19/12/18 RMC: Subramanian and Mahajan �9

  45. RMC based policy gradient ; Performance Gradient Estimate with estimate: estimated using MC / TD using RL policy gradient Stochastic Gradient Ascent and 19/12/18 RMC: Subramanian and Mahajan �9

  46. Convergence 19/12/18 RMC: Subramanian and Mahajan �10

  47. Convergence unbiased estimators of 19/12/18 RMC: Subramanian and Mahajan �10

  48. Convergence unbiased estimators of and 19/12/18 RMC: Subramanian and Mahajan �10

  49. Convergence unbiased estimators of and is an unbiased estimator of 19/12/18 RMC: Subramanian and Mahajan �10

  50. Convergence unbiased estimators of and is an unbiased estimator of has bounded variance and is continuous; 19/12/18 RMC: Subramanian and Mahajan �10

  51. Convergence unbiased estimators of and is an unbiased estimator of has bounded variance and is continuous; has locally asymptotically stable isolated limit points 19/12/18 RMC: Subramanian and Mahajan �10

  52. Convergence unbiased estimators of and is an unbiased estimator of has bounded variance and is continuous; has locally asymptotically stable isolated limit points Iteration for converges a.s. to a value where 19/12/18 RMC: Subramanian and Mahajan �10

  53. E.g. – Randomly generated MDP 300 250 200 Performance 150 Exact 100 50 0 0 . 00 0 . 25 0 . 50 0 . 75 1 . 00 1 . 25 1 . 50 1 . 75 2 . 00 × 10 5 Samples 19/12/18 RMC: Subramanian and Mahajan �11

  54. E.g. – Randomly generated MDP 300 250 200 Performance Exact 150 S-0 100 50 0 0 . 00 0 . 25 0 . 50 0 . 75 1 . 00 1 . 25 1 . 50 1 . 75 2 . 00 × 10 5 Samples 19/12/18 RMC: Subramanian and Mahajan �11

  55. E.g. – Randomly generated MDP 300 250 200 Performance Exact 150 S-0 S-1 100 50 0 0 . 00 0 . 25 0 . 50 0 . 75 1 . 00 1 . 25 1 . 50 1 . 75 2 . 00 × 10 5 Samples 19/12/18 RMC: Subramanian and Mahajan �11

  56. E.g. – Randomly generated MDP 300 250 Exact 200 S-0 Performance S-1 S-0.25 150 S-0.5 S-0.75 RMC 100 RMC-B 50 0 0 . 00 0 . 25 0 . 50 0 . 75 1 . 00 1 . 25 1 . 50 1 . 75 2 . 00 × 10 5 Samples 19/12/18 RMC: Subramanian and Mahajan �11

  57. Related work 19/12/18 RMC: Subramanian and Mahajan �12

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend