cs 285
play

CS 285 Instructor: Sergey Levine UC Berkeley Recap: actor-critic - PowerPoint PPT Presentation

Value Function Methods CS 285 Instructor: Sergey Levine UC Berkeley Recap: actor-critic fit a model to estimate return generate samples (i.e. run the policy) improve the policy Can we omit policy gradient completely? forget policies,


  1. Value Function Methods CS 285 Instructor: Sergey Levine UC Berkeley

  2. Recap: actor-critic fit a model to estimate return generate samples (i.e. run the policy) improve the policy

  3. Can we omit policy gradient completely? forget policies, let’s just do this! fit a model to estimate return generate samples (i.e. run the policy) improve the policy

  4. Policy iteration fit a model to High level idea: estimate return generate samples (i.e. how to do this? run the policy) improve the policy

  5. Dynamic programming 0.2 0.3 0.4 0.3 0.5 0.3 0.3 0.3 0.4 0.4 0.6 0.4 0.5 0.5 0.5 0.7 just use the current estimate here

  6. Policy iteration with dynamic programming fit a model to estimate return generate samples (i.e. run the policy) improve the policy 0.2 0.3 0.4 0.3 0.5 0.3 0.3 0.3 0.4 0.4 0.4 0.6 0.5 0.5 0.5 0.7

  7. Even simpler dynamic programming fit a model to estimate return generate samples (i.e. run the policy) improve the policy approximates the new value!

  8. Fitted Value Iteration & Q-Iteration

  9. Fitted value iteration fit a model to estimate return generate samples (i.e. run the policy) improve the policy curse of dimensionality

  10. What if we don’t know the transition dynamics? need to know outcomes for different actions! Back to policy iteration… can fit this using samples

  11. Can we do the “max” trick again? forget policy, compute value directly can we do this with Q-values also , without knowing the transitions? doesn’t require simulation of actions! + works even for off-policy samples (unlike actor-critic) + only one network, no high-variance policy gradient - no convergence guarantees for non-linear function approximation (more on this later)

  12. Fitted Q-iteration

  13. Review • Value-based methods fit a model to • Don’t learn a policy explicitly estimate return • Just learn value or Q-function generate samples (i.e. • If we have value function, we run the policy) have a policy improve the policy • Fitted Q-iteration

  14. From Q-Iteration to Q-Learning

  15. Why is this algorithm off-policy? dataset of transitions Fitted Q-iteration

  16. What is fitted Q-iteration optimizing? most guarantees are lost when we leave the tabular case (e.g., use neural networks)

  17. Online Q-learning algorithms fit a model to estimate return generate samples (i.e. run the policy) improve the policy off policy, so many choices here!

  18. Exploration with Q-learning final policy: why is this a bad idea for step 1? “epsilon - greedy” “Boltzmann exploration” We’ll discuss exploration in detail in a later lecture!

  19. Review • Value-based methods fit a model to • Don’t learn a policy explicitly estimate return • Just learn value or Q-function generate samples (i.e. • If we have value function, we run the policy) have a policy improve the policy • Fitted Q-iteration • Batch mode, off-policy method • Q-learning • Online analogue of fitted Q- iteration

  20. Value Functions in Theory

  21. Value function learning theory 0.2 0.3 0.4 0.3 0.3 0.3 0.5 0.3 0.4 0.4 0.4 0.6 0.5 0.5 0.7 0.5

  22. Value function learning theory 0.2 0.3 0.4 0.3 0.3 0.3 0.5 0.3 0.4 0.4 0.4 0.6 0.5 0.5 0.7 0.5

  23. Non-tabular value function learning

  24. Non-tabular value function learning Conclusions: value iteration converges (tabular case) fitted value iteration does not converge not in general often not in practice

  25. What about fitted Q-iteration? Applies also to online Q-learning

  26. But… it’s just regression! Q-learning is not gradient descent! no gradient through target value

  27. A sad corollary An aside regarding terminology

  28. Review • Value iteration theory • Operator for backup fit a model to • Operator for projection estimate return • Backup is contraction • Value iteration converges generate samples (i.e. • Convergence with function run the policy) approximation • Projection is also a contraction improve the • Projection + backup is not a contraction policy • Fitted value iteration does not in general converge • Implications for Q-learning • Q-learning, fitted Q-iteration, etc. does not converge with function approximation • But we can make it work in practice! • Sometimes – tune in next time

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend