dualdice
play

DualDICE Behavior-Agnostic Estimation of Discounted Stationary - PowerPoint PPT Presentation

DualDICE Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections Ofir Nachum ,* Yinlam Chow,* Bo Dai, Lihong Li Google Research *Equal contribution Reinforcement Learning Reinforcement Learning A policy acts on an


  1. DualDICE Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections Ofir Nachum ,* Yinlam Chow,* Bo Dai, Lihong Li Google Research *Equal contribution

  2. Reinforcement Learning

  3. Reinforcement Learning ● A policy acts on an environment.

  4. Reinforcement Learning ● A policy acts on an environment. Initial state distribution β s 0

  5. Reinforcement Learning ● A policy acts on an environment. 𝛒 (-| s 0 ) Initial state distribution β a 0 s 0

  6. Reinforcement Learning ● A policy acts on an environment. 𝛒 (-| s 0 ) Initial state distribution β a 0 s 0 R(-| s 0 , a 0 ) r 0

  7. Reinforcement Learning ● A policy acts on an environment. 𝛒 (-| s 0 ) Initial state distribution β a 0 s 0 T(-| s 0 , a 0 ) s 1 R(-| s 0 , a 0 ) r 0

  8. Reinforcement Learning ● A policy acts on an environment. 𝛒 (-| s 0 ) 𝛒 (-| s 1 ) Initial state distribution β a 0 a 1 s 0 T(-| s 0 , a 0 ) s 1 T(-| s 1 , a 1 ) R(-| s 0 , a 0 ) R(-| s 1 , a 1 ) r 0 r 1

  9. Reinforcement Learning ● A policy acts on an environment. 𝛒 (-| s 0 ) 𝛒 (-| s 1 ) 𝛒 (-| s 2 ) Initial state distribution β a 0 a 1 a 2 s 0 T(-| s 0 , a 0 ) s 1 T(-| s 1 , a 1 ) s 2 T(-| s 2 , a 2 ) R(-| s 0 , a 0 ) R(-| s 1 , a 1 ) R(-| s 2 , a 2 ) r 0 r 1 r 2

  10. Reinforcement Learning ● A policy acts on an environment. 𝛒 (-| s 0 ) 𝛒 (-| s 1 ) 𝛒 (-| s 2 ) Initial state distribution β a 0 a 1 a 2 s 0 T(-| s 0 , a 0 ) s 1 T(-| s 1 , a 1 ) s 2 T(-| s 2 , a 2 ) R(-| s 0 , a 0 ) R(-| s 1 , a 1 ) R(-| s 2 , a 2 ) r 0 r 1 r 2 ● Question: What is the value (average reward) of the policy?

  11. Off-policy Policy Estimation

  12. Off-policy Policy Estimation ● Want to estimate average discounted per-step reward of policy,

  13. Off-policy Policy Estimation ● Want to estimate average discounted per-step reward of policy, ● Only have access to finite experience dataset s, a, r, s’ s, a, r, s’ s, a, r, s’ s, a, r, s’ s, a, r, s’ . . . where transitions are from some unknown distribution

  14. Off-policy Policy Estimation ● Want to estimate average discounted per-step reward of policy, ● Only have access to finite experience dataset s, a, r, s’ s, a, r, s’ s, a, r, s’ s, a, r, s’ s, a, r, s’ . . . where transitions are from some unknown distribution ● Don’t even know the behavior policy!

  15. Reduction of OPE to Density Ratio Estimation

  16. Reduction of OPE to Density Ratio Estimation Can write where d π is discounted on-policy distribution ●

  17. Reduction of OPE to Density Ratio Estimation Can write where d π is discounted on-policy distribution ● ● Using importance weighting trick, we have,

  18. Reduction of OPE to Density Ratio Estimation Can write where d π is discounted on-policy distribution ● ● Using importance weighting trick, we have, ● Given finite dataset, this corresponds to weighted average,

  19. Reduction of OPE to Density Ratio Estimation Can write where d π is discounted on-policy distribution ● ● Using importance weighting trick, we have, ● Given finite dataset, this corresponds to weighted average, ● Problem reduces to estimating weights (density ratios)

  20. Reduction of OPE to Density Ratio Estimation Can write where d π is discounted on-policy distribution ● ● Using importance weighting trick, we have, ● Given finite dataset, this corresponds to weighted average, ● Problem reduces to estimating weights (density ratios)

  21. Reduction of OPE to Density Ratio Estimation Can write where d π is discounted on-policy distribution ● ● Using importance weighting trick, we have, ● Given finite dataset, this corresponds to weighted average, ● Problem reduces to estimating weights (density ratios) ● Difficult because we don’t have access to environment and we don’t have explicit knowledge of d D (s,a), only samples.

  22. The DualDICE Objective ● Define zero-reward Bellman operator as

  23. The DualDICE Objective ● Define zero-reward Bellman operator as

  24. The DualDICE Objective ● Define zero-reward Bellman operator as minimize squared Bellman error

  25. The DualDICE Objective ● Define zero-reward Bellman operator as minimize squared Bellman error s 2 s 3 s 1 s 4 s 0

  26. The DualDICE Objective ● Define zero-reward Bellman operator as maximize initial “nu-values” minimize squared Bellman error s 2 s 3 s 1 s 4 s 0

  27. The DualDICE Objective ● Define zero-reward Bellman operator as maximize initial “nu-values” minimize squared Bellman error s 2 s 3 s 1 s 4 s 0

  28. The DualDICE Objective ● Define zero-reward Bellman operator as maximize initial “nu-values” minimize squared Bellman error s 2 s 3 s 1 s 4 s 0

  29. The DualDICE Objective ● Define zero-reward Bellman operator as maximize initial “nu-values” minimize squared Bellman error s 2 s 3 s 1 s 4 s 0 Nice: Objective is based on expectations from d D , β, and π, which we have access to. ●

  30. The DualDICE Objective ● Define zero-reward Bellman operator as maximize initial “nu-values” minimize squared Bellman error s 2 s 3 s 1 s 4 s 0 Nice: Objective is based on expectations from d D , β, and π, which we have access to. ● ● Extension 1: Can remove appearance of Bellman operator from both objective and solution by application of Fenchel conjugate!

  31. The DualDICE Objective ● Define zero-reward Bellman operator as maximize initial “nu-values” minimize squared Bellman error s 2 s 3 s 1 s 4 s 0 Nice: Objective is based on expectations from d D , β, and π, which we have access to. ● ● Extension 1: Can remove appearance of Bellman operator from both objective and solution by application of Fenchel conjugate! ● Extension 2: Can generalize this result to any convex function (not just square)!

  32. DualDICE Results ● DualDICE accuracy during training compared to existing methods.

  33. East Exhibition Hall B+C DualDICE Results Poster #205 ● DualDICE accuracy during training compared to existing methods.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend