data efficient policy evaluation through behavior policy
play

Data-efficient Policy Evaluation through Behavior Policy Search - PowerPoint PPT Presentation

Data-efficient Policy Evaluation through Behavior Policy Search Josiah Hanna 1 Philip Thomas 2 Peter Stone 1 Scott Niekum 1 1 University of Texas at Austin 2 University of Massachusetts, Amherst August 8th, 2017 Josiah Hanna , Philip Thomas, Peter


  1. Data-efficient Policy Evaluation through Behavior Policy Search Josiah Hanna 1 Philip Thomas 2 Peter Stone 1 Scott Niekum 1 1 University of Texas at Austin 2 University of Massachusetts, Amherst August 8th, 2017 Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 1

  2. Policy Evaluation Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 2

  3. Outline 1 Demonstrate that importance-sampling for policy evaluation can outperform on-policy policy evaluation. Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 3

  4. Outline 1 Demonstrate that importance-sampling for policy evaluation can outperform on-policy policy evaluation. 2 Show how to improve the behavior policy for importance-sampling policy evaluation. Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 3

  5. Outline 1 Demonstrate that importance-sampling for policy evaluation can outperform on-policy policy evaluation. 2 Show how to improve the behavior policy for importance-sampling policy evaluation. 3 Empirically evaluate (1) and (2). Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 3

  6. Background Finite-horizon MDP. Agent selects actions with a stochastic policy, π . The policy and environment determine a distribution over trajectories, H : S 0 , A 0 , R 0 , S 1 , A 1 , R 1 , ..., S L , A L , R L Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 4

  7. Policy Evaluation Policy performance: � L � � � � γ t R t � ρ ( π ) := E � H ∼ π � t =0 Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 5

  8. Policy Evaluation Policy performance: � L � � � � γ t R t � ρ ( π ) := E � H ∼ π � t =0 Given a target policy, π e , estimate ρ ( π e ). Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 5

  9. Policy Evaluation Policy performance: � L � � � � γ t R t � ρ ( π ) := E � H ∼ π � t =0 Given a target policy, π e , estimate ρ ( π e ). Let π e ≡ π θ e Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 5

  10. Monte Carlo Policy Evaluation Given a dataset D of trajectories where ∀ H ∈ D , H ∼ π e : L MC( D ) := 1 � � γ t R ( i ) t |D| H i ∈D t =0 Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 6

  11. Action 1 +100 +1 Action 2 Target policy π e samples the high-rewarding first action with probability 0 . 01. Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 7

  12. Action 1 +100 +1 Action 2 Target policy π e samples the high-rewarding first action with probability 0 . 01. Monte Carlo evaluation of π e has high variance . Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 7

  13. Action 1 +100 +1 Action 2 Target policy π e samples the high-rewarding first action with probability 0 . 01. Monte Carlo evaluation of π e has high variance . Importance-sampling with a behavior policy that samples either action with equal probability gives a low variance evaluation. Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 7

  14. Importance-Sampling Policy Evaluation 1 Given a dataset D of trajectories where ∀ H i ∈ D , H i is sampled from a behavior policy π i : L L IS( D ) := 1 π e ( A t | S t ) � � � γ t R ( i ) t |D| π i ( A t | S t ) H i ∈D t =0 t =0 � �� � re-weighting factor 1 Precup, Sutton, and Singh (2000) Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 8

  15. Importance-Sampling Policy Evaluation 1 Given a dataset D of trajectories where ∀ H i ∈ D , H i is sampled from a behavior policy π i : L L IS( D ) := 1 π e ( A t | S t ) � � � γ t R ( i ) t |D| π i ( A t | S t ) H i ∈D t =0 t =0 � �� � re-weighting factor For convenience: L L π e ( A t | S t ) � � γ t R t IS( H , π ) := π ( A t | S t ) t =0 t =0 1 Precup, Sutton, and Singh (2000) Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 8

  16. The Optimal Behavior Policy Importance-sampling can achieve zero mean-squared error policy evaluation with only a single trajectory! Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 9

  17. The Optimal Behavior Policy Importance-sampling can achieve zero mean-squared error policy evaluation with only a single trajectory! We cannot analytically determine this policy. Requires ρ ( π e ) be known! Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 9

  18. The Optimal Behavior Policy Importance-sampling can achieve zero mean-squared error policy evaluation with only a single trajectory! We cannot analytically determine this policy. Requires ρ ( π e ) be known! Requires the reward function be known. Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 9

  19. The Optimal Behavior Policy Importance-sampling can achieve zero mean-squared error policy evaluation with only a single trajectory! We cannot analytically determine this policy. Requires ρ ( π e ) be known! Requires the reward function be known. Requires deterministic transitions. Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 9

  20. Behavior Policy Search Adapt the behavior policy towards the optimal behavior policy. Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 10

  21. Behavior Policy Search Adapt the behavior policy towards the optimal behavior policy. At each iteration, i : 1 Choose behavior policy parameters, θ i , based on all observed data D . Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 10

  22. Behavior Policy Search Adapt the behavior policy towards the optimal behavior policy. At each iteration, i : 1 Choose behavior policy parameters, θ i , based on all observed data D . 2 Sample m trajectories, H ∼ θ i and add to a data set D . Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 10

  23. Behavior Policy Search Adapt the behavior policy towards the optimal behavior policy. At each iteration, i : 1 Choose behavior policy parameters, θ i , based on all observed data D . 2 Sample m trajectories, H ∼ θ i and add to a data set D . 3 Estimate ρ ( π e ) with trajectories in D . Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 10

  24. Behavior Policy Gradient Key Idea: Adapt the behavior policy parameters, θ , with gradient descent on the mean squared error of importance-sampling. θ i +1 = θ i − α ∂ ∂ θ MSE[IS( H i , θ )] Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 11

  25. Behavior Policy Gradient Key Idea: Adapt the behavior policy parameters, θ , with gradient descent on the mean squared error of importance-sampling. θ i +1 = θ i − α ∂ ∂ θ MSE[IS( H i , θ )] MSE[ IS ( H , θ )] is not computable. ∂ ∂ θ MSE[ IS ( H , θ )] is computable. Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 11

  26. Behavior Policy Gradient Theorem Theorem � � L ∂ ∂ � − IS( H , θ ) 2 ∂ θ MSE(IS( H , θ )) = E π θ ∂ θ log ( π θ ( A t | S t )) t =0 Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 12

  27. Empirical Results Cartpole Swing-up Acrobot Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 13

  28. Empirical Results Cartpole Swing-up Acrobot Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 13

  29. GridWorld Results High Variance Policy Low Variance Policy Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend