Reducing Sampling Error in the Monte Carlo Policy Gradient Estimator - - PowerPoint PPT Presentation

▶

Mar 30, 2023 307 likes •516 views

Reducing Sampling Error in the Monte Carlo Policy Gradient Estimator Josiah Hanna and Peter Stone Department of Computer Science The University of Texas at Austin 50 millions 21 days, millions actions taken of games 1.5 years of compute

SLIDE 1

Reducing Sampling Error in the Monte Carlo Policy Gradient Estimator

Josiah Hanna and Peter Stone

Department of Computer Science The University of Texas at Austin

SLIDE 2

Josiah Hanna The University of Texas at Austin

50 millions actions taken 21 days, millions

f games

1.5 years of compute

SLIDE 3

Can reinforcement learning be data efficient enough for real world applications?

SLIDE 4

Josiah Hanna The University of Texas at Austin

Reinforcement Learning

Learn a policy that maps the world state to an action that maximizes long term utility.

SLIDE 5

Josiah Hanna The University of Texas at Austin

Crash

Reinforcement Learning

Reach Destination

Probability = 0.15 Probability = 0.85

+1

v(πθ) = E[Qπθ(S, A)]

“How good is taking action A in state S”

v(πθ) = X

Pr(s|πθ) X

πθ(a|s)Qπθ(s, a)

<latexit sha1_base64="DSJhErRj+yv5wNX6uajra9Q+Otw=">AEbHicrVPbatAEN3Ybpu6lzhtH1pCYalxscEYKQmt8xCIKYU+OnWdBCxbjNYre4lWErurECPrpZ/Yt35CX/oNXclyY7s3SjtCMJwzlzOzjBN6TCrD+LxVKJZu3b6zfbd87/6DhzuV3UdnMogEoX0SeIG4cEBSj/m0r5jy6EUoKHDHo+fO5ZuUP7+iQrLA/6BmIR1ymPjMZQSUhuzdwsfaVd0KmW2pKVXQwMfY4qCmjhu/TQano/iGS+q9ZqeBh+Wa/n5IkhG3Y5lgqyvqEs/xKp2RsALVYS4beKO6bEIjrW354HiQw/gv1a0nW14wW3b0cJ6+Qi/a2NBGIrgGluABKbScyTfEJ2bCYjvim9Z7Nmx2Z/bG+zVIAOzASkRDhly6GUikFMOFwndpziIysUjNONzt9rsfUIastl7X/ZbgFsfpaNpunlZN4qfkG+g+rsCtVo3V01DZemXjhHCydwzY2W0ZmVZRb1658sYBiTj1FfFAyoFphGqot6cY8WhStiJQyCXMKED7frAqRzG2bEkuKaRMXYDoX9f4QxdzYiBSznjo5MX0Vucin4M24QKbc9jJkfRor6ZNHIjTysApxeHh4zQYnyZtoBIpjWiskU9KqVvs+yXsJyUvxr52y/ZR609k8Pqyfv83Vsoz30AtWRiV6jE/QOdVEfkcKX4k7xafFZ8WvpSWmv9HwRWtjKcx6jNSu9/AZI3G4h</latexit>

π : S × A → [0, 1]

Unknown Known

SLIDE 6

Josiah Hanna The University of Texas at Austin

v(πθ) = E[Qπθ(S, A)]

rθv(πθ) = E[Qπθ(S, A)rθ log πθ(A|S)]

rθv(πθ) ⇡ 1 m

X

i=1

Qπθ(Si, Ai)rθ log πθ(Ai|Si)]

Policy Gradient Reinforcement Learning

rθv(πθ) = X

Pr(s|πθ) X

πθ(a|s)Qπθ(s, a)rθ log πθ(a|s)

<latexit sha1_base64="oDwFjJGpSpcVOR/lifXRc+cjIPY=">AE9nicrVRNb9MwGPaWAKM6LYjF4uqUipVbJN0B0mrUJIHDtKt0lNGzmu01qLk8h2plZpfgXDiDEld/CjX+Dk6asLR8VYo4ivXqe9+N5/dp2I58KaZrft7Y1/d79BzsPS492Hz95Wt7bvxBhzDHp4tAP+ZWLBPFpQLqSp9cRZwg5vrk0r1+lfGXN4QLGgbv5DQifYZGAfUoRlJBzp62W70x7Ig6thwTiWrwFNoMybHrJa/T3vkgueVSo1Nv1WC/VM2+X6JEzJxEpNBuc0PAGVymcxItQaiRpcSy/qJaltgPk+qiA4T/KWw2/XC0XLalhHWKHv5WxkZRxMJtD2OcGKlCUuLDumplQ7YuvSOQ+sth24s79BMgHLMBWRENKaLpqRMEB8xNEmdJMHdsQpI2uVf+Zc9TFQnlscv9Pc3NieVoOnWZ02Sh+Ra6i63YMPG7OVobBGVhTrliNk5OmuYLC86No4Vx3IRWw8xXBRSr7ZS/2cMQx4wEvtIiJ5lRrKvxigp9klasmNBIoSv0Yj0lBkgRkQ/ya9tCqsKGUIv5OoPJMzR5YgEMSGmzFWe2fEQ61wG/o7rxdJr9hMaRLEkAZ4X8mIfyhBmbwAcUk6w9KfKQJhTpRXiMVIzl+qlKlNWHQK/2xcHDaso8bh+XHl7G2xHTvgGXgODGCBl+AMvAFt0AVYE9p7aP2SZ/oH/TP+pe56/ZWEXMAVpb+9QfJFaHR</latexit>

Unknown Known

SLIDE 7

Josiah Hanna The University of Texas at Austin

Monte Carlo Policy Gradient

1. Execute current policy for m steps.
2. Update policy with Monte Carlo policy gradient estimate.
3. Throw away observed data and repeat (on-policy).

SLIDE 8

Josiah Hanna The University of Texas at Austin

Proportion = 0.15 Proportion = 0.85

Sampling Error

Crash Reach Destination

Probability = 0.15 Probability = 0.85

+1

Proportion = 0.1 Proportion = 0.9

For a finite amount of data, it may appear that the wrong policy generated the data.

SLIDE 9

Josiah Hanna The University of Texas at Austin

Correcting Sampling Error

Pretend data was generated by policy that most closely matches the

bserved data.

Correct weight on each state-action pair towards the policy we actually took actions with.

πφ = argmaxφ0

i=1

log πφ0(ai|si)

rθv(πθ) ⇡ 1 m

X

i=1

πθ(ai|si) πφ(ai|si)Qπθ(Si, Ai)rθ log πθ(Ai|Si)

Importance Sampling Correction

SLIDE 10

Josiah Hanna The University of Texas at Austin

Is this method on-policy or off-policy?

On-policy: Can only use data from the current policy. Off-policy: Can use data from any policy. Our method pretends on-policy data is off-policy data and uses importance sampling to correct!

SLIDE 11

Josiah Hanna The University of Texas at Austin

Sampling Error Corrected Policy Gradient

1. Execute current policy for m steps.
2. Estimate empirical policy with maximum likelihood estimation.
3. Update policy with Sampling Error Corrected (SEC) policy

gradient estimate.

4. Throw away data and repeat (on-policy).

SLIDE 12

Josiah Hanna The University of Texas at Austin

Empirical Results

GridWorld Discrete State and Actions

SLIDE 13

Josiah Hanna The University of Texas at Austin

Empirical Results

Cartpole Continuous state and discrete actions

SLIDE 14

Josiah Hanna The University of Texas at Austin

Related Work

1. Expected SARSA (van Seijen et al. 2009).
2. Expected Policy Gradients (Ciosek and Whiteson 2018).
3. Estimated Propensity Scores (Hirano et al. 2003, Li et al. 2015).
4. Many people outside of RL + Bandits:
Blackbox importance sampling (Liu and Lee 2017), Bayesian

Monte Carlo (Gharamani and Rasmussen 2003).

SLIDE 15

Josiah Hanna The University of Texas at Austin

Josiah Hanna, Scott Niekum, Peter Stone (to appear ICML 2019)

1. Any Monte Carlo method will have sampling error with finite data.
2. Sampling Error can slow down learning in policy gradient methods.
3. We introduced the sampling error corrected policy gradient

estimator to address this problem.

4. Similar approach can be used for other Monte Carlo estimators.
For example: on- and off-policy policy evaluation.

SLIDE 16

Josiah Hanna The University of Texas at Austin

1. Finite sample bias / variance analysis.
2. Correcting sampling error in online RL methods.

Open Questions

SLIDE 17

Josiah Hanna The University of Texas at Austin

Thank you! Questions? jphanna@cs.utexas.edu

SLIDE 18

Josiah Hanna The University of Texas at Austin

Ceci n’est pas un blank slide.

SLIDE 19

Josiah Hanna The University of Texas at Austin

Empirical Results

GridWorld Discrete State and Actions