Reducing Sampling Error in the Monte Carlo Policy Gradient Estimator - - PowerPoint PPT Presentation

reducing sampling error in the monte carlo policy
SMART_READER_LITE
LIVE PREVIEW

Reducing Sampling Error in the Monte Carlo Policy Gradient Estimator - - PowerPoint PPT Presentation

Reducing Sampling Error in the Monte Carlo Policy Gradient Estimator Josiah Hanna and Peter Stone Department of Computer Science The University of Texas at Austin 50 millions 21 days, millions actions taken of games 1.5 years of compute


slide-1
SLIDE 1

Reducing Sampling Error in the Monte Carlo Policy Gradient Estimator

Josiah Hanna and Peter Stone

Department of Computer Science The University of Texas at Austin

slide-2
SLIDE 2

Josiah Hanna The University of Texas at Austin

2

50 millions actions taken 21 days, millions

  • f games

1.5 years of compute

slide-3
SLIDE 3

Can reinforcement learning be data efficient enough for real world applications?

3

slide-4
SLIDE 4

Josiah Hanna The University of Texas at Austin

Reinforcement Learning

Learn a policy that maps the world state to an action that maximizes long term utility.

4

slide-5
SLIDE 5

Josiah Hanna The University of Texas at Austin

Crash

5

Reinforcement Learning

Reach Destination

Probability = 0.15 Probability = 0.85

+1

  • 100

v(πθ) = E[Qπθ(S, A)]

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

“How good is taking action A in state S”

v(πθ) = X

s

Pr(s|πθ) X

a

πθ(a|s)Qπθ(s, a)

<latexit sha1_base64="DSJhErRj+yv5wNX6uajra9Q+Otw=">AEbHicrVPbatAEN3Ybpu6lzhtH1pCYalxscEYKQmt8xCIKYU+OnWdBCxbjNYre4lWErurECPrpZ/Yt35CX/oNXclyY7s3SjtCMJwzlzOzjBN6TCrD+LxVKJZu3b6zfbd87/6DhzuV3UdnMogEoX0SeIG4cEBSj/m0r5jy6EUoKHDHo+fO5ZuUP7+iQrLA/6BmIR1ymPjMZQSUhuzdwsfaVd0KmW2pKVXQwMfY4qCmjhu/TQano/iGS+q9ZqeBh+Wa/n5IkhG3Y5lgqyvqEs/xKp2RsALVYS4beKO6bEIjrW354HiQw/gv1a0nW14wW3b0cJ6+Qi/a2NBGIrgGluABKbScyTfEJ2bCYjvim9Z7Nmx2Z/bG+zVIAOzASkRDhly6GUikFMOFwndpziIysUjNONzt9rsfUIastl7X/ZbgFsfpaNpunlZN4qfkG+g+rsCtVo3V01DZemXjhHCydwzY2W0ZmVZRb1658sYBiTj1FfFAyoFphGqot6cY8WhStiJQyCXMKED7frAqRzG2bEkuKaRMXYDoX9f4QxdzYiBSznjo5MX0Vucin4M24QKbc9jJkfRor6ZNHIjTysApxeHh4zQYnyZtoBIpjWiskU9KqVvs+yXsJyUvxr52y/ZR609k8Pqyfv83Vsoz30AtWRiV6jE/QOdVEfkcKX4k7xafFZ8WvpSWmv9HwRWtjKcx6jNSu9/AZI3G4h</latexit>

π : S × A → [0, 1]

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

Unknown Known

slide-6
SLIDE 6

Josiah Hanna The University of Texas at Austin

6

v(πθ) = E[Qπθ(S, A)]

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

rθv(πθ) = E[Qπθ(S, A)rθ log πθ(A|S)]

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

rθv(πθ) ⇡ 1 m

m

X

i=1

Qπθ(Si, Ai)rθ log πθ(Ai|Si)]

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

Policy Gradient Reinforcement Learning

rθv(πθ) = X

s

Pr(s|πθ) X

a

πθ(a|s)Qπθ(s, a)rθ log πθ(a|s)

<latexit sha1_base64="oDwFjJGpSpcVOR/lifXRc+cjIPY=">AE9nicrVRNb9MwGPaWAKM6LYjF4uqUipVbJN0B0mrUJIHDtKt0lNGzmu01qLk8h2plZpfgXDiDEld/CjX+Dk6asLR8VYo4ivXqe9+N5/dp2I58KaZrft7Y1/d79BzsPS492Hz95Wt7bvxBhzDHp4tAP+ZWLBPFpQLqSp9cRZwg5vrk0r1+lfGXN4QLGgbv5DQifYZGAfUoRlJBzp62W70x7Ig6thwTiWrwFNoMybHrJa/T3vkgueVSo1Nv1WC/VM2+X6JEzJxEpNBuc0PAGVymcxItQaiRpcSy/qJaltgPk+qiA4T/KWw2/XC0XLalhHWKHv5WxkZRxMJtD2OcGKlCUuLDumplQ7YuvSOQ+sth24s79BMgHLMBWRENKaLpqRMEB8xNEmdJMHdsQpI2uVf+Zc9TFQnlscv9Pc3NieVoOnWZ02Sh+Ra6i63YMPG7OVobBGVhTrliNk5OmuYLC86No4Vx3IRWw8xXBRSr7ZS/2cMQx4wEvtIiJ5lRrKvxigp9klasmNBIoSv0Yj0lBkgRkQ/ya9tCqsKGUIv5OoPJMzR5YgEMSGmzFWe2fEQ61wG/o7rxdJr9hMaRLEkAZ4X8mIfyhBmbwAcUk6w9KfKQJhTpRXiMVIzl+qlKlNWHQK/2xcHDaso8bh+XHl7G2xHTvgGXgODGCBl+AMvAFt0AVYE9p7aP2SZ/oH/TP+pe56/ZWEXMAVpb+9QfJFaHR</latexit>

Unknown Known

slide-7
SLIDE 7

Josiah Hanna The University of Texas at Austin

7

Monte Carlo Policy Gradient

  • 1. Execute current policy for m steps.
  • 2. Update policy with Monte Carlo policy gradient estimate.
  • 3. Throw away observed data and repeat (on-policy).
slide-8
SLIDE 8

Josiah Hanna The University of Texas at Austin

Proportion = 0.15 Proportion = 0.85

8

Sampling Error

Crash Reach Destination

Probability = 0.15 Probability = 0.85

+1

  • 100

Proportion = 0.1 Proportion = 0.9

For a finite amount of data, it may appear that the wrong policy generated the data.

slide-9
SLIDE 9

Josiah Hanna The University of Texas at Austin

9

Correcting Sampling Error

Pretend data was generated by policy that most closely matches the

  • bserved data.

Correct weight on each state-action pair towards the policy we actually took actions with.

πφ = argmaxφ0

m

X

i=1

log πφ0(ai|si)

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

rθv(πθ) ⇡ 1 m

m

X

i=1

πθ(ai|si) πφ(ai|si)Qπθ(Si, Ai)rθ log πθ(Ai|Si)

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

Importance Sampling Correction

slide-10
SLIDE 10

Josiah Hanna The University of Texas at Austin

Is this method on-policy or off-policy?

10

On-policy: Can only use data from the current policy. Off-policy: Can use data from any policy. Our method pretends on-policy data is off-policy data and uses importance sampling to correct!

slide-11
SLIDE 11

Josiah Hanna The University of Texas at Austin

11

Sampling Error Corrected Policy Gradient

  • 1. Execute current policy for m steps.
  • 2. Estimate empirical policy with maximum likelihood estimation.
  • 3. Update policy with Sampling Error Corrected (SEC) policy

gradient estimate.

  • 4. Throw away data and repeat (on-policy).
slide-12
SLIDE 12

Josiah Hanna The University of Texas at Austin

12

Empirical Results

GridWorld Discrete State and Actions

slide-13
SLIDE 13

Josiah Hanna The University of Texas at Austin

13

Empirical Results

Cartpole Continuous state and discrete actions

slide-14
SLIDE 14

Josiah Hanna The University of Texas at Austin

Related Work

14

  • 1. Expected SARSA (van Seijen et al. 2009).
  • 2. Expected Policy Gradients (Ciosek and Whiteson 2018).
  • 3. Estimated Propensity Scores (Hirano et al. 2003, Li et al. 2015).
  • 4. Many people outside of RL + Bandits:
  • Blackbox importance sampling (Liu and Lee 2017), Bayesian

Monte Carlo (Gharamani and Rasmussen 2003).

slide-15
SLIDE 15

Josiah Hanna The University of Texas at Austin

Josiah Hanna, Scott Niekum, Peter Stone (to appear ICML 2019)

  • 1. Any Monte Carlo method will have sampling error with finite data.
  • 2. Sampling Error can slow down learning in policy gradient methods.
  • 3. We introduced the sampling error corrected policy gradient

estimator to address this problem.

  • 4. Similar approach can be used for other Monte Carlo estimators.
  • For example: on- and off-policy policy evaluation.

15

slide-16
SLIDE 16

Josiah Hanna The University of Texas at Austin

  • 1. Finite sample bias / variance analysis.
  • 2. Correcting sampling error in online RL methods.

16

Open Questions

slide-17
SLIDE 17

Josiah Hanna The University of Texas at Austin

17

Thank you! Questions? jphanna@cs.utexas.edu

slide-18
SLIDE 18

Josiah Hanna The University of Texas at Austin

18

Ceci n’est pas un blank slide.

slide-19
SLIDE 19

Josiah Hanna The University of Texas at Austin

19

Empirical Results

GridWorld Discrete State and Actions