Reducing Sampling Error in the Monte Carlo Policy Gradient Estimator
Josiah Hanna and Peter Stone
Department of Computer Science The University of Texas at Austin
Reducing Sampling Error in the Monte Carlo Policy Gradient Estimator - - PowerPoint PPT Presentation
Reducing Sampling Error in the Monte Carlo Policy Gradient Estimator Josiah Hanna and Peter Stone Department of Computer Science The University of Texas at Austin 50 millions 21 days, millions actions taken of games 1.5 years of compute
Josiah Hanna and Peter Stone
Department of Computer Science The University of Texas at Austin
Josiah Hanna The University of Texas at Austin
2
3
Josiah Hanna The University of Texas at Austin
4
Josiah Hanna The University of Texas at Austin
Crash
5
Reach Destination
Probability = 0.15 Probability = 0.85
“How good is taking action A in state S”
s
a
Unknown Known
Josiah Hanna The University of Texas at Austin
6
m
i=1
rθv(πθ) = X
s
Pr(s|πθ) X
a
πθ(a|s)Qπθ(s, a)rθ log πθ(a|s)
<latexit sha1_base64="oDwFjJGpSpcVOR/lifXRc+cjIPY=">AE9nicrVRNb9MwGPaWAKM6LYjF4uqUipVbJN0B0mrUJIHDtKt0lNGzmu01qLk8h2plZpfgXDiDEld/CjX+Dk6asLR8VYo4ivXqe9+N5/dp2I58KaZrft7Y1/d79BzsPS492Hz95Wt7bvxBhzDHp4tAP+ZWLBPFpQLqSp9cRZwg5vrk0r1+lfGXN4QLGgbv5DQifYZGAfUoRlJBzp62W70x7Ig6thwTiWrwFNoMybHrJa/T3vkgueVSo1Nv1WC/VM2+X6JEzJxEpNBuc0PAGVymcxItQaiRpcSy/qJaltgPk+qiA4T/KWw2/XC0XLalhHWKHv5WxkZRxMJtD2OcGKlCUuLDumplQ7YuvSOQ+sth24s79BMgHLMBWRENKaLpqRMEB8xNEmdJMHdsQpI2uVf+Zc9TFQnlscv9Pc3NieVoOnWZ02Sh+Ra6i63YMPG7OVobBGVhTrliNk5OmuYLC86No4Vx3IRWw8xXBRSr7ZS/2cMQx4wEvtIiJ5lRrKvxigp9klasmNBIoSv0Yj0lBkgRkQ/ya9tCqsKGUIv5OoPJMzR5YgEMSGmzFWe2fEQ61wG/o7rxdJr9hMaRLEkAZ4X8mIfyhBmbwAcUk6w9KfKQJhTpRXiMVIzl+qlKlNWHQK/2xcHDaso8bh+XHl7G2xHTvgGXgODGCBl+AMvAFt0AVYE9p7aP2SZ/oH/TP+pe56/ZWEXMAVpb+9QfJFaHR</latexit>Unknown Known
Josiah Hanna The University of Texas at Austin
7
Josiah Hanna The University of Texas at Austin
Proportion = 0.15 Proportion = 0.85
8
Crash Reach Destination
Probability = 0.15 Probability = 0.85
Proportion = 0.1 Proportion = 0.9
For a finite amount of data, it may appear that the wrong policy generated the data.
Josiah Hanna The University of Texas at Austin
9
Pretend data was generated by policy that most closely matches the
Correct weight on each state-action pair towards the policy we actually took actions with.
πφ = argmaxφ0
m
X
i=1
log πφ0(ai|si)
<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>m
i=1
Importance Sampling Correction
Josiah Hanna The University of Texas at Austin
10
On-policy: Can only use data from the current policy. Off-policy: Can use data from any policy. Our method pretends on-policy data is off-policy data and uses importance sampling to correct!
Josiah Hanna The University of Texas at Austin
11
gradient estimate.
Josiah Hanna The University of Texas at Austin
12
GridWorld Discrete State and Actions
Josiah Hanna The University of Texas at Austin
13
Cartpole Continuous state and discrete actions
Josiah Hanna The University of Texas at Austin
14
Monte Carlo (Gharamani and Rasmussen 2003).
Josiah Hanna The University of Texas at Austin
Josiah Hanna, Scott Niekum, Peter Stone (to appear ICML 2019)
estimator to address this problem.
15
Josiah Hanna The University of Texas at Austin
16
Josiah Hanna The University of Texas at Austin
17
Thank you! Questions? jphanna@cs.utexas.edu
Josiah Hanna The University of Texas at Austin
18
Ceci n’est pas un blank slide.
Josiah Hanna The University of Texas at Austin
19
GridWorld Discrete State and Actions