SLIDE 1
Rui Zhao*, Xudong Sun, Volker Tresp Siemens AG & Ludwig - - PowerPoint PPT Presentation
Rui Zhao*, Xudong Sun, Volker Tresp Siemens AG & Ludwig - - PowerPoint PPT Presentation
Maximum Entropy-Regularized Multi-Goal Reinforcement Learning Rui Zhao*, Xudong Sun, Volker Tresp Siemens AG & Ludwig Maximilian University of Munich | June 2019 | ICML 2019 Introduction In Multi-Goal Reinforcement Learning, an agent
SLIDE 2
SLIDE 3
Motivation
- We observed that the achieved goals in the replay buffer are often biased
towards the behavior policies.
- From a Bayesian perspective (Murphy, 2012), when there is no prior
knowledge of the target goal distribution, the agent should learn uniformly from diverse achieved goals.
- We want to encourage the agent to achieve a diverse set of goals while
maximizing the expected return.
SLIDE 4
Contributions
- First, we propose a novel multi-goal RL objective based on weighted entropy,
which is essentially a reward-weighted entropy objective.
- Secondly, we derive a safe surrogate objective, that is, a lower bound of the
- riginal objective, to achieve stable optimization.
- Thirdly, we developed a Maximum Entropy-based Prioritization (MEP)
framework to optimize the derived surrogate objective.
- We evaluate the proposed method in the OpenAI Gym robotic simulations.
SLIDE 5
A Novel Multi-Goal RL Objective Based on Weighted Entropy
Guiacsu [1971] proposed weighted entropy, which is an extension of Shannon
- entropy. The definition of weighted entropy is given by
where is the weight of the event and is the probability of the event. This objective encourages the agent to maximize the expected return as well as to achieve more diverse goals. * We use to denote all the achieved goals in the trajectory , i.e., . Hw
p = − K
X
k=1
wkpk log pk
<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>τ g
<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>τ
<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>τ g = (gs
0, ..., gs T )
<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>ηH(θ) = Hw
p (T g) = Ep
" log 1 p(τ g)
T
X
t=1
r(St, Ge) | θ #
<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>w
<latexit sha1_base64="w0HyHQT+vKR/Yh4DcsftmNhoY0=">AB6HicbVDLTgJBEOzF+IL9ehlIjHxRHbRI4kXjxCIo8ENmR26IWR2dnNzKyGEL7AiweN8eonefNvHGAPClbSaWqO91dQSK4Nq7eQ2Nre2d/K7hb39g8Oj4vFJS8epYthksYhVJ6AaBZfYNwI7CQKaRQIbAfj27nfkSleSzvzSRBP6JDyUPOqLFS46lfLldwGyTryMlCBDvV/86g1ilkYoDRNU67nJsafUmU4Ezgr9FKNCWVjOsSupZJGqP3p4tAZubDKgISxsiUNWai/J6Y0noSBbYzomakV725+J/XTU1Y9adcJqlByZaLwlQE5P512TAFTIjJpZQpri9lbARVZQZm03BhuCtvrxOWpWyd1WuNK5LtWoWRx7O4BwuwYMbqMEd1KEJDBCe4RXenAfnxXl3PpatOSebOYU/cD5/AOMBjPU=</latexit>p
<latexit sha1_base64="OdjnM+N0ldtMSmejwmrhjkZqpM=">AB6HicbVBNS8NAEJ34WetX1aOXxSJ4KkV7LHgxWML9gPaUDbSbt2swm7G6GE/gIvHhTx6k/y5r9x2+agrQ8GHu/NMDMvSATXxnW/nY3Nre2d3cJecf/g8Oi4dHLa1nGqGLZYLGLVDahGwSW2DcCu4lCGgUCO8Hkbu53nlBpHsHM03Qj+hI8pAzaqzUTAalsltxFyDrxMtJGXI0BqWv/jBmaYTSMEG17nluYvyMKsOZwFmxn2pMKJvQEfYslTRC7WeLQ2fk0ipDEsbKljRkof6eyGik9TQKbGdEzVivenPxP6+XmrDmZ1wmqUHJlovCVBATk/nXZMgVMiOmlCmuL2VsDFVlBmbTdG4K2+vE7a1Yp3Xak2b8r1Wh5HAc7hAq7Ag1uowz0oAUMEJ7hFd6cR+fFeXc+lq0bTj5zBn/gfP4A2GWM7g=</latexit> SLIDE 6
A Safe Surrogate Objective
The surrogate is a lower bound of the objective function, i.e., , where is the normalization factor for . is the weighted entropy (Guiacsu, 1971; Kelbert et al., 2017), where the weight is the accumulated reward , in our case. ηL(θ)
<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>ηL(θ) < ηH(θ)
<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>ηH(θ) = Hw
p (T g) = Ep
" log 1 p(τ g)
T
X
t=1
r(St, Ge) | θ #
<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>ηL(θ) = Z · Eq " T X
t=1
r(St, Ge) | θ #
<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>q(τ g) = 1 Z p(τ g) (1 − p(τ g))
<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>Z
<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>q(τ g)
<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>Hw
p (T g)
<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>ΣT
t=1r(St, Ge)
<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit> SLIDE 7
Maximum Entropy-based Prioritization (MEP)
MEP Algorithm: We update the density model to construct a higher entropy distribution
- f achieved goals and update the agent with the more diversified training distribution.
SLIDE 8
Mean success rate and training time
SLIDE 9
Entropy of achieved goals versus and training epoch
No MEP: 5.13 ± 0.33 With MEP: 5.59 ± 0.34 No MEP: 5.73 ± 0.33 With MEP: 5.81 ± 0.30 No MEP: 5.78 ± 0.21 With MEP: 5.81 ± 0.18
SLIDE 10
Summary and Take-home Message
- Our approach improves performance by nine percentage points and sample-
efficiency by a factor of two while keeping computational time under control.
- Training the agent with many different kinds of goals, i.e., a higher entropy
goal distribution, helps the agent to learn.
- The code is available on GitHub: https://github.com/ruizhaogit/mep
- Poster: 06:30 -- 09:00 PM @ Pacific Ballroom #32