Rui Zhao*, Xudong Sun, Volker Tresp Siemens AG & Ludwig - - PowerPoint PPT Presentation

▶

Aug 30, 2022 542 likes •662 views

Maximum Entropy-Regularized Multi-Goal Reinforcement Learning Rui Zhao*, Xudong Sun, Volker Tresp Siemens AG & Ludwig Maximilian University of Munich | June 2019 | ICML 2019 Introduction In Multi-Goal Reinforcement Learning, an agent

SLIDE 1

Rui Zhao*, Xudong Sun, Volker Tresp 

Siemens AG & Ludwig Maximilian University of Munich | June 2019 | ICML 2019

Maximum Entropy-Regularized Multi-Goal Reinforcement Learning

SLIDE 2

Introduction

In Multi-Goal Reinforcement Learning, an agent learns to achieve multiple goals with a goal-conditioned policy. During learning, the agent first collects the trajectories into a replay buffer and later these trajectories are selected randomly for replay. OpenAI Gym Robotic Simulations

SLIDE 3

Motivation

We observed that the achieved goals in the replay buffer are often biased

towards the behavior policies.

From a Bayesian perspective (Murphy, 2012), when there is no prior

knowledge of the target goal distribution, the agent should learn uniformly from diverse achieved goals.

We want to encourage the agent to achieve a diverse set of goals while

maximizing the expected return.

SLIDE 4

Contributions

First, we propose a novel multi-goal RL objective based on weighted entropy,

which is essentially a reward-weighted entropy objective.

Secondly, we derive a safe surrogate objective, that is, a lower bound of the
riginal objective, to achieve stable optimization.
Thirdly, we developed a Maximum Entropy-based Prioritization (MEP)

framework to optimize the derived surrogate objective.

We evaluate the proposed method in the OpenAI Gym robotic simulations.

SLIDE 5

A Novel Multi-Goal RL Objective Based on Weighted Entropy  

Guiacsu [1971] proposed weighted entropy, which is an extension of Shannon

entropy. The definition of weighted entropy is given by

where is the weight of the event and is the probability of the event. This objective encourages the agent to maximize the expected return as well as to achieve more diverse goals. * We use to denote all the achieved goals in the trajectory , i.e., . Hw

p = − K

X

k=1

wkpk log pk

τ g

τ

τ g = (gs

0, ..., gs T )

ηH(θ) = Hw

p (T g) = Ep

" log 1 p(τ g)

T

X

t=1

r(St, Ge) | θ #

w

<latexit sha1_base64="w0HyHQT+vKR/Yh4DcsftmNhoY0=">AB6HicbVDLTgJBEOzF+IL9ehlIjHxRHbRI4kXjxCIo8ENmR26IWR2dnNzKyGEL7AiweN8eonefNvHGAPClbSaWqO91dQSK4Nq7eQ2Nre2d/K7hb39g8Oj4vFJS8epYthksYhVJ6AaBZfYNwI7CQKaRQIbAfj27nfkSleSzvzSRBP6JDyUPOqLFS46lfLldwGyTryMlCBDvV/86g1ilkYoDRNU67nJsafUmU4Ezgr9FKNCWVjOsSupZJGqP3p4tAZubDKgISxsiUNWai/J6Y0noSBbYzomakV725+J/XTU1Y9adcJqlByZaLwlQE5P512TAFTIjJpZQpri9lbARVZQZm03BhuCtvrxOWpWyd1WuNK5LtWoWRx7O4BwuwYMbqMEd1KEJDBCe4RXenAfnxXl3PpatOSebOYU/cD5/AOMBjPU=</latexit>

p

<latexit sha1_base64="OdjnM+N0ldtMSmejwmrhjkZqpM=">AB6HicbVBNS8NAEJ34WetX1aOXxSJ4KkV7LHgxWML9gPaUDbSbt2swm7G6GE/gIvHhTx6k/y5r9x2+agrQ8GHu/NMDMvSATXxnW/nY3Nre2d3cJecf/g8Oi4dHLa1nGqGLZYLGLVDahGwSW2DcCu4lCGgUCO8Hkbu53nlBpHsHM03Qj+hI8pAzaqzUTAalsltxFyDrxMtJGXI0BqWv/jBmaYTSMEG17nluYvyMKsOZwFmxn2pMKJvQEfYslTRC7WeLQ2fk0ipDEsbKljRkof6eyGik9TQKbGdEzVivenPxP6+XmrDmZ1wmqUHJlovCVBATk/nXZMgVMiOmlCmuL2VsDFVlBmbTdG4K2+vE7a1Yp3Xak2b8r1Wh5HAc7hAq7Ag1uowz0oAUMEJ7hFd6cR+fFeXc+lq0bTj5zBn/gfP4A2GWM7g=</latexit>

SLIDE 6

A Safe Surrogate Objective

The surrogate is a lower bound of the objective function, i.e., , where is the normalization factor for . is the weighted entropy (Guiacsu, 1971; Kelbert et al., 2017), where the weight is the accumulated reward , in our case. ηL(θ)

ηL(θ) < ηH(θ)

ηH(θ) = Hw

p (T g) = Ep

" log 1 p(τ g)

T

X

t=1

r(St, Ge) | θ #

ηL(θ) = Z · Eq " T X

t=1

r(St, Ge) | θ #

q(τ g) = 1 Z p(τ g) (1 − p(τ g))

Z

q(τ g)

Hw

p (T g)

ΣT

t=1r(St, Ge)

SLIDE 7

Maximum Entropy-based Prioritization (MEP)

MEP Algorithm: We update the density model to construct a higher entropy distribution

f achieved goals and update the agent with the more diversified training distribution.

SLIDE 8

Mean success rate and training time

SLIDE 9

Entropy of achieved goals versus and training epoch

No MEP: 5.13 ± 0.33 With MEP: 5.59 ± 0.34 No MEP: 5.73 ± 0.33 With MEP: 5.81 ± 0.30 No MEP: 5.78 ± 0.21 With MEP: 5.81 ± 0.18

SLIDE 10

Summary and Take-home Message

Our approach improves performance by nine percentage points and sample-

efficiency by a factor of two while keeping computational time under control.

Training the agent with many different kinds of goals, i.e., a higher entropy

goal distribution, helps the agent to learn.

The code is available on GitHub: https://github.com/ruizhaogit/mep
Poster: 06:30 -- 09:00 PM @ Pacific Ballroom #32

Rui Zhao*, Xudong Sun, Volker Tresp

Siemens AG & Ludwig Maximilian University of Munich | June 2019 | ICML 2019

Maximum Entropy-Regularized Multi-Goal Reinforcement Learning

Introduction

In Multi-Goal Reinforcement Learning, an agent learns to achieve multiple goals with a goal-conditioned policy. During learning, the agent first collects the trajectories into a replay buffer and later these trajectories are selected randomly for replay. OpenAI Gym Robotic Simulations

Motivation

towards the behavior policies.

knowledge of the target goal distribution, the agent should learn uniformly from diverse achieved goals.

maximizing the expected return.

Contributions

which is essentially a reward-weighted entropy objective.

framework to optimize the derived surrogate objective.

A Novel Multi-Goal RL Objective Based on Weighted Entropy

Guiacsu [1971] proposed weighted entropy, which is an extension of Shannon

where is the weight of the event and is the probability of the event. This objective encourages the agent to maximize the expected return as well as to achieve more diverse goals. * We use to denote all the achieved goals in the trajectory , i.e., . Hw

p = − K

X

k=1

wkpk log pk

τ g

τ

τ g = (gs

0, ..., gs T )

ηH(θ) = Hw

p (T g) = Ep

" log 1 p(τ g)

T

X

t=1

r(St, Ge) | θ #

w

p

A Safe Surrogate Objective

The surrogate is a lower bound of the objective function, i.e., , where is the normalization factor for . is the weighted entropy (Guiacsu, 1971; Kelbert et al., 2017), where the weight is the accumulated reward , in our case. ηL(θ)

ηL(θ) < ηH(θ)

ηH(θ) = Hw

p (T g) = Ep

" log 1 p(τ g)

T

X

t=1

r(St, Ge) | θ #

ηL(θ) = Z · Eq " T X

t=1

r(St, Ge) | θ #

q(τ g) = 1 Z p(τ g) (1 − p(τ g))

Z

q(τ g)

Hw

p (T g)

ΣT

t=1r(St, Ge)

Maximum Entropy-based Prioritization (MEP)

MEP Algorithm: We update the density model to construct a higher entropy distribution

Mean success rate and training time

Entropy of achieved goals versus and training epoch

No MEP: 5.13 ± 0.33 With MEP: 5.59 ± 0.34 No MEP: 5.73 ± 0.33 With MEP: 5.81 ± 0.30 No MEP: 5.78 ± 0.21 With MEP: 5.81 ± 0.18

Summary and Take-home Message

efficiency by a factor of two while keeping computational time under control.

goal distribution, helps the agent to learn.

Thank you!

Rui Zhao*, Xudong Sun, Volker Tresp 

A Novel Multi-Goal RL Objective Based on Weighted Entropy