Bidirectional Model-based Policy Optimization Hang Lai , Jian Shen, - - PowerPoint PPT Presentation

bidirectional model based policy optimization
SMART_READER_LITE
LIVE PREVIEW

Bidirectional Model-based Policy Optimization Hang Lai , Jian Shen, - - PowerPoint PPT Presentation

Bidirectional Model-based Policy Optimization Hang Lai , Jian Shen, Weinan Zhang, Yong Yu Shanghai Jiao Tong University Content 1. Background & Motivation 2. Method 3. Theoretical Analysis 4. Experiment Result Background Model-based


slide-1
SLIDE 1

Bidirectional Model-based Policy Optimization

Hang Lai, Jian Shen, Weinan Zhang, Yong Yu Shanghai Jiao Tong University

slide-2
SLIDE 2

Content

  • 1. Background & Motivation
  • 2. Method
  • 3. Theoretical Analysis
  • 4. Experiment Result
slide-3
SLIDE 3

Background

Model-based reinforcement learning (MBRL):

  • Build a model
  • To help decision making

Challenge: compounding error

real trajectory

slide-4
SLIDE 4

Motivation

Human beings in real world:

  • Predict future consequences forward
  • Imagine traces leading to a goal backward

Existing methods:

  • Learn a forward model to plan ahead.

This paper:

  • Additionally learn a backward model to

reduce the reliance on accuracy in forward model.

slide-5
SLIDE 5

Motivation

slide-6
SLIDE 6

Content

  • 1. Background & Motivation
  • 2. Method
  • 3. Theoretical Analysis
  • 4. Experiment Result
slide-7
SLIDE 7

Method

bidirectional models + MBPO[1] method Bidirectional Model-based Policy Optimization (BMPO) Other components:

  • State sampling strategy
  • Incorporating model predictive control
slide-8
SLIDE 8

Preliminary: MBPO

  • Interaction with environment with current policy.
  • Train forward model ensembles using real data.
  • Generate branched short rollouts with current policy.
  • Improve the policy with real & generated data.
slide-9
SLIDE 9

Model Learning

  • Use an ensemble of probabilistic networks for both

the forward model and the backward model .

  • The corresponding loss functions are:

and : mean and covariance : number of real transitions

slide-10
SLIDE 10

Backward Policy

Backward policy : take actions given the next

  • state. Used to generate backward policy.
  • By maximum likelihood estimation:
  • By conditional GAN:
slide-11
SLIDE 11

State Sampling Strategy

MBPO: randomly select states from environment data buffer to begin model rollouts. BMPO(ours): sample high value states according to the probability calculated by:

Probability of s to selected Estimated value

slide-12
SLIDE 12

Environment Interaction

MBPO: directly use the current policy BMPO: use MPC

  • Generate candidate action sequences from current

policy.

  • Simulate the corresponding trajectories in the model.
  • Select the first action of the sequence that yields the

highest return.

slide-13
SLIDE 13

Overall algorithm

MBPO: BMPO (ours):

slide-14
SLIDE 14

Overall algorithm

MBPO: BMPO (ours):

slide-15
SLIDE 15

Overall algorithm

MBPO: BMPO (ours):

slide-16
SLIDE 16

Overall algorithm

MBPO: BMPO (ours):

slide-17
SLIDE 17

Overall algorithm

MBPO: BMPO (ours):

slide-18
SLIDE 18

Content

  • 1. Motivation
  • 2. Method
  • 3. Theoretical Analysis
  • 4. Experiment Result
slide-19
SLIDE 19

Theoretical Analysis

For bidirectional model with backward rollout length and forward length : For the forward only model with rollout length :

Expected return in environment Expected return in branched rollout Policy shift variation Model error

slide-20
SLIDE 20

Theoretical Analysis

For bidirectional model with backward rollout length and forward length :

Expected return in environment Expected return in branched rollout Policy shift variation Model error

For the forward only model with rollout length :

slide-21
SLIDE 21

Content

  • 1. Motivation
  • 2. Method
  • 3. Theoretical Analysis
  • 4. Experiment Result
slide-22
SLIDE 22

Settings

  • Environments
  • Baselines

MBPO, SLBO[2], PETS[3], SAC[4]

Pendulum Hopper Walker Ant

slide-23
SLIDE 23

Comparison Result

slide-24
SLIDE 24

Model Error

Model validation loss(single step error)

slide-25
SLIDE 25

Compounding Model Error

Assume a real trajectory of length is .

slide-26
SLIDE 26

Backward Policy Choice

Figure 1): Comparison of two heuristic design choices for the backward policy loss: MLE loss and GAN loss.

slide-27
SLIDE 27

Ablation study

Figure 2): Ablation study of three crucial components: forward model, backward model, and MPC.

slide-28
SLIDE 28

Hyperparameter study:

Figure 3): the sensitivity of our algorithm to the hyper-parameter .

slide-29
SLIDE 29

Hyperparameter study:

Figure 4): Average return with different backward rollout lengths and fixed forward length .

slide-30
SLIDE 30

Reference

[1] Janner, Michael, et al. "When to trust your model: Model-based policy optimization." Advances in Neural Information Processing Systems. 2019. [2] Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees." arXiv preprint arXiv:1807.03858 (2018). [3] Chua, Kurtland, et al. "Deep reinforcement learning in a handful of trials using probabilistic dynamics models." Advances in Neural Information Processing Systems. 2018. [4] Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." arXiv preprint arXiv:1801.01290 (2018).

slide-31
SLIDE 31

Thanks for your interest!

Please feel free to contact me at laihang@apex.sjtu.edu.cn