Predictor-Corrector Policy Optimization (PicCoLO) Ching-An Cheng #$ - - PowerPoint PPT Presentation

predictor corrector policy optimization
SMART_READER_LITE
LIVE PREVIEW

Predictor-Corrector Policy Optimization (PicCoLO) Ching-An Cheng #$ - - PowerPoint PPT Presentation

Predictor-Corrector Policy Optimization (PicCoLO) Ching-An Cheng #$ , Xinyan Yan # , Nathan Ratliff $ , Byron Boots #$ Jun 11 2019 @ ICML # $ Policy optimization We consider episodic learning (MDP) Optimize a policy


slide-1
SLIDE 1

Predictor-Corrector Policy Optimization (PicCoLO)

Ching-An Cheng#$, Xinyan Yan#, Nathan Ratliff$, Byron Boots#$ Jun 11 2019 @ ICML

# $

slide-2
SLIDE 2

Policy optimization

2

We consider episodic learning Optimize a policy for sequential decision making (MDP)

slide-3
SLIDE 3

Learning efficiency

  • Cost of Interactions > cost of computation

learning efficiency = sample efficiency

we should maybe spend time on planning before real interactions

3

to do so we need models, but should we?

slide-4
SLIDE 4

Why we should use models

  • A way to summarize prior knowledge & past experiences
  • Can optimize the policy indirectly without costly real-world interactions
  • Can be provably more sample-efficient (Sun et al., 2019)

4

airsim flex Miniatur Wunderland

slide-5
SLIDE 5

Why we should NOT use models

  • Models are, by definition, inexact
  • Weakness of the model can be exploited in policy optimization
  • Result in biased performance of the trained policy

"The reality gap"

5

ItsJerryAndHarry (Youtube)

slide-6
SLIDE 6

Toward reconciling both sides

6

Model-free

unbiased inefficient

Model-based

efficient biased

dyna dual policy iteration control variate truncating horizon learning to plan

Unbiased kingdom Biased world

slide-7
SLIDE 7

A new direction

7

Model-free

unbiased inefficient

Model-based

efficient biased

PicCoLO

new unbiased efficient hybrids

*can be combined with control variate and learning to plan

Unbiased kingdom Biased world

slide-8
SLIDE 8

A new direction

  • Main idea

We should not fully trust a model (e.g. the methods in the biased world) but leverage only the correct part

  • How?
  • 1. Frame policy optimization as predictable online learning (POL)
  • 2. Design a reduction-based algorithm for POL to reuse known algorithms
  • 3. When translated back, this gives a meta-algorithm for policy optimization

8

slide-9
SLIDE 9

Online learning

9

LEARNER OPPONENT

policy optimization algorithm

slide-10
SLIDE 10

Online learning

10

update policy Learner makes a decision Opponent chooses a loss try a policy

  • bserve statistics

Learner suffers

slide-11
SLIDE 11

Online learning

  • Loss sequence can be adversarially chosen by the opponent
  • Common performance measure
  • For convex losses, algorithms with sublinear regret are well-known

e.g. mirror descent, follow-the-regularized-leader, etc.

11

slide-12
SLIDE 12

Policy optimization as online learning

  • Define online losses such that sublinear regret implies policy learning
  • This idea started in the context of imitation learning (Ross et al., 2011)
  • We show that episodic policy optimization can be viewed similarly

12

slide-13
SLIDE 13

Policy optimization as online learning

13

Learner makes a decision

  • bserve statistics

Check if the current policy is better than the previous policy

try a policy

advantage function

states of the current policy

update policy

The gradient of this loss is the actor-critic gradient (implemented in practice)

slide-14
SLIDE 14

Possible algorithms

  • We can try typical no-regret algorithms

e.g. mirror descent in online learning -> actor-critic update

  • But it turns out they are not optimal. We can actually learn faster!
  • Insight

loss functions here are not adversarial but can be inferred from the past but these typical algorithms were designed for adversarial setups e.g. similar policies visit similar states

14

slide-15
SLIDE 15

Predictability and predictive models

  • We can view predictability, e.g., as the ability to predict future gradients
  • Predictive model: a function that estimates gradient of future loss
  • Examples

15

(averaged) past gradients replay buffer inexact simulator function approximator

slide-16
SLIDE 16

Policy optimization is predictable online learning

  • We need algorithms that consider predictability
  • There are two-step algorithms for predictable setups, but ….
  • We have more sophisticated algorithms for adversarial problems, but …
  • That is, we need a reduction from predictable to adversarial problems

16

This is PicCoLO

slide-17
SLIDE 17

The idea behind PicCoLO

17

Predictable Adversarial

prediction error

It suffices to consider if we wisely select via predictive model

slide-18
SLIDE 18

PicCoLO is a meta algorithm

18

Apply a standard method (e.g. gradient descent) to this new sequence Prediction Step Correction Step

(PicCoLO can use control variate to reduce further the variance of )

trick: adapt steps size based on the size of gradient error take larger steps when the prediction is accurate and vice versa

slide-19
SLIDE 19

PicCoLO

19

Same idea applies to any algorithm in the family of (adaptive) mirror descent and Follow-the-Regularized-Leader PicCoLO recovers existing algorithms, e.g. extra-gradient, optimistic mirror descent, and provides their adaptive generalization Correction Step Prediction Step

slide-20
SLIDE 20

PicCoLO

20

Theoretically we can show

  • the performance is unbiased, even when the prediction (model) is incorrect
  • learning accelerates, when the prediction is relatively accurate

Correction Step Prediction Step

slide-21
SLIDE 21

How to compute the prediction

  • We want to set

because

  • We can use predictive model to realize this!

21

Prediction Step Correction Step

slide-22
SLIDE 22

How to compute the prediction

  • We want to set

because

  • Because and , we can set
  • We can select by solving a fixed-point problem (FP)

22

Prediction Step Correction Step Prediction Step

slide-23
SLIDE 23

How to compute the prediction

  • When , the FP becomes an optimization problem:
  • Heuristic: set or just do a few iterations

23

this is regularized optimal control

Prediction Step Correction Step

slide-24
SLIDE 24

Experiments

  • For example, with ADAM as the base algorithm

24

cartpole

shows acceleration when predictions are accurate robust against model error

similar properties observed for other base algorithms (e.g. natural gradient descent, TRPO)

iteration accumulated reward

better

previous-decision heuristic

slide-25
SLIDE 25

Experiments

  • For example, with ADAM as the base algorithm

25

hopper

iteration accumulated reward iteration

better

approximate fixed-point heuristic previous-decision heuristic

snake

The fixed-point formulation converges even faster

slide-26
SLIDE 26

Summary

  • “PicCoLOed” model-free algorithms can learn faster without bias
  • The predictive model can be viewed as a unified interface for injecting prior
  • As PicCoLO is designed for general predictable online learning, we expect

applications to other problems and domains

26

learning and parameterizing predictive models are of practical importance

slide-27
SLIDE 27

Thanks for attention

Please come to our poster # 106