[PPT] - Predictor-Corrector Policy Optimization (PicCoLO) Ching-An Cheng #$ PowerPoint Presentation

SLIDE 1

Predictor-Corrector Policy Optimization (PicCoLO)

Ching-An Cheng#$, Xinyan Yan#, Nathan Ratliff$, Byron Boots#$ Jun 11 2019 @ ICML

# $

SLIDE 2

Policy optimization

2

We consider episodic learning Optimize a policy for sequential decision making (MDP)

SLIDE 3

Learning efficiency

Cost of Interactions > cost of computation

○

learning efficiency = sample efficiency

○

we should maybe spend time on planning before real interactions

3

to do so we need models, but should we?

SLIDE 4

Why we should use models

A way to summarize prior knowledge & past experiences
Can optimize the policy indirectly without costly real-world interactions
Can be provably more sample-efficient (Sun et al., 2019)

4

airsim flex Miniatur Wunderland

SLIDE 5

Why we should NOT use models

Models are, by definition, inexact
Weakness of the model can be exploited in policy optimization
Result in biased performance of the trained policy

"The reality gap"

5

ItsJerryAndHarry (Youtube)

SLIDE 6

Toward reconciling both sides

6

Model-free

unbiased inefficient

Model-based

efficient biased

dyna dual policy iteration control variate truncating horizon learning to plan

Unbiased kingdom Biased world

SLIDE 7

A new direction

7

Model-free

unbiased inefficient

Model-based

efficient biased

PicCoLO

new unbiased efficient hybrids

*can be combined with control variate and learning to plan

Unbiased kingdom Biased world

SLIDE 8

A new direction

Main idea

We should not fully trust a model (e.g. the methods in the biased world) but leverage only the correct part

How?
1. Frame policy optimization as predictable online learning (POL)
2. Design a reduction-based algorithm for POL to reuse known algorithms
3. When translated back, this gives a meta-algorithm for policy optimization

8

SLIDE 9

Online learning

9

LEARNER OPPONENT

policy optimization algorithm

SLIDE 10

Online learning

10

update policy Learner makes a decision Opponent chooses a loss try a policy

bserve statistics

Learner suffers

SLIDE 11

Online learning

Loss sequence can be adversarially chosen by the opponent
Common performance measure
For convex losses, algorithms with sublinear regret are well-known

e.g. mirror descent, follow-the-regularized-leader, etc.

11

SLIDE 12

Policy optimization as online learning

Define online losses such that sublinear regret implies policy learning
This idea started in the context of imitation learning (Ross et al., 2011)
We show that episodic policy optimization can be viewed similarly

12

SLIDE 13

Policy optimization as online learning

13

Learner makes a decision

bserve statistics

Check if the current policy is better than the previous policy

try a policy

advantage function

states of the current policy

update policy

The gradient of this loss is the actor-critic gradient (implemented in practice)

SLIDE 14

Possible algorithms

We can try typical no-regret algorithms

e.g. mirror descent in online learning -> actor-critic update

But it turns out they are not optimal. We can actually learn faster!
Insight

loss functions here are not adversarial but can be inferred from the past but these typical algorithms were designed for adversarial setups e.g. similar policies visit similar states

14

SLIDE 15

Predictability and predictive models

We can view predictability, e.g., as the ability to predict future gradients
Predictive model: a function that estimates gradient of future loss
Examples

15

(averaged) past gradients replay buffer inexact simulator function approximator

SLIDE 16

Policy optimization is predictable online learning

We need algorithms that consider predictability
There are two-step algorithms for predictable setups, but ….
We have more sophisticated algorithms for adversarial problems, but …
That is, we need a reduction from predictable to adversarial problems

16

This is PicCoLO

SLIDE 17

The idea behind PicCoLO

17

Predictable Adversarial

prediction error

It suffices to consider if we wisely select via predictive model

SLIDE 18

PicCoLO is a meta algorithm

18

Apply a standard method (e.g. gradient descent) to this new sequence Prediction Step Correction Step

(PicCoLO can use control variate to reduce further the variance of )

trick: adapt steps size based on the size of gradient error take larger steps when the prediction is accurate and vice versa

SLIDE 19

PicCoLO

19

Same idea applies to any algorithm in the family of (adaptive) mirror descent and Follow-the-Regularized-Leader PicCoLO recovers existing algorithms, e.g. extra-gradient, optimistic mirror descent, and provides their adaptive generalization Correction Step Prediction Step

SLIDE 20

PicCoLO

20

Theoretically we can show

the performance is unbiased, even when the prediction (model) is incorrect
learning accelerates, when the prediction is relatively accurate

Correction Step Prediction Step

SLIDE 21

How to compute the prediction

We want to set

because

We can use predictive model to realize this!

21

Prediction Step Correction Step

SLIDE 22

How to compute the prediction

We want to set

because

Because and , we can set
We can select by solving a fixed-point problem (FP)

22

Prediction Step Correction Step Prediction Step

SLIDE 23

How to compute the prediction

When , the FP becomes an optimization problem:
Heuristic: set or just do a few iterations

23

this is regularized optimal control

Prediction Step Correction Step

SLIDE 24

Experiments

For example, with ADAM as the base algorithm

24

cartpole

shows acceleration when predictions are accurate robust against model error

similar properties observed for other base algorithms (e.g. natural gradient descent, TRPO)

iteration accumulated reward

better

previous-decision heuristic

SLIDE 25

Experiments

For example, with ADAM as the base algorithm

25

hopper

iteration accumulated reward iteration

better

approximate fixed-point heuristic previous-decision heuristic

snake

The fixed-point formulation converges even faster

SLIDE 26

Summary

“PicCoLOed” model-free algorithms can learn faster without bias
The predictive model can be viewed as a unified interface for injecting prior
As PicCoLO is designed for general predictable online learning, we expect

applications to other problems and domains

26

learning and parameterizing predictive models are of practical importance

SLIDE 27

Thanks for attention

Please come to our poster # 106