Predictor-Corrector Policy Optimization (PicCoLO)
Ching-An Cheng#$, Xinyan Yan#, Nathan Ratliff$, Byron Boots#$ Jun 11 2019 @ ICML
# $
Predictor-Corrector Policy Optimization (PicCoLO) Ching-An Cheng #$ - - PowerPoint PPT Presentation
Predictor-Corrector Policy Optimization (PicCoLO) Ching-An Cheng #$ , Xinyan Yan # , Nathan Ratliff $ , Byron Boots #$ Jun 11 2019 @ ICML # $ Policy optimization We consider episodic learning (MDP) Optimize a policy
Ching-An Cheng#$, Xinyan Yan#, Nathan Ratliff$, Byron Boots#$ Jun 11 2019 @ ICML
# $
2
We consider episodic learning Optimize a policy for sequential decision making (MDP)
○
learning efficiency = sample efficiency
○
we should maybe spend time on planning before real interactions
3
to do so we need models, but should we?
4
airsim flex Miniatur Wunderland
"The reality gap"
5
ItsJerryAndHarry (Youtube)
6
Model-free
unbiased inefficient
Model-based
efficient biased
dyna dual policy iteration control variate truncating horizon learning to plan
Unbiased kingdom Biased world
7
Model-free
unbiased inefficient
Model-based
efficient biased
new unbiased efficient hybrids
*can be combined with control variate and learning to plan
Unbiased kingdom Biased world
We should not fully trust a model (e.g. the methods in the biased world) but leverage only the correct part
8
9
LEARNER OPPONENT
policy optimization algorithm
10
update policy Learner makes a decision Opponent chooses a loss try a policy
Learner suffers
e.g. mirror descent, follow-the-regularized-leader, etc.
11
12
13
Learner makes a decision
Check if the current policy is better than the previous policy
try a policy
advantage function
states of the current policy
update policy
The gradient of this loss is the actor-critic gradient (implemented in practice)
e.g. mirror descent in online learning -> actor-critic update
loss functions here are not adversarial but can be inferred from the past but these typical algorithms were designed for adversarial setups e.g. similar policies visit similar states
14
15
(averaged) past gradients replay buffer inexact simulator function approximator
16
This is PicCoLO
17
Predictable Adversarial
prediction error
It suffices to consider if we wisely select via predictive model
18
Apply a standard method (e.g. gradient descent) to this new sequence Prediction Step Correction Step
(PicCoLO can use control variate to reduce further the variance of )
trick: adapt steps size based on the size of gradient error take larger steps when the prediction is accurate and vice versa
19
Same idea applies to any algorithm in the family of (adaptive) mirror descent and Follow-the-Regularized-Leader PicCoLO recovers existing algorithms, e.g. extra-gradient, optimistic mirror descent, and provides their adaptive generalization Correction Step Prediction Step
20
Theoretically we can show
Correction Step Prediction Step
because
21
Prediction Step Correction Step
because
22
Prediction Step Correction Step Prediction Step
23
this is regularized optimal control
Prediction Step Correction Step
24
cartpole
shows acceleration when predictions are accurate robust against model error
similar properties observed for other base algorithms (e.g. natural gradient descent, TRPO)
iteration accumulated reward
better
previous-decision heuristic
25
hopper
iteration accumulated reward iteration
better
approximate fixed-point heuristic previous-decision heuristic
snake
The fixed-point formulation converges even faster
applications to other problems and domains
26
learning and parameterizing predictive models are of practical importance
Please come to our poster # 106