PILCO: A Model-Based and Data-Efficient Approach to Policy Search - - PowerPoint PPT Presentation

pilco a model based and data efficient approach to policy
SMART_READER_LITE
LIVE PREVIEW

PILCO: A Model-Based and Data-Efficient Approach to Policy Search - - PowerPoint PPT Presentation

PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol Latent states { X t } evolve through


slide-1
SLIDE 1

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

(M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016

slide-2
SLIDE 2

PILCO Graphical Model

PILCO – Probabilistic Inference for Learning COntrol Latent states {Xt} evolve through time based on previous states and controls Policy π maps Zt, a noisy observation of Xt, into a control, Ut

CSC2541 November 4, 2016 2/ 19

slide-3
SLIDE 3

PILCO Objective

Transitions follow dynamic system xt = f(xt−1, ut−1) where x ∈ RD, u ∈ RF and f is a latent function. Let π be parameterized by θ and ut = π(xt, θ). The objective is to find π that minimizes expected cost of following π for T steps Cost function encodes information about a target state, e.g., c(x) = 1 − exp(−x − xtarget2/σ2

c)

CSC2541 November 4, 2016 3/ 19

slide-4
SLIDE 4

Algorithm

CSC2541 November 4, 2016 4/ 19

slide-5
SLIDE 5

Algorithm

CSC2541 November 4, 2016 5/ 19

slide-6
SLIDE 6

Dynamics Model Learning

Multiple plausible function approximators of f

CSC2541 November 4, 2016 6/ 19

slide-7
SLIDE 7

Dynamics Model Learning

Multiple plausible function approximators of f

CSC2541 November 4, 2016 6/ 19

slide-8
SLIDE 8

Dynamics Model Learning

Define a Gaussian process (GP) prior on the latent dynamic function f

CSC2541 November 4, 2016 7/ 19

slide-9
SLIDE 9

Dynamics Model Learning

Let the prior of f be GP(0, k(˜ x, ˜ x′)) where ˜ x [xT uT ]T and the squared exponential kernel is given by

CSC2541 November 4, 2016 8/ 19

slide-10
SLIDE 10

Dynamics Model Learning

Let ∆t = xt − xt−1 + ε where ε ∼ N(0, Σε) and Σε = diag([σε1, . . . , σεD]). The GP yields one-step predictions (see Section 2.2 in reference 3) Given n training inputs ˜ X = [˜ x1, . . . , ˜ xn] and corresponding training targets y = [∆1, . . . , ∆n], the posterior GP hyper-parameters are learned by evidence maximization (type 2 maximum likelihood).

CSC2541 November 4, 2016 9/ 19

slide-11
SLIDE 11

Algorithm

CSC2541 November 4, 2016 10/ 19

slide-12
SLIDE 12

Policy Evaluation

In evaluating objective Jπ(θ), we must calculate p(xt) since We have xt = xt−1 + ∆t − ε, where in general, computing p(∆t) is analytically intractable. Instead, p(∆t) is approximated with a Gaussian via moment matching.

CSC2541 November 4, 2016 11/ 19

slide-13
SLIDE 13

Moment Matching

Input distribution p(xt−1, ut1) is assumed Gaussian When propagated through the GP model, we obtain p(∆t) p(∆t) is approximated by a Gaussian via moment matching

CSC2541 November 4, 2016 12/ 19

slide-14
SLIDE 14

Moment Matching

p(xt) can now be approximated with N(µt, Σt) where µ∆ and Σ∆ are computed exactly via iterated expectation and variance

CSC2541 November 4, 2016 13/ 19

slide-15
SLIDE 15

Algorithm

CSC2541 November 4, 2016 14/ 19

slide-16
SLIDE 16

Analytic Gradient for Policy Improvement

Let Et = Ext[c(xt)] so that Jπ(θ) = T

t=1 Et.

Et depends on θ through p(xt), which depends on θ through p(xt−1), which depends on θ through µt and Σt, . . ., which depends on θ based on µu and Σu, where ut = π(xt, θ). Chain rule is used to calculate derivatives Analytic gradients allow for gradient-based non-convex

  • ptimization methods, e.g., CG or L-BFGS

CSC2541 November 4, 2016 15/ 19

slide-17
SLIDE 17

Data-Efficiency

CSC2541 November 4, 2016 16/ 19

slide-18
SLIDE 18

Advantages and Disadvantages

Advantages Data-efficient Incorporates model-uncertainty into long-term planning Does not rely on expert knowledge, i.e., demonstrations, or task-specific prior knowledge. Disadvantages Not an optimal control method. If p(Xi) do not cover the target region and σc induces a cost that is very peaked around the target solution, PILCO gets stuck in a local optimum because of zero gradients. Learned dynamics models are only confident in areas of the state space previously observed. Does not take temporal correlation into account. Model uncertainty treated as uncorrelated noise

CSC2541 November 4, 2016 17/ 19

slide-19
SLIDE 19

Extension: PILCO with Bayesian Filtering

  • R. McAllister and C. Rasmussen, “Data-Efficient Reinforcement Learning in Coninuous-State POMDPs.”

https://arxiv.org/abs/1602.02523 CSC2541 November 4, 2016 18/ 19

slide-20
SLIDE 20

References

1 M.P. Deisenroth and C.E. Rasmussen, “PILCO: A Model-Based and Data-Efficient Approach to Policy Search” in Proceedings of the 28th International Conference on Machine Learning, Bellevue, WA, USA, 2011. 2

  • R. McAllister and C. Rasmussen, “Data-Efficient Reinforcement Learning in Coninuous-State POMDPs.”

https://arxiv.org/abs/1602.02523 3 C.E. Rasmussen and C.K.I. Williams (2006) Gaussian Processes for Machine Learning. MIT Press. www.gaussianprocess.org/gpml/chapters 4 C.M. Bishop (2006). Pattern Recognition and Machine Learning Chapter 6.4. Springer. ISBN 0-387-31073-8. CSC2541 November 4, 2016 19/ 19