Adaptive importance sampling for control and inference Bert Kappen - - PowerPoint PPT Presentation

adaptive importance sampling for control and
SMART_READER_LITE
LIVE PREVIEW

Adaptive importance sampling for control and inference Bert Kappen - - PowerPoint PPT Presentation

Adaptive importance sampling for control and inference Bert Kappen SNN Donders Institute, Radboud University, Nijmegen Gatsby Unit, UCL London December 10, 2016 Joint work with Hans Ruiz, Dominik Thalmeier Bert Kappen Optimal control


slide-1
SLIDE 1

Adaptive importance sampling for control and inference∗

Bert Kappen SNN Donders Institute, Radboud University, Nijmegen Gatsby Unit, UCL London December 10, 2016

∗Joint work with Hans Ruiz, Dominik Thalmeier

Bert Kappen

slide-2
SLIDE 2

Optimal control theory

Hard problems:

  • a learning and exploration problem
  • a stochastic optimal control computation
  • a representation problem u(x, t)

Bert Kappen 1/30

slide-3
SLIDE 3

PICE: integrating Control, Inference and Learning

Path integral control theory Express a control computation as an inference computation. Compute optimal control using MC sampling

Bert Kappen 2/30

slide-4
SLIDE 4

PICE: Integrating Control, Inference and Learning

Path integral control theory Express a control computation as an inference computation. Compute optimal control using MC sampling Importance sampling Accellerate with importance sampling (= a state-feedback controller) Optimal importance sampler is optimal control

Bert Kappen 3/30

slide-5
SLIDE 5

PICE: Integrating Control, Inference and Learning

Path integral control theory Express a control computation as an inference computation. Compute optimal control using MC sampling Importance sampling Accellerate with importance sampling (= a state-feedback controller) Optimal importance sampler is optimal control Learning Learn the controller from self-generated data Use Cross Entropy method for parametrized controller

Bert Kappen 4/30

slide-6
SLIDE 6

PICE: Integrating control, inference and learning

Massively parallel computation

Bert Kappen 5/30

slide-7
SLIDE 7

PICE: Integrating control, inference and learning

Massively parallel computation The Monte Carlo sampling serves two purposes:

  • Planning: compute the control for current state
  • Learning: improve the sampler/controller for future control computations

Bert Kappen 6/30

slide-8
SLIDE 8

Path integral control theory

Uncontrolled dynamics specifies distribution q(τ|x, t) over trajectories τ from x, t. Cost for trajectory τ is S (τ|x, t) = φ(xT) +

T

t dsV(xs, s).

Find optimal distribution p(τ|x, t) that minimizes Ep S and is ’close’ to q(τ|x, t).

Bert Kappen 7/30

slide-9
SLIDE 9

KL control

Find p∗ that minimizes

C(p) = KL(p|q) + Ep S KL(p|q) =

  • dτp(τ|x, t) log p(τ|x, t)

q(τ|x, t)

The optimal solution is given by

p∗(τ|x, t) = 1 ψ(x, t)q(τ|x, t) exp(−S (τ|x, t)) ψ(x, t) =

  • dτq(τ|x, t) exp(−S (τ|x, t)) = Eqe−S

The optimal cost is:

C(p∗) = − log ψ(x, t)

Bert Kappen 8/30

slide-10
SLIDE 10

Controlled diffusions

p(τ|x, t) is parametrised by function u(x, t): dXt = f(Xt, t)dt + g(Xt, t)(u(Xt, t)dt + dWt) E(dW2

t ) = dt

C(u|x, t) = Eu

  • S (τ|x, t) +

T

t

ds1 2u(Xs, s)2

  • q(τ|x, t) corresponds to u = 0.

Goal is to find function u(x, t) that minimizes C.

Bert Kappen 9/30

slide-11
SLIDE 11

Solution

The optimal control problem is solved as a Feynman-Kac path integral. The optimal cost-to-go

J(x, t) = − log

  • dτq(τ|x, t)e−S (τ|x,t) = − log Eq
  • e−S

Optimal control

u∗(x, t)dt = Ep∗(dWt) = Eq

  • dWe−S

Eq e−S ψ, u∗ can be computed by forward sampling from q.

Bert Kappen 10/30

slide-12
SLIDE 12

Sampling

0.5 1 1.5 2 −10 −5 5 10

Sample trajectories τi, i = 1, . . . , N ∼ q(τ|x)

Eq e−S ≈ 1 N

N

  • i=1

e−S (τi|x,t)

Sampling is unbiased but inefficient (large variance).

Bert Kappen 11/30

slide-13
SLIDE 13

Importance sampling

−2 2 4 0.2 0.4 0.6 0.8 1 1.2

Consider simple 1-d sampling problem. Given q(x), compute

a = Prob(x < 0) = ∞

−∞

I(x)q(x)dx

with I(x) = 0, 1 if x > 0, x < 0, respectively. Naive method: generate N samples Xi ∼ q

ˆ a = 1 N

N

  • i=1

I(Xi)

Bert Kappen 12/30

slide-14
SLIDE 14

Importance sampling

−2 2 4 0.2 0.4 0.6 0.8 1 1.2

Consider another distribution p(x). Then

a = Prob(x < 0) = ∞

−∞

I(x)q(x) p(x)p(x)dx

Importance sampling: generate N samples Xi ∼ p

ˆ a = 1 N

N

  • i=1

I(Xi)q(Xi) p(Xi)

Unbiased (= correct) for any distribution p!

Bert Kappen 13/30

slide-15
SLIDE 15

Optimal importance sampling

−2 2 4 0.2 0.4 0.6 0.8 1 1.2

The distribution

p∗(x) = q(x)I(x) a

is the optimal importance sampler. One sample X ∼ p∗ is sufficient to estimate a:

ˆ a = I(X) q(X) p∗(X) = a

Bert Kappen 14/30

slide-16
SLIDE 16

Importance sampling and control

In the case of control we must compute

J(x, t) = − log Eqe−S u∗(x, t) = Eq

  • dWe−S

Eq e−S

Instead of samples from uncontrolled dynamics q (u = 0), we sample with p (u 0).

Eqe−S = Epe−S u e−S u = e−S dq dp = e−S −

T

t 1 2u(xs,s)2dt−

T

t

u(xs,s)dWs

We can choose any p, ie. any sampling control u to compute the expectation values.

Bert Kappen 15/30

slide-17
SLIDE 17

Relation between optimal sampling and optimal control

Define

αi = e−S u(τi|x,t)) N

j=1 e−S u(τj|x,t)

ES S = 1 N

j=1 α2 j

(1 ≤ ES S ≤ N)

Thm:

  • 1. Better u (in the sense of optimal control) provides a better sampler (in the sense
  • f effective sample size).
  • 2. Optimal u = u∗ (in the sense of optimal control) requires only one sample, αi =

1/N and S u(τ|x, t) deterministic! S u(τ|x, t) = S (τ|x, t) + T

t

dt1 2u(xs, s)2 + T

t

u(xx, s)dWs

Bert Kappen 16/30

slide-18
SLIDE 18

So far

  • Optimal control can be computed by MC sampling
  • Sampling can be accellerated by using ’good’ controls
  • The optimal control for sampling is also the optimal control solution

How to learn a good controller?

Bert Kappen 17/30

slide-19
SLIDE 19

The Cross-entropy method

pu(x) be a family of probability density function parametrized by u. h(x) be a positive function.

Conside the expectation value

a = E0 h =

  • dxp0(x)h(x)

for a particular value of u = 0. The optimal importance sampling distribution is p∗(x) = h(x)p0(x)/a. The cross entropy method minimises the KL divergence

KL(p∗|pu) =

  • dxp∗(x) log p∗(x)

pu(x) ∝ −Ep∗ log pu(X) ∝ −E0h(X) log pu(X) = Evh(X)p0(X) pv(X) log pu(X) p0 → p1 → p2 . . .

Bert Kappen 18/30

slide-20
SLIDE 20

The CE method for PI control

Sample pu using

dXt = f(Xt, t)dt + g(Xt, t) (u(Xt, t)dt + dWt)

We wish to compute close to optimal control u such that pu is close to p∗. Following the CE argument, we minimise

KL(p∗|pu) = 1 ψ(t, x)Eve−S (t,x,v) T

t

ds1 2

  • u(Xs, s) − v(Xs, s) − dWs

ds 2 v is the importance sampling control. Expected value is independent of v, but

variance/accuracy depends on v.

Bert Kappen 19/30

slide-21
SLIDE 21

The CE method for PI control

We parametrize the control u(x, t|θ). The gradient is given by:

∂KL(p∗|pu) ∂θ = T

t

(u(Xs, s)ds − v(Xs, s)ds − dWs) ∂u(Xs, s) ∂θ

  • v

= − T

t

dWs ∂u(Xs, s) ∂θ

  • u

θ := θ − ǫ∂KL(p∗|pu) ∂θ

We refer to the method as PICE (Path Integral Cross Entropy).

Bert Kappen 20/30

slide-22
SLIDE 22

Model based motor learning

compute control for k = 0, . . . do

datak = generate data(model, uk)

% Monte Carlo importance sampler

uk+1 = learn control(datak, uk)

% Deep or recurrent learning end for

0.5 1 1.5 2 −10 −5 5 10 0.5 1 1.5 2 −10 −5 5 10

Bert Kappen 21/30

slide-23
SLIDE 23

Parallel implementation

Massive parallel sampling on CPUs Massive parallel gradient computation on C/GPU Goal: provide generic solver for any PI control problem to arbitrary precision.

Bert Kappen 22/30

slide-24
SLIDE 24

Acrobot

2 DOF, second order, under actuated, continuous stochastic control problem. Task is swing-up from down position and stabilize.

Bert Kappen 23/30

slide-25
SLIDE 25

Acrobot

(acrobot.mp4) Neural network 2 hidden layers, 50 neurons per layer. Input is position and velocity. 2000 iterations, with 30000 rollouts per iteration. 100 cores. 15 minutes

Bert Kappen 24/30

slide-26
SLIDE 26

More samples per iteration is better :)

Fraction ESS versus IS iteration 100 k samples (green, cyan) 300 k samples (red, blue) 1000 k samples (black, yellow)

Bert Kappen 25/30

slide-27
SLIDE 27

Trust region

Initial gradient computation too hard. Introduce (KL) trust region.

Control cost vs. IS iteration. Blue line: small trust region (ESS ≈ 50 %, 30k samples) (= video) Red line: intermediate trust region (ESS ≈ 1 %, 100k samples) Green line: large trust region (ESS ≈ 0.1 %, 300k samples)

Trade-off between speed and optimality.

Bert Kappen 26/30

slide-28
SLIDE 28

Discussion

Continuous time SOC is very hard to compute.

  • PI control: Control ↔ inference
  • Better sampling (ESS) ↔ better control (control objective)
  • IS: Learning control solution also increases efficiency of (future) control computations

Bert Kappen 27/30

slide-29
SLIDE 29

Discussion

Continuous time SOC is very hard to compute.

  • PI control: Control ↔ inference
  • Better sampling (ESS) ↔ better control (control objective)
  • IS: Learning control solution also increases efficiency of (future) control computations

Continuous time SOC is very hard represent.

  • CE for parameter estimation → deep neural network

Bert Kappen 28/30

slide-30
SLIDE 30

Discussion

Continuous time SOC is very hard to compute.

  • PI control: Control ↔ inference
  • Better sampling (ESS) ↔ better control (control objective)
  • IS: Learning control solution also increases efficiency of (future) control computations

Continuous time SOC is very hard represent.

  • CE for parameter estimation → deep neural network

For robotics? Embed PICE as inner loop and estimate model in outer loop for t = 0, . . . do

dataset = real world samples(ut) model = compute model(model, dataset)

for k = 0, . . . do

datak = generate data(model, uk) uk+1 = learn control(datak, uk)

end for end for

Bert Kappen 29/30

slide-31
SLIDE 31

Thank you!

Kappen, Hilbert Johan, and Hans Christian Ruiz. ”Adaptive importance sampling for control and inference.” Journal of Statistical Physics 162.5 (2016): 1244-1266.

Bert Kappen 30/30