Path integral control Minimization wrt u yields: 11 R 1 g J u = - - PowerPoint PPT Presentation

path integral control
SMART_READER_LITE
LIVE PREVIEW

Path integral control Minimization wrt u yields: 11 R 1 g J u = - - PowerPoint PPT Presentation

Path integral control Minimization wrt u yields: 11 R 1 g J u = 1 2( J ) gR 1 g ( J ) + V + ( J ) f + 1 g g 2 J t J = 2Tr Define ( x , t ) through J ( x , t ) =


slide-1
SLIDE 1

Path integral control

Minimization wrt u yields: 11

u = −R−1g′∇J −∂tJ = −1 2(∇J)′gR−1g′(∇J) + V + (∇J)′ f + 1 2Tr

  • gνg′∇2J
  • Define ψ(x, t) through J(x, t) = −λ log ψ(x, t) and impose a relation between R and ν:

R = λν−1

with λ a positive number.

11ua = −

b,i

  • R−1

ab gib(x, t)∂J(x,t) ∂xi

Bert Kappen ML 348

slide-2
SLIDE 2

Path integral control

Then the HJB becomes linear in ψ

−∂tψ =

  • −V

λ + f ′∇ + 1 2Tr

  • gνg′∇2

ψ

with end condition ψ(x, T) = exp(−φ(x)/λ) 12

12 We sketch the derivation for g = 1.

−1 2(∇J)′R−1(∇J) + 1 2Tr

  • ν∇2J
  • = −1

2

  • ij

∇iJR−1

ij ∇jJ + 1

  • ij

R−1

ij ∇ijJ

= 1 2

  • ij

R−1

ij

  • −∇iJ∇jJ + λ∇ijJ
  • = 1

2

  • ij

R−1

ij

  • −λ2 1

ψ∇ijψ

  • since

−∇iJ∇ jJ = −λ2 1 ψ2∇iψ∇jψ ∇ijJ = −λ∇i∇ j log ψ = −λ∇i 1 ψ∇jψ

  • = λ 1

ψ2∇iψ∇ jψ − λ1 ψ∇ijψ

Bert Kappen ML 349

slide-3
SLIDE 3

Path integral control

We identify ψ(x, t) ∝ p(z, T|x, t), then the linear Bellman equation

−∂tψ =

  • −V

λ + f ′∇ + 1 2Tr

  • gνg′∇2

ψ

can be interpreted as a Kolmogorov backward equation for the process

dxi = fi(x, t)dt +

  • a

gia(x, t)dξa x(t) = † with probability V(x, t)dt/λ x(T) = † with probability φ(x)/λ

The correspondong forward equation is

∂tρ = −V λρ − ∇( fρ) + 1 2Tr∇2gνg′ρ

with ρ(x, t) = p(x, t|z, 0) and ρ(x, 0) = δ(x − z).

Bert Kappen ML 350

slide-4
SLIDE 4

Feynman-Kac formula

Denote Q(τ|x, s) the distribution over uncontrolled trajectories that start at x, t:

dx = f(x, t)dt + g(x, t)dξ

with τ a trajectory x(t → T). Then

ψ(x, t) =

  • dQ(τ|x, t) exp
  • −S (τ)

λ

  • S (τ)

= φ(x(T)) + ′

t

dsV(x(s), s) ψ can be computed by forward sampling the uncontrolled process.

Bert Kappen ML 351

slide-5
SLIDE 5

Alternative derivation

Uncontrolled dynamics specifies distribution q(τ|x, t) over trajectories τ from x, t. Cost for trajectory τ is S (τ|x, t) = φ(xT) +

t dsV(xs, s).

Find optimal distribution p(τ|x, t) that minimizes Ep S and is ’close’ to q(τ|x, t).

Bert Kappen ML 352

slide-6
SLIDE 6

KL control

Find p∗ that minimizes

C(p) = KL(p|q) + Ep S KL(p|q) =

  • dτp(τ|x, t) log p(τ|x, t)

q(τ|x, t)

The optimal solution is given by

p∗(τ|x, t) = 1 ψ(x, t)q(τ|x, t) exp(−S (τ|x, t)) ψ(x, t) =

  • dτq(τ|x, t) exp(−S (τ|x, t)) = Eqe−S

The optimal cost is:

C(p∗) = − log ψ(x, t)

Bert Kappen ML 353

slide-7
SLIDE 7

Controlled diffusions

In the case of controlled diffusions, p(τ|x, t) is parametrised by functions u(x, t), q(τ|x, t) corresponds to u(x.t) = 0:

dXt = f(Xt, t)dt + g(Xt, t)(u(Xt, t)dt + dWt) E(dWidW j) = νijdt C(p) = Ep

  • dt1

2u(Xt, t)′ν−1u(Xt, t) + S (τ|x, t)

  • J(x, t) = − log ψ(x, t) is the solution of the Bellman equation.

p∗ is generated by optimal control u∗(x, t): u∗(x, t)dt = Ep∗(dWt) = Eq

  • dWe−S

Eq e−S ψ, u∗ can be computed by forward sampling from q.

Bert Kappen ML 354

slide-8
SLIDE 8

Recap of the main idea

0.5 1 1.5 2 −2 −1 1 2

Consider a stochastic dynamical system

dXt = f(Xt, u)dt + dWt E(dWt,idWt.j) = νijdt

Given X0 find control function u(x, t) that minimizes the expected future cost

C = E

  • φ(XT) +

T dtR(Xt, u(Xt, t))

  • Bert Kappen

ML 355

slide-9
SLIDE 9

Control theory

0.5 1 1.5 2 −2 −1 1 2 0.5 1 1.5 2 −2 −1 1 2

Standard approach: define J(x, t) is optimal cost-to-go from x, t.

J(x, t) = min ut:TEu

  • φ(XT) +

T

t

dtR(Xt, u(Xt, t))

  • Xt = x

J satisfies a partial differential equation −∂tJ(t, x) = min

u

  • R(x, u) + f(x, u)∇xJ(x, t) + 1

2ν∇2

xJ(x, t)

  • J(x, T) = φ(x)

with u = u(x, t).This is HJB equation. Optimal control u∗(x, t) defines distribution over trajectories

p∗(τ) (= p(τ|x0, 0)).

Bert Kappen ML 356

slide-10
SLIDE 10

Path integral control theory

0.5 1 1.5 2 −2 −1 1 2

dXt = f(Xt)dt + g(Xt)(u(Xt, t)dt

  • f(Xt,u)dt

+dWt) X0 = x0

Goal is to find function u(x, t) that minimizes

C = E               φ(XT) + T dt V(Xt, t) + 1 2u(Xt, t)2

  • R(Xt,u(Xt,t))

              = E

  • S (τ) +

T dt1 2u(Xt, t)2

  • S (τ)

= φ(XT) + T V(Xt, t)

Bert Kappen ML 357

slide-11
SLIDE 11

Path integral control theory

0.5 1 1.5 2 −2 −1 1 2 0.5 1 1.5 2 −2 −1 1 2

Equivalent formulation: Find distribution over trajectories p that minimizes 13

C(p) =

  • dτp(τ)
  • S (τ) + log p(τ)

q(τ)

  • q(τ|x0, 0) is distribution over uncontrolled trajectories.

The optimal solution is given by p∗(τ) = 1

ψq(τ)e−S (τ)

13Eu

T

0 dt1 2u(Xt, t)2 =

  • dτp(τ) log p(τ)

q(τ).

Bert Kappen ML 358

slide-12
SLIDE 12

Path integral control theory

0.5 1 1.5 2 −2 −1 1 2 0.5 1 1.5 2 −2 −1 1 2

Equivalent formulation: Find distribution over trajectories p that minimizes

C(p) =

  • dτp(τ)
  • S (τ) + log p(τ)

q(τ)

  • q(τ|x0, 0) is distribution over uncontrolled trajectories.

The optimal solution is given by p∗(τ) = 1

ψq(τ)e−S (τ) = p(τ|u∗).

Equivalence of optimal control and discounted cost (Girsanov)

Bert Kappen ML 359

slide-13
SLIDE 13

Path integral control theory

0.5 1 1.5 2 −2 −1 1 2 0.5 1 1.5 2 −2 −1 1 2

The optimal control cost is C(p∗) = − log ψ = J(x0, 0) with

ψ =

  • dτq(τ)e−S (τ) = Eqe−S

J(x, t) can be computed by forward sampling from q.

Bert Kappen ML 360

slide-14
SLIDE 14

Delayed choice

Time-to-go T = 2 − t. 0.5 1 1.5 2 −3 −2 −1 1 2 3

−2 −1 1 2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 x J(x,t) T=2 T=1 T=0.5

J(x, t) = −ν log Eq exp(−φ(X2)/ν)

Decision is made at T = 1

ν

Bert Kappen ML 361

slide-15
SLIDE 15

Delayed choice

Time-to-go T = 2 − t. 0.5 1 1.5 2 −3 −2 −1 1 2 3

−2 −1 1 2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 x J(x,t) T=2 T=1 T=0.5

J(x, t) = −ν log Eq exp(−φ(X2)/ν)

”When the future is uncertain, delay your decisions.”

Bert Kappen ML 362

slide-16
SLIDE 16

Bert Kappen ML 363

slide-17
SLIDE 17

Bert Kappen ML 364

slide-18
SLIDE 18

Delayed choice (details)

dXt = udt + dWt EdW2

t = νdt

V = 0, path cost is 1

2u2 and end cost φ(z = ±1) = 0, φ(z) = ∞ else encodes two targets at z = ±1 at

t = T.

PI recipe: 1.

ψ(x, t) =

  • dQ(τ|x, t) exp(−S (τ)/λ)

S (τ) = φ(x(T)) ψ(x, t) =

  • dzq(z, T|x, t) exp(−φ(z)/λ) = q(1, T|x, t) + q(−1, T|x, t)

q(z, T|x, t) = N(z|x, ν(T − t))

  • 2. Compute

J(x, t) = −λ log ψ(x, t) = 1 T − t 1 2x2 − ν(T − t) log 2 cosh x ν(T − t)

  • Bert Kappen

ML 365

slide-19
SLIDE 19

3.

u(x, t) = −∇J(x, t) = 1 T − t

  • tanh

x ν(T − t) − x

  • 0.5

1 1.5 2 −2 −1 1 2 stochastic 0.5 1 1.5 2 −2 −1 1 2 0.5 1 1.5 2 −2 −1 1 2 deterministic 0.5 1 1.5 2 −2 −1 1 2

Bert Kappen ML 366

slide-20
SLIDE 20

Coordination of UAVs

(AAMAS 2015.mp4)

≈ 10.000 trajectories per iteration, 3 iterations per second.

Video at: http://www.snn.ru.nl/˜bertk/control_theory/PI_quadrotors.mp4 Gomez et al. 2015

Bert Kappen ML 367

slide-21
SLIDE 21

Coordination of UAVs

Chao Xu ACC 2017

Bert Kappen ML 368

slide-22
SLIDE 22

Importance sampling and control

0.5 1 1.5 2 −10 −5 5 10 0.5 1 1.5 2 −10 −5 5 10

ψ(x, t) = Eqe−S S (τ|x, t) = φ(xT) + T

t

dsV(xs, s)

Sampling is ’correct’ but inefficient.

Bert Kappen ML 369

slide-23
SLIDE 23

”To compute or not to compute, that is the question”

There are two extreme approaches to compute actions:

  • precompute the appropriate action u(x) for any possible situation x. Complex to learn and to
  • store. Fast to execute
  • compute the appropriate action u(x) for the current situation x. Low learning and storage cost.

Slow execution. Intuitively, one can imagine that the most efficient approach is to combine both ideas (like ’just-in- time’ manufacturing):

  • precompute ’basic motor skills’, the ’halffabrikaat’
  • compute the appropriate action u(x) from the basic motor skills

Bert Kappen ML 370

slide-24
SLIDE 24

Importance sampling

−2 2 4 0.2 0.4 0.6 0.8 1 1.2

Consider simple 1-d sampling problem. Given q(x), compute

a = Prob(x < 0) = ∞

−∞

I(x)q(x)dx

with I(x) = 0, 1 if x > 0, x < 0, respectively. Naive method: generate N samples Xi ∼ q

ˆ a = 1 N

N

  • i=1

I(Xi) Eˆ a = a Var(ˆ a) = 1 NVar(I)

Bert Kappen ML 371

slide-25
SLIDE 25

Importance sampling

−2 2 4 0.2 0.4 0.6 0.8 1 1.2

Consider another distribution p(x). Then

a = Prob(x < 0) = ∞

−∞

I(x)q(x) p(x)p(x)dx

Importance sampling: generate N samples Xi ∼ p

ˆ a = 1 N

N

  • i=1

I(Xi)q(Xi) p(Xi) Eˆ a = a Var(ˆ a) = 1 NVar

  • I p

q

  • Unbiased (= correct) for any p

Bert Kappen ML 372

slide-26
SLIDE 26

Optimal importance sampling

−2 2 4 0.2 0.4 0.6 0.8 1 1.2

The distribution

p∗(x) = q(x)I(x) a

is the optimal importance sampler. One sample X ∼ p∗ is sufficient to estimate a:

ˆ a = I(X) q(X) p∗(X) = a Eˆ a = a Var(ˆ a) = 0

Bert Kappen ML 373

slide-27
SLIDE 27

Estimating ψ = Ee−S

0.5 1 1.5 2 −2 −1 1 2 ESS = 1.8, C=31.7

Sample N trajectories from uncontrolled dynamics

τi ∼ q(τ) wi = e−S (τi) ˆ ψ = 1 N

  • i

wi ˆ ψ unbiased estimate of ψ.

Sampling efficiency is inversely proportional to variance in (normalized) wi.

ES S = N 1 + N2Var(w)

Bert Kappen ML 374

slide-28
SLIDE 28

Importance sampling

0.5 1 1.5 2 −2 −1 1 2 ESS = 1.8, C=31.7 0.5 1 1.5 2 −2 −1 1 2 ESS = 3.5, C=5.0 0.5 1 1.5 2 −2 −1 1 2 ESS=9.5, C=2.0

Sample N trajectories from controlled dynamics and reweight yields unbiased estimate of cost-to- go:

τi ∼ p(τ) wi = e−S (τi)q(τi) p(τi) = e−Su(τi) ˆ ψ = 1 N

  • i

wi S u(τ) = S (τ) + T dt1 2u(Xt, t)2 + T u(Xt, t)dWt

Bert Kappen ML 375

slide-29
SLIDE 29

Importance sampling

0.5 1 1.5 2 −2 −1 1 2 ESS = 1.8, C=31.7 0.5 1 1.5 2 −2 −1 1 2 ESS = 3.5, C=5.0 0.5 1 1.5 2 −2 −1 1 2 ESS=9.5, C=2.0

S u(τ) = S (τ) + T dt1 2u(Xt, t)2 + T u(Xt, t)dWt

Thm:

  • Better u (in the sense of optimal control) provides a better sampler (in the sense of effective

sample size).

  • Optimal u = u∗ (in the sense of optimal control) requires only one sample and S u(τ) deterministic!

Thijssen, Kappen 2015

Bert Kappen ML 376

slide-30
SLIDE 30

Proof

Control cost is C(p) = Ep

  • S (τ) + log p(τ)

q(τ)

  • = ES u

Using Jensen’s inequality:

C∗ = − log

  • τ

q(τ)e−S (τ) = − log

  • τ

p(τ)e

−S (τ)−log p(τ) q(τ) ≤

  • τ

p(τ)

  • S (τ) + log p(τ)

q(τ)

  • = C(p)

Bert Kappen ML 377

slide-31
SLIDE 31

Proof

Control cost is C(p) = Ep

  • S (τ) + log p(τ)

q(τ)

  • = ES u

Using Jensen’s inequality:

C∗ = − log

  • τ

q(τ)e−S (τ) = − log

  • τ

p(τ)e

−S (τ)−log p(τ) q(τ) ≤

  • τ

p(τ)

  • S (τ) + log p(τ)

q(τ)

  • = C(p)

The inequality is saturated when S (τ) + log p(τ)

q(τ) has zero variance: left and right side evaluate to

S (τ) + log p(τ)

q(τ).

This is realized when p = p∗ 14.

14p∗ exists when

τ q(τ)e−S (τ) < ∞

Bert Kappen ML 378

slide-32
SLIDE 32

Example

Geometric Brownian motion on the interval t = 0 to T.

dXt =Xt (u(tXt, t)dt + dWt) , C =E 1 2 log(XT)2 u(x, t) =a(t) + b(t)x + c(t)x2

  • 3
  • 2
  • 1

1 2 3 4 0.5 1 1.5 2 2.5 3 200 400 600 800 1000 1200 1400 u(t, x) particles x(t) at t = 1/2 u(0) u(1) u(2) u∗

u = 0

constant linear quadratic

  • ptimal

C

7.526 5.139 1.507 1.461 1.420 FES(%) 34.3 42.08 87.5 95.2 99.3

Bert Kappen ML 379

slide-33
SLIDE 33

The Path Integral Cross Entropy (PICE) method

We wish to estimate

ψ =

  • dτq(τ)e−S (τ)

The optimal (zero variance) importance sampler is p∗(τ) = 1

ψq(τ)e−S (τ).

We approximate p∗(τ) with pu(τ), where u(x, t|θ) is a parametrized control function. Following the Cross Entropy method, we minimise KL(p∗|pu).

∆θ ∝ −∂KL(p∗|pu) ∂θ ∝ −Eue−S u T dWt ∂u(Xt, t|θ) ∂θ u(x, t|θ) is arbitrary.

Estimate gradient by sampling. Kappen, Ruiz 2016

Bert Kappen ML 380

slide-34
SLIDE 34

Adaptive importance sampling

for k = 0, . . . do

datak = generate data(model, uk)

% Importance sampler

uk+1 = learn control(datak, uk)

% Gradient descent end for Parallel sampling Parallel gradient computation

Bert Kappen ML 381

slide-35
SLIDE 35

Inverted pendulum

Simple 2nd order pendulum with noise, X = (α, ˙

α) ¨ α = − cos α + u C = E T dtV(Xt) + 1 2u(Xt, t)2

Naive grid: u(x) =

k ukδx,xk.

200 400 600 800 1000 0.2 0.4 0.6 0.8 1 ss 200 400 600 800 1000 −1.4 −1.2 −1 −0.8 −0.6 −0.4 −0.2 J

2 4 6 −2 −1 1 2 −3 −2 −1 1 2 3

ES S < 1 due to time discretization, finite sample size effects and u(x, t) = u(x).

Illustration of gradient descent learning Eq. ?? for a second order inverted pendulum problem. Left: Entropic sample size versus importance sampling iteration. Middle: Optimal cost to go versus importance sampling iteration. Right: Optimal control solution ˆ

u(x1, x2) versus x1, x2 with 0 ≤ x1 ≤ 2π and −2 ≤ x2 ≤ 2.

Bert Kappen ML 382

slide-36
SLIDE 36

Acrobot

2 DOF, second order, under actuated, continuous stochastic control problem. Task is swing-up from down position.

Bert Kappen ML 383

slide-37
SLIDE 37

(acrobot.mp4) Neural network 10 layers, 25 neurons per layer. Input is sin and cosine of both angles as well as angular velocity. No time as input. 100 iterations, with 10000 rollouts per iteration. Annealing such that ESS larger than 10 %. Took around 15 min with 100 cpu.

Bert Kappen ML 384

slide-38
SLIDE 38

Acrobot (details)

q1(0) = q2(0) = −π/2, ˙ q1(0) = ˙ q2(0) = 0, maximize final height H = l1 sin q1(T) + l2 sin q2(T)

Bert Kappen ML 385

slide-39
SLIDE 39

Acrobot (details)

d11(q)¨ q1 + d12(q)¨ q2 + h1(q, ˙ q) + φ1(q) = 0 d21(q)¨ q1 + d22¨ q2 + h2(q, ˙ q) + φ2(q) = u

We can write these equations in standard form

dxi = fi(x)dt + gi(x)udt

with x1 = q1, x2 = q2, x3 = ˙

q1, x4 = ˙ q2 and f1(x) = x3 g1(x) = 0 f2(x) = x4 g2(x) = 0 f3(x) =

−d22(h1+φ1)+d12(h2+φ2) D

g3(x) = −

d12 D

f4(x) =

d12(h1+φ1)−d11(h2+φ2) D

g4(x) =

d11 D

Bert Kappen ML 386

slide-40
SLIDE 40

Acrobot (details)

20 40 60 80 100 −4 −2 2 4 20 40 60 80 100 2 4 6 8 ss 20 40 60 80 100 −150 −100 −50 J Jphi 20 40 60 80 100 −10 10 20 30 increment mean std

100 iterations. At each iteration 50 stochastic trajectories were generated. The new control was computed from a deterministic trajectory. Noise was lowered at each iteration. Top left: final height for each stochastic trajectory for each iteration (red) and for each deterministic solution (blue).

Bert Kappen ML 387

slide-41
SLIDE 41

Integrated sensorimotor control

Initialize control u0 for t = 0, . . . do

datat =act in the world(ut) modelt = learn model(ut, datat) ut+1= compute control(modelt)

end for compute control for k = 0, . . . do

datak = generate data(model, uk)

% Monte Carlo importance sampler

uk+1 = learn control(datak, uk)

% Deep or recurrent learning end for

Bert Kappen ML 388

slide-42
SLIDE 42

Integrated sensorimotor control

Initialize control u0 for t = 0, . . . do

datat =act in the world(ut) modelt = learn model(ut, datat) ut+1= compute control(modelt)

end for compute control for k = 0, . . . do

datak = generate data(model, uk)

% Monte Carlo importance sampler

uk+1 = learn control(datak, uk)

% Deep or recurrent learning end for

  • generate infinite data to learn infinitely complex

Bert Kappen ML 389

slide-43
SLIDE 43

Integrated sensorimotor control

Initialize control u0 for t = 0, . . . do

datat =act in the world(ut) modelt = learn model(ut, datat) ut+1= compute control(modelt)

end for compute control for k = 0, . . . do

datak = generate data(model, uk)

% Monte Carlo importance sampler

uk+1 = learn control(datak, uk)

% Deep or recurrent learning end for

  • generate infinite data to learn infinitely complex
  • datat and datak are the two realities of the brain

Bert Kappen ML 390

slide-44
SLIDE 44

Towards sensorimotor integration

The brain is a Monte Carlo sampler

  • Perception: Bayesian posterior computation
  • Action: solving an optimal control problem through sampling

Both require the learning of a world model

Bert Kappen ML 391

slide-45
SLIDE 45

Towards sensorimotor integration

The brain is a Monte Carlo sampler

  • Perception: Bayesian posterior computation
  • Action: solving an optimal control problem through sampling

Both require the learning of a world model Action computation is optimized by adaptive importance sampling,

  • this is a type of motor learning
  • but is complemented by sampling (’halffabrikaat’)

Bert Kappen ML 392

slide-46
SLIDE 46

Towards sensorimotor integration

The brain is a Monte Carlo sampler

  • Perception: Bayesian posterior computation
  • Action: solving an optimal control problem through sampling

Both require the learning of a world model Action computation is optimized by adaptive importance sampling,

  • this is a type of motor learning
  • but is complemented by sampling (’halffabrikaat’)

Many open problems

  • Sensing, acting interdependence
  • action hierarchies in terms of action building blocks

Bert Kappen ML 393

slide-47
SLIDE 47

Thank you!

  • S. Thijssen and H. J. Kappen. ”Path Integral Control and State Dependent Feedback.” Phys. Rev.

E 91, 032104 2015 Kappen, Hilbert Johan, and Hans Christian Ruiz. ”Adaptive importance sampling for control and inference.” Journal of Statistical Physics 162.5 (2016): 1244-1266. Ruiz, Hans-Christian, and Hilbert J. Kappen. ”Particle Smoothing for Hidden Diffusion Processes: Adaptive Path Integral Smoother.” IEEE Transactions on Signal Processing 65.12 (2017): 3191- 3203. Thalmeier, D., Uhlmann, M., Kappen, H. J., Memmesheimer, R. M. (2015). Learning universal computations with spikes. Plos Computational Biology 2016 Thalmeier, D., Gomez, V. Kappen, H.J. Action selection in growing state spaces: Control of Network Structure Growth. Journal of Physics A (arXiv:1606.07777).

www.snn.ru.nl/˜bertk

Bert Kappen ML 394