Adaptive importance sampling for control and inference Bert Kappen - PowerPoint PPT Presentation

Adaptive importance sampling for control and inference ∗ Bert Kappen SNN Donders Institute, Radboud University, Nijmegen Gatsby Unit, UCL London December 10, 2016 ∗ Joint work with Hans Ruiz, Dominik Thalmeier Bert Kappen

Optimal control theory Hard problems: - a learning and exploration problem - a stochastic optimal control computation - a representation problem u ( x , t ) Bert Kappen 1/30

PICE: integrating Control, Inference and Learning Path integral control theory Express a control computation as an inference computation. Compute optimal control using MC sampling Bert Kappen 2/30

PICE: Integrating Control, Inference and Learning Path integral control theory Express a control computation as an inference computation. Compute optimal control using MC sampling Importance sampling Accellerate with importance sampling (= a state-feedback controller) Optimal importance sampler is optimal control Bert Kappen 3/30

PICE: Integrating Control, Inference and Learning Path integral control theory Express a control computation as an inference computation. Compute optimal control using MC sampling Importance sampling Accellerate with importance sampling (= a state-feedback controller) Optimal importance sampler is optimal control Learning Learn the controller from self-generated data Use Cross Entropy method for parametrized controller Bert Kappen 4/30

PICE: Integrating control, inference and learning Massively parallel computation Bert Kappen 5/30

PICE: Integrating control, inference and learning Massively parallel computation The Monte Carlo sampling serves two purposes: • Planning: compute the control for current state • Learning: improve the sampler/controller for future control computations Bert Kappen 6/30

Path integral control theory Uncontrolled dynamics specifies distribution q ( τ | x , t ) over trajectories τ from x , t . � T Cost for trajectory τ is S ( τ | x , t ) = φ ( x T ) + t dsV ( x s , s ) . Find optimal distribution p ( τ | x , t ) that minimizes E p S and is ’close’ to q ( τ | x , t ) . Bert Kappen 7/30

Controlled diffusions p ( τ | x , t ) is parametrised by function u ( x , t ) : E ( dW 2 dX t f ( X t , t ) dt + g ( X t , t )( u ( X t , t ) dt + dW t ) t ) = dt = � T � � ds 1 2 u ( X s , s ) 2 C ( u | x , t ) E u S ( τ | x , t ) + = t q ( τ | x , t ) corresponds to u = 0 . Goal is to find function u ( x , t ) that minimizes C . Bert Kappen 9/30

Solution The optimal control problem is solved as a Feynman-Kac path integral. The optimal cost-to-go � � e − S � d τ q ( τ | x , t ) e − S ( τ | x , t ) = − log E q J ( x , t ) = − log Optimal control � dWe − S � E q u ∗ ( x , t ) dt E p ∗ ( dW t ) = = � e − S � E q ψ, u ∗ can be computed by forward sampling from q . Bert Kappen 10/30

Sampling 10 5 0 −5 −10 0 0.5 1 1.5 2 Sample trajectories τ i , i = 1 , . . . , N ∼ q ( τ | x ) N E q e − S ≈ 1 � e − S ( τ i | x , t ) N i = 1 Sampling is unbiased but inefficient (large variance). Bert Kappen 11/30

Importance sampling 1.2 1 0.8 0.6 0.4 0.2 0 −2 0 2 4 Consider simple 1-d sampling problem. Given q ( x ) , compute � ∞ a = Prob( x < 0) = I ( x ) q ( x ) dx −∞ with I ( x ) = 0 , 1 if x > 0 , x < 0 , respectively. Naive method: generate N samples X i ∼ q N a = 1 � ˆ I ( X i ) N i = 1 Bert Kappen 12/30

Importance sampling 1.2 1 0.8 0.6 0.4 0.2 0 −2 0 2 4 Consider another distribution p ( x ) . Then � ∞ I ( x ) q ( x ) a = Prob( x < 0) = p ( x ) p ( x ) dx −∞ Importance sampling: generate N samples X i ∼ p N a = 1 I ( X i ) q ( X i ) � ˆ N p ( X i ) i = 1 Unbiased (= correct) for any distribution p ! Bert Kappen 13/30

Optimal importance sampling 1.2 1 0.8 0.6 0.4 0.2 0 −2 0 2 4 The distribution p ∗ ( x ) = q ( x ) I ( x ) a is the optimal importance sampler. One sample X ∼ p ∗ is sufficient to estimate a : a = I ( X ) q ( X ) ˆ p ∗ ( X ) = a Bert Kappen 14/30

Importance sampling and control In the case of control we must compute � dWe − S � E q J ( x , t ) = − log E q e − S u ∗ ( x , t ) = � e − S � E q Instead of samples from uncontrolled dynamics q ( u = 0 ), we sample with p ( u � 0 ). E q e − S E p e − S u = � T � T e − S dq 1 2 u ( x s , s ) 2 dt − e − S u dp = e − S − u ( x s , s ) dW s = t t We can choose any p , ie. any sampling control u to compute the expectation values. Bert Kappen 15/30

Relation between optimal sampling and optimal control Define e − S u ( τ i | x , t )) α i = � N j = 1 e − S u ( τ j | x , t ) 1 ES S (1 ≤ ES S ≤ N ) = � N j = 1 α 2 j Thm: 1. Better u (in the sense of optimal control) provides a better sampler (in the sense of effective sample size). 2. Optimal u = u ∗ (in the sense of optimal control) requires only one sample, α i = 1 / N and S u ( τ | x , t ) deterministic! � T � T dt 1 2 u ( x s , s ) 2 + S u ( τ | x , t ) S ( τ | x , t ) + u ( x x , s ) dW s = t t Bert Kappen 16/30

So far • Optimal control can be computed by MC sampling • Sampling can be accellerated by using ’good’ controls • The optimal control for sampling is also the optimal control solution How to learn a good controller? Bert Kappen 17/30

The Cross-entropy method p u ( x ) be a family of probability density function parametrized by u . h ( x ) be a positive function. Conside the expectation value � a = E 0 h = dxp 0 ( x ) h ( x ) for a particular value of u = 0 . The optimal importance sampling distribution is p ∗ ( x ) = h ( x ) p 0 ( x ) / a . The cross entropy method minimises the KL divergence dxp ∗ ( x ) log p ∗ ( x ) � KL ( p ∗ | p u ) p u ( x ) ∝ − E p ∗ log p u ( X ) = − E 0 h ( X ) log p u ( X ) = E v h ( X ) p 0 ( X ) ∝ p v ( X ) log p u ( X ) p 0 → p 1 → p 2 . . . Bert Kappen 18/30

The CE method for PI control Sample p u using dX t = f ( X t , t ) dt + g ( X t , t ) ( u ( X t , t ) dt + dW t ) We wish to compute close to optimal control u such that p u is close to p ∗ . Following the CE argument, we minimise � T � 2 � 1 ds 1 u ( X s , s ) − v ( X s , s ) − dW s KL ( p ∗ | p u ) ψ ( t , x ) E v e − S ( t , x , v ) = 2 ds t v is the importance sampling control. Expected value is independent of v , but variance/accuracy depends on v . Bert Kappen 19/30

The CE method for PI control We parametrize the control u ( x , t | θ ) . The gradient is given by: �� T ∂ KL ( p ∗ | p u ) � ( u ( X s , s ) ds − v ( X s , s ) ds − dW s ) ∂ u ( X s , s ) = ∂θ ∂θ t v �� T � ∂ u ( X s , s ) − dW s = ∂θ t u θ − ǫ∂ KL ( p ∗ | p u ) θ : = ∂θ We refer to the method as PICE (Path Integral Cross Entropy). Bert Kappen 20/30

Model based motor learning compute control for k = 0 , . . . do data k = generate data ( model , u k ) % Monte Carlo importance sampler u k + 1 = learn control ( data k , u k ) % Deep or recurrent learning end for 10 10 5 5 0 0 −5 −5 −10 −10 0 0.5 1 1.5 2 0 0.5 1 1.5 2 Bert Kappen 21/30

Parallel implementation Massive parallel sampling on CPUs Massive parallel gradient computation on C/GPU Goal: provide generic solver for any PI control problem to arbitrary precision. Bert Kappen 22/30

Acrobot 2 DOF, second order, under actuated, continuous stochastic control problem. Task is swing-up from down position and stabilize. Bert Kappen 23/30

Acrobot (acrobot.mp4) Neural network 2 hidden layers, 50 neurons per layer. Input is position and velocity. 2000 iterations, with 30000 rollouts per iteration. 100 cores. 15 minutes Bert Kappen 24/30

More samples per iteration is better :) Fraction ESS versus IS iteration 100 k samples (green, cyan) 300 k samples (red, blue) 1000 k samples (black, yellow) Bert Kappen 25/30

Trust region Initial gradient computation too hard. Introduce (KL) trust region. Control cost vs. IS iteration. Blue line: small trust region (ESS ≈ 50 %, 30k samples) (= video) Red line: intermediate trust region (ESS ≈ 1 %, 100k samples) Green line: large trust region (ESS ≈ 0 . 1 %, 300k samples) Trade-off between speed and optimality. Bert Kappen 26/30

Discussion Continuous time SOC is very hard to compute. - PI control: Control ↔ inference - Better sampling (ESS) ↔ better control (control objective) - IS: Learning control solution also increases efficiency of (future) control computations Bert Kappen 27/30

Discussion Continuous time SOC is very hard to compute. - PI control: Control ↔ inference - Better sampling (ESS) ↔ better control (control objective) - IS: Learning control solution also increases efficiency of (future) control computations Continuous time SOC is very hard represent. - CE for parameter estimation → deep neural network Bert Kappen 28/30

Adaptive importance sampling for control and inference Bert Kappen - PowerPoint PPT Presentation

Adaptive importance sampling for control and inference Bert Kappen SNN Donders Institute, Radboud University, Nijmegen Gatsby Unit, UCL London December 10, 2016 Joint work with Hans Ruiz, Dominik Thalmeier Bert Kappen Optimal control

Adaptive Control Chapter 1: Introduction to Adaptive Control Adaptive Control Landau, Lozano,

Adaptive Control Chapter 11: Direct Adaptive Control 1 Adaptive Control Landau, Lozano,

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

Adaptive Control Chapter 12: Indirect Adaptive Control 1 Adaptive Control Landau, Lozano,

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Adaptive Control Chapter 13: Multimodel adaptive control with switching Chapter 13: Multimodel

Estimation of cosmological parameters using adaptive importance sampling Gersende FORT LTCI,

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Adaptive Control Chapter 14: Adaptive regulation Rejection of unknown disturbances 1

From passivity-based adaptive control to LMI tuned adaptive control or how Alexander Fradkov

Adaptive Control Chapter 7: Digital Control Strategies 1 Adaptive Control Landau,Lozano,

Estimation of cosmological parameters using adaptive importance sampling Gersende FORT LTCI,

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Optimal control of the MFG equilibrium for a pedestrian tourists flow model R. Maggistro F.

Optimal Control and Hamilton-Jacobi Equations H el` ene Frankowska CNRS and UNIVERSIT E

Pointwise convergence of the feasibility violation for Moreau-Yosida regularized optimal control

A Symbolic Approach for Solving Algebraic Riccati Equations G. Rance, Y. Bouzidi, Al. Quadrat, Ar.

Optimal Control of Stochastic Inventory Systems with Multiple Types of Reverse Flows Xiuli Chao

Linear-quadratic optimal control for the Oseen equations with stabilized finite elements M.

Multi-Objective Optimal Control Methods Necessary Conditions for Optimality Massimiliano

Stochastic optimal control problems in Banach spaces Federica Masiero Universit` a Milano