control inference and learning
play

Control, inference and learning Bert Kappen : SNN Donders - PowerPoint PPT Presentation

Control, inference and learning Bert Kappen : SNN Donders Institute, Radboud University, Nijmegen Gatsby Unit, UCL London July 21, 2015 Bert Kappen Why control theory? A theory for intelligent behaviour: - neuroscience Bert Kappen Oxford


  1. Control, inference and learning Bert Kappen : SNN Donders Institute, Radboud University, Nijmegen Gatsby Unit, UCL London July 21, 2015 Bert Kappen

  2. Why control theory? A theory for intelligent behaviour: - neuroscience Bert Kappen Oxford 2015 1/58

  3. Why control theory? A theory for intelligent behaviour: - neuroscience - robotics Bert Kappen Oxford 2015 2/58

  4. Control theory Given a current state and a future desired state, what is the best/cheapest/fastest way to get there. Bert Kappen Oxford 2015 3/58

  5. Why stochastic control? Bert Kappen Oxford 2015 4/58

  6. How to control? Hard problems: - a learning and exploration problem - a stochastic optimal control computation - a representation problem u ( x , t ) Bert Kappen Oxford 2015 5/58

  7. The idea: Control, Inference and Learning Linear Bellman equation and path integral solution Express a control computation as an inference computation. Bert Kappen Oxford 2015 6/58

  8. The idea: Control, Inference and Learning Linear Bellman equation and path integral solution Express a control computation as an inference computation. Compute optimal control using MC sampling Bert Kappen Oxford 2015 7/58

  9. The idea: Control, Inference and Learning Linear Bellman equation and path integral solution Express a control computation as an inference computation. Compute optimal control using MC sampling Importance sampling Accellerate with importance sampling, a state-feedback controller Bert Kappen Oxford 2015 8/58

  10. The idea: Control, Inference and Learning Linear Bellman equation and path integral solution Express a control computation as an inference computation. Compute optimal control using MC sampling Importance sampling Accellerate with importance sampling, a state-feedback controller Learn controller from self-generated data Bert Kappen Oxford 2015 9/58

  11. The idea: Control, Inference and Learning Linear Bellman equation and path integral solution Express a control computation as an inference computation. Compute optimal control using MC sampling Importance sampling Accellerate with importance sampling, a state-feedback controller Learn controller from self-generated data Optimal importance sampler is optimal control Bert Kappen Oxford 2015 10/58

  12. The idea: Control, Inference and Learning Linear Bellman equation and path integral solution Express a control computation as an inference computation. Compute optimal control using MC sampling Importance sampling Accellerate with importance sampling, a state-feedback controller Learn controller from self-generated data Optimal importance sampler is optimal control Learn a good importance sampler using PICE Bert Kappen Oxford 2015 11/58

  13. Outline • Introduction to control theory • Link between control theory, inference and statistical physics – Schr¨ odinger, Fleming Mitter ’82, Kappen ’05, Todorov ’06 • Importance sampling – Relation between optimal sampling and optimal control • Cross entropy method for adaptive importance sampling (PICE) – A criterion for parametrized control optimization – Learning by gradient descent • Some examples Bert Kappen Oxford 2015 12/58

  14. Discrete time optimal control Consider the control of a discrete time deterministic dynamical system: x t + 1 = x t + f ( x t , u t ) , t = 0 , 1 , . . . , T − 1 x t describes the state and u t specifies the control or action at time t . Given x 0 and u 0: T − 1 , we can compute x 1: T . Define a cost for each sequence of controls: T − 1 � C ( x 0 , u 0: T − 1 ) = V ( x t , u t ) t = 0 Find the sequence u 0: T − 1 that minimizes C ( x 0 , u 0: T − 1 ) . Bert Kappen Oxford 2015 13/58

  15. Dynamic programming Find the minimal cost path from A to J. min(6 + C ( H ) , 3 + C ( I )) = 7 C ( F ) = Minimal cost at time t easily expressable in terms of minimal cost at time t + 1 . Bert Kappen Oxford 2015 14/58

  16. Discrete time optimal control Dynamic programming uses concept of optimal cost-to-go J ( t , x ) . One can recursively compute J ( t , x ) from J ( t + 1 , x ) for all x in the following way:  T − 1  �     J ( t , x t ) V ( x s , u s ) = min       u t : T − 1     s = t min u t ( V ( t , x t , u t ) + J ( t + 1 , x t + f ( t , x t , u t ))) = J ( T , x ) = 0 J (0 , x ) u 0: T − 1 C ( x , u 0: T − 1 ) min = This is called the Bellman Equation. Computes u t ( x ) for all intermediate t , x . Bert Kappen Oxford 2015 15/58

  17. Stochastic optimal control Consider a stochastic dynamical system dX i = f i ( X t , u ) dt + dW i E ( dW i dW j ) = ν i j dt Given x (0) find control function u ( x , t ) that minimizes the expected future cost � T � � C E φ ( X T ) + dtV ( X t , u ( X t , t )) = 0 Expectation is over all trajectories given the control path. J ( t , x ) u ( V ( x , u ) + E J ( t + dt , x + dx )) = min � � V ( x , u ) + f ( x , u ) ∇ x J ( x , t ) + 1 2 ν ∇ 2 − ∂ t J ( t , x ) min x J ( x , t ) = u with u = u ( x , t ) and boundary condition J ( x , T ) = φ ( x ) . This is HJB equation. Bert Kappen Oxford 2015 16/58

  18. Computing the optimal control solution is hard - solve a Bellman Equation, a PDE - scales badly with dimension Efficient solutions exist for - linear dynamical systems with quadratic costs (Gaussians) - deterministic systems (no noise) Bert Kappen Oxford 2015 17/58

  19. Path integral control theory f ( X t , t ) dt + g ( X t , t )( udt + dW t ) dX t = � T � � dsV ( X s , s ) + 1 2 u T ( X t , t ) Ru ( X t , t ) C = E φ ( X T ) + t with E ( dW a dW b ) = ν ab dt and R = λν − 1 , λ > 0 . f ∈ R n , g ∈ R n × m , u ∈ R m . The HJB equation becomes � 1 �� 2 u T Ru + V + ( f + gu ) T ( ∇ J ) + 1 � g ν g T ∇ 2 J − ∂ t J = min 2Tr u with boundary condition J ( x , T ) = φ ( x ) . Bert Kappen Oxford 2015 18/58

  20. Path integral control theory Minimization wrt u yields: − R − 1 g T ( x , t ) ∇ J ( x , t ) u ( x , t ) = − 1 2( ∇ J ) T gR − 1 g T ( ∇ J ) + V + f T ∇ J + 1 � � g ν g T ∇ 2 J − ∂ t J = 2Tr Define ψ ( x , t ) through J ( x , t ) = − λ log ψ ( x , t ) . We obtain a linear HJB: � V g ν g T ∇ 2 �� λ − f T ∇ − 1 � ∂ t ψ = ψ 2Tr Bert Kappen Oxford 2015 19/58

  21. Feynman-Kac formula Denote q ( τ | x , t ) the distribution over uncontrolled trajectories that start at x , t : f ( X t , t ) dt + g ( X t , t ) dW dX t = with τ a trajectory x ( t → T ) . Then � � � − S ( τ ) � e − S /λ � ψ ( x , t ) dq ( τ | x , t ) exp = = E q λ � T S ( τ ) φ ( x ( T )) + dsV ( x ( s ) , s ) = t Bert Kappen Oxford 2015 20/58

  22. Posterior distribution over optimal trajectories ψ ( x , t ) is the partition sum for the distribution over paths under optimal control: � � 1 − S ( τ ) p ∗ ( τ | x , t ) = ψ ( x , t ) q ( τ | x , t ) exp λ The optimal cost-to-go is a free energy: � e − S /λ � J ( x , t ) = − λ log E q The optimal control is an expectation wrt p : � dWe − S /λ � E q u ∗ ( x , t ) dt = E p ∗ ( dW t ) = � e − S /λ � E q J , u ∗ can be computed by forward sampling from q . Bert Kappen Oxford 2015 21/58

  23. Delayed choice � � dW 2 dX t = u ( X t , t ) dt + dW t = ν dt t � 2 dt 1 2 u ( t ) 2 C ( p ) = E p φ ( x T ) + 0 Cost encodes targets at t = 2 . 3 2 1 0 −1 −2 −3 0 0.5 1 1.5 2 Bert Kappen Oxford 2015 22/58

  24. Delayed choice Time-to-go T = 2 − t . 1.8 3 T=2 1.6 2 T=1 1.4 1 T=0.5 1.2 J(x,t) 0 1 0.8 −1 0.6 −2 0.4 −2 −1 0 1 2 −3 x 0 0.5 1 1.5 2 J ( x , t ) = − ν log E q exp( − φ ( X 2 ) /ν ) Decision is made at T = 1 ν Bert Kappen Oxford 2015 23/58

  25. Delayed choice Time-to-go T = 2 − t . 1.8 3 T=2 1.6 2 T=1 1.4 1 T=0.5 1.2 J(x,t) 0 1 0.8 −1 0.6 −2 0.4 −2 −1 0 1 2 −3 x 0 0.5 1 1.5 2 J ( x , t ) = − ν log E q exp( − φ ( X 2 ) /ν ) ”When the future is uncertain, delay your decisions.” Bert Kappen Oxford 2015 24/58

  26. KL control Uncontrolled dynamics specifies distribution q ( τ | x , t ) over trajectories τ from t → T . � T Cost for trajectory τ is S ( τ ) = φ ( x T ) + t dsV ( x s , s ) . Find optimal distribution p ( τ | x . t ) that minimizes E p S and is ’close’ to q ( τ | x , t ) . Bert Kappen Oxford 2015 25/58

  27. KL control Find p ∗ that minimizes � d τ p ( τ | x , t ) log p ( τ | x , t ) C ( p ) = KL ( p | q ) + E p S KL ( p | q ) = q ( τ | x , t ) The optimal solution is given by � 1 p ∗ ( τ | x , t ) = ψ ( x , t ) q ( τ | x , t ) exp( − S ( τ | x , t )) ψ ( x , t ) = d τ q ( τ | x , t ) exp( − S ( τ | x , t )) The optimal cost is: C ( p ∗ ) = − log ψ ( x , t ) Bert Kappen Oxford 2015 26/58

  28. Controlled diffusions are special case In the case of controlled diffusions, p is parametrised by functions u ( x , t ) : dX t = f ( X t , t ) dt + g ( X t , t )( u ( X t , t ) dt + dW t ) E ( dW i dW j ) = ν i j dt � T � � ds 1 2 u ( X s , s ) T ν − 1 u ( X s , s ) + V ( X s , s ) C ( p ) = E p φ ( X T ) + t ψ ( x , t ) is the solution of the linear Bellman equation and J ( x , t ) = − log ψ ( x , t ) is the optimal cost-to-go. Bert Kappen Oxford 2015 27/58

  29. Sampling efficiency 10 5 0 −5 −10 0 0.5 1 1.5 2 Sampling with uncontrolled dynamics is theoretically correct, but inefficient in effi- cient in practice. Bert Kappen Oxford 2015 28/58

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend