structured policy iteration for linear quadratic regulator
play

Structured Policy Iteration for Linear Quadratic Regulator Youngsuk - PowerPoint PPT Presentation

Structured Policy Iteration for Linear Quadratic Regulator Youngsuk Park 1 with R. Rossi 2 , Z. Wen 3 , G. Wu 2 , and H. Zhao 2 Stanford University 1 , Adobe research 2 , Deepmind 3 July 14, 2020 Introduction reinforcement learning (RL) is


  1. Structured Policy Iteration for Linear Quadratic Regulator Youngsuk Park 1 with R. Rossi 2 , Z. Wen 3 , G. Wu 2 , and H. Zhao 2 Stanford University 1 , Adobe research 2 , Deepmind 3 July 14, 2020

  2. Introduction ◮ reinforcement learning (RL) is about learning from interaction with delayed feedback – decide action to take, which affects the next state of environment – need sequential decision making ◮ most of discrete RL algorithms scales poorly for tasks in continuous space – discretize state or/and action space – curse of dimensionality – sample inefficiency 2

  3. Linear Quadratic Regulator ◮ Linear Quadratic Regulator (LQR) has rich applications for continuous space task – e.g., motion planning, trajectory optimization, portfolio ◮ Infinite horizon (undiscounted) LQR problem � ∞ � � x T t Qx t + u T E minimize t Ru t (1) π t =0 subject to x t +1 = Ax t + Bu t , u t = π ( x t ) , x 0 ∼ D , where A ∈ R n × n , B ∈ R n × m , Q � 0 , and R ≻ 0 . – quadratic cost Q, R and linear dynamics A, B – Q, R set relative weights of state deviation and input usage Preliminary 3

  4. Linear Quadratic Regulator (Continued) ◮ LQR problem � ∞ � � x T t Qx t + u T minimize E t Ru t π t =0 subject to x t +1 = Ax t + Bu t , u t = π ( x t ) , x 0 ∼ D , where A ∈ R n × n , B ∈ R n × m , Q � 0 , and R ≻ 0 . ◮ well-known facts – linear optimal policy (or control gain) π ⋆ ( x ) = Kx, – quadratic optimal value function (cost-to-go) V ⋆ ( x ) = x T Px s.t. P = A T PA + Q − A T PB ( B T PB + R ) − 1 B T PA, K = − ( B T PB + R ) − 1 B T PA. – P can be derived efficiently, e.g., Riccati recursion, SDP, etc ◮ many variants and extensions – e.g., time-varying, averaged or discounted, jumping LQR etc. Preliminary 4

  5. Structured Linear Policy ◮ can we find the structured linear policy for LQR? ◮ structure can mean (block) sparse, low-rank, etc – more interpretable, memory and computationally efficient, well-suited for distributed setting – Often, structure policy is related to physical decision system ◮ e.g., data cooling system need to install/arrange cooling infrastructure ◮ To tackle this, we develop – formulation, algorithm, theory and practice Preliminary 5

  6. Formulation ◮ regularized LQR problem f ( K ) � �� � � ∞ � � x T t Qx t + u T minimize E t Ru t + λr ( K ) (2) K t =0 subject to x t +1 = Ax t + Bu t , u t = Kx t , x 0 ∼ D , – explicitly restrict policy to linear class, i.e., u t = Kx t – value function is still quadratic, i.e., V ( x ) = x T Px for some P – convex regularizer with (scalar) parameter λ ≥ 0 ◮ regularizer r ( K ) induces the policy structure – lasso � K � 1 = � i,j | K i,j | for sparse structure – group lasso � K � G , 2 = � g ∈G � K g � 2 for block-diagonal structure – nuclear-norm � K � ∗ = � i σ i ( K ) for low-rank structure F for some K ref ∈ R n × m , – proximity � K − K ref � 2 Preliminary 6

  7. Structured Policy Iteration (S-PI) ◮ When model is known, S-PI repeats – (1) Policy (and covariance) evaluation ◮ solve Lyapunov equations to return ( P i , Σ i ) ( A + BK i ) T P i ( A + BK i ) − P i + Q + ( K i ) T RK i = 0 , ( A + BK i )Σ i ( A + BK i ) T − Σ i + Σ 0 = 0 . – (2) Policy improvement R + B T P i B K i + B T P i A ◮ compute gradient ∇ K f ( K i ) = 2 �� � � Σ i ◮ apply proximal gradient step under linesearch ◮ note that – Lyapunov equation requires O ( n 3 ) to solve – (almost) no hyperparameter to tune under linesearch (LS), – LS make stability ρ ( A + BK i ) < 1 satisfied Part 1: Model-based approach for regularized LQR 7

  8. Convergence ’20) Assume K 0 s.t. K i from ρ ( A + BK 0 ) < 1 . Theorem (Park et al. S-PI Algorithm converges to the stationary point K ⋆ . Moreover, it converges linearly, i.e., after N iterations, � N � 2 � 1 − 1 � K N − K ⋆ � � � K 0 − K ⋆ � � 2 F ≤ F . � � � κ Here, κ = 1 / ( η min σ min (Σ 0 ) σ min ( R ))) > 1 where � σ min (Σ 0 ) , σ min ( Q ) , 1 η min = h η λ , 1 1 � R � , 1 1 1 � � A � , � B � , ∆ , , (3) F ( K 0 ) for some non-decreasing function h η on each argument. – Riccati recursion can give stabilizing initial policy K 0 – (global bound on) fixed stepsize η min depends on model parameters – note η min ∝ 1 /λ – in practice using LS, stepsize does have to be tuned or calculated Part 1: Model-based approach for regularized LQR 8

  9. Model-free Structured Policy Iteration ◮ when model is unknown, S-PI repeats – (1) Perturbed policy evaluation N traj ◮ get perturbation and (perturbed) cost-to-go { ˆ f j , U j } j =1 for each j = 1 , . . . , N traj sample U j ∼ Uniform( S r ) to get a perturbed ˆ K i = K i + U j K i over the horizon H to estimate the cost-to-go roll out ˆ H f j = ˆ � g ( x t , ˆ K i x t ) t =0 – (2) Policy improvement ◮ compute the (noisy) gradient N traj 1 n � � r 2 ˆ f j U j ∇ K f ( K i ) = N traj j =1 ◮ apply proximal gradient step ◮ note that – smoothing procedure adapted to estimate noisy gradient – ( N traj , H, r ) are additional hyperparameters to tune – LS is not applicable Part 2: Model-free approach for regularized LQR 9

  10. Convergence Theorem (Park et al. ’20) Suppose F ( K 0 ) is finite, Σ 0 ≻ 0 , and that x 0 ∼ D has norm bounded by D almost surly. Suppose the parameters in Algorithm ?? are chosen from D 2 � n, 1 1 � ( N traj , H, 1 /r ) = h ǫ , σ min (Σ 0 ) σ min ( R ) , . σ min (Σ 0 ) for some polynomials h . Then, with the same stepsize in Eq. (3), there exist � � K 0 − K ⋆ � F � � K N − K ⋆ � � � ≤ ǫ with iteration N at most 4 κ log such that ǫ at least 1 − o ( ǫ n − 1 ) probability. Moreover, it converges linearly, � i � � 1 − 1 � 2 ≤ � 2 , � K i − K ⋆ � � K 0 − K ⋆ � � 2 κ for the iteration i = 1 , . . . , N , where κ = ησ min (Σ 0 ) σ min ( R ) > 1 . – Assume K 0 is stabilizing policy but cannot use Riccati here – here ( N traj , H, r ) are hyperparameters to tune Part 2: Model-free approach for regularized LQR 10

  11. Experiment (Setting) ◮ Consider unstable Laplacian system A ∈ R n × n where  1 . 1 , i = j   A ij = 0 . 1 , i = j + 1 or j = i + 1   0 , otherwise B = Q = I n ∈ R n × n and R = 1000 × I n ∈ R n × n . – unstable open loop system, i.e., ρ ( A ) ≥ 1 – extremely sensitive to parameters (even under known model setting) – less in favor of the generic model-free RL approaches to deploy ◮ Model and S-PI algorithm parameter under known model – system size n ∈ [3 , 500] – lasso penalty with λ ∈ [10 − 2 , 10 6 ] – LS with initial stepsize η = 1 λ with backtracking factor β = 1 2 � 1 � – For fixed stepsize, select η = O λ Experiment 11

  12. Experiment (Continued) ◮ Convergence behavior under LS and scalability Convergence over differnt λ 100 10 −5 λ =588 λ =597 10 −7 λ =606 80 λ =615 10 −9 λ =624 time(sec) 60 f ( K i )/ f ( K ⋆ ) 10 −11 10 −13 40 10 −15 20 10 −17 10 −19 0 0 100 200 300 400 0 1 2 3 dimension n iteration i – S-PI with LS converges very fast over various n and λ – scales well for large system, even with computational bottleneck on solving Lyapunov equation – For n = 500 , takes less than 2 mins (MacBook Air) Experiment 12

  13. Experiment (Continued) ◮ Dependency of stepsize η on λ . Largest fixed stetpsize with stable system 10 −5 stepsize η fixed 10 −6 10 3 10 4 10 5 λ – vary λ under same system – largest (fixed) stepsize for stable (closed) system, i.e., A + BK i < 1 is non-increasing, i.e., η fixed ∝ 1 λ Experiment 13

  14. Experiment (Continued) ◮ Trade off between LQR performance and structure K The effect of different λ 1.0040 F ( K ⋆ )/ F ( K lqr ) 1.0035 1.0030 1.0025 590 610 630 card ( K ⋆ )/ card ( K lqr ) 1.00 0.75 0.50 0.25 590 610 630 λ – LQR solution K lqr , and S-PI solution K ⋆ – λ increases, LQR cost f ( K ⋆ ) increases whereas cardinality decreases (sparsitiy is improved). – In this range, S-PI barely changes LQR cost but improved the sparsity more than 50% . Experiment 14

  15. Experiment (Continued) ◮ sparsity pattern of policy matrix λ = 600, card(K)=132 λ = 620, card(K)=62 0 5 10 15 0 5 10 15 0 0 5 5 10 10 15 15 – sparsity pattern (location of non-zero elements) of the policy matrix with λ = 600 and λ = 620 . Experiment 15

  16. Challenge on model-free approach ◮ model-free approach is challenging and unstable – especially unstable open loop system ρ ( A ) < 1 – suffer similar difficulty to the model-free policy gradient method [Fazel et al., 2018] for LQR – finding stabilizing initial policy K 0 is non-trivial unless ρ ( A ) < 1 – suffer high variance, especially sensitive to smoothing parameter r ◮ open problems and algorithmic efforts needed in practice – variance reduction – rule of thumb to tune hyperparamters ◮ still, promising as a different class of model-free approach – no discretization – no need to compute Q ( s, a ) pair (like in REINFORCE) – seems to work for averaged cost of LQR (easier class of LQR) – more in longer version of paper Experiment 16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend