Structured Policy Iteration for Linear Quadratic Regulator Youngsuk - - PowerPoint PPT Presentation

structured policy iteration for linear quadratic regulator
SMART_READER_LITE
LIVE PREVIEW

Structured Policy Iteration for Linear Quadratic Regulator Youngsuk - - PowerPoint PPT Presentation

Structured Policy Iteration for Linear Quadratic Regulator Youngsuk Park 1 with R. Rossi 2 , Z. Wen 3 , G. Wu 2 , and H. Zhao 2 Stanford University 1 , Adobe research 2 , Deepmind 3 July 14, 2020 Introduction reinforcement learning (RL) is


slide-1
SLIDE 1

Structured Policy Iteration for Linear Quadratic Regulator

Youngsuk Park1 with R. Rossi2, Z. Wen3, G. Wu2, and H. Zhao2

Stanford University1, Adobe research2, Deepmind3

July 14, 2020

slide-2
SLIDE 2

Introduction

◮ reinforcement learning (RL) is about learning from interaction with delayed feedback

– decide action to take, which affects the next state of environment – need sequential decision making

◮ most of discrete RL algorithms scales poorly for tasks in continuous space

– discretize state or/and action space – curse of dimensionality – sample inefficiency

2

slide-3
SLIDE 3

Linear Quadratic Regulator

◮ Linear Quadratic Regulator (LQR) has rich applications for continuous space task

– e.g., motion planning, trajectory optimization, portfolio

◮ Infinite horizon (undiscounted) LQR problem minimize

π

E ∞

  • t=0

xT

t Qxt + uT t Rut

  • (1)

subject to xt+1 = Axt + But, ut = π(xt), x0 ∼ D, where A ∈ Rn×n, B ∈ Rn×m, Q 0, and R ≻ 0.

– quadratic cost Q, R and linear dynamics A, B – Q, R set relative weights of state deviation and input usage

Preliminary 3

slide-4
SLIDE 4

Linear Quadratic Regulator (Continued)

◮ LQR problem minimize

π

E ∞

  • t=0

xT

t Qxt + uT t Rut

  • subject to

xt+1 = Axt + But, ut = π(xt), x0 ∼ D, where A ∈ Rn×n, B ∈ Rn×m, Q 0, and R ≻ 0. ◮ well-known facts

– linear optimal policy (or control gain) π⋆(x) = Kx, – quadratic optimal value function (cost-to-go) V ⋆(x) = xT Px s.t. P = AT PA + Q − AT PB(BT PB + R)−1BT PA, K = −(BT PB + R)−1BT PA. – P can be derived efficiently, e.g., Riccati recursion, SDP, etc

◮ many variants and extensions

– e.g., time-varying, averaged or discounted, jumping LQR etc.

Preliminary 4

slide-5
SLIDE 5

Structured Linear Policy

◮ can we find the structured linear policy for LQR? ◮ structure can mean (block) sparse, low-rank, etc

– more interpretable, memory and computationally efficient, well-suited for distributed setting – Often, structure policy is related to physical decision system

◮ e.g., data cooling system need to install/arrange cooling infrastructure

◮ To tackle this, we develop

– formulation, algorithm, theory and practice

Preliminary 5

slide-6
SLIDE 6

Formulation

◮ regularized LQR problem minimize

K f(K)

  • E

  • t=0

xT

t Qxt + uT t Rut

  • +λr(K)

(2) subject to xt+1 = Axt + But, ut = Kxt, x0 ∼ D,

– explicitly restrict policy to linear class, i.e., ut = Kxt – value function is still quadratic, i.e., V (x) = xT Px for some P – convex regularizer with (scalar) parameter λ ≥ 0

◮ regularizer r(K) induces the policy structure

– lasso K1 =

i,j |Ki,j| for sparse structure

– group lasso KG,2 =

g∈G Kg2 for block-diagonal structure

– nuclear-norm K∗ =

i σi(K) for low-rank structure

– proximity K − Kref2

F for some Kref ∈ Rn×m,

Preliminary 6

slide-7
SLIDE 7

Structured Policy Iteration (S-PI)

◮ When model is known, S-PI repeats

– (1) Policy (and covariance) evaluation

◮ solve Lyapunov equations to return (P i, Σi) (A + BKi)T P i(A + BKi) − P i + Q + (Ki)T RKi = 0, (A + BKi)Σi(A + BKi)T − Σi + Σ0 = 0.

– (2) Policy improvement

◮ compute gradient ∇Kf(Ki) = 2

  • R + BT P iB
  • Ki + BT P iA
  • Σi

◮ apply proximal gradient step under linesearch

◮ note that

– Lyapunov equation requires O(n3) to solve – (almost) no hyperparameter to tune under linesearch (LS), – LS make stability ρ(A + BKi) < 1 satisfied

Part 1: Model-based approach for regularized LQR 7

slide-8
SLIDE 8

Convergence

Theorem (Park et al. ’20) Assume K0 s.t. ρ(A + BK0) < 1. Ki from S-PI Algorithm converges to the stationary point K⋆. Moreover, it converges linearly, i.e., after N iterations,

  • KN − K⋆
  • 2

F ≤

  • 1 − 1

κ N K0 − K⋆ 2

F .

Here, κ = 1/ (ηminσmin(Σ0)σmin(R))) > 1 where ηmin = hη

  • σmin(Σ0), σmin(Q), 1

λ , 1 A , 1 B , 1 R , 1 ∆ , 1 F(K0)

  • ,

(3) for some non-decreasing function hη on each argument.

– Riccati recursion can give stabilizing initial policy K0 – (global bound on) fixed stepsize ηmin depends on model parameters – note ηmin ∝ 1/λ – in practice using LS, stepsize does have to be tuned or calculated

Part 1: Model-based approach for regularized LQR 8

slide-9
SLIDE 9

Model-free Structured Policy Iteration

◮ when model is unknown, S-PI repeats

– (1) Perturbed policy evaluation

◮ get perturbation and (perturbed) cost-to-go { ˆ fj, Uj}

Ntraj j=1

for each j = 1, . . . , Ntraj sample Uj ∼ Uniform(Sr) to get a perturbed ˆ Ki = Ki + Uj roll out ˆ Ki over the horizon H to estimate the cost-to-go ˆ fj =

H

  • t=0

g(xt, ˆ Kixt)

– (2) Policy improvement

◮ compute the (noisy) gradient

  • ∇Kf(Ki) =

1 Ntraj

Ntraj

  • j=1

n r2 ˆ fjUj ◮ apply proximal gradient step

◮ note that

– smoothing procedure adapted to estimate noisy gradient – (Ntraj, H, r) are additional hyperparameters to tune – LS is not applicable

Part 2: Model-free approach for regularized LQR 9

slide-10
SLIDE 10

Convergence

Theorem (Park et al. ’20) Suppose F(K0) is finite, Σ0 ≻ 0, and that x0 ∼ D has norm bounded by D almost surly. Suppose the parameters in Algorithm ?? are chosen from (Ntraj, H, 1/r) = h

  • n, 1

ǫ , 1 σmin(Σ0)σmin(R) , D2 σmin(Σ0)

  • .

for some polynomials h. Then, with the same stepsize in Eq. (3), there exist iteration N at most 4κ log K0−K⋆F

ǫ

  • such that
  • KN − K⋆

≤ ǫ with at least 1 − o(ǫn−1) probability. Moreover, it converges linearly,

  • Ki − K⋆

2 ≤

  • 1 − 1

2κ i K0 − K⋆ 2 , for the iteration i = 1, . . . , N, where κ = ησmin(Σ0)σmin(R) > 1.

– Assume K0 is stabilizing policy but cannot use Riccati here – here (Ntraj, H, r) are hyperparameters to tune

Part 2: Model-free approach for regularized LQR 10

slide-11
SLIDE 11

Experiment (Setting)

◮ Consider unstable Laplacian system A ∈ Rn×n where Aij =      1.1, i = j 0.1, i = j + 1 or j = i + 1 0,

  • therwise

B = Q = In ∈ Rn×n and R = 1000 × In ∈ Rn×n.

– unstable open loop system, i.e., ρ(A) ≥ 1 – extremely sensitive to parameters (even under known model setting) – less in favor of the generic model-free RL approaches to deploy

◮ Model and S-PI algorithm parameter under known model

– system size n ∈ [3, 500] – lasso penalty with λ ∈ [10−2, 106] – LS with initial stepsize η = 1

λ with backtracking factor β = 1 2

– For fixed stepsize, select η = O 1

λ

  • Experiment

11

slide-12
SLIDE 12

Experiment (Continued)

◮ Convergence behavior under LS and scalability

1 2 3

iteration i

10−19 10−17 10−15 10−13 10−11 10−9 10−7 10−5

f(Ki)/f(K ⋆ ) Convergence over differnt λ

λ=588 λ=597 λ=606 λ=615 λ=624

100 200 300 400

dimension n

20 40 60 80 100

time(sec)

– S-PI with LS converges very fast over various n and λ – scales well for large system, even with computational bottleneck on solving Lyapunov equation – For n = 500, takes less than 2 mins (MacBook Air)

Experiment 12

slide-13
SLIDE 13

Experiment (Continued)

◮ Dependency of stepsize η on λ.

103 104 105

λ

10−6 10−5

stepsize ηfixed Largest fixed stetpsize with stable system

– vary λ under same system – largest (fixed) stepsize for stable (closed) system, i.e., A + BKi < 1 is non-increasing, i.e., ηfixed ∝ 1

λ

Experiment 13

slide-14
SLIDE 14

Experiment (Continued)

◮ Trade off between LQR performance and structure K

590 610 630 1.0025 1.0030 1.0035 1.0040

F(K ⋆ )/F(Klqr)

The effect of different λ

590 610 630

λ

0.25 0.50 0.75 1.00

card(K ⋆ )/card(Klqr)

– LQR solution Klqr, and S-PI solution K⋆ – λ increases, LQR cost f(K⋆) increases whereas cardinality decreases (sparsitiy is improved). – In this range, S-PI barely changes LQR cost but improved the sparsity more than 50%.

Experiment 14

slide-15
SLIDE 15

Experiment (Continued)

◮ sparsity pattern of policy matrix

5 10 15 5 10 15

λ = 600, card(K)=132

5 10 15 5 10 15

λ = 620, card(K)=62

– sparsity pattern (location of non-zero elements) of the policy matrix with λ = 600 and λ = 620.

Experiment 15

slide-16
SLIDE 16

Challenge on model-free approach

◮ model-free approach is challenging and unstable

– especially unstable open loop system ρ(A) < 1 – suffer similar difficulty to the model-free policy gradient method [Fazel et al., 2018] for LQR – finding stabilizing initial policy K0 is non-trivial unless ρ(A) < 1 – suffer high variance, especially sensitive to smoothing parameter r

◮ open problems and algorithmic efforts needed in practice

– variance reduction – rule of thumb to tune hyperparamters

◮ still, promising as a different class of model-free approach

– no discretization – no need to compute Q(s, a) pair (like in REINFORCE) – seems to work for averaged cost of LQR (easier class of LQR) – more in longer version of paper

Experiment 16

slide-17
SLIDE 17

Summary

◮ formulate regularized LQR problem to derive structured policy ◮ develop S-PI algorithm for both model-based and model-free approach with theoretical guarantees ◮ model-based S-PI works well in practice with (almost) no hyperparameter tuning ◮ model-free S-PI is still promising but challenging

Summary 17

slide-18
SLIDE 18

Thank you!

Summary 18