Learning Linear Quadratic Regulators Efficiently with Only - - PowerPoint PPT Presentation

learning linear quadratic regulators efficiently with
SMART_READER_LITE
LIVE PREVIEW

Learning Linear Quadratic Regulators Efficiently with Only - - PowerPoint PPT Presentation

Learning Linear Quadratic Regulators Efficiently with Only Regret T Alon Cohen Joint work with: Tomer Koren and Yishay Mansour Reinforcement Control Learning Theory Multi-armed Bandits Linear Quadratic Control Agent Environment


slide-1
SLIDE 1

Learning Linear Quadratic Regulators Efficiently with Only Regret

Alon Cohen

T

Joint work with: Tomer Koren and Yishay Mansour

slide-2
SLIDE 2

Control Theory Multi-armed Bandits Reinforcement Learning

slide-3
SLIDE 3

Linear Quadratic Control

Agent Environment

slide-4
SLIDE 4

Linear Quadratic Control

Agent Environment Control State

ut ∈ ℝk xt ∈ ℝd

slide-5
SLIDE 5

Linear Quadratic Control

Agent Environment Control State Noise

wt ∼ 𝒪(0,W) xt+1 = A⋆xt + B⋆ut + wt ∈ ℝd ut ∈ ℝk

slide-6
SLIDE 6

Linear Quadratic Control

Agent Environment Control State Cost Noise

wt ∼ 𝒪(0,W) xt+1 = A⋆xt + B⋆ut + wt ∈ ℝd ut ∈ ℝk ct = x⊤

t Qxt + u⊤ t Rut

slide-7
SLIDE 7

Applications

slide-8
SLIDE 8

Policy: Optimal policy stabilizes the system in minimum cost. For infinite horizon:

Planning in LQRs

Agent Environment Control State Cost

xt+1 = A⋆xt + B⋆ut + wt ∈ ℝd ut ∈ ℝk ct = x⊤

t Qxt + u⊤ t Rut

π : xt ⟼ ut

Dimitri P . Bertsekas, Dynamic Programming and Optimal Control, 2005.

π⋆(x) = Kx

slide-9
SLIDE 9

Learning in LQRs

Agent Environment Control State Cost

xt+1 = □ xt + □ut + wt ∈ ℝd ut ∈ ℝk ct = x⊤

t Qxt + u⊤ t Rut

Abbasi-Yadkori and Szepesvári, 2011 Ibrahimi et al., 2012 Faradonbeh et al., 2017 Ouyang et al., 2017 Abeille and Lazaric, 2017, 2018 Dean et al. 2018, 2019

Goal: minimize the regret

RT =

T

t=1

costt(Alg) − min

K T

t=1

costt(K)

slide-10
SLIDE 10

Our Result

Regret Efficient

First poly-time algorithm for online learning of linear-quadratic control systems with regret. Resolve an open question of Abbasi-Yadkori and Szepesvári (2011) and Dean, Mania, Matni, Recht, and Tu (2018). ˜ O( T)

slide-11
SLIDE 11

Our Result

Regret Efficient

Abbasi-Yadkori and Szepesvári, 2011

exp(d) T

First poly-time algorithm for online learning of linear-quadratic control systems with regret. Resolve an open question of Abbasi-Yadkori and Szepesvári (2011) and Dean, Mania, Matni, Recht, and Tu (2018). ˜ O( T)

slide-12
SLIDE 12

Our Result

Regret Efficient

Abbasi-Yadkori and Szepesvári, 2011 Ibrahimi et al., 2012

exp(d) T poly(d) T

First poly-time algorithm for online learning of linear-quadratic control systems with regret. Resolve an open question of Abbasi-Yadkori and Szepesvári (2011) and Dean, Mania, Matni, Recht, and Tu (2018). ˜ O( T)

slide-13
SLIDE 13

Our Result

Regret Efficient

Abbasi-Yadkori and Szepesvári, 2011 Ibrahimi et al., 2012 Dean et al., 2018

exp(d) T poly(d) T

First poly-time algorithm for online learning of linear-quadratic control systems with regret. Resolve an open question of Abbasi-Yadkori and Szepesvári (2011) and Dean, Mania, Matni, Recht, and Tu (2018).

poly(d)T2/3

˜ O( T)

slide-14
SLIDE 14

Our Result

Regret Efficient

Abbasi-Yadkori and Szepesvári, 2011 Ibrahimi et al., 2012 Dean et al., 2018 Ours

exp(d) T poly(d) T poly(d) T

First poly-time algorithm for online learning of linear-quadratic control systems with regret. Resolve an open question of Abbasi-Yadkori and Szepesvári (2011) and Dean, Mania, Matni, Recht, and Tu (2018).

poly(d)T2/3

˜ O( T)

slide-15
SLIDE 15

Our Result

Regret Efficient

Abbasi-Yadkori and Szepesvári, 2011 Ibrahimi et al., 2012 Dean et al., 2018 Ours

exp(d) T poly(d) T poly(d) T

First poly-time algorithm for online learning of linear-quadratic control systems with regret. Resolve an open question of Abbasi-Yadkori and Szepesvári (2011) and Dean, Mania, Matni, Recht, and Tu (2018).

poly(d)T2/3

˜ O( T)

* Recent paper by Mania et al., 2019 can be used to derive a result similar to ours.

slide-16
SLIDE 16

Solution Techniques

ut = K0xt + 𝒪(0, ε2I)

Explore-then-Exploit (Dean et al., 2018)

Execute + Gaussian noise

K0

slide-17
SLIDE 17

Solution Techniques

( ̂ A ̂ B) = arg min

(A B) T

t=1

∥Axt + But − xt+1∥2 (xt, ut)T

t=1

ut = K0xt + 𝒪(0, ε2I)

Explore-then-Exploit (Dean et al., 2018)

Execute + Gaussian noise

K0

Model Estimation (Åström, 1968)

slide-18
SLIDE 18

Solution Techniques

( ̂ A ̂ B) = arg min

(A B) T

t=1

∥Axt + But − xt+1∥2 (xt, ut)T

t=1

ut = K0xt + 𝒪(0, ε2I)

Explore-then-Exploit (Dean et al., 2018)

Execute + Gaussian noise

K0

Model Estimation (Åström, 1968) Solve Model

( ̂ A ̂ B )

slide-19
SLIDE 19

Solution Techniques

( ̂ A ̂ B) = arg min

(A B) T

t=1

∥Axt + But − xt+1∥2 (xt, ut)T

t=1

ut = K0xt + 𝒪(0, ε2I)

RT = O(T2/3)

Explore-then-Exploit (Dean et al., 2018)

Execute + Gaussian noise

K0

Model Estimation (Åström, 1968) Solve Model

( ̂ A ̂ B )

Execute

̂ K

slide-20
SLIDE 20

Solution Techniques

Optimism in the Face of Uncertainty (Abbasi-Yadkori and Szepesvári, 2011)

Θt ∋ (A⋆ B⋆)

B a s e d

  • n

U C R L

slide-21
SLIDE 21

Solution Techniques

Optimism in the Face of Uncertainty (Abbasi-Yadkori and Szepesvári, 2011)

Find Optimistic Policy

πt = arg min

π, (A B)∈Θt

J(A B)(π) Θt ∋ (A⋆ B⋆)

B a s e d

  • n

U C R L

slide-22
SLIDE 22

Solution Techniques

Optimism in the Face of Uncertainty (Abbasi-Yadkori and Szepesvári, 2011)

Find Optimistic Policy Execute

πt = arg min

π, (A B)∈Θt

J(A B)(π) Θt ∋ (A⋆ B⋆)

πt

B a s e d

  • n

U C R L

slide-23
SLIDE 23

Solution Techniques

(xt, ut)

Optimism in the Face of Uncertainty (Abbasi-Yadkori and Szepesvári, 2011)

Find Optimistic Policy Execute Update version space

πt = arg min

π, (A B)∈Θt

J(A B)(π) Θt ∋ (A⋆ B⋆)

πt

B a s e d

  • n

U C R L

slide-24
SLIDE 24

Solution Techniques

(xt, ut)

RT = O( T)

Optimism in the Face of Uncertainty (Abbasi-Yadkori and Szepesvári, 2011)

Find Optimistic Policy Execute Update version space

πt = arg min

π, (A B)∈Θt

J(A B)(π) Θt ∋ (A⋆ B⋆)

πt

B a s e d

  • n

U C R L

min

π, (A B)∈Θt

J(A B)(π) ≤ J(π⋆) . Optimistic in the sense that:

slide-25
SLIDE 25

Solution Techniques

(xt, ut)

RT = O( T)

Optimism in the Face of Uncertainty (Abbasi-Yadkori and Szepesvári, 2011)

Find Optimistic Policy Execute Update version space

πt = arg min

π, (A B)∈Θt

J(A B)(π) Θt ∋ (A⋆ B⋆)

πt

B a s e d

  • n

U C R L

min

π, (A B)∈Θt

J(A B)(π) ≤ J(π⋆) . Optimistic in the sense that: Caveat: not convex in policy parameters. J(A B)(π)

slide-26
SLIDE 26

Convex (SDP) Formulation

Σ = 𝔽[( x u) ( x u)

] .

Cohen et al., 2018

LQ Control:

xt+1 = A⋆xt + B⋆ut + wt ct = x⊤

t Qxt + u⊤ t Rut

Convex re-parameterization:

Steady- state covariance matrix

Σ = ( Σxx Σxu Σux Σuu)

slide-27
SLIDE 27

Convex (SDP) Formulation

min

Σ⪰0

Σ ∙ (

Q R)

s.t. Σxx = (A⋆ B⋆) Σ (A⋆ B⋆)⊤ + W .

Σ = 𝔽[( x u) ( x u)

] .

Lemma: K = ΣuxΣ−1

xx

is optimal for LQR.

Cohen et al., 2018

LQ Control:

xt+1 = A⋆xt + B⋆ut + wt ct = x⊤

t Qxt + u⊤ t Rut

Convex re-parameterization:

Steady- state covariance matrix

Σ = ( Σxx Σxu Σux Σuu)

slide-28
SLIDE 28

K1 K2

Intuition for Our Algorithm

slide-29
SLIDE 29

K1 K2

Intuition for Our Algorithm

slide-30
SLIDE 30

K1 K2

O(log T)

Intuition for Our Algorithm

slide-31
SLIDE 31

K1 K2

O(log T) O(log T)

Intuition for Our Algorithm

K3

slide-32
SLIDE 32

K1 K2

O(log T)

epochs with high probability.

O(log T)

O(log T) O(log T)

Intuition for Our Algorithm

˜ O( T) regret in total. K3

slide-33
SLIDE 33

K1 K2

˜ O( T) O(log T)

epochs with high probability.

O(log T)

O(log T) O(log T)

Intuition for Our Algorithm

˜ O( T) regret in total. K3 K0

Warm Start

slide-34
SLIDE 34

Our Algorithm: OSLO (i)

After warm start: Maintain: , where Run in epochs: Compute using a semidefinite program. Execute fixed during epoch. Epoch ends when is doubled.

∥(A0 B0) − (A⋆ B⋆)∥2

F ≤ O(1/

T) . zs = (

xs us) .

Vt = λI + 1 β

t−1

s=1

zsz⊤

s

det(Vt) Kt Kt

Optimistic

slide-35
SLIDE 35

At epoch start: Estimate from past observations Compute optimistic policy by solving Output:

Our Algorithm: OSLO (ii)

(At Bt) = arg min

(A B)

1 β

t−1

s=1

∥(A B)zs − xs+1∥2 + λ∥(A B) − (A0 B0)∥2

F

s.t. Σxx ⪰ (At Bt)Σ(At Bt)⊤ + W − μ(Σ ∙ V−1

t )I

Σt = arg min

Σ⪰0

Σ ∙ (

Q R)

A⋆, B⋆ Kt = (Σt)ux(Σt)−1

xx Σ = ( Σxx Σxu Σux Σuu)

Replaces hard problem in Abbasi-Yadkori & Szepesvári

slide-36
SLIDE 36

Parameter Estimation

Lemma Let With high probability

Δt = (At Bt) − (A⋆ B⋆) . tr(ΔtVtΔ⊤

t ) ≤ 1.

(Abbasi-Yadkori and Szepesvari, 2011)

norm t

Vt Δt ∥Vt∥ = Θ(t) ∥Δt∥ = Θ(1/ t)

“Almost” the regret =

T

t=1

∥Δt∥ = O( T)

(disregarding switches and warm start)

slide-37
SLIDE 37

MDP vs. LQR: Boundedness of States

Unlike in MDPs states may be unbounded. Low probability if K is stable, but may have unpredictable effect on expectation. System may destabilize when switching between policies too often. Main technique: Generate “sequentially stable” policies. Keep states bounded with high probability:

∥xt∥ ⪅ κ γ d log T w.h.p

slide-38
SLIDE 38

Summary

First efficient algorithm for learning LQRs with regret. Solved open problem. Shown connection between MAB, RL, control and convex optimization. Open Problems: No lower bound! Evidence that the correct rate is (Mania et al., 2019) .

O(log T)

˜ O( T)

slide-39
SLIDE 39

Thank You!

Poster #159