Learning Linear Quadratic Regulators Efficiently with Only Regret
Alon Cohen
T
Joint work with: Tomer Koren and Yishay Mansour
Learning Linear Quadratic Regulators Efficiently with Only - - PowerPoint PPT Presentation
Learning Linear Quadratic Regulators Efficiently with Only Regret T Alon Cohen Joint work with: Tomer Koren and Yishay Mansour Reinforcement Control Learning Theory Multi-armed Bandits Linear Quadratic Control Agent Environment
Alon Cohen
Joint work with: Tomer Koren and Yishay Mansour
Control Theory Multi-armed Bandits Reinforcement Learning
Agent Environment
Agent Environment Control State
ut ∈ ℝk xt ∈ ℝd
Agent Environment Control State Noise
wt ∼ 𝒪(0,W) xt+1 = A⋆xt + B⋆ut + wt ∈ ℝd ut ∈ ℝk
Agent Environment Control State Cost Noise
wt ∼ 𝒪(0,W) xt+1 = A⋆xt + B⋆ut + wt ∈ ℝd ut ∈ ℝk ct = x⊤
t Qxt + u⊤ t Rut
Policy: Optimal policy stabilizes the system in minimum cost. For infinite horizon:
Agent Environment Control State Cost
xt+1 = A⋆xt + B⋆ut + wt ∈ ℝd ut ∈ ℝk ct = x⊤
t Qxt + u⊤ t Rut
π : xt ⟼ ut
Dimitri P . Bertsekas, Dynamic Programming and Optimal Control, 2005.
π⋆(x) = Kx
Agent Environment Control State Cost
xt+1 = □ xt + □ut + wt ∈ ℝd ut ∈ ℝk ct = x⊤
t Qxt + u⊤ t Rut
Abbasi-Yadkori and Szepesvári, 2011 Ibrahimi et al., 2012 Faradonbeh et al., 2017 Ouyang et al., 2017 Abeille and Lazaric, 2017, 2018 Dean et al. 2018, 2019
Goal: minimize the regret
RT =
T
∑
t=1
costt(Alg) − min
K T
∑
t=1
costt(K)
Regret Efficient
First poly-time algorithm for online learning of linear-quadratic control systems with regret. Resolve an open question of Abbasi-Yadkori and Szepesvári (2011) and Dean, Mania, Matni, Recht, and Tu (2018). ˜ O( T)
Regret Efficient
Abbasi-Yadkori and Szepesvári, 2011
exp(d) T
First poly-time algorithm for online learning of linear-quadratic control systems with regret. Resolve an open question of Abbasi-Yadkori and Szepesvári (2011) and Dean, Mania, Matni, Recht, and Tu (2018). ˜ O( T)
Regret Efficient
Abbasi-Yadkori and Szepesvári, 2011 Ibrahimi et al., 2012
exp(d) T poly(d) T
First poly-time algorithm for online learning of linear-quadratic control systems with regret. Resolve an open question of Abbasi-Yadkori and Szepesvári (2011) and Dean, Mania, Matni, Recht, and Tu (2018). ˜ O( T)
Regret Efficient
Abbasi-Yadkori and Szepesvári, 2011 Ibrahimi et al., 2012 Dean et al., 2018
exp(d) T poly(d) T
First poly-time algorithm for online learning of linear-quadratic control systems with regret. Resolve an open question of Abbasi-Yadkori and Szepesvári (2011) and Dean, Mania, Matni, Recht, and Tu (2018).
poly(d)T2/3
˜ O( T)
Regret Efficient
Abbasi-Yadkori and Szepesvári, 2011 Ibrahimi et al., 2012 Dean et al., 2018 Ours
exp(d) T poly(d) T poly(d) T
First poly-time algorithm for online learning of linear-quadratic control systems with regret. Resolve an open question of Abbasi-Yadkori and Szepesvári (2011) and Dean, Mania, Matni, Recht, and Tu (2018).
poly(d)T2/3
˜ O( T)
Regret Efficient
Abbasi-Yadkori and Szepesvári, 2011 Ibrahimi et al., 2012 Dean et al., 2018 Ours
exp(d) T poly(d) T poly(d) T
First poly-time algorithm for online learning of linear-quadratic control systems with regret. Resolve an open question of Abbasi-Yadkori and Szepesvári (2011) and Dean, Mania, Matni, Recht, and Tu (2018).
poly(d)T2/3
˜ O( T)
* Recent paper by Mania et al., 2019 can be used to derive a result similar to ours.
ut = K0xt + 𝒪(0, ε2I)
Explore-then-Exploit (Dean et al., 2018)
Execute + Gaussian noise
K0
( ̂ A ̂ B) = arg min
(A B) T
∑
t=1
∥Axt + But − xt+1∥2 (xt, ut)T
t=1
ut = K0xt + 𝒪(0, ε2I)
Explore-then-Exploit (Dean et al., 2018)
Execute + Gaussian noise
K0
Model Estimation (Åström, 1968)
( ̂ A ̂ B) = arg min
(A B) T
∑
t=1
∥Axt + But − xt+1∥2 (xt, ut)T
t=1
ut = K0xt + 𝒪(0, ε2I)
Explore-then-Exploit (Dean et al., 2018)
Execute + Gaussian noise
K0
Model Estimation (Åström, 1968) Solve Model
( ̂ A ̂ B )
( ̂ A ̂ B) = arg min
(A B) T
∑
t=1
∥Axt + But − xt+1∥2 (xt, ut)T
t=1
ut = K0xt + 𝒪(0, ε2I)
RT = O(T2/3)
Explore-then-Exploit (Dean et al., 2018)
Execute + Gaussian noise
K0
Model Estimation (Åström, 1968) Solve Model
( ̂ A ̂ B )
Execute
̂ K
Optimism in the Face of Uncertainty (Abbasi-Yadkori and Szepesvári, 2011)
Θt ∋ (A⋆ B⋆)
B a s e d
U C R L
Optimism in the Face of Uncertainty (Abbasi-Yadkori and Szepesvári, 2011)
Find Optimistic Policy
πt = arg min
π, (A B)∈Θt
J(A B)(π) Θt ∋ (A⋆ B⋆)
B a s e d
U C R L
Optimism in the Face of Uncertainty (Abbasi-Yadkori and Szepesvári, 2011)
Find Optimistic Policy Execute
πt = arg min
π, (A B)∈Θt
J(A B)(π) Θt ∋ (A⋆ B⋆)
πt
B a s e d
U C R L
(xt, ut)
Optimism in the Face of Uncertainty (Abbasi-Yadkori and Szepesvári, 2011)
Find Optimistic Policy Execute Update version space
πt = arg min
π, (A B)∈Θt
J(A B)(π) Θt ∋ (A⋆ B⋆)
πt
B a s e d
U C R L
(xt, ut)
RT = O( T)
Optimism in the Face of Uncertainty (Abbasi-Yadkori and Szepesvári, 2011)
Find Optimistic Policy Execute Update version space
πt = arg min
π, (A B)∈Θt
J(A B)(π) Θt ∋ (A⋆ B⋆)
πt
B a s e d
U C R L
min
π, (A B)∈Θt
J(A B)(π) ≤ J(π⋆) . Optimistic in the sense that:
(xt, ut)
RT = O( T)
Optimism in the Face of Uncertainty (Abbasi-Yadkori and Szepesvári, 2011)
Find Optimistic Policy Execute Update version space
πt = arg min
π, (A B)∈Θt
J(A B)(π) Θt ∋ (A⋆ B⋆)
πt
B a s e d
U C R L
min
π, (A B)∈Θt
J(A B)(π) ≤ J(π⋆) . Optimistic in the sense that: Caveat: not convex in policy parameters. J(A B)(π)
Σ = 𝔽[( x u) ( x u)
⊤
] .
Cohen et al., 2018
LQ Control:
xt+1 = A⋆xt + B⋆ut + wt ct = x⊤
t Qxt + u⊤ t Rut
Convex re-parameterization:
Steady- state covariance matrix
Σ = ( Σxx Σxu Σux Σuu)
Σ⪰0
Q R)
Σ = 𝔽[( x u) ( x u)
⊤
] .
Lemma: K = ΣuxΣ−1
xx
is optimal for LQR.
Cohen et al., 2018
LQ Control:
xt+1 = A⋆xt + B⋆ut + wt ct = x⊤
t Qxt + u⊤ t Rut
Convex re-parameterization:
Steady- state covariance matrix
Σ = ( Σxx Σxu Σux Σuu)
K1 K2
K1 K2
K1 K2
O(log T)
K1 K2
O(log T) O(log T)
K3
K1 K2
…
O(log T)
epochs with high probability.
O(log T)
O(log T) O(log T)
˜ O( T) regret in total. K3
K1 K2
…
˜ O( T) O(log T)
epochs with high probability.
O(log T)
O(log T) O(log T)
˜ O( T) regret in total. K3 K0
Warm Start
After warm start: Maintain: , where Run in epochs: Compute using a semidefinite program. Execute fixed during epoch. Epoch ends when is doubled.
∥(A0 B0) − (A⋆ B⋆)∥2
F ≤ O(1/
T) . zs = (
xs us) .
Vt = λI + 1 β
t−1
∑
s=1
zsz⊤
s
det(Vt) Kt Kt
Optimistic
At epoch start: Estimate from past observations Compute optimistic policy by solving Output:
(At Bt) = arg min
(A B)
1 β
t−1
∑
s=1
∥(A B)zs − xs+1∥2 + λ∥(A B) − (A0 B0)∥2
F
s.t. Σxx ⪰ (At Bt)Σ(At Bt)⊤ + W − μ(Σ ∙ V−1
t )I
Σt = arg min
Σ⪰0
Σ ∙ (
Q R)
A⋆, B⋆ Kt = (Σt)ux(Σt)−1
xx Σ = ( Σxx Σxu Σux Σuu)
Replaces hard problem in Abbasi-Yadkori & Szepesvári
Lemma Let With high probability
Δt = (At Bt) − (A⋆ B⋆) . tr(ΔtVtΔ⊤
t ) ≤ 1.
(Abbasi-Yadkori and Szepesvari, 2011)
norm t
Vt Δt ∥Vt∥ = Θ(t) ∥Δt∥ = Θ(1/ t)
“Almost” the regret =
T
∑
t=1
∥Δt∥ = O( T)
(disregarding switches and warm start)
Unlike in MDPs states may be unbounded. Low probability if K is stable, but may have unpredictable effect on expectation. System may destabilize when switching between policies too often. Main technique: Generate “sequentially stable” policies. Keep states bounded with high probability:
∥xt∥ ⪅ κ γ d log T w.h.p
First efficient algorithm for learning LQRs with regret. Solved open problem. Shown connection between MAB, RL, control and convex optimization. Open Problems: No lower bound! Evidence that the correct rate is (Mania et al., 2019) .
O(log T)
˜ O( T)
Poster #159