Logarithmic Regret for Learning Linear Quadratic Regulators Efficiently
Asaf Cassel
Joint work with: Alon Cohen, Tomer Koren
Logarithmic Regret for Learning Linear Quadratic Regulators - - PowerPoint PPT Presentation
Logarithmic Regret for Learning Linear Quadratic Regulators Efficiently Asaf Cassel Joint work with: Alon Cohen, Tomer Koren Reinforcement Learning action u t State x t +1 Cost c t 2 Reinforcement Learning action u t State x t +1 Cost c t
Joint work with: Alon Cohen, Tomer Koren
action ut State Cost
xt+1 ct
2
action ut State Cost
xt+1 ct
Discrete MDP Linear Quadratic Regulator (LQR)
Space Transition Costs Optimal Policy Problem Size
xt ∈ S, ut ∈ A |S|, |A|
Unstructured ct = c(xt, ut) Unstructured xt+1 ∼ P( ⋅ |xt, ut)
3
Dynamic programming
action ut State Cost
xt+1 ct
Discrete MDP Linear Quadratic Regulator (LQR)
Space Transition Costs Optimal Policy Problem Size
xt ∈ S, ut ∈ A |S|, |A|
Unstructured ct = c(xt, ut) Unstructured xt+1 ∼ P( ⋅ |xt, ut)
xt ∈ ℝd, ut ∈ ℝk d, k,∥A⋆∥,∥B⋆∥
Quadratic ct = x⊤
t Qxt + u⊤ t Rut
Linear xt+1 = A⋆xt + B⋆ut + wt
ut = − K⋆xt
3
Dynamic programming
Important Milestones:
regret - Abbasi-Yadkori and Szepesvári (2011)
regret - Dean et al. (2018)
regret - Cohen et al. (2019) , Mania et al. (2019)
T T2/3 T
Minimize regret (costs) when are unknown
A⋆, B⋆
xt+1 = A⋆xt + B⋆ut + wt ct = x⊤
t Qxt + u⊤ t Rut
ut = − K⋆xt wt ∼ 𝒪(0,σ2I)
Important Milestones:
regret - Abbasi-Yadkori and Szepesvári (2011)
regret - Dean et al. (2018)
regret - Cohen et al. (2019) , Mania et al. (2019)
T T2/3 T
Minimize regret (costs) when are unknown
A⋆, B⋆
Is regret optimal? No previous lower bounds T
xt+1 = A⋆xt + B⋆ut + wt ct = x⊤
t Qxt + u⊤ t Rut
ut = − K⋆xt wt ∼ 𝒪(0,σ2I)
Important Milestones:
regret - Abbasi-Yadkori and Szepesvári (2011)
regret - Dean et al. (2018)
regret - Cohen et al. (2019) , Mania et al. (2019)
T T2/3 T
Minimize regret (costs) when are unknown
A⋆, B⋆
Is regret optimal? No previous lower bounds T
Noise Typically regret regret in stochastic bandits
T log T
xt+1 = A⋆xt + B⋆ut + wt ct = x⊤
t Qxt + u⊤ t Rut
ut = − K⋆xt wt ∼ 𝒪(0,σ2I)
Important Milestones:
regret - Abbasi-Yadkori and Szepesvári (2011)
regret - Dean et al. (2018)
regret - Cohen et al. (2019) , Mania et al. (2019)
T T2/3 T
Minimize regret (costs) when are unknown
A⋆, B⋆
Is regret optimal? No previous lower bounds T
Typically regret regret for strongly convex costs
T log T
Objective Structure Noise Typically regret regret in stochastic bandits
T log T
xt+1 = A⋆xt + B⋆ut + wt ct = x⊤
t Qxt + u⊤ t Rut
ut = − K⋆xt wt ∼ 𝒪(0,σ2I)
unknown ( known) efficient algorithm with regret
unknown ( known) efficient algorithm with regret
A⋆ B⋆ ⟹ ˜ O(log T) B⋆ A⋆ ⟹ ˜ O ( log T λmin(K⋆K⊤
⋆))
˜ O
regret is possible, sometimes…
5
xt+1 = A⋆xt + B⋆ut + wt ct = x⊤
t Qxt + u⊤ t Rut
ut = − K⋆xt wt ∼ 𝒪(0,σ2I)
unknown ( known) efficient algorithm with regret
unknown ( known) efficient algorithm with regret
A⋆ B⋆ ⟹ ˜ O(log T) B⋆ A⋆ ⟹ ˜ O ( log T λmin(K⋆K⊤
⋆))
˜ O
regret is possible, sometimes…
5
* concurrently with Simchowitz and Foster (2020)
… but in general, regret is unavoidable
T
regret lower bound for the adaptive LQR problem
is known
Ω( T) A⋆ λmin(K⋆K⊤
⋆)
xt+1 = A⋆xt + B⋆ut + wt ct = x⊤
t Qxt + u⊤ t Rut
ut = − K⋆xt wt ∼ 𝒪(0,σ2I)
Choose that minimize
, Optimal infinite horizon average cost:
u1, u2, … J = lim
T→∞ 𝔽 [
1 T
T
∑
t=1
ct] ut = − K⋆xt J(K⋆) K⋆ := K⋆(A⋆, B⋆, Q, R)
Linear Quadratic Control
6
xt+1 = A⋆xt + B⋆ut + wt ct = x⊤
t Qxt + u⊤ t Rut
ut = − K⋆xt wt ∼ 𝒪(0,σ2I) J = lim
T→∞ 𝔽 [
1 T
T
∑
t=1
ct]
Choose that minimize
, Optimal infinite horizon average cost:
u1, u2, … J = lim
T→∞ 𝔽 [
1 T
T
∑
t=1
ct] ut = − K⋆xt J(K⋆) K⋆ := K⋆(A⋆, B⋆, Q, R)
Linear Quadratic Control
6
xt+1 = A⋆xt + B⋆ut + wt ct = x⊤
t Qxt + u⊤ t Rut
ut = − K⋆xt wt ∼ 𝒪(0,σ2I) J = lim
T→∞ 𝔽 [
1 T
T
∑
t=1
ct]
Learning Objective
Regret minimization under parameter uncertainty. Regret = 𝔽 [
T
∑
t=1
(ct − J(K⋆))]
7
Regret Reparameterization
Playing Regret
*As long as does not change too often
ut = − Ktxt
*
⟹ ≈ 𝔽 [
T
∑
t=1
(J(Kt) − J(K⋆))]
Kt
xt+1 = A⋆xt + B⋆ut + wt ct = x⊤
t Qxt + u⊤ t Rut
ut = − K⋆xt wt ∼ 𝒪(0,σ2I) J = lim
T→∞ 𝔽 [
1 T
T
∑
t=1
ct]
Playing
ut = − Kxt ⟹ 𝔽 [ 1 T
T
∑
t=1
ct]
exponentially J(K) Definition: is
if such that:
K ∈ ℝk×d (κ, γ) A⋆, B⋆ ∃H, L
Strong Stability (Cohen et al. 2018)
7
Regret Reparameterization
Playing Regret
*As long as does not change too often
ut = − Ktxt
*
⟹ ≈ 𝔽 [
T
∑
t=1
(J(Kt) − J(K⋆))]
Kt
xt+1 = A⋆xt + B⋆ut + wt ct = x⊤
t Qxt + u⊤ t Rut
ut = − K⋆xt wt ∼ 𝒪(0,σ2I) J = lim
T→∞ 𝔽 [
1 T
T
∑
t=1
ct]
8
Assuming is Lipschitz: Regret
J(K) ≈ 𝔽 [
T
∑
t=1
(J(Kt) − J(K⋆))] ⪅ 𝔽 [
T
∑
t=1
∥Kt − K⋆∥]
First order estimation
xt+1 = A⋆xt + B⋆ut + wt ct = x⊤
t Qxt + u⊤ t Rut
ut = − K⋆xt wt ∼ 𝒪(0,σ2I) J = lim
T→∞ 𝔽 [
1 T
T
∑
t=1
ct]
8
Assuming is Lipschitz: Regret
J(K) ≈ 𝔽 [
T
∑
t=1
(J(Kt) − J(K⋆))] ⪅ 𝔽 [
T
∑
t=1
∥Kt − K⋆∥]
First order estimation
Perform minimal exploration to get and then play : Regret
∥Kt − K⋆∥ ≤ 1/ T Kt ≈ T + exploration cost
xt+1 = A⋆xt + B⋆ut + wt ct = x⊤
t Qxt + u⊤ t Rut
ut = − K⋆xt wt ∼ 𝒪(0,σ2I) J = lim
T→∞ 𝔽 [
1 T
T
∑
t=1
ct]
8
Assuming is Lipschitz: Regret
J(K) ≈ 𝔽 [
T
∑
t=1
(J(Kt) − J(K⋆))] ⪅ 𝔽 [
T
∑
t=1
∥Kt − K⋆∥]
First order estimation
Perform minimal exploration to get and then play : Regret
∥Kt − K⋆∥ ≤ 1/ T Kt ≈ T + exploration cost
Challenges
∥Kt − K⋆∥ ⪆ 1/ T ∥Kt − K⋆∥ ≤ T−1/4
xt+1 = A⋆xt + B⋆ut + wt ct = x⊤
t Qxt + u⊤ t Rut
ut = − K⋆xt wt ∼ 𝒪(0,σ2I) J = lim
T→∞ 𝔽 [
1 T
T
∑
t=1
ct]
known
B⋆ ⟹ yt = xt+1 − B⋆ut
9
Noise
Observed We “sense” via
A⋆ xt
Unknown
xt+1 = A⋆xt + B⋆ut + wt ct = x⊤
t Qxt + u⊤ t Rut
ut = − K⋆xt wt ∼ 𝒪(0,σ2I) J = lim
T→∞ 𝔽 [
1 T
T
∑
t=1
ct]
known
B⋆ ⟹ yt = xt+1 − B⋆ut
Least Squares Estimation ( ) Error:
̂ At ∥ ̂ At − A⋆∥ ∝ σ λmin(∑t
s=1 wsw⊤ s )
∝ T−1/2
9
Noise
Observed We “sense” via
A⋆ xt
Free Exploration By !
wt−1
Unknown
xt+1 = A⋆xt + B⋆ut + wt ct = x⊤
t Qxt + u⊤ t Rut
ut = − K⋆xt wt ∼ 𝒪(0,σ2I) J = lim
T→∞ 𝔽 [
1 T
T
∑
t=1
ct]
J(K) − J(K⋆) ≤ c1∥K − K⋆∥2
Policy estimation:
⟹
∥K⋆( ̂
A, ̂ B) − K⋆(A⋆, B⋆)∥ ≤ c2 max {∥ ̂ A − A⋆∥,∥ ̂ B − B⋆∥}
Results by Mania et al. (2019)
10
xt+1 = A⋆xt + B⋆ut + wt ct = x⊤
t Qxt + u⊤ t Rut
ut = − K⋆xt wt ∼ 𝒪(0,σ2I) J = lim
T→∞ 𝔽 [
1 T
T
∑
t=1
ct]
J(K) − J(K⋆) ≤ c1∥K − K⋆∥2
Policy estimation:
⟹
∥K⋆( ̂
A, ̂ B) − K⋆(A⋆, B⋆)∥ ≤ c2 max {∥ ̂ A − A⋆∥,∥ ̂ B − B⋆∥}
estimation
regret
1 t ⟹ 1 t
?
⟹ ∑
t
1 t = log T
Results by Mania et al. (2019)
10
xt+1 = A⋆xt + B⋆ut + wt ct = x⊤
t Qxt + u⊤ t Rut
ut = − K⋆xt wt ∼ 𝒪(0,σ2I) J = lim
T→∞ 𝔽 [
1 T
T
∑
t=1
ct]
J(K) − J(K⋆) ≤ c1∥K − K⋆∥2
Policy estimation:
⟹
∥K⋆( ̂
A, ̂ B) − K⋆(A⋆, B⋆)∥ ≤ c2 max {∥ ̂ A − A⋆∥,∥ ̂ B − B⋆∥}
estimation
regret
1 t ⟹ 1 t
?
⟹ ∑
t
1 t = log T
Kt ⟹ J(Kt) = ∞
Results by Mania et al. (2019) Not Quite…
10
xt+1 = A⋆xt + B⋆ut + wt ct = x⊤
t Qxt + u⊤ t Rut
ut = − K⋆xt wt ∼ 𝒪(0,σ2I) J = lim
T→∞ 𝔽 [
1 T
T
∑
t=1
ct]
At every round before playing:
Low probability trigger
forever Constant regret
∥xt∥,∥Kt∥ ⟹ K0 ⟹
“Abort”
11
xt+1 = A⋆xt + B⋆ut + wt ct = x⊤
t Qxt + u⊤ t Rut
ut = − K⋆xt wt ∼ 𝒪(0,σ2I) J = lim
T→∞ 𝔽 [
1 T
T
∑
t=1
ct]
Assumed Stable
At every round before playing:
Low probability trigger
forever Constant regret
∥xt∥,∥Kt∥ ⟹ K0 ⟹
“Abort”
11
Overall low order regret term!
xt+1 = A⋆xt + B⋆ut + wt ct = x⊤
t Qxt + u⊤ t Rut
ut = − K⋆xt wt ∼ 𝒪(0,σ2I) J = lim
T→∞ 𝔽 [
1 T
T
∑
t=1
ct]
Assumed Stable
At every round before playing:
Low probability trigger
forever Constant regret
∥xt∥,∥Kt∥ ⟹ K0 ⟹
“Abort”
11
Algorithm for Unknown A⋆
Overall low order regret term! T
2i
Warm-up Epoch start: Estimate (LSE) Calculate greedy
i ̂ Ai K⋆( ̂ Ai, B⋆)
Play if no “abort”
K⋆( ̂ Ai, B⋆)
xt+1 = A⋆xt + B⋆ut + wt ct = x⊤
t Qxt + u⊤ t Rut
ut = − K⋆xt wt ∼ 𝒪(0,σ2I) J = lim
T→∞ 𝔽 [
1 T
T
∑
t=1
ct]
Assumed Stable
12
≤ constant ⋅ #epochs ≈ log T ≤ constant ⋅ low probability ≈ constant
Regret ⪅ 𝔽 [
T
∑
t=1
(J(Kt) − J(K⋆))
no abort] + Switching Cost + Abort Cost
Regret Decomposition
xt+1 = A⋆xt + B⋆ut + wt ct = x⊤
t Qxt + u⊤ t Rut
ut = − K⋆xt wt ∼ 𝒪(0,σ2I) J = lim
T→∞ 𝔽 [
1 T
T
∑
t=1
ct]
12
≤ constant ⋅ #epochs ≈ log T ≤ constant ⋅ low probability ≈ constant
Regret ⪅ 𝔽 [
T
∑
t=1
(J(Kt) − J(K⋆))
no abort] + Switching Cost + Abort Cost
𝔽 [
T
∑
t=1
(J(Kt) − J(K⋆))
no abort] ⪅
#epochs
∑
i=1
2i∥ ̂ Ai − A⋆∥2 ⪅ #epochs ≈ log T
Regret Decomposition Putting it all together
⪅ 2−(i−1) = epoch length
xt+1 = A⋆xt + B⋆ut + wt ct = x⊤
t Qxt + u⊤ t Rut
ut = − K⋆xt wt ∼ 𝒪(0,σ2I) J = lim
T→∞ 𝔽 [
1 T
T
∑
t=1
ct]
Assume is known
A⋆ ⟹ yt = xt+1 − A⋆xt KtK⊤
t → K⋆K⊤ ⋆ ⟹
K⋆K⊤
⋆ ≻ μ⋆I
μ⋆
13
Noise
Observed Free Exploration By Ktwt−1 Exploration with KtK⊤
t
Unknown
xt+1 = A⋆xt + B⋆ut + wt ct = x⊤
t Qxt + u⊤ t Rut
ut = − K⋆xt wt ∼ 𝒪(0,σ2I)
Construction inspired by upper bound near degenerate
⟹ k⋆ xt+1 = 1 2 xt ± εut + wt ct = x2
t + u2 t
⟹ k⋆ ≈ ∓ ε
Construction in 1-D Main Ideas
14
Construction inspired by upper bound near degenerate
⟹ k⋆ xt+1 = 1 2 xt ± εut + wt ct = x2
t + u2 t
⟹ k⋆ ≈ ∓ ε
Construction in 1-D Main Ideas
14
Learner’s Dilemma
Bad exploration Failed to identify
⟹ sign(k⋆)
T
∑
t=1
u2
t
Good exploration but Regret ⪆
T
∑
t=1
u2
t
Best Tradeoff gives regret lower bound
ε = T−1/4 ⟹ Ω(σ2 T)
i) unknown ( known) ii) unknown ( known) & non-degenerate
regret is unavoidable
log T A⋆ B⋆ B⋆ A⋆ K⋆ T
15