Logarithmic Regret for Learning Linear Quadratic Regulators - - PowerPoint PPT Presentation

logarithmic regret for learning linear quadratic
SMART_READER_LITE
LIVE PREVIEW

Logarithmic Regret for Learning Linear Quadratic Regulators - - PowerPoint PPT Presentation

Logarithmic Regret for Learning Linear Quadratic Regulators Efficiently Asaf Cassel Joint work with: Alon Cohen, Tomer Koren Reinforcement Learning action u t State x t +1 Cost c t 2 Reinforcement Learning action u t State x t +1 Cost c t


slide-1
SLIDE 1

Logarithmic Regret for Learning Linear Quadratic Regulators Efficiently

Asaf Cassel

Joint work with: Alon Cohen, Tomer Koren

slide-2
SLIDE 2

Reinforcement Learning

action ut State Cost

xt+1 ct

2

slide-3
SLIDE 3

Reinforcement Learning

action ut State Cost

xt+1 ct

Discrete MDP Linear Quadratic Regulator (LQR)

Space Transition Costs Optimal Policy Problem Size

xt ∈ S, ut ∈ A |S|, |A|

Unstructured ct = c(xt, ut) Unstructured xt+1 ∼ P( ⋅ |xt, ut)

3

Dynamic programming

slide-4
SLIDE 4

Reinforcement Learning

action ut State Cost

xt+1 ct

Discrete MDP Linear Quadratic Regulator (LQR)

Space Transition Costs Optimal Policy Problem Size

xt ∈ S, ut ∈ A |S|, |A|

Unstructured ct = c(xt, ut) Unstructured xt+1 ∼ P( ⋅ |xt, ut)

xt ∈ ℝd, ut ∈ ℝk d, k,∥A⋆∥,∥B⋆∥

Quadratic ct = x⊤

t Qxt + u⊤ t Rut

Linear xt+1 = A⋆xt + B⋆ut + wt

ut = − K⋆xt

3

Dynamic programming

slide-5
SLIDE 5

“Adaptive Control”

Important Milestones:

  • 1. Non-efficient

regret - Abbasi-Yadkori and Szepesvári (2011)

  • 2. Efficient

regret - Dean et al. (2018)

  • 3. First efficient

regret - Cohen et al. (2019) , Mania et al. (2019)

T T2/3 T

Minimize regret (costs) when are unknown

A⋆, B⋆

  • Transition
  • Cost
  • Optimal Policy
  • i.i.d noise

xt+1 = A⋆xt + B⋆ut + wt ct = x⊤

t Qxt + u⊤ t Rut

ut = − K⋆xt wt ∼ 𝒪(0,σ2I)

slide-6
SLIDE 6

“Adaptive Control”

Important Milestones:

  • 1. Non-efficient

regret - Abbasi-Yadkori and Szepesvári (2011)

  • 2. Efficient

regret - Dean et al. (2018)

  • 3. First efficient

regret - Cohen et al. (2019) , Mania et al. (2019)

T T2/3 T

Minimize regret (costs) when are unknown

A⋆, B⋆

Is regret optimal? No previous lower bounds T

  • Transition
  • Cost
  • Optimal Policy
  • i.i.d noise

xt+1 = A⋆xt + B⋆ut + wt ct = x⊤

t Qxt + u⊤ t Rut

ut = − K⋆xt wt ∼ 𝒪(0,σ2I)

slide-7
SLIDE 7

“Adaptive Control”

Important Milestones:

  • 1. Non-efficient

regret - Abbasi-Yadkori and Szepesvári (2011)

  • 2. Efficient

regret - Dean et al. (2018)

  • 3. First efficient

regret - Cohen et al. (2019) , Mania et al. (2019)

T T2/3 T

Minimize regret (costs) when are unknown

A⋆, B⋆

Is regret optimal? No previous lower bounds T

Noise Typically regret regret in stochastic bandits

T log T

  • Transition
  • Cost
  • Optimal Policy
  • i.i.d noise

xt+1 = A⋆xt + B⋆ut + wt ct = x⊤

t Qxt + u⊤ t Rut

ut = − K⋆xt wt ∼ 𝒪(0,σ2I)

slide-8
SLIDE 8

“Adaptive Control”

Important Milestones:

  • 1. Non-efficient

regret - Abbasi-Yadkori and Szepesvári (2011)

  • 2. Efficient

regret - Dean et al. (2018)

  • 3. First efficient

regret - Cohen et al. (2019) , Mania et al. (2019)

T T2/3 T

Minimize regret (costs) when are unknown

A⋆, B⋆

Is regret optimal? No previous lower bounds T

Typically regret regret for strongly convex costs

T log T

Objective Structure Noise Typically regret regret in stochastic bandits

T log T

  • Transition
  • Cost
  • Optimal Policy
  • i.i.d noise

xt+1 = A⋆xt + B⋆ut + wt ct = x⊤

t Qxt + u⊤ t Rut

ut = − K⋆xt wt ∼ 𝒪(0,σ2I)

slide-9
SLIDE 9

Main Results

  • If

unknown ( known) efficient algorithm with regret

  • If

unknown ( known) efficient algorithm with regret

  • nly hides polynomial dependence on problem parameters

A⋆ B⋆ ⟹ ˜ O(log T) B⋆ A⋆ ⟹ ˜ O ( log T λmin(K⋆K⊤

⋆))

˜ O

regret is possible, sometimes…

log T

5

  • Transition
  • Cost
  • Optimal Policy
  • i.i.d noise

xt+1 = A⋆xt + B⋆ut + wt ct = x⊤

t Qxt + u⊤ t Rut

ut = − K⋆xt wt ∼ 𝒪(0,σ2I)

slide-10
SLIDE 10

Main Results

  • If

unknown ( known) efficient algorithm with regret

  • If

unknown ( known) efficient algorithm with regret

  • nly hides polynomial dependence on problem parameters

A⋆ B⋆ ⟹ ˜ O(log T) B⋆ A⋆ ⟹ ˜ O ( log T λmin(K⋆K⊤

⋆))

˜ O

regret is possible, sometimes…

log T

5

* concurrently with Simchowitz and Foster (2020)

… but in general, regret is unavoidable

T

  • First*

regret lower bound for the adaptive LQR problem

  • Holds even when

is known

  • Construction relies on small

Ω( T) A⋆ λmin(K⋆K⊤

⋆)

  • Transition
  • Cost
  • Optimal Policy
  • i.i.d noise

xt+1 = A⋆xt + B⋆ut + wt ct = x⊤

t Qxt + u⊤ t Rut

ut = − K⋆xt wt ∼ 𝒪(0,σ2I)

slide-11
SLIDE 11

Formalities

Choose that minimize

  • Optimal policy:

, Optimal infinite horizon average cost:

  • can be efficiently calculated (Riccati equation)

u1, u2, … J = lim

T→∞ 𝔽 [

1 T

T

t=1

ct] ut = − K⋆xt J(K⋆) K⋆ := K⋆(A⋆, B⋆, Q, R)

Linear Quadratic Control

6

  • Transition
  • Cost
  • Optimal Policy
  • i.i.d noise
  • Objective

xt+1 = A⋆xt + B⋆ut + wt ct = x⊤

t Qxt + u⊤ t Rut

ut = − K⋆xt wt ∼ 𝒪(0,σ2I) J = lim

T→∞ 𝔽 [

1 T

T

t=1

ct]

slide-12
SLIDE 12

Formalities

Choose that minimize

  • Optimal policy:

, Optimal infinite horizon average cost:

  • can be efficiently calculated (Riccati equation)

u1, u2, … J = lim

T→∞ 𝔽 [

1 T

T

t=1

ct] ut = − K⋆xt J(K⋆) K⋆ := K⋆(A⋆, B⋆, Q, R)

Linear Quadratic Control

6

  • Transition
  • Cost
  • Optimal Policy
  • i.i.d noise
  • Objective

xt+1 = A⋆xt + B⋆ut + wt ct = x⊤

t Qxt + u⊤ t Rut

ut = − K⋆xt wt ∼ 𝒪(0,σ2I) J = lim

T→∞ 𝔽 [

1 T

T

t=1

ct]

Learning Objective

Regret minimization under parameter uncertainty. Regret = 𝔽 [

T

t=1

(ct − J(K⋆))]

slide-13
SLIDE 13

Formalities

7

Regret Reparameterization

Playing Regret

*As long as does not change too often

ut = − Ktxt

*

⟹ ≈ 𝔽 [

T

t=1

(J(Kt) − J(K⋆))]

Kt

  • Transition
  • Cost
  • Optimal Policy
  • i.i.d noise
  • Objective

xt+1 = A⋆xt + B⋆ut + wt ct = x⊤

t Qxt + u⊤ t Rut

ut = − K⋆xt wt ∼ 𝒪(0,σ2I) J = lim

T→∞ 𝔽 [

1 T

T

t=1

ct]

slide-14
SLIDE 14

Formalities

Playing

ut = − Kxt ⟹ 𝔽 [ 1 T

T

t=1

ct]

exponentially J(K) Definition: is

  • strongly stable for

if such that:

K ∈ ℝk×d (κ, γ) A⋆, B⋆ ∃H, L

  • 1. A⋆ + B⋆K = HLH−1
  • 2. ∥L∥ ≤ 1 − γ, and ∥H∥,∥H−1∥,∥K∥ ≤ κ

Strong Stability (Cohen et al. 2018)

7

Regret Reparameterization

Playing Regret

*As long as does not change too often

ut = − Ktxt

*

⟹ ≈ 𝔽 [

T

t=1

(J(Kt) − J(K⋆))]

Kt

  • Transition
  • Cost
  • Optimal Policy
  • i.i.d noise
  • Objective

xt+1 = A⋆xt + B⋆ut + wt ct = x⊤

t Qxt + u⊤ t Rut

ut = − K⋆xt wt ∼ 𝒪(0,σ2I) J = lim

T→∞ 𝔽 [

1 T

T

t=1

ct]

slide-15
SLIDE 15

8

A Recipe for Regret?

Assuming is Lipschitz: Regret

T

J(K) ≈ 𝔽 [

T

t=1

(J(Kt) − J(K⋆))] ⪅ 𝔽 [

T

t=1

∥Kt − K⋆∥]

First order estimation

  • Transition
  • Cost
  • Optimal Policy
  • i.i.d noise
  • Objective

xt+1 = A⋆xt + B⋆ut + wt ct = x⊤

t Qxt + u⊤ t Rut

ut = − K⋆xt wt ∼ 𝒪(0,σ2I) J = lim

T→∞ 𝔽 [

1 T

T

t=1

ct]

slide-16
SLIDE 16

8

A Recipe for Regret?

Assuming is Lipschitz: Regret

T

J(K) ≈ 𝔽 [

T

t=1

(J(Kt) − J(K⋆))] ⪅ 𝔽 [

T

t=1

∥Kt − K⋆∥]

First order estimation

Perform minimal exploration to get and then play : Regret

∥Kt − K⋆∥ ≤ 1/ T Kt ≈ T + exploration cost

  • Transition
  • Cost
  • Optimal Policy
  • i.i.d noise
  • Objective

xt+1 = A⋆xt + B⋆ut + wt ct = x⊤

t Qxt + u⊤ t Rut

ut = − K⋆xt wt ∼ 𝒪(0,σ2I) J = lim

T→∞ 𝔽 [

1 T

T

t=1

ct]

slide-17
SLIDE 17

8

A Recipe for Regret?

Assuming is Lipschitz: Regret

T

J(K) ≈ 𝔽 [

T

t=1

(J(Kt) − J(K⋆))] ⪅ 𝔽 [

T

t=1

∥Kt − K⋆∥]

First order estimation

Perform minimal exploration to get and then play : Regret

∥Kt − K⋆∥ ≤ 1/ T Kt ≈ T + exploration cost

Challenges

  • Estimation rate is
  • Exploration can be expensive! e.g., in previous work

∥Kt − K⋆∥ ⪆ 1/ T ∥Kt − K⋆∥ ≤ T−1/4

  • Transition
  • Cost
  • Optimal Policy
  • i.i.d noise
  • Objective

xt+1 = A⋆xt + B⋆ut + wt ct = x⊤

t Qxt + u⊤ t Rut

ut = − K⋆xt wt ∼ 𝒪(0,σ2I) J = lim

T→∞ 𝔽 [

1 T

T

t=1

ct]

slide-18
SLIDE 18

Case1: Unknown (Known )

A⋆ B⋆

known

B⋆ ⟹ yt = xt+1 − B⋆ut

9

yt = A⋆ xt +

Noise

Observed We “sense” via

A⋆ xt

Unknown

  • Transition
  • Cost
  • Optimal Policy
  • i.i.d noise
  • Objective

xt+1 = A⋆xt + B⋆ut + wt ct = x⊤

t Qxt + u⊤ t Rut

ut = − K⋆xt wt ∼ 𝒪(0,σ2I) J = lim

T→∞ 𝔽 [

1 T

T

t=1

ct]

slide-19
SLIDE 19

Case1: Unknown (Known )

A⋆ B⋆

known

B⋆ ⟹ yt = xt+1 − B⋆ut

Least Squares Estimation ( ) Error:

̂ At ∥ ̂ At − A⋆∥ ∝ σ λmin(∑t

s=1 wsw⊤ s )

∝ T−1/2

9

yt = A⋆ xt +

Noise

Observed We “sense” via

A⋆ xt

Free Exploration By !

wt−1

Unknown

  • Transition
  • Cost
  • Optimal Policy
  • i.i.d noise
  • Objective

xt+1 = A⋆xt + B⋆ut + wt ct = x⊤

t Qxt + u⊤ t Rut

ut = − K⋆xt wt ∼ 𝒪(0,σ2I) J = lim

T→∞ 𝔽 [

1 T

T

t=1

ct]

slide-20
SLIDE 20

Objective Structure

  • “Strong Convexity”:

J(K) − J(K⋆) ≤ c1∥K − K⋆∥2

  • System estimation

Policy estimation:

∥K⋆( ̂

A, ̂ B) − K⋆(A⋆, B⋆)∥ ≤ c2 max {∥ ̂ A − A⋆∥,∥ ̂ B − B⋆∥}

Results by Mania et al. (2019)

10

  • Transition
  • Cost
  • Optimal Policy
  • i.i.d noise
  • Objective

xt+1 = A⋆xt + B⋆ut + wt ct = x⊤

t Qxt + u⊤ t Rut

ut = − K⋆xt wt ∼ 𝒪(0,σ2I) J = lim

T→∞ 𝔽 [

1 T

T

t=1

ct]

slide-21
SLIDE 21

Objective Structure

  • “Strong Convexity”:

J(K) − J(K⋆) ≤ c1∥K − K⋆∥2

  • System estimation

Policy estimation:

∥K⋆( ̂

A, ̂ B) − K⋆(A⋆, B⋆)∥ ≤ c2 max {∥ ̂ A − A⋆∥,∥ ̂ B − B⋆∥}

estimation

  • ptimal policy

regret

1 t ⟹ 1 t

?

⟹ ∑

t

1 t = log T

Results by Mania et al. (2019)

10

  • Transition
  • Cost
  • Optimal Policy
  • i.i.d noise
  • Objective

xt+1 = A⋆xt + B⋆ut + wt ct = x⊤

t Qxt + u⊤ t Rut

ut = − K⋆xt wt ∼ 𝒪(0,σ2I) J = lim

T→∞ 𝔽 [

1 T

T

t=1

ct]

slide-22
SLIDE 22

Objective Structure

  • “Strong Convexity”:

J(K) − J(K⋆) ≤ c1∥K − K⋆∥2

  • System estimation

Policy estimation:

∥K⋆( ̂

A, ̂ B) − K⋆(A⋆, B⋆)∥ ≤ c2 max {∥ ̂ A − A⋆∥,∥ ̂ B − B⋆∥}

estimation

  • ptimal policy

regret

1 t ⟹ 1 t

?

⟹ ∑

t

1 t = log T

  • is not stable

Kt ⟹ J(Kt) = ∞

  • Low probability event contributes unbounded regret

Results by Mania et al. (2019) Not Quite…

10

  • Transition
  • Cost
  • Optimal Policy
  • i.i.d noise
  • Objective

xt+1 = A⋆xt + B⋆ut + wt ct = x⊤

t Qxt + u⊤ t Rut

ut = − K⋆xt wt ∼ 𝒪(0,σ2I) J = lim

T→∞ 𝔽 [

1 T

T

t=1

ct]

slide-23
SLIDE 23

Algorithm and Abort Mechanism

At every round before playing:

  • bounded in high probability bounds?

Low probability trigger

  • Otherwise “abort”: Play

forever Constant regret

∥xt∥,∥Kt∥ ⟹ K0 ⟹

“Abort”

11

  • Transition
  • Cost
  • Optimal Policy
  • i.i.d noise
  • Objective

xt+1 = A⋆xt + B⋆ut + wt ct = x⊤

t Qxt + u⊤ t Rut

ut = − K⋆xt wt ∼ 𝒪(0,σ2I) J = lim

T→∞ 𝔽 [

1 T

T

t=1

ct]

Assumed Stable

slide-24
SLIDE 24

Algorithm and Abort Mechanism

At every round before playing:

  • bounded in high probability bounds?

Low probability trigger

  • Otherwise “abort”: Play

forever Constant regret

∥xt∥,∥Kt∥ ⟹ K0 ⟹

“Abort”

11

Overall low order regret term!

  • Transition
  • Cost
  • Optimal Policy
  • i.i.d noise
  • Objective

xt+1 = A⋆xt + B⋆ut + wt ct = x⊤

t Qxt + u⊤ t Rut

ut = − K⋆xt wt ∼ 𝒪(0,σ2I) J = lim

T→∞ 𝔽 [

1 T

T

t=1

ct]

Assumed Stable

slide-25
SLIDE 25

Algorithm and Abort Mechanism

At every round before playing:

  • bounded in high probability bounds?

Low probability trigger

  • Otherwise “abort”: Play

forever Constant regret

∥xt∥,∥Kt∥ ⟹ K0 ⟹

“Abort”

11

Algorithm for Unknown A⋆

Overall low order regret term! T

2i

Warm-up Epoch start: Estimate (LSE) Calculate greedy

i ̂ Ai K⋆( ̂ Ai, B⋆)

Play if no “abort”

K⋆( ̂ Ai, B⋆)

  • Transition
  • Cost
  • Optimal Policy
  • i.i.d noise
  • Objective

xt+1 = A⋆xt + B⋆ut + wt ct = x⊤

t Qxt + u⊤ t Rut

ut = − K⋆xt wt ∼ 𝒪(0,σ2I) J = lim

T→∞ 𝔽 [

1 T

T

t=1

ct]

Assumed Stable

slide-26
SLIDE 26

12

≤ constant ⋅ #epochs ≈ log T ≤ constant ⋅ low probability ≈ constant

Analysis Overview

Regret ⪅ 𝔽 [

T

t=1

(J(Kt) − J(K⋆))

no abort] + Switching Cost + Abort Cost

Regret Decomposition

  • Transition
  • Cost
  • Optimal Policy
  • i.i.d noise
  • Objective

xt+1 = A⋆xt + B⋆ut + wt ct = x⊤

t Qxt + u⊤ t Rut

ut = − K⋆xt wt ∼ 𝒪(0,σ2I) J = lim

T→∞ 𝔽 [

1 T

T

t=1

ct]

slide-27
SLIDE 27

12

≤ constant ⋅ #epochs ≈ log T ≤ constant ⋅ low probability ≈ constant

Analysis Overview

Regret ⪅ 𝔽 [

T

t=1

(J(Kt) − J(K⋆))

no abort] + Switching Cost + Abort Cost

𝔽 [

T

t=1

(J(Kt) − J(K⋆))

no abort] ⪅

#epochs

i=1

2i∥ ̂ Ai − A⋆∥2 ⪅ #epochs ≈ log T

Regret Decomposition Putting it all together

⪅ 2−(i−1) = epoch length

  • Transition
  • Cost
  • Optimal Policy
  • i.i.d noise
  • Objective

xt+1 = A⋆xt + B⋆ut + wt ct = x⊤

t Qxt + u⊤ t Rut

ut = − K⋆xt wt ∼ 𝒪(0,σ2I) J = lim

T→∞ 𝔽 [

1 T

T

t=1

ct]

slide-28
SLIDE 28

Case2: Unknown (Known )

Assume is known

  • Must have
  • Convergence ensured by Adaptive Warm-up!
  • No need to know

B⋆ A⋆

A⋆ ⟹ yt = xt+1 − A⋆xt KtK⊤

t → K⋆K⊤ ⋆ ⟹

K⋆K⊤

⋆ ≻ μ⋆I

μ⋆

13

yt = B⋆ ut +

Noise

Observed Free Exploration By Ktwt−1 Exploration with KtK⊤

t

Unknown

  • Transition
  • Cost
  • Optimal Policy
  • i.i.d noise

xt+1 = A⋆xt + B⋆ut + wt ct = x⊤

t Qxt + u⊤ t Rut

ut = − K⋆xt wt ∼ 𝒪(0,σ2I)

slide-29
SLIDE 29

Lower Bound

Construction inspired by upper bound near degenerate

⟹ k⋆ xt+1 = 1 2 xt ± εut + wt ct = x2

t + u2 t

⟹ k⋆ ≈ ∓ ε

Construction in 1-D Main Ideas

14

slide-30
SLIDE 30

Lower Bound

Construction inspired by upper bound near degenerate

⟹ k⋆ xt+1 = 1 2 xt ± εut + wt ct = x2

t + u2 t

⟹ k⋆ ≈ ∓ ε

Construction in 1-D Main Ideas

14

Learner’s Dilemma

Bad exploration Failed to identify

⟹ sign(k⋆)

T

t=1

u2

t

Good exploration but Regret ⪆

T

t=1

u2

t

Best Tradeoff gives regret lower bound

ε = T−1/4 ⟹ Ω(σ2 T)

slide-31
SLIDE 31

Summary

  • regret is possible sometimes:

i) unknown ( known) ii) unknown ( known) & non-degenerate

  • In general

regret is unavoidable

See you at the Q&A session!

log T A⋆ B⋆ B⋆ A⋆ K⋆ T

15