logarithmic regret for learning linear quadratic
play

Logarithmic Regret for Learning Linear Quadratic Regulators - PowerPoint PPT Presentation

Logarithmic Regret for Learning Linear Quadratic Regulators Efficiently Asaf Cassel Joint work with: Alon Cohen, Tomer Koren Reinforcement Learning action u t State x t +1 Cost c t 2 Reinforcement Learning action u t State x t +1 Cost c t


  1. Logarithmic Regret for Learning Linear Quadratic Regulators Efficiently Asaf Cassel Joint work with: Alon Cohen, Tomer Koren

  2. Reinforcement Learning action u t State x t +1 Cost c t 2

  3. Reinforcement Learning action u t State x t +1 Cost c t Discrete MDP Linear Quadratic Regulator (LQR) x t ∈ S , u t ∈ A Space Unstructured x t +1 ∼ P ( ⋅ | x t , u t ) Transition Unstructured c t = c ( x t , u t ) Costs Optimal Policy Dynamic programming | S | , | A | Problem Size 3

  4. Reinforcement Learning action u t State x t +1 Cost c t Discrete MDP Linear Quadratic Regulator (LQR) x t ∈ S , u t ∈ A Space x t ∈ ℝ d , u t ∈ ℝ k Unstructured x t +1 ∼ P ( ⋅ | x t , u t ) Linear x t +1 = A ⋆ x t + B ⋆ u t + w t Transition Quadratic c t = x ⊤ t Qx t + u ⊤ Unstructured c t = c ( x t , u t ) Costs t Ru t Optimal Policy Dynamic programming u t = − K ⋆ x t | S | , | A | d , k , ∥ A ⋆ ∥ , ∥ B ⋆ ∥ Problem Size 3

  5. “Adaptive Control” • Transition x t +1 = A ⋆ x t + B ⋆ u t + w t • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t Minimize regret (costs) when are unknown A ⋆ , B ⋆ • Optimal Policy u t = − K ⋆ x t • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) Important Milestones: T 1. Non-efficient regret - Abbasi-Yadkori and Szepesvári (2011) T 2/3 2. Efficient regret - Dean et al. (2018) T 3. First efficient regret - Cohen et al. (2019) , Mania et al. (2019)

  6. “Adaptive Control” • Transition x t +1 = A ⋆ x t + B ⋆ u t + w t • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t Minimize regret (costs) when are unknown A ⋆ , B ⋆ • Optimal Policy u t = − K ⋆ x t • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) Important Milestones: T 1. Non-efficient regret - Abbasi-Yadkori and Szepesvári (2011) T 2/3 2. Efficient regret - Dean et al. (2018) T 3. First efficient regret - Cohen et al. (2019) , Mania et al. (2019) Is regret optimal? No previous lower bounds T

  7. “Adaptive Control” • Transition x t +1 = A ⋆ x t + B ⋆ u t + w t • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t Minimize regret (costs) when are unknown A ⋆ , B ⋆ • Optimal Policy u t = − K ⋆ x t • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) Important Milestones: T 1. Non-efficient regret - Abbasi-Yadkori and Szepesvári (2011) T 2/3 2. Efficient regret - Dean et al. (2018) T 3. First efficient regret - Cohen et al. (2019) , Mania et al. (2019) Is regret optimal? No previous lower bounds T Typically T regret regret log T in stochastic Noise bandits

  8. “Adaptive Control” • Transition x t +1 = A ⋆ x t + B ⋆ u t + w t • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t Minimize regret (costs) when are unknown A ⋆ , B ⋆ • Optimal Policy u t = − K ⋆ x t • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) Important Milestones: T 1. Non-efficient regret - Abbasi-Yadkori and Szepesvári (2011) T 2/3 2. Efficient regret - Dean et al. (2018) T 3. First efficient regret - Cohen et al. (2019) , Mania et al. (2019) Is regret optimal? No previous lower bounds T Typically Typically T T regret regret regret regret log T log T Objective in stochastic for strongly Noise Structure bandits convex costs

  9. Main Results • Transition x t +1 = A ⋆ x t + B ⋆ u t + w t • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t regret is possible, sometimes… • Optimal Policy log T u t = − K ⋆ x t • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) • If ˜ unknown ( known) e ffi cient algorithm with regret ⟹ A ⋆ B ⋆ O (log T ) O ( ⋆ ) ) log T • If ˜ unknown ( known) e ffi cient algorithm with regret ⟹ B ⋆ A ⋆ λ min ( K ⋆ K ⊤ ˜ only hides polynomial dependence on problem parameters O 5

  10. Main Results • Transition x t +1 = A ⋆ x t + B ⋆ u t + w t • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t regret is possible, sometimes… • Optimal Policy log T u t = − K ⋆ x t • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) • If ˜ unknown ( known) e ffi cient algorithm with regret ⟹ A ⋆ B ⋆ O (log T ) O ( ⋆ ) ) log T • If ˜ unknown ( known) e ffi cient algorithm with regret ⟹ B ⋆ A ⋆ λ min ( K ⋆ K ⊤ ˜ only hides polynomial dependence on problem parameters O … but in general, regret is unavoidable T • First* regret lower bound for the adaptive LQR problem Ω ( T ) • Holds even when is known A ⋆ • Construction relies on small λ min ( K ⋆ K ⊤ ⋆ ) * concurrently with Simchowitz and Foster (2020) 5

  11. Formalities • Transition x t +1 = A ⋆ x t + B ⋆ u t + w t • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t Linear Quadratic Control • Optimal Policy T →∞ 𝔽 [ u t = − K ⋆ x t c t ] T 1 • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) ∑ Choose that minimize u 1 , u 2 , … J = lim T →∞ 𝔽 [ T c t ] T t =1 1 ∑ Objective J = lim • T • Optimal policy: , Optimal infinite horizon average cost: u t = − K ⋆ x t J ( K ⋆ ) t =1 • can be e ffi ciently calculated (Riccati equation) K ⋆ := K ⋆ ( A ⋆ , B ⋆ , Q , R ) 6

  12. Formalities • Transition x t +1 = A ⋆ x t + B ⋆ u t + w t • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t Linear Quadratic Control • Optimal Policy T →∞ 𝔽 [ u t = − K ⋆ x t c t ] T 1 • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) ∑ Choose that minimize u 1 , u 2 , … J = lim T →∞ 𝔽 [ T c t ] T t =1 1 ∑ Objective J = lim • T • Optimal policy: , Optimal infinite horizon average cost: u t = − K ⋆ x t J ( K ⋆ ) t =1 • can be e ffi ciently calculated (Riccati equation) K ⋆ := K ⋆ ( A ⋆ , B ⋆ , Q , R ) Learning Objective Regret minimization under parameter uncertainty. Regret = 𝔽 [ ( c t − J ( K ⋆ )) ] T ∑ t =1 6

  13. Formalities • Transition x t +1 = A ⋆ x t + B ⋆ u t + w t • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t Regret Reparameterization • Optimal Policy ≈ 𝔽 [ u t = − K ⋆ x t ( J ( K t ) − J ( K ⋆ )) ] T * • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) ∑ Playing Regret u t = − K t x t ⟹ T →∞ 𝔽 [ c t ] T t =1 1 ∑ Objective J = lim • T *As long as does not change too often K t t =1 7

  14. Formalities • Transition x t +1 = A ⋆ x t + B ⋆ u t + w t • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t Regret Reparameterization • Optimal Policy ≈ 𝔽 [ u t = − K ⋆ x t ( J ( K t ) − J ( K ⋆ )) ] T * • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) ∑ Playing Regret u t = − K t x t ⟹ T →∞ 𝔽 [ c t ] T t =1 1 ∑ Objective J = lim • T *As long as does not change too often K t t =1 Strong Stability (Cohen et al. 2018) u t = − Kx t ⟹ 𝔽 [ c t ] T exponentially J ( K ) 1 ∑ Playing T t =1 Definition: K ∈ ℝ k × d is -strongly stable for if such that: ∃ H , L ( κ , γ ) A ⋆ , B ⋆ 1. A ⋆ + B ⋆ K = HLH − 1 2. ∥ L ∥ ≤ 1 − γ , and ∥ H ∥ , ∥ H − 1 ∥ , ∥ K ∥ ≤ κ 7

  15. • Transition x t +1 = A ⋆ x t + B ⋆ u t + w t A Recipe for Regret? T • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t First order estimation • Optimal Policy u t = − K ⋆ x t • i.i.d noise Assuming is Lipschitz: w t ∼ 𝒪 (0, σ 2 I ) J ( K ) ≈ 𝔽 [ ( J ( K t ) − J ( K ⋆ )) ] ⪅ 𝔽 [ T →∞ 𝔽 [ c t ] ∥ K t − K ⋆ ∥ ] T T T 1 ∑ ∑ ∑ Objective J = lim Regret • T t =1 t =1 t =1 8

  16. • Transition x t +1 = A ⋆ x t + B ⋆ u t + w t A Recipe for Regret? T • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t First order estimation • Optimal Policy u t = − K ⋆ x t • i.i.d noise Assuming is Lipschitz: w t ∼ 𝒪 (0, σ 2 I ) J ( K ) ≈ 𝔽 [ ( J ( K t ) − J ( K ⋆ )) ] ⪅ 𝔽 [ T →∞ 𝔽 [ c t ] ∥ K t − K ⋆ ∥ ] T T T 1 ∑ ∑ ∑ Objective J = lim Regret • T t =1 t =1 t =1 Perform minimal exploration to get and then play : ∥ K t − K ⋆ ∥ ≤ 1/ T K t Regret T + exploration cost ≈ 8

  17. • Transition x t +1 = A ⋆ x t + B ⋆ u t + w t A Recipe for Regret? T • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t First order estimation • Optimal Policy u t = − K ⋆ x t • i.i.d noise Assuming is Lipschitz: w t ∼ 𝒪 (0, σ 2 I ) J ( K ) ≈ 𝔽 [ ( J ( K t ) − J ( K ⋆ )) ] ⪅ 𝔽 [ T →∞ 𝔽 [ c t ] ∥ K t − K ⋆ ∥ ] T T T 1 ∑ ∑ ∑ Objective J = lim Regret • T t =1 t =1 t =1 Perform minimal exploration to get and then play : ∥ K t − K ⋆ ∥ ≤ 1/ T K t Regret T + exploration cost ≈ Challenges • Estimation rate is ∥ K t − K ⋆ ∥ ⪆ 1/ T • Exploration can be expensive! e.g., in previous work ∥ K t − K ⋆ ∥ ≤ T − 1/4 8

  18. • Transition Case1: Unknown (Known ) x t +1 = A ⋆ x t + B ⋆ u t + w t A ⋆ B ⋆ • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t • Optimal Policy known ⟹ y t = x t +1 − B ⋆ u t B ⋆ u t = − K ⋆ x t • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) Observed We “sense” T →∞ 𝔽 [ c t ] T 1 via A ⋆ x t ∑ Objective J = lim Unknown • T t =1 A ⋆ y t = x t + Noise 9

  19. ̂ • Transition Case1: Unknown (Known ) x t +1 = A ⋆ x t + B ⋆ u t + w t A ⋆ B ⋆ • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t • Optimal Policy known ⟹ y t = x t +1 − B ⋆ u t B ⋆ u t = − K ⋆ x t • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) Observed We “sense” T →∞ 𝔽 [ c t ] T 1 via A ⋆ x t ∑ Objective J = lim Unknown • T t =1 A ⋆ y t = x t + Noise Least Squares Estimation ( ) Error: A t Free Exploration σ ∥ ̂ ∝ T − 1/2 By ! A t − A ⋆ ∥ ∝ w t − 1 λ min ( ∑ t s =1 w s w ⊤ s ) 9

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend