Logarithmic Regret for Learning Linear Quadratic Regulators - PowerPoint PPT Presentation

Logarithmic Regret for Learning Linear Quadratic Regulators Efficiently Asaf Cassel Joint work with: Alon Cohen, Tomer Koren

Reinforcement Learning action u t State x t +1 Cost c t 2

Reinforcement Learning action u t State x t +1 Cost c t Discrete MDP Linear Quadratic Regulator (LQR) x t ∈ S , u t ∈ A Space Unstructured x t +1 ∼ P ( ⋅ | x t , u t ) Transition Unstructured c t = c ( x t , u t ) Costs Optimal Policy Dynamic programming | S | , | A | Problem Size 3

Reinforcement Learning action u t State x t +1 Cost c t Discrete MDP Linear Quadratic Regulator (LQR) x t ∈ S , u t ∈ A Space x t ∈ ℝ d , u t ∈ ℝ k Unstructured x t +1 ∼ P ( ⋅ | x t , u t ) Linear x t +1 = A ⋆ x t + B ⋆ u t + w t Transition Quadratic c t = x ⊤ t Qx t + u ⊤ Unstructured c t = c ( x t , u t ) Costs t Ru t Optimal Policy Dynamic programming u t = − K ⋆ x t | S | , | A | d , k , ∥ A ⋆ ∥ , ∥ B ⋆ ∥ Problem Size 3

“Adaptive Control” • Transition x t +1 = A ⋆ x t + B ⋆ u t + w t • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t Minimize regret (costs) when are unknown A ⋆ , B ⋆ • Optimal Policy u t = − K ⋆ x t • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) Important Milestones: T 1. Non-efficient regret - Abbasi-Yadkori and Szepesvári (2011) T 2/3 2. Efficient regret - Dean et al. (2018) T 3. First efficient regret - Cohen et al. (2019) , Mania et al. (2019)

“Adaptive Control” • Transition x t +1 = A ⋆ x t + B ⋆ u t + w t • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t Minimize regret (costs) when are unknown A ⋆ , B ⋆ • Optimal Policy u t = − K ⋆ x t • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) Important Milestones: T 1. Non-efficient regret - Abbasi-Yadkori and Szepesvári (2011) T 2/3 2. Efficient regret - Dean et al. (2018) T 3. First efficient regret - Cohen et al. (2019) , Mania et al. (2019) Is regret optimal? No previous lower bounds T

“Adaptive Control” • Transition x t +1 = A ⋆ x t + B ⋆ u t + w t • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t Minimize regret (costs) when are unknown A ⋆ , B ⋆ • Optimal Policy u t = − K ⋆ x t • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) Important Milestones: T 1. Non-efficient regret - Abbasi-Yadkori and Szepesvári (2011) T 2/3 2. Efficient regret - Dean et al. (2018) T 3. First efficient regret - Cohen et al. (2019) , Mania et al. (2019) Is regret optimal? No previous lower bounds T Typically T regret regret log T in stochastic Noise bandits

“Adaptive Control” • Transition x t +1 = A ⋆ x t + B ⋆ u t + w t • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t Minimize regret (costs) when are unknown A ⋆ , B ⋆ • Optimal Policy u t = − K ⋆ x t • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) Important Milestones: T 1. Non-efficient regret - Abbasi-Yadkori and Szepesvári (2011) T 2/3 2. Efficient regret - Dean et al. (2018) T 3. First efficient regret - Cohen et al. (2019) , Mania et al. (2019) Is regret optimal? No previous lower bounds T Typically Typically T T regret regret regret regret log T log T Objective in stochastic for strongly Noise Structure bandits convex costs

Main Results • Transition x t +1 = A ⋆ x t + B ⋆ u t + w t • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t regret is possible, sometimes… • Optimal Policy log T u t = − K ⋆ x t • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) • If ˜ unknown ( known) e ffi cient algorithm with regret ⟹ A ⋆ B ⋆ O (log T ) O ( ⋆ ) ) log T • If ˜ unknown ( known) e ffi cient algorithm with regret ⟹ B ⋆ A ⋆ λ min ( K ⋆ K ⊤ ˜ only hides polynomial dependence on problem parameters O 5

Main Results • Transition x t +1 = A ⋆ x t + B ⋆ u t + w t • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t regret is possible, sometimes… • Optimal Policy log T u t = − K ⋆ x t • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) • If ˜ unknown ( known) e ffi cient algorithm with regret ⟹ A ⋆ B ⋆ O (log T ) O ( ⋆ ) ) log T • If ˜ unknown ( known) e ffi cient algorithm with regret ⟹ B ⋆ A ⋆ λ min ( K ⋆ K ⊤ ˜ only hides polynomial dependence on problem parameters O … but in general, regret is unavoidable T • First* regret lower bound for the adaptive LQR problem Ω ( T ) • Holds even when is known A ⋆ • Construction relies on small λ min ( K ⋆ K ⊤ ⋆ ) * concurrently with Simchowitz and Foster (2020) 5

Formalities • Transition x t +1 = A ⋆ x t + B ⋆ u t + w t • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t Linear Quadratic Control • Optimal Policy T →∞ 𝔽 [ u t = − K ⋆ x t c t ] T 1 • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) ∑ Choose that minimize u 1 , u 2 , … J = lim T →∞ 𝔽 [ T c t ] T t =1 1 ∑ Objective J = lim • T • Optimal policy: , Optimal infinite horizon average cost: u t = − K ⋆ x t J ( K ⋆ ) t =1 • can be e ffi ciently calculated (Riccati equation) K ⋆ := K ⋆ ( A ⋆ , B ⋆ , Q , R ) 6

Formalities • Transition x t +1 = A ⋆ x t + B ⋆ u t + w t • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t Linear Quadratic Control • Optimal Policy T →∞ 𝔽 [ u t = − K ⋆ x t c t ] T 1 • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) ∑ Choose that minimize u 1 , u 2 , … J = lim T →∞ 𝔽 [ T c t ] T t =1 1 ∑ Objective J = lim • T • Optimal policy: , Optimal infinite horizon average cost: u t = − K ⋆ x t J ( K ⋆ ) t =1 • can be e ffi ciently calculated (Riccati equation) K ⋆ := K ⋆ ( A ⋆ , B ⋆ , Q , R ) Learning Objective Regret minimization under parameter uncertainty. Regret = 𝔽 [ ( c t − J ( K ⋆ )) ] T ∑ t =1 6

Formalities • Transition x t +1 = A ⋆ x t + B ⋆ u t + w t • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t Regret Reparameterization • Optimal Policy ≈ 𝔽 [ u t = − K ⋆ x t ( J ( K t ) − J ( K ⋆ )) ] T * • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) ∑ Playing Regret u t = − K t x t ⟹ T →∞ 𝔽 [ c t ] T t =1 1 ∑ Objective J = lim • T *As long as does not change too often K t t =1 7

Formalities • Transition x t +1 = A ⋆ x t + B ⋆ u t + w t • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t Regret Reparameterization • Optimal Policy ≈ 𝔽 [ u t = − K ⋆ x t ( J ( K t ) − J ( K ⋆ )) ] T * • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) ∑ Playing Regret u t = − K t x t ⟹ T →∞ 𝔽 [ c t ] T t =1 1 ∑ Objective J = lim • T *As long as does not change too often K t t =1 Strong Stability (Cohen et al. 2018) u t = − Kx t ⟹ 𝔽 [ c t ] T exponentially J ( K ) 1 ∑ Playing T t =1 Definition: K ∈ ℝ k × d is -strongly stable for if such that: ∃ H , L ( κ , γ ) A ⋆ , B ⋆ 1. A ⋆ + B ⋆ K = HLH − 1 2. ∥ L ∥ ≤ 1 − γ , and ∥ H ∥ , ∥ H − 1 ∥ , ∥ K ∥ ≤ κ 7

• Transition x t +1 = A ⋆ x t + B ⋆ u t + w t A Recipe for Regret? T • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t First order estimation • Optimal Policy u t = − K ⋆ x t • i.i.d noise Assuming is Lipschitz: w t ∼ 𝒪 (0, σ 2 I ) J ( K ) ≈ 𝔽 [ ( J ( K t ) − J ( K ⋆ )) ] ⪅ 𝔽 [ T →∞ 𝔽 [ c t ] ∥ K t − K ⋆ ∥ ] T T T 1 ∑ ∑ ∑ Objective J = lim Regret • T t =1 t =1 t =1 8

• Transition x t +1 = A ⋆ x t + B ⋆ u t + w t A Recipe for Regret? T • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t First order estimation • Optimal Policy u t = − K ⋆ x t • i.i.d noise Assuming is Lipschitz: w t ∼ 𝒪 (0, σ 2 I ) J ( K ) ≈ 𝔽 [ ( J ( K t ) − J ( K ⋆ )) ] ⪅ 𝔽 [ T →∞ 𝔽 [ c t ] ∥ K t − K ⋆ ∥ ] T T T 1 ∑ ∑ ∑ Objective J = lim Regret • T t =1 t =1 t =1 Perform minimal exploration to get and then play : ∥ K t − K ⋆ ∥ ≤ 1/ T K t Regret T + exploration cost ≈ 8

• Transition x t +1 = A ⋆ x t + B ⋆ u t + w t A Recipe for Regret? T • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t First order estimation • Optimal Policy u t = − K ⋆ x t • i.i.d noise Assuming is Lipschitz: w t ∼ 𝒪 (0, σ 2 I ) J ( K ) ≈ 𝔽 [ ( J ( K t ) − J ( K ⋆ )) ] ⪅ 𝔽 [ T →∞ 𝔽 [ c t ] ∥ K t − K ⋆ ∥ ] T T T 1 ∑ ∑ ∑ Objective J = lim Regret • T t =1 t =1 t =1 Perform minimal exploration to get and then play : ∥ K t − K ⋆ ∥ ≤ 1/ T K t Regret T + exploration cost ≈ Challenges • Estimation rate is ∥ K t − K ⋆ ∥ ⪆ 1/ T • Exploration can be expensive! e.g., in previous work ∥ K t − K ⋆ ∥ ≤ T − 1/4 8

• Transition Case1: Unknown (Known ) x t +1 = A ⋆ x t + B ⋆ u t + w t A ⋆ B ⋆ • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t • Optimal Policy known ⟹ y t = x t +1 − B ⋆ u t B ⋆ u t = − K ⋆ x t • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) Observed We “sense” T →∞ 𝔽 [ c t ] T 1 via A ⋆ x t ∑ Objective J = lim Unknown • T t =1 A ⋆ y t = x t + Noise 9

̂ • Transition Case1: Unknown (Known ) x t +1 = A ⋆ x t + B ⋆ u t + w t A ⋆ B ⋆ • Cost c t = x ⊤ t Qx t + u ⊤ t Ru t • Optimal Policy known ⟹ y t = x t +1 − B ⋆ u t B ⋆ u t = − K ⋆ x t • i.i.d noise w t ∼ 𝒪 (0, σ 2 I ) Observed We “sense” T →∞ 𝔽 [ c t ] T 1 via A ⋆ x t ∑ Objective J = lim Unknown • T t =1 A ⋆ y t = x t + Noise Least Squares Estimation ( ) Error: A t Free Exploration σ ∥ ̂ ∝ T − 1/2 By ! A t − A ⋆ ∥ ∝ w t − 1 λ min ( ∑ t s =1 w s w ⊤ s ) 9

Logarithmic Regret for Learning Linear Quadratic Regulators - PowerPoint PPT Presentation

Logarithmic Regret for Learning Linear Quadratic Regulators Efficiently Asaf Cassel Joint work with: Alon Cohen, Tomer Koren Reinforcement Learning action u t State x t +1 Cost c t 2 Reinforcement Learning action u t State x t +1 Cost c t

The quadratic formula You may recall the quadratic formula for roots of quadratic polynomials ax 2

Learning Linear Quadratic Regulators Efficiently with Only Regret T Alon Cohen Joint

No-Regret Learning in Convex Games Geoff Gordon, Amy Greenwald, Casey Marks, and Martin Zinkevich

Regret Bounds for Lifelong Learning Pierre Alquier Groupe de Travail de Machine learning du CMLA

Tight Bounds on Minimax Regret under Logarithmic Loss via Self-Concordance Blair Bilodeau 1,2,3 ,

Counterfactual Regret Minimization and Domination in Extensive-Form Games Richard Gibson

Improved Bounds on Minimax Regret under Logarithmic Loss via Self-Concordance Blair Bilodeau 1 , 2

Efficient Online Portfolio with Logarithmic Regret Haipeng Luo (USC) Chen-Yu Wei (USC) Kai Zheng

An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting Cem

Topics on N orlund logarithmic means Nacima Memi c University of Sarajevo, Bosnia and

Logarithmic space Evgenij Thorstensen V18 Evgenij Thorstensen Logarithmic space V18 1 / 18

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Section3.3 Analyzing Graphs of Quadratic Functions Introduction Definitions A quadratic function

11. Quadratic forms and ellipsoids Quadratic forms Orthogonal decomposition Positive

Solving Quadratic Equations MCR3U: Functions Recall that to solve a quadratic equation means to

Quadratic Residues Definition : The numbers 0 2 , 1 2 , 2 2 , . . . , ( n 1) 2 mod n , are

Junior Presentation High School 101 September 11, 2019 MHS Student Services White Spruce Red

Shape Analysis in R GM library in the light of recent methodological developments Stanislav

IS HOME WHERE THE WORD VECTORS LEAD? A Corpus-based Diachronic Study of Jia Pei-Yi

CRUSTACEANS Crustaceans are primarily aquatic. This subphylum includes crabs, shrimps,

Re-structuring a giant, ancient code-base or: Making LibreOffice work well everywhere Michael

A Sybil-Proof Distributed Hash Table Chris Lesniewski-Laas M. Frans Kaashoek MIT 28 April 2010

Chapter 3: Laws of Motion New Reading Assignment (Ch. 4) to be completed in Canvas due on

Rising Inequality: Trends and Consequences Iglika Ivanova iglika@policyalternatives.ca