matrix free preconditioning in online learning
play

Matrix-Free Preconditioning in Online Learning Ashok Cutkosky, - PowerPoint PPT Presentation

Matrix-Free Preconditioning in Online Learning Ashok Cutkosky, Tamas Sarlos Google Research Online Optimization For t = 1 . . . T , repeat: 1: Learner chooses a point w t . 2: Environment presents learner with a gradient g t (think E [ g t ] =


  1. Matrix-Free Preconditioning in Online Learning Ashok Cutkosky, Tamas Sarlos Google Research

  2. Online Optimization For t = 1 . . . T , repeat: 1: Learner chooses a point w t . 2: Environment presents learner with a gradient g t (think E [ g t ] = ∇ F ( w t ) ). 3: Learner suffers loss � g t , w t � . The objective is minimize regret : T � R T ( w ⋆ ) = � g t , w t � − � g t , w ⋆ � � �� � � �� � t = 1 loss suffered benchmark loss Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 1 of 20

  3. Online Optimization For t = 1 . . . T , repeat: 1: Learner chooses a point w t . 2: Environment presents learner with a gradient g t (think E [ g t ] = ∇ F ( w t ) ). 3: Learner suffers loss � g t , w t � . The objective is minimize regret : T � R T ( w ⋆ ) = � g t , w t � − � g t , w ⋆ � � �� � � �� � t = 1 loss suffered benchmark loss Running an online algorithm on a stochastic optimization problem guarantees F ( w T ) − F ( w ⋆ ) ≤ R T ( w ⋆ ) . T Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 1 of 20

  4. The Classic Algorithm: Gradient Descent w t + 1 = w t − η t g t Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 2 of 20

  5. The Classic Algorithm: Gradient Descent w t + 1 = w t − η t g t Gradient descent obtains regret: � � T � � � w ⋆ � 2 � g t � 2 R T ( w ⋆ ) ≤ � t = 1 Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 2 of 20

  6. Gradient Descent Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 3 of 20

  7. Preconditioning (Deterministic) • The gradient ∇ F ( w ) may not point towards the minimum w ⋆ Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 4 of 20

  8. Preconditioning (Deterministic) • The gradient ∇ F ( w ) may not point towards the minimum w ⋆ Key idea: “Preconditioning” means ignoring irrelevant directions. Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 4 of 20

  9. Preconditioning (Stochastic) • Noise can also make g t not point towards the minimum. Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 5 of 20

  10. Regret Bounds • Regret of un-preconditioned stochastic gradient descent (with the appropriate learning rate) is � � T � √ � � � � w ⋆ � 2 � g t � 2 = O R T ( w ⋆ ) ≤ � T t = 1 • An ideal preconditioned algorithm should obtain regret � � T � √ � � � � w ⋆ , g t � 2 = O R T ( w ⋆ ) ≤ � T t = 1 Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 6 of 20

  11. Regret Bound Picture Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 7 of 20

  12. Goals • Want regret bound as good as if we had ignored irrelevant directions (up to constants/logs) Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 8 of 20

  13. Using the Covariance Matrix The typical approach to preconditioning maintains the matrix T � g t g ⊤ G = t t = 1 and compute various inverses and square roots of G . This can obtain the guarantee [CO18; KL17] � � T � � � w ⋆ , g t � 2 R T ( w ⋆ ) ≤ � d t = 1 Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 9 of 20

  14. Issues with Using Covariance Matrix • d 2 time is too slow - there’s a lot of work on compressing the matrix to try to make some tradeoff [Luo+16; GKS18; Aga+18]. Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 10 of 20

  15. Issues with Using Covariance Matrix • d 2 time is too slow - there’s a lot of work on compressing the matrix to try to make some tradeoff [Luo+16; GKS18; Aga+18]. • The regret bound might not even be beter! � � � � T T ? � � � � � w ⋆ , g t � 2 � � w ⋆ � 2 � g t � 2 � d ≤ t = 1 t = 1 Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 10 of 20

  16. Goals 1: Want regret bound as good as if we had ignored irrelevant directions (up to constants/logs). 2: Want an efficient algorithm ( O ( d ) time per update in d -dimensions). 3: Want to never do worse than non-preconditioned algorithms. Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 11 of 20

  17. Goals 1: Want regret bound as good as if we had ignored irrelevant directions (up to constants/logs). 2: Want an efficient algorithm ( O ( d ) time per update in d -dimensions). 3: Want to never do worse than non-preconditioned algorithms. • We will achieve 2 and 3, and sometimes 1. Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 11 of 20

  18. Our Contribution We provide an online learning algorithm that: • Runs in O ( d ) time per-update. • Always achieves regret: � � T � � R T ( w ⋆ ) ≤ � w ⋆ � � g t � 2 � t = 1 �� T • When −� � T t = 1 � g t � 2 , achieves: t = 1 g t , w ⋆ / � w ⋆ �� ≥ � � T � � � w ⋆ , g t � 2 R T ( w ⋆ ) ≤ � t = 1 Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 12 of 20

  19. Unpacking the Condition �� T • We need −� � T t = 1 � g t � 2 for preconditioned t = 1 g t , w ⋆ / � w ⋆ �� ≥ regret. • If g t are mean-zero independent random variables, then standard concentration results say: �   � T � � � � T T � � � � � � � � � g t � 2 − g t , w ⋆ / � w ⋆ � ≤ � = Θ � g t   � � � t = 1 t = 1 t = 1 Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 13 of 20

  20. Unpacking the Condition �� T • We need −� � T t = 1 � g t � 2 for preconditioned t = 1 g t , w ⋆ / � w ⋆ �� ≥ regret. • If g t are mean-zero independent random variables, then standard concentration results say: �   � T � � � � T T � � � � � � � � � g t � 2 − g t , w ⋆ / � w ⋆ � ≤ � = Θ � g t   � � � t = 1 t = 1 t = 1 We achieve preconditioning whenever there is any “signal” in the gradients. Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 13 of 20

  21. Coin Beting [OP16] • Define wealth : T � Wealth T = 1 − � g t , w t � t = 1 Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 14 of 20

  22. Coin Beting [OP16] • Define wealth : T � Wealth T = 1 − � g t , w t � t = 1 • High wealth implies low regret: T � R T ( w ⋆ ) = 1 − � g t , w ⋆ � − Wealth T t = 1 � �� � out of our control Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 14 of 20

  23. Coin Beting [OP16] • Define wealth : T � Wealth T = 1 − � g t , w t � t = 1 • High wealth implies low regret: T � R T ( w ⋆ ) = 1 − � g t , w ⋆ � − Wealth T t = 1 � �� � out of our control • At every iteration, choose a beting fraction v t ∈ R d and use w t = v t Wealth t − 1 Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 14 of 20

  24. Oracle value for v yields good algorithm � w ⋆ � √ � T w ⋆ Set v t = v ⋆ ≈ t = 1 � g t , w ⋆ � 2 . Then � � T � � � w ⋆ , g t � 2 R T ( w ⋆ ) ≤ � t = 1 • There are no matrices here! • But we don’t know this magic value for v . Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 15 of 20

  25. Online Learning Inside Online Learning [CO18] • Define ℓ t ( v ) = − log( 1 − � g t , v � ) . Then: T � R v T ( v ⋆ ) := ℓ t ( v t ) − ℓ t ( v ⋆ ) t = 1 • If R v T ( v ⋆ ) = O (log( T )) , then the final regret R T ( w ⋆ ) is the same as if we’d used the constant v t = v ⋆ . Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 16 of 20

  26. Online Learning Inside Online Learning [CO18] • Define ℓ t ( v ) = − log( 1 − � g t , v � ) . Then: T � R v T ( v ⋆ ) := ℓ t ( v t ) − ℓ t ( v ⋆ ) t = 1 • If R v T ( v ⋆ ) = O (log( T )) , then the final regret R T ( w ⋆ ) is the same as if we’d used the constant v t = v ⋆ . • We can use online learning to choose the v t ! Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 16 of 20

  27. Overview of Algorithm Strategy • There exists an unknown v ⋆ that would give preconditioned regret. • We can choose v t using online convex optimization on losses ℓ t ( v ) = − log( 1 − � g t , v t � ) . T ( v ⋆ ) = � T • If we get R v t = 1 ℓ t ( v t ) − ℓ t ( v ⋆ ) = O (log( T )) , then we are as good as picking v ⋆ from the beginning. • So how can we obtain logarithmic regret? Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 17 of 20

  28. How to obtain logarithmic regret? • Strategy: Remember that the constant v ⋆ we need to compete with is √ � w ⋆ � √ � T w ⋆ v ⋆ = t = 1 � g t , w ⋆ � 2 , so � v ⋆ � = O ( 1 / T ) usually. • This means that we can use a non-preconditioned online learning algorithm to obtain logarithmic regret: √ R v T ( v ⋆ ) ≤ � v ⋆ � T = O ( 1 ) Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 18 of 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend