Matrix-Free Preconditioning in Online Learning Ashok Cutkosky, - PowerPoint PPT Presentation

Matrix-Free Preconditioning in Online Learning Ashok Cutkosky, Tamas Sarlos Google Research

Online Optimization For t = 1 . . . T , repeat: 1: Learner chooses a point w t . 2: Environment presents learner with a gradient g t (think E [ g t ] = ∇ F ( w t ) ). 3: Learner suffers loss � g t , w t � . The objective is minimize regret : T � R T ( w ⋆ ) = � g t , w t � − � g t , w ⋆ � � �� t = 1 loss suffered benchmark loss Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 1 of 20

Online Optimization For t = 1 . . . T , repeat: 1: Learner chooses a point w t . 2: Environment presents learner with a gradient g t (think E [ g t ] = ∇ F ( w t ) ). 3: Learner suffers loss � g t , w t � . The objective is minimize regret : T � R T ( w ⋆ ) = � g t , w t � − � g t , w ⋆ � � �� t = 1 loss suffered benchmark loss Running an online algorithm on a stochastic optimization problem guarantees F ( w T ) − F ( w ⋆ ) ≤ R T ( w ⋆ ) . T Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 1 of 20

The Classic Algorithm: Gradient Descent w t + 1 = w t − η t g t Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 2 of 20

The Classic Algorithm: Gradient Descent w t + 1 = w t − η t g t Gradient descent obtains regret: � � T � � � w ⋆ � 2 � g t � 2 R T ( w ⋆ ) ≤ � t = 1 Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 2 of 20

Gradient Descent Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 3 of 20

Preconditioning (Deterministic) • The gradient ∇ F ( w ) may not point towards the minimum w ⋆ Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 4 of 20

Preconditioning (Deterministic) • The gradient ∇ F ( w ) may not point towards the minimum w ⋆ Key idea: “Preconditioning” means ignoring irrelevant directions. Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 4 of 20

Preconditioning (Stochastic) • Noise can also make g t not point towards the minimum. Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 5 of 20

Regret Bounds • Regret of un-preconditioned stochastic gradient descent (with the appropriate learning rate) is � � T � √ � � � � w ⋆ � 2 � g t � 2 = O R T ( w ⋆ ) ≤ � T t = 1 • An ideal preconditioned algorithm should obtain regret � � T � √ � � � � w ⋆ , g t � 2 = O R T ( w ⋆ ) ≤ � T t = 1 Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 6 of 20

Regret Bound Picture Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 7 of 20

Goals • Want regret bound as good as if we had ignored irrelevant directions (up to constants/logs) Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 8 of 20

Using the Covariance Matrix The typical approach to preconditioning maintains the matrix T � g t g ⊤ G = t t = 1 and compute various inverses and square roots of G . This can obtain the guarantee [CO18; KL17] � � T � � � w ⋆ , g t � 2 R T ( w ⋆ ) ≤ � d t = 1 Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 9 of 20

Issues with Using Covariance Matrix • d 2 time is too slow - there’s a lot of work on compressing the matrix to try to make some tradeoff [Luo+16; GKS18; Aga+18]. Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 10 of 20

Issues with Using Covariance Matrix • d 2 time is too slow - there’s a lot of work on compressing the matrix to try to make some tradeoff [Luo+16; GKS18; Aga+18]. • The regret bound might not even be beter! � � � � T T ? � � � � � w ⋆ , g t � 2 � � w ⋆ � 2 � g t � 2 � d ≤ t = 1 t = 1 Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 10 of 20

Goals 1: Want regret bound as good as if we had ignored irrelevant directions (up to constants/logs). 2: Want an efficient algorithm ( O ( d ) time per update in d -dimensions). 3: Want to never do worse than non-preconditioned algorithms. Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 11 of 20

Goals 1: Want regret bound as good as if we had ignored irrelevant directions (up to constants/logs). 2: Want an efficient algorithm ( O ( d ) time per update in d -dimensions). 3: Want to never do worse than non-preconditioned algorithms. • We will achieve 2 and 3, and sometimes 1. Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 11 of 20

Our Contribution We provide an online learning algorithm that: • Runs in O ( d ) time per-update. • Always achieves regret: � � T � � R T ( w ⋆ ) ≤ � w ⋆ � � g t � 2 � t = 1 �� T • When −� � T t = 1 � g t � 2 , achieves: t = 1 g t , w ⋆ / � w ⋆ �� ≥ � � T � � � w ⋆ , g t � 2 R T ( w ⋆ ) ≤ � t = 1 Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 12 of 20

Unpacking the Condition �� T • We need −� � T t = 1 � g t � 2 for preconditioned t = 1 g t , w ⋆ / � w ⋆ �� ≥ regret. • If g t are mean-zero independent random variables, then standard concentration results say: �   � T � � � � T T � � � � � � � � � g t � 2 − g t , w ⋆ / � w ⋆ � ≤ � = Θ � g t   � � � t = 1 t = 1 t = 1 Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 13 of 20

Unpacking the Condition �� T • We need −� � T t = 1 � g t � 2 for preconditioned t = 1 g t , w ⋆ / � w ⋆ �� ≥ regret. • If g t are mean-zero independent random variables, then standard concentration results say: �   � T � � � � T T � � � � � � � � � g t � 2 − g t , w ⋆ / � w ⋆ � ≤ � = Θ � g t   � � � t = 1 t = 1 t = 1 We achieve preconditioning whenever there is any “signal” in the gradients. Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 13 of 20

Coin Beting [OP16] • Define wealth : T � Wealth T = 1 − � g t , w t � t = 1 Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 14 of 20

Coin Beting [OP16] • Define wealth : T � Wealth T = 1 − � g t , w t � t = 1 • High wealth implies low regret: T � R T ( w ⋆ ) = 1 − � g t , w ⋆ � − Wealth T t = 1 � �� out of our control Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 14 of 20

Coin Beting [OP16] • Define wealth : T � Wealth T = 1 − � g t , w t � t = 1 • High wealth implies low regret: T � R T ( w ⋆ ) = 1 − � g t , w ⋆ � − Wealth T t = 1 � �� out of our control • At every iteration, choose a beting fraction v t ∈ R d and use w t = v t Wealth t − 1 Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 14 of 20

Oracle value for v yields good algorithm � w ⋆ � √ � T w ⋆ Set v t = v ⋆ ≈ t = 1 � g t , w ⋆ � 2 . Then � � T � � � w ⋆ , g t � 2 R T ( w ⋆ ) ≤ � t = 1 • There are no matrices here! • But we don’t know this magic value for v . Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 15 of 20

Online Learning Inside Online Learning [CO18] • Define ℓ t ( v ) = − log( 1 − � g t , v � ) . Then: T � R v T ( v ⋆ ) := ℓ t ( v t ) − ℓ t ( v ⋆ ) t = 1 • If R v T ( v ⋆ ) = O (log( T )) , then the final regret R T ( w ⋆ ) is the same as if we’d used the constant v t = v ⋆ . Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 16 of 20

Online Learning Inside Online Learning [CO18] • Define ℓ t ( v ) = − log( 1 − � g t , v � ) . Then: T � R v T ( v ⋆ ) := ℓ t ( v t ) − ℓ t ( v ⋆ ) t = 1 • If R v T ( v ⋆ ) = O (log( T )) , then the final regret R T ( w ⋆ ) is the same as if we’d used the constant v t = v ⋆ . • We can use online learning to choose the v t ! Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 16 of 20

Overview of Algorithm Strategy • There exists an unknown v ⋆ that would give preconditioned regret. • We can choose v t using online convex optimization on losses ℓ t ( v ) = − log( 1 − � g t , v t � ) . T ( v ⋆ ) = � T • If we get R v t = 1 ℓ t ( v t ) − ℓ t ( v ⋆ ) = O (log( T )) , then we are as good as picking v ⋆ from the beginning. • So how can we obtain logarithmic regret? Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 17 of 20

How to obtain logarithmic regret? • Strategy: Remember that the constant v ⋆ we need to compete with is √ � w ⋆ � √ � T w ⋆ v ⋆ = t = 1 � g t , w ⋆ � 2 , so � v ⋆ � = O ( 1 / T ) usually. • This means that we can use a non-preconditioned online learning algorithm to obtain logarithmic regret: √ R v T ( v ⋆ ) ≤ � v ⋆ � T = O ( 1 ) Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 18 of 20

Matrix-Free Preconditioning in Online Learning Ashok Cutkosky, - PowerPoint PPT Presentation

Matrix-Free Preconditioning in Online Learning Ashok Cutkosky, Tamas Sarlos Google Research Online Optimization For t = 1 . . . T , repeat: 1: Learner chooses a point w t . 2: Environment presents learner with a gradient g t (think E [ g t ] =

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Compressible Flows by Partial Coloring: A Case Study of a Semi-matrix-free Preconditioning

Building an IoT Platform with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Introductory Matrix Operations Matrix Entries Defn. For matrix A , notation a ij means the en-

Gov 2000: 10. Multiple Regression in Matrix Form Matthew Blackwell Fall 2016 1 / 64 1. Matrix

Liberating Communication with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Online Learning Lorenzo Rosasco MIT, 9.520 L. Rosasco Online Learning About this class Goal

GPSCP: A General-Purpose Support-Circuit Preconditioning Approach to Large Scale SPICE Accurate

Numerical Upscaling and Preconditioning of Flows in Highly Heterogeneous Porous Media R. Lazarov,

Preconditioning and nonlinear time solvers for the JOREK MHD code E. Franck, A. Lessig, M. H

What is Responsible for the Low-level What is Responsible for the Low-level Moist Preconditioning

Original motivation for nonlinear preconditioning A nonlinear system F ( u ) = 0 may be

Matrix COSEC Right People in Right Place at Right Time Matrix COmplete SECurity Matrix COSEC

Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector

Complexity of matrix multiplication (For Hierarchical matrix) For Usual matrix The

Practical Bioinformatics Mark Voorhies 4/10/2018 Mark Voorhies Practical Bioinformatics

The future landscape for Heart Failure Adriaan Voors, UMCG University Medical Center Groningen

Major Adverse Outcomes in Patients with Atrial Fibrillation: The AFFIRM Study Marco Proietti,

Whoever Has Ears, Let Them Hear (part 2) What happens after a resurrection? After

Machine Learning Learning HMMs A Hidden Markov model A set of states {s 1 s n } - In

Drafting Your Smoke-Free Law Doug Blanke June 2-4, 2010 Washington, D.C. The Tobacco Control

Selecting and Using Views To Compute Aggregate Queries Foto Afrati (NTUA Greece) and Rada

Non-linear interlinkages and key objectives amongst the Paris Agreement and the Sustainable