weighted sgd for p regression with randomized
play

Weighted SGD for p Regression with Randomized Preconditioning - PowerPoint PPT Presentation

Weighted SGD for p Regression with Randomized Preconditioning Jiyan Yang Stanford University SODA, Jan 2016 Joint work with Yin-Lam Chow (Stanford), Christopher R e (Stanford) and Michael W. Mahoney (Berkeley) 1/29 Outline Overview


  1. Weighted SGD for ℓ p Regression with Randomized Preconditioning Jiyan Yang Stanford University SODA, Jan 2016 Joint work with Yin-Lam Chow (Stanford), Christopher R´ e (Stanford) and Michael W. Mahoney (Berkeley) 1/29

  2. Outline Overview A Perspective of Stochastic Optimization Preliminaries in Randomized Linear Algebra Main Algorithm and Theoretical Results Empirical Results 2/29

  3. Problem formulation Definition Given a matrix A ∈ R n × d , where n ≫ d , a vector b ∈ R n , and a number p ∈ [1 , ∞ ], the constrained overdetermined ℓ p regression problem is min x ∈Z � Ax − b � p . 3/29

  4. Overview low-precision medium-precision high-precision 10 -1 10 -8 SGD pwSGD RLA Efficient, easy, scalable Requires IPM to solve the constrained subproblem Widely used for convex objective Works well for problems with linear structure Asymptotic convergence rate Better worst-case theoretical guarantee Formulated in terms of assumptions Formulated for worst-case inputs pwSGD — Preconditioned weighted SGD ◮ It preserves the simplicity of SGD and the high quality theoretical guarantees of RLA. ◮ It is preferable when a medium-precision, e.g., 10 − 3 is desired. 4/29

  5. Our main algorithm: pwSGD Algorithm 1. Apply RLA techniques to construct a preconditioner and then construct an importance sampling distribution. 2. Apply an SGD-like iterative phase with weighted sampling on the preconditioned system. Properties ◮ The preconditioner and the importance sampling distribution can be done as fast as in O (log n · nnz( A )). ◮ The convergence rate of the SGD phase only depends on the low dimension d , independent of the high dimension n . 5/29

  6. Complexity comparisons solver complexity (general) complexity (sparse) 3 9 69 25 5 2 /ǫ 3 ) 2 8 log 2 ) RLA sampling time ( R ) + O (nnz( A ) log n + ¯ κ 1 d O (nnz( A ) log n + d 8 d /ǫ time ( R ) + nd 2 + O (( nd + poly( d )) log(¯ randomized IPCPM κ 1 d /ǫ )) O ( nd log( d /ǫ )) 13 5 time ( R ) + O (nnz( A ) log n + d 3 ¯ κ 1 /ǫ 2 ) 2 d /ǫ 2 ) 2 log pwSGD O (nnz( A ) log n + d Table: Summary of complexity of several unconstrained ℓ 1 solvers that use randomized linear algebra. Clearly, pwSGD has a uniformly better complexity than that of RLA sampling methods in terms of both d and ǫ , no matter which underlying preconditioning method is used. solver complexity (SRHT) complexity (sparse) � nd log( d /ǫ ) + d 3 log( nd ) /ǫ � � nnz( A ) + d 4 /ǫ 2 � RLA projection O O � nd log n + d 3 log d + d 3 log d /ǫ � � nnz( A ) log n + d 4 + d 3 log d /ǫ � RLA sampling O O � nd log d + d 3 log d + nd log(1 /ǫ ) � � nnz( A ) + d 4 + nd log(1 /ǫ ) � RLA high-precision solvers O O � nd log n + d 3 log d + d 3 log(1 /ǫ ) /ǫ � � nnz( A ) log n + d 4 + d 3 log(1 /ǫ ) /ǫ � pwSGD O O Table: Summary of complexity of several unconstrained ℓ 2 solvers that use randomized linear algebra. When d ≥ 1 /ǫ and n ≥ d 2 /ǫ , pwSGD is asymptotically better than the solvers listed above. 6/29

  7. Outline Overview A Perspective of Stochastic Optimization Preliminaries in Randomized Linear Algebra Main Algorithm and Theoretical Results Empirical Results 7/29

  8. Viewing ℓ p regression as stochastic optimization ( a ) ( b ) x ∈Z � Ax − b � p y ∈Y � Uy − b � p min − − → min − − → min y ∈Y E ξ ∼ P [ H ( y , ξ )] . p p ◮ ( a ) is done by using a different basis U . ◮ ( b ) is true for any sampling distribution { p i } n i =1 over the rows by setting H ( y , ξ ) = | U ξ y − b ξ | p . p ξ 8/29

  9. Solving ℓ p regression via stochastic optimization To solve this stochastic optimization problem, typically one needs to answer the following three questions. ◮ (C 1 ): How to sample: SAA (i.e., draw samples in a batch mode and deal with the subproblem) or SA (i.e., draw a mini-batch of samples in an online fashion and update the weight after extracting useful information)? ◮ (C 2 ): Which probability distribution P (uniform distribution or not) and which basis U (preconditioning or not) to use? ◮ (C 3 ): Which solver to use (e.g., how to solve the subproblem in SAA or how to update the weight in SA)? 9/29

  10. A unified framework for RLA and SGD ℓ p regression naive uniform P fast gradient descent SA vanilla SGD min x � Ax − b � p U = A p online using RLA non-uniform P pwSGD fast n e n l i o gradient descent SA (this presentation) stochastic optimization well-conditioned U min y E ξ ∼ P [ | U ξ y − b ξ | p / p ξ ] b a t c h using RLA exact solution vanilla RLA non-uniform P slow SAA of subproblem sampling algorithm well-conditioned U ( C 1): How to sample? ( C 2): Which U and P to use? ( C 3): How to solve? resulting solver 10/29

  11. Outline Overview A Perspective of Stochastic Optimization Preliminaries in Randomized Linear Algebra Main Algorithm and Theoretical Results Empirical Results 11/29

  12. ℓ p Well-conditioned basis Definition ( ℓ p -norm condition number (Clarkson et al., 2013)) Given a matrix A ∈ R m × n and p ∈ [1 , ∞ ], let σ max � x � 2 =1 � Ax � p and σ min ( A ) = max ( A ) = min � x � 2 =1 � Ax � p . p p Then, we denote by κ p ( A ) the ℓ p -norm condition number of A , defined to be: κ p ( A ) = σ max ( A ) /σ min ( A ) . p p 12/29

  13. Motivation ◮ For ℓ 2 , a perfect preconditioner is the one that transforms A into an orthonormal basis. ◮ However, doing such requires factorizations like QR and SVD of A which is expensive. ◮ Can we do QR on a similar but much smaller matrix? ◮ Idea: we use randomized linear algebra to compute a sketch and perform QR on it. 13/29

  14. An important tool: sketch ◮ Given a matrix A ∈ R n × d , a sketch can be viewed as a compressed representation of A , denoted by Φ A . ◮ The matrix Φ ∈ R r × n preserves the norm of vectors in the range space of A up to small constants. That is, ∀ x ∈ R d . (1 − ǫ ) � Ax � ≤ � Φ Ax � ≤ (1 + ǫ ) � Ax � , ◮ r ≪ n . 14/29

  15. Type of ℓ 2 sketches ◮ Sub-Gaussian sketch e.g., Gaussian transform: Φ A = GA time: O ( nd 2 ), r = O ( d /ǫ 2 ) ◮ Sketch based on randomized orthonormal systems [Tropp, 2011] e.g., Subsampled randomized Hadamard transform (SRHT): Φ A = SDHA time: O ( nd log n ), r = O ( d log( nd ) log( d /ǫ 2 ) /ǫ 2 ) ◮ Sketch based on sparse transform [Clarkson and Woodruff, 2013] e.g., count-sketch like transform (CW): Φ A = RDA time: O ( nnz ( A )), r = ( d 2 + d ) /ǫ 2 ◮ Sampling with approximate leverage scores [Drineas et al., 2012] Leverage scores can be viewed as a measurement of the importance of the rows in the LS fit. e.g., using CW transform to estimate the leverage scores time: t proj + O ( nnz ( A )) log n , r = O ( d log d /ǫ 2 ) Normally, when ǫ is fixed, the required sketching size r only depends on d , independent of n . 15/29

  16. Randomized preconditioners Algorithm 1. Compute a sketch Φ A . 2. Compute the economy QR factorization of Φ A = QR . 3. Return R − 1 . 16/29

  17. Randomized preconditioners (cont’) Analysis ◮ Since A and Φ A are “similar”, AR − 1 ≈ Φ AR − 1 = Q . ◮ Using norm preservation property of the sketch and norm equivalence, we have � AR − 1 x � p ≤ � Φ AR − 1 x � p /σ Φ ≤ r max { 0 , 1 / p − 1 / 2 } · � Φ AR − 1 � 2 · � x � 2 /σ Φ = r max { 0 , 1 / p − 1 / 2 } · � x � 2 /σ Φ , ∀ x ∈ R d , and � AR − 1 x � p ≥ � Φ AR − 1 � p / ( σ Φ κ Φ ) ≥ r min { 0 , 1 / p − 1 / 2 } · � Φ AR − 1 x � 2 / ( σ Φ κ Φ ) = σ Φ r min { 0 , 1 / p − 1 / 2 } · � x � 2 / ( σ Φ κ Φ ) , ∀ x ∈ R d . 17/29

  18. Qualities of preconditioners name running time κ p ( U ) ¯ O ( nd 2 log d + d 3 log d ) O ( d 5 / 2 log 3 / 2 d ) Dense Cauchy [SW11] O ( nd log d + d 3 log d ) O ( d 11 / 2 log 9 / 2 d ) Fast Cauchy [CDM+12] 13 11 O (nnz( A ) + d 7 log 5 d ) Sparse Cauchy [MM12] O ( d 2 log 2 d ) 7 5 O (nnz( A ) + d 3 log d ) 2 log Reciprocal Exponential [WZ13] O ( d 2 d ) Table: Summary of running time and condition number, for several different ℓ 1 conditioning methods. name running time κ p ( U ) ¯ κ p ( U ) √ O ( nd 2 ) sub-Gaussian O (1) O ( d ) √ O ( nd log n + d 3 log d ) SRHT [Tropp11] O (1) O ( d ) √ O (nnz( A ) + d 4 ) Sparse ℓ 2 Embedding [CW12] O (1) O ( d ) Table: Summary of running time and condition number, for several different ℓ 2 conditioning methods. 18/29

  19. Why preconditioning is useful? We can actually show that the convergence rate of using SGD for ℓ p regression problem relies on the ℓ p condition number of the linear system. Using such a randomized preconditioner will drastically reduce the number of iterations needed. 19/29

  20. Outline Overview A Perspective of Stochastic Optimization Preliminaries in Randomized Linear Algebra Main Algorithm and Theoretical Results Empirical Results 20/29

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend