linear and sublinear linear algebra algorithms
play

Linear and Sublinear Linear Algebra Algorithms: Preconditioning - PowerPoint PPT Presentation

Linear and Sublinear Linear Algebra Algorithms: Preconditioning Stochastic Gradient Algorithms with Randomized Linear Algebra Michael W. Mahoney ICSI and Dept of Statistics, UC Berkeley ( For more info, see: http: // www. stat. berkeley. edu/


  1. Linear and Sublinear Linear Algebra Algorithms: Preconditioning Stochastic Gradient Algorithms with Randomized Linear Algebra Michael W. Mahoney ICSI and Dept of Statistics, UC Berkeley ( For more info, see: http: // www. stat. berkeley. edu/ ~ mmahoney or Google on “Michael Mahoney”) August 2015 Joint work with Jiyan Yang, Yin-Lam Chow, and Christopher R´ e 1/36

  2. Outline Background A perspective of Stochastic optimization Main Algorithm and Theoretical Results Empirical Results Connection with Coreset Methods 2/36

  3. RLA and SGD ◮ SGD (Stochastic Gradient Descent) methods 1 ◮ Widely used in practice because of their scalability, efficiency, and ease of implementation. ◮ Work for problems with general convex objective function. ◮ Usually provide an asymptotic bounds on convergence rate. ◮ Typically formulated in terms of differentiability assumptions, smoothness assumptions, etc. ◮ RLA (Randomized Linear Algebra) methods 2 ◮ Better worst-case theoretical guarantees and better control over solution precision. ◮ Less flexible (thus far), e.g., in the presence of constraints. ◮ E.g., may use interior point method for solving constrained subproblem, and this may be less efficient than SGD. ◮ Typically formulated (either TCS-style or NLA-style) for worst-case inputs. 1SGD: iteratively solve the problem by approximating the true gradient by the gradient at a single example. 2RLA: construct (with sampling/projections) a random sketch, and use that sketch to solve the subproblem or construct preconditioners for the original problem. 3/36

  4. Can we get the “best of both worlds”? Consider problems where both methods have something nontrivial to say. Definition Given a matrix A ∈ R n × d , where n ≫ d , a vector b ∈ R n , and a number p ∈ [1 , ∞ ], the overdetermined ℓ p regression problem is min x ∈Z f ( x ) = � Ax − b � p . Important special cases: ◮ Least Squares: Z = R d and p = 2. ◮ Solved by eigenvector methods with O ( nd 2 ) worst-case running time; or by iterative methods for which the running time depending on κ ( A ). ◮ Least Absolute Deviations: Z = R d and p = 1. ◮ Unconstrained ℓ 1 regression problem can be formulated as a linear program and solved by an interior-point method. 4/36

  5. Outline Background A perspective of Stochastic optimization Main Algorithm and Theoretical Results Empirical Results Connection with Coreset Methods 5/36

  6. Deterministic ℓ p regression as stochastic optimization ◮ Let U ∈ R n × d be a basis of the range space of A in the form of U = AF , where F ∈ R d × d . ◮ The constrained overdetermined (deterministic) ℓ p regression problem is equivalent to the (stochastic) optimization problem x ∈Z � Ax − b � p y ∈Y � Uy − b � p min = min p p = min y ∈Y E ξ ∼ P [ H ( y , ξ )] , where H ( y , ξ ) = | U ξ y − b ξ | p is the randomized integrand and ξ is a p ξ random variable over { 1 , . . . , n } with distribution P = { p i } n i =1 . ◮ The constraint set of y is given by Y = { y ∈ R d | y = F − 1 x , x ∈ Z} . 6/36

  7. Brief overview of stochastic optimization The standard stochastic optimization problem is of the form min x ∈X f ( x ) = E ξ ∼ P [ F ( x , ξ )] , (1) where ξ is a random data point with underlying distribution P . Two computational approaches for solving stochastic optimization problems of the form (1) based on Monte Carlo sampling techniques: ◮ SA (Stochastic Approximation): ◮ Start with an initial weight x 0 , and solve (1) iteratively. ◮ In each iteration, a new sample point ξ t is drawn from distribution P and the current weight is updated by its information (e.g., (sub)gradient of F ( x , ξ t )). ◮ SAA (Sampling Average Approximation): ◮ Sample n points from distribution P independently, ξ 1 , . . . , ξ n , and solve the Empirical Risk Minimization (ERM) problem, n f ( x ) = 1 ˆ � min F ( x , ξ i ) . n x ∈X i =1 7/36

  8. Solving ℓ p regression via stochastic optimization To solve this stochastic optimization problem, typically one needs to answer the following three questions. ◮ (C 1 ): How to sample: SAA (i.e., draw samples in a batch mode and deal with the subproblem) or SA (i.e., draw a mini-batch of samples in an online fashion and update the weight after extracting useful information)? ◮ (C 2 ): Which probability distribution P (uniform distribution or not) and which basis U (preconditioning or not) to use? ◮ (C 3 ): Which solver to use (e.g., how to solve the subproblem in SAA or how to update the weight in SA)? 8/36

  9. A unified framework for RLA and SGD (“Weighted SGD for Lp Regression with Randomized Preconditioning,” Yang, Chow, Re, and Mahoney, 2015.) ℓ p regression uniform P naive fast gradient descent SA vanilla SGD min x � Ax − b � p U = ¯ A p online using RLA pwSGD non-uniform P fast n e o n l i SA gradient descent (this presentation) stochastic optimization well-conditioned U min y E ξ ∼ P [ | U ξ y − b ξ | p / p ξ ] b a t c h using RLA exact solution vanilla RLA with non-uniform P slow SAA of subproblem algorithmic leveraging well-conditioned U ( C 1): How to sample? ( C 2): Which U and P to use? ( C 3): How to solve? resulting solver ◮ SA + “naive” P and U : vanilla SGD whose convergence rate depends (without additional niceness assumptions) on n ◮ SA + “smart” P and U : pwSGD ◮ SAA + “naive” P : uniform sampling RLA algorithm which may fail if some rows are extremely important (not shown) ◮ SAA + “smart” P : RLA (with algorithmic leveraging or random projections) which has strong worst-case theoretical guarantee and high-quality numerical implementations ◮ For unconstrained ℓ 2 regression (i.e., LS), SA + “smart” P + “naive” U recovers weighted randomized Kaczmarz algorithm [Strohmer-Vershynin]. 9/36

  10. A combined algorithm: pwSGD (“Weighted SGD for Lp Regression with Randomized Preconditioning,” Yang, Chow, Re, and Mahoney, 2015.) pwSGD : Preconditioned weighted SGD consists of two main steps: 1. Apply RLA techniques for preconditioning and construct an importance sampling distribution. 2. Apply an SGD-like iterative phase with weighted sampling on the preconditioned system. 10/36

  11. A closer look: “naive” choices of U and P in SA Consider solving ℓ 1 regression; and let U = A . If we apply the SGD with some distribution P = { p i } n i =1 , then the relative approximation error is � � x ∗ � 2 · max 1 ≤ i ≤ n � A i � 1 / p i x ) − f ( x ∗ ) f (ˆ � = O , � Ax ∗ − b � 1 f (ˆ x ) where f ( x ) = � Ax − b � 1 and x ∗ is the optimal solution. ◮ If { p i } n i =1 is the uniform distribution, i.e., p i = 1 n , then x ) − f ( x ∗ ) n � x ∗ � 2 · M f (ˆ � � = O , � Ax ∗ − b � 1 f (ˆ x ) where M = max 1 ≤ i ≤ n � A i � 1 is the maximum ℓ 1 row norm of A . � A i � 1 ◮ If { p i } n i =1 is proportional to the row norms of A , i.e., p i = i =1 � A i � 1 , then � n x ) − f ( x ∗ ) � � x ∗ � 2 · � A � 1 f (ˆ � = O . � Ax ∗ − b � 1 f (ˆ x ) In either case, the expected convergence time for SGD might blow up (i.e., grow with n ) as the size of the matrix grows ( unless one makes extra assumptions ). 11/36

  12. A closer look: “smart” choices of U and P in SA ◮ Recall that if U is a well-conditioned basis, then (by definition) � U � 1 ≤ α and � y ∗ � ∞ ≤ β � Uy ∗ � 1 , for α and β depending on the small dimension d and not the large dimension n . ◮ If we use a well-conditioned basis U for the range space of A , and if we choose the sampling probabilities proportional to the row norms of U , i.e., leverage scores of A , then the resulting convergence rate on the relative error of the objective becomes x ) − f ( x ∗ ) � � y ∗ � 2 · � U � 1 f (ˆ � = O . � ¯ f (ˆ x ) Uy ∗ � 1 where y ∗ is an optimal solution to the transformed problem. ◮ Since the condition number αβ of a well-conditioned basis depends only on d , it implies that the resulting SGD inherits a convergence rate in a relative scale that depends on d and is independent of n . 12/36

  13. Outline Background A perspective of Stochastic optimization Main Algorithm and Theoretical Results Empirical Results Connection with Coreset Methods 13/36

  14. A combined algorithm: pwSGD (“Weighted SGD for Lp Regression with Randomized Preconditioning,” Yang, Chow, Re, and Mahoney, 2015.) 1. Compute R ∈ R d × d such that U = AR − 1 is an ( α, β ) well-conditioned basis U for the range space of A . 2. Compute or estimate � U i � p p with leverage scores λ i , for i ∈ [ n ]. λ i 3. Let p i = j =1 λ j , for i ∈ [ n ]. � n 4. Construct the preconditioner F ∈ R d × d based on R . 5. For t = 1 , . . . , T Pick ξ t from [ n ] based on distribution { p i } n i =1 . � sgn ( A ξ t x t − b ξ t ) / p ξ t if p = 1; c t = 2 ( A ξ t x t − b ξ t ) / p ξ t if p = 2 . Update x by  x t − η c t H − 1 A ξ t if Z = R d ;  x t +1 = η c t A ξ t x + 1 2 � x t − x � 2 arg min otherwise . H  x ∈Z FF ⊤ � − 1 . � where H = � T x ← 1 6. ¯ t =1 x t . T 7. Return ¯ x for p = 1 or x T for p = 2. 14/36

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend