stochastic composite optimization
play

Stochastic Composite Optimization: Variance Reduction, Acceleration, - PowerPoint PPT Presentation

Stochastic Composite Optimization: Variance Reduction, Acceleration, and Robustness to Noise Andrei Kulunchakov, Julien Mairal Inria Grenoble ML in the real world, Criteo Julien Mairal Stochastic Composite Optimization 1/24 Publications


  1. Stochastic Composite Optimization: Variance Reduction, Acceleration, and Robustness to Noise Andrei Kulunchakov, Julien Mairal Inria Grenoble ML in the real world, Criteo Julien Mairal Stochastic Composite Optimization 1/24

  2. Publications Andrei Kulunchakov A. Kulunchakov and J. Mairal. Estimate Sequences for Variance-Reduced Stochastic Composite Optimization. International Conference on Machine Learning (ICML). 2019. A. Kulunchakov and J. Mairal. Estimate Sequences for Stochastic Composite Optimization: Variance Reduction, Acceleration, and Robustness to Noise. preprint arXiv:1901.08788. 2019. Julien Mairal Stochastic Composite Optimization 2/24

  3. Context Many subspace identification approaches require solving a composite optimization problem x ∈ R p { F ( x ) := f ( x ) + ψ ( x ) } , min where f is L -smooth and convex, and ψ is convex. Julien Mairal Stochastic Composite Optimization 3/24

  4. Context Many subspace identification approaches require solving a composite optimization problem x ∈ R p { F ( x ) := f ( x ) + ψ ( x ) } , min where f is L -smooth and convex, and ψ is convex. Two settings of interest Particularly interesting structures in machine learning are n f ( x ) = 1 f ( x ) = E [ ˜ � f i ( x ) or f ( x, ξ )] . n i =1 Those can typically be addressed with variants of SGD for the general stochastic case. variance-reduced algorithms such as SVRG, SAGA, MISO, SARAH, SDCA, Katyusha. . . Julien Mairal Stochastic Composite Optimization 3/24

  5. Basics of gradient-based optimization Smooth vs non-smooth (a) smooth (b) non-smooth An important quantity to quantify smoothness is the Lipschitz constant of the gradient: �∇ f ( x ) − ∇ f ( y ) � ≤ L � x − y � . Julien Mairal Stochastic Composite Optimization 4/24

  6. Basics of gradient-based optimization Smooth vs non-smooth (a) smooth (b) non-smooth An important quantity to quantify smoothness is the Lipschitz constant of the gradient: �∇ f ( x ) − ∇ f ( y ) � ≤ L � x − y � . If f is twice differentiable, L may be chosen as the largest eigenvalue of the Hessian ∇ 2 f . This is an upper-bound on the function curvature. Julien Mairal Stochastic Composite Optimization 4/24

  7. Basics of gradient-based optimization Convex vs non-convex (a) non-convex (b) convex (c) strongly-convex An important quantity to quantify convexity is the strong-convexity constant f ( x ) ≥ f ( y ) + ∇ f ( y ) ⊤ ( x − y ) + µ 2 � x − y � 2 , Julien Mairal Stochastic Composite Optimization 5/24

  8. Basics of gradient-based optimization Convex vs non-convex (a) non-convex (b) convex (c) strongly-convex An important quantity to quantify convexity is the strong-convexity constant f ( x ) ≥ f ( y ) + ∇ f ( y ) ⊤ ( x − y ) + µ 2 � x − y � 2 , If f is twice differentiable, µ may be chosen as the smallest eigenvalue of the Hessian ∇ 2 f . This is a lower-bound on the function curvature. Julien Mairal Stochastic Composite Optimization 5/24

  9. Basics of gradient-based optimization Picture from F. Bach Why is the condition number L/µ important? Julien Mairal Stochastic Composite Optimization 6/24

  10. Basics of gradient-based optimization Picture from F. Bach Trajectory of gradient descent with optimal step size. Julien Mairal Stochastic Composite Optimization 7/24

  11. Variance reduction (1/2) Variance reduction Consider two random variables X, Y and define Z = X − Y + E [ Y ] . Then, E [ Z ] = E [ X ] Var ( Z ) = Var ( X ) + Var ( Y ) − 2 cov ( X, Y ) . The variance of Z may be smaller if X and Y are positively correlated. Julien Mairal Stochastic Composite Optimization 8/24

  12. Variance reduction (1/2) Variance reduction Consider two random variables X, Y and define Z = X − Y + E [ Y ] . Then, E [ Z ] = E [ X ] Var ( Z ) = Var ( X ) + Var ( Y ) − 2 cov ( X, Y ) . The variance of Z may be smaller if X and Y are positively correlated. Why is it useful for stochastic optimization? step-sizes for SGD have to decrease to ensure convergence. with variance reduction, one may use larger constant step-sizes. Julien Mairal Stochastic Composite Optimization 8/24

  13. Variance reduction for smooth functions (2/2) SVRG x t = x t − 1 − γ ( ∇ f i t ( x t − 1 ) − ∇ f i t ( y ) + ∇ f ( y )) , where y is updated every epoch and E [ ∇ f i t ( y ) |F t − 1 ] = ∇ f ( y ) . SAGA � n ∇ f i t ( x t − 1 ) − y t − 1 i =1 y t − 1 � + 1 � x t = x t − 1 − γ , i t n i � ∇ f i ( x t − 1 ) if i = i t where E [ y t − 1 � n i =1 y t − 1 |F t − 1 ] = 1 and y t i = y t − 1 i t n i otherwise . i MISO/Finito: for n ≥ L/µ , same form as SAGA but � ∇ f i ( x t − 1 ) − µx t − 1 if i = i t � n i =1 y t − 1 1 y t = − µx t − 1 and i = y t − 1 i n otherwise . i Julien Mairal Stochastic Composite Optimization 9/24

  14. Complexity of SGD variants x ) − F ⋆ ] ≤ ε for We consider the worst-case complexity for finding a point ¯ x such that E [ F (¯ x ∈ R p { F ( x ) := E [ ˜ min f ( x, ξ )] + ψ ( x ) } , In this talk, we consider the µ -strongly convex case only. Complexity of SGD with iterate averaging � L � C 0 �� � σ 2 � O µ log + O , ε µε under the (strong) assumption that the gradient estimates have bounded variance σ 2 . Julien Mairal Stochastic Composite Optimization 10/24

  15. Complexity of SGD variants x ) − F ⋆ ] ≤ ε for We consider the worst-case complexity for finding a point ¯ x such that E [ F (¯ x ∈ R p { F ( x ) := E [ ˜ min f ( x, ξ )] + ψ ( x ) } , In this talk, we consider the µ -strongly convex case only. Complexity of SGD with iterate averaging � L � C 0 �� � σ 2 � O µ log + O , ε µε under the (strong) assumption that the gradient estimates have bounded variance σ 2 . Complexity of accelerated SGD [Ghadimi and Lan, 2013] �� �� � C 0 � σ 2 � L O µ log + O , ε µε Julien Mairal Stochastic Composite Optimization 10/24

  16. Complexity for finite sums x ) − F ⋆ ] ≤ ε for We consider the worst-case complexity for finding a point ¯ x such that E [ F (¯ n � � F ( x ) := 1 � min f i ( x ) + ψ ( x ) , n x ∈ R p i =1 Complexity of SAGA/SVRG/SDCA/MISO/S2GD n ¯ �� L � � C 0 �� L = 1 ¯ � O n + log with L i . µ ε n i =1 Complexity of GD and acc-GD � �� � �� �� � � C 0 �� � C 0 nL L O log vs. O n log . µ ε µ ε see also SDCA [Shalev-Shwartz and Zhang, 2014] and Catalyst [Lin et al., 2018]. Julien Mairal Stochastic Composite Optimization 11/24

  17. Complexity for finite sums x ) − F ⋆ ] ≤ ε for We consider the worst-case complexity for finding a point ¯ x such that E [ F (¯ n � � F ( x ) := 1 � min f i ( x ) + ψ ( x ) , n x ∈ R p i =1 Complexity of SAGA/SVRG/SDCA/MISO/S2GD n ¯ �� L � � C 0 �� L = 1 ¯ � O n + log with L i . µ ε n i =1 Complexity of Katyusha [Allen-Zhu, 2017]    � � n ¯ L � C 0  log  . O  n +  µ ε see also SDCA [Shalev-Shwartz and Zhang, 2014] and Catalyst [Lin et al., 2018]. Julien Mairal Stochastic Composite Optimization 11/24

  18. Contributions without acceleration We extend and generalize the concept of estimate sequences introduced by Nesterov to provide a unified proof of convergence for SAGA/random-SVRG/MISO. provide them adaptivity for unknown µ (known before for SAGA only). make them robust to stochastic noise , e.g. , for solving n f ( x ) = 1 � f i ( x ) = E [ ˜ f i ( x ) with f i ( x, ξ )] . n i =1 with complexity � ˜ ¯ σ 2 �� L � � C 0 �� � σ 2 ≪ σ 2 , O n + log + O with ˜ µ ε µε σ 2 is the variance due to small perturbations. where ˜ obtain new variants of the above algorithms with the same guarantees. Julien Mairal Stochastic Composite Optimization 12/24

  19. The stochastic finite sum problem � n � F ( x ) := 1 � f i ( x ) = E [ ˜ min f i ( x ) + ψ ( x ) with f i ( x, ξ )] , n x ∈ R p i =1 Data augmentation on digits (left); Dropout on text (right). Julien Mairal Stochastic Composite Optimization 13/24

  20. Contributions with acceleration we propose a new accelerated SGD algorithm for composite optimization with optimal complexity �� �� � σ 2 L � C 0 � O µ log + O , ε µε we propose an accelerated variant of SVRG for the stochastic finite-sum problem with complexity � ˜    � � n ¯ � C 0 σ 2 � L σ 2 ≪ σ 2 .  log  + O O  n + with ˜  µ ε µε When ˜ σ = 0 , the complexity matches that of Katyusha. Julien Mairal Stochastic Composite Optimization 14/24

  21. A classical iteration x k ← Prox η k ψ [ x k − 1 − η k g k ] with E [ g k |F k ] = ∇ f ( x k − 1 ) , Julien Mairal Stochastic Composite Optimization 15/24

  22. A classical iteration x k ← Prox η k ψ [ x k − 1 − η k g k ] with E [ g k |F k ] = ∇ f ( x k − 1 ) , covers SGD, SAGA, SVRG, and composite variants. Julien Mairal Stochastic Composite Optimization 15/24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend