Stochastic Composite Optimization: Variance Reduction, Acceleration, - PowerPoint PPT Presentation

Stochastic Composite Optimization: Variance Reduction, Acceleration, and Robustness to Noise Andrei Kulunchakov, Julien Mairal Inria Grenoble ML in the real world, Criteo Julien Mairal Stochastic Composite Optimization 1/24

Publications Andrei Kulunchakov A. Kulunchakov and J. Mairal. Estimate Sequences for Variance-Reduced Stochastic Composite Optimization. International Conference on Machine Learning (ICML). 2019. A. Kulunchakov and J. Mairal. Estimate Sequences for Stochastic Composite Optimization: Variance Reduction, Acceleration, and Robustness to Noise. preprint arXiv:1901.08788. 2019. Julien Mairal Stochastic Composite Optimization 2/24

Context Many subspace identification approaches require solving a composite optimization problem x ∈ R p { F ( x ) := f ( x ) + ψ ( x ) } , min where f is L -smooth and convex, and ψ is convex. Julien Mairal Stochastic Composite Optimization 3/24

Context Many subspace identification approaches require solving a composite optimization problem x ∈ R p { F ( x ) := f ( x ) + ψ ( x ) } , min where f is L -smooth and convex, and ψ is convex. Two settings of interest Particularly interesting structures in machine learning are n f ( x ) = 1 f ( x ) = E [ ˜ � f i ( x ) or f ( x, ξ )] . n i =1 Those can typically be addressed with variants of SGD for the general stochastic case. variance-reduced algorithms such as SVRG, SAGA, MISO, SARAH, SDCA, Katyusha. . . Julien Mairal Stochastic Composite Optimization 3/24

Basics of gradient-based optimization Smooth vs non-smooth (a) smooth (b) non-smooth An important quantity to quantify smoothness is the Lipschitz constant of the gradient: �∇ f ( x ) − ∇ f ( y ) � ≤ L � x − y � . Julien Mairal Stochastic Composite Optimization 4/24

Basics of gradient-based optimization Smooth vs non-smooth (a) smooth (b) non-smooth An important quantity to quantify smoothness is the Lipschitz constant of the gradient: �∇ f ( x ) − ∇ f ( y ) � ≤ L � x − y � . If f is twice differentiable, L may be chosen as the largest eigenvalue of the Hessian ∇ 2 f . This is an upper-bound on the function curvature. Julien Mairal Stochastic Composite Optimization 4/24

Basics of gradient-based optimization Convex vs non-convex (a) non-convex (b) convex (c) strongly-convex An important quantity to quantify convexity is the strong-convexity constant f ( x ) ≥ f ( y ) + ∇ f ( y ) ⊤ ( x − y ) + µ 2 � x − y � 2 , Julien Mairal Stochastic Composite Optimization 5/24

Basics of gradient-based optimization Convex vs non-convex (a) non-convex (b) convex (c) strongly-convex An important quantity to quantify convexity is the strong-convexity constant f ( x ) ≥ f ( y ) + ∇ f ( y ) ⊤ ( x − y ) + µ 2 � x − y � 2 , If f is twice differentiable, µ may be chosen as the smallest eigenvalue of the Hessian ∇ 2 f . This is a lower-bound on the function curvature. Julien Mairal Stochastic Composite Optimization 5/24

Basics of gradient-based optimization Picture from F. Bach Why is the condition number L/µ important? Julien Mairal Stochastic Composite Optimization 6/24

Basics of gradient-based optimization Picture from F. Bach Trajectory of gradient descent with optimal step size. Julien Mairal Stochastic Composite Optimization 7/24

Variance reduction (1/2) Variance reduction Consider two random variables X, Y and define Z = X − Y + E [ Y ] . Then, E [ Z ] = E [ X ] Var ( Z ) = Var ( X ) + Var ( Y ) − 2 cov ( X, Y ) . The variance of Z may be smaller if X and Y are positively correlated. Julien Mairal Stochastic Composite Optimization 8/24

Variance reduction (1/2) Variance reduction Consider two random variables X, Y and define Z = X − Y + E [ Y ] . Then, E [ Z ] = E [ X ] Var ( Z ) = Var ( X ) + Var ( Y ) − 2 cov ( X, Y ) . The variance of Z may be smaller if X and Y are positively correlated. Why is it useful for stochastic optimization? step-sizes for SGD have to decrease to ensure convergence. with variance reduction, one may use larger constant step-sizes. Julien Mairal Stochastic Composite Optimization 8/24

Variance reduction for smooth functions (2/2) SVRG x t = x t − 1 − γ ( ∇ f i t ( x t − 1 ) − ∇ f i t ( y ) + ∇ f ( y )) , where y is updated every epoch and E [ ∇ f i t ( y ) |F t − 1 ] = ∇ f ( y ) . SAGA � n ∇ f i t ( x t − 1 ) − y t − 1 i =1 y t − 1 � + 1 � x t = x t − 1 − γ , i t n i � ∇ f i ( x t − 1 ) if i = i t where E [ y t − 1 � n i =1 y t − 1 |F t − 1 ] = 1 and y t i = y t − 1 i t n i otherwise . i MISO/Finito: for n ≥ L/µ , same form as SAGA but � ∇ f i ( x t − 1 ) − µx t − 1 if i = i t � n i =1 y t − 1 1 y t = − µx t − 1 and i = y t − 1 i n otherwise . i Julien Mairal Stochastic Composite Optimization 9/24

Complexity of SGD variants x ) − F ⋆ ] ≤ ε for We consider the worst-case complexity for finding a point ¯ x such that E [ F (¯ x ∈ R p { F ( x ) := E [ ˜ min f ( x, ξ )] + ψ ( x ) } , In this talk, we consider the µ -strongly convex case only. Complexity of SGD with iterate averaging � L � C 0 �� σ 2 � O µ log + O , ε µε under the (strong) assumption that the gradient estimates have bounded variance σ 2 . Julien Mairal Stochastic Composite Optimization 10/24

Complexity of SGD variants x ) − F ⋆ ] ≤ ε for We consider the worst-case complexity for finding a point ¯ x such that E [ F (¯ x ∈ R p { F ( x ) := E [ ˜ min f ( x, ξ )] + ψ ( x ) } , In this talk, we consider the µ -strongly convex case only. Complexity of SGD with iterate averaging � L � C 0 �� σ 2 � O µ log + O , ε µε under the (strong) assumption that the gradient estimates have bounded variance σ 2 . Complexity of accelerated SGD [Ghadimi and Lan, 2013] �� C 0 � σ 2 � L O µ log + O , ε µε Julien Mairal Stochastic Composite Optimization 10/24

Complexity for finite sums x ) − F ⋆ ] ≤ ε for We consider the worst-case complexity for finding a point ¯ x such that E [ F (¯ n � � F ( x ) := 1 � min f i ( x ) + ψ ( x ) , n x ∈ R p i =1 Complexity of SAGA/SVRG/SDCA/MISO/S2GD n ¯ �� L � � C 0 �� L = 1 ¯ � O n + log with L i . µ ε n i =1 Complexity of GD and acc-GD � �� C 0 �� C 0 nL L O log vs. O n log . µ ε µ ε see also SDCA [Shalev-Shwartz and Zhang, 2014] and Catalyst [Lin et al., 2018]. Julien Mairal Stochastic Composite Optimization 11/24

Complexity for finite sums x ) − F ⋆ ] ≤ ε for We consider the worst-case complexity for finding a point ¯ x such that E [ F (¯ n � � F ( x ) := 1 � min f i ( x ) + ψ ( x ) , n x ∈ R p i =1 Complexity of SAGA/SVRG/SDCA/MISO/S2GD n ¯ �� L � � C 0 �� L = 1 ¯ � O n + log with L i . µ ε n i =1 Complexity of Katyusha [Allen-Zhu, 2017]    � � n ¯ L � C 0  log  . O  n +  µ ε see also SDCA [Shalev-Shwartz and Zhang, 2014] and Catalyst [Lin et al., 2018]. Julien Mairal Stochastic Composite Optimization 11/24

Contributions without acceleration We extend and generalize the concept of estimate sequences introduced by Nesterov to provide a unified proof of convergence for SAGA/random-SVRG/MISO. provide them adaptivity for unknown µ (known before for SAGA only). make them robust to stochastic noise , e.g. , for solving n f ( x ) = 1 � f i ( x ) = E [ ˜ f i ( x ) with f i ( x, ξ )] . n i =1 with complexity � ˜ ¯ σ 2 �� L � � C 0 �� σ 2 ≪ σ 2 , O n + log + O with ˜ µ ε µε σ 2 is the variance due to small perturbations. where ˜ obtain new variants of the above algorithms with the same guarantees. Julien Mairal Stochastic Composite Optimization 12/24

The stochastic finite sum problem � n � F ( x ) := 1 � f i ( x ) = E [ ˜ min f i ( x ) + ψ ( x ) with f i ( x, ξ )] , n x ∈ R p i =1 Data augmentation on digits (left); Dropout on text (right). Julien Mairal Stochastic Composite Optimization 13/24

Contributions with acceleration we propose a new accelerated SGD algorithm for composite optimization with optimal complexity �� σ 2 L � C 0 � O µ log + O , ε µε we propose an accelerated variant of SVRG for the stochastic finite-sum problem with complexity � ˜    � � n ¯ � C 0 σ 2 � L σ 2 ≪ σ 2 .  log  + O O  n + with ˜  µ ε µε When ˜ σ = 0 , the complexity matches that of Katyusha. Julien Mairal Stochastic Composite Optimization 14/24

A classical iteration x k ← Prox η k ψ [ x k − 1 − η k g k ] with E [ g k |F k ] = ∇ f ( x k − 1 ) , Julien Mairal Stochastic Composite Optimization 15/24

A classical iteration x k ← Prox η k ψ [ x k − 1 − η k g k ] with E [ g k |F k ] = ∇ f ( x k − 1 ) , covers SGD, SAGA, SVRG, and composite variants. Julien Mairal Stochastic Composite Optimization 15/24

Stochastic Composite Optimization: Variance Reduction, Acceleration, - PowerPoint PPT Presentation

Stochastic Composite Optimization: Variance Reduction, Acceleration, and Robustness to Noise Andrei Kulunchakov, Julien Mairal Inria Grenoble ML in the real world, Criteo Julien Mairal Stochastic Composite Optimization 1/24 Publications

Dual Effect in Stochastic Optimization February 10, 2015 P. Carpentier Master MMMEF Cours

Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic

Stochastic Optimization and Discretization January 06, 2021 P. Carpentier Master Optimization

COMPOSITE OF PLAGE AREAS OVER COMPOSITE OF PLAGE AREAS OVER COMPOSITE OF PLAGE AREAS OVER

Introduction to Stochastic Optimization January 13, 2015 P. Carpentier Master MMMEF Cours

Stochastic Online Optimization Jian Li Institute of Interdisciplinary Information Sciences

Solving composite optimization problems, with applications to phase retrieval John Duchi (based

Convergence of a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization

CHAPTER V V CHAPTER Annealing by Stochastic Annealing by Stochastic Neural Networks for

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

The Chain Rule Given a composite function: The Chain Rule Given a composite function: h ( x ) =

Plan Composite Likelihood Methods What are composite likelihoods? David Firth Where are

Composite Trust Composite Trust Composite Trust A formal derivation of conjunction A formal

Various Topics Outline 1. Dynamic (time-varying) Optimization Problems 2. Stochastic

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stakeholder Process Recommendations December 9, 2015 Background and Objectives Stakeholders

Lecture 9 A Design Example Based on personal experience in S15 Design Process Read the

Lecture 12 Multiuser MIMO Capacity MISO downlink: 10.3 Precoding: 10.3.34 MIMO

Heat Transport in a Stochastic Magnetic Field Prof. John Sarff University of Wisconsin-Madison

Mode rnizing Minne sota s Grid An E c o no mic Ana lysis o f E ne rg y Sto ra g e Oppo

Fundamentals of MIMO W Wireless Communications Pa art II Prof. Rakhesh Sing Singh Kshetrimayum

Effective Rate Analysis of MISO Systems over - Fading Channels Jiayi Zhang 1 , 2 , Linglong

WHATS NEXT FOR TRANSMISSION? SUCCESS STORIES AND LESSONS LEARNED September 10, 2020