Convergence of a Stochastic Gradient Method with Momentum for - PowerPoint PPT Presentation

Convergence of a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization Vien V. Mai and Mikael Johansson KTH - Royal Institute of Technology

Stochastic optimization Stochastic optimization problem: � minimize f ( x ) := E P [ f ( x ; S )] = f ( x ; s ) dP ( s ) x ∈X S Stochastic gradient descent (SGD): x k +1 = x k − α k g k , g k ∈ ∂f ( x k , S k ) SGD with momentum: x k +1 = x k − α k z k , z k +1 = β k g k +1 + (1 − β k ) z k Includes Polyak’s Heavy ball, Nesterov’s fast gradient, and more • widespread empirical success • theory less clear than deterministic counterpart V. V. Mai (KTH) ICML-2020 2 / 20

Stochastic optimization: sample complexity For SGD, sample complexity is known under various assumptions • convexity [Nemirovski et al., 2009] • smoothness [Ghadimi-Lan, 2013] • weak convexity [Davis-Drusvyatskiy, 2019] Much less is known for momentum-based methods • constrained • non-smooth non-convex V. V. Mai (KTH) ICML-2020 3 / 20

Our contributions Novel Lyapunov analysis for (projected) stochastic heavy ball (SHB): • sample complexity of SHB for stochastic weakly convex minimization • analyze smooth non-convex case under less restrictive assumptions V. V. Mai (KTH) ICML-2020 4 / 20

Outline • Background and motivation • SHB for non-smooth non-convex optimization • Sharper results for smooth non-convex optimization • Numerical examples • Summary and conclusions V. V. Mai (KTH) ICML-2020 5 / 20

Problem formulation Problem: � minimize f ( x ) := E P [ f ( x ; S )] = f ( x ; s ) dP ( s ) x ∈X S X is closed and convex; f is ρ - weakly convex , meaning that x �→ f ( x ) + ρ � x � 2 2 is convex . Easy to recognize, e.g., convex compositions f ( x ) = h ( c ( x )) h convex and L h -Lipschitz; c smooth with L c -Lipschitz Jacobian ( ρ = L h L c ) V. V. Mai (KTH) ICML-2020 6 / 20

Algorithm Consider � minimize f ( x ) := E P [ f ( x ; S )] = f ( x ; s ) dP ( s ) x ∈X S Algorithm: � � z k , x − x k � + 1 � 2 α � x − x k � 2 x k +1 = argmin 2 x ∈X z k +1 = βg k +1 + (1 − β ) x k − x k +1 α Recovers SHB when X = R n ; setting β = 1 gives (projected) SGD Goal: establish sample complexity V. V. Mai (KTH) ICML-2020 7 / 20

Roadmap and challenges Most complexity results for subgradient-based methods rely on forming: E [ V k +1 ] ≤ E [ V k ] − α E [ e k ] + α 2 C 2 Immediately yields O (1 /ǫ 2 ) complexity for E [ e k ] Stationarity measure: ⇒ e k = �∇ f ( x k ) � 2 • f convex = ⇒ e k = f ( x k ) − f ( x ⋆ ) ; f smooth = 2 ⇒ e k = �∇ F λ ( x k ) � 2 • f weakly convex = 2 Lyapunov analysis (for SGD): ⇒ V k = � x k − x ⋆ � 2 • f convex = [Shor, 1964] 2 • f smooth = ⇒ V k = f ( x k ) [Ghadimi-Lan, 2013] • f weakly convex = ⇒ V k = F λ ( x k ) [Davis-Drusvyatskiy, 2019] V. V. Mai (KTH) ICML-2020 8 / 20

Convergence to stationarity in weakly convex cases Moreau envelope F ( y ) + 1 � � 2 λ � x − y � 2 F λ ( x )=inf 2 y Proximal mapping F ( y ) + 1 2 λ � x − y � 2 � � x := argmin ˆ 2 y ∈ R n Connection to near-stationarity x ˆ λ �∇ F λ ( x ) �  λ − 1 ( x − ˆ x ) = ∇ F λ ( x )  x dist(0 , ∂F (ˆ x )) ≤ �∇ F λ ( x ) � 2  Small �∇ F λ ( x ) � 2 = ⇒ x close to a near-stationary point V. V. Mai (KTH) ICML-2020 9 / 20

Lyapunov analysis for SHB Recall that we wanted E [ V k +1 ] ≤ E [ V k ] − α E [ e k ] + α 2 C 2 SGD works with e k = �∇ F λ ( x k ) � 2 2 and V k = F λ ( x k ) It seems natural to take e k = �∇ F λ ( · ) � 2 2 Two questions: • at which point should we evaluate ∇ F λ ( · ) ? • can we find a corresponding Lyapunov function V k ? V. V. Mai (KTH) ICML-2020 10 / 20

Lyapunov analysis for SHB Our approach: Take ∇ F λ ( · ) at the following iterate x k := x k + 1 − β ¯ ( x k − x k − 1 ) β Define the corresponding proximal point F ( y ) + 1 � � x k � 2 x k = argmin ˆ 2 λ � y − ¯ 2 y ∈ R n This gives x k ) = λ − 1 (¯ e k = ∇ F λ (¯ x k − ˆ x k ) V. V. Mai (KTH) ICML-2020 11 / 20

Lyapunov analysis for SHB Let β = να so that β ∈ (0 , 1] and define ξ = (1 − β ) /ν . Consider the function: x k ) + νξ 2 2 + αξ 2 � (1 − β ) ξ 2 + ξ � 4 λ 2 � p k � 2 2 λ 2 � d k � 2 V k = F λ (¯ 2 + f ( x k − 1 ) , 2 λ 2 λ where p k = 1 − β ( x k − x k − 1 ) and d k = ( x k − 1 − x k ) /α. β Theorem: For any k ∈ N , it holds that 2 ] + α 2 CL 2 E [ V k +1 ] ≤ E [ V k ] − α x k ) � 2 2 E [ �∇ F λ (¯ . 2 λ V. V. Mai (KTH) ICML-2020 12 / 20

Main result: sample complexity √ √ Taking α = α 0 / K and β = O (1 / K ) ∈ (0 , 1] yields � ρ ∆ + L 2 � �� 2 � � √ E � ∇ F 1 / (2 ρ ) (¯ x k ∗ ) ≤ O 2 K + 1 ∆ = f ( x 0 ) − inf x ∈X f ( x ) Note: • same worst-case complexity as SGD ( β = 1 ) √ • β can be as small as O (1 / K ) • (much) more weight to the momentum term than the fresh subgradient This rate is, in general, not possible to improve [Arjevani et al., 2019]. V. V. Mai (KTH) ICML-2020 13 / 20

Outline • Background and motivation • SHB for non-smooth non-convex optimization • Sharper results for smooth non-convex optimization • Numerical examples • Summary and conclusions V. V. Mai (KTH) ICML-2020 14 / 20

Smooth and non-convex optimization Problem: � minimize f ( x ) := E P [ f ( x ; S )] = f ( x ; s ) dP ( s ) x ∈X S X is closed and convex; f is ρ - smooth : �∇ f ( x ) − ∇ f ( x ) � 2 ≤ ρ � x − y � 2 , ∀ x, y ∈ dom f. Assumption . There exists a real σ > 0 such that for all x ∈ X : � � � f ′ ( x, S ) − ∇ f ( x ) � 2 ≤ σ 2 . E 2 Note. • complexity of SHB is not known (even for deterministic case) • when X = R n , O (1 /ǫ 2 ) obtained under bounded gradients assumption [Yan et al., 2018] V. V. Mai (KTH) ICML-2020 15 / 20

Improved complexities on smooth non-convex problems Constrained case: α 0 Suppose that �∇ f ( x ) � 2 ≤ G for all x ∈ X . If we set α = √ K +1 , then � ρ ∆ + σ 2 + G 2 � � � x k ∗ ) � 2 √ E �∇ F λ (¯ ≤ O . 2 K + 1 Unconstrained case: α 0 If we set α = √ K +1 with α 0 ∈ (0 , 1 / (4 ρ )] , then �� 1 + 8 ρ 2 α 2 � ∆ + ( ρ + 16 α 0 ρ 2 ) σ 2 α 3 � � x k ∗ ) � 2 0 0 √ E �∇ F λ (¯ ≤ O . 2 α 0 K + 1 V. V. Mai (KTH) ICML-2020 16 / 20

Experiments: convergence behavior on phase retrieval (a) κ = 1 , α 0 = 0 . 1 (b) κ = 1 , α 0 = 0 . 15 √ Figure: Function gap vs. #iters for phase retrieval with p fail = 0 . 2 , β = 10 / K . Exponential growth before eventual convergence 1 not shown SGD is competitive if well-tuned, but sensitive to stepsize choice 1 observed also in [Asi-Duchi, 2019] V. V. Mai (KTH) ICML-2020 17 / 20

Experiments: sensitivity to initial stepsize √ √ (a) β = 1 / K (b) β = 1 /α 0 / K Figure: #epochs to achieve ǫ -accuracy vs. initial stepsize α 0 with κ = 10 . V. V. Mai (KTH) ICML-2020 18 / 20

Experiments: popular momentum parameter (a) 1 − β = 0 . 9 (b) 1 − β = 0 . 99 Figure: #epochs to achieve ǫ -accuracy vs. initial stepsize α 0 with κ = 10 . V. V. Mai (KTH) ICML-2020 19 / 20

Conclusion SGD with momentum • simple modifications to SGD • good performance and less sensitive to algorithm parameters Novel Lyapunov analysis • sample complexity of SHB for weakly convex and constrained optim. • improved rates on smooth and non-convex problems V. V. Mai (KTH) ICML-2020 20 / 20

Convergence of a Stochastic Gradient Method with Momentum for - PowerPoint PPT Presentation

Convergence of a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization Vien V. Mai and Mikael Johansson KTH - Royal Institute of Technology Stochastic optimization Stochastic optimization problem: minimize f ( x

Momentum and Conservation of Momentum Momentum Conservation of Momentum

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Algebra Based Physics Momentum 20160120 www.njctl.org Momentum Click on the topic to go to

Algebra Based Physics Momentum 2015-12-02 www.njctl.org Slide 3 / 65 Slide 4 / 65 Momentum

Angular Momentum Angular Momentum Newtons Second Law Conservation of Angular

Exponential convergence of testing error for stochastic gradient methods Loucas Pillaud-Vivien

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of

CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Convergence of perturbed Proximal Gradient algorithms Gersende Fort Institut de Math ematiques

Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer

Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

Shallow RNNs: A Method for Accurate Time-series Classification on Tiny Devices* Don Kurian

Functional Steins Institut method Mines-Telecom L. Decreusefond Borchard symposium Roadmap

CS 287 Advanced Robotics (Fall 2019) Lecture 6: Unconstrained Optimization Pieter Abbeel UC

The Entropy Rounding Method in Approximation Algorithms Thomas Rothvo Department of

Eliciting Informative Feedback: The Peer-Prediction Method Nolan Miller, Paul Resnick, &

Understanding DNSCurve D. J. Bernstein University of Illinois at Chicago & Technische

Network Layer network layer services virtual circuit and datagram Goals: networks

Convergence of a Stochastic Gradient Method with Momentum for - PowerPoint PPT Presentation

Convergence of a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization Vien V. Mai and Mikael Johansson KTH - Royal Institute of Technology Stochastic optimization Stochastic optimization problem: minimize f ( x

Momentum and Conservation of Momentum Momentum Conservation of Momentum

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Algebra Based Physics Momentum 20160120 www.njctl.org Momentum Click on the topic to go to

Algebra Based Physics Momentum 2015-12-02 www.njctl.org Slide 3 / 65 Slide 4 / 65 Momentum

Angular Momentum Angular Momentum Newtons Second Law Conservation of Angular

Exponential convergence of testing error for stochastic gradient methods Loucas Pillaud-Vivien

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of

CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Convergence of perturbed Proximal Gradient algorithms Gersende Fort Institut de Math ematiques

Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer

Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

Shallow RNNs: A Method for Accurate Time-series Classification on Tiny Devices* Don Kurian

Functional Steins Institut method Mines-Telecom L. Decreusefond Borchard symposium Roadmap

CS 287 Advanced Robotics (Fall 2019) Lecture 6: Unconstrained Optimization Pieter Abbeel UC

The Entropy Rounding Method in Approximation Algorithms Thomas Rothvo Department of

Eliciting Informative Feedback: The Peer-Prediction Method Nolan Miller, Paul Resnick, &amp;

Understanding DNSCurve D. J. Bernstein University of Illinois at Chicago &amp; Technische

Network Layer network layer services virtual circuit and datagram Goals: networks

Eliciting Informative Feedback: The Peer-Prediction Method Nolan Miller, Paul Resnick, &

Understanding DNSCurve D. J. Bernstein University of Illinois at Chicago & Technische