convergence of a stochastic gradient method with momentum
play

Convergence of a Stochastic Gradient Method with Momentum for - PowerPoint PPT Presentation

Convergence of a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization Vien V. Mai and Mikael Johansson KTH - Royal Institute of Technology Stochastic optimization Stochastic optimization problem: minimize f ( x


  1. Convergence of a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization Vien V. Mai and Mikael Johansson KTH - Royal Institute of Technology

  2. Stochastic optimization Stochastic optimization problem: � minimize f ( x ) := E P [ f ( x ; S )] = f ( x ; s ) dP ( s ) x ∈X S Stochastic gradient descent (SGD): x k +1 = x k − α k g k , g k ∈ ∂f ( x k , S k ) SGD with momentum: x k +1 = x k − α k z k , z k +1 = β k g k +1 + (1 − β k ) z k Includes Polyak’s Heavy ball, Nesterov’s fast gradient, and more • widespread empirical success • theory less clear than deterministic counterpart V. V. Mai (KTH) ICML-2020 2 / 20

  3. Stochastic optimization: sample complexity For SGD, sample complexity is known under various assumptions • convexity [Nemirovski et al., 2009] • smoothness [Ghadimi-Lan, 2013] • weak convexity [Davis-Drusvyatskiy, 2019] Much less is known for momentum-based methods • constrained • non-smooth non-convex V. V. Mai (KTH) ICML-2020 3 / 20

  4. Our contributions Novel Lyapunov analysis for (projected) stochastic heavy ball (SHB): • sample complexity of SHB for stochastic weakly convex minimization • analyze smooth non-convex case under less restrictive assumptions V. V. Mai (KTH) ICML-2020 4 / 20

  5. Outline • Background and motivation • SHB for non-smooth non-convex optimization • Sharper results for smooth non-convex optimization • Numerical examples • Summary and conclusions V. V. Mai (KTH) ICML-2020 5 / 20

  6. Problem formulation Problem: � minimize f ( x ) := E P [ f ( x ; S )] = f ( x ; s ) dP ( s ) x ∈X S X is closed and convex; f is ρ - weakly convex , meaning that x �→ f ( x ) + ρ � x � 2 2 is convex . Easy to recognize, e.g., convex compositions f ( x ) = h ( c ( x )) h convex and L h -Lipschitz; c smooth with L c -Lipschitz Jacobian ( ρ = L h L c ) V. V. Mai (KTH) ICML-2020 6 / 20

  7. Algorithm Consider � minimize f ( x ) := E P [ f ( x ; S )] = f ( x ; s ) dP ( s ) x ∈X S Algorithm: � � z k , x − x k � + 1 � 2 α � x − x k � 2 x k +1 = argmin 2 x ∈X z k +1 = βg k +1 + (1 − β ) x k − x k +1 α Recovers SHB when X = R n ; setting β = 1 gives (projected) SGD Goal: establish sample complexity V. V. Mai (KTH) ICML-2020 7 / 20

  8. Roadmap and challenges Most complexity results for subgradient-based methods rely on forming: E [ V k +1 ] ≤ E [ V k ] − α E [ e k ] + α 2 C 2 Immediately yields O (1 /ǫ 2 ) complexity for E [ e k ] Stationarity measure: ⇒ e k = �∇ f ( x k ) � 2 • f convex = ⇒ e k = f ( x k ) − f ( x ⋆ ) ; f smooth = 2 ⇒ e k = �∇ F λ ( x k ) � 2 • f weakly convex = 2 Lyapunov analysis (for SGD): ⇒ V k = � x k − x ⋆ � 2 • f convex = [Shor, 1964] 2 • f smooth = ⇒ V k = f ( x k ) [Ghadimi-Lan, 2013] • f weakly convex = ⇒ V k = F λ ( x k ) [Davis-Drusvyatskiy, 2019] V. V. Mai (KTH) ICML-2020 8 / 20

  9. Convergence to stationarity in weakly convex cases Moreau envelope F ( y ) + 1 � � 2 λ � x − y � 2 F λ ( x )=inf 2 y Proximal mapping F ( y ) + 1 2 λ � x − y � 2 � � x := argmin ˆ 2 y ∈ R n Connection to near-stationarity x ˆ λ �∇ F λ ( x ) �  λ − 1 ( x − ˆ x ) = ∇ F λ ( x )  x dist(0 , ∂F (ˆ x )) ≤ �∇ F λ ( x ) � 2  Small �∇ F λ ( x ) � 2 = ⇒ x close to a near-stationary point V. V. Mai (KTH) ICML-2020 9 / 20

  10. Lyapunov analysis for SHB Recall that we wanted E [ V k +1 ] ≤ E [ V k ] − α E [ e k ] + α 2 C 2 SGD works with e k = �∇ F λ ( x k ) � 2 2 and V k = F λ ( x k ) It seems natural to take e k = �∇ F λ ( · ) � 2 2 Two questions: • at which point should we evaluate ∇ F λ ( · ) ? • can we find a corresponding Lyapunov function V k ? V. V. Mai (KTH) ICML-2020 10 / 20

  11. Lyapunov analysis for SHB Our approach: Take ∇ F λ ( · ) at the following iterate x k := x k + 1 − β ¯ ( x k − x k − 1 ) β Define the corresponding proximal point F ( y ) + 1 � � x k � 2 x k = argmin ˆ 2 λ � y − ¯ 2 y ∈ R n This gives x k ) = λ − 1 (¯ e k = ∇ F λ (¯ x k − ˆ x k ) V. V. Mai (KTH) ICML-2020 11 / 20

  12. Lyapunov analysis for SHB Let β = να so that β ∈ (0 , 1] and define ξ = (1 − β ) /ν . Consider the function: x k ) + νξ 2 2 + αξ 2 � (1 − β ) ξ 2 + ξ � 4 λ 2 � p k � 2 2 λ 2 � d k � 2 V k = F λ (¯ 2 + f ( x k − 1 ) , 2 λ 2 λ where p k = 1 − β ( x k − x k − 1 ) and d k = ( x k − 1 − x k ) /α. β Theorem: For any k ∈ N , it holds that 2 ] + α 2 CL 2 E [ V k +1 ] ≤ E [ V k ] − α x k ) � 2 2 E [ �∇ F λ (¯ . 2 λ V. V. Mai (KTH) ICML-2020 12 / 20

  13. Main result: sample complexity √ √ Taking α = α 0 / K and β = O (1 / K ) ∈ (0 , 1] yields � ρ ∆ + L 2 � �� � 2 � � √ E � ∇ F 1 / (2 ρ ) (¯ x k ∗ ) ≤ O 2 K + 1 ∆ = f ( x 0 ) − inf x ∈X f ( x ) Note: • same worst-case complexity as SGD ( β = 1 ) √ • β can be as small as O (1 / K ) • (much) more weight to the momentum term than the fresh subgradient This rate is, in general, not possible to improve [Arjevani et al., 2019]. V. V. Mai (KTH) ICML-2020 13 / 20

  14. Outline • Background and motivation • SHB for non-smooth non-convex optimization • Sharper results for smooth non-convex optimization • Numerical examples • Summary and conclusions V. V. Mai (KTH) ICML-2020 14 / 20

  15. Smooth and non-convex optimization Problem: � minimize f ( x ) := E P [ f ( x ; S )] = f ( x ; s ) dP ( s ) x ∈X S X is closed and convex; f is ρ - smooth : �∇ f ( x ) − ∇ f ( x ) � 2 ≤ ρ � x − y � 2 , ∀ x, y ∈ dom f. Assumption . There exists a real σ > 0 such that for all x ∈ X : � � � f ′ ( x, S ) − ∇ f ( x ) � 2 ≤ σ 2 . E 2 Note. • complexity of SHB is not known (even for deterministic case) • when X = R n , O (1 /ǫ 2 ) obtained under bounded gradients assumption [Yan et al., 2018] V. V. Mai (KTH) ICML-2020 15 / 20

  16. Improved complexities on smooth non-convex problems Constrained case: α 0 Suppose that �∇ f ( x ) � 2 ≤ G for all x ∈ X . If we set α = √ K +1 , then � ρ ∆ + σ 2 + G 2 � � � x k ∗ ) � 2 √ E �∇ F λ (¯ ≤ O . 2 K + 1 Unconstrained case: α 0 If we set α = √ K +1 with α 0 ∈ (0 , 1 / (4 ρ )] , then �� � 1 + 8 ρ 2 α 2 � ∆ + ( ρ + 16 α 0 ρ 2 ) σ 2 α 3 � � x k ∗ ) � 2 0 0 √ E �∇ F λ (¯ ≤ O . 2 α 0 K + 1 V. V. Mai (KTH) ICML-2020 16 / 20

  17. Experiments: convergence behavior on phase retrieval (a) κ = 1 , α 0 = 0 . 1 (b) κ = 1 , α 0 = 0 . 15 √ Figure: Function gap vs. #iters for phase retrieval with p fail = 0 . 2 , β = 10 / K . Exponential growth before eventual convergence 1 not shown SGD is competitive if well-tuned, but sensitive to stepsize choice 1 observed also in [Asi-Duchi, 2019] V. V. Mai (KTH) ICML-2020 17 / 20

  18. Experiments: sensitivity to initial stepsize √ √ (a) β = 1 / K (b) β = 1 /α 0 / K Figure: #epochs to achieve ǫ -accuracy vs. initial stepsize α 0 with κ = 10 . V. V. Mai (KTH) ICML-2020 18 / 20

  19. Experiments: popular momentum parameter (a) 1 − β = 0 . 9 (b) 1 − β = 0 . 99 Figure: #epochs to achieve ǫ -accuracy vs. initial stepsize α 0 with κ = 10 . V. V. Mai (KTH) ICML-2020 19 / 20

  20. Conclusion SGD with momentum • simple modifications to SGD • good performance and less sensitive to algorithm parameters Novel Lyapunov analysis • sample complexity of SHB for weakly convex and constrained optim. • improved rates on smooth and non-convex problems V. V. Mai (KTH) ICML-2020 20 / 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend