sgd without replacement sharper rates for general smooth
play

SGD without Replacement: Sharper Rates for General Smooth Convex - PowerPoint PPT Presentation

SGD without Replacement: Sharper Rates for General Smooth Convex Functions Dheeraj Nagaraj Massachusetts Institute of Technology June 12, 2019 Joint work with Praneeth Netrapalli and Prateek Jain (MSR India) 1 / 12 Overview Introduction 1


  1. SGD without Replacement: Sharper Rates for General Smooth Convex Functions Dheeraj Nagaraj Massachusetts Institute of Technology June 12, 2019 Joint work with Praneeth Netrapalli and Prateek Jain (MSR India) 1 / 12

  2. Overview Introduction 1 Current Results 2 Our Results 3 Main Techniques 4 2 / 12

  3. SGD with Replacement (SGD) Consider observations ξ 1 , . . . , ξ n . Convex loss function f ( , ξ i ) : R d → R . Empirical Risk Minimization : n 1 x ∗ = arg min � x ∈ D ∇ ˆ f ( x , ξ i ) := arg min F ( x , ξ i ) , . n x ∈ D i =1 SGD with replacement (SGD) : fix step size sequence α t ≥ 0. Start at x 0 ∈ D . For every time step generate independent random variable I t ∼ unif ([ n ]). x t +1 = x t − α t ∇ f ( x t , ξ I t ) Easy to analyze since independence of I t ensures that E I t ∇ f ( x t , ξ I t ) = ˆ F ( x t ). Sharp non-asymptotic guarantees available but seldom used in practice. 3 / 12

  4. SGD without Replacement (SGDo) In practice, the order of data is fixed (say ξ 1 , . . . , ξ n ) and the data is selected in this order, one after the other. One such pass is called an epoch. The algorithm is run for K epochs. A randomized version of this ‘gets rid’ of the bad orderings. SGD without Replacement (SGDo) At the beginning of the k th epoch, draw an independent uniformly random permutation σ k . x k , i = x k , i − 1 − α k , i ∇ f ( x k , i ; ξ σ k ( i ) ) This is closer to the algorithm implemented in practice. Harder to analyze since E ∇ f ( x k , i ; ξ σ k ( i ) ) � = E ∇ ˆ F ( x k , i ) 4 / 12

  5. Experimental Observations Experiments 1 found that on many problems SGDo converges as O (1 / K 2 ), which is faster than SGD which converges at O (1 / K ). ( K = number of epochs) Theoretically, it wasn’t even shown that SGDo ‘matches’ the rate of SGD for all K . 1 L´ eon Bottou. “Curiously fast convergence of some stochastic gradient descent algorithms”. In: Proceedings of the symposium on learning and data science, Paris . 2009. 5 / 12

  6. Currently Known Bounds 6 / 12

  7. Small number of Epochs Assumptions: f ( · ; ξ i ) is L smooth, �∇ f ( · ; ξ i ) � ≤ G , diam( W ) ≤ D . � � GD Suboptimality O (leading order, General case) √ nK G 2 log nK � � Suboptimality O (leading order, µ strongly convex) µ nK Shamir’s result 2 only works for generalized linear functions and when K = 1. All other “acceleration” results hold only when K is very large. 2 Ohad Shamir. “Without-replacement sampling for stochastic gradient methods”. In: Advances in Neural Information Processing Systems . 2016, pp. 46–54. 7 / 12

  8. Large number of Epochs Assumptions: f ( · ; ξ i ) is L smooth, �∇ f ( · ; ξ i ) � ≤ G and ˆ F is µ strongly convex. (log nK ) 2 � κ 2 G 2 � When K � κ 2 , Suboptimality: O nK 2 µ Previous results 3 require Hessian smoothness and K ≥ κ 1 . 5 √ n to give � � n 2 K 2 + κ 4 κ 4 suboptimality of O . K 3 Without smoothness assumption, there can be no acceleration. 3 Jeffery Z HaoChen and Suvrit Sra. “Random Shuffling Beats SGD after Finite Epochs”. In: arXiv preprint arXiv:1806.10077 (2018). 8 / 12

  9. Main Techniques Main bottleneck in analysis: E ∇ f ( x k , i ; ξ σ k ( i ) ) � = E ∇ ˆ F ( x k , i ). If σ ′ k is independent of σ k , k ( i ) ) = E ∇ ˆ E ∇ f ( x k , i ; ξ σ ′ F ( x k , i ) . Therefore, E ∇ f ( x k , i ; ξ σ k ( i ) ) = E ∇ ˆ � � � F ( x k , i ) + O ( d W x k , i � σ k ( i ) = r , x k , i � � � Through coupling arguments: d W � σ k ( i ) = r , x k , i � α k , 0 G x k , i 9 / 12

  10. Automatic Variance Reduction and Acceleration For the smooth and strongly convex case, ∇ ˆ F ( x ∗ ) = 0 = 1 � n i =1 f ( x ∗ , ξ σ k ( i ) ). (Note that this doesn’t hold with n independent sampling). Therefore, when x k , 0 ≈ x ∗ we show by coupling arguments that: n F ( x k , 0 ) ≈ 1 0 ≈ ∇ ˆ � f ( x i , k , ξ σ k ( i ) ) n i =1 This is similar to the variance reduction as seen in modifications of SGD like SAGA, SVRG etc. 10 / 12

  11. References Bottou, L´ eon. “Curiously fast convergence of some stochastic gradient descent algorithms”. In: Proceedings of the symposium on learning and data science, Paris . 2009. G¨ urb¨ uzbalaban, Mert, Asu Ozdaglar, and Pablo Parrilo. “Why random reshuffling beats stochastic gradient descent”. In: arXiv preprint arXiv:1510.08560 (2015). HaoChen, Jeffery Z and Suvrit Sra. “Random Shuffling Beats SGD after Finite Epochs”. In: arXiv preprint arXiv:1806.10077 (2018). Shamir, Ohad. “Without-replacement sampling for stochastic gradient methods”. In: Advances in Neural Information Processing Systems . 2016, pp. 46–54. 11 / 12

  12. Questions? 12 / 12

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend