optimal mini batch and step sizes for saga
play

Optimal Mini-Batch and Step Sizes for SAGA Nidham Gazagnadou 1 , a - PowerPoint PPT Presentation

Optimal Mini-Batch and Step Sizes for SAGA Nidham Gazagnadou 1 , a joint work with Robert M. Gower 1 & Joseph Salmon 2 1 LTCI, T el ecom Paris, Institut Polytechnique de Paris, France 2 IMAG, Univ Montpellier, CNRS, Montpellier, France a


  1. Optimal Mini-Batch and Step Sizes for SAGA Nidham Gazagnadou 1 , a joint work with Robert M. Gower 1 & Joseph Salmon 2 1 LTCI, T´ el´ ecom Paris, Institut Polytechnique de Paris, France 2 IMAG, Univ Montpellier, CNRS, Montpellier, France a This work was supported by grants from R´ egion Ile-de-France 1

  2. The Optimization Problem • Goal n w ∈ R d f ( w ) = 1 find w ∗ ∈ arg min � f i ( w ) n i =1 2

  3. The Optimization Problem • Goal n w ∈ R d f ( w ) = 1 find w ∗ ∈ arg min � f i ( w ) n i =1 where – n i.i.d. observations: ( a i , y i ) ∈ R d × R or R d × {− 1 , 1 } – f i : R d → R is L i –smooth ∀ i ∈ [ n ] – f is L –smooth and µ –strongly convex 2

  4. The Optimization Problem • Goal n w ∈ R d f ( w ) = 1 find w ∗ ∈ arg min � f i ( w ) n i =1 where – n i.i.d. observations: ( a i , y i ) ∈ R d × R or R d × {− 1 , 1 } – f i : R d → R is L i –smooth ∀ i ∈ [ n ] – f is L –smooth and µ –strongly convex • Covered problems – Ridge regression – Regularized logistic regression 2

  5. Reformulation of the ERM • Sampling vector Let v ∈ R n , with distribution D s.t. for all i in [ n ]:= { 1 , . . . , n } E D [ v i ] = 1 3

  6. Reformulation of the ERM • Sampling vector Let v ∈ R n , with distribution D s.t. for all i in [ n ]:= { 1 , . . . , n } E D [ v i ] = 1 • ERM stochastic reformulation n � � f v ( w ) := 1 find w ∗ ∈ arg min � = E D v i f i ( w ) n w ∈ R d i =1 leading to an unbiased gradient estimate n E D [ ∇ f v ( w )] = 1 � E D [ v i ] f i ( w ) = ∇ f ( w ) n i =1 3

  7. Reformulation of the ERM • Sampling vector Let v ∈ R n , with distribution D s.t. for all i in [ n ]:= { 1 , . . . , n } E D [ v i ] = 1 • ERM stochastic reformulation n � � f v ( w ) := 1 find w ∗ ∈ arg min � = E D v i f i ( w ) n w ∈ R d i =1 leading to an unbiased gradient estimate n E D [ ∇ f v ( w )] = 1 � E D [ v i ] f i ( w ) = ∇ f ( w ) n i =1 • Arbitrary sampling includes all mini-batching strategies such as sampling b ∈ [ n ] elements without replacement � � v = n 1 � e i = for all B ⊆ [ n ] , | B | = b . P � , � n b 3 b i ∈ B

  8. Focus on b Mini-Batch SAGA The algorithm – Sample a mini-batch B ⊂ [ n ] := { 1 , . . . , n } s.t. | B | = b – Build the gradient estimate g ( w k ) = 1 ∇ f i ( w k ) − 1 : i + 1 � � J k nJ k e b b i ∈ B i ∈ B : i the i –th column of J k ∈ R d × n where e is the all-ones vector and J k – Take a step: w k +1 = w k − γ g ( w k ) – Update the Jacobian estimate J k J k i = ∇ f i ( w k ) , ∀ i ∈ B 10 0 b Defazio = 1 + = 6.10 e 05 , Defazio Our contribution: relative distance to optimum b practical = 70 + = 1.20 e 02 , practical b practical = 70 + gridsearch = 3.13 e 02 , 10 1 optimal mini-batch and step size b Hofmann = 20 + Hofmann = 1.59 e 03 , residual 10 2 10 3 Example of SAGA run on real data ( slice data set) 10 4 4 0 1 2 3 4 epochs

  9. Key Constant: Expected Smoothness Definition (Expected Smoothness constant) If f is L –smooth in expectation, then for every w ∈ R d � � �∇ f v ( w ) − ∇ f v ( w ∗ ) � 2 ≤ 2 L ( f ( w ) − f ( w ∗ )) E D 2 5

  10. Key Constant: Expected Smoothness Definition (Expected Smoothness constant) If f is L –smooth in expectation, then for every w ∈ R d � � �∇ f v ( w ) − ∇ f v ( w ∗ ) � 2 ≤ 2 L ( f ( w ) − f ( w ∗ )) E D 2 • Total Complexity of b mini-batch SAGA, for a given ǫ > 0, is � 4 b ( L + λ ) , n + n − b 4( L max + λ ) � � 1 � K total ( b ) = max log , n − 1 µ µ ǫ where λ is the regularizer and L max := max i =1 ... n L i • For a step size 1 γ = � . � L + λ, 1 n − b n − 1 L max + µ n 4 max b 4 b 5

  11. Key Constant: Expected Smoothness Definition (Expected Smoothness constant) If f is L –smooth in expectation, then for every w ∈ R d � � �∇ f v ( w ) − ∇ f v ( w ∗ ) � 2 ≤ 2 L ( f ( w ) − f ( w ∗ )) E D 2 • Total Complexity of b mini-batch SAGA, for a given ǫ > 0, is � 4 b ( L + λ ) , n + n − b 4( L max + λ ) � � 1 � K total ( b ) = max log , n − 1 µ µ ǫ where λ is the regularizer and L max := max i =1 ... n L i • For a step size 1 γ = � . � L + λ, 1 n − b n − 1 L max + µ n 4 max b 4 b Problem: Calculating L is most of the time intractable 5

  12. Our Estimates of the Expected Smoothness Theorem (Upper-bounds of L ) When sampling b points without replacement we have • Simple bound L ≤ L simple ( b ) := n b − 1 L + 1 n − b ¯ n − 1 L max n − 1 b b • Bernstein bound � � L ≤ L Bernstein ( b ) := 2 b − 1 n − 1 L + 1 n n − b n − 1 + 4 3 log d L max b b where ¯ � n L := 1 500 i =1 L i and n smoothness constant L max := max i ∈ [ n ] L i 400 300 Practical estimate 200 100 L practical := n b − 1 n − 1 L + 1 n − b n − 1 L max b b 0 5 10 15 20 mini-batch size Estimates of L artificial data 6 ( n = d = 24)

  13. Optimal Mini-Batch from the Practical Estimate For a precision ǫ > 0, the total complexity is � 4 b ( L practical + λ ) , n + n − b 4( L max + λ ) � � 1 � K total ( b ) = max log n − 1 µ µ ǫ b empirical = 2 10 7.0 empirical total complexity b practical = 70 Leading to the optimal 10 6.5 mini-batch size 10 6.0 b ∗ practical ∈ arg min K total ( b ) 10 5.5 b ∈ [ n ] 10 5.0 1 2 4 8 16 32 64 128 256 1024 4096 16384 � � 1 + µ ( n − 1) ⇒ b ∗ = practical = mini-batch size 4 L Total complexity vs mini-batch size ( slice dataset, λ = 10 − 1 ) 7

  14. Summary Take Home Message • Use optimal mini-batch and step sizes available for SAGA! What was done • Build estimates of L • Give optimal settings ( b , γ ) for mini-batch SAGA ⇒ Faster convergence of w k − k →∞ w ∗ = − − → • Provide convincing numerical improvements on real datasets • All the Julia code available at https://github.com/gowerrobert/StochOpt.jl 8

  15. References (1/2) • F. Bach. ”Sharp analysis of low-rank kernel matrix approximations”. In:ArXiv e-prints (Aug. 2012). arXiv: 1208.2015 [cs.LG]. • C. C. Chang and C. J. Lin. ”LIBSVM : A library for support vector machines”. In: ACM Transactions on Intelligent Systems and Technology 2.3 (Apr. 2011), pp. 127. • A. Defazio, F. Bach, and S. Lacoste-julien. ”SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives”. In: Advances in Neural Information Processing Systems 27. 2014, pp. 16461654. • R. M. Gower, P. Richtrik, and F. Bach. ”Stochastic Quasi-Gradient Methods: Variance Reduction via Jacobian Sketching”. In: arXiv preprint arXiv:1805.02632 (2018). • D. Gross and V. Nesme. ”Note on sampling without replacing from a finite collection of matrices”. In: arXiv preprint arXiv:1001.2738 (2010) • W. Hoeffding. ”Probability inequalities for sums of bounded random variables”. In: Journal of the American statistical association 58.301 (1963), pp. 1330. 9

  16. References (2/2) • R. Johnson and T. Zhang. ”Accelerating Stochastic Gradient Descent using Predictive Variance Reduction”. In: Advances in Neural Information Processing Systems 26. Curran Associates, Inc., 2013, pp. 315323. • H. Robbins and S. Monro. ”A stochastic approximation method”. In: Annals of Mathematical Statistics 22 (1951), pp. 400407. • M. Schmidt, N. Le Roux, and F. Bach. ”Minimizing finite sums with the stochastic average gradient”. In: Mathematical Programming 162.1 (2017), pp. 83112. • J. A. Tropp. ”An Introduction to Matrix Concentration Inequalities”. In: ArXiv e-prints (Jan. 2015). arXiv:1501.01571 [math.PR] • J. A. Tropp. ”Improved analysis of the subsampled randomized Hadamard transform”. In: Advances in Adaptive Data Analysis 3.01n02 (2011), pp. 115126. • J. A. Tropp. ”User-Friendly Tail Bounds for Sums of Random Matrices”. In: Foundations of Computational Mathematics 12.4 (2012), pp. 389434. 10

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend