SGD without Replacement: Sharper Rates for General Smooth Convex - PowerPoint PPT Presentation

SGD without Replacement: Sharper Rates for General Smooth Convex Functions Dheeraj Nagaraj Massachusetts Institute of Technology June 12, 2019 Joint work with Praneeth Netrapalli and Prateek Jain (MSR India) 1 / 12

Overview Introduction 1 Current Results 2 Our Results 3 Main Techniques 4 2 / 12

SGD with Replacement (SGD) Consider observations ξ 1 , . . . , ξ n . Convex loss function f ( , ξ i ) : R d → R . Empirical Risk Minimization : n 1 x ∗ = arg min � x ∈ D ∇ ˆ f ( x , ξ i ) := arg min F ( x , ξ i ) , . n x ∈ D i =1 SGD with replacement (SGD) : fix step size sequence α t ≥ 0. Start at x 0 ∈ D . For every time step generate independent random variable I t ∼ unif ([ n ]). x t +1 = x t − α t ∇ f ( x t , ξ I t ) Easy to analyze since independence of I t ensures that E I t ∇ f ( x t , ξ I t ) = ˆ F ( x t ). Sharp non-asymptotic guarantees available but seldom used in practice. 3 / 12

SGD without Replacement (SGDo) In practice, the order of data is fixed (say ξ 1 , . . . , ξ n ) and the data is selected in this order, one after the other. One such pass is called an epoch. The algorithm is run for K epochs. A randomized version of this ‘gets rid’ of the bad orderings. SGD without Replacement (SGDo) At the beginning of the k th epoch, draw an independent uniformly random permutation σ k . x k , i = x k , i − 1 − α k , i ∇ f ( x k , i ; ξ σ k ( i ) ) This is closer to the algorithm implemented in practice. Harder to analyze since E ∇ f ( x k , i ; ξ σ k ( i ) ) � = E ∇ ˆ F ( x k , i ) 4 / 12

Experimental Observations Experiments 1 found that on many problems SGDo converges as O (1 / K 2 ), which is faster than SGD which converges at O (1 / K ). ( K = number of epochs) Theoretically, it wasn’t even shown that SGDo ‘matches’ the rate of SGD for all K . 1 L´ eon Bottou. “Curiously fast convergence of some stochastic gradient descent algorithms”. In: Proceedings of the symposium on learning and data science, Paris . 2009. 5 / 12

Currently Known Bounds 6 / 12

Small number of Epochs Assumptions: f ( · ; ξ i ) is L smooth, �∇ f ( · ; ξ i ) � ≤ G , diam( W ) ≤ D . � � GD Suboptimality O (leading order, General case) √ nK G 2 log nK � � Suboptimality O (leading order, µ strongly convex) µ nK Shamir’s result 2 only works for generalized linear functions and when K = 1. All other “acceleration” results hold only when K is very large. 2 Ohad Shamir. “Without-replacement sampling for stochastic gradient methods”. In: Advances in Neural Information Processing Systems . 2016, pp. 46–54. 7 / 12

Large number of Epochs Assumptions: f ( · ; ξ i ) is L smooth, �∇ f ( · ; ξ i ) � ≤ G and ˆ F is µ strongly convex. (log nK ) 2 � κ 2 G 2 � When K � κ 2 , Suboptimality: O nK 2 µ Previous results 3 require Hessian smoothness and K ≥ κ 1 . 5 √ n to give � � n 2 K 2 + κ 4 κ 4 suboptimality of O . K 3 Without smoothness assumption, there can be no acceleration. 3 Jeffery Z HaoChen and Suvrit Sra. “Random Shuffling Beats SGD after Finite Epochs”. In: arXiv preprint arXiv:1806.10077 (2018). 8 / 12

Main Techniques Main bottleneck in analysis: E ∇ f ( x k , i ; ξ σ k ( i ) ) � = E ∇ ˆ F ( x k , i ). If σ ′ k is independent of σ k , k ( i ) ) = E ∇ ˆ E ∇ f ( x k , i ; ξ σ ′ F ( x k , i ) . Therefore, E ∇ f ( x k , i ; ξ σ k ( i ) ) = E ∇ ˆ � � � F ( x k , i ) + O ( d W x k , i � σ k ( i ) = r , x k , i � � � Through coupling arguments: d W � σ k ( i ) = r , x k , i � α k , 0 G x k , i 9 / 12

Automatic Variance Reduction and Acceleration For the smooth and strongly convex case, ∇ ˆ F ( x ∗ ) = 0 = 1 � n i =1 f ( x ∗ , ξ σ k ( i ) ). (Note that this doesn’t hold with n independent sampling). Therefore, when x k , 0 ≈ x ∗ we show by coupling arguments that: n F ( x k , 0 ) ≈ 1 0 ≈ ∇ ˆ � f ( x i , k , ξ σ k ( i ) ) n i =1 This is similar to the variance reduction as seen in modifications of SGD like SAGA, SVRG etc. 10 / 12

References Bottou, L´ eon. “Curiously fast convergence of some stochastic gradient descent algorithms”. In: Proceedings of the symposium on learning and data science, Paris . 2009. G¨ urb¨ uzbalaban, Mert, Asu Ozdaglar, and Pablo Parrilo. “Why random reshuffling beats stochastic gradient descent”. In: arXiv preprint arXiv:1510.08560 (2015). HaoChen, Jeffery Z and Suvrit Sra. “Random Shuffling Beats SGD after Finite Epochs”. In: arXiv preprint arXiv:1806.10077 (2018). Shamir, Ohad. “Without-replacement sampling for stochastic gradient methods”. In: Advances in Neural Information Processing Systems . 2016, pp. 46–54. 11 / 12

Questions? 12 / 12

SGD without Replacement: Sharper Rates for General Smooth Convex - PowerPoint PPT Presentation

SGD without Replacement: Sharper Rates for General Smooth Convex Functions Dheeraj Nagaraj Massachusetts Institute of Technology June 12, 2019 Joint work with Praneeth Netrapalli and Prateek Jain (MSR India) 1 / 12 Overview Introduction 1

SGD and Averaging Instructor: Sham Kakade 1 SGD and optimality There is a strong sense in which

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Strengthening Smooth Transition Strengthening Smooth Transition Strengthening Smooth Transition

PROPERTY RATES PROPERTY RATES PROPERTY RATES PROPERTY RATES BUFFALO CITY MUNICIPALITY

Agenda Introductions About Sharper Software Discussion: Success Criteria Solution

Soft Gamma-ray Polarimetry with ASTRO-H SGD August 23, 2014 HEAPA Symposium on Future

Closing the convergence gap of SGD without replacement Shashank Rajput, Anant Gupta, Dimitris

Extremal generalized smooth words Kolakoski word Run-length encoding Smooth words Generalized

L L Lead Service Line Lead Service Line d S d S i i Li Li Replacement Replacement

5/3/17 267 Columbus Avenue Sidewalk Replacement LPC 1 267 Columbus Avenue Sidewalk Replacement:

Cache Replacement Championship The 3P and 4P cache replacement policies Pierre Michaud INRIA

Federated Learning Min Du Postdoc, UC Berkeley Outline q Preliminary: deep learning and SGD q

Semi-Cyclic SGD Hubert Eichner Tomer Koren Brendan McMahan Kunal Talwar Google Google Google

Throughput Prediction of Asynchronous SGD in TensorFlow Zhuojin Li Wumo Yan Marco Paolieri

Optimization why does it work How many minima Do they control worm complexity Plain

Clearance Rates Office of Research and Data Analysis Clearance Rates Clearance rates are the

A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis Zhe Li ,

PUBLISHING A MONOGRAPH Michael Sharp Cambridge University Press msharp@cambridge.org The

Regression Discontinuity Designs Erik Gahner Larsen Advanced applied statistics, 2015 1 / 48

On sharp Strichartz inequalities in low dimensions Dirk Hundertmark University of Birmingham

Isosurfaces Over Simplicial Partitions of Multiresolution Grids Josiah Manson and Scott Schaefer

A Performance Improvement Approach for Second-Order Optimization in Large Mini-batch Training

Sharp Adaptive Estimation of the Trend Coefficient of an Ergodic Diffusion Arnak S. Dalalyan

A Piecewise Linear Model of Credit Traps and Credit Cycles: A Complete Characterization Iryna