SGD without Replacement: Sharper Rates for General Smooth Convex - - PowerPoint PPT Presentation

sgd without replacement sharper rates for general smooth
SMART_READER_LITE
LIVE PREVIEW

SGD without Replacement: Sharper Rates for General Smooth Convex - - PowerPoint PPT Presentation

SGD without Replacement: Sharper Rates for General Smooth Convex Functions Dheeraj Nagaraj Massachusetts Institute of Technology June 12, 2019 Joint work with Praneeth Netrapalli and Prateek Jain (MSR India) 1 / 12 Overview Introduction 1


slide-1
SLIDE 1

SGD without Replacement: Sharper Rates for General Smooth Convex Functions

Dheeraj Nagaraj

Massachusetts Institute of Technology

June 12, 2019 Joint work with Praneeth Netrapalli and Prateek Jain (MSR India)

1 / 12

slide-2
SLIDE 2

Overview

1

Introduction

2

Current Results

3

Our Results

4

Main Techniques

2 / 12

slide-3
SLIDE 3

SGD with Replacement (SGD)

Consider observations ξ1, . . . , ξn. Convex loss function f (, ξi) : Rd → R. Empirical Risk Minimization : x∗ = arg min

x∈D

1 n

n

  • i=1

f (x, ξi) := arg min

x∈D ∇ ˆ

F(x, ξi), . SGD with replacement (SGD): fix step size sequence αt ≥ 0. Start at x0 ∈ D. For every time step generate independent random variable It ∼ unif([n]). xt+1 = xt − αt∇f (xt, ξIt) Easy to analyze since independence of It ensures that EIt∇f (xt, ξIt) = ˆ F(xt). Sharp non-asymptotic guarantees available but seldom used in practice.

3 / 12

slide-4
SLIDE 4

SGD without Replacement (SGDo)

In practice, the order of data is fixed (say ξ1, . . . , ξn) and the data is selected in this order, one after the other. One such pass is called an

  • epoch. The algorithm is run for K epochs. A randomized version of this

‘gets rid’ of the bad orderings. SGD without Replacement (SGDo) At the beginning of the k th epoch, draw an independent uniformly random permutation σk. xk,i = xk,i−1 − αk,i∇f (xk,i; ξσk(i)) This is closer to the algorithm implemented in practice. Harder to analyze since E∇f (xk,i; ξσk(i)) = E∇ ˆ F(xk,i)

4 / 12

slide-5
SLIDE 5

Experimental Observations

Experiments1 found that on many problems SGDo converges as O(1/K 2), which is faster than SGD which converges at O(1/K). (K = number of epochs) Theoretically, it wasn’t even shown that SGDo ‘matches’ the rate of SGD for all K.

1L´

eon Bottou. “Curiously fast convergence of some stochastic gradient descent algorithms”. In: Proceedings of the symposium on learning and data science, Paris. 2009.

5 / 12

slide-6
SLIDE 6

Currently Known Bounds

6 / 12

slide-7
SLIDE 7

Small number of Epochs

Assumptions: f (·; ξi) is L smooth, ∇f (·; ξi) ≤ G, diam(W) ≤ D. Suboptimality O

  • GD

√ nK

  • (leading order, General case)

Suboptimality O

  • G 2 log nK

µnK

  • (leading order, µ strongly convex)

Shamir’s result2 only works for generalized linear functions and when K = 1. All other “acceleration” results hold only when K is very large.

2Ohad Shamir. “Without-replacement sampling for stochastic gradient methods”.

In: Advances in Neural Information Processing Systems. 2016, pp. 46–54.

7 / 12

slide-8
SLIDE 8

Large number of Epochs

Assumptions: f (·; ξi) is L smooth, ∇f (·; ξi) ≤ G and ˆ F is µ strongly convex. When K κ2, Suboptimality: O

  • κ2G 2

µ (log nK)2 nK 2

  • Previous results3 require Hessian smoothness and K ≥ κ1.5√n to give

suboptimality of O

  • κ4

n2K 2 + κ4 K 3

  • .

Without smoothness assumption, there can be no acceleration.

3Jeffery Z HaoChen and Suvrit Sra. “Random Shuffling Beats SGD after Finite

Epochs”. In: arXiv preprint arXiv:1806.10077 (2018).

8 / 12

slide-9
SLIDE 9

Main Techniques

Main bottleneck in analysis: E∇f (xk,i; ξσk(i)) = E∇ ˆ F(xk,i). If σ′

k is independent of σk,

E∇f (xk,i; ξσ′

k(i)) = E∇ ˆ

F(xk,i) . Therefore, E∇f (xk,i; ξσk(i)) = E∇ ˆ F(xk,i) + O(dW

  • xk,i
  • σk(i) = r, xk,i
  • Through coupling arguments: dW
  • xk,i
  • σk(i) = r, xk,i
  • αk,0G

9 / 12

slide-10
SLIDE 10

Automatic Variance Reduction and Acceleration

For the smooth and strongly convex case, ∇ ˆ F(x∗) = 0 = 1

n

n

i=1 f (x∗, ξσk(i)). (Note that this doesn’t hold with

independent sampling). Therefore, when xk,0 ≈ x∗ we show by coupling arguments that: 0 ≈ ∇ ˆ F(xk,0) ≈ 1 n

n

  • i=1

f (xi,k, ξσk(i)) This is similar to the variance reduction as seen in modifications of SGD like SAGA, SVRG etc.

10 / 12

slide-11
SLIDE 11

References

Bottou, L´

  • eon. “Curiously fast convergence of some stochastic gradient

descent algorithms”. In: Proceedings of the symposium on learning and data science, Paris. 2009. G¨ urb¨ uzbalaban, Mert, Asu Ozdaglar, and Pablo Parrilo. “Why random reshuffling beats stochastic gradient descent”. In: arXiv preprint arXiv:1510.08560 (2015). HaoChen, Jeffery Z and Suvrit Sra. “Random Shuffling Beats SGD after Finite Epochs”. In: arXiv preprint arXiv:1806.10077 (2018). Shamir, Ohad. “Without-replacement sampling for stochastic gradient methods”. In: Advances in Neural Information Processing Systems. 2016, pp. 46–54.

11 / 12

slide-12
SLIDE 12

Questions?

12 / 12