Optimal Mini-Batch and Step Sizes for SAGA Nidham Gazagnadou 1 , a - - PowerPoint PPT Presentation

optimal mini batch and step sizes for saga
SMART_READER_LITE
LIVE PREVIEW

Optimal Mini-Batch and Step Sizes for SAGA Nidham Gazagnadou 1 , a - - PowerPoint PPT Presentation

Optimal Mini-Batch and Step Sizes for SAGA Nidham Gazagnadou 1 , a joint work with Robert M. Gower 1 & Joseph Salmon 2 1 LTCI, T el ecom Paris, Institut Polytechnique de Paris, France 2 IMAG, Univ Montpellier, CNRS, Montpellier, France a


slide-1
SLIDE 1

Optimal Mini-Batch and Step Sizes for SAGA

Nidham Gazagnadou1,a joint work with Robert M. Gower1 & Joseph Salmon2

1LTCI, T´

el´ ecom Paris, Institut Polytechnique de Paris, France

2IMAG, Univ Montpellier, CNRS, Montpellier, France

aThis work was supported by grants from R´

egion Ile-de-France

1

slide-2
SLIDE 2

The Optimization Problem

  • Goal

find w∗ ∈ arg min

w∈Rd f (w) = 1

n

n

  • i=1

fi(w)

2

slide-3
SLIDE 3

The Optimization Problem

  • Goal

find w∗ ∈ arg min

w∈Rd f (w) = 1

n

n

  • i=1

fi(w) where

– n i.i.d. observations: (ai, yi) ∈ Rd × R or Rd × {−1, 1} – fi :Rd →R is Li–smooth ∀i ∈ [n] – f is L–smooth and µ–strongly convex

2

slide-4
SLIDE 4

The Optimization Problem

  • Goal

find w∗ ∈ arg min

w∈Rd f (w) = 1

n

n

  • i=1

fi(w) where

– n i.i.d. observations: (ai, yi) ∈ Rd × R or Rd × {−1, 1} – fi :Rd →R is Li–smooth ∀i ∈ [n] – f is L–smooth and µ–strongly convex

  • Covered problems

– Ridge regression – Regularized logistic regression

2

slide-5
SLIDE 5

Reformulation of the ERM

  • Sampling vector

Let v ∈ Rn, with distribution D s.t. for all i in [n]:={1, . . . , n} ED [vi] = 1

3

slide-6
SLIDE 6

Reformulation of the ERM

  • Sampling vector

Let v ∈ Rn, with distribution D s.t. for all i in [n]:={1, . . . , n} ED [vi] = 1

  • ERM stochastic reformulation

find w∗ ∈ arg min

w∈Rd

= ED

  • fv(w) := 1

n

n

  • i=1

vifi(w)

  • leading to an unbiased gradient estimate

ED [∇fv(w)] = 1 n

n

  • i=1

ED [vi] fi(w) = ∇f (w)

3

slide-7
SLIDE 7

Reformulation of the ERM

  • Sampling vector

Let v ∈ Rn, with distribution D s.t. for all i in [n]:={1, . . . , n} ED [vi] = 1

  • ERM stochastic reformulation

find w∗ ∈ arg min

w∈Rd

= ED

  • fv(w) := 1

n

n

  • i=1

vifi(w)

  • leading to an unbiased gradient estimate

ED [∇fv(w)] = 1 n

n

  • i=1

ED [vi] fi(w) = ∇f (w)

  • Arbitrary sampling includes all mini-batching strategies

such as sampling b ∈ [n] elements without replacement P

  • v = n

b

  • i∈B

ei

  • =

1 n

b

, for all B ⊆ [n], |B| = b.

3

slide-8
SLIDE 8

Focus on b Mini-Batch SAGA

The algorithm

– Sample a mini-batch B ⊂ [n] := {1, . . . , n} s.t. |B| = b – Build the gradient estimate g(w k) = 1 b

  • i∈B

∇fi(w k) − 1 b

  • i∈B

Jk

:i + 1

nJke where e is the all-ones vector and Jk

:i the i–th column of Jk ∈ Rd×n

– Take a step: w k+1 = w k − γg(w k) – Update the Jacobian estimate Jk Jk

i = ∇fi(w k),

∀i ∈ B

Our contribution:

  • ptimal mini-batch and step size

Example of SAGA run on real data (slice data set)

1 2 3 4

epochs

10

4

10

3

10

2

10

1

100

residual

bDefazio = 1 +

Defazio

= 6.10e 05 bpractical = 70 +

practical

= 1.20e 02 bpractical = 70 +

gridsearch = 3.13e

02 bHofmann = 20 +

Hofmann = 1.59e

03 , , , ,

relative distance to optimum

4

slide-9
SLIDE 9

Key Constant: Expected Smoothness

Definition (Expected Smoothness constant) If f is L–smooth in expectation, then for every w ∈ Rd ED

  • ∇fv(w) − ∇fv(w∗)2

2

  • ≤ 2L(f (w) − f (w∗))

5

slide-10
SLIDE 10

Key Constant: Expected Smoothness

Definition (Expected Smoothness constant) If f is L–smooth in expectation, then for every w ∈ Rd ED

  • ∇fv(w) − ∇fv(w∗)2

2

  • ≤ 2L(f (w) − f (w∗))
  • Total Complexity of b mini-batch SAGA, for a given ǫ > 0, is

Ktotal(b) = max 4b(L + λ) µ , n + n − b n − 1 4(Lmax + λ) µ

  • log

1 ǫ

  • ,

where λ is the regularizer and Lmax := maxi=1...n Li

  • For a step size

γ = 1 4 max

  • L + λ, 1

b n − b n − 1Lmax + µ 4 n b .

5

slide-11
SLIDE 11

Key Constant: Expected Smoothness

Definition (Expected Smoothness constant) If f is L–smooth in expectation, then for every w ∈ Rd ED

  • ∇fv(w) − ∇fv(w∗)2

2

  • ≤ 2L(f (w) − f (w∗))
  • Total Complexity of b mini-batch SAGA, for a given ǫ > 0, is

Ktotal(b) = max 4b(L + λ) µ , n + n − b n − 1 4(Lmax + λ) µ

  • log

1 ǫ

  • ,

where λ is the regularizer and Lmax := maxi=1...n Li

  • For a step size

γ = 1 4 max

  • L + λ, 1

b n − b n − 1Lmax + µ 4 n b . Problem: Calculating L is most of the time intractable

5

slide-12
SLIDE 12

Our Estimates of the Expected Smoothness

Theorem (Upper-bounds of L) When sampling b points without replacement we have

  • Simple bound

L ≤ Lsimple(b) := n b b − 1 n − 1 ¯ L + 1 b n − b n − 1Lmax

  • Bernstein bound

L ≤ LBernstein(b) := 2 b−1

b n n−1L + 1 b

  • n−b

n−1 + 4 3 log d

  • Lmax

where ¯ L := 1

n

n

i=1 Li and

Lmax := maxi∈[n] Li Practical estimate Lpractical := n

b b−1 n−1 L + 1 b n−b n−1Lmax

5 10 15 20

mini-batch size

100 200 300 400 500

smoothness constant

Estimates of L artificial data (n = d = 24)

6

slide-13
SLIDE 13

Optimal Mini-Batch from the Practical Estimate

For a precision ǫ > 0, the total complexity is Ktotal(b) = max 4b(Lpractical + λ) µ , n + n − b n − 1 4(Lmax + λ) µ

  • log

1 ǫ

  • Leading to the optimal

mini-batch size b∗

practical ∈ arg min b∈[n]

Ktotal(b) = ⇒ b∗

practical =

  • 1 + µ(n−1)

4L

  • 1

2 4 8 16 32 64 128 256 1024 4096 16384

mini-batch size

105.0 105.5 106.0 106.5 107.0

empirical total complexity bempirical = 2 bpractical = 70

Total complexity vs mini-batch size (slice dataset, λ = 10−1)

7

slide-14
SLIDE 14

Summary

Take Home Message

  • Use optimal mini-batch and step sizes available for SAGA!

What was done

  • Build estimates of L
  • Give optimal settings (b, γ) for mini-batch SAGA

= ⇒ Faster convergence of wk − − − →

k→∞ w∗

  • Provide convincing numerical improvements on real datasets
  • All the Julia code available at

https://github.com/gowerrobert/StochOpt.jl

8

slide-15
SLIDE 15

References (1/2)

  • F. Bach. ”Sharp analysis of low-rank kernel matrix approximations”.

In:ArXiv e-prints (Aug. 2012). arXiv: 1208.2015 [cs.LG].

  • C. C. Chang and C. J. Lin. ”LIBSVM : A library for support vector

machines”. In: ACM Transactions on Intelligent Systems and Technology 2.3 (Apr. 2011), pp. 127.

  • A. Defazio, F. Bach, and S. Lacoste-julien. ”SAGA: A Fast Incremental

Gradient Method With Support for Non-Strongly Convex Composite Objectives”. In: Advances in Neural Information Processing Systems 27. 2014, pp. 16461654.

  • R. M. Gower, P. Richtrik, and F. Bach. ”Stochastic Quasi-Gradient

Methods: Variance Reduction via Jacobian Sketching”. In: arXiv preprint arXiv:1805.02632 (2018).

  • D. Gross and V. Nesme. ”Note on sampling without replacing from a

finite collection of matrices”. In: arXiv preprint arXiv:1001.2738 (2010)

  • W. Hoeffding. ”Probability inequalities for sums of bounded random

variables”. In: Journal of the American statistical association 58.301 (1963), pp. 1330.

9

slide-16
SLIDE 16

References (2/2)

  • R. Johnson and T. Zhang. ”Accelerating Stochastic Gradient Descent

using Predictive Variance Reduction”. In: Advances in Neural Information Processing Systems 26. Curran Associates, Inc., 2013, pp. 315323.

  • H. Robbins and S. Monro. ”A stochastic approximation method”. In:

Annals of Mathematical Statistics 22 (1951), pp. 400407.

  • M. Schmidt, N. Le Roux, and F. Bach. ”Minimizing finite sums with the

stochastic average gradient”. In: Mathematical Programming 162.1 (2017), pp. 83112.

  • J. A. Tropp. ”An Introduction to Matrix Concentration Inequalities”. In:

ArXiv e-prints (Jan. 2015). arXiv:1501.01571 [math.PR]

  • J. A. Tropp. ”Improved analysis of the subsampled randomized

Hadamard transform”. In: Advances in Adaptive Data Analysis 3.01n02 (2011), pp. 115126.

  • J. A. Tropp. ”User-Friendly Tail Bounds for Sums of Random Matrices”.

In: Foundations of Computational Mathematics 12.4 (2012), pp. 389434.

10