On Convergence-Diagnostic based Step Sizes for Stochastic Gradient - - PowerPoint PPT Presentation

on convergence diagnostic based step sizes for stochastic
SMART_READER_LITE
LIVE PREVIEW

On Convergence-Diagnostic based Step Sizes for Stochastic Gradient - - PowerPoint PPT Presentation

On Convergence-Diagnostic based Step Sizes for Stochastic Gradient Descent Aymeric Dieuleveut CMAP, cole Polytechnique, Institut Polytechnique de Paris Joint work with Scott Pesme and Nicolas Flammarion (EPFL) 10/03/2020 Cirm Luminy -


slide-1
SLIDE 1

On Convergence-Diagnostic based Step Sizes for Stochastic Gradient Descent

Aymeric Dieuleveut CMAP, École Polytechnique, Institut Polytechnique de Paris Joint work with Scott Pesme and Nicolas Flammarion (EPFL)

10/03/2020 Cirm Luminy - Optimization for Machine Learning

slide-2
SLIDE 2

On Convergence-Diagnostic based Step Sizes for Stochastic Gradient Descent

Aymeric Dieuleveut CMAP, École Polytechnique, Institut Polytechnique de Paris Joint work with Scott Pesme and Nicolas Flammarion (EPFL)

10/03/2020 Cirm Luminy - Optimization for Machine Learning 1 / 47

slide-3
SLIDE 3

Questions

  • 1. Feel free to ask any question.
  • 2. Let me ask a few ones first:
  • Who knows about Stochastic Gradient Descent?
  • Who knows the convergence rate for the last iterate instead of the

averaged iterate?

  • Who knows about Pflug’s convergence diagnosis?

2 / 47

slide-4
SLIDE 4

Why would we still talk about SGD ?

Objective function f : D → R to minimize

θn+1 = θn −γn+1 f ′

n+1(θn) = θn −γn+1

  • f ′(θn)+ξn+1(θn)
  • .

What choice for the learning rate (γn)n∈N ? As often:

  • Theoreticians (♥) came up with optimal answers (convex setting).
  • Practitioners do not use them !

If it works in theory it also works in practice – in theory. Why not?

  • 1. Step size in SGD often depends on unknown parameters (esp.

µ-strong convexity).

  • 2. May be very sensitive to those parameters.
  • 3. Does not adapt to the noise and function regularity.

3 / 47

slide-5
SLIDE 5

A few observations

a) Large learning rates often converge faster at the beginning b) But then results in saturation: two phases behavior. c) Theory suggests to use the Polyak-Ruppert averaged iterate, but the final one might not be that bad. d) In Deep Learning, common practice is to use a constant learning rate, reduced occasionally.

4 / 47

slide-6
SLIDE 6

a) Large learning rates often converge faster at the beginning

SGD nearly always results in a Bias (initial condition) - Variance (noise) tradeoff. A large initial learning rate maximizes the decay of the bias.

2 4 6

  • 3
  • 2.5
  • 2
  • 1.5
  • 1
  • 0.5

1/R2 Decaying Steps 1 2 3 4 5 6

  • 4.5
  • 4
  • 3.5
  • 3
  • 2.5
  • 2
  • 1.5
  • 1

1/2R2 1/2R2√n

Figure 1: Logistic regression on the Covertype Dataset / Synthetic Dataset

5 / 47

slide-7
SLIDE 7

b) Saturation and limit distribution: two phases

  • “Transient phase" during which the initial conditions are forgotten

exponentially fast.

  • "Stationary phase" where the iterates oscillate around θ∗

0.0 0.2 0.4 0.6 0.8 1.0 −0.1 0.0 0.1 0.2 0.3 0.4 0.5

Synthetic logistic dataset d = 2

θn θn transient θn stationnary θ0 θ * 100 101 102 103 104 105 106 iteration n 10−6 10−5 10−4 10−3 10−2 10−1 f(θn) − f(θ * )

Synthetic logistic dataset d = 2

1 / 128 R2 transient 1 / 128 R2 stationnary averaged

Figure 2: Constant step size SGD (2 dimensionnal) path illustration.

For smooth and strongly convex functions, θn

(d)

πγ, “limit distribution”.

πγ is a stationary distribution.

6 / 47

slide-8
SLIDE 8

c) Polyak-Ruppert averaged iterate vs final one.

Instead of just the final iterate θ(γ)

n , we can consider the PR-averaged:

¯ θ(γ)

n = 1

n

n−1

  • k=0

θ(γ)

k .

Strongly reduces the impact of the noise. Slows down the Bias term. How bad is the last iterate...? It depends!

7 / 47

slide-9
SLIDE 9

c) Polyak-Ruppert averaged iterate vs final one. 2

Final Iterate Average Convex & Smooth Strongly convex & Smooth No noise (deterministic) Finite dimensional quadratic Kernel Regression The Proof by Shamir & Zhang is nice !

8 / 47

slide-10
SLIDE 10

c) Polyak-Ruppert averaged iterate vs final one. 2

9 / 47

slide-11
SLIDE 11

Previous work: with decreasing step-sizes

(Moulines & Bach 2011), smooth + strongly convex Setting γn =

1 µn we get

E

  • θn −θ∗

2 = O 1 µ2n

  • .

(Shamir & Zhang 2012), bounded gradients + strongly convex Setting γn =

1 µn we get

E

  • f (θn)− f (θ∗)
  • = O

log(n) µn

  • .

(Shamir & Zhang 2012), bounded gradients + weakly convex Setting γn =

1 n we get

E

  • f (θn)− f (θ∗)
  • = O

log(n) n

  • .

10 / 47

slide-12
SLIDE 12

d) Deep Learning: training NN

(1 - test_accuracy) ━ classical_1_wdb learning rate divided by 10

Figure 3: Typical accuracy curve in deep learning (Cifar10 dataset, Resnet18).

11 / 47

slide-13
SLIDE 13

Overall...

  • in the strongly convex case, µ is often unknown and hard to evaluate.
  • a slight misspecification of µ can lead to arbitrarily slow convergence

rates (see Moulines & Bach 2011)

  • we would like to make use of the uniform convexity assumption
  • ideally we would like a learning rate sequence that adapts to f
  • these stepsize sequences are not used in practice for deep learning

12 / 47

slide-14
SLIDE 14

Outline

Natural strategy: decrease learning rate when no more progress Hopes: adaptive “restarts” to

  • use “maximal step size” as long as useful
  • adapt to unknown parameters.

Outline:

  • 1. Convergence properties of SGD with piecewise constant learning

rates.

  • 2. Detecting Stationarity: Pflug’s Statistic
  • 3. Detecting Stationarity: new heuristic.

“Restart” : nothing to restart, just changing the learning rate !

13 / 47

slide-15
SLIDE 15

“Omniscient strategies”. What can we achieve with piecewise constant step sizes ?

slide-16
SLIDE 16

What rate can you get if you use a large step size for as long as possible and you decrease it when the loss saturates ?

14 / 47

slide-17
SLIDE 17

Oracle algorithm

Theorem (Needell 2014)

E

  • θn −θ∗

2 ≤ (1−bγ)n θ0 −θ∗ 2 +cσ2γ+O(γ2),

where b, c depend on f and σ2 = E

  • ξ(θ∗)2.

Theoretical procedure: Let p,r ∈ [0,1]. Start with l.r. γ0, stop at ∆n1:

E

  • θn −θ∗

2 ≤ [1−2γ0µ]nE

  • θ0 −θ∗

2

  • ∆n1

s.t ( )

+ σ2 µ γ0

= p×( )

.

Set γ1 = rγ0 and restart from θn1 = θ∆n1:

E

  • θn −θ∗

2 ≤ [1−2γ1µ](n−n1)E

  • θn1 −θ∗

2

  • ∆n2

s.t ( )

+ σ2 µ γ1

= p×( )

.

etc.

(Related but slightly different from Hazan Kale 2010, e.g.)

15 / 47

slide-18
SLIDE 18

Oracle algorithm analysis, good news

Theorem (Strongly convex + smooth) Following the previous oracle procedure and assuming that

θ0 −θ∗2 ≤ (p +1) σ2

µ γ0:

E

  • θnk −θ∗

2 ≤ (p +1) σ2 1−r ln

  • (1+ 1

p ) 1 µr

  • 1

µ2nk . ≤ O

  • 1

µ2nk

  • The upper bound can be optimized over p and r
  • Purely theoretical result since none of these constants are known.
  • The step size sequence produced is piecewise constant and ’imitates’

γn = 1/µn.

Beyond the Smooth & Strongly convex : uniformly convex functions

16 / 47

slide-19
SLIDE 19

Assumptions on f

Convexity:

  • Weak convexity: f (θ1) ≥ f (θ2)+〈f ′(θ2), θ1 −θ2〉
  • Strong convexity, µ > 0: f (θ1) ≥ f (θ2)+〈f ′(θ2), θ1 −θ2〉+ µ

2 θ1 −θ22

  • Uniform convexity: f is uniformly convex with parameters µ > 0,

ρ ∈ [2,+∞[ if: f (θ1) ≥ f (θ2)+〈f ′(θ2), θ1 −θ2〉+ µ ρ θ1 −θ2ρ

. Smoothness:

  • (L-smoothness) for any n ∈ N, fn is L-smooth:
  • f ′

n(θ1)− f ′ n(θ2)

  • ≤ L θ1 −θ2

a.s.

  • (Non-smooth, bounded gradients) bounded gradients framework:

E

  • f ′

n(θn−1)

  • 2

≤ G2

17 / 47

slide-20
SLIDE 20

Non-smooth, uniformly convex setting

Proposition (PDF 2020) If f is a uniformly convex function with parameter ρ > 2 with

G-bounded gradients then: E

  • f (θn)− f (θ∗)
  • ≤ C

1 γn 1/τ +G2 log(n)γ

Where τ = 1− 2

ρ ∈ [0,1]

In the finite horizon framework, this results in:

E

  • f (θn)− f (θ∗)
  • ≤ O

logN N 1/(1+τ)

  • Notice that

1 1+τ ∈ [0.5,1], we have an interpolation between the weakly

convex and strongly convex cases.

  • Juditsky Nesterov 2014 have a similar rate with a different algorithm
  • Roulet et d’Aspremont have the N−1/τ rate for GD.

18 / 47

slide-21
SLIDE 21

Restart at saturation

Considering the previous upper bound: and following the previous “oracle” procedure (restart when Bias = p ×Var ) Theorem (PDF 20)

f (θnk )− f (θ∗) ≤ O

  • log(nk)n

1 1+τ

k

  • As before, the strategy of constant steps with “restart at saturation” gives

satisfying rates (as good as the best known strategy for decaying steps)

19 / 47

slide-22
SLIDE 22

Numerical simulation in the quadratic case

100 101 102 103 104 105 106 iteration n 10−5 10−4 10−3 10−2 10−1 100 f(θn) − f(θ * )

Synthetic LS SGD, oracle piecewise constant, r = 1/4

milestones 1 / 2R2 piecewise constant

Figure 4: Oracle constant piece wise SGD

20 / 47

slide-23
SLIDE 23

Numerical simulation in the uniformly convex case

Vanilla example: f (θ) = 1

ρ θρ where ρ = 2.5, rate of ∼ n−0.8.

100 101 102 103 104 105 106

iteration n

10−3 10−2 10−1 100 101 102

f(θn)

SGD with f(θ) = 1

r ||θ||r, r = 2.5, τ = 1 − 2/r, d = 200

γn following restart strategy γn = n−1/(τ + 1) γn = 1/√n

100 101 102 103 104 105 106

iteration n

10−6 10−5 10−4 10−3 10−2 10−1

γn

γn from restart strategy γn = n−1/(τ + 1) γn = 1/√n

Figure 5: Oracle constant piece wise SGD for a uniformly convex function

21 / 47

slide-24
SLIDE 24

Conclusion 1

Oracle procedure has good theoretical guarantees and it adapts to the framework (smoothness, uniform convexity, deterministic). But:

  • Constants are un-known.
  • Computing the loss to detect saturation would be very time

consuming Can we detect saturation without having access to the loss values ?

22 / 47

slide-25
SLIDE 25

Detecting stationarity with

  • statistics. Pflug’s statistic:

S(γ)

n = 1

n

n−1

  • k=0

〈f ′

k+1, f ′ k+2〉

slide-26
SLIDE 26

Pflug’s statistic S(γ)

n = 1 n

n−1

k=0〈f ′ k+1, f ′ k+2〉

Pflug’s idea:

  • During transient phase: E
  • 〈f ′

n+1, f ′ n+2〉

  • > 0
  • Stationary phase:

E

  • 〈f ′

n+1, f ′ n+2〉

  • < 0

0.0 0.2 0.4 0.6 0.8 1.0 −0.1 0.0 0.1 0.2 0.3 0.4 0.5

Synthetic logistic dataset d = 2

θn θn transient θn stationnary θ0 θ * 100 101 102 103 104 105 106 iteration n 10−6 10−5 10−4 10−3 10−2 10−1 f(θn) − f(θ * )

Synthetic logistic dataset d = 2

1 / 128 R2 transient 1 / 128 R2 stationnary averaged

23 / 47

slide-27
SLIDE 27

Pflug’s algorithm (1983)

Algorithm 1 Piecewise constant SGD using Pflug’s statistic INPUT: θ0, γ0 > 0, nb > 0, r ∈ [0,1], N > 0 OUTPUT: θN

S ← 0

last_restart ← 0

θ1 ← θ0 −γf ′

1(θ0)

for n = 2 to N do

θn ← θn−1 −γ′

n(θn−1)

S ← S +〈f ′

n(θn−1), f ′ n−1(θn−2)〉

if

n > last_restart+nb and S < 0 then

last_restart ← n

S ← 0 γ ← r ×γ

end if end for return θN

24 / 47

slide-28
SLIDE 28

Our results

2 main results:

  • 1. Proving that it makes sense
  • 2. Proving that it fails

Why ?

25 / 47

slide-29
SLIDE 29

26 / 47

slide-30
SLIDE 30

27 / 47

slide-31
SLIDE 31

Formalization

Proposition (Pflug 1990), (Chee & Toulis 2018) (PDF 2020) In the quadratic semi-stochastic setting where f (θ) = 1

2θT Hθ and i.i.d

noise ξi (E

  • ξξT

= C): Eπγ

  • 〈f ′

1, f ′ 2〉

  • = Eπγ
  • 〈f ′

1(θ), f ′ 2(θ −γf ′ 1(θ))〉

  • = −γTr HC(2I −γH)−1 < 0.
  • 1. Proves that asymptotically, under stationary distribution, the inner

product is negative on average.

  • 2. The proof in Chee & Toulis (Aistats 18) is incomplete
  • 3. We also extend the result to a non asymptotic version of the

expectation under the restart startegy: if θrestart ∼ πγ and we restart with a new constant step size γnew = r ×γ, . Then:

Eθ0∼πγ

  • S(rγ)

n

  • = 1

4n 1 r −1

  • Tr
  • I −(I −rγH)2n

C − 1 2rγTrHC +on(γ)

28 / 47

slide-32
SLIDE 32

General loss function

We extend the proof to general functions, exhibiting the same balance between the positive and negative parts. Theorem (general smooth + strongly convex setting) (PDF 2020) For f verifying adequate assumptions:

Eπγ

  • 〈f ′

1, f ′ 2〉

  • = −1

2γTr f ′′(θ∗)C (θ∗)+O(γ3/2),

where C (θ∗) = E

  • ξ(θ∗)ξ(θ∗)T

Conclusion: “it makes sense” the mean of Pflug’s statistic is negative

  • nce we have reached the stationary distribution.

So why does it fail ?

29 / 47

slide-33
SLIDE 33

Implementation of Pflug’s algorithm

100 101 102 103 104 105 106

iteration n

10−3 10−2 10−1 100 101

||θn − θ * ||2

Total number of restarts = 34

SGD with Pflug's statistic averaged 1 / 2 R2 Pflug restarts 0.0 0.2 0.4 0.6 0.8 1.0

iteration n

1e6 −400 −200 200 400 600

Rescaled Pflug statistic

nSn since last restart Pflug restarts

Figure 6: Pflug SGD: way to many restarts

30 / 47

slide-34
SLIDE 34

Implementation of Pflug’s algorithm

100 101 102 103 104 105 106 iteration n 10−3 10−2 10−1 100 101 ||θn − θ * ||2 r = 1 / 4, nb = 102 SGD with Pflug's statistic averaged 1 / 2 R2 Pflug restarts 100 101 102 103 104 105 106 iteration n −50 50 100 150 200 Rescaled Pflug statistic nSn since last restart Pflug restarts 100 101 102 103 104 105 106 iteration n 10−3 10−2 10−1 100 101 ||θn − θ * ||2 r = 1 / 10, nb = 104 SGD with Pflug's statistic averaged 1 / 2 R2 Pflug restarts 0.0 0.2 0.4 0.6 0.8 1.0 iteration n 1e6 −1500 −1000 −500 500 1000 Rescaled Pflug statistic nSn since last restart Pflug restarts

Figure 7: Pflug SGD: way to many restarts

31 / 47

slide-35
SLIDE 35

Taking a closer look

  • Eπγ
  • 〈f ′

1, f ′ 2〉

∝ γ.

  • Var〈f ′

1, f ′ 2〉 = C ⊥

⊥ γ

To detect Sn < 0 we typically need:

E

  • S(γ)

n

  • +
  • Var(S(γ)

n ) < 0

⇔ n > 1 γ2 ≫nopt = O 1 γ

  • 0.0

0.2 0.4 0.6 0.8 1.0 iteration n 1e6 −40 −20 20 40 < ∇fn + 1, ∇fn + 2 > Semi-stochastic least squares dataset, d = 10 1 / 2 R2 1 / 512 R2

Figure 8: High variance of 〈f ′

k, f ′ k+1〉

Figure 9: High variance of Sn.

32 / 47

slide-36
SLIDE 36

Formalization

Theorem (Quadratic semi-stochastic framework) Under symmetry assumptions on the noise, it holds that for all A > 0 and 0 ≤ α < 2. Let nγ = ⌊A/γα⌋. It holds that:

Pθ0∼πγ/r

  • S(γ)

nγ ≤ 0

γ→0

1 2

  • Therefore no fixed burn-in nb can solve the variance issue
  • We would have to use at least a burn-in scaling as nγ = 1

γ2 , useless

since nopt ∝ 1

γ.

Conclusion: it fails... :( (badly... Even mini-batch are not enough... Works if only multiplicative noise but then useless...)

33 / 47

slide-37
SLIDE 37

Another heuristic: use

Ωn2 = θn −θ02.

slide-38
SLIDE 38

Intuition

0.0 0.2 0.4 0.6 0.8 1.0 −0.1 0.0 0.1 0.2 0.3 0.4 0.5

Synthetic logistic dataset d = 2

θn θn transient θn stationnary θ0 θ * 100 101 102 103 104 105 106 iteration n 10−6 10−5 10−4 10−3 10−2 10−1 f(θn) − f(θ * )

Synthetic logistic dataset d = 2

1 / 128 R2 transient 1 / 128 R2 stationnary averaged

Ωn2 =

  • ηn
  • 2 +
  • η0
  • 2 −2〈ηn, η0〉

E

  • Ωn2

= E

  • ηn
  • 2

+E

  • η0
  • 2

−2ηT

0 (I −γH)nη0. 34 / 47

slide-39
SLIDE 39

First few plots

100 101 102 103 104 105 106 iteration n 10−6 10−5 10−4 10−3 10−2 10−1 100 101

||θn − θ0||2 in plain, f(θn) − f(θ * ) in dotted Synthetic logistic regression

1 / 8 R2 1 / 512 R2

Figure 10: θn −θ02 in plain,

  • H1/2(θn −θ∗)
  • 2 in dotted

35 / 47

slide-40
SLIDE 40

Algorithm

Algorithm 2 Piecewise constant SGD with new diagnosis INPUT: θ0, γ0 > 0, r ∈ [0,1], N > 0, q > 1, threshold ∈ [0,1] OUTPUT: θN

θrestart ← θ0

for n = 2 to N do

θn ← θn−1 −γf ′

n(θn−1)

Compute Ωn2 if Ωn2 "has stopped increasing" then

γ ← r ×γ θrestar t ← θn

end if end for return θN

36 / 47

slide-41
SLIDE 41

Experiments: Least squares (smooth, strongly convex, synthetic dataset)

100 101 102 103 104 105 106 iterate n 10−5 10−4 10−3 10−2 10−1 f(θn) − f(θ * ) LS omega SGD (r = 1/2, q = 1.5) averaged 1 / 2 R2 current iterate restarts 100 101 102 103 104 105 106 iterate n 10−8 10−6 10−4 10−2 100 ||θn − θrestart||2 Omega statistic evolution restarts 100 101 102 103 104 105 106 iterate n 10−5 10−4 10−3 10−2 10−1 f(θn) − f(θ * ) LS omega SGD (r = 1/4, q = 2) averaged 1 / 2 R2 current iterate restarts 100 101 102 103 104 105 106 iterate n 10−9 10−7 10−5 10−3 10−1 101 ||θn − θrestart||2 Omega statistic evolution restarts 100 101 102 103 104 105 106 iterate n 10−5 10−4 10−3 10−2 10−1 f(θn) − f(θ * ) LS omega SGD (r = 1/8, q = 2) averaged 1 / 2 R2 current iterate restarts 100 101 102 103 104 105 106 iterate n 10−9 10−7 10−5 10−3 10−1 101 ||θn − θrestart||2 Omega statistic evolution restarts

37 / 47

slide-42
SLIDE 42

Experiments: Logistic regression (smooth, weakly convex, synthetic dataset)

100 101 102 103 104 105 106 iterate n 10−5 10−4 10−3 10−2 10−1 f(θn) − f(θ * ) Log omega SGD (r = 1/2, q = 1.5)

  • nline newton

current iterate restarts 100 101 102 103 104 105 106 iterate n 10−8 10−6 10−4 10−2 100 ||θn − θrestart||2 Omega statistic evolution 100 101 102 103 104 105 106 iterate n 10−5 10−4 10−3 10−2 10−1 f(θn) − f(θ * ) Log omega SGD (r = 1/4, q = 2)

  • nline newton

current iterate restarts 100 101 102 103 104 105 106 iterate n 10−8 10−6 10−4 10−2 100 ||θn − θrestart||2 Omega statistic evolution 100 101 102 103 104 105 106 iterate n 10−5 10−4 10−3 10−2 10−1 f(θn) − f(θ * ) Log omega SGD (r = 1/16, q = 2)

  • nline newton

current iterate restarts 100 101 102 103 104 105 106 iterate n 10−9 10−7 10−5 10−3 10−1 101 ||θn − θrestart||2 Omega statistic evolution

38 / 47

slide-43
SLIDE 43

Experiments: Logistic regression COVERTYPE dataset

100 101 102 103 104 105 106

iteration n

10−4 10−3 10−2 10−1

f(θn) − f(θ * ) Params: r = 1/2, q = 1.5, thresh = 0.4

  • nline newton

averaged 1/R2√n averaged C/R2√n distance-based SGD restarts 100 101 102 103 104 105 106

iteration n

10−6 10−5 10−4 10−3 10−2 10−1 100

||θn − θrestart||2 Evolution of the distance-based statistic

restarts

100 101 102 103 104 105 106

iteration n

10−4 10−3 10−2 10−1

f(θn) − f(θ * ) Params: r = 1/4, q = 2, thresh = 0.9

  • nline newton

averaged 1/R2√n averaged C/R2√n distance-based SGD restarts 100 101 102 103 104 105 106

iteration n

10−10 10−8 10−6 10−4 10−2 100

||θn − θrestart||2 Evolution of the distance-based statistic

restarts

39 / 47

slide-44
SLIDE 44

Experiments: SVM (non-smooth, strongly-convex, synthetic dataset)

100 101 102 103 104 105 106 iterate n 10−4 10−3 10−2 10−1 100 f(θn) − f(θ * )

LS omega SGD (r = 1/4, q = 2)

1 / μn current iterate restarts 100 101 102 103 104 105 106 iterate n 10−12 10−10 10−8 10−6 10−4 10−2 100 ||θn − θrestart||2

Omega statistic evolution

100 101 102 103 104 105 106 iterate n 10−4 10−3 10−2 10−1 100 f(θn) − f(θ * )

LS omega SGD (r = 1/16, q = 2)

1 / μn current iterate restarts 100 101 102 103 104 105 106 iterate n 10−14 10−11 10−8 10−5 10−2 101 ||θn − θrestart||2

Omega statistic evolution

40 / 47

slide-45
SLIDE 45

Experiments: LASSO (non-smooth, weakly convex, synthetic dataset)

100 101 102 103 104 105 106 iterate n 10−4 10−3 10−2 10−1 f(θn) − f(θ * )

Lasso, omega SGD (r = 1/2, q = 2)

1/√n stepsizes current iterate restarts 100 101 102 103 104 105 106 iterate n 10−6 10−4 10−2 100 102 ||θn − θrestart||2

Omega statistic evolution

restarts 100 101 102 103 104 105 106 iterate n 10−4 10−3 10−2 10−1 100 f(θn) − f(θ * )

Lasso, omega SGD (r = 1/4, q = 2)

1/√n stepsizes current iterate restarts 100 101 102 103 104 105 106 iterate n 10−5 10−3 10−1 101 ||θn − θrestart||2

Omega statistic evolution

restarts

41 / 47

slide-46
SLIDE 46

Experiments: Uniformly convex ρ = 2.5

42 / 47

slide-47
SLIDE 47

Back to the beginning Training a ResNet18 on Cifar10

test_loss ━

  • mega_stat_r=10━

state_of_art

  • mega

  • mega_stat_r=10

Figure 11: Single statistic for whole network

43 / 47

slide-48
SLIDE 48

Back to the beginning Training a ResNet18 on Cifar10

test_loss ━ mult_omegas_r=10 ━ state_of_art

  • mega

━ mult_omegas_r=10 ━ state_of_art

Figure 12: Statistic for each layer (multiple learning rates)

44 / 47

slide-49
SLIDE 49

Conclusions

  • 1. Constant step size strategies for SGD restarting “at saturation”

result in good convergence rates (in both smooth + strongly convex and uniformly convex settings).

  • 2. Pflug’s strategy for detecting convergence seems sound but cannot

work a priori

  • 3. We propose a new statistic based on heuristic arguments, that works

well in practice.

45 / 47

slide-50
SLIDE 50

Directions

Open directions:

  • 1. Theoretical analysis for the “new restart” strategy
  • 2. Restart for the averaged iterate ?
  • 3. Better understanding in deep learning.

46 / 47

slide-51
SLIDE 51

Shameless advertisement

Positions at Polytechnique:

  • 2 tenure track assistant professors (Stat & Stat + Energy)
  • Postdoc & PhD

Optimization, Learning, Federated Learning, High dimensional statistics.

Figure 13: The place to be

47 / 47

slide-52
SLIDE 52

Thank you for listening!

slide-53
SLIDE 53

On Convergence-Diagnostic based Step Sizes for Stochastic Gradient Descent

Aymeric Dieuleveut CMAP, École Polytechnique, Institut Polytechnique de Paris Joint work with Scott Pesme and Nicolas Flammarion (EPFL)

10/03/2020 Cirm Luminy - Optimization for Machine Learning