AdaGrad Stepsizes: Sharp Convergence Over Nonconvex Landscapes - - PowerPoint PPT Presentation

adagrad stepsizes sharp convergence over nonconvex
SMART_READER_LITE
LIVE PREVIEW

AdaGrad Stepsizes: Sharp Convergence Over Nonconvex Landscapes - - PowerPoint PPT Presentation

AdaGrad Stepsizes: Sharp Convergence Over Nonconvex Landscapes Xiaoxia(Shirley) WU PhD Candidate, The University of Texas at Austin June 11th, 2019 joint work with Rachel Ward and L eon Bottou, at Facebook AI Research. Outline


slide-1
SLIDE 1

AdaGrad Stepsizes: Sharp Convergence Over Nonconvex Landscapes

Xiaoxia(Shirley) WU⋆

PhD Candidate, The University of Texas at Austin

June 11th, 2019

⋆ joint work with Rachel Ward and L´ eon Bottou, at Facebook AI Research.

slide-2
SLIDE 2

Outline

Motivations Theoretical Contributions We provide a novel convergence result for AdaGrad-Norm to emphasize its robustness to the hyper-parameter tuning over nonconvex landscapes. Practical Implications

slide-3
SLIDE 3

Outline

Motivations Theoretical Contributions We provide a novel convergence result for AdaGrad-Norm to emphasize its robustness to the hyper-parameter tuning over nonconvex landscapes. Practical Implications

slide-4
SLIDE 4

Motivation

Problem Setup

Given a differentiable non-convex function, F : Rd → R,

◮ ∇F(x) − ∇F(y) ≤ Lx − y,

∀x, y ∈ Rd

slide-5
SLIDE 5

Motivation

Problem Setup

Given a differentiable non-convex function, F : Rd → R,

◮ ∇F(x) − ∇F(y) ≤ Lx − y,

∀x, y ∈ Rd Our desired goal ⇒ min

x∈Rd F(x)

We can achieve ⇒ ∇F(x)2 ≤ ε

slide-6
SLIDE 6

Motivation

Problem Setup

Given a differentiable non-convex function, F : Rd → R,

◮ ∇F(x) − ∇F(y) ≤ Lx − y,

∀x, y ∈ Rd Our desired goal ⇒ min

x∈Rd F(x)

We can achieve ⇒ ∇F(x)2 ≤ ε

Algorithm

Stochastic Gradient Descent (SGD) at the jth iteration xj+1 ← xj − ηjG(xj), (1) where E[G(xj)] = ∇F(xj) and ηj > 0 is the stepsize.

slide-7
SLIDE 7

Motivation

Algorithm: SGD

Set a sequence {ηj}j≥0 for xj+1 ← xj − ηjG(xj) Q: How to set the sequence {ηj}j≥0 ?

1E[G(x) − ∇F(x)2] ≤ σ2

slide-8
SLIDE 8

Motivation

Algorithm: SGD

Set a sequence {ηj}j≥0 for xj+1 ← xj − ηjG(xj) Q: How to set the sequence {ηj}j≥0 ?

Difficulty in Choosing Stepsizes

The classical Robbins/Monro theory (Robbins and Monro, 1951) if

  • j=1

ηj = ∞ and

  • j=1

η2

j < ∞;

(2) and the variance of the gradient is bounded 1, then limj→∞ E[∇F(xj)2] = 0.

1E[G(x) − ∇F(x)2] ≤ σ2

slide-9
SLIDE 9

Motivation

Algorithm: SGD

Set a sequence {ηj}j≥0 for xj+1 ← xj − ηjG(xj) Q: How to set the sequence {ηj}j≥0 ?

Difficulty in Choosing Stepsizes

The classical Robbins/Monro theory (Robbins and Monro, 1951) if

  • j=1

ηj = ∞ and

  • j=1

η2

j < ∞;

(3) and the variance of the gradient is bounded, then limj→∞ E[∇F(xj)2] = 0. However, the rule is too general for practical applications.

slide-10
SLIDE 10

Motivation

Algorithm: SGD

Set a sequence {ηj}j≥0 for xj+1 ← xj − ηjG(xj)

Possible Choice: Manual Tuning

ηj =            η j ≤ T1 α1η T1 ≤ j ≤ T2 α2η T2 ≤ j ≤ T3 · · ·

2∇F(x) − ∇F(y) ≤ Lx − y,

∀x, y ∈ Rd

slide-11
SLIDE 11

Motivation

Algorithm: SGD

Set a sequence {ηj}j≥0 for xj+1 ← xj − ηjG(xj)

Possible Choice: Manual Tuning

ηj =            η j ≤ T1 α1η T1 ≤ j ≤ T2 α2η T2 ≤ j ≤ T3 · · · However, tuning η, α1, α2, T1, T2, . . . are computationally costly. In particular, it requires η ≤ 2/L.2

2∇F(x) − ∇F(y) ≤ Lx − y,

∀x, y ∈ Rd

slide-12
SLIDE 12

Motivation

Algorithm: SGD with Adaptive Stepsize

Set a sequence {bj}j≥0 for ℓ = 1, 2, · · · , d [xj+1]ℓ ← [xj]ℓ − η [bj+1]ℓ [G(xj)]ℓ

Possible Choice: Adaptive Gradient Methods

Among many variants, one is AdaGrad ([bj+1]ℓ)2 = ([bj]ℓ)2 + ([G(xj)]ℓ)2

slide-13
SLIDE 13

Motivation

Algorithm: SGD with Adaptive Stepsize

Set a sequence {bj}j≥0 for ℓ = 1, 2, · · · , d [xj+1]ℓ ← [xj]ℓ − η [bj+1]ℓ [G(xj)]ℓ

Possible Choice: Adaptive Gradient Methods

Among many variants, one is AdaGrad ([bj+1]ℓ)2 = ([bj]ℓ)2 + ([G(xj)]ℓ)2

◮ It helps with “increasing the stepsize for more sparse

parameters and decreasing the stepsize for less sparse ones.” (Duchi et al. 2011)

slide-14
SLIDE 14

Motivation

Algorithm: SGD with Adaptive Stepsize

Set a sequence {bj}j≥0 for ℓ = 1, 2, · · · , d [xj+1]ℓ ← [xj]ℓ − η [bj+1]ℓ [G(xj)]ℓ

Possible Choice: Adaptive Gradient Methods

Among many variants, one is AdaGrad ([bj+1]ℓ)2 = ([bj]ℓ)2 + ([G(xj)]ℓ)2

◮ It helps with “increasing the stepsize for more sparse

parameters and decreasing the stepsize for less sparse ones.” (Duchi et al. 2011)

◮ However, “co-ordinate” AdaGrad changes the optimization

problem by introducing the “bias” in the solutions, leading to worse generalization (Wilson et al. 2017)

slide-15
SLIDE 15

Motivation

Algorithm: SGD with Adaptive Stepsize

Set a sequence {bj}j≥0 for ℓ = 1, 2, · · · , d [xj+1]ℓ ← [xj]ℓ − η bj+1 [G(xj)]ℓ

Possible Variant: Norm Version of AdaGrad

(AdaGrad-Norm) b2

j+1 = b2 j + G(xj)2

slide-16
SLIDE 16

Motivation

Algorithm: SGD with Adaptive Stepsize

Set a sequence {bj}j≥0 for ℓ = 1, 2, · · · , d [xj+1]ℓ ← [xj]ℓ − η bj+1 [G(xj)]ℓ

Possible Variant: Norm Version of AdaGrad

(AdaGrad-Norm) b2

j+1 = b2 j + G(xj)2 ◮ Auto-tuning property (Wu, Ward, and Bottou, 2018):

robustness to the choices of hyper-parameters (b0 and η); connection to Weight/Layer/Batch Normalization;

◮ Does not affect generalization.

slide-17
SLIDE 17

Outline

Motivations Theoretical Contributions We provide a novel convergence result for AdaGrad-Norm to emphasize its robustness to the hyper-parameter tuning over nonconvex landscapes. Practical Implications

slide-18
SLIDE 18

Theory

Algorithm: SGD with Adaptive Stepsize

xj+1 ← xj − η bj+1 G(xj) with b2

j+1 = b2 j + G(xj)2

What is the convergence rate of AdaGrad-Norm?

◮ Intuition: if E[G(xj)2] ≤ γ2, then the effective stepsize η bj

E η bj

η

  • jγ2 + b2
slide-19
SLIDE 19

Theory

Algorithm: SGD with Adaptive Stepsize

xj+1 ← xj − η bj+1 G(xj) with b2

j+1 = b2 j + G(xj)2

What is the convergence rate of AdaGrad-Norm?

◮ Intuition: if E[G(xj)2] ≤ γ2, then the effective stepsize η bj

E η bj

η

  • jγ2 + b2

◮ Convex Landscapes O

  • 1

√ T

  • (Levy, 2018)
slide-20
SLIDE 20

Theory

Algorithm: SGD with Adaptive Stepsize

xj+1 ← xj − η bj+1 G(xj) with b2

j+1 = b2 j + G(xj)2

What is the convergence rate of AdaGrad-Norm?

◮ Intuition: if E[G(xj)2] ≤ γ2, then the effective stepsize η bj

E η bj

η

  • jγ2 + b2

◮ Convex Landscapes O

  • 1

√ T

  • (Levy, 2018)

◮ Nonconvex Landscapes O

  • log(T)

√ T

  • (Ours, Theorem 2.1)
slide-21
SLIDE 21

Theory

Algorithm: SGD with Adaptive Stepsize

(1) At jth iteration, generate ξj and G(xj) = G(xj, ξj) (2) xj+1 ← xj −

η bj+1 G(xj)

with b2

j+1 = b2 j + G(xj)2

Theorem

Under the assumption:

  • 1. The random vectors ξj, j = 0, 1, 2, . . . , are mutually

independent and also independent of xj;

  • 2. Bounded variance3: Eξj[G(xj, ξj) − ∇F(xj)2] ≤ σ2;
  • 3. Bounded gradient norm: ∇F(xj) ≤ γ uniformly;

3It means the expectation with respect to ξj conditional on xj.

slide-22
SLIDE 22

Theory

Algorithm: SGD with Adaptive Stepsize

(1) At jth iteration, generate ξj and G(xj) = G(xj, ξj) (2) xj+1 ← xj −

η bj+1 G(xj)

with b2

j+1 = b2 j + G(xj)2

Theorem

Under the assumption:

  • 1. The random vectors ξj, j = 0, 1, 2, . . . , are mutually

independent and also independent of xj;

  • 2. Bounded variance3: Eξj[G(xj, ξj) − ∇F(xj)2] ≤ σ2;
  • 3. Bounded gradient norm: ∇F(xj) ≤ γ uniformly;

AdaGrad-Norm converges to a stationary point w.h.p. at the rate min

ℓ=0,1,...,T−1 ∇F(xℓ)2 ≤ C 2

T + σC √ T where C = O (log (T/b0 + 1)) and O hides η, L and F(x0) − F ∗.

3It means the expectation with respect to ξj conditional on xj.

slide-23
SLIDE 23

Theory

Algorithm: SGD with Adaptive Stepsize

(1) At jth iteration, generate ξj and G(xj) = G(xj, ξj) (2) xj+1 ← xj −

η bj+1 G(xj)

with b2

j+1 = b2 j + G(xj)2

Challenges in the proof:

bj+1 is a random variable correlated with ∇F(xj) and G(xj)

◮ L-Lipschitz continuous gradient 4 Fj+1−Fj η

≤ − ∇Fj2

bj+1

+ ∇Fj, ∇Fj − Gj bj+1

  • KeyTerm

+ ηLGj2

2b2

j+1 . 4We write F(xj) = Fj, ∇F(xj) = ∇Fj and G(xj) = Gj.

slide-24
SLIDE 24

Theory

Algorithm: SGD with Adaptive Stepsize

(1) At jth iteration, generate ξj and G(xj) = G(xj, ξj) (2) xj+1 ← xj −

η bj+1 G(xj)

with b2

j+1 = b2 j + G(xj)2

Challenges in the proof:

bj+1 is a random variable correlated with ∇F(xj) and G(xj)

◮ L-Lipschitz continuous gradient 4 Fj+1−Fj η

≤ − ∇Fj2

bj+1

+ ∇Fj, ∇Fj − Gj bj+1

  • KeyTerm

+ ηLGj2

2b2

j+1 .

◮ Unlike the standard SGD with constant stepsize

Eξj ∇Fj,∇Fj−Gj

bj+1

  • = 0;

4We write F(xj) = Fj, ∇F(xj) = ∇Fj and G(xj) = Gj.

slide-25
SLIDE 25

Theory

Algorithm: SGD with Adaptive Stepsize

(1) At jth iteration, generate ξj and G(xj) = G(xj, ξj) (2) xj+1 ← xj −

η bj+1 G(xj)

with b2

j+1 = b2 j + G(xj)2

Challenges in the proof:

bj+1 is a random variable correlated with ∇F(xj) and G(xj)

◮ L-Lipschitz continuous gradient 4 Fj+1−Fj η

≤ − ∇Fj2

bj+1

+ ∇Fj, ∇Fj − Gj bj+1

  • KeyTerm

+ ηLGj2

2b2

j+1 .

◮ Unlike the standard SGD with constant stepsize

Eξj ∇Fj,∇Fj−Gj

bj+1

  • = 0;

◮ New techniques needed to bound KeyTerm:

careful Tower rule, Cauchy-Schwarz, H¨

  • lder’s Inequality, etc.

4We write F(xj) = Fj, ∇F(xj) = ∇Fj and G(xj) = Gj.

slide-26
SLIDE 26

Outline

Motivations Theoretical Contributions We provide a novel convergence result for AdaGrad-Norm to emphasize its robustness to the hyper-parameter tuning over nonconvex landscapes. Practical Implications

slide-27
SLIDE 27

Practice

AdaGrad-Norm

We show that AdaGrad-Norm converges 5 min

ℓ=0,1,...,T−1 ∇F(xℓ)2 ≤ O

C1 T + σC2 √ T

  • where the constants C1 and C2 are explicit and robust to

hyper-parameters b0 and η. Recall: Eξj[G(xj, ξj) − ∇F(xj)2] ≤ σ2

5Note we combine Theorem 2.1 and Theorem 2.2 6For the case b1 ≥ ηL ≈ ∆L

slide-28
SLIDE 28

Practice

AdaGrad-Norm

We show that AdaGrad-Norm converges 5 min

ℓ=0,1,...,T−1 ∇F(xℓ)2 ≤ O

C1 T + σC2 √ T

  • where the constants C1 and C2 are explicit and robust to

hyper-parameters b0 and η. Recall: Eξj[G(xj, ξj) − ∇F(xj)2] ≤ σ2

◮ For σ ≈ 0

Suppose we know F ∗ and set η = F(x0) − F ∗; the constant C1 almost matches GD with best stepsize. 6

5Note we combine Theorem 2.1 and Theorem 2.2 6For the case b1 ≥ ηL ≈ ∆L

slide-29
SLIDE 29

Practice

AdaGrad-Norm

We show that AdaGrad-Norm converges 5 min

ℓ=0,1,...,T−1 ∇F(xℓ)2 ≤ O

C1 T + σC2 √ T

  • where the constants C1 and C2 are explicit and robust to

hyper-parameters b0 and η. Recall: Eξj[G(xj, ξj) − ∇F(xj)2] ≤ σ2

◮ For σ ≈ 0

Suppose we know F ∗ and set η = F(x0) − F ∗; the constant C1 almost matches GD with best stepsize. 6

◮ For σ > 0

Set η = 1, the constant C2 almost matches SGD with well-tuned stepsize up to a factor of L log(T/b0 + 1)

5Note we combine Theorem 2.1 and Theorem 2.2 6For the case b1 ≥ ηL ≈ ∆L

slide-30
SLIDE 30

Practice: Synthetic Data with Linear Regression

10−1 101 103 105

10−4 10−2 100 102 104 106 108 1010

GradNorm Iteration at 10

AdaGrad_Norm SGD_Constant SGD_DecaySqrt

10−1 101 103 105

10−4 10−2 100 102 104 106 108 1010 Iteration at 2000

10−1 101 103 105

10−4 10−2 100 102 104 106 108 1010 Iteration at 5000

10−1 101 103 105

b0

10−5 10−4 10−3 10−2 10−1 100

Effective LR

10−1 101 103 105

b0

10−5 10−4 10−3 10−2 10−1 100

10−1 101 103 105

b0

10−5 10−4 10−3 10−2 10−1 100

Figure 1: Random initialized x0 with η = F(x0) − F ∗ = 650 − 0. (AdaGrad-Norm) 650

bj ; (SGD-Constant) 650 b0 ; (SGD-DecaySqrt) 650 b0 √j

slide-31
SLIDE 31

Practice: ResNet-18 on CIFAR10

10 −2 10 10 2 10 4

20 40 60 80 100

Train Accuracy ResNet at 10

10 −2 10 10 2 10 4

ResNet at 60

10 −2 10 10 2 10 4

ResNet at 120

10

−2

10 10

2

10

4

b2

20 40 60 80 100

Test Accuracy

10

−2

10 10

2

10

4

b2

10

−2

10 10

2

10

4

b2

Figure 2: Random initialized x0 with η = 1. (AdaGrad-Norm)

1 bj ;

(SGD-Constant)

1 b0 ; (SGD-DecaySqrt) 1 b0 √j

AdaGrad-Norm code: https://github.com/xwuShirley/pytorch/blob/master/torch/optim/adagradnorm.py

slide-32
SLIDE 32

Practice: ResNet-50 on ImageNet

10−2 10−1 100 101 102 103 104 10 20 30 40 50 60 70

Train Accuracy ResNet at 30

10−2 10−1 100 101 102 103 104

ResNet at 50

10−2 10−1 100 101 102 103 104

ResNet at 90

10−2 100 102 104

b2

10 20 30 40 50 60 70

Test Accuracy

10−2 100 102 104

b2

10−2 100 102 104

b2

Figure 3: Random initialized x0 with η = 1. (AdaGrad-Norm)

1 bj ;

(SGD-Constant)

1 b0 ; (SGD-DecaySqrt) 1 b0 √j

slide-33
SLIDE 33

Conclusion

◮ We provide a novel convergence result for AdaGrad-Norm in

non-convex optimization.The analysis is useful to adaptive-type methods.

slide-34
SLIDE 34

Conclusion

◮ We provide a novel convergence result for AdaGrad-Norm in

non-convex optimization.The analysis is useful to adaptive-type methods.

◮ The convergence bound for AdaGrad-Norm is explicit and

comparable with well-tuned stepsize choice in SGD, but without careful tuning of the AdaGrad-Norm’s hyper-parameters

slide-35
SLIDE 35

Conclusion

◮ We provide a novel convergence result for AdaGrad-Norm in

non-convex optimization.The analysis is useful to adaptive-type methods.

◮ The convergence bound for AdaGrad-Norm is explicit and

comparable with well-tuned stepsize choice in SGD, but without careful tuning of the AdaGrad-Norm’s hyper-parameters

◮ Numerical experiments suggest that the robustness of

AdaGrad-Norm extends to state-of-the-art models in deep learning, without sacrificing generalization

slide-36
SLIDE 36

See you

at poster section: Pacific Ballroom #56 (Today 6:30-9:00PM).

slide-37
SLIDE 37

Practice: ResNet-50 on ImageNet

10−2 10−1 100 101 102 103 104 105 10 20 30 40 50 60 70

Train Accuracy ResNet at 30

AdaGrad_Norm SGD_Constant SGD_DecaySqrt AdaGrad_Coordinate

10−2 10−1 100 101 102 103 104 105

ResNet at 50

10−2 10−1 100 101 102 103 104 105

ResNet at 90

10−2 100 102 104

b2

10 20 30 40 50 60 70

Test Accuracy

10−2 100 102 104

b2

10−2 100 102 104

b2

Figure 4: Random initialized x0 with η = 1. (AdaGrad-Norm)

1 bj ;

(SGD-Constant)

1 b0 ; (SGD-DecaySqrt) 1 b0 √j

slide-38
SLIDE 38

Theory

Difficulty Proofs of SGD do not straightforwardly extend because bk+1 is a random variable correlated with ∇F(xk), i.e., Eξj ∇Fj, ∇Fj − Gj bj+1

  • = Eξj [∇Fj, ∇Fj − Gj]

bj+1 = 1 bj+1 · 0; (Cauchy-Schwartz)

Eξj     1

  • b2

j + C 2 −

1 bj+1   ∇Fj, Gj   ≤ Eξj  

  • 1
  • b2

j + C 2 −

1 bj+1

  • ∇FjGj

 

(H¨

  • lder’s Inequality)

E  ∇Fk2

  • b2

k+1

  ≥

  • E∇Fk

4 3

3

2

2

  • E
  • b2

k+1