[PPT] - Stochastic Cubic Regularization for Fast Nonconvex Optimization PowerPoint Presentation

SLIDE 1

Stochastic Cubic Regularization for Fast Nonconvex Optimization

Nilesh Tripuraneni, Mitchell Stern, Chi Jin, Jeffrey Regier and Michael I. Jordan

Achin Jain University of Pennsylvania STAT991, Spring 2019

Fast Nonconvex Optimization 1

SLIDE 2

Outline

1. Motivation
2. Objectives
3. Algorithm
4. Experiments
5. References

Fast Nonconvex Optimization 2

SLIDE 3

Outline

1. Motivation
2. Objectives
3. Algorithm
4. Experiments
5. References

Fast Nonconvex Optimization 3 1 – Motivation

SLIDE 4

Motivation

min

x∈Rd f (x) := Eξ∈D [f (x; ξ)]

f non-convex and f (x; ξ) stochastic Variants of stochastic optimization

1. Offline setting: minimize the empirical loss over a fixed amount of data
2. Online setting: minimize the empirical loss when data arrives sequentially

Applications

Large-scale statistics and machine learning problems
Example: optimization of deep neural networks

Fast Nonconvex Optimization 4 1 – Motivation

SLIDE 5

Outline

1. Motivation
2. Objectives
3. Algorithm
4. Experiments
5. References

Fast Nonconvex Optimization 5 2 – Objectives

SLIDE 6

Survey of (stochastic) gradient descent algorithms for ǫ-second order stationary points

Fast Nonconvex Optimization 6 2 – Objectives

SLIDE 7

Cubic-regularized gradient descent

Gradient descent xt+1 = arg min

x

f (xt) + ∇f (xt)T(x − xt) + L

2||x − xt||2

State of the art convergence:
1. hessian free perturbed GD O
ǫ−2

[Jin et al., 2017] Cubic-regularized gradient descent xt+1 = arg min

x

f (xt) + ∇f (xt)T(x − xt) + 1

2(x − xt)T∇2f (xt)(x − xt) + ρ 6||x − xt||3

State of the art convergence:
1. full Hessian O
ǫ−1.5

[Nesterov and Polyak, 2006]

2. Hessian-vector product evaluations w/o acceleration O
ǫ−2

[Carmon and Duchi, 2016]

3. Hessian-vector product evaluations w/ acceleration O
ǫ−1.75

[Carmon et al., 2018]

Fast Nonconvex Optimization 7 2 – Objectives

SLIDE 8

Cubic-regularized stochastic gradient descent

Stochastic gradient descent xt+1 = arg min

x

f (xt) + g(xt, ξt)T(x − xt) + L

2||x − xt||2

, Eg(xt, ξt) = ∇f (xt)

State of the art convergence:

1. noisy SGD O
ǫ−4

[Ge et al., 2015]

2. Hessian-vector product evaluations w/ variance reduction O
ǫ−3.5

[Allen-Zhu, 2018]

3. gradient evaluations w/ variance reduction O
ǫ−3.5

[Allen-Zhu and Li, 2018] Stochastic cubic-regularized gradient descent [this paper] xt+1 = arg min

x

f (xt) + gT

t (x − xt) + 1

2(x − xt)TBt(x − xt) + ρ 6||x − xt||3

State of the art convergence:
1. Hessian-vector product evaluations O
ǫ?

[Tripuraneni et al., 2018]

Fast Nonconvex Optimization 8 2 – Objectives

SLIDE 9

Problem statement

min

x∈Rd f (x) := Eξ∈D [f (x; ξ)]

f non-convex and f (x; ξ) stochastic

1. Can we design a fully stochastic variant of the cubic-regularized Newton

method?

2. Is such an algorithm faster than stochastic gradient descent?

Fast Nonconvex Optimization 9 2 – Objectives

SLIDE 10

What’s coming up

min

x∈Rd f (x) := Eξ∈D [f (x; ξ)]

f non-convex and f (x; ξ) stochastic Comparison of different stochastic optimization algorithms to find an ǫ-second-order stationary point:

Method Run-time Variance Reduction Type SGD [Ge et al., 2015] O(ǫ−4) no needed 1st order Natasha 2 [Allen-Zhu, 2018] O(ǫ−3.5) needed 2nd order Neon 2 [Allen-Zhu and Li, 2018] O(ǫ−3.5) needed 2nd order SCR [this paper] O(ǫ−3.5) not needed 2nd order

Fast Nonconvex Optimization 10 2 – Objectives

SLIDE 11

Outline

1. Motivation
2. Objectives
3. Algorithm
4. Experiments
5. References

Fast Nonconvex Optimization 11 3 – Algorithm

SLIDE 12

Assumptions

Assumption 1. The functionf (x) has L-Lipschitz gradients and ρ-Lipschitz Hessians for all x1 and x2. ||∇f (x1) − ∇f (x2)|| ≤ L||x1 − x2||, ∇2f (x1) − ∇2f (x2)|| ≤ ρ||x1 − x2|| Assumption 2. The stochastic gradients and stochastic Hessians, and their variance are bounded. E

||∇f (x, ξ) − ∇f (x)||2

≤ σ2

1,

||∇f (x, ξ) − ∇f (x)|| ≤ M1 E

||∇2f (x, ξ) − ∇2f (x)||2

≤ σ2

2,

||∇2f (x, ξ) − ∇2f (x)|| ≤ M2

Fast Nonconvex Optimization 12 3 – Algorithm

SLIDE 13

Cubic-regularized gradient descent

In the deterministic setting, we minimize the local upper bound on the function using a third-order Taylor expansion [Nesterov and Polyak, 2006]. mt(x) =

f (xt) + ∇f (xt)T(x − xt) + 1

2(x − xt)T∇2f (xt)(x − xt) + ρ 6||x − xt||3

xt+1 = arg min

x

mt(x) In the stochastic setting,

1. we only have access to stochastic gradients and Hessians, not the true gradient

and Hessian,

2. our only means of interaction with the Hessian is through Hessian-vector

products, and

3. the cubic submodel mt(x) cannot be solved exactly in practice, only up to some

tolerance.

Fast Nonconvex Optimization 13 3 – Algorithm

SLIDE 14

Stochastic Cubic Regularization (Meta-algorithm)

In the deterministic setting, we minimize mt(x) =

f (xt) + (x − xt)T∇f (xt) + 1

2(x − xt)T∇2f (xt)(x − xt) + ρ 6||x − xt||3

xt+1 = arg min

x

mt(x) In the stochastic setting, we minimize

1. ˜

m(∆) = ∆Tgt + 1

2∆TBt [∆] Hessian-vector product

+ ρ

6||∆||3,

∆ := x − xt (we need a cubic solver)

2. ∆t+1 = arg min∆ ˜

m(∆), xt+1 = xt + ∆t+1

3. ∆⋆ = arg min∆ ˜

m(∆) (cubic solver will not solve exactly)

4. ˜

m(∆) = mt(xt + ∆) − mt(xt)

Fast Nonconvex Optimization 14 3 – Algorithm

SLIDE 15

Stochastic cubic regularization (meta-algorithm)

˜ m(∆) = ∆Tgt + 1 2∆TB [∆] + ρ 6||∆||3 = ∆m = mt(xt + ∆) − mt(xt) ∆ = arg min

∆

˜ m(∆)

Fast Nonconvex Optimization 15 3 – Algorithm

SLIDE 16

Gradient descent as a cubic subsolver

lines 1–3: when g is large, the submodel ˜

m(∆) may be ill-conditioned, so instead of doing gradient descent, the iterate only moves one step in the g direction, which already guarantees sufficient descent [Carmon and Duchi, 2016]

line 6: the algorithm adds a small perturbation to g to avoid a hard case for the cubic

submodel

Fast Nonconvex Optimization 16 3 – Algorithm

SLIDE 17

Cubic final solver

Algorithm 2 may produce inexact ∆
line 2, 4: gradient descent but with higher precision

Fast Nonconvex Optimization 17 3 – Algorithm

SLIDE 18

Claims

Condition 1. For a small constant c, Cubic-Subsolver (g, B[], ǫ) terminates within T (ǫ) gradient iterations (which may depend on c), and returns a ∆ satisfying at least

ne of the following
1. the parameter change results in submodel and function decreases that are both

sufficiently large

2. if that fails to hold, the second condition ensures that ∆ is not too large relative

to the true solution ∆⋆, and that the cubic submodel is solved to precision cρ||∆⋆||3 when ∆⋆ is large Theorem 1. There exists an absolute constant c such that if f (x) satisfies Assumptions 1, 2, CubicSubsolver satisfies Condition 1 with c, then for all δ > 0 and ∆f ≥ f (x0) − f ∗, and ǫ ≤ min{

σ2

1

cM1 , σ4

2

c2M2

2 ρ}, Algorithm 1 will output an ǫ-second-order

stationary point of f with probability at least 1 − δ within O √ρ∆f ǫ1.5 σ2

1

ǫ2 + σ2

2

ρǫ T (ǫ)

total stochastic gradient and Hessian-vector product evaluations.

Fast Nonconvex Optimization 18 3 – Algorithm

SLIDE 19

Claims

Lemma 1. There exists an absolute constant c, such that under the same assumptions on f (x) and the same choice of parameters n1, n2 as in Theorem 1, Algorithm 2 satisfies Condition 1 with atleast 1 − δ with T (ǫ) ≤ O L √ρǫ

Corollary 1. Under the same settings of Theorem 1, if we instantiate CubicSubsolver

with Algorithm 2, and ǫ ≤ min{

σ2

1

cM1 , σ4

2

c2M2

2 ρ}, then Algorithm 1 will output an

ǫ-second-order stationary point of f with probability at least 1 − δ within O √ρ∆f ǫ1.5 σ2

1

ǫ2 + σ2

2

ρǫ1.5 L √ρ)

total stochastic gradient and Hessian-vector product evaluations.

Fast Nonconvex Optimization 19 3 – Algorithm

SLIDE 20

Proof Sketch

Claim 1. If xt+1 is not an ǫ-second-order stationary point of f (x), the cubic submodel has large descent mt(xt+1) − mt(xt). Claim 2. If the cubic submodel has large descent mt(xt+1) − mt(xt), then the true function also has large descent f (xt+1) − f (xt).

Fast Nonconvex Optimization 20 3 – Algorithm

SLIDE 21

Outline

1. Motivation
2. Objectives
3. Algorithm
4. Experiments
5. References

Fast Nonconvex Optimization 21 4 – Experiments

SLIDE 22

Synthetic Nonconvex Problem

Piece-wise cubic function w(x1) min

x∈R2

w(x1) + 10x2

2

Algorithm 1 is able to escape the saddle point at the origin and converge to
ne of the global minima faster than SGD.

Fast Nonconvex Optimization 22 4 – Experiments

SLIDE 23

Deep Autoencoder

Encoder: (28 × 28) → 512 → 256 → 128 → 32 Decoder: (28 × 28) ← 512 ← 256 ← 128 ← 32 min

pixelwise L2 loss

Fast Nonconvex Optimization 23 4 – Experiments

SLIDE 24

Outline

1. Motivation
2. Objectives
3. Algorithm
4. Experiments
5. References

Fast Nonconvex Optimization 24 5 – References

SLIDE 25

References I

Allen-Zhu, Z. (2018). Natasha 2: Faster non-convex optimization than SGD. In Advances in Neural Information Processing Systems, pages 2675–2686. Allen-Zhu, Z. and Li, Y. (2018). Neon2: Finding local minima via first-order oracles. In Advances in Neural Information Processing Systems, pages 3716–3726. Carmon, Y. and Duchi, J. C. (2016). Gradient descent efficiently finds the cubic-regularized non-convex newton step. arXiv preprint arXiv:1612.00547. Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. (2018). Accelerated methods for nonconvex optimization. SIAM Journal on Optimization, 28(2):1751–1772. Ge, R., Huang, F., Jin, C., and Yuan, Y. (2015). Escaping from saddle points – online stochastic gradient for tensor decomposition. In Conference on Learning Theory, pages 797–842. Jin, C., Ge, R., Netrapalli, P., Kakade, S. M., and Jordan, M. I. (2017). How to escape saddle points efficiently. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1724–1732. JMLR. org.

Fast Nonconvex Optimization 25 5 – References

SLIDE 26

References II

Nesterov, Y. and Polyak, B. T. (2006). Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205. Tripuraneni, N., Stern, M., Jin, C., Regier, J., and Jordan, M. I. (2018). Stochastic cubic regularization for fast nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2899–2908.

Fast Nonconvex Optimization 26 5 – References