Katalyst: Boosting Convex Katayusha for Non-Convex Problems with a - - PowerPoint PPT Presentation

katalyst boosting convex katayusha for non convex
SMART_READER_LITE
LIVE PREVIEW

Katalyst: Boosting Convex Katayusha for Non-Convex Problems with a - - PowerPoint PPT Presentation

Katalyst: Boosting Convex Katayusha for Non-Convex Problems with a Large Condition Number Zaiyi Chen, Yi Xu, Haoyuan Hu, Tianbao Yang zaiyi.czy@alibaba-inc.com 2019-06-10 Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 1


slide-1
SLIDE 1

Katalyst: Boosting Convex Katayusha for Non-Convex Problems with a Large Condition Number

Zaiyi Chen, Yi Xu, Haoyuan Hu, Tianbao Yang

zaiyi.czy@alibaba-inc.com

2019-06-10

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 1 / 20

slide-2
SLIDE 2

Overview

1

Introduction

2

Katalyst Algorithm and Theoretical Guarantee

3

Experiments

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 2 / 20

slide-3
SLIDE 3

Problem Definition

Problem Definition min

x∈Rd φ(x) := 1

n

n

  • i=1

fi(x) + ψ(x) (1) we can obtain a better gradient complexity w.r.t. sample size n and accuracy ǫ via variance reduced method (Johnson & Zhang, 2013) (SVRG-type). We name the proposed algorithm Katalyst after Katyusha (Allen-Zhu, 2017) and Catalyst (Lin et al., 2015).

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 3 / 20

slide-4
SLIDE 4

Problem Definition

Problem Definition min

x∈Rd φ(x) := 1

n

n

  • i=1

fi(x) + ψ(x) (1) we can obtain a better gradient complexity w.r.t. sample size n and accuracy ǫ via variance reduced method (Johnson & Zhang, 2013) (SVRG-type). We name the proposed algorithm Katalyst after Katyusha (Allen-Zhu, 2017) and Catalyst (Lin et al., 2015).

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 3 / 20

slide-5
SLIDE 5

Assumptions

{fi} are L-smooth. ψ can be non-smooth but convex. φ is µ-weakly convex.

Definition 1

(L-smoothness) A function f is Lipschitz smooth with constant L if its derivatives are Lipschitz continuous with constant L, that is ∇f (x) − ∇(y) ≤ Lx − y, ∀x, y ∈ Rd

Definition 2

(Weak convexity) A function φ is µ-weakly convex, if φ(x) + µ

2x2 is

convex.

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 4 / 20

slide-6
SLIDE 6

Comparisons with Related Work

Table 1: Comparison of gradient complexities of variance reduction based algorithms for finding ǫ-stationary point of (1). ∗ marks the result is only valid when L/µ ≤ √n.

Algorithms L/µ ≥ Ω(n) L/µ ≤ O(n) Non-smooth ψ SAGA (Reddi et al., 2016) O(n2/3L/ǫ2) O(n2/3L/ǫ2) Yes RapGrad (Lan & Yang, 2018)

  • O(√nLµ/ǫ2)
  • O((µn + √nLµ)/ǫ2)

indicator function SVRG (Reddi et al., 2016) O(n2/3L/ǫ2) O(n2/3L/ǫ2) Yes Natasha1 (Allen-Zhu, 2017) NA O(n2/3L2/3µ1/3/ǫ2)

Yes RepeatSVRG (Allen-Zhu, 2017)

  • O(n3/4√Lµ/ǫ2)
  • O((µn + n3/4√Lµ)/ǫ2)

Yes 4WD-Catalyst (Paquette et al., 2018) O(nL/ǫ2) O(nL/ǫ2) Yes SPIDER (Fang et al., 2018) O(√nL/ǫ2) O(√nL/ǫ2) No SNVRG (Zhou et al., 2018) O(√nL/ǫ2) O(√nL/ǫ2) No Katalyst (this work)

  • O(√nLµ/ǫ2)
  • O((µn + L)/ǫ2)

Yes

Our bound is proved optimal up to a logarithmic factor by a recent work (Zhou & Gu, 2019).

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 5 / 20

slide-7
SLIDE 7

Overview

1

Introduction

2

Katalyst Algorithm and Theoretical Guarantee

3

Experiments

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 6 / 20

slide-8
SLIDE 8

Interpretation - Our Basic Idea

  • 1
  • 0.5

0.5 1

  • 0.2
  • 0.1

0.1 0.2 0.3 0.4 0.5 0.6 x0 x1

Step 1

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 7 / 20

slide-9
SLIDE 9

Interpretation - Our Basic Idea

  • 1
  • 0.5

0.5 1

  • 0.2
  • 0.1

0.1 0.2 0.3 0.4 0.5 0.6 x1 x2

Step > 1

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 8 / 20

slide-10
SLIDE 10

A Unified Framework

Meta Algorithm Algorithm 1: Stagewise-SA(w0, {ηs}, µ, {ws}) Input : a non-increasing sequence {ws}, x0 ∈ dom(ψ), γ = (2µ)−1;

1 for s = 1, . . . , S do 2

fs(·) = φ(·) + 1

2γ · −xs−12; 3

xs = Katyusha(fs, xs−1, Ks, µ, L + µ) // xs is usually an averaged solution;

4 end

Output: xτ, τ is randomly chosen from {0, . . . , S} according to the probabilities pτ =

wτ+1 S

k=0 wk+1 , τ = 0, . . . , S ;

fs(x) = 1 n

n

  • i=1

(fi(x) + µ 2 x − xs−12

  • ˆ

fi(x)

) + γ−1 − µ 2 x − xs−12 + ψ(x)

  • ˆ

ψ(x)

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 9 / 20

slide-11
SLIDE 11

Algorithm

Algorithm 2: Katyusha(f , x0, K, σ, L) Initialize: τ2 = 1

2, τ1 = min{nσ 3L, 1 2}, η = 1 3τ1L, θ = 1 + ησ, m =

⌈ log(2τ1+2/θ−1)

log θ

⌉ + 1, y0 = ζ0 = x0 ← x0;

1 for k = 0, . . . , K − 1 do 2

uk = ∇ˆ f ( xk);

3

for t = 0, . . . , m − 1 do

4

j = km + t;

5

xj = τ1ζj + τ2 xk + (1 − τ1 − τ2)yj;

6

  • ∇j+1 = uk + ∇ˆ

fi(xj+1) − ∇ˆ fi( xk);

7

ζj+1 = arg minζ 1

2ηζ − ζj2 +

∇j+1, ζ + ˆ ψ(ζ);

8

yj+1 = arg miny 3

L 2 y − xj+12 +

∇j+1, y + ˆ ψ(ζ);

9

end

10

  • xk+1 =

m−1

t=0 θtysm+t+1

m−1

j=0 θt

;

11 end

Output : xK;

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 10 / 20

slide-12
SLIDE 12

Theory

Theorem 3

Let ws = sα, α > 0, γ =

1 2µ ,

L = L + µ, σ = µ, and in each call of Katyusha let τ1 = min{

3 L , 1 2 }, step size η = 1 3τ1 L , τ2 = 1/2, θ = 1 + ησ, and Ks =

log(Ds)

m log(θ)

  • ,

m = log(2τ1+2/θ−1)

log θ

  • + 1, where Ds = max{4ˆ

L/µ, ˆ L3/µ3, L2s/µ2}. Then we have that max{E[∇φγ(xτ+1)2], E[L2xτ+1 − zτ+12]} ≤ 34µ∆φ(α + 1) S + 1 + 98µ∆φ(α + 1) (S + 1)αIα<1 , where z = proxγφ(x), τ is randomly chosen from {0, . . . , S} according to probabilities pτ =

wτ+1 S

k=0 wk+1 , τ = 0, . . . , S. Furthermore, the total gradient complexity for finding xτ+1 such

that max(E[∇φγ(xτ+1)2], L2E[xτ+1 − zτ+12]) ≤ ǫ2 is N(ǫ) =        O

  • (µn +
  • nµL) log

L µǫ 1 ǫ2

  • , n ≥ 3L

4µ , O

  • nLµ log

L µǫ 1 ǫ2

  • ,

n ≤ 3L 4µ .

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 11 / 20

slide-13
SLIDE 13

Theory

Theorem 4

Suppose ψ = 0. With the same parameter values as in Theorem 3 except that K = log(D)

m log(θ)

  • ,

where D = max(48ˆ L/µ, 2ˆ L3/µ3). The total gradient complexity for finding xτ+1 such that E[∇φ(xτ+1)2] ≤ ǫ2 is N(ǫ) =        O

  • (µn +
  • nµL) log

L µ 1 ǫ2

  • , n ≥ 3L

4µ , O

  • nLµ log

L µ 1 ǫ2

  • ,

n ≤ 3L 4µ .

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 12 / 20

slide-14
SLIDE 14

Overview

1

Introduction

2

Katalyst Algorithm and Theoretical Guarantee

3

Experiments

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 13 / 20

slide-15
SLIDE 15

Experiments I

Squared hinge loss + (log-sum penalty (LSP) / transformed ℓ1 penalty (TL1)).

number of gradients/n

50 100 150 200

log10(objective)

  • 1
  • 0.9
  • 0.8
  • 0.7
  • 0.6
  • 0.5
  • 0.4
  • 0.3
  • 0.2
  • 0.1

TL1, rcv1, λ = 1/n

Katalyst proxSVRG proxSVRG-mb 4WD-Catalyst

number of gradients/n

50 100 150 200

log10(objective)

  • 1.2
  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2

LSP, rcv1, λ = 1/n

Katalyst proxSVRG proxSVRG-mb 4WD-Catalyst

number of gradients/n

50 100 150 200

log10(objective)

  • 1.2
  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2

TL1, realsim, λ = 1/n

Katalyst proxSVRG proxSVRG-mb 4WD-Catalyst

number of gradients/n

50 100 150 200

log10(objective)

  • 1.2
  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2

LSP, realsim, λ = 1/n

Katalyst proxSVRG proxSVRG-mb 4WD-Catalyst

number of gradients/n

50 100 150 200

log10(objective)

  • 1.8
  • 1.6
  • 1.4
  • 1.2
  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2

TL1, rcv1, λ = 0.1/n

Katalyst proxSVRG proxSVRG-mb 4WD-Catalyst

number of gradients/n

50 100 150 200

log10(objective)

  • 2
  • 1.8
  • 1.6
  • 1.4
  • 1.2
  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2

LSP, rcv1, λ = 0.1/n

Katalyst proxSVRG proxSVRG-mb 4WD-Catalyst

number of gradients/n

50 100 150 200

log10(objective)

  • 1.8
  • 1.6
  • 1.4
  • 1.2
  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2

TL1, realsim, λ = 0.1/n

Katalyst proxSVRG proxSVRG-mb 4WD-Catalyst

number of gradients/n

50 100 150 200

log10(objective)

  • 2
  • 1.8
  • 1.6
  • 1.4
  • 1.2
  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2

LSP, realsim, λ = 0.1/n

Katalyst proxSVRG proxSVRG-mb 4WD-Catalyst

Figure 1: Comparison of different algorithms for two tasks on different datasets

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 14 / 20

slide-16
SLIDE 16

Experiments II

We use Smoothed SCAD given in (Lan & Yang, 2018), Rλ,γ,ǫ(x) =                    λ(x2 + ǫ)

1 2 ,

if(x2 + ǫ)

1 2 ≤ λ,

2γλ(x2 + ǫ)

1 2 − (x2 + ǫ) − λ2

2(γ − 1) , if λ < (x2 + ǫ)

1 2 < γλ,

λ2(γ + 1) 2 ,

  • therwise,

where γ > 2, λ > 0, and ǫ > 0. Then the problem is min

x∈Rd φ(x) := 1

2n

n

  • i=1

(a⊤

i x − bi)2 + ρ

2

d

  • i=1

Rλ,γ,ǫ(xi)

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 15 / 20

slide-17
SLIDE 17

Experiments II.1

20 40 60 80 100 0.5 1 1.5 2 50 100 150 200

  • 0.5

0.5 1 1.5 2 50 100 150 200 250 300 350 400

  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 500 1000 1500 2000 2500 3000

  • 15
  • 10
  • 5

1000 2000 3000 4000 5000 6000

  • 15
  • 10
  • 5

2000 4000 6000 8000 10000

  • 15
  • 10
  • 5

Figure 2: Theoretical performances of RapGrad and Katalyst.

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 16 / 20

slide-18
SLIDE 18

Experiments II.2

20 40 60 80 100 0.5 1 1.5 2 50 100 150 200

  • 0.5

0.5 1 1.5 2 50 100 150 200 250 300 350 400

  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 50 100 150 200 250 300

  • 15
  • 10
  • 5

100 200 300 400 500 600 700 800

  • 15
  • 10
  • 5

500 1000 1500 2000 2500

  • 15
  • 10
  • 5

Figure 3: Empirical performances of RapGrad and Katalyst with early termination.

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 17 / 20

slide-19
SLIDE 19

The End

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 18 / 20

slide-20
SLIDE 20

References I

Allen-Zhu, Z. Natasha: Faster non-convex stochastic optimization via strongly non-convex parameter. In Proceedings of the 34th International Conference on Machine Learning (ICML), pp. 89–97, 2017. Allen-Zhu, Z. Katyusha: the first direct acceleration of stochastic gradient

  • methods. In Proceedings of the 49th Annual ACM SIGACT Symposium
  • n Theory of Computing (STOC), pp. 1200–1205, 2017.

Fang, C., Li, C. J., Lin, Z., and Zhang, T. SPIDER: near-optimal non-convex optimization via stochastic path-integrated differential

  • estimator. In NeurIPS, pp. 687–697, 2018.

Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, pp. 315–323, 2013. Lan, G. and Yang, Y. Accelerated stochastic algorithms for nonconvex finite-sum and multi-block optimization. CoRR, abs/1805.05411, 2018.

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 19 / 20

slide-21
SLIDE 21

References II

Lin, H., Mairal, J., and Harchaoui, Z. A universal catalyst for first-order

  • ptimization. In Advances in Neural Information Processing Systems,
  • pp. 3384–3392, 2015.

Paquette, C., Lin, H., Drusvyatskiy, D., Mairal, J., and Harchaoui, Z. Catalyst for gradient-based nonconvex optimization. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84, pp. 613–622, 2018. Reddi, S. J., Sra, S., P´

  • czos, B., and Smola, A. J. Proximal stochastic

methods for nonsmooth nonconvex finite-sum optimization. In Advances in Neural Information Processing Systems, pp. 1145–1153, 2016. Zhou, D. and Gu, Q. Lower bounds for smooth nonconvex finite-sum

  • ptimization. arXiv preprint arXiv:1901.11224, 2019.

Zhou, D., Xu, P., and Gu, Q. Stochastic nested variance reduced gradient descent for nonconvex optimization. In NeurIPS, pp. 3925–3936, 2018.

Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 20 / 20