Katalyst: Boosting Convex Katayusha for Non-Convex Problems with a Large Condition Number
Zaiyi Chen, Yi Xu, Haoyuan Hu, Tianbao Yang
zaiyi.czy@alibaba-inc.com
2019-06-10
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 1 / 20
Katalyst: Boosting Convex Katayusha for Non-Convex Problems with a - - PowerPoint PPT Presentation
Katalyst: Boosting Convex Katayusha for Non-Convex Problems with a Large Condition Number Zaiyi Chen, Yi Xu, Haoyuan Hu, Tianbao Yang zaiyi.czy@alibaba-inc.com 2019-06-10 Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 1
Zaiyi Chen, Yi Xu, Haoyuan Hu, Tianbao Yang
zaiyi.czy@alibaba-inc.com
2019-06-10
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 1 / 20
1
Introduction
2
Katalyst Algorithm and Theoretical Guarantee
3
Experiments
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 2 / 20
Problem Definition min
x∈Rd φ(x) := 1
n
n
fi(x) + ψ(x) (1) we can obtain a better gradient complexity w.r.t. sample size n and accuracy ǫ via variance reduced method (Johnson & Zhang, 2013) (SVRG-type). We name the proposed algorithm Katalyst after Katyusha (Allen-Zhu, 2017) and Catalyst (Lin et al., 2015).
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 3 / 20
Problem Definition min
x∈Rd φ(x) := 1
n
n
fi(x) + ψ(x) (1) we can obtain a better gradient complexity w.r.t. sample size n and accuracy ǫ via variance reduced method (Johnson & Zhang, 2013) (SVRG-type). We name the proposed algorithm Katalyst after Katyusha (Allen-Zhu, 2017) and Catalyst (Lin et al., 2015).
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 3 / 20
{fi} are L-smooth. ψ can be non-smooth but convex. φ is µ-weakly convex.
Definition 1
(L-smoothness) A function f is Lipschitz smooth with constant L if its derivatives are Lipschitz continuous with constant L, that is ∇f (x) − ∇(y) ≤ Lx − y, ∀x, y ∈ Rd
Definition 2
(Weak convexity) A function φ is µ-weakly convex, if φ(x) + µ
2x2 is
convex.
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 4 / 20
Table 1: Comparison of gradient complexities of variance reduction based algorithms for finding ǫ-stationary point of (1). ∗ marks the result is only valid when L/µ ≤ √n.
Algorithms L/µ ≥ Ω(n) L/µ ≤ O(n) Non-smooth ψ SAGA (Reddi et al., 2016) O(n2/3L/ǫ2) O(n2/3L/ǫ2) Yes RapGrad (Lan & Yang, 2018)
indicator function SVRG (Reddi et al., 2016) O(n2/3L/ǫ2) O(n2/3L/ǫ2) Yes Natasha1 (Allen-Zhu, 2017) NA O(n2/3L2/3µ1/3/ǫ2)
∗
Yes RepeatSVRG (Allen-Zhu, 2017)
Yes 4WD-Catalyst (Paquette et al., 2018) O(nL/ǫ2) O(nL/ǫ2) Yes SPIDER (Fang et al., 2018) O(√nL/ǫ2) O(√nL/ǫ2) No SNVRG (Zhou et al., 2018) O(√nL/ǫ2) O(√nL/ǫ2) No Katalyst (this work)
Yes
Our bound is proved optimal up to a logarithmic factor by a recent work (Zhou & Gu, 2019).
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 5 / 20
1
Introduction
2
Katalyst Algorithm and Theoretical Guarantee
3
Experiments
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 6 / 20
0.5 1
0.1 0.2 0.3 0.4 0.5 0.6 x0 x1
Step 1
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 7 / 20
0.5 1
0.1 0.2 0.3 0.4 0.5 0.6 x1 x2
Step > 1
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 8 / 20
Meta Algorithm Algorithm 1: Stagewise-SA(w0, {ηs}, µ, {ws}) Input : a non-increasing sequence {ws}, x0 ∈ dom(ψ), γ = (2µ)−1;
1 for s = 1, . . . , S do 2
fs(·) = φ(·) + 1
2γ · −xs−12; 3
xs = Katyusha(fs, xs−1, Ks, µ, L + µ) // xs is usually an averaged solution;
4 end
Output: xτ, τ is randomly chosen from {0, . . . , S} according to the probabilities pτ =
wτ+1 S
k=0 wk+1 , τ = 0, . . . , S ;
fs(x) = 1 n
n
(fi(x) + µ 2 x − xs−12
fi(x)
) + γ−1 − µ 2 x − xs−12 + ψ(x)
ψ(x)
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 9 / 20
Algorithm 2: Katyusha(f , x0, K, σ, L) Initialize: τ2 = 1
2, τ1 = min{nσ 3L, 1 2}, η = 1 3τ1L, θ = 1 + ησ, m =
⌈ log(2τ1+2/θ−1)
log θ
⌉ + 1, y0 = ζ0 = x0 ← x0;
1 for k = 0, . . . , K − 1 do 2
uk = ∇ˆ f ( xk);
3
for t = 0, . . . , m − 1 do
4
j = km + t;
5
xj = τ1ζj + τ2 xk + (1 − τ1 − τ2)yj;
6
fi(xj+1) − ∇ˆ fi( xk);
7
ζj+1 = arg minζ 1
2ηζ − ζj2 +
∇j+1, ζ + ˆ ψ(ζ);
8
yj+1 = arg miny 3
L 2 y − xj+12 +
∇j+1, y + ˆ ψ(ζ);
9
end
10
m−1
t=0 θtysm+t+1
m−1
j=0 θt
;
11 end
Output : xK;
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 10 / 20
Theorem 3
Let ws = sα, α > 0, γ =
1 2µ ,
L = L + µ, σ = µ, and in each call of Katyusha let τ1 = min{
3 L , 1 2 }, step size η = 1 3τ1 L , τ2 = 1/2, θ = 1 + ησ, and Ks =
log(Ds)
m log(θ)
m = log(2τ1+2/θ−1)
log θ
L/µ, ˆ L3/µ3, L2s/µ2}. Then we have that max{E[∇φγ(xτ+1)2], E[L2xτ+1 − zτ+12]} ≤ 34µ∆φ(α + 1) S + 1 + 98µ∆φ(α + 1) (S + 1)αIα<1 , where z = proxγφ(x), τ is randomly chosen from {0, . . . , S} according to probabilities pτ =
wτ+1 S
k=0 wk+1 , τ = 0, . . . , S. Furthermore, the total gradient complexity for finding xτ+1 such
that max(E[∇φγ(xτ+1)2], L2E[xτ+1 − zτ+12]) ≤ ǫ2 is N(ǫ) = O
L µǫ 1 ǫ2
4µ , O
L µǫ 1 ǫ2
n ≤ 3L 4µ .
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 11 / 20
Theorem 4
Suppose ψ = 0. With the same parameter values as in Theorem 3 except that K = log(D)
m log(θ)
where D = max(48ˆ L/µ, 2ˆ L3/µ3). The total gradient complexity for finding xτ+1 such that E[∇φ(xτ+1)2] ≤ ǫ2 is N(ǫ) = O
L µ 1 ǫ2
4µ , O
L µ 1 ǫ2
n ≤ 3L 4µ .
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 12 / 20
1
Introduction
2
Katalyst Algorithm and Theoretical Guarantee
3
Experiments
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 13 / 20
Squared hinge loss + (log-sum penalty (LSP) / transformed ℓ1 penalty (TL1)).
number of gradients/n
50 100 150 200
log10(objective)
TL1, rcv1, λ = 1/n
Katalyst proxSVRG proxSVRG-mb 4WD-Catalyst
number of gradients/n
50 100 150 200
log10(objective)
LSP, rcv1, λ = 1/n
Katalyst proxSVRG proxSVRG-mb 4WD-Catalyst
number of gradients/n
50 100 150 200
log10(objective)
TL1, realsim, λ = 1/n
Katalyst proxSVRG proxSVRG-mb 4WD-Catalyst
number of gradients/n
50 100 150 200
log10(objective)
LSP, realsim, λ = 1/n
Katalyst proxSVRG proxSVRG-mb 4WD-Catalyst
number of gradients/n
50 100 150 200
log10(objective)
TL1, rcv1, λ = 0.1/n
Katalyst proxSVRG proxSVRG-mb 4WD-Catalyst
number of gradients/n
50 100 150 200
log10(objective)
LSP, rcv1, λ = 0.1/n
Katalyst proxSVRG proxSVRG-mb 4WD-Catalyst
number of gradients/n
50 100 150 200
log10(objective)
TL1, realsim, λ = 0.1/n
Katalyst proxSVRG proxSVRG-mb 4WD-Catalyst
number of gradients/n
50 100 150 200
log10(objective)
LSP, realsim, λ = 0.1/n
Katalyst proxSVRG proxSVRG-mb 4WD-Catalyst
Figure 1: Comparison of different algorithms for two tasks on different datasets
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 14 / 20
We use Smoothed SCAD given in (Lan & Yang, 2018), Rλ,γ,ǫ(x) = λ(x2 + ǫ)
1 2 ,
if(x2 + ǫ)
1 2 ≤ λ,
2γλ(x2 + ǫ)
1 2 − (x2 + ǫ) − λ2
2(γ − 1) , if λ < (x2 + ǫ)
1 2 < γλ,
λ2(γ + 1) 2 ,
where γ > 2, λ > 0, and ǫ > 0. Then the problem is min
x∈Rd φ(x) := 1
2n
n
(a⊤
i x − bi)2 + ρ
2
d
Rλ,γ,ǫ(xi)
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 15 / 20
20 40 60 80 100 0.5 1 1.5 2 50 100 150 200
0.5 1 1.5 2 50 100 150 200 250 300 350 400
0.5 1 1.5 2 500 1000 1500 2000 2500 3000
1000 2000 3000 4000 5000 6000
2000 4000 6000 8000 10000
Figure 2: Theoretical performances of RapGrad and Katalyst.
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 16 / 20
20 40 60 80 100 0.5 1 1.5 2 50 100 150 200
0.5 1 1.5 2 50 100 150 200 250 300 350 400
0.5 1 1.5 2 50 100 150 200 250 300
100 200 300 400 500 600 700 800
500 1000 1500 2000 2500
Figure 3: Empirical performances of RapGrad and Katalyst with early termination.
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 17 / 20
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 18 / 20
Allen-Zhu, Z. Natasha: Faster non-convex stochastic optimization via strongly non-convex parameter. In Proceedings of the 34th International Conference on Machine Learning (ICML), pp. 89–97, 2017. Allen-Zhu, Z. Katyusha: the first direct acceleration of stochastic gradient
Fang, C., Li, C. J., Lin, Z., and Zhang, T. SPIDER: near-optimal non-convex optimization via stochastic path-integrated differential
Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, pp. 315–323, 2013. Lan, G. and Yang, Y. Accelerated stochastic algorithms for nonconvex finite-sum and multi-block optimization. CoRR, abs/1805.05411, 2018.
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 19 / 20
Lin, H., Mairal, J., and Harchaoui, Z. A universal catalyst for first-order
Paquette, C., Lin, H., Drusvyatskiy, D., Mairal, J., and Harchaoui, Z. Catalyst for gradient-based nonconvex optimization. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84, pp. 613–622, 2018. Reddi, S. J., Sra, S., P´
methods for nonsmooth nonconvex finite-sum optimization. In Advances in Neural Information Processing Systems, pp. 1145–1153, 2016. Zhou, D. and Gu, Q. Lower bounds for smooth nonconvex finite-sum
Zhou, D., Xu, P., and Gu, Q. Stochastic nested variance reduced gradient descent for nonconvex optimization. In NeurIPS, pp. 3925–3936, 2018.
Chen Z., et al. (CAINIAO.AI&USTC&UIowa) Katalyst 2019-06-10 20 / 20