Regularized Nonlinear Acceleration.
Alexandre d’Aspremont, CNRS & D.I. Ecole Normale Sup´ erieure. with Damien Scieur & Francis Bach. Support from ERC SIPA and ITN SpaRTaN.
Alex d’Aspremont Huatulco, January 2018. 1/30
Regularized Nonlinear Acceleration. Alexandre dAspremont , CNRS - - PowerPoint PPT Presentation
Regularized Nonlinear Acceleration. Alexandre dAspremont , CNRS & D.I. Ecole Normale Sup erieure . with Damien Scieur & Francis Bach. Support from ERC SIPA and ITN SpaRTaN. Alex dAspremont Huatulco, January 2018. 1/30
Alex d’Aspremont Huatulco, January 2018. 1/30
x∈Rn f(x)
Alex d’Aspremont Huatulco, January 2018. 2/30
Alex d’Aspremont Huatulco, January 2018. 3/30
k
Alex d’Aspremont Huatulco, January 2018. 4/30
k
k→∞
(−1)k (2k+1)
i=0 (−1)i (2i+1)
Alex d’Aspremont Huatulco, January 2018. 5/30
Similar results apply to sequences satisfying
k
This produces Wynn’s ε−algorithm [Wynn, 1956]. See [Brezinski, 1977] for a survey on acceleration, extrapolation. Directly related to the Levinson-Durbin algo on AR processes. Vector case: focus on Minimal Polynomial Extrapolation [Sidi et al., 1986].
Alex d’Aspremont Huatulco, January 2018. 6/30
Introduction Minimal Polynomial Extrapolation Regularized MPE Numerical results Alex d’Aspremont Huatulco, January 2018. 7/30
2
Alex d’Aspremont Huatulco, January 2018. 8/30
k
k
i=1 civi
Setting c such that p(A)(x0 − x∗) = 0, we would have
k
Get the limit by averaging iterates (using weights depending on xk). We typically do not observe A (or x∗). How do we extract c from the iterates xk? Alex d’Aspremont Huatulco, January 2018. 9/30
k
i=1 ci(xi − xi−1) = 0
i=1 ci = 1
Alex d’Aspremont Huatulco, January 2018. 10/30
For k smaller than the degree of the minimal polynomial, we find c that
Setting U ∈ Rn×k+1, with Ui = xi+1 − xi, this means solving
1T c=1
Also known as Eddy-Meˇ
Alex d’Aspremont Huatulco, January 2018. 11/30
i=0 c∗ i xi − x∗
i=0 c∗ i Ui
i=1,...,n |p(λi)| (x1 − x0)2
{p∈Rk[x]: p(1)=1}
v∈[0,σ]
Alex d’Aspremont Huatulco, January 2018. 12/30
Alex d’Aspremont Huatulco, January 2018. 13/30
i xi − x∗
i=0 c∗ i xi − x∗
µ/L 1+√ µ/L
Alex d’Aspremont Huatulco, January 2018. 14/30
Key difference with conjugate gradient: we do not observe A. . . Chebyshev polynomials satisfy a two-step recurrence. For quadratic
L(Byk−1 − b)
σ αk−1 − αk−2
Nesterov’s acceleration recursively computes a similar polynomial with
L(Byk−1 − b)
Alex d’Aspremont Huatulco, January 2018. 15/30
This means ˜
2) where
L, whenever µI ∇2f(x) LI.
Approximation error is a sum of three terms
Alex d’Aspremont Huatulco, January 2018. 16/30
The iterations span a Krylov subspace
Similar to Hankel or Toeplitz case. U TU has a condition number typically
In fact, the Hankel, Toeplitz and Krylov problems are directly connected, hence
For generic optimization problems, eigenvalues are perturbed by deviations
Alex d’Aspremont Huatulco, January 2018. 17/30
Introduction Minimal Polynomial Extrapolation Regularized MPE Numerical results Alex d’Aspremont Huatulco, January 2018. 18/30
Regularized formulation of problem (AMPE),
Solution given by a linear system of size k + 1.
λ =
Alex d’Aspremont Huatulco, January 2018. 19/30
1: Form U = [x1 − x0, ..., xk+1 − xk] 2: Solve the linear system (U TU + λI)z = 1 3: Set c = z/(zT1)
i=0 cixi, approximating the optimum x∗
Alex d’Aspremont Huatulco, January 2018. 20/30
{q∈Rk[x]: q(1)=1}
x∈[0,σ] ((1 − x)q(x))2 + αq2 2
λ solves problem (RMPE) and assume A = g′(x∗)
λ − x∗2 ≤ C(E, P, λ) S(k, λ/x0 − x∗2 2)
1 2 x0 − x∗2
2 Alex d’Aspremont Huatulco, January 2018. 21/30
L[ on
λ − x∗2 ≤ κ
β)2
µ is the condition number
Alex d’Aspremont Huatulco, January 2018. 22/30
i=0 ˜
i ˜
Alex d’Aspremont Huatulco, January 2018. 23/30
When the noise scale τ → 0, if ¯
3[, we recover the
i=0 ˜
i ˜
κ
1+√κ
If λ → ∞, we recover the averaged gradient
i=0 ˜
i ˜
k+1
i=0 ˜
Huatulco, January 2018. 24/30
Introduction Minimal Polynomial Extrapolation Regularized MPE Numerical results Alex d’Aspremont Huatulco, January 2018. 25/30
2 4 6 8 10 ×104 10-5 100
Gradient Nesterov
RMPE 5 RMPE 5 + LS
500 1000 1500 10-5 100
Gradient Nesterov
RMPE 5 RMPE 5 + LS
Alex d’Aspremont Huatulco, January 2018. 26/30
200 400 10 -10 10 -5
f(x) − f(x∗) Epoch
50 100 150 200 10 -10 10 -5
Time (sec)
200 400 10 -10 10 -5
f(x) − f(x∗) Epoch
100 200 300 10 -10 10 -5
Time (sec) SAGA SGD SVRG Katyusha AccSAGA AccSGD AccSVRG AccKat.
Alex d’Aspremont Huatulco, January 2018. 27/30
50 100 150 200 10 -3 10 -2 10 -1 10 0
SGD + momentum RNA + SGD + momentum
50 100 150 200 10 20 30 40
SGD + momentum RNA + SGD + momentum
Alex d’Aspremont Huatulco, January 2018. 28/30
Simple postprocessing step. Marginal complexity, can be performed in parallel. Significant convergence speedup over optimal methods.
Extrapolating accelerated methods. Constrained problems. Better handling of smooth functions. . . . Alex d’Aspremont Huatulco, January 2018. 29/30
The VAR matrix A is formed implicitly, but we have some information on its
Explicit bounds on the regularized Chebyshev problem,
{q∈Rk[x]: q(1)=1}
x∈[0,σ] ((1 − x)q(x))2 + αq2 2
Alex d’Aspremont Huatulco, January 2018. 30/30
References Alexander Craig Aitken. On Bernoulli’s numerical solution of algebraic equations. Proceedings of the Royal Society of Edinburgh, 46:289–305, 1927. C Brezinski. Acc´ el´ eration de la convergence en analyse num´
RP Eddy. Extrapolating to the limit of a vector sequence. Information linkage between applied mathematics and industry, pages 387–396, 1979.
Georg Heinig and Karla Rost. Fast algorithms for Toeplitz and Hankel matrices. Linear Algebra and its Applications, 435(1):1–59, 2011. M Meˇ
Engineering, 10(2):165–173, 1977. Arkadi S Nemirovskiy and Boris T Polyak. Iterative methods for solving linear ill-posed problems under precise information. ENG. CYBER., (4):50–56, 1984.
372–376, 1983.
Damien Scieur, Alexandre d’Aspremont, and Francis Bach. Nonlinear acceleration of stochastic algorithms. arXiv preprint arXiv:1706.07270, 2017. Avram Sidi, William F Ford, and David A Smith. Acceleration of convergence of vector sequences. SIAM Journal on Numerical Analysis, 23 (1):178–196, 1986. David A Smith, William F Ford, and Avram Sidi. Extrapolation methods for vector sequences. SIAM review, 29(2):199–233, 1987. Evgenij E Tyrtyshnikov. How bad are Hankel matrices? Numerische Mathematik, 67(2):261–269, 1994. Peter Wynn. On a device for computing the em(sn) transformation. Mathematical Tables and Other Aids to Computation, 10(54):91–96, 1956. Alex d’Aspremont Huatulco, January 2018. 31/30