Regularized Nonlinear Acceleration. Alexandre dAspremont , CNRS - - PowerPoint PPT Presentation

regularized nonlinear acceleration
SMART_READER_LITE
LIVE PREVIEW

Regularized Nonlinear Acceleration. Alexandre dAspremont , CNRS - - PowerPoint PPT Presentation

Regularized Nonlinear Acceleration. Alexandre dAspremont , CNRS & D.I. Ecole Normale Sup erieure . with Damien Scieur & Francis Bach. Support from ERC SIPA and ITN SpaRTaN. Alex dAspremont Huatulco, January 2018. 1/30


slide-1
SLIDE 1

Regularized Nonlinear Acceleration.

Alexandre d’Aspremont, CNRS & D.I. Ecole Normale Sup´ erieure. with Damien Scieur & Francis Bach. Support from ERC SIPA and ITN SpaRTaN.

Alex d’Aspremont Huatulco, January 2018. 1/30

slide-2
SLIDE 2

Introduction

Generic convex optimization problem min

x∈Rn f(x)

Alex d’Aspremont Huatulco, January 2018. 2/30

slide-3
SLIDE 3

Introduction

Algorithms produce a sequence of iterates. We only keep the last (or best) one. . .

Alex d’Aspremont Huatulco, January 2018. 3/30

slide-4
SLIDE 4

Introduction

Aitken’s ∆2 [Aitken, 1927]. Given a sequence {sk}k=1,... ∈ RN with limit s∗, and suppose sk+1 − s∗ = a (sk − s∗), for k = 1, . . . We can compute a using sk+1 − sk = a (sk − sk−1) ⇒ a = sk+1 − sk sk − sk−1 and get the limit s∗ by solving sk+1 − s∗ = sk+1 − sk sk − sk−1 (sk − s∗) which yields s∗ = sk−1sk+1 − s2

k

sk+1 − 2sk + sk−1 This is Aitken’s ∆2 and allows us to compute s∗ from {sk+1, sk, sk−1}.

Alex d’Aspremont Huatulco, January 2018. 4/30

slide-5
SLIDE 5

Introduction

Convergence acceleration. Consider sk =

k

  • i=0

(−1)i (2i + 1)

k→∞

− − − − → π 4 = 0.785398 . . . we have

k

(−1)k (2k+1)

k

i=0 (−1)i (2i+1)

∆2 1 1.0000 – 1

  • 0.33333

0.66667 – 2 0.2 0.86667 0.79167 3

  • 0.14286

0.72381 0.78333 4 0.11111 0.83492 0.78631 5

  • 0.090909

0.74401 0.78492 6 0.076923 0.82093 0.78568 7

  • 0.066667

0.75427 0.78522 8 0.058824 0.81309 0.78552 9

  • 0.052632

0.76046 0.78531

Alex d’Aspremont Huatulco, January 2018. 5/30

slide-6
SLIDE 6

Introduction

Convergence acceleration.

Similar results apply to sequences satisfying

k

  • i=0

ai(sn+i − s∗) = 0 using Aitken’s ideas recursively.

This produces Wynn’s ε−algorithm [Wynn, 1956]. See [Brezinski, 1977] for a survey on acceleration, extrapolation. Directly related to the Levinson-Durbin algo on AR processes. Vector case: focus on Minimal Polynomial Extrapolation [Sidi et al., 1986].

Overall: a simple postprocessing step.

Alex d’Aspremont Huatulco, January 2018. 6/30

slide-7
SLIDE 7

Outline

Introduction Minimal Polynomial Extrapolation Regularized MPE Numerical results Alex d’Aspremont Huatulco, January 2018. 7/30

slide-8
SLIDE 8

Minimal Polynomial Extrapolation

Quadratic example. Minimize f(x) = 1 2Bx − b2

2

using the basic gradient algorithm, with xk+1 := xk − 1 L(BTBxk − b). we get xk+1 − x∗ :=

  • I − 1

LBTB

  • A

(xk − x∗) since BTBx∗ = b. This means xk+1 − x∗ follows a vector autoregressive process.

Alex d’Aspremont Huatulco, January 2018. 8/30

slide-9
SLIDE 9

Minimal Polynomial Extrapolation

We have

k

  • i=0

ci(xi − x∗) =

k

  • i=1

ciAi(x0 − x∗) and setting 1Tc = 1, yields k

  • i=0

cixi

  • − x∗ = p(A)(x0 − x∗),

where p(v) = k

i=1 civi

Setting c such that p(A)(x0 − x∗) = 0, we would have

x∗ =

k

  • i=0

cixi

Get the limit by averaging iterates (using weights depending on xk). We typically do not observe A (or x∗). How do we extract c from the iterates xk? Alex d’Aspremont Huatulco, January 2018. 9/30

slide-10
SLIDE 10

Minimal Polynomial Extrapolation

We have xk − xk−1 = (xk − x∗) − (xk−1 − x∗) = (A − I)Ak−1(x0 − x∗) hence if p(A) = 0, we must have

k

  • i=1

ci(xi − xi−1) = (A − I)p(A)(x0 − x∗) = 0 so if (A − I) is nonsingular, the coefficient vector c solves the linear system      k

i=1 ci(xi − xi−1) = 0

k

i=1 ci = 1

and p(·) is the minimal polynomial of A w.r.t. (x0 − x∗).

Alex d’Aspremont Huatulco, January 2018. 10/30

slide-11
SLIDE 11

Approximate Minimal Polynomial Extrapolation

Approximate MPE.

For k smaller than the degree of the minimal polynomial, we find c that

minimizes the residual (A − I)p(A)(x0 − x∗)2 =

  • k
  • i=1

ci(xi − xi−1)

  • 2

Setting U ∈ Rn×k+1, with Ui = xi+1 − xi, this means solving

c∗ argmin

1T c=1

Uc2 (AMPE) in the variable c ∈ Rk+1.

Also known as Eddy-Meˇ

sina method [Meˇ sina, 1977, Eddy, 1979] or Reduced Rank Extrapolation with arbitrary k (see [Smith et al., 1987, §10]). Very similar to Anderson acceleration, GMRES, etc.

Alex d’Aspremont Huatulco, January 2018. 11/30

slide-12
SLIDE 12

Uniform Bound

Chebyshev polynomials. Crude bound on Uc∗2 using Chebyshev polynomials, to bound error as a function of k, with

  • k

i=0 c∗ i xi − x∗

  • 2

=

  • (I − A)−1 k

i=0 c∗ i Ui

  • 2

  • (I − A)−1
  • 2 p(A)(x1 − x0)2

We have p(A)(x1 − x0)2 ≤ p(A)2 (x1 − x0)2 = max

i=1,...,n |p(λi)| (x1 − x0)2

where 0 ≤ λi ≤ σ are the eigenvalues of A. It suffices to find p(·) ∈ Rk[x] solving inf

{p∈Rk[x]: p(1)=1}

sup

v∈[0,σ]

|p(v)| Explicit solution using modified Chebyshev polynomials.

Alex d’Aspremont Huatulco, January 2018. 12/30

slide-13
SLIDE 13

Uniform Bound using Chebyshev Polynomials

0.2 0.4 0.6 0.8 1 −0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

x Tk(x, σ) σ Chebyshev polynomials T3(x, σ) and T5(x, σ) for x ∈ [0, 1] and σ = 0.85. The maximum value of Tk on [0, σ] decreases geometrically fast when k grows.

Alex d’Aspremont Huatulco, January 2018. 13/30

slide-14
SLIDE 14

Approximate Minimal Polynomial Extrapolation

Proposition AMPE convergence. Let A be symmetric, 0 A σI with σ < 1 and c∗ be the solution of (AMPE). Then

  • k
  • i=0

c∗

i xi − x∗

  • 2

≤ κ(A − I) 2ζk 1 + ζ2kx0 − x∗2 (1) where κ(A − I) is the condition number of the matrix A − I and ζ is given by ζ = 1 − √1 − σ 1 + √1 − σ < σ, (2) See also [Nemirovskiy and Polyak, 1984]. Gradient method, σ = 1 − µ/L, so

  • k

i=0 c∗ i xi − x∗

  • 2 ≤ κ(A − I)
  • 1−√

µ/L 1+√ µ/L

k x0 − x∗2

Alex d’Aspremont Huatulco, January 2018. 14/30

slide-15
SLIDE 15

Approximate Minimal Polynomial Extrapolation

AMPE versus Nesterov, conjugate gradient.

Key difference with conjugate gradient: we do not observe A. . . Chebyshev polynomials satisfy a two-step recurrence. For quadratic

minimization using the gradient method:        zk−1 = yk−1 − 1

L(Byk−1 − b)

yk = αk−1 αk 2zk−1 σ − yk−1

  • − αk−2

αk yk−2 where αk = 2−σ

σ αk−1 − αk−2

Nesterov’s acceleration recursively computes a similar polynomial with

   zk−1 = yk−1 − 1

L(Byk−1 − b)

yk = zk−1 + βk(zk−1 − zk−2), see also [Hardt, 2013].

Alex d’Aspremont Huatulco, January 2018. 15/30

slide-16
SLIDE 16

Approximate Minimal Polynomial Extrapolation

Accelerating optimization algorithms. For gradient descent, we have ˜ xk+1 := ˜ xk − 1 L∇f(˜ xk)

This means ˜

xk+1 − x∗ := A(˜ xk − x∗) + O(˜ xk − x∗2

2) where

A = I − 1 L∇2f(x∗), meaning that A2 ≤ 1 − µ

L, whenever µI ∇2f(x) LI.

Approximation error is a sum of three terms

  • k
  • i=0

˜ ci˜ xi − x∗

  • 2

  • k
  • i=0

cixi − x∗

  • 2
  • AMPE

+

  • k
  • i=0

(˜ ci − ci)xi

  • 2
  • Stability

+

  • k
  • i=0

˜ ci(˜ xi − xi)

  • 2
  • Nonlinearity

Stability is key here.

Alex d’Aspremont Huatulco, January 2018. 16/30

slide-17
SLIDE 17

Approximate Minimal Polynomial Extrapolation

Stability.

The iterations span a Krylov subspace

Kk = span

  • U0, AU0, ..., Ak−1U0
  • so the matrix U in AMPE is a Krylov matrix.

Similar to Hankel or Toeplitz case. U TU has a condition number typically

growing exponentially with dimension [Tyrtyshnikov, 1994].

In fact, the Hankel, Toeplitz and Krylov problems are directly connected, hence

the link with Levinson-Durbin [Heinig and Rost, 2011].

For generic optimization problems, eigenvalues are perturbed by deviations

from the linear model, which can make the situation even worse. Be wise, regularize . . .

Alex d’Aspremont Huatulco, January 2018. 17/30

slide-18
SLIDE 18

Outline

Introduction Minimal Polynomial Extrapolation Regularized MPE Numerical results Alex d’Aspremont Huatulco, January 2018. 18/30

slide-19
SLIDE 19

Regularized Minimal Polynomial Extrapolation

Regularized AMPE. Add a regularization term to AMPE.

Regularized formulation of problem (AMPE),

minimize cT(U TU + λI)c subject to 1Tc = 1 (RMPE)

Solution given by a linear system of size k + 1.

c∗

λ =

(U TU + λI)−11 1T(U TU + λI)−11 (3)

Alex d’Aspremont Huatulco, January 2018. 19/30

slide-20
SLIDE 20

Regularized Minimal Polynomial Extrapolation

RMPE algorithm. Input: Sequence {x0, x1, ..., xk+1}, parameter λ > 0

1: Form U = [x1 − x0, ..., xk+1 − xk] 2: Solve the linear system (U TU + λI)z = 1 3: Set c = z/(zT1)

Output: Return k

i=0 cixi, approximating the optimum x∗

Alex d’Aspremont Huatulco, January 2018. 20/30

slide-21
SLIDE 21

Regularized Minimal Polynomial Extrapolation

Regularized AMPE. Define S(k, α) min

{q∈Rk[x]: q(1)=1}

  • max

x∈[0,σ] ((1 − x)q(x))2 + αq2 2

  • ,

Proposition [Scieur, d’Aspremont, and Bach, 2016] Error bounds Let matrices X = [x0, x1, ..., xk], ˜ X = [x0, ˜ x1, ..., ˜ xk] and scalar κ = (A − I)−12. Suppose ˜ c∗

λ solves problem (RMPE) and assume A = g′(x∗)

symmetric with 0 A σI where σ < 1. Let us write the perturbation matrices P = ˜ U T ˜ U − U TU and E = (X − ˜ X). Then ˜ X˜ c∗

λ − x∗2 ≤ C(E, P, λ) S(k, λ/x0 − x∗2 2)

1 2 x0 − x∗2

where C(E, P, λ) =

  • κ2 + 1

λ

  • 1 + P2

λ 2 E2 + κP2 2 √ λ 21

2 Alex d’Aspremont Huatulco, January 2018. 21/30

slide-22
SLIDE 22

Regularized Minimal Polynomial Extrapolation

Proposition [Scieur et al., 2016] Asymptotic acceleration Using the gradient method with stepsize in ]0, 2

L[ on

a L-smooth, µ-strongly convex function f with Lipschitz-continuous Hessian of constant M. ˜ X˜ c∗

λ − x∗2 ≤ κ

  • 1 +

(1 + 1

β)2

4β2 1/2 2ζk 1 + ζ2kx0 − x∗ with ζ = 1 −

  • µ/L

1 +

  • µ/L

for x0 − x∗ small enough, where λ = βP2 and κ = L

µ is the condition number

  • f the function f(x).

We (asymptotically) recover the accelerated rate in [Nesterov, 1983].

Alex d’Aspremont Huatulco, January 2018. 22/30

slide-23
SLIDE 23

Regularized Minimal Polynomial Extrapolation

Stochastic optimization. Noisy oracles on iterates (in practice, gradients) ˜ xt+1 = g(˜ xt) + ηt+1, where ηt is noise term (independent). Equivalent to ˜ xt+1 = x∗ + G(˜ xt − x∗) + εt+1, where E[εt] ≤ ν and εt has bounded variance Σt (σ2/d)I with τ ν + σ x0 − x∗. Proposition [Scieur, d’Aspremont, and Bach, 2017] Error bounds The accuracy of AMPE applied to the sequence {˜ x0, ..., ˜ xk} is bounded by

E

  • k

i=0 ˜

i ˜

xi − x∗

  • x0 − x∗

≤  Sκ(k, ¯ λ)

  • 1

κ2 + O(τ 2(1 + τ)2) ¯ λ3 + O

  • τ 2 + τ 2(1 + τ 2)

¯ λ  

Alex d’Aspremont Huatulco, January 2018. 23/30

slide-24
SLIDE 24

Regularized Minimal Polynomial Extrapolation

Stochastic optimization.

When the noise scale τ → 0, if ¯

λ = Θ(τ s) with s ∈]0, 2

3[, we recover the

accelerated rate E

  • k

i=0 ˜

i ˜

xi − x∗

  • ≤ 1

κ

  • 1−√κ

1+√κ

k x0 − x∗.

If λ → ∞, we recover the averaged gradient

E

  • k

i=0 ˜

i ˜

xi − x∗

  • → E
  • 1

k+1

k

i=0 ˜

xi − x∗

  • Alex d’Aspremont

Huatulco, January 2018. 24/30

slide-25
SLIDE 25

Outline

Introduction Minimal Polynomial Extrapolation Regularized MPE Numerical results Alex d’Aspremont Huatulco, January 2018. 25/30

slide-26
SLIDE 26

Numerical Results

2 4 6 8 10 ×104 10-5 100

f(xk) − f(x∗) Gradient oracle calls

Gradient Nesterov

  • Nest. + backt.

RMPE 5 RMPE 5 + LS

500 1000 1500 10-5 100

f(xk) − f(x∗) CPU Time (sec.)

Gradient Nesterov

  • Nest. + backt.

RMPE 5 RMPE 5 + LS

Logistic regression with ℓ2 regularizartion, on Madelon Dataset (500 features, 2000 data points), solved using several algorithms. The penalty parameter has been set to 102 in order to have a condition number equal to 1.2 × 109.

Alex d’Aspremont Huatulco, January 2018. 26/30

slide-27
SLIDE 27

Numerical Results

200 400 10 -10 10 -5

f(x) − f(x∗) Epoch

50 100 150 200 10 -10 10 -5

Time (sec)

200 400 10 -10 10 -5

f(x) − f(x∗) Epoch

100 200 300 10 -10 10 -5

Time (sec) SAGA SGD SVRG Katyusha AccSAGA AccSGD AccSVRG AccKat.

Optimization of quadratic loss (Top) and logistic loss (Bottom) with several algorithms, using the Sid dataset with bad conditioning. The experiments are done in Matlab. Left: Error vs epoch number. Right: Error vs time.

Alex d’Aspremont Huatulco, January 2018. 27/30

slide-28
SLIDE 28

Numerical Results

50 100 150 200 10 -3 10 -2 10 -1 10 0

Training loss Epoch

SGD + momentum RNA + SGD + momentum

50 100 150 200 10 20 30 40

Test error (%) Epoch

SGD + momentum RNA + SGD + momentum

Convergence acceleration. Training Resnet-28-10 on CIFAR data set. Value reached by the current iterate versus extrapolated one (from the last 15 iterates). Training loss on the left, testing error on the right. Restarting the training periodically at the extrapolated point. Vertical lines mark learning rate decreases.

Alex d’Aspremont Huatulco, January 2018. 28/30

slide-29
SLIDE 29

Conclusion

Postprocessing works. Regularized MPE yields asymptotically optimal rates.

Simple postprocessing step. Marginal complexity, can be performed in parallel. Significant convergence speedup over optimal methods.

  • Adaptive. Does not need knowledge of smoothness parameters.

Work in progress. . .

Extrapolating accelerated methods. Constrained problems. Better handling of smooth functions. . . . Alex d’Aspremont Huatulco, January 2018. 29/30

slide-30
SLIDE 30

Open problems

  • Regularization. How do we account for the fact that we are estimating the

limit of a VAR sequence with a fixed point?

The VAR matrix A is formed implicitly, but we have some information on its

spectrum through smoothness.

Explicit bounds on the regularized Chebyshev problem,

S(k, α) min

{q∈Rk[x]: q(1)=1}

  • max

x∈[0,σ] ((1 − x)q(x))2 + αq2 2

  • .

Preprints on ArXiv, NIPS 2016, 2017.

Alex d’Aspremont Huatulco, January 2018. 30/30

slide-31
SLIDE 31

*

References Alexander Craig Aitken. On Bernoulli’s numerical solution of algebraic equations. Proceedings of the Royal Society of Edinburgh, 46:289–305, 1927. C Brezinski. Acc´ el´ eration de la convergence en analyse num´

  • erique. Lecture notes in mathematics (ISSN 0075-8434, (584), 1977.

RP Eddy. Extrapolating to the limit of a vector sequence. Information linkage between applied mathematics and industry, pages 387–396, 1979.

  • M. Hardt. The zen of gradient descent. Mimeo, 2013.

Georg Heinig and Karla Rost. Fast algorithms for Toeplitz and Hankel matrices. Linear Algebra and its Applications, 435(1):1–59, 2011. M Meˇ

  • sina. Convergence acceleration for the iterative solution of the equations x = ax + f. Computer Methods in Applied Mechanics and

Engineering, 10(2):165–173, 1977. Arkadi S Nemirovskiy and Boris T Polyak. Iterative methods for solving linear ill-posed problems under precise information. ENG. CYBER., (4):50–56, 1984.

  • Y. Nesterov. A method of solving a convex programming problem with convergence rate O(1/k2). Soviet Mathematics Doklady, 27(2):

372–376, 1983.

  • D. Scieur, A. d’Aspremont, and F. Bach. Regularized Nonlinear Acceleration. NIPS, 2016.

Damien Scieur, Alexandre d’Aspremont, and Francis Bach. Nonlinear acceleration of stochastic algorithms. arXiv preprint arXiv:1706.07270, 2017. Avram Sidi, William F Ford, and David A Smith. Acceleration of convergence of vector sequences. SIAM Journal on Numerical Analysis, 23 (1):178–196, 1986. David A Smith, William F Ford, and Avram Sidi. Extrapolation methods for vector sequences. SIAM review, 29(2):199–233, 1987. Evgenij E Tyrtyshnikov. How bad are Hankel matrices? Numerische Mathematik, 67(2):261–269, 1994. Peter Wynn. On a device for computing the em(sn) transformation. Mathematical Tables and Other Aids to Computation, 10(54):91–96, 1956. Alex d’Aspremont Huatulco, January 2018. 31/30