Exact rate of Nesterov Scheme Vasileios Apidopoulos, Jean-Fran cois - - PowerPoint PPT Presentation

exact rate of nesterov scheme
SMART_READER_LITE
LIVE PREVIEW

Exact rate of Nesterov Scheme Vasileios Apidopoulos, Jean-Fran cois - - PowerPoint PPT Presentation

Exact rate of Nesterov Scheme Vasileios Apidopoulos, Jean-Fran cois Aujol, Charles Dossal, Aude Rondepierre INSA de Toulouse, Institut de Math ematiques de Toulouse January 2019, MIA Conference 1/1 The setting Minimize a differentiable


slide-1
SLIDE 1

1/1

Exact rate of Nesterov Scheme

Vasileios Apidopoulos, Jean-Fran¸ cois Aujol, Charles Dossal, Aude Rondepierre

INSA de Toulouse, Institut de Math´ ematiques de Toulouse

January 2019, MIA Conference

slide-2
SLIDE 2

2/1

The setting Minimize a differentiable function Let F be a convex differentiable function from Rn to R which gradient is L − Lipschitz, having at least one minimizer x∗. We want to build an efficient sequence to estimate arg min

x∈Rn F(x)

(1)

slide-3
SLIDE 3

3/1

Gradient descent Explicit Gradient Descent Let F be a convex differentiable function from Rn to R which gradient is L − Lipschitz, having at least one minimizer x∗. Gradient descent : for h < 2

L,

xn+1 = xn − h∇F(xn) (2) The sequence (xn)n∈N converges to a minimizer of F and F(xn) − F(x∗) x0 − x∗2 2hn (3)

slide-4
SLIDE 4

4/1

Nesterov inertial scheme Nesterov inertial scheme Nesterov Scheme for h < 1

L, and α 3

xn+1 = xn − h∇F

  • xn +

n n + α(xn − xn−1)

  • (4)

F(xn) − F(x∗) = O 1 n2

  • (5)

Nesterov (84) proposes α = 3.

slide-5
SLIDE 5

5/1

The questions The questions More precise than O 1 n2

  • with more information on F ?

Is Nesterov really an acceleration of Gradient descent ? The answers Yes... with strong convexity, Su et al. (15) Attouch et al. (17) We give a more accurate answer for more general geometry. In many numerical problems Nesterov is more efficient, but not always. The real answer is ... Nesterov may be more efficient than GD

  • r not.
slide-6
SLIDE 6

6/1

Outline Outline Gradient descent and growth condition. State of the art on Nesterov scheme. New rates for Nesterov Schemes. Proofs coming from an ODE study.

slide-7
SLIDE 7

7/1

Gradient Descent and Geometry Growth condition A function F satisfies condition L(γ) if it exists K > 0 such that for all x ∈ Rn d(x, X ∗)γ K (F(x) − F(x∗)) (6) Theorem Garrigos al al. If F satisfies condition L(γ) with γ > 2 then F(xn) − F(x∗) = O

  • 1

n

α α−2

  • (7)

If F satisfies condition L(2) then it exists a > 0 F(xn) − F(x∗) = O

  • e−an

(8)

slide-8
SLIDE 8

8/1

Geometric convergence of GD with L(2) Geometric convergence of GD with L(2) F(xn) − F(x∗) x0 − x∗2 2hn and x − x∗2 K (F(x) − F(x∗)) No memory algorithm ⇒ ∀j n F(xn) − F(x∗) xn−j − x∗2 2hj K 2hj (F(xn−j) − F(x∗)) If

K 2hj 1 2 ⇐

⇒ j K

h ,

F(xn) − F(x∗) F(xn−j) − F(x∗) 2 Conclusion : The decay is geometric.

slide-9
SLIDE 9

9/1

Back to Nesterov scheme State of the art Nesterov Scheme for h < 1

L, and α 3

xn+1 = xn − h∇F

  • xn +

n n + α(xn − xn−1)

  • (9)

F(xn) − F(x∗) = O 1 n2

  • (10)

Chambolle, D (14) and Attouch Peypouquet (15): α > 3 ⇒ convergence of (xn)n1 and F(xn)−F(x∗) = o 1 n2

  • If α 3, Apidopoulos et al. and Attouch et al. (17)

F(xn) − F(x∗) = O 1 n

2α 3

  • (11)
slide-10
SLIDE 10

10/1

Nesterov, with strong convexity Theorem Su Boyd Cand` es (15), Attouch Cabot (17) If F satisfies L(2) and uniqueness of minimizer, then ∀α > 0 F(xn) − F(x∗) = O 1 n

2α 3

  • (12)
slide-11
SLIDE 11

11/1

Another geometrical condition Flatness condition F satisfies condition H(γ) if ∀x ∈ Rn and all x∗ ∈ X ∗ F(x) − F(x∗) 1 γ ∇F(x), x − x∗ (13) Flatness and growth properties If (F − F ∗)

1 γ is convex, then F satisfies H1(γ).

If F satisfies H(γ) then it exists K2 > 0 such that F(x) − F(x∗) K2d(x, X ∗)γ (14) if F(x) = x − x∗r, with r > 1, F satisfies H1(γ) for all γ ∈ [1, r] ... and L(p) for all p γ. if F satisfies L(2) and ∇F is L-Lispchitz then F satisfies H(1 +

L 2K2 ).

slide-12
SLIDE 12

12/1

Nesterov, flatness may improve convergence rate Theorem : Apidopoulos et al. (18) Let F be a differentiable convex function which gradient is L−Lipschitz

1

If F satisfies H(γ), with γ > 1 and

1

if α 1 + 2

γ

F(xn) − F(x∗) = O

  • 1

n

2γα γ+2

  • (15)

2

if α > 1 + 2

γ and thus if α = 3 then

F(xn) − F(x∗) = o 1 n2

  • (16)

and the sequence (xn)n1 converges.

2

If F satisfies L(2), the previous points apply for a γ > 1.

slide-13
SLIDE 13

13/1

Nesterov, rate for sharp functions Theorem for sharp functions, Apidoupoulos et al. (18) If F satisfies L(2), H(γ) and has a unique minimizer x∗ then ∀α > 0 F(xn) − F(x∗) = O

  • 1

n

2γα γ+2

  • (17)

Comments For γ = 1 we recover the decay O

  • 1

n

2α 3

  • : Attouch and Cabot

For quadratic functions, γ = 2 and thus we get O 1

  • .

Since ∇F is L−Lipschitz, F satisfies H1(γ) for γ > 1 and thus 2γα

γ+2 > 2α 3 .

For F(x) = Ax − y2 the decay is O 1

  • .
slide-14
SLIDE 14

14/1

Nesterov, rate for flat functions Theorem for flat functions, Apidopoulos (18) If F satisfies H(γ) and L(γ) with γ > 2, if F has unique minimizer and if α > γ+2

γ−2 then

F(xn) − F(x∗) = O

  • 1

n

2γ γ−2

  • (18)

Gradient descent rate If F satisfies L(γ) with γ > 2 F(xn) − F(x∗) = O

  • 1

n

γ γ−2

  • (19)
slide-15
SLIDE 15

15/1

Nesterov continuous and discret Discretization of an ODE, Su Boyd and Cand` es (15) The scheme defined by xn+1 = yn − h∇F(yn) with yn = xn + n n + α(xn − xn−1) (20) is a discretization of a solution of ¨ x(t) + α t ˙ x(t) + ∇F(x(t)) = 0 (ODE) With ˙ x(t0) = 0. Move of a solid in a potential field with a vanishing viscosity α

t .

Advantages of the discret setting

1

A simpler Lyapunov analysis, better insight

2

Optimality of bounds

slide-16
SLIDE 16

16/1

Nesterov, Continuous vs discret ¨ x(t) + α t ˙ x(t) + ∇F(x(t)) = 0 (ODE) Nesterov, Continuous If F is convex and if α 3, the solution of (??) satisfies F(x(t)) − F(x∗) = O 1 t2

  • (21)

xn+1 = yn − h∇F(yn) with yn = xn + n n + α(xn − xn−1) Nesterov, Discret If F is convex and if α 3, the sequence(xn)n1 satisfies F(xn) − F(x∗) = O 1 n2

  • (22)
slide-17
SLIDE 17

17/1

Nesterov, Proof Nesterov, Proof of the continuous theorem We define E(t) = t2(F(x(t)) − F(x∗)) + 1 2 (α − 1)(x(t) − x∗) + t ˙ x(t)2 Using (??) and the following convex inequality F(x(t)) − F(x∗) x(t) − x∗, ∇F(x(t)) we get E′(t) (3 − α)t(F(x(t) − F(x∗)) (23)

1

If α 3, ∀t t0, t2(F(x(t)) − F(x∗)) E(t0)

2

If α > 3, +∞

t=t0

(α − 3)t(F(x(t) − F(x∗)) E(t0)

slide-18
SLIDE 18

18/1

Nesterov, Proof Nesterov, Proof of the discret theorem We define En = n2(F(xn) − F(x∗)) + 1 2h (α − 1)(xn − x∗) + n(xn − xn−1)2 Using the definition of (xn)n1 and the following convex inequality F(xn) − F(x∗) xn − x∗, ∇F(xn) we get En+1 − En (3 − α)n(F(xn) − F(x∗)) (24)

1

If α 3, ∀n 1, n2(F(xn) − F(x∗)) E1

2

If α > 3,

  • n1

(α − 3)n(F(xn) − F(x∗)) E1

slide-19
SLIDE 19

19/1

Nesterov, Proof of convergence rate

1

We define for (p, ξ, λ) ∈ R3 H(t) = tp(t2(F(x(t)) − F(x∗))+1 2 (λ(x(t) − x∗) + t ˙ x(t)2+ξ 2 x(t) − x∗2)

2

We choose (p, ξ, λ) ∈ R3 depending on the hypotheses to ensure that H is bounded. H may not be non increasing.

3

We deduce there is A ∈ R such that t2+p(F(x(t)) − F(x∗)) A − tp ξ 2 x(t) − x∗2

4

If ξ 0 then F(x(t)) − F(x∗) = O

  • 1

tp+2

  • .

5

if ξ 0 we must use conditions L(γ) to conclude.

slide-20
SLIDE 20

20/1

Nesterov, Example Theorem Su, Boyd, Cand` es (15) If F is convex, satisfies and α 3 F(x(t)) − F(x∗) = O 1 t2

  • (25)

Proof : p = 0, λ = α − 1, ξ = 0 Theorem Aujol, D., Rondepierre (18) If F is convex, satisfies H(γ) and L(2), and has unique minimizer F(x(t)) − F(x∗) = O

  • 1

t

2αγ γ+2

  • (26)

Proof : p = 2αγ

γ+2 − 2, λ = 2α γ+2, ξ = λ(λ + 1 − α).