Lacc el eration de Nesterov est-elle vraiment une acc el eration - - PowerPoint PPT Presentation

l acc el eration de nesterov est elle vraiment une acc el
SMART_READER_LITE
LIVE PREVIEW

Lacc el eration de Nesterov est-elle vraiment une acc el eration - - PowerPoint PPT Presentation

Lacc el eration de Nesterov est-elle vraiment une acc el eration ? cois Aujol 1 Jean-Fran Collaboration avec Vassilis Apidopoulos 1 , Charles Dossal 2 et Aude Rondepierre 2 . 1 IMB, Universit e de Bordeaux 2 INSA Toulouse, IMT


slide-1
SLIDE 1

1/70

L’acc´ el´ eration de Nesterov est-elle vraiment une acc´ el´ eration ?

Jean-Fran¸ cois Aujol 1 Collaboration avec Vassilis Apidopoulos 1, Charles Dossal 2 et Aude Rondepierre 2.

1 IMB, Universit´

e de Bordeaux

2 INSA Toulouse, IMT

27 mai 2019

slide-2
SLIDE 2

2/70

Setting Minimize a differentiable function Let F be a convex differentiable function from Rn to R whose gradient is L − Lipschitz, having at least one minimizer x∗. We want to build an efficient sequence to estimate arg min

x∈Rn F(x)

slide-3
SLIDE 3

3/70

Gradient descent Explicit Gradient Descent Let F be a convex differentiable function from Rn to R whose gradient is L − Lipschitz, having at least one minimizer x∗. Gradient descent : for h < 2

L,

xn+1 = xn − h∇F(xn) The sequence (xn)n∈N converges to a minimizer of F and F(xn) − F(x∗) x0 − x∗2 2hn

slide-4
SLIDE 4

4/70

Gradient descent Inertial Gradient Descent Let F be a convex differentiable function from Rn to R whose gradient is L − Lipschitz, having at least one minimizer x∗. Inertial Gradient descent : for h < 1

L,

yn = xn + α(xn − xn−1) xn+1 = yn − h∇F (yn) If α ∈ [0, 1], the sequence (xn)n∈N converges to a minimizer of F and F(xn) − F(x∗) = O 1 n

slide-5
SLIDE 5

5/70

Nesterov inertial scheme Nesterov inertial scheme A specific class of inertial gradient scheme: Nesterov Scheme for h < 1

L, and α 3

yn = xn +

n n+α(xn − xn−1)

xn+1 = yn − h∇F (yn) F(xn) − F(x∗) = O 1 n2

  • Nesterov (84) proposes α = 3.
slide-6
SLIDE 6

6/70

Introduction

Non-smooth convex optimization

(a) Input y: motion blur + noise (σ = 2)

50 100 150 200 250 300 10 -2 10 -1 10 0 10 1 10 2 ISTA FISTA

(b) Convergence prof les (c) Deconvolution ISTA(300)+UDWT (d) Deconvolution FISTA(300)+UDWT

slide-7
SLIDE 7

7/70

Introduction Some questions How to prove these decays ? What is the role of the inertial parameter α ? Can we get more accurate rates than O 1 n2

  • with more

information on F (i.e. assuming more then F convex) ? Are these bounds tight ? Is Nesterov scheme really an acceleration of the Gradient descent ?

slide-8
SLIDE 8

8/70

Outline

1

Introduction

2

Geometry assumptions

3

Geometry for Nesterov inertial scheme

4

ODEs and Lyapunov functions

5

The non differentiable setting

6

Finite error analysis

slide-9
SLIDE 9

9/70

Outline

1

Introduction

2

Geometry assumptions

3

Geometry for Nesterov inertial scheme

4

ODEs and Lyapunov functions

5

The non differentiable setting

6

Finite error analysis

slide-10
SLIDE 10

10/70

Convex functions Definition F is a convex function : ∀(x, y), ∀λ ∈ [0, 1], F(λx + (1 − λ)y) ≤ λF(x) + (1 − λ)F(y) Properties of differentiable convex functions F is minorated by its affine approximation ∀(x, y), F(y) F(x) + ∇F(x), y − x If x → ∇F(x) is L − Lipschitz, F is majorated by its quadratic approximation ∀(x, y), F(y) F(x) + ∇F(x), y − x + L 2 x − y2 In particular, F(x) − F(x∗) L

2 x − x∗2

slide-11
SLIDE 11

11/70

Classical geometric assumptions Strong convexity F(y) F(x) + ∇F(x), y − x + µ 2 x − y2 Strong minimizer x∗ the minimizer of F: x − x∗2 ≤ 2 µ(F(x) − F(x∗)) In both cases, we have uniqueness of the minimizer. Remark If F strongly convex with L-Lipshitz gradient: µ 2 x − x∗2 ≤ F(x) − F(x∗) ≤ L 2 x − x∗2

slide-12
SLIDE 12

12/70

Refined geometric assumptions Growth condition (sharpness) X ∗ the set of minimizer of F. A function F satisfies condition L(p) if it exists K > 0 such that for all x ∈ Rn d(x, X ∗)p K (F(x) − F(x∗))

The smaller p, the sharper F.

Remark L(2) and uniqueness of the minimizer ⇐ ⇒ strong minimizer Remark If F convex with L-Lipschitz gradient satisfies the growth condtion L(p) for some p > 0, then we necessary have p ≥ 2.

slide-13
SLIDE 13

13/70

Another geometrical condition Flatness condition X ∗ the set of minimizer of F. F satisfies condition H(γ) if ∀x ∈ Rn and all x∗ ∈ X ∗ F(x) − F(x∗) 1 γ ∇F(x), x − x∗ Flatness properties If (F − F ∗)

1 γ is convex, then F satisfies H(γ).

If F satisfies H(γ) then it exists K2 > 0 such that F(x) − F(x∗) K2d(x, X ∗)γ

The hypothesis H(γ) can be seen as a “flatness” condition on the function F in the sense that it ensures that F is sufficiently flat (at least as flat as x → |x|γ) in the neighborhood of its minimizers. The larger γ, the flatter F.

slide-14
SLIDE 14

14/70

Flatness and growth geometrical conditions Flatness and growth properties if F(x) = x − x∗r, with r > 1, F satisfies H(γ) for all γ ∈ [1, r] ... and L(p) for all p r. if F satisfies H(γ) and L(p) , then p ≥ γ. if F satisfies L(2) and ∇F is L-Lispchitz then F satisfies H(1 +

L 2K2 ).

Remark For the explicit gradient descent, only sharpness H(γ) assumption is used. For inertial methods, flatness L(p) also plays a key role.

slide-15
SLIDE 15

15/70

  • Lojasiewicz property and growth condition
  • Lojasiewicz property

A differentiable function F : Rn → R is said to have the

  • Lojasiewicz property with exponent θ ∈ [0, 1) if, for any critical

point x∗, there exist c > 0 and ε > 0 such that: ∀x ∈ B(x∗, ε), ∇F(x) c|F(x) − F ∗|θ Lemma Let F : Rn → R be a convex differentiable function. Then F has the Lojasiewicz property with exponent θ ∈ [0, 1) iff F satisfies the growth condition L(r) with θ = 1 − 1

r .

slide-16
SLIDE 16

16/70

Gradient Descent and Geometry Growth condition A function F satisfies condition L(p) if it exists K > 0 such that for all x ∈ Rn d(x, X ∗)p K (F(x) − F(x∗)) Theorem Garrigos al al. 2017 (gradient descent) If F satisfies condition L(p) with p > 2 then F(xn) − F(x∗) = O

  • 1

n

γ γ−2

  • If F satisfies condition L(2) then it exists a > 0

F(xn) − F(x∗) = O

  • e−an
slide-17
SLIDE 17

17/70

Geometric convergence of GD with L(2) Geometric convergence of GD with L(2) F(xn) − F(x∗) x0 − x∗2 2hn and x − x∗2 K (F(x) − F(x∗)) No memory algorithm ⇒ ∀j n F(xn) − F(x∗) xn−j − x∗2 2hj K 2hj (F(xn−j) − F(x∗)) If

K 2hj 1 2 ⇐

⇒ j K

h ,

F(xn) − F(x∗) F(xn−j) − F(x∗) 2 Conclusion : The decay is geometric.

slide-18
SLIDE 18

18/70

Nesterov scheme for strongly convex functions Nesterov inertial scheme Nesterov Scheme for h < 1

L, and αn = n n+α, with α ≥ 3:

yn = xn + αn(xn − xn−1) xn+1 = yn − h∇F (yn) F(xn) − F(x∗) = O 1 n2

  • Nesterov Scheme for strongly convex functions

For h < 1

L, ρ = µ L:

  • yn

= xn + 1−√ρ

1+√ρ(xn − xn−1)

xn+1 = yn − h∇F (yn) F(xn) − F(x∗) = O ((1 − √ρ)n)

slide-19
SLIDE 19

19/70

Outline

1

Introduction

2

Geometry assumptions

3

Geometry for Nesterov inertial scheme

4

ODEs and Lyapunov functions

5

The non differentiable setting

6

Finite error analysis

slide-20
SLIDE 20

20/70

Back to Nesterov scheme State of the art Nesterov Scheme for h < 1

L, and α 3

xn+1 = xn − h∇F

  • xn +

n n + α(xn − xn−1)

  • F(xn) − F(x∗) = O

1 n2

  • Chambolle-Dossal (14) and Attouch-Peypouquet (15):

α > 3 ⇒ convergence of (xn)n1 and F(xn)−F(x∗) = o 1 n2

  • If α 3, Apidopoulos et al. and Attouch et al. (17)

F(xn) − F(x∗) = O 1 n

2α 3

slide-21
SLIDE 21

21/70

Nesterov, with strong convexity Theorem Su Boyd Cand` es (15), Attouch Cabot (17) If F satisfies L(2) and uniqueness of minimizer, then ∀α > 0 F(xn) − F(x∗) = O 1 n

2α 3

slide-22
SLIDE 22

22/70

Geometrical condition Growth condition A function F satisfies condition L(p) if it exists K > 0 such that for all x ∈ Rn d(x, X ∗)p K (F(x) − F(x∗)) Flatness condition F satisfies condition H(γ) if ∀x ∈ Rn and all x∗ ∈ X ∗ F(x) − F(x∗) 1 γ ∇F(x), x − x∗

slide-23
SLIDE 23

23/70

Nesterov, flatness may improve convergence rate Theorem : Aujol et al. (18) Let F be a differentiable convex function whose gradient is L−Lipschitz

1

If F satisfies H(γ), with γ > 1 and

1

if α 1 + 2

γ

F(xn) − F(x∗) = O

  • 1

n

2γα γ+2

  • 2

if α > 1 + 2

γ and thus if α = 3 then

F(xn) − F(x∗) = o 1 n2

  • and the sequence (xn)n1 converges.

2

If F satisfies L(2), then there exists γ > 1 such that F satifies H(γ).

slide-24
SLIDE 24

24/70

Nesterov, flatness may improve convergence rate Decay rate r(α, γ) = 2αγ

γ+2 depending on the value of α when

α γ+2

γ

and F satisfies H(γ) for four values γ: γ1 = 1.5 dashed line, γ2 = 2, solide line, γ3 = 3 dotted line and γ4 = 5 dashed-dotted line.

slide-25
SLIDE 25

25/70

Nesterov, rate for a little flat and very sharp functions Theorem for sharp functions, Aujol et al. (18) If F satisfies L(2), H(γ) with γ ≤ 2, and has a unique minimizer x∗ then ∀α > 0 F(xn) − F(x∗) = O

  • 1

n

2γα γ+2

  • Comments

For γ = 1 same decay O

  • 1

n

2α 3

  • as in Attouch-Cabot 2017

Since ∇F is L−Lipschitz, F satisfies H(γ) for γ > 1 and thus

2γα γ+2 > 2α 3 .

For quadratic functions, γ = 2 and thus we get O 1

  • .

= ⇒ does not compare well with the geometric convergence of the explicit gradient descent algorithm . . . For F(x) = Ax − y2 the decay is O 1

  • .
slide-26
SLIDE 26

26/70

Nesterov, rate for a little flat and very sharp functions Decay rate r(α, γ) = 2αγ

γ+2 depending on the value of α when F

satisfies H(γ) and L(2) with γ 2 for two values γ: γ1 = 1.5 dashed line, γ2 = 2, solide line.

slide-27
SLIDE 27

27/70

Nesterov, rate for very flat and a little sharp functions Theorem for flat functions, Aujol et al. (18) If F satisfies H(γ) and L(γ) with γ > 2, if F has a unique minimizer x∗ and if α > γ+2

γ−2 then

F(xn) − F(x∗) = O

  • 1

n

2γ γ−2

  • Gradient descent rate (Garrigos et al 2017)

If F satisfies L(γ) with γ > 2 F(xn) − F(x∗) = O

  • 1

n

γ γ−2

slide-28
SLIDE 28

28/70

Nesterov, rate for very flat and a little sharp functions Decay rate r(α, γ) =

2γ γ−2 depending on the value of α when

α γ+2

γ−2 when F satisfies H(γ) and L(γ) for two values γ: γ3 = 3

dotted line and γ4 = 5 dashed-dotted line.

slide-29
SLIDE 29

29/70

Nesterov, rate of convergence Decay rate r(α, γ) depending on the value of α if F satisfies H(γ) and L(r) with r = max(2, γ) for four values γ: γ1 = 1.5 dashed line, γ2 = 2, solide line, γ3 = 3 dotted line and γ4 = 5 dashed-dotted line.

slide-30
SLIDE 30

30/70

Nesterov, flatness may give convergence rate for non convex functions Theorem : Aujol-Dossal 2017 Let F be a differentiable function that satisfies H(γ), with γ ∈ (0, 1].

1

if α 1 + 2

γ , then F(xn) − F(x∗) = O

  • 1

n

2γα γ+2

  • .

2

if α > 1 + 2

γ , then F(xn) − F(x∗) = o

1

n2

  • and the sequence

(xn)n1 converges. Remark If F satisfies H(γ), with γ ∈ (0, 1), then F may be non convex (e.g. F(x) = √x satisfies H 1

2

  • ).
slide-31
SLIDE 31

31/70

Outline

1

Introduction

2

Geometry assumptions

3

Geometry for Nesterov inertial scheme

4

ODEs and Lyapunov functions

5

The non differentiable setting

6

Finite error analysis

slide-32
SLIDE 32

32/70

Introduction Gradient descent To study the gradient descent, we can study the solutions of ˙ x(t) + ∇F(x(t)) = 0. Nesterov scheme To study Nesterov acceleration, we can study the solutions of ¨ x(t) + α t ˙ x(t) + ∇F(x(t)) = 0. for a suitable choice of α...

slide-33
SLIDE 33

33/70

The gradient descent as a discretization of an ODE Gradient scheme: xn+1 = xn − γ∇F(xn) This can be seen as an explicit Euler scheme for the ODE with a step h = γ: ˙ x(t) + ∇F(x(t)) = 0

slide-34
SLIDE 34

34/70

Gradient Descent Minimization of convex a function F. F, Rn → R convex differentiable having at least one minimizer x∗, ∇F is L−Lipschitz. Gradient descent : xn+1 = xn − γ∇F(xn) with γ 2 L Associated EDO ˙ x(t) + ∇F(x(t)) = 0. Convergence rate : F(xn) − F(x∗) = O 1 n

  • and

F(x(t)) − F(x∗) = O 1 t

slide-35
SLIDE 35

35/70

Rate for the gradient descent ˙ x(t) + ∇F(x(t)) = 0. We introduce the Lyapunov function: E(t) = tW (t) + 1 2x(t) − x∗2 with W (t) = F(x(t)) − F(x∗) We have E′(t) = W (t) + t ˙ x(t), ∇F(x(t)) + ˙ x(t), x(t) − x∗ By convexity of F, we get: W (t) ≤ ∇F(x(t)), x(t) − x∗. Hence, using the ODE: E′(t) ≤ −t∇F(x(t))2 ≤ 0 We deduce that E(t) ≤ E(t0) if t ≥ t0 and thus: W (t) ≤ E(t0) t

slide-36
SLIDE 36

36/70

Nesterov acceleration as a discretization of an ODE The acceleration Nesterov rule : xn+1 = xn + n n + α(xn − xn−1) − γ∇F(yn) can be written xn+1 − 2xn + xn−1 + α n (xn+1 − xn) + γ n + α n ∇F(yn) = 0. The Nesterov scheme can be seen as a semi-implicit discretization of the ODE (Nes) with a step h = √γ : ¨ x(t) + α t ˙ x(t) + ∇F(x(t)) = 0 (Nes)

slide-37
SLIDE 37

37/70

Nesterov and heavy ball equation A physical interpretation The equation ¨ x(t) + α t ˙ x(t) + ∇F(x(t)) = 0 (Nes) models the trajectory of a heavy ball in a potential field F with a viscosity term α

t .

The system oscillates. The oscillations depend on α. Gives useful insight about Lyapunov functions.

slide-38
SLIDE 38

38/70

Nesterov Acceleration Minimization of convex a function F, Acceleration Fast Gradient Descent by Nesterov [84] xn+1 = yn − γ∇F(yn), with yn = xn + n n + α(xn − xn−1) Associated EDO ¨ x(t) + α t ˙ x(t) + ∇F(x(t)) = 0. (Nes) Convergence rate for α 3: F(xn)−F(x∗) = O 1 n2

  • and

F(x(t))−F(x∗) = O 1 t2

  • Optimal rate.
slide-39
SLIDE 39

39/70

Nesterov Acceleration Optimality ¨ x(t) + α t ˙ x(t) + ∇F(x(t)) = 0. with F(x) = c |x|p, x ∈ R and p > 2. Look for an explicit solution of the type: x(t) = 1

tθ .

= ⇒ conditions on c, θ, p and α. We have min F = 0 and F(x(t)) = 2 p(p − 2)

  • α −

p p − 2

  • 1

t

2p p−2

slide-40
SLIDE 40

40/70

State of the art, α 3 ¨ x(t) + α t ˙ x(t) + ∇F(x(t)) = 0. (Nes) Lyapunov functions (Su-Boyd-Candes 2015, Attouch et al 2015) Let us define W (t) := F(x(t)) − F(x∗) and Eλ(t) = t2W (t) + λ(x(t) − x∗) + t ˙ x(t)2

2

(Ener) If α 3 and λ = α − 1, Eλ is non-increasing which leads to ∀t t0, W (t) Eλ(t) t2 Eλ(t0) t2

slide-41
SLIDE 41

41/70

State of the art, α 3 ¨ x(t) + α t ˙ x(t) + ∇F(x(t)) = 0. (Nes) Lyapunov functions Eλ(t) = t2W (t) + λ(x(t) − x∗) + t ˙ x(t)2

2

(Ener) E′

λ(t) = 2tW (t) + t2∇F(x(t)), ˙

x(t) + λ ˙ x(t) + t¨ x(t) + ˙ x(t), λ(x(t) − x∗) + t ˙ x(t) E′

λ(t) = 2tW (t) + λt−∇F(x(t)), x(t) − x∗

+ (λ + 1 − α)(t ˙ x(t)2 + λ ˙ x(t), x(t) − x∗) Using λ = α − 1 and W (t) ∇F(x(t)), x(t) − x∗ E′

α−1(t) (3 − α)tW (t) 0

slide-42
SLIDE 42

42/70

Nesterov, Proof Nesterov, Proof of the continuous theorem We define E(t) = t2(F(x(t)) − F(x∗)) + 1 2 (α − 1)(x(t) − x∗) + t ˙ x(t)2 Using the ODE and the following convex inequality F(x(t)) − F(x∗) x(t) − x∗, ∇F(x(t)) we get E′(t) (3 − α)t(F(x(t) − F(x∗)) If α 3, ∀t t0, t2(F(x(t)) − F(x∗)) E(t0)

slide-43
SLIDE 43

43/70

Nesterov, Proof Nesterov, Proof of the discrete theorem We define En = n2(F(xn) − F(x∗)) + 1 2h (α − 1)(xn − x∗) + n(xn − xn−1)2 Using the definition of (xn)n1 and the following convex inequality F(xn) − F(x∗) xn − x∗, ∇F(xn) we get En+1 − En (3 − α)n(F(xn) − F(x∗)) If α 3, ∀n 1, n2(F(xn) − F(x∗)) E1

slide-44
SLIDE 44

44/70

Replacing the convexity assumption with H(γ) Eα−1(t) = t2W (t) + (α − 1)(x(t) − x∗) + t ˙ x(t)2

2

E′

α−1(t) = 2tW (t) + λt−∇F(x(t)), x(t) − x∗

If F is convex then W (t) ∇F(x(t)), x(t) − x∗ and then E′

α−1(t) (3 − α)tW (t).

If H(γ) holds, then W (t) = F(x(t) − F(x∗) satisfies W (t) 1 γ ∇F(x(t)), x(t) − x∗ and then E′

α−1(t) 1

γ 2 γ + 1 − α

  • tW (t).
slide-45
SLIDE 45

45/70

Replacing the convexity assumption with H(γ) Eα−1(t) = t2W (t) + (α − 1)(x(t) − x∗) + t ˙ x(t)2

2

E′

α−1(t) = 2tW (t) + λt−∇F(x(t)), x(t) − x∗

If H(γ) holds, then W (t) = F(x(t) − F(x∗) satisfies W (t) 1 γ ∇F(x(t)), x(t) − x∗ and then E′

α−1(t) 1

γ 2 γ + 1 − α

  • tW (t).

We deduce that if α 2

γ + 1,

W (t) = O 1 t2

slide-46
SLIDE 46

46/70

Yet another Lyapunov function for α ∈ (0, 3) Eλ,ξ = t2W (t) + λ(x(t) − x∗) + t ˙ x(t)2

2 + ξ x(t) − x∗2 2

For λ = 2α

3 and ξ = 2α 3 (1 − α 3 )

E′

λ,ξ(t)

  • 2 − 2α

3 Eλ,ξ t . It follows that t → H(t) = t

2α 3 −2Eλ,ξ(t) is non-increasing :

Theorem (Aujol-Dossal 17 and Attouch et al. 17) If F is convex and α ∈ (0, 3) ∀t t0, W (t) H(t0) t

2α 3

slide-47
SLIDE 47

47/70

Nesterov, general proof of convergence rate

1

We define for (p, ξ, λ) ∈ R3, W (t) = F(x(t)) − F(x∗): H(t) = tp(t2(W (t)) + 1 2 (λ(x(t) − x∗) + t ˙ x(t)2 + ξ 2 x(t) − x∗2)

2

We choose (p, ξ, λ) ∈ R3 depending on the hypotheses to ensure that H is bounded. H may not be non increasing.

3

We deduce there is A ∈ R such that t2+p(F(x(t)) − F(x∗)) A − tp ξ 2 x(t) − x∗2

4

If ξ 0 then F(x(t)) − F(x∗) = O

  • 1

tp+2

  • .

5

if ξ 0 we need to use condition L(γ) to conclude.

slide-48
SLIDE 48

48/70

Nesterov, Example Theorem Su, Boyd, Cand` es (15) If F is convex, and α 3 F(x(t)) − F(x∗) = O 1 t2

  • Proof : p = 0, λ = α − 1, ξ = 0

Theorem Aujol, Dossal, Rondepierre (18) If F is convex, satisfies H(γ) and L(2), and has unique minimizer F(x(t)) − F(x∗) = O

  • 1

t

2αγ γ+2

  • Proof : p = 2αγ

γ+2 − 2, λ = 2α γ+2, ξ = λ(λ + 1 − α).

slide-49
SLIDE 49

49/70

Outline

1

Introduction

2

Geometry assumptions

3

Geometry for Nesterov inertial scheme

4

ODEs and Lyapunov functions

5

The non differentiable setting

6

Finite error analysis

slide-50
SLIDE 50

50/70

Deblurring Left: the blurred image y0. Right: an image ˆ x estimated by minimizing F : F(x) = 1 2

  • y0 − h ⋆ x
  • 2

2 + λ Ψx1

slide-51
SLIDE 51

51/70

Inpainting Left: Image with missing pixels y0. Right: the image ˆ x estimated by minimizing F : F(x) = 1 2

  • y0 − Mx
  • 2

2 + λ Ψx1

slide-52
SLIDE 52

52/70

Non differentiable convex functions Examples, Sub-differential Examples: f (x) = x1 , f (x) = Dx1 f (x) = χC(x) = if x ∈ C +∞ else (C is a closed convex set) The sub-differential generalizes the notion of gradient. ∂f (x) = {u ∈ X such that ∀y ∈ X, f (y) f (x) + u, y − x} if f (x) = x1, ∂f (x)i =    +1 if xi > 0 −1 if xi < 0 [−1, 1] if xi = 0 If F = f + g and f differentiable, ∂F = ∇f + ∂g. x∗ is a minimizer of F ⇔ 0 ∈ ∂F(x∗). x∗ is a minimizer of F = f + g ⇔ −∇f (x∗) ∈ ∂g(x∗)

slide-53
SLIDE 53

53/70

Proximal Operator (Moreau 1962) Proximal Operator of convex function For any convex function g one can define proxg by proxg(x) = arg min

y∈X g(y) + 1

2 y − x2 . If we denote p = proxg(x), 0 ∈ ∂g(p) + p − x ⇔ x ∈ ∂g(p) + p ⇔ p = (I + ∂g)−1(x) If g is differentiable : proxτg implicit gradient with step τ. If g(x) = x1, proxτg(x) =    xi − τ if xi > τ xi + τ if xi < −τ else If g = χC, proxg = projC.

slide-54
SLIDE 54

54/70

Splitting Setting Minimize F(x) = f (x) + g(x) x ∈ X Hilbert space, with f and g proper, convex, lsc functions. Moreover F is coercive (has at least one minimizer), ∇f L-Lipschitz and g “simple”, meaning that proxτg(x) := arg min

z

g(z) + 1 2τ x − z2 can be evaluated. Many proximal algorithms using splitting : Primal-Dual, Douglas-Rachford, Forward-Backward, ADMM, Split-Bregman.

slide-55
SLIDE 55

55/70

Forward-Backward Splitting Forward-Backward algorithm (ISTA) T is the operator defined from X to X by x → Tx := proxτg(I − τ∇f )(x) for τ 1/L. Any sequence (xn)n∈N defined by xn+1 = T(xn) weakly converges to a minimizer of F. For any minimizer x∗ of F F(xn) − F(x∗) = O 1 n

  • See e.g. Combette-Wajs 2004.
slide-56
SLIDE 56

56/70

Inertial Method, FISTA Over relaxations of FB, FISTA (Nesterov 1984) x1 = x0 ∈ X xn+1 = T(xn + αn(xn − xn−1)), ∀n 1; Beck and Teboulle Theorem (2008) If τ ≤ 1

L and αn = tn−1 tn+1 with t1 = 1 and tn+1 = 1+√ t2

n+4

2

F(xn) − F(x∗) 2L x0 − x∗2 (n + 1)2 . The rate O 1

n2

  • is optimal (Nesterov 1984).

Same bound if tn = n+1

2

Convergence proof for tn = n+a−1

a

and αn = n−1

n+a with a > 2

in Chambolle-Dossal 2015 with similar convergence rate. Study of the robustness to noise in Aujol-Dossal 2015.

slide-57
SLIDE 57

57/70

The minimization problem min{F(x) : x ∈ X} (P) X = Rd , d ≥ 1. F = f + g : X − → R , where :

  • f : X −

→ R proper, convex, differentiable with ∇f L-Lipschitz

  • g : X −

→ R proper, convex, lower semi-continuous.

x∗ ∈ arg min{F} = ∅ Solutions of (P) : {∇f (x)} + ∂g(x) = ∂F(x) ∋ 0 Dynamical system ¨ x(t) + α t ˙ x(t) + ∂F(x(t)) ∋ 0 (DI) where α > 0.

slide-58
SLIDE 58

58/70

Discretization of (DI)-FISTA Time step : h > 0 , explicit discretization with respect to ∇f , implicit discretization with respect to ∂g Algorithm 1 FISTA ( Beck ’09 et al, Nesterov ’83/’04 ) Let x0, x1 ∈ Rd and γ = h2. For all n ≥ 1, define {xn}n∈N , {yn}n∈N as follows : yn = xn + n n + α(xn − xn−1) xn+1 = Proxγg(yn − γ∇f (yn)) =

  • Id + ∂γg

−1(yn − γ∇f (yn)) Proxγg(x) =

  • Id + ∂γg

−1(x) = arg min{g(y) + x − y2 2γ }

slide-59
SLIDE 59

59/70

The differential inclusion For t0 > 0, α > 0 and t ≥ t0 :

  • ¨

x(t) + α

t ˙

x(t) + ∂F(x(t)) ∋ 0 x(t0) = x0 ˙ x(t0) = v0 (DI) General framework for (DI) (Paoli 2000 )

  • ¨

x(t) + ∂F(x(t)) ∋ u(t, x(t), ˙ x)(t) x(t0) = x0 ˙ x(t0) = v0 (GDI) where u is Lipschitz continuous in its last two arguments uniformly with respect to the first one.

slide-60
SLIDE 60

60/70

Definition (Shock solution of (DI) Paoli ’00, Attouch et al ’02 ) A function x : [t0, +∞) − → Rd is an energy-conserving shock solution of (DI) if :

1

x ∈ C0,1([t0, T]; Rd), for all T > t0

2

˙ x ∈ BV ([t0, T]; Rd), for all T > t0

3

x(t) ∈ D(F), for all t ≥ t0

4

For all φ ∈ C1

c([t0, +∞), R+) and v ∈ C([t0, +∞), D(F)), it

holds : T

t0

(F(x(t)) − F(v(t))φ(t)dt ≤ ¨ x + b t ˙ x, (v − x)φM×C

5

x satisfies the following energy equation for a.e. t ≥ t0 F(x(t)) − F(x0) + 1 2 ˙ x(t)2 − 1 2v02 + t

t0

b s ˙ x(s)2ds = 0

slide-61
SLIDE 61

61/70

Approximating ODE We consider the Moreau-Yosida approximations of F, {Fγ}γ>0 Fγ(x) = min

y

  • F(y) + 1

2γ x − y2

  • and the following approximating ODE :

Approximating ODE ¨ xγ(t) + α t ˙ xγ(t) + ∇Fγ(xγ(t)) = 0 xγ(t0) = x0 ˙ xγ(t0) = v0 (ADE) (ADE) falls into the classical theory of differential equations and admits a unique solution xγ ∈ C2([t0, +∞); Rd), for all γ > 0.

slide-62
SLIDE 62

62/70

Existence of a shock solution (Paoli 2000 ) Theorem Let {Fγ}γ>0 the Moreau-Yosida approximations of F. There exists a subsequence {xγ}γ>0 of solutions of (ADE) that converges to a shock solution of (DI), according to the following scheme :

  • xγ −

γ→0 x

uniformly on [t0, T], ∀T > t0

  • ˙

xγ − →

γ→0 ˙

x in Lp([t0, T]; Rd) , ∀p ∈ [1, +∞) , ∀T > t0

  • Fγ(xγ) −

γ→0 F(x)

in Lp([t0, T]; Rd), ∀p ∈ [1, +∞), ∀T > t0 (AS) Lemma (A-priori estimates) sup

γ>0

{xγ∞, ˙ xγ∞, ∇Fγ(xγ)1, ¨ xγ1} < +∞

slide-63
SLIDE 63

63/70

Asymptotic analysis Let x be a shock solution of (DI) obtained as a limit of the approximation scheme (AS) and x∗ a minimizer of F. Notation : W (t) = F(x(t)) − F(x∗) For all λ, ξ ≥ 0 we consider the Lyapunov function : Eλ,ξ(t) = t2W (t) + 1 2λ(x(t) − x∗) + t ˙ x(t)2 + ξ 2x(t) − x∗2 Lemma For α ≥ 3, 2 ≤ λ ≤ α − 1 and ξ = λ(α − λ − 1), Eλ,ξ is essentially non-increasing in [t0, +∞) : Eλ,ξ(t) ≤ Eλ,ξ(s) , for a.e. t0 ≤ s ≤ t

slide-64
SLIDE 64

64/70

Convergence rates for W (t) and ˙ x(t) Theorem (Apidopoulos-A-Dossal ’17) Let x be a shock solution of (DI) obtained as a limit of the approximation scheme (AS) and x∗ a minimizer of F. There exist C1, C2 > 0, s.t. : If α ≥ 3 : sup

t≥t0

{x(t) − x∗} < +∞ W (t) ≤ C1 t2 and ˙ x(t) ≤ C2 t for a.e. t ≥ t0 If α > 3 : +∞

t0

tW (t)dt < +∞ and +∞

t0

t ˙ x(t)2dt < +∞

The trajectory {x(t)}t≥t0 converges asymptotically to x∗. ess lim

t→∞ t2W (t) = 0

and ess lim

t→∞ t ˙

x(t) = 0

slide-65
SLIDE 65

65/70

The case of low friction α < 3 For λ = 2α

3 , ξ = 2α(3−α) 9

> 0 and c = 2 − 2α

3 > 0, for all t ≥ t0,

we consider the energy-function : H(t) = t−cEλ,ξ(t) Lemma For α ≤ 3, H is essentially non-increasing in [t0, +∞) Corollary (Apidopoulos-A-Dossal ’17 ) Let α < 3, x a shock solution of (DI) obtained as a limit of the approximation scheme (AS) and x∗ a minimizer of F. W (t) ≤ Ct− 2α

3

for a.e. t ≥ t0

slide-66
SLIDE 66

66/70

Optimal convergence rate for W (t) when α < 3 We consider F(x) = |x|, for all x ∈ R and study (DI) for α < 3. Theorem (Apidopoulos et al.’17) Let x be a solution of (DI), with F(x) = |x| and α < 3 such that x(t0) = 0. There exists a constant K > 0, such that for any T > t0, there exists t > T such that : W (t) = |x(t)| ≥ K t

2α 3

slide-67
SLIDE 67

67/70

Outline

1

Introduction

2

Geometry assumptions

3

Geometry for Nesterov inertial scheme

4

ODEs and Lyapunov functions

5

The non differentiable setting

6

Finite error analysis

slide-68
SLIDE 68

68/70

Paradox ?

Non-smooth convex optimization

(a) Input y: motion blur + noise (σ = 2)

50 100 150 200 250 300 10 -2 10 -1 10 0 10 1 10 2 ISTA FISTA

(b) Convergence prof les (c) Deconvolution ISTA(300)+UDWT (d) Deconvolution FISTA(300)+UDWT

slide-69
SLIDE 69

69/70

Work in progress Finite error analysis Let ǫ > 0. What it the minimal time tǫ such that x(t) − x∗ ≤ ǫ for all t ≥ tǫ ? For finite error, Nesterov’s scheme seems to be most of the time better than First Order methods. Conjecture Anisotropy seems to play a key role: Nesterov scheme seems to be a way to precondition the algorithm.

slide-70
SLIDE 70

70/70

The End!

Questions ?

For more details : http://www.math.u-bordeaux.fr/∼jaujol/

The differential inclusion modeling the FISTA algorithm and

  • ptimality of convergence rate in the case b ≤ 3, V.Apidopoulos,

J-F Aujol and C. Dossal, SIAM Journal on Optimization, 2018. Optimal rate of convergence of an ODE associated to the Fast Gradient Descent schemes for b > 0 , HAL Preprint 01547251, J-F. Aujol, and C. Dossal, 2017. Convergence rate of inertial Forward-Backward algorithm beyond Nesterov’s rule, V. Apidopoulos, J-F. Aujol, and C. Dossal, Mathematical Programming, in press. Optimal convergence rates for Nesterov acceleration, HAL Preprint 01786117, J-F. Aujol, C. Dossal, and A. Rondepierre, 2018. Convergence rates of an inertial gradient descent algorithm under growth condition, 2018, V. Apidopoulos, J-F. Aujol, C. Dossal, and

  • A. Rondepierre.