An adaptive backtracking strategy for non-smooth composite - - PowerPoint PPT Presentation

an adaptive backtracking strategy for non smooth
SMART_READER_LITE
LIVE PREVIEW

An adaptive backtracking strategy for non-smooth composite - - PowerPoint PPT Presentation

An adaptive backtracking strategy for non-smooth composite optimisation problems Luca Calatroni ees (CMAP), Centre de Mathematiqu es Appliqu Ecole Polytechnique, Palaiseau joint work with: A. Chambolle. CMIPI 2018 Workshop University


slide-1
SLIDE 1

An adaptive backtracking strategy for non-smooth composite optimisation problems

Luca Calatroni

Centre de Mathematiqu´ es Appliqu´ ees (CMAP), ´ Ecole Polytechnique, Palaiseau

joint work with: A. Chambolle.

CMIPI 2018 Workshop

University of Insubria, DISAT July 16-18 2018 Como, IT

slide-2
SLIDE 2

Table of contents

  • 1. Introduction
  • 2. GFISTA with backtracking
  • 3. Accelerated convergence rates
  • 4. Imaging applications
  • 5. Conclusions & outlook

1

slide-3
SLIDE 3

Introduction

slide-4
SLIDE 4

Gradient based methods: a review

(X, · ), Hilbert space. Given f : X → R convex, l.s.c., with x∗ ∈ arg min f , we want to solve: min

x∈X f (x) 2

slide-5
SLIDE 5

Gradient based methods: a review

(X, · ), Hilbert space. Given f : X → R convex, l.s.c., with x∗ ∈ arg min f , we want to solve: min

x∈X f (x)

If f is differentiable with Lf -Lipschitz gradient, explicit gradient descent reads:

Algorithm 1 Gradient descent with fixed step.

Input: 0 < τ ≤ 2/Lf , x0 ∈ X. for k ≥ 0 do xk+1 = xk − τ∇f (xk) end for

Quite restrictive smoothness assumption!

2

slide-6
SLIDE 6

Gradient based methods: a review

(X, · ), Hilbert space. Given f : X → R convex, l.s.c., with x∗ ∈ arg min f , we want to solve: min

x∈X f (x)

No further assumptions on ∇f : use implicit gradient descent.

Algorithm 2 Implicit (proximal) gradient descent with fixed step.

Input: τ > 0, x0 ∈ X. for k ≥ 0 do xk+1 = proxτf (xk)(= xk − τ∇f (xk+1)) end for

Note: the iteration can be rewritten as: xk+1 = xk − τ∇fτ(xk), with fτ(xk) := min

x∈X f (x) + x − xk2

2τ , the Moreau-Yosida regularisation of f , which is 1/τ-Lipschitz ⇒ explicit gradient descent on fτ. Same theory applies!

References: Brezis-Lions (’73, ’78), G¨ uler (’91),. . . 2

slide-7
SLIDE 7

Convergence rates

Theorem: O(1/k) rate Let x0 ∈ X and τ ≤ 2/Lf . Then, the sequence (xk) of iterates of gradient descent converges to x∗ and satisfies: f (xk) − f (x∗) ≤ 1 2τk x∗ − x02.

3

slide-8
SLIDE 8

Convergence rates

Theorem: O(1/k) rate Let x0 ∈ X and τ ≤ 2/Lf . Then, the sequence (xk) of iterates of gradient descent converges to x∗ and satisfies: f (xk) − f (x∗) ≤ 1 2τk x∗ − x02. Assume: f is µf -strongly convex, µf > 0: f (y) ≥ f (x) + ∇f (x), y − x + µf 2 y − x2, for all x, y ∈ X. Theorem: Linear rate for strongly convex objectives Let f be µf -strongly convex. Let x0 ∈ X and τ ≤ 2/(Lf + µf ). Then, the sequence (xk) of iterates of gradient descent satisfies: f (xk) − f (x∗) + 1 2τ xk − x∗2 ≤ ωk 2τ x∗ − x02, with ω = (1 − µf /L)/(1 + µf L) < 1. References: Bertsekas, ’15, Nesterov ’04

3

slide-9
SLIDE 9

Lower bounds1

Theorem (Lower bounds)

Let x0 ∈ Rn, Lf > 0 and k < n. Then, for any first-order method there exists a convex C 1 function f with Lf -Lipschitz gradient such that:

  • 1. convex case:

f (xk) − f (x∗) ≥ Lf 8(k + 1)2 x∗ − x02.

  • 2. strongly convex case:

f (xk) − f (x∗) ≥ µf 2 √q − 1 √q + 1 2k x∗ − x02, where q = Lf /µf ≥ 1. Remark: If k ≥ n we could use conjugate gradient! However, for imaging n ≫ 1!

Usually k < n: can we improve convergence speed?

1Nesterov, ’04

4

slide-10
SLIDE 10

Nesterov acceleration for gradient descent2

To make it faster build extrapolated sequence (inertia). Algorithm 3 Nesterov accelerated gradient descent with fixed step.

Input: 0 < τ ≤ 1/Lf , x0 = x−1 = y 0 ∈ X, t0 = 0. for k ≥ 0 do tk+1 = 1 +

  • 1 + 4t2

k

2 y k = xk + tk − 1 tk+1 (xk − xk−1) xk+1 = y k − τ∇f (y k) end for

2Nesterov, ’83, ’04, G¨

uler ’92

5

slide-11
SLIDE 11

Nesterov acceleration for gradient descent2

Algorithm 4 Nesterov accelerated gradient descent with fixed step.

Input: 0 < τ ≤ 1/Lf , x0 = x−1 = y 0 ∈ X, t0 = 0. for k ≥ 0 do tk+1 = 1 +

  • 1 + 4t2

k

2 y k = xk + tk − 1 tk+1 (xk − xk−1) xk+1 = y k − τ∇f (y k) end for

Theorem (Acceleration) Let τ ≤ 1/Lf and (xk) the sequence generated by the accelerated gradient descent algorithm. Then: f (xk) − f (x∗) ≤ 2 τ(k + 1)2 x0 − x∗2.

2Nesterov, ’83, ’04, G¨

uler ’92

5

slide-12
SLIDE 12

Standard problem in imaging: composite structure

Variational regularisation of ill-posed inverse problems

Compute a reconstructed version of a given degraded image f by solving: min

u∈X

{F(x) := R(u) + λD(u, f )} , λ > 0 with non-smooth regularisation and smooth data fidelity.

6

slide-13
SLIDE 13

Standard problem in imaging: composite structure

Variational regularisation of ill-posed inverse problems

Compute a reconstructed version of a given degraded image f by solving: min

u∈X

{F(x) := R(u) + λD(u, f )} , λ > 0 with non-smooth regularisation and smooth data fidelity. Examples in inverse problems/imaging:

  • R(u) = TV , ICTV , TGV , ℓ1 (Rudin, Osher, Fatemi, ’92, Chambolle-Lions’ 97,

Bredies, ’10)

  • D(u, f ) = u − f 2

2 (Gaussian Rudin, Osher, Fatemi, ’92 ), D(u, f ) = u − f 1,γ

(Laplace/impulse, Nikolova, ’04), D(u, f ) = KLγ(u, f ) (Poisson, Burger, Sawatzky, Brune, M¨ uller, ’09). . .

6

slide-14
SLIDE 14

Composite optimisation

We want to solve: min

x∈X {F(x) := f (x) + g(x)}

  • f is smooth: differentiable, convex with Lipschitz gradient

∇f (y) − ∇f (x) ≤ Lf y − x, for any x, y ∈ X.

  • g is convex, l.s.c., non-smooth, easy proximal map.

3Combettes, Ways, ’05, Nesterov, ’13. . .

7

slide-15
SLIDE 15

Composite optimisation

We want to solve: min

x∈X {F(x) := f (x) + g(x)}

  • f is smooth: differentiable, convex with Lipschitz gradient

∇f (y) − ∇f (x) ≤ Lf y − x, for any x, y ∈ X.

  • g is convex, l.s.c., non-smooth, easy proximal map.

Composite optimisation problem Forward-Backward splitting3.

  • forward gradient descent step in f ;
  • backward implicit gradient descent step in g.

Basic algorithm: take x0 ∈ X, fix τ > 0 and for k ≥ 0 do: xk+1 = proxτg (xk − τ∇f (xk)) =: Tτxk.

3Combettes, Ways, ’05, Nesterov, ’13. . .

7

slide-16
SLIDE 16

Composite optimisation

We want to solve: min

x∈X {F(x) := f (x) + g(x)}

  • f is smooth: differentiable, convex with Lipschitz gradient

∇f (y) − ∇f (x) ≤ Lf y − x, for any x, y ∈ X.

  • g is convex, l.s.c., non-smooth, easy proximal map.

Composite optimisation problem Forward-Backward splitting3.

  • forward gradient descent step in f ;
  • backward implicit gradient descent step in g.

Basic algorithm: take x0 ∈ X, fix τ > 0 and for k ≥ 0 do: xk+1 = proxτg (xk − τ∇f (xk)) =: Tτxk.

Rate of convergence: O(1/k).

3Combettes, Ways, ’05, Nesterov, ’13. . .

7

slide-17
SLIDE 17

Accelerated forward-backward, FISTA: previous work

In Nesterov ’04 and Beck, Teboulle ’09, accelerated O(1/k2) convergence of is achieved by extrapolation (as above). Further properties:

  • convergence of iterates (Chambolle, Dossal ’15);
  • monotone variants (Beck, Teboulle ’09, Tseng ’08, Tao, Boley, Zhang ’15)
  • acceleration for inexact evaluation of operators (Villa, Salzo, Baldassarre,

Verri ’13, Bonettini, Prato, Rebegoldi, ’18)

8

slide-18
SLIDE 18

Accelerated forward-backward, FISTA: previous work

In Nesterov ’04 and Beck, Teboulle ’09, accelerated O(1/k2) convergence of is achieved by extrapolation (as above). Further properties:

  • convergence of iterates (Chambolle, Dossal ’15);
  • monotone variants (Beck, Teboulle ’09, Tseng ’08, Tao, Boley, Zhang ’15)
  • acceleration for inexact evaluation of operators (Villa, Salzo, Baldassarre,

Verri ’13, Bonettini, Prato, Rebegoldi, ’18) Questions

  • 1. Can we say more when f and/or g are strongly convex? Linear

convergence?

  • 2. Can we let the gradient step (proximal parameter) vary along the

iterations AND preserving acceleration?

8

slide-19
SLIDE 19

A strongly convex variant of FISTA (GFISTA)

Let µf , µg ≥ 0. Then µ = µf + µg. For τ > 0 define: q := τµ 1 + τµg ∈ [0, 1). Algorithm 5 GFISTA4 (no backtracking)

Input: 0 < τ ≤ 1/Lf , x0 = x−1 ∈ X and let t0 ∈ R s.t. 0 ≤ t0 ≤ 1/√q. for k ≥ 0 do y k = xk + βk(xk − xk−1) xk+1 = Tτ y k = proxτg(y k − τ∇f (y k)) tk+1 = 1 − qt2

k +

  • (1 − qt2

k )2 + 4t2 k

2 βk = tk − 1 tk+1 1 + τµg − tk+1τµ 1 − τµf end for

Remark: µ = q = 0 = ⇒ standard FISTA.

4Chambolle, Pock ’16

9

slide-20
SLIDE 20

GFISTA: acceleration results

Theorem [Chambolle, Pock ’16] Let τ ≤ 1/Lf and 0 ≤ t0√q ≤ 1. Then, the sequence (xk) of iterates of GFISTA satisfies F(xk) − F(x∗) ≤ rk(q)

  • t2

0(F(x0) − F(x∗)) + 1 + τµg

2 x − x∗2

  • ,

where x∗ is a minimiser of F and: rk(q) = min

  • 4

(k + 1)2 , (1 + √q)(1 − √q)k, (1 − √q)k t2

  • .

Note: for µ = q = 0, t0 = 0 this is the standard FISTA convergence result.

10

slide-21
SLIDE 21

GFISTA: acceleration results

Theorem [Chambolle, Pock ’16] Let τ ≤ 1/Lf and 0 ≤ t0√q ≤ 1. Then, the sequence (xk) of iterates of GFISTA satisfies F(xk) − F(x∗) ≤ rk(q)

  • t2

0(F(x0) − F(x∗)) + 1 + τµg

2 x − x∗2

  • ,

where x∗ is a minimiser of F and: rk(q) = min

  • 4

(k + 1)2 , (1 + √q)(1 − √q)k, (1 − √q)k t2

  • .

Note: for µ = q = 0, t0 = 0 this is the standard FISTA convergence result.

Question: What if an estimate of Lf is not available? Backtracking!

10

slide-22
SLIDE 22

Backtracking idea

For plain 2D gradient descent:

Too small VS. too big τ

11

slide-23
SLIDE 23

Backtracking idea

For plain 2D gradient descent:

Armijo line-search

11

slide-24
SLIDE 24

Backtracking idea

For plain 2D gradient descent:

Armijo line-search Armijo rule: Choose τk = 1/2i where i ∈ N is the minimum integer s.t. f (xk+1) − f (xk) ≤ βτk∇f (xk)⊤(xk+1 − xk), 0 < β < 1. FISTA + backtracking:

  • Armijo-type (Beck, Teboulle, ’09): τk+1 ≤ τk for every k.
  • Full backtracking (Scheinberg, Goldfarb, Bai, ’14): larger steps in “flat” areas!

11

slide-25
SLIDE 25

GFISTA with backtracking

slide-26
SLIDE 26

Backtracking strategy and Bregman distance

General idea: check if for every x ∈ X: F(xk+1) + (1 + τµg)x − xk+12 2τ + xk+1 − y k2 2τ − Df (xk+1, y k)

  • ≤ F(x) + (1 − τµf )x − y k2

2τ , where Df (xk+1, y k) := f (xk+1) − f (y k) − ∇f (y k), y k − xk+1 is the Bregman distance of f between xk+1 = Tτy k and y k.

12

slide-27
SLIDE 27

Backtracking strategy and Bregman distance

General idea: check if for every x ∈ X: F(xk+1) + (1 + τµg)x − xk+12 2τ + xk+1 − y k2 2τ − Df (xk+1, y k)

  • ≤ F(x) + (1 − τµf )x − y k2

2τ , where Df (xk+1, y k) := f (xk+1) − f (y k) − ∇f (y k), y k − xk+1 is the Bregman distance of f between xk+1 = Tτy k and y k. Constant steps Such condition is verified as long as: τ ≤ xk+1 − y k2 2Df (xk+1, y k) ∼ 1 Lk , (*) which is always true if τ ≤ 1/Lf with known estimate Lf . However, one can alternatively check (*) along the iterations. This corresponds to compute a local Lipschitz Constant Estimate (LCE).

12

slide-28
SLIDE 28

GFISTA with backtracking: Algorithm

For any k ≥ 0 we let τ = τk and define: τ ′

k =

τk 1 − τkµf , qk = µτk 1 + τkµg . Update rule for extrapolation: for any k ≥ 0 set tk+1 = 1 −

qk+1 1−qk+1 τ′

k

τ′

k+1 t2

k +

  • qk+1

1−qk+1 τ′

k

τ′

k+1 t2

k − 1

2 + 4

τ′

k

τ′

k+1

t2

k

1−qk+1

2

13

slide-29
SLIDE 29

GFISTA with backtracking: Algorithm

Algorithm 6 GFISTA with backtracking

Input: µf , µg, τ0 > 0, q0, ρ ∈ (0, 1), x0 = x−1 ∈ X and t0 ∈ R s.t. 0 ≤ t0 ≤ 1/√q0. for k ≥ 0 do y k = xk + βk(xk − xk−1). Set ibt = 0; if too close to LCE then while Backtracking condition (*) is not verified & ibt ≤ imax do keep/reduce step-size: τk+1 = ρibt τk; Compute xk+1 = Tτk+1 y k = proxτk+1g(y k − τk+1∇f (y k)) (1) ibt = ibt + 1; end while else if far enough from LCE then increase step-size: τk+1 =

τk ρ ;

Compute xk+1 using (1); end if Update qk+1, τ ′

k+1, tk+1.

Set βk+1 = 1 − qk+1tk+1 1 − qk+1 tk − 1 tk+1 . end for

Too close/too far: how tight is (*)? Reduce costs due to (1).

14

slide-30
SLIDE 30

Analogies/differences with FISTA-type algorithms: update rule

tk+1 = 1 −

qk+1 1−qk+1 τ′

k

τ′

k+1 t2

k +

  • qk+1

1−qk+1 τ′

k

τ′

k+1 t2

k − 1

2 + 4

τ′

k

τ′

k+1

t2

k

1−qk+1

2

15

slide-31
SLIDE 31

Analogies/differences with FISTA-type algorithms: update rule

tk+1 = 1 −

qk+1 1−qk+1 τ′

k

τ′

k+1 t2

k +

  • qk+1

1−qk+1 τ′

k

τ′

k+1 t2

k − 1

2 + 4

τ′

k

τ′

k+1

t2

k

1−qk+1

2 No backtracking, convex case If µ = qk = 0 and τk = τk+1 for any k ≥ 0, this is the FISTA update rule.

15

slide-32
SLIDE 32

Analogies/differences with FISTA-type algorithms: update rule

tk+1 = 1 −

qk+1 1−qk+1 τ′

k

τ′

k+1 t2

k +

  • qk+1

1−qk+1 τ′

k

τ′

k+1 t2

k − 1

2 + 4

τ′

k

τ′

k+1

t2

k

1−qk+1

2 No backtracking, convex case If µ = qk = 0 and τk = τk+1 for any k ≥ 0, this is the FISTA update rule. FISTA with adaptive backtracking If µ = qk = 0 for any k ≥ 0, the rule reduces to: tk+1 = 1 +

  • 1 + 4 τk

τk+1 t2 k

2 , which is the same as the one proposed by Scheinberg et al. ’14 for fast adaptive backtracking.

15

slide-33
SLIDE 33

Accelerated convergence rates

slide-34
SLIDE 34

Convergence rates: worst-case analysis

Define: Lw := max Lf ρ , ρL0

  • ,

qw := µ Lw + µg , with qw being the worst-case inverse condition number. Theorem Let x0 ∈ X, τ0 > 0 and let (xk) the sequence produced by the GFISTA algorithm with backtracking. If t0 ≥ 0 and √q0t0 ≤ 1, we have: F(xk) − F(x∗) ≤ rk (Lw − µf )

  • τ0t2

1 − µf τ0

  • F(x0) − F(x∗)
  • + 1

2x0 − x∗2

  • where the decay rate is defined as:

rk := min

  • 4

(k + 1)2 , (1 − √qw)k−1, (1 − √qw)k t2

  • .

16

slide-35
SLIDE 35

Convergence rates: worst-case analysis

Define: Lw := max Lf ρ , ρL0

  • ,

qw := µ Lw + µg , with qw being the worst-case inverse condition number. Theorem Let x0 ∈ X, τ0 > 0 and let (xk) the sequence produced by the GFISTA algorithm with backtracking. If t0 ≥ 0 and √q0t0 ≤ 1, we have: F(xk) − F(x∗) ≤ rk (Lw − µf )

  • τ0t2

1 − µf τ0

  • F(x0) − F(x∗)
  • + 1

2x0 − x∗2

  • where the decay rate is defined as:

rk := min

  • 4

(k + 1)2 , (1 − √qw)k−1, (1 − √qw)k t2

  • .

Disclaimer Compare recent work by Florea, Vorobyov (preprint, ’17) where the same result is obtained via a generalised estimate sequence argument.

16

slide-36
SLIDE 36

Monotone variants (M-GFISTA)

In order to make the convergence non-increasing5, we can simply set: y k = xk + βk

  • xk − xk−1

+ tk tk − 1

  • Tτk (y k−1) − xk

. This suggests an easy rule to select xk+1 at any iteration: xk+1 =    Tτk+1(y k) if F(Tτk+1(y k)) ≤ F(xk), xk

  • therwise.

Same computations, same convergence rates.

5Beck, Teboulle ’09, Tseng ’08, Tao, Boley, Zhang ’16

17

slide-37
SLIDE 37

Imaging applications

slide-38
SLIDE 38

Huber-TV Gaussian denoising

Given noisy u0 ∈ Rm×n corrupted by noise N(0, σ2), use TV ROF6 model: min

u

λDu2,1 + 1 2u − u02

2,

Du2,1 =

m,n

  • i,j=1
  • (Du)2

i,j,1 + (Du)2 i,j,2,

where Du is the finite-difference-discretised gradient and λ > 0.

6Rudin, Osher, Fatemi, ’92

18

slide-39
SLIDE 39

Huber-TV Gaussian denoising

Given noisy u0 ∈ Rm×n corrupted by noise N(0, σ2), use TV ROF6 model: min

u

λDu2,1 + 1 2u − u02

2,

Du2,1 =

m,n

  • i,j=1
  • (Du)2

i,j,1 + (Du)2 i,j,2,

where Du is the finite-difference-discretised gradient and λ > 0. Strongly convex variant: for ε ≪ 1, C 1-Huber regularisation hε(t) :=   

t2 2ε

for |t| ≤ ε, |t| − ε

2

for |t| > ε.

6Rudin, Osher, Fatemi, ’92

18

slide-40
SLIDE 40

Huber-TV Gaussian denoising

Given noisy u0 ∈ Rm×n corrupted by noise N(0, σ2), use TV ROF6 model: min

u

λHε(u) + 1 2u − u02

2

Hε(u) :=

m,n

  • i,j=1

  • (Du)2

i,j,1 + (Du)2 i,j,2

  • where Du is the finite-difference-discretised gradient and λ > 0.

Strongly convex variant: for ε ≪ 1, C 1-Huber regularisation hε(t) :=   

t2 2ε

for |t| ≤ ε, |t| − ε

2

for |t| > ε.

6Rudin, Osher, Fatemi, ’92

18

slide-41
SLIDE 41

Huber-TV Gaussian denoising: dual formulation

The Huber-TV dual problem reads: min

p

1 2D∗p − u02

2 + ε

2λp2

2 + δ{·2,∞≤λ}(p),

where D∗ is the discretised negative finite-difference divergence and: δ{·2,∞≤λ}(p) =    if |pi,j|2 ≤ λ for any i, j, +∞

  • therwise.

19

slide-42
SLIDE 42

Huber-TV Gaussian denoising: dual formulation

The Huber-TV dual problem reads: min

p

1 2D∗p − u02

2

  • “f ”

+ ε 2λp2

2 + δ{·2,∞≤λ}(p)

  • “g”

, where D∗ is the discretised negative finite-difference divergence and: δ{·2,∞≤λ}(p) =    if |pi,j|2 ≤ λ for any i, j, +∞

  • therwise.

Note:

  • ∇f (p) = D(D∗p − u0) =

⇒ Lf ≤ 8;

  • proxτg is easy to compute and µg = µ = ε

λ. 19

slide-43
SLIDE 43

Huber-TV Gaussian denoising: dual formulation

The Huber-TV dual problem reads: min

p

1 2D∗p − u02

2

  • “f ”

+ ε 2λp2

2 + δ{·2,∞≤λ}(p)

  • “g”

, where D∗ is the discretised negative finite-difference divergence and: δ{·2,∞≤λ}(p) =    if |pi,j|2 ≤ λ for any i, j, +∞

  • therwise.

Note:

  • ∇f (p) = D(D∗p − u0) =

⇒ Lf ≤ 8;

  • proxτg is easy to compute and µg = µ = ε

λ.

Use monotone GFISTA with backtracking. . .

19

slide-44
SLIDE 44

Huber-TV Gaussian denoising: results

Parameters: u0 ∈ R256×256, σ2 = 0.005, ε = 0.01, λ = 0.1. Have: µ = 0.1

(a) Ground truth (b) u0 (c) Reference u∗

20

slide-45
SLIDE 45

Huber-TV Gaussian denoising: results

Parameters: u0 ∈ R256×256, σ2 = 0.005, ε = 0.01, λ = 0.1. Have: µ = 0.1

(a) Convergence rates. (b) Lipschitz constant estimate. Figure 1: Underestimating L0 = 5. GFISTA parameters: ρ = 0.9, t0 = 1, p0 = Du0.

20

slide-46
SLIDE 46

Huber-TV Gaussian denoising: results

Parameters: u0 ∈ R256×256, σ2 = 0.005, ε = 0.01, λ = 0.1. Have: µ = 0.1

(a) Convergence rates. (b) Lipschitz constant estimate. Figure 1: Overestimating L0 = 20. GFISTA parameters: ρ = 0.9, t0 = 1, p0 = Du0.

20

slide-47
SLIDE 47

Huber-TV Gaussian denoising: results

Parameters: u0 ∈ R256×256, σ2 = 0.005, ε = 0.01, λ = 0.1. Have: µ = 0.1

(a) Convergence rates. (b) Lipschitz constant estimate. Figure 1: Overestimating L0 = 20. GFISTA parameters: ρ = 0.9, t0 = 1, p0 = Du0.

Remark O(1/k2) convergence of naive FISTA (µ = 0).

20

slide-48
SLIDE 48

Strongly convex TV-Poisson denoising: primal formulation

Poisson noise is typical in astronomy and microscopy imaging. . . For ε ≪ 1, consider the ε-strongly convex TV Poisson denoising model: min

u

λDu2,1 + ε 2u2

2 + ˜

KL(u0, u), where ˜ KL(u0, u) is a differentiable version of the Kullback-Leibler function7: ˜ KL(u0, u) :=

m,n

  • i,j=1

       ui,j + bi,j − u0

i,j + u0 i,j log

  • u0

i,j

ui,j +bi,j

  • if ui,j ≥ 0,

u0

i,j

2b2

i,j u2

i,j +

  • 1 −

u0

i,j

bi,j

  • ui,j + bi,j − u0

i,j + u0 i,j log

  • u0

i,j

bi,j

  • else,

and b ∈ Rm×n is the background image. We can brutally estimate: Lf = max

i,j

u0

i,j

b2

i,j

, for u0, b > 0. Moreover, proxτg can be computed solving TV ROF model.

7Chambolle, Ehrhardt, Richtarik, Sch¨

  • nlieb, ’17

21

slide-49
SLIDE 49

Strongly convex TV-Poisson denoising: primal formulation

Poisson noise is typical in astronomy and microscopy imaging. . . For ε ≪ 1, consider the ε-strongly convex TV Poisson denoising model: min

u

λDu2,1 + ε 2u2

2

  • “g”

+ ˜ KL(u0, u)

  • “f ”

, where ˜ KL(u0, u) is a differentiable version of the Kullback-Leibler function7: ˜ KL(u0, u) :=

m,n

  • i,j=1

       ui,j + bi,j − u0

i,j + u0 i,j log

  • u0

i,j

ui,j +bi,j

  • if ui,j ≥ 0,

u0

i,j

2b2

i,j u2

i,j +

  • 1 −

u0

i,j

bi,j

  • ui,j + bi,j − u0

i,j + u0 i,j log

  • u0

i,j

bi,j

  • else,

and b ∈ Rm×n is the background image. We can brutally estimate: Lf = max

i,j

u0

i,j

b2

i,j

, for u0, b > 0. Moreover, proxτg can be computed solving TV ROF model.

7Chambolle, Ehrhardt, Richtarik, Sch¨

  • nlieb, ’17

21

slide-50
SLIDE 50

Strongly convex TV-Poisson denoising: results

Parameters: u0 ∈ R256×256, ε = µ = 0.15, λ = 0.2. b constant, Lf ≤ 45.

(a) Ground truth (b) u0 (c) Reference u∗

22

slide-51
SLIDE 51

Strongly convex TV-Poisson denoising: results

Parameters: u0 ∈ R256×256, ε = µ = 0.15, λ = 0.2. b constant, Lf ≤ 45.

(a) Convergence rates. (b) Lipschitz constant estimate. Figure 2: Overestimating L0 = 60. GFISTA parameters: ρ = 0.8, t0 = 1, u0 = u0. Relative objective:

F(uk )−F(u∗) F(u0)−F(u∗) . 22

slide-52
SLIDE 52

Strongly convex TV-Poisson denoising: results

Parameters: u0 ∈ R256×256, ε = µ = 0.15, λ = 0.2. b constant, Lf ≤ 45.

Figure 2: Monotone decay with/without backtracking.

22

slide-53
SLIDE 53

Conclusions & outlook

slide-54
SLIDE 54

Conclusions & outlook

Take-home messages:

  • If µf , µg > 0, linear convergence can be shown for GFISTA;
  • Adaptive backtracking to get a local estimate Lk along the iterations.
  • GFISTA with backtracking can be easily implemented and used for

imaging applications!

23

slide-55
SLIDE 55

Conclusions & outlook

Take-home messages:

  • If µf , µg > 0, linear convergence can be shown for GFISTA;
  • Adaptive backtracking to get a local estimate Lk along the iterations.
  • GFISTA with backtracking can be easily implemented and used for

imaging applications! Outlook:

  • Estimate of µf and µg?Restarting! (O’Donoghue, Cand´

es, 2012).

  • Milder (non-Lipschitz) differentiability assumptions (Salzo ’17)?

23

slide-56
SLIDE 56

Main references

  • L. Calatroni, A. Chambolle, Backtracking strategies for accelerated descent

methods with smooth composite objectives, arXiv:1709.09004, 2017.

  • K. Scheinberg, D. Goldfarb, and X. Bai, Fast first-order methods for

composite convex optimization with backtracking, Foundations of Computational Mathematics 14, 3, 2014.

  • M. I. Florea, S. Vorobyov, A generalized accelerated composite gradient

method: uniting Nesterov’s fast gradient method and FISTA, arXiv:1705.10266, 2017.

  • A. Chambolle, T. Pock, An introduction to continuous optimization for

imaging, Acta Numerica 25, 2016.

24

slide-57
SLIDE 57

Thanks for your attention! Questions ? luca.calatroni@polytechnique.edu

24