Convergence rates in convex optimization Beyond the worst-case with - - PowerPoint PPT Presentation

convergence rates in convex optimization beyond the worst
SMART_READER_LITE
LIVE PREVIEW

Convergence rates in convex optimization Beyond the worst-case with - - PowerPoint PPT Presentation

Convergence rates in convex optimization Beyond the worst-case with the help of geometry Guillaume Garrigos with Lorenzo Rosasco and Silvia Villa cole Normale Suprieure Journes du GdR MOA/MIA - Bordeaux - 19 Oct 2017 Guillaume Garrigos


slide-1
SLIDE 1

Convergence rates in convex optimization Beyond the worst-case with the help of geometry Guillaume Garrigos

with Lorenzo Rosasco and Silvia Villa

École Normale Supérieure

Journées du GdR MOA/MIA - Bordeaux - 19 Oct 2017

Guillaume Garrigos 1/27

slide-2
SLIDE 2

Introduction

Setting: X Hilbert space, f : X → R ∪ {+∞} convex l.s.c. Problem: Minimize f (x), x ∈ X. Tool: My favorite algorithm.

Guillaume Garrigos 2/27

slide-3
SLIDE 3

Introduction

Setting: X Hilbert space, f : X → R ∪ {+∞} convex l.s.c. Problem: Minimize f (x), x ∈ X. Tool: My favorite algorithm. As optimizers, we often face the same questions concerning the convergence of an algorithm: (Qualitative result) For the iterates (xn)n∈N: weak, strong convergence? (Quantitative result) For the iterates and/or the values: sublinear O (n−α) rates, linear O (εn), superlinear ?

Guillaume Garrigos 2/27

slide-4
SLIDE 4

Introduction

Setting: X Hilbert space, f : X → R ∪ {+∞} convex l.s.c. Problem: Minimize f (x), x ∈ X. Tool: My favorite algorithm. As optimizers, we often face the same questions concerning the convergence of an algorithm: (Qualitative result) For the iterates (xn)n∈N: weak, strong convergence? (Quantitative result) For the iterates and/or the values: sublinear O (n−α) rates, linear O (εn), superlinear ? It depends on the algorithm and the assumptions made on f .

Guillaume Garrigos 2/27

slide-5
SLIDE 5

Introduction

Setting: X Hilbert space, f : X → R ∪ {+∞} convex l.s.c. Problem: Minimize f (x), x ∈ X. Tool: My favorite algorithm. As optimizers, we often face the same questions concerning the convergence of an algorithm: (Qualitative result) For the iterates (xn)n∈N: weak, strong convergence? (Quantitative result) For the iterates and/or the values: sublinear O (n−α) rates, linear O (εn), superlinear ? It depends on the algorithm and the assumptions made on f . Here we will essentially consider first order descent methods, and more simply the forward-backward method.

Guillaume Garrigos 2/27

slide-6
SLIDE 6

Introduction

Setting: X Hilbert space, f : X → R ∪ {+∞} convex l.s.c. Problem: Minimize f (x), x ∈ X. Tool: My favorite algorithm. As optimizers, we often face the same questions concerning the convergence of an algorithm: (Qualitative result) For the iterates (xn)n∈N: weak, strong convergence? (Quantitative result) For the iterates and/or the values: sublinear O (n−α) rates, linear O (εn), superlinear ? It depends on the algorithm and the assumptions made on f . Here we will essentially consider first order descent methods, and more simply the forward-backward method.

Guillaume Garrigos 2/27

slide-7
SLIDE 7

Contents

1

Classic theory

2

Better rates with the help of geometry Identifying the geometry of a function Exploiting the geometry

3

Inverse problems in Hilbert spaces Linear inverse problems Sparse inverse problems

Guillaume Garrigos 3/27

slide-8
SLIDE 8

Classic convergence results

Let f = g + h be convex, with h L-Lipschitz smooth Let xn+1 = proxλg(xn − λ∇h(xn)), λ ∈]0, 2/L[. Theorem (general convex case) argmin f = ∅ : xn diverges, no rates for f (xn) − inf f . argmin f = ∅ : xn weakly converges to x∞ ∈ argmin f , and f (xn) − inf f = o

  • n−1

.

Guillaume Garrigos 4/27

slide-9
SLIDE 9

Classic convergence results

Let f = g + h be convex, with h L-Lipschitz smooth Let xn+1 = proxλg(xn − λ∇h(xn)), λ ∈]0, 2/L[. Theorem (general convex case) argmin f = ∅ : xn diverges, no rates for f (xn) − inf f . argmin f = ∅ : xn weakly converges to x∞ ∈ argmin f , and f (xn) − inf f = o

  • n−1

. Theorem (strongly convex case) Assume that f is strongly convex. Then xn strongly converges to x∞ ∈ argmin f , and both iterates and values converge linearly.

Guillaume Garrigos 4/27

slide-10
SLIDE 10

Classic convergence results

Assume f to be convex and (xn)n∈N be generated by forward-backward. function values iterates argmin f = ∅

  • (1)

diverge argmin f = ∅

  • (n−1)

weak convergence

  • s. convex

linear linear

Guillaume Garrigos 5/27

slide-11
SLIDE 11

Classic convergence results

Assume f to be convex and (xn)n∈N be generated by forward-backward. function values iterates argmin f = ∅

  • (1)

diverge argmin f = ∅

  • (n−1)

weak convergence ? ? ?

  • s. convex

linear linear

Guillaume Garrigos 5/27

slide-12
SLIDE 12

Classic convergence results

Assume f to be convex and (xn)n∈N be generated by forward-backward. function values iterates argmin f = ∅

  • (1)

diverge argmin f = ∅

  • (n−1)

weak convergence ? ? ? ? linear linear

Guillaume Garrigos 5/27

slide-13
SLIDE 13

Classic convergence results

Assume f to be convex and (xn)n∈N be generated by forward-backward. function values iterates argmin f = ∅

  • (1)

diverge argmin f = ∅

  • (n−1)

weak convergence ? ? ? ? linear linear − → Use geometry!

Guillaume Garrigos 5/27

slide-14
SLIDE 14

Known examples

A ∈ L(X, Y ), y ∈ Y . f (x) = 1

2Ax − y2, xn+1 = xn − τA∗(Axn − y)

If R(A) is closed, linear convergence.

Guillaume Garrigos 6/27

slide-15
SLIDE 15

Known examples

A ∈ L(X, Y ), y ∈ Y . f (x) = 1

2Ax − y2, xn+1 = xn − τA∗(Axn − y)

If R(A) is closed, linear convergence.

Guillaume Garrigos 6/27

slide-16
SLIDE 16

Known examples

A ∈ L(X, Y ), y ∈ Y . f (x) = 1

2Ax − y2, xn+1 = xn − τA∗(Axn − y)

If R(A) is closed, linear convergence. Else, strong convergence for iterates, arbitrarily slow.

Guillaume Garrigos 6/27

slide-17
SLIDE 17

Known examples

A ∈ L(X, Y ), y ∈ Y . f (x) = 1

2Ax − y2, xn+1 = xn − τA∗(Axn − y)

If R(A) is closed, linear convergence. Else, strong convergence for iterates, arbitrarily slow.

f (x) = αx1 + 1

2Ax − y2, xn+1 = Sατ (xn − τA∗(Axn − y))

In X = RN, the convergence is linear.1 In X = ℓ2(N), ISTA converges strongly2. Linear rates can also be

  • btained under some conditions3. In fact not necessary4.

1Bolte, Nguyen, Peypouquet, Suter (2015),

based on Li (2012)

2Daubechies, Defrise, DeMol (2004) 3Bredies, Lorenz (2008) 4End of this talk

Guillaume Garrigos 6/27

slide-18
SLIDE 18

Known examples

A ∈ L(X, Y ), y ∈ Y . f (x) = 1

2Ax − y2, xn+1 = xn − τA∗(Axn − y)

If R(A) is closed, linear convergence. Else, strong convergence for iterates, arbitrarily slow.

f (x) = αx1 + 1

2Ax − y2, xn+1 = Sατ (xn − τA∗(Axn − y))

In X = RN, the convergence is linear.1 In X = ℓ2(N), ISTA converges strongly2. Linear rates can also be

  • btained under some conditions3. In fact not necessary4.

Gap between theory and practice.

1Bolte, Nguyen, Peypouquet, Suter (2015),

based on Li (2012)

2Daubechies, Defrise, DeMol (2004) 3Bredies, Lorenz (2008) 4End of this talk

Guillaume Garrigos 6/27

slide-19
SLIDE 19

Contents

1

Classic theory

2

Better rates with the help of geometry Identifying the geometry of a function Exploiting the geometry

3

Inverse problems in Hilbert spaces Linear inverse problems Sparse inverse problems

Guillaume Garrigos 7/27

slide-20
SLIDE 20

Conditioned and Lojasiewicz functions

Let p ≥ 1 and Ω ⊂ X and arbitrary set. Definition We say that f is p-conditioned on Ω if ∃γΩ > 0 such that ∀x ∈ Ω, γΩ p dist (x, argmin f )p ≤ f (x) − inf f .

Guillaume Garrigos 8/27

slide-21
SLIDE 21

Conditioned and Lojasiewicz functions

Let p ≥ 1 and Ω ⊂ X and arbitrary set. Definition We say that f is p-conditioned on Ω if ∃γΩ > 0 such that ∀x ∈ Ω, γΩ p dist (x, argmin f )p ≤ f (x) − inf f . The exponent p governs the local geometry of f , and then the rates of convergence. Easy to get.

Guillaume Garrigos 8/27

slide-22
SLIDE 22

Conditioned and Lojasiewicz functions

Let p ≥ 1 and Ω ⊂ X and arbitrary set. Definition We say that f is p-conditioned on Ω if ∃γΩ > 0 such that ∀x ∈ Ω, γΩ p dist (x, argmin f )p ≤ f (x) − inf f . The exponent p governs the local geometry of f , and then the rates of convergence. Easy to get. γΩ governs the constant in the rates. Hard to estimate properly.

1Bolte, Nguyen, Peypouquet, Suter, 2015 - Garrigos, Rosasco , Villa, 2016.

Guillaume Garrigos 8/27

slide-23
SLIDE 23

Conditioned and Lojasiewicz functions

Let p ≥ 1 and Ω ⊂ X and arbitrary set. Definition We say that f is p-conditioned on Ω if ∃γΩ > 0 such that ∀x ∈ Ω, γΩ p dist (x, argmin f )p ≤ f (x) − inf f . The exponent p governs the local geometry of f , and then the rates of convergence. Easy to get. γΩ governs the constant in the rates. Hard to estimate properly. "Equivalent" to Lojasiewicz inequality/metric subregularity1.

1Bolte, Nguyen, Peypouquet, Suter, 2015 - Garrigos, Rosasco , Villa, 2016.

Guillaume Garrigos 8/27

slide-24
SLIDE 24

Identifying the geometry: Some examples

strongly convex functions are 2-conditioned on X, γX = γ

Guillaume Garrigos 9/27

slide-25
SLIDE 25

Identifying the geometry: Some examples

strongly convex functions are 2-conditioned on X, γX = γ f (x) = 1

2Ax − y2

If R(A) is closed, f is 2-conditioned on X, γX = σ∗

min(A∗A).

Guillaume Garrigos 9/27

slide-26
SLIDE 26

Identifying the geometry: Some examples

strongly convex functions are 2-conditioned on X, γX = γ f (x) = 1

2Ax − y2

If R(A) is closed, f is 2-conditioned on X, γX = σ∗

min(A∗A).

Else, complicated (see later).

Guillaume Garrigos 9/27

slide-27
SLIDE 27

Identifying the geometry: Some examples

strongly convex functions are 2-conditioned on X, γX = γ f (x) = 1

2Ax − y2

If R(A) is closed, f is 2-conditioned on X, γX = σ∗

min(A∗A).

Else, complicated (see later).

In RN, convex polynomial by parts functions are p-conditioned1 on sublevel sets, with p = 1 + (d − 1)N, but γ[f ≤r] unknown. Example: f (x) = αx1 + 1

2Ax − y2.

1Yang, 2009 + Li, 2012

Guillaume Garrigos 9/27

slide-28
SLIDE 28

Identifying the geometry: Some examples

strongly convex functions are 2-conditioned on X, γX = γ f (x) = 1

2Ax − y2

If R(A) is closed, f is 2-conditioned on X, γX = σ∗

min(A∗A).

Else, complicated (see later).

In RN, convex polynomial by parts functions are p-conditioned1 on sublevel sets, with p = 1 + (d − 1)N, but γ[f ≤r] unknown. Example: f (x) = αx1 + 1

2Ax − y2.

Almost any simple function used in practice: xp

α, KL divergence,

etc...

1Yang, 2009 + Li, 2012

Guillaume Garrigos 9/27

slide-29
SLIDE 29

Identifying the geometry: Some examples

strongly convex functions are 2-conditioned on X, γX = γ f (x) = 1

2Ax − y2

If R(A) is closed, f is 2-conditioned on X, γX = σ∗

min(A∗A).

Else, complicated (see later).

In RN, convex polynomial by parts functions are p-conditioned1 on sublevel sets, with p = 1 + (d − 1)N, but γ[f ≤r] unknown. Example: f (x) = αx1 + 1

2Ax − y2.

Almost any simple function used in practice: xp

α, KL divergence,

etc... semi-algebraic functions are conditioned around minimizers2. p and γ unknown.

1Yang, 2009 + Li, 2012 2Bolte, Daniilidis, Lewis, Shiota, 2007

Guillaume Garrigos 9/27

slide-30
SLIDE 30

Identifying the geometry: two rules

Theorem: Sum rule1 Assume that f1 and f2 are respectively p1 and p2-conditioned, up to linear perturbations, on Ω ⊂ X. Then, under some qualification condition, f1 + f2 is p-conditioned on Ω with p = max{p1, p2}. Theorem: Composition with linear operator (closed range)1 Assume that f is p-conditioned and smooth, up to linear perturbations,

  • n Ω ⊂ X. Then, under some qualification conditions, f ◦ A is

p-conditioned on A−1Ω.

1Lewis, Drusvyatskiy (2016) for p = 2; G., Rosasco, Villa (2016) for p ≥ 1.

Guillaume Garrigos 10/27

slide-31
SLIDE 31

Identifying the geometry: two rules

Theorem: Sum rule1 Assume that f1 and f2 are respectively p1 and p2-conditioned, up to linear perturbations, on Ω ⊂ X. Then, under some qualification condition, f1 + f2 is p-conditioned on Ω with p = max{p1, p2}. Theorem: Composition with linear operator (closed range)1 Assume that f is p-conditioned and smooth, up to linear perturbations,

  • n Ω ⊂ X. Then, under some qualification conditions, f ◦ A is

p-conditioned on A−1Ω. Not always true without QC! See M∗ + AM − Y2.

1Lewis, Drusvyatskiy (2016) for p = 2; G., Rosasco, Villa (2016) for p ≥ 1.

Guillaume Garrigos 10/27

slide-32
SLIDE 32

Contents

1

Classic theory

2

Better rates with the help of geometry Identifying the geometry of a function Exploiting the geometry

3

Inverse problems in Hilbert spaces Linear inverse problems Sparse inverse problems

Guillaume Garrigos 11/27

slide-33
SLIDE 33

Exploiting the geometry: Convergence result

Theorem (G., Rosasco, Villa, 2016) & (Frankel, G., Peypouquet, 2014) Let (xn)n∈N be generated by the Forward-Backward, and suppose (Localization) (xn)n∈N ⊂ Ω, (Geometry) f is p-conditioned on Ω. Then xn converges strongly to a minimizer x† of f . Moreover, ∀n ∈ N:

Guillaume Garrigos 12/27

slide-34
SLIDE 34

Exploiting the geometry: Convergence result

Theorem (G., Rosasco, Villa, 2016) & (Frankel, G., Peypouquet, 2014) Let (xn)n∈N be generated by the Forward-Backward, and suppose (Localization) (xn)n∈N ⊂ Ω, (Geometry) f is p-conditioned on Ω. Then xn converges strongly to a minimizer x† of f . Moreover, ∀n ∈ N:

1 if p = 2, linear convergence with ε ∈]0, 1[, C > 0

f (xn+1) − inf f ≤ ε(f (xn) − inf f ) and xn − x† ≤ C√εn,

Guillaume Garrigos 12/27

slide-35
SLIDE 35

Exploiting the geometry: Convergence result

Theorem (G., Rosasco, Villa, 2016) & (Frankel, G., Peypouquet, 2014) Let (xn)n∈N be generated by the Forward-Backward, and suppose (Localization) (xn)n∈N ⊂ Ω, (Geometry) f is p-conditioned on Ω. Then xn converges strongly to a minimizer x† of f . Moreover, ∀n ∈ N:

1 if p = 2, linear convergence with ε ∈]0, 1[, C > 0

f (xn+1) − inf f ≤ ε(f (xn) − inf f ) and xn − x† ≤ C√εn,

2 if p > 2, sublinear convergence with C1, C2 > 0

f (xn) − inf f ≤ C1n

−p p−2 and xn − x† ≤ C2n −1 p−2 .

Guillaume Garrigos 12/27

slide-36
SLIDE 36

Exploiting the geometry: Convergence result

Theorem (G., Rosasco, Villa, 2016) & (Frankel, G., Peypouquet, 2014) Let (xn)n∈N be generated by the Forward-Backward, and suppose (Localization) (xn)n∈N ⊂ Ω, (Geometry) f is p-conditioned on Ω. Then xn converges strongly to a minimizer x† of f . Moreover, ∀n ∈ N:

1 if p = 2, linear convergence with ε ∈]0, 1[, C > 0

f (xn+1) − inf f ≤ ε(f (xn) − inf f ) and xn − x† ≤ C√εn,

2 if p > 2, sublinear convergence with C1, C2 > 0

f (xn) − inf f ≤ C1n

−p p−2 and xn − x† ≤ C2n −1 p−2 .

NB: All the constants depend on (L, λ, p, γf ,Ω, f (x0) − inf f ).

Guillaume Garrigos 12/27

slide-37
SLIDE 37

Exploiting the geometry: convergence result

Theorem (G., Rosasco, Villa, 2016) & (Frankel, G., Peypouquet, 2014) (Localization) (xn)n∈N ⊂ Ω, (Geometry) f is p-conditioned on Ω. Then p = 2 gives linear rates, p > 2 sublinear rates. Some remarks on the convergence result:

Guillaume Garrigos 13/27

slide-38
SLIDE 38

Exploiting the geometry: convergence result

Theorem (G., Rosasco, Villa, 2016) & (Frankel, G., Peypouquet, 2014) (Localization) (xn)n∈N ⊂ Ω, (Geometry) f is p-conditioned on Ω. Then p = 2 gives linear rates, p > 2 sublinear rates. Some remarks on the convergence result: These rates are optimal (see f (x) = xp).

Guillaume Garrigos 13/27

slide-39
SLIDE 39

Exploiting the geometry: convergence result

Theorem (G., Rosasco, Villa, 2016) & (Frankel, G., Peypouquet, 2014) (Localization) (xn)n∈N ⊂ Ω, (Geometry) f is p-conditioned on Ω. Then p = 2 gives linear rates, p > 2 sublinear rates. Some remarks on the convergence result: These rates are optimal (see f (x) = xp). Rates involve a generalized condition number κ ∝ L/γf ,Ω. For p = 2 there is ε = κ/(κ + 1).

Guillaume Garrigos 13/27

slide-40
SLIDE 40

Exploiting the geometry: convergence result

Theorem (G., Rosasco, Villa, 2016) & (Frankel, G., Peypouquet, 2014) (Localization) (xn)n∈N ⊂ Ω, (Geometry) f is p-conditioned on Ω. Then p = 2 gives linear rates, p > 2 sublinear rates. Some remarks on the convergence result: These rates are optimal (see f (x) = xp). Rates involve a generalized condition number κ ∝ L/γf ,Ω. For p = 2 there is ε = κ/(κ + 1). These results extends to the nonconvex setting.

Guillaume Garrigos 13/27

slide-41
SLIDE 41

Exploiting the geometry: convergence result

Theorem (G., Rosasco, Villa, 2016) & (Frankel, G., Peypouquet, 2014) (Localization) (xn)n∈N ⊂ Ω, (Geometry) f is p-conditioned on Ω. Then p = 2 gives linear rates, p > 2 sublinear rates. Some remarks on the convergence result: These rates are optimal (see f (x) = xp). Rates involve a generalized condition number κ ∝ L/γf ,Ω. For p = 2 there is ε = κ/(κ + 1). These results extends to the nonconvex setting. These results extends to general first-order descent methods.

Guillaume Garrigos 13/27

slide-42
SLIDE 42

On the localization/geometry trade-off

Theorem (G., Rosasco, Villa, 2016) & (Frankel, G., Peypouquet, 2014) (Localization) (xn)n∈N ⊂ Ω, (Geometry) f is p-conditioned on Ω. Localization hypothesis seems a trick. And why general Ω ⊂ X?

Guillaume Garrigos 14/27

slide-43
SLIDE 43

On the localization/geometry trade-off

Theorem (G., Rosasco, Villa, 2016) & (Frankel, G., Peypouquet, 2014) (Localization) (xn)n∈N ⊂ Ω, (Geometry) f is p-conditioned on Ω. Localization hypothesis seems a trick. And why general Ω ⊂ X?

  • Clarify the local vs global rates.

Guillaume Garrigos 14/27

slide-44
SLIDE 44

On the localization/geometry trade-off

Theorem (G., Rosasco, Villa, 2016) & (Frankel, G., Peypouquet, 2014) (Localization) (xn)n∈N ⊂ Ω, (Geometry) f is p-conditioned on Ω. Localization hypothesis seems a trick. And why general Ω ⊂ X?

  • Clarify the local vs global rates.

∃(δ, r) ∈]0, +∞[2, f is p-conditioned on Ω := B(¯ x, δ) ∩ [f − inf ≤ r].

Guillaume Garrigos 14/27

slide-45
SLIDE 45

On the localization/geometry trade-off

Theorem (G., Rosasco, Villa, 2016) & (Frankel, G., Peypouquet, 2014) (Localization) (xn)n∈N ⊂ Ω, (Geometry) f is p-conditioned on Ω. Localization hypothesis seems a trick. And why general Ω ⊂ X?

  • Clarify the local vs global rates.

∃(δ, r) ∈]0, +∞[2, f is p-conditioned on Ω := B(¯ x, δ) ∩ [f − inf ≤ r]. Fejer + descent ⇒ ∃N ∈ N, ∀n ≥ N, xn ∈ Ω ⇒ Local rates.

Guillaume Garrigos 14/27

slide-46
SLIDE 46

On the localization/geometry trade-off

Theorem (G., Rosasco, Villa, 2016) & (Frankel, G., Peypouquet, 2014) (Localization) (xn)n∈N ⊂ Ω, (Geometry) f is p-conditioned on Ω. Localization hypothesis seems a trick. And why general Ω ⊂ X?

  • Clarify the local vs global rates.

∀(δ, r) ∈]0, +∞[2, f is p-conditioned on Ω := B(¯ x, δ) ∩ [f − inf ≤ r].

Guillaume Garrigos 14/27

slide-47
SLIDE 47

On the localization/geometry trade-off

Theorem (G., Rosasco, Villa, 2016) & (Frankel, G., Peypouquet, 2014) (Localization) (xn)n∈N ⊂ Ω, (Geometry) f is p-conditioned on Ω. Localization hypothesis seems a trick. And why general Ω ⊂ X?

  • Clarify the local vs global rates.

∀(δ, r) ∈]0, +∞[2, f is p-conditioned on Ω := B(¯ x, δ) ∩ [f − inf ≤ r]. Fejer + descent ⇒ ∀n ≥ 0, xn ∈ Ω, ⇒ Global rates.

Guillaume Garrigos 14/27

slide-48
SLIDE 48

On the localization/geometry trade-off

Theorem (G., Rosasco, Villa, 2016) & (Frankel, G., Peypouquet, 2014) (Localization) (xn)n∈N ⊂ Ω, (Geometry) f is p-conditioned on Ω. Localization hypothesis seems a trick. And why general Ω ⊂ X?

  • Clarify the local vs global rates.
  • Some functions have nonlocal geometry Ω : f (x) = Ax − y2.

Guillaume Garrigos 14/27

slide-49
SLIDE 49

On the localization/geometry trade-off

Theorem (G., Rosasco, Villa, 2016) & (Frankel, G., Peypouquet, 2014) (Localization) (xn)n∈N ⊂ Ω, (Geometry) f is p-conditioned on Ω. Localization hypothesis seems a trick. And why general Ω ⊂ X?

  • Clarify the local vs global rates.
  • Some functions have nonlocal geometry Ω : f (x) = Ax − y2.

If Im A not closed, Haraux and Jendoubi show that no conditioning hold on B(¯ x, δ).

Guillaume Garrigos 14/27

slide-50
SLIDE 50

On the localization/geometry trade-off

Theorem (G., Rosasco, Villa, 2016) & (Frankel, G., Peypouquet, 2014) (Localization) (xn)n∈N ⊂ Ω, (Geometry) f is p-conditioned on Ω. Localization hypothesis seems a trick. And why general Ω ⊂ X?

  • Clarify the local vs global rates.
  • Some functions have nonlocal geometry Ω : f (x) = Ax − y2.

If Im A not closed, Haraux and Jendoubi show that no conditioning hold on B(¯ x, δ). We prove that conditioning holds on "Sobolev" spaces.

Guillaume Garrigos 14/27

slide-51
SLIDE 51

On the localization/geometry trade-off

Theorem (G., Rosasco, Villa, 2016) & (Frankel, G., Peypouquet, 2014) (Localization) (xn)n∈N ⊂ Ω, (Geometry) f is p-conditioned on Ω. Localization hypothesis seems a trick. And why general Ω ⊂ X?

  • Clarify the local vs global rates.
  • Some functions have nonlocal geometry Ω : f (x) = Ax − y2.
  • We can restrict to low-dimensional sets.

Guillaume Garrigos 14/27

slide-52
SLIDE 52

On the localization/geometry trade-off

Theorem (G., Rosasco, Villa, 2016) & (Frankel, G., Peypouquet, 2014) (Localization) (xn)n∈N ⊂ Ω, (Geometry) f is p-conditioned on Ω. Localization hypothesis seems a trick. And why general Ω ⊂ X?

  • Clarify the local vs global rates.
  • Some functions have nonlocal geometry Ω : f (x) = Ax − y2.
  • We can restrict to low-dimensional sets.

If f = g + h with h smooth and g partially smooth + QC, then ∃N ∈ N, ∀n ≥ N, xn ∈ M (identification of active manifold)

Guillaume Garrigos 14/27

slide-53
SLIDE 53

On the localization/geometry trade-off

Theorem (G., Rosasco, Villa, 2016) & (Frankel, G., Peypouquet, 2014) (Localization) (xn)n∈N ⊂ Ω, (Geometry) f is p-conditioned on Ω. Localization hypothesis seems a trick. And why general Ω ⊂ X?

  • Clarify the local vs global rates.
  • Some functions have nonlocal geometry Ω : f (x) = Ax − y2.
  • We can restrict to low-dimensional sets.

If f = g + h with h smooth and g partially smooth + QC, then ∃N ∈ N, ∀n ≥ N, xn ∈ M (identification of active manifold) → conditioning on M is enough, no need for strong convexity.

Guillaume Garrigos 14/27

slide-54
SLIDE 54

Updated results

function values iterates argmin f = ∅

  • (1)

diverge argmin f = ∅

  • (n−1)

weak convergence geometry, p > 2 O

  • n

−p p−2

  • O
  • n

−1 p−2

  • geometry, p = 2

linear linear geometry, 1 < p < 2 superlinear superlinear geometry, p = 1 finite finite

Guillaume Garrigos 15/27

slide-55
SLIDE 55

Updated results

function values iterates argmin f = ∅

  • (1)

diverge argmin f = ∅

  • (n−1)

weak convergence geometry, p > 2 O

  • n

−p p−2

  • O
  • n

−1 p−2

  • geometry, p = 2

linear linear geometry, 1 < p < 2 superlinear superlinear geometry, p = 1 finite finite We have a spectra covering "almost" all convex functions in finite dimensions1 .

1Bolte, Daniilidis, Ley, Mazet - 2010

Guillaume Garrigos 15/27

slide-56
SLIDE 56

Updated results

function values iterates argmin f = ∅

  • (1)

diverge argmin f = ∅

  • (n−1)

weak convergence geometry, p > 2 O

  • n

−p p−2

  • O
  • n

−1 p−2

  • geometry, p = 2

linear linear geometry, 1 < p < 2 superlinear superlinear geometry, p = 1 finite finite We have a spectra covering "almost" all convex functions in finite dimensions1 . The hypothesis to get linear rates is minimal

1Bolte, Daniilidis, Ley, Mazet - 2010

Guillaume Garrigos 15/27

slide-57
SLIDE 57

Updated results

function values iterates argmin f = ∅

  • (1)

diverge argmin f = ∅

  • (n−1)

weak convergence geometry, p > 2 O

  • n

−p p−2

  • O
  • n

−1 p−2

  • geometry, p = 2

linear linear geometry, 1 < p < 2 superlinear superlinear geometry, p = 1 finite finite Proposition If linear rates hold on Ω: (∃ε ∈]0, 1[)(∀x ∈ Ω) dist (FB(x), argmin f ) ≤ εdist (x, argmin f ), then f is 2-conditioned on Ω.

Guillaume Garrigos 15/27

slide-58
SLIDE 58

Updated results

function values iterates argmin f = ∅

  • (1)

diverge argmin f = ∅

  • (n−1)

weak convergence geometry, p > 2 O

  • n

−p p−2

  • O
  • n

−1 p−2

  • geometry, p = 2

linear linear geometry, 1 < p < 2 superlinear superlinear geometry, p = 1 finite finite We have a spectra covering "almost" all convex functions in finite dimensions1 . The hypothesis to get linear rates is minimal Up to now, the infinite dimensional setting is less understood.

1Bolte, Daniilidis, Ley, Mazet - 2010

Guillaume Garrigos 15/27

slide-59
SLIDE 59

Contents

1

Classic theory

2

Better rates with the help of geometry Identifying the geometry of a function Exploiting the geometry

3

Inverse problems in Hilbert spaces Linear inverse problems Sparse inverse problems

Guillaume Garrigos 16/27

slide-60
SLIDE 60

Least squares: f (x) = 1

2Ax − y2

Assume that R(A) is not closed, and y ∈ dom A†. The FB method becomes xn+1 = xn − λA∗(Axn − y), x0 = 0.

Guillaume Garrigos 17/27

slide-61
SLIDE 61

Least squares: f (x) = 1

2Ax − y2

Assume that R(A) is not closed, and y ∈ dom A†. The FB method becomes xn+1 = xn − λA∗(Axn − y), x0 = 0. xn converges to x† := A†y. But how fast? → Old answer: it depends on the regularity of x†.

Guillaume Garrigos 17/27

slide-62
SLIDE 62

Least squares: f (x) = 1

2Ax − y2

Assume that R(A) is not closed, and y ∈ dom A†. The FB method becomes xn+1 = xn − λA∗(Axn − y), x0 = 0. xn converges to x† := A†y. But how fast? → Old answer: it depends on the regularity of x†. In inverse problems, the spaces R(A∗Aµ) play the role of Sobolev in L2. Example: Sobolev regularity If X = Y = L2([0, 2π]) and A is the integration operator, then R(A∗Aµ) = H2µ([0, 2π]).

Guillaume Garrigos 17/27

slide-63
SLIDE 63

Least squares: Convergence analysis

Theorem: Geometry on Sobolev spaces The least squares f is p-conditioned on each affine space x† + R(A∗Aµ), with the exponent p = 2 + µ−1. Fact: if x† ∈ R(A∗Aµ) and x0 = 0, then (xn)n∈N ⊂ x† + R(A∗Aµ). Theorem: Convergence for Landweber’s algorithm If x0 = 0, and x† ∈ R(A∗Aµ), then the convergence is sublinear: f (xn) − inf f = O

  • n−(1+2µ)

and xn − x† = O

  • n−µ

. NB: the exponent p = 2 + µ−1 and the rates are tight.

Guillaume Garrigos 18/27

slide-64
SLIDE 64

Least squares: what if argmin Ax − y2 = ∅ ?

It might be that x† = A†y doesn’t exist...

Guillaume Garrigos 19/27

slide-65
SLIDE 65

Least squares: what if argmin Ax − y2 = ∅ ?

It might be that x† = A†y doesn’t exist... Typically in learning we look for a function in L2(X × Y, ρ) But in practice we work in a RKHS X ⊂ L2

Guillaume Garrigos 19/27

slide-66
SLIDE 66

Least squares: what if argmin Ax − y2 = ∅ ?

It might be that x† = A†y doesn’t exist... Typically in learning we look for a function in L2(X × Y, ρ) But in practice we work in a RKHS X ⊂ L2 Even if f has no minimizers, we still want to estimate f (xn) − inf f → 0 It will depend on how far the solution is from X.

Guillaume Garrigos 19/27

slide-67
SLIDE 67

Least squares: what if argmin Ax − y2 = ∅ ?

It might be that x† = A†y doesn’t exist... Typically in learning we look for a function in L2(X × Y, ρ) But in practice we work in a RKHS X ⊂ L2 Even if f has no minimizers, we still want to estimate f (xn) − inf f → 0 It will depend on how far the solution is from X. We look at how regular is y† := proj (y, Im A) within Im A ⊂ Y .

Guillaume Garrigos 19/27

slide-68
SLIDE 68

Least squares but no minimizers: Convergence analysis

Theorem: Geometry on Sobolev spaces (w.r.t. data space Y) The least squares f is "p-conditioned" on each affine space A−1 y† + R(AA∗ν)

  • , ν > 0

with the exponent p = 2 + (ν − 1/2)−1. Fact: if ν < 1/2 then p < 0 !! f behaves like

1 t|p| .

Theorem: Convergence for Landweber’s algorithm If x0 = 0, and y† ∈ R(AA∗ν), then the convergence is sublinear: f (xn) − inf f = O

  • n−2ν

.

Guillaume Garrigos 20/27

slide-69
SLIDE 69

Updated results

Assume f to be convex and (xn)n∈N be generated by a first-order descent method. function values iterates argmin f = ∅

  • (1)

diverge geometry, p < 0 O

  • n

−p p−2

  • diverge

argmin f = ∅

  • (n−1)

weak convergence geometry, p > 2 O

  • n

−p p−2

  • O
  • n

−1 p−2

  • geometry, p = 2

linear linear geometry, 1 < p < 2 superlinear superlinear geometry, p = 1 finite finite

Guillaume Garrigos 21/27

slide-70
SLIDE 70

Contents

1

Classic theory

2

Better rates with the help of geometry Identifying the geometry of a function Exploiting the geometry

3

Inverse problems in Hilbert spaces Linear inverse problems Sparse inverse problems

Guillaume Garrigos 22/27

slide-71
SLIDE 71

Lasso in Hilbert spaces

Consider the Lasso in ℓ2(N) f (x) = αx1 + 1 2Ax − y2 How fast do converge ISTA? O(1/n)? linearly?

Guillaume Garrigos 23/27

slide-72
SLIDE 72

Lasso in Hilbert spaces

Consider the Lasso in ℓ2(N) f (x) = αx1 + 1 2Ax − y2 How fast do converge ISTA? O(1/n)? linearly?

  • linear rates if A is injective on finite supports
  • linear rates if qualification condition holds

Guillaume Garrigos 23/27

slide-73
SLIDE 73

Lasso in Hilbert spaces

Consider the Lasso in ℓ2(N) f (x) = αx1 + 1 2Ax − y2 How fast do converge ISTA? O(1/n)? linearly?

  • linear rates if A is injective on finite supports
  • linear rates if qualification condition holds

Theorem (G., Rosasco, Villa - 2017) There exists Ω such that (xn)n∈N ⊂ Ω and f is 2-conditioned on Ω. So ISTA always converge linearly.

Guillaume Garrigos 23/27

slide-74
SLIDE 74

Lasso in Hilbert spaces

Consider the Lasso in ℓ2(N) f (x) = αx1 + 1 2Ax − y2 How fast do converge ISTA? O(1/n)? linearly?

  • linear rates if A is injective on finite supports
  • linear rates if qualification condition holds

Theorem (G., Rosasco, Villa - 2017) There exists Ω such that (xn)n∈N ⊂ Ω and f is 2-conditioned on Ω. So ISTA always converge linearly. Similar result by replacing · 1 with · 1 + · p

p.

Guillaume Garrigos 23/27

slide-75
SLIDE 75

Conclusion

Guillaume Garrigos 24/27

slide-76
SLIDE 76

If you had to remember ONE thing

You have a descent-related (dissipative?) algorithm? Strong convexity gives you strong convergence and better rates? Try to use the 2-conditioning: γ dist (x, argmin f )2 ≤ f (x) − inf f − → It should give the same results than strong convexity − → It applies to a way more general class of functions (actually super sharp for linear rates)

Guillaume Garrigos 25/27

slide-77
SLIDE 77

Conclusion/Discussion

Structural results allow a practical identification of geometry.

Guillaume Garrigos 26/27

slide-78
SLIDE 78

Conclusion/Discussion

Structural results allow a practical identification of geometry. Geometry sheds a new light on a priori unrelated results.

Guillaume Garrigos 26/27

slide-79
SLIDE 79

Conclusion/Discussion

Structural results allow a practical identification of geometry. Geometry sheds a new light on a priori unrelated results. Quantitative characterization of the geometry in the nonconvex case is an active topic. E.g.: f (w) = ℓ(xi, w − yi).

Guillaume Garrigos 26/27

slide-80
SLIDE 80

Conclusion/Discussion

Structural results allow a practical identification of geometry. Geometry sheds a new light on a priori unrelated results. Quantitative characterization of the geometry in the nonconvex case is an active topic. E.g.: f (w) = ℓ(xi, w − yi). Descent methods very well understood. Holds for general first-order descent methods

1

(descent) axn+1 − xn2 ≤ f (xn) − f (xn+1)

2

(1st order) b∂f (xn+1)_ ≤ xn+1 − xn

Allows even more structured methods (decomposition by blocs), or variants (variable metric, inexact computations)

Guillaume Garrigos 26/27

slide-81
SLIDE 81

Conclusion/Discussion

Structural results allow a practical identification of geometry. Geometry sheds a new light on a priori unrelated results. Quantitative characterization of the geometry in the nonconvex case is an active topic. E.g.: f (w) = ℓ(xi, w − yi). Descent methods very well understood. Holds for general first-order descent methods Recently: application to stochastic gradient methods.

Guillaume Garrigos 26/27

slide-82
SLIDE 82

Conclusion/Discussion

Structural results allow a practical identification of geometry. Geometry sheds a new light on a priori unrelated results. Quantitative characterization of the geometry in the nonconvex case is an active topic. E.g.: f (w) = ℓ(xi, w − yi). Descent methods very well understood. Holds for general first-order descent methods Recently: application to stochastic gradient methods. Geometry is a powerful tool not only for rates, but also for regularization! (see Silvia’s talk)

Guillaume Garrigos 26/27

slide-83
SLIDE 83

Conclusion/Discussion

Structural results allow a practical identification of geometry. Geometry sheds a new light on a priori unrelated results. Quantitative characterization of the geometry in the nonconvex case is an active topic. E.g.: f (w) = ℓ(xi, w − yi). Descent methods very well understood. Holds for general first-order descent methods Recently: application to stochastic gradient methods. Geometry is a powerful tool not only for rates, but also for regularization! (see Silvia’s talk) Can inertial methods benefit from this analysis? Are they adaptive?

Guillaume Garrigos 26/27

slide-84
SLIDE 84

Thanks for your attention !

Guillaume Garrigos 27/27