[PPT] - Iterative regularization for general inverse problems Guillaume PowerPoint Presentation

SLIDE 1

Iterative regularization for general inverse problems Guillaume Garrigos

with L. Rosasco and S. Villa

CNRS, École Normale Supérieure

Séminaire CVN - Centrale Supélec - 23 Jan 2018

SLIDE 2

1

Regularization of inverse problems

2

Regularization by penalization and early stopping

3

Iterative regularization for general models

Guillaume Garrigos 2/21

SLIDE 3

Intro : Inverse Problems

An ill-posed inverse problem Given A : X → Y , and ¯ y ∈ Y we want to solve Ax = ¯ y (P)

Guillaume Garrigos 3/21

SLIDE 4

Intro : Inverse Problems

An ill-posed inverse problem Given A : X → Y , and ¯ y ∈ Y we want to solve Ax = ¯ y (P) Typically ¯ y = A¯ x Signal/image processing: ¯ x the original signal deteriorated by A Linear regression: (ai, yi) the data, A = (a1; ...; ai; ..) Non-linear/Kernel regression/SVM: same but send the ai’s in a feature space

Guillaume Garrigos 3/21

SLIDE 5

Intro : Inverse Problems

An ill-posed inverse problem Given A : X → Y , and ¯ y ∈ Y we want to solve Ax = ¯ y (P) (P) might be ill-posed! Typically ¯ y = A¯ x Signal/image processing: ¯ x the original signal deteriorated by A Linear regression: (ai, yi) the data, A = (a1; ...; ai; ..) Non-linear/Kernel regression/SVM: same but send the ai’s in a feature space

Guillaume Garrigos 3/21

SLIDE 6

Intro : Inverse Problems

An ill-posed inverse problem Given A : X → Y , and ¯ y ∈ Y we want to solve Ax = ¯ y (P) (P) might be ill-posed! (P) might have no solutions → introduce a discrepancy

Guillaume Garrigos 3/21

SLIDE 7

Intro : Inverse Problems

An ill-posed inverse problem Given A : X → Y , and ¯ y ∈ Y we want to solve x† = arg min D(Ax; ¯ y) (P) (P) might be ill-posed! (P) might have no solutions → introduce a discrepancy D(Ax; ¯ y) = Ax − ¯ y, Ax − ¯ y1, or DKL(Ax; ¯ y) ...

Guillaume Garrigos 3/21

SLIDE 8

Intro : Inverse Problems

An ill-posed inverse problem Given A : X → Y , and ¯ y ∈ Y we want to solve x† = arg min D(Ax; ¯ y) (P) (P) might be ill-posed! (P) might have no solutions → introduce a discrepancy D(Ax; ¯ y) = Ax − ¯ y, Ax − ¯ y1, or DKL(Ax; ¯ y) ...

Guillaume Garrigos 3/21

SLIDE 9

Intro : Inverse Problems

An ill-posed inverse problem Given A : X → Y , and ¯ y ∈ Y we want to solve x† = arg min D(Ax; ¯ y) (P) (P) might be ill-posed! (P) might have no solutions → introduce a discrepancy D(Ax; ¯ y) = Ax − ¯ y, Ax − ¯ y1, or DKL(Ax; ¯ y) ... the solution x† might be not unique → introduce a prior

Guillaume Garrigos 3/21

SLIDE 10

Intro : Inverse Problems

An ill-posed inverse problem Given A : X → Y , and ¯ y ∈ Y we want to solve x† = arg min R(x)

arg min D(Ax;¯ y)

(P) (P) might be ill-posed! (P) might have no solutions → introduce a discrepancy D(Ax; ¯ y) = Ax − ¯ y, Ax − ¯ y1, or DKL(Ax; ¯ y) ... the solution x† might be not unique → introduce a prior R(x) is a convex functional (x2, Wx1, ∇x,...)

Guillaume Garrigos 3/21

SLIDE 11

Intro : Inverse Problems

An ill-posed inverse problem Given A : X → Y , and ¯ y ∈ Y we want to solve x† = arg min R(x)

arg min D(Ax;¯ y)

(P) (P) might be ill-posed! (P) might have no solutions → introduce a discrepancy D(Ax; ¯ y) = Ax − ¯ y, Ax − ¯ y1, or DKL(Ax; ¯ y) ... the solution x† might be not unique → introduce a prior R(x) is a convex functional (x2, Wx1, ∇x,...) (P) is our model.

Guillaume Garrigos 3/21

SLIDE 12

Intro : Inverse Problems

What about the stability to noise? ˆ y = ¯ y + ε A noisy example x† = arg min R(x)

arg min D(Ax;¯ y)

(P) ¯ x ¯ y = A¯ x ˆ y ˆ x†

Guillaume Garrigos 4/21

SLIDE 13

Intro : Inverse Problems

What about the stability to noise? ˆ y = ¯ y + ε A noisy example x† = arg min R(x)

arg min D(Ax;¯ y)

(P) ¯ x ¯ y = A¯ x ˆ y ˆ x† We need to impose well-posedness!

Guillaume Garrigos 4/21

SLIDE 14

Regularization

Regularization is a parametrization of a low-dimensional subset of the space of solutions, balancing between fitting the data/model.

Guillaume Garrigos 5/21

SLIDE 15

Regularization

Regularization is a parametrization of a low-dimensional subset of the space of solutions, balancing between fitting the data/model. We want a map (y, λ) ∈ Y × P → {xλ(y)}λ∈P ⊂ X such that

1

lim

λ∈P xλ(¯

y) = x†

2 ˆ

y − ¯ y ≤ δ ⇒ ∃λδ ∈ P, xλδ(ˆ y) − x† = O (δα)

Guillaume Garrigos 5/21

SLIDE 16

Regularization

Regularization is a parametrization of a low-dimensional subset of the space of solutions, balancing between fitting the data/model. We want a map (y, λ) ∈ Y × P → {xλ(y)}λ∈P ⊂ X such that

1

lim

λ∈P xλ(¯

y) = x†

2 ˆ

y − ¯ y ≤ δ ⇒ ∃λδ ∈ P, xλδ(ˆ y) − x† = O (δα) A good regularization method is a method for which α is big.

Guillaume Garrigos 5/21

SLIDE 17

1

Regularization of inverse problems

2

Regularization by penalization and early stopping

3

Iterative regularization for general models

Guillaume Garrigos 6/21

SLIDE 18

Regularization

x† = arg min R(x)

arg min D(Ax;¯ y)

(P) Which regularization method for our model problem?

Guillaume Garrigos 7/21

SLIDE 19

Regularization via Perturbation (Tikhonov)

Penalization method xλ(y) := arg min

x∈X

λR(x) + D(Ax; y) (Pλ)

Guillaume Garrigos 8/21

SLIDE 20

Regularization via Perturbation (Tikhonov)

Penalization method xλ(y) := arg min

x∈X

λR(x) + D(Ax; y) (Pλ) In practice ր (Pλ1)

ptim

− → xλ1 ց (P) → (Pλ2)

ptim

− → xλ2 →

reg. path
param. selec.

− → xλδ ց (Pλ3)

ptim

− − → xλ3 ր

Guillaume Garrigos 8/21

SLIDE 21

Regularization via Penalization (Tikhonov)

Penalization method xλ(y) := arg min

x∈X

λR(x) + D(Ax; y) (Pλ) Example λ = 1 λ = 0.3 λ = 0.01

Guillaume Garrigos 8/21

SLIDE 22

Regularization via Penalization (Tikhonov)

Penalization xλ(y) := arg min

x∈X

λR(x) + D(Ax; y) (Pλ) Tikhonov regularization is a regularization method (linear case) Assume R(x) = x2, D(Ax; y) = Ax − y2 and x† ∈ Range(A∗). Let ˆ y − ¯ y ≤ δ and ˆ xλ be generated by the data ˆ y. If λδ = O(δ), then

ˆ

xλδ − x†

δ

1 2

Guillaume Garrigos 8/21

SLIDE 23

Regularization via Penalization (Tikhonov)

Penalization xλ(y) := arg min

x∈X

λR(x) + D(Ax; y) (Pλ) Tikhonov regularization is a regularization method (linear case) Assume R(x) = x2, D(Ax; y) = Ax − y2 and x† ∈ Range(A∗). Let ˆ y − ¯ y ≤ δ and ˆ xλ be generated by the data ˆ y. If λδ = O(δ), then

ˆ

xλδ − x†

δ

1 2

the exponent 1/2 is optimal very few results for other models...

Guillaume Garrigos 8/21

SLIDE 24

Iterative Regularization (Early stopping)

Early stopping Take any (robust) algorithm solving directly (P): arg min R(x)

arg min D(Ax;¯ y)

The regularization path is (xn)n∈N, the parameter is n.

Guillaume Garrigos 9/21

SLIDE 25

Iterative Regularization (Early stopping)

Early stopping Take any (robust) algorithm solving directly (P): arg min R(x)

arg min D(Ax;¯ y)

The regularization path is (xn)n∈N, the parameter is n. In practice (P)

ptim

− → (xn)n∈N →

reg. path
param. selec.

− → xnδ

Guillaume Garrigos 9/21

SLIDE 26

Iterative Regularization (Early stopping)

Early stopping Take any (robust) algorithm solving directly (P): arg min

x∈arg min D(A·;y)

R(x) The regularization path is {xn}, the parameter is n. Example n = 300 n = 500 n = 1000

Guillaume Garrigos 9/21

SLIDE 27

Iterative Regularization (Robust Optimization)

Early stopping Take any (robust) algorithm solving directly (P): arg min R(x)

arg min D(Ax;¯ y)

The regularization path is {xn}, the parameter is n. The algorithm(s) If D(Ax; y) = Ax − y2 the constraint is linear so the dual of (P) is: min

u R∗(−A∗u) + u, y,

which could be solved by gradient on the dual: xn = ∇R∗(−A∗un) un+1 = un + τ(Axn − y). NB: If R = · 2 it becomes the Landweber algorithm xn+1 = xn − τA∗(Axn − y).

Guillaume Garrigos 9/21

SLIDE 28

Iterative Regularization (Robust Optimization)

Early stopping Take any (robust) algorithm solving directly (P): arg min

x∈arg min D(A·;y)

R(x) The regularization path is {xn}, the parameter is n. Gradient descent is a regularization method Assume R(x) = x2, D(Ax; y) = Ax − y2 and x† ∈ Range(A∗). Let ˆ y − ¯ y ≤ δ and ˆ xn be generated by the data ˆ y via ˆ xn+1 = ˆ xn − γA∗(Aˆ xn − y). If nδ = O(δ−1), then

ˆ

xnδ − x†

δ

1 2

Guillaume Garrigos 9/21

SLIDE 29

Iterative Regularization (Robust Optimization)

Early stopping Take any (robust) algorithm solving directly (P): arg min R(x)

arg min D(Ax;¯ y)

. The regularization path is {xn}, the parameter is n. Gradient Descent on the dual is a regularization [Matet et al., 2016] Assume R(x) to be strongly convex, D(Ax; y) = Ax − y2 and ∂R(x†) ∩ Range(A∗) = ∅. Let ˆ y − ¯ y ≤ δ and ˆ xn be generated by the data ˆ y, via Gradient descent on the dual. If nδ = O(δ−1), then

ˆ

xnδ − x†

δ

1 2

What about other models for D ..?

Guillaume Garrigos 9/21

SLIDE 30

The learning setting

Guillaume Garrigos 10/21

SLIDE 31

The learning setting

Let ρ be a distribution on X × Y (Y ⊂ R). We want to solve arg min

f :X→Y

X×Y

(f (x) − y)2dρ(x, y) (P)

Guillaume Garrigos 10/21

SLIDE 32

The learning setting

Let ρ be a distribution on X × Y (Y ⊂ R). We want to solve arg min

w∈Hφ

X×Y

(w, φ(x) − y)2dρ(x, y) (P)

Guillaume Garrigos 10/21

SLIDE 33

The learning setting

Let ρ be a distribution on X × Y (Y ⊂ R). We want to solve arg min

w∈Hφ

X×Y

(w, φ(x) − y)2dρ(x, y) (P) But we actually only have access to a sample of the data (xi, yi)m

i=1.

This means that we pass from minimizing Xw − Y to X mw − Y m.

Guillaume Garrigos 10/21

SLIDE 34

The learning setting

Let ρ be a distribution on X × Y (Y ⊂ R). We want to solve arg min

w∈Hφ

X×Y

(w, φ(x) − y)2dρ(x, y) (P) But we actually only have access to a sample of the data (xi, yi)m

i=1.

This means that we pass from minimizing Xw − Y to X mw − Y m. Define a regularization method by looking at m → +∞ instead of δ → 0. Under reasonable assumptions, the same type of results hold: both Tikhonov and Gradient descent give optimal rates for ˆ wλ(m) − w† or ˆ wn(m) − w† [Caponetto, De Vito - 2006].

Guillaume Garrigos 10/21

SLIDE 35

The learning setting

Let ρ be a distribution on X × Y (Y ⊂ R). We want to solve arg min

w∈Hφ

X×Y

(w, φ(x) − y)2dρ(x, y) (P) But we actually only have access to a sample of the data (xi, yi)m

i=1.

This means that we pass from minimizing Xw − Y to X mw − Y m. Define a regularization method by looking at m → +∞ instead of δ → 0. Under reasonable assumptions, the same type of results hold: both Tikhonov and Gradient descent give optimal rates for ˆ wλ(m) − w† or ˆ wn(m) − w† [Caponetto, De Vito - 2006]. Other algorithms are regularizing, and other parameters are regularizers (passes over the data [Rosasco, Villa - 2015]).

Guillaume Garrigos 10/21

SLIDE 36

Let’s make the point

arg min R(x)

arg min D(Ax;¯ y)

(P) Penalization and Early stopping are two different regularization methods Early stopping seems to have a better complexity in practice Penalization lacks theoretical guarantees for general models. It is not even clear which algorithm to use for early stopping in general !!

Guillaume Garrigos 11/21

SLIDE 37

1

Regularization of inverse problems

2

Regularization by penalization and early stopping

3

Iterative regularization for general models

Guillaume Garrigos 12/21

SLIDE 38

Iterative Regularization for general discrepancies D(Ax; y)

When D(Ax; y) = Ax − y2, how to solve arg min R(x)

arg min D(Ax;¯ y)

? → we cannot use the dual of (P) → Diagonal approach ! (Old idea, see e.g. Lemaire in the 80’s)

Guillaume Garrigos 13/21

SLIDE 39

Iterative Regularization for general discrepancies D(Ax; y)

Diagonal method (heuristic) Consider any algorithm xn+1 = Algo(xn, y, λ) for solving xλ(y) := arg min

x∈X

λR(x) + D(Ax; y) (Pλ) Instead, do xn+1 = Algo(xn, y, λn) with λn → 0.

Guillaume Garrigos 13/21

SLIDE 40

Iterative Regularization for general discrepancies D(Ax; y)

Diagonal method (heuristic) Consider any algorithm xn+1 = Algo(xn, y, λ) for solving xλ(y) := arg min

x∈X

λR(x) + D(Ax; y) (Pλ) Instead, do xn+1 = Algo(xn, y, λn) with λn → 0.

Includes the warm restart strategy

Guillaume Garrigos 13/21

SLIDE 41

Iterative Regularization for general discrepancies D(Ax; y)

Diagonal method (heuristic) Consider any algorithm xn+1 = Algo(xn, y, λ) for solving xλ(y) := arg min

x∈X

λR(x) + D(Ax; y) (Pλ) Instead, do xn+1 = Algo(xn, y, λn) with λn → 0.

Includes the warm restart strategy
See [Attouch, Czarnecki, Peypouquet,...] about diagonal FB

Guillaume Garrigos 13/21

SLIDE 42

Iterative Regularization for general discrepancies D(Ax; y)

Diagonal method (heuristic) Consider any algorithm xn+1 = Algo(xn, y, λ) for solving xλ(y) := arg min

x∈X

λR(x) + D(Ax; y) (Pλ) Instead, do xn+1 = Algo(xn, y, λn) with λn → 0.

Includes the warm restart strategy
See [Attouch, Czarnecki, Peypouquet,...] about diagonal FB

Issues

How to deal with D(A·; y) if D nonsmooth and A = Id?
require to know the conditioning of D(A·; y). Might not exist.

Guillaume Garrigos 13/21

SLIDE 43

Our approach: Diagonal method on Dual problem

Take the dual of (Pλ) min

x

λR(x) + D(Ax; y): min

u

R∗(−A∗u) + 1 λD∗(λu; y). (Dλ) Do a diagonal proximal-gradient (Forward-Backward) method on (Dλ), with λn → 0: xn = ∇R∗(−A∗un) (Dual-to-primal step) wn+1 = un + τAxn (Forward step) un+1 = wn+1 − τprox

1 τλn D(·;y)

τ −1wn+1
(Backward step)

Guillaume Garrigos 14/21

SLIDE 44

Our approach: Diagonal method on Dual problem

Take the dual of (Pλ) min

x

λR(x) + D(Ax; y): min

u

R∗(−A∗u) + 1 λD∗(λu; y). (Dλ) Do a diagonal proximal-gradient (Forward-Backward) method on (Dλ), with λn → 0: xn = ∇R∗(−A∗un) (Dual-to-primal step) wn+1 = un + τAxn (Forward step) un+1 = wn+1 − τprox

1 τλn D(·;y)

τ −1wn+1
(Backward step)
Activates D only via its prox
If R = J + (1/2) · 2 then ∇R∗ = proxJ
Does it work?

Guillaume Garrigos 14/21

SLIDE 45

Main result on Diagonal Dual Descent method

Assumptions R is strongly convex and ¯ x ∈ dom R D(·; ¯ y) coercive and p-conditioned Qualification condition: ∂R(x†) ∩ Range(A∗) = ∅ → Qualification condition holds if R continuous at x† and Range(A∗) closed.

Guillaume Garrigos 15/21

SLIDE 46

Main result on Diagonal Dual Descent method

Assumptions R is strongly convex and ¯ x ∈ dom R D(·; ¯ y) coercive and p-conditioned Qualification condition: ∂R(x†) ∩ Range(A∗) = ∅ Theorem: Optimization (aka no-noise case) [G., Rosasco, Villa - 2017] Assume that λn → 0 fast enough (i.e. λn ∈ ℓ

1 p−1 (N)). Let xn

generated from the true data ¯

y. Then xn − x† = o(1/√n).

Guillaume Garrigos 15/21

SLIDE 47

Main result on Diagonal Dual Descent method

Assumptions R is strongly convex and ¯ x ∈ dom R D(·; ¯ y) coercive and p-conditioned Qualification condition: ∂R(x†) ∩ Range(A∗) = ∅ Assume an additive discrepacy: D(Ax; y) = L(Ax − y). Theorem: Regularization [G., Rosasco, Villa - 2017] Assume that λn → 0 fast enough (i.e. λn ∈ ℓ

1 p−1 (N)), and

ˆ y − ¯ y ≤ δ. Let ˆ xn generated from the noisy data ˆ y. Then ∃nδ = O(δ−2/3) s.t. ˆ xnδ − x† = O(δ1/3).

Guillaume Garrigos 15/21

SLIDE 48

Main result on Diagonal Dual Descent method

Assumptions R is strongly convex and ¯ x ∈ dom R D(·; ¯ y) coercive and p-conditioned Qualification condition: ∂R(x†) ∩ Range(A∗) = ∅ Assume an additive discrepacy: D(Ax; y) = L(Ax − y). Theorem: Regularization [G., Rosasco, Villa - 2017] Assume that λn → 0 fast enough (i.e. λn ∈ ℓ

1 p−1 (N)), and

ˆ y − ¯ y ≤ δ. Let ˆ xn generated from the noisy data ˆ y. Then ∃nδ = O(δ−2/3) s.t. ˆ xnδ − x† = O(δ1/3).

Similar results for other discrepancies like DKL(Ax; y)

Guillaume Garrigos 15/21

SLIDE 49

Main result on Diagonal Dual Descent method

Assumptions R is strongly convex and ¯ x ∈ dom R D(·; ¯ y) coercive and p-conditioned Qualification condition: ∂R(x†) ∩ Range(A∗) = ∅ Assume an additive discrepacy: D(Ax; y) = L(Ax − y). Theorem: Regularization [G., Rosasco, Villa - 2017] Assume that λn → 0 fast enough (i.e. λn ∈ ℓ

1 p−1 (N)), and

ˆ y − ¯ y ≤ δ. Let ˆ xn generated from the noisy data ˆ y. Then ∃nδ = O(δ−2/3) s.t. ˆ xnδ − x† = O(δ1/3).

Similar results for other discrepancies like DKL(Ax; y)
Less sharp results suggest that slower λn → 0 leads to larger nδ

but more accurate ˆ xnδ.

Guillaume Garrigos 15/21

SLIDE 50

Experiments

512 × 512 images blurred and corrupted by impulse noise (35% intensity) D(Xw, y) = Ax − y1 and F(x) = Wx1 or xTV

Guillaume Garrigos 16/21

SLIDE 51

Experiments

512 × 512 image blurred and corrupted by impulse noise (35% intensity) t=0

Guillaume Garrigos 17/21

SLIDE 52

Experiments

512 × 512 image blurred and corrupted by impulse noise (35% intensity)

Guillaume Garrigos 18/21

t=10

SLIDE 53

Comments on the experiments

Early stopping VS penalization : who wins? Early stopping achieves the same error reconstruction than the Penalization method Early stopping requires way less computations than ’stupid’ Penalization, but comparable to warm restart strategy With Early stopping we have a direct control on the computations, but not on the quality error: fix a budget of iterations, pick the best solution With Penalization it is the reverse: fix a stopping criterion for the problems (Pλ), and let run the algorithm

Guillaume Garrigos 19/21

SLIDE 54

Comments on the experiments

Early stopping VS penalization : who wins? Early stopping achieves the same error reconstruction than the Penalization method Early stopping requires way less computations than ’stupid’ Penalization, but comparable to warm restart strategy With Early stopping we have a direct control on the computations, but not on the quality error: fix a budget of iterations, pick the best solution With Penalization it is the reverse: fix a stopping criterion for the problems (Pλ), and let run the algorithm How to choose the parameters ?? Any technique used for Penalization applies to Early stopping In learning, cross-validation works very well In imaging it’s more delicate (SURE? Discrepancy principle?)

Guillaume Garrigos 19/21

SLIDE 55

Conclusions

Early stopping is not limited to linear inverse problems but applies to general models Allows for better control of the computational costs than penalization methods What’s next? (work in progress) Learning scenario (what if A ↔ Am?) Accelerated method: same reconstruction bound, but faster? Removing the strong convexity assumption by using an other algorithm?

Guillaume Garrigos 20/21

SLIDE 56

Conclusion

Thanks for your attention !

Guillaume Garrigos 21/21