Iterative regularization for general inverse problems Guillaume - - PowerPoint PPT Presentation
Iterative regularization for general inverse problems Guillaume - - PowerPoint PPT Presentation
Iterative regularization for general inverse problems Guillaume Garrigos with L. Rosasco and S. Villa CNRS, cole Normale Suprieure Sminaire CVN - Centrale Suplec - 23 Jan 2018 Regularization of inverse problems 1 Regularization by
1
Regularization of inverse problems
2
Regularization by penalization and early stopping
3
Iterative regularization for general models
Guillaume Garrigos 2/21
Intro : Inverse Problems
An ill-posed inverse problem Given A : X → Y , and ¯ y ∈ Y we want to solve Ax = ¯ y (P)
Guillaume Garrigos 3/21
Intro : Inverse Problems
An ill-posed inverse problem Given A : X → Y , and ¯ y ∈ Y we want to solve Ax = ¯ y (P) Typically ¯ y = A¯ x Signal/image processing: ¯ x the original signal deteriorated by A Linear regression: (ai, yi) the data, A = (a1; ...; ai; ..) Non-linear/Kernel regression/SVM: same but send the ai’s in a feature space
Guillaume Garrigos 3/21
Intro : Inverse Problems
An ill-posed inverse problem Given A : X → Y , and ¯ y ∈ Y we want to solve Ax = ¯ y (P) (P) might be ill-posed! Typically ¯ y = A¯ x Signal/image processing: ¯ x the original signal deteriorated by A Linear regression: (ai, yi) the data, A = (a1; ...; ai; ..) Non-linear/Kernel regression/SVM: same but send the ai’s in a feature space
Guillaume Garrigos 3/21
Intro : Inverse Problems
An ill-posed inverse problem Given A : X → Y , and ¯ y ∈ Y we want to solve Ax = ¯ y (P) (P) might be ill-posed! (P) might have no solutions → introduce a discrepancy
Guillaume Garrigos 3/21
Intro : Inverse Problems
An ill-posed inverse problem Given A : X → Y , and ¯ y ∈ Y we want to solve x† = arg min D(Ax; ¯ y) (P) (P) might be ill-posed! (P) might have no solutions → introduce a discrepancy D(Ax; ¯ y) = Ax − ¯ y, Ax − ¯ y1, or DKL(Ax; ¯ y) ...
Guillaume Garrigos 3/21
Intro : Inverse Problems
An ill-posed inverse problem Given A : X → Y , and ¯ y ∈ Y we want to solve x† = arg min D(Ax; ¯ y) (P) (P) might be ill-posed! (P) might have no solutions → introduce a discrepancy D(Ax; ¯ y) = Ax − ¯ y, Ax − ¯ y1, or DKL(Ax; ¯ y) ...
Guillaume Garrigos 3/21
Intro : Inverse Problems
An ill-posed inverse problem Given A : X → Y , and ¯ y ∈ Y we want to solve x† = arg min D(Ax; ¯ y) (P) (P) might be ill-posed! (P) might have no solutions → introduce a discrepancy D(Ax; ¯ y) = Ax − ¯ y, Ax − ¯ y1, or DKL(Ax; ¯ y) ... the solution x† might be not unique → introduce a prior
Guillaume Garrigos 3/21
Intro : Inverse Problems
An ill-posed inverse problem Given A : X → Y , and ¯ y ∈ Y we want to solve x† = arg min R(x)
arg min D(Ax;¯ y)
(P) (P) might be ill-posed! (P) might have no solutions → introduce a discrepancy D(Ax; ¯ y) = Ax − ¯ y, Ax − ¯ y1, or DKL(Ax; ¯ y) ... the solution x† might be not unique → introduce a prior R(x) is a convex functional (x2, Wx1, ∇x,...)
Guillaume Garrigos 3/21
Intro : Inverse Problems
An ill-posed inverse problem Given A : X → Y , and ¯ y ∈ Y we want to solve x† = arg min R(x)
arg min D(Ax;¯ y)
(P) (P) might be ill-posed! (P) might have no solutions → introduce a discrepancy D(Ax; ¯ y) = Ax − ¯ y, Ax − ¯ y1, or DKL(Ax; ¯ y) ... the solution x† might be not unique → introduce a prior R(x) is a convex functional (x2, Wx1, ∇x,...) (P) is our model.
Guillaume Garrigos 3/21
Intro : Inverse Problems
What about the stability to noise? ˆ y = ¯ y + ε A noisy example x† = arg min R(x)
arg min D(Ax;¯ y)
(P) ¯ x ¯ y = A¯ x ˆ y ˆ x†
Guillaume Garrigos 4/21
Intro : Inverse Problems
What about the stability to noise? ˆ y = ¯ y + ε A noisy example x† = arg min R(x)
arg min D(Ax;¯ y)
(P) ¯ x ¯ y = A¯ x ˆ y ˆ x† We need to impose well-posedness!
Guillaume Garrigos 4/21
Regularization
Regularization is a parametrization of a low-dimensional subset of the space of solutions, balancing between fitting the data/model.
Guillaume Garrigos 5/21
Regularization
Regularization is a parametrization of a low-dimensional subset of the space of solutions, balancing between fitting the data/model. We want a map (y, λ) ∈ Y × P → {xλ(y)}λ∈P ⊂ X such that
1
lim
λ∈P xλ(¯
y) = x†
2 ˆ
y − ¯ y ≤ δ ⇒ ∃λδ ∈ P, xλδ(ˆ y) − x† = O (δα)
Guillaume Garrigos 5/21
Regularization
Regularization is a parametrization of a low-dimensional subset of the space of solutions, balancing between fitting the data/model. We want a map (y, λ) ∈ Y × P → {xλ(y)}λ∈P ⊂ X such that
1
lim
λ∈P xλ(¯
y) = x†
2 ˆ
y − ¯ y ≤ δ ⇒ ∃λδ ∈ P, xλδ(ˆ y) − x† = O (δα) A good regularization method is a method for which α is big.
Guillaume Garrigos 5/21
1
Regularization of inverse problems
2
Regularization by penalization and early stopping
3
Iterative regularization for general models
Guillaume Garrigos 6/21
Regularization
x† = arg min R(x)
arg min D(Ax;¯ y)
(P) Which regularization method for our model problem?
Guillaume Garrigos 7/21
Regularization via Perturbation (Tikhonov)
Penalization method xλ(y) := arg min
x∈X
λR(x) + D(Ax; y) (Pλ)
Guillaume Garrigos 8/21
Regularization via Perturbation (Tikhonov)
Penalization method xλ(y) := arg min
x∈X
λR(x) + D(Ax; y) (Pλ) In practice ր (Pλ1)
- ptim
− → xλ1 ց (P) → (Pλ2)
- ptim
− → xλ2 →
- reg. path
- param. selec.
− → xλδ ց (Pλ3)
- ptim
− − → xλ3 ր
Guillaume Garrigos 8/21
Regularization via Penalization (Tikhonov)
Penalization method xλ(y) := arg min
x∈X
λR(x) + D(Ax; y) (Pλ) Example λ = 1 λ = 0.3 λ = 0.01
Guillaume Garrigos 8/21
Regularization via Penalization (Tikhonov)
Penalization xλ(y) := arg min
x∈X
λR(x) + D(Ax; y) (Pλ) Tikhonov regularization is a regularization method (linear case) Assume R(x) = x2, D(Ax; y) = Ax − y2 and x† ∈ Range(A∗). Let ˆ y − ¯ y ≤ δ and ˆ xλ be generated by the data ˆ y. If λδ = O(δ), then
- ˆ
xλδ − x†
- δ
1 2
Guillaume Garrigos 8/21
Regularization via Penalization (Tikhonov)
Penalization xλ(y) := arg min
x∈X
λR(x) + D(Ax; y) (Pλ) Tikhonov regularization is a regularization method (linear case) Assume R(x) = x2, D(Ax; y) = Ax − y2 and x† ∈ Range(A∗). Let ˆ y − ¯ y ≤ δ and ˆ xλ be generated by the data ˆ y. If λδ = O(δ), then
- ˆ
xλδ − x†
- δ
1 2
the exponent 1/2 is optimal very few results for other models...
Guillaume Garrigos 8/21
Iterative Regularization (Early stopping)
Early stopping Take any (robust) algorithm solving directly (P): arg min R(x)
arg min D(Ax;¯ y)
The regularization path is (xn)n∈N, the parameter is n.
Guillaume Garrigos 9/21
Iterative Regularization (Early stopping)
Early stopping Take any (robust) algorithm solving directly (P): arg min R(x)
arg min D(Ax;¯ y)
The regularization path is (xn)n∈N, the parameter is n. In practice (P)
- ptim
− → (xn)n∈N →
- reg. path
- param. selec.
− → xnδ
Guillaume Garrigos 9/21
Iterative Regularization (Early stopping)
Early stopping Take any (robust) algorithm solving directly (P): arg min
x∈arg min D(A·;y)
R(x) The regularization path is {xn}, the parameter is n. Example n = 300 n = 500 n = 1000
Guillaume Garrigos 9/21
Iterative Regularization (Robust Optimization)
Early stopping Take any (robust) algorithm solving directly (P): arg min R(x)
arg min D(Ax;¯ y)
The regularization path is {xn}, the parameter is n. The algorithm(s) If D(Ax; y) = Ax − y2 the constraint is linear so the dual of (P) is: min
u R∗(−A∗u) + u, y,
which could be solved by gradient on the dual: xn = ∇R∗(−A∗un) un+1 = un + τ(Axn − y). NB: If R = · 2 it becomes the Landweber algorithm xn+1 = xn − τA∗(Axn − y).
Guillaume Garrigos 9/21
Iterative Regularization (Robust Optimization)
Early stopping Take any (robust) algorithm solving directly (P): arg min
x∈arg min D(A·;y)
R(x) The regularization path is {xn}, the parameter is n. Gradient descent is a regularization method Assume R(x) = x2, D(Ax; y) = Ax − y2 and x† ∈ Range(A∗). Let ˆ y − ¯ y ≤ δ and ˆ xn be generated by the data ˆ y via ˆ xn+1 = ˆ xn − γA∗(Aˆ xn − y). If nδ = O(δ−1), then
- ˆ
xnδ − x†
- δ
1 2
Guillaume Garrigos 9/21
Iterative Regularization (Robust Optimization)
Early stopping Take any (robust) algorithm solving directly (P): arg min R(x)
arg min D(Ax;¯ y)
. The regularization path is {xn}, the parameter is n. Gradient Descent on the dual is a regularization [Matet et al., 2016] Assume R(x) to be strongly convex, D(Ax; y) = Ax − y2 and ∂R(x†) ∩ Range(A∗) = ∅. Let ˆ y − ¯ y ≤ δ and ˆ xn be generated by the data ˆ y, via Gradient descent on the dual. If nδ = O(δ−1), then
- ˆ
xnδ − x†
- δ
1 2
What about other models for D ..?
Guillaume Garrigos 9/21
The learning setting
Guillaume Garrigos 10/21
The learning setting
Let ρ be a distribution on X × Y (Y ⊂ R). We want to solve arg min
f :X→Y
- X×Y
(f (x) − y)2dρ(x, y) (P)
Guillaume Garrigos 10/21
The learning setting
Let ρ be a distribution on X × Y (Y ⊂ R). We want to solve arg min
w∈Hφ
- X×Y
(w, φ(x) − y)2dρ(x, y) (P)
Guillaume Garrigos 10/21
The learning setting
Let ρ be a distribution on X × Y (Y ⊂ R). We want to solve arg min
w∈Hφ
- X×Y
(w, φ(x) − y)2dρ(x, y) (P) But we actually only have access to a sample of the data (xi, yi)m
i=1.
This means that we pass from minimizing Xw − Y to X mw − Y m.
Guillaume Garrigos 10/21
The learning setting
Let ρ be a distribution on X × Y (Y ⊂ R). We want to solve arg min
w∈Hφ
- X×Y
(w, φ(x) − y)2dρ(x, y) (P) But we actually only have access to a sample of the data (xi, yi)m
i=1.
This means that we pass from minimizing Xw − Y to X mw − Y m. Define a regularization method by looking at m → +∞ instead of δ → 0. Under reasonable assumptions, the same type of results hold: both Tikhonov and Gradient descent give optimal rates for ˆ wλ(m) − w† or ˆ wn(m) − w† [Caponetto, De Vito - 2006].
Guillaume Garrigos 10/21
The learning setting
Let ρ be a distribution on X × Y (Y ⊂ R). We want to solve arg min
w∈Hφ
- X×Y
(w, φ(x) − y)2dρ(x, y) (P) But we actually only have access to a sample of the data (xi, yi)m
i=1.
This means that we pass from minimizing Xw − Y to X mw − Y m. Define a regularization method by looking at m → +∞ instead of δ → 0. Under reasonable assumptions, the same type of results hold: both Tikhonov and Gradient descent give optimal rates for ˆ wλ(m) − w† or ˆ wn(m) − w† [Caponetto, De Vito - 2006]. Other algorithms are regularizing, and other parameters are regularizers (passes over the data [Rosasco, Villa - 2015]).
Guillaume Garrigos 10/21
Let’s make the point
arg min R(x)
arg min D(Ax;¯ y)
(P) Penalization and Early stopping are two different regularization methods Early stopping seems to have a better complexity in practice Penalization lacks theoretical guarantees for general models. It is not even clear which algorithm to use for early stopping in general !!
Guillaume Garrigos 11/21
1
Regularization of inverse problems
2
Regularization by penalization and early stopping
3
Iterative regularization for general models
Guillaume Garrigos 12/21
Iterative Regularization for general discrepancies D(Ax; y)
When D(Ax; y) = Ax − y2, how to solve arg min R(x)
arg min D(Ax;¯ y)
? → we cannot use the dual of (P) → Diagonal approach ! (Old idea, see e.g. Lemaire in the 80’s)
Guillaume Garrigos 13/21
Iterative Regularization for general discrepancies D(Ax; y)
Diagonal method (heuristic) Consider any algorithm xn+1 = Algo(xn, y, λ) for solving xλ(y) := arg min
x∈X
λR(x) + D(Ax; y) (Pλ) Instead, do xn+1 = Algo(xn, y, λn) with λn → 0.
Guillaume Garrigos 13/21
Iterative Regularization for general discrepancies D(Ax; y)
Diagonal method (heuristic) Consider any algorithm xn+1 = Algo(xn, y, λ) for solving xλ(y) := arg min
x∈X
λR(x) + D(Ax; y) (Pλ) Instead, do xn+1 = Algo(xn, y, λn) with λn → 0.
- Includes the warm restart strategy
Guillaume Garrigos 13/21
Iterative Regularization for general discrepancies D(Ax; y)
Diagonal method (heuristic) Consider any algorithm xn+1 = Algo(xn, y, λ) for solving xλ(y) := arg min
x∈X
λR(x) + D(Ax; y) (Pλ) Instead, do xn+1 = Algo(xn, y, λn) with λn → 0.
- Includes the warm restart strategy
- See [Attouch, Czarnecki, Peypouquet,...] about diagonal FB
Guillaume Garrigos 13/21
Iterative Regularization for general discrepancies D(Ax; y)
Diagonal method (heuristic) Consider any algorithm xn+1 = Algo(xn, y, λ) for solving xλ(y) := arg min
x∈X
λR(x) + D(Ax; y) (Pλ) Instead, do xn+1 = Algo(xn, y, λn) with λn → 0.
- Includes the warm restart strategy
- See [Attouch, Czarnecki, Peypouquet,...] about diagonal FB
Issues
- How to deal with D(A·; y) if D nonsmooth and A = Id?
- require to know the conditioning of D(A·; y). Might not exist.
Guillaume Garrigos 13/21
Our approach: Diagonal method on Dual problem
Take the dual of (Pλ) min
x
λR(x) + D(Ax; y): min
u
R∗(−A∗u) + 1 λD∗(λu; y). (Dλ) Do a diagonal proximal-gradient (Forward-Backward) method on (Dλ), with λn → 0: xn = ∇R∗(−A∗un) (Dual-to-primal step) wn+1 = un + τAxn (Forward step) un+1 = wn+1 − τprox
1 τλn D(·;y)
- τ −1wn+1
- (Backward step)
Guillaume Garrigos 14/21
Our approach: Diagonal method on Dual problem
Take the dual of (Pλ) min
x
λR(x) + D(Ax; y): min
u
R∗(−A∗u) + 1 λD∗(λu; y). (Dλ) Do a diagonal proximal-gradient (Forward-Backward) method on (Dλ), with λn → 0: xn = ∇R∗(−A∗un) (Dual-to-primal step) wn+1 = un + τAxn (Forward step) un+1 = wn+1 − τprox
1 τλn D(·;y)
- τ −1wn+1
- (Backward step)
- Activates D only via its prox
- If R = J + (1/2) · 2 then ∇R∗ = proxJ
- Does it work?
Guillaume Garrigos 14/21
Main result on Diagonal Dual Descent method
Assumptions R is strongly convex and ¯ x ∈ dom R D(·; ¯ y) coercive and p-conditioned Qualification condition: ∂R(x†) ∩ Range(A∗) = ∅ → Qualification condition holds if R continuous at x† and Range(A∗) closed.
Guillaume Garrigos 15/21
Main result on Diagonal Dual Descent method
Assumptions R is strongly convex and ¯ x ∈ dom R D(·; ¯ y) coercive and p-conditioned Qualification condition: ∂R(x†) ∩ Range(A∗) = ∅ Theorem: Optimization (aka no-noise case) [G., Rosasco, Villa - 2017] Assume that λn → 0 fast enough (i.e. λn ∈ ℓ
1 p−1 (N)). Let xn
generated from the true data ¯
- y. Then xn − x† = o(1/√n).
Guillaume Garrigos 15/21
Main result on Diagonal Dual Descent method
Assumptions R is strongly convex and ¯ x ∈ dom R D(·; ¯ y) coercive and p-conditioned Qualification condition: ∂R(x†) ∩ Range(A∗) = ∅ Assume an additive discrepacy: D(Ax; y) = L(Ax − y). Theorem: Regularization [G., Rosasco, Villa - 2017] Assume that λn → 0 fast enough (i.e. λn ∈ ℓ
1 p−1 (N)), and
ˆ y − ¯ y ≤ δ. Let ˆ xn generated from the noisy data ˆ y. Then ∃nδ = O(δ−2/3) s.t. ˆ xnδ − x† = O(δ1/3).
Guillaume Garrigos 15/21
Main result on Diagonal Dual Descent method
Assumptions R is strongly convex and ¯ x ∈ dom R D(·; ¯ y) coercive and p-conditioned Qualification condition: ∂R(x†) ∩ Range(A∗) = ∅ Assume an additive discrepacy: D(Ax; y) = L(Ax − y). Theorem: Regularization [G., Rosasco, Villa - 2017] Assume that λn → 0 fast enough (i.e. λn ∈ ℓ
1 p−1 (N)), and
ˆ y − ¯ y ≤ δ. Let ˆ xn generated from the noisy data ˆ y. Then ∃nδ = O(δ−2/3) s.t. ˆ xnδ − x† = O(δ1/3).
- Similar results for other discrepancies like DKL(Ax; y)
Guillaume Garrigos 15/21
Main result on Diagonal Dual Descent method
Assumptions R is strongly convex and ¯ x ∈ dom R D(·; ¯ y) coercive and p-conditioned Qualification condition: ∂R(x†) ∩ Range(A∗) = ∅ Assume an additive discrepacy: D(Ax; y) = L(Ax − y). Theorem: Regularization [G., Rosasco, Villa - 2017] Assume that λn → 0 fast enough (i.e. λn ∈ ℓ
1 p−1 (N)), and
ˆ y − ¯ y ≤ δ. Let ˆ xn generated from the noisy data ˆ y. Then ∃nδ = O(δ−2/3) s.t. ˆ xnδ − x† = O(δ1/3).
- Similar results for other discrepancies like DKL(Ax; y)
- Less sharp results suggest that slower λn → 0 leads to larger nδ
but more accurate ˆ xnδ.
Guillaume Garrigos 15/21
Experiments
512 × 512 images blurred and corrupted by impulse noise (35% intensity) D(Xw, y) = Ax − y1 and F(x) = Wx1 or xTV
Guillaume Garrigos 16/21
Experiments
512 × 512 image blurred and corrupted by impulse noise (35% intensity) t=0
Guillaume Garrigos 17/21
Experiments
512 × 512 image blurred and corrupted by impulse noise (35% intensity)
Guillaume Garrigos 18/21
t=10
Comments on the experiments
Early stopping VS penalization : who wins? Early stopping achieves the same error reconstruction than the Penalization method Early stopping requires way less computations than ’stupid’ Penalization, but comparable to warm restart strategy With Early stopping we have a direct control on the computations, but not on the quality error: fix a budget of iterations, pick the best solution With Penalization it is the reverse: fix a stopping criterion for the problems (Pλ), and let run the algorithm
Guillaume Garrigos 19/21
Comments on the experiments
Early stopping VS penalization : who wins? Early stopping achieves the same error reconstruction than the Penalization method Early stopping requires way less computations than ’stupid’ Penalization, but comparable to warm restart strategy With Early stopping we have a direct control on the computations, but not on the quality error: fix a budget of iterations, pick the best solution With Penalization it is the reverse: fix a stopping criterion for the problems (Pλ), and let run the algorithm How to choose the parameters ?? Any technique used for Penalization applies to Early stopping In learning, cross-validation works very well In imaging it’s more delicate (SURE? Discrepancy principle?)
Guillaume Garrigos 19/21
Conclusions
Early stopping is not limited to linear inverse problems but applies to general models Allows for better control of the computational costs than penalization methods What’s next? (work in progress) Learning scenario (what if A ↔ Am?) Accelerated method: same reconstruction bound, but faster? Removing the strong convexity assumption by using an other algorithm?
Guillaume Garrigos 20/21
Conclusion
Thanks for your attention !
Guillaume Garrigos 21/21