Optimization for data processing at a large scale Sparsity4PSL - - PowerPoint PPT Presentation

optimization for data processing at a large scale
SMART_READER_LITE
LIVE PREVIEW

Optimization for data processing at a large scale Sparsity4PSL - - PowerPoint PPT Presentation

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 1/32 Optimization for data processing at a large scale Sparsity4PSL Summer School Emilie Chouzenoux Center for


slide-1
SLIDE 1

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 1/32

Optimization for data processing at a large scale Sparsity4PSL Summer School

Emilie Chouzenoux Center for Visual Computing CentraleSup´ elec, INRIA Saclay

24 June 2019

slide-2
SLIDE 2

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 2/32

Inverse problems and large scale optimization

[Microscopy, ISBI Challenge 2013, F. Soulez]

Original image Degraded image

slide-3
SLIDE 3

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 2/32

Inverse problems and large scale optimization

[Microscopy, ISBI Challenge 2013, F. Soulez]

Original image Degraded image x ∈ RN z = D(Hx) ∈ RM ◮ H ∈ RM×N: matrix associated with the degradation operator. ◮ D: RM → RM: noise degradation. Inverse problem: Find a good estimate of x from the observations z, using some a priori knowledge on x and on the noise characteristics .

slide-4
SLIDE 4

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 3/32

Inverse problems and large scale optimization

Inverse problem: Find an estimate x close to x from the observations z = D(Hx) . ◮ Inverse filtering (if M = N and H is invertible)

  • x = H−1z

= H−1(Hx + b) ← if b ∈ RM is an additive noise = x + H−1b → Closed form expression, but amplification of the noise if H is ill-conditioned (ill-posed problem).

slide-5
SLIDE 5

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 4/32

Inverse problems and large scale optimization

Inverse problem: Find an estimate x close to x from the observations z = D(Hx) . ◮ ✭✭✭✭✭✭✭

Inverse filtering ◮ Variational approach

  • x ∈ Argmin

x∈RN

f1(x)

  • Data fidelity term

+ f2(x)

  • Regularization term
slide-6
SLIDE 6

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 4/32

Inverse problems and large scale optimization

Inverse problem: Find an estimate x close to x from the observations z = D(Hx) . ◮ ✭✭✭✭✭✭✭

Inverse filtering ◮ Variational approach

  • x ∈ Argmin

x∈RN

f1(x)

  • Data fidelity term

+ f2(x)

  • Regularization term

Examples of data fidelity term ◮ Gaussian noise (∀x ∈ RN) f1(x) = 1 σ2 Hx − z2 ◮ Poisson noise (∀x ∈ RN) f1(x) =

M

  • m=1
  • [Hx](m) − z(m) log([Hx](m))
slide-7
SLIDE 7

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 5/32

Examples of regularization terms (1)

◮ Admissibility constraints Find x ∈ C =

M

  • m=1

Cm where (∀m ∈ {1, . . . , M}) Cm ⊂ RN.

slide-8
SLIDE 8

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 5/32

Examples of regularization terms (1)

◮ Admissibility constraints Find x ∈ C =

M

  • m=1

Cm where (∀m ∈ {1, . . . , M}) Cm ⊂ RN. ◮ Variational formulation (∀x ∈ RN) f2(x) =

M

  • m=1

ιCm(x) where, for all m ∈ {1, . . . , M}, ιCm is the indicator function

  • f Cm:

(∀x ∈ RN) ιCm(x) =

  • if x ∈ Cm

+∞

  • therwise.
slide-9
SLIDE 9

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 6/32

Examples of regularization terms (2)

◮ ℓ1 norm (analysis approach) (∀x ∈ RN) f2(x) =

K

  • k=1
  • [Fx](k)
  • = Fx1

F ∈ RK×N: Frame decomposition operator (K ≥ N) F

signal x frame coefficients

slide-10
SLIDE 10

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 6/32

Examples of regularization terms (2)

◮ ℓ1 norm (analysis approach) (∀x ∈ RN) f2(x) =

K

  • k=1
  • [Fx](k)
  • = Fx1

◮ Total variation (∀x = (x(i1,i2))1≤i1≤N1,1≤i2≤N2 ∈ RN1×N2) f2(x) = tv(x) =

N1

  • i1=1

N2

  • i2=1

∇x(i1,i2)2 ∇x(i1,i2) : discrete gradient at pixel (i1, i2).

slide-11
SLIDE 11

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 7/32

Inverse problems and large scale optimization

Inverse problem: Find an estimate x close to x from the observations z = D(Hx) . ◮ ✭✭✭✭✭✭✭

Inverse filtering ◮ Variational approach (more general context)

  • x ∈ Argmin

x∈RN m

  • i=1

fi(x) where fi may denote a data fidelity term / a (hybrid) regularization term / constraint.

slide-12
SLIDE 12

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 7/32

Inverse problems and large scale optimization

Inverse problem: Find an estimate x close to x from the observations z = D(Hx) . ◮ ✭✭✭✭✭✭✭

Inverse filtering ◮ Variational approach (more general context)

  • x ∈ Argmin

x∈RN m

  • i=1

fi(x) where fi may denote a data fidelity term / a (hybrid) regularization term / constraint. → Often no closed form expression or solution expensive to compute (especially in large scale context). ◮ Need for an efficient iterative minimization strategy !

slide-13
SLIDE 13

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 8/32

Main challenges

◮ How to exploit the mathematical properties of each term involved in f ? How to handle constraints efficiently ? How to deal with non differentiable terms in f ? Which convergence result can be expected if f is non convex? ◮ How to reduce the memory requirements of an optimization algorithm? How to avoid large-size matrix inversion? ◮ What are the benefits of block alternating strategies? What are their convergence guaranties? ◮ How to accelerate the convergence speed of a first-order (gradient-like) optimization method?

slide-14
SLIDE 14

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 9/32

Outline

  • 1. Introduction to optimization

◮ Notation/definitions ◮ Existence and unicity of minimizers ◮ Differential/subdifferential ◮ Optimality conditions

  • 2. Majoration-Minimization approaches

◮ Majorization-Minimization principle ◮ Majorization techniques ◮ MM quadratic methods ◮ Forward-backward algorithm ◮ Block-coordinate MM algorithms

slide-15
SLIDE 15

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 10/32

Introduction to optimization

slide-16
SLIDE 16

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 11/32

Domain of a function

Let f : RN → R ∪ +∞. ◮ The domain of f is dom f = {x ∈ RN | f (x) < +∞}. ◮ The function f is proper if dom f = ∅.

slide-17
SLIDE 17

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 12/32

Indicator function

Let C ⊂ RN. The indicator function of C is (∀x ∈ RN) ιC(x) =

  • if x ∈ C

+∞

  • therwise.

Example:

slide-18
SLIDE 18

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 13/32

Epigraph

Let f : RN → R ∪ +∞. The epigraph of f is epi f =

  • (x, ζ) ∈ dom f × R
  • f (x) ≤ ζ
  • Examples:
slide-19
SLIDE 19

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 14/32

Lower semi-continuous function

Let f : RN → R ∪ +∞. f is a lower semi-continuous function on RN if and only if epi f is closed Examples:

slide-20
SLIDE 20

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 15/32

Convex set

C ⊂ RN is a convex set if (∀(x, y) ∈ C 2)(∀α ∈]0, 1[) αx + (1 − α)y ∈ C

slide-21
SLIDE 21

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 16/32

Coercive function

Let f : RN → R ∪ +∞. f is coercive if limx→+∞ f (x) = +∞.

slide-22
SLIDE 22

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 17/32

Convex function

f : RN → R ∪ +∞ is a convex function if

  • ∀(x, y) ∈ (RN)2

(∀α ∈]0, 1[) f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y) ◮ f is convex ⇔ its epigraph is convex. Examples:

slide-23
SLIDE 23

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 18/32

Strictly convex function

f : RN → R ∪ +∞ is strictly convex if (∀x ∈ dom f )(∀y ∈ dom f )(∀α ∈]0, 1[) x = y ⇒ f (αx + (1 − α)y) < αf (x) + (1 − α)f (y).

slide-24
SLIDE 24

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 19/32

Existence/unicity of minimizers

Theorem Let f : RN → R ∪ +∞ be a proper l.s.c. coercive function. Then, the set of minimizers of f is a nonempty compact set. Convex case

  • Let f : RN → R ∪ +∞ be a proper convex function such that

µ = inf f > −∞. Then, every local minimizer of f is a global minimizer . Moreover, if f is strictly convex, then there exists at most one minimizer.

  • Let C a closed convex subset of RN. Let f : RN → R ∪ +∞ proper,

convex, lsc such that dom f ∩ C = ∅. If f is coercive or C is bounded , then there exists x ∈ C such that f ( x) = infx∈C f (x). If, moreover, f is strictly convex, this minimizer x is unique.

slide-25
SLIDE 25

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 20/32

Subdifferential

Let f : RN → R ∪ +∞ be a proper function. The (Moreau) subdifferential of f , denoted by ∂f is such that ∂f : RN → 2RN x → {u ∈ RN | (∀y ∈ RN) y − x|u + f (x) ≤ f (y)}

f (y) f (x) + y − x|t y x t ∈ ∂f (x)

slide-26
SLIDE 26

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 20/32

Subdifferential

Let f : RN → R ∪ +∞ be a proper function. The (Moreau) subdifferential of f , denoted by ∂f is such that ∂f : RN → 2RN x → {u ∈ RN | (∀y ∈ RN) y − x|u + f (x) ≤ f (y)}

f (y) f (x) + y − x|t y x x t ∈ ∂f (x)

b

slide-27
SLIDE 27

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 20/32

Subdifferential

Let f : RN → R ∪ +∞ be a proper function. The (Moreau) subdifferential of f , denoted by ∂f is such that ∂f : RN → 2RN x → {u ∈ RN | (∀y ∈ RN) y − x|u + f (x) ≤ f (y)}

f (y) f (x) + y − x|t y x x t ∈ ∂f (x)

b

slide-28
SLIDE 28

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 20/32

Subdifferential

Let f : RN → R ∪ +∞ be a proper function. The (Moreau) subdifferential of f , denoted by ∂f is such that ∂f : RN → 2RN x → {u ∈ RN | (∀y ∈ RN) y − x|u + f (x) ≤ f (y)}

f (y) f (x) + y − x|t y x x t ∈ ∂f (x)

b

slide-29
SLIDE 29

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 20/32

Subdifferential

Let f : RN → R ∪ +∞ be a proper function. The (Moreau) subdifferential of f , denoted by ∂f is such that ∂f : RN → 2RN x → {u ∈ RN | (∀y ∈ RN) y − x|u + f (x) ≤ f (y)}

f (y) f (x) + y − x|t y x x t ∈ ∂f (x)

b

slide-30
SLIDE 30

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 20/32

Subdifferential

Let f : RN → R ∪ +∞ be a proper function. The (Moreau) subdifferential of f , denoted by ∂f is such that ∂f : RN → 2RN x → {u ∈ RN | (∀y ∈ RN) y − x|u + f (x) ≤ f (y)}

f (y) f (x) + y − x|t y x x t ∈ ∂f (x)

b

slide-31
SLIDE 31

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 20/32

Subdifferential

Let f : RN → R ∪ +∞ be a proper function. The (Moreau) subdifferential of f , denoted by ∂f is such that ∂f : RN → 2RN x → {u ∈ RN | (∀y ∈ RN) y − x|u + f (x) ≤ f (y)}

f (y) f (x) + y − x|t y x x t ∈ ∂f (x)

b

slide-32
SLIDE 32

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 20/32

Subdifferential

Let f : RN → R ∪ +∞ be a proper function. The (Moreau) subdifferential of f , denoted by ∂f is such that ∂f : RN → 2RN x → {u ∈ RN | (∀y ∈ RN) y − x|u + f (x) ≤ f (y)}

f (y) f (x) + y − x|t y x x t ∈ ∂f (x)

b

slide-33
SLIDE 33

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 21/32

Optimality conditions

Fermat’s rule : 0 ∈ ∂f (x) ⇔ x ∈ Argmin f Differentiable case Let C be a nonempty convex subset of RN. Let f : RN → R ∪ +∞ be Gˆ ateaux differentiable at x ∈ C. If x is a local minimizer of f

  • ver C, then

(∀y ∈ C) ∇f ( x)⊤(y − x) ≥ 0. If x ∈ int(C), then the condition reduces to ∇f ( x) = 0.

slide-34
SLIDE 34

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 22/32

Majoration-Minimization approaches

slide-35
SLIDE 35

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 23/32

Majorant function

Let f : RN → R. Let y ∈ RN. h(·, y) : RN → R is a majorant function of f at y if:

  • (∀x ∈ RN)

f (x) ≤ h(x, y), f (y) = h(y, y).

f y h(·, y)

slide-36
SLIDE 36

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 24/32

Majorization-Minimization algorithm

Problem: Minimization of function f : RN → R. MM Algorithm xn+1 ∈ Argmin

x∈RN

h(x, xn) where h(·, xn) is a majorant function for f at xn.

f xn xn+1 h(·, xn)

⇒ The sequence (f (xn))n∈N is decreasing: (∀n ∈ N) f (xn+1)≤

M

h(xn+1, xn)≤

M

h(xn, xn) = f (xn)

slide-37
SLIDE 37

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 24/32

Majorization-Minimization algorithm

Problem: Minimization of function f : RN → R. MM Algorithm xn+1 ∈ Argmin

x∈RN

h(x, xn) where h(·, xn) is a majorant function for f at xn.

f xn+1xn+2 h(·, xn+1)

⇒ The sequence (f (xn))n∈N is decreasing: (∀n ∈ N) f (xn+1)≤

M

h(xn+1, xn)≤

M

h(xn, xn) = f (xn)

slide-38
SLIDE 38

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 24/32

Majorization-Minimization algorithm

Problem: Minimization of function f : RN → R. MM Algorithm xn+1 ∈ Argmin

x∈RN

h(x, xn) where h(·, xn) is a majorant function for f at xn.

f xn+2xn+3 · · · h(·, xn+2)

⇒ The sequence (f (xn))n∈N is decreasing: (∀n ∈ N) f (xn+1)≤

M

h(xn+1, xn)≤

M

h(xn, xn) = f (xn)

slide-39
SLIDE 39

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 25/32

Majorization techniques

◮ Subdifferential inequality ◮ Descent lemma ◮ Proximity operator ◮ Even smooth functions ◮ Jensen’s inequality

slide-40
SLIDE 40

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 26/32

Majorization techniques

Even differentiable function Let f be defined as (∀x ∈ R) f (x) = ψ(|x|) where (i) ψ is differentiable on ]0, +∞[, (ii) ψ(√·) is concave on ]0, +∞[, (iii) (∀x ∈ [0, +∞[) ˙ ψ(x) ≥ 0, (iv) limx→0

x>0

  • ω(x) :=

˙ ψ(x) x

  • ∈ R.

h(.,y) f y

Then, for all y ∈ R, (∀x ∈ R) f (x) ≤ f (y) + ˙ f (y)(x − y) + 1 2ω(|y|)(x − y)2.

slide-41
SLIDE 41

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 27/32

Examples of functions f

f (x) ω(x) |x| − δ log(|x|/δ + 1) (|x| + δ)−1

  • x2

if |x| < δ 2δ|x| − δ2

  • therwise
  • 2

if |x| < δ 2δ/|x|

  • therwise

Convex log(cosh(x)) tanh(x)/x (1 + x2/δ2)κ/2 − 1 (κ/δ2)(1 + x2/δ2)κ/2−1 1 − exp(−x2/(2δ2)) (1/δ2) exp(−x2/(2δ2)) x2/(2δ2 + x2) 4δ2/(2δ2 + x2)2

  • 1 − (1 − x2/(6δ2))3

if |x| ≤ √ 6δ 1

  • therwise
  • (1/δ2)(1 − x2/(6δ2))2

if |x| ≤ √ 6δ

  • therwise

Nonconvex tanh(x2/(2δ2)) (1/δ2)(cosh(x2/(2δ2)))−2 log(1 + x2/δ2) 2/(δ2 + x2) (λ, δ) ∈]0, +∞[2, κ ∈ [1, 2]

slide-42
SLIDE 42

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 28/32

Examples of functions f

−1 −0.5 0.5 1 1 2 3 4 5 6 x

f (x) = (1 + x2

δ2 )1/2 − 1, f (x) = log

  • 1 + x2

δ2

  • , f (x) = 1 − exp(− x2

2δ2 ).

slide-43
SLIDE 43

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 29/32

Majorization techniques

Consequences of Jensen’s inequality Let ψ : RN → R be a convex function.

  • (∀(x, y, c) ∈ (]0, +∞[N)3)

ψ

  • c⊤x

N

  • i=1

c(i)y(i) c⊤y ψ c⊤y y(i) x(i)

  • .
  • Let ω ∈ [0, +∞[N such that N

i=1 ω(i) = 1 and ω(i) = 0 iff c(i) = 0.

(∀(x, y, c) ∈ (] − ∞, +∞[N)3) ψ

  • c⊤x

N

  • i=1

ω(i)ψ

  • c(i)

ω(i) (x(i) − y(i)) + c⊤y

  • .
slide-44
SLIDE 44

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 30/32

MM algorithms

◮ Separable MM approach ◮ MM quadratic algorithm ◮ 3MG algorithm ◮ Forward-backward algorithm ◮ Block-alternating MM schemes

slide-45
SLIDE 45

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 31/32

Acceleration via block-alternation

Problem: Minimization of f : RN → R.

slide-46
SLIDE 46

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 31/32

Acceleration via block-alternation

Problem: Minimization of f : RN → R. x ∈ RN x(1)∈ RN1 x(2)∈ RN2 x(J) ∈ RNj ×J

j=1RNj = RN

slide-47
SLIDE 47

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 31/32

Acceleration via block-alternation

Problem: Minimization of f : RN → R. x f = f x(1) x(2) x(J)

slide-48
SLIDE 48

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 31/32

Acceleration via block-alternation

Problem: Minimization of f : RN → R. x f = f x(1) x(2) x(J) ⇒ Block-coordinate strategy: Instead of updating the whole vector x at iteration n ∈ N, restrict the update to a block jn ∈ {1, . . . , J}.

slide-49
SLIDE 49

Introduction Introduction to optimization Majoration-Minimization approaches Optimization for data processing at a large scale 32/32

Concluding remarks

◮ In large scale optimization, we search for the best possible tradeoff in terms of computational complexity and convergence rate. ◮ Availability of theoretical convergence results is important, to assess the reliability of an optimization scheme. ◮ There is rarely a single technique available for the resolution of an

  • ptimization problem.

◮ It is thus always recommended to test and compare different strategies, for a given application. Not treated in this course: stochastic optimization, distributed algorithms, primal-dual strategies, etc.