Optimization considerations for regularizations of inverse and - - PowerPoint PPT Presentation

optimization considerations for regularizations of
SMART_READER_LITE
LIVE PREVIEW

Optimization considerations for regularizations of inverse and - - PowerPoint PPT Presentation

Optimization considerations for regularizations of inverse and learning problems Hugo Raguet 1 Statistics seminar at LIRMM, Montpellier April 11, 2018 1 hugo.raguet@gmail.com Let me introduce myself briefly Ph.D. at Paris-Dauphine University


slide-1
SLIDE 1

Optimization considerations for regularizations of inverse and learning problems

Hugo Raguet1 Statistics seminar at LIRMM, Montpellier April 11, 2018

1hugo.raguet@gmail.com

slide-2
SLIDE 2

Let me introduce myself briefly

Ph.D. at Paris-Dauphine University

structured sparse modeling for neuroimaging

420 ms 430 440 450 460 470 300 600 900 ms 5×10-3 400 µm 5×10-3

slide-3
SLIDE 3

Let me introduce myself briefly

Ph.D. at Paris-Dauphine University

structured sparse modeling for neuroimaging

Lecturer at Aix-Marseille University

  • ptimization for signal and learning on graphs
slide-4
SLIDE 4

Let me introduce myself briefly

Ph.D. at Paris-Dauphine University

structured sparse modeling for neuroimaging

Lecturer at Aix-Marseille University

  • ptimization for signal and learning on graphs

Postdoc at French Commission for Atomic Energy

dependence measures for sensitivity analysis Reproducing kernel Hilbert space H k P Q

b b

—k(P) —k(Q)

‚k(P, Q)

slide-5
SLIDE 5

Some Motivation Proximal Splitting Variants and Accelerations Cut-pursuit Algorithm

slide-6
SLIDE 6

An Example in functional MRI

Observing the brain at work high low

slide-7
SLIDE 7

An Example in functional MRI

Observing the brain at work high low x(1) ∈ RV

slide-8
SLIDE 8

An Example in functional MRI

Observing the brain at work high low x(1) ∈ RV high low x(2) ∈ RV

slide-9
SLIDE 9

An Example in functional MRI

Observing the brain at work high low x(1) ∈ RV x(3) ∈ RV high low x(2) ∈ RV x(5) ∈ RV

slide-10
SLIDE 10

An Example in functional MRI

Observing the brain at work high low x(1) ∈ RV x(3) ∈ RV x(4) ∈ RV high low x(2) ∈ RV x(5) ∈ RV x(6) ∈ RV

slide-11
SLIDE 11

An Example in functional MRI

A binary logistic classification problem high low x(1) ∈ RV x(3) ∈ RV x(4) ∈ RV c(n) = +1 high low x(2) ∈ RV x(5) ∈ RV x(6) ∈ RV c(n) = −1

slide-12
SLIDE 12

An Example in functional MRI

A binary logistic classification problem for n ∈ {1, ... , N}, c(n) = sign

`w , x(n) ´

slide-13
SLIDE 13

An Example in functional MRI

A binary logistic classification problem for n ∈ {1, ... , N}, c(n) = sign

`w , x(n) ´

cw , x P(c | x; w)

.5 1

slide-14
SLIDE 14

An Example in functional MRI

A binary logistic classification problem for n ∈ {1, ... , N}, P

`c(n) | x(n); w ´ = ff `c(n)w , x(n) ´

ff: t → 1

‹`1 + exp `−t ´´

cw , x P(c | x; w)

.5 1

slide-15
SLIDE 15

An Example in functional MRI

A binary logistic classification problem for n ∈ {1, ... , N}, P

`c(n) | x(n); w ´ = ff `c(n)w , x(n) ´

ff: t → 1

‹`1 + exp `−t ´´

Maximize log-likelihood

Find w ∈ arg maxRV

P

n log P

`c(n) | x(n); w ´

slide-16
SLIDE 16

An Example in functional MRI

A binary logistic classification problem for n ∈ {1, ... , N}, P

`c(n) | x(n); w ´ = ff `c(n)w , x(n) ´

ff: t → 1

‹`1 + exp `−t ´´

Maximize log-likelihood

Find w ∈ arg minRV

P

n − log ff

`c(n)w , x(n) ´

slide-17
SLIDE 17

An Example in functional MRI

A binary logistic classification problem cw , x − log ff

`cw , x ´

Maximize log-likelihood

Find w ∈ arg minRV

P

n − log ff

`c(n)w , x(n) ´

slide-18
SLIDE 18

Optimization

Simple, smooth and convex

Maximize log-likelihood

Find w ∈ arg minRV F : w → P

n − log ff

`c(n)w , x(n) ´

slide-19
SLIDE 19

Optimization

Simple, smooth and convex

Maximize log-likelihood

Find w ∈ arg minRV F : w → P

n − log ff

`c(n)w , x(n) ´

  • First order:

(∇F)v =

X

n

−c(n)wvx(n)

v

`1 − ff `c(n)w , x(n) ´´

slide-20
SLIDE 20

Optimization

Simple, smooth and convex

Maximize log-likelihood

Find w ∈ arg minRV F : w → P

n − log ff

`c(n)w , x(n) ´

  • First order:

(∇F)v =

X

n

−c(n)wvx(n)

v

`1 − ff `c(n)w , x(n) ´´

  • Second order:

(∇2F)uv =

X

n

(c(n))2wuwvx(n)

u x(n) v ff(1 − ff)

slide-21
SLIDE 21

Optimization

Simple, smooth and convex

Maximize log-likelihood

Find w ∈ arg minRV F : w → P

n − log ff

`c(n)w , x(n) ´

  • First order:

(∇F)v =

X

n

−c(n)wvx(n)

v

`1 − ff `c(n)w , x(n) ´´

  • Second order:

(∇2F)uv =

X

n

(c(n))2wuwvx(n)

u x(n) v ff(1 − ff)

Gradient descent w(k+1) = w(k) − ‚∇F(w(k)) w(0) w?

slide-22
SLIDE 22

Optimization

Simple, smooth and convex

Maximize log-likelihood

Find w ∈ arg minRV F : w → P

n − log ff

`c(n)w , x(n) ´

  • First order:

(∇F)v =

X

n

−c(n)wvx(n)

v

`1 − ff `c(n)w , x(n) ´´

  • Second order:

(∇2F)uv =

X

n

(c(n))2wuwvx(n)

u x(n) v ff(1 − ff)

(Quasi-)Newton method w(k+1) = w(k) − ‚(∇2F)

−1∇F(w(k))

w(0) w?

slide-23
SLIDE 23

Regularization

Stability and prior knowledge log

1 + exp

`−c(n)w , x(n) ´”

N ≪ V; w? → +∞ cw , x − log ff

`cw , x ´

slide-24
SLIDE 24

Regularization

Stability and prior knowledge log

1 + exp

`−c(n)w , x(n) ´”

N ≪ V; w? → +∞ cw , x − log ff

`cw , x ´

  • ‘‘Ridge’’

F(w) = ‘(w) + –w2

slide-25
SLIDE 25

Regularization

Stability and prior knowledge log

1 + exp

`−c(n)w , x(n) ´”

N ≪ V; w? → +∞ cw , x − log ff

`cw , x ´

  • ‘‘Ridge’’

F(w) = ‘(w) + –w2

slide-26
SLIDE 26

Regularization

Stability and prior knowledge log

1 + exp

`−c(n)w , x(n) ´”

N ≪ V; w? → +∞ cw , x − log ff

`cw , x ´

  • ‘‘Ridge’’

F(w) = ‘(w) + –w2

  • ‘‘LASSO’’

F(w) = ‘(w) + –

P

v|wv|

slide-27
SLIDE 27

Regularization

Stability and prior knowledge log

1 + exp

`−c(n)w , x(n) ´”

N ≪ V; w? → +∞ cw , x − log ff

`cw , x ´

  • ‘‘Ridge’’

F(w) = ‘(w) + –w2

  • ‘‘LASSO’’

F(w) = ‘(w) + –

P

v|wv|

  • ‘‘Group LASSO’’ F(w) = ‘(w) + – P

bwb

slide-28
SLIDE 28

Regularization

Stability and prior knowledge log

1 + exp

`−c(n)w , x(n) ´”

N ≪ V; w? → +∞ cw , x − log ff

`cw , x ´

  • ‘‘Ridge’’

F(w) = ‘(w) + –w2

  • ‘‘LASSO’’

F(w) = ‘(w) + –

P

v|wv|

  • ‘‘Group LASSO’’ F(w) = ‘(w) + – P

bwb

  • ‘‘Total variation’’ F(w) = ‘(w) + – P

bDwb

slide-29
SLIDE 29

Some Motivation Proximal Splitting Variants and Accelerations Cut-pursuit Algorithm

slide-30
SLIDE 30

Proximal Point Algorithm

Fixed-point algorithm for nonsmooth optimization

  • Gradient and subgradient:

∇F(x) = u

def

⇐ ⇒ ∀ y, F(y) = F(x) + u | y − x + o(y − x)

  • First-order optimality:
  • 0 = ∇F(x?)
  • Fixed point equation:
  • x? = x? − ‚∇F(x?)
  • Algorithm:
  • x(k+1) = (Id −‚∇F)x(k)

x F(x)

slide-31
SLIDE 31

Proximal Point Algorithm

Fixed-point algorithm for nonsmooth optimization

  • Gradient and subgradient:

∇F(x) = u

def

⇐ ⇒ ∀ y, F(y) = F(x) + u | y − x + o(y − x) u ∈ @F(x)

def

⇐ ⇒ ∀ y, F(y) ≥ F(x) + u | y − x

  • First-order optimality:
  • 0 = ∇F(x?)
  • Fixed point equation:
  • x? = x? − ‚∇F(x?)
  • Algorithm:
  • x(k+1) = (Id −‚∇F)x(k)

x F(x)

slide-32
SLIDE 32

Proximal Point Algorithm

Fixed-point algorithm for nonsmooth optimization

  • Gradient and subgradient:

∇F(x) = u

def

⇐ ⇒ ∀ y, F(y) = F(x) + u | y − x + o(y − x) u ∈ @F(x)

def

⇐ ⇒ ∀ y, F(y) ≥ F(x) + u | y − x

  • First-order optimality:
  • 0 = ∇F(x?)
  • 0 ∈ @F(x?)
  • Fixed point equation:
  • x? = x? − ‚∇F(x?)
  • x? ∈ x? − ‚@F(x?)
  • Algorithm:
  • x(k+1) = (Id −‚∇F)x(k)
  • x(k+1) ∈ (Id −@F)x(k)

x F(x)

slide-33
SLIDE 33

Proximal Point Algorithm

Fixed-point algorithm for nonsmooth optimization

  • Gradient and subgradient:

∇F(x) = u

def

⇐ ⇒ ∀ y, F(y) = F(x) + u | y − x + o(y − x) u ∈ @F(x)

def

⇐ ⇒ ∀ y, F(y) ≥ F(x) + u | y − x

  • First-order optimality:
  • 0 = ∇F(x?)
  • 0 ∈ @F(x?)
  • Fixed point equation:
  • x? = x? − ‚∇F(x?)
  • x? + ‚@F(x?) ∋ x?
  • Algorithm:
  • x(k+1) = (Id −‚∇F)x(k)
  • x(k+1) = (Id +‚@F)−1x(k)

x F(x)

slide-34
SLIDE 34

Proximal Point Algorithm

Fixed-point algorithm for nonsmooth optimization

  • Gradient and subgradient:

∇F(x) = u

def

⇐ ⇒ ∀ y, F(y) = F(x) + u | y − x + o(y − x) u ∈ @F(x)

def

⇐ ⇒ ∀ y, F(y) ≥ F(x) + u | y − x

  • First-order optimality:
  • 0 = ∇F(x?)
  • 0 ∈ @F(x?)
  • Fixed point equation:
  • x? = x? − ‚∇F(x?)
  • x? + ‚@F(x?) ∋ x?
  • Algorithm:
  • x(k+1) = (Id −‚∇F)x(k)
  • x(k+1) = (Id +‚@F)−1x(k)

x F(x)

= arg minx

1 2x(k) − x2 + ‚F(x)

slide-35
SLIDE 35

Proximal Point Algorithm

Fixed-point algorithm for nonsmooth optimization

  • Gradient and subgradient:

∇F(x) = u

def

⇐ ⇒ ∀ y, F(y) = F(x) + u | y − x + o(y − x) u ∈ @F(x)

def

⇐ ⇒ ∀ y, F(y) ≥ F(x) + u | y − x

  • First-order optimality:
  • 0 = ∇F(x?)
  • 0 ∈ @F(x?)
  • Fixed point equation:
  • x? = x? − ‚∇F(x?)
  • x? + ‚@F(x?) ∋ x?
  • Algorithm:
  • x(k+1) = (Id −‚∇F)x(k)
  • x(k+1) = (Id +‚@F)−1x(k)

x F(x)

= arg minx

1 2x(k) − x2 + ‚F(x) = prox‚F(x(k))

slide-36
SLIDE 36

Proximal Splitting Algorithms

Primal algorithms F = f + g, where:

  • f smooth (Lipschitz-continuous gradient)
  • g simple (proximity operator easy to compute)
slide-37
SLIDE 37

Proximal Splitting Algorithms

Primal algorithms F = f + g, where:

  • f smooth (Lipschitz-continuous gradient)
  • g simple (proximity operator easy to compute)

∈ @F(x?)

slide-38
SLIDE 38

Proximal Splitting Algorithms

Primal algorithms F = f + g, where:

  • f smooth (Lipschitz-continuous gradient)
  • g simple (proximity operator easy to compute)

∈ @F(x?) ∈ (∇f + @g)x?

slide-39
SLIDE 39

Proximal Splitting Algorithms

Primal algorithms F = f + g, where:

  • f smooth (Lipschitz-continuous gradient)
  • g simple (proximity operator easy to compute)

∈ @F(x?) ∈ (∇f + @g)x? − ∇f(x?) ∈ @g(x?)

slide-40
SLIDE 40

Proximal Splitting Algorithms

Primal algorithms F = f + g, where:

  • f smooth (Lipschitz-continuous gradient)
  • g simple (proximity operator easy to compute)

∈ @F(x?) ∈ (∇f + @g)x? − ∇f(x?) ∈ @g(x?) (Id −∇f)x? ∈ (Id +@g)x?

slide-41
SLIDE 41

Proximal Splitting Algorithms

Primal algorithms F = f + g, where:

  • f smooth (Lipschitz-continuous gradient)
  • g simple (proximity operator easy to compute)

∈ @F(x?) ∈ (∇f + @g)x? − ∇f(x?) ∈ @g(x?) (Id −∇f)x? ∈ (Id +@g)x? (Id +@g)−1(Id −∇f)x? = x?

slide-42
SLIDE 42

Proximal Splitting Algorithms

Primal algorithms F = f + g, where:

  • f smooth (Lipschitz-continuous gradient)
  • g simple (proximity operator easy to compute)

∈ @F(x?) ∈ (∇f + @g)x? − ∇f(x?) ∈ @g(x?) (Id −∇f)x? ∈ (Id +@g)x? (Id +@g)−1(Id −∇f)x? = x?

Forward-Backward Splitting Algorithm

x(k+1) = prox‚g

`x(k) − ‚∇f(x(k)) ´.

slide-43
SLIDE 43

Proximal Splitting Algorithms

Primal algorithms

F = f + g Forward-Backward (Lions and Mercier, 1979)

x(k+1) = prox‚g

`x(k) − ‚∇f(x(k)) ´.

F = g + h, g and h are simple

slide-44
SLIDE 44

Proximal Splitting Algorithms

Primal algorithms

F = f + g Forward-Backward (Lions and Mercier, 1979)

x(k+1) = prox‚g

`x(k) − ‚∇f(x(k)) ´.

F = g + h, g and h are simple rprox

def

= 2 prox − Id

F = g + h Douglas–Rachford Splitting Algorithm

y(k+1) = 1

2 rprox‚g

`rprox‚h(y(k)) ´ + 1

2y(k);

x(k+1) = prox‚h(y(k+1))

slide-45
SLIDE 45

Proximal Splitting Algorithms

Primal algorithms

F = f + g Forward-Backward (Lions and Mercier, 1979)

x(k+1) = prox‚g

`x(k) − ‚∇f(x(k)) ´.

F = g + h Douglas–Rachford (Lions and Mercier, 1979)

y(k+1) = 1

2 rprox‚g

`rprox‚h(y(k)) ´ + 1

2y(k);

x(k+1) = prox‚h(y(k+1))

P

bwb

P

bDwb

slide-46
SLIDE 46

Proximal Splitting Algorithms

Primal algorithms

F = f + g Forward-Backward (Lions and Mercier, 1979)

x(k+1) = prox‚g

`x(k) − ‚∇f(x(k)) ´.

F = g + h Douglas–Rachford (Lions and Mercier, 1979)

y(k+1) = 1

2 rprox‚g

`rprox‚h(y(k)) ´ + 1

2y(k);

x(k+1) = prox‚h(y(k+1)) F = P

i gi,

each gi is simple minx F(x) = minxi

P

i gi(xi) subject to ∀i, j, xi = xj

minx F(x) = minx

x x g

g g + « « «V

V V

slide-47
SLIDE 47

Proximal Splitting Algorithms

Primal algorithms

F = f + g Forward-Backward (Lions and Mercier, 1979)

x(k+1) = prox‚g

`x(k) − ‚∇f(x(k)) ´.

F = g + h Douglas–Rachford (Lions and Mercier, 1979)

y(k+1) = 1

2 rprox‚g

`rprox‚h(y(k)) ´ + 1

2y(k);

x(k+1) = prox‚h(y(k+1))

F =

P i gi D.–R. on Product Space (Spingarn, 1983)

∀i, y(k+1)

i

= y(k)

i

+ prox ‚

wi gi

`2x(k) − y(k)

i

´ − x(k);

x(k+1) =

P

i wiy(k+1) i

slide-48
SLIDE 48

Proximal Splitting Algorithms

Primal algorithms

F = f + g Forward-Backward (Lions and Mercier, 1979)

x(k+1) = prox‚g

`x(k) − ‚∇f(x(k)) ´.

F = g + h Douglas–Rachford (Lions and Mercier, 1979)

y(k+1) = 1

2 rprox‚g

`rprox‚h(y(k)) ´ + 1

2y(k);

x(k+1) = prox‚h(y(k+1))

F =

P i gi D.–R. on Product Space (Spingarn, 1983)

∀i, y(k+1)

i

= y(k)

i

+ prox ‚

wi gi

`2x(k) − y(k)

i

´ − x(k);

x(k+1) =

P

i wiy(k+1) i

F = f + P

i gi,

f is smooth, each gi is simple

slide-49
SLIDE 49

Proximal Splitting Algorithms

Primal algorithms

F = f + g Forward-Backward (Lions and Mercier, 1979)

x(k+1) = prox‚g

`x(k) − ‚∇f(x(k)) ´.

F = g + h Douglas–Rachford (Lions and Mercier, 1979)

y(k+1) = 1

2 rprox‚g

`rprox‚h(y(k)) ´ + 1

2y(k);

x(k+1) = prox‚h(y(k+1))

F =

P i gi D.–R. on Product Space (Spingarn, 1983)

∀i, y(k+1)

i

= y(k)

i

+ prox ‚

wi gi

`2x(k) − y(k)

i

´ − x(k);

x(k+1) = P

i wiy(k+1) i

F = f +

P i gi Generalized F.-B. (Raguet et al., 2013)

∀i, y(k+1)

i

= y(k)

i

+ prox ‚

wi gi

`2x(k) − y(k)

i

− ‚∇f(x(k))

´ − x(k);

x(k+1) =

P

i wiy(k+1) i

slide-50
SLIDE 50

Proximal Splitting Algorithms

Primal algorithms

F = f + g Forward-Backward (Lions and Mercier, 1979) F = g + h Douglas–Rachford (Lions and Mercier, 1979) F =

P i gi D.–R. on Product Space (Spingarn, 1983)

F = f +

P i gi Generalized F.-B. (Raguet et al., 2013)

what about g ◦ L, g simple, L bounded linear operator?

slide-51
SLIDE 51

Proximal Splitting Algorithms

Primal algorithms

F = f + g Forward-Backward (Lions and Mercier, 1979) F = g + h Douglas–Rachford (Lions and Mercier, 1979) F =

P i gi D.–R. on Product Space (Spingarn, 1983)

F = f +

P i gi Generalized F.-B. (Raguet et al., 2013)

what about g ◦ L, g simple, L bounded linear operator? ‘‘tight frame’’ ∀y ∈ ran L, LL∗y = y

slide-52
SLIDE 52

Proximal Splitting Algorithms

Primal algorithms

F = f + g Forward-Backward (Lions and Mercier, 1979) F = g + h Douglas–Rachford (Lions and Mercier, 1979) F =

P i gi D.–R. on Product Space (Spingarn, 1983)

F = f +

P i gi Generalized F.-B. (Raguet et al., 2013)

what about g ◦ L, g simple, L bounded linear operator? ‘‘tight frame’’ proxg◦L(x) = x + 1

L∗“

proxg − Id

Lx

slide-53
SLIDE 53

Proximal Splitting Algorithms

Primal algorithms

F = f + g Forward-Backward (Lions and Mercier, 1979) F = g + h Douglas–Rachford (Lions and Mercier, 1979) F =

P i gi D.–R. on Product Space (Spingarn, 1983)

F = f +

P i gi Generalized F.-B. (Raguet et al., 2013)

what about g ◦ L, g simple, L bounded linear operator? ‘‘tight frame’’ ‘‘split’’ g ◦ L = P

i gi ◦ Li,

gi simple, Li tight frame

slide-54
SLIDE 54

Proximal Splitting Algorithms

Primal algorithms

F = f + g Forward-Backward (Lions and Mercier, 1979) F = g + h Douglas–Rachford (Lions and Mercier, 1979) F =

P i gi D.–R. on Product Space (Spingarn, 1983)

F = f +

P i gi Generalized F.-B. (Raguet et al., 2013)

what about g ◦ L, g simple, L bounded linear operator? ‘‘tight frame’’ ‘‘split’’ ‘‘augment space’’ min

x

g

`Lx ´ = min

x,y g(y) subject to Lx = y

slide-55
SLIDE 55

Proximal Splitting Algorithms

Primal algorithms

F = f + g Forward-Backward (Lions and Mercier, 1979) F = g + h Douglas–Rachford (Lions and Mercier, 1979) F =

P i gi D.–R. on Product Space (Spingarn, 1983)

F = f +

P i gi Generalized F.-B. (Raguet et al., 2013)

what about g ◦ L, g simple, L bounded linear operator? ‘‘tight frame’’ ‘‘split’’ ‘‘augment space’’ min

x

g

`Lx ´ = min

x,y g(y) + «{(x,y) | Lx=y}(x, y)

slide-56
SLIDE 56

Proximal Splitting Algorithms

Primal algorithms

F = f + g Forward-Backward (Lions and Mercier, 1979) F = g + h Douglas–Rachford (Lions and Mercier, 1979) F =

P i gi D.–R. on Product Space (Spingarn, 1983)

F = f +

P i gi Generalized F.-B. (Raguet et al., 2013)

what about g ◦ L, g simple, L bounded linear operator? ‘‘tight frame’’ ‘‘split’’ ‘‘augment space’’ proj{(x,y) | Lx=y} involves

`Id +L∗L ´−1

  • r

`Id +LL∗´−1

slide-57
SLIDE 57

Proximal Splitting Algorithms

Primal algorithms

F = f + g Forward-Backward (Lions and Mercier, 1979) F = g + h Douglas–Rachford (Lions and Mercier, 1979) F =

P i gi D.–R. on Product Space (Spingarn, 1983)

F = f +

P i gi Generalized F.-B. (Raguet et al., 2013)

what about g ◦ L, g simple, L bounded linear operator? ‘‘tight frame’’ ‘‘split’’ ‘‘augment space’’

  • therwise: primal-dual algorithm
slide-58
SLIDE 58

Proximal Splitting Algorithms

Primal-dual algorithms Canonical form: F = g ◦ L + h, g, h simple, L linear operator Split as minx,y g(y) + h(x) subject to y = Lx

slide-59
SLIDE 59

Proximal Splitting Algorithms

Primal-dual algorithms Canonical form: F = g ◦ L + h, g, h simple, L linear operator Split as minx,y g(y) + h(x) subject to y = Lx

Alternating-Direction Method of Multipliers? (Gabay

and Mercier, 1976) x(k+1) = arg minx

1 h(x) +  2Lx − (y(k) − –(k))2

y(k+1) = arg miny

1 g(y) + 1 2y − (Lx(k) + –(k))2

–(k+1) = –(k) + 

`Lx(k+1) − y(k+1)´

slide-60
SLIDE 60

Proximal Splitting Algorithms

Primal-dual algorithms Canonical form: F = g ◦ L + h, g, h simple, L linear operator Split as minx,y g(y) + h(x) subject to y = Lx

Alternating-Direction Method of Multipliers? (Gabay

and Mercier, 1976) x(k+1) = arg minx

1 h(x) +  2Lx − (y(k) − –(k))2

y(k+1) = arg miny

1 g(y) + 1 2y − (Lx(k) + –(k))2

–(k+1) = –(k) + 

`Lx(k+1) − y(k+1)´

  • update on x
  • well defined only for L injective
slide-61
SLIDE 61

Proximal Splitting Algorithms

Primal-dual algorithms Canonical form: F = g ◦ L + h, g, h simple, L linear operator Split as minx,y g(y) + h(x) subject to y = Lx

Alternating-Direction Method of Multipliers? (Gabay

and Mercier, 1976) x(k+1) = arg minx

1 h(x) +  2Lx − (y(k) − –(k))2

y(k+1) = arg miny

1 g(y) + 1 2y − (Lx(k) + –(k))2

–(k+1) = –(k) + 

`Lx(k+1) − y(k+1)´

  • update on x
  • well defined only for L injective
  • more complicated than prox 1

 h

slide-62
SLIDE 62

Proximal Splitting Algorithms

Primal-dual algorithms Canonical form: F = g ◦ L + h, g, h simple, L linear operator Split as minx,y g(y) + h(x) subject to y = Lx

Alternating-Direction Method of Multipliers? (Gabay

and Mercier, 1976) x(k+1) = arg minx

1 h(x) +  2Lx − (y(k) − –(k))2

y(k+1) = arg miny

1 g(y) + 1 2y − (Lx(k) + –(k))2

–(k+1) = –(k) + 

`Lx(k+1) − y(k+1)´

  • update on x
  • well defined only for L injective
  • more complicated than prox 1

 h

  • require storing both y and –
slide-63
SLIDE 63

Proximal Splitting Algorithms

Primal-dual algorithms Canonical form: F = g ◦ L + h, g, h simple, L linear operator Split as minx,y g(y) + h(x) subject to y = Lx

ADMM? (Gabay and Mercier, 1976) F = g ◦ L + h Primal-Dual of Chambolle and Pock (2011)

  • r more generally,

F = P

i gi ◦ Li

slide-64
SLIDE 64

Proximal Splitting Algorithms

Primal-dual algorithms Canonical form: F = g ◦ L + h, g, h simple, L linear operator Split as minx,y g(y) + h(x) subject to y = Lx

ADMM? (Gabay and Mercier, 1976) F = g ◦ L + h Primal-Dual of Chambolle and Pock (2011)

  • r more generally,

F = P

i gi ◦ Li

And if f is smooth but not simple?

slide-65
SLIDE 65

Proximal Splitting Algorithms

Primal-dual algorithms Canonical form: F = g ◦ L + h, g, h simple, L linear operator Split as minx,y g(y) + h(x) subject to y = Lx

ADMM? (Gabay and Mercier, 1976) F = g ◦ L + h Primal-Dual of Chambolle and Pock (2011)

  • r more generally,

F = P

i gi ◦ Li

And if f is smooth but not simple?

F = f + g ◦ L + h Primal-Dual of Condat (2013); V˜

u (2013)

  • r more generally,

F = f + P

i gi ◦ Li

slide-66
SLIDE 66

Proximal Splitting Algorithms

Summary

F = f + g Forward-Backward (Lions and Mercier, 1979)

a.k.a proximal gradient algorithm

F = g + h Douglas–Rachford (Lions and Mercier, 1979) F =

P i gi D.–R. on Product Space (Spingarn, 1983)

a.k.a Parallel Proximal Algorithm

F = f +

P i gi Generalized F.-B. (Raguet et al., 2013)

a.k.a Forward-Douglas–Rachford

F = g ◦ L + h Primal-Dual of Chambolle and Pock (2011)

a.k.a Primal-Dual Hybrid Gradient

F = f + g ◦ L + h Primal-Dual of Condat (2013); V˜

u (2013) a.k.a Forward-Backward Primal-Dual

slide-67
SLIDE 67

Some Motivation Proximal Splitting Variants and Accelerations Cut-pursuit Algorithm

slide-68
SLIDE 68

Proximal Splitting Algorithms

Overrelaxation and Inertial Forces

All Methods

  • y(k+1) = Tx(k)
  • x(k+1) = y(k+1) + ¸k(y(k+1) − y(k))

Acceleration observed in practice (Iutzeler and Hendrickx, 2018)

F = f + g Forward-Backward

Theoretical acceleration on functional values F(x(k)) − F(x?) (Beck and Teboulle, 2009)

slide-69
SLIDE 69

Proximal Splitting Algorithms

Metric Conditioning

F = f + g Forward-Backward

Variable metric forward-backward (Chen and Rockafellar, 1997) Quasi-Newton forward-backward (Becker and Fadili, 2012)

F = f +

P i gi Generalized Forward-Backward

∀i, y(k+1)

i

= y(k)

i

+ prox ‚

wi gi

`2x(k) − y(k)

i

− ‚∇f(x(k))

´ − x(k);

x(k+1) =

P

i wiy(k+1) i

slide-70
SLIDE 70

Proximal Splitting Algorithms

Metric Conditioning

F = f + g Forward-Backward

Variable metric forward-backward (Chen and Rockafellar, 1997) Quasi-Newton forward-backward (Becker and Fadili, 2012)

F = f +

P i gi Generalized Forward-Backward

∀i, y(k+1)

i

= y(k)

i

+ prox` −1Wi

gi

`2x(k) − y(k)

i

− `∇f(x(k))

´ − x(k);

x(k+1) =

P

i Wiy(k+1) i

  • ` approximate ‘‘(∇2F

−1)’’

  • P

i Wi = Id, but Wi might be only semidefinite

  • prox` −1Wi

gi

might be computable when proxgi is not

slide-71
SLIDE 71

Proximal Splitting Algorithms

Metric Conditioning

F = f + g Forward-Backward

Variable metric forward-backward (Chen and Rockafellar, 1997) Quasi-Newton forward-backward (Becker and Fadili, 2012)

F = f +

P i gi Generalized Forward-Backward

∀i, y(k+1)

i

= y(k)

i

+ prox` −1Wi

gi

`2x(k) − y(k)

i

− `∇f(x(k))

´ − x(k);

x(k+1) =

P

i Wiy(k+1) i

(Raguet and Landrieu, 2015)

F = g ◦ L + h Primal-Dual Hybrid Gradient

Preconditioning on L (Pock and Chambolle, 2011)

F = f + g ◦ L + h Forward-Backward Primal-Dual

Preconditioning on both L and ‘‘∇2f’’ (Lorenz and Pock, 2015)

slide-72
SLIDE 72

Proximal Splitting Algorithms

Stochastic and distributed versions

Douglas–Rachford and ADMM

Seminal work of Iutzeler et al. (2013)

All Methods

Fall within the scope of stochastic fixed point algorithms (Combettes and Pesquet, 2015)

Special case of Forward-Douglas–Rachford

Replace ∇f by a random variable G Typical convergence conditions:

  • E

ˆG(k) | X(1), ... , X(k)˜ = ∇f(X(n))

a.s.

  • P

k E

ˆG(k) − ∇f(X(n))2 | X(1), ... , X(k)˜ < +∞

a.s. (Cevher et al., 2016)

slide-73
SLIDE 73

Proximal Splitting Algorithms

Nonconvex cases

F = f + g Forward-Backward

Any function nonconvex (Attouch et al., 2013) f smooth, g convex (Ochs et al., 2014; Chouzenoux et al., 2014)

F = g ◦ L + h Primal-Dual Hybrid Gradient

g semiconvex, h strongly convex (Möllenhoff et al., 2015) h smooth, L surjective (with ADMM, Li and Pong, 2015)

But actually my classification of proximal algorithms is not anymore relevant in absence of convexity

slide-74
SLIDE 74

Some Motivation Proximal Splitting Variants and Accelerations Cut-pursuit Algorithm

slide-75
SLIDE 75

Cut-pursuit Algorithm

Enhancing proximal algorithm with combinatorial optimization G = (V, E) F: (xv)v∈V → f(x) +

X

v∈V

gv(xv) +

X

(u,v)∈E

w(u,v)|xu − xv| f smooth; g separable

slide-76
SLIDE 76

Cut-pursuit Algorithm

Enhancing proximal algorithm with combinatorial optimization G = (V, E) F: (xv)v∈V → f(x) +

X

v∈V

gv(xv) +

X

(u,v)∈E

w(u,v)|xu − xv| f smooth; g separable

slide-77
SLIDE 77

Cut-pursuit Algorithm

Enhancing proximal algorithm with combinatorial optimization G = (V, E) F: (xv)v∈V → f(x) +

X

v∈V

gv(xv) +

X

(u,v)∈E

w(u,v)|xu − xv| f smooth; g separable Typical proximal algorithm:

  • GFB (preconditioning)
  • PDHG (if proxf available)
  • PDFB (use ∇f)

Visit the entire graph at each iteration!

slide-78
SLIDE 78

Cut-pursuit Algorithm

Enhancing proximal algorithm with combinatorial optimization G = (V, E) F: (xv)v∈V → f(x) +

X

v∈V

gv(xv) +

X

(u,v)∈E

w(u,v)|xu − xv| f smooth; g separable Typical proximal algorithm:

  • GFB (preconditioning)
  • PDHG (if proxf available)
  • PDFB (use ∇f)

Visit the entire graph at each iteration! Use the fact that the solution has few constant components:

  • block coordinate
  • ‘‘working set’’ (Landrieu and Obozinski, 2017)
slide-79
SLIDE 79

Cut-pursuit

Working set approach G = (V, E) F: (xv)v∈V → f(x) +

X

v∈V

gv(xv) +

X

(u,v)∈E

w(u,v)|xu − xv| f smooth; g separable

V partition of V;

x = P

U∈V ‰U1U

slide-80
SLIDE 80

Cut-pursuit

Working set approach G = (V, E) F: (xv)v∈V → f(x) +

X

v∈V

gv(xv) +

X

(u,v)∈E

w(u,v)|xu − xv| f smooth; g separable

V partition of V;

x = P

U∈V ‰U1U

F(V ) : (‰U)U∈V → F(P

U∈V ‰U1U)

= f

“ X

U∈V

‰U1U

+

X

U∈V

X

v∈U

gv

`‰U ´ + X

(U,U′)∈E

X

(u,v)∈E∩U×U′

w(u,v)|‰U − ‰′

U|

slide-81
SLIDE 81

Cut-pursuit

Working set approach G = (V, E) F: (xv)v∈V → f(x) +

X

v∈V

gv(xv) +

X

(u,v)∈E

w(u,v)|xu − xv| f smooth; g separable

V partition of V;

x = P

U∈V ‰U1U

F(V ) : (‰U)U∈V → F(P

U∈V ‰U1U)

= f

“ X

U∈V

‰U1U

+

X

U∈V

X

v∈U

gv

`‰U ´ + X

(U,U′)∈E

X

(u,v)∈E∩U×U′

w(u,v)|‰U − ‰′

U|

slide-82
SLIDE 82

Cut-pursuit

Working set approach G = (V, E) F: (xv)v∈V → f(x) +

X

v∈V

gv(xv) +

X

(u,v)∈E

w(u,v)|xu − xv| f smooth; g separable

V partition of V;

x = P

U∈V ‰U1U

F(V ) : (‰U)U∈V → F(P

U∈V ‰U1U)

= f

“ X

U∈V

‰U1U

+

X

U∈V

X

v∈U

gv

`‰U ´ + X

(U,U′)∈E

X

(u,v)∈E∩U×U′

w(u,v)|‰U − ‰′

U|

slide-83
SLIDE 83

Cut-pursuit

Working set approach G = (V, E) F: (xv)v∈V → f(x) +

X

v∈V

gv(xv) +

X

(u,v)∈E

w(u,v)|xu − xv| f smooth; g separable

V partition of V;

x = P

U∈V ‰U1U

G = (V ,E)

F(V ) : (‰U)U∈V → F(P

U∈V ‰U1U)

= f

“ X

U∈V

‰U1U

+

X

U∈V

X

v∈U

gv

`‰U ´ + X

(U,U′)∈E

X

(u,v)∈E∩U×U′

w(u,v)|‰U − ‰′

U|

find ‰(V ) ∈ arg min F(V ) efficient with proximal algorithm (if correctly conditioned)

slide-84
SLIDE 84

Cut-pursuit

Working set approach G = (V, E) F: (xv)v∈V → f(x) +

X

v∈V

gv(xv) +

X

(u,v)∈E

w(u,v)|xu − xv| f smooth; g separable

V partition of V;

x = P

U∈V ‰U1U

G = (V ,E)

F(V ) : (‰U)U∈V → F(P

U∈V ‰U1U)

= f

“ X

U∈V

‰U1U

+

X

U∈V

X

v∈U

gv

`‰U ´ + X

(U,U′)∈E

X

(u,v)∈E∩U×U′

w(u,v)|‰U − ‰′

U|

find ‰(V ) ∈ arg min F(V ) efficient with proximal algorithm (if correctly conditioned) Algorithmic scheme:

  • 1. solve reduced problem
  • 2. refine partitionV
slide-85
SLIDE 85

Cut-pursuit

Refining the partition F: (xv)v∈V → f(x) + P

v∈V gv(xv)

+ P

(u,v)∈E w(u,v)|xu − xv|

F′(x, d) Steepest descent direction? arg min

d∈RV

F′(x, d)

slide-86
SLIDE 86

Cut-pursuit

Refining the partition F: (xv)v∈V → f(x) + P

v∈V gv(xv)

+ P

(u,v)∈E w(u,v)|xu − xv|

F′(x, d) ∇vf(x)dv Steepest descent direction? arg min

d∈RV

F′(x, d)

slide-87
SLIDE 87

Cut-pursuit

Refining the partition F: (xv)v∈V → f(x) + P

v∈V gv(xv)

+ P

(u,v)∈E w(u,v)|xu − xv|

F′(x, d) ∇vf(x)dv g′

v(xv, +1)dv

g′

v(xv, −1)dv

Steepest descent direction? arg min

d∈RV

F′(x, d)

slide-88
SLIDE 88

Cut-pursuit

Refining the partition F: (xv)v∈V → f(x) + P

v∈V gv(xv)

+ P

(u,v)∈E w(u,v)|xu − xv|

F′(x, d) ∇vf(x)dv g′

v(xv, +1)dv

g′

v(xv, −1)dv

w(u,v) sign(xv − xu)dv w(u,v)|du − dv| Steepest descent direction? arg min

d∈RV

F′(x, d)

slide-89
SLIDE 89

Cut-pursuit

Refining the partition F: (xv)v∈V → f(x) + P

v∈V gv(xv)

+ P

(u,v)∈E w(u,v)|xu − xv|

F′(x, d) ∇vf(x)dv g′

v(xv, +1)dv

g′

v(xv, −1)dv

w(u,v) sign(xv − xu)dv w(u,v)|du − dv| Steepest descent direction? arg min

d∈RV

F′(x, d)

X

v∈V dv>0

‹+

v (x)dv +

X

v∈V dv<0

‹−

v (x)dv +

X

(u,v)∈E(x)

=

w(u,v)|du − dv|

slide-90
SLIDE 90

Cut-pursuit

Refining the partition F: (xv)v∈V → f(x) + P

v∈V gv(xv)

+ P

(u,v)∈E w(u,v)|xu − xv|

F′(x, d) ∇vf(x)dv g′

v(xv, +1)dv

g′

v(xv, −1)dv

w(u,v) sign(xv − xu)dv w(u,v)|du − dv| Steepest binary descent direction? arg min

d∈{−1,+1}V F′(x, d)

X

v∈V dv=+1

‹+

v (x) −

X

v∈V dv=−1

‹−

v (x) +

X

(u,v)∈E(x)

=

w(u,v)|du − dv| Can be solved by a minimal cut in an appropriate flow graph

s t u v w 2w(u,v) −‹−

u (x)

‹+

u(x)

slide-91
SLIDE 91

Cut-pursuit

Refining the partition F: (xv)v∈V → f(x) + P

v∈V gv(xv)

+ P

(u,v)∈E w(u,v)|xu − xv|

F′(x, d) ∇vf(x)dv g′

v(xv, +1)dv

g′

v(xv, −1)dv

w(u,v) sign(xv − xu)dv w(u,v)|du − dv| Steepest ternary descent direction? arg min

d∈{−1,0,+1}V F′(x, d)

X

v∈V dv=+1

‹+

v (x) −

X

v∈V dv=−1

‹−

v (x) +

X

(u,v)∈E(x)

=

w(u,v)|du − dv| Can be solved by a minimal cut in an appropriate flow graph Theorem: this set of descent directions is rich enough to ensure

  • ptimality

s t u(1) v(1) w(1) u(2) v(2) w(2) w(u,v) w(v,u) −‹−

u (x) + mu

‹+

u(x) + mu

mu

slide-92
SLIDE 92

Cut-pursuit

Preliminary results Brain source identification in electroencephalography F: x → 1

2y − Φx2 +

X

v∈V

`–v|xv| + «R+(xv) ´ + X

(u,v)∈E

w(u,v)|xu − xv| |V| = 19 626 |E| = 29 439

slide-93
SLIDE 93

Cut-pursuit

Preliminary results regularization of 3D point cloud classification given probabilistic assignment q ∈ RV×K F: p →

X

v∈V

KL(˛)(qv, pv) +

X

v∈V

«△K(pv) +

X

(u,v)∈E

w(u,v)pu − pv1 |V| = 3 000 111 |E| = 17 206 938

slide-94
SLIDE 94

Cut-pursuit

Preliminary results regularization of 3D point cloud classification given probabilistic assignment q ∈ RV×K F: p →

X

v∈V

KL(˛)(qv, pv) +

X

v∈V

«△K(pv) +

X

(u,v)∈E

w(u,v)pu − pv1 |V| = 3 000 111 |E| = 17 206 938

Next: parallelize graph cuts along components inV

  • almost linear acceleration
  • distributed optimization
slide-95
SLIDE 95

Integration in ICAR team

Strengths

  • continuous methods
  • regularization techniques
  • convex optimization

Weaknesses

  • not (yet) an expert in (deep) learning
  • not familiar with ‘‘discrete formulations’’

Research interest

  • registration and inverse problems for medical imaging
  • high-resolution satellite image segmentation
  • dependence measures for identifying functional

relationship between data with statistical tools

slide-96
SLIDE 96

References I

Attouch, H., Bolte, J., and Svaiter, B. F. (2013). Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss–Seidel methods. Mathematical Programming, 137(1-2):91–129. Beck, A. and Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse

  • problems. SIAM Journal on Imaging Sciences, 2(1):183–202.

Becker, S. and Fadili, J. (2012). A quasi-Newton proximal splitting method. In Advances in Neural Information Processing Systems, pages 2627–2635. Cevher, V., V˜ u, B. C., and Yurtsever, A. (2016). Stochastic forward-Douglas–Rachford splitting for monotone

  • inclusions. Technical report, EPFL.
slide-97
SLIDE 97

References II

Chambolle, A. and Pock, T. (2011). A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40(1):120–145. Chen, G. H.-G. and Rockafellar, R. T. (1997). Convergence rates in forward-backward splitting. SIAM Journal on Optimization, 7(2):421–444. Chouzenoux, E., Pesquet, J.-C., and Repetti, A. (2014). Variable metric forward-backward algorithm for minimizing the sum

  • f a differentiable function and a convex function. Journal of

Optimization Theory and Applications, 162(1):107–132. Combettes, P. L. and Pesquet, J.-C. (2015). Stochastic quasi-fejér block-coordinate fixed point iterations with random

  • sweeping. SIAM Journal of Optimization, 25:1221–1248.
slide-98
SLIDE 98

References III

Condat, L. (2013). A primal-dual splitting method for convex

  • ptimization involving Lipschitzian, proximable and linear

composite terms. Journal of Optimization Theory and Applications, 158(2):460–479. Gabay, D. and Mercier, B. (1976). A dual algorithm for the solution of nonlinear variational problems via finite element

  • approximation. Computers & Mathematics with Applications,

2(1):17–40. Iutzeler, F., Bianchi, P., and Hachem, W. (2013). Asynchronous distributed optimization using a randomized alternating direction method of multipliers. In IEEE Conference on Decision and Control. Iutzeler, F. and Hendrickx, J. M. (2018). A generic online acceleration scheme for optimization algorithms via relaxation and inertia.

slide-99
SLIDE 99

References IV

Landrieu, L. and Obozinski, G. (2017). Cut pursuit: Fast algorithms to learn piecewise constant functions on general weighted graphs. SIAM Journal on Imaging Sciences, 10(4):1724–1766. Li, G. and Pong, T. K. (2015). Global convergence of splitting methods for nonconvex composite optimization. SIAM Journal on Optimization, 25(4):2434–2460. Lions, P.-L. and Mercier, B. (1979). Splitting algorithms for the sum of two nonlinear operators. SIAM Journal on Numerical Analysis, 16(6):964–979. Lorenz, D. A. and Pock, T. (2015). An inertial forward-backward algorithm for monotone inclusions. Journal of Mathematical Imaging and Vision, 51(2):311–325.

slide-100
SLIDE 100

References V

Möllenhoff, T., Strekalovskiy, E., Moeller, M., and Cremers, D. (2015). The primal-dual hybrid gradient method for semiconvex splittings. SIAM Journal on Imaging Sciences, 8(2):827–857. Ochs, P., Chen, Y., Brox, T., and Pock, T. (2014). iPiano: Inertial proximal algorithm for nonconvex optimization. SIAM Journal on Imaging Sciences, 7(2):1388–1419. Pock, T. and Chambolle, A. (2011). Diagonal preconditioning for first order primal-dual algorithms in convex optimization. In IEEE International Conference on Computer Vision, pages 1762–1769. IEEE. Raguet, H., Fadili, J., and Peyré, G. (2013). A generalized forward-backward splitting. SIAM Journal on Imaging Sciences, 6(3):1199–1226.

slide-101
SLIDE 101

References VI

Raguet, H. and Landrieu, L. (2015). Preconditioning of a generalized forward-backward splitting and application to

  • ptimization on graphs. SIAM Journal on Imaging Sciences,

8(4):2706–2739. Spingarn, J. E. (1983). Partial inverse of a monotone operator. Applied Mathematics and Optimization, 10(1):247–265. V˜ u, B. C. (2013). A splitting algorithm for dual monotone inclusions involving cocoercive operators. Advances in Computational Mathematics, 38(3):667–681.