optimization problems for primal-dual algorithms minimize f ( x ) + - - PowerPoint PPT Presentation

optimization problems for primal dual algorithms
SMART_READER_LITE
LIVE PREVIEW

optimization problems for primal-dual algorithms minimize f ( x ) + - - PowerPoint PPT Presentation

Primal-dual algorithms for the sum of two and three functions 1 Ming Yan Michigan State University, CMSE/Mathematics 1 This works is partially supported by NSF. optimization problems for primal-dual algorithms minimize f ( x ) + g ( x ) + h ( Ax


slide-1
SLIDE 1

Primal-dual algorithms for the sum of two and three functions1

Ming Yan Michigan State University, CMSE/Mathematics

1This works is partially supported by NSF.

slide-2
SLIDE 2
  • ptimization problems for primal-dual algorithms

minimize

x

f(x) + g(x) + h(Ax)

  • f, g, and h are convex.
  • X and Y are two Hilbert spaces (e.g., Rm, Rn).
  • f : X → R is differentiable with a 1/β-Lipschitz continuous gradient for

some β ∈ (0, +∞).

  • A : X → Y is a bounded linear operator.
slide-3
SLIDE 3

applications: statistics

Elastic net regularization (Zou-Hastie ’05): minimize

x

µ2x2

2 + µ1x1 + l(Ax, b),

where x ∈ Rp, A ∈ Rn×p, b ∈ Rn, and l is the loss function, which may be nondifferentiable.

slide-4
SLIDE 4

applications: statistics

Elastic net regularization (Zou-Hastie ’05): minimize

x

µ2x2

2 + µ1x1 + l(Ax, b),

where x ∈ Rp, A ∈ Rn×p, b ∈ Rn, and l is the loss function, which may be nondifferentiable. Fused lasso (Tibshirani et al. ’05): minimize

x

1 2Ax − b2

2 + µ1x1 + µ2Dx1,

where x ∈ Rp, A ∈ Rn×p, b ∈ Rn, and D =

   

−1 1 −1 1 . . . . . . −1 1

   

is a matrix in R(p−1)×p.

slide-5
SLIDE 5

applications: decentralized optimization

minimize

x n

  • i=1

fi(x) + gi(x)

  • fi and gi is known at node i only.
  • Nodes 1, · · · , n are connected in a undirected graph.
  • fi is differentiable with a Lipschitz continuous gradient.
slide-6
SLIDE 6

applications: decentralized optimization

minimize

x n

  • i=1

fi(x) + gi(x)

  • fi and gi is known at node i only.
  • Nodes 1, · · · , n are connected in a undirected graph.
  • fi is differentiable with a Lipschitz continuous gradient.

Introduce a copy xi at node i: minimize

x

f(x) + g(x) :=

n

  • i=1

fi(xi) + gi(xi) s.t. Wx = x

  • xi ∈ Rp, x = [x1 x2 · · · xn]⊤ ∈ Rn×p.
  • W is a symmetric doubly stochastic mixing matrix.
slide-7
SLIDE 7

applications: decentralized optimization

minimize

x n

  • i=1

fi(x) + gi(x)

  • fi and gi is known at node i only.
  • Nodes 1, · · · , n are connected in a undirected graph.
  • fi is differentiable with a Lipschitz continuous gradient.

Introduce a copy xi at node i: minimize

x

f(x) + g(x) :=

n

  • i=1

fi(xi) + gi(xi) s.t. Wx = x

  • xi ∈ Rp, x = [x1 x2 · · · xn]⊤ ∈ Rn×p.
  • W is a symmetric doubly stochastic mixing matrix.

The sum of three functions: minimize

x

f(x) + g(x) + ι0((I − W)1/2x)

slide-8
SLIDE 8

applications: imaging

Image restoration with two regularizations: minimize

x

1 2Ax − b2

2 + ιC(x) + µDx1,

where x ∈ Rn is the image to be reconstructed, A ∈ Rm×n is the forward projection matrix, b ∈ Rm is the measured data with noise, D is a discrete gradient operator, and ιC is the indicator function that returns zero if x ∈ C (here, C is the set of nonnegative vectors in Rn) and +∞ otherwise.

slide-9
SLIDE 9

applications: imaging

Image restoration with two regularizations: minimize

x

1 2Ax − b2

2 + ιC(x) + µDx1,

where x ∈ Rn is the image to be reconstructed, A ∈ Rm×n is the forward projection matrix, b ∈ Rm is the measured data with noise, D is a discrete gradient operator, and ιC is the indicator function that returns zero if x ∈ C (here, C is the set of nonnegative vectors in Rn) and +∞ otherwise. Other problems:

  • f: data fitting term (infimal convolution for mixed noise)
  • h ◦ A: total variation; other transforms
  • g: nonnegativity; box constraint
slide-10
SLIDE 10

primal-dual formulation

Original problem: minimize

x

f(x) + g(x) + h(Ax)

slide-11
SLIDE 11

primal-dual formulation

Original problem: minimize

x

f(x) + g(x) + h(Ax) Introduce a dual variable s: minimize

x

max

s f(x) + g(x) + Ax, s − h∗(s)

Here h∗ is the conjugate function of h that is defined as h∗(s) = max

t

s, t − h(t),

slide-12
SLIDE 12

primal-dual formulation

Original problem: minimize

x

f(x) + g(x) + h(Ax) Introduce a dual variable s: minimize

x

max

s f(x) + g(x) + Ax, s − h∗(s)

Here h∗ is the conjugate function of h that is defined as h∗(s) = max

t

s, t − h(t), It is equivalent to (s∗ ∈ ∂h(Ax∗) ⇐ ⇒ Ax∗ ∈ ∂h∗(s∗)):

  • 0 ∈

∇f(x∗) + ∂g(x∗) + A⊤s∗ 0 ∈ ∂h∗(s∗) − Ax∗ All primal-dual algorithms try to find (x∗, s∗).

slide-13
SLIDE 13

existing algorithms: Condat-Vu, AFBA, and PDFP

Condat-Vu (Condat ’13, Vu ’13):

  • Convergence conditions: λAA⊤ + γ/(2β) ≤ 1
  • Per-iteration computations: A, A⊤, ∇f, one (I + γ∂g)−1, (I + λ

γ ∂h∗)−1 2

2

(I + γ∂g)−1(˜ x) = arg min

x

γg(x) + 1 2 x − ˜ x2. This is a backward step (or implicit step) because (I + γ∂g)−1(˜ x) ∈ ˜ x − γ∂g((I + γ∂g)−1(˜ x))

slide-14
SLIDE 14

existing algorithms: Condat-Vu, AFBA, and PDFP

Condat-Vu (Condat ’13, Vu ’13):

  • Convergence conditions: λAA⊤ + γ/(2β) ≤ 1
  • Per-iteration computations: A, A⊤, ∇f, one (I + γ∂g)−1, (I + λ

γ ∂h∗)−1 2

AFBA (Latafat-Patrinos ’16):

  • Convergence conditions: λAA⊤/2 +
  • λAA⊤/2 + γ/(2β) ≤ 1
  • Per-iteration computations: A, A⊤, ∇f, one (I + γ∂g)−1, (I + λ

γ ∂h∗)−1

2

(I + γ∂g)−1(˜ x) = arg min

x

γg(x) + 1 2 x − ˜ x2. This is a backward step (or implicit step) because (I + γ∂g)−1(˜ x) ∈ ˜ x − γ∂g((I + γ∂g)−1(˜ x))

slide-15
SLIDE 15

existing algorithms: Condat-Vu, AFBA, and PDFP

Condat-Vu (Condat ’13, Vu ’13):

  • Convergence conditions: λAA⊤ + γ/(2β) ≤ 1
  • Per-iteration computations: A, A⊤, ∇f, one (I + γ∂g)−1, (I + λ

γ ∂h∗)−1 2

AFBA (Latafat-Patrinos ’16):

  • Convergence conditions: λAA⊤/2 +
  • λAA⊤/2 + γ/(2β) ≤ 1
  • Per-iteration computations: A, A⊤, ∇f, one (I + γ∂g)−1, (I + λ

γ ∂h∗)−1

PDFP (Chen-Huang-Zhang ’16):

  • Convergence conditions: λAA⊤ < 1; γ/(2β) < 1
  • Per-iteration computations: A, A⊤, ∇f, two (I + γ∂g)−1, (I + λ

γ ∂h∗)−1

2

(I + γ∂g)−1(˜ x) = arg min

x

γg(x) + 1 2 x − ˜ x2. This is a backward step (or implicit step) because (I + γ∂g)−1(˜ x) ∈ ˜ x − γ∂g((I + γ∂g)−1(˜ x))

slide-16
SLIDE 16

PDHG (Zhu-Chan ’08)

When f = 0, we have

  • ∂g

A⊤ −A ∂h∗ x∗ s∗

  • ∋ 0
slide-17
SLIDE 17

PDHG (Zhu-Chan ’08)

When f = 0, we have

  • ∂g

A⊤ −A ∂h∗ x∗ s∗

  • ∋ 0

It is equivalent to

  • ∂g

−A ∂h∗ x∗ s∗

  • −A⊤

x∗ s∗

slide-18
SLIDE 18

PDHG (Zhu-Chan ’08)

When f = 0, we have

  • ∂g

A⊤ −A ∂h∗ x∗ s∗

  • ∋ 0

It is equivalent to

  • 1

γ I+∂g

−A

γ λI+∂h∗

x∗ s∗

  • 1

γ I

A⊤

γ λI

x∗ s∗

slide-19
SLIDE 19

PDHG (Zhu-Chan ’08)

When f = 0, we have

  • ∂g

A⊤ −A ∂h∗ x∗ s∗

  • ∋ 0

It is equivalent to

  • 1

γ I+∂g

−A

γ λI+∂h∗

x∗ s∗

  • 1

γ I

A⊤

γ λI

x∗ s∗

  • Primal-dual hybrid gradient (PDHG)

x+ = (I + γ∂g)−1(x − γA⊤s) s+ = I + λ

γ ∂h∗−1

s + λ

γ Ax+

slide-20
SLIDE 20

PDHG (Zhu-Chan ’08)

When f = 0, we have

  • ∂g

A⊤ −A ∂h∗ x∗ s∗

  • ∋ 0

It is equivalent to

  • 1

γ I+∂g

−A

γ λI+∂h∗

x∗ s∗

  • 1

γ I

A⊤

γ λI

x∗ s∗

  • Chambolle-Pock (Chambolle et.al ’09, Esser-Zhang-Chan ’10)

x+ = (I + γ∂g)−1(x − γA⊤s) ¯ x+ = x++x+ − x s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x+

slide-21
SLIDE 21

Chambolle-Pock ’11 as proximal point

Chambolle-Pock (x − s order) x+ = (I + γ∂g)−1(x − γA⊤s) s+ = I + λ

γ ∂h∗−1

s + λ

γ A(2x+ − x)

slide-22
SLIDE 22

Chambolle-Pock ’11 as proximal point

Chambolle-Pock (x − s order) x+ = (I + γ∂g)−1(x − γA⊤s) s+ = I + λ

γ ∂h∗−1

s + λ

γ A(2x+ − x)

CP is equivalent to the backward operator applied on the KKT system.

slide-23
SLIDE 23

Chambolle-Pock ’11 as proximal point

Chambolle-Pock (x − s order) x+ = (I + γ∂g)−1(x − γA⊤s) s+ = I + λ

γ ∂h∗−1

s + λ

γ A(2x+ − x)

CP is equivalent to the backward operator applied on the KKT system.

  • 1

γ I

−A

γ λI

  • +
  • ∂g

−A ∂h∗ x+ s+

  • 1

γ I

−A⊤ −A

γ λI

x s

  • CP is 1/2-averaged under the metric induced by the matrix if λ satisfies

the condition λAA⊤ ≤ 1.

slide-24
SLIDE 24

Chambolle-Pock ’11 as proximal point

Chambolle-Pock (x − s order) x+ = (I + γ∂g)−1(x − γA⊤s) s+ = I + λ

γ ∂h∗−1

s + λ

γ A(2x+ − x)

CP is equivalent to the backward operator applied on the KKT system.

  • 1

γ I

−A⊤ −A

γ λI

  • +
  • ∂g

A⊤ −A ∂h∗ x+ s+

  • 1

γ I

−A⊤ −A

γ λI

x s

  • CP is 1/2-averaged under the metric induced by the matrix if λ satisfies

the condition λAA⊤ ≤ 1.

slide-25
SLIDE 25

Condat-Vu (Condat ’13, Vu ’13)

The optimality condition:

  • ∂g

A⊤ −A ∂h∗ x∗ s∗

  • +
  • ∇f(x∗)
  • ∋ 0

CV is equivalent to the forward-backward applied on the KKT system.

  • 1

γ I

−A⊤ −A

γ λ I

  • +
  • ∂g

A⊤ −A ∂h∗

  • x+

s+

  • 1

γ I

−A⊤ −A

γ λ I

  • x

s

  • ∇f(x)
slide-26
SLIDE 26

Condat-Vu (Condat ’13, Vu ’13)

The optimality condition:

  • ∂g

A⊤ −A ∂h∗ x∗ s∗

  • +
  • ∇f(x∗)
  • ∋ 0

CV is equivalent to the forward-backward applied on the KKT system.

  • 1

γ I

−A⊤ −A

γ λ I

  • +
  • ∂g

A⊤ −A ∂h∗

  • x+

s+

  • 1

γ I

−A⊤ −A

γ λ I

  • x

s

  • ∇f(x)
  • That is:

x+ = (I + γ∂g)−1(x−γ∇f(x) − γA⊤s) s+ = I + λ

γ ∂h∗−1

s + λ

γ A(2x+ − x)

It is equivalent to (by changing the update order) s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = (I + γ∂g)−1(x−γ∇f(x) − γA⊤s+) ¯ x+ = 2x+ − x

slide-27
SLIDE 27

Condat-Vu (Condat ’13, Vu ’13)

The optimality condition:

  • ∂g

A⊤ −A ∂h∗ x∗ s∗

  • +
  • ∇f(x∗)
  • ∋ 0

CV is equivalent to the forward-backward applied on the KKT system.

  • 1

γ I

−A⊤ −A

γ λ I

  • +
  • ∂g

A⊤ −A ∂h∗

  • x+

s+

  • 1

γ I

−A⊤ −A

γ λ I

  • x

s

  • ∇f(x)
  • That is:

x+ = (I + γ∂g)−1(x−γ∇f(x) − γA⊤s) s+ = I + λ

γ ∂h∗−1

s + λ

γ A(2x+ − x)

It is equivalent to (by changing the update order) s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = (I + γ∂g)−1(x−γ∇f(x) − γA⊤s+) ¯ x+ = 2x+ − x

  • CV is non-expansive (forward-backward) under the metric induced by the

matrix if γ and λ satisfy the condition λAA⊤ + γ/(2β) ≤ 1.

slide-28
SLIDE 28

PDFP2O/PAPC (Loris-Verhoeven ’11, Chen-Huang-Zhang ’13, Drori-Sabach-Teboulle ’15)

When g = 0, the optimality condition becomes:

  • A⊤

−A ∂h∗ x∗ s∗

  • +
  • ∇f(x∗)
  • ∋ 0
slide-29
SLIDE 29

PDFP2O/PAPC (Loris-Verhoeven ’11, Chen-Huang-Zhang ’13, Drori-Sabach-Teboulle ’15)

When g = 0, the optimality condition becomes:

  • A⊤

−A ∂h∗ x∗ s∗

  • +
  • ∇f(x∗)
  • ∋ 0

PAPC is equivalent to the forward-backward applied on the KKT system.

  • 1

γ I

A⊤ −A

γ λ I − γAA⊤ + ∂h∗

  • x+

s+

  • 1

γ I γ λ I − γAA⊤

  • x

s

  • ∇f(x)
slide-30
SLIDE 30

PDFP2O/PAPC (Loris-Verhoeven ’11, Chen-Huang-Zhang ’13, Drori-Sabach-Teboulle ’15)

When g = 0, the optimality condition becomes:

  • A⊤

−A ∂h∗ x∗ s∗

  • +
  • ∇f(x∗)
  • ∋ 0

PAPC is equivalent to the forward-backward applied on the KKT system.

  • 1

γ I

A⊤ −A

γ λ I − γAA⊤ + ∂h∗

  • x+

s+

  • 1

γ I γ λ I − γAA⊤

  • x

s

  • ∇f(x)
  • 1

γ I

A⊤

γ λI + ∂h∗

x+ s+

  • 1

γ I

A

γ λI − γAA⊤

x s

  • ∇f(x)

γA∇f(x)

  • PAPC is non-expansive (forward-backward) under the metric induced by

the matrix if γ and λ satisfy the conditions λAA⊤ ≤ 1 and γ/(2β) ≤ 1.

slide-31
SLIDE 31

PAPC

PAPC can be expressed as s+ = I + λ

γ ∂h∗−1

(I − λAA⊤)s + λ

γ A (x − γ∇f(x))

x+ = x − γ∇f(x) − γA⊤s+

slide-32
SLIDE 32

PAPC

PAPC can be expressed as s+ = I + λ

γ ∂h∗−1

(I − λAA⊤)s + λ

γ A (x − γ∇f(x))

x+ = x − γ∇f(x) − γA⊤s+ It is equivalent to s+= I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+= x − γ∇f(x) − γA⊤s+ ¯ x+= x+ − γ∇f(x+) − γA⊤s+

slide-33
SLIDE 33

PAPC

PAPC can be expressed as s+ = I + λ

γ ∂h∗−1

(I − λAA⊤)s + λ

γ A (x − γ∇f(x))

x+ = x − γ∇f(x) − γA⊤s+ It is equivalent to s+= I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+= x − γ∇f(x) − γA⊤s+ ¯ x+= x+ − γ∇f(x+) − γA⊤s+

  • PAPC is α-averaged under the metric induced by the matrix.
  • PAPC converges if γ and λ satisfy the conditions λAA⊤ < 4/3 and

γ/(2β) < 1 (Li-Yan ’17).

slide-34
SLIDE 34

PDFP (Chen-Huang-Zhang ’16)

Rewrite PDFP2O as s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = x − γ∇f(x) − γA⊤s+ ¯ x+ = x+ − γ∇f(x+) − γA⊤s+

slide-35
SLIDE 35

PDFP (Chen-Huang-Zhang ’16)

Rewrite PDFP2O as s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = x − γ∇f(x) − γA⊤s+ ¯ x+ = x+ − γ∇f(x+) − γA⊤s+ PDFP, as a generalization of PDFP2O, is s+ = (I + λ

γ ∂h∗)−1

s + λ

γ A¯

x x+ = (I + γ∂g)−1(x − γ∇f(x) − γA⊤s+) ¯ x+ = (I + γ∂g)−1(x+ − γ∇f(x+) − γA⊤s+)

  • When g is the indicator function, PDFP reduces to Preconditioned

Alternating Projection Algorithm (PAPA) (Krol-Li-Shen-Xu ’12).

slide-36
SLIDE 36

AFBA (Latafat-Patrinos ’16)

Rewrite PAPC as s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = ¯ x − γA⊤(s+ − s) ¯ x+= x+ − γ∇f(x+) − γA⊤s+

slide-37
SLIDE 37

AFBA (Latafat-Patrinos ’16)

Rewrite PAPC as s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = ¯ x − γA⊤(s+ − s) ¯ x+= x+ − γ∇f(x+) − γA⊤s+ AFBA, as a generalization of PAPC, is s+ = (I + λ

γ ∂h∗)−1

s + λ

γ A¯

x x+ = ¯ x − γA⊤(s+ − s) ¯ x+ = (I + γ∂g)−1(x+ − γ∇f(x+) − γA⊤s+) Convergence conditions: λAA⊤/2 +

  • λAA⊤/2 + γ/(2β) ≤ 1
slide-38
SLIDE 38

Chambolle-Pock and PAPC

Chambolle-Pock: s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = (I + γ∂g)−1(x − γA⊤s+) ¯ x+ = 2x+ − x

slide-39
SLIDE 39

Chambolle-Pock and PAPC

Chambolle-Pock: s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = (I + γ∂g)−1(x − γA⊤s+) ¯ x+ = 2x+ − x PAPC: s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = x−γ∇f(x) − γA⊤s+ ¯ x+ = x+ − γ∇f(x+) − γA⊤s+

slide-40
SLIDE 40

Chambolle-Pock and PAPC

Chambolle-Pock: s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = (I + γ∂g)−1(x − γA⊤s+) ¯ x+ = 2x+ − x PAPC: s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = x−γ∇f(x) − γA⊤s+ ¯ x+ = 2x+ − x−γ∇f(x+) + γ∇f(x)

slide-41
SLIDE 41

Chambolle-Pock and PAPC

Chambolle-Pock: s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = (I + γ∂g)−1(x − γA⊤s+) ¯ x+ = 2x+ − x PAPC: s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = x−γ∇f(x) − γA⊤s+ ¯ x+ = 2x+ − x−γ∇f(x+) + γ∇f(x) PD3O (Yan ’16): s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = (I + γ∂g)−1(x−γ∇f(x) − γA⊤s+) ¯ x+ = 2x+ − x−γ∇f(x+) + γ∇f(x)

slide-42
SLIDE 42

Chambolle-Pock

Chambolle-Pock (x − s order): s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = (I + γ∂g)−1(x − γA⊤s+) ¯ x+ = 2x+ − x

slide-43
SLIDE 43

Chambolle-Pock

z, s z+, s+ x+, s 2x+ − z, s 2x+ − z − γA⊤s+, s+

Chambolle-Pock (x − s order): z = x − γA⊤s x+ = (I + γ∂g)−1(z) s+ = I + λ

γ ∂h∗−1

(I − λAA⊤)s + λ

γ A(2x+ − z)

z+ = z + 2x+ − z − γA⊤s+ − x+

slide-44
SLIDE 44

C-P and PAPC

PAPC: s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = x−γ∇f(x) − γA⊤s+ ¯ x+ = 2x+ − x−γ∇f(x+) + γ∇f(x)

slide-45
SLIDE 45

C-P and PAPC

z, s z+, s+ x+, s 2x+ − z, s 2x+ − z−γ∇f(x+) − γA⊤s+, s+ 2x+ − z−γ∇f(x+), s

PAPC: x+ = z = x−γ∇f(x) − γA⊤s s+ = I + λ

γ ∂h∗−1

(I − λAA⊤)s + λ

γ A(2x+ − z−γ∇f(x+))

z+ = z + 2x+ − z−γ∇f(x+) − γA⊤s+ − x+

slide-46
SLIDE 46

PD3O

z, s z+, s+ x+, s 2x+ − z, s 2x+ − z−γ∇f(x+) − γA⊤s+, s+ 2x+ − z−γ∇f(x+), s

PD3O: s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = (I + γ∂g)−1(x−γ∇f(x) − γA⊤s+) ¯ x+ = 2x+ − x−γ∇f(x+) + γ∇f(x)

slide-47
SLIDE 47

PD3O

z, s z+, s+ x+, s 2x+ − z, s 2x+ − z−γ∇f(x+) − γA⊤s+, s+ 2x+ − z−γ∇f(x+), s

PD3O: z = x−γ∇f(x) − γA⊤s x+ = (I + γ∂g)−1(z) s+ = I + λ

γ ∂h∗−1

(I − λAA⊤)s + λ

γ A(2x+ − z−γ∇f(x+))

z+ = z + 2x+ − z−γ∇f(x+) − γA⊤s+ − x+

slide-48
SLIDE 48

PD3O vs Condat-Vu vs AFBA vs PDFP

Algorithms: s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = (I + γ∂g)−1(x − γ∇f(x) − γA⊤s+) PDFP ¯ x+ = (I + γ∂g)−1(x+ − γ∇f(x+) − γA⊤s+) Condat-Vu ¯ x+ = 2x+ − x PD3O ¯ x+ = 2x+ − x + γ∇f(x) − γ∇f(x+)

slide-49
SLIDE 49

PD3O vs Condat-Vu vs AFBA vs PDFP

Algorithms: s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = (I + γ∂g)−1(x − γ∇f(x) − γA⊤s+) PDFP ¯ x+ = (I + γ∂g)−1(x+ − γ∇f(x+) − γA⊤s+) Condat-Vu ¯ x+ = 2x+ − x PD3O ¯ x+ = 2x+ − x + γ∇f(x) − γ∇f(x+) Parameters: f = 0, g = 0 f = 0 g = 0 PDFP λAA⊤ < 1; γ/(2β) < 1 PAPC Condat-Vu λAA⊤ + γ/(2β) ≤ 1 C-P AFBA λAA⊤/2 +

  • λAA⊤/2 + γ/(2β) ≤ 1

PAPC PD3O λAA⊤ ≤ 1; γ/(2β) < 1 C-P PAPC

slide-50
SLIDE 50

convergence results: summary

Let z = x − γ∇f(x) − γA⊤s and x+ → x: x = (I + γ∂g)−1z s+ = I + λ

γ ∂h∗−1

(I − λAA⊤)s + λ

γ A (2x − z − γ∇f(x))

z+ = x − γ∇f(x) − γA⊤s+

slide-51
SLIDE 51

convergence results: summary

Let z = x − γ∇f(x) − γA⊤s and x+ → x: x = (I + γ∂g)−1z s+ = I + λ

γ ∂h∗−1

(I − λAA⊤)s + λ

γ A (2x − z − γ∇f(x))

z+ = x − γ∇f(x) − γA⊤s+

  • (zk+1, sk+1) − (zk, sk)2

M = o 1 k+1

  • , and (zk, sk) weakly converges to a

fixed point (z∗, s∗)

slide-52
SLIDE 52

convergence results: summary

Let z = x − γ∇f(x) − γA⊤s and x+ → x: x = (I + γ∂g)−1z s+ = I + λ

γ ∂h∗−1

(I − λAA⊤)s + λ

γ A (2x − z − γ∇f(x))

z+ = x − γ∇f(x) − γA⊤s+

  • (zk+1, sk+1) − (zk, sk)2

M = o 1 k+1

  • , and (zk, sk) weakly converges to a

fixed point (z∗, s∗)

  • Let L(x, s) = f(x) + g(x) + Ax, s − h∗(s), then

L(¯ xk, s) − L(x,¯ sk+1) ≤ (z1, s1) − (z, s)2 k where (¯ xk,¯ sk+1) = 1

k

k

i=1(xi, si+1), and z = x − γ∇f(x) − γA⊤s.

slide-53
SLIDE 53

convergence results: summary

Let z = x − γ∇f(x) − γA⊤s and x+ → x: x = (I + γ∂g)−1z s+ = I + λ

γ ∂h∗−1

(I − λAA⊤)s + λ

γ A (2x − z − γ∇f(x))

z+ = x − γ∇f(x) − γA⊤s+

  • (zk+1, sk+1) − (zk, sk)2

M = o 1 k+1

  • , and (zk, sk) weakly converges to a

fixed point (z∗, s∗)

  • Let L(x, s) = f(x) + g(x) + Ax, s − h∗(s), then

L(¯ xk, s) − L(x,¯ sk+1) ≤ (z1, s1) − (z, s)2 k where (¯ xk,¯ sk+1) = 1

k

k

i=1(xi, si+1), and z = x − γ∇f(x) − γA⊤s.

  • Linear convergence with additional assumptions on f, g, and h
slide-54
SLIDE 54

convergence analysis: the general case

  • Let M = γ2

λ (I − λAA⊤) be positive definite. Define

sM =

  • s, sM =
  • s, Ms and (z, s)M =
  • z2 + s2

M.

Lemma

The iteration T mapping (z, s) to (z+, s+) is a nonexpansive operator under the metric defined by M if γ ≤ 2β. Furthermore, it is α-averaged with α =

2β 4β−γ .

  • Chambolle-Pock is firmly non-expansive under the new metric, which is

different from the previous metric.

slide-55
SLIDE 55

convergence analysis: the general case

Theorem

1) Let (z∗, s∗) be any fixed point of T. Then ((zk, sk) − (z∗, s∗)M)k≥0 is monotonically nonincreasing. 2) The sequence (T(zk, sk) − (zk, sk)M)k≥0 is monotonically nonincreasing and converges to 0. 3) We have the following convergence rate T(zk, sk) − (zk, sk)2

M = o 1 k+1

  • 4) (zk, sk) weakly converges to a fixed point of T, and if X has finite

dimension (e.g., Rm), then it is strongly convergent.

slide-56
SLIDE 56

convergence analysis: linear convergent

Denote: uh = γ

λ(I − λAA⊤)s + A˜

z − γ

λs+ ∈ ∂h∗(s+),

ug = 1

γ (z − x) ∈ ∂g(x),

u∗

h =A(˜

z∗ − γA⊤s∗) = Ax∗ ∈ ∂h∗(s∗), u∗

g = 1 γ (z∗ − x∗) ∈ ∂g(x∗).

and ∇g(x) − ∇g(y) ≤Lgx − y, s+ − s∗, uh − u∗

h ≥τhs+ − s∗2 M,

x − x∗, ug − u∗

g ≥τgx − x∗2,

x − x∗, ∇f(x) − ∇f(x∗) ≥τfx − x∗.

slide-57
SLIDE 57

convergence analysis: linear convergent

Theorem

We have z+ − z∗2 + (1 + 2γτh) s+ − s∗2

M ≤ ρ

z − z∗2 + (1 + 2γτh) s − s∗2

M

  • where

ρ = max

 

1 1+2γτh , 1 −

  • 2γ− γ2

β

  • τf +2γτg
  • 1+γLg

  .

(5) When, in addition, γ < 2β, τh > 0, and τf + τg > 0, we have that ρ < 1 and the algorithm converges linearly.

slide-58
SLIDE 58

numerical experiment: fused lasso

minimize

x

1 2Ax − b2

2 + µ1x1 + µ2 p−1

  • i=1

|xi+1 − xi|,

  • x = (x1, · · · , xp) ∈ Rp, A ∈ Rn×p, b ∈ Rn

2000 4000 6000 8000 10000

  • 4
  • 3
  • 2
  • 1

1 2 3 4 True PD3O PDFP Condat-Vu 3000 3500 4000 4500 5000

  • 4
  • 3
  • 2
  • 1

1 2 3 4 True PD3O PDFP Condat-Vu

Figure: The true sparse signal and the reconstructed results using PD3O, PDFP, and Condat-Vu. The right figure is a zoom-in of the signal in [3000, 5500].

slide-59
SLIDE 59

numerical experiment: fused lasso

iteration

200 400 600 800 1000

f!f$ f$

10-4 10-3 10-2 10-1 100 101 102 103 PD3O-.1 PDFP-.1 Condat-Vu-.1 PD3O-.2 PDFP-.2 PD3O-.3 PDFP-.3

iteration

200 400 600 800 1000

f!f$ f$

10-4 10-3 10-2 10-1 100 101 102 103 PD3O-.1 PDFP-.1 Condat-Vu-.1 PD3O-.2 PDFP-.2 PD3O-.3 PDFP-.3

Figure: In the left figure, we fix λ = 1/8 and let γ = β, 1.5β, 1.9β. In the right figure, we fix γ = 1.9β and let λ = 1/80, 1/8, 1/4.

slide-60
SLIDE 60

applications: decentralized optimization

minimize

x n

  • i=1

fi(x) + gi(x)

  • fi and gi is known at node i only.
  • Nodes 1, · · · , n are connected in a undirected graph.
  • fi is differentiable with a Lipschitz continuous gradient.
slide-61
SLIDE 61

applications: decentralized optimization

minimize

x n

  • i=1

fi(x) + gi(x)

  • fi and gi is known at node i only.
  • Nodes 1, · · · , n are connected in a undirected graph.
  • fi is differentiable with a Lipschitz continuous gradient.

Introduce a copy xi at node i: minimize

x

f(x) + g(x) :=

n

  • i=1

fi(xi) + gi(xi) s.t. Wx = x

  • xi ∈ Rp, x = [x1 x2 · · · xn]⊤ ∈ Rn×p.
  • W is a symmetric doubly stochastic mixing matrix.
slide-62
SLIDE 62

applications: decentralized optimization

minimize

x n

  • i=1

fi(x) + gi(x)

  • fi and gi is known at node i only.
  • Nodes 1, · · · , n are connected in a undirected graph.
  • fi is differentiable with a Lipschitz continuous gradient.

Introduce a copy xi at node i: minimize

x

f(x) + g(x) :=

n

  • i=1

fi(xi) + gi(xi) s.t. Wx = x

  • xi ∈ Rp, x = [x1 x2 · · · xn]⊤ ∈ Rn×p.
  • W is a symmetric doubly stochastic mixing matrix.

The sum of three functions: minimize

x

f(x) + g(x) + ι0((I − W)1/2x)

slide-63
SLIDE 63

comparing PG-EXTRA and NIDS (Li-Shi-Yan ’17)

minimize

x

1 2

n

  • i=1

Aixi − bi2 + µ1

n

  • i=1

xi1 + ι0((I − W)1/2x)

slide-64
SLIDE 64

comparing PG-EXTRA and NIDS (Li-Shi-Yan ’17)

minimize

x

1 2

n

  • i=1

Aixi − bi2 + µ1

n

  • i=1

xi1 + ι0((I − W)1/2x)

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 number of iterations 104 10-8 10-6 10-4 10-2 100 102 NIDS-1/L NIDS-1.5/L NIDS-1.9/L PGEXTRA-1/L PGEXTRA-1.2/L PGEXTRA-1.3/L PGEXTRA-1.4/L

slide-65
SLIDE 65

conclusion

  • a new primal-dual algorithm for minimizing the sum of three functions.
  • a new interpretation of Chambolle-Pock: Douglas-Rachford splitting on

the KKT system under a new metric induced by a block diagonal matrix.

  • PAPC is forward-backward splitting applied on the KKT system under the

same metric; we proved the optimal bound for the parameters (dual stepsize).

  • PD3O is a generalization of both Chambolle-Pock and PAPC, and it has

the advantages of both Condat-Vu (a generalization of Chambolle-Pock), and AFBA and PDFP (two generalizations of PAPC).

  • In decentralized consensus optimization, we derive a fast method whose

stepsize does not depend on the network structure; we provide an optimal bound for the stepsize in PG-EXTRA (Shi et al. ’15).

slide-66
SLIDE 66

Thank You!

Paper 1 M. Yan, A new primal-dual method for minimizing the sum of three functions with a linear operator, Arxiv: arXiv:1611.09805 Code https://github.com/mingyan08/PD3O Paper 2 Z. Li, W. Shi and M. Yan, A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates, arXiv:1704.07807 Code https://github.com/mingyan08/NIDS Paper 3 Z. Li and M. Yan, A primal-dual algorithm with optimal stepsizes and its application in decentralized consensus optimization, arXiv:1711.06785