[PPT] - optimization problems for primal-dual algorithms minimize f ( x ) + PowerPoint Presentation

SLIDE 1

Primal-dual algorithms for the sum of two and three functions1

Ming Yan Michigan State University, CMSE/Mathematics

1This works is partially supported by NSF.

SLIDE 2

ptimization problems for primal-dual algorithms

minimize

x

f(x) + g(x) + h(Ax)

f, g, and h are convex.
X and Y are two Hilbert spaces (e.g., Rm, Rn).
f : X → R is differentiable with a 1/β-Lipschitz continuous gradient for

some β ∈ (0, +∞).

A : X → Y is a bounded linear operator.

SLIDE 3

applications: statistics

Elastic net regularization (Zou-Hastie ’05): minimize

x

µ2x2

2 + µ1x1 + l(Ax, b),

where x ∈ Rp, A ∈ Rn×p, b ∈ Rn, and l is the loss function, which may be nondifferentiable.

SLIDE 4

applications: statistics

Elastic net regularization (Zou-Hastie ’05): minimize

x

µ2x2

2 + µ1x1 + l(Ax, b),

where x ∈ Rp, A ∈ Rn×p, b ∈ Rn, and l is the loss function, which may be nondifferentiable. Fused lasso (Tibshirani et al. ’05): minimize

x

1 2Ax − b2

2 + µ1x1 + µ2Dx1,

where x ∈ Rp, A ∈ Rn×p, b ∈ Rn, and D =

   

−1 1 −1 1 . . . . . . −1 1

   

is a matrix in R(p−1)×p.

SLIDE 5

applications: decentralized optimization

minimize

x n

i=1

fi(x) + gi(x)

fi and gi is known at node i only.
Nodes 1, · · · , n are connected in a undirected graph.
fi is differentiable with a Lipschitz continuous gradient.

SLIDE 6

applications: decentralized optimization

minimize

x n

i=1

fi(x) + gi(x)

fi and gi is known at node i only.
Nodes 1, · · · , n are connected in a undirected graph.
fi is differentiable with a Lipschitz continuous gradient.

Introduce a copy xi at node i: minimize

x

f(x) + g(x) :=

n

i=1

fi(xi) + gi(xi) s.t. Wx = x

xi ∈ Rp, x = [x1 x2 · · · xn]⊤ ∈ Rn×p.
W is a symmetric doubly stochastic mixing matrix.

SLIDE 7

applications: decentralized optimization

minimize

x n

i=1

fi(x) + gi(x)

fi and gi is known at node i only.
Nodes 1, · · · , n are connected in a undirected graph.
fi is differentiable with a Lipschitz continuous gradient.

Introduce a copy xi at node i: minimize

x

f(x) + g(x) :=

n

i=1

fi(xi) + gi(xi) s.t. Wx = x

xi ∈ Rp, x = [x1 x2 · · · xn]⊤ ∈ Rn×p.
W is a symmetric doubly stochastic mixing matrix.

The sum of three functions: minimize

x

f(x) + g(x) + ι0((I − W)1/2x)

SLIDE 8

applications: imaging

Image restoration with two regularizations: minimize

x

1 2Ax − b2

2 + ιC(x) + µDx1,

where x ∈ Rn is the image to be reconstructed, A ∈ Rm×n is the forward projection matrix, b ∈ Rm is the measured data with noise, D is a discrete gradient operator, and ιC is the indicator function that returns zero if x ∈ C (here, C is the set of nonnegative vectors in Rn) and +∞ otherwise.

SLIDE 9

applications: imaging

Image restoration with two regularizations: minimize

x

1 2Ax − b2

2 + ιC(x) + µDx1,

where x ∈ Rn is the image to be reconstructed, A ∈ Rm×n is the forward projection matrix, b ∈ Rm is the measured data with noise, D is a discrete gradient operator, and ιC is the indicator function that returns zero if x ∈ C (here, C is the set of nonnegative vectors in Rn) and +∞ otherwise. Other problems:

f: data fitting term (infimal convolution for mixed noise)
h ◦ A: total variation; other transforms
g: nonnegativity; box constraint

SLIDE 10

primal-dual formulation

Original problem: minimize

x

f(x) + g(x) + h(Ax)

SLIDE 11

primal-dual formulation

Original problem: minimize

x

f(x) + g(x) + h(Ax) Introduce a dual variable s: minimize

x

max

s f(x) + g(x) + Ax, s − h∗(s)

Here h∗ is the conjugate function of h that is defined as h∗(s) = max

t

s, t − h(t),

SLIDE 12

primal-dual formulation

Original problem: minimize

x

f(x) + g(x) + h(Ax) Introduce a dual variable s: minimize

x

max

s f(x) + g(x) + Ax, s − h∗(s)

Here h∗ is the conjugate function of h that is defined as h∗(s) = max

t

s, t − h(t), It is equivalent to (s∗ ∈ ∂h(Ax∗) ⇐ ⇒ Ax∗ ∈ ∂h∗(s∗)):

0 ∈

∇f(x∗) + ∂g(x∗) + A⊤s∗ 0 ∈ ∂h∗(s∗) − Ax∗ All primal-dual algorithms try to find (x∗, s∗).

SLIDE 13

existing algorithms: Condat-Vu, AFBA, and PDFP

Condat-Vu (Condat ’13, Vu ’13):

Convergence conditions: λAA⊤ + γ/(2β) ≤ 1
Per-iteration computations: A, A⊤, ∇f, one (I + γ∂g)−1, (I + λ

γ ∂h∗)−1 2

2

(I + γ∂g)−1(˜ x) = arg min

x

γg(x) + 1 2 x − ˜ x2. This is a backward step (or implicit step) because (I + γ∂g)−1(˜ x) ∈ ˜ x − γ∂g((I + γ∂g)−1(˜ x))

SLIDE 14

existing algorithms: Condat-Vu, AFBA, and PDFP

Condat-Vu (Condat ’13, Vu ’13):

Convergence conditions: λAA⊤ + γ/(2β) ≤ 1
Per-iteration computations: A, A⊤, ∇f, one (I + γ∂g)−1, (I + λ

γ ∂h∗)−1 2

AFBA (Latafat-Patrinos ’16):

Convergence conditions: λAA⊤/2 +
λAA⊤/2 + γ/(2β) ≤ 1
Per-iteration computations: A, A⊤, ∇f, one (I + γ∂g)−1, (I + λ

γ ∂h∗)−1

2

(I + γ∂g)−1(˜ x) = arg min

x

γg(x) + 1 2 x − ˜ x2. This is a backward step (or implicit step) because (I + γ∂g)−1(˜ x) ∈ ˜ x − γ∂g((I + γ∂g)−1(˜ x))

SLIDE 15

existing algorithms: Condat-Vu, AFBA, and PDFP

Condat-Vu (Condat ’13, Vu ’13):

Convergence conditions: λAA⊤ + γ/(2β) ≤ 1
Per-iteration computations: A, A⊤, ∇f, one (I + γ∂g)−1, (I + λ

γ ∂h∗)−1 2

AFBA (Latafat-Patrinos ’16):

Convergence conditions: λAA⊤/2 +
λAA⊤/2 + γ/(2β) ≤ 1
Per-iteration computations: A, A⊤, ∇f, one (I + γ∂g)−1, (I + λ

γ ∂h∗)−1

PDFP (Chen-Huang-Zhang ’16):

Convergence conditions: λAA⊤ < 1; γ/(2β) < 1
Per-iteration computations: A, A⊤, ∇f, two (I + γ∂g)−1, (I + λ

γ ∂h∗)−1

2

(I + γ∂g)−1(˜ x) = arg min

x

γg(x) + 1 2 x − ˜ x2. This is a backward step (or implicit step) because (I + γ∂g)−1(˜ x) ∈ ˜ x − γ∂g((I + γ∂g)−1(˜ x))

SLIDE 16

PDHG (Zhu-Chan ’08)

When f = 0, we have

∂g

A⊤ −A ∂h∗ x∗ s∗

∋ 0

SLIDE 17

PDHG (Zhu-Chan ’08)

When f = 0, we have

∂g

A⊤ −A ∂h∗ x∗ s∗

∋ 0

It is equivalent to

∂g

−A ∂h∗ x∗ s∗

∋
−A⊤

x∗ s∗

SLIDE 18

PDHG (Zhu-Chan ’08)

When f = 0, we have

∂g

A⊤ −A ∂h∗ x∗ s∗

∋ 0

It is equivalent to

1

γ I+∂g

−A

γ λI+∂h∗

x∗ s∗

∋
1

γ I

A⊤

γ λI

x∗ s∗

SLIDE 19

PDHG (Zhu-Chan ’08)

When f = 0, we have

∂g

A⊤ −A ∂h∗ x∗ s∗

∋ 0

It is equivalent to

1

γ I+∂g

−A

γ λI+∂h∗

x∗ s∗

∋
1

γ I

A⊤

γ λI

x∗ s∗

Primal-dual hybrid gradient (PDHG)

x+ = (I + γ∂g)−1(x − γA⊤s) s+ = I + λ

γ ∂h∗−1

s + λ

γ Ax+

SLIDE 20

PDHG (Zhu-Chan ’08)

When f = 0, we have

∂g

A⊤ −A ∂h∗ x∗ s∗

∋ 0

It is equivalent to

1

γ I+∂g

−A

γ λI+∂h∗

x∗ s∗

∋
1

γ I

A⊤

γ λI

x∗ s∗

Chambolle-Pock (Chambolle et.al ’09, Esser-Zhang-Chan ’10)

x+ = (I + γ∂g)−1(x − γA⊤s) ¯ x+ = x++x+ − x s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x+

SLIDE 21

Chambolle-Pock ’11 as proximal point

Chambolle-Pock (x − s order) x+ = (I + γ∂g)−1(x − γA⊤s) s+ = I + λ

γ ∂h∗−1

s + λ

γ A(2x+ − x)

SLIDE 22

Chambolle-Pock ’11 as proximal point

Chambolle-Pock (x − s order) x+ = (I + γ∂g)−1(x − γA⊤s) s+ = I + λ

γ ∂h∗−1

s + λ

γ A(2x+ − x)

CP is equivalent to the backward operator applied on the KKT system.

SLIDE 23

Chambolle-Pock ’11 as proximal point

Chambolle-Pock (x − s order) x+ = (I + γ∂g)−1(x − γA⊤s) s+ = I + λ

γ ∂h∗−1

s + λ

γ A(2x+ − x)

CP is equivalent to the backward operator applied on the KKT system.

1

γ I

−A

γ λI

+
∂g

−A ∂h∗ x+ s+

∋
1

γ I

−A⊤ −A

γ λI

x s

CP is 1/2-averaged under the metric induced by the matrix if λ satisfies

the condition λAA⊤ ≤ 1.

SLIDE 24

Chambolle-Pock ’11 as proximal point

Chambolle-Pock (x − s order) x+ = (I + γ∂g)−1(x − γA⊤s) s+ = I + λ

γ ∂h∗−1

s + λ

γ A(2x+ − x)

CP is equivalent to the backward operator applied on the KKT system.

1

γ I

−A⊤ −A

γ λI

+
∂g

A⊤ −A ∂h∗ x+ s+

∋
1

γ I

−A⊤ −A

γ λI

x s

CP is 1/2-averaged under the metric induced by the matrix if λ satisfies

the condition λAA⊤ ≤ 1.

SLIDE 25

Condat-Vu (Condat ’13, Vu ’13)

The optimality condition:

∂g

A⊤ −A ∂h∗ x∗ s∗

+
∇f(x∗)
∋ 0

CV is equivalent to the forward-backward applied on the KKT system.

1

γ I

−A⊤ −A

γ λ I

+
∂g

A⊤ −A ∂h∗

x+

s+

∋
1

γ I

−A⊤ −A

γ λ I

x

s

−
∇f(x)

SLIDE 26

Condat-Vu (Condat ’13, Vu ’13)

The optimality condition:

∂g

A⊤ −A ∂h∗ x∗ s∗

+
∇f(x∗)
∋ 0

CV is equivalent to the forward-backward applied on the KKT system.

1

γ I

−A⊤ −A

γ λ I

+
∂g

A⊤ −A ∂h∗

x+

s+

∋
1

γ I

−A⊤ −A

γ λ I

x

s

−
∇f(x)
That is:

x+ = (I + γ∂g)−1(x−γ∇f(x) − γA⊤s) s+ = I + λ

γ ∂h∗−1

s + λ

γ A(2x+ − x)

It is equivalent to (by changing the update order) s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = (I + γ∂g)−1(x−γ∇f(x) − γA⊤s+) ¯ x+ = 2x+ − x

SLIDE 27

Condat-Vu (Condat ’13, Vu ’13)

The optimality condition:

∂g

A⊤ −A ∂h∗ x∗ s∗

+
∇f(x∗)
∋ 0

CV is equivalent to the forward-backward applied on the KKT system.

1

γ I

−A⊤ −A

γ λ I

+
∂g

A⊤ −A ∂h∗

x+

s+

∋
1

γ I

−A⊤ −A

γ λ I

x

s

−
∇f(x)
That is:

x+ = (I + γ∂g)−1(x−γ∇f(x) − γA⊤s) s+ = I + λ

γ ∂h∗−1

s + λ

γ A(2x+ − x)

It is equivalent to (by changing the update order) s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = (I + γ∂g)−1(x−γ∇f(x) − γA⊤s+) ¯ x+ = 2x+ − x

CV is non-expansive (forward-backward) under the metric induced by the

matrix if γ and λ satisfy the condition λAA⊤ + γ/(2β) ≤ 1.

SLIDE 28

PDFP2O/PAPC (Loris-Verhoeven ’11, Chen-Huang-Zhang ’13, Drori-Sabach-Teboulle ’15)

When g = 0, the optimality condition becomes:

A⊤

−A ∂h∗ x∗ s∗

+
∇f(x∗)
∋ 0

SLIDE 29

PDFP2O/PAPC (Loris-Verhoeven ’11, Chen-Huang-Zhang ’13, Drori-Sabach-Teboulle ’15)

When g = 0, the optimality condition becomes:

A⊤

−A ∂h∗ x∗ s∗

+
∇f(x∗)
∋ 0

PAPC is equivalent to the forward-backward applied on the KKT system.

1

γ I

A⊤ −A

γ λ I − γAA⊤ + ∂h∗

x+

s+

∋
1

γ I γ λ I − γAA⊤

x

s

−
∇f(x)

SLIDE 30

PDFP2O/PAPC (Loris-Verhoeven ’11, Chen-Huang-Zhang ’13, Drori-Sabach-Teboulle ’15)

When g = 0, the optimality condition becomes:

A⊤

−A ∂h∗ x∗ s∗

+
∇f(x∗)
∋ 0

PAPC is equivalent to the forward-backward applied on the KKT system.

1

γ I

A⊤ −A

γ λ I − γAA⊤ + ∂h∗

x+

s+

∋
1

γ I γ λ I − γAA⊤

x

s

−
∇f(x)
1

γ I

A⊤

γ λI + ∂h∗

x+ s+

∋
1

γ I

A

γ λI − γAA⊤

x s

−
∇f(x)

γA∇f(x)

PAPC is non-expansive (forward-backward) under the metric induced by

the matrix if γ and λ satisfy the conditions λAA⊤ ≤ 1 and γ/(2β) ≤ 1.

SLIDE 31

PAPC

PAPC can be expressed as s+ = I + λ

γ ∂h∗−1

(I − λAA⊤)s + λ

γ A (x − γ∇f(x))

x+ = x − γ∇f(x) − γA⊤s+

SLIDE 32

PAPC

PAPC can be expressed as s+ = I + λ

γ ∂h∗−1

(I − λAA⊤)s + λ

γ A (x − γ∇f(x))

x+ = x − γ∇f(x) − γA⊤s+ It is equivalent to s+= I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+= x − γ∇f(x) − γA⊤s+ ¯ x+= x+ − γ∇f(x+) − γA⊤s+

SLIDE 33

PAPC

PAPC can be expressed as s+ = I + λ

γ ∂h∗−1

(I − λAA⊤)s + λ

γ A (x − γ∇f(x))

x+ = x − γ∇f(x) − γA⊤s+ It is equivalent to s+= I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+= x − γ∇f(x) − γA⊤s+ ¯ x+= x+ − γ∇f(x+) − γA⊤s+

PAPC is α-averaged under the metric induced by the matrix.
PAPC converges if γ and λ satisfy the conditions λAA⊤ < 4/3 and

γ/(2β) < 1 (Li-Yan ’17).

SLIDE 34

PDFP (Chen-Huang-Zhang ’16)

Rewrite PDFP2O as s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = x − γ∇f(x) − γA⊤s+ ¯ x+ = x+ − γ∇f(x+) − γA⊤s+

SLIDE 35

PDFP (Chen-Huang-Zhang ’16)

Rewrite PDFP2O as s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = x − γ∇f(x) − γA⊤s+ ¯ x+ = x+ − γ∇f(x+) − γA⊤s+ PDFP, as a generalization of PDFP2O, is s+ = (I + λ

γ ∂h∗)−1

s + λ

γ A¯

x x+ = (I + γ∂g)−1(x − γ∇f(x) − γA⊤s+) ¯ x+ = (I + γ∂g)−1(x+ − γ∇f(x+) − γA⊤s+)

When g is the indicator function, PDFP reduces to Preconditioned

Alternating Projection Algorithm (PAPA) (Krol-Li-Shen-Xu ’12).

SLIDE 36

AFBA (Latafat-Patrinos ’16)

Rewrite PAPC as s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = ¯ x − γA⊤(s+ − s) ¯ x+= x+ − γ∇f(x+) − γA⊤s+

SLIDE 37

AFBA (Latafat-Patrinos ’16)

Rewrite PAPC as s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = ¯ x − γA⊤(s+ − s) ¯ x+= x+ − γ∇f(x+) − γA⊤s+ AFBA, as a generalization of PAPC, is s+ = (I + λ

γ ∂h∗)−1

s + λ

γ A¯

x x+ = ¯ x − γA⊤(s+ − s) ¯ x+ = (I + γ∂g)−1(x+ − γ∇f(x+) − γA⊤s+) Convergence conditions: λAA⊤/2 +

λAA⊤/2 + γ/(2β) ≤ 1

SLIDE 38

Chambolle-Pock and PAPC

Chambolle-Pock: s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = (I + γ∂g)−1(x − γA⊤s+) ¯ x+ = 2x+ − x

SLIDE 39

Chambolle-Pock and PAPC

Chambolle-Pock: s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = (I + γ∂g)−1(x − γA⊤s+) ¯ x+ = 2x+ − x PAPC: s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = x−γ∇f(x) − γA⊤s+ ¯ x+ = x+ − γ∇f(x+) − γA⊤s+

SLIDE 40

Chambolle-Pock and PAPC

Chambolle-Pock: s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = (I + γ∂g)−1(x − γA⊤s+) ¯ x+ = 2x+ − x PAPC: s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = x−γ∇f(x) − γA⊤s+ ¯ x+ = 2x+ − x−γ∇f(x+) + γ∇f(x)

SLIDE 41

Chambolle-Pock and PAPC

Chambolle-Pock: s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = (I + γ∂g)−1(x − γA⊤s+) ¯ x+ = 2x+ − x PAPC: s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = x−γ∇f(x) − γA⊤s+ ¯ x+ = 2x+ − x−γ∇f(x+) + γ∇f(x) PD3O (Yan ’16): s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = (I + γ∂g)−1(x−γ∇f(x) − γA⊤s+) ¯ x+ = 2x+ − x−γ∇f(x+) + γ∇f(x)

SLIDE 42

Chambolle-Pock

Chambolle-Pock (x − s order): s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = (I + γ∂g)−1(x − γA⊤s+) ¯ x+ = 2x+ − x

SLIDE 43

Chambolle-Pock

z, s z+, s+ x+, s 2x+ − z, s 2x+ − z − γA⊤s+, s+

Chambolle-Pock (x − s order): z = x − γA⊤s x+ = (I + γ∂g)−1(z) s+ = I + λ

γ ∂h∗−1

(I − λAA⊤)s + λ

γ A(2x+ − z)

z+ = z + 2x+ − z − γA⊤s+ − x+

SLIDE 44

C-P and PAPC

PAPC: s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = x−γ∇f(x) − γA⊤s+ ¯ x+ = 2x+ − x−γ∇f(x+) + γ∇f(x)

SLIDE 45

C-P and PAPC

z, s z+, s+ x+, s 2x+ − z, s 2x+ − z−γ∇f(x+) − γA⊤s+, s+ 2x+ − z−γ∇f(x+), s

PAPC: x+ = z = x−γ∇f(x) − γA⊤s s+ = I + λ

γ ∂h∗−1

(I − λAA⊤)s + λ

γ A(2x+ − z−γ∇f(x+))

z+ = z + 2x+ − z−γ∇f(x+) − γA⊤s+ − x+

SLIDE 46

PD3O

z, s z+, s+ x+, s 2x+ − z, s 2x+ − z−γ∇f(x+) − γA⊤s+, s+ 2x+ − z−γ∇f(x+), s

PD3O: s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = (I + γ∂g)−1(x−γ∇f(x) − γA⊤s+) ¯ x+ = 2x+ − x−γ∇f(x+) + γ∇f(x)

SLIDE 47

PD3O

z, s z+, s+ x+, s 2x+ − z, s 2x+ − z−γ∇f(x+) − γA⊤s+, s+ 2x+ − z−γ∇f(x+), s

PD3O: z = x−γ∇f(x) − γA⊤s x+ = (I + γ∂g)−1(z) s+ = I + λ

γ ∂h∗−1

(I − λAA⊤)s + λ

γ A(2x+ − z−γ∇f(x+))

z+ = z + 2x+ − z−γ∇f(x+) − γA⊤s+ − x+

SLIDE 48

PD3O vs Condat-Vu vs AFBA vs PDFP

Algorithms: s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = (I + γ∂g)−1(x − γ∇f(x) − γA⊤s+) PDFP ¯ x+ = (I + γ∂g)−1(x+ − γ∇f(x+) − γA⊤s+) Condat-Vu ¯ x+ = 2x+ − x PD3O ¯ x+ = 2x+ − x + γ∇f(x) − γ∇f(x+)

SLIDE 49

PD3O vs Condat-Vu vs AFBA vs PDFP

Algorithms: s+ = I + λ

γ ∂h∗−1

s + λ

γ A¯

x x+ = (I + γ∂g)−1(x − γ∇f(x) − γA⊤s+) PDFP ¯ x+ = (I + γ∂g)−1(x+ − γ∇f(x+) − γA⊤s+) Condat-Vu ¯ x+ = 2x+ − x PD3O ¯ x+ = 2x+ − x + γ∇f(x) − γ∇f(x+) Parameters: f = 0, g = 0 f = 0 g = 0 PDFP λAA⊤ < 1; γ/(2β) < 1 PAPC Condat-Vu λAA⊤ + γ/(2β) ≤ 1 C-P AFBA λAA⊤/2 +

λAA⊤/2 + γ/(2β) ≤ 1

PAPC PD3O λAA⊤ ≤ 1; γ/(2β) < 1 C-P PAPC

SLIDE 50

convergence results: summary

Let z = x − γ∇f(x) − γA⊤s and x+ → x: x = (I + γ∂g)−1z s+ = I + λ

γ ∂h∗−1

(I − λAA⊤)s + λ

γ A (2x − z − γ∇f(x))

z+ = x − γ∇f(x) − γA⊤s+

SLIDE 51

convergence results: summary

Let z = x − γ∇f(x) − γA⊤s and x+ → x: x = (I + γ∂g)−1z s+ = I + λ

γ ∂h∗−1

(I − λAA⊤)s + λ

γ A (2x − z − γ∇f(x))

z+ = x − γ∇f(x) − γA⊤s+

(zk+1, sk+1) − (zk, sk)2

M = o 1 k+1

, and (zk, sk) weakly converges to a

fixed point (z∗, s∗)

SLIDE 52

convergence results: summary

Let z = x − γ∇f(x) − γA⊤s and x+ → x: x = (I + γ∂g)−1z s+ = I + λ

γ ∂h∗−1

(I − λAA⊤)s + λ

γ A (2x − z − γ∇f(x))

z+ = x − γ∇f(x) − γA⊤s+

(zk+1, sk+1) − (zk, sk)2

M = o 1 k+1

, and (zk, sk) weakly converges to a

fixed point (z∗, s∗)

Let L(x, s) = f(x) + g(x) + Ax, s − h∗(s), then

L(¯ xk, s) − L(x,¯ sk+1) ≤ (z1, s1) − (z, s)2 k where (¯ xk,¯ sk+1) = 1

k

i=1(xi, si+1), and z = x − γ∇f(x) − γA⊤s.

SLIDE 53

convergence results: summary

Let z = x − γ∇f(x) − γA⊤s and x+ → x: x = (I + γ∂g)−1z s+ = I + λ

γ ∂h∗−1

(I − λAA⊤)s + λ

γ A (2x − z − γ∇f(x))

z+ = x − γ∇f(x) − γA⊤s+

(zk+1, sk+1) − (zk, sk)2

M = o 1 k+1

, and (zk, sk) weakly converges to a

fixed point (z∗, s∗)

Let L(x, s) = f(x) + g(x) + Ax, s − h∗(s), then

L(¯ xk, s) − L(x,¯ sk+1) ≤ (z1, s1) − (z, s)2 k where (¯ xk,¯ sk+1) = 1

k

i=1(xi, si+1), and z = x − γ∇f(x) − γA⊤s.

Linear convergence with additional assumptions on f, g, and h

SLIDE 54

convergence analysis: the general case

Let M = γ2

λ (I − λAA⊤) be positive definite. Define

sM =

s, sM =
s, Ms and (z, s)M =
z2 + s2

M.

Lemma

The iteration T mapping (z, s) to (z+, s+) is a nonexpansive operator under the metric defined by M if γ ≤ 2β. Furthermore, it is α-averaged with α =

2β 4β−γ .

Chambolle-Pock is firmly non-expansive under the new metric, which is

different from the previous metric.

SLIDE 55

convergence analysis: the general case

Theorem

1) Let (z∗, s∗) be any fixed point of T. Then ((zk, sk) − (z∗, s∗)M)k≥0 is monotonically nonincreasing. 2) The sequence (T(zk, sk) − (zk, sk)M)k≥0 is monotonically nonincreasing and converges to 0. 3) We have the following convergence rate T(zk, sk) − (zk, sk)2

M = o 1 k+1

4) (zk, sk) weakly converges to a fixed point of T, and if X has finite

dimension (e.g., Rm), then it is strongly convergent.

SLIDE 56

convergence analysis: linear convergent

Denote: uh = γ

λ(I − λAA⊤)s + A˜

z − γ

λs+ ∈ ∂h∗(s+),

ug = 1

γ (z − x) ∈ ∂g(x),

u∗

h =A(˜

z∗ − γA⊤s∗) = Ax∗ ∈ ∂h∗(s∗), u∗

g = 1 γ (z∗ − x∗) ∈ ∂g(x∗).

and ∇g(x) − ∇g(y) ≤Lgx − y, s+ − s∗, uh − u∗

h ≥τhs+ − s∗2 M,

x − x∗, ug − u∗

g ≥τgx − x∗2,

x − x∗, ∇f(x) − ∇f(x∗) ≥τfx − x∗.

SLIDE 57

convergence analysis: linear convergent

Theorem

We have z+ − z∗2 + (1 + 2γτh) s+ − s∗2

M ≤ ρ

z − z∗2 + (1 + 2γτh) s − s∗2

M

where

ρ = max

 

1 1+2γτh , 1 −

2γ− γ2

β

τf +2γτg
1+γLg

  .

(5) When, in addition, γ < 2β, τh > 0, and τf + τg > 0, we have that ρ < 1 and the algorithm converges linearly.

SLIDE 58

numerical experiment: fused lasso

minimize

x

1 2Ax − b2

2 + µ1x1 + µ2 p−1

i=1

|xi+1 − xi|,

x = (x1, · · · , xp) ∈ Rp, A ∈ Rn×p, b ∈ Rn

2000 4000 6000 8000 10000

4
3
2
1

1 2 3 4 True PD3O PDFP Condat-Vu 3000 3500 4000 4500 5000

4
3
2
1

1 2 3 4 True PD3O PDFP Condat-Vu

Figure: The true sparse signal and the reconstructed results using PD3O, PDFP, and Condat-Vu. The right figure is a zoom-in of the signal in [3000, 5500].

SLIDE 59

numerical experiment: fused lasso

iteration

200 400 600 800 1000

f!f$ f$

10-4 10-3 10-2 10-1 100 101 102 103 PD3O-.1 PDFP-.1 Condat-Vu-.1 PD3O-.2 PDFP-.2 PD3O-.3 PDFP-.3

iteration

200 400 600 800 1000

f!f$ f$

10-4 10-3 10-2 10-1 100 101 102 103 PD3O-.1 PDFP-.1 Condat-Vu-.1 PD3O-.2 PDFP-.2 PD3O-.3 PDFP-.3

Figure: In the left figure, we fix λ = 1/8 and let γ = β, 1.5β, 1.9β. In the right figure, we fix γ = 1.9β and let λ = 1/80, 1/8, 1/4.

SLIDE 60

applications: decentralized optimization

minimize

x n

i=1

fi(x) + gi(x)

fi and gi is known at node i only.
Nodes 1, · · · , n are connected in a undirected graph.
fi is differentiable with a Lipschitz continuous gradient.

SLIDE 61

applications: decentralized optimization

minimize

x n

i=1

fi(x) + gi(x)

fi and gi is known at node i only.
Nodes 1, · · · , n are connected in a undirected graph.
fi is differentiable with a Lipschitz continuous gradient.

Introduce a copy xi at node i: minimize

x

f(x) + g(x) :=

n

i=1

fi(xi) + gi(xi) s.t. Wx = x

xi ∈ Rp, x = [x1 x2 · · · xn]⊤ ∈ Rn×p.
W is a symmetric doubly stochastic mixing matrix.

SLIDE 62

applications: decentralized optimization

minimize

x n

i=1

fi(x) + gi(x)

fi and gi is known at node i only.
Nodes 1, · · · , n are connected in a undirected graph.
fi is differentiable with a Lipschitz continuous gradient.

Introduce a copy xi at node i: minimize

x

f(x) + g(x) :=

n

i=1

fi(xi) + gi(xi) s.t. Wx = x

xi ∈ Rp, x = [x1 x2 · · · xn]⊤ ∈ Rn×p.
W is a symmetric doubly stochastic mixing matrix.

The sum of three functions: minimize

x

f(x) + g(x) + ι0((I − W)1/2x)

SLIDE 63

comparing PG-EXTRA and NIDS (Li-Shi-Yan ’17)

minimize

x

1 2

n

i=1

Aixi − bi2 + µ1

n

i=1

xi1 + ι0((I − W)1/2x)

SLIDE 64

comparing PG-EXTRA and NIDS (Li-Shi-Yan ’17)

minimize

x

1 2

n

i=1

Aixi − bi2 + µ1

n

i=1

xi1 + ι0((I − W)1/2x)

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 number of iterations 104 10-8 10-6 10-4 10-2 100 102 NIDS-1/L NIDS-1.5/L NIDS-1.9/L PGEXTRA-1/L PGEXTRA-1.2/L PGEXTRA-1.3/L PGEXTRA-1.4/L

SLIDE 65

conclusion

a new primal-dual algorithm for minimizing the sum of three functions.
a new interpretation of Chambolle-Pock: Douglas-Rachford splitting on

the KKT system under a new metric induced by a block diagonal matrix.

PAPC is forward-backward splitting applied on the KKT system under the

same metric; we proved the optimal bound for the parameters (dual stepsize).

PD3O is a generalization of both Chambolle-Pock and PAPC, and it has

the advantages of both Condat-Vu (a generalization of Chambolle-Pock), and AFBA and PDFP (two generalizations of PAPC).

In decentralized consensus optimization, we derive a fast method whose

stepsize does not depend on the network structure; we provide an optimal bound for the stepsize in PG-EXTRA (Shi et al. ’15).

SLIDE 66

Thank You!

Paper 1 M. Yan, A new primal-dual method for minimizing the sum of three functions with a linear operator, Arxiv: arXiv:1611.09805 Code https://github.com/mingyan08/PD3O Paper 2 Z. Li, W. Shi and M. Yan, A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates, arXiv:1704.07807 Code https://github.com/mingyan08/NIDS Paper 3 Z. Li and M. Yan, A primal-dual algorithm with optimal stepsizes and its application in decentralized consensus optimization, arXiv:1711.06785