SLIDE 1 Primal-dual algorithms for the sum of two and three functions1
Ming Yan Michigan State University, CMSE/Mathematics
1This works is partially supported by NSF.
SLIDE 2
- ptimization problems for primal-dual algorithms
minimize
x
f(x) + g(x) + h(Ax)
- f, g, and h are convex.
- X and Y are two Hilbert spaces (e.g., Rm, Rn).
- f : X → R is differentiable with a 1/β-Lipschitz continuous gradient for
some β ∈ (0, +∞).
- A : X → Y is a bounded linear operator.
SLIDE 3
applications: statistics
Elastic net regularization (Zou-Hastie ’05): minimize
x
µ2x2
2 + µ1x1 + l(Ax, b),
where x ∈ Rp, A ∈ Rn×p, b ∈ Rn, and l is the loss function, which may be nondifferentiable.
SLIDE 4
applications: statistics
Elastic net regularization (Zou-Hastie ’05): minimize
x
µ2x2
2 + µ1x1 + l(Ax, b),
where x ∈ Rp, A ∈ Rn×p, b ∈ Rn, and l is the loss function, which may be nondifferentiable. Fused lasso (Tibshirani et al. ’05): minimize
x
1 2Ax − b2
2 + µ1x1 + µ2Dx1,
where x ∈ Rp, A ∈ Rn×p, b ∈ Rn, and D =
−1 1 −1 1 . . . . . . −1 1
is a matrix in R(p−1)×p.
SLIDE 5 applications: decentralized optimization
minimize
x n
fi(x) + gi(x)
- fi and gi is known at node i only.
- Nodes 1, · · · , n are connected in a undirected graph.
- fi is differentiable with a Lipschitz continuous gradient.
SLIDE 6 applications: decentralized optimization
minimize
x n
fi(x) + gi(x)
- fi and gi is known at node i only.
- Nodes 1, · · · , n are connected in a undirected graph.
- fi is differentiable with a Lipschitz continuous gradient.
Introduce a copy xi at node i: minimize
x
f(x) + g(x) :=
n
fi(xi) + gi(xi) s.t. Wx = x
- xi ∈ Rp, x = [x1 x2 · · · xn]⊤ ∈ Rn×p.
- W is a symmetric doubly stochastic mixing matrix.
SLIDE 7 applications: decentralized optimization
minimize
x n
fi(x) + gi(x)
- fi and gi is known at node i only.
- Nodes 1, · · · , n are connected in a undirected graph.
- fi is differentiable with a Lipschitz continuous gradient.
Introduce a copy xi at node i: minimize
x
f(x) + g(x) :=
n
fi(xi) + gi(xi) s.t. Wx = x
- xi ∈ Rp, x = [x1 x2 · · · xn]⊤ ∈ Rn×p.
- W is a symmetric doubly stochastic mixing matrix.
The sum of three functions: minimize
x
f(x) + g(x) + ι0((I − W)1/2x)
SLIDE 8
applications: imaging
Image restoration with two regularizations: minimize
x
1 2Ax − b2
2 + ιC(x) + µDx1,
where x ∈ Rn is the image to be reconstructed, A ∈ Rm×n is the forward projection matrix, b ∈ Rm is the measured data with noise, D is a discrete gradient operator, and ιC is the indicator function that returns zero if x ∈ C (here, C is the set of nonnegative vectors in Rn) and +∞ otherwise.
SLIDE 9 applications: imaging
Image restoration with two regularizations: minimize
x
1 2Ax − b2
2 + ιC(x) + µDx1,
where x ∈ Rn is the image to be reconstructed, A ∈ Rm×n is the forward projection matrix, b ∈ Rm is the measured data with noise, D is a discrete gradient operator, and ιC is the indicator function that returns zero if x ∈ C (here, C is the set of nonnegative vectors in Rn) and +∞ otherwise. Other problems:
- f: data fitting term (infimal convolution for mixed noise)
- h ◦ A: total variation; other transforms
- g: nonnegativity; box constraint
SLIDE 10
primal-dual formulation
Original problem: minimize
x
f(x) + g(x) + h(Ax)
SLIDE 11
primal-dual formulation
Original problem: minimize
x
f(x) + g(x) + h(Ax) Introduce a dual variable s: minimize
x
max
s f(x) + g(x) + Ax, s − h∗(s)
Here h∗ is the conjugate function of h that is defined as h∗(s) = max
t
s, t − h(t),
SLIDE 12 primal-dual formulation
Original problem: minimize
x
f(x) + g(x) + h(Ax) Introduce a dual variable s: minimize
x
max
s f(x) + g(x) + Ax, s − h∗(s)
Here h∗ is the conjugate function of h that is defined as h∗(s) = max
t
s, t − h(t), It is equivalent to (s∗ ∈ ∂h(Ax∗) ⇐ ⇒ Ax∗ ∈ ∂h∗(s∗)):
∇f(x∗) + ∂g(x∗) + A⊤s∗ 0 ∈ ∂h∗(s∗) − Ax∗ All primal-dual algorithms try to find (x∗, s∗).
SLIDE 13 existing algorithms: Condat-Vu, AFBA, and PDFP
Condat-Vu (Condat ’13, Vu ’13):
- Convergence conditions: λAA⊤ + γ/(2β) ≤ 1
- Per-iteration computations: A, A⊤, ∇f, one (I + γ∂g)−1, (I + λ
γ ∂h∗)−1 2
2
(I + γ∂g)−1(˜ x) = arg min
x
γg(x) + 1 2 x − ˜ x2. This is a backward step (or implicit step) because (I + γ∂g)−1(˜ x) ∈ ˜ x − γ∂g((I + γ∂g)−1(˜ x))
SLIDE 14 existing algorithms: Condat-Vu, AFBA, and PDFP
Condat-Vu (Condat ’13, Vu ’13):
- Convergence conditions: λAA⊤ + γ/(2β) ≤ 1
- Per-iteration computations: A, A⊤, ∇f, one (I + γ∂g)−1, (I + λ
γ ∂h∗)−1 2
AFBA (Latafat-Patrinos ’16):
- Convergence conditions: λAA⊤/2 +
- λAA⊤/2 + γ/(2β) ≤ 1
- Per-iteration computations: A, A⊤, ∇f, one (I + γ∂g)−1, (I + λ
γ ∂h∗)−1
2
(I + γ∂g)−1(˜ x) = arg min
x
γg(x) + 1 2 x − ˜ x2. This is a backward step (or implicit step) because (I + γ∂g)−1(˜ x) ∈ ˜ x − γ∂g((I + γ∂g)−1(˜ x))
SLIDE 15 existing algorithms: Condat-Vu, AFBA, and PDFP
Condat-Vu (Condat ’13, Vu ’13):
- Convergence conditions: λAA⊤ + γ/(2β) ≤ 1
- Per-iteration computations: A, A⊤, ∇f, one (I + γ∂g)−1, (I + λ
γ ∂h∗)−1 2
AFBA (Latafat-Patrinos ’16):
- Convergence conditions: λAA⊤/2 +
- λAA⊤/2 + γ/(2β) ≤ 1
- Per-iteration computations: A, A⊤, ∇f, one (I + γ∂g)−1, (I + λ
γ ∂h∗)−1
PDFP (Chen-Huang-Zhang ’16):
- Convergence conditions: λAA⊤ < 1; γ/(2β) < 1
- Per-iteration computations: A, A⊤, ∇f, two (I + γ∂g)−1, (I + λ
γ ∂h∗)−1
2
(I + γ∂g)−1(˜ x) = arg min
x
γg(x) + 1 2 x − ˜ x2. This is a backward step (or implicit step) because (I + γ∂g)−1(˜ x) ∈ ˜ x − γ∂g((I + γ∂g)−1(˜ x))
SLIDE 16 PDHG (Zhu-Chan ’08)
When f = 0, we have
A⊤ −A ∂h∗ x∗ s∗
SLIDE 17 PDHG (Zhu-Chan ’08)
When f = 0, we have
A⊤ −A ∂h∗ x∗ s∗
It is equivalent to
−A ∂h∗ x∗ s∗
x∗ s∗
SLIDE 18 PDHG (Zhu-Chan ’08)
When f = 0, we have
A⊤ −A ∂h∗ x∗ s∗
It is equivalent to
γ I+∂g
−A
γ λI+∂h∗
x∗ s∗
γ I
A⊤
γ λI
x∗ s∗
SLIDE 19 PDHG (Zhu-Chan ’08)
When f = 0, we have
A⊤ −A ∂h∗ x∗ s∗
It is equivalent to
γ I+∂g
−A
γ λI+∂h∗
x∗ s∗
γ I
A⊤
γ λI
x∗ s∗
- Primal-dual hybrid gradient (PDHG)
x+ = (I + γ∂g)−1(x − γA⊤s) s+ = I + λ
γ ∂h∗−1
s + λ
γ Ax+
SLIDE 20 PDHG (Zhu-Chan ’08)
When f = 0, we have
A⊤ −A ∂h∗ x∗ s∗
It is equivalent to
γ I+∂g
−A
γ λI+∂h∗
x∗ s∗
γ I
A⊤
γ λI
x∗ s∗
- Chambolle-Pock (Chambolle et.al ’09, Esser-Zhang-Chan ’10)
x+ = (I + γ∂g)−1(x − γA⊤s) ¯ x+ = x++x+ − x s+ = I + λ
γ ∂h∗−1
s + λ
γ A¯
x+
SLIDE 21
Chambolle-Pock ’11 as proximal point
Chambolle-Pock (x − s order) x+ = (I + γ∂g)−1(x − γA⊤s) s+ = I + λ
γ ∂h∗−1
s + λ
γ A(2x+ − x)
SLIDE 22
Chambolle-Pock ’11 as proximal point
Chambolle-Pock (x − s order) x+ = (I + γ∂g)−1(x − γA⊤s) s+ = I + λ
γ ∂h∗−1
s + λ
γ A(2x+ − x)
CP is equivalent to the backward operator applied on the KKT system.
SLIDE 23 Chambolle-Pock ’11 as proximal point
Chambolle-Pock (x − s order) x+ = (I + γ∂g)−1(x − γA⊤s) s+ = I + λ
γ ∂h∗−1
s + λ
γ A(2x+ − x)
CP is equivalent to the backward operator applied on the KKT system.
γ I
−A
γ λI
−A ∂h∗ x+ s+
γ I
−A⊤ −A
γ λI
x s
- CP is 1/2-averaged under the metric induced by the matrix if λ satisfies
the condition λAA⊤ ≤ 1.
SLIDE 24 Chambolle-Pock ’11 as proximal point
Chambolle-Pock (x − s order) x+ = (I + γ∂g)−1(x − γA⊤s) s+ = I + λ
γ ∂h∗−1
s + λ
γ A(2x+ − x)
CP is equivalent to the backward operator applied on the KKT system.
γ I
−A⊤ −A
γ λI
A⊤ −A ∂h∗ x+ s+
γ I
−A⊤ −A
γ λI
x s
- CP is 1/2-averaged under the metric induced by the matrix if λ satisfies
the condition λAA⊤ ≤ 1.
SLIDE 25 Condat-Vu (Condat ’13, Vu ’13)
The optimality condition:
A⊤ −A ∂h∗ x∗ s∗
CV is equivalent to the forward-backward applied on the KKT system.
γ I
−A⊤ −A
γ λ I
A⊤ −A ∂h∗
s+
γ I
−A⊤ −A
γ λ I
s
SLIDE 26 Condat-Vu (Condat ’13, Vu ’13)
The optimality condition:
A⊤ −A ∂h∗ x∗ s∗
CV is equivalent to the forward-backward applied on the KKT system.
γ I
−A⊤ −A
γ λ I
A⊤ −A ∂h∗
s+
γ I
−A⊤ −A
γ λ I
s
x+ = (I + γ∂g)−1(x−γ∇f(x) − γA⊤s) s+ = I + λ
γ ∂h∗−1
s + λ
γ A(2x+ − x)
It is equivalent to (by changing the update order) s+ = I + λ
γ ∂h∗−1
s + λ
γ A¯
x x+ = (I + γ∂g)−1(x−γ∇f(x) − γA⊤s+) ¯ x+ = 2x+ − x
SLIDE 27 Condat-Vu (Condat ’13, Vu ’13)
The optimality condition:
A⊤ −A ∂h∗ x∗ s∗
CV is equivalent to the forward-backward applied on the KKT system.
γ I
−A⊤ −A
γ λ I
A⊤ −A ∂h∗
s+
γ I
−A⊤ −A
γ λ I
s
x+ = (I + γ∂g)−1(x−γ∇f(x) − γA⊤s) s+ = I + λ
γ ∂h∗−1
s + λ
γ A(2x+ − x)
It is equivalent to (by changing the update order) s+ = I + λ
γ ∂h∗−1
s + λ
γ A¯
x x+ = (I + γ∂g)−1(x−γ∇f(x) − γA⊤s+) ¯ x+ = 2x+ − x
- CV is non-expansive (forward-backward) under the metric induced by the
matrix if γ and λ satisfy the condition λAA⊤ + γ/(2β) ≤ 1.
SLIDE 28 PDFP2O/PAPC (Loris-Verhoeven ’11, Chen-Huang-Zhang ’13, Drori-Sabach-Teboulle ’15)
When g = 0, the optimality condition becomes:
−A ∂h∗ x∗ s∗
SLIDE 29 PDFP2O/PAPC (Loris-Verhoeven ’11, Chen-Huang-Zhang ’13, Drori-Sabach-Teboulle ’15)
When g = 0, the optimality condition becomes:
−A ∂h∗ x∗ s∗
PAPC is equivalent to the forward-backward applied on the KKT system.
γ I
A⊤ −A
γ λ I − γAA⊤ + ∂h∗
s+
γ I γ λ I − γAA⊤
s
SLIDE 30 PDFP2O/PAPC (Loris-Verhoeven ’11, Chen-Huang-Zhang ’13, Drori-Sabach-Teboulle ’15)
When g = 0, the optimality condition becomes:
−A ∂h∗ x∗ s∗
PAPC is equivalent to the forward-backward applied on the KKT system.
γ I
A⊤ −A
γ λ I − γAA⊤ + ∂h∗
s+
γ I γ λ I − γAA⊤
s
γ I
A⊤
γ λI + ∂h∗
x+ s+
γ I
A
γ λI − γAA⊤
x s
γA∇f(x)
- PAPC is non-expansive (forward-backward) under the metric induced by
the matrix if γ and λ satisfy the conditions λAA⊤ ≤ 1 and γ/(2β) ≤ 1.
SLIDE 31
PAPC
PAPC can be expressed as s+ = I + λ
γ ∂h∗−1
(I − λAA⊤)s + λ
γ A (x − γ∇f(x))
x+ = x − γ∇f(x) − γA⊤s+
SLIDE 32
PAPC
PAPC can be expressed as s+ = I + λ
γ ∂h∗−1
(I − λAA⊤)s + λ
γ A (x − γ∇f(x))
x+ = x − γ∇f(x) − γA⊤s+ It is equivalent to s+= I + λ
γ ∂h∗−1
s + λ
γ A¯
x x+= x − γ∇f(x) − γA⊤s+ ¯ x+= x+ − γ∇f(x+) − γA⊤s+
SLIDE 33 PAPC
PAPC can be expressed as s+ = I + λ
γ ∂h∗−1
(I − λAA⊤)s + λ
γ A (x − γ∇f(x))
x+ = x − γ∇f(x) − γA⊤s+ It is equivalent to s+= I + λ
γ ∂h∗−1
s + λ
γ A¯
x x+= x − γ∇f(x) − γA⊤s+ ¯ x+= x+ − γ∇f(x+) − γA⊤s+
- PAPC is α-averaged under the metric induced by the matrix.
- PAPC converges if γ and λ satisfy the conditions λAA⊤ < 4/3 and
γ/(2β) < 1 (Li-Yan ’17).
SLIDE 34
PDFP (Chen-Huang-Zhang ’16)
Rewrite PDFP2O as s+ = I + λ
γ ∂h∗−1
s + λ
γ A¯
x x+ = x − γ∇f(x) − γA⊤s+ ¯ x+ = x+ − γ∇f(x+) − γA⊤s+
SLIDE 35 PDFP (Chen-Huang-Zhang ’16)
Rewrite PDFP2O as s+ = I + λ
γ ∂h∗−1
s + λ
γ A¯
x x+ = x − γ∇f(x) − γA⊤s+ ¯ x+ = x+ − γ∇f(x+) − γA⊤s+ PDFP, as a generalization of PDFP2O, is s+ = (I + λ
γ ∂h∗)−1
s + λ
γ A¯
x x+ = (I + γ∂g)−1(x − γ∇f(x) − γA⊤s+) ¯ x+ = (I + γ∂g)−1(x+ − γ∇f(x+) − γA⊤s+)
- When g is the indicator function, PDFP reduces to Preconditioned
Alternating Projection Algorithm (PAPA) (Krol-Li-Shen-Xu ’12).
SLIDE 36
AFBA (Latafat-Patrinos ’16)
Rewrite PAPC as s+ = I + λ
γ ∂h∗−1
s + λ
γ A¯
x x+ = ¯ x − γA⊤(s+ − s) ¯ x+= x+ − γ∇f(x+) − γA⊤s+
SLIDE 37 AFBA (Latafat-Patrinos ’16)
Rewrite PAPC as s+ = I + λ
γ ∂h∗−1
s + λ
γ A¯
x x+ = ¯ x − γA⊤(s+ − s) ¯ x+= x+ − γ∇f(x+) − γA⊤s+ AFBA, as a generalization of PAPC, is s+ = (I + λ
γ ∂h∗)−1
s + λ
γ A¯
x x+ = ¯ x − γA⊤(s+ − s) ¯ x+ = (I + γ∂g)−1(x+ − γ∇f(x+) − γA⊤s+) Convergence conditions: λAA⊤/2 +
SLIDE 38
Chambolle-Pock and PAPC
Chambolle-Pock: s+ = I + λ
γ ∂h∗−1
s + λ
γ A¯
x x+ = (I + γ∂g)−1(x − γA⊤s+) ¯ x+ = 2x+ − x
SLIDE 39
Chambolle-Pock and PAPC
Chambolle-Pock: s+ = I + λ
γ ∂h∗−1
s + λ
γ A¯
x x+ = (I + γ∂g)−1(x − γA⊤s+) ¯ x+ = 2x+ − x PAPC: s+ = I + λ
γ ∂h∗−1
s + λ
γ A¯
x x+ = x−γ∇f(x) − γA⊤s+ ¯ x+ = x+ − γ∇f(x+) − γA⊤s+
SLIDE 40
Chambolle-Pock and PAPC
Chambolle-Pock: s+ = I + λ
γ ∂h∗−1
s + λ
γ A¯
x x+ = (I + γ∂g)−1(x − γA⊤s+) ¯ x+ = 2x+ − x PAPC: s+ = I + λ
γ ∂h∗−1
s + λ
γ A¯
x x+ = x−γ∇f(x) − γA⊤s+ ¯ x+ = 2x+ − x−γ∇f(x+) + γ∇f(x)
SLIDE 41
Chambolle-Pock and PAPC
Chambolle-Pock: s+ = I + λ
γ ∂h∗−1
s + λ
γ A¯
x x+ = (I + γ∂g)−1(x − γA⊤s+) ¯ x+ = 2x+ − x PAPC: s+ = I + λ
γ ∂h∗−1
s + λ
γ A¯
x x+ = x−γ∇f(x) − γA⊤s+ ¯ x+ = 2x+ − x−γ∇f(x+) + γ∇f(x) PD3O (Yan ’16): s+ = I + λ
γ ∂h∗−1
s + λ
γ A¯
x x+ = (I + γ∂g)−1(x−γ∇f(x) − γA⊤s+) ¯ x+ = 2x+ − x−γ∇f(x+) + γ∇f(x)
SLIDE 42
Chambolle-Pock
Chambolle-Pock (x − s order): s+ = I + λ
γ ∂h∗−1
s + λ
γ A¯
x x+ = (I + γ∂g)−1(x − γA⊤s+) ¯ x+ = 2x+ − x
SLIDE 43
Chambolle-Pock
z, s z+, s+ x+, s 2x+ − z, s 2x+ − z − γA⊤s+, s+
Chambolle-Pock (x − s order): z = x − γA⊤s x+ = (I + γ∂g)−1(z) s+ = I + λ
γ ∂h∗−1
(I − λAA⊤)s + λ
γ A(2x+ − z)
z+ = z + 2x+ − z − γA⊤s+ − x+
SLIDE 44
C-P and PAPC
PAPC: s+ = I + λ
γ ∂h∗−1
s + λ
γ A¯
x x+ = x−γ∇f(x) − γA⊤s+ ¯ x+ = 2x+ − x−γ∇f(x+) + γ∇f(x)
SLIDE 45
C-P and PAPC
z, s z+, s+ x+, s 2x+ − z, s 2x+ − z−γ∇f(x+) − γA⊤s+, s+ 2x+ − z−γ∇f(x+), s
PAPC: x+ = z = x−γ∇f(x) − γA⊤s s+ = I + λ
γ ∂h∗−1
(I − λAA⊤)s + λ
γ A(2x+ − z−γ∇f(x+))
z+ = z + 2x+ − z−γ∇f(x+) − γA⊤s+ − x+
SLIDE 46
PD3O
z, s z+, s+ x+, s 2x+ − z, s 2x+ − z−γ∇f(x+) − γA⊤s+, s+ 2x+ − z−γ∇f(x+), s
PD3O: s+ = I + λ
γ ∂h∗−1
s + λ
γ A¯
x x+ = (I + γ∂g)−1(x−γ∇f(x) − γA⊤s+) ¯ x+ = 2x+ − x−γ∇f(x+) + γ∇f(x)
SLIDE 47
PD3O
z, s z+, s+ x+, s 2x+ − z, s 2x+ − z−γ∇f(x+) − γA⊤s+, s+ 2x+ − z−γ∇f(x+), s
PD3O: z = x−γ∇f(x) − γA⊤s x+ = (I + γ∂g)−1(z) s+ = I + λ
γ ∂h∗−1
(I − λAA⊤)s + λ
γ A(2x+ − z−γ∇f(x+))
z+ = z + 2x+ − z−γ∇f(x+) − γA⊤s+ − x+
SLIDE 48
PD3O vs Condat-Vu vs AFBA vs PDFP
Algorithms: s+ = I + λ
γ ∂h∗−1
s + λ
γ A¯
x x+ = (I + γ∂g)−1(x − γ∇f(x) − γA⊤s+) PDFP ¯ x+ = (I + γ∂g)−1(x+ − γ∇f(x+) − γA⊤s+) Condat-Vu ¯ x+ = 2x+ − x PD3O ¯ x+ = 2x+ − x + γ∇f(x) − γ∇f(x+)
SLIDE 49 PD3O vs Condat-Vu vs AFBA vs PDFP
Algorithms: s+ = I + λ
γ ∂h∗−1
s + λ
γ A¯
x x+ = (I + γ∂g)−1(x − γ∇f(x) − γA⊤s+) PDFP ¯ x+ = (I + γ∂g)−1(x+ − γ∇f(x+) − γA⊤s+) Condat-Vu ¯ x+ = 2x+ − x PD3O ¯ x+ = 2x+ − x + γ∇f(x) − γ∇f(x+) Parameters: f = 0, g = 0 f = 0 g = 0 PDFP λAA⊤ < 1; γ/(2β) < 1 PAPC Condat-Vu λAA⊤ + γ/(2β) ≤ 1 C-P AFBA λAA⊤/2 +
PAPC PD3O λAA⊤ ≤ 1; γ/(2β) < 1 C-P PAPC
SLIDE 50
convergence results: summary
Let z = x − γ∇f(x) − γA⊤s and x+ → x: x = (I + γ∂g)−1z s+ = I + λ
γ ∂h∗−1
(I − λAA⊤)s + λ
γ A (2x − z − γ∇f(x))
z+ = x − γ∇f(x) − γA⊤s+
SLIDE 51 convergence results: summary
Let z = x − γ∇f(x) − γA⊤s and x+ → x: x = (I + γ∂g)−1z s+ = I + λ
γ ∂h∗−1
(I − λAA⊤)s + λ
γ A (2x − z − γ∇f(x))
z+ = x − γ∇f(x) − γA⊤s+
M = o 1 k+1
- , and (zk, sk) weakly converges to a
fixed point (z∗, s∗)
SLIDE 52 convergence results: summary
Let z = x − γ∇f(x) − γA⊤s and x+ → x: x = (I + γ∂g)−1z s+ = I + λ
γ ∂h∗−1
(I − λAA⊤)s + λ
γ A (2x − z − γ∇f(x))
z+ = x − γ∇f(x) − γA⊤s+
M = o 1 k+1
- , and (zk, sk) weakly converges to a
fixed point (z∗, s∗)
- Let L(x, s) = f(x) + g(x) + Ax, s − h∗(s), then
L(¯ xk, s) − L(x,¯ sk+1) ≤ (z1, s1) − (z, s)2 k where (¯ xk,¯ sk+1) = 1
k
k
i=1(xi, si+1), and z = x − γ∇f(x) − γA⊤s.
SLIDE 53 convergence results: summary
Let z = x − γ∇f(x) − γA⊤s and x+ → x: x = (I + γ∂g)−1z s+ = I + λ
γ ∂h∗−1
(I − λAA⊤)s + λ
γ A (2x − z − γ∇f(x))
z+ = x − γ∇f(x) − γA⊤s+
M = o 1 k+1
- , and (zk, sk) weakly converges to a
fixed point (z∗, s∗)
- Let L(x, s) = f(x) + g(x) + Ax, s − h∗(s), then
L(¯ xk, s) − L(x,¯ sk+1) ≤ (z1, s1) − (z, s)2 k where (¯ xk,¯ sk+1) = 1
k
k
i=1(xi, si+1), and z = x − γ∇f(x) − γA⊤s.
- Linear convergence with additional assumptions on f, g, and h
SLIDE 54 convergence analysis: the general case
λ (I − λAA⊤) be positive definite. Define
sM =
- s, sM =
- s, Ms and (z, s)M =
- z2 + s2
M.
Lemma
The iteration T mapping (z, s) to (z+, s+) is a nonexpansive operator under the metric defined by M if γ ≤ 2β. Furthermore, it is α-averaged with α =
2β 4β−γ .
- Chambolle-Pock is firmly non-expansive under the new metric, which is
different from the previous metric.
SLIDE 55 convergence analysis: the general case
Theorem
1) Let (z∗, s∗) be any fixed point of T. Then ((zk, sk) − (z∗, s∗)M)k≥0 is monotonically nonincreasing. 2) The sequence (T(zk, sk) − (zk, sk)M)k≥0 is monotonically nonincreasing and converges to 0. 3) We have the following convergence rate T(zk, sk) − (zk, sk)2
M = o 1 k+1
- 4) (zk, sk) weakly converges to a fixed point of T, and if X has finite
dimension (e.g., Rm), then it is strongly convergent.
SLIDE 56
convergence analysis: linear convergent
Denote: uh = γ
λ(I − λAA⊤)s + A˜
z − γ
λs+ ∈ ∂h∗(s+),
ug = 1
γ (z − x) ∈ ∂g(x),
u∗
h =A(˜
z∗ − γA⊤s∗) = Ax∗ ∈ ∂h∗(s∗), u∗
g = 1 γ (z∗ − x∗) ∈ ∂g(x∗).
and ∇g(x) − ∇g(y) ≤Lgx − y, s+ − s∗, uh − u∗
h ≥τhs+ − s∗2 M,
x − x∗, ug − u∗
g ≥τgx − x∗2,
x − x∗, ∇f(x) − ∇f(x∗) ≥τfx − x∗.
SLIDE 57 convergence analysis: linear convergent
Theorem
We have z+ − z∗2 + (1 + 2γτh) s+ − s∗2
M ≤ ρ
z − z∗2 + (1 + 2γτh) s − s∗2
M
ρ = max
1 1+2γτh , 1 −
β
.
(5) When, in addition, γ < 2β, τh > 0, and τf + τg > 0, we have that ρ < 1 and the algorithm converges linearly.
SLIDE 58 numerical experiment: fused lasso
minimize
x
1 2Ax − b2
2 + µ1x1 + µ2 p−1
|xi+1 − xi|,
- x = (x1, · · · , xp) ∈ Rp, A ∈ Rn×p, b ∈ Rn
2000 4000 6000 8000 10000
1 2 3 4 True PD3O PDFP Condat-Vu 3000 3500 4000 4500 5000
1 2 3 4 True PD3O PDFP Condat-Vu
Figure: The true sparse signal and the reconstructed results using PD3O, PDFP, and Condat-Vu. The right figure is a zoom-in of the signal in [3000, 5500].
SLIDE 59 numerical experiment: fused lasso
iteration
200 400 600 800 1000
f!f$ f$
10-4 10-3 10-2 10-1 100 101 102 103 PD3O-.1 PDFP-.1 Condat-Vu-.1 PD3O-.2 PDFP-.2 PD3O-.3 PDFP-.3
iteration
200 400 600 800 1000
f!f$ f$
10-4 10-3 10-2 10-1 100 101 102 103 PD3O-.1 PDFP-.1 Condat-Vu-.1 PD3O-.2 PDFP-.2 PD3O-.3 PDFP-.3
Figure: In the left figure, we fix λ = 1/8 and let γ = β, 1.5β, 1.9β. In the right figure, we fix γ = 1.9β and let λ = 1/80, 1/8, 1/4.
SLIDE 60 applications: decentralized optimization
minimize
x n
fi(x) + gi(x)
- fi and gi is known at node i only.
- Nodes 1, · · · , n are connected in a undirected graph.
- fi is differentiable with a Lipschitz continuous gradient.
SLIDE 61 applications: decentralized optimization
minimize
x n
fi(x) + gi(x)
- fi and gi is known at node i only.
- Nodes 1, · · · , n are connected in a undirected graph.
- fi is differentiable with a Lipschitz continuous gradient.
Introduce a copy xi at node i: minimize
x
f(x) + g(x) :=
n
fi(xi) + gi(xi) s.t. Wx = x
- xi ∈ Rp, x = [x1 x2 · · · xn]⊤ ∈ Rn×p.
- W is a symmetric doubly stochastic mixing matrix.
SLIDE 62 applications: decentralized optimization
minimize
x n
fi(x) + gi(x)
- fi and gi is known at node i only.
- Nodes 1, · · · , n are connected in a undirected graph.
- fi is differentiable with a Lipschitz continuous gradient.
Introduce a copy xi at node i: minimize
x
f(x) + g(x) :=
n
fi(xi) + gi(xi) s.t. Wx = x
- xi ∈ Rp, x = [x1 x2 · · · xn]⊤ ∈ Rn×p.
- W is a symmetric doubly stochastic mixing matrix.
The sum of three functions: minimize
x
f(x) + g(x) + ι0((I − W)1/2x)
SLIDE 63 comparing PG-EXTRA and NIDS (Li-Shi-Yan ’17)
minimize
x
1 2
n
Aixi − bi2 + µ1
n
xi1 + ι0((I − W)1/2x)
SLIDE 64 comparing PG-EXTRA and NIDS (Li-Shi-Yan ’17)
minimize
x
1 2
n
Aixi − bi2 + µ1
n
xi1 + ι0((I − W)1/2x)
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 number of iterations 104 10-8 10-6 10-4 10-2 100 102 NIDS-1/L NIDS-1.5/L NIDS-1.9/L PGEXTRA-1/L PGEXTRA-1.2/L PGEXTRA-1.3/L PGEXTRA-1.4/L
SLIDE 65 conclusion
- a new primal-dual algorithm for minimizing the sum of three functions.
- a new interpretation of Chambolle-Pock: Douglas-Rachford splitting on
the KKT system under a new metric induced by a block diagonal matrix.
- PAPC is forward-backward splitting applied on the KKT system under the
same metric; we proved the optimal bound for the parameters (dual stepsize).
- PD3O is a generalization of both Chambolle-Pock and PAPC, and it has
the advantages of both Condat-Vu (a generalization of Chambolle-Pock), and AFBA and PDFP (two generalizations of PAPC).
- In decentralized consensus optimization, we derive a fast method whose
stepsize does not depend on the network structure; we provide an optimal bound for the stepsize in PG-EXTRA (Shi et al. ’15).
SLIDE 66
Thank You!
Paper 1 M. Yan, A new primal-dual method for minimizing the sum of three functions with a linear operator, Arxiv: arXiv:1611.09805 Code https://github.com/mingyan08/PD3O Paper 2 Z. Li, W. Shi and M. Yan, A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates, arXiv:1704.07807 Code https://github.com/mingyan08/NIDS Paper 3 Z. Li and M. Yan, A primal-dual algorithm with optimal stepsizes and its application in decentralized consensus optimization, arXiv:1711.06785