SLIDE 1
Anderson Accelerated Douglas-Rachford Splitting Anqi Fu Junzi Zhang - - PowerPoint PPT Presentation
Anderson Accelerated Douglas-Rachford Splitting Anqi Fu Junzi Zhang - - PowerPoint PPT Presentation
Anderson Accelerated Douglas-Rachford Splitting Anqi Fu Junzi Zhang Stephen Boyd EE & ICME Departments Stanford University March 10, 2020 1 Problem Overview Douglas-Rachford Splitting Anderson Acceleration Numerical Experiments
SLIDE 2
SLIDE 3
Outline
Problem Overview Douglas-Rachford Splitting Anderson Acceleration Numerical Experiments Conclusion
Problem Overview 3
SLIDE 4
Prox-Affine Problem
Prox-affine convex optimization problem: minimize
N
i=1 fi(xi)
subject to
N
i=1 Aixi = b
with variables xi ∈ Rni for i = 1, . . . , N ◮ Ai ∈ Rm×ni and b ∈ Rm given data ◮ fi : Rni → R ∪ {+∞} are closed, convex and proper ◮ Each fi can only be accessed via its proximal operator proxtfi(vi) = argminxi
- fi(xi) + 1
2t xi − vi2 2
- ,
where t > 0 is a parameter
Problem Overview 4
SLIDE 5
Why This Formulation?
◮ Encompasses many classes of convex problems (conic programs, consensus optimization) ◮ Block separable form ideal for distributed optimization ◮ Proximal operator can be provided as a “black box”, enabling privacy-preserving implementation
Problem Overview 5
SLIDE 6
Previous Work
◮ Alternating direction method of multipliers (ADMM) ◮ Douglas-Rachford splitting (DRS) ◮ Augmented Lagrangian method (ALM)
Problem Overview 6
SLIDE 7
Previous Work
◮ Alternating direction method of multipliers (ADMM) ◮ Douglas-Rachford splitting (DRS) ◮ Augmented Lagrangian method (ALM) These are typically slow to converge, prompting research into acceleration techniques: ◮ Adaptive penalty parameters ◮ Momentum methods ◮ Quasi-Newton method with line search
Problem Overview 6
SLIDE 8
Our Method
◮ A2DR: Anderson acceleration (AA) applied to DRS ◮ DRS is a non-expansive fixed-point (NEFP) method that fits prox-affine framework ◮ AA is fast, efficient, and can be applied to NEFP iterations – but unstable without modification ◮ We introduce a type-II AA variant that converges globally in non-smooth, potentially pathological settings
Problem Overview 7
SLIDE 9
Main Advantages
◮ A2DR produces primal and dual solutions, or a certificate of infeasibility/unboundedness ◮ Consistently converges faster with no parameter tuning ◮ Memory efficient ⇒ little extra cost per iteration ◮ Scales to large problems and is easily parallelized ◮ Python implementation: https://github.com/cvxgrp/a2dr
Problem Overview 8
SLIDE 10
Outline
Problem Overview Douglas-Rachford Splitting Anderson Acceleration Numerical Experiments Conclusion
Douglas-Rachford Splitting 9
SLIDE 11
DRS Algorithm
◮ Define A = [A1 . . . An] and x = (x1, . . . , xN) ◮ Rewrite problem using set indicator IS minimize
N
i=1 fi(xi) + IAx=b(x)
◮ DRS iterates for k = 1, 2, . . ., xk+1/2
i
= proxtfi(vk), i = 1, . . . , N vk+1/2 = 2xk+1/2 − vk xk+1 = ΠAv=b(vk+1/2) vk+1 = vk + xk+1 − xk+1/2 ΠS(v) is Euclidean projection of v onto S
Douglas-Rachford Splitting 10
SLIDE 12
Convergence of DRS
◮ DRS iterations can be conceived as a fixed-point mapping vk+1 = F(vk), where F is firmly non-expansive ◮ vk converges to a fixed point of F (if it exists) ◮ xk and xk+1/2 converge to a solution of our problem
Douglas-Rachford Splitting 11
SLIDE 13
Convergence of DRS
◮ DRS iterations can be conceived as a fixed-point mapping vk+1 = F(vk), where F is firmly non-expansive ◮ vk converges to a fixed point of F (if it exists) ◮ xk and xk+1/2 converge to a solution of our problem In practice, this convergence is often slow...
Douglas-Rachford Splitting 11
SLIDE 14
Outline
Problem Overview Douglas-Rachford Splitting Anderson Acceleration Numerical Experiments Conclusion
Anderson Acceleration 12
SLIDE 15
Type-II AA
◮ Quasi-Newton method for accelerating fixed point iterations ◮ Extrapolates next iterate using M + 1 most recent iterates vk+1 =
M
- j=0
αk
j F(vk−M+j)
◮ Let G(v) = v − F(v), then αk ∈ RM+1 is solution to minimize M
j=0 αk j G(vk−M+j)2 2
subject to
M
j=0 αk j = 1
◮ Typically only need M ≈ 10 for good performance
Anderson Acceleration 13
SLIDE 16
Adaptive Regularization
◮ Type-II AA is unstable (Scieur, d’Aspremont, Bach 2016) and can provably diverge (Mai, Johansson 2019) ◮ Add adaptive regularization term to unconstrained formulation
Anderson Acceleration 14
SLIDE 17
Adaptive Regularization
◮ Type-II AA is unstable (Scieur, d’Aspremont, Bach 2016) and can provably diverge (Mai, Johansson 2019) ◮ Add adaptive regularization term to unconstrained formulation ◮ Change variables to γk ∈ RM αk
0 = γk 0 ,
αk
i = γk i −γk i−1 ∀i = 1, . . . , M−1,
αk
M = 1−γk M−1
◮ Unconstrained AA problem is minimize gk − Ykγk2
2,
where we define gk = G(vk), yk = gk+1 − gk, Yk = [yk−M . . . yk−1]
Anderson Acceleration 14
SLIDE 18
Adaptive Regularization
◮ Type-II AA is unstable (Scieur, d’Aspremont, Bach 2016) and can provably diverge (Mai, Johansson 2019) ◮ Add adaptive regularization term to unconstrained formulation ◮ Change variables to γk ∈ RM αk
0 = γk 0 ,
αk
i = γk i −γk i−1 ∀i = 1, . . . , M−1,
αk
M = 1−γk M−1
◮ Stabilized AA problem is minimize gk − Ykγk2
2 + η
Sk2
F + Yk2 F
γk2
2,
where η ≥ 0 is a parameter and gk = G(vk), yk = gk+1 − gk, Yk = [yk−M . . . yk−1] sk = vk+1 − vk, Sk = [sk−M . . . sk−1]
Anderson Acceleration 15
SLIDE 19
A2DR
◮ Parameters: M = max-memory, R = safeguarding parameter ◮ A2DR iterates for k = 1, 2, . . .,
- 1. v k+1
DRS = F(v k),
gk = v k − v k+1
DRS
- 2. Compute αk by solving stabilized AA problem
- 3. v k+1
AA
= M
j=0 αk j v k−M+j+1 DRS
- 4. Safeguard check: If G(v k)2 is small enough,
v k+i = v k+i
AA for i = 1, . . . , R
Otherwise, v k+1 = v k+1
DRS
◮ Safeguard ensures convergence in infeasible/unbounded case
Anderson Acceleration 16
SLIDE 20
Stopping Criterion of A2DR
◮ Stop and output xk+1/2 when r k2 ≤ ǫtol r k
prim = Axk+1/2 − b
r k
dual = 1 t (vk − xk+1/2) + ATλk
r k = (r k
prim, r k dual)
◮ Dual variable is minimizer of dual residual norm λk = argminλ
- 1
t (vk − xk+1/2) + ATλ
- 2
◮ Note that this is a simple least-squares problem
Anderson Acceleration 17
SLIDE 21
Convergence of A2DR Theorem (Solvable Case)
If the problem is feasible and bounded, lim inf
k→∞ r k2 = 0
and the AA candidates are adopted infinitely often. Furthermore, if F has a fixed point, lim
k→∞ vk = v⋆ and
lim
k→∞ xk+1/2 = x⋆,
where v⋆ is a fixed point of F and x⋆ is a solution to the problem.
Anderson Acceleration 18
SLIDE 22
Convergence of A2DR Theorem (Pathological Case)
If the problem is pathological (infeasible/unbounded), lim
k→∞
- vk − vk+1
= δv = 0. Furthermore, if limk→∞ Axk+1/2 = b, the problem is unbounded and δv2 = t dist(dom f ∗, R(AT)). Otherwise, it is infeasible and δv2 ≥ dist(dom f , {x : Ax = b}). Here f (x) = N
i=1 fi(xi).
Anderson Acceleration 19
SLIDE 23
Preconditioning
◮ Convergence greatly improved by rescaling problem ◮ Replace original A, b, fi with ˆ A = DAE, ˆ b = Db, ˆ fi(ˆ xi) = fi(eiˆ xi) ◮ D and E are diagonal positive, ei > 0 corresponds to ith block diagonal entry of E ◮ D and E chosen by equilibrating A (see paper for details) ◮ Proximal operator of ˆ fi can be evaluated using proximal
- perator of fi
proxtˆ
fi(ˆ
vi) = 1
ei prox(e2
i t)fi(eiˆ
vi)
Anderson Acceleration 20
SLIDE 24
Outline
Problem Overview Douglas-Rachford Splitting Anderson Acceleration Numerical Experiments Conclusion
Numerical Experiments 21
SLIDE 25
Python Solver Interface
result = a2dr(prox_list, A_list, b) Input arguments: ◮ prox_list is list of proximal function handles, e.g., fi(xi) = xi ⇒ prox_list[i] = lambda v,t: v - t ◮ A_list is list of matrices Ai, b is vector b Output dictionary keys: ◮ num_iters is total number of iterations K ◮ x_vals is list of final values xK
i
◮ primal and dual are vectors containing r k
prim and r k dual for
k = 1, . . . , K
Numerical Experiments 22
SLIDE 26
Proximal Library
We provide an extensive proximal library in a2dr.proximal f (x) proxtf (v) Function Handle x v − t
prox_identity
x1 (v − t)+ − (−v − t)+
prox_norm1
x2 (1 − t/v2)+v
prox_norm2
x∞ Bisection
prox_norm_inf
ex v − W (tev)
prox_exp
− log(x)
- v +
√ v2 + 4t
- /2
prox_neg_log
- i log(1 + exi)
Newton-CG
prox_logistic
Fx − g2
2
LSQR
prox_sum_squares_affine
IRn
+(x)
max(v, 0)
prox_nonneg_constr
...And much more! See the documentation for full list
Numerical Experiments 23
SLIDE 27
Nonnegative Least Squares (NNLS)
minimize Fz − g2
2
subject to z ≥ 0 with respect to z ∈ Rq ◮ Problem data: F ∈ Rp×q and g ∈ Rp ◮ Can be written in standard form with f1(x1) = Fx1 − g2
2,
f2(x2) = IRn
+(x2)
A1 = I, A2 = −I, b = 0 ◮ We evaluate proximal operator of f1 using LSQR
Numerical Experiments 24
SLIDE 28
NNLS: Convergence of r k2
p = 104, q = 8000, F has 0.1% nonzeros
200 400 600 800 1000 10
7
10
5
10
3
10
1
101 Residuals (DRS) Residuals (A2DR)
Numerical Experiments 25
SLIDE 29
NNLS: Regularization Effect on r k2
p = 300, q = 500, F has 0.1% nonzeros
200 400 600 800 1000 10
7
10
5
10
3
10
1
101 Residuals (no-reg) Residuals (constant-reg) Residuals (ada-reg)
Numerical Experiments 26
SLIDE 30
Sparse Inverse Covariance Estimation
◮ Samples z1, . . . , zp IID from N(0, Σ) ◮ Know covariance Σ ∈ Sq
+ has sparse inverse S = Σ−1
◮ One way to estimate S is by solving the penalized log-likelihood problem minimize − log det(S) + tr(SQ) + α
i,j |Sij|
where Q is the sample covariance, α ≥ 0 is a parameter ◮ Note log det(S) = −∞ when S ⊁ 0
Numerical Experiments 27
SLIDE 31
Sparse Inverse Covariance Estimation
◮ Problem can be written in standard form with f1(S1) = − log det(S1) + tr(S1Q), f2(S2) = α
- i,j
|(S2)ij| A1 = I, A2 = −I, b = 0 ◮ Both proximal operators have closed-form solutions
Numerical Experiments 28
SLIDE 32
Covariance Estimation: Convergence of r k2
p = 1000, q = 100, S has 10% nonzeros
200 400 600 800 1000 10
7
10
5
10
3
10
1
101 Residuals (DRS) Residuals (A2DR)
Numerical Experiments 29
SLIDE 33
Multi-Task Logistic Regression
minimize φ(W θ, Y ) + α L
l=1 θl2 + βθ∗
with respect to θ = [θ1 · · · θL] ∈ Rs×L ◮ Problem data: W ∈ Rp×s and Y = [y1 · · · yL] ∈ Rp×L ◮ Regularization parameters: α ≥ 0, β ≥ 0 ◮ Logistic loss function φ(Z, Y ) =
L
- l=1
p
- i=1
log (1 + exp(−YilZil))
Numerical Experiments 30
SLIDE 34
Multi-Task Logistic Regression
◮ Rewrite problem in standard form with f1(Z) = φ(Z, Y ), f2(θ) = α
L
- l=1
θl2, f3(˜ θ) = β˜ θ∗, A =
- I
−W I −I
- ,
x =
Z θ ˜ θ
,
b = 0 ◮ We evaluate proximal operator of f1 using Newton-CG method, rest have closed-form solutions
Numerical Experiments 31
SLIDE 35
Multi-Task Logistic: Convergence of r k2
p = 300, s = 500, L = 10, α = β = 0.1
200 400 600 800 1000 10
7
10
5
10
3
10
1
101 Residuals (DRS) Residuals (A2DR)
Numerical Experiments 32
SLIDE 36
Solver Comparisons
◮ Compare with generic convex solvers OSQP (Stellato et al, 2017) and SCS (O’Donoghue et al, 2016) ◮ NNLS (p = 104, q = 8000): OSQP and SCS took 349 and 327 seconds, A2DR took 55 seconds ◮ Covariance estimation
◮ q = 1200: SCS took 11 hours to achieve ǫ = 10−1, A2DR took 1 hour to achieve ǫ = 10−3 ◮ q = 2000: SCS failed with out-of-memory error, A2DR took 2.6 hours to achieve ǫ = 10−3
Numerical Experiments 33
SLIDE 37
Outline
Problem Overview Douglas-Rachford Splitting Anderson Acceleration Numerical Experiments Conclusion
Conclusion 34
SLIDE 38
Conclusion
◮ A2DR is a fast, robust algorithm for solving generic convex
- ptimization problems in prox-affine form
◮ Can be easily scaled up and parallelized ◮ Open-source Python library a2dr: https://github.com/cvxgrp/a2dr Reference: A. Fu*, J. Zhang*, and S. Boyd. (2019). “Anderson Accelerated Douglas-Rachford Splitting.” arXiv:1908.11482.
(*equal contribution) Conclusion 35
SLIDE 39