Anderson Accelerated Douglas-Rachford Splitting Anqi Fu Junzi Zhang - - PowerPoint PPT Presentation

anderson accelerated douglas rachford splitting
SMART_READER_LITE
LIVE PREVIEW

Anderson Accelerated Douglas-Rachford Splitting Anqi Fu Junzi Zhang - - PowerPoint PPT Presentation

Anderson Accelerated Douglas-Rachford Splitting Anqi Fu Junzi Zhang Stephen Boyd EE & ICME Departments Stanford University March 10, 2020 1 Problem Overview Douglas-Rachford Splitting Anderson Acceleration Numerical Experiments


slide-1
SLIDE 1

Anderson Accelerated Douglas-Rachford Splitting

Anqi Fu Junzi Zhang Stephen Boyd EE & ICME Departments Stanford University March 10, 2020

1

slide-2
SLIDE 2

Problem Overview Douglas-Rachford Splitting Anderson Acceleration Numerical Experiments Conclusion

2

slide-3
SLIDE 3

Outline

Problem Overview Douglas-Rachford Splitting Anderson Acceleration Numerical Experiments Conclusion

Problem Overview 3

slide-4
SLIDE 4

Prox-Affine Problem

Prox-affine convex optimization problem: minimize

N

i=1 fi(xi)

subject to

N

i=1 Aixi = b

with variables xi ∈ Rni for i = 1, . . . , N ◮ Ai ∈ Rm×ni and b ∈ Rm given data ◮ fi : Rni → R ∪ {+∞} are closed, convex and proper ◮ Each fi can only be accessed via its proximal operator proxtfi(vi) = argminxi

  • fi(xi) + 1

2t xi − vi2 2

  • ,

where t > 0 is a parameter

Problem Overview 4

slide-5
SLIDE 5

Why This Formulation?

◮ Encompasses many classes of convex problems (conic programs, consensus optimization) ◮ Block separable form ideal for distributed optimization ◮ Proximal operator can be provided as a “black box”, enabling privacy-preserving implementation

Problem Overview 5

slide-6
SLIDE 6

Previous Work

◮ Alternating direction method of multipliers (ADMM) ◮ Douglas-Rachford splitting (DRS) ◮ Augmented Lagrangian method (ALM)

Problem Overview 6

slide-7
SLIDE 7

Previous Work

◮ Alternating direction method of multipliers (ADMM) ◮ Douglas-Rachford splitting (DRS) ◮ Augmented Lagrangian method (ALM) These are typically slow to converge, prompting research into acceleration techniques: ◮ Adaptive penalty parameters ◮ Momentum methods ◮ Quasi-Newton method with line search

Problem Overview 6

slide-8
SLIDE 8

Our Method

◮ A2DR: Anderson acceleration (AA) applied to DRS ◮ DRS is a non-expansive fixed-point (NEFP) method that fits prox-affine framework ◮ AA is fast, efficient, and can be applied to NEFP iterations – but unstable without modification ◮ We introduce a type-II AA variant that converges globally in non-smooth, potentially pathological settings

Problem Overview 7

slide-9
SLIDE 9

Main Advantages

◮ A2DR produces primal and dual solutions, or a certificate of infeasibility/unboundedness ◮ Consistently converges faster with no parameter tuning ◮ Memory efficient ⇒ little extra cost per iteration ◮ Scales to large problems and is easily parallelized ◮ Python implementation: https://github.com/cvxgrp/a2dr

Problem Overview 8

slide-10
SLIDE 10

Outline

Problem Overview Douglas-Rachford Splitting Anderson Acceleration Numerical Experiments Conclusion

Douglas-Rachford Splitting 9

slide-11
SLIDE 11

DRS Algorithm

◮ Define A = [A1 . . . An] and x = (x1, . . . , xN) ◮ Rewrite problem using set indicator IS minimize

N

i=1 fi(xi) + IAx=b(x)

◮ DRS iterates for k = 1, 2, . . ., xk+1/2

i

= proxtfi(vk), i = 1, . . . , N vk+1/2 = 2xk+1/2 − vk xk+1 = ΠAv=b(vk+1/2) vk+1 = vk + xk+1 − xk+1/2 ΠS(v) is Euclidean projection of v onto S

Douglas-Rachford Splitting 10

slide-12
SLIDE 12

Convergence of DRS

◮ DRS iterations can be conceived as a fixed-point mapping vk+1 = F(vk), where F is firmly non-expansive ◮ vk converges to a fixed point of F (if it exists) ◮ xk and xk+1/2 converge to a solution of our problem

Douglas-Rachford Splitting 11

slide-13
SLIDE 13

Convergence of DRS

◮ DRS iterations can be conceived as a fixed-point mapping vk+1 = F(vk), where F is firmly non-expansive ◮ vk converges to a fixed point of F (if it exists) ◮ xk and xk+1/2 converge to a solution of our problem In practice, this convergence is often slow...

Douglas-Rachford Splitting 11

slide-14
SLIDE 14

Outline

Problem Overview Douglas-Rachford Splitting Anderson Acceleration Numerical Experiments Conclusion

Anderson Acceleration 12

slide-15
SLIDE 15

Type-II AA

◮ Quasi-Newton method for accelerating fixed point iterations ◮ Extrapolates next iterate using M + 1 most recent iterates vk+1 =

M

  • j=0

αk

j F(vk−M+j)

◮ Let G(v) = v − F(v), then αk ∈ RM+1 is solution to minimize M

j=0 αk j G(vk−M+j)2 2

subject to

M

j=0 αk j = 1

◮ Typically only need M ≈ 10 for good performance

Anderson Acceleration 13

slide-16
SLIDE 16

Adaptive Regularization

◮ Type-II AA is unstable (Scieur, d’Aspremont, Bach 2016) and can provably diverge (Mai, Johansson 2019) ◮ Add adaptive regularization term to unconstrained formulation

Anderson Acceleration 14

slide-17
SLIDE 17

Adaptive Regularization

◮ Type-II AA is unstable (Scieur, d’Aspremont, Bach 2016) and can provably diverge (Mai, Johansson 2019) ◮ Add adaptive regularization term to unconstrained formulation ◮ Change variables to γk ∈ RM αk

0 = γk 0 ,

αk

i = γk i −γk i−1 ∀i = 1, . . . , M−1,

αk

M = 1−γk M−1

◮ Unconstrained AA problem is minimize gk − Ykγk2

2,

where we define gk = G(vk), yk = gk+1 − gk, Yk = [yk−M . . . yk−1]

Anderson Acceleration 14

slide-18
SLIDE 18

Adaptive Regularization

◮ Type-II AA is unstable (Scieur, d’Aspremont, Bach 2016) and can provably diverge (Mai, Johansson 2019) ◮ Add adaptive regularization term to unconstrained formulation ◮ Change variables to γk ∈ RM αk

0 = γk 0 ,

αk

i = γk i −γk i−1 ∀i = 1, . . . , M−1,

αk

M = 1−γk M−1

◮ Stabilized AA problem is minimize gk − Ykγk2

2 + η

Sk2

F + Yk2 F

γk2

2,

where η ≥ 0 is a parameter and gk = G(vk), yk = gk+1 − gk, Yk = [yk−M . . . yk−1] sk = vk+1 − vk, Sk = [sk−M . . . sk−1]

Anderson Acceleration 15

slide-19
SLIDE 19

A2DR

◮ Parameters: M = max-memory, R = safeguarding parameter ◮ A2DR iterates for k = 1, 2, . . .,

  • 1. v k+1

DRS = F(v k),

gk = v k − v k+1

DRS

  • 2. Compute αk by solving stabilized AA problem
  • 3. v k+1

AA

= M

j=0 αk j v k−M+j+1 DRS

  • 4. Safeguard check: If G(v k)2 is small enough,

v k+i = v k+i

AA for i = 1, . . . , R

Otherwise, v k+1 = v k+1

DRS

◮ Safeguard ensures convergence in infeasible/unbounded case

Anderson Acceleration 16

slide-20
SLIDE 20

Stopping Criterion of A2DR

◮ Stop and output xk+1/2 when r k2 ≤ ǫtol r k

prim = Axk+1/2 − b

r k

dual = 1 t (vk − xk+1/2) + ATλk

r k = (r k

prim, r k dual)

◮ Dual variable is minimizer of dual residual norm λk = argminλ

  • 1

t (vk − xk+1/2) + ATλ

  • 2

◮ Note that this is a simple least-squares problem

Anderson Acceleration 17

slide-21
SLIDE 21

Convergence of A2DR Theorem (Solvable Case)

If the problem is feasible and bounded, lim inf

k→∞ r k2 = 0

and the AA candidates are adopted infinitely often. Furthermore, if F has a fixed point, lim

k→∞ vk = v⋆ and

lim

k→∞ xk+1/2 = x⋆,

where v⋆ is a fixed point of F and x⋆ is a solution to the problem.

Anderson Acceleration 18

slide-22
SLIDE 22

Convergence of A2DR Theorem (Pathological Case)

If the problem is pathological (infeasible/unbounded), lim

k→∞

  • vk − vk+1

= δv = 0. Furthermore, if limk→∞ Axk+1/2 = b, the problem is unbounded and δv2 = t dist(dom f ∗, R(AT)). Otherwise, it is infeasible and δv2 ≥ dist(dom f , {x : Ax = b}). Here f (x) = N

i=1 fi(xi).

Anderson Acceleration 19

slide-23
SLIDE 23

Preconditioning

◮ Convergence greatly improved by rescaling problem ◮ Replace original A, b, fi with ˆ A = DAE, ˆ b = Db, ˆ fi(ˆ xi) = fi(eiˆ xi) ◮ D and E are diagonal positive, ei > 0 corresponds to ith block diagonal entry of E ◮ D and E chosen by equilibrating A (see paper for details) ◮ Proximal operator of ˆ fi can be evaluated using proximal

  • perator of fi

proxtˆ

fi(ˆ

vi) = 1

ei prox(e2

i t)fi(eiˆ

vi)

Anderson Acceleration 20

slide-24
SLIDE 24

Outline

Problem Overview Douglas-Rachford Splitting Anderson Acceleration Numerical Experiments Conclusion

Numerical Experiments 21

slide-25
SLIDE 25

Python Solver Interface

result = a2dr(prox_list, A_list, b) Input arguments: ◮ prox_list is list of proximal function handles, e.g., fi(xi) = xi ⇒ prox_list[i] = lambda v,t: v - t ◮ A_list is list of matrices Ai, b is vector b Output dictionary keys: ◮ num_iters is total number of iterations K ◮ x_vals is list of final values xK

i

◮ primal and dual are vectors containing r k

prim and r k dual for

k = 1, . . . , K

Numerical Experiments 22

slide-26
SLIDE 26

Proximal Library

We provide an extensive proximal library in a2dr.proximal f (x) proxtf (v) Function Handle x v − t

prox_identity

x1 (v − t)+ − (−v − t)+

prox_norm1

x2 (1 − t/v2)+v

prox_norm2

x∞ Bisection

prox_norm_inf

ex v − W (tev)

prox_exp

− log(x)

  • v +

√ v2 + 4t

  • /2

prox_neg_log

  • i log(1 + exi)

Newton-CG

prox_logistic

Fx − g2

2

LSQR

prox_sum_squares_affine

IRn

+(x)

max(v, 0)

prox_nonneg_constr

...And much more! See the documentation for full list

Numerical Experiments 23

slide-27
SLIDE 27

Nonnegative Least Squares (NNLS)

minimize Fz − g2

2

subject to z ≥ 0 with respect to z ∈ Rq ◮ Problem data: F ∈ Rp×q and g ∈ Rp ◮ Can be written in standard form with f1(x1) = Fx1 − g2

2,

f2(x2) = IRn

+(x2)

A1 = I, A2 = −I, b = 0 ◮ We evaluate proximal operator of f1 using LSQR

Numerical Experiments 24

slide-28
SLIDE 28

NNLS: Convergence of r k2

p = 104, q = 8000, F has 0.1% nonzeros

200 400 600 800 1000 10

7

10

5

10

3

10

1

101 Residuals (DRS) Residuals (A2DR)

Numerical Experiments 25

slide-29
SLIDE 29

NNLS: Regularization Effect on r k2

p = 300, q = 500, F has 0.1% nonzeros

200 400 600 800 1000 10

7

10

5

10

3

10

1

101 Residuals (no-reg) Residuals (constant-reg) Residuals (ada-reg)

Numerical Experiments 26

slide-30
SLIDE 30

Sparse Inverse Covariance Estimation

◮ Samples z1, . . . , zp IID from N(0, Σ) ◮ Know covariance Σ ∈ Sq

+ has sparse inverse S = Σ−1

◮ One way to estimate S is by solving the penalized log-likelihood problem minimize − log det(S) + tr(SQ) + α

i,j |Sij|

where Q is the sample covariance, α ≥ 0 is a parameter ◮ Note log det(S) = −∞ when S ⊁ 0

Numerical Experiments 27

slide-31
SLIDE 31

Sparse Inverse Covariance Estimation

◮ Problem can be written in standard form with f1(S1) = − log det(S1) + tr(S1Q), f2(S2) = α

  • i,j

|(S2)ij| A1 = I, A2 = −I, b = 0 ◮ Both proximal operators have closed-form solutions

Numerical Experiments 28

slide-32
SLIDE 32

Covariance Estimation: Convergence of r k2

p = 1000, q = 100, S has 10% nonzeros

200 400 600 800 1000 10

7

10

5

10

3

10

1

101 Residuals (DRS) Residuals (A2DR)

Numerical Experiments 29

slide-33
SLIDE 33

Multi-Task Logistic Regression

minimize φ(W θ, Y ) + α L

l=1 θl2 + βθ∗

with respect to θ = [θ1 · · · θL] ∈ Rs×L ◮ Problem data: W ∈ Rp×s and Y = [y1 · · · yL] ∈ Rp×L ◮ Regularization parameters: α ≥ 0, β ≥ 0 ◮ Logistic loss function φ(Z, Y ) =

L

  • l=1

p

  • i=1

log (1 + exp(−YilZil))

Numerical Experiments 30

slide-34
SLIDE 34

Multi-Task Logistic Regression

◮ Rewrite problem in standard form with f1(Z) = φ(Z, Y ), f2(θ) = α

L

  • l=1

θl2, f3(˜ θ) = β˜ θ∗, A =

  • I

−W I −I

  • ,

x =

  

Z θ ˜ θ

   ,

b = 0 ◮ We evaluate proximal operator of f1 using Newton-CG method, rest have closed-form solutions

Numerical Experiments 31

slide-35
SLIDE 35

Multi-Task Logistic: Convergence of r k2

p = 300, s = 500, L = 10, α = β = 0.1

200 400 600 800 1000 10

7

10

5

10

3

10

1

101 Residuals (DRS) Residuals (A2DR)

Numerical Experiments 32

slide-36
SLIDE 36

Solver Comparisons

◮ Compare with generic convex solvers OSQP (Stellato et al, 2017) and SCS (O’Donoghue et al, 2016) ◮ NNLS (p = 104, q = 8000): OSQP and SCS took 349 and 327 seconds, A2DR took 55 seconds ◮ Covariance estimation

◮ q = 1200: SCS took 11 hours to achieve ǫ = 10−1, A2DR took 1 hour to achieve ǫ = 10−3 ◮ q = 2000: SCS failed with out-of-memory error, A2DR took 2.6 hours to achieve ǫ = 10−3

Numerical Experiments 33

slide-37
SLIDE 37

Outline

Problem Overview Douglas-Rachford Splitting Anderson Acceleration Numerical Experiments Conclusion

Conclusion 34

slide-38
SLIDE 38

Conclusion

◮ A2DR is a fast, robust algorithm for solving generic convex

  • ptimization problems in prox-affine form

◮ Can be easily scaled up and parallelized ◮ Open-source Python library a2dr: https://github.com/cvxgrp/a2dr Reference: A. Fu*, J. Zhang*, and S. Boyd. (2019). “Anderson Accelerated Douglas-Rachford Splitting.” arXiv:1908.11482.

(*equal contribution) Conclusion 35

slide-39
SLIDE 39

Future Work

◮ More work on feasibility detection ◮ Expand library of proximal operators ◮ User-friendly interface with CVXPY ◮ GPU parallelization and cloud computing

Conclusion 36