Splitting Envelopes Accelerated Second-order Proximal Methods Panos - - PowerPoint PPT Presentation

splitting envelopes accelerated second order proximal
SMART_READER_LITE
LIVE PREVIEW

Splitting Envelopes Accelerated Second-order Proximal Methods Panos - - PowerPoint PPT Presentation

Splitting Envelopes Accelerated Second-order Proximal Methods Panos Patrinos (joint work with Lorenzo Stella, Alberto Bemporad) September 8, 2014 Outline forward-backward envelope (FBE) forward-backward Newton method (FBN) dual FBE


slide-1
SLIDE 1

Splitting Envelopes Accelerated Second-order Proximal Methods

Panos Patrinos (joint work with Lorenzo Stella, Alberto Bemporad) September 8, 2014

slide-2
SLIDE 2

Outline

forward-backward envelope (FBE) forward-backward Newton method (FBN) dual FBE and Augmented Lagrangian alternating minimization Newton method (AMNM) Douglas Rachford envelope (DRE) accelerated Douglas Rachford splitting (ADRS) based on

1.

  • P. Patrinos and A. Bemporad. Proximal Newton methods for convex composite optimization. In Proc. 52nd IEEE

Conference on Decision and Control (CDC), pages 2358-2363, Florence, Italy, 2013. 2.

  • P. Patrinos, L. Stella, and A. Bemporad, Forward-backward truncated Newton methods for convex composite
  • ptimization. submitted, arXiv:1402.6655, 2014.

3.

  • P. Patrinos, L. Stella, and A. Bemporad. Douglas-Rachford splitting: complexity estimates and accelerated variants. In
  • Proc. 53rd IEEE Conference on Decision and Control (CDC), Los Angeles, CA, arXiv:1407.6723, 2014.

4.

  • L. Stella, P. Patrinos, and A. Bemporad, Alternating minimization Newton method for separable convex optimization,

2014 (submitted). fixed point implementation for MPC

  • A. Guiggiani, P. Patrinos, and A. Bemporad. Fixed-point implementation of a proximal Newton method for embedded

model predictive control. In 19th IFAC, South Africa, 2014. 2 / 40

slide-3
SLIDE 3

Convex composite optimization

minimize F(x) = f(x) + g(x) f : I Rn → I R convex, twice continuously differentiable with ∇f(x) − ∇f(y) ≤ Lfx − y, for all x, y ∈ I Rn g : I Rn → I R convex, nonsmooth with inexpensive proximal mapping proxγg(x) = arg min

z∈I Rn

  • g(z) + 1

2γ z − x2

many problem classes: QPs, cone programs, sparse least-squares, rank minimization, total variation minimization,. . . applications: control, system identification, signal processing, image analysis, machine learning,. . .

3 / 40

slide-4
SLIDE 4

Proximal mappings

proxγg(x) = arg min

z∈I Rn

  • g(z) + 1

2γ z − x2

γ > 0 resolvent of maximal monotone operator ∂g proxγg(x) = (I + γ∂g)−1(x) single-valued and (firmly) nonexpansive explicitly computable for many functions (see Parikh, Boyd ’14, Combettes,

Pesquet ’10)

reduces to projection when g is indicator of convex set proxγδC(x) = ΠC(x) z = proxγg(x) is implicit a subgradient step (0 ∈ ∂g(z)+γ−1(z −x)) z = x − γv v ∈ ∂g(z)

4 / 40

slide-5
SLIDE 5

Proximal Minimization Algorithm

minimize g(x), g : I Rn → I R closed proper convex given x0 ∈ I Rn, repeat xk+1 = proxγg(xk) γ > 0 fixed point iteration for optimality conditions 0 ∈ ∂g(x⋆) ⇐ ⇒ x⋆ ∈ (I + γ∂g)(x⋆) ⇐ ⇒ x⋆ = proxγg(x⋆) special case of proximal point algorithm (Martinet ’70, Rockafellar ’76) converges under very general conditions mostly conceptual algorithm

5 / 40

slide-6
SLIDE 6

Moreau envelope

Moreau envelope of closed proper convex g : I Rn → I R gγ(x) = inf

z∈I Rn

  • g(z) + 1

2γ z − x2

, γ > 0 gγ is real-valued, convex, differentiable with 1/γ-Lipschitz gradient ∇gγ(x) = (1/γ)(x − proxγg(x)) minimizing nonsmooth g is equivalent to minimizing smooth gγ proximal minimization algorithm = gradient method for gγ xk+1 = xk − γ∇gγ(xk) can use any method of unconstrained smooth minimization for gγ

6 / 40

slide-7
SLIDE 7

Forward-Backward Splitting (FBS)

minimize F(x) = f(x) + g(x)

  • ptimality condition: x⋆ ∈ I

Rn is optimal if and only if x⋆ = proxγg(x⋆ − γ∇f(x⋆)), γ > 0 forward-backward splitting (aka proximal gradient) xk+1 = proxγg(xk − γ∇f(xk)), γ ∈ (0, 2/Lf) FBS is a fixed point iteration g = 0: gradient method, g = δC: gradient projection, f = 0: prox min accelerated versions (Nesterov)

7 / 40

slide-8
SLIDE 8

Forward-Backward Envelope

x − proxγg(x − γ∇f(x)) = 0 use proxγg(y) = y − γ∇gγ(y) for y = x − γ∇f(x) γ∇f(x) + γ∇gγ(x − γ∇f(x)) = 0 multiply with γ−1(I − γ∇2f(x)) (positive definite for γ ∈ (0, 1/Lf)) gradient of the Forward Backward Envelope (FBE) F FB

γ

(x) = f(x) − γ

2∇f(x)2 2 + gγ(x − γ∇f(x))

alternative expression for FBE F FB

γ

(x) = inf

z∈I Rn{f(x) + ∇f(x), z − x

  • linearize f around x

+g(z) + 1

2γ z − x2}

8 / 40

slide-9
SLIDE 9

Properties of FBE

stationary points of F FB

γ

= minimizers of F reformulates original nonsmooth problem into a smooth one minimize

x∈I Rn

F FB

γ

(x) equivalent to minimize

x∈I Rn

F(x) F FB

γ

is real-valued, continuously differentiable ∇F FB

γ

(x) = γ−1(I − γ∇2f(x))(x − proxγg(x − γ∇f(x)) FBS is a variable metric gradient method for FBE xk+1 = xk − γD−1

k ∇F FB γ

(xk)

9 / 40

slide-10
SLIDE 10

Forward-Backward Newton Method (FBN)

Input: x0 ∈ I Rn, γ ∈ (0, 1/Lf), σ ∈ (0, 1/2) for k = 0, 1, 2, . . . do Newton direction Choose Hk ∈ ˆ ∂2F FB

γ

(xk). Compute dk by solving (approximately) Hkd = −∇F FB

γ

(xk) Line search Compute stepsize by backtracking F FB

γ

(xk + τkdk) ≤ F FB

γ

(xk) + στk∇F FB

γ

(xk), dk Update: xk+1 = xk + τkdk end

10 / 40

slide-11
SLIDE 11

Linear Newton approximation

FBE is C1 but not C2 Hd = −∇F FB

γ

(x) where ∇F FB

γ

(x) = γ−1(I − γ∇2f(x))(x − proxγg(x − γ∇f(x)) and ˆ ∂2Fγ(x) is an approximate generalized Hessian H = γ−1(I − γ∇2f(x))(I − P(I − γ∇2f(x))) ∈ ˆ ∂2Fγ(x) where P ∈ ∂C(proxγg)

  • Clarke’s generalized

Jacobian

(x − γ∇f(x)) preserves all favorable properties of the Hessian for C2 functions “Gauss-Newton” generalized Hessian: we omit 3rd order terms

11 / 40

slide-12
SLIDE 12

Generalized Jacobians of proximal mappings

∂C proxγg(x) is the following set of matrices

(Clarke, 1983)

conv limits of (ordinary) Jacobians for every sequence that converges to x, consisting of points where proxγg is differentiable

  • ◮ proxγg(x) simple to compute =

⇒ P ∈ ∂C(proxγg)(x) for free

◮ g (block) separable =

⇒ P ∈ ∂C(proxγg)(x) (block) diagonal

example–ℓ1 norm

more examples in Patrinos, Stella, Bemporad (2014)

g(x) = x1 proxγf(x)i =      xi + γ, if xi ≤ −γ, 0, if − γ ≤ xi ≤ γ xi − γ, if xi ≥ γ P ∈ ∂C(proxγg)(x) are diagonal matrices with Pii =      1, if i ∈ {i | |xi| > γ}, ∈ [0, 1], if i ∈ {i | |xi| = γ}, 0, if i ∈ {i | |xi| < γ}.

12 / 40

slide-13
SLIDE 13

Convergence of FBN

every limit point of {xk} converges to arg min

x∈I Rn

F(x) all H ∈ ˆ ∂2Fγ(x⋆) nonsingular = ⇒ Q-quadratic asymptotic rate extension: FBN II apply FB step after a Newton step same asymptotic rate + global complexity estimates

◮ non-strongly convex f: sublinear rate for F(xk) − F(x⋆) ◮ strongly convex f:

linear rate for F(xk) − F(x⋆) xk − x⋆2

  • 13 / 40
slide-14
SLIDE 14

FBN–CG

large problems conjugate gradient (CG) on regularized Newton system until (Hk + δkI)dk + ∇F FB

γ

(xk)

  • residual

≤ ηk∇F FB

γ

(xk) with ηk = O(∇F FB

γ

(xk)), δk = O(∇F FB

γ

(xk)) properties no need to form ∇2f(x) and Hk – only matvec products same convergence properties

14 / 40

slide-15
SLIDE 15

Box-constrained convex programs

minimize f(x) subject to ℓ ≤ x ≤ u Newton direction solves minimize

1 2d, ∇2f(xk)d + ∇f(xk), d

subject to di = ℓi − xk

i , i ∈ β1,

di = ui − xk

i , i ∈ β2

where β1 = {i | xk

i − γ∇if(xk) ≤ ℓi}

estimate of x⋆

i = ℓi

β2 = {i | xk

i − γ∇if(xk) ≥ ui}

estimate of x⋆

i = ui

Newton system becomes Qδδdδ = −(∇δf(xk) + ∇δβf(xk)dβ), (β = β1 ∪ β2, δ \ β = [n])

15 / 40

slide-16
SLIDE 16

Example

minimize

1 2x, Qx + q, x,

n = 1000 subject to ℓ ≤ x ≤ u cond(Q) = 104

0.2 0.4 0.6 0.8 1 1.2 1.4 10

−12

10

−10

10

−8

10

−6

10

−4

10

−2

10 10

2

time [sec] F (xν) − F PNM PGNM FGM

FBN FBN II AFBS

GUROBI: 4.87 sec, CPLEX: 3.73 sec

cond(Q) = 108

10 20 30 40 50 10

−12

10

−10

10

−8

10

−6

10

−4

10

−2

10 10

2

time [sec] F (xν) − F PNM PGNM FGM

FBN FBN II AFBS

GUROBI: 5.96 sec, CPLEX: 4.83 sec

FBN: much less sensitive to bad conditioning

16 / 40

slide-17
SLIDE 17

Sparse least-squares

minimize

1 2Ax − b2 2 + λx1

Newton system becomes dβ = −xβ A⊤

·δA·δdδ = −[A⊤ ·δ(A·δxδ − b) + λ sign(xδ − γ∇δf(x))]

δ is an estimate of the nonzero components of x⋆ δ = {i | |xi − γ∇if(x)| > λγ} close to solution δ small

17 / 40

slide-18
SLIDE 18

FBN Methods are robust 548 datasets taken from http://wwwopt.mathematik.tu-darmstadt.de/spear/ compared against AFBS, YALL1 (ADDM-based), SpaRSA (Barzilai-Borwein), l1-ls (interior point) performance plot: point (x, y) indicates algorithm is at most x times slower in a fraction of y problems

5 10 15 20 25 30 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

performance ratio frequency FBN−CG I FBN−CG II Accelerated FBS YALL1 SpaRSA l1−ls 18 / 40

slide-19
SLIDE 19

Sparse Logistic Regression

minimize

x,y m

  • i=1

log

  • 1 + e−bi(a⊺

i x+y)

+ λx1, λ > 0 5 10 15 20 10−3 10−1 101 103 time (s) F k − F ⋆

AFBS FBN

19 / 40

slide-20
SLIDE 20

Augmented Lagrangians and Moreau envelopes

minimize f(x) f : I Rn → I R convex (can be nonsmooth), subject to Ax = b A ∈ I Rp×n Augmented Lagrangian Lγ(x, y) = f(x) + y, Ax − b + γ

2Ax − b2

Augmented Lagrangian method (ALM or Method of Multipliers) xk = inf

x∈I Rn Lγ(x, yk)

Hestenes (1969), Powell (1969) yk+1 = yk + γ(Axk − b) ALM = proximal minimization for dual

Rockafellar (1973,1976)

= gradient method for Moreau envelope of dual

20 / 40

slide-21
SLIDE 21

Separable convex problems

minimize f(x) + g(z), f, g convex (can be nonsmooth) subject to Ax + Bz = b A ∈ I Rp×n, B ∈ I Rp×m f is strongly convex with convexity parameter µf f, g are nice, e.g. separable. coupling introduced through constraints ALM not amenable to decomposition: x and z updates are coupled Alternating Minimization Method (AMM): FBS applied to the dual xk+1 = arg min

x∈I Rn

L0(x, zk, yk) zk+1 = arg min

z∈I Rm

Lγ(xk+1, zk, yk) γ ∈ (0, 2µf/A2) yk+1 = yk + γ(Axk+1 + Bzk+1 − b) Augmented Lagrangian Lγ(x, z, λ) = f(x) + g(z) + y, Ax + Bz − b + γ

2Ax + Bz − b2

21 / 40

slide-22
SLIDE 22

Dual FBE

FBE for dual problem is augmented Lagrangian! F FB

γ

(y) = Lγ(x(y), z(y), y) where x(y), z(y) are the AMM updates x(y) = arg min

x∈I Rn

f(x) + y, Ax z(y) = arg min

z∈I Rm

g(z) + y, Bz + γ

2Ax(y) + Bz − b2

22 / 40

slide-23
SLIDE 23

Connection between AMM and FBE

dual problem is equivalent to maximize

y∈I Rp

F FB

γ

(y) = Lγ(x(y), z(y), y) γ ∈

  • 0, µf/A2

f ∈ C2(I Rn) = ⇒ F FB

γ

∈ C1(I Rp) ∇F FB

γ

(y) =

D(y)

  • I − γA(∇2f(x(y)))−1A⊤

(Ax(y) + Bz(y) − b) AMM as a variable metric gradient method on dual FBE yk+1 = yk + γD(yk)−1∇F FB

γ

(yk) AMNM: FBN to dual

23 / 40

slide-24
SLIDE 24

Strictly convex QPs

minimize

1 2x, Qx + q, x

cond(Q) = 104 subject to ℓ ≤ Ax ≤ u A ∈ I R2000×1000

10 20 30 40 10

−6

10

−4

10

−2

time [sec] primal infeasibility PNM PGNM FGM AMNM AMNM II FAMM

10 20 30 40 10

−8

10

−6

10

−4

10

−2

time [sec] duality gap PNM PGNM FGM AMNM AMNM II FAMM

GUROBI: 6.7 s CPLEX: 60.8 s Fast AMM: 41 s AMNM: 8.1 s AMNM II:11.3 s

24 / 40

slide-25
SLIDE 25

Projection onto Convex Sets

minimize

x

1 2x − p2 +

m

  • i=1

δCi(zi) subject to x = zi, i = 1, . . . , M δC is the indicator of C x⋆ is the projection of p onto C1 ∩ . . . ∩ Cm

Dykstra Fast AMM AMNM

25 / 40

slide-26
SLIDE 26

Random problem with m = 100 random hyperplanes in I R120 1 2 3 4 5 10−6 10−5 10−4 10−3 10−2 10−1 100 101 time (s) xk − x⋆

AMM Fast AMM Dykstra ADMM AMNM

26 / 40

slide-27
SLIDE 27

Distributed MPC

minimize

M

  • i=1

N−1

  • t=1
  • ξi(t)2

Qi + ui(t)2 Ri

  • + ξi(N)2

Pi

subject to ξi(0) = ξ0

i ,

i ∈ N[1,M] ξi(t + 1) =

  • j∈Ni

Φijξj(t) + Γijuj(t), t ∈ N[0,N−1], i ∈ N[1,M] (ξi(t), ui(t)) ∈ Yi, t ∈ N[0,N−1], i ∈ N[1,M] ξi(N) ∈ Zi, i ∈ N[1,M] solve an optimal control problem over a network of agents local constraint sets Yi, Zi are simple, coupling in the dynamics can handle complicated local and coupled constraints as well

27 / 40

slide-28
SLIDE 28

DMPC simulations

M = 100 subsystems 23600 vars, 20600 cons 1 2 3 10−8 10−5 10−2 101 time (s) primal residual

Fast AMM AMNM

average over 20 instances 8 states, 3 inputs, N = 10 FAMM AMNM M local local global 5 29.5 2.1 2.1 10 47.0 2.2 2.1 20 65.7 2.4 2.4 50 104.8 3.3 3.3 100 139.2 3.9 3.8 200 159.3 4.4 4.3 communication rounds (in thousands)

28 / 40

slide-29
SLIDE 29

Douglas-Rachford Splitting

minimize F(x) = f(x) + g(x)

  • ptimality condition: x⋆ is optimal if and only if x⋆ = proxγf(˜

x), where ˜ x solves proxγg(2 proxγf(x) − x) − proxγf(x) = 0 DRS yk = proxγf(xk) zk = proxγg(2yk − xk) xk+1 = xk + λk(zk − yk) γ > 0 and λk ∈ [0, 2] with

k∈N λk(2 − λk) = +∞

DRS is a relaxed fixed point iteration

29 / 40

slide-30
SLIDE 30

ADMM

minimize f(x) + g(z), f, g convex (can be nonsmooth) subject to Ax + Bz = b A ∈ I Rp×n, B ∈ I Rp×m Alternating Direction Method of Multipliers xk+1 = arg min

x∈I Rn

Lγ(x, zk, yk) zk+1 = arg min

z∈I Rm

Lγ(xk+1, z, yk) yk+1 = yk + γ(Axk+1 + Bzk+1 − b) DRS applied to the dual (Eckstein, Bertsekas, 1992)

30 / 40

slide-31
SLIDE 31

Douglas Rachford Envelope

assume f convex, C2 = ⇒ Moreau envelope fγ is C2 ∇f(x) − ∇f(y) ≤ Lfx − y for all x, y ∈ I Rn

  • ptimality condition

proxγf(x) − proxγg(2 proxγf(x) − x) = 0 use ∇hγ(x) = γ−1(x − proxγh(x)) ∇fγ(x) + ∇gγ(x − 2γ∇fγ(x)) = 0 multiply by (I − 2γ∇2fγ(x)), γ ∈ (0, 1/Lf) and “integrate” Douglas Rachford Envelope (DRE) F DR

γ

(x) = fγ(x) − γ∇fγ(x)2 + gγ(x − 2γ∇fγ(x))

31 / 40

slide-32
SLIDE 32

Properties of DRE

γ ∈ (0, 1/Lf): miniziming DRE = minimizing nonsmooth F inf F(x) = inf F DR

γ

(x) arg min F(x) = proxγf(arg min F DR

γ

(x)) f ∈ C2(I Rn) = ⇒ DRE is C1 on I Rn f quadratic = ⇒ DRE is convex with (1/γ)-Lipschitz gradient

32 / 40

slide-33
SLIDE 33

Connection between DRE and FBE

partial linearization of F around x ℓF (z; x) = f(x) + ∇f(x), z − x + g(z) FBE is F FB

γ

(x) = min

z∈I Rn

  • ℓF (z; x) + 1

2γ z − x2

DRE is [use ∇fγ(x) = γ−1(x − proxγf(x)) = ∇f(proxγf(x))] F DR

γ

(x) = min

z∈I Rn

  • ℓF (z; proxγf(x)) + 1

2γ z − proxγf(x)2

DRE is equal to FBE evaluated at proxγf(x) F DR

γ

(x) = F FB

γ

(proxγf(x))

33 / 40

slide-34
SLIDE 34

Connection between DRS and FBS

FBS (with relaxation) xk+1 = xk + λk

  • proxγg
  • xk − γ∇f(xk)
  • − xk

DRS yk = proxγf(xk) xk+1 = xk + λk

  • proxγg(2yk − xk) − yk

f ∈ C1(I Rn) yk = proxγf(xk) = xk − γ∇f(proxγf(xk)) DRS becomes yk = proxγf(xk) xk+1 = xk + λk

  • proxγg
  • yk − γ∇f(yk)
  • − yk

DRS iteration = FBS iteration at “shifted” point yk = proxγf(xk)

34 / 40

slide-35
SLIDE 35

DRS as a variable metric method

DRS xk+1 = xk + λk

  • proxγg(2 proxγf(xk) − xk) − proxγf(xk)
  • gradient of DRE

∇F DR

γ

(x) =

  • I − 2γ∇2fγ(xk)

proxγf(xk) − proxγg(2 proxγf(xk) − xk)

  • DRS: variable metric method applied to DRE

xk+1 = xk − λkDk∇F DR

γ

(xk) where Dk = (I − 2γ∇2fγ(xk))−1 relaxation parameter λk of DRS = ⇒ stepsize for gradient method can use backtracking for selecting λk

35 / 40

slide-36
SLIDE 36

DRS – complexity estimates

assume f quadratic = ⇒ DRE is convex for γ ∈ (0, 1/Lf) DRS is preconditioned gradient method under change of variables x = Sw S = D1/2 convergence rate of DRS λk = λ = (1 − γLf)/(1 + γLf) F(zk+1) − F⋆ ≤ 1 (2γλ)kx0 − ˜ x2

  • ptimal prox-parameter γ for DRE

γ⋆ = √ 2 − 1 Lf linear convergence rate if F is strongly convex

36 / 40

slide-37
SLIDE 37

Accelerated DRS

DRE is convex with (1/γ)-Lipschitz gradient for f quadratic and γ ∈ (0, 1/Lf) Nesterov’s FGM applied to (preconditioned) DRE: x0 = x−1 ∈ I Rn uk = xk + βk(xk − xk−1) yk = proxγf(uk) zk = proxγg(2yk − uk) xk+1 = uk + λ(zk − yk) λ = (1 − γLf)/(1 + γLf). can choose βk = k−1

k+2

convergence rate is O(1/k2) F(zk) − F⋆ ≤ 4 γλ(k + 2)2 x0 − ˜ x2. linear convergence for f or g strongly convex

37 / 40

slide-38
SLIDE 38

Sparse least-squares

minimize

1 2Ax − b2 2 + λx1,

1,000 2,000 3,000 10−8 10−6 10−4 10−2 100 iterations

  • rel. subopt.

γ = 0.2Lf γ = γ⋆ γ = 0.6Lf γ = 0.8Lf

600 1,200 1,800 10−8 10−6 10−4 10−2 100 iterations DRS Fast DRS

38 / 40

slide-39
SLIDE 39

Take home message I

proximal minimization = gradient method on Moreau envelope (70’s) FBS = (variable metric) gradient method on FBE (this talk) DRS = (variable metric) gradient method on DRE (this talk)

39 / 40

slide-40
SLIDE 40

Take home message II

ALM = proximal minimization for dual Moreau envelope of dual = arg min

x∈I Rn

Lγ(x, y)

(Rockafellar, 1973)

AMM = FBS for dual FBE of dual = Lγ(x(y), z(y), y) where x(y) = arg min

x∈I Rn L0(x, z, y)

z(y) = arg min

z∈I Rn Lγ(x(y), z, y)

to conclude interpretation of operator splitting algorithms as gradient methods splitting envelopes can lead to new exciting algorithms examples: FBN, AMNM and ADRS

40 / 40