ADMM and Mirror Descent Geoff Gordon & Ryan Tibshirani (I am - - PowerPoint PPT Presentation

admm and mirror descent
SMART_READER_LITE
LIVE PREVIEW

ADMM and Mirror Descent Geoff Gordon & Ryan Tibshirani (I am - - PowerPoint PPT Presentation

ADMM and Mirror Descent Geoff Gordon & Ryan Tibshirani (I am Aaditya Ramdas and I approve this lecture) Optimization 10-725 / 36-725 Oct 30, 2012 1 Recap of Dual Ascent For problems like min x f ( x ) s.t. Ax = b We defined Lagrangian L


slide-1
SLIDE 1

ADMM and Mirror Descent

Geoff Gordon & Ryan Tibshirani (I am Aaditya Ramdas and I approve this lecture) Optimization 10-725 / 36-725 Oct 30, 2012

1

slide-2
SLIDE 2

Recap of Dual Ascent

For problems like min

x f(x)

s.t. Ax = b We defined Lagrangian L(x, u) = f(x) + u⊤(Ax − b) We defined the lagrange dual function g(u) = inf

x L(x, u)

If x+ minimizes L(x, u) then ∂g(u) = Ax+ − b

2

slide-3
SLIDE 3

Recap of Dual Ascent

Dual problem : maximize g(u) - use subgradient ascent! This gives us the algorithm xt+1 = arg min

x L(x, ut)

ut+1 = ut + ηt(Axt+1 − b) If strong duality, x∗ = arg minx L(x, u∗), provided it is unique. For appropriate ηt (and some conditions), xt, ut converge to an

  • ptimal primal and dual point.

If g not differentiable, convergence not monotone, i.e. sometimes g(ut+1) g(ut).

3

slide-4
SLIDE 4

Recap of Dual Decomposition Ascent

Suppose f(x) =

i fi(xi) where xi ∈ Rni are disjoint

Write Ax =

i Aixi, and so

L(x, u) =

  • i

Li(xi, u) =

  • i
  • fi(xi) + u⊤Aixi − (1/N)u⊤b
  • x-minimization step in dual ascent decomposes

xt+1

i

= arg min

xi Li(xi, ut)

ut+1 = ut + ηt(Axt+1 − b)

4

slide-5
SLIDE 5

Recap of Augmented Lagrangian, Method of Multipliers

Lρ(x, u) = f(x) + u⊤(Ax − b) + (ρ/2)Ax − b2

2

Lagrangian of minx f(x) + (ρ/2)Ax − b2

2 s.t. Ax = b

Associated dual function gρ(u) = min

x Lρ(x, u)

Applying dual ascent : xt+1 = arg min

x Lρ(x, ut)

ut+1 = ut + ρ(Axt+1 − b) More robust than dual ascent (converges if f is not strictly convex

  • r when f can be infinite). However, lost decomposability.

5

slide-6
SLIDE 6

Alternating Direction Method of Multipliers

Augmented Lagrangian for f(x) = f1(x1) + f2(x2) is Lρ(x1, x2, u) = f1(x1)+f2(x2)+u⊤(A1x1+A2x2−b)+(ρ/2)A1x1+A2x2−b2

2

”Alternating direction” minimization xt+1

1

= arg min

x1 Lρ(x1, xt 2, ut)

xt+1

2

= arg min

x2 Lρ(xt+1 1

, x2, ut) ut+1 = ut + ρ(A1xt+1

1

+ A2xt+1

2

− b) Normal method of multipliers would’ve done (xt+1

1

, xt+1

2

) = arg min

x1,x2 Lρ(x1, x2, ut)

ut+1 = ut + ρ(A1xt+1

1

+ A2xt+1

2

− b)

6

slide-7
SLIDE 7

Convergence Guarantees of ADMM

Assumption 1: f1, f2 are closed, proper, convex (epigraphs are closed, nonempty, convex) Assumption 2: Unaugmented Lagrangian L0(x1, x2, u) has saddle L0(xS

1 , xS 2 , u) ≤ L0(xS 1 , xS 2 , uS) ≤ L0(x1, x2, uS)

Residual convergence : rt = A1xt

1 + A2xt 2 − b → 0

Objective convergence : f1(xt

1) + f2(xt 2) → f∗

Dual variable convergence : yt → y∗ Primal variables needn’t converge (more assumptions needed)

7

slide-8
SLIDE 8

Example: Generalized Lasso with Repeated Ridge

min

x

1 2Ax − b2

2 + λFx1

In ADMM form, min

x,z

Ax − b2

2 + λz1

s.t. Fx − z = 0 ADMM updates : xt+1 = (A⊤A + ρF ⊤F)−1(A⊤b + ρF ⊤(zt − ut)) zt+1 = Sλ/ρ(Fxt+1 + ut) ut+1 = ut + Fxt+1 − zt+1 For group lasso (λ

i xi2 for disjoint xi ∈ Rni), ADMM uses

vector soft thresholding operator Sκ(a) =

  • 1 −

κ a2

  • +

a

8

slide-9
SLIDE 9

Break!

9

slide-10
SLIDE 10

Bregman Divergence ∆g

If g is strongly convex wrt norm ., define ∆g(x, y) = g(x) − [g(y) + ∇g(y)⊤(x − y)] Read ”distance between x and y as measured by function g”. Eg: g(x) = x2

2, strongly convex wrt .2

∆g(x, y) = x − y2

2

Eg: g(x) =

i(xi log xi − xi), strongly convex wrt .1

∆g(x, y) =

  • i
  • xi log

xi yi

  • + yi − xi
  • 10
slide-11
SLIDE 11

Properties of Bregman Divergence

For a λ-strongly convex function g, we defined ∆g(x, y) = g(x) − [g(y) + ∇g(y)⊤(x − y)] So ∆g(x, x) = 0 and by strong convexity, ∆g(x, y) ≥ λ 2 x − y2 ≥ 0 Derivatives: ∇x∆g(x, y) = ∇g(x) − ∇g(y) ∇2

x∆g(x, y) = ∇2g(x) λI

Triangle Inequality (kinda): ∆g(x, y) + ∆g(y, z) = ∆g(x, z) + (∇g(z) − ∇g(y))⊤(x − y)

11

slide-12
SLIDE 12

Recap of Gradient Descent

Consider (S ⊆ Rn) the problem minx∈S f(x) Gradient descent : minimize quadratic approx. of f at xt (Ht = I) xt+1 = arg min

x

f(xt) + ∂f(xt)⊤(x − xt) + 1 2x − xt2

2

From HW2 (via regret) : for projected subgradient descent, f(xt) − f(x∗) ≤ L2D2 √ T where maxx∈S ∂f(x)2 ≤ L2, maxx,y∈S x − y2 ≤ D2 How does this scale with n? Depends on L2(f, S) and D2(S)

12

slide-13
SLIDE 13

Mirror Descent

Given a norm . over the domain S, xt+1 = arg min

x

f(xt) + ∂f(xt)⊤(x − xt) + ∆g(x, xt) where g is strongly convex wrt .. Alternatively, xt+1 = arg min

x

x⊤(∂f(xt) − ∇g(xt)) + g(x) Hence, ∂f(xt) + ∇g(xt+1) − ∇g(xt) = 0 So, we sometimes see xt+1 = ∇g−1(∇g(xt) − ηt∂f(xt))

13

slide-14
SLIDE 14

Convergence Guarantees

Let ∂f(x)∗ ≤ L. or equivalently f(x) − f(y) ≤ L.x − y If xg = arg minx∈S g(x), let Dg,. =

  • 2 maxy ∆g(xg, y)/λ, then

x − xg ≤ Dg,. Choosing ηt =

λDg,. ∂f(xt)∗ √ T

f(xT ) − f(x∗) ≤ L.Dg,. √ T Remember (HW2): ηt =

D2 L2 √ T and D2 =

  • maxx,y x − y2

2 14

slide-15
SLIDE 15

Example : Probability Simplex and .1

n-dimensional simplex : x ≥ 0, 1⊤x = 1 Functions are Lipschitz wrt .1 : maxx ∂f(x)∞ ≤ L1 If g(x) =

i xi log xi − xi, we get exponentiated gradient

xt+1 = xt ◦ exp(−ηt∇f(xt)) Dg,.1 ≤ √2 log n, yielding a rate

  • log n/T.

g(x) = x2

2 (grad. descent) gives

  • n/T (D2 = 1, L2 ≤ √nL1)

15

slide-16
SLIDE 16

References

Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers - Boyd, Parikh, Chu, Peleato and Eckstein, 2010 Lecture Notes on Modern Convex Optimization - Ben-Tal and Nemirovski, 2012

16