SLIDE 1
ADMM and Mirror Descent Geoff Gordon & Ryan Tibshirani (I am - - PowerPoint PPT Presentation
ADMM and Mirror Descent Geoff Gordon & Ryan Tibshirani (I am - - PowerPoint PPT Presentation
ADMM and Mirror Descent Geoff Gordon & Ryan Tibshirani (I am Aaditya Ramdas and I approve this lecture) Optimization 10-725 / 36-725 Oct 30, 2012 1 Recap of Dual Ascent For problems like min x f ( x ) s.t. Ax = b We defined Lagrangian L
SLIDE 2
SLIDE 3
Recap of Dual Ascent
Dual problem : maximize g(u) - use subgradient ascent! This gives us the algorithm xt+1 = arg min
x L(x, ut)
ut+1 = ut + ηt(Axt+1 − b) If strong duality, x∗ = arg minx L(x, u∗), provided it is unique. For appropriate ηt (and some conditions), xt, ut converge to an
- ptimal primal and dual point.
If g not differentiable, convergence not monotone, i.e. sometimes g(ut+1) g(ut).
3
SLIDE 4
Recap of Dual Decomposition Ascent
Suppose f(x) =
i fi(xi) where xi ∈ Rni are disjoint
Write Ax =
i Aixi, and so
L(x, u) =
- i
Li(xi, u) =
- i
- fi(xi) + u⊤Aixi − (1/N)u⊤b
- x-minimization step in dual ascent decomposes
xt+1
i
= arg min
xi Li(xi, ut)
ut+1 = ut + ηt(Axt+1 − b)
4
SLIDE 5
Recap of Augmented Lagrangian, Method of Multipliers
Lρ(x, u) = f(x) + u⊤(Ax − b) + (ρ/2)Ax − b2
2
Lagrangian of minx f(x) + (ρ/2)Ax − b2
2 s.t. Ax = b
Associated dual function gρ(u) = min
x Lρ(x, u)
Applying dual ascent : xt+1 = arg min
x Lρ(x, ut)
ut+1 = ut + ρ(Axt+1 − b) More robust than dual ascent (converges if f is not strictly convex
- r when f can be infinite). However, lost decomposability.
5
SLIDE 6
Alternating Direction Method of Multipliers
Augmented Lagrangian for f(x) = f1(x1) + f2(x2) is Lρ(x1, x2, u) = f1(x1)+f2(x2)+u⊤(A1x1+A2x2−b)+(ρ/2)A1x1+A2x2−b2
2
”Alternating direction” minimization xt+1
1
= arg min
x1 Lρ(x1, xt 2, ut)
xt+1
2
= arg min
x2 Lρ(xt+1 1
, x2, ut) ut+1 = ut + ρ(A1xt+1
1
+ A2xt+1
2
− b) Normal method of multipliers would’ve done (xt+1
1
, xt+1
2
) = arg min
x1,x2 Lρ(x1, x2, ut)
ut+1 = ut + ρ(A1xt+1
1
+ A2xt+1
2
− b)
6
SLIDE 7
Convergence Guarantees of ADMM
Assumption 1: f1, f2 are closed, proper, convex (epigraphs are closed, nonempty, convex) Assumption 2: Unaugmented Lagrangian L0(x1, x2, u) has saddle L0(xS
1 , xS 2 , u) ≤ L0(xS 1 , xS 2 , uS) ≤ L0(x1, x2, uS)
Residual convergence : rt = A1xt
1 + A2xt 2 − b → 0
Objective convergence : f1(xt
1) + f2(xt 2) → f∗
Dual variable convergence : yt → y∗ Primal variables needn’t converge (more assumptions needed)
7
SLIDE 8
Example: Generalized Lasso with Repeated Ridge
min
x
1 2Ax − b2
2 + λFx1
In ADMM form, min
x,z
Ax − b2
2 + λz1
s.t. Fx − z = 0 ADMM updates : xt+1 = (A⊤A + ρF ⊤F)−1(A⊤b + ρF ⊤(zt − ut)) zt+1 = Sλ/ρ(Fxt+1 + ut) ut+1 = ut + Fxt+1 − zt+1 For group lasso (λ
i xi2 for disjoint xi ∈ Rni), ADMM uses
vector soft thresholding operator Sκ(a) =
- 1 −
κ a2
- +
a
8
SLIDE 9
Break!
9
SLIDE 10
Bregman Divergence ∆g
If g is strongly convex wrt norm ., define ∆g(x, y) = g(x) − [g(y) + ∇g(y)⊤(x − y)] Read ”distance between x and y as measured by function g”. Eg: g(x) = x2
2, strongly convex wrt .2
∆g(x, y) = x − y2
2
Eg: g(x) =
i(xi log xi − xi), strongly convex wrt .1
∆g(x, y) =
- i
- xi log
xi yi
- + yi − xi
- 10
SLIDE 11
Properties of Bregman Divergence
For a λ-strongly convex function g, we defined ∆g(x, y) = g(x) − [g(y) + ∇g(y)⊤(x − y)] So ∆g(x, x) = 0 and by strong convexity, ∆g(x, y) ≥ λ 2 x − y2 ≥ 0 Derivatives: ∇x∆g(x, y) = ∇g(x) − ∇g(y) ∇2
x∆g(x, y) = ∇2g(x) λI
Triangle Inequality (kinda): ∆g(x, y) + ∆g(y, z) = ∆g(x, z) + (∇g(z) − ∇g(y))⊤(x − y)
11
SLIDE 12
Recap of Gradient Descent
Consider (S ⊆ Rn) the problem minx∈S f(x) Gradient descent : minimize quadratic approx. of f at xt (Ht = I) xt+1 = arg min
x
f(xt) + ∂f(xt)⊤(x − xt) + 1 2x − xt2
2
From HW2 (via regret) : for projected subgradient descent, f(xt) − f(x∗) ≤ L2D2 √ T where maxx∈S ∂f(x)2 ≤ L2, maxx,y∈S x − y2 ≤ D2 How does this scale with n? Depends on L2(f, S) and D2(S)
12
SLIDE 13
Mirror Descent
Given a norm . over the domain S, xt+1 = arg min
x
f(xt) + ∂f(xt)⊤(x − xt) + ∆g(x, xt) where g is strongly convex wrt .. Alternatively, xt+1 = arg min
x
x⊤(∂f(xt) − ∇g(xt)) + g(x) Hence, ∂f(xt) + ∇g(xt+1) − ∇g(xt) = 0 So, we sometimes see xt+1 = ∇g−1(∇g(xt) − ηt∂f(xt))
13
SLIDE 14
Convergence Guarantees
Let ∂f(x)∗ ≤ L. or equivalently f(x) − f(y) ≤ L.x − y If xg = arg minx∈S g(x), let Dg,. =
- 2 maxy ∆g(xg, y)/λ, then
x − xg ≤ Dg,. Choosing ηt =
λDg,. ∂f(xt)∗ √ T
f(xT ) − f(x∗) ≤ L.Dg,. √ T Remember (HW2): ηt =
D2 L2 √ T and D2 =
- maxx,y x − y2
2 14
SLIDE 15
Example : Probability Simplex and .1
n-dimensional simplex : x ≥ 0, 1⊤x = 1 Functions are Lipschitz wrt .1 : maxx ∂f(x)∞ ≤ L1 If g(x) =
i xi log xi − xi, we get exponentiated gradient
xt+1 = xt ◦ exp(−ηt∇f(xt)) Dg,.1 ≤ √2 log n, yielding a rate
- log n/T.
g(x) = x2
2 (grad. descent) gives
- n/T (D2 = 1, L2 ≤ √nL1)
15
SLIDE 16