admm and mirror descent
play

ADMM and Mirror Descent Geoff Gordon & Ryan Tibshirani (I am - PowerPoint PPT Presentation

ADMM and Mirror Descent Geoff Gordon & Ryan Tibshirani (I am Aaditya Ramdas and I approve this lecture) Optimization 10-725 / 36-725 Oct 30, 2012 1 Recap of Dual Ascent For problems like min x f ( x ) s.t. Ax = b We defined Lagrangian L


  1. ADMM and Mirror Descent Geoff Gordon & Ryan Tibshirani (I am Aaditya Ramdas and I approve this lecture) Optimization 10-725 / 36-725 Oct 30, 2012 1

  2. Recap of Dual Ascent For problems like min x f ( x ) s.t. Ax = b We defined Lagrangian L ( x, u ) = f ( x ) + u ⊤ ( Ax − b ) We defined the lagrange dual function g ( u ) = inf x L ( x, u ) If x + minimizes L ( x, u ) then ∂g ( u ) = Ax + − b 2

  3. Recap of Dual Ascent Dual problem : maximize g ( u ) - use subgradient ascent! This gives us the algorithm x t +1 = arg min x L ( x, u t ) u t +1 = u t + η t ( Ax t +1 − b ) If strong duality, x ∗ = arg min x L ( x, u ∗ ) , provided it is unique. For appropriate η t (and some conditions), x t , u t converge to an optimal primal and dual point. If g not differentiable, convergence not monotone, i.e. sometimes g ( u t +1 ) � g ( u t ) . 3

  4. Recap of Dual Decomposition Ascent i f i ( x i ) where x i ∈ R n i are disjoint Suppose f ( x ) = � Write Ax = � i A i x i , and so � � � � f i ( x i ) + u ⊤ A i x i − (1 /N ) u ⊤ b L ( x, u ) = L i ( x i , u ) = i i x -minimization step in dual ascent decomposes x t +1 x i L i ( x i , u t ) = arg min i u t +1 = u t + η t ( Ax t +1 − b ) 4

  5. Recap of Augmented Lagrangian, Method of Multipliers L ρ ( x, u ) = f ( x ) + u ⊤ ( Ax − b ) + ( ρ/ 2) � Ax − b � 2 2 Lagrangian of min x f ( x ) + ( ρ/ 2) � Ax − b � 2 2 s.t. Ax = b Associated dual function g ρ ( u ) = min x L ρ ( x, u ) Applying dual ascent : x t +1 = arg min x L ρ ( x, u t ) u t +1 = u t + ρ ( Ax t +1 − b ) More robust than dual ascent (converges if f is not strictly convex or when f can be infinite). However, lost decomposability. 5

  6. Alternating Direction Method of Multipliers Augmented Lagrangian for f ( x ) = f 1 ( x 1 ) + f 2 ( x 2 ) is L ρ ( x 1 , x 2 , u ) = f 1 ( x 1 )+ f 2 ( x 2 )+ u ⊤ ( A 1 x 1 + A 2 x 2 − b )+( ρ/ 2) � A 1 x 1 + A 2 x 2 − b � 2 2 ”Alternating direction” minimization x t +1 x 1 L ρ ( x 1 , x t 2 , u t ) = arg min 1 x t +1 x 2 L ρ ( x t +1 , x 2 , u t ) = arg min 2 1 u t +1 = u t + ρ ( A 1 x t +1 + A 2 x t +1 − b ) 1 2 Normal method of multipliers would’ve done ( x t +1 , x t +1 x 1 ,x 2 L ρ ( x 1 , x 2 , u t ) ) = arg min 1 2 u t +1 = u t + ρ ( A 1 x t +1 + A 2 x t +1 − b ) 1 2 6

  7. Convergence Guarantees of ADMM Assumption 1 : f 1 , f 2 are closed, proper, convex (epigraphs are closed, nonempty, convex) Assumption 2 : Unaugmented Lagrangian L 0 ( x 1 , x 2 , u ) has saddle L 0 ( x S 1 , x S 2 , u ) ≤ L 0 ( x S 1 , x S 2 , u S ) ≤ L 0 ( x 1 , x 2 , u S ) Residual convergence : r t = A 1 x t 1 + A 2 x t 2 − b → 0 Objective convergence : f 1 ( x t 1 ) + f 2 ( x t 2 ) → f ∗ Dual variable convergence : y t → y ∗ Primal variables needn’t converge (more assumptions needed) 7

  8. Example: Generalized Lasso with Repeated Ridge 1 2 � Ax − b � 2 min 2 + λ � Fx � 1 x In ADMM form, � Ax − b � 2 min 2 + λ � z � 1 x,z s.t. Fx − z = 0 ADMM updates : x t +1 = ( A ⊤ A + ρF ⊤ F ) − 1 ( A ⊤ b + ρF ⊤ ( z t − u t )) z t +1 = S λ/ρ ( Fx t +1 + u t ) u t +1 = u t + Fx t +1 − z t +1 i � x i � 2 for disjoint x i ∈ R n i ), ADMM uses For group lasso ( λ � vector soft thresholding operator � � κ S κ ( a ) = 1 − a � a � 2 + 8

  9. Break! 9

  10. Bregman Divergence ∆ g If g is strongly convex wrt norm � . � , define ∆ g ( x, y ) = g ( x ) − [ g ( y ) + ∇ g ( y ) ⊤ ( x − y )] Read ” distance between x and y as measured by function g ”. Eg: g ( x ) = � x � 2 2 , strongly convex wrt � . � 2 ∆ g ( x, y ) = � x − y � 2 2 Eg: g ( x ) = � i ( x i log x i − x i ) , strongly convex wrt � . � 1 � � x i � � � ∆ g ( x, y ) = x i log + y i − x i y i i 10

  11. Properties of Bregman Divergence For a λ -strongly convex function g , we defined ∆ g ( x, y ) = g ( x ) − [ g ( y ) + ∇ g ( y ) ⊤ ( x − y )] So ∆ g ( x, x ) = 0 and by strong convexity, ∆ g ( x, y ) ≥ λ 2 � x − y � 2 ≥ 0 Derivatives: ∇ x ∆ g ( x, y ) = ∇ g ( x ) − ∇ g ( y ) ∇ 2 x ∆ g ( x, y ) = ∇ 2 g ( x ) � λI Triangle Inequality (kinda): ∆ g ( x, y ) + ∆ g ( y, z ) = ∆ g ( x, z ) + ( ∇ g ( z ) − ∇ g ( y )) ⊤ ( x − y ) 11

  12. Recap of Gradient Descent Consider ( S ⊆ R n ) the problem min x ∈ S f ( x ) Gradient descent : minimize quadratic approx. of f at x t ( H t = I ) f ( x t ) + ∂f ( x t ) ⊤ ( x − x t ) + 1 x t +1 = arg min 2 � x − x t � 2 2 x From HW2 (via regret) : for projected subgradient descent, f ( x t ) − f ( x ∗ ) ≤ L 2 D 2 √ T where max x ∈ S � ∂f ( x ) � 2 ≤ L 2 , max x,y ∈ S � x − y � 2 ≤ D 2 How does this scale with n ? Depends on L 2 ( f, S ) and D 2 ( S ) 12

  13. Mirror Descent Given a norm � . � over the domain S , x t +1 = arg min f ( x t ) + ∂f ( x t ) ⊤ ( x − x t ) + ∆ g ( x, x t ) x where g is strongly convex wrt � . � . Alternatively, x t +1 = arg min x ⊤ ( ∂f ( x t ) − ∇ g ( x t )) + g ( x ) x Hence, ∂f ( x t ) + ∇ g ( x t +1 ) − ∇ g ( x t ) = 0 So, we sometimes see x t +1 = ∇ g − 1 ( ∇ g ( x t ) − η t ∂f ( x t )) 13

  14. Convergence Guarantees Let � ∂f ( x ) � ∗ ≤ L � . � or equivalently f ( x ) − f ( y ) ≤ L � . � � x − y � If x g = arg min x ∈ S g ( x ) , let D g, � . � = � 2 max y ∆ g ( x g , y ) /λ , then � x − x g � ≤ D g, � . � Choosing η t = λD g, � . � √ � ∂f ( x t ) � ∗ T f ( x T ) − f ( x ∗ ) ≤ L � . � D g, � . � √ T Remember (HW2): η t = D 2 � max x,y � x − y � 2 √ T and D 2 = 2 L 2 14

  15. Example : Probability Simplex and � . � 1 n-dimensional simplex : x ≥ 0 , 1 ⊤ x = 1 Functions are Lipschitz wrt � . � 1 : max x � ∂f ( x ) � ∞ ≤ L 1 If g ( x ) = � i x i log x i − x i , we get exponentiated gradient x t +1 = x t ◦ exp( − η t ∇ f ( x t )) D g, � . � 1 ≤ √ 2 log n , yielding a rate � log n/T . n/T ( D 2 = 1 , L 2 ≤ √ nL 1 ) g ( x ) = � x � 2 � 2 (grad. descent) gives 15

  16. References Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers - Boyd, Parikh, Chu, Peleato and Eckstein, 2010 Lecture Notes on Modern Convex Optimization - Ben-Tal and Nemirovski, 2012 16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend