generalized gradient descent
play

Generalized gradient descent Geoff Gordon & Ryan Tibshirani - PowerPoint PPT Presentation

Generalized gradient descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember subgradient method We want to solve x R n f ( x ) , min for f convex, not necessarily differentiable Subgradient method: choose initial


  1. Generalized gradient descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1

  2. Remember subgradient method We want to solve x ∈ R n f ( x ) , min for f convex, not necessarily differentiable Subgradient method: choose initial x (0) ∈ R n , repeat: x ( k ) = x ( k − 1) − t k · g ( k − 1) , k = 1 , 2 , 3 , . . . where g ( k − 1) is a subgradient of f at x ( k − 1) If f is Lipschitz on a bounded set containing its minimizer, then √ subgradient method has convergence rate O (1 / k ) Downside: can be very slow! 2

  3. Outline Today: • Generalized gradient descent • Convergence analysis • ISTA, matrix completion • Special cases 3

  4. Decomposable functions Suppose f ( x ) = g ( x ) + h ( x ) • g is convex, differentiable • h is convex, not necessarily differentiable If f were differentiable, gradient descent update: x + = x − t ∇ f ( x ) Recall motivation: minimize quadratic approximation to f around x , replace ∇ 2 f ( x ) by 1 t I , f ( x ) + ∇ f ( x ) T ( z − x ) + 1 x + = argmin 2 t � z − x � 2 z � �� � � f t ( z ) 4

  5. In our case f is not differentiable, but f = g + h , g differentiable Why don’t we make quadratic approximation to g , leave h alone? I.e., update x + = argmin � g t ( z ) + h ( z ) z g ( x ) + ∇ g ( x ) T ( z − x ) + 1 2 t � z − x � 2 + h ( z ) = argmin z � � 1 � 2 + h ( z ) � z − ( x − t ∇ g ( x )) = argmin 2 t z 2 t � z − ( x − t ∇ g ( x )) � 2 1 be close to gradient update for g h ( z ) also make h small 5

  6. Generalized gradient descent Define 1 2 t � x − z � 2 + h ( z ) prox t ( x ) = argmin z ∈ R n Generalized gradient descent: choose initialize x (0) , repeat: x ( k ) = prox t k ( x ( k − 1) − t k ∇ g ( x ( k − 1) )) , k = 1 , 2 , 3 , . . . To make update step look familiar, can write it as x ( k ) = x ( k − 1) − t k · G t k ( x ( k − 1) ) where G t is the generalized gradient, G t ( x ) = x − prox t ( x − t ∇ g ( x )) t 6

  7. What good did this do? You have a right to be suspicious ... looks like we just swapped one minimization problem for another Point is that prox function prox t ( · ) is can be computed analytically for a lot of important functions h . Note: • prox t doesn’t depend on g at all • g can be very complicated as long as we can compute its gradient Convergence analysis: will be in terms of # of iterations of the algorithm Each iteration evaluates prox t ( · ) once, and this can be cheap or expensive, depending on h 7

  8. ISTA Consider lasso criterion f ( x ) = 1 + . 2 � y − Ax � 2 λ � x � 1 . � �� � � �� � g ( x ) h ( x ) Prox function is now 1 2 t � x − z � 2 + λ � z � 1 prox t ( x ) = argmin z ∈ R n = S λt ( x ) where S λ ( x ) is the soft-thresholding operator,   x i − λ if x i > λ  [ S λ ( x )] i = 0 if − λ ≤ x i ≤ λ   x i + λ if x i < − λ 8

  9. Recall ∇ g ( x ) = − A T ( y − Ax ) . Hence generalized gradient update step is: x + = S λt ( x + tA T ( y − Ax )) Resulting algorithm called ISTA (Iterative Soft-Thresholding Algorithm). Very simple algorithm to compute a lasso solution 0.50 0.20 Generalized gradient f(k)−fstar 0.10 (ISTA) vs subgradient 0.05 descent: Subgradient method 0.02 Generalized gradient 0 200 400 600 800 1000 k 9

  10. Convergence analysis We have f ( x ) = g ( x ) + h ( x ) , and assume • g is convex, differentiable, ∇ g is Lipschitz continuous with constant L > 0 • h is convex, prox t ( x ) = argmin z {� x − z � 2 / (2 t ) + h ( z ) } can be evaluated Theorem: Generalized gradient descent with fixed step size t ≤ 1 /L satisfies f ( x ( k ) ) − f ( x ⋆ ) ≤ � x (0) − x ⋆ � 2 2 tk I.e., generalized gradient descent has convergence rate O (1 /k ) Same as gradient descent! But remember, this counts # of iterations, not # of operations 10

  11. Proof Similar to proof for gradient descent, but with generalized gradient G t replacing gradient ∇ f . Main steps: • ∇ g Lipschitz with constant L ⇒ f ( y ) ≤ g ( x ) + ∇ g ( x ) T ( y − x ) + L 2 � y − x � 2 + h ( y ) all x, y • Plugging in y = x + = x − tG t ( x ) , f ( x + ) ≤ g ( x ) − t ∇ g ( x ) T G t ( x )+ Lt 2 � G t ( x ) � 2 + h ( x − tG t ( x )) • By definition of prox, 1 2 t � z − ( x − t ∇ g ( x )) � 2 + h ( z ) x − tG t ( x ) = argmin z ∈ R n ⇒ ∇ g ( x ) − G t ( x ) + v = 0 , v ∈ ∂h ( x − tG t ( x )) 11

  12. • Using G t ( x ) − ∇ g ( x ) ∈ ∂h ( x − tG t ( x )) , and convexity of g , f ( x + ) ≤ f ( z ) + G t ( x ) T ( x − z ) − (1 − Lt 2 ) t � G t ( x ) � 2 all z • Letting t ≤ 1 /L and z = x ⋆ , f ( x + ) ≤ f ( x ⋆ ) + G t ( x ) T ( x ⋆ − x ) − t 2 � G t ( x ) � 2 � � x − x ⋆ � 2 − � x + − x ⋆ � 2 � = f ( x ⋆ ) + 1 2 t Proof proceeds just as with gradient descent. 12

  13. Backtracking line search Same as with gradient descent, just replace ∇ f with generalized gradient G t . I.e., • Fix 0 < β < 1 • Then at each iteration, start with t = 1 , and while f ( x − tG t ( x )) > f ( x ) − t 2 � G t ( x ) � 2 , update t = βt Theorem: Generalized gradient descent with backtracking line search satisfies f ( x ( k ) ) − f ( x ⋆ ) ≤ � x (0) − x ⋆ � 2 2 t min k where t min = min { 1 , β/L } 13

  14. Matrix completion Given matrix A , m × n , only observe entries A ij , ( i, j ) ∈ Ω Want to fill in missing entries (e.g., ), so we solve: � 1 ( A ij − X ij ) 2 + λ � X � ∗ min 2 X ∈ R m × n ( i,j ) ∈ Ω Here � X � ∗ is the nuclear norm of X , r � � X � ∗ = σ i ( X ) i =1 where r = rank( X ) and σ 1 ( X ) , . . . σ r ( X ) are its singular values 14

  15. Define P Ω , projection operator onto observed set: � X ij ( i, j ) ∈ Ω [ P Ω ( X )] ij = 0 ( i, j ) / ∈ Ω Criterion is f ( X ) = 1 + . 2 � P Ω ( A ) − P Ω ( X ) � 2 λ � X � ∗ F . � �� � � �� � g ( X ) h ( X ) Two things for generalized gradient descent: • Gradient: ∇ g ( X ) = − ( P Ω ( A ) − P Ω ( X )) • Prox function: 1 2 t � X − Z � 2 prox t ( X ) = argmin F + λ � Z � ∗ Z ∈ R m × n 15

  16. Claim: prox t ( X ) = S λt ( X ) , where the matrix soft-thresholding operator S λ ( X ) is defined by S λ ( X ) = U Σ λ V T where X = U Σ V T is a singular value decomposition, and Σ λ is diagonal with (Σ λ ) ii = max { Σ ii − λ, 0 } Why? Note prox t ( X ) = Z , where Z satisfies 0 ∈ Z − X + λt · ∂ � Z � ∗ Fact: if Z = U Σ V T , then ∂ � Z � ∗ = { UV T + W : W ∈ R m × n , � W � ≤ 1 , U T W = 0 , WV = 0 } Now plug in Z = S λt ( X ) and check that we can get 0 16

  17. Hence generalized gradient update step is: X + = S λt ( X + t ( P Ω ( A ) − P Ω ( X ))) Note that ∇ g ( X ) is Lipschitz continuous with L = 1 , so we can choose fixed step size t = 1 . Update step is now: X + = S λ ( P Ω ( A ) + P ⊥ Ω ( X )) where P ⊥ Ω projects onto unobserved set, P Ω ( X ) + P ⊥ Ω ( X ) = X This is the soft-impute algorithm 1 , simple and effective method for matrix completion 1 Mazumder et al. (2011), Spectral regularization algorithms for learning large incomplete matrices 17

  18. Why “generalized”? Special cases of generalized gradient descent, on f = g + h : • h = 0 → gradient descent • h = I C → projected gradient descent • g = 0 → proximal minimization algorithm Therefore these algorithms all have O (1 /k ) convergence rate 18

  19. Projected gradient descent Given closed, convex set C ∈ R n , min x ∈ C g ( x ) ⇔ min g ( x ) + I C ( x ) x � 0 x ∈ C where I C ( x ) = is the indicator function of C ∞ x / ∈ C Hence 1 2 t � x − z � 2 + I C ( z ) prox t ( x ) = argmin z � x − z � 2 = argmin z ∈ C I.e., prox t ( x ) = P C ( x ) , projection operator onto C 19

  20. Therefore generalized gradient update step is: x + = P C ( x − t ∇ g ( x )) i.e., perform usual gradient update and then project back onto C . Called projected gradient descent 1.5 1.0 0.5 c() 0.0 ● −0.5 ● −1.0 −1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 20

  21. What sets C are easy to project onto? Lots, e.g., • Affine images C = { Ax + b : x ∈ R n } • Solution set of linear system C = { x ∈ R n : Ax = b } • Nonnegative orthant C = { x ∈ R n : x ≥ 0 } = R n + • Norm balls C = { x ∈ R n : � x � p ≤ 1 } , for p = 1 , 2 , ∞ • Some simple polyhedra and simple cones Warning: it is easy to write down seemingly simple set C , and P C can turn out to be very hard! E.g., it is generally hard to project onto solution set of arbitrary linear inequalities, i.e, arbitrary polyhedron C = { x ∈ R n : Ax ≤ b } 21

  22. Proximal minimization algorithm Consider for h convex (not necessarily differentiable), min h ( x ) x Generalized gradient update step is just a prox update: 1 x + = argmin 2 t � x − z � 2 + h ( z ) z Called proximal minimization algorithm Faster than subgradient method, but not implementable unless we know prox in closed form 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend