Generalized gradient descent Geoff Gordon & Ryan Tibshirani - PowerPoint PPT Presentation

Generalized gradient descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1

Remember subgradient method We want to solve x ∈ R n f ( x ) , min for f convex, not necessarily differentiable Subgradient method: choose initial x (0) ∈ R n , repeat: x ( k ) = x ( k − 1) − t k · g ( k − 1) , k = 1 , 2 , 3 , . . . where g ( k − 1) is a subgradient of f at x ( k − 1) If f is Lipschitz on a bounded set containing its minimizer, then √ subgradient method has convergence rate O (1 / k ) Downside: can be very slow! 2

Outline Today: • Generalized gradient descent • Convergence analysis • ISTA, matrix completion • Special cases 3

Decomposable functions Suppose f ( x ) = g ( x ) + h ( x ) • g is convex, differentiable • h is convex, not necessarily differentiable If f were differentiable, gradient descent update: x + = x − t ∇ f ( x ) Recall motivation: minimize quadratic approximation to f around x , replace ∇ 2 f ( x ) by 1 t I , f ( x ) + ∇ f ( x ) T ( z − x ) + 1 x + = argmin 2 t � z − x � 2 z � �� f t ( z ) 4

In our case f is not differentiable, but f = g + h , g differentiable Why don’t we make quadratic approximation to g , leave h alone? I.e., update x + = argmin � g t ( z ) + h ( z ) z g ( x ) + ∇ g ( x ) T ( z − x ) + 1 2 t � z − x � 2 + h ( z ) = argmin z � � 1 � 2 + h ( z ) � z − ( x − t ∇ g ( x )) = argmin 2 t z 2 t � z − ( x − t ∇ g ( x )) � 2 1 be close to gradient update for g h ( z ) also make h small 5

Generalized gradient descent Define 1 2 t � x − z � 2 + h ( z ) prox t ( x ) = argmin z ∈ R n Generalized gradient descent: choose initialize x (0) , repeat: x ( k ) = prox t k ( x ( k − 1) − t k ∇ g ( x ( k − 1) )) , k = 1 , 2 , 3 , . . . To make update step look familiar, can write it as x ( k ) = x ( k − 1) − t k · G t k ( x ( k − 1) ) where G t is the generalized gradient, G t ( x ) = x − prox t ( x − t ∇ g ( x )) t 6

What good did this do? You have a right to be suspicious ... looks like we just swapped one minimization problem for another Point is that prox function prox t ( · ) is can be computed analytically for a lot of important functions h . Note: • prox t doesn’t depend on g at all • g can be very complicated as long as we can compute its gradient Convergence analysis: will be in terms of # of iterations of the algorithm Each iteration evaluates prox t ( · ) once, and this can be cheap or expensive, depending on h 7

ISTA Consider lasso criterion f ( x ) = 1 + . 2 � y − Ax � 2 λ � x � 1 . � �� g ( x ) h ( x ) Prox function is now 1 2 t � x − z � 2 + λ � z � 1 prox t ( x ) = argmin z ∈ R n = S λt ( x ) where S λ ( x ) is the soft-thresholding operator,   x i − λ if x i > λ  [ S λ ( x )] i = 0 if − λ ≤ x i ≤ λ   x i + λ if x i < − λ 8

Recall ∇ g ( x ) = − A T ( y − Ax ) . Hence generalized gradient update step is: x + = S λt ( x + tA T ( y − Ax )) Resulting algorithm called ISTA (Iterative Soft-Thresholding Algorithm). Very simple algorithm to compute a lasso solution 0.50 0.20 Generalized gradient f(k)−fstar 0.10 (ISTA) vs subgradient 0.05 descent: Subgradient method 0.02 Generalized gradient 0 200 400 600 800 1000 k 9

Convergence analysis We have f ( x ) = g ( x ) + h ( x ) , and assume • g is convex, differentiable, ∇ g is Lipschitz continuous with constant L > 0 • h is convex, prox t ( x ) = argmin z {� x − z � 2 / (2 t ) + h ( z ) } can be evaluated Theorem: Generalized gradient descent with fixed step size t ≤ 1 /L satisfies f ( x ( k ) ) − f ( x ⋆ ) ≤ � x (0) − x ⋆ � 2 2 tk I.e., generalized gradient descent has convergence rate O (1 /k ) Same as gradient descent! But remember, this counts # of iterations, not # of operations 10

Proof Similar to proof for gradient descent, but with generalized gradient G t replacing gradient ∇ f . Main steps: • ∇ g Lipschitz with constant L ⇒ f ( y ) ≤ g ( x ) + ∇ g ( x ) T ( y − x ) + L 2 � y − x � 2 + h ( y ) all x, y • Plugging in y = x + = x − tG t ( x ) , f ( x + ) ≤ g ( x ) − t ∇ g ( x ) T G t ( x )+ Lt 2 � G t ( x ) � 2 + h ( x − tG t ( x )) • By definition of prox, 1 2 t � z − ( x − t ∇ g ( x )) � 2 + h ( z ) x − tG t ( x ) = argmin z ∈ R n ⇒ ∇ g ( x ) − G t ( x ) + v = 0 , v ∈ ∂h ( x − tG t ( x )) 11

• Using G t ( x ) − ∇ g ( x ) ∈ ∂h ( x − tG t ( x )) , and convexity of g , f ( x + ) ≤ f ( z ) + G t ( x ) T ( x − z ) − (1 − Lt 2 ) t � G t ( x ) � 2 all z • Letting t ≤ 1 /L and z = x ⋆ , f ( x + ) ≤ f ( x ⋆ ) + G t ( x ) T ( x ⋆ − x ) − t 2 � G t ( x ) � 2 � � x − x ⋆ � 2 − � x + − x ⋆ � 2 � = f ( x ⋆ ) + 1 2 t Proof proceeds just as with gradient descent. 12

Backtracking line search Same as with gradient descent, just replace ∇ f with generalized gradient G t . I.e., • Fix 0 < β < 1 • Then at each iteration, start with t = 1 , and while f ( x − tG t ( x )) > f ( x ) − t 2 � G t ( x ) � 2 , update t = βt Theorem: Generalized gradient descent with backtracking line search satisfies f ( x ( k ) ) − f ( x ⋆ ) ≤ � x (0) − x ⋆ � 2 2 t min k where t min = min { 1 , β/L } 13

Matrix completion Given matrix A , m × n , only observe entries A ij , ( i, j ) ∈ Ω Want to fill in missing entries (e.g., ), so we solve: � 1 ( A ij − X ij ) 2 + λ � X � ∗ min 2 X ∈ R m × n ( i,j ) ∈ Ω Here � X � ∗ is the nuclear norm of X , r � � X � ∗ = σ i ( X ) i =1 where r = rank( X ) and σ 1 ( X ) , . . . σ r ( X ) are its singular values 14

Define P Ω , projection operator onto observed set: � X ij ( i, j ) ∈ Ω [ P Ω ( X )] ij = 0 ( i, j ) / ∈ Ω Criterion is f ( X ) = 1 + . 2 � P Ω ( A ) − P Ω ( X ) � 2 λ � X � ∗ F . � �� g ( X ) h ( X ) Two things for generalized gradient descent: • Gradient: ∇ g ( X ) = − ( P Ω ( A ) − P Ω ( X )) • Prox function: 1 2 t � X − Z � 2 prox t ( X ) = argmin F + λ � Z � ∗ Z ∈ R m × n 15

Claim: prox t ( X ) = S λt ( X ) , where the matrix soft-thresholding operator S λ ( X ) is defined by S λ ( X ) = U Σ λ V T where X = U Σ V T is a singular value decomposition, and Σ λ is diagonal with (Σ λ ) ii = max { Σ ii − λ, 0 } Why? Note prox t ( X ) = Z , where Z satisfies 0 ∈ Z − X + λt · ∂ � Z � ∗ Fact: if Z = U Σ V T , then ∂ � Z � ∗ = { UV T + W : W ∈ R m × n , � W � ≤ 1 , U T W = 0 , WV = 0 } Now plug in Z = S λt ( X ) and check that we can get 0 16

Hence generalized gradient update step is: X + = S λt ( X + t ( P Ω ( A ) − P Ω ( X ))) Note that ∇ g ( X ) is Lipschitz continuous with L = 1 , so we can choose fixed step size t = 1 . Update step is now: X + = S λ ( P Ω ( A ) + P ⊥ Ω ( X )) where P ⊥ Ω projects onto unobserved set, P Ω ( X ) + P ⊥ Ω ( X ) = X This is the soft-impute algorithm 1 , simple and effective method for matrix completion 1 Mazumder et al. (2011), Spectral regularization algorithms for learning large incomplete matrices 17

Why “generalized”? Special cases of generalized gradient descent, on f = g + h : • h = 0 → gradient descent • h = I C → projected gradient descent • g = 0 → proximal minimization algorithm Therefore these algorithms all have O (1 /k ) convergence rate 18

Projected gradient descent Given closed, convex set C ∈ R n , min x ∈ C g ( x ) ⇔ min g ( x ) + I C ( x ) x � 0 x ∈ C where I C ( x ) = is the indicator function of C ∞ x / ∈ C Hence 1 2 t � x − z � 2 + I C ( z ) prox t ( x ) = argmin z � x − z � 2 = argmin z ∈ C I.e., prox t ( x ) = P C ( x ) , projection operator onto C 19

Therefore generalized gradient update step is: x + = P C ( x − t ∇ g ( x )) i.e., perform usual gradient update and then project back onto C . Called projected gradient descent 1.5 1.0 0.5 c() 0.0 ● −0.5 ● −1.0 −1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 20

What sets C are easy to project onto? Lots, e.g., • Affine images C = { Ax + b : x ∈ R n } • Solution set of linear system C = { x ∈ R n : Ax = b } • Nonnegative orthant C = { x ∈ R n : x ≥ 0 } = R n + • Norm balls C = { x ∈ R n : � x � p ≤ 1 } , for p = 1 , 2 , ∞ • Some simple polyhedra and simple cones Warning: it is easy to write down seemingly simple set C , and P C can turn out to be very hard! E.g., it is generally hard to project onto solution set of arbitrary linear inequalities, i.e, arbitrary polyhedron C = { x ∈ R n : Ax ≤ b } 21

Proximal minimization algorithm Consider for h convex (not necessarily differentiable), min h ( x ) x Generalized gradient update step is just a prox update: 1 x + = argmin 2 t � x − z � 2 + h ( z ) z Called proximal minimization algorithm Faster than subgradient method, but not implementable unless we know prox in closed form 22

Generalized gradient descent Geoff Gordon & Ryan Tibshirani - PowerPoint PPT Presentation

Generalized gradient descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember subgradient method We want to solve x R n f ( x ) , min for f convex, not necessarily differentiable Subgradient method: choose initial

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Gradient Descent Michail Michailidis & Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

Continuous Descent Operation (CDO) Continuous Descent Operation (CDO) Doc 9331 Doc 9331 Erwin

ECS 231 Gradient descent methods for solving large scale eigenvalue problems 1 / 17 Generalized

Gradient descent revisited Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1

Learning From Data Lecture 9 Logistic Regression and Gradient Descent Logistic Regression

The Rumen The Rumen The rumen , also known as the fermentation vat or paunch forms the larger

Reduction of Boolean network models Elena Dimitrova School of Mathematical and Statistical

NATURE A d o p t t h e p a c e o f n a t u r e : h e r s e c r e t i s p a t i e n c e .

Construction of new DC muon beamline, MuSIC-RCNP, for muon

Proximal methods S. Villa 21st October 2013 0.1 Review of the basics Often machine learning

RFID Hacking Live Free or RFID Hard 01 Aug 2013 Black Hat USA 2013 Las Vegas, NV Presen

Efficient Algorithms for Smooth Minimax Optimization NeurIPS 2019 Kiran Koshy Thekumparampil ,

Proximity Language Model A Language Model beyond Bag of Words through Proximity Jinglei Zhao 1

Generalized gradient descent Geoff Gordon & Ryan Tibshirani - PowerPoint PPT Presentation

Generalized gradient descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember subgradient method We want to solve x R n f ( x ) , min for f convex, not necessarily differentiable Subgradient method: choose initial

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Gradient Descent Michail Michailidis &amp; Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

Continuous Descent Operation (CDO) Continuous Descent Operation (CDO) Doc 9331 Doc 9331 Erwin

ECS 231 Gradient descent methods for solving large scale eigenvalue problems 1 / 17 Generalized

Gradient descent revisited Geoff Gordon &amp; Ryan Tibshirani Optimization 10-725 / 36-725 1

Learning From Data Lecture 9 Logistic Regression and Gradient Descent Logistic Regression

The Rumen The Rumen The rumen , also known as the fermentation vat or paunch forms the larger

Reduction of Boolean network models Elena Dimitrova School of Mathematical and Statistical

NATURE A d o p t t h e p a c e o f n a t u r e : h e r s e c r e t i s p a t i e n c e .

Construction of new DC muon beamline, MuSIC-RCNP, for muon

Proximal methods S. Villa 21st October 2013 0.1 Review of the basics Often machine learning

RFID Hacking Live Free or RFID Hard 01 Aug 2013 Black Hat USA 2013 Las Vegas, NV Presen

Efficient Algorithms for Smooth Minimax Optimization NeurIPS 2019 Kiran Koshy Thekumparampil ,

Proximity Language Model A Language Model beyond Bag of Words through Proximity Jinglei Zhao 1

Gradient Descent Michail Michailidis & Patrick Maiden Outline

Gradient descent revisited Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1