Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization - PowerPoint PPT Presentation

Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1

Adding to the toolbox, with stats and ML in mind We’ve seen several general and useful minimization tools • First-order methods • Newton’s method • Dual methods • Interior-point methods These are some of the core methods in optimization, and they are the main objects of study in this field In statistics and machine learning, there are a few other techniques that have received a lot of attention; these are not studied as much by those purely in optimization They don’t apply as broadly as above methods, but are interesting and useful when they do apply ... our focus for the next 2 lectures 2

Coordinate-wise minimization We’ve seen (and will continue to see) some pretty sophisticated methods. Today, we’ll see an extremely simple technique that is surprisingly efficient and scalable Focus is on coordinate-wise minimization Q: Given convex, differentiable f : R n → R , if we are at a point x such that f ( x ) is minimized along each coordinate axis, have we found a global minimizer? I.e., does f ( x + d · e i ) ≥ f ( x ) for all d, i ⇒ f ( x ) = min z f ( z ) ? (Here e i = (0 , . . . , 1 , . . . 0) ∈ R n , the i th standard basis vector) 3

f x 1 x 2 A: Yes! Proof: � ∂f � ( x ) , . . . ∂f ∇ f ( x ) = ( x ) = 0 ∂x 1 ∂x n Q: Same question, but for f convex (not differentiable) ... ? 4

4 2 f x2 0 ● −2 −4 x2 x1 −4 −2 0 2 4 x1 A: No! Look at the above counterexample Q: Same question again, but now f ( x ) = g ( x ) + � n i =1 h i ( x i ) , with g convex, differentiable and each h i convex ... ? (Non-smooth part here called separable ) 5

4 2 f x2 0 ● −2 −4 x2 x1 −4 −2 0 2 4 x1 A: Yes! Proof: for any y , n � f ( y ) − f ( x ) ≥ ∇ g ( x ) T ( y − x ) + [ h i ( y i ) − h i ( x i )] i =1 n � = [ ∇ i g ( x )( y i − x i ) + h i ( y i ) − h i ( x i )] ≥ 0 � �� i =1 ≥ 0 6

Coordinate descent This suggests that for f ( x ) = g ( x ) + � n i =1 h i ( x i ) (with g convex, differentiable and each h i convex) we can use coordinate descent to find a minimizer: start with some initial guess x (0) , and repeat for k = 1 , 2 , 3 , . . . � � x ( k ) x 1 , x ( k − 1) , x ( k − 1) , . . . x ( k − 1) ∈ argmin f 1 2 3 n x 1 � � x ( k ) x ( k ) 1 , x 2 , x ( k − 1) , . . . x ( k − 1) ∈ argmin f 2 3 n x 2 � � x ( k ) x ( k ) 1 , x ( k ) 2 , x 3 , . . . x ( k − 1) ∈ argmin f n 3 x 2 . . . � � x ( k ) 1 , x ( k ) 2 , x ( k ) x ( k ) ∈ argmin f 3 , . . . x n n x 2 Note: after we solve for x ( k ) , we use its new value from then on i 7

Seminal work of Tseng (2001) proves that for such f (provided f is continuous on compact set { x : f ( x ) ≤ f ( x (0) ) } and f attains its minimum), any limit point of x ( k ) , k = 1 , 2 , 3 , . . . is a minimizer of f . Now, citing real analysis facts: • x ( k ) has subsequence converging to x ⋆ (Bolzano-Weierstrass) • f ( x ( k ) ) converges to f ⋆ (monotone convergence) Notes: • Order of cycle through coordinates is arbitrary, can use any permutation of { 1 , 2 , . . . n } • Can everywhere replace individual coordinates with blocks of coordinates • “One-at-a-time” update scheme is critical, and “all-at-once” scheme does not necessarily converge 8

Linear regression 2 � y − Ax � 2 , where y ∈ R n , A ∈ R n × p with columns Let f ( x ) = 1 A 1 , . . . A p Consider minimizing over x i , with all x j , j � = i fixed: 0 = ∇ i f ( x ) = A T i ( Ax − y ) = A T i ( A i x i + A − i x − i − y ) i.e., we take x i = A T i ( y − A − i x − i ) A T i A i Coordinate descent repeats this update for i = 1 , 2 , . . . , p, 1 , 2 , . . . 9

1e+02 GD CD 1e−01 Coordinate descent vs gra- f(k)−fstar 1e−04 dient descent for linear regression: 100 instances 1e−07 ( n = 100 , p = 20 ) 1e−10 0 10 20 30 40 k Is it fair to compare 1 cycle of coordinate descent to 1 iteration of gradient descent? Yes, if we’re clever: x i = A T = A T i ( y − A − i x − i ) i r � A i � 2 + x old i A T i A i where r = y − Ax . Therefore each coordinate update takes O ( n ) operations — O ( n ) to update r , and O ( n ) to compute A T i r — and one cycle requires O ( np ) operations, just like gradient descent 10

1e+02 GD CD Accelerated GD 1e−01 f(k)−fstar 1e−04 Same example, but now with accelerated gradient descent for comparison 1e−07 1e−10 0 10 20 30 40 k Is this contradicting the optimality of accelerated gradient descent? I.e., is coordinate descent a first-order method? No. It uses much more than first-order information 11

Lasso regression Consider the lasso problem f ( x ) = 1 2 � y − Ax � 2 + λ � x � 1 Note that the non-smooth part is separable: � x � 1 = � p i =1 | x i | Minimizing over x i , with x j , j � = i fixed: 0 = A T i A i x i + A T i ( A − i x − i − y ) + λs i where s i ∈ ∂ | x i | . Solution is given by soft-thresholding � A T � i ( y − A − i x − i ) x i = S λ/ � A i � 2 A T i A i Repeat this for i = 1 , 2 , . . . p, 1 , 2 , . . . 12

Box-constrained regression Consider box-constrainted linear regression 1 2 � y − Ax � 2 subject to � x � ∞ ≤ s min x ∈ R n Note this fits our framework, as 1 {� x � ∞ ≤ s } = � n i =1 1 {| x i | ≤ s } Minimizing over x i with all x j , j � = i fixed: with same basic steps, we get � A T � i ( y − A − i x − i ) x i = T s A T i A i where T s is the truncating operator:  s if u > s   T s ( u ) = u if − s ≤ u ≤ s   − s if u < − s 13

Support vector machines A coordinate descent strategy can be applied to the SVM dual: 1 2 α T Kα − 1 T α subject to y T α = 0 , 0 ≤ α ≤ C 1 min α ∈ R n Sequential minimal optimization or SMO (Platt, 1998) is basic- ally blockwise coordinate descent in blocks of 2. Instead of cycling, it chooses the next block greedily Recall the complementary slackness conditions � � α i · ( Av ) i − y i d − (1 − s i ) = 0 , i = 1 , . . . n (1) ( C − α i ) · s i = 0 , i = 1 , . . . n (2) where v, d, s are the primal coefficients, intercept, and slacks, with v = A T α , d computed from (1) using any i such that 0 < α i < C , and s computed from (1), (2) 14

SMO repeats the following two steps: • Choose α i , α j that do not satisfy complementary slackness • Minimize over α i , α j exactly, keeping all other variables fixed Second step uses equality con- straint, reduces to minimizing uni- variate quadratic over an interval (From Platt, 1998) First step uses heuristics to choose α i , α j greedily Note this does not meet separability assumptions for convergence from Tseng (2001), and a different treatment is required 15

Coordinate descent in statistics and ML History in statistics: • Idea appeared in Fu (1998), and again in Daubechies et al. (2004), but was inexplicably ignored • Three papers around 2007, and Friedman et al. (2007) really sparked interest in statistics and ML community Why is it used? • Very simple and easy to implement • Careful implementations can attain state-of-the-art • Scalable, e.g., don’t need to keep data in memory Some examples: lasso regression, SVMs, lasso GLMs, group lasso, fused lasso (total variation denoising) trend filtering, graphical lasso, regression with nonconvex penalties 16

Pathwise coordinate descent for lasso Here is the basic outline for pathwise coordinate descent for lasso, from Friedman et al. (2007), Friedman et al. (2009) Outer loop ( pathwise strategy): • Compute the solution at sequence λ 1 ≥ λ 2 ≥ . . . ≥ λ r of tuning parameter values • For tuning parameter value λ k , initialize coordinate descent algorithm at the computed solution for λ k +1 Inner loop ( active set strategy): • Perform one coordinate cycle (or small number of cycles), and record active set S of coefficients that are nonzero • Cycle over coefficients in S until convergence • Check KKT conditions over all coefficients; if not all satisfied, add offending coefficients to S , go back one step 17

Even if solution is only desired at one value of λ , pathwise strategy ( λ 1 ≥ λ 2 ≥ . . . ≥ λ r = λ ) is much faster than directly performing coordinate descent at λ Active set strategy takes algorithmic advantage of sparsity; e.g., for large problems, coordinate descent for lasso is much faster than it is for ridge regression With these strategies in place (and a few more tricks), coordinate descent is competitve with fastest algorithms for 1-norm penalized minimization problems Freely available via glmnet package in MATLAB or R (Friedman et al., 2009) 18

Convergence rates? Global convergence rates for coordinate descent have not yet been established as they have for first-order methods Recently Saha et al. (2010) consider minimizing f ( x ) = g ( x ) + λ � x � 1 and assume that • g convex, ∇ g Lipschitz with constant L > 0 , and I − ∇ g/L monotone increasing in each component • there is z such that z ≥ S λ ( z − ∇ g ( z )) or z ≤ S λ ( z − ∇ g ( z )) (component-wise) They show that for coordinate descent starting at x (0) = z , and generalized gradient descent starting at y (0) = z (step size 1 /L ), f ( x ( k ) ) − f ( x ⋆ ) ≤ f ( y ( k ) ) − f ( x ⋆ ) ≤ L � x (0) − x ⋆ � 2 2 k 19

Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization - PowerPoint PPT Presentation

Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind Weve seen several general and useful minimization tools First-order methods Newtons method

Continuous Descent Operation (CDO) Continuous Descent Operation (CDO) Doc 9331 Doc 9331 Erwin

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Implementation Issues More from Interface point of view V Eye Y U N X Z Viewing Coordinate

Transformations & Transformations & Coordinate Systems Coordinate Systems CSCD 472?

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Gradient Descent Michail Michailidis & Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

DESCENT into the DARK AGES DESCENT into the DARK AGES A. Falcone Battle of the Romans

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization Paul Tseng

Cycling Without Age-Enhancing Lives of Seniors The right to wind in your face Kim

Three questions for today 1) Can lighting encourage more cycling after-dark? 2) How does

Method description Thomas Navin Lal and Olivier Chapelle { navin.lal, olivier.chapelle }

Combining Effects and Coeffects via Grading (slides) Marco Gaboardi Shin-ya Katsumata Dominic

ZiSense Towards Interference Resilient Duty Cycling in Wireless Sensor Networks Xiaolong Zheng ,

Cycling VM, Bergen 2017 An example of the consequences of external events. Budgeting the Cycling

I an Davey - Hourbike #BTNBikeShare www.btnbikeshare.com Guardian November 2017

Nonlinear Shi, Registers: A Survey and Open Problems Tor

Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization - PowerPoint PPT Presentation

Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind Weve seen several general and useful minimization tools First-order methods Newtons method

Continuous Descent Operation (CDO) Continuous Descent Operation (CDO) Doc 9331 Doc 9331 Erwin

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Implementation Issues More from Interface point of view V Eye Y U N X Z Viewing Coordinate

Transformations &amp; Transformations &amp; Coordinate Systems Coordinate Systems CSCD 472?

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Gradient Descent Michail Michailidis &amp; Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

DESCENT into the DARK AGES DESCENT into the DARK AGES A. Falcone Battle of the Romans

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization Paul Tseng

Cycling Without Age-Enhancing Lives of Seniors The right to wind in your face Kim

Three questions for today 1) Can lighting encourage more cycling after-dark? 2) How does

Method description Thomas Navin Lal and Olivier Chapelle { navin.lal, olivier.chapelle }

Combining Effects and Coeffects via Grading (slides) Marco Gaboardi Shin-ya Katsumata Dominic

ZiSense Towards Interference Resilient Duty Cycling in Wireless Sensor Networks Xiaolong Zheng ,

Cycling VM, Bergen 2017 An example of the consequences of external events. Budgeting the Cycling

I an Davey - Hourbike #BTNBikeShare www.btnbikeshare.com Guardian November 2017

Nonlinear Shi, Registers: A Survey and Open Problems Tor

Transformations & Transformations & Coordinate Systems Coordinate Systems CSCD 472?

Gradient Descent Michail Michailidis & Patrick Maiden Outline