Convergence of a Block Coordinate Descent Method for - PowerPoint PPT Presentation

Sep 28, 2022 •97 likes •738 views

Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization Paul Tseng Presenter: Lei Tang Department of CSE Arizona State University Nov. 7th, 2008 1 / 44 Introduction Popular method for minimizing a real-valued

Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization Paul Tseng Presenter: Lei Tang Department of CSE Arizona State University Nov. 7th, 2008 1 / 44
Introduction Popular method for minimizing a real-valued continuously differentiable function f of n variables, subject to bound constraint, is (block) coordinate descent (BCD). In this work, coordinate descent actually refers to alternating optimization(AO). Each step find the exact minimizer. Popular for its efficiency, simplicity and scalability. Applied to large-scale SVM, Lasso etc. Unfortunately, the convergence of coordinate descent is not clear. Not like steepest descent method. In this work, it is shown that if the function satisfy some mild conditions, BCD converges to the stationary point. 2 / 44
Introduction Popular method for minimizing a real-valued continuously differentiable function f of n variables, subject to bound constraint, is (block) coordinate descent (BCD). In this work, coordinate descent actually refers to alternating optimization(AO). Each step find the exact minimizer. Popular for its efficiency, simplicity and scalability. Applied to large-scale SVM, Lasso etc. Unfortunately, the convergence of coordinate descent is not clear. Not like steepest descent method. In this work, it is shown that if the function satisfy some mild conditions, BCD converges to the stationary point. 2 / 44
Questions? 1 Does BCD Converge? 2 Does BCD Converge to the local minimizer? 3 When does BCD converge to the stationary point? 4 What’s the convergence rate? 3 / 44
Existing works Convergence of coordinate descent method requires typically that f be strictly convex (or quasiconvex and hemivariate) differentiable the strict convexity is relaxed to pseudoconvexity, which allows f to have non-unique minimum along coordinate directions. If f is not differentiable, the coordinate descent method may get stuck at a nonstationary point even when f is convex. However, this method still works when the nondifferentiable part of f is seperable. N � f ( x 1 , · · · , x N ) = f 0 ( x 1 , · · · , x N ) + f k ( x k ) k =1 where f k is non-differentiable, each x k represents one block. This work shows that BCD converges to a stationary point if f 0 has certain smoothness property. 4 / 44
Existing works Convergence of coordinate descent method requires typically that f be strictly convex (or quasiconvex and hemivariate) differentiable the strict convexity is relaxed to pseudoconvexity, which allows f to have non-unique minimum along coordinate directions. If f is not differentiable, the coordinate descent method may get stuck at a nonstationary point even when f is convex. However, this method still works when the nondifferentiable part of f is seperable. N � f ( x 1 , · · · , x N ) = f 0 ( x 1 , · · · , x N ) + f k ( x k ) k =1 where f k is non-differentiable, each x k represents one block. This work shows that BCD converges to a stationary point if f 0 has certain smoothness property. 4 / 44
An Example of Alternating Optimization − xy − yz − zx + ( x − 1) 2 + + ( − x − 1) 2 φ 1 ( x , y , z ) = + + ( y − 1) 2 + + ( − y − 1) 2 + + ( z − 1) 2 + + ( − z − 1) 2 + Note that the optimal x given fixed y and z is � � 1 + 1 x = sign ( y + z ) 2 | y + z | Suppose you start from ( − 1 − ǫ, 1 + 1 2 ǫ, − 1 − 1 4 ǫ ): (1 + 1 8 ǫ, 1 + 1 2 ǫ, − 1 − 1 ( − 1 − 1 64 ǫ, − 1 − 1 16 ǫ, 1 + 1 4 ǫ ) 32 ǫ ) (1 + 1 8 ǫ, − 1 − 1 16 ǫ, − 1 − 1 ( − 1 − 1 128 ǫ, 1 + 1 1 4 ǫ ) 64 ǫ, 1 + 32 ǫ ) (1 + 1 8 ǫ, − 1 − 1 16 ǫ, 1 + 1 ( − 1 − 1 1 1 32 ǫ ) 64 ǫ, 1 + 128 ǫ, − 1 − 256 ǫ ) Cycle around 6 edges of the cube ( ± 1 , ± 1 , ± 1)!! 5 / 44
An Example of Alternating Optimization − xy − yz − zx + ( x − 1) 2 + + ( − x − 1) 2 φ 1 ( x , y , z ) = + + ( y − 1) 2 + + ( − y − 1) 2 + + ( z − 1) 2 + + ( − z − 1) 2 + Note that the optimal x given fixed y and z is � � 1 + 1 x = sign ( y + z ) 2 | y + z | Suppose you start from ( − 1 − ǫ, 1 + 1 2 ǫ, − 1 − 1 4 ǫ ): (1 + 1 8 ǫ, 1 + 1 2 ǫ, − 1 − 1 ( − 1 − 1 64 ǫ, − 1 − 1 16 ǫ, 1 + 1 4 ǫ ) 32 ǫ ) (1 + 1 8 ǫ, − 1 − 1 16 ǫ, − 1 − 1 ( − 1 − 1 128 ǫ, 1 + 1 1 4 ǫ ) 64 ǫ, 1 + 32 ǫ ) (1 + 1 8 ǫ, − 1 − 1 16 ǫ, 1 + 1 ( − 1 − 1 1 1 32 ǫ ) 64 ǫ, 1 + 128 ǫ, − 1 − 256 ǫ ) Cycle around 6 edges of the cube ( ± 1 , ± 1 , ± 1)!! 5 / 44
An Example of Alternating Optimization − xy − yz − zx + ( x − 1) 2 + + ( − x − 1) 2 φ 1 ( x , y , z ) = + + ( y − 1) 2 + + ( − y − 1) 2 + + ( z − 1) 2 + + ( − z − 1) 2 + Note that the optimal x given fixed y and z is � � 1 + 1 x = sign ( y + z ) 2 | y + z | Suppose you start from ( − 1 − ǫ, 1 + 1 2 ǫ, − 1 − 1 4 ǫ ): (1 + 1 8 ǫ, 1 + 1 2 ǫ, − 1 − 1 ( − 1 − 1 64 ǫ, − 1 − 1 16 ǫ, 1 + 1 4 ǫ ) 32 ǫ ) (1 + 1 8 ǫ, − 1 − 1 16 ǫ, − 1 − 1 ( − 1 − 1 128 ǫ, 1 + 1 1 4 ǫ ) 64 ǫ, 1 + 32 ǫ ) (1 + 1 8 ǫ, − 1 − 1 16 ǫ, 1 + 1 ( − 1 − 1 1 1 32 ǫ ) 64 ǫ, 1 + 128 ǫ, − 1 − 256 ǫ ) Cycle around 6 edges of the cube ( ± 1 , ± 1 , ± 1)!! 5 / 44
An Example of Alternating Optimization − xy − yz − zx + ( x − 1) 2 + + ( − x − 1) 2 φ 1 ( x , y , z ) = + + ( y − 1) 2 + + ( − y − 1) 2 + + ( z − 1) 2 + + ( − z − 1) 2 + Note that the optimal x given fixed y and z is � � 1 + 1 x = sign ( y + z ) 2 | y + z | Suppose you start from ( − 1 − ǫ, 1 + 1 2 ǫ, − 1 − 1 4 ǫ ): (1 + 1 8 ǫ, 1 + 1 2 ǫ, − 1 − 1 ( − 1 − 1 64 ǫ, − 1 − 1 16 ǫ, 1 + 1 4 ǫ ) 32 ǫ ) (1 + 1 8 ǫ, − 1 − 1 16 ǫ, − 1 − 1 ( − 1 − 1 128 ǫ, 1 + 1 1 4 ǫ ) 64 ǫ, 1 + 32 ǫ ) (1 + 1 8 ǫ, − 1 − 1 16 ǫ, 1 + 1 ( − 1 − 1 1 1 32 ǫ ) 64 ǫ, 1 + 128 ǫ, − 1 − 256 ǫ ) Cycle around 6 edges of the cube ( ± 1 , ± 1 , ± 1)!! 5 / 44
An Example of Alternating Optimization − xy − yz − zx + ( x − 1) 2 + + ( − x − 1) 2 φ 1 ( x , y , z ) = + + ( y − 1) 2 + + ( − y − 1) 2 + + ( z − 1) 2 + + ( − z − 1) 2 + Note that the optimal x given fixed y and z is � � 1 + 1 x = sign ( y + z ) 2 | y + z | Suppose you start from ( − 1 − ǫ, 1 + 1 2 ǫ, − 1 − 1 4 ǫ ): (1 + 1 8 ǫ, 1 + 1 2 ǫ, − 1 − 1 ( − 1 − 1 64 ǫ, − 1 − 1 16 ǫ, 1 + 1 4 ǫ ) 32 ǫ ) (1 + 1 8 ǫ, − 1 − 1 16 ǫ, − 1 − 1 ( − 1 − 1 128 ǫ, 1 + 1 1 4 ǫ ) 64 ǫ, 1 + 32 ǫ ) (1 + 1 8 ǫ, − 1 − 1 16 ǫ, 1 + 1 ( − 1 − 1 1 1 32 ǫ ) 64 ǫ, 1 + 128 ǫ, − 1 − 256 ǫ ) Cycle around 6 edges of the cube ( ± 1 , ± 1 , ± 1)!! 5 / 44
Some Examples The gradient in the example is not zero at any ( ± 1 , ± 1 , ± 1). The example we show is unstable to perturbations. The example has non-smooth 2nd derivatives. More complicated examples could be constructed to show that even if the function is infinitely differentiable, stable cyclic behavior still occurs, whose gradient is bounded away from zero in the limiting path. Please see On Search Directions for Minimization Algorithms , Mathematical Programming, 1974. 6 / 44
Some Examples The gradient in the example is not zero at any ( ± 1 , ± 1 , ± 1). The example we show is unstable to perturbations. The example has non-smooth 2nd derivatives. More complicated examples could be constructed to show that even if the function is infinitely differentiable, stable cyclic behavior still occurs, whose gradient is bounded away from zero in the limiting path. Please see On Search Directions for Minimization Algorithms , Mathematical Programming, 1974. 6 / 44
Some Examples The gradient in the example is not zero at any ( ± 1 , ± 1 , ± 1). The example we show is unstable to perturbations. The example has non-smooth 2nd derivatives. More complicated examples could be constructed to show that even if the function is infinitely differentiable, stable cyclic behavior still occurs, whose gradient is bounded away from zero in the limiting path. Please see On Search Directions for Minimization Algorithms , Mathematical Programming, 1974. 6 / 44
Alternating Optimization Algorithm Figure: Alternating Optimization Algorithm 7 / 44
EU Assumption Before we go into the proof details, I would like to introduce some convergence properties of AO that might be useful. Typically, we have this EU assumption: 8 / 44
Global Convergence 9 / 44
Indications Under certain conditions, all limit points of an AO sequence are either saddle points of a special type of minimizers. However, not all saddle point can be captured by AO. Only those which looks like a minimizer along the grouped coordinate ( X 1 , X 2 , etc) can be captured. The potential for convergence to a saddle point is a “price” need to pay. What if strict convex functions? Converge to the global optimal q-linearly 10 / 44
Indications Under certain conditions, all limit points of an AO sequence are either saddle points of a special type of minimizers. However, not all saddle point can be captured by AO. Only those which looks like a minimizer along the grouped coordinate ( X 1 , X 2 , etc) can be captured. The potential for convergence to a saddle point is a “price” need to pay. What if strict convex functions? Converge to the global optimal q-linearly 10 / 44
Local Convergence 11 / 44

Recommend

Convergence of a Block Coordinate Descent Method for - PowerPoint PPT Presentation

Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization Paul Tseng Presenter: Lei Tang Department of CSE Arizona State University Nov. 7th, 2008 1 / 44 Introduction Popular method for minimizing a real-valued

Global Convergence of Block Coordinate Descent in Deep Learning 1 Jiangxi Normal Univ. * Equal

Continuous Descent Operation (CDO) Continuous Descent Operation (CDO) Doc 9331 Doc 9331 Erwin

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

10. Unconstrained minimization terminology and assumptions gradient descent method

Implementation Issues More from Interface point of view V Eye Y U N X Z Viewing Coordinate

Transformations & Transformations & Coordinate Systems Coordinate Systems CSCD 472?

Problem 1 k zero bits n bits IV Block Block Block Block Cipher Cipher Cipher Cipher

0 decay NMEs with the generator coordinate method Changfeng Jiao Department of Physics

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to

Linear Convergence of Randomized Primal-Dual Coordinate Method for Large-scale Linear Constrained

Introducing EF Block TM Introduction to EF Block Building Materials Overview of EF

CRYSTAL CITY BLOCK PLAN # CCBP- J-K 2019 1 BLOCK J-K Long Range Planning Committee Block

CRYSTAL CITY BLOCK PLAN # CCBP- G 1 BLOCK G (Metro Market Square block) 1 Long Range

Decision on Eliminating Convergence Bidding at the Interties Greg Cook Director, Market &

Briefing on Convergence Bidding David Withrow, Lead Market and Product Economist Board of

The Plan for Chinatown & Surrounding Areas Achieving Affordability, Preservation and Growth

CAPNET Program Context-aware Pervasive Networking Mobile Forums Research Focus Area Contents

WHO's perspective IMDRF meeting 20 March 2013, Nice, France Kees de Joncheere Director

International Regulatory Convergence Challenges of the Global Health System IMDRF 3 Open

#Q3_2017 Orange financial results Ramon Fernandez Deputy CEO, Chief Financial and Strategy

CONVERGENCE OF A GENERALIZED MIDPOINT ITERATION JARED ABLE, DANIEL BRADLEY, ALVIN MOON, AND

Convergence of a Block Coordinate Descent Method for - PowerPoint PPT Presentation

Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization Paul Tseng Presenter: Lei Tang Department of CSE Arizona State University Nov. 7th, 2008 1 / 44 Introduction Popular method for minimizing a real-valued

Global Convergence of Block Coordinate Descent in Deep Learning 1 Jiangxi Normal Univ. * Equal

Continuous Descent Operation (CDO) Continuous Descent Operation (CDO) Doc 9331 Doc 9331 Erwin

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

10. Unconstrained minimization terminology and assumptions gradient descent method

Implementation Issues More from Interface point of view V Eye Y U N X Z Viewing Coordinate

Transformations &amp; Transformations &amp; Coordinate Systems Coordinate Systems CSCD 472?

Problem 1 k zero bits n bits IV Block Block Block Block Cipher Cipher Cipher Cipher

0 decay NMEs with the generator coordinate method Changfeng Jiao Department of Physics

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Coordinate descent Geoff Gordon &amp; Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to

Linear Convergence of Randomized Primal-Dual Coordinate Method for Large-scale Linear Constrained

Introducing EF Block TM Introduction to EF Block Building Materials Overview of EF

CRYSTAL CITY BLOCK PLAN # CCBP- J-K 2019 1 BLOCK J-K Long Range Planning Committee Block

CRYSTAL CITY BLOCK PLAN # CCBP- G 1 BLOCK G (Metro Market Square block) 1 Long Range

Decision on Eliminating Convergence Bidding at the Interties Greg Cook Director, Market &amp;

Briefing on Convergence Bidding David Withrow, Lead Market and Product Economist Board of

The Plan for Chinatown &amp; Surrounding Areas Achieving Affordability, Preservation and Growth

CAPNET Program Context-aware Pervasive Networking Mobile Forums Research Focus Area Contents

WHO's perspective IMDRF meeting 20 March 2013, Nice, France Kees de Joncheere Director

International Regulatory Convergence Challenges of the Global Health System IMDRF 3 Open

#Q3_2017 Orange financial results Ramon Fernandez Deputy CEO, Chief Financial and Strategy

CONVERGENCE OF A GENERALIZED MIDPOINT ITERATION JARED ABLE, DANIEL BRADLEY, ALVIN MOON, AND

Transformations & Transformations & Coordinate Systems Coordinate Systems CSCD 472?

Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to

Decision on Eliminating Convergence Bidding at the Interties Greg Cook Director, Market &

The Plan for Chinatown & Surrounding Areas Achieving Affordability, Preservation and Growth