nonconvex distributed optimization novel algorithmic
play

Nonconvex Distributed Optimization: Novel Algorithmic Design and - PowerPoint PPT Presentation

Nonconvex Distributed Optimization: Novel Algorithmic Design and Arbitrarily Precise Solutions Presenter: Zhiyu He Coauthors: Jianping He*, Cailian Chen and Xinping Guan Department of Automation Shanghai Jiao Tong University July 2020


  1. Nonconvex Distributed Optimization: Novel Algorithmic Design and Arbitrarily Precise Solutions Presenter: Zhiyu He Coauthors: Jianping He*, Cailian Chen and Xinping Guan Department of Automation Shanghai Jiao Tong University July 2020 *Corresponding author: Jianping He, Email: jphe@sjtu.edu.cn 1 / 39

  2. 𝑂 𝑗 𝑗 𝑂 𝑦 𝑗 𝑂 𝑂 𝑗 Distributed Optimization ◮ What is distributed optimization? to enable agents in networked systems to 𝑗= � 𝑗= � collaboratively optimize the average of local objective functions. ◮ Why not centralized optimization? � • possible lack of central authority � • efficiency, privacy-preserving, robustness and scalability issues 1 Figure 1 An illustration of distributed optimization 1A. Nedi´ c et al. , “Distributed optimization for control,” Annual Review of Control, Robotics, and Autonomous Systems , vol. 1, pp. 77–103, 2018 2 / 39

  3. Distributed Optimization: Application Scenarios • Distributed optimization empowers networked multi-agent systems (a) Distributed Learning 2 (b) Distributed Localization in Sensor Networks 3 (c) Distributed Coordination in Smart Grids 4 (d) Distributed Control of Multi-robot Formations 5 Figure 2 Application scenarios of distributed optimization 2S. Boyd et al. , Found. Trends Mach. Learn. , 2011, 3 Y. Zhang et al. , IEEE Trans. Wireless Commun. , 2015, 4 C. Zhao et al. , IEEE Trans. Smart Grid , 2016, 5 W. Ren et al. , ROBOT AUTON SYST. , 2008. 3 / 39

  4. Distributed Optimization: Application Scenarios • Distributed Learning • Distributed Coordination in Smart Grid Suppose that the training sets are so large that We aim to coordinate the power generation of a they are stored separately at multiple servers. set of distributed energy resources, so that We aim to train the model so that the overall loss ⊲ demand is met, ⊲ total cost is minimized. function is minimized. N � min f i ( P i ) , � min x F ( x ) = f i ( x ) , i =1 i N � � s.t. P i = P d , f i ( x ) = l j ( x ) , i =1 j ∈D i s.t. P i ≤ P i ≤ P i , where D i denotes local dataset, and f i ( · ) , l j ( · ) where f i ( · ) denotes the function of generation denote loss functions. cost of each energy resource. 4 / 39

  5. Developments of Distributed Optimization G. Scutari M. Hong W. Shi A. Olshevsky A. Nedich Purdue UMN Princeton BU ASU Push-DIGing SONATA DGD EXTRA ZONE directed graph undirected graph undirected graph directed graph undirected graph linear rate sub-linear rate linear rate (2019) (2019) (2017) (2009) (2015) 1 st -order 0 th -order non-convex 1 st -order convex optimization algorithms optimization algorithms 4 6A. Nedic et al. , IEEE Trans. Autom. Control , 2009, 7W. Shi et al. , SIAM J. Optim. , 2015, 8A. Nedic et al. , SIAM J. Optim. , 2017, 9G. Scutari et al. , Math. Program. , 2019, 10D. Hajinezhad et al. , IEEE Trans. Autom. Control , 2019. 5 / 39

  6. Developments of Distributed Optimization ◮ We classify existing distributed optimization algorithms into two categories: • Primal Methods: Distributed (sub)Gradient Descent 11 , Fast-DGD 12 , EXTRA 13 , DIGing 14 , Acc-DNGD 15 , ZONE 16 , SONATA 17 . . . feature: combine (sub)gradient descent with consensus, so as to drive local estimates to converge in the primal domain • Dual-based Methods: Dual Averaging 18 , D-ADMM 19 , DCS 20 , MSDA 21 , MSPD 22 , . . . feature: introduce consensus equality constraints, and then solve the dual problem or carry on primal-dual updates to reach a saddle point of the Lagrangian ◮ Please refer to [T. Yang et al. , Annu Rev Control , 2019] for a recent comprehensive survey. 11A. Nedic et al. , IEEE Trans. Autom. Control , 2009, 12D. Jakoveti´ c et al. , IEEE Trans. Autom. Control , 2014, 13W. Shi et al. , SIAM J. Optim. , 2015, 14A. Nedic et al. , SIAM J. Optim. , 2017, 15G. Qu et al. , IEEE Trans. Autom. Control , 2019, 16D. Hajinezhad et al. , IEEE Trans. Autom. Control , 2019, 17G. Scutari et al. , Math. Program. , 2019, 18J. C. Duchi et al. , IEEE Trans. Autom. Control , 2011, 19W. Shi et al. , IEEE Trans. Signal Process. , 2014, 20G. Lan et al. , Math. Program. , 2017, 21K. Scaman et al. , in Proc. Int. Conf. Mach. Learn. , 2017, 22K. Scaman et al. , in Adv Neural Inf Process Syst , 2018. 6 / 39

  7. Distributed Gradient Descent N f ( x ) = 1 � Convex Distributed Optimization min f i ( x ) , ∀ i, f i ( x ) is convex. N x ∈ R n i =1 Distributed Gradient Descent (DGD) 23 � x t +1 w ij x t j − α t ∇ f i ( x t = i ) i Averaging for reaching consensus Local GD for reaching optimality Assumptions • diminishing step sizes • W doubly stochastic • bounded gradients �∇ f i � ≤ L Sub-linear Convergence � 1 t − 1 � i = 1 � i ) − f ∗ ∼ O x t x t x k f (ˆ √ , ˆ i t t k =0 ⇒ can be improved to linear convergence rates with Gradient Tracking 24 = 23A. Nedic et al. , IEEE Trans. Autom. Control , 2009, 24P. Di Lorenzo et al. , IEEE Trans. Signal Inf. Pr. , 2016, J. Xu et al. , IEEE Trans. Autom. Control , 2017 7 / 39

  8. Motivations General Distributed Optimization Generic Methods with Gradient Tracking �� � N f ( x ) = 1 � x t +1 w ij x t j , s t = F t min f i ( x ) i i N x ∈ X i =1 � s t +1 j + ∇ f i ( x t +1 w ij s t ) − ∇ f i ( x t = i ) i i possibly noncovex � �� � eval of gradients at every itr Two notable unresolved issues within the existing works • growing load of oracle queries with respect to iterations ⊲ results from evaluations of gradients or values of local objectives within every iteration • hardness of achieving iterative convergence to global optimal points ⊲ results from the nonconvex nature of general objectives 8 / 39

  9. Contributions Main contributions of this work • We propose a novel algorithm, leveraging polynomial approximation, consensus and SDP theories. • CPCA has the advantages of ◦ able to obtain ǫ globally optimal solutions ⇐ = ǫ is any arbitrarily small given tolerance = the required 0 th -order oracle queries are independent of iterations ◦ computationally efficient ⇐ ◦ distributively terminable once the precision requirement is met • We provide a comprehensive analysis of the accuracy and complexities of CPCA 9 / 39

  10. Problem Formulation The constrained distributed nonconvex optimization problem we consider is N f ( x ) = 1 � min f i ( x ) , N x i =1 N � s.t. x ∈ X = X i , X i ⊂ R . i =1 Note • We only require possibly non-convex univariate f i ( x ) to be Lipschitz continuous on convex X i . • We assume that G is an undirected graph. The extension to time-varying directed graphs is presented in our recent work. 10 / 39

  11. Key Ideas • Inspirations Approximation is closely linked with optimization. (a) Newton’s method (b) Majorization-Minimization Algorithm Source: S. Boyd et al. , Convex optimization . 2004 Source: Y. Sun et al. , IEEE Trans. Signal Process. , 2016 Figure 3 Optimization algorithms based on approximation Both of them are based on local approximations. What if global approximations? 11 / 39

  12. Key Ideas • Inspirations Researchers use Chebyshev polynomial approximation to substitute for the target function defined on an interval, so as to make the study of its property much easier. m � � 2 x − ( a + b ) � f ( x ) ≈ p ( x ) = c i T i , x ∈ [ a, b ] . b − a i =0 Chebfun Toolbox for MATLAB • Insights turn to optimize the approximation (i.e. the proxy) of the global objective, to obtain ǫ -optimal solutions for any arbitrarily small given error tolerance ǫ • use average consensus to enable every agent to obtain such a global proxy • optimize locally the global proxy by finding its stationary points, or solving SDPs 12 / 39

  13. 𝐿 𝑓 𝑗 𝑗 𝐿 𝑗 𝑗 𝑗 𝐿 ∗ 𝑂 𝑗 𝑓 ∗ Overview of CPCA Stage 2: Average Consensus Stage 1: Construction of Local Proxies Adaptive Chebyshev 𝑑 �� 𝑑 �� � Interpolation (local proxy) ··· 𝑑 �� � � � local vector 𝑞 𝑗 𝑑 �� Extract Coefficients average Terminate at 𝑞 = 1 � K th iteration 𝑂 � 𝑞 𝑗 𝑗 �� 𝑑 � ··· 𝑑 � Stage 3: Optimization of Global Proxy ··· Consensus with Optimization by 𝑑 � ··· 𝑑 � local vector 𝑞 𝑗 Distributed Stopping Solving SDPs Rep. ( 𝜁- globally optimal) converge to 𝑞 proxy for global objective 𝑔(𝑦) Figure 4 The architecture of CPCA 13 / 39

  14. Initialization: Construction of Local Chebyshev Proxies • Goal Construct the Chebyshev polynomial approximation p i ( x ) for f i ( x ) , such that | f i ( x ) − p i ( x ) | ≤ ǫ 1 , ∀ x ∈ X, where X = � N i =1 X i � [ a, b ] . • Details 1. Run a finite number of max/min consensus iterations in advance to obtain the intersection set X . 2. Use Adaptive Chebyshev Interpolation 25 to obtain p i ( x ) . 3. Maintain p 0 i storing the Chebyshev coefficients of p i ( x ) ’s derivative through certain recurrence formula. 25J. P. Boyd, Solving Transcendental Equations: The Chebyshev Polynomial Proxy and Other Numerical Rootfinders, Perturbation Series, and Oracles . SIAM, 2014, vol. 139. 14 / 39

  15. Initialization: Construction of Local Chebyshev Proxies Figure 5 An illustration of Adaptive Chebyshev Interpolation Source: J. P. Boyd. SIAM, 2014, vol. 139 15 / 39

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend