Nonconvex Distributed Optimization: Novel Algorithmic Design and - PowerPoint PPT Presentation

Nonconvex Distributed Optimization: Novel Algorithmic Design and Arbitrarily Precise Solutions Presenter: Zhiyu He Coauthors: Jianping He*, Cailian Chen and Xinping Guan Department of Automation Shanghai Jiao Tong University July 2020 *Corresponding author: Jianping He, Email: jphe@sjtu.edu.cn 1 / 39

𝑂 𝑗 𝑗 𝑂 𝑦 𝑗 𝑂 𝑂 𝑗 Distributed Optimization ◮ What is distributed optimization? to enable agents in networked systems to 𝑗= � 𝑗= � collaboratively optimize the average of local objective functions. ◮ Why not centralized optimization? � • possible lack of central authority � • efficiency, privacy-preserving, robustness and scalability issues 1 Figure 1 An illustration of distributed optimization 1A. Nedi´ c et al. , “Distributed optimization for control,” Annual Review of Control, Robotics, and Autonomous Systems , vol. 1, pp. 77–103, 2018 2 / 39

Distributed Optimization: Application Scenarios • Distributed optimization empowers networked multi-agent systems (a) Distributed Learning 2 (b) Distributed Localization in Sensor Networks 3 (c) Distributed Coordination in Smart Grids 4 (d) Distributed Control of Multi-robot Formations 5 Figure 2 Application scenarios of distributed optimization 2S. Boyd et al. , Found. Trends Mach. Learn. , 2011, 3 Y. Zhang et al. , IEEE Trans. Wireless Commun. , 2015, 4 C. Zhao et al. , IEEE Trans. Smart Grid , 2016, 5 W. Ren et al. , ROBOT AUTON SYST. , 2008. 3 / 39

Distributed Optimization: Application Scenarios • Distributed Learning • Distributed Coordination in Smart Grid Suppose that the training sets are so large that We aim to coordinate the power generation of a they are stored separately at multiple servers. set of distributed energy resources, so that We aim to train the model so that the overall loss ⊲ demand is met, ⊲ total cost is minimized. function is minimized. N � min f i ( P i ) , � min x F ( x ) = f i ( x ) , i =1 i N � � s.t. P i = P d , f i ( x ) = l j ( x ) , i =1 j ∈D i s.t. P i ≤ P i ≤ P i , where D i denotes local dataset, and f i ( · ) , l j ( · ) where f i ( · ) denotes the function of generation denote loss functions. cost of each energy resource. 4 / 39

Developments of Distributed Optimization G. Scutari M. Hong W. Shi A. Olshevsky A. Nedich Purdue UMN Princeton BU ASU Push-DIGing SONATA DGD EXTRA ZONE directed graph undirected graph undirected graph directed graph undirected graph linear rate sub-linear rate linear rate (2019) (2019) (2017) (2009) (2015) 1 st -order 0 th -order non-convex 1 st -order convex optimization algorithms optimization algorithms 4 6A. Nedic et al. , IEEE Trans. Autom. Control , 2009, 7W. Shi et al. , SIAM J. Optim. , 2015, 8A. Nedic et al. , SIAM J. Optim. , 2017, 9G. Scutari et al. , Math. Program. , 2019, 10D. Hajinezhad et al. , IEEE Trans. Autom. Control , 2019. 5 / 39

Developments of Distributed Optimization ◮ We classify existing distributed optimization algorithms into two categories: • Primal Methods: Distributed (sub)Gradient Descent 11 , Fast-DGD 12 , EXTRA 13 , DIGing 14 , Acc-DNGD 15 , ZONE 16 , SONATA 17 . . . feature: combine (sub)gradient descent with consensus, so as to drive local estimates to converge in the primal domain • Dual-based Methods: Dual Averaging 18 , D-ADMM 19 , DCS 20 , MSDA 21 , MSPD 22 , . . . feature: introduce consensus equality constraints, and then solve the dual problem or carry on primal-dual updates to reach a saddle point of the Lagrangian ◮ Please refer to [T. Yang et al. , Annu Rev Control , 2019] for a recent comprehensive survey. 11A. Nedic et al. , IEEE Trans. Autom. Control , 2009, 12D. Jakoveti´ c et al. , IEEE Trans. Autom. Control , 2014, 13W. Shi et al. , SIAM J. Optim. , 2015, 14A. Nedic et al. , SIAM J. Optim. , 2017, 15G. Qu et al. , IEEE Trans. Autom. Control , 2019, 16D. Hajinezhad et al. , IEEE Trans. Autom. Control , 2019, 17G. Scutari et al. , Math. Program. , 2019, 18J. C. Duchi et al. , IEEE Trans. Autom. Control , 2011, 19W. Shi et al. , IEEE Trans. Signal Process. , 2014, 20G. Lan et al. , Math. Program. , 2017, 21K. Scaman et al. , in Proc. Int. Conf. Mach. Learn. , 2017, 22K. Scaman et al. , in Adv Neural Inf Process Syst , 2018. 6 / 39

Distributed Gradient Descent N f ( x ) = 1 � Convex Distributed Optimization min f i ( x ) , ∀ i, f i ( x ) is convex. N x ∈ R n i =1 Distributed Gradient Descent (DGD) 23 � x t +1 w ij x t j − α t ∇ f i ( x t = i ) i Averaging for reaching consensus Local GD for reaching optimality Assumptions • diminishing step sizes • W doubly stochastic • bounded gradients �∇ f i � ≤ L Sub-linear Convergence � 1 t − 1 � i = 1 � i ) − f ∗ ∼ O x t x t x k f (ˆ √ , ˆ i t t k =0 ⇒ can be improved to linear convergence rates with Gradient Tracking 24 = 23A. Nedic et al. , IEEE Trans. Autom. Control , 2009, 24P. Di Lorenzo et al. , IEEE Trans. Signal Inf. Pr. , 2016, J. Xu et al. , IEEE Trans. Autom. Control , 2017 7 / 39

Motivations General Distributed Optimization Generic Methods with Gradient Tracking �� N f ( x ) = 1 � x t +1 w ij x t j , s t = F t min f i ( x ) i i N x ∈ X i =1 � s t +1 j + ∇ f i ( x t +1 w ij s t ) − ∇ f i ( x t = i ) i i possibly noncovex � �� eval of gradients at every itr Two notable unresolved issues within the existing works • growing load of oracle queries with respect to iterations ⊲ results from evaluations of gradients or values of local objectives within every iteration • hardness of achieving iterative convergence to global optimal points ⊲ results from the nonconvex nature of general objectives 8 / 39

Contributions Main contributions of this work • We propose a novel algorithm, leveraging polynomial approximation, consensus and SDP theories. • CPCA has the advantages of ◦ able to obtain ǫ globally optimal solutions ⇐ = ǫ is any arbitrarily small given tolerance = the required 0 th -order oracle queries are independent of iterations ◦ computationally efficient ⇐ ◦ distributively terminable once the precision requirement is met • We provide a comprehensive analysis of the accuracy and complexities of CPCA 9 / 39

Problem Formulation The constrained distributed nonconvex optimization problem we consider is N f ( x ) = 1 � min f i ( x ) , N x i =1 N � s.t. x ∈ X = X i , X i ⊂ R . i =1 Note • We only require possibly non-convex univariate f i ( x ) to be Lipschitz continuous on convex X i . • We assume that G is an undirected graph. The extension to time-varying directed graphs is presented in our recent work. 10 / 39

Key Ideas • Inspirations Approximation is closely linked with optimization. (a) Newton’s method (b) Majorization-Minimization Algorithm Source: S. Boyd et al. , Convex optimization . 2004 Source: Y. Sun et al. , IEEE Trans. Signal Process. , 2016 Figure 3 Optimization algorithms based on approximation Both of them are based on local approximations. What if global approximations? 11 / 39

Key Ideas • Inspirations Researchers use Chebyshev polynomial approximation to substitute for the target function defined on an interval, so as to make the study of its property much easier. m � � 2 x − ( a + b ) � f ( x ) ≈ p ( x ) = c i T i , x ∈ [ a, b ] . b − a i =0 Chebfun Toolbox for MATLAB • Insights turn to optimize the approximation (i.e. the proxy) of the global objective, to obtain ǫ -optimal solutions for any arbitrarily small given error tolerance ǫ • use average consensus to enable every agent to obtain such a global proxy • optimize locally the global proxy by finding its stationary points, or solving SDPs 12 / 39

𝐿 𝑓 𝑗 𝑗 𝐿 𝑗 𝑗 𝑗 𝐿 ∗ 𝑂 𝑗 𝑓 ∗ Overview of CPCA Stage 2: Average Consensus Stage 1: Construction of Local Proxies Adaptive Chebyshev 𝑑 �� 𝑑 �� Interpolation (local proxy) ··· 𝑑 �� local vector 𝑞 𝑗 𝑑 �� Extract Coefficients average Terminate at 𝑞 = 1 � K th iteration 𝑂 � 𝑞 𝑗 𝑗 �� 𝑑 � ··· 𝑑 � Stage 3: Optimization of Global Proxy ··· Consensus with Optimization by 𝑑 � ··· 𝑑 � local vector 𝑞 𝑗 Distributed Stopping Solving SDPs Rep. ( 𝜁- globally optimal) converge to 𝑞 proxy for global objective 𝑔(𝑦) Figure 4 The architecture of CPCA 13 / 39

Initialization: Construction of Local Chebyshev Proxies • Goal Construct the Chebyshev polynomial approximation p i ( x ) for f i ( x ) , such that | f i ( x ) − p i ( x ) | ≤ ǫ 1 , ∀ x ∈ X, where X = � N i =1 X i � [ a, b ] . • Details 1. Run a finite number of max/min consensus iterations in advance to obtain the intersection set X . 2. Use Adaptive Chebyshev Interpolation 25 to obtain p i ( x ) . 3. Maintain p 0 i storing the Chebyshev coefficients of p i ( x ) ’s derivative through certain recurrence formula. 25J. P. Boyd, Solving Transcendental Equations: The Chebyshev Polynomial Proxy and Other Numerical Rootfinders, Perturbation Series, and Oracles . SIAM, 2014, vol. 139. 14 / 39

Initialization: Construction of Local Chebyshev Proxies Figure 5 An illustration of Adaptive Chebyshev Interpolation Source: J. P. Boyd. SIAM, 2014, vol. 139 15 / 39

Nonconvex Distributed Optimization: Novel Algorithmic Design and - PowerPoint PPT Presentation

Nonconvex Distributed Optimization: Novel Algorithmic Design and Arbitrarily Precise Solutions Presenter: Zhiyu He Coauthors: Jianping He*, Cailian Chen and Xinping Guan Department of Automation Shanghai Jiao Tong University July 2020

Stochastic Cubic Regularization for Fast Nonconvex Optimization Nilesh Tripuraneni, Mitchell

Novel Gaits for a Novel Novel Gaits for a Novel Crawling/Grasping Mechanism Crawling/Grasping

Nonconvex Demixing from Bilinear Measurements Yuanming Shi 1 Outline Motivations Blind

Algorithmic Complexity Algorithmic Complexity "Algorithmic Complexity", also called

Accelerated Douglas-Rachford splitting and ADMM for structured nonconvex optimization Panos

Implicit Regularization in Nonconvex Statistical Estimation Yuxin Chen Electrical Engineering,

PRACTICAL AUGMENTED LAGRANGIAN METHODS FOR NONCONVEX PROBLEMS Jos e Mario Mart nez

Pathwise Coordinate Optimization for Nonconvex Sparse Learning Tuo Zhao

Algorithmic Meta-Theorems for Restrictions of Treewidth Michael Lampis Computer Science Dept.

Algorithmic Aspects of Example: How to . . . Algorithmic Aspects of . . . Analysis, Prediction,

Treewidth reduction and algorithmic applications Treewidth reduction and algorithmic applications

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization Frank E.

A Trust Funnel Algorithm for Nonconvex Equality Constrained Optimization with O ( 3 / 2 )

OOPS 2020 Mean field methods in high-dimensional statistics and nonconvex optimization Lecturer:

Flexible ADMM for Block-Structured Convex and Nonconvex Optimization Zhi-Quan (Tom) Luo Joint

Laboratory for Computational Cultural Dynamics Dana Nau (nau@cs.umd.edu) V.S. Subrahmanian

Contents 1 Two Institutional Epidemics 1 1.1 Swine Flu at Fort Dix, 1976 . . . . . . . . . . .

Herb Kohl Year Revenue NBA Milwaukee Bucks Rank (of 30) Revenue 2009 $94 30 th

Military Divorce: Dividing Military Retirement Benefits Navigating Unique Rules for Pension

APPLICATION OF THE MTDS DETERMINISTIC SCENARIO MODEL James Olekah, 1 Technical Advisor, Debt

Bury St Edmunds Destination Management Organisation (DMO) June 2017 DMO A DMO is a coalition

Government Debt Securities Management as of December 9, 2013 Directorate of Gov Debt Securities,

US/UK/AUS Trilateral Software Intensive Systems Acquisition Improvement Group (SISAIG) Dr. Jim

Nonconvex Distributed Optimization: Novel Algorithmic Design and - PowerPoint PPT Presentation

Nonconvex Distributed Optimization: Novel Algorithmic Design and Arbitrarily Precise Solutions Presenter: Zhiyu He Coauthors: Jianping He*, Cailian Chen and Xinping Guan Department of Automation Shanghai Jiao Tong University July 2020

Stochastic Cubic Regularization for Fast Nonconvex Optimization Nilesh Tripuraneni, Mitchell

Novel Gaits for a Novel Novel Gaits for a Novel Crawling/Grasping Mechanism Crawling/Grasping

Nonconvex Demixing from Bilinear Measurements Yuanming Shi 1 Outline Motivations Blind

Algorithmic Complexity Algorithmic Complexity &quot;Algorithmic Complexity&quot;, also called

Accelerated Douglas-Rachford splitting and ADMM for structured nonconvex optimization Panos

Implicit Regularization in Nonconvex Statistical Estimation Yuxin Chen Electrical Engineering,

PRACTICAL AUGMENTED LAGRANGIAN METHODS FOR NONCONVEX PROBLEMS Jos e Mario Mart nez

Pathwise Coordinate Optimization for Nonconvex Sparse Learning Tuo Zhao

Algorithmic Meta-Theorems for Restrictions of Treewidth Michael Lampis Computer Science Dept.

Algorithmic Aspects of Example: How to . . . Algorithmic Aspects of . . . Analysis, Prediction,

Treewidth reduction and algorithmic applications Treewidth reduction and algorithmic applications

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization Frank E.

A Trust Funnel Algorithm for Nonconvex Equality Constrained Optimization with O ( 3 / 2 )

OOPS 2020 Mean field methods in high-dimensional statistics and nonconvex optimization Lecturer:

Flexible ADMM for Block-Structured Convex and Nonconvex Optimization Zhi-Quan (Tom) Luo Joint

Laboratory for Computational Cultural Dynamics Dana Nau (nau@cs.umd.edu) V.S. Subrahmanian

Contents 1 Two Institutional Epidemics 1 1.1 Swine Flu at Fort Dix, 1976 . . . . . . . . . . .

Herb Kohl Year Revenue NBA Milwaukee Bucks Rank (of 30) Revenue 2009 $94 30 th

Military Divorce: Dividing Military Retirement Benefits Navigating Unique Rules for Pension

APPLICATION OF THE MTDS DETERMINISTIC SCENARIO MODEL James Olekah, 1 Technical Advisor, Debt

Bury St Edmunds Destination Management Organisation (DMO) June 2017 DMO A DMO is a coalition

Government Debt Securities Management as of December 9, 2013 Directorate of Gov Debt Securities,

US/UK/AUS Trilateral Software Intensive Systems Acquisition Improvement Group (SISAIG) Dr. Jim

Algorithmic Complexity Algorithmic Complexity "Algorithmic Complexity", also called