Stochastic Optimization for Learning over Networks Guanghui - PowerPoint PPT Presentation

Stochastic Optimization for Learning over Networks Guanghui (George) Lan School of Industrial and Systems Engineering Georgia Institute of Technology East Coast Optimization Meeting 2019, April 4-5, 2019 Department of Mathematical Sciences George Mason University Fairfax, Virginia (USA)

Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Machine Learning Given a set of observed data S = { ( u i , v i ) } m i = 1 , drawn from a certain unknown distribution D on U × V . Goal: to describe the relation between u i and v i ’s for prediction. Applications: predicting strokes and seizures, identifying heart failure, stopping credit card fraud, predicting machine failure, identifying spam, ...... Classic models: Lasso regression: min x E [( � x , u � − v ) 2 ] + ρ � x � 1 . Support vector machine: min E u , v [max { 0 , v � x , u � ] + ρ � x � 2 2 . Deep learning: min x E u , v ( F ( u , x ) − v ) 2 + ρ � Ux � 1 beamer-tu-logo 2 / 46

Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Machine learning and stochastic optimization Generic stochastic optimization model: min x ∈ X { f ( x ) := E ξ [ F ( x , ξ )] } . In ML, F is the regularized loss function and ξ = ( u , v ) : F ( x , ξ ) = ( � x , u � − v ) 2 + ρ � x � 1 F ( x , ξ ) = max { 0 , v � x , u �} + ρ � x � 2 2 . To compute the gradient of f is expensive or impossible. Stochastic first-order methods: iterative methods which operate with the stochastic gradients (subgradients) of f . beamer-tu-logo 3 / 46

Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Learning over networks How about data are distributed over a multi-agent network? f ( x ) := � m min x i = 1 f i ( x ) X := ∩ m s.t. x ∈ X , i = 1 X i , where f i ( x ) = E [ F i ( x , ζ i )] is given in the form of expectation. Optimization defined over complex multi-agent network. Each agent has its own data (observations of ζ i ). Data usually are private - no sharing. Can share knowledge learned from data. Communication among agents can be expensive. Data can be captured on-line. beamer-tu-logo 4 / 46

Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Example: SVM over networks Three agents: min x 1 3 [ f 1 ( x ) + f 2 ( x ) + f 3 ( x )] � N 1 1 max { 0 , v 1 i � x , u 1 + ρ � x � 2 � � f 1 ( x ) = i � 2 . N 1 i = 1 � N 2 1 max { 0 , v 2 i � x , u 2 + ρ � x � 2 � � f 2 ( x ) = i � 2 . N 2 i = 1 � N 3 1 � max { 0 , v 3 i � x , u 3 � + ρ � x � 2 f 3 ( x ) = i � 2 . N 3 i = 1 i ) } N 1 Dataset for agent 1: { ( u 1 i , v 1 i = 1 . i ) } N 2 Dataset for agent 2: { ( u 2 i , v 2 i = 1 . i ) } N 3 Dataset for agent 3: { ( u 2 i , v 2 i = 1 . Each agent accesses its own dataset. beamer-tu-logo 5 / 46

Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Example: SVM over networks with streaming data min x 1 3 [ f 1 ( x ) + f 2 ( x ) + f 3 ( x )] , max { 0 , v j � x , u j � + ρ � x � 2 � � where f j ( x ) = E 2 , j = 1 , 2 , 3. Dataset for agent i can be viewed as samples of random vector ( u j , v j ) , j = 1 , 2 , 3. ( u j , v j ) can satisfy different distribution. Samples can be collected in an online fashion. Agents can possibly share solutions, but not the samples. Need to minimize the communication costs. Key questions # samples - sampling complexity # communication rounds - communication complexity beamer-tu-logo 6 / 46

Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Network topology? beamer-tu-logo 7 / 46

Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Centralized stochastic gradient descent Sampling complexity Distributed SGD and federated learning Sampling complexity Communication complexity Decentralized SGD How to communicate Sampling complexity Communication complexity beamer-tu-logo 8 / 46

Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Stochastic (sub)gradients The Problem: min x ∈ X { f ( x ) := E ξ [ F ( x , ξ )] } . Stochastic (sub)gradients At iteration t , x t ∈ X being the input, we have access to a vector G ( x t , ξ t ) , where { ξ t } t ≥ 1 are i.i.d. random variables s.t. E [ G ( x t , ξ t )] ≡ g ( x t ) ∈ ∂ Ψ( x t ) , E [ � G ( x , ξ ) � 2 ] ≤ M 2 . Examples: Regression with batch data: � N min x f ( x ) = 1 i = 1 ( � x , u i � − v i ) 2 N Stochastic gradient: 2 ( � x , u i t � − v i t ) u i t Regression with streaming data: � ( � x , u � − v ) 2 � min x f ( x ) = E Stochastic gradient: 2 ( � x , u � − v ) u beamer-tu-logo 9 / 46

Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Stochastic (sub)grdient descent The algorithm x t + 1 = argmin x ∈ X � x − ( x t − γ t G t ) � 2 2 , t = 1 , 2 , . . . Theorem (Nemirovski, Juditsky, Lan and Shapiro 07 (09)) � Ω 2 / ( kM 2 ) , t = 1 , . . . , k, and Let D X ≥ max x 1 , x 2 ∈ X � x 1 − x 2 � 2 . If γ t = x k = � k ¯ t = 1 x t / k, x k ) − f ∗ ] ≤ MD X E [ f (¯ k , ∀ k ≥ 1 . √ 2 Sampling complexity # samples = # iterations = O ( 1 )( M 2 D 2 ǫ 2 ) , X x ) − f ∗ ] ≤ ǫ . to find a solution ¯ x ∈ X such that E [ f (¯ beamer-tu-logo 10 / 46

Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Recent developments Accelerated SGD (Lan 08 (12)) Stochastic version of Nesterov’s accelerated gradient method A universally optimal method for smooth, nonsmooth and stochastic optimization The impact of Lipschitz constants vanishes for stochastic problems Popular in training deep neural networks (Sutskever, Martens, Dahl, Hinton 13) Adaptive stochastic subgradient (Duchi, Hazan, Singer 11) Nonconvex SGD and its acceleration (Ghadimi and Lan 12) Adaptive sample sizes (Byrd, Chin, Nocedal and Wu 12) SGD for finite-sum problems (Schmidt, Roux and Bach 13) beamer-tu-logo Optimal incremental gradient methods (Lan and Zhou 15) 11 / 46

Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary The distributed structure - star topology Data sets are distributed over individual agents (devices) in the network. All devices are connected to a parameter server (or central cloud), which controls the learning process and updates solutions. Figure: A cloud-device based distributed learning One example: federated system learning . beamer-tu-logo 12 / 46

Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Stochastic finite sum optimization Consider the convex programming (CP) problem given by � m x ∈ X Ψ( x ) := 1 min i = 1 f i ( x ) + µ ω ( x ) . m X ⊆ R n is a closed convex set. f i : R n → R , i = 1 , . . . , m , are smooth convex with Lipschitz constants L i ≥ 0. f ( x ) := 1 � m i = 1 f i ( x ) is smooth convex m � m with Lipschitz constant L f ≤ L = 1 i = 1 L i . m ω : X → R is a strongly convex function with modulus 1 w.r.t. an arbitrary norm � · � . µ ≥ 0 is a given constant. f i ( x ) = E [ F i ( x , ξ i )] can be represented by a stochastic beamer-tu-logo oracle, providing stochastic (sub)gradients upon request. 13 / 46

Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Motivation Randomized incremental gradient (RIG) methods Randomized incremental gradient (RIG) methods solve min x ∈ X 1 � m i = 1 f i ( x ) iteratively, at k -th iteration, m 1) Randomly select a component index i k from 1 , . . . , m (server). 2) Compute the gradient of the component function f i k ( x k ) (agents). 3) Set x k + 1 = P X ( x k − α k ∇ f i k ( x k )) , where α k is a positive step-size, P X ( · ) denotes projection on X (server). Potentially save the total number of gradient computations. Save communication costs in distributed setting . beamer-tu-logo 14 / 46

Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Motivation Existing RIG methods SAG/SAGA in Schmidt et al, 13 and Defazio et al, 14, and ( m + L /µ ) log 1 � � SVRG in Johnson and Zhang, 13 obtained O ǫ rate of convergence � rate of convergence not optimal RPDG in Lan and Zhou, 15 (precursor: Zhang and Xiao 14, Dang and Lan 14) � require exact gradient evaluation at the initial point, and differentiability over R n Catalyst scheme in Lin et al, 15 and Katyusha in Allen-Zhu, 16 require re-evaluating exact gradients from time to time � synchronous delays No existing studies on stochastic finite-sum: each f i is represented by a stochastic oracle, or each agent only has access to noisy first-order information beamer-tu-logo 15 / 46

Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Motivation Road map Goals: Fully-distributed (no exact gradient evaluations) Direct acceleration with optimal communication costs Applicable to solve stochastic finite-sum problems - optimal sampling complexity Outline: Gradient Extrapolation Method - GEM Interpretation on GEM Randomized Gradient Extrapolation Method - RGEM beamer-tu-logo 16 / 46

Stochastic Optimization for Learning over Networks Guanghui - PowerPoint PPT Presentation

Stochastic Optimization for Learning over Networks Guanghui (George) Lan School of Industrial and Systems Engineering Georgia Institute of Technology East Coast Optimization Meeting 2019, April 4-5, 2019 Department of Mathematical Sciences

Dual Effect in Stochastic Optimization February 10, 2015 P. Carpentier Master MMMEF Cours

Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic

CHAPTER V V CHAPTER Annealing by Stochastic Annealing by Stochastic Neural Networks for

Stochastic Optimization and Discretization January 06, 2021 P. Carpentier Master Optimization

Introduction to Stochastic Optimization January 13, 2015 P. Carpentier Master MMMEF Cours

Stochastic Online Optimization Jian Li Institute of Interdisciplinary Information Sciences

Convergence of a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization

Stochastic Chemical Reaction Networks Matthew Douglas Johnston University of Waterloo Fall 2010

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Various Topics Outline 1. Dynamic (time-varying) Optimization Problems 2. Stochastic

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Processes Will Perkins March 7, 2013 Stochastic Processes Q: What is a Stochastic

What If We Only Have Stochastic . . . What if the Stochastic . . . Approximate Stochastic

The importance of better models in stochastic optimization John Duchi (based on joint work with

Stochastic Optimization Techniques for Big Data Machine Learning Tong Zhang Rutgers University

Inner and Outer Approximating Flowpipes for Delay Differential Equations Eric Goubault 1 Sylvie

Plug & Manage Heterogeneous Sensing Devices Levent Grgen, Johan Nystrm-Persson, Amin

Distributed nonsmooth composite optimization via the proximal augmented Lagrangian Neil K.

The Proximal Primal-Dual Approach for Nonconvex Linearly Constrained Problems Presenter: Mingyi

Proximal Method with Contractions for Smooth Convex Optimization Nikita Doikov Yurii Nesterov

Building Hardened Implementations of SCADA/ICS Protocols Using Language-Theoretic Security

Edit Timelines & Efficient Streaming of Media Mangala Prabhu and Eric Reinecke Agenda

CSCE 970 Lecture 7: Earth, cant afford to visit each area to deter- Clustering: Basic Concepts

Stochastic Optimization for Learning over Networks Guanghui - PowerPoint PPT Presentation

Stochastic Optimization for Learning over Networks Guanghui (George) Lan School of Industrial and Systems Engineering Georgia Institute of Technology East Coast Optimization Meeting 2019, April 4-5, 2019 Department of Mathematical Sciences

Dual Effect in Stochastic Optimization February 10, 2015 P. Carpentier Master MMMEF Cours

Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic

CHAPTER V V CHAPTER Annealing by Stochastic Annealing by Stochastic Neural Networks for

Stochastic Optimization and Discretization January 06, 2021 P. Carpentier Master Optimization

Introduction to Stochastic Optimization January 13, 2015 P. Carpentier Master MMMEF Cours

Stochastic Online Optimization Jian Li Institute of Interdisciplinary Information Sciences

Convergence of a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization

Stochastic Chemical Reaction Networks Matthew Douglas Johnston University of Waterloo Fall 2010

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Various Topics Outline 1. Dynamic (time-varying) Optimization Problems 2. Stochastic

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Processes Will Perkins March 7, 2013 Stochastic Processes Q: What is a Stochastic

What If We Only Have Stochastic . . . What if the Stochastic . . . Approximate Stochastic

The importance of better models in stochastic optimization John Duchi (based on joint work with

Stochastic Optimization Techniques for Big Data Machine Learning Tong Zhang Rutgers University

Inner and Outer Approximating Flowpipes for Delay Differential Equations Eric Goubault 1 Sylvie

Plug &amp; Manage Heterogeneous Sensing Devices Levent Grgen*, Johan Nystrm-Persson*, Amin

Distributed nonsmooth composite optimization via the proximal augmented Lagrangian Neil K.

The Proximal Primal-Dual Approach for Nonconvex Linearly Constrained Problems Presenter: Mingyi

Proximal Method with Contractions for Smooth Convex Optimization Nikita Doikov Yurii Nesterov

Building Hardened Implementations of SCADA/ICS Protocols Using Language-Theoretic Security

Edit Timelines &amp; Efficient Streaming of Media Mangala Prabhu and Eric Reinecke Agenda

CSCE 970 Lecture 7: Earth, cant afford to visit each area to deter- Clustering: Basic Concepts

Plug & Manage Heterogeneous Sensing Devices Levent Grgen, Johan Nystrm-Persson, Amin

Edit Timelines & Efficient Streaming of Media Mangala Prabhu and Eric Reinecke Agenda