stochastic optimization for learning over networks
play

Stochastic Optimization for Learning over Networks Guanghui - PowerPoint PPT Presentation

Stochastic Optimization for Learning over Networks Guanghui (George) Lan School of Industrial and Systems Engineering Georgia Institute of Technology East Coast Optimization Meeting 2019, April 4-5, 2019 Department of Mathematical Sciences


  1. Stochastic Optimization for Learning over Networks Guanghui (George) Lan School of Industrial and Systems Engineering Georgia Institute of Technology East Coast Optimization Meeting 2019, April 4-5, 2019 Department of Mathematical Sciences George Mason University Fairfax, Virginia (USA)

  2. Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Machine Learning Given a set of observed data S = { ( u i , v i ) } m i = 1 , drawn from a certain unknown distribution D on U × V . Goal: to describe the relation between u i and v i ’s for prediction. Applications: predicting strokes and seizures, identifying heart failure, stopping credit card fraud, predicting machine failure, identifying spam, ...... Classic models: Lasso regression: min x E [( � x , u � − v ) 2 ] + ρ � x � 1 . Support vector machine: min E u , v [max { 0 , v � x , u � ] + ρ � x � 2 2 . Deep learning: min x E u , v ( F ( u , x ) − v ) 2 + ρ � Ux � 1 beamer-tu-logo 2 / 46

  3. Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Machine learning and stochastic optimization Generic stochastic optimization model: min x ∈ X { f ( x ) := E ξ [ F ( x , ξ )] } . In ML, F is the regularized loss function and ξ = ( u , v ) : F ( x , ξ ) = ( � x , u � − v ) 2 + ρ � x � 1 F ( x , ξ ) = max { 0 , v � x , u �} + ρ � x � 2 2 . To compute the gradient of f is expensive or impossible. Stochastic first-order methods: iterative methods which operate with the stochastic gradients (subgradients) of f . beamer-tu-logo 3 / 46

  4. Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Learning over networks How about data are distributed over a multi-agent network? f ( x ) := � m min x i = 1 f i ( x ) X := ∩ m s.t. x ∈ X , i = 1 X i , where f i ( x ) = E [ F i ( x , ζ i )] is given in the form of expectation. Optimization defined over complex multi-agent network. Each agent has its own data (observations of ζ i ). Data usually are private - no sharing. Can share knowledge learned from data. Communication among agents can be expensive. Data can be captured on-line. beamer-tu-logo 4 / 46

  5. Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Example: SVM over networks Three agents: min x 1 3 [ f 1 ( x ) + f 2 ( x ) + f 3 ( x )] � N 1 1 max { 0 , v 1 i � x , u 1 + ρ � x � 2 � � f 1 ( x ) = i � 2 . N 1 i = 1 � N 2 1 max { 0 , v 2 i � x , u 2 + ρ � x � 2 � � f 2 ( x ) = i � 2 . N 2 i = 1 � N 3 1 � max { 0 , v 3 i � x , u 3 � + ρ � x � 2 f 3 ( x ) = i � 2 . N 3 i = 1 i ) } N 1 Dataset for agent 1: { ( u 1 i , v 1 i = 1 . i ) } N 2 Dataset for agent 2: { ( u 2 i , v 2 i = 1 . i ) } N 3 Dataset for agent 3: { ( u 2 i , v 2 i = 1 . Each agent accesses its own dataset. beamer-tu-logo 5 / 46

  6. Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Example: SVM over networks with streaming data min x 1 3 [ f 1 ( x ) + f 2 ( x ) + f 3 ( x )] , max { 0 , v j � x , u j � + ρ � x � 2 � � where f j ( x ) = E 2 , j = 1 , 2 , 3. Dataset for agent i can be viewed as samples of random vector ( u j , v j ) , j = 1 , 2 , 3. ( u j , v j ) can satisfy different distribution. Samples can be collected in an online fashion. Agents can possibly share solutions, but not the samples. Need to minimize the communication costs. Key questions # samples - sampling complexity # communication rounds - communication complexity beamer-tu-logo 6 / 46

  7. Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Network topology? beamer-tu-logo 7 / 46

  8. Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Centralized stochastic gradient descent Sampling complexity Distributed SGD and federated learning Sampling complexity Communication complexity Decentralized SGD How to communicate Sampling complexity Communication complexity beamer-tu-logo 8 / 46

  9. Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Stochastic (sub)gradients The Problem: min x ∈ X { f ( x ) := E ξ [ F ( x , ξ )] } . Stochastic (sub)gradients At iteration t , x t ∈ X being the input, we have access to a vector G ( x t , ξ t ) , where { ξ t } t ≥ 1 are i.i.d. random variables s.t. E [ G ( x t , ξ t )] ≡ g ( x t ) ∈ ∂ Ψ( x t ) , E [ � G ( x , ξ ) � 2 ] ≤ M 2 . Examples: Regression with batch data: � N min x f ( x ) = 1 i = 1 ( � x , u i � − v i ) 2 N Stochastic gradient: 2 ( � x , u i t � − v i t ) u i t Regression with streaming data: � ( � x , u � − v ) 2 � min x f ( x ) = E Stochastic gradient: 2 ( � x , u � − v ) u beamer-tu-logo 9 / 46

  10. Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Stochastic (sub)grdient descent The algorithm x t + 1 = argmin x ∈ X � x − ( x t − γ t G t ) � 2 2 , t = 1 , 2 , . . . Theorem (Nemirovski, Juditsky, Lan and Shapiro 07 (09)) � Ω 2 / ( kM 2 ) , t = 1 , . . . , k, and Let D X ≥ max x 1 , x 2 ∈ X � x 1 − x 2 � 2 . If γ t = x k = � k ¯ t = 1 x t / k, x k ) − f ∗ ] ≤ MD X E [ f (¯ k , ∀ k ≥ 1 . √ 2 Sampling complexity # samples = # iterations = O ( 1 )( M 2 D 2 ǫ 2 ) , X x ) − f ∗ ] ≤ ǫ . to find a solution ¯ x ∈ X such that E [ f (¯ beamer-tu-logo 10 / 46

  11. Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Recent developments Accelerated SGD (Lan 08 (12)) Stochastic version of Nesterov’s accelerated gradient method A universally optimal method for smooth, nonsmooth and stochastic optimization The impact of Lipschitz constants vanishes for stochastic problems Popular in training deep neural networks (Sutskever, Martens, Dahl, Hinton 13) Adaptive stochastic subgradient (Duchi, Hazan, Singer 11) Nonconvex SGD and its acceleration (Ghadimi and Lan 12) Adaptive sample sizes (Byrd, Chin, Nocedal and Wu 12) SGD for finite-sum problems (Schmidt, Roux and Bach 13) beamer-tu-logo Optimal incremental gradient methods (Lan and Zhou 15) 11 / 46

  12. Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary The distributed structure - star topology Data sets are distributed over individual agents (devices) in the network. All devices are connected to a parameter server (or central cloud), which controls the learning process and updates solutions. Figure: A cloud-device based distributed learning One example: federated system learning . beamer-tu-logo 12 / 46

  13. Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Stochastic finite sum optimization Consider the convex programming (CP) problem given by � m x ∈ X Ψ( x ) := 1 min i = 1 f i ( x ) + µ ω ( x ) . m X ⊆ R n is a closed convex set. f i : R n → R , i = 1 , . . . , m , are smooth convex with Lipschitz constants L i ≥ 0. f ( x ) := 1 � m i = 1 f i ( x ) is smooth convex m � m with Lipschitz constant L f ≤ L = 1 i = 1 L i . m ω : X → R is a strongly convex function with modulus 1 w.r.t. an arbitrary norm � · � . µ ≥ 0 is a given constant. f i ( x ) = E [ F i ( x , ξ i )] can be represented by a stochastic beamer-tu-logo oracle, providing stochastic (sub)gradients upon request. 13 / 46

  14. Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Motivation Randomized incremental gradient (RIG) methods Randomized incremental gradient (RIG) methods solve min x ∈ X 1 � m i = 1 f i ( x ) iteratively, at k -th iteration, m 1) Randomly select a component index i k from 1 , . . . , m (server). 2) Compute the gradient of the component function f i k ( x k ) (agents). 3) Set x k + 1 = P X ( x k − α k ∇ f i k ( x k )) , where α k is a positive step-size, P X ( · ) denotes projection on X (server). Potentially save the total number of gradient computations. Save communication costs in distributed setting . beamer-tu-logo 14 / 46

  15. Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Motivation Existing RIG methods SAG/SAGA in Schmidt et al, 13 and Defazio et al, 14, and ( m + L /µ ) log 1 � � SVRG in Johnson and Zhang, 13 obtained O ǫ rate of convergence � rate of convergence not optimal RPDG in Lan and Zhou, 15 (precursor: Zhang and Xiao 14, Dang and Lan 14) � require exact gradient evaluation at the initial point, and differentiability over R n Catalyst scheme in Lin et al, 15 and Katyusha in Allen-Zhu, 16 require re-evaluating exact gradients from time to time � synchronous delays No existing studies on stochastic finite-sum: each f i is represented by a stochastic oracle, or each agent only has access to noisy first-order information beamer-tu-logo 15 / 46

  16. Background Centralized SGD Distribtued SGD GEM RGEM Decentralized SGD Summary Motivation Road map Goals: Fully-distributed (no exact gradient evaluations) Direct acceleration with optimal communication costs Applicable to solve stochastic finite-sum problems - optimal sampling complexity Outline: Gradient Extrapolation Method - GEM Interpretation on GEM Randomized Gradient Extrapolation Method - RGEM beamer-tu-logo 16 / 46

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend