adaptive distributed stochastic gradient descent for
play

Adaptive Distributed Stochastic Gradient Descent for Minimizing - PowerPoint PPT Presentation

2020 IEEE International Conference on Acoustics, Speech, and Signal Processing Adaptive Distributed Stochastic Gradient Descent for Minimizing Delay in the Presence of Stragglers Serge Kas Hanna Email: serge.k.hanna@rutgers.edu Website:


  1. 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing Adaptive Distributed Stochastic Gradient Descent for Minimizing Delay in the Presence of Stragglers Serge Kas Hanna Email: serge.k.hanna@rutgers.edu Website: tiny.cc/serge-kas-hanna

  2. Joint work with Parimal Parag, IISC Rawad Bitar, TUM Venkat Dasari, US Army RL Salim El Rouayheb, Rutgers Serge Kas Hanna IEEE ICASSP 2020 2

  3. Distributed Computing and Applications The Age of Big Data Internet of Things (IoT) Cloud computing Focus of this talk: Distributed Machine Learning Outsourcing computations Distributed Machine to companies Learning Serge Kas Hanna IEEE ICASSP 2020 3

  4. Speeding Up Distributed Machine Learning 𝐡 " Master Master wants to run a ML algorithm on a 𝐡 # Dataset 𝐡 large dataset 𝐡 … 𝐡 % Learning process can be made faster by outsourcing computations to worker nodes … Workers who perform local computations and communicate results back to master Worker 1 Worker 2 Worker 3 𝐡 " 𝐡 & 𝐡 % 𝐡 # Challenge: Stragglers: slow or unresponsive workers can significantly delay the learning process Master is as fast as the slowest worker! Stragglers Serge Kas Hanna IEEE ICASSP 2020 4

  5. Distributed Machine Learning Ø Master has dataset π‘Œ ∈ ℝ *Γ—, , labels 𝒛 ∈ ℝ 𝒏 and wants to learn a model 𝒙 βˆ— ∈ ℝ , that best represents 𝒛 as a function of π‘Œ Optimization problem Master Find 𝒙 βˆ— ∈ ℝ , that minimizes a 𝑛 𝑛 certain loss function 𝐺 π‘Œ 𝒛 data vectors labels 𝒙 βˆ— = arg min 𝐺(π‘Œ, 𝒛, 𝒙) 𝒙 𝑒 dimension Ø When the dataset is large (𝑛 ≫) , computation is a bottleneck Ø Distributed learning: recruit workers 1) Distribute data to π‘œ 𝐡 " Worker 1 workers Master 2) Workers compute on 𝐡 " 𝐡 # Worker 2 local data & send to 𝐡 # π‘Œ 𝒛 = master … … … 𝐡 % 3) Master aggregates responses & updates 𝐡 % Worker π‘œ 𝐡 = [π‘Œ|𝒛 ] model 5

  6. GD, SGD & batch SGD Ø Gradient Descent (GD), choose 𝒙 B randomly then iterate 𝒙 CD" = 𝒙 C βˆ’ πœƒβˆ‡πΊ 𝐡, 𝒙 π’Œ , where πœƒ is the step size and βˆ‡πΊ is the gradient of 𝐺 𝒙 CD" = 𝒙 C βˆ’ πœƒβˆ‡πΊ 𝒙 Ø When dataset 𝐡 is large , computing βˆ‡πΊ 𝐡, 𝒙 is cumbersome Ø Stochastic Gradient Descent (SGD): at each iteration, update 𝒙 C based on one row of 𝐡 ∈ ℝ ,D" that is chosen uniformly at random sample 1 𝒃 randomly chosen row at data vector from A random 𝐡 𝒙 CD" = 𝒙 C βˆ’ πœƒβˆ‡πΊ 𝒃, 𝒙 π’Œ , Ø Batch SGD: choose a batch of 𝑑 < 𝑛 data vectors uniformly at random sample batch of 𝑑 𝑇 random batch of rows at data vectors 𝐡 random 𝒙 CD" = 𝒙 C βˆ’ πœƒβˆ‡πΊ 𝑇, 𝒙 π’Œ , Ø SGD & Batch SGD can converge to 𝒙 βˆ— with a higher number of iterations 6

  7. Synchronous Distributed GD Ø Distributed GD: each worker computes a partial gradient on its local data 𝐡 " Compute 𝑕 " (𝒙 C ) = βˆ‡πΊ(𝐡 " , π‘₯ C ) Worker 1 𝒙 C Dataset 𝑕 " (𝒙 C ) 𝐡 " Master 𝐡 # Compute 𝑕 # (𝒙 C ) = βˆ‡πΊ(𝐡 # , π‘₯ C ) Worker 2 𝒙 C 𝐡 # … 𝑕 # (𝒙 C ) … … 𝐡 % 𝒙 C Master computes 𝑕(𝒙 C ) = 𝑕 " + 𝑕 # + β‹― + 𝑕 % 𝑕 % (𝒙 C ) Worker π‘œ 𝐡 % Compute 𝑕 % (𝒙 C ) = βˆ‡πΊ(𝐡 % , π‘₯ C ) Ø At iteration π‘˜: 1. Master sends the current model 𝒙 C to all workers 2. Workers compute their partial gradients and send them to the master 3. Master aggregates the partial gradients by summing them to obtain full gradient Ø Aggregation with simple summation works if βˆ‡πΊ is additively separable, e.g. β„’ # loss Ø Straggler problem: Master is as fast as the slowest worker 7

  8. Speeding up Distributed GD: Previous Work Ø Coding theoretic approach: Gradient coding [Tandon et al. β€˜17], [Yu et al. β€˜17], [Halbawi et al. β€˜18], [Kumar et al. β€˜18], … β€’ Main idea: Distribute data redundantly and encode the partial gradients β€’ Responses from stragglers are treated as erasures and the full gradient is decoded from responses of non-stragglers Ø Approximate gradient coding: [Chen et. al β€˜17], [Wang et al. β€˜19], [Bitar et al. β€˜19], … β€’ Main idea: master does not need to compute exact gradient, e.g. SGD β€’ Ignore the response of stragglers and obtain an estimate of the full gradient β€’ Fastest- 𝒍 SGD : wait for the responses of the fastest 𝑙 < π‘œ workers and ignore the responses of the π‘œ βˆ’ 𝑙 stragglers Ø Mixed Strategies: [Charles et al. β€˜17], [Maity et al. β€˜18], … 8

  9. Fastest- 𝑙 SGD Ø Our question: how to choose the value of 𝑙 in fastest- 𝑙 SGD with fixed step size? Ø Numerical example on synthetic data: linear regression, β„’ # loss function Error vs Time of Fastest- 𝑙 SGD π‘œ = 50 workers 𝑛 = 2000 data points 𝑒 = 10 dimension Response time of workers iid ∼ exp(1) Key observation Error-runtime trade-off: convergence is faster for small 𝑙 but accuracy is lower Ø What does theory say? Theorem [Murata 1998] : SGD with fixed step size goes through an exponential phase where error decreases exponentially, then enters a stationary phase where 𝒙 C oscillates around 𝒙 βˆ— Ø Previous work on fastest- 𝑙 SGD: Analysis by [Bottou et al. ’18] & [Duta et al. β€˜18] for predetermined (fixed) 𝑙 9

  10. Our Contribution: Adaptive fastest- 𝑙 SGD Ø Our goal: speed up distributed SGD in the presence of stragglers, i.e., achieve lower error is less time Envelope Ø Approach: adapt the value of 𝑙 throughout the runtime to maximize time spent in exponential decrease Ø Adaptive: start with smallest 𝑙 and then increase 𝑙 gradually every time error hits a plateau Ø Challenge: in practice we do not know the error because we do not know 𝒙 βˆ— Ø Our results: 1. Theoretical: β€’ Derive an upper bound on the error of fastest- 𝑙 SGD as a function of time β€’ Determine the bound-optimal switching times 2. Practical: Devise an algorithm for adaptive fastest- 𝑙 SGD based on a statistical heuristic Serge Kas Hanna IEEE ICASSP 2020 10

  11. Our Theoretical Results Theorem 1 [Error vs. Time of fastest- 𝑙 SGD]: Under certain assumptions on the loss function, the error of fastest- 𝑙 SGD after wall-clock time 𝑒 with fixed step size satisfies ≀ πœƒπ‘€πœ # 𝐺 𝒙 B βˆ’ 𝐺 𝒙 βˆ— βˆ’ πœƒπ‘€πœ # ` f g "hi 𝔽 𝐺 𝒙 ` βˆ’ 𝐺 𝒙 βˆ— | 𝐾 𝑒 2𝑑𝑙𝑑 + 1 βˆ’ πœƒπ‘‘ , 2𝑑𝑙𝑑 with high probability for large 𝑒 , where 0 < πœ— β‰ͺ 1 is a constant error term, 𝐾(𝑒) is the number of iterations completed in time 𝑒 , and 𝜈 m is the average of the 𝑙 `n order statistic of the random response times. Theorem 2 [Bound-optimal switching times]: The bound optimal switching times 𝑒 m , 𝑙 = 1, … , π‘œ βˆ’ 1 , at which the master should switch from waiting for the fastest 𝑙 workers to waiting for the fastest 𝑙 + 1 workers are given by 𝜈 m βˆ’ ln 1 βˆ’ πœƒπ‘‘ [ln 𝜈 mD" βˆ’ 𝜈 m βˆ’ ln πœƒπ‘€πœ # 𝜈 m 𝑒 m = 𝑒 mh" + βˆ’ πœƒπ‘€ 𝑙 + 1 𝜏 # ] + ln(2𝑑𝑙 𝑙 + 1 𝑑 𝐺 𝒙 ` gpq βˆ’ 𝐺 𝒙 βˆ— where 𝑒 B = 0 . Serge Kas Hanna IEEE ICASSP 2020 11

  12. Example on Theorem 2 Theorem 2 [Bound-optimal switching times]: The bound optimal switching times 𝑒 m , ..., are given by 𝜈 m βˆ’ ln 1 βˆ’ πœƒπ‘‘ [ln 𝜈 mD" βˆ’ 𝜈 m βˆ’ ln πœƒπ‘€πœ # 𝜈 m 𝑒 m = 𝑒 mh" + βˆ’ πœƒπ‘€ 𝑙 + 1 𝜏 # ] + ln(2𝑑𝑙 𝑙 + 1 𝑑 𝐺 𝒙 ` gpq βˆ’ 𝐺 𝒙 βˆ— where 𝑒 B = 0 . Ø Example with iid exponential response times: evaluate upper bound and apply Thm 2 12

  13. Algorithm for Adaptive fastest- 𝑙 SGD Ø Start with 𝑙 = 1 and then increase 𝑙 every time a phase transition is detected Ø Phase transition detection: monitor the sign of consecutive gradients Exponential phase Stationary phase Stochastic approximation: In stationary phase, In exponential phase, [Pflug 1990] consecutive gradients are consecutive gradients are likely to point in opposite likely to point in the same Detect phase transition: directions due to oscillation direction [Chee and Toulis β€˜18] t > 0 t < 0 β‡’ βˆ‡πΊ π‘₯ C βˆ‡πΊ π‘₯ β‡’ βˆ‡πΊ π‘₯ C βˆ‡πΊ π‘₯ CD" CD" Ø Initialize a counter to zero and update: t < 0 π‘‘π‘π‘£π‘œπ‘’π‘“π‘  = zπ‘‘π‘π‘£π‘œπ‘’π‘“π‘  + 1, 𝑗𝑔 βˆ‡πΊ π‘₯ C βˆ‡πΊ π‘₯ CD" t > 0 π‘‘π‘π‘£π‘œπ‘’π‘“π‘  βˆ’ 1, 𝑗𝑔 βˆ‡πΊ π‘₯ C βˆ‡πΊ π‘₯ CD" Ø Declare a phase transition if counter goes above a certain threshold & increase 𝑙 13

  14. Simulation Results: Non-adaptive vs Adaptive Fastest- 𝑙 SGD Simulation on synthetic data π‘Œ : Ø - Generate π‘Œ : pick 𝑛 data vectors chosen uniformly at random from 1,2, … , 10 , - Pick 𝒙 ⋆ uniformly at random from 1,2, … , 100 , - Generate labels: 𝒛 ∼ π’ͺ(π‘Œπ’™ ⋆ , 1 ) - Loss function: β„’ # loss (least square errors) - Workers’ response times are iid ∼ exp(1) and independent across iterations Simulation results on adaptive fastest- 𝑙 SGD for π‘œ = 50 workers Ø π‘œ = 50 workers 𝑛 = 2000 data vectors 𝑒 = 100 dimension πœƒ = 0.005 step size 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend