Adaptive Distributed Stochastic Gradient Descent for Minimizing - - PowerPoint PPT Presentation

β–Ά
adaptive distributed stochastic gradient descent for
SMART_READER_LITE
LIVE PREVIEW

Adaptive Distributed Stochastic Gradient Descent for Minimizing - - PowerPoint PPT Presentation

2020 IEEE International Conference on Acoustics, Speech, and Signal Processing Adaptive Distributed Stochastic Gradient Descent for Minimizing Delay in the Presence of Stragglers Serge Kas Hanna Email: serge.k.hanna@rutgers.edu Website:


slide-1
SLIDE 1

Adaptive Distributed Stochastic Gradient Descent for Minimizing Delay in the Presence of Stragglers

2020 IEEE International Conference on Acoustics, Speech, and Signal Processing

Serge Kas Hanna

Email: serge.k.hanna@rutgers.edu Website: tiny.cc/serge-kas-hanna

slide-2
SLIDE 2

Joint work with

Salim El Rouayheb, Rutgers Rawad Bitar, TUM Parimal Parag, IISC Venkat Dasari, US Army RL

Serge Kas Hanna IEEE ICASSP 2020 2

slide-3
SLIDE 3

Distributed Computing and Applications

Focus of this talk: Distributed Machine Learning

The Age of Big Data Internet of Things (IoT) Cloud computing Outsourcing computations to companies Distributed Machine Learning

Serge Kas Hanna IEEE ICASSP 2020 3

slide-4
SLIDE 4

Speeding Up Distributed Machine Learning

Worker 1 Worker 3 Worker 2

Master

…

𝐡 𝐡" 𝐡# … 𝐡%

𝐡" 𝐡# 𝐡& 𝐡%

Dataset Master wants to run a ML algorithm on a large dataset 𝐡

Challenge:

Stragglers Workers who perform local computations and communicate results back to master Learning process can be made faster by

  • utsourcing computations to worker nodes

4

Stragglers: slow or unresponsive workers can significantly delay the learning process Master is as fast as the slowest worker!

Serge Kas Hanna IEEE ICASSP 2020

slide-5
SLIDE 5

Distributed Machine Learning

Ø Master has dataset π‘Œ ∈ ℝ*Γ—,, labels 𝒛 ∈ ℝ𝒏 and wants to learn a model π’™βˆ— ∈ ℝ, that best represents 𝒛 as a function of π‘Œ Ø When the dataset is large (𝑛 ≫), computation is a bottleneck

Find π’™βˆ— ∈ ℝ, that minimizes a certain loss function 𝐺 π’™βˆ— = arg min

𝒙

𝐺(π‘Œ, 𝒛, 𝒙)

Ø Distributed learning: recruit workers

Master

π‘Œ 𝒛

𝑛 labels 𝑛 data vectors 𝑒 dimension Optimization problem

Worker 1 Worker π‘œ Worker 2

Master

…

π‘Œ 𝒛

𝐡# 𝐡" 𝐡" 𝐡% 𝐡# 𝐡% … =

1) Distribute data to π‘œ workers 2) Workers compute on local data & send to master 3) Master aggregates responses & updates model

…

𝐡 = [π‘Œ|𝒛]

5

slide-6
SLIDE 6

GD, SGD & batch SGD

Ø Gradient Descent (GD), choose 𝒙B randomly then iterate 𝒙CD" = 𝒙C βˆ’ πœƒβˆ‡πΊ 𝐡, π’™π’Œ , where πœƒ is the step size and βˆ‡πΊ is the gradient of 𝐺

𝒙CD" = 𝒙C βˆ’ πœƒβˆ‡πΊ 𝒙

Ø When dataset 𝐡 is large, computing βˆ‡πΊ 𝐡, 𝒙 is cumbersome Ø Stochastic Gradient Descent (SGD): at each iteration, update 𝒙C based on one row of 𝐡 ∈ ℝ,D" that is chosen uniformly at random 𝒙CD" = 𝒙C βˆ’ πœƒβˆ‡πΊ 𝒃, π’™π’Œ ,

randomly chosen data vector from A

𝐡

𝒃

Ø Batch SGD: choose a batch of 𝑑 < 𝑛 data vectors uniformly at random 𝒙CD" = 𝒙C βˆ’ πœƒβˆ‡πΊ 𝑇, π’™π’Œ ,

random batch of data vectors

Ø SGD & Batch SGD can converge to π’™βˆ— with a higher number of iterations

𝐡

𝑇 sample 1 row at random sample batch of 𝑑 rows at random 6

slide-7
SLIDE 7

Synchronous Distributed GD

Ø Distributed GD: each worker computes a partial gradient on its local data

Worker 1 Worker π‘œ Worker 2

Master

…

𝐡# 𝐡" 𝐡%

…

𝐡" 𝐡% … Dataset 𝐡#

𝒙C 𝑕"(𝒙C) 𝒙C 𝑕#(𝒙C) 𝒙C 𝑕%(𝒙C)

Master computes

𝑕(𝒙C) = 𝑕" + 𝑕# + β‹― + 𝑕%

Compute 𝑕"(𝒙C) = βˆ‡πΊ(𝐡", π‘₯C) Compute 𝑕#(𝒙C) = βˆ‡πΊ(𝐡#, π‘₯C) Compute 𝑕%(𝒙C) = βˆ‡πΊ(𝐡%, π‘₯C)

Ø Aggregation with simple summation works if βˆ‡πΊ is additively separable, e.g. β„’# loss Ø Straggler problem: Master is as fast as the slowest worker

7

Ø At iteration π‘˜:

  • 1. Master sends the current model 𝒙C to all workers
  • 2. Workers compute their partial gradients and send them to the master
  • 3. Master aggregates the partial gradients by summing them to obtain full gradient
slide-8
SLIDE 8

Speeding up Distributed GD: Previous Work

Ø Coding theoretic approach: Gradient coding [Tandon et al. β€˜17], [Yu et al. β€˜17], [Halbawi et al. β€˜18], [Kumar et al. β€˜18], …

8

Ø Approximate gradient coding: [Chen et. al β€˜17], [Wang et al. β€˜19], [Bitar et al. β€˜19], …

  • Main idea: master does not need to compute exact gradient, e.g. SGD
  • Ignore the response of stragglers and obtain an estimate of the full gradient
  • Fastest-𝒍 SGD: wait for the responses of the fastest 𝑙 < π‘œ workers and ignore

the responses of the π‘œ βˆ’ 𝑙 stragglers Ø Mixed Strategies: [Charles et al. β€˜17], [Maity et al. β€˜18], …

  • Main idea: Distribute data redundantly and encode the partial gradients
  • Responses from stragglers are treated as erasures and the full gradient is

decoded from responses of non-stragglers

slide-9
SLIDE 9

Fastest-𝑙 SGD

Ø Our question: how to choose the value of 𝑙 in fastest-𝑙 SGD with fixed step size? Ø Numerical example on synthetic data: linear regression, β„’# loss function Ø What does theory say?

Error-runtime trade-off: convergence is faster for small 𝑙 but accuracy is lower

Ø Previous work on fastest-𝑙 SGD: Analysis by [Bottou et al. ’18] & [Duta et al. β€˜18] for predetermined (fixed) 𝑙

π‘œ = 50 workers 𝑒 = 10 dimension 𝑛 = 2000 data points

Error vs Time of Fastest-𝑙 SGD Key observation

Response time of workers iid ∼ exp(1)

Theorem [Murata 1998]: SGD with fixed step size goes through an exponential phase where error decreases exponentially, then enters a stationary phase where 𝒙C oscillates around π’™βˆ—

9

slide-10
SLIDE 10

Our Contribution: Adaptive fastest-𝑙 SGD

Ø Approach: adapt the value of 𝑙 throughout the runtime to maximize time spent in exponential decrease Ø Challenge: in practice we do not know the error because we do not know π’™βˆ— Ø Our results: Ø Our goal: speed up distributed SGD in the presence

  • f stragglers, i.e., achieve lower error is less time

Ø Adaptive: start with smallest 𝑙 and then increase 𝑙 gradually every time error hits a plateau

  • 1. Theoretical:
  • Derive an upper bound on the error of fastest-𝑙 SGD as a function of time
  • Determine the bound-optimal switching times
  • 2. Practical: Devise an algorithm for adaptive fastest-𝑙 SGD based on a statistical heuristic

Serge Kas Hanna IEEE ICASSP 2020 10

Envelope

slide-11
SLIDE 11

Our Theoretical Results

Theorem 1 [Error vs. Time of fastest-𝑙 SGD]: Under certain assumptions on the loss function, the error of fastest-𝑙 SGD after wall-clock time 𝑒 with fixed step size satisfies 𝔽 𝐺 𝒙` βˆ’ 𝐺 π’™βˆ— | 𝐾 𝑒 ≀ πœƒπ‘€πœ# 2𝑑𝑙𝑑 + 1 βˆ’ πœƒπ‘‘

` fg "hi

𝐺 𝒙B βˆ’ 𝐺 π’™βˆ— βˆ’ πœƒπ‘€πœ# 2𝑑𝑙𝑑 , with high probability for large 𝑒, where 0 < πœ— β‰ͺ 1 is a constant error term, 𝐾(𝑒) is the number of iterations completed in time 𝑒, and 𝜈m is the average of the 𝑙`n order statistic

  • f the random response times.

Theorem 2 [Bound-optimal switching times]: The bound optimal switching times 𝑒m, 𝑙 = 1, … , π‘œ βˆ’ 1, at which the master should switch from waiting for the fastest 𝑙 workers to waiting for the fastest 𝑙 + 1 workers are given by 𝑒m = 𝑒mh" + 𝜈m βˆ’ ln 1 βˆ’ πœƒπ‘‘ [ln 𝜈mD" βˆ’ 𝜈m βˆ’ ln πœƒπ‘€πœ#𝜈m + ln(2𝑑𝑙 𝑙 + 1 𝑑 𝐺 𝒙`gpq βˆ’ 𝐺 π’™βˆ— βˆ’ πœƒπ‘€ 𝑙 + 1 𝜏# ] where 𝑒B = 0.

11 Serge Kas Hanna IEEE ICASSP 2020

slide-12
SLIDE 12

Example on Theorem 2

Theorem 2 [Bound-optimal switching times]: The bound optimal switching times 𝑒m, ..., are given by 𝑒m = 𝑒mh" + 𝜈m βˆ’ ln 1 βˆ’ πœƒπ‘‘ [ln 𝜈mD" βˆ’ 𝜈m βˆ’ ln πœƒπ‘€πœ#𝜈m + ln(2𝑑𝑙 𝑙 + 1 𝑑 𝐺 𝒙`gpq βˆ’ 𝐺 π’™βˆ— βˆ’ πœƒπ‘€ 𝑙 + 1 𝜏# ] where 𝑒B = 0.

Ø Example with iid exponential response times: evaluate upper bound and apply Thm 2

12

slide-13
SLIDE 13

Algorithm for Adaptive fastest-𝑙 SGD

Ø Start with 𝑙 = 1 and then increase 𝑙 every time a phase transition is detected Ø Phase transition detection: monitor the sign of consecutive gradients Ø Initialize a counter to zero and update:

Exponential phase Stationary phase In exponential phase, consecutive gradients are likely to point in the same direction In stationary phase, consecutive gradients are likely to point in opposite directions due to oscillation β‡’ βˆ‡πΊ π‘₯

C βˆ‡πΊ π‘₯ CD" t < 0

β‡’ βˆ‡πΊ π‘₯

C βˆ‡πΊ π‘₯ CD" t > 0

π‘‘π‘π‘£π‘œπ‘’π‘“π‘  = zπ‘‘π‘π‘£π‘œπ‘’π‘“π‘  + 1, 𝑗𝑔 βˆ‡πΊ π‘₯

C βˆ‡πΊ π‘₯ CD" t < 0

π‘‘π‘π‘£π‘œπ‘’π‘“π‘  βˆ’ 1, 𝑗𝑔 βˆ‡πΊ π‘₯

C βˆ‡πΊ π‘₯ CD" t > 0

Ø Declare a phase transition if counter goes above a certain threshold & increase 𝑙

13

Stochastic approximation: [Pflug 1990] Detect phase transition: [Chee and Toulis β€˜18]

slide-14
SLIDE 14

Simulation Results: Non-adaptive vs Adaptive Fastest-𝑙 SGD

Ø Simulation on synthetic data π‘Œ:

  • Generate π‘Œ: pick 𝑛 data vectors chosen uniformly at random from 1,2, … , 10 ,
  • Pick 𝒙⋆ uniformly at random from 1,2, … , 100 ,
  • Generate labels: 𝒛 ∼ π’ͺ(π‘Œπ’™β‹†, 1)
  • Loss function: β„’# loss (least square errors)

Ø Simulation results on adaptive fastest-𝑙 SGD for π‘œ = 50 workers

  • Workers’ response times are iid ∼ exp(1) and independent across iterations

π‘œ = 50 workers 𝑒 = 100 dimension 𝑛 = 2000 data vectors πœƒ = 0.005 step size

14

slide-15
SLIDE 15

Simulation Results: Async vs Adaptive Fastest-𝑙 SGD

15

Ø Asynchronous Stochastic Gradient Descent: update the model 𝒙C and send new model 𝒙CD" every time a worker finishes it’s partial gradient computation Ø Workers who have not finished continue working on the old model Ø Simulation results:

π‘œ = 50 workers 𝑒 = 100 dimension 𝑛 = 2000 data vectors πœƒ = 0.002 step size

slide-16
SLIDE 16

Summary and Future Work

  • Straggler problem

Ø Speeding up distributed machine learning

  • Adaptive fastest-𝑙 SGD for minimizing delay in the presence of stragglers
  • Theoretical results: bounds on the error & bound-optimal switching times

Ø Future work

  • Simulations or real data (MNIST, CFAR, etc.)
  • Mixed strategies: coding + adaptivity

16 Serge Kas Hanna IEEE ICASSP 2020

  • Numerical results showing gain with respect to non-adaptive SGD
  • Novel realizable algorithm based on statistical heuristic
  • Variable step size