Adaptive Distributed Stochastic Gradient Descent for Minimizing - PowerPoint PPT Presentation

2020 IEEE International Conference on Acoustics, Speech, and Signal Processing Adaptive Distributed Stochastic Gradient Descent for Minimizing Delay in the Presence of Stragglers Serge Kas Hanna Email: serge.k.hanna@rutgers.edu Website: tiny.cc/serge-kas-hanna

Joint work with Parimal Parag, IISC Rawad Bitar, TUM Venkat Dasari, US Army RL Salim El Rouayheb, Rutgers Serge Kas Hanna IEEE ICASSP 2020 2

Distributed Computing and Applications The Age of Big Data Internet of Things (IoT) Cloud computing Focus of this talk: Distributed Machine Learning Outsourcing computations Distributed Machine to companies Learning Serge Kas Hanna IEEE ICASSP 2020 3

Speeding Up Distributed Machine Learning 𝐵 " Master Master wants to run a ML algorithm on a 𝐵 # Dataset 𝐵 large dataset 𝐵 … 𝐵 % Learning process can be made faster by outsourcing computations to worker nodes … Workers who perform local computations and communicate results back to master Worker 1 Worker 2 Worker 3 𝐵 " 𝐵 & 𝐵 % 𝐵 # Challenge: Stragglers: slow or unresponsive workers can significantly delay the learning process Master is as fast as the slowest worker! Stragglers Serge Kas Hanna IEEE ICASSP 2020 4

Distributed Machine Learning Ø Master has dataset 𝑌 ∈ ℝ *×, , labels 𝒛 ∈ ℝ 𝒏 and wants to learn a model 𝒙 ∗ ∈ ℝ , that best represents 𝒛 as a function of 𝑌 Optimization problem Master Find 𝒙 ∗ ∈ ℝ , that minimizes a 𝑛 𝑛 certain loss function 𝐺 𝑌 𝒛 data vectors labels 𝒙 ∗ = arg min 𝐺(𝑌, 𝒛, 𝒙) 𝒙 𝑒 dimension Ø When the dataset is large (𝑛 ≫) , computation is a bottleneck Ø Distributed learning: recruit workers 1) Distribute data to 𝑜 𝐵 " Worker 1 workers Master 2) Workers compute on 𝐵 " 𝐵 # Worker 2 local data & send to 𝐵 # 𝑌 𝒛 = master … … … 𝐵 % 3) Master aggregates responses & updates 𝐵 % Worker 𝑜 𝐵 = [𝑌|𝒛 ] model 5

GD, SGD & batch SGD Ø Gradient Descent (GD), choose 𝒙 B randomly then iterate 𝒙 CD" = 𝒙 C − 𝜃∇𝐺 𝐵, 𝒙 𝒌 , where 𝜃 is the step size and ∇𝐺 is the gradient of 𝐺 𝒙 CD" = 𝒙 C − 𝜃∇𝐺 𝒙 Ø When dataset 𝐵 is large , computing ∇𝐺 𝐵, 𝒙 is cumbersome Ø Stochastic Gradient Descent (SGD): at each iteration, update 𝒙 C based on one row of 𝐵 ∈ ℝ ,D" that is chosen uniformly at random sample 1 𝒃 randomly chosen row at data vector from A random 𝐵 𝒙 CD" = 𝒙 C − 𝜃∇𝐺 𝒃, 𝒙 𝒌 , Ø Batch SGD: choose a batch of 𝑡 < 𝑛 data vectors uniformly at random sample batch of 𝑡 𝑇 random batch of rows at data vectors 𝐵 random 𝒙 CD" = 𝒙 C − 𝜃∇𝐺 𝑇, 𝒙 𝒌 , Ø SGD & Batch SGD can converge to 𝒙 ∗ with a higher number of iterations 6

Synchronous Distributed GD Ø Distributed GD: each worker computes a partial gradient on its local data 𝐵 " Compute 𝑕 " (𝒙 C ) = ∇𝐺(𝐵 " , 𝑥 C ) Worker 1 𝒙 C Dataset 𝑕 " (𝒙 C ) 𝐵 " Master 𝐵 # Compute 𝑕 # (𝒙 C ) = ∇𝐺(𝐵 # , 𝑥 C ) Worker 2 𝒙 C 𝐵 # … 𝑕 # (𝒙 C ) … … 𝐵 % 𝒙 C Master computes 𝑕(𝒙 C ) = 𝑕 " + 𝑕 # + ⋯ + 𝑕 % 𝑕 % (𝒙 C ) Worker 𝑜 𝐵 % Compute 𝑕 % (𝒙 C ) = ∇𝐺(𝐵 % , 𝑥 C ) Ø At iteration 𝑘: 1. Master sends the current model 𝒙 C to all workers 2. Workers compute their partial gradients and send them to the master 3. Master aggregates the partial gradients by summing them to obtain full gradient Ø Aggregation with simple summation works if ∇𝐺 is additively separable, e.g. ℒ # loss Ø Straggler problem: Master is as fast as the slowest worker 7

Speeding up Distributed GD: Previous Work Ø Coding theoretic approach: Gradient coding [Tandon et al. ‘17], [Yu et al. ‘17], [Halbawi et al. ‘18], [Kumar et al. ‘18], … • Main idea: Distribute data redundantly and encode the partial gradients • Responses from stragglers are treated as erasures and the full gradient is decoded from responses of non-stragglers Ø Approximate gradient coding: [Chen et. al ‘17], [Wang et al. ‘19], [Bitar et al. ‘19], … • Main idea: master does not need to compute exact gradient, e.g. SGD • Ignore the response of stragglers and obtain an estimate of the full gradient • Fastest- 𝒍 SGD : wait for the responses of the fastest 𝑙 < 𝑜 workers and ignore the responses of the 𝑜 − 𝑙 stragglers Ø Mixed Strategies: [Charles et al. ‘17], [Maity et al. ‘18], … 8

Fastest- 𝑙 SGD Ø Our question: how to choose the value of 𝑙 in fastest- 𝑙 SGD with fixed step size? Ø Numerical example on synthetic data: linear regression, ℒ # loss function Error vs Time of Fastest- 𝑙 SGD 𝑜 = 50 workers 𝑛 = 2000 data points 𝑒 = 10 dimension Response time of workers iid ∼ exp(1) Key observation Error-runtime trade-off: convergence is faster for small 𝑙 but accuracy is lower Ø What does theory say? Theorem [Murata 1998] : SGD with fixed step size goes through an exponential phase where error decreases exponentially, then enters a stationary phase where 𝒙 C oscillates around 𝒙 ∗ Ø Previous work on fastest- 𝑙 SGD: Analysis by [Bottou et al. ’18] & [Duta et al. ‘18] for predetermined (fixed) 𝑙 9

Our Contribution: Adaptive fastest- 𝑙 SGD Ø Our goal: speed up distributed SGD in the presence of stragglers, i.e., achieve lower error is less time Envelope Ø Approach: adapt the value of 𝑙 throughout the runtime to maximize time spent in exponential decrease Ø Adaptive: start with smallest 𝑙 and then increase 𝑙 gradually every time error hits a plateau Ø Challenge: in practice we do not know the error because we do not know 𝒙 ∗ Ø Our results: 1. Theoretical: • Derive an upper bound on the error of fastest- 𝑙 SGD as a function of time • Determine the bound-optimal switching times 2. Practical: Devise an algorithm for adaptive fastest- 𝑙 SGD based on a statistical heuristic Serge Kas Hanna IEEE ICASSP 2020 10

Our Theoretical Results Theorem 1 [Error vs. Time of fastest- 𝑙 SGD]: Under certain assumptions on the loss function, the error of fastest- 𝑙 SGD after wall-clock time 𝑢 with fixed step size satisfies ≤ 𝜃𝑀𝜏 # 𝐺 𝒙 B − 𝐺 𝒙 ∗ − 𝜃𝑀𝜏 # ` f g "hi 𝔽 𝐺 𝒙 ` − 𝐺 𝒙 ∗ | 𝐾 𝑢 2𝑑𝑙𝑡 + 1 − 𝜃𝑑 , 2𝑑𝑙𝑡 with high probability for large 𝑢 , where 0 < 𝜗 ≪ 1 is a constant error term, 𝐾(𝑢) is the number of iterations completed in time 𝑢 , and 𝜈 m is the average of the 𝑙 `n order statistic of the random response times. Theorem 2 [Bound-optimal switching times]: The bound optimal switching times 𝑢 m , 𝑙 = 1, … , 𝑜 − 1 , at which the master should switch from waiting for the fastest 𝑙 workers to waiting for the fastest 𝑙 + 1 workers are given by 𝜈 m − ln 1 − 𝜃𝑑 [ln 𝜈 mD" − 𝜈 m − ln 𝜃𝑀𝜏 # 𝜈 m 𝑢 m = 𝑢 mh" + − 𝜃𝑀 𝑙 + 1 𝜏 # ] + ln(2𝑑𝑙 𝑙 + 1 𝑡 𝐺 𝒙 ` gpq − 𝐺 𝒙 ∗ where 𝑢 B = 0 . Serge Kas Hanna IEEE ICASSP 2020 11

Example on Theorem 2 Theorem 2 [Bound-optimal switching times]: The bound optimal switching times 𝑢 m , ..., are given by 𝜈 m − ln 1 − 𝜃𝑑 [ln 𝜈 mD" − 𝜈 m − ln 𝜃𝑀𝜏 # 𝜈 m 𝑢 m = 𝑢 mh" + − 𝜃𝑀 𝑙 + 1 𝜏 # ] + ln(2𝑑𝑙 𝑙 + 1 𝑡 𝐺 𝒙 ` gpq − 𝐺 𝒙 ∗ where 𝑢 B = 0 . Ø Example with iid exponential response times: evaluate upper bound and apply Thm 2 12

Algorithm for Adaptive fastest- 𝑙 SGD Ø Start with 𝑙 = 1 and then increase 𝑙 every time a phase transition is detected Ø Phase transition detection: monitor the sign of consecutive gradients Exponential phase Stationary phase Stochastic approximation: In stationary phase, In exponential phase, [Pflug 1990] consecutive gradients are consecutive gradients are likely to point in opposite likely to point in the same Detect phase transition: directions due to oscillation direction [Chee and Toulis ‘18] t > 0 t < 0 ⇒ ∇𝐺 𝑥 C ∇𝐺 𝑥 ⇒ ∇𝐺 𝑥 C ∇𝐺 𝑥 CD" CD" Ø Initialize a counter to zero and update: t < 0 𝑑𝑝𝑣𝑜𝑢𝑓𝑠 = z𝑑𝑝𝑣𝑜𝑢𝑓𝑠 + 1, 𝑗𝑔 ∇𝐺 𝑥 C ∇𝐺 𝑥 CD" t > 0 𝑑𝑝𝑣𝑜𝑢𝑓𝑠 − 1, 𝑗𝑔 ∇𝐺 𝑥 C ∇𝐺 𝑥 CD" Ø Declare a phase transition if counter goes above a certain threshold & increase 𝑙 13

Simulation Results: Non-adaptive vs Adaptive Fastest- 𝑙 SGD Simulation on synthetic data 𝑌 : Ø - Generate 𝑌 : pick 𝑛 data vectors chosen uniformly at random from 1,2, … , 10 , - Pick 𝒙 ⋆ uniformly at random from 1,2, … , 100 , - Generate labels: 𝒛 ∼ 𝒪(𝑌𝒙 ⋆ , 1 ) - Loss function: ℒ # loss (least square errors) - Workers’ response times are iid ∼ exp(1) and independent across iterations Simulation results on adaptive fastest- 𝑙 SGD for 𝑜 = 50 workers Ø 𝑜 = 50 workers 𝑛 = 2000 data vectors 𝑒 = 100 dimension 𝜃 = 0.005 step size 14

Adaptive Distributed Stochastic Gradient Descent for Minimizing - PowerPoint PPT Presentation

2020 IEEE International Conference on Acoustics, Speech, and Signal Processing Adaptive Distributed Stochastic Gradient Descent for Minimizing Delay in the Presence of Stragglers Serge Kas Hanna Email: serge.k.hanna@rutgers.edu Website:

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Large Scale Machine Learning with Stochastic Gradient Descent L eon Bottou leon@bottou.org

Gradient Descent Michail Michailidis & Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer

Edge Fabric: Delivering Oceans of Content to the World Brandon Schlinker 1,2 Hyojeong Kim 1 ,

Path Diversity ! Number of paths for a packet to transit between two points " Inside an

Geo-Location of PoPs Noa Zilberman & Yuval Shavitt Tel Aviv University February-2010 Agenda

Lecture 10: Graph Data Structures Steven Skiena Department of Computer Science State University

On Optimal and Reasonable Control in the Presence of Adversaries Oded Maler CNRS-VERIMAG

Iterative Closest Point (ICP) Algorithm. L 1 solution... Yaroslav Halchenko CS @ NJIT Iterative

Perceptron Learning Algorithm Matthieu R. Bloch 1 A bit of geometry Definition 1.1. Dataset { x

Color superfluidity of neutral ultracold fermions in the presence of color-orbit and color-flip

Adaptive Distributed Stochastic Gradient Descent for Minimizing - PowerPoint PPT Presentation

2020 IEEE International Conference on Acoustics, Speech, and Signal Processing Adaptive Distributed Stochastic Gradient Descent for Minimizing Delay in the Presence of Stragglers Serge Kas Hanna Email: serge.k.hanna@rutgers.edu Website:

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Large Scale Machine Learning with Stochastic Gradient Descent L eon Bottou leon@bottou.org

Gradient Descent Michail Michailidis &amp; Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer

Edge Fabric: Delivering Oceans of Content to the World Brandon Schlinker 1,2 Hyojeong Kim 1 ,

Path Diversity ! Number of paths for a packet to transit between two points &quot; Inside an

Geo-Location of PoPs Noa Zilberman &amp; Yuval Shavitt Tel Aviv University February-2010 Agenda

Lecture 10: Graph Data Structures Steven Skiena Department of Computer Science State University

On Optimal and Reasonable Control in the Presence of Adversaries Oded Maler CNRS-VERIMAG

Iterative Closest Point (ICP) Algorithm. L 1 solution... Yaroslav Halchenko CS @ NJIT Iterative

Perceptron Learning Algorithm Matthieu R. Bloch 1 A bit of geometry Definition 1.1. Dataset { x

Color superfluidity of neutral ultracold fermions in the presence of color-orbit and color-flip

Gradient Descent Michail Michailidis & Patrick Maiden Outline

Path Diversity ! Number of paths for a packet to transit between two points " Inside an

Geo-Location of PoPs Noa Zilberman & Yuval Shavitt Tel Aviv University February-2010 Agenda