Diversity vs. Parallelism in Distributed Computing with Redundancy
Pei Peng*, Emina Soljanin*, Philip Whiting†
* Rutgers University † Macquarie University
2020 IEEE International Symposium on Information Theory
Diversity vs. Parallelism in Distributed Computing with Redundancy - - PowerPoint PPT Presentation
Diversity vs. Parallelism in Distributed Computing with Redundancy Pei Peng*, Emina Soljanin*, Philip Whiting * Rutgers University Macquarie University 2020 IEEE International Symposium on Information Theory Background and Problem
* Rutgers University † Macquarie University
2020 IEEE International Symposium on Information Theory
data requirements;
execution of smaller tasks that make up a large computing job.
task service times;
fixed size subset of tasks has been executed.
1/19
Distributed computing provides simultaneous execution of smaller tasks that make up a large computing job.
Split homework Bob Bob Alice 2/19
There are many reasons, e.g. large-scale sharing, maintenance activities, queueing , causes random fluctuations in task service times.
Straggler 3/19
Redundancy, in the form of task replication and erasure coding, that allows a job to be completed as only a subset of redundant tasks gets executed thus avoiding stragglers.
4/19 Split homework Bob Alice Eve
replica
4/19 Straggler
4/19 Straggler
Erasure code is a potentially powerful way to shorten the job execution time, especially in previous example. However, it can only be used in some specific scenarios.
5/19 Split and encode
5/19 Straggler
6/19 Diversity Parallelism Redundancy maximum minimum Replication Splitting Coding
Replication Splitting
The diversity and parallelism are defined according to the redundancy for each job: diversity increases with redundancy and parallelism decreases with redundancy.
decrease increase
6/19
Both parallelism and diversity are essential in reducing job service time, but in the
Replication Splitting Task execution time Tr1, Tr2,…, Tr𝑜 Ts1, Ts2,…, Ts𝑜 𝑄 Tr𝑗 > 𝑢 > 𝑄 Ts𝑗 > 𝑢 where 𝑗 = 1,2, … , 𝑜 Job completion time 𝑁𝑗𝑜{Tr1, Tr2,…, Tr𝑜} 𝑁𝑏𝑦{Ts1, Ts2,…, Ts𝑜} Comparison between each task
To formalize this tradeoff problem, we should answer the following questions:
7/19
Jobs can be split into tasks which can be executed independently in parallel
We consider the expected execution time for each job.
8/19
Fact: Job can not be split into any size of tasks.
Smallest task: a question. Smallest task:
A Job consists of tasks, and each task consists of CUs.
n − number of workers (number of CUs in a job) k − number of workers that have to execute their tasks for job completion s − number of CUs per task, 𝑡 = 𝑜/𝑙 V − service time for each CU X − exponential random variable Y − task completion time for each worker 𝑌𝑙:𝑜 − k-th order statistics 𝑍
𝑜,𝑙 − job completion time when each worker’s task size is 𝑜 𝑙 = 𝑡
[1] G. Liang and U. C. Kozat, “Fast cloud: Pushing the envelope on delay performance of cloud storage with coding,” IEEE/ACM Transactions on Networking,
[2] A. Vulimiri, P. B. Godfrey, R. Mittal, J. Sherry, S. Ratnasamy, and S. Shenker, “Low latency via redundancy,” in Proceedings of the ninth ACM conference
[3] A. Gorbunova, I. Zaryadov, S. Matyushenko, and E. Sopin, “The estimation of probability characteristics of cloud computing systems with splitting of requests,” in International Conference on Distributed Computer and Communication Networks. Springer, 2016, pp. 418–429. [4] A. Behrouzi-Far and E. Soljanin, “Redundancy scheduling in systems with bi-modal job service time distributions,” in 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2019, pp. 9–16. [5] A. Reisizadeh, S. Prakash, R. Pedarsani, and A. S. Avestimehr, “Coded computation over heterogeneous clusters,” IEEE Transactions on Information Theory, vol. 65, no. 7, pp. 4227–4242, 2019. [6] S. Dutta, V. Cadambe, and P. Grover, “Short-dot: Computing large linear transforms distributedly using coded short dot products,” Advances in Neural Information Processing Systems, vol. 29, pp. 2100–2108, 2016. [7] G. Joshi, Y. Liu, and E. Soljanin, “Coding for fast content download,” in Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on, 2012, pp. 326–333. [8]K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speeding up distributed machine learning using codes,” IEEE Transactions on Information Theory, vol. 64, no. 3, pp. 1514–1529, 2017. [9] K. Gardner, M. Harchol-Balter, A. Scheller-Wolf, and B. Van Houdt, “A better model for job redundancy: Decoupling server slowdown and job size,” IEEE/ACM transactions on networking, vol. 25, no. 6, pp. 3353–3367, 2017. [10] G. Joshi, E. Soljanin, and G. Wornell, “Efficient redundancy techniques for latency reduction in cloud systems,” ACM Transactions on Modeling and Performance Evaluation of Computing Systems (TOMPECS), vol. 2, no. 2, pp. 1–30, 2017 [11] M. F. Akta¸s and E. Soljanin, “Straggler mitigation at scale,” IEEE/ACM Transactions on Networking, vol. 27, no. 6, pp. 2266–2279, 2019. [12] M. S. Klamkin and D. J. Newman, “Extensions of the birthday surprise,” Journal of Combinatorial Theory, vol. 3, no. 3, pp. 279–282, 1967. [13] P. Peng, E. Soljanin, and P. Whiting, “Diversity vs. parallelism in distributed computing with redundancy.” [Online]. Available: https://emina.flywheelsites.com
e.g.[2], Erlang was used in e.g. [3], Bi-modal was used in e.g. [4].
9/19
component);
straggling (random component).
𝑊~∆ + 𝑌
The other distributions are analyzed in [13].
9/19
the deterministic component e.g. [7,8]. These papers provide the scaling for specific distributions, none of them provides a model for general distributions.
𝑍 = ∆ + 𝑡 ∙ 𝑌 Scaling the random component: Scaling the deterministic component: 𝑍 = 𝑡 ∙ ∆ + 𝑌 Where 𝑡 is the number of CUs per task (task size). The service time for each CU is 𝑊~∆ + 𝑌. Then the task execution time 𝑍 is
10/19
[9, 10, 11], however [9, 10] only focus on applying replication in queueing systems, and [11] only focuses on Pareto distribution.
however they do not focusing on optimizing code rate.
11/19 Model 1---Server-Dependent: the straggling effect depends on the server and is identical for each CU executed on that server. Here ∆ is some initial handshake time. 𝑍 = ∆ + 𝑡 ∙ 𝑌𝑗 where 𝑌𝑗~𝐹𝑦𝑞 𝑋 is the service time for a CU on server 𝑗. The service time increases with the size of task (or the number of CUs 𝑡). Model 2---Data-Dependent: each CU in a task of 𝑡 CUs takes ∆ time to complete and there are some inherent additive system randomness at each server. 𝑍 = 𝑡 ∙ ∆ + 𝑌𝑗 where 𝑌𝑗~𝐹𝑦𝑞 𝑋 is the system randomness on server 𝑗. Model 3---Additive: the execution times of CUs are independent and identically distributed. 𝑍 = 𝑊
1 + ⋯ + 𝑊 𝑡
where 𝑊
𝑗~𝑇 − 𝐹𝑦𝑞 ∆, 𝑋 is the service time for a CU on server 𝑗.
12/19 Theorem 1. The expected job completion time for the server-dependent execution model is given by 𝐹[𝑍
𝑜,𝑙] = ∆ + 𝑡𝑋 𝐼𝑜 − 𝐼𝑜−𝑙
= ∆ + 𝑋
𝑜 𝑙 𝐼𝑜 − 𝐼𝑜−𝑙 ≥ ∆ + 𝑋
𝐹[𝑍
𝑜,𝑙] is minimized by replication/maximal diversity (k=1).
For each worker: 𝒁 = ∆ + 𝒕 ∙ 𝒀𝒋 where 𝒀𝒋~𝑭𝒚𝒒 𝑿 , then the job completion time 𝑍
𝑜,𝑙 is the 𝑙 -th order statistics of 𝑜 workers’ task excution times
13/19
when W/ ∆ is large, and has less payoff when W/ ∆ is small.
always the optimal;
Note that different values of W and ∆ with the equal W/ ∆ give different values of 𝐹 𝑍
𝑜,𝑙 , then the results
14/19 Theorem 2. The expected job completion time for the data-dependent execution model is given by 𝐹[𝑍
𝑜,𝑙] = 𝑡∆ + 𝑋 𝐼𝑜 − 𝐼𝑜−𝑙
= 𝑋
𝑜 𝑙 ∙ ∆ 𝑋 + 𝐼𝑜 − 𝐼𝑜−𝑙
parallelism (splitting);
with maximal diversity (replication).
For each worker : 𝒁 = 𝒕 ∙ ∆ + 𝒀𝒋 where 𝒀𝒋~𝑭𝒚𝒒 𝑿
By taking the 𝑚𝑝 approximation of the harmonica numbers, we find the optimal 𝑙∗ which minimizes 𝐹[𝑍
𝑜,𝑙] :
𝑙∗ = 𝑜 −
𝑠 2 +
𝑠 +
𝑠4 4 , 𝑠 = ∆ 𝑋
15/19
non-trivial rate is optimal (Recall that k/n is the code rate ).
very large;
small;
16/19 Replication: Theorem 3. The expected job completion time for the data-dependent execution model is given by 𝐹 𝑍
𝑜,1 = 𝑜∆ + 𝑋 1
𝑜 න
∞
𝑓−𝑢[𝑆𝑜(𝑢 𝑜)]𝑜ⅆ𝑢 where 𝑆𝑜
𝑢 𝑜 = 1 + 𝑦 1! + 𝑦2 2! + ⋯ + 𝑦𝑜−1 (𝑜−1)!
Lemma 1. For 𝑜 sufficiently large, splitting (maximal parallelism) outperforms replication (maximal diversity). Splitting: the job completion time is Then, 𝐹 𝑍
𝑜,𝑜 = ∆ + 𝑋𝐼𝑜
𝑍
𝑜,𝑜 = ∆ + 𝑌1:𝑜 + 𝑌1:(𝑜−1) + ⋯ + 𝑌1:2 + 𝑌1:1
Proof uses the birthday problem analogy in [12]. For each worker : 𝒁 = 𝑾𝟐 + ⋯ + 𝑾𝒕 where 𝑾𝒋 = ∆ + 𝒀𝒋 and 𝒀𝒋~𝑭𝒚𝒒 ∆, 𝑿 Equation comes from largest order statistics, since each worker has one CU.
17/19 Theorem 4. Suppose that 𝑜 = 2𝑙 ≥ 4 is even. Then 𝑍
𝑜,𝑜/2 stochastically dominates 𝑍 𝑜,𝑜
that is, 𝑄 𝑍
𝑜,𝑜/2 > 𝑦 ≤ 𝑄 𝑍 𝑜,𝑜 > 𝑦
Consider the special case when ∆= 0, then the service time for each CU is exponential. For a system with exponential service time and sufficient large 𝑜, then half rate coding is better than splitting which is also better than replication.
Conclusion:
Notice that our conclusion has two limitations: the service time distribution is memoryless and the conclusion only works for code rate is 1/2.
18/19
Lemma 1.
when ∆= 0.
19/19 Table I summarizes more general results reported in [13].
19/19 Conjecture 1. For a general distribution, if its scaling parameter is the random component, replication is the optimal strategy; If its scaling parameter is both deterministic and random component, coding or splitting is the optimal.
Conclusion: For a general distribution with 1st moment, when ∆ is much larger than the expectation, splitting is the optimal strategy; When ∆ is much smaller, replication is the optimal; Otherwise coding with a proper code rate is the optimal.
Conjecture 2. For a general distribution, only coding or splitting is the optimal strategy, and the