Diversity vs. Parallelism in Distributed Computing with Redundancy - - PowerPoint PPT Presentation

diversity vs parallelism in distributed computing with
SMART_READER_LITE
LIVE PREVIEW

Diversity vs. Parallelism in Distributed Computing with Redundancy - - PowerPoint PPT Presentation

Diversity vs. Parallelism in Distributed Computing with Redundancy Pei Peng*, Emina Soljanin*, Philip Whiting * Rutgers University Macquarie University 2020 IEEE International Symposium on Information Theory Background and Problem


slide-1
SLIDE 1

Diversity vs. Parallelism in Distributed Computing with Redundancy

Pei Peng*, Emina Soljanin*, Philip Whiting†

* Rutgers University † Macquarie University

2020 IEEE International Symposium on Information Theory

slide-2
SLIDE 2

Background and Problem Description

slide-3
SLIDE 3

Background

  • Numerous machine learning and other algorithms increase in complexity and

data requirements;

  • Distributed computing can provide parallelism that provides simultaneous

execution of smaller tasks that make up a large computing job.

Distributed computing: Redundancy:

  • The large-scale sharing of computing resources causes random fluctuations in

task service times;

  • Redundancy can provides diversity so that a job is completed as soon as any

fixed size subset of tasks has been executed.

1/19

slide-4
SLIDE 4

Distributed Computing

Distributed computing provides simultaneous execution of smaller tasks that make up a large computing job.

Split homework Bob Bob Alice 2/19

slide-5
SLIDE 5

Straggling Jobs

There are many reasons, e.g. large-scale sharing, maintenance activities, queueing , causes random fluctuations in task service times.

Straggler 3/19

slide-6
SLIDE 6

Redundancy

Redundancy, in the form of task replication and erasure coding, that allows a job to be completed as only a subset of redundant tasks gets executed thus avoiding stragglers.

4/19 Split homework Bob Alice Eve

replica

slide-7
SLIDE 7

Stragglers Mitigation-example

4/19 Straggler

slide-8
SLIDE 8

Worst Case for One Straggler

4/19 Straggler

slide-9
SLIDE 9

Coding

Erasure code is a potentially powerful way to shorten the job execution time, especially in previous example. However, it can only be used in some specific scenarios.

5/19 Split and encode

slide-10
SLIDE 10

Stragglers Mitigation-example

5/19 Straggler

slide-11
SLIDE 11

Diversity and Parallelism

6/19 Diversity Parallelism Redundancy maximum minimum Replication Splitting Coding

Replication Splitting

The diversity and parallelism are defined according to the redundancy for each job: diversity increases with redundancy and parallelism decreases with redundancy.

decrease increase

slide-12
SLIDE 12

Diversity vs. Parallelism Tradeoff

6/19

Both parallelism and diversity are essential in reducing job service time, but in the

  • pposite directions.

Replication Splitting Task execution time Tr1, Tr2,…, Tr𝑜 Ts1, Ts2,…, Ts𝑜 𝑄 Tr𝑗 > 𝑢 > 𝑄 Ts𝑗 > 𝑢 where 𝑗 = 1,2, … , 𝑜 Job completion time 𝑁𝑗𝑜{Tr1, Tr2,…, Tr𝑜} 𝑁𝑏𝑦{Ts1, Ts2,…, Ts𝑜} Comparison between each task

To formalize this tradeoff problem, we should answer the following questions:

  • 1. What distribution has been used for job service time? (Tr, Ts)
  • 2. How the distribution changes with task size?
slide-13
SLIDE 13

System Model and Prior work

slide-14
SLIDE 14

System Model

7/19

Jobs can be split into tasks which can be executed independently in parallel

  • n different workers:
  • 1. J1 is executed with splitting;
  • 2. J2 is executed with replication, the shadings are replicas;
  • 3. J3 is executed with [4,2] erasure code, the shadings are coded tasks.

We consider the expected execution time for each job.

slide-15
SLIDE 15

Computing Unit (CU)

8/19

Fact: Job can not be split into any size of tasks.

Smallest task: a question. Smallest task:

A Job consists of tasks, and each task consists of CUs.

slide-16
SLIDE 16

Parameters and Notations

n − number of workers (number of CUs in a job) k − number of workers that have to execute their tasks for job completion s − number of CUs per task, 𝑡 = 𝑜/𝑙 V − service time for each CU X − exponential random variable Y − task completion time for each worker 𝑌𝑙:𝑜 − k-th order statistics 𝑍

𝑜,𝑙 − job completion time when each worker’s task size is 𝑜 𝑙 = 𝑡

slide-17
SLIDE 17

References

[1] G. Liang and U. C. Kozat, “Fast cloud: Pushing the envelope on delay performance of cloud storage with coding,” IEEE/ACM Transactions on Networking,

  • vol. 22, no. 6, pp. 2012–2025, 2014.

[2] A. Vulimiri, P. B. Godfrey, R. Mittal, J. Sherry, S. Ratnasamy, and S. Shenker, “Low latency via redundancy,” in Proceedings of the ninth ACM conference

  • n Emerging networking experiments and technologies. ACM, 2013, pp. 283–294.

[3] A. Gorbunova, I. Zaryadov, S. Matyushenko, and E. Sopin, “The estimation of probability characteristics of cloud computing systems with splitting of requests,” in International Conference on Distributed Computer and Communication Networks. Springer, 2016, pp. 418–429. [4] A. Behrouzi-Far and E. Soljanin, “Redundancy scheduling in systems with bi-modal job service time distributions,” in 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2019, pp. 9–16. [5] A. Reisizadeh, S. Prakash, R. Pedarsani, and A. S. Avestimehr, “Coded computation over heterogeneous clusters,” IEEE Transactions on Information Theory, vol. 65, no. 7, pp. 4227–4242, 2019. [6] S. Dutta, V. Cadambe, and P. Grover, “Short-dot: Computing large linear transforms distributedly using coded short dot products,” Advances in Neural Information Processing Systems, vol. 29, pp. 2100–2108, 2016. [7] G. Joshi, Y. Liu, and E. Soljanin, “Coding for fast content download,” in Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on, 2012, pp. 326–333. [8]K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speeding up distributed machine learning using codes,” IEEE Transactions on Information Theory, vol. 64, no. 3, pp. 1514–1529, 2017. [9] K. Gardner, M. Harchol-Balter, A. Scheller-Wolf, and B. Van Houdt, “A better model for job redundancy: Decoupling server slowdown and job size,” IEEE/ACM transactions on networking, vol. 25, no. 6, pp. 3353–3367, 2017. [10] G. Joshi, E. Soljanin, and G. Wornell, “Efficient redundancy techniques for latency reduction in cloud systems,” ACM Transactions on Modeling and Performance Evaluation of Computing Systems (TOMPECS), vol. 2, no. 2, pp. 1–30, 2017 [11] M. F. Akta¸s and E. Soljanin, “Straggler mitigation at scale,” IEEE/ACM Transactions on Networking, vol. 27, no. 6, pp. 2266–2279, 2019. [12] M. S. Klamkin and D. J. Newman, “Extensions of the birthday surprise,” Journal of Combinatorial Theory, vol. 3, no. 3, pp. 279–282, 1967. [13] P. Peng, E. Soljanin, and P. Whiting, “Diversity vs. parallelism in distributed computing with redundancy.” [Online]. Available: https://emina.flywheelsites.com

slide-18
SLIDE 18

Various Proposed Distributions

1. What distribution has been used for job service time?

  • For theoretical analysis, Shifted exponential was used in e.g.[1], Pareto was used in

e.g.[2], Erlang was used in e.g. [3], Bi-modal was used in e.g. [4].

  • In this paper, we assume the service time per CU is Shifted exponential:

9/19

  • Where ∆ is a constant modelling the minimum service time (deterministic

component);

  • 𝑌~𝐹𝑦𝑞(𝑋) is an exponentially distributed random variable modelling the

straggling (random component).

𝑊~∆ + 𝑌

The other distributions are analyzed in [13].

slide-19
SLIDE 19

Various Proposed Scaling Models

9/19

  • 2. How the probability distribution scales (changes) with task size?
  • There is no consensus on this question. Scaling the random component e.g. [5,6], scaling

the deterministic component e.g. [7,8]. These papers provide the scaling for specific distributions, none of them provides a model for general distributions.

𝑍 = ∆ + 𝑡 ∙ 𝑌 Scaling the random component: Scaling the deterministic component: 𝑍 = 𝑡 ∙ ∆ + 𝑌 Where 𝑡 is the number of CUs per task (task size). The service time for each CU is 𝑊~∆ + 𝑌. Then the task execution time 𝑍 is

slide-20
SLIDE 20

Prior Related Work

10/19

  • Category 2: there are few papers working on how much redundancy should be introduced, e.g.

[9, 10, 11], however [9, 10] only focus on applying replication in queueing systems, and [11] only focuses on Pareto distribution.

We can classify the relative references into two categories.

  • Category 1: there are many papers working on designing codes for given systems, e.g. [5, 6, 8],

however they do not focusing on optimizing code rate.

slide-21
SLIDE 21

Task Service Time Scaling Models

11/19 Model 1---Server-Dependent: the straggling effect depends on the server and is identical for each CU executed on that server. Here ∆ is some initial handshake time. 𝑍 = ∆ + 𝑡 ∙ 𝑌𝑗 where 𝑌𝑗~𝐹𝑦𝑞 𝑋 is the service time for a CU on server 𝑗. The service time increases with the size of task (or the number of CUs 𝑡). Model 2---Data-Dependent: each CU in a task of 𝑡 CUs takes ∆ time to complete and there are some inherent additive system randomness at each server. 𝑍 = 𝑡 ∙ ∆ + 𝑌𝑗 where 𝑌𝑗~𝐹𝑦𝑞 𝑋 is the system randomness on server 𝑗. Model 3---Additive: the execution times of CUs are independent and identically distributed. 𝑍 = 𝑊

1 + ⋯ + 𝑊 𝑡

where 𝑊

𝑗~𝑇 − 𝐹𝑦𝑞 ∆, 𝑋 is the service time for a CU on server 𝑗.

slide-22
SLIDE 22

Main Results

slide-23
SLIDE 23

Server-Dependent Model

12/19 Theorem 1. The expected job completion time for the server-dependent execution model is given by 𝐹[𝑍

𝑜,𝑙] = ∆ + 𝑡𝑋 𝐼𝑜 − 𝐼𝑜−𝑙

= ∆ + 𝑋

𝑜 𝑙 𝐼𝑜 − 𝐼𝑜−𝑙 ≥ ∆ + 𝑋

𝐹[𝑍

𝑜,𝑙] is minimized by replication/maximal diversity (k=1).

For each worker: 𝒁 = ∆ + 𝒕 ∙ 𝒀𝒋 where 𝒀𝒋~𝑭𝒚𝒒 𝑿 , then the job completion time 𝑍

𝑜,𝑙 is the 𝑙 -th order statistics of 𝑜 workers’ task excution times

slide-24
SLIDE 24

Numerical Analysis

13/19

  • Replication is much more effective

when W/ ∆ is large, and has less payoff when W/ ∆ is small.

Conclusions:

  • Replication/maximal diversity is

always the optimal;

Note that different values of W and ∆ with the equal W/ ∆ give different values of 𝐹 𝑍

𝑜,𝑙 , then the results

  • f W/ ∆ scenarios should not be compared.
slide-25
SLIDE 25

Data-Dependent Model

14/19 Theorem 2. The expected job completion time for the data-dependent execution model is given by 𝐹[𝑍

𝑜,𝑙] = 𝑡∆ + 𝑋 𝐼𝑜 − 𝐼𝑜−𝑙

= 𝑋

𝑜 𝑙 ∙ ∆ 𝑋 + 𝐼𝑜 − 𝐼𝑜−𝑙

  • When ∆≫ 𝑋, the execution is essentially deterministic, it is optimal to use maximal

parallelism (splitting);

  • When ∆≪ 𝑋, the execution time is much more variable and it is optimal to operate

with maximal diversity (replication).

For each worker : 𝒁 = 𝒕 ∙ ∆ + 𝒀𝒋 where 𝒀𝒋~𝑭𝒚𝒒 𝑿

By taking the 𝑚𝑝𝑕 approximation of the harmonica numbers, we find the optimal 𝑙∗ which minimizes 𝐹[𝑍

𝑜,𝑙] :

𝑙∗ = 𝑜 −

𝑠 2 +

𝑠 +

𝑠4 4 , 𝑠 = ∆ 𝑋

slide-26
SLIDE 26

Numerical Analysis

15/19

  • Otherwise, coding with a certain

non-trivial rate is optimal (Recall that k/n is the code rate ).

Conclusions:

  • Replication is optimal when W/ ∆ is

very large;

  • Splitting is optimal when W/ ∆ is

small;

slide-27
SLIDE 27

Additive Model-replication vs. splitting

16/19 Replication: Theorem 3. The expected job completion time for the data-dependent execution model is given by 𝐹 𝑍

𝑜,1 = 𝑜∆ + 𝑋 1

𝑜 න

𝑓−𝑢[𝑆𝑜(𝑢 𝑜)]𝑜ⅆ𝑢 where 𝑆𝑜

𝑢 𝑜 = 1 + 𝑦 1! + 𝑦2 2! + ⋯ + 𝑦𝑜−1 (𝑜−1)!

Lemma 1. For 𝑜 sufficiently large, splitting (maximal parallelism) outperforms replication (maximal diversity). Splitting: the job completion time is Then, 𝐹 𝑍

𝑜,𝑜 = ∆ + 𝑋𝐼𝑜

𝑍

𝑜,𝑜 = ∆ + 𝑌1:𝑜 + 𝑌1:(𝑜−1) + ⋯ + 𝑌1:2 + 𝑌1:1

Proof uses the birthday problem analogy in [12]. For each worker : 𝒁 = 𝑾𝟐 + ⋯ + 𝑾𝒕 where 𝑾𝒋 = ∆ + 𝒀𝒋 and 𝒀𝒋~𝑭𝒚𝒒 ∆, 𝑿 Equation comes from largest order statistics, since each worker has one CU.

slide-28
SLIDE 28

Additive Model-half rate coding vs. splitting

17/19 Theorem 4. Suppose that 𝑜 = 2𝑙 ≥ 4 is even. Then 𝑍

𝑜,𝑜/2 stochastically dominates 𝑍 𝑜,𝑜

that is, 𝑄 𝑍

𝑜,𝑜/2 > 𝑦 ≤ 𝑄 𝑍 𝑜,𝑜 > 𝑦

Consider the special case when ∆= 0, then the service time for each CU is exponential. For a system with exponential service time and sufficient large 𝑜, then half rate coding is better than splitting which is also better than replication.

Conclusion:

Notice that our conclusion has two limitations: the service time distribution is memoryless and the conclusion only works for code rate is 1/2.

slide-29
SLIDE 29

Numerical Analysis

18/19

  • Even n=12 is sufficient large that satisfies

Lemma 1.

Conclusions:

  • Splitting is optimal when W/ ∆ is small;
  • Coding is optimal when W/ ∆ is large;
  • Half rate coding is better than splitting

when ∆= 0.

slide-30
SLIDE 30

Conclusions

19/19 Table I summarizes more general results reported in [13].

slide-31
SLIDE 31

Conclusions and Future Work

19/19 Conjecture 1. For a general distribution, if its scaling parameter is the random component, replication is the optimal strategy; If its scaling parameter is both deterministic and random component, coding or splitting is the optimal.

Server-dependent model: Data-dependent model:

Conclusion: For a general distribution with 1st moment, when ∆ is much larger than the expectation, splitting is the optimal strategy; When ∆ is much smaller, replication is the optimal; Otherwise coding with a proper code rate is the optimal.

Additive model:

Conjecture 2. For a general distribution, only coding or splitting is the optimal strategy, and the

  • ptimal code rate is always greater than 1/2.