diversity vs parallelism in distributed computing with
play

Diversity vs. Parallelism in Distributed Computing with Redundancy - PowerPoint PPT Presentation

Diversity vs. Parallelism in Distributed Computing with Redundancy Pei Peng*, Emina Soljanin*, Philip Whiting * Rutgers University Macquarie University 2020 IEEE International Symposium on Information Theory Background and Problem


  1. Diversity vs. Parallelism in Distributed Computing with Redundancy Pei Peng*, Emina Soljanin*, Philip Whiting † * Rutgers University † Macquarie University 2020 IEEE International Symposium on Information Theory

  2. Background and Problem Description

  3. Background  Distributed computing:  Numerous machine learning and other algorithms increase in complexity and data requirements;  Distributed computing can provide parallelism that provides simultaneous execution of smaller tasks that make up a large computing job.  Redundancy :  The large-scale sharing of computing resources causes random fluctuations in task service times;  Redundancy can provides diversity so that a job is completed as soon as any fixed size subset of tasks has been executed. 1/19

  4. Distributed Computing Distributed computing provides simultaneous execution of smaller tasks that make up a large computing job. Bob Split homework Bob Alice 2/19

  5. Straggling Jobs There are many reasons, e.g. large-scale sharing, maintenance activities, queueing , causes random fluctuations in task service times. Straggler 3/19

  6. Redundancy Redundancy , in the form of task replication and erasure coding , that allows a job to be completed as only a subset of redundant tasks gets executed thus avoiding stragglers. Bob Split homework replica Alice Eve 4/19

  7. Stragglers Mitigation -example Straggler 4/19

  8. Worst Case for One Straggler Straggler 4/19

  9. Coding Erasure code is a potentially powerful way to shorten the job execution time, especially in previous example. However, it can only be used in some specific scenarios. Split and encode 5/19

  10. Stragglers Mitigation -example Straggler 5/19

  11. Diversity and Parallelism The diversity and parallelism are defined according to the redundancy for each job: diversity increases with redundancy and parallelism decreases with redundancy. Parallelism Diversity Coding Splitting Replication minimum decrease increase Redundancy maximum Splitting Replication 6/19

  12. Diversity vs. Parallelism Tradeoff Both parallelism and diversity are essential in reducing job service time, but in the opposite directions. Replication Splitting Task execution time Tr 1 , Tr 2 ,…, Tr 𝑜 T s 1 , T s 2 ,…, T s 𝑜 𝑄 Tr 𝑗 > 𝑢 > 𝑄 T s 𝑗 > 𝑢 where 𝑗 = 1,2, … , 𝑜 Comparison between each task 𝑁𝑗𝑜{ Tr 1 , Tr 2 ,…, Tr 𝑜 } 𝑁𝑏𝑦{ T s 1 , T s 2 ,…, T s 𝑜 } Job completion time To formalize this tradeoff problem, we should answer the following questions: 1. What distribution has been used for job service time? ( Tr , T s ) 2. How the distribution changes with task size? 6/19

  13. System Model and Prior work

  14. System Model Jobs can be split into tasks which can be executed independently in parallel on different workers: 1. J 1 is executed with splitting; 2. J 2 is executed with replication, the shadings are replicas; 3. J 3 is executed with [4,2] erasure code, the shadings are coded tasks. We consider the expected execution time for each job. 7/19

  15. Computing Unit (CU) Fact: Job can not be split into any size of tasks. Smallest task: a question. Smallest task: A Job consists of tasks, and each task consists of CUs. 8/19

  16. Parameters and Notations n − number of workers (number of CUs in a job) k − number of workers that have to execute their tasks for job completion s − number of CUs per task, 𝑡 = 𝑜/𝑙 V − service time for each CU X − exponential random variable Y − task completion time for each worker 𝑌 𝑙:𝑜 − k-th order statistics 𝑜 𝑍 𝑜,𝑙 − job completion time when each worker’s task size is 𝑙 = 𝑡

  17. References [1] G. Liang and U. C. Kozat , “Fast cloud: Pushing the envelope on delay performance of cloud storage with coding,” IEEE/ACM Transactions on Networking, vol. 22, no. 6, pp. 2012 – 2025, 2014. [2] A. Vulimiri, P. B. Godfrey, R. Mittal, J. Sherry, S. Ratnasamy, and S. Shenker , “Low latency via redundancy,” in Proceedings of the ninth ACM conference on Emerging networking experiments and technologies. ACM, 2013, pp. 283 – 294. [3] A. Gorbunova, I. Zaryadov, S. Matyushenko, and E. Sopin , “The estimation of probability characteristics of cloud computing systems with splitting of requests,” in International Conference on Distributed Computer and Communication Networks. Springer, 2016, pp. 418– 429. [4] A. Behrouzi-Far and E. Soljanin , “Redundancy scheduling in systems with bi - modal job service time distributions,” in 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2019, pp. 9 – 16. [5] A. Reisizadeh, S. Prakash, R. Pedarsani, and A. S. Avestimehr , “Coded computation over heterogeneous clusters,” IEEE Transactions on Information Theory, vol. 65, no. 7, pp. 4227 – 4242, 2019. [6] S. Dutta, V. Cadambe , and P. Grover, “Short -dot: Computing large linear transforms distributedly using coded short dot products,” Advances in Neural Information Processing Systems, vol. 29, pp. 2100 – 2108, 2016. [7] G. Joshi, Y. Liu, and E. Soljanin , “Coding for fast content download,” in Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on, 2012, pp. 326 – 333. [8]K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran , “Speeding up distributed machine learning using codes,” IEEE Transactions on Information Theory, vol. 64, no. 3, pp. 1514 – 1529, 2017. [9] K. Gardner, M. Harchol-Balter, A. Scheller-Wolf, and B. Van Houdt , “A better model for job redundancy: Decoupling server slowdown and job size,” IEEE/ACM transactions on networking, vol. 25, no. 6, pp. 3353 – 3367, 2017. [10] G. Joshi, E. Soljanin, and G. Wornell , “Efficient redundancy techniques for latency reduction in cloud systems,” ACM Transactions on Modeling and Performance Evaluation of Computing Systems (TOMPECS), vol. 2, no. 2, pp. 1 – 30, 2017 [11] M. F. Akta¸s and E. Soljanin , “Straggler mitigation at scale,” IEEE/ACM Transactions on Networking, vol. 27, no. 6, pp. 2266– 2279, 2019. [12] M. S. Klamkin and D. J. Newman, “Extensions of the birthday surprise,” Journal of Combinatorial Theory, vol. 3, no. 3, pp. 279– 282, 1967. [13] P. Peng, E. Soljanin , and P. Whiting, “Diversity vs. parallelism in distributed computing with redundancy.” [Online]. Available: https://emina.flywheelsites.com

  18. Various Proposed Distributions 1. What distribution has been used for job service time?  For theoretical analysis, Shifted exponential was used in e.g.[1], Pareto was used in e.g.[2], Erlang was used in e.g. [3], Bi-modal was used in e.g. [4].  In this paper, we assume the service time per CU is Shifted exponential: 𝑊~∆ + 𝑌 o Where ∆ is a constant modelling the minimum service time (deterministic component); o 𝑌~𝐹𝑦𝑞(𝑋) is an exponentially distributed random variable modelling the straggling (random component). The other distributions are analyzed in [13]. 9/19

  19. Various Proposed Scaling Models 2. How the probability distribution scales (changes) with task size?  There is no consensus on this question. Scaling the random component e.g. [5,6], scaling the deterministic component e.g. [7,8]. These papers provide the scaling for specific distributions, none of them provides a model for general distributions. The service time for each CU is 𝑊~∆ + 𝑌 . Then the task execution time 𝑍 is Scaling the random component: 𝑍 = ∆ + 𝑡 ∙ 𝑌 Scaling the deterministic component: 𝑍 = 𝑡 ∙ ∆ + 𝑌 Where 𝑡 is the number of CUs per task (task size). 9/19

  20. Prior Related Work We can classify the relative references into two categories.  Category 1: there are many papers working on designing codes for given systems, e.g. [5, 6, 8], however they do not focusing on optimizing code rate.  Category 2: there are few papers working on how much redundancy should be introduced, e.g. [9, 10, 11], however [9, 10] only focus on applying replication in queueing systems, and [11] only focuses on Pareto distribution. 10/19

  21. Task Service Time Scaling Models The service time increases with the size of task (or the number of CUs 𝑡 ). Model 1---Server-Dependent: the straggling effect depends on the server and is identical for each CU executed on that server. Here ∆ is some initial handshake time. 𝑍 = ∆ + 𝑡 ∙ 𝑌 𝑗 where 𝑌 𝑗 ~𝐹𝑦𝑞 𝑋 is the service time for a CU on server 𝑗 . Model 2---Data-Dependent: each CU in a task of 𝑡 CUs takes ∆ time to complete and there are some inherent additive system randomness at each server. 𝑍 = 𝑡 ∙ ∆ + 𝑌 𝑗 where 𝑌 𝑗 ~𝐹𝑦𝑞 𝑋 is the system randomness on server 𝑗 . Model 3---Additive: the execution times of CUs are independent and identically distributed. 𝑍 = 𝑊 1 + ⋯ + 𝑊 𝑡 where 𝑊 𝑗 ~𝑇 − 𝐹𝑦𝑞 ∆, 𝑋 is the service time for a CU on server 𝑗 . 11/19

  22. Main Results

  23. Server-Dependent Model For each worker: 𝒁 = ∆ + 𝒕 ∙ 𝒀 𝒋 where 𝒀 𝒋 ~𝑭𝒚𝒒 𝑿 , then the job completion time 𝑍 𝑜,𝑙 is the 𝑙 -th order statistics of 𝑜 workers’ task excution times Theorem 1. The expected job completion time for the server-dependent execution model is given by 𝐹[𝑍 𝑜,𝑙 ] = ∆ + 𝑡𝑋 𝐼 𝑜 − 𝐼 𝑜−𝑙 𝑜 = ∆ + 𝑋 𝑙 𝐼 𝑜 − 𝐼 𝑜−𝑙 ≥ ∆ + 𝑋 𝐹[𝑍 𝑜,𝑙 ] is minimized by replication/maximal diversity (k=1). 12/19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend