Diversity vs. Parallelism in Distributed Computing with Redundancy - PowerPoint PPT Presentation

Diversity vs. Parallelism in Distributed Computing with Redundancy Pei Peng*, Emina Soljanin*, Philip Whiting † * Rutgers University † Macquarie University 2020 IEEE International Symposium on Information Theory

Background and Problem Description

Background  Distributed computing:  Numerous machine learning and other algorithms increase in complexity and data requirements;  Distributed computing can provide parallelism that provides simultaneous execution of smaller tasks that make up a large computing job.  Redundancy :  The large-scale sharing of computing resources causes random fluctuations in task service times;  Redundancy can provides diversity so that a job is completed as soon as any fixed size subset of tasks has been executed. 1/19

Distributed Computing Distributed computing provides simultaneous execution of smaller tasks that make up a large computing job. Bob Split homework Bob Alice 2/19

Straggling Jobs There are many reasons, e.g. large-scale sharing, maintenance activities, queueing , causes random fluctuations in task service times. Straggler 3/19

Redundancy Redundancy , in the form of task replication and erasure coding , that allows a job to be completed as only a subset of redundant tasks gets executed thus avoiding stragglers. Bob Split homework replica Alice Eve 4/19

Stragglers Mitigation -example Straggler 4/19

Worst Case for One Straggler Straggler 4/19

Coding Erasure code is a potentially powerful way to shorten the job execution time, especially in previous example. However, it can only be used in some specific scenarios. Split and encode 5/19

Stragglers Mitigation -example Straggler 5/19

Diversity and Parallelism The diversity and parallelism are defined according to the redundancy for each job: diversity increases with redundancy and parallelism decreases with redundancy. Parallelism Diversity Coding Splitting Replication minimum decrease increase Redundancy maximum Splitting Replication 6/19

Diversity vs. Parallelism Tradeoff Both parallelism and diversity are essential in reducing job service time, but in the opposite directions. Replication Splitting Task execution time Tr 1 , Tr 2 ,…, Tr 𝑜 T s 1 , T s 2 ,…, T s 𝑜 𝑄 Tr 𝑗 > 𝑢 > 𝑄 T s 𝑗 > 𝑢 where 𝑗 = 1,2, … , 𝑜 Comparison between each task 𝑁𝑗𝑜{ Tr 1 , Tr 2 ,…, Tr 𝑜 } 𝑁𝑏𝑦{ T s 1 , T s 2 ,…, T s 𝑜 } Job completion time To formalize this tradeoff problem, we should answer the following questions: 1. What distribution has been used for job service time? ( Tr , T s ) 2. How the distribution changes with task size? 6/19

System Model and Prior work

System Model Jobs can be split into tasks which can be executed independently in parallel on different workers: 1. J 1 is executed with splitting; 2. J 2 is executed with replication, the shadings are replicas; 3. J 3 is executed with [4,2] erasure code, the shadings are coded tasks. We consider the expected execution time for each job. 7/19

Computing Unit (CU) Fact: Job can not be split into any size of tasks. Smallest task: a question. Smallest task: A Job consists of tasks, and each task consists of CUs. 8/19

Parameters and Notations n − number of workers (number of CUs in a job) k − number of workers that have to execute their tasks for job completion s − number of CUs per task, 𝑡 = 𝑜/𝑙 V − service time for each CU X − exponential random variable Y − task completion time for each worker 𝑌 𝑙:𝑜 − k-th order statistics 𝑜 𝑍 𝑜,𝑙 − job completion time when each worker’s task size is 𝑙 = 𝑡

References [1] G. Liang and U. C. Kozat , “Fast cloud: Pushing the envelope on delay performance of cloud storage with coding,” IEEE/ACM Transactions on Networking, vol. 22, no. 6, pp. 2012 – 2025, 2014. [2] A. Vulimiri, P. B. Godfrey, R. Mittal, J. Sherry, S. Ratnasamy, and S. Shenker , “Low latency via redundancy,” in Proceedings of the ninth ACM conference on Emerging networking experiments and technologies. ACM, 2013, pp. 283 – 294. [3] A. Gorbunova, I. Zaryadov, S. Matyushenko, and E. Sopin , “The estimation of probability characteristics of cloud computing systems with splitting of requests,” in International Conference on Distributed Computer and Communication Networks. Springer, 2016, pp. 418– 429. [4] A. Behrouzi-Far and E. Soljanin , “Redundancy scheduling in systems with bi - modal job service time distributions,” in 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2019, pp. 9 – 16. [5] A. Reisizadeh, S. Prakash, R. Pedarsani, and A. S. Avestimehr , “Coded computation over heterogeneous clusters,” IEEE Transactions on Information Theory, vol. 65, no. 7, pp. 4227 – 4242, 2019. [6] S. Dutta, V. Cadambe , and P. Grover, “Short -dot: Computing large linear transforms distributedly using coded short dot products,” Advances in Neural Information Processing Systems, vol. 29, pp. 2100 – 2108, 2016. [7] G. Joshi, Y. Liu, and E. Soljanin , “Coding for fast content download,” in Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on, 2012, pp. 326 – 333. [8]K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran , “Speeding up distributed machine learning using codes,” IEEE Transactions on Information Theory, vol. 64, no. 3, pp. 1514 – 1529, 2017. [9] K. Gardner, M. Harchol-Balter, A. Scheller-Wolf, and B. Van Houdt , “A better model for job redundancy: Decoupling server slowdown and job size,” IEEE/ACM transactions on networking, vol. 25, no. 6, pp. 3353 – 3367, 2017. [10] G. Joshi, E. Soljanin, and G. Wornell , “Efficient redundancy techniques for latency reduction in cloud systems,” ACM Transactions on Modeling and Performance Evaluation of Computing Systems (TOMPECS), vol. 2, no. 2, pp. 1 – 30, 2017 [11] M. F. Akta¸s and E. Soljanin , “Straggler mitigation at scale,” IEEE/ACM Transactions on Networking, vol. 27, no. 6, pp. 2266– 2279, 2019. [12] M. S. Klamkin and D. J. Newman, “Extensions of the birthday surprise,” Journal of Combinatorial Theory, vol. 3, no. 3, pp. 279– 282, 1967. [13] P. Peng, E. Soljanin , and P. Whiting, “Diversity vs. parallelism in distributed computing with redundancy.” [Online]. Available: https://emina.flywheelsites.com

Various Proposed Distributions 1. What distribution has been used for job service time?  For theoretical analysis, Shifted exponential was used in e.g.[1], Pareto was used in e.g.[2], Erlang was used in e.g. [3], Bi-modal was used in e.g. [4].  In this paper, we assume the service time per CU is Shifted exponential: 𝑊~∆ + 𝑌 o Where ∆ is a constant modelling the minimum service time (deterministic component); o 𝑌~𝐹𝑦𝑞(𝑋) is an exponentially distributed random variable modelling the straggling (random component). The other distributions are analyzed in [13]. 9/19

Various Proposed Scaling Models 2. How the probability distribution scales (changes) with task size?  There is no consensus on this question. Scaling the random component e.g. [5,6], scaling the deterministic component e.g. [7,8]. These papers provide the scaling for specific distributions, none of them provides a model for general distributions. The service time for each CU is 𝑊~∆ + 𝑌 . Then the task execution time 𝑍 is Scaling the random component: 𝑍 = ∆ + 𝑡 ∙ 𝑌 Scaling the deterministic component: 𝑍 = 𝑡 ∙ ∆ + 𝑌 Where 𝑡 is the number of CUs per task (task size). 9/19

Prior Related Work We can classify the relative references into two categories.  Category 1: there are many papers working on designing codes for given systems, e.g. [5, 6, 8], however they do not focusing on optimizing code rate.  Category 2: there are few papers working on how much redundancy should be introduced, e.g. [9, 10, 11], however [9, 10] only focus on applying replication in queueing systems, and [11] only focuses on Pareto distribution. 10/19

Task Service Time Scaling Models The service time increases with the size of task (or the number of CUs 𝑡 ). Model 1---Server-Dependent: the straggling effect depends on the server and is identical for each CU executed on that server. Here ∆ is some initial handshake time. 𝑍 = ∆ + 𝑡 ∙ 𝑌 𝑗 where 𝑌 𝑗 ~𝐹𝑦𝑞 𝑋 is the service time for a CU on server 𝑗 . Model 2---Data-Dependent: each CU in a task of 𝑡 CUs takes ∆ time to complete and there are some inherent additive system randomness at each server. 𝑍 = 𝑡 ∙ ∆ + 𝑌 𝑗 where 𝑌 𝑗 ~𝐹𝑦𝑞 𝑋 is the system randomness on server 𝑗 . Model 3---Additive: the execution times of CUs are independent and identically distributed. 𝑍 = 𝑊 1 + ⋯ + 𝑊 𝑡 where 𝑊 𝑗 ~𝑇 − 𝐹𝑦𝑞 ∆, 𝑋 is the service time for a CU on server 𝑗 . 11/19

Main Results

Server-Dependent Model For each worker: 𝒁 = ∆ + 𝒕 ∙ 𝒀 𝒋 where 𝒀 𝒋 ~𝑭𝒚𝒒 𝑿 , then the job completion time 𝑍 𝑜,𝑙 is the 𝑙 -th order statistics of 𝑜 workers’ task excution times Theorem 1. The expected job completion time for the server-dependent execution model is given by 𝐹[𝑍 𝑜,𝑙 ] = ∆ + 𝑡𝑋 𝐼 𝑜 − 𝐼 𝑜−𝑙 𝑜 = ∆ + 𝑋 𝑙 𝐼 𝑜 − 𝐼 𝑜−𝑙 ≥ ∆ + 𝑋 𝐹[𝑍 𝑜,𝑙 ] is minimized by replication/maximal diversity (k=1). 12/19

Diversity vs. Parallelism in Distributed Computing with Redundancy - PowerPoint PPT Presentation

Diversity vs. Parallelism in Distributed Computing with Redundancy Pei Peng, Emina Soljanin, Philip Whiting * Rutgers University Macquarie University 2020 IEEE International Symposium on Information Theory Background and Problem

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

Parallel Models Different ways to exploit parallelism Outline Shared-Variables Parallelism

1 CONTENTS 1. Supplier Diversity Data Call 2. Insurer Response Rate 3. Supplier Diversity

Fundamentals of Diversity Reception What is diversity? Diversity is a technique to combine

Part II. Fading and Diversity Impact of Fading in Detection; Time Diversity; Antenna Diversity;

Part II. Fading and Diversity Impact of Fading in Detection; Time Diversity; Antenna Diversity;

Parallelism ! Multiple processes concurrently Parallelism CPU1 CPU1 CPU1 Pseudo- Process 1

CO444H parallelism Ben Livshits 1 Why Parallelism? One way to speed up a computation is to

Multi-core Programming: Implicit Parallelism Tuukka Haapasalo April 16, 2009 Tuukka Haapasalo

Plan Parallelism Complexity Measures 1 Multithreaded Parallelism and Performance Measures cilk

Opportunities for Parallelism Dr. Michael K. Bane HIGH END COMPUTE Questions 1. What do you

in GIFTED n GIFTED Education Educa tion What Do Current Research and Practice Tell Us About

Outline Motivation Self Force Singular Field Calculation Mode Sum Decomposition

Welcome! The 82 nd CHHSM Annual Gathering! 2020 CHHSM 82nd Annual Gathering: Justice and Grace

SHPE Sacramento State SHPE California State University, Sacramento AGENDA Announcements

PHASES 1 & 2 CONSTRUCTION UPDATE CITIZENS ADVISORY COMMITTEE January 30, 2020 PHASE 1

Houston and COVID-19: Are We Nearing the End-Game? Robert W. Gilmer, Ph.D. C.T. Bauer College of

Security of the AES with a Secret S-box Lars R. Knudsen Stefan Klbl Tyge Tiessen Martin M.

Analysis and Optimization of Caching for Content Delivery in Wireless Networks Ying Cui

Sambuz

Useful Links

Newsletter

Mail Us

Diversity vs. Parallelism in Distributed Computing with Redundancy - PowerPoint PPT Presentation

Diversity vs. Parallelism in Distributed Computing with Redundancy Pei Peng*, Emina Soljanin*, Philip Whiting * Rutgers University Macquarie University 2020 IEEE International Symposium on Information Theory Background and Problem

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

Parallel Models Different ways to exploit parallelism Outline Shared-Variables Parallelism

1 CONTENTS 1. Supplier Diversity Data Call 2. Insurer Response Rate 3. Supplier Diversity

Fundamentals of Diversity Reception What is diversity? Diversity is a technique to combine

Part II. Fading and Diversity Impact of Fading in Detection; Time Diversity; Antenna Diversity;

Part II. Fading and Diversity Impact of Fading in Detection; Time Diversity; Antenna Diversity;

Parallelism ! Multiple processes concurrently Parallelism CPU1 CPU1 CPU1 Pseudo- Process 1

CO444H parallelism Ben Livshits 1 Why Parallelism? One way to speed up a computation is to

Multi-core Programming: Implicit Parallelism Tuukka Haapasalo April 16, 2009 Tuukka Haapasalo

Plan Parallelism Complexity Measures 1 Multithreaded Parallelism and Performance Measures cilk

Opportunities for Parallelism Dr. Michael K. Bane HIGH END COMPUTE Questions 1. What do you

in GIFTED n GIFTED Education Educa tion What Do Current Research and Practice Tell Us About

Outline Motivation Self Force Singular Field Calculation Mode Sum Decomposition

Welcome! The 82 nd CHHSM Annual Gathering! 2020 CHHSM 82nd Annual Gathering: Justice and Grace

SHPE Sacramento State SHPE California State University, Sacramento AGENDA Announcements

PHASES 1 &amp; 2 CONSTRUCTION UPDATE CITIZENS ADVISORY COMMITTEE January 30, 2020 PHASE 1

Houston and COVID-19: Are We Nearing the End-Game? Robert W. Gilmer, Ph.D. C.T. Bauer College of

Security of the AES with a Secret S-box Lars R. Knudsen Stefan Klbl Tyge Tiessen Martin M.

Analysis and Optimization of Caching for Content Delivery in Wireless Networks Ying Cui

Sambuz

Useful Links

Newsletter

Mail Us

Diversity vs. Parallelism in Distributed Computing with Redundancy Pei Peng, Emina Soljanin, Philip Whiting * Rutgers University Macquarie University 2020 IEEE International Symposium on Information Theory Background and Problem

PHASES 1 & 2 CONSTRUCTION UPDATE CITIZENS ADVISORY COMMITTEE January 30, 2020 PHASE 1