A Network-aware Scheduler in Data-parallel Clusters for High - PowerPoint PPT Presentation

A Network-aware Scheduler in Data-parallel Clusters for High Performance Zhuozhao Li , Haiying Shen and Ankur Sarker Department of Computer Science University of Virginia May, 2018

1/61 Introduction • Data-parallel clusters Introduction Introduction • Used to process large datasets efficiently Related Work • Deployed in many large organizations NAS • E.g., Facebook, Google and Yahoo! Evaluation • Shared by users from different groups Conclusion

2/61 Motivations • Network-intensive stages in data-parallel jobs Introduction Introduction Related Work NAS Evaluation Conclusion [1] M. Chowdhury, Y. Zhong, and I. Stoica . “Efficient coflow scheduling with varys ”. In: Proc. of SIGCOMM. 2014.

3/61 MapReduce A MapReduce job Map Shuffle Output/ Shuffle Introduction Introduction data Input Map task Reduce task Related block Work Input Map task Reduce task NAS block Evaluation Input Map task Reduce task block Conclusion Map stage Reduce stage

4/61 Motivations • Network-intensive stage • E.g., 60% and 20% of the jobs on the Yahoo and Facebook clusters, Introduction Introduction respectively, are reported to be shuffle-heavy • Jobs with large shuffle data size, generating a large amount of network Related traffic Work • More than 50% of time spent in network communication [2] Problem: A large number of shuffle-heavy jobs may cause bottleneck on the NAS cross-rack network • Oversubscribed network from rack-to-core in datacenter Evaluation • Oversubscription ratio ranging from 3:1 to 20:1 Conclusion • Nearly 50% of cross-rack bandwidth used by background transfer [1] M. Chowdhury, Y. Zhong, and I. Stoica . “Efficient coflow scheduling with varys ”. In: Proc. of SIGCOMM. 2014.

5/61 Outline • Introduction Introduction • Related Work Related Work • Network-Aware Scheduler Design (NAS) NAS • Evaluation Evaluation Conclusion • Conclusion

6/61 Related Work – Fair and Delay Map input data Introduction Map task Related Related Work Work Reduce task NAS Evaluation Rack 1 Rack 2 Rack 3 Conclusion Place the map task close to the input data – data locality Problem: Place the reduce task randomly

7/61 Related Work – ShuffleWatcher (ATC’14) Introduction Map input data Related Related Map task Work Work Reduce task NAS Evaluation Conclusion Rack 1 Rack 2 Rack 3 Pre-compute the map and reduce placement and attempt to place map and reduce on the same racks to minimize the cross-rack traffic

8/61 Related Work – ShuffleWatcher (ATC’14) Map input data Introduction Map task Related Related Work Work Reduce task NAS Evaluation Rack 1 Rack 2 Rack 3 Conclusion Problem: Reduce the cross-rack shuffle traffic at the cost of reading remote map input data.

9/61 Related Work – ShuffleWatcher (ATC’14) Map input data Introduction Map task Related Related Work Work Reduce task NAS Evaluation Rack 1 Rack 2 Rack 3 Conclusion Problem: Resource contention on the racks – intra-job and inter-job

11/61 Challenges • Network-aware scheduler • How to reduce cross-rack congestion Introduction • How to reduce cross-rack traffic Related Work • Idea NAS NAS • Network not saturated at all time Evaluation • Designing schedulers to place tasks • Balance the network load Conclusion • Consider shuffle data locality in addition to input data locality

12/61 Network-Aware Scheduler (NAS) • Map task scheduling (MTS) • Balance the network load Introduction • Congestion-avoidance reduce task scheduling (CA-RTS) Related Work • Consider shuffle data locality • adaptively adjusts the map completion threshold of jobs based on their NAS shuffle data sizes NAS • Congestion-reduction reduce task scheduling (CR-RTS) Evaluation • Balance the network load Conclusion

13/61 Map task scheduling (MTS) • Goal: balancing the network load Introduction • Set a TrafficThreshold for each node • Cannot process more shuffle data than this threshold at one time Related • Constrain the generated shuffle data size at a time Work • Map task scheduling NAS NAS • Map input data locality and fairness Evaluation • Whether the generated shuffle data size on a node exceeds the TrafficThreshold after placing a task Conclusion

14/61 Map task scheduling (MTS) • Setting the TrafficThreshold • Could be changed based on workloads Introduction • Distribute the shuffle data into each wave Related • Task wave Work • Number of tasks >> number of containers • Tasks scheduled to all available containers, forming the first wave NAS • Second wave, third wave … NAS 𝑈𝑇 • TrafficThreshold = 𝑂∗𝑋 Evaluation • TS – total shuffle data size of jobs in the cluster • N – the total number of nodes in the cluster Conclusion • W – the number of waves: the total number of map tasks/the total number of containers 14

15/61 Map task scheduling (MTS) • User 1: • Job1: 6 map tasks and 6 reduce tasks Introduction • Job3: 6 map tasks and 6 reduce tasks • User 2: Related • Job2: 6 map tasks and 6 reduce tasks Work • Job4: 6 map tasks and 6 reduce tasks NAS NAS • Each map -> each reduce Evaluation • Job1 and Job2: 8 • Job3 and Job4: 1 Conclusion 15

16/61 Congestion-avoidance Reduce Task Scheduling (CA-RTS) • Check network status -- CongestionThreshold (e.g., 80% of cross- rack bandwidth) Introduction • Used when the CongestionThreshold is NOT reached • Goal: reduce cross-rack traffic Related • A rack has more shuffle data of a job  assign more reduce Work tasks of the job on this rack to reduce cross-rack traffic NAS • NAS The number of reduce tasks of a job scheduled on a rack does not exceed ReduceNum Rack 1 Rack 2 • ReduceNum = TotalReduceNum * MapOutputPortion Evaluation 30% 10 reduce 70% tasks Conclusion 7 3

17/61 Congestion-reduction Reduce Task Scheduling (CR-RTS) • Used when the CongestionThreshold is reached Introduction • Goal: reduce cross-rack network congestion Related Work • Launch a reduce task from a shuffle-light job • Small shuffle data size NAS NAS • Minimal impact on the cross-rack traffic Evaluation • If no, search the next user until a reduce task from a shuffle- light job is found Conclusion

19/61 Evaluation • Real cluster experiment • Throughput Introduction • Average job completion time Related Work • Cross-rack congestion NAS • Cross-rack traffic Evaluation Evaluation • Sensitivity analysis Conclusion • Simulation study

20/61 Evaluation • Real cluster experiment • 40-node cluster organized into 8 racks, 5 nodes each rack Introduction • 8 racks interconnected by a core switch • Oversubscription 5:1 from the rack to core Related Work • Workload • 200 jobs from the Facebook synthesized execution framework [1] NAS • Baselines Evaluation Evaluation • Fair Scheduler (current scheduler in Hadoop) • Delay Scheduler (current scheduler in Hadoop) Conclusion • ShuffleWatcher (ATC’14) [1] Y. Chen, A. Ganapathi , R. Griffith, and R. Katz. “The Case for Evaluating MapReduce Performance Using Workload Suites.” In: Proc. of MASCOTS. 2011.

21/61 Throughput 2 Normalized throughput 1.63 Introduction 1.5 1.24 1.1 Related 1 Work 0.5 NAS 0 Fair Delay ShuffleWatcher NAS Evaluation Evaluation NAS improves the throughput over Fair, Delay and Conclusion ShuffleWatcher by 63%, 48%, 31%, respectively

22/61 Average Job Completion Time Normalized average job Introduction 1 0.89 completion time 0.83 Related 0.56 Work 0.5 NAS 0 Fair Delay ShuffleWatcher NAS Evaluation Evaluation NAS reduces the average job completion time over Fair, Delay Conclusion and ShuffleWatcher by 44%, 37%, 33%, respectively

23/61 Cross-rack congestion 1.2 occurrences of cross-rack 1 Introduction 0.91 Total number of 0.84 0.8 congestions Related 0.6 0.55 Work 0.4 0.2 NAS 0 Fair Delay ShuffleWatcher NAS Evaluation Evaluation Conclusion NAS reduces the cross-rack congestion over Fair, Delay and ShuffleWatcher by 45%, 40%, 34%. 23

24/61 Conclusion We can improve the performance of current state-of-the-art schedulers (e.g., Fair and Delay schedulers in Hadoop) by Introduction • balancing the network traffic and enforcing the data locality for shuffle Related data, Work NAS • aggregating the data transfers to efficiently exploit optical circuit switch in hybrid electrical/optical datacenter network while still guaranteeing Evaluation parallelism, Conclusion Conclusion • and adaptively scheduling a job to either scale-up machines or scale-out machines that benefit the job the most in hybrid scale-up/out cluster.

25/61 Introduction Related Work Zhuozhao Li NAS Ph.D. Candidate Evaluation Department of Computer Science University of Virginia Conclusion ZL5UQ@VIRGINIA.edu

Backup 26

Shuffle Data Size Predictor • MapOutput = (map output/input ratio) ∗ MapInput • Unpredicted and predicted job • Update in real time 27

A Network-aware Scheduler in Data-parallel Clusters for High - PowerPoint PPT Presentation

A Network-aware Scheduler in Data-parallel Clusters for High Performance Zhuozhao Li , Haiying Shen and Ankur Sarker Department of Computer Science University of Virginia May, 2018 1/61 Introduction Data-parallel clusters Introduction

Preempting Scheduler Activations Scheduler activations are completely preemptable Deadlocks

WORK STEALING SCHEDULER 2 6/16/2010 Work Stealing Scheduler

Design and Implemention of a Plugin Scheduler for DIET March 11, 2005 Design and Implemention of

I nternational research The evidence on clusters is clear Firms located in clusters are more

Internet Server Clusters Internet Server Clusters Jeff Chase Duke University, Department of

LTE eNB Scheduler performance 3rd Fed4FIRE Engineering Conference experiments 14.03.2018

Avoiding Scheduler Subversion usin ing Scheduler-Cooperative Locks Yuvraj Patel, Leon Yang * ,

CPU Scheduling Schedulers Structure of a CPU scheduler Criteria for scheduling

GNU Radio Advanced Scheduler Dude: Josh Blum - New scheduler features and stuff GRAS - Project

scheduling 2 FCFS, RR, priority, SRTF 1 last time xv6 scheduler design separate scheduler

Three-Level Scheduling CPU CPU scheduler Scheduling Arriving jobs How to choose which of the

A Configurable Hardware Scheduler A Configurable Hardware Scheduler (CHS) for Real- -Time

Avoiding Scheduler Subversion using Scheduler - Cooperative Locks Yuvraj Patel , Leon Yang , Leo

CS 5220: Parallel machines and models David Bindel 2017-09-07 1 Why clusters? Clusters of

Toolkit to Support Intelligibility in Context Aware Applications Context-Aware Applications P

Locational narratives in creative clusters An exploration of place, reputation and creative

Presenting FreeNAS Olivier COCHARD-LABBE (olivier@freenas.org) Presentation available at:

Efficient Neural Architecture Search via Parameter Sharing ICML 2018 Authors: Hieu Pham, Melody

NAS-Bench-101 : Towards Reproducible Neural Architecture Search Chris Ying 1 , Aaron Klein 2 ,

Knowledge Distillation for Block-wisely Supervised NAS Xiaojun Chang / Monash University July 1

Dalhousie University Records Management Introduction to Records Management Dalhousie University

Running Debian on Inexpensive Network Attached Storage Devices Martin Michlmayr tbm@cyrius.com

OpenSDS An Indus try W ide Colla bora tion For SDS Ma na gement Cameron Bahar and Steven Tan

Hamsa Balakrishnan Massachuse1s Ins3tute of Technology Resilient Ops, Inc. (With Bala Chandran,

Sambuz

Useful Links

Newsletter

Mail Us

A Network-aware Scheduler in Data-parallel Clusters for High - PowerPoint PPT Presentation

A Network-aware Scheduler in Data-parallel Clusters for High Performance Zhuozhao Li , Haiying Shen and Ankur Sarker Department of Computer Science University of Virginia May, 2018 1/61 Introduction Data-parallel clusters Introduction

Preempting Scheduler Activations Scheduler activations are completely preemptable Deadlocks

WORK STEALING SCHEDULER 2 6/16/2010 Work Stealing Scheduler

Design and Implemention of a Plugin Scheduler for DIET March 11, 2005 Design and Implemention of

I nternational research The evidence on clusters is clear Firms located in clusters are more

Internet Server Clusters Internet Server Clusters Jeff Chase Duke University, Department of

LTE eNB Scheduler performance 3rd Fed4FIRE Engineering Conference experiments 14.03.2018

Avoiding Scheduler Subversion usin ing Scheduler-Cooperative Locks Yuvraj Patel, Leon Yang * ,

CPU Scheduling Schedulers Structure of a CPU scheduler Criteria for scheduling

GNU Radio Advanced Scheduler Dude: Josh Blum - New scheduler features and stuff GRAS - Project

scheduling 2 FCFS, RR, priority, SRTF 1 last time xv6 scheduler design separate scheduler

Three-Level Scheduling CPU CPU scheduler Scheduling Arriving jobs How to choose which of the

A Configurable Hardware Scheduler A Configurable Hardware Scheduler (CHS) for Real- -Time

Avoiding Scheduler Subversion using Scheduler - Cooperative Locks Yuvraj Patel , Leon Yang , Leo

CS 5220: Parallel machines and models David Bindel 2017-09-07 1 Why clusters? Clusters of

Toolkit to Support Intelligibility in Context Aware Applications Context-Aware Applications P

Locational narratives in creative clusters An exploration of place, reputation and creative

Presenting FreeNAS Olivier COCHARD-LABBE (olivier@freenas.org) Presentation available at:

Efficient Neural Architecture Search via Parameter Sharing ICML 2018 Authors: Hieu Pham, Melody

NAS-Bench-101 : Towards Reproducible Neural Architecture Search Chris Ying *1 , Aaron Klein *2 ,

Knowledge Distillation for Block-wisely Supervised NAS Xiaojun Chang / Monash University July 1

Dalhousie University Records Management Introduction to Records Management Dalhousie University

Running Debian on Inexpensive Network Attached Storage Devices Martin Michlmayr tbm@cyrius.com

OpenSDS An Indus try W ide Colla bora tion For SDS Ma na gement Cameron Bahar and Steven Tan

Hamsa Balakrishnan Massachuse1s Ins3tute of Technology Resilient Ops, Inc. (With Bala Chandran,

Sambuz

Useful Links

Newsletter

Mail Us

NAS-Bench-101 : Towards Reproducible Neural Architecture Search Chris Ying 1 , Aaron Klein 2 ,