Coupling Task Progress for MapReduce Resource-Aware Scheduling Jian - PDF document

Coupling Task Progress for MapReduce Resource-Aware Scheduling Jian Tan, Xiaoqiao Meng, Li Zhang IBM T. J. Watson Research Center Yorktown Heights, New York, 10598 Email: { tanji, xmeng, zhangli } @us.ibm.com also an issue for ReducerTasks. Though it has recently Abstract —Schedulers are critical in enhancing the performance of MapReduce/Hadoop in presence of multiple jobs been addressed in [13], [14], the adopted approaches with different characteristics and performance goals. Though are sensitive to future run-time information (e.g., map current schedulers for Hadoop are quite successful, they still output distribution and competitions among new jobs) have room for improvement: map tasks (MapTasks) and reduce that is difficult to predict. tasks (ReduceTasks) are not jointly optimized, albeit there is a strong dependence between them. This can cause job starvation 2) Fair Scheduler uses Delay Scheduling [12] that allows and unfavorable data locality. In this paper, we design and MapTasks to wait for a period of time to find local implement a resource-aware scheduler for Hadoop. It couples data. It usually improves the data locality for MapTasks. the progresses of MapTasks and ReduceTasks, utilizing Wait However, we observe that the introduced delays may Scheduling for ReduceTasks and Random Peeking Scheduling for lead to under-utilization and instability, i.e., the number MapTasks to jointly optimize the task placement. This mitigates the starvation problem and improves the overall data locality. of MapTasks running simultaneously is far below a Our extensive experiments demonstrate significant improvements desired level, and change with large variations over time. in job response times. In view of these observations, we propose a resource- aware scheduler, termed Coupling Scheduler. It couples the I. I NTRODUCTION progresses of map and reduce tasks to mitigate starvation, and MapReduce [1] has emerged as a popular paradigm for jointly optimizes the placements for both of them to improve processing large datasets in parallel over a cluster. As an open the overall data locality. Specifically, we utilize Wait Schedul- source implementation, Hadoop [2] has been successfully used ing for ReduceTasks and Random Peeking Scheduling for in a variety of applications, such as social network mining, MapTasks, taking into consideration the interactions between log processing, video and image analysis, search indexing, them, to holistically optimize the data locality. Our extensive recommendation systems, etc. In many scenarios, long batch experiments demonstrate superior performance improvements jobs and short interactive queries are submitted to the same to the job processing times. MapReduce cluster, sharing limited common computing re- A. Scheduling ReduceTasks sources with different performance goals. To meet these im- posed challenges an efficient scheduler is critical to providing While MapTasks are small tasks that can run independently the desired quality of service for the MapReduce cluster. In in parallel, ReduceTasks are long-running tasks that contain this domain of MapReduce scheduling, Fair Scheduler [3] copy/shuffle and reduce phases. In most existing schedulers, is the most widely used one in practice. Other commonly ReduceTasks are not preemptive, i.e., a ReduceTask will not used schedulers include, the default FIFO Scheduler, Capacity release the occupied slot until its reduce phase completes. It is Scheduler [4] and variations [5]–[7]. To improve performance feasible to introduce ReduceTask preemptions in engineering of large-scale MapReduce clusters, more complicated resource realization [15]. However, this work adheres to the current management schemes have also been proposed [8]–[11]. While non-preemption assumption. For Fair Scheduler, once a certain focusing on Fair Scheduler as it is the de facto standard in the percentage of MapTasks of a job finish, the ReduceTasks are Hadoop community, we observe that it, as well as many other launched greedily to a maximum. This method overlaps the schedulers, still exhibits room for improvement. copy/shuffle and map phase of a job and can greatly reduce the 1) Map and reduce tasks are scheduled separately [5] job processing times. However, this approach can starve newly without joint optimization. First, Fair Scheduler only arrived jobs [12], and this problem is even more pronounced guarantees the fairness of MapTasks, and is not really when many small jobs arrive after large ones [16]. fair for ReduceTasks. We observe that allocating excess The experiment in Fig. 1 further illustrates the problem. resources to ReduceTasks without coordinating with the Fig. 1 shows the number of map and reduce tasks running map progress will lead to cluster-wise resource under- simultaneously at every time point for two Grep jobs. Job 1 utilization, which is evidenced by the starvation prob- grabs all the reduce slots at time 0 . 9 minutes, just before lem [12]. Secondly, most MapReduce schedulers only job 2 is submitted at time 1 . 0 minute. Thus, when job 2 consider data locality for MapTasks and ignore that it is finishes its MapTasks at time 3 . 8 minutes, it can not launch

Coupling Task Progress for MapReduce Resource-Aware Scheduling Jian - PDF document

Coupling Task Progress for MapReduce Resource-Aware Scheduling Jian Tan, Xiaoqiao Meng, Li Zhang IBM T. J. Watson Research Center Yorktown Heights, New York, 10598 Email: { tanji, xmeng, zhangli } @us.ibm.com also an issue for ReducerTasks.

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Semantics-Aware Prediction for Analytic Queries in MapReduce Environment Weikuan Yu, Zhuo Liu,

Day Ahead Coupling Oct 3 rd 2011 Workshop Day Ahead Coupling Implicit auctions; Single

Spin Valves (I) Exchange Coupling Spin Valves (I) Exchange Coupling Spin Valves (I) Exchange

Resource Allocation Task Force Resource Allocation Task Force Gigi Karmous Edwards Gigi

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 Introduction Current practice

Flow Analysis Using MapReduce Strengths and Limitations Markus De Shon Sr. Security Engineer

Molecular Dynamics Simulation A Short Introduction Michel Cuendet 1 Molecular Dynamics

Materials Science Blas Pedro Uberuaga Los Alamos National Laboratory U N C L A S S I F I E D

Experimental Designs for Developing Adaptive Treatment Strategies With Application to the

CPSC 221: Data Structures Hashing Alan J. Hu (Using mainly Steve Wolfmans Old Slides)

Embracing Change: Financial Informatics and Risk Analytics Mark D. Flood Senior Financial

FALL WORKSHOP September 3, 2020 Greetings and Introductions Welcome! You are at the MBC 2020

Ohio AAP Brush, Book, Bed: Bed (Sleep Routines) CME Disclaimer I have no personal financial

SLEEP SLEEP ME MEDI DICINE NE UPD UPDATE TE David Claman, MD Director, UCSF Sleep Disorders

Sambuz

Useful Links

Newsletter

Mail Us

Coupling Task Progress for MapReduce Resource-Aware Scheduling Jian - PDF document

Coupling Task Progress for MapReduce Resource-Aware Scheduling Jian Tan, Xiaoqiao Meng, Li Zhang IBM T. J. Watson Research Center Yorktown Heights, New York, 10598 Email: { tanji, xmeng, zhangli } @us.ibm.com also an issue for ReducerTasks.

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases &amp; Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data &amp; Cloud Services (P. Baumann) 1 Overview MapReduce : the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Semantics-Aware Prediction for Analytic Queries in MapReduce Environment Weikuan Yu, Zhuo Liu,

Day Ahead Coupling Oct 3 rd 2011 Workshop Day Ahead Coupling Implicit auctions; Single

Spin Valves (I) Exchange Coupling Spin Valves (I) Exchange Coupling Spin Valves (I) Exchange

Resource Allocation Task Force Resource Allocation Task Force Gigi Karmous Edwards Gigi

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 Introduction Current practice

Flow Analysis Using MapReduce Strengths and Limitations Markus De Shon Sr. Security Engineer

Molecular Dynamics Simulation A Short Introduction Michel Cuendet 1 Molecular Dynamics

Materials Science Blas Pedro Uberuaga Los Alamos National Laboratory U N C L A S S I F I E D

Experimental Designs for Developing Adaptive Treatment Strategies With Application to the

CPSC 221: Data Structures Hashing Alan J. Hu (Using mainly Steve Wolfmans Old Slides)

Embracing Change: Financial Informatics and Risk Analytics Mark D. Flood Senior Financial

FALL WORKSHOP September 3, 2020 Greetings and Introductions Welcome! You are at the MBC 2020

Ohio AAP Brush, Book, Bed: Bed (Sleep Routines) CME Disclaimer I have no personal financial

SLEEP SLEEP ME MEDI DICINE NE UPD UPDATE TE David Claman, MD Director, UCSF Sleep Disorders

Sambuz

Useful Links

Newsletter

Mail Us

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the