Coupling Task Progress for MapReduce Resource-Aware Scheduling Jian - - PDF document

coupling task progress for mapreduce resource aware
SMART_READER_LITE
LIVE PREVIEW

Coupling Task Progress for MapReduce Resource-Aware Scheduling Jian - - PDF document

Coupling Task Progress for MapReduce Resource-Aware Scheduling Jian Tan, Xiaoqiao Meng, Li Zhang IBM T. J. Watson Research Center Yorktown Heights, New York, 10598 Email: { tanji, xmeng, zhangli } @us.ibm.com also an issue for ReducerTasks.


slide-1
SLIDE 1

Coupling Task Progress for MapReduce Resource-Aware Scheduling

Jian Tan, Xiaoqiao Meng, Li Zhang IBM T. J. Watson Research Center Yorktown Heights, New York, 10598 Email: {tanji, xmeng, zhangli}@us.ibm.com

Abstract—Schedulers are critical in enhancing the perfor- mance of MapReduce/Hadoop in presence of multiple jobs with different characteristics and performance goals. Though current schedulers for Hadoop are quite successful, they still have room for improvement: map tasks (MapTasks) and reduce tasks (ReduceTasks) are not jointly optimized, albeit there is a strong dependence between them. This can cause job starvation and unfavorable data locality. In this paper, we design and implement a resource-aware scheduler for Hadoop. It couples the progresses of MapTasks and ReduceTasks, utilizing Wait Scheduling for ReduceTasks and Random Peeking Scheduling for MapTasks to jointly optimize the task placement. This mitigates the starvation problem and improves the overall data locality. Our extensive experiments demonstrate significant improvements in job response times.

  • I. INTRODUCTION

MapReduce [1] has emerged as a popular paradigm for processing large datasets in parallel over a cluster. As an open source implementation, Hadoop [2] has been successfully used in a variety of applications, such as social network mining, log processing, video and image analysis, search indexing, recommendation systems, etc. In many scenarios, long batch jobs and short interactive queries are submitted to the same MapReduce cluster, sharing limited common computing re- sources with different performance goals. To meet these im- posed challenges an efficient scheduler is critical to providing the desired quality of service for the MapReduce cluster. In this domain of MapReduce scheduling, Fair Scheduler [3] is the most widely used one in practice. Other commonly used schedulers include, the default FIFO Scheduler, Capacity Scheduler [4] and variations [5]–[7]. To improve performance

  • f large-scale MapReduce clusters, more complicated resource

management schemes have also been proposed [8]–[11]. While focusing on Fair Scheduler as it is the de facto standard in the Hadoop community, we observe that it, as well as many other schedulers, still exhibits room for improvement. 1) Map and reduce tasks are scheduled separately [5] without joint optimization. First, Fair Scheduler only guarantees the fairness of MapTasks, and is not really fair for ReduceTasks. We observe that allocating excess resources to ReduceTasks without coordinating with the map progress will lead to cluster-wise resource under- utilization, which is evidenced by the starvation prob- lem [12]. Secondly, most MapReduce schedulers only consider data locality for MapTasks and ignore that it is also an issue for ReducerTasks. Though it has recently been addressed in [13], [14], the adopted approaches are sensitive to future run-time information (e.g., map

  • utput distribution and competitions among new jobs)

that is difficult to predict. 2) Fair Scheduler uses Delay Scheduling [12] that allows MapTasks to wait for a period of time to find local

  • data. It usually improves the data locality for MapTasks.

However, we observe that the introduced delays may lead to under-utilization and instability, i.e., the number

  • f MapTasks running simultaneously is far below a

desired level, and change with large variations over time. In view of these observations, we propose a resource- aware scheduler, termed Coupling Scheduler. It couples the progresses of map and reduce tasks to mitigate starvation, and jointly optimizes the placements for both of them to improve the overall data locality. Specifically, we utilize Wait Schedul- ing for ReduceTasks and Random Peeking Scheduling for MapTasks, taking into consideration the interactions between them, to holistically optimize the data locality. Our extensive experiments demonstrate superior performance improvements to the job processing times.

  • A. Scheduling ReduceTasks

While MapTasks are small tasks that can run independently in parallel, ReduceTasks are long-running tasks that contain copy/shuffle and reduce phases. In most existing schedulers, ReduceTasks are not preemptive, i.e., a ReduceTask will not release the occupied slot until its reduce phase completes. It is feasible to introduce ReduceTask preemptions in engineering realization [15]. However, this work adheres to the current non-preemption assumption. For Fair Scheduler, once a certain percentage of MapTasks of a job finish, the ReduceTasks are launched greedily to a maximum. This method overlaps the copy/shuffle and map phase of a job and can greatly reduce the job processing times. However, this approach can starve newly arrived jobs [12], and this problem is even more pronounced when many small jobs arrive after large ones [16]. The experiment in Fig. 1 further illustrates the problem.

  • Fig. 1 shows the number of map and reduce tasks running

simultaneously at every time point for two Grep jobs. Job 1 grabs all the reduce slots at time 0.9 minutes, just before job 2 is submitted at time 1.0 minute. Thus, when job 2 finishes its MapTasks at time 3.8 minutes, it can not launch

slide-2
SLIDE 2

2 4 6 8 10 12 14 10 20 30

Running maps

2 4 6 8 10 12 14 5 10 15

Running reduces Time (minutes)

Job 1 Job 2

  • Fig. 1.

Starvation problem with Fair Scheduler

its ReduceTasks until job 1 releases some reduce slots at time 11.3 minutes. This shows Fair Scheduler is not really fair. An ultimate solution to this problem is to design preemptive ReduceTasks [15]. For non-preemptive ReduceTasks, a better approach is to avoid or to alleviate the problem, if possible, from the very beginning. Since it is difficult to predict the fu- ture job arrivals, we introduce a simple but effective approach: launch ReduceTasks gradually according to the progress of the

  • MapTasks. If the reduce progress lag behind the map progress,

this job should have a better chance to launch ReduceTasks;

  • therwise it should not be aggressive in acquiring more slots

that could be utilized by other more urgent jobs.

2 4 6 8 10 12 14 10 20 30

Running maps

2 4 6 8 10 12 14 5 10 15

Running reduces Time (minutes) Job 1 Job 2

  • Fig. 2.

Repeat with Coupling Scheduler

Repeating the same experiment as in Fig. 1, we show that the starvation problem can be mitigated with the new approach in Fig. 2. Interestingly both of these two jobs run faster with Coupling Scheduler in this experiment; more similar observa- tions are shown in Section III-A. This is partially due to a side effect of gradually launching ReduceTasks. Specifically, each ReduceTask has a set of threads responsible for fetching map

  • utput bytes through disks or networks. Greedily launching

ReduceTasks to a maximum causes these fetching activities (I/O and network intensive) to synchronously compete for the same resources. Gradually launching ReduceTasks can alleviate this sync problem. We will further investigate this issue by measuring disk I/O and CPU performance values in Section II-A. In addition, observe that the gray area for the number of ReduceTasks running simultaneously in Fig. 2 is almost half of that in Fig. 1. This is because ReduceTasks are not work-conserving: sometimes ReduceTasks take the slots and wait for map outputs. Therefore, postponing the launching of ReduceTasks in these cases will not increase the job processing delays. It decreases the idle periods when a job holds reduce slots and increases the fairness to ReduceTasks. This method enables a new way to optimize data locality jointly for both MapTasks and ReduceTasks, which is absent in existing approaches [6], [9], [10], [12], [14]. MapTasks take the input data and ReduceTasks fetch the intermediate data. There should exist an optimization that places 1) MapTasks close to the input data and the running ReduceTasks, and 2) ReduceTasks close to a “centrality” of the already generated intermediate data stored in the Hadoop cluster. The placements

  • f MapTasks determine where the intermediate data are gen-

erated and stored. In turn, the placements of ReduceTasks affect the cost of fetching the intermediate data from the

  • MapTasks. Nevertheless, it is difficult to predict the future

information, e.g., intermediate data size, competitions among future jobs. The greedy approach in launching ReduceTasks taken by existing schedulers could make wrong decisions at the

  • beginning. For long-running ReduceTasks, an early mistake

takes much more efforts to correct. To detect such mistakes early, Dryad [17] designed Mantri [9] and Hadoop designed LATE scheduler [8]. These designs rely on re-computations and speculative executions. Our unique way of gradually launching ReduceTasks make it possible to form a cooperation between map and reduce phases, i.e., initially both MapTasks and ReduceTasks may not know their optimal placements with the common goal for better data locality; as the job evolves, new run-time information becomes available such that MapTasks and ReduceTasks can alternatively optimize step by step depending on the decisions taken by the other phase.

  • B. Scheduling MapTasks

We focus on data locality in scheduling MapTasks. A common approach is to trade fairness off against data locality, e.g., Quincy [10] and Delay [12]. Clearly there is a balance: achieving optimal data locality can largely delay a job, which in turn breaks the fairness constraint. Quincy [10] addresses this issue by mapping some aspects to a graph structure and relying on a min-flow solver. However, as pointed out in [10], not all aspects are suitable for this approach. Fairness is a rather ambiguous concept. Therefore, in the following we assume that some adopted algorithm is already available to compute a fair assignment for MapTasks when ignoring data locality constraints, e.g., using the slot allocation algorithm of Fair Scheduler [12] or FLEX [5]. Based on the computed assignments of MapTasks that only consider fairness, we make further adjustments to improve the data locality by relaxing fairness requirements. Fair Scheduler uses Delay Scheduling to improve data locality for MapTasks [12]. We observe that the introduced delays can affect MapTask scheduling. As a consequence, the number of MapTasks running simultaneously can be far below a desired level (under-utilization) and fluctuate significantly

  • ver time (instability) when the input data are not uniformly
  • distributed. This uneven storage situation is not uncommon. To

mitigate this problem, we propose Random Peeking Schedul- ing, which avoids delays for MapTasks by making decisions immediately through randomly peeking a number of other nodes to obtain data locality information.

slide-3
SLIDE 3
  • C. Summary of design rationale

Mitigate starvation Coupling Scheduler enables sequential cooperation between map and reduce phases. The way of gradually launching reducers according to map task progress can effectively mitigate the starvation problem in presence of multiple jobs. Do no harm to single jobs Regarding concerns that not using all the available reduce slots immediately may increase the processing times of single jobs, our experiments in Sec- tion III-A show that the delays introduced to reducers have no harm in most cases due to the non-work-conserving feature of

  • reducers. Interestingly, it even expedites the job processing in

many scenarios due to reduced resource contention. Improve data locality The Wait Scheduling for reduce tasks and Random Peeking Scheduling for map tasks, which take into account the interactions between them, can greatly improve the overall data locality. Improved performance We demonstrate the good perfor- mance of Coupling Scheduler compared to Fair Scheduler by repeating 22 jobs for 5 times. Fig 3 compares the job processing times. Each line shows the starting time and the minimum, mean and maximum delays for 5 different trials, with the three vertical lines from left to right representing the minimum, mean and maximum, respectively. On average the job response times decrease by 39.1%. For some jobs, e.g., job 3, 6, 12, 13, the response time decreases by more than 70%. The submitted workload includes CPU and I/O intensive, map-

10 20 30 40 50 60 5 10 15 20

Time (minutes) Job ID Coupling Fair

  • Fig. 3.

Compare job processing times

heavy, copy/merge-heavy and reduce-heavy jobs [16], with short jobs after large ones [18]. Most jobs spend less time with Coupling Scheduler in this experiment, e.g., job 3 is 12.8 times faster. We also notice that Fair Scheduler results in a larger variation to the job processing times. For example, the processing time for job 3 varies between 0.8 and 12.1 minutes when we repeat this test multiple times. This instability is because job 3 arrives around a critical time point when job 1 and 2 are about to take all reduce slots according to the greedy

  • strategy. If job 3 happens to see an available reduce slot soon

after its submission, it can finish much earlier.

  • II. DESIGN DETAILS

This section describes Wait Scheduling for ReduceTasks and Random Peeking Scheduling for MapTasks, respectively.

  • Fig. 4 plots the schematic diagram for Coupling Scheduler. It

determines the job whose reduce progress lags the most behind its map progress as a candidate for launching ReduceTasks.

  • Fig. 4.

Schematic diagram for Coupling Scheduler

  • A. Details on scheduling ReduceTasks

Gradually launching ReduceTasks depending on the map progress has benefits. In summary, 1) it is more fair to Re- duceTasks, and can relieve starvation; 2) it allows a sequential control between map and reduce phases to jointly improve data locality; 3) it alleviates the sync problem in the copy/shuffle phase by reducing the disk I/O and network contention. For comparison, we first run 10 identical Grep jobs using Fair Scheduler on 8 nodes (4 map slots and 2 reduce slots per node). We plot the number of MapTasks and ReduceTasks

  • f each job running simultaneously at each time in Fig. 5.

These jobs are submitted one after another, with time interval 5 minutes and each containing 427 MapTasks and 8 Reduc-

  • eTasks. MapTasks are processed according to processor shar-

10 20 30 40 50 60 70 80 20 40 Running maps 10 20 30 40 50 60 70 80 2 4 6 8 Running reducers Time (minutes)

  • Fig. 5.

10 Grep jobs with Fair Scheduler

  • ing. Reducers are processed in a more complicated manner.

For example, when the first two jobs are submitted, they grab all the reduce slots at time 5.0 minute, which prevents job 3 from taking reduce slots until job 1 releases reduce slots around time 16.0 minutes. Immediately, the released reduce slots are allocated to job 3. As a result, only when job 2 finishes around time 30.0 minutes, job 4, 5, 6 can grab reduce slots, though these three jobs have already joined the queue before time 25.0 minutes. 1) Coordination between map/reduce phases: Coupling Scheduler uses a mismatch value to describe the progress difference between MapTasks and ReduceTasks. For job i, it matches the fraction xi of MapTasks that have completed and the fraction yi of ReduceTasks that have been launched. Job i is associated with a function fi(x) : [0, 1] → [0, 1]. Only when yi < fi(xi) should job i requests to launch ReduceTasks. We simply choose fi(x) ≡ f(x) for all i. We have provided a theoretical analysis of the performance for this algorithm under a general MapReduce model in [19], [20]. In this paper, we describe an engineering realization.

slide-4
SLIDE 4

We compute the mismatch value in Algorithm 1. Therein job.desiredMaps (job.desiredReduces) denotes the total num- ber of map (reduce) tasks of this job. The parameter δ indicates that all ReduceTasks of a job should be launched before it finishes a fraction δ of its MapTasks if there are available

  • slots. Line 3 provides a simple way to compute δ: assign a

smaller δ to jobs with less ReduceTasks. For example, set threshold = 3. A better way is to also consider the map

  • utput size: less intermediate data imply a larger δ. On line 9,

we raise a job’s mismatch when it has no pending MapTasks but still has pending ReduceTasks, with priority given to a job with the least number of pending ReduceTasks. We choose the form 4 + 1/job.pendingReduces due to line 7 that normalizes the mismatch value on [0, 1/δ] with δ > 0.25 and 1/0.25 = 4. Thus the jobs with only pending ReduceTasks have larger mismatch values than jobs with pending MapTasks. Algorithm 1 Mismatch between the map/reduce progresses

1: Comment: It is computed for every job on a received heartbeat. 2: if job.pendingReduces > 0 then 3: δ =1-exp(-job.desiredReduces/threshold) 4: unit=δ×job.desiredMaps/job.desiredReduces 5: mapProgress=job.finishedMaps/unit 6: redProgress=job.finishedReduces+job.runningReduces+1 7: mismatch = (mapProgress - redProgress)/job.desiredReduces 8: if job.pendingMaps == 0 then 9: mismatch = 4 + 1/job.pendingReduces 10: end if 11: else 12: mismatch = 0 13: end if

Repeating the same experiment with Coupling Scheduler shows that ReduceTasks can fairly share the limited reduce slots in Part (b) of Fig. 6. We compare the processing time of

20 40 60 80 20 40

Running mappers

20 40 60 80 2 4 6 8

Running reducers Time (minutes) (b) Task assignment using Coupling Scheduler

20 40 60 80 2 4 6 8 10

Time (minutes) (a) Compare job processing times Job ID Coupling Fair

  • Fig. 6.

10 Grep jobs with Coupling Scheduler

each job under the two schedulers in Part (a) of Fig. 6. Interest- ingly, all of the submitted jobs have a shorter processing time with Coupling Scheduler. On average it decreases by 16.1% compared to Fair Scheduler. This workload is not typical since all jobs are identical. For typical one that contains small jobs after large jobs [16], the improvement can be up to an order

  • f magnitude; see Fig. 3.

During the copy/shuffle phase, each ReduceTask has a set

  • f threads responsible for fetching map output bytes through

disks or networks. Greedily launching ReduceTasks to a maximum causes these fetching activities (I/O and network intensive) on different ReduceTasks in sync, which slows down the running threads that share the same resources on the nodes that host the MapTasks, as shown in Part (a) of Fig. 7. MapTasks generate intermediate data that may be constantly

  • Fig. 7.

Mitigate resource contention

spilled to disks after applying a quick-sort when the buffer

  • verflows. Gradually launching ReduceTasks can alleviate the

sync problem. For example, as demonstrated in Part (b) of

  • Fig. 7, between t1 and t2, only one ReduceTask fetches

intermediate data under Coupling Scheduler, but there are three ReduceTasks that establish HTTP connections simultaneously under Fair Scheduler. Furthermore, these intermediate data need to be sorted and merged after arriving to each Reduc-

  • eTask. The merged files are processed in multiple iterations as

new map outputs keep being generated, causing many paging

  • activities. If these ReduceTasks happen to locate on the same

node, then the assignment in Part (b) of Fig. 7 can alleviate disk I/O and CPU contentions. To evaluate the resource contention, we measure the CPU and disk I/O utilization on the 8 slave nodes in this experiment, as shown in Table I. On average %iowait (percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request) and await (the average time in milliseconds for I/O requests issued to the device to be served, including the time spent by the requests in queue and the time spent servicing them) decreases by 40.0% and 13.6%, respectively. Interestingly, svctm (the average service time in milliseconds for I/O requests that were issued to the device) also decreases by 13.2%. These 10 submitted jobs has

TABLE I CPU AND DISK I/O

Fair Scheduler Coupling Scheduler Node ID %iowait %idle await svctm %iowait %idle await svctm 1 0.81 21.06 111.30 4.61 0.55 13.59 103.33 4.03 2 0.80 15.45 109.14 4.64 0.48 11.67 102.77 4.21 3 1.15 15.30 45.84 3.51 0.67 9.51 43.23 2.47 4 1.01 24.88 119.16 4.54 0.58 16.60 96.24 3.85 5 0.99 16.54 64.03 4.06 0.49 9.95 49.01 3.14 6 1.02 21.41 148.94 4.80 0.56 13.49 109.88 4.19 7 1.07 22.71 82.3 4.10 0.58 15.10 87.86 3.65 8 0.72 16.20 92.27 4.32 0.66 10.98 75.42 4.45 Average 0.95 19.19 96.62 4.32 0.57 12.61 83.47 3.75

4270 number of MapTasks in total. We plot the cumulative distribution of the processing times of these MapTasks in

  • Fig. 8, which shows that the map task processing time under

Coupling Scheduler is statistically smaller than that under Fair

  • Scheduler. On average they decrease by 10.1%.
slide-5
SLIDE 5

10 20 30 40 50 60 70 80 90 100 0.2 0.4 0.6 0.8 1

Map task processing time (seconds) Cumulated distribution Fair Coupling

  • Fig. 8.

Empirical map task processing time distribution

2) Wait Scheduling for ReduceTasks: Wait Scheduling places ReduceTasks close to the “centrality” of the already generated intermediate data stored on the slave nodes, the topology of which is assumed to be a tree for Hadoop. We have described this notion in [21], with details provided in this paper. For related work on ReduceTask data locality see [14], [22], [23]. Since ReduceTasks are launched gradually under Coupling Scheduler, those launched later know more information about the distribution of the intermediate data. Now we define the “centrality” of intermediate data. A given network G = (V, E) comprises a set V of nodes together with a set E of edges. Denote by h(u, v) the hop distance between node u and v, and by wi(u) the size of the data stored on node u for job i. Furthermore, denote by Vd(i) the set of nodes that store the intermediate data for job i, and by Vr the set of nodes that have available reduce slots. When transferring the stored data on node u to node v, the cost is proportional to wi(u)h(u, v). Thus, for a job i, we define the total cost of transferring intermediate data to node v ∈ Vr by Ci(v, Vd(i)) =

  • u∈Vd(i)

wi(u)h(v, u). (1) We compute Ci(v, G) for every v ∈ Vr, and only keep the smallest D (e.g., D = 7) values sorted in increasing order; the number of values in this list may be smaller than D if |Vr| < D. The corresponding nodes associated with these values sorted in the same order is denoted by a list L(i). Accordingly, the “centrality” is defined to be the first node v∗ in Li, which minimizes Ci(v, G) for all v ∈ Vr. Here tie-breaking is through a random order. Even when a node has an available reduce slot, Hadoop may still reject a reduce request since it has a resource estimator that determines whether a reduce request can be granted or not

  • n a given node. Therefore, the “centrality” node may not be

able to run the requested ReduceTask. To this end, we split the list L(i) into three groups: L1(i) only contains the “centrality” for job i; L2(i) contains the second and third node from list L(i); L3(i) contains the other nodes in L(i). Note that each

  • f these groups may be empty if |Vr| < D.

In each round of communications between the master node and the N slave nodes, the master node sequentially processes the N received heartbeats, which arrive in a random order. When it is ready to launch a ReduceTask (mismatch> 0), the slave node that just sent the heartbeat is not necessarily the node that is close to the data “centrality” for a job under consideration. In order to resolve this issue, we propose Wait Scheduling, which allows ReduceTasks to check up to 3 × N number of heartbeats (three rounds). Within the first N received heartbeats, it only launches the ReduceTask to a node in L1(i). If it fails, then it considers L2(i) for the next N heartbeats. If it fails again, L3(i) is considered. After the three rounds, it assigns the ReduceTask randomly to a node. We avoid multiple ReduceTasks of a job on the same node if possible, by following the suggestion in [9]. When receiving a heartbeat, we find the job with the largest positive mismatch value (m◦) as a candidate for launching a

  • ReduceTask. If the candidate can not launch a ReduceTask

close to its current data “centrality”, it skips this current

  • heartbeat. We introduce a counter (wait) to denote the number
  • f skipped heartbeats. The details are shown in Algorithm 2.

Algorithm 2 Wait Scheduling

1: Static values: candidate = NULL, wait = 0, m◦ = 0 2: Input: Receive a heartbeat from node v; The cluster has N nodes. 3: Output: Compute the task assignment of ReduceTasks on node v. 4: if candidate == NULL then 5: Loop over all jobs and find the one (J) with the largest mismatch (m◦), tie-breaking through randomly selecting one. 6: if m◦ > 0 then 7: candidate = J 8: end if 9: else 10: wait = wait +1 11: for i = 1 to 3 do 12: if (i − 1) × N < wait <= i × N AND v ∈ Li(J) AND node v has no running ReduceTasks of J then 13: if successfully launch a ReduceTask of job J on node v then 14: candidate = NULL, wait = 0 15: else 16: wait = i × N + 1 17: end if 18: end if 19: end for 20: if 3N < wait ≤ 4N then 21: if successfully launch a ReduceTask of job J on node v then 22: candidate = NULL, wait = 0 23: end if 24: end if 25: if wait > 4N then 26: candidate = NULL, wait = 0 27: end if 28: end if

  • B. Details on scheduling MapTasks

The Hadoop file system [24] stores the input data as trunks that are distributed in the disks of the whole cluster. The input data to a map task resides on one of the slave nodes, which can differ from the node where the map task is launched. Running a MapTask with local data can greatly improve the performance [10]–[12]. Fair Scheduler utilizes delay scheduling to improve data locality [12]. Roughly speaking, the job tracker maintains a timer that keeps increasing from 0 to T1 and T2 for every

  • job. These two values T1 and T2 account for node-level and

rack-level data locality, respectively [3]. If a job has local data

  • n a node that sends out the heartbeat, the job immediately

launches MapTasks on that node and reset the timer. If a job has no local data and the timer is less than T1, then it can wait for a node with local data before the timer expires. By

slide-6
SLIDE 6

the same manner, T2 is used for rack-level data. If the timer expires, MapTasks can be launched even without local data. Interestingly, we find that the MapTasks are sensitive to the introduced delay. For illustration we assume T = T1 = T2, e.g., T = 15 seconds in [12] . The waiting time T determines the rate for launching MapTasks. Suppose on average the MapTask of the considered job lasts for K seconds, when the number S of nodes that are running local MapTasks exceeds a threshold, i.e., S/K > 1/T, the MapTask assign rate of a job becomes less than the rate at which the MapTasks complete

  • n the cluster. This can cause under-utilization and instability:

the number of MapTasks running simultaneously can be far below the desired level and change with large variations. We illustrate this problem in Fig. 9.

  • Fig. 9.

When MapTasks running with local data always release at least one map slot before their timers expire, pending MapTasks cannot be launched

  • n the available slots without local data.

This phenomenon can be explained using the following analysis that assumes i.i.d. exponential random variables for the MapTask processing times. Let µ be the MapTask comple- tion rate and s represent the number of map slots with local data in Fig. 9. Since a job can wait for T seconds before launching a remote MapTask, the timer resets to zero if one

  • f these s MapTasks running on local data releases a map

slot with local data within T seconds. Using the properties

  • f exponential distributions, we know that, for a job with N

MapTasks, the probability that all of the MapTasks will be launched on the slots with local data is ps =

  • 1 − e−sµT N .

(2) We demonstrate this problem using an experiment (Word- Count with 427 MapTasks on 15 nodes) that is shown in

  • Fig. 10. We intentionally store all the input data on 7 nodes

with the other 8 nodes having no input data at all. The

2 4 6 8 10 20 40 60

Running maps Time (minutes)

  • Fig. 10.

WordCount runs on 15 slave nodes with Fair Scheduler. Each node has 4 map slots (60 slots in total). Only 7 nodes have local data and 8 nodes do not store any input data. In most time delay scheduling only allows 4 × 7 MapTasks running simultaneously (under-utilization); there is a sudden big jump around time 8.2 minutes (instability).

measurement result in Fig. 10 shows µ = 1/48.5. As we know s = 4 × 7 and T = 15.0 in this experiment, simple calculations yield ps = 92.9% using Equation (2). This partially explains why delay scheduling can only launch 28 concurrent MapTasks most of the time. Therefore, improving data locality should be balanced against resource utilization. When computing resources are abundant, even launching a good number of remote MapTasks is beneficial. Since MapTasks are sensitive to delays, we propose Random Peeking Scheduling that avoids delays, as shown in Algo- rithm 3. We start with an adopted algorithm that is already available to compute a list of MapTasks as candidates to be launched after ignoring data locality constraints, e.g., using the slot allocation algorithm of Fair Scheduler [12] or FLEX [5]. Algorithm 3 Random peeking scheduling

1: Input: A list L of MapTask candidates computed when ignoring data locality constraints. 2: Output: A subset L◦ of L to be launched; initially L◦ = ∅. 3: for all j ∈ L do 4: Compute pj using (3). 5: if rand < pj then 6: Comment: rand is uniformly random on [0, 1] 7: L◦ = L◦ ∪ {j} 8: end if 9: end for

Based on the list of MapTask candidates, we associate with each candidate of job j a launching probability pj and then randomly launch each of the MapTask candidates of job j with probability pj. When the slave node that just sends the heartbeat does have local inputs for job j, pj is set to 1;

  • therwise, set pj smaller than 1. In the latter case the scheduler

launches a remote MapTask randomly, depending on 1) the fraction ˆ pm of nodes that have local input data, 2) the number ˆ Nm of nodes that have available map slots and, 3) the number

  • f pending MapTasks M j

p of job j. Choosing a good pj is

based on the intuition that 1) if many other nodes have local data and available slots, then the schedule should skip the current heartbeat with high probability; 2) if job j has a large number of pending MapTasks compared with ˆ Nm, it is beneficial to launch remote MapTasks immediately with a certain probability. In a heartbeat, the master node randomly selects K nodes and checks how many (say, Kj

1) has local map inputs for a

given job j and how many (say, K2) has available map slots. The random sampling is to reduce the time complexity for a large cluster with N nodes. A medium/small size cluster can set K = N (our testbed has 62 nodes, and the scheduler checks every node). Then it estimates the percentage of nodes that have local map inputs ˆ pj

m = Kj 1/K for job j and the total

number ˆ Nm of nodes that have available map slots ˆ Nm = N ×K2/K. Note that ˆ pj

m tends to decrease as job j proceeds

  • further. Our implementation sets

pj =

  • 1,

if job j has local data on a given node 1 − αj

  • ˆ

pj

m

βj 1 − e− ˆ

Nm

  • ,
  • therwise

(3) where βj = 0.1+0.9(1−exp(−M j

p/ max{ ˆ

Nm, 1}). We set αj to 0.7 if job j has rack data with running ReduceTasks on the

slide-7
SLIDE 7

node, 0.8 if job j has rack data but no running ReduceTasks

  • n the node, and 1.0 otherwise. This choice reflects that

we prefer to place MapTasks close to input data as well as running ReduceTasks of the same job. The schedule skips the current heartbeat for the considered job with probability pj, expecting better data locality on the next heartbeat. This forms a Bernoulli sequence, which has an average of 1/pj heartbeats. Below we consider some extreme cases. When ˆ Nm is large, then pj is close to 1 − αj (ˆ pm)βj. When ˆ Nm = 0, then pj is equal to 1. In this case even when there is no local data on this node, it is beneficial to launch a remote MapTask immediately, since there are no enough map slots on other nodes. Repeating the WordCount experiment of Fig. 10 by choos- ing β = βj for all j and fixing β = 0.1, β = 0.5 results in a MapTask assignment shown in Fig. 11, which decreases the job processing time by 18.3% and 38.7%, respectively. Unlike Fig. 10, it does not have a sudden big jump and has

5 10 20 40 60

β=0.5 Time (minutes)

5 10 20 40 60

Running maps β=0.1 Time (minutes)

  • Fig. 11.

Peeking algorithm for MapTasks

increased utilization for map slots. Later we will compare the number of MapTasks with local input data for running 10 jobs in Section III under Fair Scheduler and Coupling Scheduler. Interestingly, this value with Fair Scheduler is between the values obtained when fixing β = 0.1 and β = 0.5 for Coupling Scheduler in this experiment, as shown in Fig. 15. That is the reason why we let βj depend on the number of pending MapTasks M j

p of job j in (3).

  • III. EXPERIMENTAL RESULTS

We evaluate Coupling Scheduler through extensive experi-

  • ments. Our testbed has one master node and 62 slave nodes; a

survey has shown that the average cluster size is 66 nodes [25]. Each node has 4 cores (2933MHz, 32KB cache size), 6GB of memory and 72GB of disk. To compare, both Coupling and Fair schedulers are configured identically with 4 map slots and 2 reduce slots per node. In addition, speculative tasks are disabled, since failures seldom occur on our small/medium size testbed and enabling speculative tasks actually increases the job processing times due to wasted slots in this case. We design three sets of experiments by varying the size of the cluster and changing the locations of input data. A: Set A contains 7 slave nodes. The following files are stored on these nodes with the replication value being set to 1. We download Wikipedia Free Encyclopedia (28G, named WikiInput). In addition, we generate Random- Input05, RandomInput10, RandomInput15, RandomIn- put30 and RandomInput of size 0.5G, 1.0G, 1.5G, 3.0G and 10.0G bytes, respectively, where each line of these five files only contains a randomly generated 10 digit

  • integer. We build four files (named RandomPair1, Ran-

domPair2 and RandomPair3 of size 15.0G and 29.0G, 4.3G bytes, respectively) using RandomWriter. B: Set B contains 15 slave nodes. We add 8 nodes that do not store any input data in addition to Set A for testing the effectiveness of data locality. C: Set C contains 62 slave nodes. We copy every file in Set A and store them on the whole cluster.

  • A. No harm to a single job on response time

Different from Fair Scheduler that launches ReduceTasks greedily, Coupling Scheduler introduces delays and starts them step by step. Therefore, we first investigate whether the delays can impact the job response time when running a single job in the cluster. We test both CPU intensive and I/O intensive jobs, as shown in Table II, by repeating each 15 times on Set A to minimize random factors using averages. For example, the job processing times for running Sort2 vary a lot, ranging from 16.15 to 23.5 minutes under Fair Scheduler. Therefore, we plot the mean, minimum and maximum of

TABLE II JOB PROFILE

Job Map (#) Reduce (#) Grep [1-5]* randomInput 148 6 RandomWriter 70 QuasiMonteCarlo 50 1 Sort randomPair1 (Sort1) 224 12 Sort randomPair2 (Sort2) 352 27

the processing times for 15 trials in Fig. 12 for each job. Interestingly, not only the introduced delays do not slow down the job processing on average (Sort2 runs slightly slower), but also they even expedite many of the jobs (e.g., Grep, MonteCarlo) due to the reasons explained in Section II-A. In

Grep RandomWriter MonteCarlo Sort1 Sort2 5 10 15 20 25

Job processing time (minutes)

Fair Coupling

  • Fig. 12.

Mean and range of single job processing times

this figure, Sort1 decreases by 14.6% running with Coupling

  • Scheduler. It is because this job has 12 ReduceTasks and the

cluster has 14 reduce slots, the placements of ReduceTasks differentiating the performance. Sort2 has 27 jobs, and the placements of ReduceTasks have much less impacts on the total job processing times.

  • B. Better resource utilization for scheduling MapTasks

To test Random Peeking Scheduling in Algorithm 3, we submit 10 jobs that do not have any reduce tasks on Set B. Trace studies on a production MapReduce cluster show that 77% jobs are map-only jobs [16]. The details of the 10 jobs are described in Table III; the third column lists the job submission times in seconds. We plot the total processing time for each job in Part (a)

  • f Fig. 13 under Coupling and Fair schedulers. On average
slide-8
SLIDE 8

TABLE III GREP JOBS WITHOUT REDUCETASKS

JobID Job Time (s) Map (#) 01 Grep [a-f][a-z]* wikiInput 427 02 Grep [1-5]* randomInput10 120 15 03 Grep [1-4]* randomInput15 240 23 04 Grep [1-3]* randomInput30 360 45 05 Grep [a-f][a-z]* wikiInput 480 427 06 Grep [1-2]* randomInput10 600 15 07 Grep [1-4]* randomInput15 720 23 08 Grep [1-3]* randomInput20 840 30 09 Grep [a-f][a-z]* wikiInput 960 427 10 Grep [1-2]* randomInput30 1080 45

Coupling Scheduler decreases the processing time by 24% compared to Fair Scheduler in this example. The performance

5 10 15 20 20 40 60

Running maps Fair Scheduler

5 10 15 20 20 40 60

Time (minutes) (b) Map task assignment Running maps Coupling Scheduler

5 10 15 20 2 4 6 8 10

Time (minutes) (a) Job processing times Job ID Coupling Fair

  • Fig. 13.

Jobs with only MapTasks

5 10 15 20 25 0.5 1 1.5 2 x 10

8

Maximum network on all nodes

Minutes Fair Scheduler Bytes

network_in network_out 5 10 15 20 25 0.5 1 1.5 2 x 10

8

Maximum network on all nodes

Minutes Coupling Scheduler

network_in network_out

  • Fig. 14.

More efficient resource utilization with Coupling Scheduler

gain is due to the improvements to the under-utilization and instability problem of Fair Scheduler that is illustrated in

  • Fig. 10. Since 8 out of 15 nodes on Set B do not store any input

data, the number of running MapTasks with Fair Scheduler can be below the desired value with large variations over time, as shown in Part (b) of Fig. 13. For example, the numbers

  • f running MapTasks of job 1, 5, 9 vary between 60 and 20,

which increases the total processing times. Random Peeking Scheduling can launch more non-local MapTasks when there are still many pending MapTasks, which decreases the job processing time at the expense of more network traffic; see Fig. 14. We plot the maximum number

  • f bytes transmitted through network on every node of the

cluster at each time. This metric is more meaningful than the total network in characterizing network utilization. If network becomes an issue, the bottleneck is possibly on the node that has the largest network activities. To further illustrate this point, we compare the number of MapTasks that have local data input for each job when running Fair, Coupling with β = 0.1 and β = 0.5, as shown in Fig. 15. Interestingly, data locality performance of Fair Scheduler is between Coupling Scheduler with β = 0.1 and β = 0.5 in this experiment.

1 2 3 4 5 6 7 8 9 10 100 200 300 400

Job ID The number of maps with local data

Fair Coupling (β=0.5) Coupling (β=0.1)

  • Fig. 15.

The number of MapTasks with local data

  • C. Multiplexing typical job flows

Trace studies show that in a production MapReduce cluster most of the jobs are map intensive [16] and typical job sequences consist of large jobs followed by small ones [18]. Therefore, we submit a sequence of 22 jobs with both map intensive jobs (e.g., Grep and QuasiMonteCarlo) and copy- /shuffle intensive jobs (e.g., sort), exhibiting the observed statistical characteristics. The details are presented in Table IV: the third column (Time) lists the job arrival times in minutes and the fourth (M) and fifth (R) columns contain the number

  • f map and reduce tasks of each job. Fig. 3 in Section I-C

plots the starting time point as well as the minimum, mean and maximum processing times for each job.

TABLE IV JOB SEQUENCE

JobID Job Time M R SF SC 01 Grep [1-5]* randomInput 148 15 0.0 0.0 02 Grep [5-9]* randomInput 30 148 15 0.0 0.0 03 QuasiMonteCarlo 150 5 1 6.1 0.0 04 WordCount randomInput05 170 8 1 4.0 0.0 05 Grep [2-6]* randomInput05 190 8 2 3.9 0.0 06 Grep [3-6]* randomInput05 210 8 2 2.8 0.0 07 Grep [4-6]* randomInput05 230 8 3 2.8 0.0 08 Grep [a-h][a-z]* wikiInput 470 427 15 0.0 0.2 09 Grep [a-g][a-z]* wikiInput 500 427 15 0.0 0.0 10 Sort randomPair1 800 224 27 4.9 4.5 11 Grep [1-2]* randomInput10 860 15 5 8.3 0.9 12 Grep [1-5]* randomInput05 880 8 3 8.6 0.6 13 Grep [6-9]* randomInput05 900 8 2 8.4 0.2 14 Sort randomPair3 1020 64 27 5.3 3.5 15 Grep [3-8]* randomInput20 1140 30 2 1.0 0.0 16 WordCount randomInput10 1440 15 1 0.1 0.0 17 Sort randomPair2 1710 352 27 0.3 5.0 18 QuasiMonteCarlo 2110 15 1 0.3 0.0 19 Grep [1-5]* randomInput05 2125 8 3 0.0 0.0 20 Sort randomPair3 2245 64 27 0.5 1.1 21 RandomWriter 2365 150 0.0 0.0 22 QuasiMonteCarlo 2485 10 1 0.0 0.0

In order to quantify the performance improvement from alleviating the starvation problem, we define the starvation time in minutes for a given job, SC under Coupling Scheduler and SF under Fair Scheduler, respectively. The starvation time

  • f a job is defined as the average of the set of durations from

the completion time of its last MapTask to the starting times

  • f each of its reduce tasks that are launched after the last

MapTask completes. Table IV contains the values for SC and SF in minutes, which clearly explain that Coupling Scheduler, compared to Fair Scheduler, can greatly expedite the process- ing for most of the jobs, e.g., job 3, 4, 5, 6, 7, 11, 12, 13 (job 3 runs 12.8 times faster), due to the largely reduced starvation times.

slide-9
SLIDE 9
  • D. Experiments on 62 nodes

The last experiment is conducted on Set C with 62 nodes. We submit 200 jobs sequentially in 10 batches to the cluster where each batch contains 20 jobs with similar characteristics as the workflow in Table IV. These 200 jobs consist of Grep, WordCount, QuasiMonteCarlo and Sort, with percentage 73.8%, 8.9%, 6.8%, 10.5%, respectively. This is in line with the traces collected on a production MapReduce cluster [16]. The job arrival intervals follow exponential distributions, with the average arrival interval within each batch equal to 25 seconds and average interval between the arrival times of the first jobs of two consecutive batches equal to 20 minutes.

  • Fig. 16 plots the cumulative distribution of the job pro-

cessing time T with Fair Scheduler and Coupling Scheduler,

  • respectively. It shows that P[T < t] under Coupling Scheduler

5 10 15 20 25 30 35 0.2 0.4 0.6 0.8 1

Job processing time (minutes) Cumulated distribution Fair Coupling

  • Fig. 16.

Empirical job completion time distribution

is larger than under Fair Scheduler for every t. It implies that Coupling Scheduler leads to a smaller response time both in the average, as well as the stronger probabilistic sense. Roughly speaking, it has a higher probability to result in a shorter job processing time than Fair Scheduler in this

  • experiment. On average, Coupling Scheduler decreases the job

processing time by 21.3% compared with Fair Scheduler in this experiment.

  • IV. CONCLUSION

The current practice with Hadoop schedulers still exhibits room for improvement: map and reduce tasks are scheduled separately without joint optimization, which can cause star- vation and unfavorable data locality. To this end, we design a resource-aware scheduler for Hadoop, which couples the progresses of map and reduce tasks to alleviate starvation, and jointly optimizes the data locality for both of them. Extensive experiments with this scheduler demonstrate significant im- provements in job response times. REFERENCES

[1] J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,” Commun. ACM, vol. 51, pp. 107–113, January 2008. [2] Hadoop, http://hadoop.apache.org. [3] Fair Scheduler, http://hadoop.apache.org/mapreduce/docs/r0.21.0/fair scheduler.html. [4] Capacity Scheduler, http://hadoop.apache.org/mapreduce/docs/r0.21.0/ capacity scheduler.html. [5] J. Wolf, D. Rajan, K. Hildrum, R. Khandekar, V. Kumar, S. Parekh, K.-

  • L. Wu, and A. Balmin, “FLEX: A slot allocation scheduling optimizer

for MapReduce workloads,” in Middleware 2010, ser. Lecture Notes in Computer Science. Springer Berlin/Heidelberg, 2010, vol. 6452, pp. 1–20. [6] J. Polo, C. Castillo, D. Carrera, Y. Becerra, I. Whalley, M. Steinder,

  • J. Torres, and E. Ayguad´

e, “Resource-aware adaptive scheduling for MapReduce clusters,” in Middleware, ser. Lecture Notes in Computer Science, vol. 7049. Springer, 2011, pp. 187–207. [7] T. Sandholm and K. Lai, “Dynamic proportional share scheduling in hadoop,” in Proceedings of the 15th international conference on job scheduling strategies for parallel processing, ser. JSSPP’10. Berlin, Heidelberg: Springer-Verlag, 2010, pp. 110–131. [8] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica, “Improving MapReduce performance in heterogeneous environments,” in Proceedings of the 8th USENIX conference on Operating systems design and implementation (OSDI’08), 2008, pp. 29–42. [9] G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica, Y. Lu,

  • B. Saha, and E. Harris, “Reining in the outliers in map-reduce clus-

ters using mantri,” in Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI’10), 2010, pp. 1–16. [10] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Gold- berg, “Quincy: fair scheduling for distributed computing clusters,” in Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles (SOSP ’09). ACM, 2009, pp. 261–276. [11] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph,

  • R. Katz, S. Shenker, and I. Stoica, “Mesos: a platform for fine-grained

resource sharing in the data center,” in Proceedings of the 8th USENIX conference on Networked systems design and implementation (NSDI’11), 2011, pp. 22–22. [12] M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker, and

  • I. Stoica, “Job scheduling for multi-user MapReduce clusters,” EECS

Department, University of California, Berkeley, Tech. Rep. UCB/EECS- 2009-55, April 2009. [13] B. Palanisamy, A. Singh, L. Liu, and B. Jain, “Purlieus: locality-aware resource allocation for MapReduce in a cloud,” in Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’11, 2011, pp. 58:1–58:11. [14] M. Hammoud and M. Sakr, “Locality-aware reduce task scheduling for MapReduce,” in Proceedings of 2011 IEEE Third International Conference on Cloud Computing Technology and Science (CloudCom), 2011, pp. 570 –576. [15] Y. Wang, J. Tan, W. Yu, L. Zhang, X. Meng, and X. Li, “Preemptive ReduceTask scheduling for fair and fast job completion,” 2013, under review. [16] S. Kavulya, J. Tan, R. Gandhi, and P. Narasimhan, “An analysis of traces from a production MapReduce cluster,” in Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, ser. CCGRID ’10, Washington, DC, USA, 2010, pp. 94–103. [17] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: distributed data-parallel programs from sequential building blocks,” in Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007 (EuroSys ’07), 2007. [18] Y. Chen, A. Ganapathi, R. Griffith, and R. H. Katz, “The case for eval- uating MapReduce performance using workload suites,” in Proceedings

  • f the 19th IEEE International Symposium on Modeling, Analysis and

Simulation of Computer and Telecommunication Systems (MASCOTS), July 2011. [19] J. Tan, X. Meng, and L. Zhang, “Performance analysis of Coupling Scheduler for MapReduce/Hadoop,” in Proceedings of the 31th IEEE In- ternational Conference on Computer Communications (INFOCOM’12),, March 2012, pp. 2586 –2590, mini conference. [20] ——, “Delay tails in MapReduce scheduling,” in Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems (SIGMETRICS ’12). New York, NY, USA: ACM, 2012, pp. 5–16. [21] ——, “Coupling scheduler for MapReduce/Hadoop,” in Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing (HPDC’12), 2012, pp. 129–130, poster. [22] M. Hammoud, M. S. Rehman, and M. F. Sakr, “Center-of-gravity reduce task scheduling to lower MapReduce network traffic,” in Proceedings of 2012 IEEE Fifth International Conference on Cloud Computing (IEEE CLOUD). IEEE, June 2012. [23] J. Tan, X. Meng, and L. Zhang, “Improving ReduceTask data locality for sequential MapReduce jobs,” in Proceedings of the 32th IEEE In- ternational Conference on Computer Communications (INFOCOM’13), April 2013. [24] Hadoop distributed filesystem, http://hadoop.apache.org/hdfs/. [25] A. Verma, L. Cherkasova, and R. H. Campbell, “ARIA: Automatic Resource Inference and Allocation for MapReduce Environments,” in

  • Proc. of Intl. Conference on Autonomic Computing (ICAC), 2011.