Resource Management
Marco Serafini
COMPSCI 532 Lecture 17
Resource Management Marco Serafini COMPSCI 532 Lecture 17 What - - PowerPoint PPT Presentation
Resource Management Marco Serafini COMPSCI 532 Lecture 17 What Are the Functions of an OS? Virtualization CPU scheduling Memory management (e.g. virtual memory) Concurrency E.g. allocate Processes, Threads
COMPSCI 532 Lecture 17
22
3
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica
University of California, Berkeley
Abstract
We present Mesos, a platform for sharing commod- ity clusters between multiple diverse cluster computing frameworks, such as Hadoop and MPI. Sharing improves cluster utilization and avoids per-framework data repli-
ner, allowing frameworks to achieve data locality by taking turns reading data stored on each machine. To support the sophisticated schedulers of today’s frame- works, Mesos introduces a distributed two-level schedul- ing mechanism called resource offers. Mesos decides how many resources to offer each framework, while frameworks decide which resources to accept and which computations to run on them. Our results show that Mesos can achieve near-optimal data locality when shar- ing the cluster among diverse frameworks, can scale to 50,000 (emulated) nodes, and is resilient to failures.
1 Introduction
Clusters of commodity servers have become a major computing platform, powering both large Internet ser- vices and a growing number of data-intensive scientific Two common solutions for sharing a cluster today are either to statically partition the cluster and run one frame- work per partition, or to allocate a set of VMs to each
ther high utilization nor efficient data sharing. The main problem is the mismatch between the allocation granular- ities of these solutions and of existing frameworks. Many frameworks, such as Hadoop and Dryad, employ a fine- grained resource sharing model, where nodes are subdi- vided into “slots” and jobs are composed of short tasks that are matched to slots [25, 38]. The short duration of tasks and the ability to run multiple tasks per node allow jobs to achieve high data locality, as each job will quickly get a chance to run on nodes storing its input data. Short tasks also allow frameworks to achieve high utilization, as jobs can rapidly scale when new nodes become avail-
veloped independently, there is no way to perform fine- grained sharing across frameworks, making it difficult to share clusters and data efficiently between them. In this paper, we propose Mesos, a thin resource shar- ing layer that enables fine-grained sharing across diverse
55
66
77
88
99
Mesos slave Mesos slave Mesos slave
MPI executor task Hadoop executor task
MPI executor task task Hadoop executor task task
Mesos master Hadoop scheduler MPI scheduler Standby master Standby master ZooKeeper quorum
Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI).
10
10
11
11
FW Scheduler Job 1 Job 2
Framework 1
Allocation module
Mesos master
<s1, 4cpu, 4gb, … > 1 <fw1, task1, 2cpu, 1gb, … > <fw1, task2, 1cpu, 2gb, … >
4 Slave 1
Task
Executor
Task
FW Scheduler Job 1 Job 2
Framework 2 Task
Executor
Task
Slave 2
<s1, 4cpu, 4gb, … > <task1, s1, 2cpu, 1gb, … > <task2, s1, 1cpu, 2gb, … >
3 2
Figure 3: Resource offer example.
12
12
13
13
14
14
15
15
16
16
17
17
18
18
19
19
20
21
21
22 22
23
23
Large-scale cluster management at Google with Borg
Abhishek Verma† Luis Pedrosa‡ Madhukar Korupolu David Oppenheimer Eric Tune John Wilkes
Google Inc.
Abstract
Google’s Borg system is a cluster manager that runs hun- dreds of thousands of jobs, from many thousands of differ- ent applications, across a number of clusters each with up to tens of thousands of machines. It achieves high utilization by combining admission con- trol, efficient task-packing, over-commitment, and machine sharing with process-level performance isolation. It supports high-availability applications with runtime features that min- imize fault-recovery time, and scheduling policies that re- duce the probability of correlated failures. Borg simplifies life for its users by offering a declarative job specification language, name service integration, real-time job monitor- ing, and tools to analyze and simulate system behavior. We present a summary of the Borg system architecture and features, important design decisions, a quantitative anal- ysis of some of its policy decisions, and a qualitative ex- amination of lessons learned from a decade of operational
web browsers BorgMaster link shard UI shard BorgMaster link shard UI shard BorgMaster link shard UI shard BorgMaster link shard UI shard
Cell
Scheduler
borgcfg command-line tools web browsers
scheduler Borglet Borglet Borglet Borglet BorgMaster link shard read/UI shard
config file
persistent store (Paxos)
Figure 1: The high-level architecture of Borg. Only a tiny fraction
25
26
26
27
27
10 20 30 40 50 60 20 40 60 80 100 Overhead from segregation [%] Percentage of cells
(b) CDF of additional machines that would be needed if we segregated the workload of 15 representative cells.
28
28
29
29
40 80 120 160 capacity limit reservation usage 40 80 120 160
Mem
10k 20k Week 1 (baseline) Week 2 (aggressive) Week 3 (medium) Week 4 (baseline)
OOMs
[%] CPU [%]
30
30