Resource Management Paige Calisi, Meghana Yadavalli, B Chase Babrich - - PowerPoint PPT Presentation
Resource Management Paige Calisi, Meghana Yadavalli, B Chase Babrich - - PowerPoint PPT Presentation
Resource Management Paige Calisi, Meghana Yadavalli, B Chase Babrich Why is Resource Management Important? Companies pay for time and resources Important to understand workloads Traditional big-data analytics workloads vary from
Why is Resource Management Important?
- Companies pay for time and resources
- Important to understand workloads
- Traditional big-data analytics workloads vary from DL jobs
- GPUs have become the trend for high performance computing
- Thousands of parallel floating-point units can be packed into a single chip
- Makes parallelizing the same task very easy and optimizable
Key Challenges
- Many data analytics frameworks
- No one-size-fits-all solution
- Fairness
- Load balancing
- Fault tolerance
- Scalability
Existing Resource Schedulers
- YARN
○ Introduced to relieve Hadoop of resource management and job scheduling ○ Takes job and distributes it among slave nodes
- Mesos
○ Resource offer - low demands pick first ○ Delegates scheduling to framework - not centralized
- Tetris
○ Packs tasks onto clusters based on requirements ○ Favors small resource jobs
Multi-tenant GPU Clusters for Deep Learning Workloads: Analysis and Implications
Problem Statement
- GPU utilization for DL is different from traditional big-data analytics workloads
○ Hours - weeks vs milliseconds to hours
- Identify the constraints
1) GPUs are a monolithic resource that cannot be shared in a fine-grained manner across all
users
2) Multi-tenant clusters 3) With respect to workload, DL frameworks utilize gang-scheduling which decreases the
flexibility of scheduling
4) Synchronization of parameters -> locality
- Identify implications for future schedulers
Project Philly
Study 3 Things:
1. Queueing delays: 1. Delay incurred by users waiting for their fair share of resources 2. Waiting for locality constraints to be met 2. How GPU utilization is affected by placement decisions for distributed training jobs 1. Distribution of individual jobs across servers, ignoring locality constraints, increasing synchronization overheads 2. Colocation, or packing of different jobs on the same server leads to contention of shared resources 3. Jobs might fail to complete successfully 1. Programming errors early in the training process 2. Failures due to cluster components happen later in training
System Overview
- Agnostic to ML framework, all supervised learning tasks
- Distributed training across GPUs, aggregated subset training results, perform
synchronized updates
- Multiple GPUs on a server (PCIe), multiple servers on a rack (RDMA),
multiple racks (ethernet)
- Fair Scheduling
- Collect logs over 3 main sources
○ YARN scheduler logs ○ stdout and stderr ○ Ganglia monitoring system
Analysis of Queueing Delays
- 2 types of queuing delays:
1) Fair-share delay is when a VC uses up its GPUs, so jobs are waiting for GPUs to become available 2) Fragmentation delay, which happens when large jobs are spread across many racks (low locality)
- Jobs with more GPUs means higher probability of longer queuing delays
- Conclusion: need for gang-scheduling and locality introduces fragmentation
delay, so sometimes locality constraints need to be relaxed to mitigate delays
Analysis of GPU Utilization
- GPU utilization is low across all jobs
- Efficiency of allocated GPUs varies
according to locality and colocation scenarios that could occur in the cluster
- Observe if a particular job requires
disproportionate amount of host memory and isolate memory used by jobs colocated on the same server
Training Progress and Completion
- Terminated jobs constituted 55%
- f GPU utilization
- Large fraction of jobs spent time
training for longer than necessary
- User error is a big reason for job
failure
- Semantic errors increase when
with a higher number of GPUs because they need to communicate and synchronize model parameters
Lessons Learned
1) Schedulers should trade queuing delay for adhering to locality constraints
a) Retry jobs without relaxing locality constraints
2) Aim to isolate the jobs on dedicated servers and implement migration for defragmentation to support locality constraints 3) Early failures should be caught on a smaller pool of GPUs before being scheduled on larger clusters
a) Lots of user errors can be caught without deploying on large clusters b) Classify errors and don’t retry errors that won’t pass (syntax errors)
Pros and Cons
Pros:
- Explained different scheduling concerns and gave us a very broad understanding of
how scheduling jobs affects runtime
- Failure analysis section gives good insight on very easy ways to stop wasting GPU
cycles
- Highlights the importance of dynamically checking for loss convergence
Cons:
- Didn’t explain much about the role preemption plays in job completion
- Flexible scheduling can lead to more time being spent saving model checkpoints
- Didn’t address scalability as an issue
Themis: Fair and Efficient GPU Clustering Scheduling
Themis image taken from https://en.wikipedia.org/wiki/Themis#/media/File:00 29MAN-Themis.jpg
Motivation
- Two major problems with other scheduling
algorithms:
○ Do not account for the long-running length of ML tasks ○ No attention is paid to the placement of the ML tasks ○ Example: DRF
- Alright for big data scheduling, but not for ML
○ Violates Pareto Efficiency and envy-freedom ○ “Even with existing fair sharing schemes, we do find users frustrated with the inability to get their work done in a timely way…”
- We would like to maximize sharing incentive (SI)
https://www.economicshelp.org/blog/glossary/pareto-efficiency/
Formalization of Time
- ML App
○ One or more training jobs ■ Each job has several tasks that process a minibatch of data
- GPU Time
■ 10 task GPU mins ■ 10*2=20 job GPU mins ■ 10*2*2=40 app GPU mins
- Heterogeneity across apps
○ Analyzation of workload traces from a large internet company
- Can be mitigated with LAS
○ Least Attained Service
Attempts To Pay Attention To Time - Tiresias
- Uses job completion time and GPU usage as
measures of services
- Implements a Least Attained Service (LAS) policy
○ Addresses starvation of jobs and therefore fairness
- Does not encode GPU placement preferences of
jobs
○ Treats all GPU configurations as absolute
Image taken from http://01greekmythology.blogspot.com/2014/06/teiresias.html
The Importance of Space
- The placement of an app can heavily affect its performance
○ Again we see heterogeneity
- LAS and DRF will not achieve efficiency due to these issues
○ Instance 1 violates SI ○ Instance 2 violates PE and EF
Attempts To Pay Attention Space - Gandiva
- Squeezes as much power out of GPUs as possible
by exploiting the cyclic nature of SGD
○ Uses a greedy scheduling policy that continuously optimizes for cluster efficiency
- Master scheduler assigns Docker containers as
they become available
○ Scheduling policy is built around early feedback and
- ptimizing for efficiency
- Sets the theoretical groundwork for Themis
○ “The primary design goal of the Gandiva scheduler is to provide early feedback to jobs” ○ “Cluster level fairness is not a design goal in Gandiva”
The Themis Solution
- Presented in two parts
○ (1) An auction mechanism that allows apps to bid for resources ■ “Partial Allocation auction” incentivizes truth telling ○ (2) Two-level scheduling architecture ■ Allows for hyper-parameter optimization (1) (2)
Key Ideas for Partial Allocation Auction
- Finish-Time fairness
○ SI Achieved if ρ ≤ 1
- Requires the app to be able to express a preference for each allocation
■ Wider interface between app and allocation engine
- Hidden Payment incentivizes truth-telling
Computation of ρ
- Recall that ρ is calculated for every permutation of the available GPUs
- This process is complicated by the presence of hyper-parameter optimization
- r early stopping
○ In this case, TSh is calculated differently Slowdown to account for system
- verhead
Rc = # of GPUs left in cluster
Multi-Round Auctions
- Single-round does not guarantee SI
(why?)
- Auctions are triggered by leases ending
- At each round, 1-⨏ of the apps with the
greatest ρ value are filtered
○ Why do we do this? ○ What happens as we vary ⨏? ■ Fairness vs Efficiency
- Random allocation of resources leftover
from hidden payments
Themis Scheduling Architecture
- Current architectures cannot support multi-round auctioning
○ E.g. Mesos, Omega ○ Entirely pessimistic or entirely optimistic
- Themis has “semi-optimistic” concurrency control
○ Top level offers optimistic, bottom offers pessimistic
Widening the API Between the Apps and Scheduler
- A crucial aspect of Themis’ architecture
design is that it requires the app to be able to see all other resources but only able to use is own resources
○ Accomplished with this app/agent idea ○ Agents are able to see all resources and apps can
- nly use their own resources
- Allows the Agent the ability to interact with
existing hyper-parameter optimizers
○ Introduces an overhead into the app writer’s process ■ Negligible?
Setting the Stage For Evaluation
- Implemented on top a modified version of Yarn
- Two types of experiments
○ “Real” experiment using a 64 GPU, 20 Machine Cluster ○ Simulation of a much larger system (256 GPUs)
- Baselines for each important measurement
○ Tiresias: Ideal fairness ■ Uses LAS ○ Gandiva: Ideal Efficiency ○ Optimus: Ideal time ○ SLAQ: Ideal Model Quality
- Measured against three benchmarks
○ Finish-time fairness, GPU time, and Placement score
Evaluation of Fairness
- Fairness?
○ Tiresias “treats long apps and short apps as the same” ○ Themis only architecture to maintain ~1 max ρ vals ○ Does our fair play mechanism actually work?
Evaluation of Time, Efficiency, and Placement
- Time/Efficiency?
○ Workload 1 had jobs that each required only one or two GPUs
- Placement?
○ Again, Themis and Gandiva is the only one that pays attention to placement so it does very well here
Summary and Discussion
- Pros:
○ Strives for fairness ○ Built on top of YARN
- Cons:
○ Moderate amount of overhead ○ Some amount of cherry picking in baselines
- Themis presents itself as the most “fair” scheduling architecture
○ Chooses to define “fair” itself. Do we agree with the ρ definition? ○ Is there any other measure of fairness that we might want to use?
- It boasts a “Semi-optimistic” two-level design
○ Necessary to support partial auctions ○ Leases and partial auctions combat the effects of ML apps being so long-running
Summary and Discussion
- Pays attention to placement of apps
○ Leads to great speed up but only for specific ML models ○ How does Themis’ placement compare to that of Gandiva’s?
- Requires a widening of the API between the apps and the scheduler
○ Are there any drawbacks to this? ○ Worth the extra overhead?
- Doesn’t mention fault tolerance at all
○ Also doesn’t mention inference… but what difference would this make?
Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters
Motivation
- Increased amount of data and scale of models has significantly increased
training time
- This has increased the popularity of using distributed systems for parallel
training
- In shared deep learning cluster, efficient resource scheduling is key to
maximize utilization of expensive resources (e.g. GPUs and RDMA networks)
System Overview
- Designed for data-parallel deep learning tasks, using parameter server
- Resource-performance models for each job
- Dynamic scheduling algorithm
○ Resource Allocation ○ Task Placement
- Scheduler integrated with Kubernetes
Deep Learning Model Training
- 1. Iterative
a. The model training is usually carried out in an iterative fashion b. Dataset divided into chunks, chunks divided into mini-batches c. Epoch
- 2. Trained until model converges
a. Tens to hundreds of epochs b. Performance metric stabilizes
Parameter Server Architecture
- Support for synchronous and asynchronous training
Synchronous vs. Asynchronous Training
Synchronous - parameter server updates parameters after it collects gradients from all workers, before any worker moves onto the next step Asynchronous - parameter server updates its parameters every time it receives gradients from a worker. Workers can progress onto different steps
Modeling of DL Jobs - Steps to Reach Convergence
- First model used to estimate number of steps to reach convergence
- Using fitted loss model and convergence threshold, total number of
steps/epochs can be calculated and the number of steps remaining
Performance Modeling - Training Speed
- Second model is used to estimate
training speed based on computation and comm. patterns
- Model trained on sample set of data,
with different combinations of parameters and workers
- NNLS solve used to find best fitting
parameters
Optimus Dynamic Scheduling Algorithm
Resource Allocation and Task Placement
Resource Allocation Algorithm
- Jobs are sorted in order of their marginal gains
- Job with largest marginal gain is selected and either one worker or parameter
server is added based on which term is larger
- Marginal gains are then updated
*Continues until resource is cluster has been used up or marginal gains of all jobs are all negative
Task Placement
Goal: reduce training time by reducing time spent on parameter exchange among workers Algorithm steps:
- Sort servers in cluster based on current resource availability
- Place jobs based on increasing resource demands (smallest jobs place first)
- For each job, check if there are enough resources using first k servers
○ Increment k until there are enough servers to place the job
Straggler Handling
Synchronous Training
- Stragglers can be a large bottleneck because you have to wait for all workers
to finish before moving onto the next step
- Detection: monitor arrival time of each worker’s gradients on parameter
server and calculate gap between arrivals Asynchronous Training
- Goal avoid stale parameters, this leads to unstable training process
- Detection: monitor each workers training speed
Dynamic Adjustment of Resources
- Model parameters are checkpointed when the number of workers or
parameters assigned to a job changes
- Job is restarted from checkpoint and parameter server is redeployed
Evaluation
- Resources: 7 CPU Servers, 6 GPU servers
- Baseline: Compared to DRF Schedulers (Mesos/YARN) and Tetris Scheduler
Indicators of performance:
- JCT = average job completion time
- Makespan = total time between arrival of first job and completion of all jobs
Performance
- Optimus reduced JCT by 2.4x and
makespan by 1.6x
- Experiment found a 2.5% scaling
- verhead for dynamic job
adjustment
- Optimus can schedule 4,000 jobs
within 5 seconds on a cluster of 16,000 nodes
Performance - Utilization
- Optimus runs a small number of jobs
compared to DRF (a)
○ More resources != higher training speed
- Optimus has higher CPU utilization for
workers and parameter servers (b&c)
○ Utilizes resources more efficiently
Discussion
Strengths:
- Offers support for both resource allocation and task placement
- Support for both synchronous and asynchronous training
- Allows for dynamic shifting of resource allocation after jobs have started
- Priorities finishing jobs quickly, avoids job starvation
Weaknesses:
- Failures in training process for workers not mentioned
- Overhead to checkpointing model parameters
Compare Themis and Optimus
- Both papers perform scheduling at fixed time intervals
○ Auctioning vs Computing marginal gains (optimization)
- Themis emphasis fairness/truth-telling while Optimus tries to optimize
resource allocation for the maximum number of jobs to be completed
- Themis: ML, Optimus: DL
- Both pay attention to placement of jobs
Questions? Discussion?
Notes from Themis 1
Introduction Existing scheduling algorithms for clusters do not optimize for GPU workloads, only big-data workloads which differ. Themis serves to address a lot of the issues by supporting incentive sharing, Pareto efficiency, and envy-freedom. Multiplexes GPU cluster across ML
- application. To capture the effect of long running tasks and placement sensitivity, introduce new term fairness metric finish-time first,
ratio of running in a shared cluster of N by the running time of running alone in a 1/N cluster. Goal is to minimize the maximum finish- time fairness across all ML apps. Do this in 2 ways: 1. Widen the API between ML apps and the scheduler. Introduce the notion of a round-by-round auction where jobs bid on more allocation based on their finish-time first metric. The worse the number, the better their chance of getting a good GPU placement. 2. 2 level scheduling 1. Centralized inter-app scheduler at the bottom 2. Narrow API to integrate with existing hyper-parameter tuning frameworks at the top level System Design Go over the image at a high level and step through how jobs are auctioned and allocated. Can have a single ML training job. That one is straight forward. You do the calculations to calculate rho. In the multiple ML jobs case, this is when a hyperparameter search is going
- n. They cover 2 stopping criteria, successive halving and performance curve stopping.
Notes from Themis 2
Finish Time Fair Allocation Timely completion means that given N MP apps, they should not run slower on the shared cluster compared to a dedicated cluster with 1/N of the resources. This is the sharing incentive property. Pareto Efficiency: a Pareto Efficient allocation is one where no app’s allocation can be improved without hurting some other app Envy Freeness: envy-freeness means that no app should prefer the resource allocation of another app In big data analytics jobs where task durations are short, redistribution of resources can happen quickly and often. Blindly applying the same approach to ML apps isn't smart because jobs run for much longer amount of time, which can mean other jobs are waiting in the queue for a long time. Another algorithm, Least Attained Service, redistributes services based on leases, but this avoids starvation, but doesn't prioritize locality. ML models with lots of parameters may prefer better locality than models without so many parameters because of the overhead involved in synchronizing parameters and updating gradients. They make the argument that existing fair schemes violate SI, PE, and EF for ML apps, and ignore placement preference. That's why they introduce a new fairness metric and algorithm. Finish-Time Fairness: To summarize, in THEMIS we propose a new finish-time fairness metric that captures fairness for long-running, placement sensitive ML apps. To perform allocations, we propose using a multi-round partial allocation auction that incentivizes truth telling and provides pareto efficient, envy free allocations. By filtering the apps considered in the auction, we maximize sharing incentive and hence satisfy all the properties necessary for fair sharing among ML applications.
Notes from Themis 3
Evaluation Since bid values are affected by placement, Themis does well with globally optimal placement compared to Gandiva, which takes greedy locally optimal decisions. Themis had better rho values compared to all of the benchmarks also. This means that the finish time fairness metric was better than the other scheduling algorithms, and indicates that Themis would perform better as a scheduler. Themis offers better sharing incentive to apps and introduces a more altruistic behavior for long apps and shorter apps participate in the auction
- more. Other scheduling algorithms treat long and short duration apps the same which leads to overall worse performance.
Regarding system overheads, it looks like computing bids and computing partial allocations don't take too much time, compared to a lease time of 10 minutes. However, allocating and relinquishing control over GPU resources does take a long time at 5-10 seconds. In terms of cluster utilization, Themis does about the same as the other apps when the apps are mostly compute intensive. However,
- nce network utilization increases, Themis does better than the others.
Concluded that there is a point where the fairness knob maxes out because after a certain point, only one app can participate in the auction (because everything has been filtered out). Also, smaller lease times promote better fairness because apps can quickly be scheduled again. However, smaller lease times makes it worse for system overhead time because during every one of these "context switches", a model checkpoint has to be generated and the app needs to communicate with the HDFS.
Notes from Optimus 1
Introduction Current limitations of schedulers are that a job gets submitted with the resource requirements that are supplied by the job owner. These are static allocations, and cannot use new resources if available, and also perhaps these jobs don't need all the resources that were allocated for it, reducing the available resources for other jobs. Build accurate performance models of the DL job while online. Can calculate through learning rate, resource utilization. Optimus requires no knowledge of the model and hardware configuration of the cluster. Their goal is to minimize job completion time. (This goal may not mean it's fair to other jobs that are also contending for resources. Perhaps all the short jobs finish quickly, and 1 long job takes forever. The rate of jobs completing quickly is still high. This favors short jobs.)
Optimus Notes 2
Background and Motivation This paper focuses on using convergence as a metric to measure training time/completion of a DL job. The parameter-server architecture is that the parameters are split across multiple servers, and the training data is split across multiple workers. Each worker computes gradients based on its subset of training data and pushes the gradients to the parameter servers. Once the parameter servers collect all of the gradients, they perform the update and send the new parameter values back to the workers. Optimus tries to maximally exploit varying runtime resource availability and adjust the number and placement of parameter servers and workers. To perform
- nline fitting (which can estimate time for job completion), after every training step, they collect all points (k, l) and calculate the loss
curve (for SGD). The prediction error for this computation decreases as the number of training steps increases. They build a Resource-to-Speed model based on the number of workers, parameter servers, speed of the network, bandwidth, model size, etc. Training speed in a job depends on if they are using asynchronous or synchronous training. Important to note that for synchronous training, blindly increasing the number of parameter servers and workers doesn't decrease training time because the mini-batch size will decrease which can lead to underutilization of the GPU/CPU, and more workers means more synchronization related communication. Resource Allocation: Assign one worker and one parameter server. Sort all the jobs based on maximum marginal gain if they were to run like that. Then based on that order, add either a worker or a parameter server based on which one would add to the marginal gain the most. Keep doing this for each job until some resource in the cluster is used up, or until marginal gain is negative. This process happens each scheduling interval. They add a little bit about job placement and basically say they can reduce training time by specifically placing jobs (parameter servers and workers) on the same server. This would reduce training time, which would indirectly change the optimization problem and attempt to give preference to colocated jobs.
Optimus Notes 3
System Implementation Use HDFS. They identify stragglers that can slow the system down and deploy a new worker. They also attempt to keep the load the same on all parameter servers. They checkpoint model parameters during each scheduling iteration. Evaluation They compare Optimus to a DRF (Dominant Resource Fairness) scheduler, and the scheduler Tetris which prioritizes jobs with low duration and resources, and low fragmentation. They use average Job Completion Time as their metric for performance. They also use makespan: the time since the first job arrived to the completion of all jobs. It seems like Optimus has a smaller throughput of jobs compared to DRF. Optimus prioritizes finishing jobs quickly, while DRF wants to take on as many jobs as it can and treats them equally fairly, which means they may not complete as quickly as a job on Optimus. Their conclusions: (1) Testbed experiments show that Optimus improves average job completion time and makespan by 139% and 63% compared to the fairness scheduler. Further, Optimus can scale to schedule 100,000 tasks on 16,000 nodes in 5 seconds, and its resource adjustment overhead is small, i.e., 2.54%. (2) Further improvement of estimation accuracy will not increase Optimus’s performance much (15%) and Optimus performs better than DRF and Tetris under various workloads. (3) The resource allocation algorithm, the task placement scheme and the parameter server load balancing algorithm contribute to Optimus’s performance improvement by about 62%, 17%, 20% respectively.