Resource Management Paige Calisi, Meghana Yadavalli, B Chase Babrich - - PowerPoint PPT Presentation

resource management
SMART_READER_LITE
LIVE PREVIEW

Resource Management Paige Calisi, Meghana Yadavalli, B Chase Babrich - - PowerPoint PPT Presentation

Resource Management Paige Calisi, Meghana Yadavalli, B Chase Babrich Why is Resource Management Important? Companies pay for time and resources Important to understand workloads Traditional big-data analytics workloads vary from


slide-1
SLIDE 1

Resource Management

Paige Calisi, Meghana Yadavalli, B Chase Babrich

slide-2
SLIDE 2

Why is Resource Management Important?

  • Companies pay for time and resources
  • Important to understand workloads
  • Traditional big-data analytics workloads vary from DL jobs
  • GPUs have become the trend for high performance computing
  • Thousands of parallel floating-point units can be packed into a single chip
  • Makes parallelizing the same task very easy and optimizable
slide-3
SLIDE 3

Key Challenges

  • Many data analytics frameworks
  • No one-size-fits-all solution
  • Fairness
  • Load balancing
  • Fault tolerance
  • Scalability
slide-4
SLIDE 4

Existing Resource Schedulers

  • YARN

○ Introduced to relieve Hadoop of resource management and job scheduling ○ Takes job and distributes it among slave nodes

  • Mesos

○ Resource offer - low demands pick first ○ Delegates scheduling to framework - not centralized

  • Tetris

○ Packs tasks onto clusters based on requirements ○ Favors small resource jobs

slide-5
SLIDE 5

Multi-tenant GPU Clusters for Deep Learning Workloads: Analysis and Implications

slide-6
SLIDE 6

Problem Statement

  • GPU utilization for DL is different from traditional big-data analytics workloads

○ Hours - weeks vs milliseconds to hours

  • Identify the constraints

1) GPUs are a monolithic resource that cannot be shared in a fine-grained manner across all

users

2) Multi-tenant clusters 3) With respect to workload, DL frameworks utilize gang-scheduling which decreases the

flexibility of scheduling

4) Synchronization of parameters -> locality

  • Identify implications for future schedulers
slide-7
SLIDE 7

Project Philly

Study 3 Things:

1. Queueing delays: 1. Delay incurred by users waiting for their fair share of resources 2. Waiting for locality constraints to be met 2. How GPU utilization is affected by placement decisions for distributed training jobs 1. Distribution of individual jobs across servers, ignoring locality constraints, increasing synchronization overheads 2. Colocation, or packing of different jobs on the same server leads to contention of shared resources 3. Jobs might fail to complete successfully 1. Programming errors early in the training process 2. Failures due to cluster components happen later in training

slide-8
SLIDE 8

System Overview

  • Agnostic to ML framework, all supervised learning tasks
  • Distributed training across GPUs, aggregated subset training results, perform

synchronized updates

  • Multiple GPUs on a server (PCIe), multiple servers on a rack (RDMA),

multiple racks (ethernet)

  • Fair Scheduling
  • Collect logs over 3 main sources

○ YARN scheduler logs ○ stdout and stderr ○ Ganglia monitoring system

slide-9
SLIDE 9

Analysis of Queueing Delays

  • 2 types of queuing delays:

1) Fair-share delay is when a VC uses up its GPUs, so jobs are waiting for GPUs to become available 2) Fragmentation delay, which happens when large jobs are spread across many racks (low locality)

  • Jobs with more GPUs means higher probability of longer queuing delays
  • Conclusion: need for gang-scheduling and locality introduces fragmentation

delay, so sometimes locality constraints need to be relaxed to mitigate delays

slide-10
SLIDE 10

Analysis of GPU Utilization

  • GPU utilization is low across all jobs
  • Efficiency of allocated GPUs varies

according to locality and colocation scenarios that could occur in the cluster

  • Observe if a particular job requires

disproportionate amount of host memory and isolate memory used by jobs colocated on the same server

slide-11
SLIDE 11

Training Progress and Completion

  • Terminated jobs constituted 55%
  • f GPU utilization
  • Large fraction of jobs spent time

training for longer than necessary

  • User error is a big reason for job

failure

  • Semantic errors increase when

with a higher number of GPUs because they need to communicate and synchronize model parameters

slide-12
SLIDE 12

Lessons Learned

1) Schedulers should trade queuing delay for adhering to locality constraints

a) Retry jobs without relaxing locality constraints

2) Aim to isolate the jobs on dedicated servers and implement migration for defragmentation to support locality constraints 3) Early failures should be caught on a smaller pool of GPUs before being scheduled on larger clusters

a) Lots of user errors can be caught without deploying on large clusters b) Classify errors and don’t retry errors that won’t pass (syntax errors)

slide-13
SLIDE 13

Pros and Cons

Pros:

  • Explained different scheduling concerns and gave us a very broad understanding of

how scheduling jobs affects runtime

  • Failure analysis section gives good insight on very easy ways to stop wasting GPU

cycles

  • Highlights the importance of dynamically checking for loss convergence

Cons:

  • Didn’t explain much about the role preemption plays in job completion
  • Flexible scheduling can lead to more time being spent saving model checkpoints
  • Didn’t address scalability as an issue
slide-14
SLIDE 14

Themis: Fair and Efficient GPU Clustering Scheduling

Themis image taken from https://en.wikipedia.org/wiki/Themis#/media/File:00 29MAN-Themis.jpg

slide-15
SLIDE 15

Motivation

  • Two major problems with other scheduling

algorithms:

○ Do not account for the long-running length of ML tasks ○ No attention is paid to the placement of the ML tasks ○ Example: DRF

  • Alright for big data scheduling, but not for ML

○ Violates Pareto Efficiency and envy-freedom ○ “Even with existing fair sharing schemes, we do find users frustrated with the inability to get their work done in a timely way…”

  • We would like to maximize sharing incentive (SI)

https://www.economicshelp.org/blog/glossary/pareto-efficiency/

slide-16
SLIDE 16

Formalization of Time

  • ML App

○ One or more training jobs ■ Each job has several tasks that process a minibatch of data

  • GPU Time

■ 10 task GPU mins ■ 10*2=20 job GPU mins ■ 10*2*2=40 app GPU mins

  • Heterogeneity across apps

○ Analyzation of workload traces from a large internet company

  • Can be mitigated with LAS

○ Least Attained Service

slide-17
SLIDE 17

Attempts To Pay Attention To Time - Tiresias

  • Uses job completion time and GPU usage as

measures of services

  • Implements a Least Attained Service (LAS) policy

○ Addresses starvation of jobs and therefore fairness

  • Does not encode GPU placement preferences of

jobs

○ Treats all GPU configurations as absolute

Image taken from http://01greekmythology.blogspot.com/2014/06/teiresias.html

slide-18
SLIDE 18

The Importance of Space

  • The placement of an app can heavily affect its performance

○ Again we see heterogeneity

  • LAS and DRF will not achieve efficiency due to these issues

○ Instance 1 violates SI ○ Instance 2 violates PE and EF

slide-19
SLIDE 19

Attempts To Pay Attention Space - Gandiva

  • Squeezes as much power out of GPUs as possible

by exploiting the cyclic nature of SGD

○ Uses a greedy scheduling policy that continuously optimizes for cluster efficiency

  • Master scheduler assigns Docker containers as

they become available

○ Scheduling policy is built around early feedback and

  • ptimizing for efficiency
  • Sets the theoretical groundwork for Themis

○ “The primary design goal of the Gandiva scheduler is to provide early feedback to jobs” ○ “Cluster level fairness is not a design goal in Gandiva”

slide-20
SLIDE 20

The Themis Solution

  • Presented in two parts

○ (1) An auction mechanism that allows apps to bid for resources ■ “Partial Allocation auction” incentivizes truth telling ○ (2) Two-level scheduling architecture ■ Allows for hyper-parameter optimization (1) (2)

slide-21
SLIDE 21

Key Ideas for Partial Allocation Auction

  • Finish-Time fairness

○ SI Achieved if ρ ≤ 1

  • Requires the app to be able to express a preference for each allocation

■ Wider interface between app and allocation engine

  • Hidden Payment incentivizes truth-telling
slide-22
SLIDE 22

Computation of ρ

  • Recall that ρ is calculated for every permutation of the available GPUs
  • This process is complicated by the presence of hyper-parameter optimization
  • r early stopping

○ In this case, TSh is calculated differently Slowdown to account for system

  • verhead

Rc = # of GPUs left in cluster

slide-23
SLIDE 23

Multi-Round Auctions

  • Single-round does not guarantee SI

(why?)

  • Auctions are triggered by leases ending
  • At each round, 1-⨏ of the apps with the

greatest ρ value are filtered

○ Why do we do this? ○ What happens as we vary ⨏? ■ Fairness vs Efficiency

  • Random allocation of resources leftover

from hidden payments

slide-24
SLIDE 24

Themis Scheduling Architecture

  • Current architectures cannot support multi-round auctioning

○ E.g. Mesos, Omega ○ Entirely pessimistic or entirely optimistic

  • Themis has “semi-optimistic” concurrency control

○ Top level offers optimistic, bottom offers pessimistic

slide-25
SLIDE 25

Widening the API Between the Apps and Scheduler

  • A crucial aspect of Themis’ architecture

design is that it requires the app to be able to see all other resources but only able to use is own resources

○ Accomplished with this app/agent idea ○ Agents are able to see all resources and apps can

  • nly use their own resources
  • Allows the Agent the ability to interact with

existing hyper-parameter optimizers

○ Introduces an overhead into the app writer’s process ■ Negligible?

slide-26
SLIDE 26

Setting the Stage For Evaluation

  • Implemented on top a modified version of Yarn
  • Two types of experiments

○ “Real” experiment using a 64 GPU, 20 Machine Cluster ○ Simulation of a much larger system (256 GPUs)

  • Baselines for each important measurement

○ Tiresias: Ideal fairness ■ Uses LAS ○ Gandiva: Ideal Efficiency ○ Optimus: Ideal time ○ SLAQ: Ideal Model Quality

  • Measured against three benchmarks

○ Finish-time fairness, GPU time, and Placement score

slide-27
SLIDE 27

Evaluation of Fairness

  • Fairness?

○ Tiresias “treats long apps and short apps as the same” ○ Themis only architecture to maintain ~1 max ρ vals ○ Does our fair play mechanism actually work?

slide-28
SLIDE 28

Evaluation of Time, Efficiency, and Placement

  • Time/Efficiency?

○ Workload 1 had jobs that each required only one or two GPUs

  • Placement?

○ Again, Themis and Gandiva is the only one that pays attention to placement so it does very well here

slide-29
SLIDE 29

Summary and Discussion

  • Pros:

○ Strives for fairness ○ Built on top of YARN

  • Cons:

○ Moderate amount of overhead ○ Some amount of cherry picking in baselines

  • Themis presents itself as the most “fair” scheduling architecture

○ Chooses to define “fair” itself. Do we agree with the ρ definition? ○ Is there any other measure of fairness that we might want to use?

  • It boasts a “Semi-optimistic” two-level design

○ Necessary to support partial auctions ○ Leases and partial auctions combat the effects of ML apps being so long-running

slide-30
SLIDE 30

Summary and Discussion

  • Pays attention to placement of apps

○ Leads to great speed up but only for specific ML models ○ How does Themis’ placement compare to that of Gandiva’s?

  • Requires a widening of the API between the apps and the scheduler

○ Are there any drawbacks to this? ○ Worth the extra overhead?

  • Doesn’t mention fault tolerance at all

○ Also doesn’t mention inference… but what difference would this make?

slide-31
SLIDE 31

Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters

slide-32
SLIDE 32

Motivation

  • Increased amount of data and scale of models has significantly increased

training time

  • This has increased the popularity of using distributed systems for parallel

training

  • In shared deep learning cluster, efficient resource scheduling is key to

maximize utilization of expensive resources (e.g. GPUs and RDMA networks)

slide-33
SLIDE 33

System Overview

  • Designed for data-parallel deep learning tasks, using parameter server
  • Resource-performance models for each job
  • Dynamic scheduling algorithm

○ Resource Allocation ○ Task Placement

  • Scheduler integrated with Kubernetes
slide-34
SLIDE 34

Deep Learning Model Training

  • 1. Iterative

a. The model training is usually carried out in an iterative fashion b. Dataset divided into chunks, chunks divided into mini-batches c. Epoch

  • 2. Trained until model converges

a. Tens to hundreds of epochs b. Performance metric stabilizes

slide-35
SLIDE 35

Parameter Server Architecture

  • Support for synchronous and asynchronous training
slide-36
SLIDE 36

Synchronous vs. Asynchronous Training

Synchronous - parameter server updates parameters after it collects gradients from all workers, before any worker moves onto the next step Asynchronous - parameter server updates its parameters every time it receives gradients from a worker. Workers can progress onto different steps

slide-37
SLIDE 37

Modeling of DL Jobs - Steps to Reach Convergence

  • First model used to estimate number of steps to reach convergence
  • Using fitted loss model and convergence threshold, total number of

steps/epochs can be calculated and the number of steps remaining

slide-38
SLIDE 38

Performance Modeling - Training Speed

  • Second model is used to estimate

training speed based on computation and comm. patterns

  • Model trained on sample set of data,

with different combinations of parameters and workers

  • NNLS solve used to find best fitting

parameters

slide-39
SLIDE 39

Optimus Dynamic Scheduling Algorithm

Resource Allocation and Task Placement

slide-40
SLIDE 40

Resource Allocation Algorithm

  • Jobs are sorted in order of their marginal gains
  • Job with largest marginal gain is selected and either one worker or parameter

server is added based on which term is larger

  • Marginal gains are then updated

*Continues until resource is cluster has been used up or marginal gains of all jobs are all negative

slide-41
SLIDE 41

Task Placement

Goal: reduce training time by reducing time spent on parameter exchange among workers Algorithm steps:

  • Sort servers in cluster based on current resource availability
  • Place jobs based on increasing resource demands (smallest jobs place first)
  • For each job, check if there are enough resources using first k servers

○ Increment k until there are enough servers to place the job

slide-42
SLIDE 42

Straggler Handling

Synchronous Training

  • Stragglers can be a large bottleneck because you have to wait for all workers

to finish before moving onto the next step

  • Detection: monitor arrival time of each worker’s gradients on parameter

server and calculate gap between arrivals Asynchronous Training

  • Goal avoid stale parameters, this leads to unstable training process
  • Detection: monitor each workers training speed
slide-43
SLIDE 43

Dynamic Adjustment of Resources

  • Model parameters are checkpointed when the number of workers or

parameters assigned to a job changes

  • Job is restarted from checkpoint and parameter server is redeployed
slide-44
SLIDE 44

Evaluation

  • Resources: 7 CPU Servers, 6 GPU servers
  • Baseline: Compared to DRF Schedulers (Mesos/YARN) and Tetris Scheduler

Indicators of performance:

  • JCT = average job completion time
  • Makespan = total time between arrival of first job and completion of all jobs
slide-45
SLIDE 45

Performance

  • Optimus reduced JCT by 2.4x and

makespan by 1.6x

  • Experiment found a 2.5% scaling
  • verhead for dynamic job

adjustment

  • Optimus can schedule 4,000 jobs

within 5 seconds on a cluster of 16,000 nodes

slide-46
SLIDE 46

Performance - Utilization

  • Optimus runs a small number of jobs

compared to DRF (a)

○ More resources != higher training speed

  • Optimus has higher CPU utilization for

workers and parameter servers (b&c)

○ Utilizes resources more efficiently

slide-47
SLIDE 47

Discussion

Strengths:

  • Offers support for both resource allocation and task placement
  • Support for both synchronous and asynchronous training
  • Allows for dynamic shifting of resource allocation after jobs have started
  • Priorities finishing jobs quickly, avoids job starvation

Weaknesses:

  • Failures in training process for workers not mentioned
  • Overhead to checkpointing model parameters
slide-48
SLIDE 48

Compare Themis and Optimus

  • Both papers perform scheduling at fixed time intervals

○ Auctioning vs Computing marginal gains (optimization)

  • Themis emphasis fairness/truth-telling while Optimus tries to optimize

resource allocation for the maximum number of jobs to be completed

  • Themis: ML, Optimus: DL
  • Both pay attention to placement of jobs
slide-49
SLIDE 49

Questions? Discussion?

slide-50
SLIDE 50

Notes from Themis 1

Introduction Existing scheduling algorithms for clusters do not optimize for GPU workloads, only big-data workloads which differ. Themis serves to address a lot of the issues by supporting incentive sharing, Pareto efficiency, and envy-freedom. Multiplexes GPU cluster across ML

  • application. To capture the effect of long running tasks and placement sensitivity, introduce new term fairness metric finish-time first,

ratio of running in a shared cluster of N by the running time of running alone in a 1/N cluster. Goal is to minimize the maximum finish- time fairness across all ML apps. Do this in 2 ways: 1. Widen the API between ML apps and the scheduler. Introduce the notion of a round-by-round auction where jobs bid on more allocation based on their finish-time first metric. The worse the number, the better their chance of getting a good GPU placement. 2. 2 level scheduling 1. Centralized inter-app scheduler at the bottom 2. Narrow API to integrate with existing hyper-parameter tuning frameworks at the top level System Design Go over the image at a high level and step through how jobs are auctioned and allocated. Can have a single ML training job. That one is straight forward. You do the calculations to calculate rho. In the multiple ML jobs case, this is when a hyperparameter search is going

  • n. They cover 2 stopping criteria, successive halving and performance curve stopping.
slide-51
SLIDE 51

Notes from Themis 2

Finish Time Fair Allocation Timely completion means that given N MP apps, they should not run slower on the shared cluster compared to a dedicated cluster with 1/N of the resources. This is the sharing incentive property. Pareto Efficiency: a Pareto Efficient allocation is one where no app’s allocation can be improved without hurting some other app Envy Freeness: envy-freeness means that no app should prefer the resource allocation of another app In big data analytics jobs where task durations are short, redistribution of resources can happen quickly and often. Blindly applying the same approach to ML apps isn't smart because jobs run for much longer amount of time, which can mean other jobs are waiting in the queue for a long time. Another algorithm, Least Attained Service, redistributes services based on leases, but this avoids starvation, but doesn't prioritize locality. ML models with lots of parameters may prefer better locality than models without so many parameters because of the overhead involved in synchronizing parameters and updating gradients. They make the argument that existing fair schemes violate SI, PE, and EF for ML apps, and ignore placement preference. That's why they introduce a new fairness metric and algorithm. Finish-Time Fairness: To summarize, in THEMIS we propose a new finish-time fairness metric that captures fairness for long-running, placement sensitive ML apps. To perform allocations, we propose using a multi-round partial allocation auction that incentivizes truth telling and provides pareto efficient, envy free allocations. By filtering the apps considered in the auction, we maximize sharing incentive and hence satisfy all the properties necessary for fair sharing among ML applications.

slide-52
SLIDE 52

Notes from Themis 3

Evaluation Since bid values are affected by placement, Themis does well with globally optimal placement compared to Gandiva, which takes greedy locally optimal decisions. Themis had better rho values compared to all of the benchmarks also. This means that the finish time fairness metric was better than the other scheduling algorithms, and indicates that Themis would perform better as a scheduler. Themis offers better sharing incentive to apps and introduces a more altruistic behavior for long apps and shorter apps participate in the auction

  • more. Other scheduling algorithms treat long and short duration apps the same which leads to overall worse performance.

Regarding system overheads, it looks like computing bids and computing partial allocations don't take too much time, compared to a lease time of 10 minutes. However, allocating and relinquishing control over GPU resources does take a long time at 5-10 seconds. In terms of cluster utilization, Themis does about the same as the other apps when the apps are mostly compute intensive. However,

  • nce network utilization increases, Themis does better than the others.

Concluded that there is a point where the fairness knob maxes out because after a certain point, only one app can participate in the auction (because everything has been filtered out). Also, smaller lease times promote better fairness because apps can quickly be scheduled again. However, smaller lease times makes it worse for system overhead time because during every one of these "context switches", a model checkpoint has to be generated and the app needs to communicate with the HDFS.

slide-53
SLIDE 53

Notes from Optimus 1

Introduction Current limitations of schedulers are that a job gets submitted with the resource requirements that are supplied by the job owner. These are static allocations, and cannot use new resources if available, and also perhaps these jobs don't need all the resources that were allocated for it, reducing the available resources for other jobs. Build accurate performance models of the DL job while online. Can calculate through learning rate, resource utilization. Optimus requires no knowledge of the model and hardware configuration of the cluster. Their goal is to minimize job completion time. (This goal may not mean it's fair to other jobs that are also contending for resources. Perhaps all the short jobs finish quickly, and 1 long job takes forever. The rate of jobs completing quickly is still high. This favors short jobs.)

slide-54
SLIDE 54

Optimus Notes 2

Background and Motivation This paper focuses on using convergence as a metric to measure training time/completion of a DL job. The parameter-server architecture is that the parameters are split across multiple servers, and the training data is split across multiple workers. Each worker computes gradients based on its subset of training data and pushes the gradients to the parameter servers. Once the parameter servers collect all of the gradients, they perform the update and send the new parameter values back to the workers. Optimus tries to maximally exploit varying runtime resource availability and adjust the number and placement of parameter servers and workers. To perform

  • nline fitting (which can estimate time for job completion), after every training step, they collect all points (k, l) and calculate the loss

curve (for SGD). The prediction error for this computation decreases as the number of training steps increases. They build a Resource-to-Speed model based on the number of workers, parameter servers, speed of the network, bandwidth, model size, etc. Training speed in a job depends on if they are using asynchronous or synchronous training. Important to note that for synchronous training, blindly increasing the number of parameter servers and workers doesn't decrease training time because the mini-batch size will decrease which can lead to underutilization of the GPU/CPU, and more workers means more synchronization related communication. Resource Allocation: Assign one worker and one parameter server. Sort all the jobs based on maximum marginal gain if they were to run like that. Then based on that order, add either a worker or a parameter server based on which one would add to the marginal gain the most. Keep doing this for each job until some resource in the cluster is used up, or until marginal gain is negative. This process happens each scheduling interval. They add a little bit about job placement and basically say they can reduce training time by specifically placing jobs (parameter servers and workers) on the same server. This would reduce training time, which would indirectly change the optimization problem and attempt to give preference to colocated jobs.

slide-55
SLIDE 55

Optimus Notes 3

System Implementation Use HDFS. They identify stragglers that can slow the system down and deploy a new worker. They also attempt to keep the load the same on all parameter servers. They checkpoint model parameters during each scheduling iteration. Evaluation They compare Optimus to a DRF (Dominant Resource Fairness) scheduler, and the scheduler Tetris which prioritizes jobs with low duration and resources, and low fragmentation. They use average Job Completion Time as their metric for performance. They also use makespan: the time since the first job arrived to the completion of all jobs. It seems like Optimus has a smaller throughput of jobs compared to DRF. Optimus prioritizes finishing jobs quickly, while DRF wants to take on as many jobs as it can and treats them equally fairly, which means they may not complete as quickly as a job on Optimus. Their conclusions: (1) Testbed experiments show that Optimus improves average job completion time and makespan by 139% and 63% compared to the fairness scheduler. Further, Optimus can scale to schedule 100,000 tasks on 16,000 nodes in 5 seconds, and its resource adjustment overhead is small, i.e., 2.54%. (2) Further improvement of estimation accuracy will not increase Optimus’s performance much (15%) and Optimus performs better than DRF and Tetris under various workloads. (3) The resource allocation algorithm, the task placement scheme and the parameter server load balancing algorithm contribute to Optimus’s performance improvement by about 62%, 17%, 20% respectively.