1
Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads
Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, Fan Yang
Clusters for DNN Training Workloads Myeongjae Jeon , Shivaram - - PowerPoint PPT Presentation
Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads Myeongjae Jeon , Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, Fan Yang 1 Deep Learning at a Large Enterprise Cortana Speech, Image, Ads, NLP,
1
Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, Fan Yang
Speech, Image, Ads, NLP, Web Search … DL training jobs require large GPU clusters Philly: Cluster manager for DL workloads on large shared GPU clusters
2
Cortana
Recent Cluster Managers Optimus [EuroSys 18] Gandiva [OSDI 18] Tiresias [NSDI 19] Objective Average JCT Consolidation Average JCT Scheduler SRTF Time-sharing Gittins Index Motivated by observations in Philly
Significant increase in scale during 2017
10.5× in DL training jobs 5× in GPU cluster size
3
Philly cluster manager
GPU Cluster
2
Philly Scheduler & Job Placement
Free GPU Occupied GPU 4-GPU machine
N N-GPU DL job 4 4 2
Job Queue
4 4
5
6
75-day period from Oct. 2017 to Dec. 2017 Total of 96,260 jobs across thousands of users
Philly Scheduler & Job Placement
7
GPU Cluster
2
Free GPU Occupied GPU 4-GPU machine
N N-GPU DL job 4 4 2
Queue
4
– Job arrival, GPU alloc, finish status
– GPU, CPU, memory utilization
– stderr/stdout for executed jobs
4
Track scheduling decision and utilization info during job lifecycle
8
9
10
64.7 59.2 51.6 44.8
1-GPU Jobs 4-GPU Jobs 8-GPU Jobs 16-GPU Jobs
Median GPU Utilization Mean
1 GPU Jobs
40 60 80 100
GPU utilization
20
4 GPU Jobs 8 GPU Jobs 16 GPU Jobs
GPU utilization is low! (Lower in distributed training) Two reasons:
11
Dedicate servers → No other jobs on this server Distributed training itself causes utilization to go lower!
12
GPU Cluster
2
Philly Scheduler & Job Placement
Free GPU Occupied GPU 4-GPU machine
N N-GPU DL job 4 2
Queue
4
– High communication efficiency – Long queueing time
– Low queueing time – Contention in the use of network – Risk of intra-server interference (across jobs)
Relaxing locality constraints
13
14
Training started Training completed
A job is unsuccessful if it repeatedly fails (waste resources)
0.33 0.98 1.09 1.11
0.2 0.4 0.6 0.8 1 1.2 1 GPU 2-4 GPU 5-8 GPU >8 GPU
Average of one failure per distributed training job
1 GPU Jobs 2-4 GPU Jobs 5-8 GPU Jobs >8 GPU Jobs
15
Infrastructure AI Engine User Program
Resource Scheduler
Our study: classify into failure types and identify utilization impacts Improve failure handling
16
(signature, failure category) >230 signatures stderr/stdout Who - job & user ID GPU hours - # of GPUs x Time to failure Where - Infra? AI engine? User?
Time GPU util
17
Reason: User errors in code or configuration Repetitive and appearing early
31.5 25.4 7.7 6.8 2.9 1.3
10 20 30 40 50 CPU OOM Incorrect inputs Sematic error Invalid mem access Syntax error GPU OOM
failure occurrences Mean
CPU OOM Incorrect inputs Semantic error Syntax error
20 30 40
% out of total
10
Invalid mem access GPU OOM
18
Reason: Infrastructure failures and semantics errors
24.2 17.6 16.3 15.3
10 20 30 40 50 Incorrect inputs Semantic error Model ckpt error MPI runtime failure
GPU hours until failure Mean
Incorrect inputs Semantic error
20 30 40
% out of total GPU hours
10
Model ckpt error MPI runtime failure
Spread across many layers of system stack
19
Scheduler needs to consider: 1) trade-off between queueing delay and locality-aware scheduling 2) incorporating job migration
20
Low locality High locality 24 hours 16 hours (0 hour) (1 hour) Queueing Run time
example
21
Reason: User errors in code or configuration Simple validation before scheduling (e.g., pre-run) avoids a majority of these failures
31.5 25.4 7.7 6.8 2.9 1.3
10 20 30 40 50 CPU OOM Incorrect inputs Sematic error Invalid mem access Syntax error GPU OOM
failure occurrences Mean
CPU OOM Incorrect inputs Semantic error Syntax error
20 30 40
% out of total
10
Invalid mem access GPU OOM
– Fair-share delay v.s. fragmentation delay – Impact of out-of-order scheduling on job queueing
– Full classification of failures and detailed statistics – How to mitigate failures by proactively analyzing failures at runtime
– Opportunity to not perform the last bunch of epochs
22
failure handling
Traces available! https://github.com/msr-fiddle/philly-traces
GPU Cluster
2
Philly Scheduler & Job Placement
4 2
Queue
4