Tiresias
Ju Junchen eng g Gu, Mosharaf Chowdhury, Kang G. Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang (Harry) Liu, Chuanxiong Guo
A GPU Cluster Manager for Distributed Deep Learning
Tiresias A GPU Cluster Manager for Distributed Deep Learning Ju - - PowerPoint PPT Presentation
Tiresias A GPU Cluster Manager for Distributed Deep Learning Ju Junchen eng g Gu , Mosharaf Chowdhury, Kang G. Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang (Harry) Liu, Chuanxiong Guo GPU Cluster for Deep Learning Training Deep
Ju Junchen eng g Gu, Mosharaf Chowdhury, Kang G. Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang (Harry) Liu, Chuanxiong Guo
A GPU Cluster Manager for Distributed Deep Learning
10.5× increase of DL training jobs in Microsoft
1
[1]. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads. https://arxiv.org/abs/1901.05758
How to efficiently manage a GPU cluster for DL training jobs?
Google Lens Siri
2
GPU Cluster
2
Scheduler
Free GPU Occupied GPU 4-GPU machine
N N-GPU DL job 1 4 2
Placement Scheme
Job Queue
1 1
Design Objectives
Minimize
Cluster-Wide Average Job Completion Time (JCT)
Achieve
High Resource (GPU) Utilization
3
[1]. Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters, EuroSys’18
⎯ DSSM ⎯ ResNext ⎯ Seq2Seq
Progress
1.0 0.5 0.0
Progress
1.0 0.5 0.0 ⎯ Job1 ⎯ Job2
§ Unknown execution time of DL training jobs
§ Job execution time is useful when minimizing JCT
§ Predict job execution time
§ Use the smooth loss curve of DL training jobs (Optimus [1])
4
[1]. Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters, EuroSys’18
⎯ DSSM ⎯ ResNext ⎯ Seq2Seq
Progress
1.0 0.5 0.0
Progress
1.0 0.5 0.0 ⎯ Job1 ⎯ Job2
It’s hard to predict training time of DL jobs in many cases
§ Unknown execution time of DL training jobs
§ Job execution time is useful when minimizing JCT
§ Predict job execution time
§ Use the smooth loss curve of DL training jobs (Optimus [1])
5
§ Network overhead in DDL training
Machine 1 Machine 2 Machine 3 Machine 4
Free GPU Occupied GPU
Job Queue
4
N N-GPU Job
4
6
§ Network overhead in DDL training
Machine 1 Machine 2 Machine 3 Machine 4
Free GPU Occupied GPU
Job Queue
4
N N-GPU Job
Machine 2
4
§ Consolidated placement for good training performance
7
§ Network overhead in DDL training
Machine 1 Machine 2 Machine 3 Machine 4
Free GPU Occupied GPU
Job Queue
4
N N-GPU Job
Machine 2 Machine 2
4
§ Consolidated placement for good training performance
8
§ Fragmented free GPUs in the cluster § Longer queuing delay
§ Network overhead in DDL training
Machine 1 Machine 2 Machine 3 Machine 4
Free GPU Occupied GPU
Job Queue
4
N N-GPU Job
Machine 2 Machine 2
4
§ Consolidated placement for good training performance
9
(Scheduling)
(Job Placement)
YARN-CS Optimus[1] Gandiva[2]
FIFO Time-sharing Trial-and-error None
[1]. Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters, EuroSys’18 [2]. Gandiva: Introspective Cluster Scheduling for Deep Learning, OSDI’18
None None
10
A GPU cluster manager for Distributed Deep Learning Without Complete Knowledge
Minimize JCT without complete knowledge of jobs
Place jobs without additional information from users
12
§ Variations in both temporal and spatial aspects
Job execution time # of GPUs
102 10 104 105 103
Job execution time (min)
1 2 4 8 16 32 64 128
Number of GPUs
13
§ Variations in both temporal and spatial aspects
Scheduler should consider both
temporal and spatial
aspects of DL training jobs
Job execution time # of GPUs
102 10 104 105 103
Job execution time (min)
1 2 4 8 16 32 64 128
Number of GPUs
… ?
14
Time
1 2 3 4 5 6 7 8 9 10 11 G1 G2 G3
# of GPUs
… ?
15
Time
1 2 3 4 5 6 7 8 9 10 11 G1 G2 G3
Executed time
# of GPUs
… ?
16
[1]. Feedback queueing models for time-shared systems. JACM, 1968 [2]. Multi-armed bandit allocation indices. Wiley, Chichester, 1989
Time
1 2 3 4 5 6 7 8 9 10 11 G1 G2 G3
Age (executed time)
# of GPUs # of GPUs
17
18
# of GPUs Duration Attained Service Gittins Index
J1 2 2 0.25 J2 1 8 0.25 J3 2 6 0.25
Extra Information
2D-Gittins Index
GPU time distribution
10.0 2D-LAS
None
11.7
Execution time
19
# of GPUs Duration Attained Service Gittins Index
J1 2 2 0.25 J2 1 8 0.25 J3 2 6 0.25
(4, 8,12)
Extra Information
2D-Gittins Index
GPU time distribution
10.0 2D-LAS
None
11.7
Execution time
Distribution
20
# of GPUs Duration Attained Service Gittins Index
J1 2 2 0.25 J2 1 8 0.25 J3 2 6 0.25
(4, 8,12)
0.2 0.4 0.6 0.8 1 2 3 4 5 6 7 8 9 10 11 12 2D-Gittins Index Value Attained service
Extra Information
2D-Gittins Index
GPU time distribution
10.0 2D-LAS
None
11.7
Execution time
Distribution
21
# of GPUs Duration Attained Service Gittins Index
J1 2 2 0.25 J2 1 8 0.25 J3 2 6 0.25
(4, 8,12)
0.2 0.4 0.6 0.8 1 2 3 4 5 6 7 8 9 10 11 12 2D-Gittins Index Value Attained service
Extra Information
2D-Gittins Index
GPU time distribution
10.0 2D-LAS
None
11.7
Execution time
Distribution
22
# of GPUs Duration Attained Service Gittins Index
J1 2 2 0.25 J2 1 8 0.25 J3 2 6 0.25
(4, 8,12)
0.2 0.4 0.6 0.8 1 2 3 4 5 6 7 8 9 10 11 12 2D-Gittins Index Value Attained service
Extra Information
2D-Gittins Index
GPU time distribution
10.0 2D-LAS
None
11.7
Execution time
Distribution
23
# of GPUs Duration Attained Service Gittins Index
J1 2 2 0.25 J2 1 8 0.25 J3 2 6 0.25
# of GPUs Duration Attained Service Gittins Index
J1 2 2 4 0.2 J2 1 8 0.25 J3 2 6 0.25
(4, 8,12)
J1 end
Time
1 2 3 4 5 6 7 8 9 10 11 G1 G2 12 13 14 15 16
0.2 0.4 0.6 0.8 1 2 3 4 5 6 7 8 9 10 11 12 2D-Gittins Index Value Attained service
Extra Information
2D-Gittins Index
GPU time distribution
10.0 2D-LAS
None
11.7
Execution time
Distribution
24
# of GPUs Duration Attained Service Gittins Index
J1 2 2 0.25 J2 1 8 0.25 J3 2 6 0.25
# of GPUs Duration Attained Service Gittins Index
J1 2 2 4 0.2 J2 1 8 0.25 J3 2 6 0.25
# of GPUs Duration Attained Service Gittins Index
J1 2 2 4 0.2 J2 1 8 4 0.2 J3 2 6 0.25
(4, 8,12)
J1 end
Time
1 2 3 4 5 6 7 8 9 10 11 G1 G2 12 13 14 15 16 Job switch
0.2 0.4 0.6 0.8 1 2 3 4 5 6 7 8 9 10 11 12 2D-Gittins Index Value Attained service
Extra Information
2D-Gittins Index
GPU time distribution
10.0 2D-LAS
None
11.7
Execution time
Distribution
25
# of GPUs Duration Attained Service Gittins Index
J1 2 2 0.25 J2 1 8 0.25 J3 2 6 0.25
# of GPUs Duration Attained Service Gittins Index
J1 2 2 4 0.2 J2 1 8 0.25 J3 2 6 0.25
# of GPUs Duration Attained Service Gittins Index
J1 2 2 4 0.2 J2 1 8 4 0.2 J3 2 6 0.25
# of GPUs Duration Attained Service Gittins Index
J1 2 2 4 0.2 J2 1 8 4 0.2 J3 2 6 4 0.2
(4, 8,12)
J1 end
Time
1 2 3 4 5 6 7 8 9 10 11 G1 G2 12 13 14 15 16 Job switch Job switch
0.2 0.4 0.6 0.8 1 2 3 4 5 6 7 8 9 10 11 12 2D-Gittins Index Value Attained service
Extra Information
2D-Gittins Index
GPU time distribution
10.0 2D-LAS
None
11.7
Execution time
Distribution
26
# of GPUs Duration Attained Service Gittins Index
J1 2 2 0.25 J2 1 8 0.25 J3 2 6 0.25
# of GPUs Duration Attained Service Gittins Index
J1 2 2 4 0.2 J2 1 8 0.25 J3 2 6 0.25
# of GPUs Duration Attained Service Gittins Index
J1 2 2 4 0.2 J2 1 8 4 0.2 J3 2 6 0.25
# of GPUs Duration Attained Service Gittins Index
J1 2 2 4 0.2 J2 1 8 4 0.2 J3 2 6 4 0.2
# of GPUs Duration Attained Service Gittins Index
J1 2 2 4 0.2 J2 1 8 8 0.125 J3 2 6 4 0.2
(4, 8,12)
J1 end J2 end
Time
1 2 3 4 5 6 7 8 9 10 11 G1 G2 12 13 14 15 16 Job switch Job switch
0.2 0.4 0.6 0.8 1 2 3 4 5 6 7 8 9 10 11 12 2D-Gittins Index Value Attained service
Extra Information
2D-Gittins Index
GPU time distribution
10.0 2D-LAS
None
11.7
Execution time
Distribution
27
# of GPUs Duration Attained Service Gittins Index
J1 2 2 0.25 J2 1 8 0.25 J3 2 6 0.25
# of GPUs Duration Attained Service Gittins Index
J1 2 2 4 0.2 J2 1 8 0.25 J3 2 6 0.25
# of GPUs Duration Attained Service Gittins Index
J1 2 2 4 0.2 J2 1 8 4 0.2 J3 2 6 0.25
# of GPUs Duration Attained Service Gittins Index
J1 2 2 4 0.2 J2 1 8 4 0.2 J3 2 6 4 0.2
# of GPUs Duration Attained Service Gittins Index
J1 2 2 4 0.2 J2 1 8 8 0.125 J3 2 6 4 0.2 # of GPUs
Duration Attained Service Gittins Index
J1 2 2 4 0.2 J2 1 8 8 0.125 J3 2 6 12 N/A
(4, 8,12)
J1 end J2 end J3 end
Time
1 2 3 4 5 6 7 8 9 10 11 G1 G2 12 13 14 15 16 Job switch Job switch
0.2 0.4 0.6 0.8 1 2 3 4 5 6 7 8 9 10 11 12 2D-Gittins Index Value Attained service
Extra Information
2D-Gittins Index
GPU time distribution
10.0 2D-LAS
None
11.7
Execution time
Distribution
28
# of GPUs Duration Attained Service Gittins Index
J1 2 2 0.25 J2 1 8 0.25 J3 2 6 0.25
# of GPUs Duration Attained Service Gittins Index
J1 2 2 4 0.2 J2 1 8 0.25 J3 2 6 0.25
# of GPUs Duration Attained Service Gittins Index
J1 2 2 4 0.2 J2 1 8 4 0.2 J3 2 6 0.25
# of GPUs Duration Attained Service Gittins Index
J1 2 2 4 0.2 J2 1 8 4 0.2 J3 2 6 4 0.2
# of GPUs Duration Attained Service Gittins Index
J1 2 2 4 0.2 J2 1 8 8 0.125 J3 2 6 4 0.2 # of GPUs
Duration Attained Service Gittins Index
J1 2 2 4 0.2 J2 1 8 8 0.125 J3 2 6 12 N/A
(4, 8,12)
J1 end J2 end J3 end
Time
1 2 3 4 5 6 7 8 9 10 11 G1 G2 12 13 14 15 16 Job switch Job switch
0.2 0.4 0.6 0.8 1 2 3 4 5 6 7 8 9 10 11 12 2D-Gittins Index Value Attained service
Extra Information
2D-Gittins Index
GPU time distribution
10.0 2D-LAS
None
11.7
Execution time
Distribution
29
Discretized-2DAS
30
(Scheduling)
(Job Placement)
YARN-CS Optimus[1] Gandiva[2]
Tiresias
FIFO Time-sharing Trial-and-error None
[1]. Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters, EuroSys’18 [2]. Gandiva: Introspective Cluster Scheduling for Deep Learning, OSDI’18
None None
Gittins Index
31
(Scheduling)
(Job Placement)
YARN-CS Optimus[1] Gandiva[2]
Tiresias
FIFO Time-sharing Trial-and-error None ?
[1]. Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters, EuroSys’18 [2]. Gandiva: Introspective Cluster Scheduling for Deep Learning, OSDI’18
None None LAS
33
VGG11 VGG16 VGG19 AlexNet ResNet50 Inception3 ResNet101 ResNet152 Inception4 GoogleNet
Size (MB)
100 200 300 400 500 600
Consolidated placement is needed when the model is highly skewed in its tensor size
34
Consolidation?
NO YES
ResNet50
VGG11
Inception3
VGG16
ResNet101
AlexNet
Inception4
VGG19
GoogleNet ResNet152
Model Profiler
35
Central Master Network-Level Model Profiler
60-GPU Testbed Experiment Large-scale & Trace-driven Simulation
GPU Cluster DL Job (model, resource) Placement
Preemption
Discretized-2DAS
Central Master
Placement scheme
Model profiler
(w.r.t. YARN-CS): 5.5× Comparable performance to SRTF
36
0.0 0.2 0.4 0.6 0.8 1.0 10 100 1000 10000 100000
Fraction of Jobs JCT (second) YARN-CS SRTF Tiresias
0.0 0.2 0.4 0.6 0.8 1.0 10 100 1000 10000 100000
Fraction of Jobs JCT (second) YARN-CS SRTF Tiresias
0.0 0.2 0.4 0.6 0.8 1.0 10 100 1000 10000 100000
Fraction of Jobs JCT (second) YARN-CS SRTF Tiresias
10 102 103 104 105
37
(w.r.t. Gandiva): 2×
0.0 0.2 0.4 0.6 0.8 1.0 100 1000 10000 100000 1000000 10000000
Fraction of Jobs JCT(second) YARN-CS SRTF Gandiva Tiresias
0.0 0.2 0.4 0.6 0.8 1.0 100 1000 10000 100000 1000000 10000000
Fraction of Jobs JCT(second) YARN-CS SRTF Gandiva Tiresias
102 103 104 105 106 107
38
A GPU cluster manager for Distributed Deep Learning Without Complete Knowledge
https://github.com/SymbioticLab/Tiresias
39
40
20 40 60 80 100
VGG19 VGG16 VGG11 AlexNet ResNet152 ResNet101 ResNet50 Inception4 Inception3 GoogleNet
Time Overhead (second) Preemption Resumption
41
Model Total size (MB) Largest tensor (MB) VGG19
548 382
VGG16
527 392
VGG11
506 392
AlexNet
235 144
ResNet152
230 9
ResNet101
170 9
ResNet50
98 9
Inception4
163 6
Inception3
91 8
GoogleNet
27 4
42
0.0 0.2 0.4 0.6 0.8 1.0 10 100 1000 10000 100000 Fraction of Jobs JCT (second)
YARN-CS SRTF Tiresias-L Tiresias-G
10 102 103 104 105
43
27.7 23.4 1 2 3 4 5 6 7 8 Avg. 95th Avg. 95th Avg. 95th Avg. 95th Avg. 95th Bin 1 Bin 2 Bin 3 Bin 4 ALL Factor of Improvment YARN-CS SRTF Tiresias-G
Bins
1(Small-Short) 2(Small-Long) 3(Large-Short) 4(Large-Long)
% of Jobs 63.5% 12.5% 16.5% 7.5%
44
0.0 0.2 0.4 0.6 0.8 1.0 50 100 Fraction of Time 10s-averaged GPU utilization (%)
YARN-CS SRTF Tiresias-G Tiresias-L
YARN-CS)
45
Average Median 95th YARN-CS 8146s 7464s 15327s SRTF 593s 32s 3133s Tiresias-G 1005s 39s 7933s Tiresias-L 963s 13s 7755s
46
0.0 0.2 0.4 0.6 0.8 1.0 0.8 1 1.2 1.4 1.6 1.8 Fraction of DDL Jobs Ratio of Job T raining Time
47
0.0 0.2 0.4 0.6 0.8 1.0
100 1000 10000 100000 1000000 10000000
Fraction of Jobs JCT(second)
YARN-CS Best-effort SRTF Gandiva Tiresias-G Tiresias-L tt
102 103 104 105 106 107
Average Median 95th YARN-CS 2.41× 30.85× 1.25× SRTF 1.00× 1.00× 0.84× Gandiva 2.00× 2.59× 2.08× Tiresias-G 0.97× 1.00× 0.85×
48
49
1.03 1.00 1.00 1.00
0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.5h 1h 2h 4h
(2, 1h) Tiresias-L Threshold
1.00 0.99 0.99
0.0 0.2 0.4 0.6 0.8 1.0 1.2 2 3 4
(2, 1h) Tiresias-L Number of queues
1.000 0.952 0.947 0.943 0.936
0.0 0.2 0.4 0.6 0.8 1.0 1.2 Inf 8 4 2 1
PromoteKnob = Inf PromoteKnob
50
1.00 1.00 1.00 1.00 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.5h 1h 2h 4h
(2, 1h) Tiresias-G Service quantum Δ 1.00 1.00 1.00 0.0 0.2 0.4 0.6 0.8 1.0 1.2 2 3 4
(2, 1h) Tiresias-G Number of queues 1.00 0.65 0.65 0.65 0.65 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Inf 8 4 2 1
PromoteKnob = Inf PromoteKnob
51
GI& = sup
∆,-
P S − a& ≤ ∆ S > a&) E min S − a&, ∆ S > a&]