Tiresias A GPU Cluster Manager for Distributed Deep Learning Ju - - PowerPoint PPT Presentation

tiresias
SMART_READER_LITE
LIVE PREVIEW

Tiresias A GPU Cluster Manager for Distributed Deep Learning Ju - - PowerPoint PPT Presentation

Tiresias A GPU Cluster Manager for Distributed Deep Learning Ju Junchen eng g Gu , Mosharaf Chowdhury, Kang G. Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang (Harry) Liu, Chuanxiong Guo GPU Cluster for Deep Learning Training Deep


slide-1
SLIDE 1

Tiresias

Ju Junchen eng g Gu, Mosharaf Chowdhury, Kang G. Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang (Harry) Liu, Chuanxiong Guo

A GPU Cluster Manager for Distributed Deep Learning

slide-2
SLIDE 2
  • Deep learning (DL) is popular
  • 10

10.5× increase of DL training jobs in Microsoft

  • DL training jobs require GPU
  • Distributed deep learning (DDL) training with multiple GPUs
  • GPU cluster for DL training
  • 5× increase of GPU cluster scale in Microsoft [1]

1

[1]. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads. https://arxiv.org/abs/1901.05758

How to efficiently manage a GPU cluster for DL training jobs?

GPU Cluster for Deep Learning Training

Google Lens Siri

slide-3
SLIDE 3

2

GPU Cluster Manager

GPU Cluster

2

Scheduler

Free GPU Occupied GPU 4-GPU machine

N N-GPU DL job 1 4 2

Placement Scheme

Job Queue

1 1

Design Objectives

Minimize

Cluster-Wide Average Job Completion Time (JCT)

Achieve

High Resource (GPU) Utilization

slide-4
SLIDE 4

3

[1]. Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters, EuroSys’18

⎯ DSSM ⎯ ResNext ⎯ Seq2Seq

Progress

  • Norm. Train. Loss

1.0 0.5 0.0

Progress

  • Norm. Train. Loss

1.0 0.5 0.0 ⎯ Job1 ⎯ Job2

Challenge Ⅰ: Unpredictable Training Time

§ Unknown execution time of DL training jobs

§ Job execution time is useful when minimizing JCT

§ Predict job execution time

§ Use the smooth loss curve of DL training jobs (Optimus [1])

slide-5
SLIDE 5

4

[1]. Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters, EuroSys’18

⎯ DSSM ⎯ ResNext ⎯ Seq2Seq

Progress

  • Norm. Train. Loss

1.0 0.5 0.0

Progress

  • Norm. Train. Loss

1.0 0.5 0.0 ⎯ Job1 ⎯ Job2

Challenge Ⅰ: Unpredictable Training Time

It’s hard to predict training time of DL jobs in many cases

§ Unknown execution time of DL training jobs

§ Job execution time is useful when minimizing JCT

§ Predict job execution time

§ Use the smooth loss curve of DL training jobs (Optimus [1])

slide-6
SLIDE 6

5

Challenge ⅠⅠ: Over-Aggressive Job Consolidation

§ Network overhead in DDL training

Machine 1 Machine 2 Machine 3 Machine 4

Free GPU Occupied GPU

Job Queue

4

N N-GPU Job

4

slide-7
SLIDE 7

6

Challenge ⅠⅠ: Over-Aggressive Job Consolidation

§ Network overhead in DDL training

Machine 1 Machine 2 Machine 3 Machine 4

Free GPU Occupied GPU

Job Queue

4

N N-GPU Job

Machine 2

4

§ Consolidated placement for good training performance

slide-8
SLIDE 8

7

Challenge ⅠⅠ: Over-Aggressive Job Consolidation

§ Network overhead in DDL training

Machine 1 Machine 2 Machine 3 Machine 4

Free GPU Occupied GPU

Job Queue

4

N N-GPU Job

Machine 2 Machine 2

4

§ Consolidated placement for good training performance

slide-9
SLIDE 9

8

Challenge ⅠⅠ: Over-Aggressive Job Consolidation

§ Fragmented free GPUs in the cluster § Longer queuing delay

§ Network overhead in DDL training

Machine 1 Machine 2 Machine 3 Machine 4

Free GPU Occupied GPU

Job Queue

4

N N-GPU Job

Machine 2 Machine 2

4

§ Consolidated placement for good training performance

slide-10
SLIDE 10

9

Prior Solutions

  • I. Unpredictable Training Time

(Scheduling)

  • II. Over-Aggressive Job Consolidation

(Job Placement)

YARN-CS Optimus[1] Gandiva[2]

FIFO Time-sharing Trial-and-error None

[1]. Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters, EuroSys’18 [2]. Gandiva: Introspective Cluster Scheduling for Deep Learning, OSDI’18

None None

slide-11
SLIDE 11

10

Tiresias

A GPU cluster manager for Distributed Deep Learning Without Complete Knowledge

  • I. Age-Based Scheduler

Minimize JCT without complete knowledge of jobs

  • 2. Model Profile-Based Placement

Place jobs without additional information from users

slide-12
SLIDE 12

Challenge I

How To Schedule DL Training Jobs Without Complete Job Information?

slide-13
SLIDE 13

Temporal and Spatial Co-scheduling

12

Characteristics of DL Training Jobs

§ Variations in both temporal and spatial aspects

Job execution time # of GPUs

102 10 104 105 103

Job execution time (min)

1 2 4 8 16 32 64 128

Number of GPUs

slide-14
SLIDE 14

Temporal and Spatial Co-scheduling

13

Characteristics of DL Training Jobs

§ Variations in both temporal and spatial aspects

Scheduler should consider both

temporal and spatial

aspects of DL training jobs

Job execution time # of GPUs

102 10 104 105 103

Job execution time (min)

1 2 4 8 16 32 64 128

Number of GPUs

slide-15
SLIDE 15

… ?

  • 1. Spatial: number of GPUs

14

Available Job Information

Time

1 2 3 4 5 6 7 8 9 10 11 G1 G2 G3

# of GPUs

slide-16
SLIDE 16

… ?

  • 1. Spatial: number of GPUs
  • 2. Temporal: executed time

15

Available Job Information

Time

1 2 3 4 5 6 7 8 9 10 11 G1 G2 G3

Executed time

# of GPUs

slide-17
SLIDE 17

… ?

16

[1]. Feedback queueing models for time-shared systems. JACM, 1968 [2]. Multi-armed bandit allocation indices. Wiley, Chichester, 1989

Age-Based Schedulers

  • Least-Attained Service[1] (LAS)
  • Prioritize job that has the shortest executed time
  • Gittins Index policy[2]
  • Need the distribution of job execution time
  • Prioritize job that has the highest probability to complete in the near future

Time

1 2 3 4 5 6 7 8 9 10 11 G1 G2 G3

Age (executed time)

# of GPUs # of GPUs

slide-18
SLIDE 18

17

Two-Dimensional Age-Based Scheduler (2DAS)

  • Age calculated by two-dimensional attained service
  • i.e., a job’s total executed GPU time (# of GPUs × executed time)
  • No prior information
  • 2D-LAS
  • With partial information: distribution of job GPU time
  • 2D-Gittins Index
slide-19
SLIDE 19

18

2D-Gittins Index: Partial Information

# of GPUs Duration Attained Service Gittins Index

J1 2 2 0.25 J2 1 8 0.25 J3 2 6 0.25

  • Higher probability to complete (Gittins Index), higher priority

Extra Information

  • Avg. JCT

2D-Gittins Index

GPU time distribution

10.0 2D-LAS

None

11.7

Execution time

slide-20
SLIDE 20

19

2D-Gittins Index: Partial Information

# of GPUs Duration Attained Service Gittins Index

J1 2 2 0.25 J2 1 8 0.25 J3 2 6 0.25

(4, 8,12)

  • Higher probability to complete (Gittins Index), higher priority

Extra Information

  • Avg. JCT

2D-Gittins Index

GPU time distribution

10.0 2D-LAS

None

11.7

Execution time

Distribution

slide-21
SLIDE 21

20

2D-Gittins Index: Partial Information

# of GPUs Duration Attained Service Gittins Index

J1 2 2 0.25 J2 1 8 0.25 J3 2 6 0.25

(4, 8,12)

  • Higher probability to complete (Gittins Index), higher priority

0.2 0.4 0.6 0.8 1 2 3 4 5 6 7 8 9 10 11 12 2D-Gittins Index Value Attained service

Extra Information

  • Avg. JCT

2D-Gittins Index

GPU time distribution

10.0 2D-LAS

None

11.7

Execution time

Distribution

slide-22
SLIDE 22

21

2D-Gittins Index: Partial Information

# of GPUs Duration Attained Service Gittins Index

J1 2 2 0.25 J2 1 8 0.25 J3 2 6 0.25

(4, 8,12)

  • Higher probability to complete (Gittins Index), higher priority

0.2 0.4 0.6 0.8 1 2 3 4 5 6 7 8 9 10 11 12 2D-Gittins Index Value Attained service

Extra Information

  • Avg. JCT

2D-Gittins Index

GPU time distribution

10.0 2D-LAS

None

11.7

Execution time

Distribution

slide-23
SLIDE 23

22

2D-Gittins Index: Partial Information

# of GPUs Duration Attained Service Gittins Index

J1 2 2 0.25 J2 1 8 0.25 J3 2 6 0.25

(4, 8,12)

  • Higher probability to complete (Gittins Index), higher priority

0.2 0.4 0.6 0.8 1 2 3 4 5 6 7 8 9 10 11 12 2D-Gittins Index Value Attained service

Extra Information

  • Avg. JCT

2D-Gittins Index

GPU time distribution

10.0 2D-LAS

None

11.7

Execution time

Distribution

slide-24
SLIDE 24

23

2D-Gittins Index: Partial Information

# of GPUs Duration Attained Service Gittins Index

J1 2 2 0.25 J2 1 8 0.25 J3 2 6 0.25

# of GPUs Duration Attained Service Gittins Index

J1 2 2 4 0.2 J2 1 8 0.25 J3 2 6 0.25

(4, 8,12)

J1 end

Time

1 2 3 4 5 6 7 8 9 10 11 G1 G2 12 13 14 15 16

  • Higher probability to complete (Gittins Index), higher priority

0.2 0.4 0.6 0.8 1 2 3 4 5 6 7 8 9 10 11 12 2D-Gittins Index Value Attained service

Extra Information

  • Avg. JCT

2D-Gittins Index

GPU time distribution

10.0 2D-LAS

None

11.7

Execution time

Distribution

slide-25
SLIDE 25

24

2D-Gittins Index: Partial Information

# of GPUs Duration Attained Service Gittins Index

J1 2 2 0.25 J2 1 8 0.25 J3 2 6 0.25

# of GPUs Duration Attained Service Gittins Index

J1 2 2 4 0.2 J2 1 8 0.25 J3 2 6 0.25

# of GPUs Duration Attained Service Gittins Index

J1 2 2 4 0.2 J2 1 8 4 0.2 J3 2 6 0.25

(4, 8,12)

J1 end

Time

1 2 3 4 5 6 7 8 9 10 11 G1 G2 12 13 14 15 16 Job switch

  • Higher probability to complete (Gittins Index), higher priority

0.2 0.4 0.6 0.8 1 2 3 4 5 6 7 8 9 10 11 12 2D-Gittins Index Value Attained service

Extra Information

  • Avg. JCT

2D-Gittins Index

GPU time distribution

10.0 2D-LAS

None

11.7

Execution time

Distribution

slide-26
SLIDE 26

25

2D-Gittins Index: Partial Information

# of GPUs Duration Attained Service Gittins Index

J1 2 2 0.25 J2 1 8 0.25 J3 2 6 0.25

# of GPUs Duration Attained Service Gittins Index

J1 2 2 4 0.2 J2 1 8 0.25 J3 2 6 0.25

# of GPUs Duration Attained Service Gittins Index

J1 2 2 4 0.2 J2 1 8 4 0.2 J3 2 6 0.25

# of GPUs Duration Attained Service Gittins Index

J1 2 2 4 0.2 J2 1 8 4 0.2 J3 2 6 4 0.2

(4, 8,12)

J1 end

Time

1 2 3 4 5 6 7 8 9 10 11 G1 G2 12 13 14 15 16 Job switch Job switch

  • Higher probability to complete (Gittins Index), higher priority

0.2 0.4 0.6 0.8 1 2 3 4 5 6 7 8 9 10 11 12 2D-Gittins Index Value Attained service

Extra Information

  • Avg. JCT

2D-Gittins Index

GPU time distribution

10.0 2D-LAS

None

11.7

Execution time

Distribution

slide-27
SLIDE 27

26

2D-Gittins Index: Partial Information

# of GPUs Duration Attained Service Gittins Index

J1 2 2 0.25 J2 1 8 0.25 J3 2 6 0.25

# of GPUs Duration Attained Service Gittins Index

J1 2 2 4 0.2 J2 1 8 0.25 J3 2 6 0.25

# of GPUs Duration Attained Service Gittins Index

J1 2 2 4 0.2 J2 1 8 4 0.2 J3 2 6 0.25

# of GPUs Duration Attained Service Gittins Index

J1 2 2 4 0.2 J2 1 8 4 0.2 J3 2 6 4 0.2

# of GPUs Duration Attained Service Gittins Index

J1 2 2 4 0.2 J2 1 8 8 0.125 J3 2 6 4 0.2

(4, 8,12)

J1 end J2 end

Time

1 2 3 4 5 6 7 8 9 10 11 G1 G2 12 13 14 15 16 Job switch Job switch

  • Higher probability to complete (Gittins Index), higher priority

0.2 0.4 0.6 0.8 1 2 3 4 5 6 7 8 9 10 11 12 2D-Gittins Index Value Attained service

Extra Information

  • Avg. JCT

2D-Gittins Index

GPU time distribution

10.0 2D-LAS

None

11.7

Execution time

Distribution

slide-28
SLIDE 28

27

2D-Gittins Index: Partial Information

# of GPUs Duration Attained Service Gittins Index

J1 2 2 0.25 J2 1 8 0.25 J3 2 6 0.25

# of GPUs Duration Attained Service Gittins Index

J1 2 2 4 0.2 J2 1 8 0.25 J3 2 6 0.25

# of GPUs Duration Attained Service Gittins Index

J1 2 2 4 0.2 J2 1 8 4 0.2 J3 2 6 0.25

# of GPUs Duration Attained Service Gittins Index

J1 2 2 4 0.2 J2 1 8 4 0.2 J3 2 6 4 0.2

# of GPUs Duration Attained Service Gittins Index

J1 2 2 4 0.2 J2 1 8 8 0.125 J3 2 6 4 0.2 # of GPUs

Duration Attained Service Gittins Index

J1 2 2 4 0.2 J2 1 8 8 0.125 J3 2 6 12 N/A

(4, 8,12)

J1 end J2 end J3 end

Time

1 2 3 4 5 6 7 8 9 10 11 G1 G2 12 13 14 15 16 Job switch Job switch

  • Higher probability to complete (Gittins Index), higher priority

0.2 0.4 0.6 0.8 1 2 3 4 5 6 7 8 9 10 11 12 2D-Gittins Index Value Attained service

Extra Information

  • Avg. JCT

2D-Gittins Index

GPU time distribution

10.0 2D-LAS

None

11.7

Execution time

Distribution

slide-29
SLIDE 29

28

2D-Gittins Index: Partial Information

# of GPUs Duration Attained Service Gittins Index

J1 2 2 0.25 J2 1 8 0.25 J3 2 6 0.25

# of GPUs Duration Attained Service Gittins Index

J1 2 2 4 0.2 J2 1 8 0.25 J3 2 6 0.25

# of GPUs Duration Attained Service Gittins Index

J1 2 2 4 0.2 J2 1 8 4 0.2 J3 2 6 0.25

# of GPUs Duration Attained Service Gittins Index

J1 2 2 4 0.2 J2 1 8 4 0.2 J3 2 6 4 0.2

# of GPUs Duration Attained Service Gittins Index

J1 2 2 4 0.2 J2 1 8 8 0.125 J3 2 6 4 0.2 # of GPUs

Duration Attained Service Gittins Index

J1 2 2 4 0.2 J2 1 8 8 0.125 J3 2 6 12 N/A

(4, 8,12)

J1 end J2 end J3 end

Time

1 2 3 4 5 6 7 8 9 10 11 G1 G2 12 13 14 15 16 Job switch Job switch

  • Higher probability to complete (Gittins Index), higher priority

0.2 0.4 0.6 0.8 1 2 3 4 5 6 7 8 9 10 11 12 2D-Gittins Index Value Attained service

Extra Information

  • Avg. JCT

2D-Gittins Index

GPU time distribution

10.0 2D-LAS

None

11.7

Execution time

Distribution

slide-30
SLIDE 30

29

Two-Dimensional Age-Based Scheduler (2DAS)

  • Age calculated by two-dimensional attained service
  • i.e., a job’s total executed GPU time (# of GPUs × executed time)
  • No prior information
  • 2D-LAS
  • With partial information: distribution of job GPU time
  • 2D-Gittins Index
  • Fewer job switches
  • Priority discretization: Discretized-2DAS
slide-31
SLIDE 31

Discretized-2DAS

30

Prior Solutions

  • I. Unpredictable Training Time

(Scheduling)

  • II. Over-Aggressive Job Consolidation

(Job Placement)

YARN-CS Optimus[1] Gandiva[2]

Tiresias

FIFO Time-sharing Trial-and-error None

[1]. Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters, EuroSys’18 [2]. Gandiva: Introspective Cluster Scheduling for Deep Learning, OSDI’18

None None

slide-32
SLIDE 32

Gittins Index

31

Prior Solutions

  • I. Unpredictable Training Time

(Scheduling)

  • II. Over-Aggressive Job Consolidation

(Job Placement)

YARN-CS Optimus[1] Gandiva[2]

Tiresias

FIFO Time-sharing Trial-and-error None ?

[1]. Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters, EuroSys’18 [2]. Gandiva: Introspective Cluster Scheduling for Deep Learning, OSDI’18

None None LAS

slide-33
SLIDE 33

Challenge II

How to Place DL Jobs Without Hurting Training Performance?

slide-34
SLIDE 34
  • Tensor size in DL models
  • Large tensors cause network imbalance and contention

33

Characteristics of DL Models

VGG11 VGG16 VGG19 AlexNet ResNet50 Inception3 ResNet101 ResNet152 Inception4 GoogleNet

Size (MB)

100 200 300 400 500 600

Consolidated placement is needed when the model is highly skewed in its tensor size

slide-35
SLIDE 35

34

Model Profile-Based Placement

Consolidation?

NO YES

ResNet50

VGG11

Inception3

VGG16

ResNet101

AlexNet

Inception4

VGG19

GoogleNet ResNet152

Model Profiler

slide-36
SLIDE 36

35

Tiresias

Central Master Network-Level Model Profiler

60-GPU Testbed Experiment Large-scale & Trace-driven Simulation

Evaluation

GPU Cluster DL Job (model, resource) Placement

Preemption

Discretized-2DAS

Central Master

Placement scheme

Model profiler

slide-37
SLIDE 37

JCT Improvements in Testbed Experiment

  • Testbed – Michigan ConFlux cluster
  • 15 machines (4 GPUs each)
  • 100 Gbps RDMA network
  • Avg. JCT improvement

(w.r.t. YARN-CS): 5.5× Comparable performance to SRTF

36

0.0 0.2 0.4 0.6 0.8 1.0 10 100 1000 10000 100000

Fraction of Jobs JCT (second) YARN-CS SRTF Tiresias

0.0 0.2 0.4 0.6 0.8 1.0 10 100 1000 10000 100000

Fraction of Jobs JCT (second) YARN-CS SRTF Tiresias

0.0 0.2 0.4 0.6 0.8 1.0 10 100 1000 10000 100000

Fraction of Jobs JCT (second) YARN-CS SRTF Tiresias

10 102 103 104 105

slide-38
SLIDE 38
  • Discrete-time simulator
  • 10-week job trace from Microsoft
  • 2,000-GPU cluster

37

JCT Improvements in Trace-Driven Simulation

  • Avg. JCT improvement

(w.r.t. Gandiva): 2×

0.0 0.2 0.4 0.6 0.8 1.0 100 1000 10000 100000 1000000 10000000

Fraction of Jobs JCT(second) YARN-CS SRTF Gandiva Tiresias

0.0 0.2 0.4 0.6 0.8 1.0 100 1000 10000 100000 1000000 10000000

Fraction of Jobs JCT(second) YARN-CS SRTF Gandiva Tiresias

102 103 104 105 106 107

slide-39
SLIDE 39

38

Tiresias

  • Optimize JCT with no or partial job information
  • Relax placement constraint without hurting training performance
  • Simple, practical, and with significant performance improvements

A GPU cluster manager for Distributed Deep Learning Without Complete Knowledge

https://github.com/SymbioticLab/Tiresias

slide-40
SLIDE 40

39

slide-41
SLIDE 41

40

20 40 60 80 100

VGG19 VGG16 VGG11 AlexNet ResNet152 ResNet101 ResNet50 Inception4 Inception3 GoogleNet

Time Overhead (second) Preemption Resumption

Time Overhead of Job Switch

slide-42
SLIDE 42

41

DL Models

Model Total size (MB) Largest tensor (MB) VGG19

548 382

VGG16

527 392

VGG11

506 392

AlexNet

235 144

ResNet152

230 9

ResNet101

170 9

ResNet50

98 9

Inception4

163 6

Inception3

91 8

GoogleNet

27 4

slide-43
SLIDE 43

JCT in Testbed Experiment

42

0.0 0.2 0.4 0.6 0.8 1.0 10 100 1000 10000 100000 Fraction of Jobs JCT (second)

YARN-CS SRTF Tiresias-L Tiresias-G

10 102 103 104 105

slide-44
SLIDE 44

JCT Improvements in Testbed Experiment

43

27.7 23.4 1 2 3 4 5 6 7 8 Avg. 95th Avg. 95th Avg. 95th Avg. 95th Avg. 95th Bin 1 Bin 2 Bin 3 Bin 4 ALL Factor of Improvment YARN-CS SRTF Tiresias-G

Bins

1(Small-Short) 2(Small-Long) 3(Large-Short) 4(Large-Long)

% of Jobs 63.5% 12.5% 16.5% 7.5%

slide-45
SLIDE 45

GPU Utilization in Testbed Experiment

44

0.0 0.2 0.4 0.6 0.8 1.0 50 100 Fraction of Time 10s-averaged GPU utilization (%)

YARN-CS SRTF Tiresias-G Tiresias-L

  • The makespan is improved by 1.21× (w.r.t.

YARN-CS)

slide-46
SLIDE 46

Queuing Delay in Testbed Experiment

45

Average Median 95th YARN-CS 8146s 7464s 15327s SRTF 593s 32s 3133s Tiresias-G 1005s 39s 7933s Tiresias-L 963s 13s 7755s

slide-47
SLIDE 47

Training Performance in Testbed Experiment

46

0.0 0.2 0.4 0.6 0.8 1.0 0.8 1 1.2 1.4 1.6 1.8 Fraction of DDL Jobs Ratio of Job T raining Time

  • Training time when Tiresias-L running with and without placement
slide-48
SLIDE 48

47

JCT in Trace-Driven Simulation

0.0 0.2 0.4 0.6 0.8 1.0

100 1000 10000 100000 1000000 10000000

Fraction of Jobs JCT(second)

YARN-CS Best-effort SRTF Gandiva Tiresias-G Tiresias-L tt

102 103 104 105 106 107

slide-49
SLIDE 49

Average Median 95th YARN-CS 2.41× 30.85× 1.25× SRTF 1.00× 1.00× 0.84× Gandiva 2.00× 2.59× 2.08× Tiresias-G 0.97× 1.00× 0.85×

48

JCT Improvements in Trace-Driven Simulation

slide-50
SLIDE 50

49

Sensitivity Analysis of 2D-LAS

1.03 1.00 1.00 1.00

0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.5h 1h 2h 4h

  • Norm. Avg. JCT w.r.t.

(2, 1h) Tiresias-L Threshold

1.00 0.99 0.99

0.0 0.2 0.4 0.6 0.8 1.0 1.2 2 3 4

  • Norm. Avg. JCT w.r.t.

(2, 1h) Tiresias-L Number of queues

1.000 0.952 0.947 0.943 0.936

0.0 0.2 0.4 0.6 0.8 1.0 1.2 Inf 8 4 2 1

  • Norm. 95th JCT w.r.t

PromoteKnob = Inf PromoteKnob

slide-51
SLIDE 51

50

Sensitivity Analysis of 2D-Gittins Index

1.00 1.00 1.00 1.00 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.5h 1h 2h 4h

  • Norm. Avg. JCT w.r.t.

(2, 1h) Tiresias-G Service quantum Δ 1.00 1.00 1.00 0.0 0.2 0.4 0.6 0.8 1.0 1.2 2 3 4

  • Norm. Avg. JCT w.r.t.

(2, 1h) Tiresias-G Number of queues 1.00 0.65 0.65 0.65 0.65 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Inf 8 4 2 1

  • Norm. Max JCT w.r.t.

PromoteKnob = Inf PromoteKnob

slide-52
SLIDE 52

51

Gittins Index

  • ! is the probability that " can complete with in Δ
  • # is the expected service (cost) of " to be complete with in Δ
  • Δ is the next service quantum
  • ! and # are calculated from the distribution of job GPU time

GI& = sup

∆,-

P S − a& ≤ ∆ S > a&) E min S − a&, ∆ S > a&]