tiresias
play

Tiresias A GPU Cluster Manager for Distributed Deep Learning Ju - PowerPoint PPT Presentation

Tiresias A GPU Cluster Manager for Distributed Deep Learning Ju Junchen eng g Gu , Mosharaf Chowdhury, Kang G. Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang (Harry) Liu, Chuanxiong Guo GPU Cluster for Deep Learning Training Deep


  1. Tiresias A GPU Cluster Manager for Distributed Deep Learning Ju Junchen eng g Gu , Mosharaf Chowdhury, Kang G. Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang (Harry) Liu, Chuanxiong Guo

  2. GPU Cluster for Deep Learning Training • Deep learning (DL) is popular • 10 10.5 × increase of DL training jobs in Microsoft • DL training jobs require GPU Google Lens Siri • Distributed deep learning (DDL) training with multiple GPUs • GPU cluster for DL training • 5 × increase of GPU cluster scale in Microsoft [1] How to efficiently manage a GPU cluster for DL training jobs? 1 [1]. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads. https://arxiv.org/abs/1901.05758

  3. GPU Cluster Manager Job Queue Design Objectives 2 4 1 2 Scheduler 1 Minimize Cluster-Wide Average Placement Scheme 1 Job Completion Time ( JCT ) N N-GPU DL job Achieve Free GPU Occupied GPU High Resource (GPU) Utilization 4-GPU machine GPU Cluster 2

  4. Challenge Ⅰ : Unpredictable Training Time § Unknown execution time of DL training jobs § Job execution time is useful when minimizing JCT § Predict job execution time § Use the smooth loss curve of DL training jobs ( Optimus [1] ) 1.0 1.0 ⎯ DSSM ⎯ ResNext ⎯ Seq2Seq Norm. Train. Loss Norm. Train. Loss 0.5 0.5 ⎯ Job 1 ⎯ Job 2 0.0 0.0 Progress Progress [1]. Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters, EuroSys’18 3

  5. Challenge Ⅰ : Unpredictable Training Time § Unknown execution time of DL training jobs § Job execution time is useful when minimizing JCT § Predict job execution time § Use the smooth loss curve of DL training jobs ( Optimus [1] ) 1.0 1.0 ⎯ DSSM ⎯ ResNext ⎯ Seq2Seq Norm. Train. Loss Norm. Train. Loss It’s hard to predict training time of DL jobs in many cases 0.5 0.5 ⎯ Job 1 ⎯ Job 2 0.0 0.0 Progress Progress [1]. Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters, EuroSys’18 4

  6. Challenge ⅠⅠ : Over-Aggressive Job Consolidation § Network overhead in DDL training 4 4 N N-GPU Job Job Queue Free GPU Occupied GPU Machine 1 Machine 2 Machine 3 Machine 4 5

  7. Challenge ⅠⅠ : Over-Aggressive Job Consolidation § Network overhead in DDL training § Consolidated placement for good training performance 4 4 N N-GPU Job Job Queue Free GPU Occupied GPU Machine 1 Machine 2 Machine 2 Machine 3 Machine 4 6

  8. Challenge ⅠⅠ : Over-Aggressive Job Consolidation § Network overhead in DDL training § Consolidated placement for good training performance 4 4 N N-GPU Job Job Queue Free GPU Occupied GPU Machine 1 Machine 2 Machine 2 Machine 2 Machine 3 Machine 4 7

  9. Challenge ⅠⅠ : Over-Aggressive Job Consolidation § Network overhead in DDL training § Consolidated placement for good training performance § Fragmented free GPUs in the cluster § Longer queuing delay 4 4 N N-GPU Job Job Queue Free GPU Occupied GPU Machine 1 Machine 2 Machine 2 Machine 2 Machine 3 Machine 4 8

  10. Prior Solutions II. Over-Aggressive Job Consolidation I. Unpredictable Training Time ( Scheduling ) ( Job Placement ) Optimus [1] None None YARN-CS FIFO None Gandiva [2] Time-sharing Trial-and-error [1]. Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters, EuroSys’18 9 [2]. Gandiva: Introspective Cluster Scheduling for Deep Learning, OSDI’18

  11. Tiresias A GPU cluster manager for Distributed Deep Learning Without Complete Knowledge I. Age-Based Scheduler Minimize JCT without complete knowledge of jobs 2. Model Profile-Based Placement Place jobs without additional information from users 10

  12. Challenge I How To Schedule DL Training Jobs Without Complete Job Information?

  13. Characteristics of DL Training Jobs Temporal and Spatial Co-scheduling § Variations in both temporal and spatial aspects 128 # of GPUs 64 Number of GPUs Job execution time 32 16 8 4 2 1 10 10 2 10 3 10 4 10 5 Job execution time (min) 12

  14. Characteristics of DL Training Jobs Temporal and Spatial Co-scheduling § Variations in both temporal and spatial aspects 128 # of GPUs 64 Scheduler should consider both Number of GPUs Job execution time 32 temporal and spatial 16 aspects of DL training jobs 8 4 2 1 10 10 2 10 3 10 4 10 5 Job execution time (min) 13

  15. Available Job Information 1. Spatial: number of GPUs G 3 G 2 ? … # of GPUs G 1 Time 0 1 2 3 4 5 6 7 8 9 10 11 14

  16. Available Job Information 1. Spatial: number of GPUs 2. Temporal: executed time Executed time G 3 G 2 ? … # of GPUs G 1 Time 0 1 2 3 4 5 6 7 8 9 10 11 15

  17. Age-Based Schedulers • Least-Attained Service [1] (LAS) • Prioritize job that has the shortest executed time • Gittins Index policy [2] • Need the distribution of job execution time • Prioritize job that has the highest probability to complete in the near future Age ( executed time ) G 3 G 2 ? … # of GPUs # of GPUs G 1 Time 0 1 2 3 4 5 6 7 8 9 10 11 [1]. Feedback queueing models for time-shared systems. JACM, 1968 16 [2]. Multi-armed bandit allocation indices. Wiley, Chichester, 1989

  18. Two-Dimensional Age-Based Scheduler (2DAS) • Age calculated by two-dimensional attained service • i.e., a job’s total executed GPU time (# of GPUs × executed time) • No prior information • 2D-LAS • With partial information: distribution of job GPU time • 2D-Gittins Index 17

  19. 2D-Gittins Index: Partial Information • Higher probability to complete ( Gittins Index ), higher priority # of GPUs Duration Attained Service Gittins Index Execution time J 1 2 2 0 0.25 J 2 1 8 0 0.25 J 3 2 6 0 0.25 Extra Information Avg. JCT 2D-Gittins Index 10.0 GPU time distribution 2D-LAS 11.7 None 18

  20. 2D-Gittins Index: Partial Information • Higher probability to complete ( Gittins Index ), higher priority Distribution # of GPUs Duration Attained Service Gittins Index Execution time J 1 2 2 0 0.25 (4, 8,12) J 2 1 8 0 0.25 J 3 2 6 0 0.25 Extra Information Avg. JCT 2D-Gittins Index 10.0 GPU time distribution 2D-LAS 11.7 None 19

  21. 2D-Gittins Index: Partial Information • Higher probability to complete ( Gittins Index ), higher priority 0.8 2D-Gittins Index Value Distribution # of GPUs Duration Attained Service Gittins Index Execution time 0.6 J 1 2 2 0 0.25 0.4 (4, 8,12) J 2 1 8 0 0.25 0.2 J 3 2 6 0 0.25 0 0 1 2 3 4 5 6 7 8 9 10 11 12 Attained service Extra Information Avg. JCT 2D-Gittins Index 10.0 GPU time distribution 2D-LAS 11.7 None 20

  22. 2D-Gittins Index: Partial Information • Higher probability to complete ( Gittins Index ), higher priority 0.8 2D-Gittins Index Value Distribution # of GPUs Duration Attained Service Gittins Index Execution time 0.6 J 1 2 2 0 0.25 0.4 (4, 8,12) J 2 1 8 0 0.25 0.2 J 3 2 6 0 0.25 0 0 1 2 3 4 5 6 7 8 9 10 11 12 Attained service Extra Information Avg. JCT 2D-Gittins Index 10.0 GPU time distribution 2D-LAS 11.7 None 21

  23. 2D-Gittins Index: Partial Information • Higher probability to complete ( Gittins Index ), higher priority 0.8 2D-Gittins Index Value Distribution # of GPUs Duration Attained Service Gittins Index Execution time 0.6 J 1 2 2 0 0.25 0.4 (4, 8,12) J 2 1 8 0 0.25 0.2 J 3 2 6 0 0.25 0 0 1 2 3 4 5 6 7 8 9 10 11 12 Attained service Extra Information Avg. JCT 2D-Gittins Index 10.0 GPU time distribution 2D-LAS 11.7 None 22

  24. 2D-Gittins Index: Partial Information • Higher probability to complete ( Gittins Index ), higher priority 0.8 2D-Gittins Index Value Distribution # of GPUs # of GPUs Duration Duration Attained Service Attained Service Gittins Index Gittins Index Execution time 0.6 J 1 J 1 2 2 2 2 0 4 0.25 0.2 0.4 (4, 8,12) J 2 J 2 1 1 8 8 0 0 0.25 0.25 0.2 J 3 J 3 2 2 6 6 0 0 0.25 0.25 0 0 1 2 3 4 5 6 7 8 9 10 11 12 Attained service Extra Information Avg. JCT J 1 end G 2 2D-Gittins Index 10.0 GPU time distribution G 1 2D-LAS 11.7 None Time 0 1 2 3 4 5 7 8 9 10 11 12 13 14 15 16 6 23

  25. 2D-Gittins Index: Partial Information • Higher probability to complete ( Gittins Index ), higher priority 0.8 2D-Gittins Index Value Distribution # of GPUs # of GPUs # of GPUs Duration Duration Duration Attained Service Attained Service Attained Service Gittins Index Gittins Index Gittins Index Execution time 0.6 J 1 J 1 J 1 2 2 2 2 2 2 4 0 4 0.25 0.2 0.2 0.4 (4, 8,12) J 2 J 2 J 2 1 1 1 8 8 8 0 4 0 0.25 0.25 0.2 0.2 J 3 J 3 J 3 2 2 2 6 6 6 0 0 0 0.25 0.25 0.25 0 0 1 2 3 4 5 6 7 8 9 10 11 12 Attained service Extra Information Avg. JCT J 1 end Job switch G 2 2D-Gittins Index 10.0 GPU time distribution G 1 2D-LAS 11.7 None Time 0 1 2 3 4 5 7 8 9 10 11 12 13 14 15 16 6 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend