dcuda dynamic gpu scheduling with live migration support
play

DCUDA: Dynamic GPU Scheduling with Live Migration Support Fan Guo 1 - PowerPoint PPT Presentation

DCUDA: Dynamic GPU Scheduling with Live Migration Support Fan Guo 1 , Yongkun Li 1 , John C.S. Lui 2 , Yinlong Xu 1 1 University of Science and Technology of China 2 The Chinese University of Hong Kong Outline 1 Background & Problems 2


  1. DCUDA: Dynamic GPU Scheduling with Live Migration Support Fan Guo 1 , Yongkun Li 1 , John C.S. Lui 2 , Yinlong Xu 1 1 University of Science and Technology of China 2 The Chinese University of Hong Kong

  2. Outline 1 Background & Problems 2 DCUDA Design 3 Evaluation 4 Conclusion 2

  3. GPU Sharing and Scheduling n GPUs are underloaded without sharing ü A server may contain multiple GPUs ü Each GPU contains thousands of cores n GPU sharing allows multiple apps to run concurrently on one GPU App. App. GPU scheduling API Frontend Frontend is necessary API Backend Load balance Scheduler GPU utilization API … GPU 1 GPU 2 GPU N GPU N-1 3

  4. Current Scheduling Schemes n Current schemes are “static” ü Round-robin, prediction-based, least-loaded ü They only make the assignment of applications before running them n State-of-the-art: Least-loaded scheduling ü Assign new app to the GPU with the least load New App. Scheduler API … GPU 1 GPU 2 GPU N GPU N-1 4

  5. Limitations of Static Scheduling n Load imbalance (least-loaded scheduling) The fraction of time in which at least one GPU is overloaded and some other GPU is underloaded accounts for up to 41.7% (overloaded: demand > GPU cores) 5

  6. Limitations of Static Scheduling n Why does static scheduling result in load imbalance? n Assign before running New App. ü Hard to get exact Scheduler resource demand API ü The assignment is not … GPU 1 GPU 2 GPU N GPU N-1 optimal n No migration support ü No way to adjust online 6

  7. Limitations of Static Scheduling n Fairness issue caused by contention ü Applications with low resource demand may be blocked by those with high resource demand ü May also exists even with load-balancing schemes n Energy inefficiency 4000 Energy consumption (J) 3500 3000 Compacting multiple 2500 2000 small jobs on one 1500 1000 GPU saves energy 500 0 Triad Kmeans Mnist_mlp BFS Autoencoder Sort Reduction cifar10 single execution concurrent execution(2 app.) 7

  8. Our Goal n Our goal is to design a scheduling scheme so as to achieve better ü Load balance, energy efficiency, fairness n Key idea: DCUDA n Dynamic scheduling Online migration (Schedule after running, (running applications, fairness and energy not executing kernels) awareness) 8

  9. Outline 1 Background & Problems 2 DCUDA Design 3 Evaluation 4 Conclusion 9

  10. Overall Design n DCUDA is implemented based on the API forwarding framework n Key three modules at the backend ü Monitor GPU utilization l App’s resource demand l ü Scheduler Load balance l Energy efficiency l Fairness l ü Migrator Migration of running app l 10

  11. The Monitor n Resource demand of each application ü GPU cores and GPU memory ü Key challenge: lightweight requirement n Demand on GPU cores ü Existing tool (nvprof): large overhead (replay API calls) Optimization Timer function ü Estimate only at the first time (Track info. only when the kernel func is called from parameters ü Use the recorded info. next time of intercepted API: ü Rationale: GPU applications are #blk, #threads) iteration-based 11

  12. The Monitor n Demand on GPU memory ü Easy to know allocated mem, but not all mem. are used n How to detect actual usage? ü Pointer check with cuPointerGetAttribute() + sampling ü False negative: miss identification of used mem On-demand paging (with unified mem support) l n Estimation of GPU utilization ü Periodically scan the resource demand of applications ü Aggregate them together 12

  13. The Scheduler n A multi-stage and multi-object scheduling policy First priority: Load balance Case 1: (Slightly) overloaded GPU Must avoid low-demand tasks being blocked Case 2: Underloaded GPUs: Waste energy 13

  14. The Scheduler n Load balance ü Which GPUs: check each GPU pair Feasible candidates: An overloaded + an underloaded l ü Which applications to migrate Minimize migration frequency + avoid ping-pong effect l Greedy: Migrate the most heavyweight and feasible applications l n Energy awareness ü Compact lightweight apps to fewer GPUs to save energy n Fairness awareness: Grouping + time slicing Tradeoff Utilization Fairness Utilization vs fairness Mixed packing Priority-based scheme 14

  15. The Migrator n Clone runtime ü Largest overhead: initializing libraries (>80%) ü Handle pooling: maintain a pool of libraries’ handles for each GPU n Migrate memory data ü Leverage unified memory: Able to immediately run task without migrating data ü Transparently support Intercept API and replace l ü Pipeline Prefetch & on-demand paging l 15

  16. The Migrator n Resume computing tasks ü Two states of tasks: running and waiting Only migrate waiting tasks l ü Sync to wait for the completion of all running tasks ü Redirect waiting tasks to target GPUs Order preserving l FIFO queue l 16

  17. Outline 1 Background & Problems 2 DCUDA Design 3 Evaluation 4 Conclusion 17

  18. Experiment Setting n Testbed ü Prototype implemented based on CUDA toolkit 8.0 ü Four NVIDIA 1080Ti GPUs, each has 3584 cores and 12GB memory n Workload ü 20 benchmark programs which represent a majority of GPU applications (HPC, DM, ML, Graph Alg, DL) ü Focus on randomly selected 50 sequences, each combines the 20 programs with a fixed interval n Baseline algorithm ü Least-loaded: most efficient static scheduling scheme 18

  19. Load Balance n Load states of GPU ü 0%-50% utilization, 50%-100% utilization, and overloaded (demand > GPU cores) n Overloaded time of each GPU ü Least-loaded: 14.3% - 51.4% ü DCUDA: within 6% 19

  20. GPU Utilization n Improves average GPU utilization by 14.6% n Reduce the overloaded time by 78.3% on average (over the 50 sequences/workloads) 20

  21. Application Execution Time n Normalize the time to single execution n DCUDA reduces the average execution time by up to 42.1% 21

  22. Impact of Different Loads n Largest performance improvement in medium load case Average Execution Time n Largest energy saving in light load case Energy Consumption 22

  23. Outline 1 Background & Problems 2 DCUDA Design 3 Evaluation 4 Conclusion 23

  24. Conclusion & Future Work n Static GPU scheduling algorithm in assigning applications leads to load imbalance ü Low GPU utilization & high energy consumption n We develop DCUDA, a dynamic scheduling alg ü Monitors resource demand and util. w/ low overhead ü Supports migration of running applications ü Transparently supports all CUDA applications n Limitation: DCUDA only considers scheduling within a server and the resource of GPU cores 24

  25. Thanks! Q&A Yongkun Li ykli@ustc.edu.cn http://staff.ustc.edu.cn/~ykli 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend