kube knots resource harvesting through dynamic container
play

Kube-Knots: Resource Harvesting through Dynamic Container - PowerPoint PPT Presentation

Kube-Knots: Resource Harvesting through Dynamic Container Orchestration in GPU-based Datacenters Prashanth Thinakaran , Jashwant Raj Gunasekaran, Bikash Sharma, Chita Das, Mahmut Kandemir September 25th, IEEE CLUSTER19 Motivation Sub-PF GPU


  1. Kube-Knots: Resource Harvesting through Dynamic Container Orchestration in GPU-based Datacenters Prashanth Thinakaran , Jashwant Raj Gunasekaran, Bikash Sharma, Chita Das, Mahmut Kandemir September 25th, IEEE CLUSTER’19

  2. Motivation Sub-PF GPU Pre GPU Training Algorithmic Parallelism & TPUs 1 https://openai.com/blog/ai-and-compute/ 2 Schwartz, Roy, et al. "Green AI." arXiv preprint arXiv:1907.10597 (2019) 2

  3. Motivation Sub-PF GPU Pre GPU Training Algorithmic Parallelism & TPUs Most of the contribution was on improving accuracy but not Increasing compute demands for DNN training resource efficiency!!! • Modern GPGPUs bridge the compute gap ~10 TFlops • GPU Utilization efficiency is 33% • Kube-Knots focus on Green AI (Efficiency) instead of Red AI (Accuracy) 1 https://openai.com/blog/ai-and-compute/ 2 Schwartz, Roy, et al. "Green AI." arXiv preprint arXiv:1907.10597 (2019) 3

  4. Outline • Need for GPU resource harvesting • Cluster workload setup • Kube-Knots architecture • Correlation Based Provisioning and Peak Prediction • Results - Real system & Scalability study • Conclusion 4

  5. Energy Proportionality 5

  6. Need for GPU bin-packing • CPUs operate at peak efficiency for average load cases • GPUs have linear performance per watt scaling • Crucial to pack and use GPUs at 100% Utilization • A real data-center scenario! 6

  7. Alibaba: Study of Over-commitment • Average CPU Utilization ~ 47% • Average Mem Utilization ~ 76% • Half of the scheduled containers consume < 45% of memory • Containers are provisioned for peak utilization in datacenters • Under-utilization epidemic! 7

  8. Harvesting spare compute and memory Under-utilization calls for resource harvesting at the cluster scheduler level 8

  9. CPUs vs GPUs • CPUs have mature docker / hypervisor layers for efficient resource management. • Enforcing bin-packing is the known solution • GPUs have limited support for virtualization. • Context switches overheads (VIPT Vs VIVT) • Agnostic scheduling leads to QoS violations • Energy proportional scheduling calls for a novel approach 9

  10. Workload heterogeneity • Two different types of workload in GPU-based datacenters • Batch workloads: HPC, DL Training, etc., • Long running: typically hours and days • Latency-sensitive workloads: DL Inference, etc., • Short-lived: in milli-seconds to few seconds 10

  11. How to Harvest Spare Cycles Can provision for only average case utilization conservatively ~80% of the asking! But in case of peaks how to resize them back? Are there any early markers to harvest spare cycles? 11

  12. Correlation of resource metrics: Alibaba Tightly No solid correlated leads metrics Predictable load over time Latency-sensitive Batch/Long-running Workload Workload 12

  13. Opportunities for harvesting in batch • Phase changes are predictable • I/O peaks are succeeded by memory peaks • Average consumption is low when compared to peaks • Provisioning for peak leads to over-commitment 13

  14. TensorFlow Inference on GPUs 100 TF % GPU Memory Used 80 face imc key 60 ner pos chk 40 20 0 1 2 4 8 16 32 64 128 Inference Batch Sizes 14

  15. TensorFlow Inference on GPUs 100 • Inference Queries are latency- TF % GPU Memory Used 80 face sensitive ~ 200ms. imc key 60 ner pos chk 40 • Consumes < 10% of GPU. 20 0 • With batching can be pushed up to 1 2 4 8 16 32 64 128 Inference Batch Sizes 30%. • Usually when run inside TF, the GPU memory cannot be harvested. 15

  16. Outline • Need for GPU resource harvesting • Cluster Workload setup • Kube-Knots architecture • Correlation based Provisioning and Peak Prediction • Results - Real system & Scalability study • Conclusion 16

  17. Cluster-level workload setup App-Mix-1 • Eight Rodinia (HPC) GPU applications App-Mix-2 • Batch and long running tasks • Djinn and Tonic suite’s DNN inference Queries • Face recognition, key points detection, speech recognition App-Mix-3 • We characterize and associate them in three different bins • Plot the COV of GPU Utilization • COV <= 1 Static load and not much variation • COV > 1 Heavy tailed and highly varying load 17

  18. Baseline GPU Agnostic Scheduler App-Mix-1 App-Mix-2 App-Mix-3 • Ideal scheduler would strive to improve the GPU utilization in all percentiles. • In case of high COV, the cluster utilization is not stable. • Applications have varying resource needs throughout. • Keeping a GPU cluster busy throughout depends on COV mixes. • GPU Agnostic scheduler leads to QoS violations due to load imbalance. 18

  19. Outline • Need for GPU resource harvesting • Cluster workload setup • Kube-Knots architecture • Correlation Based Provisioning and Peak Prediction • Results - Real system & Scalability study • Conclusion 19

  20. Kube-Knots Design 20

  21. Outline • Need for GPU resource harvesting • Cluster workload setup • Kube-Knots architecture • Correlation Based Provisioning and Peak Prediction • Results - Real system & Scalability study • Conclusion 21

  22. Correlation Based Provisioning • Correlation between utilization metrics is considered for application placement. • Two positively correlating pods for memory is not colocated together on the same GPU • Pods are always resized for average utilization and not peak utilization. • GPUs are still underutilized due to static provisioning. • QoS violations due to pending pods as most of them contend for same resource (+ve Correlation) 22

  23. Peak Prediction Scheduler • PP allows two +vely correlating pods to be on same GPU. • PP is built on first principle that, resource peaks do not happen at the same time for all co-located apps. • PP uses ARIMA to predict peak utilization to resize the pods. • Autocorrelation function predicts the subsequent resource demand trends. • Where n is the total number of events, ȳ is the moving average • When the r value is > 0, we use ARIMA to forecast the resource utilization. 23

  24. Outline • Need for GPU resource harvesting • Cluster workload setup • Kube-Knots Architecture • Correlation Based Provisioning and Peak Prediction • Results - Real System & Scalability Study • Conclusion 24

  25. CBP+PP Utilization Improvements App-Mix-1 App-Mix-2 App-Mix-3 • CBP+PP does an effective load consolidation in case of high & medium loads when compared to GPU-Agnostic scheduler • 62% improvement in average utilization. • 80% improvement for median and 99%ile • In case of low and sporadic load scenario, CBP+PP effectively consolidated loads to active GPUs. • GPU nodes 1, 4, 8, 10 are minimally used due to power efficiency. 25

  26. GPU Utilization Breakdown App-Mix-1 • CBP+PP consistently improved utilization in all cases. App-Mix-2 • By up to 80% for median and tail • In case of low load scenarios, the scope for improvements is low. App-Mix-3 • Still CBP+PP improved in average case. 26

  27. Power & QoS Improvements • Res-Ag consumes least power on an average of 33% • Violates QoS for 53% of requests • PP consumes 10% more than Res-Ag • Ensures QoS for almost 100% of requests • CBP+PP can ensure QoS by predicting the GPU resource peaks • Further power savings is due to consolidation on active GPUs 27

  28. Scalability of CBP+PP in case of DL • Deep Learning Training and Inference workload mixes. • 60% faster median JCT compared to DL-aware schedulers. • 30% better than Gandiva. • 11% better than Tiresias. • QoS guarantees of DLI in presence of DLT • Reduced QoS violations due to GPU- utilization aware placement. 28

  29. Conclusion • Need for resource harvesting in GPU-datacenters. • Exposing GPU real-time utilization to Kubernetes through Knots. • CBP+PP Scheduler improved GPU Utilization by up to 80% for both average and tail-case utilization. • QoS aware workload consolidation lead to 33% energy savings. • Trace-driven scalability experiments show that Kube-Knots performs 36% better in term of JCT compared to DLT schedulers. • Kube-Knots also reduced the overall QoS violations by up to 53%. 29

  30. prashanth@psu.edu http://www.cse.psu.edu/hpcl/index.html “Workload Setup Docker TensorFlow / HPC experiments used in evaluation of kube-knots,” https://hub.docker.com/r/prashanth5192/gpu September 25th, IEEE CLUSTER’19

  31. Backup-1 Cluster Status COV • COV of loads across different GPUs • 0 to 0.2 range, effectively reduced form 0.1 to 0.7. • PP performs load balancing even in case of high-load scenarios. • PP also harvests and consolidates in low-loads by keeping idle GPUs in p_state 12 31

  32. Difference Table Uniform Kubernetes default Scheduler GPUs cannot be shared Low PPW and No QoS guarantees Resource Agnostic Sharing First Fit Decreasing bin-packing High PPW Poor QoS and high queueing delays Utilization metrics based bin-packing Correlation Based Provisioning High PPW Assured QoS but high queueing delays due to affinity constraints Peak Prediction Predicts the resource peaks of co-scheduled apps by Auto Correlation Factor High PPW and Assured QoS guarantees

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend