Kube-Knots: Resource Harvesting through Dynamic Container - - PowerPoint PPT Presentation

kube knots resource harvesting through dynamic container
SMART_READER_LITE
LIVE PREVIEW

Kube-Knots: Resource Harvesting through Dynamic Container - - PowerPoint PPT Presentation

Kube-Knots: Resource Harvesting through Dynamic Container Orchestration in GPU-based Datacenters Prashanth Thinakaran , Jashwant Raj Gunasekaran, Bikash Sharma, Chita Das, Mahmut Kandemir September 25th, IEEE CLUSTER19 Motivation Sub-PF GPU


slide-1
SLIDE 1

Kube-Knots: Resource Harvesting through Dynamic Container Orchestration in GPU-based Datacenters

Prashanth Thinakaran, Jashwant Raj Gunasekaran, Bikash Sharma, Chita Das, Mahmut Kandemir

September 25th, IEEE CLUSTER’19

slide-2
SLIDE 2

Motivation

2

1 https://openai.com/blog/ai-and-compute/ 2 Schwartz, Roy, et al. "Green AI." arXiv preprint arXiv:1907.10597 (2019)

Pre GPU

Sub-PF GPU Training Algorithmic Parallelism & TPUs

slide-3
SLIDE 3

Motivation

3

1 https://openai.com/blog/ai-and-compute/ 2 Schwartz, Roy, et al. "Green AI." arXiv preprint arXiv:1907.10597 (2019)

Pre GPU

Sub-PF GPU Training

Algorithmic Parallelism & TPUs

  • Increasing compute demands for DNN training
  • Modern GPGPUs bridge the compute gap ~10 TFlops
  • GPU Utilization efficiency is 33%

Most of the contribution was on improving accuracy but not resource efficiency!!!

Kube-Knots focus on Green AI (Efficiency) instead of Red AI (Accuracy)

slide-4
SLIDE 4

Outline

4

  • Need for GPU resource harvesting
  • Cluster workload setup
  • Kube-Knots architecture
  • Correlation Based Provisioning and Peak Prediction
  • Results - Real system & Scalability study
  • Conclusion
slide-5
SLIDE 5

Energy Proportionality

5

slide-6
SLIDE 6

Need for GPU bin-packing

6

  • CPUs operate at peak efficiency for average load cases
  • GPUs have linear performance per watt scaling
  • Crucial to pack and use GPUs at 100% Utilization
  • A real data-center scenario!
slide-7
SLIDE 7

Alibaba: Study of Over-commitment

7

  • Average CPU Utilization ~ 47%
  • Average Mem Utilization ~ 76%
  • Half of the scheduled

containers consume < 45% of memory

  • Containers are provisioned for

peak utilization in datacenters

  • Under-utilization epidemic!
slide-8
SLIDE 8

Harvesting spare compute and memory

8

Under-utilization calls for resource harvesting at the cluster scheduler level

slide-9
SLIDE 9

CPUs vs GPUs

9

  • CPUs have mature docker / hypervisor layers for

efficient resource management.

  • Enforcing bin-packing is the known solution
  • GPUs have limited support for virtualization.
  • Context switches overheads (VIPT Vs VIVT)
  • Agnostic scheduling leads to QoS violations
  • Energy proportional scheduling calls for a novel approach
slide-10
SLIDE 10

Workload heterogeneity

10

  • Two different types of workload in GPU-based datacenters
  • Batch workloads: HPC, DL Training, etc.,
  • Long running: typically hours and days
  • Latency-sensitive workloads: DL Inference, etc.,
  • Short-lived: in milli-seconds to few seconds
slide-11
SLIDE 11

How to Harvest Spare Cycles

11

Can provision for only average case utilization conservatively ~80% of the asking! But in case of peaks how to resize them back? Are there any early markers to harvest spare cycles?

slide-12
SLIDE 12

Correlation of resource metrics: Alibaba

12

Latency-sensitive Workload Batch/Long-running Workload

Predictable load over time Tightly correlated metrics No solid leads

slide-13
SLIDE 13

Opportunities for harvesting in batch

13

  • Phase changes are

predictable

  • I/O peaks are succeeded

by memory peaks

  • Average consumption is

low when compared to peaks

  • Provisioning for peak

leads to over-commitment

slide-14
SLIDE 14

TensorFlow Inference on GPUs

14

1 2 4 8 16 32 64 128

Inference Batch Sizes

20 40 60 80 100

% GPU Memory Used

TF face imc key ner pos chk

slide-15
SLIDE 15

TensorFlow Inference on GPUs

15

1 2 4 8 16 32 64 128

Inference Batch Sizes

20 40 60 80 100

% GPU Memory Used

TF face imc key ner pos chk

  • Inference Queries are latency-

sensitive ~ 200ms.

  • Consumes < 10% of GPU.
  • With batching can be pushed up to

30%.

  • Usually when run inside TF, the

GPU memory cannot be harvested.

slide-16
SLIDE 16

Outline

16

  • Need for GPU resource harvesting
  • Cluster Workload setup
  • Kube-Knots architecture
  • Correlation based Provisioning and Peak Prediction
  • Results - Real system & Scalability study
  • Conclusion
slide-17
SLIDE 17

Cluster-level workload setup

17

  • Eight Rodinia (HPC) GPU applications
  • Batch and long running tasks
  • Djinn and Tonic suite’s DNN inference Queries
  • Face recognition, key points detection, speech recognition
  • We characterize and associate them in three different bins
  • Plot the COV of GPU Utilization
  • COV <= 1 Static load and not much variation
  • COV > 1 Heavy tailed and highly varying load

App-Mix-1 App-Mix-2 App-Mix-3

slide-18
SLIDE 18

Baseline GPU Agnostic Scheduler

18

  • Ideal scheduler would strive to improve the GPU utilization in all percentiles.
  • In case of high COV, the cluster utilization is not stable.
  • Applications have varying resource needs throughout.
  • Keeping a GPU cluster busy throughout depends on COV mixes.
  • GPU Agnostic scheduler leads to QoS violations due to load imbalance.

App-Mix-1 App-Mix-2 App-Mix-3

slide-19
SLIDE 19

Outline

19

  • Need for GPU resource harvesting
  • Cluster workload setup
  • Kube-Knots architecture
  • Correlation Based Provisioning and Peak Prediction
  • Results - Real system & Scalability study
  • Conclusion
slide-20
SLIDE 20

Kube-Knots Design

20

slide-21
SLIDE 21

Outline

21

  • Need for GPU resource harvesting
  • Cluster workload setup
  • Kube-Knots architecture
  • Correlation Based Provisioning and Peak Prediction
  • Results - Real system & Scalability study
  • Conclusion
slide-22
SLIDE 22

Correlation Based Provisioning

22

  • Correlation between utilization metrics is considered for

application placement.

  • Two positively correlating pods for memory is not colocated

together on the same GPU

  • Pods are always resized for average utilization and not

peak utilization.

  • GPUs are still underutilized due to static provisioning.
  • QoS violations due to pending pods as most of them contend

for same resource (+ve Correlation)

slide-23
SLIDE 23

Peak Prediction Scheduler

23

  • PP allows two +vely correlating pods to be on same GPU.
  • PP is built on first principle that, resource peaks do not happen at

the same time for all co-located apps.

  • PP uses ARIMA to predict peak utilization to resize the pods.
  • Autocorrelation function predicts the subsequent resource demand trends.
  • Where n is the total number of events, ȳ is the moving average
  • When the r value is > 0, we use ARIMA to forecast the resource utilization.
slide-24
SLIDE 24

Outline

24

  • Need for GPU resource harvesting
  • Cluster workload setup
  • Kube-Knots Architecture
  • Correlation Based Provisioning and Peak Prediction
  • Results - Real System & Scalability Study
  • Conclusion
slide-25
SLIDE 25

CBP+PP Utilization Improvements

25

  • CBP+PP does an effective load consolidation in case of high & medium loads when

compared to GPU-Agnostic scheduler

  • 62% improvement in average utilization.
  • 80% improvement for median and 99%ile
  • In case of low and sporadic load scenario, CBP+PP effectively consolidated loads

to active GPUs.

  • GPU nodes 1, 4, 8, 10 are minimally used due to power efficiency.

App-Mix-1 App-Mix-2 App-Mix-3

slide-26
SLIDE 26

GPU Utilization Breakdown

26

  • CBP+PP consistently improved

utilization in all cases.

  • By up to 80% for median and tail
  • In case of low load scenarios, the

scope for improvements is low.

  • Still CBP+PP improved in average case.

App-Mix-1 App-Mix-2 App-Mix-3

slide-27
SLIDE 27

Power & QoS Improvements

27

  • Res-Ag consumes least power on an

average of 33%

  • Violates QoS for 53% of requests
  • PP consumes 10% more than Res-Ag
  • Ensures QoS for almost 100% of requests
  • CBP+PP can ensure QoS by predicting

the GPU resource peaks

  • Further power savings is due to

consolidation on active GPUs

slide-28
SLIDE 28

Scalability of CBP+PP in case of DL

28

  • Deep Learning Training and

Inference workload mixes.

  • 60% faster median JCT compared to

DL-aware schedulers.

  • 30% better than Gandiva.
  • 11% better than Tiresias.
  • QoS guarantees of DLI in presence
  • f DLT
  • Reduced QoS violations due to GPU-

utilization aware placement.

slide-29
SLIDE 29

Conclusion

29

  • Need for resource harvesting in GPU-datacenters.
  • Exposing GPU real-time utilization to Kubernetes through Knots.
  • CBP+PP Scheduler improved GPU Utilization by up to 80% for

both average and tail-case utilization.

  • QoS aware workload consolidation lead to 33% energy savings.
  • Trace-driven scalability experiments show that Kube-Knots

performs 36% better in term of JCT compared to DLT schedulers.

  • Kube-Knots also reduced the overall QoS violations by up to 53%.
slide-30
SLIDE 30

prashanth@psu.edu http://www.cse.psu.edu/hpcl/index.html

September 25th, IEEE CLUSTER’19

“Workload Setup Docker TensorFlow / HPC experiments used in evaluation of kube-knots,” https://hub.docker.com/r/prashanth5192/gpu

slide-31
SLIDE 31

Backup-1 Cluster Status COV

31

  • COV of loads across different

GPUs

  • 0 to 0.2 range, effectively reduced

form 0.1 to 0.7.

  • PP performs load balancing even

in case of high-load scenarios.

  • PP also harvests and

consolidates in low-loads by keeping idle GPUs in p_state 12

slide-32
SLIDE 32

Difference Table

Uniform Kubernetes default Scheduler GPUs cannot be shared Low PPW and No QoS guarantees Resource Agnostic Sharing First Fit Decreasing bin-packing High PPW Poor QoS and high queueing delays

Correlation Based Provisioning

Utilization metrics based bin-packing High PPW Assured QoS but high queueing delays due to affinity constraints Peak Prediction Predicts the resource peaks of co-scheduled apps by Auto Correlation Factor High PPW and Assured QoS guarantees