DCUDA: Dynamic GPU Scheduling with Live Migration Support Fan Guo 1 - - PowerPoint PPT Presentation

dcuda dynamic gpu scheduling with live migration support
SMART_READER_LITE
LIVE PREVIEW

DCUDA: Dynamic GPU Scheduling with Live Migration Support Fan Guo 1 - - PowerPoint PPT Presentation

DCUDA: Dynamic GPU Scheduling with Live Migration Support Fan Guo 1 , Yongkun Li 1 , John C.S. Lui 2 , Yinlong Xu 1 1 University of Science and Technology of China 2 The Chinese University of Hong Kong Outline 1 Background & Problems 2


slide-1
SLIDE 1

DCUDA: Dynamic GPU Scheduling with Live Migration Support

Fan Guo1, Yongkun Li1, John C.S. Lui2, Yinlong Xu1

1University of Science and Technology of China 2The Chinese University of Hong Kong

slide-2
SLIDE 2

Outline

2

Background & Problems

1

DCUDA Design

2

Evaluation

3

Conclusion

4

slide-3
SLIDE 3

n GPUs are underloaded without sharing

ü A server may contain multiple GPUs ü Each GPU contains thousands of cores

n GPU sharing allows multiple apps to run

concurrently on one GPU

3

GPU Sharing and Scheduling

GPU scheduling is necessary

App. Frontend App. Frontend Backend Scheduler GPU1 GPUN …

API API API

GPU2 GPUN-1

Load balance GPU utilization

slide-4
SLIDE 4

n Current schemes are “static”

ü Round-robin, prediction-based, least-loaded ü They only make the assignment of applications

before running them

n State-of-the-art: Least-loaded scheduling

ü Assign new app to the GPU with the least load

4

Current Scheduling Schemes

New App. Scheduler GPU1 GPUN …

API

GPU2 GPUN-1

slide-5
SLIDE 5

n Load imbalance (least-loaded scheduling)

5

Limitations of Static Scheduling

The fraction of time in which at least one GPU is

  • verloaded and some other GPU is underloaded accounts

for up to 41.7% (overloaded: demand > GPU cores)

slide-6
SLIDE 6

n Why does static scheduling result in load

imbalance?

6

Limitations of Static Scheduling

New App. Scheduler GPU1 GPUN …

API

GPU2 GPUN-1

n Assign before running

ü Hard to get exact

resource demand

ü The assignment is not

  • ptimal

n No migration support

ü No way to adjust online

slide-7
SLIDE 7

n Fairness issue caused by contention

ü Applications with low resource demand may be

blocked by those with high resource demand

ü May also exists even with load-balancing schemes

n Energy inefficiency

Limitations of Static Scheduling

7

500 1000 1500 2000 2500 3000 3500 4000

Triad Kmeans Mnist_mlp BFS Autoencoder Sort Reduction cifar10

Energy consumption (J) single execution concurrent execution(2 app.)

Compacting multiple small jobs on one GPU saves energy

slide-8
SLIDE 8

n Our goal is to design a scheduling scheme so

as to achieve better

ü Load balance, energy efficiency, fairness

n Key idea: DCUDA n

8

Our Goal

Dynamic scheduling

(Schedule after running, fairness and energy awareness)

Online migration

(running applications, not executing kernels)

slide-9
SLIDE 9

Outline

9

Background & Problems

1

DCUDA Design

2

Evaluation

3

Conclusion

4

slide-10
SLIDE 10

n DCUDA is implemented based on the API

forwarding framework

n Key three modules at the backend

ü Monitor

l

GPU utilization

l

App’s resource demand

ü Scheduler

l

Load balance

l

Energy efficiency

l

Fairness

ü Migrator

l

Migration of running app

10

Overall Design

slide-11
SLIDE 11

n Resource demand of each application

ü GPU cores and GPU memory ü Key challenge: lightweight requirement

n Demand on GPU cores

ü Existing tool (nvprof): large overhead (replay API calls)

11

The Monitor

Timer function

(Track info. only from parameters

  • f intercepted API:

#blk, #threads)

Optimization

ü Estimate only at the first time when the kernel func is called ü Use the recorded info. next time ü Rationale: GPU applications are iteration-based

slide-12
SLIDE 12

n Demand on GPU memory

ü Easy to know allocated mem, but not all mem. are used

n How to detect actual usage?

ü Pointer check with cuPointerGetAttribute() + sampling ü False negative: miss identification of used mem

l

On-demand paging (with unified mem support)

n Estimation of GPU utilization

ü Periodically scan the resource demand of applications ü Aggregate them together

12

The Monitor

slide-13
SLIDE 13

n A multi-stage and multi-object scheduling policy

13

The Scheduler

Case 1: (Slightly)

  • verloaded GPU

Must avoid low-demand tasks being blocked First priority: Load balance Case 2: Underloaded GPUs: Waste energy

slide-14
SLIDE 14

n Load balance

ü Which GPUs: check each GPU pair

l

Feasible candidates: An overloaded + an underloaded

ü Which applications to migrate

l

Minimize migration frequency + avoid ping-pong effect

l

Greedy: Migrate the most heavyweight and feasible applications

n Energy awareness

ü Compact lightweight apps to fewer GPUs to save energy

n Fairness awareness: Grouping + time slicing

The Scheduler

14

Tradeoff

Utilization vs fairness

Utilization

Mixed packing

Fairness

Priority-based scheme

slide-15
SLIDE 15

n Clone runtime

ü Largest overhead: initializing libraries (>80%) ü Handle pooling: maintain a pool of libraries’ handles

for each GPU

n Migrate memory data

ü Leverage unified memory: Able to immediately run

task without migrating data

ü Transparently support

l

Intercept API and replace

ü Pipeline

l

Prefetch & on-demand paging

The Migrator

15

slide-16
SLIDE 16

n Resume computing tasks

ü Two states of tasks: running and waiting

l

Only migrate waiting tasks

ü Sync to wait for the completion of all running tasks ü Redirect waiting tasks to target GPUs

l

Order preserving

l

FIFO queue

The Migrator

16

slide-17
SLIDE 17

Outline

17

Background & Problems

1

DCUDA Design

2

Evaluation

3

Conclusion

4

slide-18
SLIDE 18

n Testbed

ü Prototype implemented based on CUDA toolkit 8.0 ü Four NVIDIA 1080Ti GPUs, each has 3584 cores and

12GB memory

n Workload

ü 20 benchmark programs which represent a majority of

GPU applications (HPC, DM, ML, Graph Alg, DL)

ü Focus on randomly selected 50 sequences, each

combines the 20 programs with a fixed interval

n Baseline algorithm

ü Least-loaded: most efficient static scheduling scheme

18

Experiment Setting

slide-19
SLIDE 19

n Load states of GPU

ü 0%-50% utilization, 50%-100% utilization, and

  • verloaded (demand > GPU cores)

n Overloaded time of each GPU

ü Least-loaded: 14.3% - 51.4% ü DCUDA: within 6%

19

Load Balance

slide-20
SLIDE 20

n Improves average GPU utilization by 14.6% n Reduce the overloaded time by 78.3% on

average (over the 50 sequences/workloads)

20

GPU Utilization

slide-21
SLIDE 21

n Normalize the time to single execution n DCUDA reduces the average execution time

by up to 42.1%

21

Application Execution Time

slide-22
SLIDE 22

n Largest performance

improvement in medium load case

22

Impact of Different Loads

Average Execution Time Energy Consumption

n Largest energy

saving in light load case

slide-23
SLIDE 23

Outline

23

Background & Problems

1

DCUDA Design

2

Evaluation

3

Conclusion

4

slide-24
SLIDE 24

n Static GPU scheduling algorithm in assigning

applications leads to load imbalance

ü Low GPU utilization & high energy consumption

n We develop DCUDA, a dynamic scheduling alg

ü Monitors resource demand and util. w/ low overhead ü Supports migration of running applications ü Transparently supports all CUDA applications

n Limitation: DCUDA only considers scheduling

within a server and the resource of GPU cores

24

Conclusion & Future Work

slide-25
SLIDE 25

Q&A Yongkun Li ykli@ustc.edu.cn http://staff.ustc.edu.cn/~ykli

25

Thanks!