Dynamic Fractional Resource Scheduling for HPC Workloads Mark - - PowerPoint PPT Presentation

dynamic fractional resource scheduling for hpc workloads
SMART_READER_LITE
LIVE PREVIEW

Dynamic Fractional Resource Scheduling for HPC Workloads Mark - - PowerPoint PPT Presentation

Scheduling DFRS Heuristics Experiments Conclusions Dynamic Fractional Resource Scheduling for HPC Workloads Mark Stillwell 1 Frdric Vivien 2 , 1 Henri Casanova 1 1 Department of Information and Computer Sciences University of Hawaii at


slide-1
SLIDE 1

Scheduling DFRS Heuristics Experiments Conclusions

Dynamic Fractional Resource Scheduling for HPC Workloads

Mark Stillwell1 Frédéric Vivien2,1 Henri Casanova1

1Department of Information and Computer Sciences

University of Hawai’i at M¯ anoa

2INRIA, France

Invited Talk, October 8, 2009

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA Dynamic Fractional Resource Schedulingfor HPC Workloads

slide-2
SLIDE 2

Scheduling DFRS Heuristics Experiments Conclusions Formalization

HPC Job Scheduling Problem

0 < N homogeneous nodes 0 < J jobs, each job j has:

arrival time 0 ≤ rj 0 < tj ≤ N tasks compute time 0 < cj

J not known rj and tj not known before rj cj not known until j completes

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA Dynamic Fractional Resource Schedulingfor HPC Workloads

slide-3
SLIDE 3

Scheduling DFRS Heuristics Experiments Conclusions Formalization

Schedule Evaluation

make span not relevant for unrelated jobs flow time over-emphasizes very long jobs stretch re-balances in favor of short jobs average stretch prone to starvation max stretch helps with average while bounding worst case

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA Dynamic Fractional Resource Schedulingfor HPC Workloads

slide-4
SLIDE 4

Scheduling DFRS Heuristics Experiments Conclusions Current Approaches

Current Approaches

Batch Scheduling, which no one likes

usually FCFS with backfilling backfilling needs (unreliable) compute time estimates unbounded wait times poor resource utilization No particular objective

Gang Scheduling, which no one uses

globally coordinated time sharing complicated and slow memory pressure a concern

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA Dynamic Fractional Resource Schedulingfor HPC Workloads

slide-5
SLIDE 5

Scheduling DFRS Heuristics Experiments Conclusions Dynamic Fractional Resource Scheduling

VM Technology

basically, time sharing pooling of discrete resources (e.g., multiple CPUs) hard limits on resource consumption job preemption and task migration

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA Dynamic Fractional Resource Schedulingfor HPC Workloads

slide-6
SLIDE 6

Scheduling DFRS Heuristics Experiments Conclusions Dynamic Fractional Resource Scheduling

Problem Formulation

extends basic HPC problem jobs now have per-task CPU need αj and memory requirement mj multiple tasks can run on one node if total memory requirement ≤ 100% job tasks must be assigned equal amounts of CPU resource assigning less than the need results in proportional slowdown assigned allocations can change no run-time estimates so we need another metric to optimize

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA Dynamic Fractional Resource Schedulingfor HPC Workloads

slide-7
SLIDE 7

Scheduling DFRS Heuristics Experiments Conclusions Dynamic Fractional Resource Scheduling

Yield

Definition The yield, yj(t) of job j at time t is the ratio of the CPU allocation given to the job to the job’s CPU need. requires no knowledge of flow or compute times can be optimized for at each scheduling event maximizing minimum yield related to minimizing maximum stretch How do we keep track of job progress when the yield can vary?

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA Dynamic Fractional Resource Schedulingfor HPC Workloads

slide-8
SLIDE 8

Scheduling DFRS Heuristics Experiments Conclusions Dynamic Fractional Resource Scheduling

Virtual Time

Definition The virtual time vj(t) of job j at time t is the subjective time experienced by the job. vj(t) = t

rj yj(τ)dτ

job completes when vj(t) = cj

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA Dynamic Fractional Resource Schedulingfor HPC Workloads

slide-9
SLIDE 9

Scheduling DFRS Heuristics Experiments Conclusions Dynamic Fractional Resource Scheduling

The Need for Preemption

final goal is to minimize maximum stretch without preemption, stretch of non-clairvoyant on-line algorithms unbounded

consider 2 jobs both require all of the system resources

  • ne has cj = 1
  • ther has cj = ∆

need criteria to decide which jobs should be preempted

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA Dynamic Fractional Resource Schedulingfor HPC Workloads

slide-10
SLIDE 10

Scheduling DFRS Heuristics Experiments Conclusions Dynamic Fractional Resource Scheduling

Priority

Jobs should be preempted in order by increasing priority. newly arrived jobs may have infinite priority 1/vj(t) performs well, but subject to starvation (t − rj)/vj(t) time avoids starvation, but does not perform well (t − rj)/(vj(t))2 seems a reasonable compromise

  • ther possibilities exist

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA Dynamic Fractional Resource Schedulingfor HPC Workloads

slide-11
SLIDE 11

Scheduling DFRS Heuristics Experiments Conclusions Greedy Heuristics

Greedy Scheduling Heuristics

GREEDY– Put tasks on the host with the lowest CPU demand on which it can fit into memory; new jobs may have to be resubmitted using bounded exponential backoff. GREEDY-PMTN– Like GREEDY, but older tasks may be preempted GREEDY-PMTN-MIGR– Like GREEDY-PMTN, but older tasks may be migrated as well as preempted

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA Dynamic Fractional Resource Schedulingfor HPC Workloads

slide-12
SLIDE 12

Scheduling DFRS Heuristics Experiments Conclusions MCB Heuristics

Connection to multi-capacity bin packing

For each discrete scheduling event: problem similar to multi-capacity (vector) bin packing, but has optimization target and variable CPU allocations can formulate as an MILP [Stillwell et al., 2009] (NP-complete) relaxed LP heuristics slow, give low quality solutions

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA Dynamic Fractional Resource Schedulingfor HPC Workloads

slide-13
SLIDE 13

Scheduling DFRS Heuristics Experiments Conclusions MCB Heuristics

Applying MCB heuristics

yield is continuous, so choose a granularity (0.01) perform a binary search on yield, seeking to maximize for each fixed yield, set CPU requirement and apply heuristic found yield is the maximized minimum, leftover CPU used to improve average if a solution cannot be found at any yield, remove the lowest priority job and try again

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA Dynamic Fractional Resource Schedulingfor HPC Workloads

slide-14
SLIDE 14

Scheduling DFRS Heuristics Experiments Conclusions MCB Heuristics

MCB8 Heuristic

Based on [Leinberger et al., 1999], simplified to 2-dimensional case:

1 Put job tasks in two lists: CPU-intensive and

memory-intensive

2 Sort lists by “some criterion”. (MCB8: descending order by

maximum)

3 Starting with the first host, pick tasks that fit in order from

the list that goes against the current imbalance. Example:

current host tasks total 50% CPU and 60% memory Assign the next task that fits from the list of CPU-intensive jobs.

4 When no tasks can fit on a host, go to the next host. 5 If all tasks can be placed, then success, otherwise failure.

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA Dynamic Fractional Resource Schedulingfor HPC Workloads

slide-15
SLIDE 15

Scheduling DFRS Heuristics Experiments Conclusions MCB Heuristics

MCB8 Heuristic

Based on [Leinberger et al., 1999], simplified to 2-dimensional case:

1 Put job tasks in two lists: CPU-intensive and

memory-intensive

2 Sort lists by “some criterion”. (MCB8: descending order by

maximum)

3 Starting with the first host, pick tasks that fit in order from

the list that goes against the current imbalance. Example:

current host tasks total 50% CPU and 60% memory Assign the next task that fits from the list of CPU-intensive jobs.

4 When no tasks can fit on a host, go to the next host. 5 If all tasks can be placed, then success, otherwise failure.

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA Dynamic Fractional Resource Schedulingfor HPC Workloads

slide-16
SLIDE 16

Scheduling DFRS Heuristics Experiments Conclusions MCB Heuristics

MCB8 Heuristic

Based on [Leinberger et al., 1999], simplified to 2-dimensional case:

1 Put job tasks in two lists: CPU-intensive and

memory-intensive

2 Sort lists by “some criterion”. (MCB8: descending order by

maximum)

3 Starting with the first host, pick tasks that fit in order from

the list that goes against the current imbalance. Example:

current host tasks total 50% CPU and 60% memory Assign the next task that fits from the list of CPU-intensive jobs.

4 When no tasks can fit on a host, go to the next host. 5 If all tasks can be placed, then success, otherwise failure.

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA Dynamic Fractional Resource Schedulingfor HPC Workloads

slide-17
SLIDE 17

Scheduling DFRS Heuristics Experiments Conclusions MCB Heuristics

MCB8 Heuristic

Based on [Leinberger et al., 1999], simplified to 2-dimensional case:

1 Put job tasks in two lists: CPU-intensive and

memory-intensive

2 Sort lists by “some criterion”. (MCB8: descending order by

maximum)

3 Starting with the first host, pick tasks that fit in order from

the list that goes against the current imbalance. Example:

current host tasks total 50% CPU and 60% memory Assign the next task that fits from the list of CPU-intensive jobs.

4 When no tasks can fit on a host, go to the next host. 5 If all tasks can be placed, then success, otherwise failure.

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA Dynamic Fractional Resource Schedulingfor HPC Workloads

slide-18
SLIDE 18

Scheduling DFRS Heuristics Experiments Conclusions MCB Heuristics

MCB8 Heuristic

Based on [Leinberger et al., 1999], simplified to 2-dimensional case:

1 Put job tasks in two lists: CPU-intensive and

memory-intensive

2 Sort lists by “some criterion”. (MCB8: descending order by

maximum)

3 Starting with the first host, pick tasks that fit in order from

the list that goes against the current imbalance. Example:

current host tasks total 50% CPU and 60% memory Assign the next task that fits from the list of CPU-intensive jobs.

4 When no tasks can fit on a host, go to the next host. 5 If all tasks can be placed, then success, otherwise failure.

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA Dynamic Fractional Resource Schedulingfor HPC Workloads

slide-19
SLIDE 19

Scheduling DFRS Heuristics Experiments Conclusions MCB Heuristics

MCB8 Heuristic

Based on [Leinberger et al., 1999], simplified to 2-dimensional case:

1 Put job tasks in two lists: CPU-intensive and

memory-intensive

2 Sort lists by “some criterion”. (MCB8: descending order by

maximum)

3 Starting with the first host, pick tasks that fit in order from

the list that goes against the current imbalance. Example:

current host tasks total 50% CPU and 60% memory Assign the next task that fits from the list of CPU-intensive jobs.

4 When no tasks can fit on a host, go to the next host. 5 If all tasks can be placed, then success, otherwise failure.

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA Dynamic Fractional Resource Schedulingfor HPC Workloads

slide-20
SLIDE 20

Scheduling DFRS Heuristics Experiments Conclusions MCB Heuristics

MCB8 Heuristic

Based on [Leinberger et al., 1999], simplified to 2-dimensional case:

1 Put job tasks in two lists: CPU-intensive and

memory-intensive

2 Sort lists by “some criterion”. (MCB8: descending order by

maximum)

3 Starting with the first host, pick tasks that fit in order from

the list that goes against the current imbalance. Example:

current host tasks total 50% CPU and 60% memory Assign the next task that fits from the list of CPU-intensive jobs.

4 When no tasks can fit on a host, go to the next host. 5 If all tasks can be placed, then success, otherwise failure.

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA Dynamic Fractional Resource Schedulingfor HPC Workloads

slide-21
SLIDE 21

Scheduling DFRS Heuristics Experiments Conclusions MCB Heuristics

MCB8 Scheduling Heuristics

DYNMCB8– Apply heuristic on every event DYNMCB8-PER– Apply heuristic periodically DYNMCB8-ASAP-PER– like DYNMCB8-PER, but try to greedily schedule incoming jobs DYNMCB8-STRETCH-PER– like DYNMCB8-PER, but try to

  • ptimize worst-case max stretch

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA Dynamic Fractional Resource Schedulingfor HPC Workloads

slide-22
SLIDE 22

Scheduling DFRS Heuristics Experiments Conclusions Methodology

Methodology

discrete event simulator takes list of jobs and returns stretch values workloads based on synthetic and real traces synthetic workload arrival times scaled to show performance on different load conditions algorithms evaluated by per-trace degredation factor experiment with “free” preemption/migration and experiment where preemption/migration costs job a constant amount of wall clock time.

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA Dynamic Fractional Resource Schedulingfor HPC Workloads

slide-23
SLIDE 23

Scheduling DFRS Heuristics Experiments Conclusions Results

Average Maximum Yield, No preemption/migration penalty

1 10 100 1000 10000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Degredation Factor Load M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA Dynamic Fractional Resource Schedulingfor HPC Workloads

slide-24
SLIDE 24

Scheduling DFRS Heuristics Experiments Conclusions Results

Average Maximum Yield, No preemption/migration penalty

1 10 100 1000 10000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Degredation Factor Load M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA Dynamic Fractional Resource Schedulingfor HPC Workloads

slide-25
SLIDE 25

Scheduling DFRS Heuristics Experiments Conclusions Results

Average Maximum Yield, No preemption/migration penalty

1 10 100 1000 10000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Degredation Factor Load M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA Dynamic Fractional Resource Schedulingfor HPC Workloads

slide-26
SLIDE 26

Scheduling DFRS Heuristics Experiments Conclusions Results

Average Maximum Yield, 5 minute preemption/migration penalty

1 10 100 1000 10000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Degredation Factor Load M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA Dynamic Fractional Resource Schedulingfor HPC Workloads

slide-27
SLIDE 27

Scheduling DFRS Heuristics Experiments Conclusions Results

Average Maximum Yield, 5 minute preemption/migration penalty

1 10 100 1000 10000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Degredation Factor Load M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA Dynamic Fractional Resource Schedulingfor HPC Workloads

slide-28
SLIDE 28

Scheduling DFRS Heuristics Experiments Conclusions Results

Average Maximum Yield, 5 minute preemption/migration penalty

1 10 100 1000 10000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Degredation Factor Load M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA Dynamic Fractional Resource Schedulingfor HPC Workloads

slide-29
SLIDE 29

Scheduling DFRS Heuristics Experiments Conclusions Results

Comparison of Synthetic vs. Real workload results

Scaled synth. Unscaled synth. Real-world Algs

  • Deg. factor
  • Deg. factor
  • Deg. factor

avg. max avg. max avg. max EASY 167 560 139 443 94 1476 FCFS 186 569 154 476 118 2219 greedy 294 1093 249 1050 153 1527 greedyp 41 875 35 785 9 147 greedypm 62 835 37 773 17 759 mcb 32 162 11 162 11 231 mcbp 1 12 2 21 3 20 gmcbp 1 9 2 22 2 20 mcbsp 1 12 2 21 3 23

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA Dynamic Fractional Resource Schedulingfor HPC Workloads

slide-30
SLIDE 30

Scheduling DFRS Heuristics Experiments Conclusions Results

Computation Times

Most scheduling events involve 10 or fewer jobs and require negligible time for all schedulers. Even when there are about 100 jobs, the time for MCB8 is under 5 seconds on a 3.2Ghz machine

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA Dynamic Fractional Resource Schedulingfor HPC Workloads

slide-31
SLIDE 31

Scheduling DFRS Heuristics Experiments Conclusions Results

Costs

Greedy approaches use significantly less bandwidth than MCB approaches (<1GB/s in the worst case) MCB approaches cause jobs to be preempted around 5 times on average. DYNMCB8 uses 1.3GB/s on average, 5.1GB/s maximum periodic algorithms 0.6GB/s on average, 2.1GB/s maximum

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA Dynamic Fractional Resource Schedulingfor HPC Workloads

slide-32
SLIDE 32

Scheduling DFRS Heuristics Experiments Conclusions

Conclusions

DFRS potentially much better than batch scheduling multi-capacity bin packing heuristics perform best targeting yield does as well as targeting worst case max stretch periodic MCB approaches perform nearly as well as aggressive ones when there is no migration cost and much better when there is a fixed migration cost adding an opportunistic greedy scheduling heuristic to DYNMCB8-PER gives no real benefit to max stretch MCB approaches can calculate resource allocations reasonably quickly MCB approaches need to try to mitigate migration/preemptions costs

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA Dynamic Fractional Resource Schedulingfor HPC Workloads

slide-33
SLIDE 33

Appendix For Further Reading

References I

Leinberger, W., Karypis, G., and Kumar, V. (1999). Multi-capacity bin packing algorithms with applications to job scheduling under multiple constraints. In Proc. of the Intl. Conf. on Parallel Processing, pages 404–412. IEEE. Stillwell, M., Shanzenbach, D., Vivien, F ., and Casanova, H. (2009). Resource Allocation using Virtual Clusters. In Proc. of CCGrid 2009, pages 260–267. IEEE.

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA Dynamic Fractional Resource Schedulingfor HPC Workloads