Co-scheduling algorithms for high-throughput workload execution - - PowerPoint PPT Presentation

co scheduling algorithms for high throughput workload
SMART_READER_LITE
LIVE PREVIEW

Co-scheduling algorithms for high-throughput workload execution - - PowerPoint PPT Presentation

Problem definition Theoretical results Heuristics Simulations Conclusion Co-scheduling algorithms for high-throughput workload execution Guillaume Aupy 1 , Manu Shantharam 2 , Anne Benoit 1 , 3 , Yves Robert 1 , 3 , 4 and Padma Raghavan 5 1 .


slide-1
SLIDE 1

Problem definition Theoretical results Heuristics Simulations Conclusion

Co-scheduling algorithms for high-throughput workload execution

Guillaume Aupy1, Manu Shantharam2, Anne Benoit1,3, Yves Robert1,3,4 and Padma Raghavan5

  • 1. Ecole Normale Sup´

erieure de Lyon, France

  • 2. University of Utah, USA
  • 3. Institut Universitaire de France
  • 4. University of Tennessee Knoxville, USA
  • 5. Pennsylvania State University, USA

Anne.Benoit@ens-lyon.fr http://graal.ens-lyon.fr/~abenoit/

9th Scheduling for Large Scale Systems Workshop July 1-4, 2014 - Lyon, France

Anne.Benoit@ens-lyon.fr Lyon 2014 Co-scheduling algorithms 1/ 30

slide-2
SLIDE 2

Problem definition Theoretical results Heuristics Simulations Conclusion

Motivation

Execution time of HPC applications

Can be significantly reduced when using a large number of processors But inefficient resource usage if all resources used for a single application (non-linear decrease of execution time)

Pool of several applications

Co-scheduling algorithms: execute several applications concurrently Increase individual execution time of each application, but (i) improve efficiency of parallelization (ii) reduce total execution time (iii) reduce average response time

Increase platform yield, and save energy

Anne.Benoit@ens-lyon.fr Lyon 2014 Co-scheduling algorithms 2/ 30

slide-3
SLIDE 3

Problem definition Theoretical results Heuristics Simulations Conclusion

1

Problem definition

2

Theoretical results

3

Heuristics

4

Simulations

5

Conclusion

Anne.Benoit@ens-lyon.fr Lyon 2014 Co-scheduling algorithms 3/ 30

slide-4
SLIDE 4

Problem definition Theoretical results Heuristics Simulations Conclusion

Framework

Distributed-memory platform with p identical processors Set of n independent tasks (or applications) T1, . . . , Tn; application Ti can be assigned σ(i) = j processors, and

pi is the minimum number of processors required by Ti; ti,j is the execution time of task Ti with j processors; work(i, j) = j × ti,j is the corresponding work.

We assume the following for 1 ≤ i ≤ n and pi ≤ j < p: Non increasing execution time: ti,j+1 ≤ ti,j Non decreasing work: work(i, j + 1) ≥ work(i, j)

Anne.Benoit@ens-lyon.fr Lyon 2014 Co-scheduling algorithms 4/ 30

slide-5
SLIDE 5

Problem definition Theoretical results Heuristics Simulations Conclusion

Co-schedules

A co-schedule partitions the n tasks into groups (called packs): All tasks from a given pack start their execution at the same time Two tasks from different packs have disjoint execution intervals

P1 P2 P3 P4 time processors

A co-schedule with four packs P1 to P4

Anne.Benoit@ens-lyon.fr Lyon 2014 Co-scheduling algorithms 5/ 30

slide-6
SLIDE 6

Problem definition Theoretical results Heuristics Simulations Conclusion

Definition (k-in-p-CoSchedule optimization problem) Given a fixed constant k ≤ p, find a co-schedule with at most k tasks per pack that minimizes the execution time.

The most general problem is when k = p, but in some frameworks we may have an upper bound k < p on the maximum number of tasks within each pack.

Anne.Benoit@ens-lyon.fr Lyon 2014 Co-scheduling algorithms 6/ 30

slide-7
SLIDE 7

Problem definition Theoretical results Heuristics Simulations Conclusion

Related work

Performance bounds for level-oriented two-dimensional packing algorithms, Coffman, Garey, Johnson: Strip-packing problem, parallel tasks (fixed number of processors), approximation algorithm based on “shelves” Scheduling parallel tasks: Approximation algorithms, Dutot, Mouni´ e, Trystram: Use this model to approximate the moldable model; they studied the p-in-p-CoSchedule for identical moldable tasks (polynomial with DP) Widely studied for sequential tasks

Anne.Benoit@ens-lyon.fr Lyon 2014 Co-scheduling algorithms 7/ 30

slide-8
SLIDE 8

Problem definition Theoretical results Heuristics Simulations Conclusion

1

Problem definition

2

Theoretical results

3

Heuristics

4

Simulations

5

Conclusion

Anne.Benoit@ens-lyon.fr Lyon 2014 Co-scheduling algorithms 8/ 30

slide-9
SLIDE 9

Problem definition Theoretical results Heuristics Simulations Conclusion

Complexity: Polynomial instances

Theorem The 1-in-p-CoSchedule and 2-in-p-CoSchedule problems can both be solved in polynomial time.

Proof. If there is a batch with exactly tasks Ti and Ti′, then its execution time is minj=pi..p−pi′

  • max(ti,j, ti′,p−j)
  • .

We then construct the complete weighted graph G = (V , E), where |V | = n, and ei,i′ =

  • ti,p

if i = i′ minj=pi..p−pi′

  • max(ti,j, ti′,p−j)
  • therwise

Finally, finding a perfect matching of minimal weight in G leads to the optimal solution for 2-in-p-CoSchedule.

Anne.Benoit@ens-lyon.fr Lyon 2014 Co-scheduling algorithms 9/ 30

slide-10
SLIDE 10

Problem definition Theoretical results Heuristics Simulations Conclusion

Complexity: Polynomial instances

Theorem The 1-in-p-CoSchedule and 2-in-p-CoSchedule problems can both be solved in polynomial time.

Proof. If there is a batch with exactly tasks Ti and Ti′, then its execution time is minj=pi..p−pi′

  • max(ti,j, ti′,p−j)
  • .

We then construct the complete weighted graph G = (V , E), where |V | = n, and ei,i′ =

  • ti,p

if i = i′ minj=pi..p−pi′

  • max(ti,j, ti′,p−j)
  • therwise

Finally, finding a perfect matching of minimal weight in G leads to the optimal solution for 2-in-p-CoSchedule.

Anne.Benoit@ens-lyon.fr Lyon 2014 Co-scheduling algorithms 9/ 30

slide-11
SLIDE 11

Problem definition Theoretical results Heuristics Simulations Conclusion

Complexity: NP-completeness

Theorem The 3-in-p-CoSchedule problem is strongly NP-complete.

Proof. We reduce this problem to 3-Partition: Given an integer B and 3n integers a1, . . . , a3n, can we partition the 3n integers into n triplets, each of sum B? This problem is strongly NP-hard so we can encode the ai’s and B in unary. We build instance I2 of 3-in-p-CoSchedule, with p = B processors, a deadline D = n, and 3n tasks Ti such that ti,j = 1 + 1

ai if j < ai, ti,j = 1 otherwise. (The ti,j’s verify the

constraints on work and execution time.) Any solution of I2 has n packs each of cost 1 with exactly 3 tasks in it, and the sum of the weights of these tasks sums up to B.

Anne.Benoit@ens-lyon.fr Lyon 2014 Co-scheduling algorithms 10/ 30

slide-12
SLIDE 12

Problem definition Theoretical results Heuristics Simulations Conclusion

Complexity: NP-completeness

Theorem The 3-in-p-CoSchedule problem is strongly NP-complete.

Proof. We reduce this problem to 3-Partition: Given an integer B and 3n integers a1, . . . , a3n, can we partition the 3n integers into n triplets, each of sum B? This problem is strongly NP-hard so we can encode the ai’s and B in unary. We build instance I2 of 3-in-p-CoSchedule, with p = B processors, a deadline D = n, and 3n tasks Ti such that ti,j = 1 + 1

ai if j < ai, ti,j = 1 otherwise. (The ti,j’s verify the

constraints on work and execution time.) Any solution of I2 has n packs each of cost 1 with exactly 3 tasks in it, and the sum of the weights of these tasks sums up to B.

Anne.Benoit@ens-lyon.fr Lyon 2014 Co-scheduling algorithms 10/ 30

slide-13
SLIDE 13

Problem definition Theoretical results Heuristics Simulations Conclusion

Complexity: NP-completeness

Theorem For k ≥ 3, The k-in-p-CoSchedule problem is strongly NP-complete.

Proof. We reduce these problems to the same instance of the 3-in-p-CoSchedule problem, to which we further add: n(k − 3) buffer tasks such that ti,j = max

  • B+1

j

, 1

  • ;

the number of processors is now p = B + (k − 3)(B + 1); the deadline remains D = n. Again, we need to execute each pack in unit time and at most n

  • packs. The only way to proceed is to execute within each pack

k − 3 buffer tasks on B + 1 processors.

Anne.Benoit@ens-lyon.fr Lyon 2014 Co-scheduling algorithms 11/ 30

slide-14
SLIDE 14

Problem definition Theoretical results Heuristics Simulations Conclusion

Complexity: NP-completeness

Theorem For k ≥ 3, The k-in-p-CoSchedule problem is strongly NP-complete.

Proof. We reduce these problems to the same instance of the 3-in-p-CoSchedule problem, to which we further add: n(k − 3) buffer tasks such that ti,j = max

  • B+1

j

, 1

  • ;

the number of processors is now p = B + (k − 3)(B + 1); the deadline remains D = n. Again, we need to execute each pack in unit time and at most n

  • packs. The only way to proceed is to execute within each pack

k − 3 buffer tasks on B + 1 processors.

Anne.Benoit@ens-lyon.fr Lyon 2014 Co-scheduling algorithms 11/ 30

slide-15
SLIDE 15

Problem definition Theoretical results Heuristics Simulations Conclusion

Scheduling a pack of tasks

Theorem Given k tasks to be scheduled on p processors in a single pack (1-pack-schedule), we can find in time O(p log k) the schedule that minimizes the cost of the pack. Greedy algorithm Optimal-1-pack-schedule: Initially, each task Ti is assigned its minimum number of processors pi While there remain available processors, assign one to the largest task (with their current processor assignment) This algorithm returns an optimal solution

Anne.Benoit@ens-lyon.fr Lyon 2014 Co-scheduling algorithms 12/ 30

slide-16
SLIDE 16

Problem definition Theoretical results Heuristics Simulations Conclusion

Scheduling a pack of tasks

Theorem Given k tasks to be scheduled on p processors in a single pack (1-pack-schedule), we can find in time O(p log k) the schedule that minimizes the cost of the pack. Greedy algorithm Optimal-1-pack-schedule: Initially, each task Ti is assigned its minimum number of processors pi While there remain available processors, assign one to the largest task (with their current processor assignment) This algorithm returns an optimal solution

Anne.Benoit@ens-lyon.fr Lyon 2014 Co-scheduling algorithms 12/ 30

slide-17
SLIDE 17

Problem definition Theoretical results Heuristics Simulations Conclusion

Optimal solution

Theorem The following integer linear program characterizes the k-in-p-CoSchedule problem, where the unknown variables are the xi,j,b’s (Boolean variables) and the yb’s (rational variables), for 1 ≤ i, b ≤ n and 1 ≤ j ≤ p: Minimizen

b=1 yb

subject to (i)

j,b xi,j,b = 1,

1 ≤ i ≤ n (ii)

i,j xi,j,b ≤ k,

1 ≤ b ≤ n (iii)

i,j j × xi,j,b ≤ p,

1 ≤ b ≤ n (iv) xi,j,b × ti,j ≤ yb, 1 ≤ i, b ≤ n, 1 ≤ j ≤ p xi,j,b = 1 iff Ti is in pack b and executed on j processors yb is the execution time of pack b

Anne.Benoit@ens-lyon.fr Lyon 2014 Co-scheduling algorithms 13/ 30

slide-18
SLIDE 18

Problem definition Theoretical results Heuristics Simulations Conclusion

Approximation algorithm

3-approximation algorithm for the problem p-in-p-CoSchedule Initialization: task Ti executed on pi processors Greedy procedure Make-pack to create packs (with k = p), given σ(i) processors for task Ti

procedure Make-pack(n, p, k, σ) begin L: list of tasks sorted in non-increasing execution times ti,σ(i); while L = ∅ do Schedule the current task on the first pack with enough available processors and less than k tasks; Create a new pack if no existing pack fits; Remove the current task from L; end return the set of packs end

Anne.Benoit@ens-lyon.fr Lyon 2014 Co-scheduling algorithms 14/ 30

slide-19
SLIDE 19

Problem definition Theoretical results Heuristics Simulations Conclusion

pack-Approx: Iteratively refine the solution, adding a processor to the task with longest execution time

procedure pack-Approx(T1, . . . , Tn) begin COST = +∞; for j = 1 to n do σ(j) ← pj ; for i = 0 to

j(p − pj) − 1 do

Call Make-pack (n, p, p, σ); Let COSTi be the cost of the co-schedule; if COSTi < COST then COST ← COSTi; Let Atot(i) = n

j=1 tj,σ(j)σ(j);

Let Tj⋆ be one task that maximizes tj,σ(j); if

  • Atot(i) > p × tj⋆,σ(j⋆)
  • r (σ(j⋆) = p) then

return COST else σ(j⋆) ← σ(j⋆) + 1 end end return COST; end

Anne.Benoit@ens-lyon.fr Lyon 2014 Co-scheduling algorithms 15/ 30

slide-20
SLIDE 20

Problem definition Theoretical results Heuristics Simulations Conclusion

Theorem pack-Approx is a 3-approximation algorithm for the p-in-p-CoSchedule problem. Involved proof, studying the different ways to exit algorithm pack-Approx: The task with longest execution time is already assigned p processors The sum of the work of all tasks (n

i=1 ti,σ(i)σ(i)) is greater

than p times the longest execution time Each task has been assigned p processors

Anne.Benoit@ens-lyon.fr Lyon 2014 Co-scheduling algorithms 16/ 30

slide-21
SLIDE 21

Problem definition Theoretical results Heuristics Simulations Conclusion

1

Problem definition

2

Theoretical results

3

Heuristics

4

Simulations

5

Conclusion

Anne.Benoit@ens-lyon.fr Lyon 2014 Co-scheduling algorithms 17/ 30

slide-22
SLIDE 22

Problem definition Theoretical results Heuristics Simulations Conclusion

Heuristics

In all heuristics (even randoms), once the different packs are chosen, we always run Optimal-1-pack-schedule on each pack. Random-Pack: generates the packs randomly: randomly chooses an integer j between 1 and k, and then randomly selects j tasks to form a pack. Random-Proc: assigns the number of processors to each task randomly, then calls Make-pack to generate the packs. pack-by-pack (ε): creates packs that are “well-balanced”: the difference between smallest and longest execution times of a pack is small (ratio of 1 + ε). pack-Approx: an extension of the approximation algorithm in the case where there are at most k tasks in a pack.

Anne.Benoit@ens-lyon.fr Lyon 2014 Co-scheduling algorithms 18/ 30

slide-23
SLIDE 23

Problem definition Theoretical results Heuristics Simulations Conclusion

Heuristic variants

Improvement of the heuristics by using up to 9 runs: 4 random heuristics with either one or nine runs:

Random-Pack-1, Random-Pack-9 Random-Proc-1, Random-Proc-9

pack-by-pack (ε) with

either one single run with ε = 0.5 (pack-by-pack-1)

  • r 9 runs with ε ∈ {.1, .2, . . . , .9} (pack-by-pack-9)

Only one version of pack-Approx Further variants: up to 99 runs, or better choice to create packs in pack-by-pack, but only little improvement at the price of a much higher running time

Anne.Benoit@ens-lyon.fr Lyon 2014 Co-scheduling algorithms 19/ 30

slide-24
SLIDE 24

Problem definition Theoretical results Heuristics Simulations Conclusion

Heuristic variants

Improvement of the heuristics by using up to 9 runs: 4 random heuristics with either one or nine runs:

Random-Pack-1, Random-Pack-9 Random-Proc-1, Random-Proc-9

pack-by-pack (ε) with

either one single run with ε = 0.5 (pack-by-pack-1)

  • r 9 runs with ε ∈ {.1, .2, . . . , .9} (pack-by-pack-9)

Only one version of pack-Approx Further variants: up to 99 runs, or better choice to create packs in pack-by-pack, but only little improvement at the price of a much higher running time

Anne.Benoit@ens-lyon.fr Lyon 2014 Co-scheduling algorithms 19/ 30

slide-25
SLIDE 25

Problem definition Theoretical results Heuristics Simulations Conclusion

1

Problem definition

2

Theoretical results

3

Heuristics

4

Simulations

5

Conclusion

Anne.Benoit@ens-lyon.fr Lyon 2014 Co-scheduling algorithms 20/ 30

slide-26
SLIDE 26

Problem definition Theoretical results Heuristics Simulations Conclusion

Workloads

Workload-I: 10 parallel scientific applications (involving VASP, ABAQUS, LAMMPS, Petsc); execution time observed on a cluster with p = 16 processors and 128 cores Workload-II: synthetic test suite with 65 tasks for 128 cores (p = 16); execution time for problem size m on q cores: t(m, q) = f × t(m, 1) + (1 − f )t(m, 1) q + κ(m, q)

f : inherently serial fraction κ: overheads related to synchronization and communication

Workload-III: similar to Workload-II, but with 260 tasks for 256 cores (p = 32)

Anne.Benoit@ens-lyon.fr Lyon 2014 Co-scheduling algorithms 21/ 30

slide-27
SLIDE 27

Problem definition Theoretical results Heuristics Simulations Conclusion

Assessing the performance of heuristics

Seven heuristics and three measures: Relative cost: cost divided by the cost of a schedule with each task scheduled on p processors (schedule used in practice, n-packs-schedule) Packing ratio: total work n

i=1 ti,σ(i) × σ(i) divided by p

times the cost of the co-schedule; close to 1 if no idle time Relative response time: mean response time compared to n-packs-schedule with non-decreasing order of execution time

Anne.Benoit@ens-lyon.fr Lyon 2014 Co-scheduling algorithms 22/ 30

slide-28
SLIDE 28

Problem definition Theoretical results Heuristics Simulations Conclusion

Results: Relative cost

PACK-APPROX# PACK-BY-PACK-1# PACK-BY-PACK-9# RANDOM-PACK-1# RANDOM-PACK-9# RANDOM-PROC-1# RANDOM-PROC-9# 0.00# 0.20# 0.40# 0.60# 0.80# 1.00# 1.20# 2# 4# 6# 8# 10# Rela%ve'cost' Pack'size'

Workload3I'

0.00# 0.20# 0.40# 0.60# 0.80# 1.00# 1.20# 2# 4# 6# 8# 10# 12# 14# 16# Rela%ve'cost' Pack'size'

Workload3II'

Horizontal line = optimal co-schedule (exhaustive search for W-I) pack-Approx and pack-by-pack close to optimal Gain of more than 35% compared to n-packs-schedule for W-I Huge gains for W-II (more than 80%, better for larger values of pack size)

Anne.Benoit@ens-lyon.fr Lyon 2014 Co-scheduling algorithms 23/ 30

slide-29
SLIDE 29

Problem definition Theoretical results Heuristics Simulations Conclusion

Results: Packing ratio

PACK-APPROX# PACK-BY-PACK-1# PACK-BY-PACK-9# RANDOM-PACK-1# RANDOM-PACK-9# RANDOM-PROC-1# RANDOM-PROC-9# 0" 0.2" 0.4" 0.6" 0.8" 1" 1.2" 2" 4" 6" 8" 10" Packing(ra*o( Pack(size( Workload2I( 0" 0.2" 0.4" 0.6" 0.8" 1" 1.2" 2" 4" 6" 8" 10" 12" 14" 16" Packing(ra*o( Pack(size(

Workload2II(

Packing ratios very close to one for pack-by-pack and pack-Approx High quality packings

Anne.Benoit@ens-lyon.fr Lyon 2014 Co-scheduling algorithms 24/ 30

slide-30
SLIDE 30

Problem definition Theoretical results Heuristics Simulations Conclusion

Results: Response time

PACK-APPROX# PACK-BY-PACK-1# PACK-BY-PACK-9# RANDOM-PACK-1# RANDOM-PACK-9# RANDOM-PROC-1# RANDOM-PROC-9# 0.00# 0.20# 0.40# 0.60# 0.80# 1.00# 1.20# 1.40# 1.60# 1.80# 2.00# 2.20# 2# 4# 6# 8# 10# Rela%ve'response'%me' Pack'size'

Workload5I'

0.00# 0.20# 0.40# 0.60# 0.80# 1.00# 1.20# 1.40# 1.60# 1.80# 2# 4# 6# 8# 10# 12# 14# 16# Rela%ve'response'%me' Pack'size'

Workload5II'

Values less than 1: improvements in response times For Workload-II and larger values of the pack size, response time gains over 80% k-in-p-CoSchedule attractive from the user perspective

Anne.Benoit@ens-lyon.fr Lyon 2014 Co-scheduling algorithms 25/ 30

slide-31
SLIDE 31

Problem definition Theoretical results Heuristics Simulations Conclusion

Results: Workload-III

PACK-APPROX# PACK-BY-PACK-1# PACK-BY-PACK-9# RANDOM-PACK-1# RANDOM-PACK-9# RANDOM-PROC-1# RANDOM-PROC-9# 0.00# 0.10# 0.20# 0.30# 0.40# 0.50# 0.60# 0.70# 0.80# 0.90# 1.00# 4# 16# 32# Rela%ve'cost' Pack'size'

Workload3III'

0" 0.2" 0.4" 0.6" 0.8" 1" 1.2" 4" 16" 32" Packing(ra*o( Pack(size(

Workload2III(

0.00# 0.50# 1.00# 1.50# 2.00# 2.50# 4# 16# 32# Rela%ve'response'%me' Pack'size'

Workload5III'

Scalability trends with 260 tasks on 32 processors pack-Approx and pack-by-pack are clearly superior

Anne.Benoit@ens-lyon.fr Lyon 2014 Co-scheduling algorithms 26/ 30

slide-32
SLIDE 32

Problem definition Theoretical results Heuristics Simulations Conclusion

Results: Running times

Workload-I Workload-II Workload-III pack-Approx 0.50 0.30 5.12 pack-by-pack-1 0.03 0.12 0.53 pack-by-pack-9 0.30 1.17 5.07 Random-Pack-1 0.07 0.34 9.30 Random-Pack-9 0.67 2.71 87.25 Random-Proc-1 0.05 0.26 4.49 Random-Proc-9 0.47 2.26 39.54

Average running times in milliseconds All heuristics run within a few ms, even for W-III Random heuristics slower (cost of random number generation) pack-by-pack-9 comparable with pack-Approx

Anne.Benoit@ens-lyon.fr Lyon 2014 Co-scheduling algorithms 27/ 30

slide-33
SLIDE 33

Problem definition Theoretical results Heuristics Simulations Conclusion

1

Problem definition

2

Theoretical results

3

Heuristics

4

Simulations

5

Conclusion

Anne.Benoit@ens-lyon.fr Lyon 2014 Co-scheduling algorithms 28/ 30

slide-34
SLIDE 34

Problem definition Theoretical results Heuristics Simulations Conclusion

Conclusion

Theoretically: Exhaustive complexity study

NP-completeness (need to choose for each task both number

  • f processors and pack)

Optimal strategy once the packs are formed Efficient algorithm to partition tasks with pre-assigned resources into packs (3-approximation algorithm for k = p)

Practically: Heuristics building upon theoretical study, with very good performance

Heuristic of choice: pack-by-pack-9 Great improvement compared to existing schedulers (in terms

  • f relative cost)

Corresponding savings in system energy cost Measurable benefits in average response time

Anne.Benoit@ens-lyon.fr Lyon 2014 Co-scheduling algorithms 29/ 30

slide-35
SLIDE 35

Problem definition Theoretical results Heuristics Simulations Conclusion

Future work

Combine with DVFS technique (dynamic voltage and frequency scaling) to further obtain gains in energy consumption Experiment at a larger scale (university computing facilities), where workload attributes do not vary much in time, and energy costs are a limiting factor Theoretically, obtain more approximation results

Anne.Benoit@ens-lyon.fr Lyon 2014 Co-scheduling algorithms 30/ 30