Practical Steady-State Scheduling for Tree-Shaped Task Graphs Skou D - - PowerPoint PPT Presentation

practical steady state scheduling for tree shaped task
SMART_READER_LITE
LIVE PREVIEW

Practical Steady-State Scheduling for Tree-Shaped Task Graphs Skou D - - PowerPoint PPT Presentation

Practical Steady-State Scheduling for Tree-Shaped Task Graphs Skou D IAKIT 1 , Loris M ARCHAL 2 , Jean-Marc N ICOD 1 , Laurent P HILIPPE 1 - 19/11/2009 1 : Laboratoire dInformatique de Franche-Comt Universit de France Comt, France 2 :


slide-1
SLIDE 1

Practical Steady-State Scheduling for Tree-Shaped Task Graphs

Sékou DIAKITÉ1, Loris MARCHAL2, Jean-Marc NICOD1, Laurent PHILIPPE1 - 19/11/2009

1: Laboratoire d’Informatique de Franche-Comté Université de France Comté, France 2: Laboratoire de l’Informatique du Parallélisme CNRS - INRIA - Université de Lyon, France

slide-2
SLIDE 2

Outline

Scheduling problem Principle of steady-state scheduling Overview Shortcomings Reducing the latency Dependencies Mixed Integer Program Heuristic approach Using non-conservative steady-state solutions Experimental results Simulation settings Inter-period dependencies Scheduling efficiency Number of running instances Running time of the algorithms Synthesis

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 2 / 43

slide-3
SLIDE 3

Scheduling problem

Definitions

Execution platform

undirected graph, Gp = (Vp, Ep)

Vp = {P1, ..., Pn} : n processors Ep : communication links between the processors

bidirectional one-port model ci,j is the time needed to send a unit of data from Pi to Pj

Example

P3 P4 P1 P2

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 3 / 43

slide-4
SLIDE 4

Scheduling problem

Definitions

Application

DAG with no forks (in-trees), Ga = (Va, Ea)

Va = {T1, ..., Tk} : k tasks

unrelated computation model, wi,k : time needed by Pi to execute Tk

Ea dependencies between tasks

Fk,l is the amount of data (File) produced by Tk and consumed by Tl

Example T2 T1 T3 T4

10 to 1000 times

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 4 / 43

slide-5
SLIDE 5

Scheduling problem

How to ?

Problem

Executing a batch of graphs (from 10 to 1000)

Objective

Minimizing the makespan Cmax

Chosen method

Steady-state technique which is asymptotically optimal (throughput)

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 5 / 43

slide-6
SLIDE 6

Outline

Scheduling problem Principle of steady-state scheduling Overview Shortcomings Reducing the latency Dependencies Mixed Integer Program Heuristic approach Using non-conservative steady-state solutions Experimental results Simulation settings Inter-period dependencies Scheduling efficiency Number of running instances Running time of the algorithms Synthesis

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 6 / 43

slide-7
SLIDE 7

Principle of steady-state scheduling

Overview

This study is based on

  • O. Beaumont, A. Legrand, L. Marchal and Y. Robert. Steady-state

scheduling on heterogeneous clusters. Int. J. of Foundations of Computer Science, 16(2) :163-194, 2005.

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 7 / 43

slide-8
SLIDE 8

Principle of steady-state scheduling

Overview

Converting the scheduling problem to a linear program

the steady-state is characterized by activities variables

the average number of Tk processed by Pi in one time unit the average number of Fk,l sent by Pi to Pj in one time unit

these activities variables allow us to write constraints

  • n processor speeds and link bandwidths

"conservation laws" to state that Fk,l has to be produced by Tk and consumed by Tl

these constraints describe a valid steady-state schedule by adding the objective of maximizing the steady-state throughput, we obtain a linear program

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 8 / 43

slide-9
SLIDE 9

Principle of steady-state scheduling

Overview

From the linear program to a periodic schedule (period)

the optimal solution of the linear program gives rational activities we can not split tasks and files

→ the period length L is equal to the LCM of activities denominators → we multiply every activity by L, activities are now integers

L is large but bounded the period allows us to schedule any number of graphs, the final schedule consists in 3 phases

initialization steady-state : n × periods clean-up

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 9 / 43

slide-10
SLIDE 10

Principle of steady-state scheduling

Overview

Example

Platform graph P1 P3 P2 F1,2 T2 T1 Task graph A1 A2 Allocations T2 T1 T2 T1 processor P1 processor P2 processor P3 link P1 → P3 link P2 → P3 T2 T2 L T1 T1 F1,2 F1,2 Steady-state period

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 10 / 43

slide-11
SLIDE 11

Outline

Scheduling problem Principle of steady-state scheduling Overview Shortcomings Reducing the latency Dependencies Mixed Integer Program Heuristic approach Using non-conservative steady-state solutions Experimental results Simulation settings Inter-period dependencies Scheduling efficiency Number of running instances Running time of the algorithms Synthesis

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 11 / 43

slide-12
SLIDE 12

Principle of steady-state scheduling

Shortcomings

Long latency

several periods are necessary to process an instance

→ drawback for interactive applications → lead to large buffers : at every time step, a large number of ongoing job has to be stored

Long initialization and clean-up phases

the period contains a large number of ongoing job

→ long initialization phase to enter steady-state → long clean-up phase to leave steady-state

initialization and clean-up are done with heuristic scheduling

→ we lose the benefit of the optimal steady-state phase

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 12 / 43

slide-13
SLIDE 13

Principle of steady-state scheduling

Shortcomings

Addressing the shortcomings

the original steady-state algorithm reaches good Cmax as soon as the number of instances is large enough in this study, we aim at reducing this threshold

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 13 / 43

slide-14
SLIDE 14

Principle of steady-state scheduling

Addressing the shortcomings

Means of actions

decrease the length of the period

hard to do when we want to keep an optimal period

reduce the latency (inter/intra dependencies)

side benefit : less work to do in initialization and clean-up (gain on Cmax)

reduce the period length by allowing a small reduction of the throughput

side benefit : reducing the latency

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 14 / 43

slide-15
SLIDE 15

Outline

Scheduling problem Principle of steady-state scheduling Overview Shortcomings Reducing the latency Dependencies Mixed Integer Program Heuristic approach Using non-conservative steady-state solutions Experimental results Simulation settings Inter-period dependencies Scheduling efficiency Number of running instances Running time of the algorithms Synthesis

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 15 / 43

slide-16
SLIDE 16

Reducing the latency

Dependencies

How the reduce the latency ?

Intra-period dependencies.

The original steady-state (only inter-period dependencies)

Platform graph P1 P3 P2 F1,2 T2 T1 Task graph T 2n

1

T 2n+1

1

T 2n

2

Steady-state schedule F 2n

1,2

F 2n+1

1,2

T 2n+1

2 T 2n+2

1

T 2n+4

1

T 2n+3

1

T 2n+5

1

F 2n−2

1,2

F 2n−1

1,2

F 2n+2

1,2

F 2n+3

1,2

T 2n−2

2

T 2n−1

2

T 2n−3

2

T 2n−4

2

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 16 / 43

slide-17
SLIDE 17

Reducing the latency

Dependencies

The steady-state with intra-period dependencies

Platform graph P1 P3 P2 F1,2 T2 T1 Task graph T 2n

1

T 2n+1

1

F 2n

1,2

F 2n+1

1,2

T 2n

2

T 2n+1

2

period n + 1 period n + 2 period n inter-period dependency intra-period dependency

T 2n+2

1

T 2n+4

1

T 2n+3

1

T 2n+5

1

F 2n−2

1,2

F 2n−1

1,2

F 2n+2

1,2

F 2n+3

1,2

T 2n−1

2

T 2n−2

2

T 2n−3

2

T 2n+2

2

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 17 / 43

slide-18
SLIDE 18

Outline

Scheduling problem Principle of steady-state scheduling Overview Shortcomings Reducing the latency Dependencies Mixed Integer Program Heuristic approach Using non-conservative steady-state solutions Experimental results Simulation settings Inter-period dependencies Scheduling efficiency Number of running instances Running time of the algorithms Synthesis

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 18 / 43

slide-19
SLIDE 19

Reducing the latency

Mixed Integer Program

Ordering

Tasks (Tj, Tk) on the same processor Pi

binary variable yj,k = 1 if and only if Tj is processed before Tk tj is the starting time of task Tj, L is the length of the period tj − tk ≥ −yj,k × L (1) yj,k + yk,j = 1 (2) tk − (tj + wi,j) ≥ (yj,k − 1) × L (3) tj + wi,j ≤ L (4)

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 19 / 43

slide-20
SLIDE 20

Reducing the latency

Mixed Integer Program

Dependencies

For each dependency Tj → Tk

binary variable ej,k = 1 intra-period dependency (ej,k = 0 inter-period) tk − (tj + wi,j) ≥ (ej,k − 1) × L (5)

Objective

Maximize ej,k under the constraints (1), (2), (3), (4) and (5)

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 20 / 43

slide-21
SLIDE 21

Outline

Scheduling problem Principle of steady-state scheduling Overview Shortcomings Reducing the latency Dependencies Mixed Integer Program Heuristic approach Using non-conservative steady-state solutions Experimental results Simulation settings Inter-period dependencies Scheduling efficiency Number of running instances Running time of the algorithms Synthesis

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 21 / 43

slide-22
SLIDE 22

Reducing the latency

Heuristic approach

Limitations

The heuristic algorithm is not allowed to move tasks inside the period Algorithme 1 : Heuristic algorithm

IntraDep ← {}; Prod ← set of all sources of the dependencies (sorted by completion time); Cons ← set of all destinations of the dependencies (sorted by starting time); forall Tsrc ∈ Prod do forall Tdst ∈ Cons do if There is a dependency Tsrc → Tdst then if end(Tsrc) ≤ start(Tdst) then remove Tdst from Cons; IntraDep ← IntraDep ∪ {(Tsrc → Tdst)}; continue

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 22 / 43

slide-23
SLIDE 23

Outline

Scheduling problem Principle of steady-state scheduling Overview Shortcomings Reducing the latency Dependencies Mixed Integer Program Heuristic approach Using non-conservative steady-state solutions Experimental results Simulation settings Inter-period dependencies Scheduling efficiency Number of running instances Running time of the algorithms Synthesis

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 23 / 43

slide-24
SLIDE 24

Using non-conservative solutions

Motivation

How to reduce the period length ?

  • ne of the main drawbacks of the method

solution of a linear program

hard to find an other optimal solution with a smaller period

→ modify the solution

A sub-optimal solution

decrease the system throughput gain flexibility on the period length → our claim : much shorten the period at the cost of a slight reduction of the throughput

side benefits : shorter latencies and smaller buffers

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 24 / 43

slide-25
SLIDE 25

Using non-conservative solutions

Principle

Steady state scheduling and allocations

allocations : A1, . . . , Am throughput of Ak ρk = αk/βk and total throughput ρ =

k ρk

period length T = lcmkβk → in one period Ak processed T × αk/βk ∈ N

Influence of large value of βk

contribution to a small amount the total throughput responsible of large period size → suppress βk from the computation of T (scale ⌊(αj × T)/βj⌋)

hope : loss in ρ compensated by a shorter value of T

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 25 / 43

slide-26
SLIDE 26

Using non-conservative solutions

Algorithm

Algorithme 2 : Shorten the period of the steady state schedule

Data : Ntotal instances, m allocations Ai with the throughput αi/βi. Parameters : K = 0.25 (ratio initialization/total) and L = 0.85 (maximum degradation). Sort allocation by non-increasing βk, so that β1 ≥ β2 ≥ · · · ≥ βm Ninit ← estimateInitTermJobCount(ρ1, . . . , ρm) i ← 1 ; ρorig = Pm

k=1 αk/βk

while i < m − 1 and (Ninit/Ntotal > K) and (ρ > L × ρorig) do T ← lcm{βi, . . . , βm} foreach allocation Ak in {A1, . . . Am} do ρrollback

k

← ρk ρk ← ⌊(αk × T)/βk⌋ ρ ← Pm

k=1 ρk

Ninit ← estimateInitTermJobCount(ρ1, . . . , ρm) i ← i + 1 if (Ninit/Ntotal ≤ K) or (ρ ≤ L × ρorig) then foreach allocation Ak in {A1, . . . Am} do ρk ← ρrollback

k

return (ρ1, . . . , ρm)

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 26 / 43

slide-27
SLIDE 27

Outline

Scheduling problem Principle of steady-state scheduling Overview Shortcomings Reducing the latency Dependencies Mixed Integer Program Heuristic approach Using non-conservative steady-state solutions Experimental results Simulation settings Inter-period dependencies Scheduling efficiency Number of running instances Running time of the algorithms Synthesis

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 27 / 43

slide-28
SLIDE 28

Experimental results

Comparison of 6 algorithms

the original steady-state implementation ; the steady-state with the reduction of inter-period using MIP (steady-state+MIP) the steady-state with the reduction of inter-period using the greedy heuristic (steady-state+greedy) the steady-state with the non-conservative period reduction (steady-state+suboptimal) the steady-state with both the greedy heuristic and the non-conservative period reduction (steady-state+heuristic+suboptimal) a classical list scheduling algorithm based on HEFT

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 28 / 43

slide-29
SLIDE 29

Experimental results

Simulation settings

Simulator

results are obtained with a simulator on top of SimGrid simulations of 200 random platform/application scenarios batches from 1 to 1000 task graphs MIP solving using CPLEX

Limitations

the MIP solver was able to find a solution within 15 minutes for 142 SIMPLE scenarios in the GENERAL case we do not give the results for the MIP approach

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 29 / 43

slide-30
SLIDE 30

Outline

Scheduling problem Principle of steady-state scheduling Overview Shortcomings Reducing the latency Dependencies Mixed Integer Program Heuristic approach Using non-conservative steady-state solutions Experimental results Simulation settings Inter-period dependencies Scheduling efficiency Number of running instances Running time of the algorithms Synthesis

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 30 / 43

slide-31
SLIDE 31

Experimental results

Inter-period dependencies

Results

SIMPLE case : the MIP solves 32% of the intra-period dependencies SIMPLE case : the heuristic solves 26% of the intra-period dependencies GENERAL case : the heuristic solves 25% of the intra-period dependencies

Notes

both the MIP and the heuristic achieve good performances for the resolution of intra-period dependencies how does they perform on other metrics ?

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 31 / 43

slide-32
SLIDE 32

Outline

Scheduling problem Principle of steady-state scheduling Overview Shortcomings Reducing the latency Dependencies Mixed Integer Program Heuristic approach Using non-conservative steady-state solutions Experimental results Simulation settings Inter-period dependencies Scheduling efficiency Number of running instances Running time of the algorithms Synthesis

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 32 / 43

slide-33
SLIDE 33

Experimental results

Scheduling efficiency

Efficiency

ratio to the optimal throughput obtained by an algorithm

One complex example

1 10 100 1000 10000 Number of jobs 0.4 0.5 0.6 0.7 0.8 0.9 Efficiency (ratio to optimal throughput) steady-state steady-state+heuristic steady-state+suboptimal steady-state+heuristic+suboptimal list-scheduling heuristic

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 33 / 43

slide-34
SLIDE 34

Experimental results

Scheduling efficiency

SIMPLE

100 1000 10000 Number of jobs 60 70 80 90 100 Experiments with efficiency above 90% steady-state steady-state+heuristic steady-state+MIP steady-state+suboptimal steady-state+heuristic+suboptimal list-scheduling

GENERAL

100 1000 10000 Number of jobs 50 60 70 80 90 Experiments with efficiency above 90%

Notes

Proportion of scenarios where we reach 90% of the optimal throughput

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 34 / 43

slide-35
SLIDE 35

Outline

Scheduling problem Principle of steady-state scheduling Overview Shortcomings Reducing the latency Dependencies Mixed Integer Program Heuristic approach Using non-conservative steady-state solutions Experimental results Simulation settings Inter-period dependencies Scheduling efficiency Number of running instances Running time of the algorithms Synthesis

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 35 / 43

slide-36
SLIDE 36

Experimental results

Number of running instances

Example

20000 40000 60000 80000 100000 120000 Simulation time (in minutes) 100 200 300 400 500 Number of jobs being processed

Results

SIMPLE case : MIP induces 30% less running instances SIMPLE case : heuristic induces 24% less running instances GENERAL case : heuristic + non-conservative induces 37% less running instances (548 down to 126) → shorten the buffer size

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 36 / 43

slide-37
SLIDE 37

Experimental results

Latency and buffer size

SIMPLE and GENERAL

SIMPLE scenarios GENERAL scenarios Algorithm average maximum

  • max. num. of

average maximum

  • max. num. of

latency latency running jobs latency latency running jobs MIP 94% 67% 70% N/A N/A N/A heuristic 95% 74% 76% 90% 90% 93% suboptimal 100% 100% 100% 53% 93% 88% heuristic+ 95% 74% 75% 33% 67% 63% suboptimal

TAB.: Performance of the algorithms in latency and buffer size relatively to

  • riginal steady-state implementation. (Smaller latency and number of running

jobs are better.)

NB : GENERAL cases include SIMPLE cases too, so the decrease is important for complex cases

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 37 / 43

slide-38
SLIDE 38

Outline

Scheduling problem Principle of steady-state scheduling Overview Shortcomings Reducing the latency Dependencies Mixed Integer Program Heuristic approach Using non-conservative steady-state solutions Experimental results Simulation settings Inter-period dependencies Scheduling efficiency Number of running instances Running time of the algorithms Synthesis

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 38 / 43

slide-39
SLIDE 39

Experimental results

Running time of the algorithms

SIMPLE

100 1000 10000 Number of jobs 1 10 100 1000 CPU time in seconds Steady-state Steady-state + Heuristic Steady-state + MIP Steady-state suboptimal Steady-state suboptimal + Heuristic List-scheduling heuristic

GENERAL

100 1000 10000 Number of jobs 1 10 100 1000 CPU time in seconds Steady-state Steady-state + Heuristic Steady-state suboptimal Steady-state suboptimal + Heuristic List-scheduling heuristic

Notes

Average CPU-time in seconds

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 39 / 43

slide-40
SLIDE 40

Outline

Scheduling problem Principle of steady-state scheduling Overview Shortcomings Reducing the latency Dependencies Mixed Integer Program Heuristic approach Using non-conservative steady-state solutions Experimental results Simulation settings Inter-period dependencies Scheduling efficiency Number of running instances Running time of the algorithms Synthesis

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 40 / 43

slide-41
SLIDE 41

Synthesis

Conclusion

Summary

study of an adaptation of the steady state techniques in practical conditions :

medium size batches performance metrics (throughput) and pratical interest (latency and buffer)

two optimizations :

dependency reorganization (NP-Complet : MIP + heuristic) shorten the period by decreasing the throughput (< 15%)

mesure of the impact of our optimizations (efficiency, buffer size, latency)

Conclusion

steady-state scheduling is an efficient tool for dealing collections of task graphs

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 41 / 43

slide-42
SLIDE 42

Synthesis

Future work

steady state techniques and tolerance to the variation the platform capabilities evaluation onto a GRID context (cf MAO project)

DIAKITÉ, MARCHAL, NICOD, PHILIPPE ROMA/GRAAL working Group - 19/11/2009 42 / 43

slide-43
SLIDE 43

Thank you for your attention

Questions ?