On the Impact of Platform Models EPIT 2007 Arnaud Legrand - - PowerPoint PPT Presentation

on the impact of platform models
SMART_READER_LITE
LIVE PREVIEW

On the Impact of Platform Models EPIT 2007 Arnaud Legrand - - PowerPoint PPT Presentation

On the Impact of Platform Models EPIT 2007 Arnaud Legrand CNRS/INRIA, LIG laboratory June 6, 2007 A. Legrand (CNRS) INRIA-MESCAL 1 / 108 On the Impact of Platform Models Motivation Scientific computing : large needs in computation or


slide-1
SLIDE 1

On the Impact of Platform Models

EPIT 2007 Arnaud Legrand

CNRS/INRIA, LIG laboratory

June 6, 2007

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models 1 / 108

slide-2
SLIDE 2

Motivation

◮ Scientific computing : large needs in computation or storage

resources.

◮ Need to use systems with “several processors”:

◮ Parallel computers with shared/dis-

tributed memory

◮ Clusters ◮ Heterogeneous clusters ◮ Clusters of clusters ◮ Network of workstations ◮ The Grid ◮ Desktop Grids

◮ When modeling platform, communications modeling seems to

be the most controversial part.

◮ Two kinds of people produce communication models: those

who are concerned with scheduling and those who are concerned with performance evaluation.

◮ All these models are imperfect and intractable.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models 2 / 108

slide-3
SLIDE 3

Outline

Part I Platform Model Part II Scheduling Case Study

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models 3 / 108

slide-4
SLIDE 4

Part I Platform Model

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models 4 / 108

slide-5
SLIDE 5

Outline

1

Topology

2

Point to Point Communication Models

3

Modeling Concurency

4

Remind This is a Model, Hence Imperfect

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Topology 5 / 108

slide-6
SLIDE 6

Various Topologies Used in the Litterature

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Topology 6 / 108

slide-7
SLIDE 7

Outline

1

Topology

2

Point to Point Communication Models Hockney LogP and Friends TCP

3

Modeling Concurency

4

Remind This is a Model, Hence Imperfect

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models P2P Communication 7 / 108

slide-8
SLIDE 8

UET-UCT

  • Hem. . . This one is mainly used by scheduling theoreticians to prove

that their problem is hard and to know whether there is some hope to prove some clever result or not.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models P2P Communication 8 / 108

slide-9
SLIDE 9

“Hockney” Model

Hockney [Hoc94] proposed the following model for performance evaluation of the Paragon. A message of size m from Pi to Pj requires: ti,j(m) = Li,j + m/Bi,j In scheduling, there are three types of “corresponding” models:

◮ Communications are not “splitable” and each communication

k is associated to a communication time tk (accounting for message size, latency, bandwidth, middleware, . . . ).

◮ Communications are “splitable” but latency is considered to be

negligible (linear divisible model): ti,j(m) = m/Bi,j

◮ Communications are “splitable” and latency cannot be neglected

(linear divisible model): ti,j(m) = Li,j + m/Bi,j

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models P2P Communication 9 / 108

slide-10
SLIDE 10

LogP

The LogP model [CKP+96] is defined by 4 parameters:

◮ L is the network latency ◮ o is the middleware overhead (message splitting and packing,

buffer management, connection, . . . ) for a message of size w

◮ g is the gap (the minimum time between two packets commu-

nication) between two messages of size w

◮ P is the number of processors/modules

g g g g

  • g

g g g L

  • Sender

Card Receiver Card Network

◮ Sending m bytes with packets of size w:

2o + L + m

w

  • · max(o, g)

◮ Occupation on the sender and on the receiver:

  • + L +

m

w

  • − 1
  • · max(o, g)
  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models P2P Communication 10 / 108

slide-11
SLIDE 11

LogP

The LogP model [CKP+96] is defined by 4 parameters:

◮ L is the network latency ◮ o is the middleware overhead (message splitting and packing,

buffer management, connection, . . . ) for a message of size w

◮ g is the gap (the minimum time between two packets commu-

nication) between two messages of size w

◮ P is the number of processors/modules

  • g

g g g g

  • g

g g g L

  • Sender

Card Receiver Card Network

◮ Sending m bytes with packets of size w:

2o + L + m

w

  • · max(o, g)

◮ Occupation on the sender and on the receiver:

  • + L +

m

w

  • − 1
  • · max(o, g)
  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models P2P Communication 10 / 108

slide-12
SLIDE 12

LogGP & pLogP

The previous model works fine for short messages. However, many parallel machines have special support for long messages, hence a higher bandwidth. LogGP [AISS97] is an extension of LogP: G captures the bandwidth for long messages: short messages 2o + L + m

w

  • · max(o, g)

long messages 2o + L + (m − 1)G There is no fundamental difference. . . OK, it works for small and large messages. Does it work for average- size messages ? pLogP [KBV00] is an extension of LogP when L, o and g depends on the message size m. They also have introduced a distinction between os and or. This is more and more precise but concurency is still not taken into account.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models P2P Communication 11 / 108

slide-13
SLIDE 13

Bandwidth as a Function of Message Size

With the Hockney model:

m L+m/B.

200 400 600 800 1000 16Mo 4Mo 1Mo 256Ko 64Ko 16Ko 4Ko 2Ko 1Ko 256 128 64 32 16 8 4 2 1 Bande passante [Mbits/s] Taille des messages Mpich 1.2.6 sans optimisation Mpich 1.2.6 avec optimisation

MPICH, TCP with Gigabit Ethernet

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models P2P Communication 12 / 108

slide-14
SLIDE 14

Bandwidth as a Function of Message Size

With the Hockney model:

m L+m/B.

200 400 600 800 1000 16Mo 4Mo 1Mo 256Ko 64Ko 16Ko 4Ko 2Ko 1Ko 256 128 64 32 16 8 4 2 1 Bande passante [Mbits/s] Taille des messages Mpich 1.2.6 sans optimisation Mpich 1.2.6 avec optimisation

MPICH, TCP with Gigabit Ethernet

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models P2P Communication 12 / 108

slide-15
SLIDE 15

What About TCP-based Networks?

The previous models work fine for parallel machines. Most networks use TCP that has fancy flow-control mechanism and slow start. Is it valid to use affine model for such networks? The answer seems to be yes but latency and bandwidth parameters have to be carefully measured [LQDB05].

◮ Probing for m = 1b and m = 1Mb leads to bad results. ◮ The whole middleware layers should be benchmarked (theoret-

ical latency is useless because of middleware, theoretical band- width is useless because of middleware and latency). The slow-start does not seem to be too harmful. Most people forget that the round-trip time has a huge impact on the bandwidth.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models P2P Communication 13 / 108

slide-16
SLIDE 16

Outline

1

Topology

2

Point to Point Communication Models

3

Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows

4

Remind This is a Model, Hence Imperfect

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Modeling Concurency 14 / 108

slide-17
SLIDE 17

Multi-ports

◮ A given processor can communicate with as many other pro-

cessors as he wishes without any degradation.

◮ This model is widely used by scheduling theoreticians (think

about all DAG with commmunications scheduling problems) to prove that their problem is hard and to know whether there is some hope to prove some clever result or not. Some theoreticians feel like this model is borderline, especially when allowing duplication or when trying to design algorithms with super tight approximation ratios [Yves Robert 01–??]. Frankly, such a model is totally unrealistic.

◮ Using MPI and synchronous communica-

tions, it may not be an issue. However, with multi-core, multi-processor machines, it cannot be ignored. . .

Multi-port 1 1 1 A C B

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Modeling Concurency 15 / 108

slide-18
SLIDE 18

Bounded Multi-port

◮ Assume now that we have threads or multi-core processors.

We can write that sum of the throughputs of all communi- cations (incomming and outgoing). Such a model is OK for wide-area communications [HP04].

◮ Remember, the bounds due to the round-trip-time must not be

forgotten!

Multi-port (β) β/2 β/2 β/2 A C B

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Modeling Concurency 16 / 108

slide-19
SLIDE 19

Single-port (Pure)

◮ A process can communicate with only one other process at a

  • time. This constraint is generally written as a constraint on the

sum of communication times and is thus rather easy to use in a scheduling context (even though it complexifies problems).

◮ This model makes sense when using non-threaded versions of

communication libraries (e.g., MPI). As soon as you’re allowed to use threads, bounded-multiport seems a more reasonnable

  • ption (both for performance and scheduling complexity).

1-port (pure) 1/3 1/3 1/3 A C B

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Modeling Concurency 17 / 108

slide-20
SLIDE 20

Single-port (Full-Duplex)

At a given time, a process can be engaged in at most one emission and one reception. This constraint is generally written as two con- straints: one on the sum of incomming communication times and

  • ne on the sum of outgoing communication times.

1-port (full duplex) 1/2 1/2 1/2 A C B

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Modeling Concurency 18 / 108

slide-21
SLIDE 21

Single-port (Full-Duplex)

This model somehow makes sense when using networks like Myrinet that have few multiplexing units and with protocols without control flow [Mar07]. Even if it does not model well complex situations, such a model is not harmfull.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Modeling Concurency 19 / 108

slide-22
SLIDE 22

Fluid Modeling

When using TCP-based networks, it is generally reasonnable to use flows to model bandwidth sharing [MR99, Low03]. ∀l ∈ L,

  • r∈R s.t. l∈r

ρr cl Income Maximization maximize

  • r∈R

ρr Max-Min Fairness maximize min

r∈R ρr

Proportional Fairness maximize

  • r∈R

log(ρr) Potential Delay Minimization minimize

  • r∈R

1 ρr Some weird function minimize

  • r∈R

arctan(ρr)

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Modeling Concurency 20 / 108

slide-23
SLIDE 23

Fluid Modeling

When using TCP-based networks, it is generally reasonnable to use flows to model bandwidth sharing [MR99, Low03]. ∀l ∈ L,

  • r∈R s.t. l∈r

ρr cl Income Maximization maximize

  • r∈R

ρr Max-Min Fairness maximize min

r∈R ρr ATM

Proportional Fairness maximize

  • r∈R

log(ρr) TCP Vegas Potential Delay Minimization minimize

  • r∈R

1 ρr Some weird function minimize

  • r∈R

arctan(ρr) TCP Reno

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Modeling Concurency 20 / 108

slide-24
SLIDE 24

Flows Extensions

◮ Note that this model is a multi-port model with capacity-constraints

(like in the previous bounded multi-port).

◮ When latencies are large, using multiple connections enables to

get more bandwidth. As a matter of fact, there is very few to loose in using multiple connections. . .

◮ Therefore many people enforce a sometimes artificial (but less

intrusive) bound on the maximum number of connections per link [Wag05, MYCR06].

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Modeling Concurency 21 / 108

slide-25
SLIDE 25

Outline

1

Topology

2

Point to Point Communication Models

3

Modeling Concurency

4

Remind This is a Model, Hence Imperfect

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Imperfection 22 / 108

slide-26
SLIDE 26

Remind This is a Model, Hence Imperfect

◮ The previous sharing models are nice but you generally do not

know other flows. . .

◮ Communications use the memory bus and hence interfere with

  • computations. Taking such interferences into account may be-

come more and more important with multi-core architectures.

◮ Interference between communications are sometimes. . . surprising.

Modeling is an art. You have to know your platform and your ap- plication to know what is negligeable and what is important. Even if your model is imperfect, you may still derive interesting results.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Imperfection 23 / 108

slide-27
SLIDE 27

Part II Scheduling Case Study

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models 24 / 108

slide-28
SLIDE 28

Outline

5

Scheduling Divisible Workload Star-like Network Under the Multi-port Model Bus-like Network Star-like Network Under the One-Port Model Multi-round algorithms

6

Iterative Algorithms

7

Data Redistribution

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 25 / 108

slide-29
SLIDE 29

Context: Distributed Heterogeneous Platforms

Scheduling divisible load on various architectures [Rob, BGMR96, Bea05, Yan07]. Sources of problems

◮ Point to point communication model (homogeneous/heterogeneous,

with or without latency,. . . )

◮ Concurrency impact.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 26 / 108

slide-30
SLIDE 30

Seismic Tomography of the Earth

◮ Model of the inner structure

  • f the Earth

◮ The model is validated by comparing the propagation time of

a seismic wave in the model to the actual propagation time.

◮ Set of all seismic events of the year 1999: 817101 ◮ Original program written for a parallel computer:

if (rank = ROOT) raydata ← read n lines from data file; MPI Scatter(raydata, n/P, ..., rbuff, ..., ROOT, MPI COMM WORLD); compute work(rbuff);

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 27 / 108

slide-31
SLIDE 31

Applications Covered by The Divisible Loads Model

Applications made of a very (very) large number of fine grain com- putations. Computation time proportional to the size of the data to be pro- cessed. Independent computations: neither synchronizations nor communi- cations.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 28 / 108

slide-32
SLIDE 32

Outline

5

Scheduling Divisible Workload Star-like Network Under the Multi-port Model Bus-like Network Star-like Network Under the One-Port Model Multi-round algorithms

6

Iterative Algorithms

7

Data Redistribution

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 29 / 108

slide-33
SLIDE 33

Star-like Network

  • ◮ The links between the master and the workers have different

characteristics.

◮ The workers have different computational power. ◮ Communications from the master to the workers can be done

in parallel.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 30 / 108

slide-34
SLIDE 34

Notations

◮ A set P1, ..., Pp of processors ◮ P1 is the master processor: initially, it holds all the data. ◮ The overall amount of work: Wtotal. ◮ Processor Pi receives an amount of work αiWtotal

with αi ∈ Q and

i αi = 1.

Length of a unit-size work on processor Pi: wi. Computation time on Pi: αiWtotalwi.

◮ Time needed to send a unit-message from P1 to Pi: ci.

Communication time on Pi: αiWtotalci. Multi-port model: P1 can send messages in parallel to all work- ers.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 31 / 108

slide-35
SLIDE 35

“Optimization” Problem

If all communications start in parallel at time 0, the completion time Ti of processor Pi is equal to: Ti = αiWtotalci + αiWtotalwi The makespan T of a load distribution is thus equal to: max

i

αiWtotal(ci + wi) = T Therefore this problem is really trivial as we just need to note that αi = T/(Wtotal(ci+wi)) and

  • i αj = 1 to get T.

Hence, we minimize the makespan by setting: αi = 1

  • j

ci+wi cj+wj

P3 P4 P5 P0 P1 P2

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 32 / 108

slide-36
SLIDE 36

Latencies: just for fun

Let’s assume that the time needed to send a message of size αi from P1 to Pi is now equal to: Li + ci × αi Therefore in the optimal solution: forall i such that αi > 0, Li + αiWtotal × (ci + wi) = T. So just sort the processor by increasing latency and “fill” the Wtotal units of fluid load (the “density” of one unit of load on Pi being equal to ci + wi). P3 P4 P5 P0 P1 P2

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 33 / 108

slide-37
SLIDE 37

Latencies: just for fun

Let’s assume that the time needed to send a message of size αi from P1 to Pi is now equal to: Li + ci × αi Therefore in the optimal solution: forall i such that αi > 0, Li + αiWtotal × (ci + wi) = T. So just sort the processor by increasing latency and “fill” the Wtotal units of fluid load (the “density” of one unit of load on Pi being equal to ci + wi). P3 P4 P5 P0 P1 P2

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 33 / 108

slide-38
SLIDE 38

Latencies: just for fun

Let’s assume that the time needed to send a message of size αi from P1 to Pi is now equal to: Li + ci × αi Therefore in the optimal solution: forall i such that αi > 0, Li + αiWtotal × (ci + wi) = T. So just sort the processor by increasing latency and “fill” the Wtotal units of fluid load (the “density” of one unit of load on Pi being equal to ci + wi). P3 P4 P5 P0 P1 P2

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 33 / 108

slide-39
SLIDE 39

Latencies: just for fun

Let’s assume that the time needed to send a message of size αi from P1 to Pi is now equal to: Li + ci × αi Therefore in the optimal solution: forall i such that αi > 0, Li + αiWtotal × (ci + wi) = T. So just sort the processor by increasing latency and “fill” the Wtotal units of fluid load (the “density” of one unit of load on Pi being equal to ci + wi). P3 P4 P5 P0 P1 P2

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 33 / 108

slide-40
SLIDE 40

Latencies: just for fun

Let’s assume that the time needed to send a message of size αi from P1 to Pi is now equal to: Li + ci × αi Therefore in the optimal solution: forall i such that αi > 0, Li + αiWtotal × (ci + wi) = T. So just sort the processor by increasing latency and “fill” the Wtotal units of fluid load (the “density” of one unit of load on Pi being equal to ci + wi). P3 P4 P5 P0 P1 P2

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 33 / 108

slide-41
SLIDE 41

Outline

5

Scheduling Divisible Workload Star-like Network Under the Multi-port Model Bus-like Network Star-like Network Under the One-Port Model Multi-round algorithms

6

Iterative Algorithms

7

Data Redistribution

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 34 / 108

slide-42
SLIDE 42

Notations

◮ A set P1, ..., Pp of processors ◮ P1 is the master processor: initially, it holds all the data. ◮ The overall amount of work: Wtotal. ◮ Processor Pi receives an amount of work αiWtotal

with αiWtotal ∈ Q and

i αi = 1.

Length of a unit-size work on processor Pi: wi. Computation time on Pi: niwi.

◮ Time needed to send a unit-message from P1 to Pi: c.

One-port model: P1 sends a single message at a time, all pro- cessors communicate at the same speed with the master.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 35 / 108

slide-43
SLIDE 43

Equations

For processor Pi (with c1 = 0 and cj = c otherwise): Ti =

i

  • j=1

αjWtotal.cj + αiWtotal.wi T = max

1ip

 

i

  • j=1

αjWtotal.cj + αiWtotal.wi   We look for a data distribution α1, ..., αp which minimizes T.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 36 / 108

slide-44
SLIDE 44

Properties of Load-Balancing

Lemma 1. In an optimal solution, all processors end their processing at the same time.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 37 / 108

slide-45
SLIDE 45

Demonstration of Lemma 1

Two workers i and i + 1 with Ti < Ti+1.

P2 P3 P4 P1 temps fin

We decrease αi+1 by ε.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 38 / 108

slide-46
SLIDE 46

Demonstration of Lemma 1

Two workers i and i + 1 with Ti < Ti+1.

P2 P3 P4 P1 temps fin

We decrease αi+1 by ε.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 38 / 108

slide-47
SLIDE 47

Demonstration of Lemma 1

Two workers i and i + 1 with Ti < Ti+1.

P2 P3 P4 P1 temps fin

We decrease αi+1 by ε.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 38 / 108

slide-48
SLIDE 48

Demonstration of Lemma 1

Two workers i and i + 1 with Ti < Ti+1.

P2 P3 P4 P1 temps fin

We increase αi by ε.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 38 / 108

slide-49
SLIDE 49

Demonstration of Lemma 1

Two workers i and i + 1 with Ti < Ti+1.

P2 P3 P4 P1 temps fin

We increase αi by ε.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 38 / 108

slide-50
SLIDE 50

Demonstration of Lemma 1

Two workers i and i + 1 with Ti < Ti+1.

P2 P3 P4 P1 temps fin

The communication time for the following processors is unchanged.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 38 / 108

slide-51
SLIDE 51

Demonstration of Lemma 1

Two workers i and i + 1 with Ti < Ti+1.

P2 P3 P4 P1 temps fin

We end up with a better solution !

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 38 / 108

slide-52
SLIDE 52

Property for the Resource Selection

Lemma 2. In an optimal solution all processors work.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 39 / 108

slide-53
SLIDE 53

Property for the Resource Selection

Lemma 2. In an optimal solution all processors work. Demonstration: this is just a corollary of lemma 1...

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 39 / 108

slide-54
SLIDE 54

Resolution

T = α1Wtotalw1.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 40 / 108

slide-55
SLIDE 55

Resolution

T = α1Wtotalw1. T = α2(c + w2)Wtotal. Therefore α2 =

w1 c+w2 α1.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 40 / 108

slide-56
SLIDE 56

Resolution

T = α1Wtotalw1. T = α2(c + w2)Wtotal. Therefore α2 =

w1 c+w2 α1.

T = (α2c + α3(c + w3))Wtotal. Therefore α3 =

w2 c+w3 α2.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 40 / 108

slide-57
SLIDE 57

Resolution

T = α1Wtotalw1. T = α2(c + w2)Wtotal. Therefore α2 =

w1 c+w2 α1.

T = (α2c + α3(c + w3))Wtotal. Therefore α3 =

w2 c+w3 α2.

αi = wi−1

c+wi αi−1 for i 2.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 40 / 108

slide-58
SLIDE 58

Resolution

T = α1Wtotalw1. T = α2(c + w2)Wtotal. Therefore α2 =

w1 c+w2 α1.

T = (α2c + α3(c + w3))Wtotal. Therefore α3 =

w2 c+w3 α2.

αi = wi−1

c+wi αi−1 for i 2.

n

i=1 αi = 1.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 40 / 108

slide-59
SLIDE 59

Resolution

T = α1Wtotalw1. T = α2(c + w2)Wtotal. Therefore α2 =

w1 c+w2 α1.

T = (α2c + α3(c + w3))Wtotal. Therefore α3 =

w2 c+w3 α2.

αi = wi−1

c+wi αi−1 for i 2.

n

i=1 αi = 1.

α1

  • 1 +

w1 c + w2 + ... +

j

  • k=2

wk−1 c + wk + ...

  • = 1
  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 40 / 108

slide-60
SLIDE 60

Impact of the Order of Communications

How important is the influence of the ordering of the processor on the solution ?

?

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 41 / 108

slide-61
SLIDE 61

No Impact of The Order of the Communications

Volume processed by processors Pi and Pi+1 during a time T. Processor Pi: αi(c + wi)Wtotal = T. Therefore αi =

1 c+wi T Wtotal .

Processor Pi+1: αicWtotal + αi+1(c + wi+1)Wtotal = T. Thus αi+1 =

1 c+wi+1 ( T Wtotal − αic) = wi (c+wi)(c+wi+1) T Wtotal .

Processors Pi and Pi+1: αi + αi+1 = c + wi + wi+1 (c + wi)(c + wi+1)

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 42 / 108

slide-62
SLIDE 62

Choice of the Master Processor

We compare processors P1 and P2.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 43 / 108

slide-63
SLIDE 63

Choice of the Master Processor

We compare processors P1 and P2. Processor P1: α1w1Wtotal = T. Then, α1 =

1 w1 T Wtotal .

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 43 / 108

slide-64
SLIDE 64

Choice of the Master Processor

We compare processors P1 and P2. Processor P1: α1w1Wtotal = T. Then, α1 =

1 w1 T Wtotal .

Processor P2: α2(c + w2)Wtotal = T. Thus, α2 =

1 c+w2 T Wtotal .

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 43 / 108

slide-65
SLIDE 65

Choice of the Master Processor

We compare processors P1 and P2. Processor P1: α1w1Wtotal = T. Then, α1 =

1 w1 T Wtotal .

Processor P2: α2(c + w2)Wtotal = T. Thus, α2 =

1 c+w2 T Wtotal .

Total volume processed: α1 + α2 = c + w1 + w2 w1(c + w2) = c + w1 + w2 cw1 + w1w2

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 43 / 108

slide-66
SLIDE 66

Choice of the Master Processor

We compare processors P1 and P2. Processor P1: α1w1Wtotal = T. Then, α1 =

1 w1 T Wtotal .

Processor P2: α2(c + w2)Wtotal = T. Thus, α2 =

1 c+w2 T Wtotal .

Total volume processed: α1 + α2 = c + w1 + w2 w1(c + w2) = c + w1 + w2 cw1 + w1w2 Minimal when w1 < w2. Master = the most powerfull processor (for computations).

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 43 / 108

slide-67
SLIDE 67

Conclusion

◮ Closed-form expressions for the execution time and the distri-

bution of data.

◮ Choice of the master. ◮ The ordering of the processors has no impact. ◮ All processors take part in the work.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 44 / 108

slide-68
SLIDE 68

Outline

5

Scheduling Divisible Workload Star-like Network Under the Multi-port Model Bus-like Network Star-like Network Under the One-Port Model Multi-round algorithms

6

Iterative Algorithms

7

Data Redistribution

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 45 / 108

slide-69
SLIDE 69

Star-like Network

  • ◮ The links between the master and the workers have different

characteristics.

◮ The workers have different computational power.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 46 / 108

slide-70
SLIDE 70

Notations

◮ A set P1, ..., Pp of processors ◮ P1 is the master processor: initially, it holds all the data. ◮ The overall amount of work: Wtotal. ◮ Processor Pi receives an amount of work αiWtotal

with

i ni = Wtotal with αiWtotal ∈ Q and i αi = 1.

Length of a unit-size work on processor Pi: wi. Computation time on Pi: niwi.

◮ Time needed to send a unit-message from P1 to Pi: ci.

One-port model: P1 sends a single message at a time.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 47 / 108

slide-71
SLIDE 71

Star Network and Linear Cost Model

Goal : maximize the number of processed tasks within a time-bound Tf : αi.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 48 / 108

slide-72
SLIDE 72

Star Network and Linear Cost Model

Goal : maximize the number of processed tasks within a time-bound Tf : αi. Lemma 3. In any optimal solution of the StarLinear problem, all workers participate in the computation, and all processors finish computing simultaneously.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 48 / 108

slide-73
SLIDE 73

Star Network and Linear Cost Model

Goal : maximize the number of processed tasks within a time-bound Tf : αi. Lemma 3. In any optimal solution of the StarLinear problem, all workers participate in the computation, and all processors finish computing simultaneously. Lemma 4. An optimal ordering for the StarLinear problem is obtained by serving the workers in the ordering of non decreasing link capacities ci.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 48 / 108

slide-74
SLIDE 74

Sketch of the Proof of Lemma 3

Two steps :

◮ All

workers participate in the computation. . .

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 49 / 108

slide-75
SLIDE 75

Sketch of the Proof of Lemma 3

Two steps :

◮ All

workers participate in the computation. . . otherwise it would not be optimal.

α2w2 αpwp Tf T2 Tp T1 α1w1 α1c1 αpcp ... ... α2c2 P2 Pp Network P1 Pi

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 49 / 108

slide-76
SLIDE 76

Sketch of the Proof of Lemma 3

Two steps :

◮ All

workers participate in the computation. . . otherwise it would not be optimal.

◮ All

processors finish their work at the same time.

αiwi αici α2w2 αpwp Tf T2 Tp T1 α1w1 α1c1 αpcp ... ... α2c2 P2 Pp Network P1 Pi

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 49 / 108

slide-77
SLIDE 77

Sketch of the Proof of Lemma 3

Two steps :

◮ All

workers participate in the computation. . . otherwise it would not be optimal.

◮ All

processors finish their work at the same time.

αici αiwi ... ... α2w2 αpwp Tf T2 Tp T1 α1w1 α1c1 αpcp ... ... α2c2 P2 Pp Network P1 Pi

Maximize βi, subject to LB(i) ∀i, βi 0 UB(i) ∀i, i

k=1 βkck + βiwi Tf

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 49 / 108

slide-78
SLIDE 78

Sketch of the Proof of Lemma 3

Two steps :

◮ All

workers participate in the computation. . . otherwise it would not be optimal.

◮ All

processors finish their work at the same time.

β2 β1

Maximize βi, subject to LB(i) ∀i, βi 0 UB(i) ∀i, i

k=1 βkck + βiwi Tf

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 49 / 108

slide-79
SLIDE 79

Sketch of the Proof of Lemma 3

Two steps :

◮ All

workers participate in the computation. . . otherwise it would not be optimal.

◮ All

processors finish their work at the same time.

β2 β1

Maximize βi, subject to LB(i) ∀i, βi 0 UB(i) ∀i, i

k=1 βkck + βiwi Tf

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 49 / 108

slide-80
SLIDE 80

Sketch of the Proof of Lemma 3

Two steps :

◮ All

workers participate in the computation. . . otherwise it would not be optimal.

◮ All

processors finish their work at the same time.

β2 β1

Maximize βi, subject to LB(i) ∀i, βi 0 UB(i) ∀i, i

k=1 βkck + βiwi Tf

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 49 / 108

slide-81
SLIDE 81

Sketch of the Proof of Lemma 3

Two steps :

◮ All

workers participate in the computation. . . otherwise it would not be optimal.

◮ All

processors finish their work at the same time.

β2 β1 (α1, α2)

Maximize βi, subject to LB(i) ∀i, βi 0 UB(i) ∀i, i

k=1 βkck + βiwi Tf

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 49 / 108

slide-82
SLIDE 82

Sketch of the Proof of Lemma 3

Two steps :

◮ All

workers participate in the computation. . . otherwise it would not be optimal.

◮ All

processors finish their work at the same time.

β2 β1 (α1, α2)

Maximize βi, subject to LB(i) ∀i, βi 0 UB(i) ∀i, i

k=1 βkck + βiwi Tf

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 49 / 108

slide-83
SLIDE 83

Sketch of the Proof of Lemma 4

The proof is based on the comparison of the amount of work that is performed by the first two workers, and then proceeds by induction.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 50 / 108

slide-84
SLIDE 84

Sketch of the Proof of Lemma 4

The proof is based on the comparison of the amount of work that is performed by the first two workers, and then proceeds by induction.

T P1 P2 t(A) α(A)

1 w1

α(A)

1 c1

α(A)

2 w2

α(A)

2 c2

T P1 P2 t(B) α(B)

2 c2

α(B)

2 w2

α(B)

1 c1

α(B)

1 w1

(A) P1 starts before P2 (B) P2 starts before P1

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 50 / 108

slide-85
SLIDE 85

Sketch of the Proof of Lemma 4

The proof is based on the comparison of the amount of work that is performed by the first two workers, and then proceeds by induction.

T P1 P2 t(A) α(A)

1 w1

α(A)

1 c1

α(A)

2 w2

α(A)

2 c2

T P1 P2 t(B) α(B)

2 c2

α(B)

2 w2

α(B)

1 c1

α(B)

1 w1

(A) P1 starts before P2 (B) P2 starts before P1 t(A) = t(B), (1)

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 50 / 108

slide-86
SLIDE 86

Sketch of the Proof of Lemma 4

The proof is based on the comparison of the amount of work that is performed by the first two workers, and then proceeds by induction.

T P1 P2 t(A) α(A)

1 w1

α(A)

1 c1

α(A)

2 w2

α(A)

2 c2

T P1 P2 t(B) α(B)

2 c2

α(B)

2 w2

α(B)

1 c1

α(B)

1 w1

(A) P1 starts before P2 (B) P2 starts before P1 t(A) = t(B), (1) and (α(A)

1

+ α(A)

2

) − (α(B)

1

+ α(B)

2

) = T(c2 − c1) (c1 + w1)(c2 + w2). (2)

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 50 / 108

slide-87
SLIDE 87

Conclusion

◮ The processors must be ordered by decreasing bandwidths ◮ All processors are working ◮ All processors end their work at the same time ◮ Formulas for the execution time and the distribution of data

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 51 / 108

slide-88
SLIDE 88

Outline

5

Scheduling Divisible Workload Star-like Network Under the Multi-port Model Bus-like Network Star-like Network Under the One-Port Model Multi-round algorithms

6

Iterative Algorithms

7

Data Redistribution

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 52 / 108

slide-89
SLIDE 89

One-round vs. Multi-round

Tf T2 Tp ... T1 αpwp α2w2 α1w1 α1g α2g αpg Pp P2 P1 Network

One round

R0 R1 Rk

Pp P2 P1 Network

Multi-round

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 53 / 108

slide-90
SLIDE 90

One-round vs. Multi-round

Tf T2 Tp ... T1 αpwp α2w2 α1w1 α1g α2g αpg Pp P2 P1 Network

One round long idle-times

R0 R1 Rk

Pp P2 P1 Network

Multi-round Efficient when Wtotal large Intuition: start with small rounds, then increase chunks. Problems:

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 53 / 108

slide-91
SLIDE 91

One-round vs. Multi-round

Tf T2 Tp ... T1 αpwp α2w2 α1w1 α1g α2g αpg Pp P2 P1 Network

One round long idle-times

R0 R1 Rk

Pp P2 P1 Network

Multi-round Efficient when Wtotal large Intuition: start with small rounds, then increase chunks. Problems:

◮ linear communication model leads to absurd solution ◮ resource selection ◮ number of rounds ◮ size of each round

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 53 / 108

slide-92
SLIDE 92

Notations

◮ A set P1, ..., Pp of processors ◮ P1 is the master processor: initially, it holds all the data. ◮ The overall amount of work: Wtotal. ◮ Processor Pi receives an amount of work αiWtotal

with

i ni = Wtotal with αiWtotal ∈ Q and i αi = 1.

Length of a unit-size work on processor Pi: wi. Computation time on Pi: niwi.

◮ Time needed to send a message of size αi P1 to Pi: Li +

ci × αi. One-port model: P1 sends and receives a single message at a time.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 54 / 108

slide-93
SLIDE 93

Complexity

Definition: One round, ∀i, ci = 0. Given Wtotal, p workers, (Pi)1ip, (Li)1ip, and a rational number T 0, and assuming that bandwidths are infinite, is it possible to compute all Wtotal load units within T time units? Theorem 1. The problem with one-round and infinite bandwidths is NP- complete.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 55 / 108

slide-94
SLIDE 94

Complexity

Definition: One round, ∀i, ci = 0. Given Wtotal, p workers, (Pi)1ip, (Li)1ip, and a rational number T 0, and assuming that bandwidths are infinite, is it possible to compute all Wtotal load units within T time units? Theorem 1. The problem with one-round and infinite bandwidths is NP- complete. What is the complexity of the general problem with finite bandwidths and several rounds ?

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 55 / 108

slide-95
SLIDE 95

Complexity

Definition: One round, ∀i, ci = 0. Given Wtotal, p workers, (Pi)1ip, (Li)1ip, and a rational number T 0, and assuming that bandwidths are infinite, is it possible to compute all Wtotal load units within T time units? Theorem 1. The problem with one-round and infinite bandwidths is NP- complete. What is the complexity of the general problem with finite bandwidths and several rounds ? The general problem is NP-hard, but does not appear to be in NP (no polynomial bound on the number of activations).

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 55 / 108

slide-96
SLIDE 96

Fixed activation sequence

Hypotheses

1 Number of activations : Nact; 2 Whether Pi is the processor used during activation j : χ(j)

i

Minimize T, under the constraints                   

Nact

  • j=1

p

  • i=1

χ(j)

i α(j) i

= Wtotal ∀k Nact, ∀l :  

k

  • j=1

p

  • i=1

χ(j)

i (Li + α(j) i ci)

  +

Nact

  • j=k

χ(j)

l α(j) l wl T

∀i, j : α(j)

i

(3) Can be solved in polynomial time.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 56 / 108

slide-97
SLIDE 97

Fixed number of activations

Minimize T, under the constraints                                 

Nact

  • j=1

p

  • i=1

χ(j)

i α(j) i

= Wtotal ∀k Nact, ∀l :  

k

  • j=1

p

  • i=1

χ(j)

i (Li + α(j) i ci)

  +

Nact

  • j=k

χ(j)

l α(j) l wl T

∀k Nact :

p

  • i=1

χ(k)

i

1 ∀i, j : χ(j)

i

∈ {0, 1} ∀i, j : α(j)

i

(4) Exact but exponential (branch-and-bound algorithms).

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 57 / 108

slide-98
SLIDE 98

Uniform Multi-Round

In a round: all workers have same computation time Geometrical increase of rounds size No idle time in communications:

. . . . . .

Transfer Compute Transfer Compute Transfer Compute Worker 1 Worker 2

round j TA

time Transfer Worker i

round j + 2 round j + 1 T

B

TC

Li

Worker p

α(j+1)

1

ci α(j)

1 ci

α(j)

1 w1

α(j)

i ci

Compute

α(j)

p cp

α(j)

i wi = α(j) 1 w1

α(j+1)

i

ci α(j+1)

p

cp α(j)

p wp = α(j) 1 w1

α(j)

i wi = p

  • k=1

(Lk + α(j+1)

k

ck). Heuristic processor selection: by decreasing bandwidths

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 58 / 108

slide-99
SLIDE 99

Uniform Multi-Round

In a round: all workers have same computation time Geometrical increase of rounds size No idle time in communications:

. . . . . .

Transfer Compute Transfer Compute Transfer Compute Worker 1 Worker 2

round j TA

time Transfer Worker i

round j + 2 round j + 1 T

B

TC

Li

Worker p

α(j+1)

1

ci α(j)

1 ci

α(j)

1 w1

α(j)

i ci

Compute

α(j)

p cp

α(j)

i wi = α(j) 1 w1

α(j+1)

i

ci α(j+1)

p

cp α(j)

p wp = α(j) 1 w1

α(j)

i wi = p

  • k=1

(Lk + α(j+1)

k

ck). Heuristic processor selection: by decreasing bandwidths No guarantee. . .

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 58 / 108

slide-100
SLIDE 100

Periodic Schedule

Tp

Ln αncn Ln αncn Ln αncn . . . α1w1 α2w2 α3w3 αnwn α1c1 α1w1 α2w2 α3w3 αnwn α1w1 α2w2 α3w3 αnwn α1c1 α1c1 L2 L2 L2 α2c2 α2c2 α2c2 L3 L3 L3 α3c3 α3c3 α3c3 L1 L1 L1 Compute Transfer Compute Transfer Compute Transfer Compute Transfer

How to choose Tp? Which resources to select?

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 59 / 108

slide-101
SLIDE 101

With no Overlap (1/4)

Equations

◮ Divide total execution time T into k periods of duration Tp. ◮ I ⊂ {1, . . . , p} participating processors. ◮ Bandwidth limitation:

  • i∈I

(Li + αici) Tp.

◮ No overlap:

∀i ∈ I, Li + αi(ci + wi) Tp.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 60 / 108

slide-102
SLIDE 102

With no Overlap (2/4)

Normalization

◮ βi average number of tasks processed by Pi during one time

unit.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 61 / 108

slide-103
SLIDE 103

With no Overlap (2/4)

Normalization

◮ βi average number of tasks processed by Pi during one time

unit.

◮ Linear program:

Maximize p

i=1 βi

  • ∀i ∈ I,

βi(ci + wi) 1 − Li

Tp

  • i∈I βici 1 −

P

i∈I Li

Tp

.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 61 / 108

slide-104
SLIDE 104

With no Overlap (2/4)

Normalization

◮ βi average number of tasks processed by Pi during one time

unit.

◮ Linear program:

Maximize p

i=1 βi

  • ∀i ∈ I,

βi(ci + wi) 1 − Li

Tp

  • i∈I βici 1 −

P

i∈I Li

Tp

. Relaxed version Maximize p

i=1 xi

   ∀1 i p, xi(ci + wi) 1 −

Li Tp

p

i=1 xici 1 − Pp

i=1 Li

Tp

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 61 / 108

slide-105
SLIDE 105

With no Overlap (2/4)

Normalization

◮ βi average number of tasks processed by Pi during one time

unit.

◮ Linear program:

Maximize p

i=1 βi

  • ∀i ∈ I,

βi(ci + wi) 1 − Li

Tp

  • i∈I βici 1 −

P

i∈I Li

Tp

. Relaxed version Maximize p

i=1 xi

   ∀1 i p, xi(ci + wi) 1 −

Pp

i=1 Li

Tp

p

i=1 xici 1 − Pp

i=1 Li

Tp

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 61 / 108

slide-106
SLIDE 106

With no Overlap (3/4)

Bandwidth-centric solution

◮ Sort: c1 c2 . . . cp. ◮ Let q be the largest index so that q i=1 ci ci+wi 1. ◮ If q < p, ε = 1 − q i=1 ci ci+wi . ◮ Optimal solution to relaxed program:

∀1 i q, xi = 1 −

Pp

i=1 Li

Tp

ci + wi and (if q < p): xq+1 =

  • 1 −

p

i=1 Li

Tp ε cq+1

  • ,

and xq+2 = xq+3 = . . . = xp = 0.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 62 / 108

slide-107
SLIDE 107

With no Overlap (4/4)

Asymptotic optimality

◮ Let Tp =

  • T ∗

max and αi = xiTp for all i. ◮ Then T T ∗ max + O(

  • T ∗

max). ◮ Closed-form expressions for resource selection and task assign-

ment provided by the algorithm.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 63 / 108

slide-108
SLIDE 108

With Overlap

Key points

◮ Still sort resources according to the ci. ◮ Greedily select resources until the sum of the ratios ci wi

  • instead of

ci ci+wi

  • exceeds 1.
  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 64 / 108

slide-109
SLIDE 109

Conclusion

◮ NP-hardness comes from the one-port model and latencies. ◮ The problem is however rather easy to approximate.

Rough solutions are way enough.

◮ Communications are much more important than computations.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Divisible Workload 65 / 108

slide-110
SLIDE 110

Outline

5

Scheduling Divisible Workload

6

Iterative Algorithms The Problem Fully Homogeneous Network Heterogeneous Network (Complete) Heterogeneous Network (General Case)

7

Data Redistribution

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 66 / 108

slide-111
SLIDE 111

Outline

5

Scheduling Divisible Workload

6

Iterative Algorithms The Problem Fully Homogeneous Network Heterogeneous Network (Complete) Heterogeneous Network (General Case)

7

Data Redistribution

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 67 / 108

slide-112
SLIDE 112

The Context: Distributed Heterogeneous Platforms

How to embed a ring in a complex network [LRRV04]. Sources of problems

◮ Heterogeneity of processors (computational power, memory,

etc.)

◮ Heterogeneity of communications links. ◮ Irregularity of interconnection network.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 68 / 108

slide-113
SLIDE 113

Targeted Applications: Iterative Algorithms

◮ A set of data (typically, a matrix) ◮ Structure of the algorithms:

1

Each processor performs a computation on its chunk of data

2

Each processor exchange the “border” of its chunk of data with its neighbor processors

3

We go back at Step 1

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 69 / 108

slide-114
SLIDE 114

Targeted Applications: Iterative Algorithms

◮ A set of data (typically, a matrix) ◮ Structure of the algorithms:

1

Each processor performs a computation on its chunk of data

2

Each processor exchange the “border” of its chunk of data with its neighbor processors

3

We go back at Step 1

Question: how can we efficiently execute such an algorithm on such a platform?

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 69 / 108

slide-115
SLIDE 115

The Questions

◮ Which processors should be used ? ◮ What amount of data should we give them ? ◮ How do we cut the set of data ?

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 70 / 108

slide-116
SLIDE 116

First of All, a Simplification: Slicing the Data

◮ Data: a 2-D array

P1 P2 P4 P3

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 71 / 108

slide-117
SLIDE 117

First of All, a Simplification: Slicing the Data

◮ Data: a 2-D array

P1 P2 P3 P4

P1 P2 P4 P3

◮ Unidimensional cutting into vertical slices

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 71 / 108

slide-118
SLIDE 118

First of All, a Simplification: Slicing the Data

◮ Data: a 2-D array

P1 P2 P3 P4

P1 P2 P4 P3

◮ Unidimensional cutting into vertical slices ◮ Consequences:

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 71 / 108

slide-119
SLIDE 119

First of All, a Simplification: Slicing the Data

◮ Data: a 2-D array

P1 P2 P3 P4

P1 P2 P4 P3

◮ Unidimensional cutting into vertical slices ◮ Consequences:

1

Borders and neighbors are easily defined

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 71 / 108

slide-120
SLIDE 120

First of All, a Simplification: Slicing the Data

◮ Data: a 2-D array

P1 P2 P3 P4

P1 P2 P4 P3

◮ Unidimensional cutting into vertical slices ◮ Consequences:

1

Borders and neighbors are easily defined

2

Constant volume of data exchanged between neighbors: Dc

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 71 / 108

slide-121
SLIDE 121

First of All, a Simplification: Slicing the Data

◮ Data: a 2-D array

P1 P2 P3 P4

P1 P2 P4 P3

◮ Unidimensional cutting into vertical slices ◮ Consequences:

1

Borders and neighbors are easily defined

2

Constant volume of data exchanged between neighbors: Dc

3

Processors are virtually organized into a ring

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 71 / 108

slide-122
SLIDE 122

Notations

◮ Processors: P1, ..., Pp ◮ Processor Pi executes a unit task in a time wi ◮ Overall amount of work Dw;

Share of Pi: αi.Dw processed in a time αi.Dw.wi (αi 0,

j αj = 1) ◮ Cost of a unit-size communication from Pi to Pj: ci,j ◮ Cost of a sending from Pi to its successor in the ring: Dc.ci,succ(i)

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 72 / 108

slide-123
SLIDE 123

Communications: 1-Port Model (Full-Duplex)

A processor can:

◮ send at most one message at any time; ◮ receive at most one message at any time; ◮ send and receive a message simultaneously.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 73 / 108

slide-124
SLIDE 124

Objective

1 Select q processors among p

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 74 / 108

slide-125
SLIDE 125

Objective

1 Select q processors among p 2 Order them into a ring

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 74 / 108

slide-126
SLIDE 126

Objective

1 Select q processors among p 2 Order them into a ring 3 Distribute the data among them

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 74 / 108

slide-127
SLIDE 127

Objective

1 Select q processors among p 2 Order them into a ring 3 Distribute the data among them

So as to minimize: max

1ip I{i}[αi.Dw.wi + Dc.(ci,pred(i) + ci,succ(i))]

Where I{i}[x] = x if Pi participates in the computation, and 0

  • therwise
  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 74 / 108

slide-128
SLIDE 128

Outline

5

Scheduling Divisible Workload

6

Iterative Algorithms The Problem Fully Homogeneous Network Heterogeneous Network (Complete) Heterogeneous Network (General Case)

7

Data Redistribution

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 75 / 108

slide-129
SLIDE 129

Special Hypotheses

1 There exists a communication link between any two processors 2 All links have the same capacity

(∃c, ∀i, j ci,j = c)

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 76 / 108

slide-130
SLIDE 130

Consequences

◮ Either the most powerful processor performs all the work, or all

the processors participate

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 77 / 108

slide-131
SLIDE 131

Consequences

◮ Either the most powerful processor performs all the work, or all

the processors participate

◮ If all processors participate, all end their share of work simulta-

neously

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 77 / 108

slide-132
SLIDE 132

Consequences

◮ Either the most powerful processor performs all the work, or all

the processors participate

◮ If all processors participate, all end their share of work simulta-

neously(∃τ, αi.Dw.wi = τ, so 1 =

i τ Dw.wi )

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 77 / 108

slide-133
SLIDE 133

Consequences

◮ Either the most powerful processor performs all the work, or all

the processors participate

◮ If all processors participate, all end their share of work simulta-

neously(∃τ, αi.Dw.wi = τ, so 1 =

i τ Dw.wi ) ◮ Time of the optimal solution:

Tstep = min

  • Dw.wmin, Dw.

1

  • i

1 wi

+ 2.Dc.c

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 77 / 108

slide-134
SLIDE 134

Outline

5

Scheduling Divisible Workload

6

Iterative Algorithms The Problem Fully Homogeneous Network Heterogeneous Network (Complete) Heterogeneous Network (General Case)

7

Data Redistribution

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 78 / 108

slide-135
SLIDE 135

Special hypothesis

1 There exists a communication link between any two processors

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 79 / 108

slide-136
SLIDE 136

All the Processors Participate: Study (1)

time Dc.c1,5 Dc.c1,2 Dc.c2,1 Dc.c2,3 Dc.c3,2 Dc.c4,3 Dc.c4,5 Dc.c5,4 Dc.c5,1 α5.Dw.w5 P1 P2 P3 P4 P5 α4.Dw.w4 Dc.c3,4 α3.Dw.w3 α2.Dw.w2 α1.Dw.w1 processors

All processors end simultaneously

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 80 / 108

slide-137
SLIDE 137

All the Processors Participate: Study (2)

◮ All processors end simultaneously

Tstep = αi.Dw.wi + Dc.(ci,succ(i) + ci,pred(i))

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 81 / 108

slide-138
SLIDE 138

All the Processors Participate: Study (2)

◮ All processors end simultaneously

Tstep = αi.Dw.wi + Dc.(ci,succ(i) + ci,pred(i))

◮ p

  • i=1

αi = 1 ⇒

p

  • i=1

Tstep − Dc.(ci,succ(i) + ci,pred(i)) Dw.wi = 1. Thus Tstep Dw.wcumul = 1 + Dc Dw

p

  • i=1

ci,succ(i) + ci,pred(i) wi where wcumul =

1 P

i 1 wi

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 81 / 108

slide-139
SLIDE 139

All the Processors Participate: Interpretation

Tstep Dw.wcumul = 1 + Dc Dw

p

  • i=1

ci,succ(i) + ci,pred(i) wi

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 82 / 108

slide-140
SLIDE 140

All the Processors Participate: Interpretation

Tstep Dw.wcumul = 1 + Dc Dw

p

  • i=1

ci,succ(i) + ci,pred(i) wi Tstep is minimal when

p

  • i=1

ci,succ(i) + ci,pred(i) wi is minimal

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 82 / 108

slide-141
SLIDE 141

All the Processors Participate: Interpretation

Tstep Dw.wcumul = 1 + Dc Dw

p

  • i=1

ci,succ(i) + ci,pred(i) wi Tstep is minimal when

p

  • i=1

ci,succ(i) + ci,pred(i) wi is minimal Look for an hamiltonian cycle of minimal weight in a graph where the edge from Pi to Pj has a weight of di,j = ci,j

wi + cj,i wj

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 82 / 108

slide-142
SLIDE 142

All the Processors Participate: Interpretation

Tstep Dw.wcumul = 1 + Dc Dw

p

  • i=1

ci,succ(i) + ci,pred(i) wi Tstep is minimal when

p

  • i=1

ci,succ(i) + ci,pred(i) wi is minimal Look for an hamiltonian cycle of minimal weight in a graph where the edge from Pi to Pj has a weight of di,j = ci,j

wi + cj,i wj

NP-complete problem

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 82 / 108

slide-143
SLIDE 143

All the Processors Participate: Linear Program

Minimize p

i=1

p

j=1 di,j.xi,j,

satisfying the (in)equations            (1) p

j=1 xi,j = 1

1 i p (2) p

i=1 xi,j = 1

1 j p (3) xi,j ∈ {0, 1} 1 i, j p (4) ui − uj + p.xi,j p − 1 2 i, j p, i = j (5) ui integer, ui 0 2 i p xi,j = 1 if, and only if, the edge from Pi to Pj is used

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 83 / 108

slide-144
SLIDE 144

General Case: Linear program

Best ring made of q processors

Minimize T satisfying the (in)equations 8 > > > > > > > > > > > > > > > > > > > < > > > > > > > > > > > > > > > > > > > : (1) xi,j ∈ {0, 1} 1 i, j p (2) Pp

i=1 xi,j 1

1 j p (3) Pp

i=1

Pp

j=1 xi,j = q

(4) Pp

i=1 xi,j = Pp i=1 xj,i

1 j p (5) Pp

i=1 αi = 1

(6) αi Pp

j=1 xi,j

1 i p (7) αi.wi + Dc

Dw

Pp

j=1(xi,jci,j + xj,icj,i) T

1 i p (8) Pp

i=1 yi = 1

(9) − p.yi − p.yj + ui − uj + q.xi,j q − 1 1 i, j p, i = j (10) yi ∈ {0, 1} 1 i p (11) ui integer, ui 0 1 i p

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 84 / 108

slide-145
SLIDE 145

Linear Programming

◮ Problems with rational variables: can be solved in polynomial

time (in the size of the problem).

◮ Problems with integer variables: solved in exponential time in

the worst case.

◮ No relaxation in rationals seems possible here. . .

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 85 / 108

slide-146
SLIDE 146

And, in Practice ?

All processors participate. One can use a heuristic to solve the traveling salesman problem (as Lin-Kernighan’s one)

1 Exhaustive search: feasible until a dozen of processors. . . 2 Greedy heuristic: initially we take the best pair of processors;

for a given ring we try to insert any unused processor in between any pair of neighbor processors in the ring. . .

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 86 / 108

slide-147
SLIDE 147

And, in Practice ?

All processors participate. One can use a heuristic to solve the traveling salesman problem (as Lin-Kernighan’s one) No guarantee, but excellent results in practice.

1 Exhaustive search: feasible until a dozen of processors. . . 2 Greedy heuristic: initially we take the best pair of processors;

for a given ring we try to insert any unused processor in between any pair of neighbor processors in the ring. . .

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 86 / 108

slide-148
SLIDE 148

And, in Practice ?

All processors participate. One can use a heuristic to solve the traveling salesman problem (as Lin-Kernighan’s one) No guarantee, but excellent results in practice. General case.

1 Exhaustive search: feasible until a dozen of processors. . . 2 Greedy heuristic: initially we take the best pair of processors;

for a given ring we try to insert any unused processor in between any pair of neighbor processors in the ring. . .

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 86 / 108

slide-149
SLIDE 149

Outline

5

Scheduling Divisible Workload

6

Iterative Algorithms The Problem Fully Homogeneous Network Heterogeneous Network (Complete) Heterogeneous Network (General Case)

7

Data Redistribution

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 87 / 108

slide-150
SLIDE 150

New Difficulty: Dommunication Links Sharing

P1 P3 P2 P4

Heterogeneous platform

P1 P2 P4 P3

Virtual ring

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 88 / 108

slide-151
SLIDE 151

New Difficulty: Dommunication Links Sharing

P1 P3 P2 P4

Heterogeneous platform

P1 P2 P4 P3

Virtual ring

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 88 / 108

slide-152
SLIDE 152

New Difficulty: Dommunication Links Sharing

P1 P3 P2 P4

Heterogeneous platform

P1 P2 P4 P3

Virtual ring

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 88 / 108

slide-153
SLIDE 153

New Difficulty: Dommunication Links Sharing

P1 P3 P2 P4

Heterogeneous platform

P1 P2 P4 P3

Virtual ring We must take communication link sharing into account.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 88 / 108

slide-154
SLIDE 154

New Notations

◮ A set of communications links: e1, ..., en ◮ Bandwidth of link em: bem ◮ There is a path Si from Pi to Psucc(i) in the network

◮ Si uses a fraction si,m of the bandwidth bem of link em ◮ Pi needs a time Dc.

1 minem∈Si si,m to send to its successor a message of size Dc

◮ Constraints on the bandwidth of em:

  • 1ip

si,m bem

◮ Symmetrically, there is a path Pi from Pi to Ppred(i) in the

network, which uses a fraction pi,m of the bandwidth bem of link em

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 89 / 108

slide-155
SLIDE 155

Toy Example: Choosing the Ring

P1 Q P4 P5 R P2 P3 d a hg b c f e

◮ 7 processors and 8 bidirectional communications links ◮ We choose a ring of 5 processors:

P1 → P2 → P3 → P4 → P5 (we use neither Q, nor R)

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 90 / 108

slide-156
SLIDE 156

Toy Example: Choosing the Ring

P1 Q P4 P5 R P2 P3 d a hg b c f e

◮ 7 processors and 8 bidirectional communications links ◮ We choose a ring of 5 processors:

P1 → P2 → P3 → P4 → P5 (we use neither Q, nor R)

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 90 / 108

slide-157
SLIDE 157

Toy Example: Choosing the Paths

P1 Q P4 P5 R P2 P3 d a hg b c f e

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 91 / 108

slide-158
SLIDE 158

Toy Example: Choosing the Paths

P1 Q P4 P5 R P2 P3 d a hg b c f e From P1 to P2, we use the links a and b: S1 = {a, b}.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 91 / 108

slide-159
SLIDE 159

Toy Example: Choosing the Paths

P1 Q P4 P5 R P2 P3 d a hg b c f e From P1 to P2, we use the links a and b: S1 = {a, b}. From P2 to P1, we use the links b, g and h: P2 = {b, g, h}.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 91 / 108

slide-160
SLIDE 160

Toy Example: Choosing the Paths

P1 Q P4 P5 R P2 P3 d a hg b c f e From P1 to P2, we use the links a and b: S1 = {a, b}. From P2 to P1, we use the links b, g and h: P2 = {b, g, h}.

From P1: to P2, S1 = {a, b} and to P5, P1 = {h} From P2: to P3, S2 = {c, d} and to P1, P2 = {b, g, h} From P3: to P4, S3 = {d, e} and to P2, P3 = {d, e, f} From P4: to P5, S4 = {f, b, g} and to P3, P4 = {e, d} From P5: to P1, S5 = {h} and to P4, P5 = {g, b, f}

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 91 / 108

slide-161
SLIDE 161

Toy Example: Bandwidth Sharing

From P1 to P2 we use links a and b: c1,2 =

1 min(s1,a,s1,b).

From P1 to P5 we use the link h: c1,5 =

1 p1,h .

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 92 / 108

slide-162
SLIDE 162

Toy Example: Bandwidth Sharing

From P1 to P2 we use links a and b: c1,2 =

1 min(s1,a,s1,b).

From P1 to P5 we use the link h: c1,5 =

1 p1,h .

Set of all sharing constraints:

Lien a: s1,a ba Lien b: s1,b + s4,b + p2,b + p5,b bb Lien c: s2,c bc Lien d: s2,d + s3,d + p3,d + p4,d bd Lien e: s3,e + p3,e + p4,e be Lien f: s4,f + p3,f + p5,f bf Lien g: s4,g + p2,g + p5,g bg Lien h: s5,h + p1,h + p2,h bh

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 92 / 108

slide-163
SLIDE 163

Toy Example: Final Quadratic System

Minimize max1i5 (αi.Dw.wi + Dc.(ci,i−1 + ci,i+1)) under the constraints 8 > > > > > > > > > > > > > > > > > > < > > > > > > > > > > > > > > > > > > : P5

i=1 αi = 1

s1,a ba s1,b + s4,b + p2,b + p5,b bb s2,c bc s2,d + s3,d + p3,d + p4,d bd s3,e + p3,e + p4,e be s4,f + p3,f + p5,f bf s4,g + p2,g + p5,g bg s5,h + p1,h + p2,h bh s1,a.c1,2 1 s1,b.c1,2 1 p1,h.c1,5 1 s2,c.c2,3 1 s2,d.c2,3 1 p2,b.c2,1 1 p2,g.c2,1 1 p2,h.c2,1 1 s3,d.c3,4 1 s3,e.c3,4 1 p3,d.c3,2 1 p3,e.c3,2 1 p3,f.c3,2 1 s4,f.c4,5 1 s4,b.c4,5 1 s4,g.c4,5 1 p4,e.c4,3 1 p4,d.c4,3 1 s5,h.c5,1 1 p5,g.c5,4 1 p5,b.c5,4 1 p5,f.c5,4 1

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 93 / 108

slide-164
SLIDE 164

Toy Example: Conclusion

The problem sums up to a quadratic system if

1 The processors are selected; 2 The processors are ordered into a ring; 3 The communication paths between the processors are known.

In other words: a quadratic system if the ring is known.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 94 / 108

slide-165
SLIDE 165

Toy Example: Conclusion

The problem sums up to a quadratic system if

1 The processors are selected; 2 The processors are ordered into a ring; 3 The communication paths between the processors are known.

In other words: a quadratic system if the ring is known. If the ring is known:

◮ Complete graph: closed-form expression; ◮ General graph: quadratic system.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 94 / 108

slide-166
SLIDE 166

And, in Practice ?

We adapt our greedy heuristic:

1 Initially: best pair of processors 2 For each processor Pk (not already included in the ring) ◮ For each pair (Pi, Pj) of neighbors in the ring 1

We build the graph of the unused bandwidths (Without considering the paths between Pi and Pj)

2

We compute the shortest paths (in terms of bandwidth) be- tween Pk and Pi and Pj

3

We evaluate the solution

3 We keep the best solution found at step 2 and we start again

+ refinements (max-min fairness, quadratic solving).

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 95 / 108

slide-167
SLIDE 167

Is This Meaningful ?

◮ No guarantee, neither theoretical, nor practical ◮ Simple solution:

1

we build the complete graph whose edges are labeled with the bandwidths of the best communication paths

2

we apply the heuristic for complete graphs

3

we allocate the bandwidths

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 96 / 108

slide-168
SLIDE 168

Example: an Actual Platform (Lyon)

moby canaria mryi0 popc0 sci0 Hub Switch sci3 sci2 sci4 sci5 sci6 sci1 myri1 myri2 Hub router backbone routlhpc

Topology

P0 P1 P2 P3 P4 P5 P6 P7 P8 0.0206 0.0206 0.0206 0.0206 0.0291 0.0206 0.0087 0.0206 0.0206 P9 P10 P11 P12 P13 P14 P15 P16 0.0206 0.0206 0.0206 0.0291 0.0451

Processors processing times (in seconds par megaflop)

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 97 / 108

slide-169
SLIDE 169

Results

First heuristic building the ring without taking link sharing into ac- count Second heuristic taking into account link sharing (and with quadratic programing) Ratio Dc/Dw H1 H2 Gain 0.64 0.008738 (1) 0.008738 (1) 0% 0.064 0.018837 (13) 0.006639 (14) 64.75% 0.0064 0.003819 (13) 0.001975 (14) 48.28% Ratio Dc/Dw H1 H2 Gain 0.64 0.005825 (1) 0.005825 (1) 0 % 0.064 0.027919 (8) 0.004865 (6) 82.57% 0.0064 0.007218 (13) 0.001608 (8) 77.72%

Table: Tstep/Dw for each heuristic on Lyon’s and Strasbourg’s platforms (the numbers in parentheses show the size of the rings built).

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 98 / 108

slide-170
SLIDE 170

Conclusion

Even though this is a very basic application, it illustrates many dif- ficulties encountered when:

◮ Processors have different characteristics ◮ Communications links have different characteristics ◮ There is an irregular interconnection network with complex

bandwidth sharing issues. We need to use a realistic model of networks... Even though a more realistic model leads to a much more complicated problem, this is worth the effort as derived solutions are more efficient in practice.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Iterative Algorithms 99 / 108

slide-171
SLIDE 171

Outline

5

Scheduling Divisible Workload

6

Iterative Algorithms

7

Data Redistribution

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Data Redistribution 100 / 108

slide-172
SLIDE 172

A Data Redistribution Problem

When coupling code, data often have to be redistributed from one cluster to another. Using “Brute force” is generally not a good idea [Wag05].

100Mb/s 100Mb/s 200Mb/s

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Data Redistribution 101 / 108

slide-173
SLIDE 173

A Data Redistribution Problem

When coupling code, data often have to be redistributed from one cluster to another. Using “Brute force” is generally not a good idea [Wag05].

200Mb 100Mb 100Mb 100Mb/s 100Mb/s 200Mb/s

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Data Redistribution 101 / 108

slide-174
SLIDE 174

A Data Redistribution Problem

When coupling code, data often have to be redistributed from one cluster to another. Using “Brute force” is generally not a good idea [Wag05].

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Data Redistribution 101 / 108

slide-175
SLIDE 175

A Data Redistribution Problem

When coupling code, data often have to be redistributed from one cluster to another. Using “Brute force” is generally not a good idea [Wag05].

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Data Redistribution 101 / 108

slide-176
SLIDE 176

A Data Redistribution Problem

When coupling code, data often have to be redistributed from one cluster to another. Using “Brute force” is generally not a good idea [Wag05]. Non-cooperative: Cmax = 2.5

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Data Redistribution 101 / 108

slide-177
SLIDE 177

A Data Redistribution Problem

When coupling code, data often have to be redistributed from one cluster to another. Using “Brute force” is generally not a good idea [Wag05]. Non-cooperative: Cmax = 2.5

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Data Redistribution 101 / 108

slide-178
SLIDE 178

A Data Redistribution Problem

When coupling code, data often have to be redistributed from one cluster to another. Using “Brute force” is generally not a good idea [Wag05]. Non-cooperative: Cmax = 2.5 Optimal: Cmax = 2 The bottleneck moves resource waste

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Data Redistribution 101 / 108

slide-179
SLIDE 179

A Data Redistribution Problem

When coupling code, data often have to be redistributed from one cluster to another. Using “Brute force” is generally not a good idea [Wag05]. Non-cooperative: Cmax = 2.5 Optimal: Cmax = 2 The bottleneck moves resource waste Moreover, opening dozens of connections at the same time is gen- erally very intrusive for other users and often leads to performance degradation.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Data Redistribution 101 / 108

slide-180
SLIDE 180

Modeling

Input

◮ b1 is the bandwidth of the sending cluster ◮ b2 is the bandwidth of the receiving cluster ◮ bb is the bandwidth of the backbone ◮ β is the latency of communications ◮ The redistribution is modeled by a bipartite graph G =

(V1, V2, m, E). m(v1, v2) is the amount of data to trans- fer from v1 to v2. A given processor can communicate with at most one processor at a time. Therefore we try to decompose our redistribution as a set of synchronous communication steps. Output We look for a set D of h matching M1 = (E1, m1), . . . , Mh = (Eh, mh) such that: ∀(v1, v2) ∈ E : m(v1, v2) =

h

  • l=1

ml(v1, v2)

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Data Redistribution 102 / 108

slide-181
SLIDE 181

Modeling

Objective function The time needed for a communication step Ml is equal to. . .

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Data Redistribution 103 / 108

slide-182
SLIDE 182

Modeling

Objective function The time needed for a communication step Ml is equal to. . . It is unclear. It depends on the bandwidth sharing. This is why the problem has been modeled in a different way. Let’s do it one more time! Let us denote by w(v1, v2) the minimum communication time to transfer m(v1, v2) from v1 to v2. w(v1, v2) = m(v1, v2) min(b1, b2, bb) The maximum number of flows that can be sent at full speed is bounded by: k =

  • bb

min(b1, b2, bb)

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Data Redistribution 103 / 108

slide-183
SLIDE 183

K-Preemptive Bipartite Scheduling

Input

◮ β is the latency of communications ◮ The redistribution is modeled by a bipartite graph G =

(V1, V2, w, E). w(v1, v2) is the time required to transfer data from v1 to v2.

◮ At most k simultaneous communications can be done.

Output We look for a set D of h matching M1 = (E1, w1), . . . , Mh = (Eh, wh) such that: ∀(v1, v2) ∈ E : w(v1, v2) =

h

  • l=1

wl(v1, v2) and ∀l : |El| k

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Data Redistribution 104 / 108

slide-184
SLIDE 184

K-Preemptive Bipartite Scheduling

Objective function The time needed for a communication step Ml is equal to c(Ml) = max

e∈E wl(e) + β

Therefore, the cost of a distribution D is c(D) =

h

  • l=1

wl(v1, v2) = hβ +

h

  • l=1

max

e∈E wl(e)

There are two difficulties:

◮ The trade-off between the number of steps and the latency. ◮ We look for bounded-size matchings.

PBS is the exact same problem where the bound on the size of matchings is removed.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Data Redistribution 105 / 108

slide-185
SLIDE 185

A few results on the KPBS complexity

◮ KPBS is strong NP-hard. ◮ PBS cannot be approximated with a ratio smaller than 7 6. ◮ PBS can be approximated with a ratio 2 − 1 β+1. ◮ KPBS can be approximated with a ratio 8 3.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Data Redistribution 106 / 108

slide-186
SLIDE 186

Getting Rid of Annoying Constraints

◮ The k bound is somehow artificial but is due to the 1-port

model.

◮ By getting rid of the latencies, you get a polynomial fractionnal

matching problem (if you have understood the previous talk that used linear programing and ellipsoid, you should see why).

◮ With a few “standard tricks” you can even introduce release

dates and optimize the maximum weighted flow instead of the

  • makespan. . .

◮ However, taking the whole topology into account is more tricky.

◮ Indeed, under a bounded multiport model, the problem is trivial. ◮ However, if you want to keep the 1-port constraint, you need

some matching with non-uniform bandwidth allocation, which seems to be more tricky.

Note there are also problems for which the latency is not an issue but where the hardness really comes from the bound on the number

  • f simulataneous connections [MYCR06].
  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Data Redistribution 107 / 108

slide-187
SLIDE 187

Conclusion

Ensure that all parts of your modeling are mandatory. Maybe if k is large in practice and your latencies can be somehow

  • verlapped, then they may not be worth being considered.
  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Data Redistribution 108 / 108

slide-188
SLIDE 188
  • A. Alexandrov, M. Ionescu, K. Schauser, and C. Scheiman.

LogGP: Incorporating long messages into the LogP model for parallel computation. Journal of Parallel and Distributed Computing, 44(1):71–79, 1997. Beaumont, Olivier and Casanova, Henri and Legrand, Arnaud and Robert, Yves and Yang, Yang. Scheduling Divisible Loads on Star and Tree Networks: Results and Open Problems. IEEE Trans. Parallel Distributed Systems, 16(3):207–218, 2005.

  • V. Bharadwaj, D. Ghose, V. Mani, and T.G. Robertazzi.

Scheduling Divisible Loads in Parallel and Distributed Systems. IEEE Computer Society Press, 1996.

  • D. Culler, R. Karp, D. Patterson, A. Sahay, E. Santos,
  • K. Schauser, R. Subramonian, and T. von Eicken.

LogP: a practical model of parallel computation.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Data Redistribution 108 / 108

slide-189
SLIDE 189

Communication of the ACM, 39(11):78–85, 1996.

  • R. W. Hockney.

The communication challenge for mpp : Intel paragon and meiko cs-2. Parallel Computing, 20:389–398, 1994.

  • B. Hong and V.K. Prasanna.

Distributed adaptive task allocation in heterogeneous computing environments to maximize throughput. In International Parallel and Distributed Processing Symposium IPDPS’2004. IEEE Computer Society Press, 2004.

  • T. Kielmann, H. E. Bal, and K. Verstoep.

Fast measurement of LogP parameters for message passing plat- forms. In Proceedings of the 15th IPDPS. Workshops on Parallel and Distributed Processing, 2000. Steven H. Low. A duality model of TCP and queue management algorithms.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Data Redistribution 108 / 108

slide-190
SLIDE 190

IEEE/ACM Transactions on Networking, 2003. Dong Lu, Yi Qiao, Peter A. Dinda, and Fabi´ an E. Bustamante. Characterizing and predicting tcp throughput on the wide area network. In Proceedings of the 25th IEEE International Conference on Distributed Computing Systems (ICDCS’05), 2005. Arnaud Legrand, H´ el` ene Renard, Yves Robert, and Fr´ ed´ eric Vivien. Mapping and load-balancing iterative computations on hetero- geneous clusters with shared links. IEEE Trans. Parallel Distributed Systems, 15(6):546–558, 2004. Maxime Martinasso. Analyse et mod´ elisation des communications concurrentes dans les r´ eseaux haute performance. PhD thesis, Universit´ e Joseph Fourier de Grenoble, 2007. Laurent Massouli´ e and James Roberts.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Data Redistribution 108 / 108

slide-191
SLIDE 191

Bandwidth sharing: Objectives and algorithms. In INFOCOM (3), pages 1395–1403, 1999. Loris Marchal, Yang Yang, Henri Casanova, and Yves Robert. Steady-state scheduling of multiple divisible load applications

  • n wide-area distributed computing platforms.
  • Int. Journal of High Performance Computing Applications, (3),

2006. T.G. Robertazzi. Divisible Load Scheduling. http://www.ece.sunysb.edu/∼tom/dlt.html. Fr´ ed´ eric Wagner. Redistribution de donn´ ees ` a travers un r´ eseau haut d´ ebit. PhD thesis, Universit´ e Henri Poincar´ e Nancy 1, 2005. Yang, Yang and Casanova, Henri and Drozdowski, Maciej and Lawenda, Marcin and Legrand, Arnaud. On the Complexity of Multi-Round Divisible Load Scheduling. Research Report 6096, INRIA, 01 2007.

  • A. Legrand (CNRS) INRIA-MESCAL

On the Impact of Platform Models Data Redistribution 108 / 108