Methods for Partitioning Data to Improve Parallel Execution Time for - - PowerPoint PPT Presentation

methods for partitioning data to improve parallel
SMART_READER_LITE
LIVE PREVIEW

Methods for Partitioning Data to Improve Parallel Execution Time for - - PowerPoint PPT Presentation

Motivation Contribution Summary Methods for Partitioning Data to Improve Parallel Execution Time for Sorting on Heterogeneous Clusters erin 1 J.-C. Dubacq 1 J.-L. Roch 2 C. C 1 LIPN Universit e de Paris Nord 2 ID-IMAG Universit e


slide-1
SLIDE 1

Motivation Contribution Summary

Methods for Partitioning Data to Improve Parallel Execution Time for Sorting on Heterogeneous Clusters

  • C. C´

erin1 J.-C. Dubacq1 J.-L. Roch2

1LIPN

Universit´ e de Paris Nord

2ID-IMAG

Universit´ e Joseph Fourier, Grenoble

Global and Pervasive Computing 2006 (台中市)

slide-2
SLIDE 2

Motivation Contribution Summary

Outline

1

Motivation The partitioning problem Splitting data

2

Contribution General exact analytic approach Dynamic evaluation of complexity function Non uniformly related processors Experiments

slide-3
SLIDE 3

Motivation Contribution Summary

Outline

1

Motivation The partitioning problem Splitting data

2

Contribution General exact analytic approach Dynamic evaluation of complexity function Non uniformly related processors Experiments

slide-4
SLIDE 4

Motivation Contribution Summary

Partitioning large data sets for sorting

Large data sets require lot of computation time for sorting;

slide-5
SLIDE 5

Motivation Contribution Summary

Partitioning large data sets for sorting

Large data sets require lot of computation time for sorting; Data chunks of equal size used to do the job on parallel machines.

slide-6
SLIDE 6

Motivation Contribution Summary

Partitioning large data sets for sorting

Large data sets require lot of computation time for sorting; Data chunks of equal size used to do the job on parallel machines. Modelisation Infinite point-to-point bandwidth;

slide-7
SLIDE 7

Motivation Contribution Summary

Partitioning large data sets for sorting

Large data sets require lot of computation time for sorting; Data chunks of equal size used to do the job on parallel machines. Modelisation Infinite point-to-point bandwidth; Heterogeneous speed: relative linear speed;

slide-8
SLIDE 8

Motivation Contribution Summary

Partitioning large data sets for sorting

Large data sets require lot of computation time for sorting; Data chunks of equal size used to do the job on parallel machines. Modelisation Infinite point-to-point bandwidth; Heterogeneous speed: relative linear speed; No study of memory effect.

slide-9
SLIDE 9

Motivation Contribution Summary

Methodology

1 Data chunks are sent from node 0 to nodes 1, . . . , p − 1;

slide-10
SLIDE 10

Motivation Contribution Summary

Methodology

1 Data chunks are sent from node 0 to nodes 1, . . . , p − 1; 2 Each processor sorts locally its data chunk;

slide-11
SLIDE 11

Motivation Contribution Summary

Methodology

1 Data chunks are sent from node 0 to nodes 1, . . . , p − 1; 2 Each processor sorts locally its data chunk; 3 Node 0 receives p − 1 pivots, sorts them and broadcasts them;

slide-12
SLIDE 12

Motivation Contribution Summary

Methodology

1 Data chunks are sent from node 0 to nodes 1, . . . , p − 1; 2 Each processor sorts locally its data chunk; 3 Node 0 receives p − 1 pivots, sorts them and broadcasts them; 4 Each processor uses the pivots to split its data;

slide-13
SLIDE 13

Motivation Contribution Summary

Methodology

1 Data chunks are sent from node 0 to nodes 1, . . . , p − 1; 2 Each processor sorts locally its data chunk; 3 Node 0 receives p − 1 pivots, sorts them and broadcasts them; 4 Each processor uses the pivots to split its data; 5 Each processor transmits all its (split) data to the others;

slide-14
SLIDE 14

Motivation Contribution Summary

Methodology

1 Data chunks are sent from node 0 to nodes 1, . . . , p − 1; 2 Each processor sorts locally its data chunk; 3 Node 0 receives p − 1 pivots, sorts them and broadcasts them; 4 Each processor uses the pivots to split its data; 5 Each processor transmits all its (split) data to the others; 6 Each processor merges all data it received with its own.

slide-15
SLIDE 15

Motivation Contribution Summary

Methodology

1 Data chunks are sent from node 0 to nodes 1, . . . , p − 1; 2 Each processor sorts locally its data chunk; 3 Node 0 receives p − 1 pivots, sorts them and broadcasts them; 4 Each processor uses the pivots to split its data; 5 Each processor transmits all its (split) data to the others; 6 Each processor merges all data it received with its own.

Observation With fixed p, the computation-intensive part is step 2.

slide-16
SLIDE 16

Motivation Contribution Summary

Context: Grid’5000, heterogeneous clusters

GRID’5000: French national research project on grids;

slide-17
SLIDE 17

Motivation Contribution Summary

Context: Grid’5000, heterogeneous clusters

GRID’5000: French national research project on grids; Goal: 5000 nodes dedicated to experimental development;

slide-18
SLIDE 18

Motivation Contribution Summary

Context: Grid’5000, heterogeneous clusters

GRID’5000: French national research project on grids; Goal: 5000 nodes dedicated to experimental development; Current state: 2300 nodes, 13+ separated clusters, 9 sites, dedicated 10 Gb/s black fibre connexion;

slide-19
SLIDE 19

Motivation Contribution Summary

Context: Grid’5000, heterogeneous clusters

GRID’5000: French national research project on grids; Goal: 5000 nodes dedicated to experimental development; Current state: 2300 nodes, 13+ separated clusters, 9 sites, dedicated 10 Gb/s black fibre connexion; Heterogeneity Clusters have different processors, same family-processors have different clock speeds.

slide-20
SLIDE 20

Motivation Contribution Summary

Outline

1

Motivation The partitioning problem Splitting data

2

Contribution General exact analytic approach Dynamic evaluation of complexity function Non uniformly related processors Experiments

slide-21
SLIDE 21

Motivation Contribution Summary

From homogeneous to heterogeneous processors

Goal We have N objects to transmit and transform using p nodes. We want all computation to end at exactly the same time. Final merging is not relevant.

slide-22
SLIDE 22

Motivation Contribution Summary

From homogeneous to heterogeneous processors

Goal We have N objects to transmit and transform using p nodes. We want all computation to end at exactly the same time. Final merging is not relevant. Theorem (Homogeneous case) If all nodes work at same speed, the splitting of the data is optimal if one uses chunks of size N/p.

slide-23
SLIDE 23

Motivation Contribution Summary

From homogeneous to heterogeneous processors

Goal We have N objects to transmit and transform using p nodes. We want all computation to end at exactly the same time. Final merging is not relevant. Theorem (Homogeneous case) If all nodes work at same speed, the splitting of the data is optimal if one uses chunks of size N/p. We define the relative speed ki of a node i as the quantity of

  • perations it can do by unit of time compared to a reference node,

and K =

j kj.

slide-24
SLIDE 24

Motivation Contribution Summary

Previous works

Na¨ ıve algorithm uses chunks of size ki

K N and yields inadequate

computation time.

slide-25
SLIDE 25

Motivation Contribution Summary

Previous works

Na¨ ıve algorithm uses chunks of size ki

K N and yields inadequate

computation time. Example (na¨ ıve algorithm) Node 1 k1 = 1 n1 = N

3

T1 = n1 log n1 Node 2 k2 = 2 n2 = 2N

3

T2 = n2 log n2

k2

T2 = n1 log (2n1) = T1 + n1 log 2 = T1

slide-26
SLIDE 26

Motivation Contribution Summary

Previous works

Na¨ ıve algorithm uses chunks of size ki

K N and yields inadequate

computation time. Example (na¨ ıve algorithm) Node 1 k1 = 1 n1 = N

3

T1 = n1 log n1 Node 2 k2 = 2 n2 = 2N

3

T2 = n2 log n2

k2

T2 = n1 log (2n1) = T1 + n1 log 2 = T1 Theorem (C´ erin,Koskas,Jemni,Fkaier) For large N, optimal chunk size is ni = ki K N + ǫi, (1 ≤ i ≤ p) where ǫi = N ln N   ki K 2

p

  • j=1

kj ln kj ki  

slide-27
SLIDE 27

Motivation Contribution Summary

Outline

1

Motivation The partitioning problem Splitting data

2

Contribution General exact analytic approach Dynamic evaluation of complexity function Non uniformly related processors Experiments

slide-28
SLIDE 28

Motivation Contribution Summary

Basic approach

We use ˜ f as the complexity function (Ti = ˜ f (n1)/ki). T = ˜ f (n1) k1 = ˜ f (n2) k2 = · · · = ˜ f (np) kp

slide-29
SLIDE 29

Motivation Contribution Summary

Basic approach

We use ˜ f as the complexity function (Ti = ˜ f (n1)/ki). T = ˜ f (n1) k1 = ˜ f (n2) k2 = · · · = ˜ f (np) kp n1 + n2 + .... + np = N

slide-30
SLIDE 30

Motivation Contribution Summary

Basic approach

We use ˜ f as the complexity function (Ti = ˜ f (n1)/ki). T = ˜ f (n1) k1 = ˜ f (n2) k2 = · · · = ˜ f (np) kp n1 + n2 + .... + np = N Thus we can derive these compact equations for equality:

slide-31
SLIDE 31

Motivation Contribution Summary

Basic approach

We use ˜ f as the complexity function (Ti = ˜ f (n1)/ki). T = ˜ f (n1) k1 = ˜ f (n2) k2 = · · · = ˜ f (np) kp n1 + n2 + .... + np = N Thus we can derive these compact equations for equality: ni = ˜ f −1(T.ki)

slide-32
SLIDE 32

Motivation Contribution Summary

Basic approach

We use ˜ f as the complexity function (Ti = ˜ f (n1)/ki). T = ˜ f (n1) k1 = ˜ f (n2) k2 = · · · = ˜ f (np) kp n1 + n2 + .... + np = N Thus we can derive these compact equations for equality: ni = ˜ f −1(T.ki) and

p

  • i=1

˜ f −1(T.ki) = N

slide-33
SLIDE 33

Motivation Contribution Summary

Basic approach

We use ˜ f as the complexity function (Ti = ˜ f (n1)/ki). T = ˜ f (n1) k1 = ˜ f (n2) k2 = · · · = ˜ f (np) kp n1 + n2 + .... + np = N Thus we can derive these compact equations for equality: ni = ˜ f −1(T.ki) and

p

  • i=1

˜ f −1(T.ki) = N Only one unknown variable left!

slide-34
SLIDE 34

Motivation Contribution Summary

The polynomial case

Theorem (Polynomial case) If ˜ f : x → αxβ, then the optimal division is obtained by chunks sizes:

slide-35
SLIDE 35

Motivation Contribution Summary

The polynomial case

Theorem (Polynomial case) If ˜ f : x → αxβ, then the optimal division is obtained by chunks sizes: ni = k1/β

i

p

i=1 k1/β i

N.

slide-36
SLIDE 36

Motivation Contribution Summary

The polynomial case

Theorem (Polynomial case) If ˜ f : x → αxβ, then the optimal division is obtained by chunks sizes: ni = k1/β

i

p

i=1 k1/β i

N. Proof: ˜ f is multiplicative.

p

  • i=1

˜ f −1(T.ki) = N = ⇒ N = ˜ f −1(T)

p

  • i=1

˜ f −1(ki) = ⇒ T = ˜ f

  • N

p

i=1 ˜

f −1(ki)

slide-37
SLIDE 37

Motivation Contribution Summary

The polylog case

Theorem Initial values of ni can be asymptotically computed by

p

  • i=1

Tki + Tki ln ln(Tki) (ln(Tki))2 = N and ni = Tki + Tki ln ln(Tki) (ln(Tki))2

slide-38
SLIDE 38

Motivation Contribution Summary

The polylog case

Theorem Initial values of ni can be asymptotically computed by

p

  • i=1

Tki + Tki ln ln(Tki) (ln(Tki))2 = N and ni = Tki + Tki ln ln(Tki) (ln(Tki))2 Proof. We use the Lambert W function which is the inverse function of x → x log x.

slide-39
SLIDE 39

Motivation Contribution Summary

The polylog case

Theorem Initial values of ni can be asymptotically computed by

p

  • i=1

Tki + Tki ln ln(Tki) (ln(Tki))2 = N and ni = Tki + Tki ln ln(Tki) (ln(Tki))2 Proof. We use the Lambert W function which is the inverse function of x → x log x. A well known approximation is W (x) = ln x − ln ln(x) + o(1).

slide-40
SLIDE 40

Motivation Contribution Summary

Outline

1

Motivation The partitioning problem Splitting data

2

Contribution General exact analytic approach Dynamic evaluation of complexity function Non uniformly related processors Experiments

slide-41
SLIDE 41

Motivation Contribution Summary

Framework for unknown complexity function

Goal We want to cope with unknown complexity functions. We have several batches of data.

slide-42
SLIDE 42

Motivation Contribution Summary

Framework for unknown complexity function

Goal We want to cope with unknown complexity functions. We have several batches of data. If the speed vector is unknown, first submit a batch assuming vector is [1, . . . , 1]. Time-differences will tell what the relative speed is. So we may assume the speed vector is known;

slide-43
SLIDE 43

Motivation Contribution Summary

Framework for unknown complexity function

Goal We want to cope with unknown complexity functions. We have several batches of data. If the speed vector is unknown, first submit a batch assuming vector is [1, . . . , 1]. Time-differences will tell what the relative speed is. So we may assume the speed vector is known; Deduce ni chunk sizes to send to node i (in parallel for each node). Node ni measures the treatment time for the chunk, and reports it at the end.

slide-44
SLIDE 44

Motivation Contribution Summary

Framework for unknown complexity function

Goal We want to cope with unknown complexity functions. We have several batches of data. If the speed vector is unknown, first submit a batch assuming vector is [1, . . . , 1]. Time-differences will tell what the relative speed is. So we may assume the speed vector is known; Deduce ni chunk sizes to send to node i (in parallel for each node). Node ni measures the treatment time for the chunk, and reports it at the end. A piecewise representation of the complexity function is built, and missing values are interpolated.

slide-45
SLIDE 45

Motivation Contribution Summary

Detailed algorithm

1 For each node i, precompute the mapping (T, i) → ni as

previously, using interpolated values for f if necessary. Deduce a mapping T → n by summing the mappings over all i.

slide-46
SLIDE 46

Motivation Contribution Summary

Detailed algorithm

1 For each node i, precompute the mapping (T, i) → ni as

previously, using interpolated values for f if necessary. Deduce a mapping T → n by summing the mappings over all i.

2 Use a dichotomic search through T → n mapping to find the

ideal value of T (and thus of all the ni) and assign chunks of data to node i;

slide-47
SLIDE 47

Motivation Contribution Summary

Detailed algorithm

1 For each node i, precompute the mapping (T, i) → ni as

previously, using interpolated values for f if necessary. Deduce a mapping T → n by summing the mappings over all i.

2 Use a dichotomic search through T → n mapping to find the

ideal value of T (and thus of all the ni) and assign chunks of data to node i;

3 When chunk i of size ni is being treated:

slide-48
SLIDE 48

Motivation Contribution Summary

Detailed algorithm

1 For each node i, precompute the mapping (T, i) → ni as

previously, using interpolated values for f if necessary. Deduce a mapping T → n by summing the mappings over all i.

2 Use a dichotomic search through T → n mapping to find the

ideal value of T (and thus of all the ni) and assign chunks of data to node i;

3 When chunk i of size ni is being treated: 1

Record the cost C = Tniki of the computation for size ni.

slide-49
SLIDE 49

Motivation Contribution Summary

Detailed algorithm

1 For each node i, precompute the mapping (T, i) → ni as

previously, using interpolated values for f if necessary. Deduce a mapping T → n by summing the mappings over all i.

2 Use a dichotomic search through T → n mapping to find the

ideal value of T (and thus of all the ni) and assign chunks of data to node i;

3 When chunk i of size ni is being treated: 1

Record the cost C = Tniki of the computation for size ni.

2

If ni already had a non-interpolated value, choose a new value C ′ according to some strategy.

slide-50
SLIDE 50

Motivation Contribution Summary

Detailed algorithm

1 For each node i, precompute the mapping (T, i) → ni as

previously, using interpolated values for f if necessary. Deduce a mapping T → n by summing the mappings over all i.

2 Use a dichotomic search through T → n mapping to find the

ideal value of T (and thus of all the ni) and assign chunks of data to node i;

3 When chunk i of size ni is being treated: 1

Record the cost C = Tniki of the computation for size ni.

2

If ni already had a non-interpolated value, choose a new value C ′ according to some strategy.

3

If ni was not a known point, set C ′ = C.

slide-51
SLIDE 51

Motivation Contribution Summary

Detailed algorithm

1 For each node i, precompute the mapping (T, i) → ni as

previously, using interpolated values for f if necessary. Deduce a mapping T → n by summing the mappings over all i.

2 Use a dichotomic search through T → n mapping to find the

ideal value of T (and thus of all the ni) and assign chunks of data to node i;

3 When chunk i of size ni is being treated: 1

Record the cost C = Tniki of the computation for size ni.

2

If ni already had a non-interpolated value, choose a new value C ′ according to some strategy.

3

If ni was not a known point, set C ′ = C.

4

Ensure that the mapping as defined by n = ni → C(n) and the new value ni → C ′ is still monotonous increasing.

slide-52
SLIDE 52

Motivation Contribution Summary

Detailed algorithm

1 For each node i, precompute the mapping (T, i) → ni as

previously, using interpolated values for f if necessary. Deduce a mapping T → n by summing the mappings over all i.

2 Use a dichotomic search through T → n mapping to find the

ideal value of T (and thus of all the ni) and assign chunks of data to node i;

3 When chunk i of size ni is being treated: 1

Record the cost C = Tniki of the computation for size ni.

2

If ni already had a non-interpolated value, choose a new value C ′ according to some strategy.

3

If ni was not a known point, set C ′ = C.

4

Ensure that the mapping as defined by n = ni → C(n) and the new value ni → C ′ is still monotonous increasing.

4 A new batch can begin.

slide-53
SLIDE 53

Motivation Contribution Summary

Outline

1

Motivation The partitioning problem Splitting data

2

Contribution General exact analytic approach Dynamic evaluation of complexity function Non uniformly related processors Experiments

slide-54
SLIDE 54

Motivation Contribution Summary

Non-uniformly related processors

Goal We want to cope with complexity functions that depend on the node characteristics.

slide-55
SLIDE 55

Motivation Contribution Summary

Non-uniformly related processors

Goal We want to cope with complexity functions that depend on the node characteristics. We can minimise the following formula by dynamic programming: T(N, p) = max

i=1,...,p {fi(ni)} =

min (x1, . . . , xp) ∈ Np p

i=1 xi = N

  • max

i=1,...,p {fi(xi)}

slide-56
SLIDE 56

Motivation Contribution Summary

Non-uniformly related processors

Goal We want to cope with complexity functions that depend on the node characteristics. We can minimise the following formula by dynamic programming: T(N, p) = max

i=1,...,p {fi(ni)} =

min (x1, . . . , xp) ∈ Np p

i=1 xi = N

  • max

i=1,...,p {fi(xi)}

  • T(m, i) =

min

ni=0..m max(fi(ni), C(m − ni, i − 1))

slide-57
SLIDE 57

Motivation Contribution Summary

Non-uniformly related processors

Goal We want to cope with complexity functions that depend on the node characteristics. We can minimise the following formula by dynamic programming: T(N, p) = max

i=1,...,p {fi(ni)} =

min (x1, . . . , xp) ∈ Np p

i=1 xi = N

  • max

i=1,...,p {fi(xi)}

  • T(m, i) =

min

ni=0..m max(fi(ni), C(m − ni, i − 1))

Theorem Computation of optimal partition is done in O(N2p) time.

slide-58
SLIDE 58

Motivation Contribution Summary

Outline

1

Motivation The partitioning problem Splitting data

2

Contribution General exact analytic approach Dynamic evaluation of complexity function Non uniformly related processors Experiments

slide-59
SLIDE 59

Motivation Contribution Summary

Experiments

slide-60
SLIDE 60

Motivation Contribution Summary

Experiments

Records of 100 bytes long, two classes of computers (k = 1 and k = 1.5);

slide-61
SLIDE 61

Motivation Contribution Summary

Experiments

Records of 100 bytes long, two classes of computers (k = 1 and k = 1.5); 54 GB of data, 50 runs for each experiment, bi-opteron processor, cpu-burning;

slide-62
SLIDE 62

Motivation Contribution Summary

Experiments

Records of 100 bytes long, two classes of computers (k = 1 and k = 1.5); 54 GB of data, 50 runs for each experiment, bi-opteron processor, cpu-burning; 96 nodes used;

slide-63
SLIDE 63

Motivation Contribution Summary

Experiments

Records of 100 bytes long, two classes of computers (k = 1 and k = 1.5); 54 GB of data, 50 runs for each experiment, bi-opteron processor, cpu-burning; 96 nodes used; Minute Sort benchmark compliant;

slide-64
SLIDE 64

Motivation Contribution Summary

Experiments

Records of 100 bytes long, two classes of computers (k = 1 and k = 1.5); 54 GB of data, 50 runs for each experiment, bi-opteron processor, cpu-burning; 96 nodes used; Minute Sort benchmark compliant; naive algo partitioning partitioning (2 threads) 125.4s 112.7s 69.4s

slide-65
SLIDE 65

Motivation Contribution Summary

Summary

Polynomial complexity functions yield a simple formula ni = ˜ f −1(ki) p

i=1 ˜

f −1(ki) N.

slide-66
SLIDE 66

Motivation Contribution Summary

Summary

Polynomial complexity functions yield a simple formula ni = ˜ f −1(ki) p

i=1 ˜

f −1(ki) N. Unknown complexity functions can still be managed, but require incremental construction;

slide-67
SLIDE 67

Motivation Contribution Summary

Summary

Polynomial complexity functions yield a simple formula ni = ˜ f −1(ki) p

i=1 ˜

f −1(ki) N. Unknown complexity functions can still be managed, but require incremental construction; Dynamic programming can also be used in more general cases.

slide-68
SLIDE 68

Motivation Contribution Summary

Summary

Polynomial complexity functions yield a simple formula ni = ˜ f −1(ki) p

i=1 ˜

f −1(ki) N. Unknown complexity functions can still be managed, but require incremental construction; Dynamic programming can also be used in more general cases. Future work

Limited bandwidth models and heterogeneous network links. Non-linear computation time models. Global optimisation.