Mapping skeleton workflows onto heterogeneous platforms Anne Benoit - - PowerPoint PPT Presentation

mapping skeleton workflows onto heterogeneous platforms
SMART_READER_LITE
LIVE PREVIEW

Mapping skeleton workflows onto heterogeneous platforms Anne Benoit - - PowerPoint PPT Presentation

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion Mapping skeleton workflows onto heterogeneous platforms Anne Benoit and Yves Robert GRAAL team, LIP Ecole Normale Sup erieure de


slide-1
SLIDE 1

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Mapping skeleton workflows

  • nto heterogeneous platforms

Anne Benoit and Yves Robert

GRAAL team, LIP ´ Ecole Normale Sup´ erieure de Lyon

June 2007

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 1/ 44

slide-2
SLIDE 2

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Introduction and motivation

Mapping applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach

Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping

Mapping skeletons (pipeline, fork) onto heterogeneous platforms

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 2/ 44

slide-3
SLIDE 3

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Introduction and motivation

Mapping applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach

Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping

Mapping skeletons (pipeline, fork) onto heterogeneous platforms

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 2/ 44

slide-4
SLIDE 4

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Introduction and motivation

Mapping applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach

Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping

Mapping skeletons (pipeline, fork) onto heterogeneous platforms

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 2/ 44

slide-5
SLIDE 5

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Introduction and motivation

Mapping applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach

Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping

Mapping skeletons (pipeline, fork) onto heterogeneous platforms

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 2/ 44

slide-6
SLIDE 6

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Introduction and motivation

Mapping applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach

Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping

Mapping skeletons (pipeline, fork) onto heterogeneous platforms

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 2/ 44

slide-7
SLIDE 7

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Introduction and motivation

Mapping applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach

Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping

Mapping skeletons (pipeline, fork) onto heterogeneous platforms

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 2/ 44

slide-8
SLIDE 8

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Introduction and motivation

Mapping applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach

Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping

Mapping skeletons (pipeline, fork) onto heterogeneous platforms

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 2/ 44

slide-9
SLIDE 9

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Rule of the game

Map each pipeline stage on a single processor (extended later) Goal: minimize execution time (extended later) Several mapping strategies

S1

... ...

S2 Sk Sn

The pipeline application

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 3/ 44

slide-10
SLIDE 10

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Rule of the game

Map each pipeline stage on a single processor (extended later) Goal: minimize execution time (extended later) Several mapping strategies

S1

... ...

S2 Sk Sn

The pipeline application

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 3/ 44

slide-11
SLIDE 11

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Rule of the game

Map each pipeline stage on a single processor (extended later) Goal: minimize execution time (extended later) Several mapping strategies

S1

... ...

S2 Sk Sn

One-to-one Mapping

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 3/ 44

slide-12
SLIDE 12

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Rule of the game

Map each pipeline stage on a single processor (extended later) Goal: minimize execution time (extended later) Several mapping strategies

S1

... ...

S2 Sk Sn

Interval Mapping

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 3/ 44

slide-13
SLIDE 13

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Rule of the game

Map each pipeline stage on a single processor (extended later) Goal: minimize execution time (extended later) Several mapping strategies

S1

... ...

S2 Sk Sn

General Mapping

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 3/ 44

slide-14
SLIDE 14

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Major contributions

Theory Formal approach to the problem, definition of replication and data-parallelism Problem complexity for several cases Integer linear program for exact resolution Practice Heuristics for Interval Mapping on clusters Experiments to compare heuristics and evaluate their absolute performance

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 4/ 44

slide-15
SLIDE 15

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Major contributions

Theory Formal approach to the problem, definition of replication and data-parallelism Problem complexity for several cases Integer linear program for exact resolution Practice Heuristics for Interval Mapping on clusters Experiments to compare heuristics and evaluate their absolute performance

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 4/ 44

slide-16
SLIDE 16

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Outline

1

Framework

2

Working out an example

3

Part 1 - Communications, monolithic stages, mono-criterion

4

Part 2 - Simpler model with no communications, but with replication/DP and bi-criteria

5

Conclusion

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 5/ 44

slide-17
SLIDE 17

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Outline

1

Framework

2

Working out an example

3

Part 1 - Communications, monolithic stages, mono-criterion

4

Part 2 - Simpler model with no communications, but with replication/DP and bi-criteria

5

Conclusion

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 6/ 44

slide-18
SLIDE 18

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

The application: pipeline graphs

... ...

S2 Sk Sn S1 w1 w2 wk wn δ0 δ1 δk−1 δk δn

n stages Sk, 1 ≤ k ≤ n Sk:

receives input of size δk−1 from Sk−1 performs wk computations

  • utputs data of size δk to Sk+1

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 7/ 44

slide-19
SLIDE 19

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

The application: fork graphs

w0 S2 Sk Sn S1

... ...

S0 δ−1 δ0 δ0 δ0 δ0 δn δk δ2 δ1 w1 w2 wk wn

n + 1 stages Sk, 0 ≤ k ≤ n

S0: root stage S1 to Sn: independent stages

A data-set goes through stage S0, then it can be executed simultaneously for all other stages

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 8/ 44

slide-20
SLIDE 20

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

The platform

bin,u Pv Pout Pin sv Pu su bv,out bu,v sin sout

p processors Pu, 1 ≤ u ≤ p, fully interconnected su: speed of processor Pu bidirectional link linku,v : Pu → Pv, bandwidth bu,v

  • ne-port model: each processor can either send, receive or

compute at any time-step

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 9/ 44

slide-21
SLIDE 21

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Different platforms

Fully Homogeneous – Identical processors (su = s) and links (bu,v = b): typical parallel machines Communication Homogeneous – Different-speed processors (su = sv), identical links (bu,v = b): networks of workstations, clusters Fully Heterogeneous – Fully heterogeneous architectures, su = sv and bu,v = bu′,v′: hierarchical platforms, grids

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 10/ 44

slide-22
SLIDE 22

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Rule of the game

Consecutive data-sets fed into the workflow Period Tperiod = time interval between beginning of execution

  • f two consecutive data sets (throughput=1/Tperiod)

Latency Tlatency(x) = time elapsed between beginning and end of execution for a given data set x, and Tlatency = maxx Tlatency(x) Map each pipeline/fork stage on one or several processors Goal: minimize Tperiod or Tlatency or bi-criteria minimization

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 11/ 44

slide-23
SLIDE 23

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Rule of the game

Consecutive data-sets fed into the workflow Period Tperiod = time interval between beginning of execution

  • f two consecutive data sets (throughput=1/Tperiod)

Latency Tlatency(x) = time elapsed between beginning and end of execution for a given data set x, and Tlatency = maxx Tlatency(x) Map each pipeline/fork stage on one or several processors Goal: minimize Tperiod or Tlatency or bi-criteria minimization

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 11/ 44

slide-24
SLIDE 24

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Stage types

Monolithic stages: must be mapped on one single processor since computation for a data-set may depend on result of previous computation Replicable stages: can be replicated on several processors, but not parallel, i.e. a data-set must be entirely processed on a single processor Data-parallel stages: inherently parallel stages, one data-set can be computed in parallel by several processors

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 12/ 44

slide-25
SLIDE 25

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Replication

Replicate stage Sk on P1, . . . , Pq . . . Sk−1

  • Sk on P1: data sets 1, 4, 7, . . .
  • −−

Sk on P2: data sets 2, 5, 8, . . . −−

  • Sk on P3: data sets 3, 5, 9, . . .
  • Sk+1 . . .

Sk+1 may be monolithic: output order must be respected Round-robin rule to ensure output order Cannot feed more fast processors than slow ones Most efficient with similar-speed processors

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 13/ 44

slide-26
SLIDE 26

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Replication

Replicate stage Sk on P1, . . . , Pq . . . Sk−1

  • Sk on P1: data sets 1, 4, 7, . . .
  • −−

Sk on P2: data sets 2, 5, 8, . . . −−

  • Sk on P3: data sets 3, 5, 9, . . .
  • Sk+1 . . .

Sk+1 may be monolithic: output order must be respected Round-robin rule to ensure output order Cannot feed more fast processors than slow ones Most efficient with similar-speed processors

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 13/ 44

slide-27
SLIDE 27

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Data-parallelism

Data-parallelize stage Sk on P1, . . . , Pq Sk (w = 16)

  • • • •
  • • • •
  • • • •
  • • • •

⇒ P1 (s1 = 2) : • • • • • • • • P2 (s2 = 1) : • • • • P3 (s3 = 1) : • • • • Perfect sharing of the work Data-parallelize single stage only

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 14/ 44

slide-28
SLIDE 28

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Data-parallelism

Data-parallelize stage Sk on P1, . . . , Pq Sk (w = 16)

  • • • •
  • • • •
  • • • •
  • • • •

⇒ P1 (s1 = 2) : • • • • • • • • P2 (s2 = 1) : • • • • P3 (s3 = 1) : • • • • Perfect sharing of the work Data-parallelize single stage only

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 14/ 44

slide-29
SLIDE 29

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Interval Mapping for pipeline graphs

Several consecutive stages onto the same processor Increase computational load, reduce communications Partition of [1..n] into m intervals Ij = [dj, ej] (with dj ≤ ej for 1 ≤ j ≤ m, d1 = 1, dj+1 = ej + 1 for 1 ≤ j ≤ m − 1 and em = n) Interval Ij mapped onto processor Palloc(j) Tperiod = max

1≤j≤m

  • δdj−1

balloc(j−1),alloc(j) + ej

i=dj wi

salloc(j) + δej balloc(j),alloc(j+1)

  • Tlatency =
  • 1≤j≤m
  • δdj−1

balloc(j−1),alloc(j) + ej

i=dj wi

salloc(j)

  • +

δn balloc(m),alloc(m+1)

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 15/ 44

slide-30
SLIDE 30

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Interval Mapping for pipeline graphs

Several consecutive stages onto the same processor Increase computational load, reduce communications Partition of [1..n] into m intervals Ij = [dj, ej] (with dj ≤ ej for 1 ≤ j ≤ m, d1 = 1, dj+1 = ej + 1 for 1 ≤ j ≤ m − 1 and em = n) Interval Ij mapped onto processor Palloc(j) Tperiod = max

1≤j≤m

  • δdj−1

balloc(j−1),alloc(j) + ej

i=dj wi

salloc(j) + δej balloc(j),alloc(j+1)

  • Tlatency =
  • 1≤j≤m
  • δdj−1

balloc(j−1),alloc(j) + ej

i=dj wi

salloc(j)

  • +

δn balloc(m),alloc(m+1)

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 15/ 44

slide-31
SLIDE 31

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Interval Mapping for pipeline graphs

Several consecutive stages onto the same processor Increase computational load, reduce communications Partition of [1..n] into m intervals Ij = [dj, ej] (with dj ≤ ej for 1 ≤ j ≤ m, d1 = 1, dj+1 = ej + 1 for 1 ≤ j ≤ m − 1 and em = n) Interval Ij mapped onto processor Palloc(j) Tperiod = max

1≤j≤m

  • δdj−1

balloc(j−1),alloc(j) + ej

i=dj wi

salloc(j) + δej balloc(j),alloc(j+1)

  • Tlatency =
  • 1≤j≤m
  • δdj−1

balloc(j−1),alloc(j) + ej

i=dj wi

salloc(j)

  • +

δn balloc(m),alloc(m+1)

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 15/ 44

slide-32
SLIDE 32

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Interval Mapping for pipeline graphs

Several consecutive stages onto the same processor Increase computational load, reduce communications Partition of [1..n] into m intervals Ij = [dj, ej] (with dj ≤ ej for 1 ≤ j ≤ m, d1 = 1, dj+1 = ej + 1 for 1 ≤ j ≤ m − 1 and em = n) Interval Ij mapped onto processor Palloc(j) Tperiod = max

1≤j≤m

  • δdj−1

balloc(j−1),alloc(j) + ej

i=dj wi

salloc(j) + δej balloc(j),alloc(j+1)

  • Tlatency =
  • 1≤j≤m
  • δdj−1

balloc(j−1),alloc(j) + ej

i=dj wi

salloc(j)

  • +

δn balloc(m),alloc(m+1)

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 15/ 44

slide-33
SLIDE 33

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Simpler problem, replication and data-parallelism

No communication costs nor overheads Cost to execute Si on Pu alone: wi

su

Cost to data-parallelize [Si, Sj] (i = j for pipeline; 0 < i ≤ j or i = j = 0 for fork) on k processors Pq1, . . . , Pqk: j

ℓ=i wℓ

k

u=1 squ

Cost = Tperiod of assigned processors Cost = delay to traverse the interval

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 16/ 44

slide-34
SLIDE 34

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Simpler problem, replication and data-parallelism

No communication costs nor overheads Cost to execute Si on Pu alone: wi

su

Cost to data-parallelize [Si, Sj] (i = j for pipeline; 0 < i ≤ j or i = j = 0 for fork) on k processors Pq1, . . . , Pqk: j

ℓ=i wℓ

k

u=1 squ

Cost = Tperiod of assigned processors Cost = delay to traverse the interval

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 16/ 44

slide-35
SLIDE 35

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Simpler problem, replication and data-parallelism

No communication costs nor overheads Cost to execute Si on Pu alone: wi

su

Cost to data-parallelize [Si, Sj] (i = j for pipeline; 0 < i ≤ j or i = j = 0 for fork) on k processors Pq1, . . . , Pqk: j

ℓ=i wℓ

k

u=1 squ

Cost = Tperiod of assigned processors Cost = delay to traverse the interval

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 16/ 44

slide-36
SLIDE 36

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Simpler problem, replication and data-parallelism

Cost to replicate [Si, Sj] on k processors Pq1, . . . , Pqk: j

ℓ=i wℓ

k × min1≤u≤k squ . Cost = Tperiod of assigned processors Delay to traverse the interval = time needed by slowest processor: tmax = j

ℓ=i wℓ

min1≤u≤k squ With these formulas: easy to compute Tperiod and Tlatency for pipeline graphs

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 17/ 44

slide-37
SLIDE 37

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Simpler problem, replication and data-parallelism

Cost to replicate [Si, Sj] on k processors Pq1, . . . , Pqk: j

ℓ=i wℓ

k × min1≤u≤k squ . Cost = Tperiod of assigned processors Delay to traverse the interval = time needed by slowest processor: tmax = j

ℓ=i wℓ

min1≤u≤k squ With these formulas: easy to compute Tperiod and Tlatency for pipeline graphs

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 17/ 44

slide-38
SLIDE 38

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Outline

1

Framework

2

Working out an example

3

Part 1 - Communications, monolithic stages, mono-criterion

4

Part 2 - Simpler model with no communications, but with replication/DP and bi-criteria

5

Conclusion

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 18/ 44

slide-39
SLIDE 39

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Working out an example

S1 → S2 → S3 → S4 14 4 2 4 Interval mapping, 4 processors, s1 = 2 and s2 = s3 = s4 = 1 Optimal period?

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 19/ 44

slide-40
SLIDE 40

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Working out an example

S1 → S2 → S3 → S4 14 4 2 4 Interval mapping, 4 processors, s1 = 2 and s2 = s3 = s4 = 1 Optimal period? Tperiod = 7, S1 → P1, S2S3 → P2, S4 → P3 (Tlatency = 17) Optimal latency?

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 19/ 44

slide-41
SLIDE 41

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Working out an example

S1 → S2 → S3 → S4 14 4 2 4 Interval mapping, 4 processors, s1 = 2 and s2 = s3 = s4 = 1 Optimal period? Tperiod = 7, S1 → P1, S2S3 → P2, S4 → P3 (Tlatency = 17) Optimal latency? Tlatency = 12, S1S2S3S4 → P1 (Tperiod = 12)

  • Min. latency if Tperiod ≤ 10?

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 19/ 44

slide-42
SLIDE 42

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Working out an example

S1 → S2 → S3 → S4 14 4 2 4 Interval mapping, 4 processors, s1 = 2 and s2 = s3 = s4 = 1 Optimal period? Tperiod = 7, S1 → P1, S2S3 → P2, S4 → P3 (Tlatency = 17) Optimal latency? Tlatency = 12, S1S2S3S4 → P1 (Tperiod = 12)

  • Min. latency if Tperiod ≤ 10?

Tlatency = 14, S1S2S3 → P1, S4 → P2

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 19/ 44

slide-43
SLIDE 43

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Example with replication and data-parallelism

S1 → S2 → S3 → S4 14 4 2 4 Interval mapping, 4 processors, s1 = 2 and s2 = s3 = s4 = 1 Replicate interval [Su..Sv] on P1, . . . , Pq . . . S

  • Su . . . Sv on P1: data sets 1, 4, 7, . . .
  • −−

Su . . . Sv on P2: data sets 2, 5, 8, . . . −−

  • Su . . . Sv on P3: data sets 3, 5, 9, . . .
  • S . . .

Tperiod =

Pv

k=u wk

q×mini(si) and Tlatency = q × Tperiod

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 20/ 44

slide-44
SLIDE 44

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Example with replication and data-parallelism

S1 → S2 → S3 → S4 14 4 2 4 Interval mapping, 4 processors, s1 = 2 and s2 = s3 = s4 = 1 Data Parallelize single stage Sk on P1, . . . , Pq S (w = 16)

  • • • •
  • • • •
  • • • •
  • • • •

⇒ P1 (s1 = 2) : • • • • • • • • P2 (s2 = 1) : • • • • P3 (s3 = 1) : • • • • Tperiod =

wk Pq

i=1 si and Tlatency = Tperiod Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 20/ 44

slide-45
SLIDE 45

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Example with replication and data-parallelism

S1 → S2 → S3 → S4 14 4 2 4 Interval mapping, 4 processors, s1 = 2 and s2 = s3 = s4 = 1 Optimal period?

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 20/ 44

slide-46
SLIDE 46

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Example with replication and data-parallelism

S1 → S2 → S3 → S4 14 4 2 4 Interval mapping, 4 processors, s1 = 2 and s2 = s3 = s4 = 1 Optimal period? S1

DP

→ P1P2, S2S3S4

REP

→ P3P4 Tperiod = max( 14

2+1, 4+2+4 2×1 ) = 5, Tlatency = 14.67

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 20/ 44

slide-47
SLIDE 47

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Example with replication and data-parallelism

S1 → S2 → S3 → S4 14 4 2 4 Interval mapping, 4 processors, s1 = 2 and s2 = s3 = s4 = 1 Optimal period? S1

DP

→ P1P2, S2S3S4

REP

→ P3P4 Tperiod = max( 14

2+1, 4+2+4 2×1 ) = 5, Tlatency = 14.67

S1

DP

→ P2P3P4, S2S3S4 → P1 Tperiod = max(

14 1+1+1, 4+2+4 2

) = 5, Tlatency = 9.67 (optimal)

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 20/ 44

slide-48
SLIDE 48

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Outline

1

Framework

2

Working out an example

3

Part 1 - Communications, monolithic stages, mono-criterion

4

Part 2 - Simpler model with no communications, but with replication/DP and bi-criteria

5

Conclusion

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 21/ 44

slide-49
SLIDE 49

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Part 1

Pipeline graph Different platforms, with communications Different mapping strategies Only monolithic stages: no replication nor data-parallelism Mono-criterion: period minimization Complexity results, heuristics and experiments

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 22/ 44

slide-50
SLIDE 50

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Part 1

Pipeline graph Different platforms, with communications Different mapping strategies Only monolithic stages: no replication nor data-parallelism Mono-criterion: period minimization Complexity results, heuristics and experiments

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 22/ 44

slide-51
SLIDE 51

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Complexity results

Fully Hom.

  • Comm. Hom.

One-to-one Mapping Interval Mapping General Mapping

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 23/ 44

slide-52
SLIDE 52

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Complexity results

Fully Hom.

  • Comm. Hom.

One-to-one Mapping polynomial polynomial Interval Mapping General Mapping Binary search polynomial algorithm for One-to-one Mapping

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 23/ 44

slide-53
SLIDE 53

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Complexity results

Fully Hom.

  • Comm. Hom.

One-to-one Mapping polynomial polynomial Interval Mapping polynomial NP-complete General Mapping Binary search polynomial algorithm for One-to-one Mapping Dynamic programming algorithm for Interval Mapping on

  • Hom. platforms (NP-hard otherwise)

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 23/ 44

slide-54
SLIDE 54

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Complexity results

Fully Hom.

  • Comm. Hom.

One-to-one Mapping polynomial polynomial Interval Mapping polynomial NP-complete General Mapping same complexity as Interval Binary search polynomial algorithm for One-to-one Mapping Dynamic programming algorithm for Interval Mapping on

  • Hom. platforms (NP-hard otherwise)

General mapping: same complexity as Interval Mapping

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 23/ 44

slide-55
SLIDE 55

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Complexity results

Fully Hom.

  • Comm. Hom.

One-to-one Mapping polynomial polynomial Interval Mapping polynomial NP-complete General Mapping same complexity as Interval Binary search polynomial algorithm for One-to-one Mapping Dynamic programming algorithm for Interval Mapping on

  • Hom. platforms (NP-hard otherwise)

General mapping: same complexity as Interval Mapping All problem instances NP-complete on Fully Heterogeneous platforms

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 23/ 44

slide-56
SLIDE 56

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

One-to-one/Comm. Hom.: binary search algorithm

Work with fastest n processors, numbered P1 to Pn, where s1 ≤ s2 ≤ . . . ≤ sn Mark all stages S1 to Sn as free For u = 1 to n

Pick up any free stage Sk s.t. δk−1/b + wk/su + δk/b ≤ Tperiod Assign Sk to Pu, and mark Sk as already assigned If no stage found return ”failure”

Proof: exchange argument

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 24/ 44

slide-57
SLIDE 57

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

One-to-one/Comm. Hom.: binary search algorithm

Work with fastest n processors, numbered P1 to Pn, where s1 ≤ s2 ≤ . . . ≤ sn Mark all stages S1 to Sn as free For u = 1 to n

Pick up any free stage Sk s.t. δk−1/b + wk/su + δk/b ≤ Tperiod Assign Sk to Pu, and mark Sk as already assigned If no stage found return ”failure”

Proof: exchange argument

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 24/ 44

slide-58
SLIDE 58

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Greedy heuristics

Target clusters: Com. hom. platforms and Interval Mapping H1a-GR: random – fixed intervals H1b-GRIL: random interval length H2-GSW: biggest w – Place interval with most computations

  • n fastest processor

H3-GSD: biggest δin + δout – Intervals are sorted by communications (δin + δout) in: first stage of interval; (out − 1): last one H4-GP: biggest period on fastest processor – Balancing computation and communication: processors sorted by decreasing speed su; for current processor u, choose interval with biggest period (δin + δout)/b +

i∈Interval wi/su

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 25/ 44

slide-59
SLIDE 59

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Sophisticated heuristics

H5-BS121: binary search for One-to-one Mapping – optimal algorithm for One-to-one Mapping. When p < n, application cut in fixed intervals of length L. H6-SPL: splitting intervals – Processors sorted by decreasing speed, all stages to first processor. At each step, select used proc j with largest period, split its interval (give fraction of stages to j′): minimize max(period(j), period(j′)) and split if maximum period improved. H7a-BSL and H7b-BSC: binary search (longest/closest) – Binary search on period P: start with stage s = 1, build intervals (s, s′) fitting on processors. For each u, and each s′ ≥ s, compute period (s..s′, u) and check whether it is smaller than P. H7a: maximizes s′; H7b: chooses the closest period.

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 26/ 44

slide-60
SLIDE 60

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Plan of experiments

Assess performance of polynomial heuristics Random applications, n = 1 to 50 stages Random platforms, p = 10 and p = 100 processors b = 10 (comm. hom.), proc. speed between 1 and 20 Relevant parameters: ratios δ

b and w s

Average over 100 similar random appli/platform pairs

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 27/ 44

slide-61
SLIDE 61

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Plan of experiments

Assess performance of polynomial heuristics Random applications, n = 1 to 50 stages Random platforms, p = 10 and p = 100 processors b = 10 (comm. hom.), proc. speed between 1 and 20 Relevant parameters: ratios δ

b and w s

Average over 100 similar random appli/platform pairs

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 27/ 44

slide-62
SLIDE 62

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Experiment 1 - balanced comm/comp, hom comm

δi = 10, computation time between 1 and 20 10 processors

5 10 15 20 25 10 20 30 40 50 Maximum period Number of stages (p=10) H1a-GreedyRandom H1b-GreedyRandomIntervalLength H2-GreedySumW H3-GreedySumDinDout H4-GreedyPeriod H5-BinarySearch1to1 H6-SPLitting H7a-BinarySearchLongest H7b-BinarySearchClosest

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 28/ 44

slide-63
SLIDE 63

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Experiment 1 - balanced comm/comp, hom comm

δi = 10, computation time between 1 and 20 100 processors

2.6 2.8 3 3.2 3.4 10 20 30 40 50 Maximum period Number of stages (p=100) H1a-GreedyRandom H1b-GreedyRandomIntervalLength H2-GreedySumW H3-GreedySumDinDout H4-GreedyPeriod H5-BinarySearch1to1 H6-SPLitting H7a-BinarySearchLongest H7b-BinarySearchClosest

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 28/ 44

slide-64
SLIDE 64

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Experiment 2 - balanced comm/comp, het comm

communication time between 1 and 100 computation time between 1 and 20

10 15 20 25 30 35 10 20 30 40 50 Maximum period Number of stages (p=10) H1a-GreedyRandom H1b-GreedyRandomIntervalLength H2-GreedySumW H3-GreedySumDinDout H4-GreedyPeriod H5-BinarySearch1to1 H6-SPLitting H7a-BinarySearchLongest H7b-BinarySearchClosest

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 29/ 44

slide-65
SLIDE 65

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Experiment 2 - balanced comm/comp, het comm

communication time between 1 and 100 computation time between 1 and 20

10 15 20 25 30 10 20 30 40 50 Maximum period Number of stages (p=100) H1a-GreedyRandom H1b-GreedyRandomIntervalLength H2-GreedySumW H3-GreedySumDinDout H4-GreedyPeriod H5-BinarySearch1to1 H6-SPLitting H7a-BinarySearchLongest H7b-BinarySearchClosest

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 29/ 44

slide-66
SLIDE 66

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Experiment 3 - large computations

communication time between 1 and 20 computation time between 10 and 1000

200 400 600 800 1000 10 20 30 40 50 Maximum period Number of stages (p=10) H1a-GreedyRandom H1b-GreedyRandomIntervalLength H2-GreedySumW H3-GreedySumDinDout H4-GreedyPeriod H5-BinarySearch1to1 H6-SPLitting H7a-BinarySearchLongest H7b-BinarySearchClosest

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 30/ 44

slide-67
SLIDE 67

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Experiment 3 - large computations

communication time between 1 and 20 computation time between 10 and 1000

25 30 35 40 45 50 55 60 65 70 10 20 30 40 50 Maximum period Number of stages (p=100) H1a-GreedyRandom H1b-GreedyRandomIntervalLength H2-GreedySumW H3-GreedySumDinDout H4-GreedyPeriod H5-BinarySearch1to1 H6-SPLitting H7a-BinarySearchLongest H7b-BinarySearchClosest

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 30/ 44

slide-68
SLIDE 68

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Experiment 4 - small computations

communication time between 1 and 20 computation time between 0.01 and 10

1.5 2 2.5 3 3.5 4 4.5 10 20 30 40 50 Maximum period Number of stages (p=10) H1a-GreedyRandom H1b-GreedyRandomIntervalLength H2-GreedySumW H3-GreedySumDinDout H4-GreedyPeriod H5-BinarySearch1to1 H6-SPLitting H7a-BinarySearchLongest H7b-BinarySearchClosest

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 31/ 44

slide-69
SLIDE 69

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Experiment 4 - small computations

communication time between 1 and 20 computation time between 0.01 and 10

1 1.5 2 2.5 3 3.5 4 4.5 5 10 20 30 40 50 Maximum period Number of stages (p=100) H1a-GreedyRandom H1b-GreedyRandomIntervalLength H2-GreedySumW H3-GreedySumDinDout H4-GreedyPeriod H5-BinarySearch1to1 H6-SPLitting H7a-BinarySearchLongest H7b-BinarySearchClosest

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 31/ 44

slide-70
SLIDE 70

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Summary of experiments

Much more efficient than random mappings Three dominant heuristics for different cases Insignificant communications (hom. or small) and many processors: H5-BS121 (One-to-one Mapping) Insignificant communications (hom. or small) and few processors: H7b-BSC (binary search: clever choice where to split) Important communications (het. or big): H6-SPL (splitting choice relevant for any number of processors)

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 32/ 44

slide-71
SLIDE 71

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Summary of experiments

Much more efficient than random mappings Three dominant heuristics for different cases Insignificant communications (hom. or small) and many processors: H5-BS121 (One-to-one Mapping) Insignificant communications (hom. or small) and few processors: H7b-BSC (binary search: clever choice where to split) Important communications (het. or big): H6-SPL (splitting choice relevant for any number of processors)

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 32/ 44

slide-72
SLIDE 72

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Outline

1

Framework

2

Working out an example

3

Part 1 - Communications, monolithic stages, mono-criterion

4

Part 2 - Simpler model with no communications, but with replication/DP and bi-criteria

5

Conclusion

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 33/ 44

slide-73
SLIDE 73

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Part 2

Pipeline graph Different platforms, with communications Different mapping strategies Only monolithic stages: no replication nor data-parallelism Mono-criterion: period minimization Complexity results, heuristics and experiments

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 34/ 44

slide-74
SLIDE 74

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Part 2

Pipeline and fork graphs Different platforms, with communications Different mapping strategies Only monolithic stages: no replication nor data-parallelism Mono-criterion: period minimization Complexity results, heuristics and experiments

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 34/ 44

slide-75
SLIDE 75

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Part 2

Pipeline and fork graphs Different platforms, without communications Different mapping strategies Only monolithic stages: no replication nor data-parallelism Mono-criterion: period minimization Complexity results, heuristics and experiments

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 34/ 44

slide-76
SLIDE 76

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Part 2

Pipeline and fork graphs Different platforms, without communications Interval Mapping only Only monolithic stages: no replication nor data-parallelism Mono-criterion: period minimization Complexity results, heuristics and experiments

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 34/ 44

slide-77
SLIDE 77

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Part 2

Pipeline and fork graphs Different platforms, without communications Interval Mapping only Replicable stages, and either data-parallelism or not Mono-criterion: period minimization Complexity results, heuristics and experiments

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 34/ 44

slide-78
SLIDE 78

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Part 2

Pipeline and fork graphs Different platforms, without communications Interval Mapping only Replicable stages, and either data-parallelism or not Bi-criteria optimization Complexity results, heuristics and experiments

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 34/ 44

slide-79
SLIDE 79

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Part 2

Pipeline and fork graphs Different platforms, without communications Interval Mapping only Replicable stages, and either data-parallelism or not Bi-criteria optimization Complexity results only

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 34/ 44

slide-80
SLIDE 80

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Complexity results

Without data-parallelism, Homogeneous platforms Objective period latency bi-criteria

  • Hom. pipeline
  • Het. pipeline

Poly (str)

  • Hom. fork
  • Poly (DP)
  • Het. fork

Poly (str) NP-hard

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 35/ 44

slide-81
SLIDE 81

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Complexity results

With data-parallelism, Homogeneous platforms Objective period latency bi-criteria

  • Hom. pipeline
  • Het. pipeline

Poly (DP)

  • Hom. fork
  • Poly (DP)
  • Het. fork

Poly (str) NP-hard

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 35/ 44

slide-82
SLIDE 82

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Complexity results

Without data-parallelism, Heterogeneous platforms Objective period latency bi-criteria

  • Hom. pipeline

Poly (*)

  • Poly (*)
  • Het. pipeline

NP-hard (**) Poly (str) NP-hard

  • Hom. fork

Poly (*)

  • Het. fork

NP-hard

  • Anne.Benoit@ens-lyon.fr

Cetraro, June 07 Mapping skeleton workflows WS’07 35/ 44

slide-83
SLIDE 83

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Complexity results

With data-parallelism, Heterogeneous platforms Objective period latency bi-criteria

  • Hom. pipeline

NP-hard

  • Het. pipeline
  • Hom. fork

NP-hard

  • Het. fork
  • Anne.Benoit@ens-lyon.fr

Cetraro, June 07 Mapping skeleton workflows WS’07 35/ 44

slide-84
SLIDE 84

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Complexity results

Most interesting case: Without data-parallelism, Heterogeneous platforms Objective period latency bi-criteria

  • Hom. pipeline

Poly (*)

  • Poly (*)
  • Het. pipeline

NP-hard (**) Poly (str) NP-hard

  • Hom. fork

Poly (*)

  • Het. fork

NP-hard

  • Anne.Benoit@ens-lyon.fr

Cetraro, June 07 Mapping skeleton workflows WS’07 35/ 44

slide-85
SLIDE 85

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

No data-parallelism, Heterogeneous platforms

For pipeline, minimizing the latency is straightforward: map all stages on fastest proc Minimizing the period is NP-hard (involved reduction similar to the heterogeneous chain-to-chain one) for general pipeline Homogeneous pipeline: all stages have same workload w: in this case, polynomial complexity. Polynomial bi-criteria algorithm for homogeneous pipeline

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 36/ 44

slide-86
SLIDE 86

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

No data-parallelism, Heterogeneous platforms

For pipeline, minimizing the latency is straightforward: map all stages on fastest proc Minimizing the period is NP-hard (involved reduction similar to the heterogeneous chain-to-chain one) for general pipeline Homogeneous pipeline: all stages have same workload w: in this case, polynomial complexity. Polynomial bi-criteria algorithm for homogeneous pipeline

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 36/ 44

slide-87
SLIDE 87

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Lemma: form of the solution

Pipeline, no data-parallelism, Heterogeneous platform Lemma If an optimal solution which minimizes pipeline period uses q processors, consider q fastest processors P1, ..., Pq, ordered by non-decreasing speeds: s1 ≤ ... ≤ sq. There exists an optimal solution which replicates intervals of stages

  • nto k intervals of processors Ir = [Pdr , Per ], with 1 ≤ r ≤ k ≤ q,

d1 = 1, ek = q, and er + 1 = dr+1 for 1 ≤ r < k. Proof: exchange argument, which does not increase latency

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 37/ 44

slide-88
SLIDE 88

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Lemma: form of the solution

Pipeline, no data-parallelism, Heterogeneous platform Lemma If an optimal solution which minimizes pipeline period uses q processors, consider q fastest processors P1, ..., Pq, ordered by non-decreasing speeds: s1 ≤ ... ≤ sq. There exists an optimal solution which replicates intervals of stages

  • nto k intervals of processors Ir = [Pdr , Per ], with 1 ≤ r ≤ k ≤ q,

d1 = 1, ek = q, and er + 1 = dr+1 for 1 ≤ r < k. Proof: exchange argument, which does not increase latency

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 37/ 44

slide-89
SLIDE 89

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Binary-search/Dynamic programming algorithm

Given latency L, given period K Loop on number of processors q Dynamic programming algorithm to minimize latency Success if L is obtained Binary search on L to minimize latency for fixed period Binary search on K to minimize period for fixed latency

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 38/ 44

slide-90
SLIDE 90

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Binary-search/Dynamic programming algorithm

Given latency L, given period K Loop on number of processors q Dynamic programming algorithm to minimize latency Success if L is obtained Binary search on L to minimize latency for fixed period Binary search on K to minimize period for fixed latency

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 38/ 44

slide-91
SLIDE 91

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Dynamic programming algorithm

Compute L(n, 1, q), where L(m, i, j) = minimum latency to map m pipeline stages on processors Pi to Pj, while fitting in period K. L(m, i, j) = min 1 ≤ m′ < m i ≤ k < j m.w

si

if

m.w (j−i).si ≤ K

(1) L(m′, i, k) + L(m − m′, k + 1, j) (2) Case (1): replicating m stages onto processors Pi, ..., Pj Case (2): splitting the interval

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 39/ 44

slide-92
SLIDE 92

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Dynamic programming algorithm

Compute L(n, 1, q), where L(m, i, j) = minimum latency to map m pipeline stages on processors Pi to Pj, while fitting in period K. L(m, i, j) = min 1 ≤ m′ < m i ≤ k < j m.w

si

if

m.w (j−i).si ≤ K

(1) L(m′, i, k) + L(m − m′, k + 1, j) (2) Initialization: L(1, i, j) = w

si

if

w (j−i).si ≤ K

+∞

  • therwise

L(m, i, i) = m.w

si

if

m.w si

≤ K +∞

  • therwise

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 39/ 44

slide-93
SLIDE 93

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Dynamic programming algorithm

Compute L(n, 1, q), where L(m, i, j) = minimum latency to map m pipeline stages on processors Pi to Pj, while fitting in period K. L(m, i, j) = min 1 ≤ m′ < m i ≤ k < j m.w

si

if

m.w (j−i).si ≤ K

(1) L(m′, i, k) + L(m − m′, k + 1, j) (2) Complexity of the dynamic programming: O(n2.p4) Number of iterations of the binary search formally bounded, very small number of iterations in practice.

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 39/ 44

slide-94
SLIDE 94

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Outline

1

Framework

2

Working out an example

3

Part 1 - Communications, monolithic stages, mono-criterion

4

Part 2 - Simpler model with no communications, but with replication/DP and bi-criteria

5

Conclusion

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 40/ 44

slide-95
SLIDE 95

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Related work

Subhlok and Vondran– Extension of their work (pipeline on hom platforms) Chains-to-chains– In our work possibility to replicate or data-parallelize Mapping pipelined computations onto clusters and grids– DAG [Taura et al.], DataCutter [Saltz et al.] Energy-aware mapping of pipelined computations [Melhem et al.], three-criteria optimization Mapping pipelined computations onto special-purpose architectures– FPGA arrays [Fabiani et al.]. Fault-tolerance for embedded systems [Zhu et al.] Mapping skeletons onto clusters and grids– Use of stochastic process algebra [Benoit et al.]

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 41/ 44

slide-96
SLIDE 96

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Conclusion

Theoretical side – Complexity results for several cases Solid theoretical foundation for study of single/bi-criteria mappings, with possibility to replicate and data-parallelize application stages Practical side Optimal polynomial algorithms, heuristics for NP-hard instances of the problem Experiments: Comparison of heuristics performance Linear program to assess the absolute performance of the heuristics, which turns out to be quite good

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 42/ 44

slide-97
SLIDE 97

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Future work

Short term Heuristics for Fully Heterogeneous platforms and

  • ther NP-hard instances of the problem

Extension to DAG-trees (a DAG which is a tree when un-oriented) Longer term Heuristics based on our polynomial algorithms for general application graphs structured as combinations of pipeline and fork kernels Real experiments on heterogeneous clusters, using an already-implemented skeleton library and MPI Comparison of effective performance against theoretical performance

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 43/ 44

slide-98
SLIDE 98

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

Open problems

Replication for fault-tolerance vs replication for parallelism

compute several time the same data-set in case of failure uses more resources and does not decrease period or latency increases robustness

Energy savings

processors that can run at different frequencies trade-off between energy consumption and speed

Simultaneous execution of several (concurrent) workflows

competition for CPU and network resources fairness between applications (stretch) sensitivity to application/platform parameter changes

Anne.Benoit@ens-lyon.fr Cetraro, June 07 Mapping skeleton workflows WS’07 44/ 44