Steady-State Scheduling on Heterogeneous Platforms Matthieu Gallet - - PowerPoint PPT Presentation

steady state scheduling on heterogeneous platforms
SMART_READER_LITE
LIVE PREVIEW

Steady-State Scheduling on Heterogeneous Platforms Matthieu Gallet - - PowerPoint PPT Presentation

tel-00637362, version 1 - 1 Nov 2011 Steady-State Scheduling on Heterogeneous Platforms Matthieu Gallet Advisors: Yves Robert and Fr ed eric Vivien Ecole Normale Sup erieure de Lyon GRAAL team, Laboratoire de lInformatique du


slide-1
SLIDE 1

Steady-State Scheduling

  • n Heterogeneous Platforms

Matthieu Gallet

Advisors: Yves Robert and Fr´ ed´ eric Vivien

´ Ecole Normale Sup´ erieure de Lyon GRAAL team, Laboratoire de l’Informatique du Parall´ elisme

October 20, 2009

1/64

tel-00637362, version 1 - 1 Nov 2011

slide-2
SLIDE 2

Introduction

◮ Scheduling large applications on complex heterogeneous

platforms

◮ Stream of data to process: video or audio streams, on-the-fly

processing of experimental data, . . .

◮ Structured applications: repeatedly apply several filters to

each data set

◮ Several computing resources, of different kinds ◮ Heterogeneous communication network ◮ How to organize all processings?

2/64

tel-00637362, version 1 - 1 Nov 2011

slide-3
SLIDE 3

Introduction

◮ Makespan minimization is a difficult problem → relaxations

3/64

tel-00637362, version 1 - 1 Nov 2011

slide-4
SLIDE 4

Introduction

◮ Makespan minimization is a difficult problem → relaxations ◮ Divisible Load scheduling:

◮ Presentation of the Divisible Load Theory ◮ Scheduling divisible loads on a processor chain

◮ Steady-state scheduling:

◮ Mono-allocation schedules of task graphs on heterogeneous

platforms

◮ Dynamic bag-of-tasks applications ◮ Computing the throughput of replicated workflows ◮ Task graph scheduling on the Cell processor 3/64

tel-00637362, version 1 - 1 Nov 2011

slide-5
SLIDE 5

Introduction

◮ Makespan minimization is a difficult problem → relaxations ◮ Divisible Load scheduling:

◮ Presentation of the Divisible Load Theory ◮ Scheduling divisible loads on a processor chain

◮ Steady-state scheduling:

◮ Mono-allocation schedules of task graphs on heterogeneous

platforms

◮ Dynamic bag-of-tasks applications ◮ Computing the throughput of replicated workflows ◮ Task graph scheduling on the Cell processor 3/64

tel-00637362, version 1 - 1 Nov 2011

slide-6
SLIDE 6

Outline of this talk

Introduction Steady-state scheduling Mono-allocation steady-state scheduling Task graph scheduling on the Cell processor Computing the throughput of replicated workflows Conclusion

4/64

tel-00637362, version 1 - 1 Nov 2011

slide-7
SLIDE 7

Outline

Introduction Steady-state scheduling Mono-allocation steady-state scheduling Task graph scheduling on the Cell processor Computing the throughput of replicated workflows Conclusion

5/64

tel-00637362, version 1 - 1 Nov 2011

slide-8
SLIDE 8

General presentation

T1 T2 T3 T4

◮ Structured application: directed acyclic graph

GA = (VA = {T1, . . . , Tn}, EA = (Fk,l)k,l)

6/64

tel-00637362, version 1 - 1 Nov 2011

slide-9
SLIDE 9

General presentation

T4 T1 T2 T3

◮ Structured application: directed acyclic graph

GA = (VA = {T1, . . . , Tn}, EA = (Fk,l)k,l)

6/64

tel-00637362, version 1 - 1 Nov 2011

slide-10
SLIDE 10

General presentation

T4 T1 T2 T3

◮ Structured application: directed acyclic graph

GA = (VA = {T1, . . . , Tn}, EA = (Fk,l)k,l)

6/64

tel-00637362, version 1 - 1 Nov 2011

slide-11
SLIDE 11

General presentation

T4 T1 T2 T3

◮ Structured application: directed acyclic graph

GA = (VA = {T1, . . . , Tn}, EA = (Fk,l)k,l)

6/64

tel-00637362, version 1 - 1 Nov 2011

slide-12
SLIDE 12

General presentation

T4 T1 T2 T3

◮ Structured application: directed acyclic graph

GA = (VA = {T1, . . . , Tn}, EA = (Fk,l)k,l)

◮ Many data sets to process

6/64

tel-00637362, version 1 - 1 Nov 2011

slide-13
SLIDE 13

General presentation

T4 T1 T2 T3

◮ Structured application: directed acyclic graph

GA = (VA = {T1, . . . , Tn}, EA = (Fk,l)k,l)

◮ Many data sets to process

6/64

tel-00637362, version 1 - 1 Nov 2011

slide-14
SLIDE 14

General presentation

T4 T1 T2 T3

◮ Structured application: directed acyclic graph

GA = (VA = {T1, . . . , Tn}, EA = (Fk,l)k,l)

◮ Many data sets to process

6/64

tel-00637362, version 1 - 1 Nov 2011

slide-15
SLIDE 15

General presentation

T4 T1 T2 T3

◮ Structured application: directed acyclic graph

GA = (VA = {T1, . . . , Tn}, EA = (Fk,l)k,l)

◮ Many data sets to process

6/64

tel-00637362, version 1 - 1 Nov 2011

slide-16
SLIDE 16

General presentation

T4 T1 T2 T3

◮ Structured application: directed acyclic graph

GA = (VA = {T1, . . . , Tn}, EA = (Fk,l)k,l)

◮ Many data sets to process

6/64

tel-00637362, version 1 - 1 Nov 2011

slide-17
SLIDE 17

General presentation

T1 T2 T3 T4

◮ Structured application: directed acyclic graph

GA = (VA = {T1, . . . , Tn}, EA = (Fk,l)k,l)

◮ Many data sets to process

6/64

tel-00637362, version 1 - 1 Nov 2011

slide-18
SLIDE 18

General presentation

T1 T2 T3 T4

◮ Structured application: directed acyclic graph

GA = (VA = {T1, . . . , Tn}, EA = (Fk,l)k,l)

◮ Many data sets to process ◮ Platform

6/64

tel-00637362, version 1 - 1 Nov 2011

slide-19
SLIDE 19

General presentation

T3 T1 T2 T4 P8 P2 P3 P4 P5 P6 P7 P1

◮ Structured application: directed acyclic graph

GA = (VA = {T1, . . . , Tn}, EA = (Fk,l)k,l)

◮ Many data sets to process ◮ Platform, modeled by a graph

GP = (VP = {P1, · · · , Pp}, EP = (Pq → Pr))

6/64

tel-00637362, version 1 - 1 Nov 2011

slide-20
SLIDE 20

General presentation

T1 T2 T3 T4

bw9 bw5 bw6 bw7 bw4 bw3 bw1 bw2 bw8 s7 s8 s6 s1 s2 s3 s5 s4

P8 P1 P3 P4 P5 P6 P7 P2

◮ Structured application: directed acyclic graph

GA = (VA = {T1, . . . , Tn}, EA = (Fk,l)k,l)

◮ Many data sets to process ◮ Heterogeneous platform, modeled by a graph

GP = (VP = {P1, · · · , Pp}, EP = (Pq → Pr))

6/64

tel-00637362, version 1 - 1 Nov 2011

slide-21
SLIDE 21

Objective function

Makespan (maximum completion time)

◮ The most natural objective function ◮ Lot of work about its minimization, but difficult problem ◮ But. . .

7/64

tel-00637362, version 1 - 1 Nov 2011

slide-22
SLIDE 22

Objective function

Makespan (maximum completion time)

◮ The most natural objective function ◮ Lot of work about its minimization, but difficult problem ◮ But. . . is the makespan relevant in our case? ◮ Not really:

◮ undefined for a continuous flow of data sets ◮ does not benefit from regular problem structure 7/64

tel-00637362, version 1 - 1 Nov 2011

slide-23
SLIDE 23

Objective function

Makespan (maximum completion time)

◮ The most natural objective function ◮ Lot of work about its minimization, but difficult problem ◮ But. . . is the makespan relevant in our case? ◮ Not really:

◮ undefined for a continuous flow of data sets ◮ does not benefit from regular problem structure

Throughput

Average number of processed data sets per time unit

◮ Well-suited to continuous flows of data sets ◮ Based upon problem regularity

7/64

tel-00637362, version 1 - 1 Nov 2011

slide-24
SLIDE 24

Steady-state scheduling

◮ Focus on schedule’s core ◮ Neglect initiation and termination phases ◮ Adapted to throughput maximization

8/64

tel-00637362, version 1 - 1 Nov 2011

slide-25
SLIDE 25

Steady-state scheduling

◮ Focus on schedule’s core ◮ Neglect initiation and termination phases ◮ Adapted to throughput maximization

Periodic schedules

P3 P1 P2 Period τ

◮ Optimal for throughput → asymptotically optimal for

makespan

◮ Independent of data set count → compact description

8/64

tel-00637362, version 1 - 1 Nov 2011

slide-26
SLIDE 26

Allocation

Definition

An allocation of the application graph to the platform graph is a function σ associating:

◮ to each task Ti: a processor σ(Ti) that processes all instances

  • f Ti assigned to the allocation

◮ to each file Fi,j: a set of communication links σ(Fi,j) that

transfers this file from processor σ(Ti) to processor σ(Tj)

9/64

tel-00637362, version 1 - 1 Nov 2011

slide-27
SLIDE 27

Different mapping strategies

A small example

P2 P1 P3 P4 P5 P6 P7 T2 F3,4 F2,3 F1,2 T4 T3 T1 10/64

tel-00637362, version 1 - 1 Nov 2011

slide-28
SLIDE 28

Simple solution, with a single allocation

A mono-allocation schedule

P5 P3 P3 P1

T1 T4 T2 T3

T2 F3,4 T1 T4 F1,2 F2,3 T3 11/64

tel-00637362, version 1 - 1 Nov 2011

slide-29
SLIDE 29

Simple solution, with a single allocation

A mono-allocation schedule

T2 T1 T3 T4 P3 P7 P1 P2 P4 P5 P6 Period τ

11/64

tel-00637362, version 1 - 1 Nov 2011

slide-30
SLIDE 30

Multi-allocation steady-state scheduling

An optimal solution: multi-allocation steady-state scheduling

T3 T1 T4 T2

T1 T4 T3 F2,3 T2 F1,2 F3,4 12/64

tel-00637362, version 1 - 1 Nov 2011

slide-31
SLIDE 31

Multi-allocation steady-state scheduling

An optimal solution: multi-allocation steady-state scheduling

T2 T3 T1 T4

T1 T4 T3 F2,3 T2 F1,2 F3,4 12/64

tel-00637362, version 1 - 1 Nov 2011

slide-32
SLIDE 32

Multi-allocation steady-state scheduling

An optimal solution: multi-allocation steady-state scheduling

P7 P2 P6 P5 P4 P7 P1

T1 T2 T3 T4 T3 T1 T4 T2

T3 F2,3 F3,4 T1 T4 F1,2 T2 12/64

tel-00637362, version 1 - 1 Nov 2011

slide-33
SLIDE 33

Multi-allocation steady-state scheduling

An optimal solution: multi-allocation steady-state scheduling

T4 T2 T2 T1 T1 T3 T3 T4 P3 P7 P6 P2 P1 P5 P4 Period τ

12/64

tel-00637362, version 1 - 1 Nov 2011

slide-34
SLIDE 34

Mappings with replications

Round-robin distribution of replicated tasks

P4 P1 P6 P2 P7 P5 P3

T2 T2 T3 T4 T1 T3 T3

F3,4 T1 T2 T3 T4 F2,3 F1,2 13/64

tel-00637362, version 1 - 1 Nov 2011

slide-35
SLIDE 35

Mappings with replications

Round-robin distribution of replicated tasks

P1 P6 P2 P7 P5 P3 P4

T2 T2 T3 T4 T1 T3 T3

F3,4 T1 T2 T3 T4 F2,3 F1,2 13/64

tel-00637362, version 1 - 1 Nov 2011

slide-36
SLIDE 36

Mappings with replications

Round-robin distribution of replicated tasks

P1 P6 P2 P7 P5 P3 P4

T2 T2 T3 T4 T1 T3 T3

F3,4 T1 T2 T3 T4 F2,3 F1,2 13/64

tel-00637362, version 1 - 1 Nov 2011

slide-37
SLIDE 37

Mappings with replications

Round-robin distribution of replicated tasks

P1 P6 P2 P7 P5 P3 P4

T2 T2 T3 T4 T1 T3 T3

F3,4 T1 T2 T3 T4 F2,3 F1,2 13/64

tel-00637362, version 1 - 1 Nov 2011

slide-38
SLIDE 38

Mappings with replications

Round-robin distribution of replicated tasks

P7 P5 P3 P4 P1 P6 P2

T4 T3 T3 T1 T2 T2 T3

T4 T2 F1,2 F2,3 F3,4 T3 T1 13/64

tel-00637362, version 1 - 1 Nov 2011

slide-39
SLIDE 39

Mappings with replications

Round-robin distribution of replicated tasks

P7 P5 P3 P4 P1 P6 P2

T4 T3 T3 T1 T2 T2 T3

T4 T2 F1,2 F2,3 F3,4 T3 T1 13/64

tel-00637362, version 1 - 1 Nov 2011

slide-40
SLIDE 40

Mappings with replications

Round-robin distribution of replicated tasks

P7 P5 P3 P4 P1 P6 P2

T4 T3 T3 T1 T2 T2 T3

T4 T2 F1,2 F2,3 F3,4 T3 T1 13/64

tel-00637362, version 1 - 1 Nov 2011

slide-41
SLIDE 41

Mappings with replications

Round-robin distribution of replicated tasks

P7 P5 P3 P4 P1 P6 P2

T4 T3 T3 T1 T2 T2 T3

T4 T2 F1,2 F2,3 F3,4 T3 T1 13/64

tel-00637362, version 1 - 1 Nov 2011

slide-42
SLIDE 42

Mappings with replications

Round-robin distribution of replicated tasks

P7 P5 P3 P4 P1 P6 P2

T4 T3 T3 T1 T2 T2 T3

T4 T2 F1,2 F2,3 F3,4 T3 T1 13/64

tel-00637362, version 1 - 1 Nov 2011

slide-43
SLIDE 43

Mappings with replications

Round-robin distribution of replicated tasks

P1 P2 P6 P4 P3 P5 P7

T3 T4 T3 T2 T2 T1 T3

T1 F3,4 F2,3 F1,2 T2 T3 T4 13/64

tel-00637362, version 1 - 1 Nov 2011

slide-44
SLIDE 44

Mappings with replications

Round-robin distribution of replicated tasks

T4 T3 T1 T2 T2 T3 T3 P3 P7 P6 P2 P1 P5 P4 Period τ

13/64

tel-00637362, version 1 - 1 Nov 2011

slide-45
SLIDE 45

A short comparison of the three methods 1/3

Mono-allocation steady-state schedules

◮ Easy to implement ◮ Handle stateful nodes ◮ Smaller buffers ◮ Less efficient schedules (stronger constraints)

14/64

tel-00637362, version 1 - 1 Nov 2011

slide-46
SLIDE 46

A short comparison of the three methods 2/3

General multi-allocation steady-state solution

◮ Optimal throughput ◮ Polynomial computation time in almost all cases ◮ Very long periods ◮ Huge response time ◮ Complex allocation schemes ◮ Never fully implemented

15/64

tel-00637362, version 1 - 1 Nov 2011

slide-47
SLIDE 47

A short comparison of the three methods 3/3

Replication with Round-Round distribution

◮ Natural extension to a mono-allocation solution ◮ No buffer required to keep initial order of data sets ◮ Fully implemented solution (DataCutter) ◮ Simple control, with closed form to determine processors ◮ Less efficient schedules (stronger constraints) ◮ May not fully exploit each resource ◮ Hard to determine throughput

16/64

tel-00637362, version 1 - 1 Nov 2011

slide-48
SLIDE 48

Communication model

◮ Trade-off between realism and tractability ◮ Many different models ◮ One-Port model:

◮ Strict One-Port: a processor can either compute or perform a

single communication

◮ Full-Duplex One-Port: a processor can either compute or

simultaneously send and receive data

◮ One-Port with overlap of computation by communications

◮ Bounded Multiport model: several concurrent

communications, respecting resource bandwidths

◮ Linear cost model: communication time proportional to data

size

17/64

tel-00637362, version 1 - 1 Nov 2011

slide-49
SLIDE 49

Communication model

Which model to choose?

◮ Computing resources (single processors vs. multi-core

processors or dedicated co-processors, . . . )

◮ Network (homogeneous network vs. a server with large

bandwidth connected to many light clients)

◮ Applications and software resources (blocking communications

  • vs. multithreaded libraries)

◮ Previously studied models ◮ Algorithmic complexity! (non-trivial results with simpler

models vs. realism with complex models)

18/64

tel-00637362, version 1 - 1 Nov 2011

slide-50
SLIDE 50

Outline

Introduction Steady-state scheduling Mono-allocation steady-state scheduling Task graph scheduling on the Cell processor Computing the throughput of replicated workflows Conclusion

19/64

tel-00637362, version 1 - 1 Nov 2011

slide-51
SLIDE 51

Mono-allocation steady-state scheduling

Main idea

All instances of a task are processed by the same resource

◮ Less efficient schedules (stronger constraints) ◮ A single allocation ◮ Bounded Multiport model

◮ Limited incoming bandwidth bwin

q

◮ Limited outgoing bandwidth bwout

q

◮ Limited bandwidth per link bwq,r

◮ Period τ, throughput ρ: ρ = 1/τ = critical resource cycle-time

20/64

tel-00637362, version 1 - 1 Nov 2011

slide-52
SLIDE 52

Complexity

Problem DAG-Single-Alloc

Given a directed acyclic application graph, a platform graph, and a bound B, is there an allocation with throughput ρ ≥ B?

Theorem

DAG-Single-Alloc is NP-complete

21/64

tel-00637362, version 1 - 1 Nov 2011

slide-53
SLIDE 53

Variables and constraints due to the application

◮ αk q = 1 if task Tk is processed on processor Pq, and αk q = 0

  • therwise

◮ Each task is processed exactly once:

∀Tk,

  • Pq αk

q = 1 ◮ βk,l q,r = 1 if file Fk,l is transferred using path Pq Pr, and

βk,l

q,r = 0 otherwise ◮ A file transfer must originate from where the file was

produced: βk,l

q,r ≤ αk q

22/64

tel-00637362, version 1 - 1 Nov 2011

slide-54
SLIDE 54

Constraints on computations

◮ The processor computing a task must hold all necessary input

data, i.e., either it received or it computed all required input data: αk

r +

  • PqPr

βk,l

q,r ≥ αl r ◮ The computing time of a processor is not larger that τ:

  • Tk

αk

q × wq,k ≤ τ

23/64

tel-00637362, version 1 - 1 Nov 2011

slide-55
SLIDE 55

Constraints on communications

◮ The amount of data carried by the link Pq → Pr is:

dq,r =

  • PsPt with

Pq→Pr∈PsPt

  • Fk,l

βk,l

s,t × datak,l ◮ The link bandwidth must not be exceeded:

dq,r bwq,r ≤ τ

◮ The output bandwidth of a processor Pq must not be

exceeded:

  • Pq→Pr∈EP

dq,r bwout

q

≤ τ

◮ The input bandwidth of a processor Pq must not be exceeded:

  • Pq→Pr∈EP

dq,r bwin

r

≤ τ

24/64

tel-00637362, version 1 - 1 Nov 2011

slide-56
SLIDE 56

Objective

Minimize the maximum time τ spent by all resources

Theorem

An optimal solution of the above linear program describes an allocation with maximal throughput

Summary

◮ Solutions based on mixed-linear programs ◮ NP-complete problem ◮ Need for clever heuristics

25/64

tel-00637362, version 1 - 1 Nov 2011

slide-57
SLIDE 57

Greedy mapping strategies

◮ Simple mapping:

◮ assign “largest” task to best processor ◮ continue with second “largest” task, assign it to the processor

that decreases the least the throughput

◮ . . .

◮ Refined greedy:

◮ take communication times into account when sorting tasks ◮ when mapping a task, select the processor such that the

maximum occupation time of all resources (processors and links) is minimized

26/64

tel-00637362, version 1 - 1 Nov 2011

slide-58
SLIDE 58

Rounding the linear program

  • 1. Solve the linear program over the rationals
  • 2. Based on the rational solution, select an integer variable αk

i :

RLP-max:

◮ Select the αk

i with maximum value

◮ Set αk

i to 1

RLP-rand:

◮ Select a task Tk not yet mapped ◮ Randomly choose a processor Pi with probability αk

i

◮ Set αk

i to 1

  • 3. Goto step 1 until all variables are set

27/64

tel-00637362, version 1 - 1 Nov 2011

slide-59
SLIDE 59

Delegating computations

◮ Start from solution where all tasks are processed by a single

processor

◮ Try to move a (connected) subset of tasks to another

processor to increase throughput

◮ Repeat this process until no more improvement is found

Several issues to overcome:

◮ Find interesting groups of tasks to move

◮ for all tasks, we test all possible immediate neighborhoods, and

then try to increase the group along chains

◮ Hard to find a good evaluation metric: some moves do not

directly decrease throughput, but are still interesting

◮ for a given mapping, we sort all resource occupation times by

lexicographical order, and use the ordered list instead of the throughput in comparisons

28/64

tel-00637362, version 1 - 1 Nov 2011

slide-60
SLIDE 60

Neighborhood-centric strategy

◮ First, evaluate the cost of any task and its immediate

neighbors on an idle platform

◮ Cost of a task: maximum occupation time over all resources ◮ Consider each task Tk ordered by non-increasing cost:

◮ Evaluate the mapping of Tk and its neighbors on each

processor

◮ Definitely assign Tk to best processor

◮ Same evaluation metric as Delegate

29/64

tel-00637362, version 1 - 1 Nov 2011

slide-61
SLIDE 61

Performance evaluation – methodology

◮ Reference heuristics: HEFT, Data-Parallel, Clustering ◮ LP and MLP solved with Cplex 11 ◮ Simulations done using SimGrid ◮ Platforms: actual Grids, from SimGrid repository

(only a subset of processors is available for computation)

◮ Applications: random task graphs + one real application

◮ “Small problems”: 8–12 tasks ◮ “Large problems”: up to 47 tasks (MLP not used) ◮ for each application, we compute a CCR = communications

computations

◮ we try to cover a large CCR range 30/64

tel-00637362, version 1 - 1 Nov 2011

slide-62
SLIDE 62

Performance evaluation – results on small problems

Optimal Large communications Small communications 0.01 Normalized throughput log(CCR) 2.5 2 1.5 1 0.5 10 4 2 1 0.4 0.2 0.1 0.04 0.02 Upper bound Delegate

31/64

tel-00637362, version 1 - 1 Nov 2011

slide-63
SLIDE 63

Performance evaluation – results on small problems

0.04 0.02 Large communications Small communications Optimal Normalized throughput 1.2 1 0.8 0.6 0.4 0.2 log(CCR) 10 4 2 1 0.4 0.2 0.1 0.01 Delegate HEFT Data-parallel

31/64

tel-00637362, version 1 - 1 Nov 2011

slide-64
SLIDE 64

Performance evaluation – results on small problems

Optimal 0.01 0.02 0.04 0.1 0.2 0.4 1 2 4 10 log(CCR) 0.2 0.4 0.6 0.8 1 1.2 Normalized throughput Small communications Large communications Refined Greedy Simple Greedy Delegate Clustering

31/64

tel-00637362, version 1 - 1 Nov 2011

slide-65
SLIDE 65

Performance evaluation – results on small problems

0.01 0.02 0.04 0.1 0.2 0.4 1 2 4 10 0.2 0.4 0.6 0.8 1

Optimal Large communications log(CCR) Small communications Normalized throughput

RLP RAND RLP MAX Neighborhood Delegate

31/64

tel-00637362, version 1 - 1 Nov 2011

slide-66
SLIDE 66

Summary

◮ Mono-allocation strategies are close to multi-allocation ones ◮ Outperform HEFT in most cases ◮ Optimal MLP solution restricted to small problems ◮ Efficient heuristics handle larger problems

32/64

tel-00637362, version 1 - 1 Nov 2011

slide-67
SLIDE 67

Outline

Introduction Steady-state scheduling Mono-allocation steady-state scheduling Task graph scheduling on the Cell processor Computing the throughput of replicated workflows Conclusion

33/64

tel-00637362, version 1 - 1 Nov 2011

slide-68
SLIDE 68

A short introduction to the Cell

◮ Joint work of Sony, Toshiba and IBM ◮ Non-standard architecture

34/64

tel-00637362, version 1 - 1 Nov 2011

slide-69
SLIDE 69

A theoretical vision of the Cell

EIB P0 PPE0 SPE4 SPE3 SPE0 SPE1 SPE7 SPE6

MEMORY

SPE2 SPE5 P3 P7 P8 P2 P1 P4 P5 P6

◮ One Power core (PPE) P0

standard processor, direct access to memory and L1/L2 cache

35/64

tel-00637362, version 1 - 1 Nov 2011

slide-70
SLIDE 70

A theoretical vision of the Cell

EIB P4 P5 P1 P2 P8 P7 P3 SPE5 PPE0 SPE4 SPE3 SPE0 SPE6

MEMORY

SPE2 SPE1 SPE7 P6 P0

◮ Eight Synergistic Processing Elements (SPEs) P1, . . . , P8

256-kB Local Stores, dedicated asynchronous DMA engine

35/64

tel-00637362, version 1 - 1 Nov 2011

slide-71
SLIDE 71

A theoretical vision of the Cell

EIB

P6 SPE2 SPE6 SPE7 SPE1 SPE0 SPE3 SPE4 PPE0 SPE5 P3 P7 P8 P2 P1 P4 P5 P0

MEMORY

◮ Element Interconnect Bus (EIB)

200 GB/s → no contention

35/64

tel-00637362, version 1 - 1 Nov 2011

slide-72
SLIDE 72

A theoretical vision of the Cell

EIB PPE0 SPE4 SPE3 SPE0 SPE1 SPE7 P6 P0 P4 P5 P1 SPE6

MEMORY

P2 SPE2 P8 P7 P3 SPE5

◮ Bidirectional communication link between an element and the EIB

bandwidth bw = 25GB/s

35/64

tel-00637362, version 1 - 1 Nov 2011

slide-73
SLIDE 73

A theoretical vision of the Cell

EIB P0 PPE0 SPE4 SPE3 SPE0 SPE1 SPE7 SPE6

MEMORY

SPE2 SPE5 P3 P7 P8 P2 P1 P4 P5 P6

◮ Limited DMA stack:

◮ at most 16 simultaneous incoming communications for each SPE ◮ at most 8 simultaneous communications

between a SPE and the PPE

35/64

tel-00637362, version 1 - 1 Nov 2011

slide-74
SLIDE 74

Application model

◮ Previously-used task graph model: GA = (VA, EA) ◮ Many data sets ◮ Some enhancements:

◮ readk: data to read before executing Tk ◮ writek: data to write after the execution of Tk ◮ peekk: number of next data sets to receive before executing Tk

◮ Two computations times for Tk: wPPE(Tk) and wSPE(Tk)

36/64

tel-00637362, version 1 - 1 Nov 2011

slide-75
SLIDE 75

Preprocessing of the schedule

◮ Objective: compute minimal buffer sizes ◮ min periodl = maxm∈precl(min periodm) + peekl + 2 ◮ bufferi,l = min periodl − min periodi peekj bufferj,l bufferi,l =min periodl − min periodi bufferi,j peekl min periodl = maxm∈precl(min periodm) + peekl + 2 bufferk,l min periodk peekk bufferi,k peeki min periodi min periodj Tj Ti Tl Tk

37/64

tel-00637362, version 1 - 1 Nov 2011

slide-76
SLIDE 76

Preprocessing of the schedule

◮ Objective: compute minimal buffer sizes ◮ min periodl = maxm∈precl(min periodm) + peekl + 2 ◮ bufferi,l = min periodl − min periodi peekj = 1 peekl = 2 min periodj = 4 bufferj,l = 6 bufferi,l = 9 bufferi,j = 3 min periodl = 10 bufferk,l = 4 min periodk = 6 peekk = 3 bufferi,k = 5 peeki = 0 min periodi = 1 Tj Ti Tl Tk

37/64

tel-00637362, version 1 - 1 Nov 2011

slide-77
SLIDE 77

Preprocessing of the schedule

◮ Objective: compute minimal buffer sizes ◮ min periodl = maxm∈precl(min periodm) + peekl + 2 ◮ bufferi,l = min periodl − min periodi peekl = 2 min periodi = 1 peekj = 1 min periodj = 4 bufferj,l = 6 bufferi,l = 9 min periodl = 10 bufferi,j = 3 bufferk,l = 4 min periodk = 6 peekk = 3 bufferi,k = 5 peeki = 0

period= 1

Ti Tl Tk Tj

37/64

tel-00637362, version 1 - 1 Nov 2011

slide-78
SLIDE 78

Preprocessing of the schedule

◮ Objective: compute minimal buffer sizes ◮ min periodl = maxm∈precl(min periodm) + peekl + 2 ◮ bufferi,l = min periodl − min periodi bufferi,k = 5 peeki = 0 min periodi = 1 peekj = 1 min periodj = 4 bufferj,l = 6 peekl = 2 bufferi,j = 3 bufferi,l = 9 min periodl = 10 bufferk,l = 4 min periodk = 6 peekk = 3

period= 2

Ti Tl Tk Tj

37/64

tel-00637362, version 1 - 1 Nov 2011

slide-79
SLIDE 79

Preprocessing of the schedule

◮ Objective: compute minimal buffer sizes ◮ min periodl = maxm∈precl(min periodm) + peekl + 2 ◮ bufferi,l = min periodl − min periodi bufferi,k = 5 peeki = 0 min periodi = 1 peekj = 1 min periodj = 4 bufferj,l = 6 peekl = 2 bufferi,j = 3 bufferi,l = 9 min periodl = 10 bufferk,l = 4 min periodk = 6 peekk = 3

period= 3

Ti Tl Tk Tj

37/64

tel-00637362, version 1 - 1 Nov 2011

slide-80
SLIDE 80

Preprocessing of the schedule

◮ Objective: compute minimal buffer sizes ◮ min periodl = maxm∈precl(min periodm) + peekl + 2 ◮ bufferi,l = min periodl − min periodi bufferi,k = 5 peeki = 0 min periodi = 1 peekj = 1 min periodj = 4 bufferj,l = 6 peekl = 2 bufferi,j = 3 bufferi,l = 9 min periodl = 10 bufferk,l = 4 min periodk = 6 peekk = 3

period= 4

Tj Ti Tl Tk

37/64

tel-00637362, version 1 - 1 Nov 2011

slide-81
SLIDE 81

Preprocessing of the schedule

◮ Objective: compute minimal buffer sizes ◮ min periodl = maxm∈precl(min periodm) + peekl + 2 ◮ bufferi,l = min periodl − min periodi bufferi,k = 5 peeki = 0 min periodi = 1 peekj = 1 min periodj = 4 bufferj,l = 6 peekl = 2 bufferi,j = 3 bufferi,l = 9 min periodl = 10 bufferk,l = 4 min periodk = 6 peekk = 3

period= 5

Tj Ti Tl Tk

37/64

tel-00637362, version 1 - 1 Nov 2011

slide-82
SLIDE 82

Preprocessing of the schedule

◮ Objective: compute minimal buffer sizes ◮ min periodl = maxm∈precl(min periodm) + peekl + 2 ◮ bufferi,l = min periodl − min periodi bufferi,k = 5 peeki = 0 min periodi = 1 peekj = 1 min periodj = 4 bufferj,l = 6 peekl = 2 bufferi,j = 3 bufferi,l = 9 min periodl = 10 bufferk,l = 4 min periodk = 6 peekk = 3

period= 6

Tj Ti Tl Tk

37/64

tel-00637362, version 1 - 1 Nov 2011

slide-83
SLIDE 83

Preprocessing of the schedule

◮ Objective: compute minimal buffer sizes ◮ min periodl = maxm∈precl(min periodm) + peekl + 2 ◮ bufferi,l = min periodl − min periodi bufferi,k = 5 peeki = 0 min periodi = 1 peekj = 1 min periodj = 4 bufferj,l = 6 peekl = 2 bufferi,j = 3 bufferi,l = 9 min periodl = 10 bufferk,l = 4 min periodk = 6 peekk = 3

period= 7

Tk Tj Ti Tl

37/64

tel-00637362, version 1 - 1 Nov 2011

slide-84
SLIDE 84

Preprocessing of the schedule

◮ Objective: compute minimal buffer sizes ◮ min periodl = maxm∈precl(min periodm) + peekl + 2 ◮ bufferi,l = min periodl − min periodi bufferi,k = 5 peeki = 0 min periodi = 1 peekj = 1 min periodj = 4 bufferj,l = 6 peekl = 2 bufferi,j = 3 bufferi,l = 9 min periodl = 10 bufferk,l = 4 min periodk = 6 peekk = 3

period= 8

Ti Tl Tk Tj

37/64

tel-00637362, version 1 - 1 Nov 2011

slide-85
SLIDE 85

Preprocessing of the schedule

◮ Objective: compute minimal buffer sizes ◮ min periodl = maxm∈precl(min periodm) + peekl + 2 ◮ bufferi,l = min periodl − min periodi bufferi,k = 5 peeki = 0 min periodi = 1 peekj = 1 min periodj = 4 bufferj,l = 6 peekl = 2 bufferi,j = 3 bufferi,l = 9 min periodl = 10 bufferk,l = 4 min periodk = 6 peekk = 3

period= 9

Tl Tj Tk Ti

37/64

tel-00637362, version 1 - 1 Nov 2011

slide-86
SLIDE 86

Preprocessing of the schedule

◮ Objective: compute minimal buffer sizes ◮ min periodl = maxm∈precl(min periodm) + peekl + 2 ◮ bufferi,l = min periodl − min periodi peekj = 1 bufferk,l = 4 min periodj = 4 min periodk = 6 peekl = 2 peekk = 3 bufferi,j = 3 min periodl = 10 bufferi,k = 5 peeki = 0 min periodi = 1 bufferj,l = 6 bufferi,l = 9

period= 10

Tj Ti Tl Tk

37/64

tel-00637362, version 1 - 1 Nov 2011

slide-87
SLIDE 87

Objective

◮ Maximize throughput ◮ Obtain a periodic schedule ◮ Use a single allocation: code size is critical ◮ Simplification: all communications within a period are

simultaneous

38/64

tel-00637362, version 1 - 1 Nov 2011

slide-88
SLIDE 88

Constraints 1/2

On the application structure:

◮ Each task is mapped on a processor:

∀Tk

  • i

αk

i = 1 ◮ Given a dependence Tk → Tl, the processor computing Tl

must receive the corresponding file: ∀(k, l) ∈ E, ∀Pj,

  • i

βk,l

i,j ≥ αl j ◮ Given a dependence Tk → Tl, only the processor computing

Tk can send the corresponding file: ∀(k, l) ∈ E, ∀Pi,

  • j

βk,l

i,j ≤ αk i

39/64

tel-00637362, version 1 - 1 Nov 2011

slide-89
SLIDE 89

Constraints 2/2

◮ On a given processor, all tasks must be completed within τ:

∀Pi,

  • k

αk

i × ti(k) ≤ τ ◮ All incoming communications must be completed within τ:

∀Pj, 1 bw

k

αk

j × readk +

  • k,l
  • i

βk,l

i,j × datak,l

  • ≤ τ

◮ All outgoing communications must be completed within τ:

∀Pi, 1 bw

k

αk

i × writek +

  • k,l
  • i

βk,l

i,j × datak,l

  • ≤ τ

+ constraints on the number of incoming/outgoing communications to respect DMA requirements + constraints on the available memory on SPE

40/64

tel-00637362, version 1 - 1 Nov 2011

slide-90
SLIDE 90

Optimal mapping

◮ Constraints form a linear program ◮ Binary variables: exponential solving time ◮ Can we do better?

41/64

tel-00637362, version 1 - 1 Nov 2011

slide-91
SLIDE 91

Optimal mapping

◮ Constraints form a linear program ◮ Binary variables: exponential solving time ◮ Can we do better? ◮ NP-complete problem (reduction from 2-Partition) ◮ Reasonable running times (small number of cores)

41/64

tel-00637362, version 1 - 1 Nov 2011

slide-92
SLIDE 92

Experiments

Hardware:

◮ Sony Playstation 3 ◮ Single Cell processor ◮ Only 6 available SPEs ◮ 256-MB memory

Software:

◮ New dedicated scheduling framework ◮ Requires a mono-allocation schedule ◮ Vocoder application (141 tasks) and random graphs ◮ Linear programs solved by Cplex (using a 0.05-approximation) ◮ Greedy memory-aware heuristic as reference

42/64

tel-00637362, version 1 - 1 Nov 2011

slide-93
SLIDE 93

The Vocoder application

Vocode r Ste pSource work= 21 I/O: 0-> 1 *** STATEFUL *** IntToFloa t work= 6 I/O: 1-> 1 De la y work= 215 I/O: 1-> 1 *** STATEFUL *** DUPLICATE(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1) work= null DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** DFTFilte r work= 66 I/O: 1-> 2 *** PEEKS 28 AHEAD *** *** STATEFUL *** WEIGHTED_ROUND_ROBIN(2,2,2,2,2,2,2,2,2,2,2,2,2,2,2) work= null Re cta ngula rToPola r work= 9105 I/O: 30-> 30 WEIGHTED_ROUND_ROBIN(1,1) work= null DUPLICATE(1,1) work= null WEIGHTED_ROUND_ROBIN(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1) work= null WEIGHTED_ROUND_ROBIN(1,1) work= null Pola rToRe cta ngula r work= 5060 I/O: 40-> 40 FIRSm
  • othingFilte
r work= 3300 I/O: 15-> 15 Ide ntity work= 90 I/O: 15-> 15 WEIGHTED_ROUND_ROBIN(1,1) work= null De convolve work= 450 I/O: 30-> 30 WEIGHTED_ROUND_ROBIN(1,1) work= null Duplica tor work= 195 I/O: 15-> 20 Line a rInte rpola tor work= 2010 I/O: 15-> 60 *** PEEKS 1 AHEAD *** WEIGHTED_ROUND_ROBIN(1,1) work= null Multiplie r work= 220 I/O: 40-> 20 De cim a tor work= 320 I/O: 60-> 20 Ide ntity work= 120 I/O: 20-> 20 Pha se Unwra ppe r work= 107 I/O: 1-> 1 *** STATEFUL *** Pha se Unwra ppe r work= 107 I/O: 1-> 1 *** STATEFUL *** Pha se Unwra ppe r work= 107 I/O: 1-> 1 *** STATEFUL *** Pha se Unwra ppe r work= 107 I/O: 1-> 1 *** STATEFUL *** Pha se Unwra ppe r work= 107 I/O: 1-> 1 *** STATEFUL *** Pha se Unwra ppe r work= 107 I/O: 1-> 1 *** STATEFUL *** Pha se Unwra ppe r work= 107 I/O: 1-> 1 *** STATEFUL *** Pha se Unwra ppe r work= 107 I/O: 1-> 1 *** STATEFUL *** Pha se Unwra ppe r work= 107 I/O: 1-> 1 *** STATEFUL *** Pha se Unwra ppe r work= 107 I/O: 1-> 1 *** STATEFUL *** Pha se Unwra ppe r work= 107 I/O: 1-> 1 *** STATEFUL *** Pha se Unwra ppe r work= 107 I/O: 1-> 1 *** STATEFUL *** Pha se Unwra ppe r work= 107 I/O: 1-> 1 *** STATEFUL *** Pha se Unwra ppe r work= 107 I/O: 1-> 1 *** STATEFUL *** Pha se Unwra ppe r work= 107 I/O: 1-> 1 *** STATEFUL *** WEIGHTED_ROUND_ROBIN(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1) work= null Duplica tor work= 195 I/O: 15-> 20 FirstDiffe re nce work= 15 I/O: 1-> 1 *** STATEFUL *** ConstMultiplie r work= 8 I/O: 1-> 1 Accum ula tor work= 14 I/O: 1-> 1 *** STATEFUL *** FirstDiffe re nce work= 15 I/O: 1-> 1 *** STATEFUL *** ConstMultiplie r work= 8 I/O: 1-> 1 Accum ula tor work= 14 I/O: 1-> 1 *** STATEFUL *** FirstDiffe re nce work= 15 I/O: 1-> 1 *** STATEFUL *** ConstMultiplie r work= 8 I/O: 1-> 1 Accum ula tor work= 14 I/O: 1-> 1 *** STATEFUL *** FirstDiffe re nce work= 15 I/O: 1-> 1 *** STATEFUL *** ConstMultiplie r work= 8 I/O: 1-> 1 Accum ula tor work= 14 I/O: 1-> 1 *** STATEFUL *** FirstDiffe re nce work= 15 I/O: 1-> 1 *** STATEFUL *** ConstMultiplie r work= 8 I/O: 1-> 1 Accum ula tor work= 14 I/O: 1-> 1 *** STATEFUL *** FirstDiffe re nce work= 15 I/O: 1-> 1 *** STATEFUL *** ConstMultiplie r work= 8 I/O: 1-> 1 Accum ula tor work= 14 I/O: 1-> 1 *** STATEFUL *** FirstDiffe re nce work= 15 I/O: 1-> 1 *** STATEFUL *** ConstMultiplie r work= 8 I/O: 1-> 1 Accum ula tor work= 14 I/O: 1-> 1 *** STATEFUL *** FirstDiffe re nce work= 15 I/O: 1-> 1 *** STATEFUL *** ConstMultiplie r work= 8 I/O: 1-> 1 Accum ula tor work= 14 I/O: 1-> 1 *** STATEFUL *** FirstDiffe re nce work= 15 I/O: 1-> 1 *** STATEFUL *** ConstMultiplie r work= 8 I/O: 1-> 1 Accum ula tor work= 14 I/O: 1-> 1 *** STATEFUL *** FirstDiffe re nce work= 15 I/O: 1-> 1 *** STATEFUL *** ConstMultiplie r work= 8 I/O: 1-> 1 Accum ula tor work= 14 I/O: 1-> 1 *** STATEFUL *** FirstDiffe re nce work= 15 I/O: 1-> 1 *** STATEFUL *** ConstMultiplie r work= 8 I/O: 1-> 1 Accum ula tor work= 14 I/O: 1-> 1 *** STATEFUL *** FirstDiffe re nce work= 15 I/O: 1-> 1 *** STATEFUL *** ConstMultiplie r work= 8 I/O: 1-> 1 Accum ula tor work= 14 I/O: 1-> 1 *** STATEFUL *** FirstDiffe re nce work= 15 I/O: 1-> 1 *** STATEFUL *** ConstMultiplie r work= 8 I/O: 1-> 1 Accum ula tor work= 14 I/O: 1-> 1 *** STATEFUL *** FirstDiffe re nce work= 15 I/O: 1-> 1 *** STATEFUL *** ConstMultiplie r work= 8 I/O: 1-> 1 Accum ula tor work= 14 I/O: 1-> 1 *** STATEFUL *** FirstDiffe re nce work= 15 I/O: 1-> 1 *** STATEFUL *** ConstMultiplie r work= 8 I/O: 1-> 1 Accum ula tor work= 14 I/O: 1-> 1 *** STATEFUL *** WEIGHTED_ROUND_ROBIN(1,1) work= null WEIGHTED_ROUND_ROBIN(1,18,1) work= null Floa tVoid work= 60 I/O: 20-> WEIGHTED_ROUND_ROBIN(1,0) work= null InvDe la y work= 9 I/O: 1-> 1 *** PEEKS 13 AHEAD *** Ide ntity work= 6 I/O: 1-> 1 Double r work= 252 I/O: 18-> 18 Ide ntity work= 6 I/O: 1-> 1 WEIGHTED_ROUND_ROBIN(1,18,1) work= null Pre _Colla pse dDa ta Pa ra lle l_1 work= 207 I/O: 20-> 20 Adde r work= 146 I/O: 20-> 2 Subtra ctor work= 14 I/O: 2-> 1 ConstMultiplie r work= 8 I/O: 1-> 1 Floa tToShort work= 12 I/O: 1-> 1 File Write r work= I/O: 1->

43/64

tel-00637362, version 1 - 1 Nov 2011

slide-94
SLIDE 94

Throughput variation according to the number of SPEs

Experimental LP Experimental Greedy Theoretical LP Theoretical Greedy 5 Number of SPEs 100 150 200 250 300 350 400 450 Throughput (data sets / sec) for 10,000 data sets 1 3 4 6 2

44/64

tel-00637362, version 1 - 1 Nov 2011

slide-95
SLIDE 95

Time to reach steady-state

Theoretical Experimental 350 450 Throughput (data sets / sec) 40,000 32,500 27,500 22,500 17,500 12,500 7,500 2,500 Number of data sets 50 100 150 200 250 300 400

45/64

tel-00637362, version 1 - 1 Nov 2011

slide-96
SLIDE 96

Summary

◮ Heterogeneity is difficult to handle ◮ Innovative processor, but with strong hardware constraints ◮ Optimal solution to steady-state mono-allocation scheduling

problem

◮ New framework dedicated to mono-allocation schedules ◮ Outperforms greedy memory-aware heuristic

46/64

tel-00637362, version 1 - 1 Nov 2011

slide-97
SLIDE 97

Outline

Introduction Steady-state scheduling Mono-allocation steady-state scheduling Task graph scheduling on the Cell processor Computing the throughput of replicated workflows Conclusion

47/64

tel-00637362, version 1 - 1 Nov 2011

slide-98
SLIDE 98

Application and platform

◮ A linear workflow with many data sets

T2 T3 T4 T1 F2 F3 F1

◮ Fully connected platform ◮ Heterogeneous processors and communication links ◮ Mapping is given ◮ Objective: determine throughput

Communication models

◮ Strict One-Port ◮ Overlap One-Port

48/64

tel-00637362, version 1 - 1 Nov 2011

slide-99
SLIDE 99

Mapping

◮ A processor processes at most 1 task ◮ A task is mapped on possibly many processors ◮ Replication count of Ti: mi

P1 m4 = 1 m3 = 3 m1 = 1 m2 = 2 P3 P5 P7 P2 P6 P4 F2 F3 T2 T3 F1 T1 T4

49/64

tel-00637362, version 1 - 1 Nov 2011

slide-100
SLIDE 100

Mapping

◮ A processor processes at most 1 task ◮ A task is mapped on possibly many processors ◮ Replication count of Ti: mi ◮ Round-Robin distribution of each task

P1 m4 = 1 m3 = 3 m1 = 1 m2 = 2 P3 P5 P7 P2 P6 P4 F2 F3 T2 T3 F1 T1 T4

49/64

tel-00637362, version 1 - 1 Nov 2011

slide-101
SLIDE 101

Mapping

◮ A processor processes at most 1 task ◮ A task is mapped on possibly many processors ◮ Replication count of Ti: mi ◮ Round-Robin distribution of each task

P1 m4 = 1 m3 = 3 m1 = 1 m2 = 2 P3 P5 P7 P2 P6 P4 F2 F3 T2 T3 F1 T1 T4

49/64

tel-00637362, version 1 - 1 Nov 2011

slide-102
SLIDE 102

Mapping

◮ A processor processes at most 1 task ◮ A task is mapped on possibly many processors ◮ Replication count of Ti: mi ◮ Round-Robin distribution of each task

P1 m4 = 1 m3 = 3 m1 = 1 m2 = 2 P3 P5 P7 P2 P6 P4 F2 F3 T2 T3 F1 T1 T4

49/64

tel-00637362, version 1 - 1 Nov 2011

slide-103
SLIDE 103

Mapping

◮ A processor processes at most 1 task ◮ A task is mapped on possibly many processors ◮ Replication count of Ti: mi ◮ Round-Robin distribution of each task

P1 m4 = 1 m3 = 3 m1 = 1 m2 = 2 P3 P5 P7 P2 P6 P4 F2 F3 T2 T3 F1 T1 T4

49/64

tel-00637362, version 1 - 1 Nov 2011

slide-104
SLIDE 104

Mapping

◮ A processor processes at most 1 task ◮ A task is mapped on possibly many processors ◮ Replication count of Ti: mi ◮ Round-Robin distribution of each task

P1 m4 = 1 m3 = 3 m1 = 1 m2 = 2 P3 P5 P7 P2 P6 P4 F2 F3 T2 T3 F1 T1 T4

49/64

tel-00637362, version 1 - 1 Nov 2011

slide-105
SLIDE 105

Mapping

◮ A processor processes at most 1 task ◮ A task is mapped on possibly many processors ◮ Replication count of Ti: mi ◮ Round-Robin distribution of each task

P1 m4 = 1 m3 = 3 m1 = 1 m2 = 2 P3 P5 P7 P2 P6 P4 F2 F3 T2 T3 F1 T1 T4

49/64

tel-00637362, version 1 - 1 Nov 2011

slide-106
SLIDE 106

Mapping

◮ A processor processes at most 1 task ◮ A task is mapped on possibly many processors ◮ Replication count of Ti: mi ◮ Round-Robin distribution of each task

Input data Path in the system 1 P1 → P2 → P4 → P7 2 P1 → P3 → P5 → P7 3 P1 → P2 → P6 → P7 4 P1 → P3 → P4 → P7 5 P1 → P2 → P5 → P7 6 P1 → P3 → P6 → P7 7 P1 → P2 → P4 → P7 8 P1 → P3 → P5 → P7

49/64

tel-00637362, version 1 - 1 Nov 2011

slide-107
SLIDE 107

Mapping

◮ A processor processes at most 1 task ◮ A task is mapped on possibly many processors ◮ Replication count of Ti: mi ◮ Round-Robin distribution of each task

Theorem

Assume that stage Ti is mapped onto mi distinct processors. Then the number of paths is equal to m = lcm (m1, . . . , mn).

49/64

tel-00637362, version 1 - 1 Nov 2011

slide-108
SLIDE 108

Mapping

◮ A processor processes at most 1 task ◮ A task is mapped on possibly many processors ◮ Replication count of Ti: mi ◮ Critical cycle time = 215.8 Period = 230.7

146 22 186 192 128 165 147 68 77 157 23 104 73 126 73 67 57 13

P4 P1 P5 P7 P2 P6 m2 = 2 m1 = 1 m3 = 3 m4 = 1 P3 F1 F3 T1 T4 T2 T3 F2

49/64

tel-00637362, version 1 - 1 Nov 2011

slide-109
SLIDE 109

Short presentation of Timed Petri Nets (TPN)

◮ Some transitions

50/64

tel-00637362, version 1 - 1 Nov 2011

slide-110
SLIDE 110

Short presentation of Timed Petri Nets (TPN)

◮ Some transitions ◮ Some places

50/64

tel-00637362, version 1 - 1 Nov 2011

slide-111
SLIDE 111

Short presentation of Timed Petri Nets (TPN)

◮ Some transitions ◮ Some places ◮ Connections between transitions and places. . .

50/64

tel-00637362, version 1 - 1 Nov 2011

slide-112
SLIDE 112

Short presentation of Timed Petri Nets (TPN)

◮ Some transitions ◮ Some places ◮ Connections between transitions and places. . . and between

places and transitions

50/64

tel-00637362, version 1 - 1 Nov 2011

slide-113
SLIDE 113

Short presentation of Timed Petri Nets (TPN)

◮ Some transitions ◮ Some places ◮ Connections between transitions and places. . . and between

places and transitions

◮ Some tokens allowing transitions to be fired

50/64

tel-00637362, version 1 - 1 Nov 2011

slide-114
SLIDE 114

Short presentation of Timed Petri Nets (TPN)

◮ Some transitions ◮ Some places ◮ Connections between transitions and places. . . and between

places and transitions

◮ Some tokens allowing transitions to be fired

50/64

tel-00637362, version 1 - 1 Nov 2011

slide-115
SLIDE 115

Short presentation of Timed Petri Nets (TPN)

◮ Some transitions ◮ Some places ◮ Connections between transitions and places. . . and between

places and transitions

◮ Some tokens allowing transitions to be fired

50/64

tel-00637362, version 1 - 1 Nov 2011

slide-116
SLIDE 116

Short presentation of Timed Petri Nets (TPN)

◮ Some transitions ◮ Some places ◮ Connections between transitions and places. . . and between

places and transitions

◮ Some tokens allowing transitions to be fired

50/64

tel-00637362, version 1 - 1 Nov 2011

slide-117
SLIDE 117

Short presentation of Timed Petri Nets (TPN)

◮ Some transitions ◮ Some places ◮ Connections between transitions and places. . . and between

places and transitions

◮ Some tokens allowing transitions to be fired

50/64

tel-00637362, version 1 - 1 Nov 2011

slide-118
SLIDE 118

Short presentation of Timed Petri Nets (TPN)

◮ Some transitions ◮ Some places ◮ Connections between transitions and places. . . and between

places and transitions

◮ Some tokens allowing transitions to be fired ◮ Delay between the consumption of input tokens and the

creation of output tokens

τ2 τ3 τ4 τ1

50/64

tel-00637362, version 1 - 1 Nov 2011

slide-119
SLIDE 119

Timed Petri Net model

◮ Transitions: communications and computations ◮ Places: dependences between two successive operations ◮ Each path followed by the input data must be fully developed

in the TPN

51/64

tel-00637362, version 1 - 1 Nov 2011

slide-120
SLIDE 120

Overlap One-Port model

Computations

T4

P2 P3 P6 P5 P4 P7

T3 T4 T2 T1

T1 T3 T2

P1

F2 F3 F1

P7 P1 P7 P4 P1 P7 P1 P7 P1 P7 P4 P1 P7 P5 P6 P5 P6 P2 P2 P3 P3 P3 P2 P1

52/64

tel-00637362, version 1 - 1 Nov 2011

slide-121
SLIDE 121

Overlap One-Port model

Computations and commmunications

P1 P3 P6 P5 P4 P7

T2 T4 T3 T1

T2 T3 T4 T1

P2

F2 F1 F3

P1 P7 P4 P1 P7 P5 P6 P5 P6 P2 P2 P3 P3 P3 P2 P1 P7 P1 P7 P4 P1 P7 P1 P7

52/64

tel-00637362, version 1 - 1 Nov 2011

slide-122
SLIDE 122

Overlap One-Port model

A communication cannot begin before the end of the computation

P5 P3 P2 P1

T2 T4 T3

T4 T2 T3 T1

T1

P7 P4 P6

F2 F3 F1

P1 P7 P1 P7 P4 P1 P7 P1 P7 P1 P7 P4 P1 P7 P5 P6 P5 P6 P2 P2 P3 P3 P3 P2

52/64

tel-00637362, version 1 - 1 Nov 2011

slide-123
SLIDE 123

Overlap One-Port model

A computation cannot begin before the end of the communication

P5 P3 P2 P1

T2 T4 T3

T4 T2 T3 T1

T1

P7 P4 P6

F2 F3 F1

P1 P7 P1 P7 P4 P1 P7 P1 P7 P1 P7 P4 P1 P7 P5 P6 P5 P6 P2 P2 P3 P3 P3 P2

52/64

tel-00637362, version 1 - 1 Nov 2011

slide-124
SLIDE 124

Overlap One-Port model

Dependences due to the round-robin distribution of computations

T2 T4 T3 T1

T1 T3

P2 P4 P7 P5 P6 P3

T2 T4

P1

F2 F3 F1

P7 P1 P7 P4 P1 P7 P1 P7 P1 P7 P4 P1 P7 P5 P6 P5 P6 P2 P2 P3 P3 P3 P2 P1

52/64

tel-00637362, version 1 - 1 Nov 2011

slide-125
SLIDE 125

Overlap One-Port model

Dependences due to the round-robin distribution of outgoing communications

T4

T4

P1

T3 T2 T1

T1

P2 P4 P7 P5 P6 P3

T3 T2 F2 F3 F1

P7 P1 P7 P4 P1 P7 P1 P7 P1 P7 P4 P1 P7 P5 P6 P5 P6 P2 P2 P3 P3 P3 P2 P1

52/64

tel-00637362, version 1 - 1 Nov 2011

slide-126
SLIDE 126

Overlap One-Port model

Dependences due to the round-robin distribution of incoming communications

T3

P1 P2 P3 P6 P5 P4 P7

T4 T2 T3 T1

T1 T2 T4

F2 F3 F1

P7 P5 P6 P5 P6 P2 P2 P3 P3 P3 P2 P1 P7 P1 P7 P4 P1 P7 P1 P7 P1 P7 P4 P1

52/64

tel-00637362, version 1 - 1 Nov 2011

slide-127
SLIDE 127

Overlap One-Port model

All dependences!

P2 P7 P4 P5 P6 P3 P1

T4 T2 T3 T1

T1 T2 T4 T3

F1 F2 F3

P1 P7 P1 P7 P1 P7 P4 P1 P7 P5 P5 P6 P2 P2 P3 P3 P3 P2 P6 P1 P7 P1 P7 P4

52/64

tel-00637362, version 1 - 1 Nov 2011

slide-128
SLIDE 128

Strict One-Port model

Dependences between communications and computations

T2

T1 T3 T2 T4

P3 P2 P1 P7 P4 P5 P6

T3 T4 T1

F2 F3 F1

P7 P1 P7 P4 P1 P7 P1 P7 P1 P7 P4 P1 P7 P5 P6 P5 P6 P2 P2 P3 P3 P3 P2 P1

53/64

tel-00637362, version 1 - 1 Nov 2011

slide-129
SLIDE 129

Strict One-Port model

Dependences due to the Strict One-Port model

T1 T3 T2 T4

P1 P2 P3 P4 P7 P5 P6

T3 T4 T2 T1

F2 F3 F1

P7 P1 P7 P4 P1 P7 P1 P7 P1 P7 P4 P1 P7 P5 P6 P5 P6 P2 P2 P3 P3 P3 P2 P1

53/64

tel-00637362, version 1 - 1 Nov 2011

slide-130
SLIDE 130

Strict One-Port model

All dependences!

P6 P3 P2 P1

T1

T1 T3 T2

P7 P4 P5

T4

T3 T4 T2

F2 F1 F3

P1 P7 P1 P4 P7 P1 P7 P1 P2 P3 P3 P3 P2 P2 P6 P5 P6 P5 P7 P1 P4 P7 P1 P7

53/64

tel-00637362, version 1 - 1 Nov 2011

slide-131
SLIDE 131

Computing the throughput

◮ Equivalent to find critical cycles ◮ C is a cycle of the TPN ◮ L(C) is its length (total time of transitions) ◮ t(C) is the total number of tokens in places traversed by C ◮ A critical cycle achieves the largest ratio maxCcycle L(C) t(C) ◮ This ratio gives the period τ of the system ◮ Can be computed in time O(n3m3) (m = lcm (m1, . . . , mn))

54/64

tel-00637362, version 1 - 1 Nov 2011

slide-132
SLIDE 132

Computing the throughput

◮ TPN has an exponential size ◮ Overlap One-Port model:

Theorem

Consider a pipeline of n tasks T1, . . . , Tn−1, such that stage Ti is mapped onto mi distinct processors. Then the average throughput

  • f this system can be computed in time O

n−1

i=1 (mimi+1)3

.

◮ Strict One-Port model: problem remains open

55/64

tel-00637362, version 1 - 1 Nov 2011

slide-133
SLIDE 133

Key ideas of proof

◮ Split the TPN into 2n − 1 columns ◮ Computation columns: simple problem ◮ Communication columns: reduction to smaller TPNs with

critical cycles of same weight

56/64

tel-00637362, version 1 - 1 Nov 2011

slide-134
SLIDE 134

The case of a communication column

◮ Several connected components. ◮ Example of connected component: 9 columns 7 rows 55 patterns

m1 = 5, m2 = 21, m3 = 27, m4 = 11

57/64

tel-00637362, version 1 - 1 Nov 2011

slide-135
SLIDE 135

Summary

◮ Even when mapping is given, the throughput is hard to

determine

◮ Examples without critical resource for both communication

models

◮ Such examples remain seldom

58/64

tel-00637362, version 1 - 1 Nov 2011

slide-136
SLIDE 136

Outline

Introduction Steady-state scheduling Mono-allocation steady-state scheduling Task graph scheduling on the Cell processor Computing the throughput of replicated workflows Conclusion

59/64

tel-00637362, version 1 - 1 Nov 2011

slide-137
SLIDE 137

Conclusion

Presentation of three steady-state scheduling problems

◮ Mono-allocation task graph scheduling on large heterogeneous

platforms

◮ Task graph scheduling on the Cell processor ◮ Computing throughput of replicated workflows

Algorithms and methods:

◮ NP-completeness proofs ◮ Optimal algorithms, mainly using linear programs ◮ Heuristics ◮ Use of Timed Petri Nets to model a complete system

60/64

tel-00637362, version 1 - 1 Nov 2011

slide-138
SLIDE 138

Conclusion

Presentation of three steady-state scheduling problems

◮ Mono-allocation task graph scheduling on large heterogeneous

platforms

◮ Task graph scheduling on the Cell processor ◮ Computing throughput of replicated workflows

Experimental study:

◮ Simple numerical simulations ◮ Simulation using the SimGrid framework ◮ Cell implementation

60/64

tel-00637362, version 1 - 1 Nov 2011

slide-139
SLIDE 139

Ongoing Work and Perspectives

Mono-allocation steady-state scheduling

◮ Introduce duplication to increase the throughput

Scheduling task graphs on Cell processors

◮ Write new models to capture multi-Cell platforms ◮ Clever heuristics to handle very large task graphs ◮ Need to correctly model new heterogeneous architectures

Computing throughput of replicated workflows

◮ Determine complexity of the Strict One-Port case ◮ Work with stochastic computation and communication times ◮ Efficient heuristics to find good mappings

61/64

tel-00637362, version 1 - 1 Nov 2011

slide-140
SLIDE 140

Publications

Journal and book chapter

Comments on “Design and performance evaluation of load distribution strategies for multiple loads

  • n heterogeneous linear daisy chain networks”

Matthieu Gallet, Yves Robert, Fr´ ed´ eric Vivien, Journal of Parallel and Distributed Computing, 68(2), 2008 Divisible Load Scheduling Matthieu Gallet, Yves Robert, Fr´ ed´ eric Vivien, Introduction to Scheduling , 2009

International conferences

Scheduling communication requests traversing a switch: complexity and algorithms Matthieu Gallet, Yves Robert, Fr´ ed´ eric Vivien, Proceedings of the 15th Euromicro Workshop on Parallel, Distributed and Network-based Processing (PDP’2007) , 2007 Scheduling multiple divisible loads on a linear processor network Matthieu Gallet, Yves Robert, Fr´ ed´ eric Vivien, Proceedings of the 13rd IEEE International Conference on Parallel and Distributed Systems (ICPADS’07) , 2007 Efficient Scheduling of Task Graph Collections on Heterogeneous Resources Matthieu Gallet, Loris Marchal, Fr´ ed´ eric Vivien, Proceedings of the 23rd International Parallel and Distributed Processing Symposium (IPDPS’09) , 2009 Allocating Series of Workflows on Computing Grids Matthieu Gallet, Loris Marchal, Fr´ ed´ eric Vivien, Proceedings of the 14th IEEE International Conference on Parallel and Distributed Systems (ICPADS’08) , 2008 Computing the throughput of replicated workflows on heterogeneous platforms Anne Benoit, Matthieu Gallet, Bruno Gaujal, Yves Robert, Proceedings of the 38th International Conference on Parallel Processing (ICPP’09) , 2009 62/64

tel-00637362, version 1 - 1 Nov 2011

slide-141
SLIDE 141

Performance evaluation – running times

Average running times in seconds to schedule 1000 data sets: small task graphs large task graphs HEFT ∗ 14.30 83.36 MLP 49.45 n/a Delegate 16.74 40.49 Simple-Greedy 0.11 0.61 Refined-Greedy 0.12 0.81 RLP-max 166.38 1301.80 RLP-rand 16.78 812.30

∗: HEFT running time grows with the number of data sets

63/64

tel-00637362, version 1 - 1 Nov 2011

slide-142
SLIDE 142

Difference between experimental and theoretical values

Algorithm Average error Standard deviation MLP 3% 3% Simple-Greedy 8% 11% Refined-Greedy 5% 6% RLP-max 8% 12% RLP-rand 16% 28% Delegate 2% 2% Neighborhood 6% 12%

64/64

tel-00637362, version 1 - 1 Nov 2011