Optimizing the steady-state throughput of scatter and reduce - - PowerPoint PPT Presentation

optimizing the steady state throughput of scatter and
SMART_READER_LITE
LIVE PREVIEW

Optimizing the steady-state throughput of scatter and reduce - - PowerPoint PPT Presentation

Optimizing the steady-state throughput of scatter and reduce operations on heterogeneous platforms Arnaud Legrand, Loris Marchal, Yves Robert Laboratoire de lInformatique du Parall elisme Ecole Normale Sup erieure de Lyon, France


slide-1
SLIDE 1

Optimizing the steady-state throughput of scatter and reduce operations on heterogeneous platforms

Arnaud Legrand, Loris Marchal, Yves Robert

Laboratoire de l’Informatique du Parall´ elisme ´ Ecole Normale Sup´ erieure de Lyon, France

APDCM workshop - April 2004

Loris Marchal Steady state collective communications 1/ 27

slide-2
SLIDE 2

Outline

Introduction

Two Problems of Collective Communication Platform Model Framework

Series of Scatter

Steady-state constraints Toy Example Building a schedule Asymptotic optimality

Series of Reduce

Introduction to reduction trees Linear Program Periodic schedule - Asymptotic optimality Toy Example for Series of Reduce

Approximation for a fixed period Conclusion

Loris Marchal Steady state collective communications 2/ 27

slide-3
SLIDE 3

Outline

Introduction

Two Problems of Collective Communication Platform Model Framework

Series of Scatter

Steady-state constraints Toy Example Building a schedule Asymptotic optimality

Series of Reduce

Introduction to reduction trees Linear Program Periodic schedule - Asymptotic optimality Toy Example for Series of Reduce

Approximation for a fixed period Conclusion

Loris Marchal Steady state collective communications 2/ 27

slide-4
SLIDE 4

Outline

Introduction

Two Problems of Collective Communication Platform Model Framework

Series of Scatter

Steady-state constraints Toy Example Building a schedule Asymptotic optimality

Series of Reduce

Introduction to reduction trees Linear Program Periodic schedule - Asymptotic optimality Toy Example for Series of Reduce

Approximation for a fixed period Conclusion

Loris Marchal Steady state collective communications 2/ 27

slide-5
SLIDE 5

Outline

Introduction

Two Problems of Collective Communication Platform Model Framework

Series of Scatter

Steady-state constraints Toy Example Building a schedule Asymptotic optimality

Series of Reduce

Introduction to reduction trees Linear Program Periodic schedule - Asymptotic optimality Toy Example for Series of Reduce

Approximation for a fixed period Conclusion

Loris Marchal Steady state collective communications 2/ 27

slide-6
SLIDE 6

Outline

Introduction

Two Problems of Collective Communication Platform Model Framework

Series of Scatter

Steady-state constraints Toy Example Building a schedule Asymptotic optimality

Series of Reduce

Introduction to reduction trees Linear Program Periodic schedule - Asymptotic optimality Toy Example for Series of Reduce

Approximation for a fixed period Conclusion

Loris Marchal Steady state collective communications 2/ 27

slide-7
SLIDE 7

Outline

Introduction

Two Problems of Collective Communication Platform Model Framework

Series of Scatter

Steady-state constraints Toy Example Building a schedule Asymptotic optimality

Series of Reduce

Introduction to reduction trees Linear Program Periodic schedule - Asymptotic optimality Toy Example for Series of Reduce

Approximation for a fixed period Conclusion

Loris Marchal Steady state collective communications 3/ 27

slide-8
SLIDE 8

Introduction

◮ Complex applications on grid environment require collective

communication schemes:

  • ne to all Broadcast, Multicast, Scatter

all to one Reduce all to all Gossip, All-to-All

◮ Numerous studies of a single communication scheme, mainly

about one single broadcast

◮ Pipelining communications:

◮ data parallelism involves a large amount of data ◮ not a single communication, but series of same communication

schemes (e.g. series of broadcasts from same source)

◮ maximize throughput of steady-state operation Loris Marchal Steady state collective communications 4/ 27

slide-9
SLIDE 9

Introduction

◮ Complex applications on grid environment require collective

communication schemes:

  • ne to all Broadcast, Multicast, Scatter

all to one Reduce all to all Gossip, All-to-All

◮ Numerous studies of a single communication scheme, mainly

about one single broadcast

◮ Pipelining communications:

◮ data parallelism involves a large amount of data ◮ not a single communication, but series of same communication

schemes (e.g. series of broadcasts from same source)

◮ maximize throughput of steady-state operation Loris Marchal Steady state collective communications 4/ 27

slide-10
SLIDE 10

Introduction

◮ Complex applications on grid environment require collective

communication schemes:

  • ne to all Broadcast, Multicast, Scatter

all to one Reduce all to all Gossip, All-to-All

◮ Numerous studies of a single communication scheme, mainly

about one single broadcast

◮ Pipelining communications:

◮ data parallelism involves a large amount of data ◮ not a single communication, but series of same communication

schemes (e.g. series of broadcasts from same source)

◮ maximize throughput of steady-state operation Loris Marchal Steady state collective communications 4/ 27

slide-11
SLIDE 11

Introduction

◮ Complex applications on grid environment require collective

communication schemes:

  • ne to all Broadcast, Multicast, Scatter

all to one Reduce all to all Gossip, All-to-All

◮ Numerous studies of a single communication scheme, mainly

about one single broadcast

◮ Pipelining communications:

◮ data parallelism involves a large amount of data ◮ not a single communication, but series of same communication

schemes (e.g. series of broadcasts from same source)

◮ maximize throughput of steady-state operation Loris Marchal Steady state collective communications 4/ 27

slide-12
SLIDE 12

Introduction

◮ Complex applications on grid environment require collective

communication schemes:

  • ne to all Broadcast, Multicast, Scatter

all to one Reduce all to all Gossip, All-to-All

◮ Numerous studies of a single communication scheme, mainly

about one single broadcast

◮ Pipelining communications:

◮ data parallelism involves a large amount of data ◮ not a single communication, but series of same communication

schemes (e.g. series of broadcasts from same source)

◮ maximize throughput of steady-state operation Loris Marchal Steady state collective communications 4/ 27

slide-13
SLIDE 13

Introduction

◮ Complex applications on grid environment require collective

communication schemes:

  • ne to all Broadcast, Multicast, Scatter

all to one Reduce all to all Gossip, All-to-All

◮ Numerous studies of a single communication scheme, mainly

about one single broadcast

◮ Pipelining communications:

◮ data parallelism involves a large amount of data ◮ not a single communication, but series of same communication

schemes (e.g. series of broadcasts from same source)

◮ maximize throughput of steady-state operation Loris Marchal Steady state collective communications 4/ 27

slide-14
SLIDE 14

Introduction

◮ Complex applications on grid environment require collective

communication schemes:

  • ne to all Broadcast, Multicast, Scatter

all to one Reduce all to all Gossip, All-to-All

◮ Numerous studies of a single communication scheme, mainly

about one single broadcast

◮ Pipelining communications:

◮ data parallelism involves a large amount of data ◮ not a single communication, but series of same communication

schemes (e.g. series of broadcasts from same source)

◮ maximize throughput of steady-state operation Loris Marchal Steady state collective communications 4/ 27

slide-15
SLIDE 15

Introduction

◮ Complex applications on grid environment require collective

communication schemes:

  • ne to all Broadcast, Multicast, Scatter

all to one Reduce all to all Gossip, All-to-All

◮ Numerous studies of a single communication scheme, mainly

about one single broadcast

◮ Pipelining communications:

◮ data parallelism involves a large amount of data ◮ not a single communication, but series of same communication

schemes (e.g. series of broadcasts from same source)

◮ maximize throughput of steady-state operation Loris Marchal Steady state collective communications 4/ 27

slide-16
SLIDE 16

Introduction

◮ Complex applications on grid environment require collective

communication schemes:

  • ne to all Broadcast, Multicast, Scatter

all to one Reduce all to all Gossip, All-to-All

◮ Numerous studies of a single communication scheme, mainly

about one single broadcast

◮ Pipelining communications:

◮ data parallelism involves a large amount of data ◮ not a single communication, but series of same communication

schemes (e.g. series of broadcasts from same source)

◮ maximize throughput of steady-state operation Loris Marchal Steady state collective communications 4/ 27

slide-17
SLIDE 17

Two Problems of Collective Communication

Scatter one processor Psource sends distinct messages to target processors (

  • Pt0, . . . , PtN
  • )

◮ Series of Scatter Psource sends consecutively a large number

  • f distinct messages to all targets

Reduce Each of the participating processor Pri in Pr0, . . . , PrN

  • wns a value vi

⇒ compute V = v1 ⊕ v2 ⊕ · · · ⊕ vN (⊕ is associative, non commutative)

◮ Series of Reduce several consecutive reduce operations

from the same set Pr0, . . . , PrN to the same target Ptarget.

Loris Marchal Steady state collective communications 5/ 27

slide-18
SLIDE 18

Two Problems of Collective Communication

Scatter one processor Psource sends distinct messages to target processors (

  • Pt0, . . . , PtN
  • )

◮ Series of Scatter Psource sends consecutively a large number

  • f distinct messages to all targets

Reduce Each of the participating processor Pri in Pr0, . . . , PrN

  • wns a value vi

⇒ compute V = v1 ⊕ v2 ⊕ · · · ⊕ vN (⊕ is associative, non commutative)

◮ Series of Reduce several consecutive reduce operations

from the same set Pr0, . . . , PrN to the same target Ptarget.

Loris Marchal Steady state collective communications 5/ 27

slide-19
SLIDE 19

Two Problems of Collective Communication

Scatter one processor Psource sends distinct messages to target processors (

  • Pt0, . . . , PtN
  • )

◮ Series of Scatter Psource sends consecutively a large number

  • f distinct messages to all targets

Reduce Each of the participating processor Pri in Pr0, . . . , PrN

  • wns a value vi

⇒ compute V = v1 ⊕ v2 ⊕ · · · ⊕ vN (⊕ is associative, non commutative)

◮ Series of Reduce several consecutive reduce operations

from the same set Pr0, . . . , PrN to the same target Ptarget.

Loris Marchal Steady state collective communications 5/ 27

slide-20
SLIDE 20

Two Problems of Collective Communication

Scatter one processor Psource sends distinct messages to target processors (

  • Pt0, . . . , PtN
  • )

◮ Series of Scatter Psource sends consecutively a large number

  • f distinct messages to all targets

Reduce Each of the participating processor Pri in Pr0, . . . , PrN

  • wns a value vi

⇒ compute V = v1 ⊕ v2 ⊕ · · · ⊕ vN (⊕ is associative, non commutative)

◮ Series of Reduce several consecutive reduce operations

from the same set Pr0, . . . , PrN to the same target Ptarget.

Loris Marchal Steady state collective communications 5/ 27

slide-21
SLIDE 21

Platform Model

◮ G = (V, E, c) ◮ P1, P2, . . . , Pn: processors ◮ (j, k) ∈ E: communication link

between Pi and Pj

◮ c(j, k): time to transfer one unit

message from Pj to Pk

◮ one-port for incoming

communications

◮ one-port for outgoing

communications

10 10 30 5 5 8 P0 P1 P2 P3

Loris Marchal Steady state collective communications 6/ 27

slide-22
SLIDE 22

Platform Model

◮ G = (V, E, c) ◮ P1, P2, . . . , Pn: processors ◮ (j, k) ∈ E: communication link

between Pi and Pj

◮ c(j, k): time to transfer one unit

message from Pj to Pk

◮ one-port for incoming

communications

◮ one-port for outgoing

communications

10 10 30 5 5 8 P0 P1 P2 P3

Loris Marchal Steady state collective communications 6/ 27

slide-23
SLIDE 23

Platform Model

◮ G = (V, E, c) ◮ P1, P2, . . . , Pn: processors ◮ (j, k) ∈ E: communication link

between Pi and Pj

◮ c(j, k): time to transfer one unit

message from Pj to Pk

◮ one-port for incoming

communications

◮ one-port for outgoing

communications

10 10 30 5 5 8 P0 P1 P2 P3

Loris Marchal Steady state collective communications 6/ 27

slide-24
SLIDE 24

Platform Model

◮ G = (V, E, c) ◮ P1, P2, . . . , Pn: processors ◮ (j, k) ∈ E: communication link

between Pi and Pj

◮ c(j, k): time to transfer one unit

message from Pj to Pk

◮ one-port for incoming

communications

◮ one-port for outgoing

communications

10 10 30 5 5 8 P0 P1 P2 P3

Loris Marchal Steady state collective communications 6/ 27

slide-25
SLIDE 25

Platform Model

◮ G = (V, E, c) ◮ P1, P2, . . . , Pn: processors ◮ (j, k) ∈ E: communication link

between Pi and Pj

◮ c(j, k): time to transfer one unit

message from Pj to Pk

◮ one-port for incoming

communications

◮ one-port for outgoing

communications

10 10 30 5 5 8 P0 P1 P2 P3

Loris Marchal Steady state collective communications 6/ 27

slide-26
SLIDE 26

Platform Model

◮ G = (V, E, c) ◮ P1, P2, . . . , Pn: processors ◮ (j, k) ∈ E: communication link

between Pi and Pj

◮ c(j, k): time to transfer one unit

message from Pj to Pk

◮ one-port for incoming

communications

◮ one-port for outgoing

communications

10 10 30 5 5 8 P0 P1 P2 P3

Loris Marchal Steady state collective communications 6/ 27

slide-27
SLIDE 27

Platform Model

◮ G = (V, E, c) ◮ P1, P2, . . . , Pn: processors ◮ (j, k) ∈ E: communication link

between Pi and Pj

◮ c(j, k): time to transfer one unit

message from Pj to Pk

◮ one-port for incoming

communications

◮ one-port for outgoing

communications

P0 P1 P2 P3

Loris Marchal Steady state collective communications 6/ 27

slide-28
SLIDE 28

Platform Model

◮ G = (V, E, c) ◮ P1, P2, . . . , Pn: processors ◮ (j, k) ∈ E: communication link

between Pi and Pj

◮ c(j, k): time to transfer one unit

message from Pj to Pk

◮ one-port for incoming

communications

◮ one-port for outgoing

communications OK

P0 P1 P2 P3

Loris Marchal Steady state collective communications 6/ 27

slide-29
SLIDE 29

Framework

  • 1. express optimization problem as set of linear constraints

(variables = fraction of time a processor spends sending to

  • ne of its neighbors)
  • 2. solve linear program (in rational numbers)
  • 3. use solution to build periodic schedule reaching best

throughput

Loris Marchal Steady state collective communications 7/ 27

slide-30
SLIDE 30

Framework

  • 1. express optimization problem as set of linear constraints

(variables = fraction of time a processor spends sending to

  • ne of its neighbors)
  • 2. solve linear program (in rational numbers)
  • 3. use solution to build periodic schedule reaching best

throughput

Loris Marchal Steady state collective communications 7/ 27

slide-31
SLIDE 31

Framework

  • 1. express optimization problem as set of linear constraints

(variables = fraction of time a processor spends sending to

  • ne of its neighbors)
  • 2. solve linear program (in rational numbers)
  • 3. use solution to build periodic schedule reaching best

throughput

Loris Marchal Steady state collective communications 7/ 27

slide-32
SLIDE 32

Outline

Introduction

Two Problems of Collective Communication Platform Model Framework

Series of Scatter

Steady-state constraints Toy Example Building a schedule Asymptotic optimality

Series of Reduce

Introduction to reduction trees Linear Program Periodic schedule - Asymptotic optimality Toy Example for Series of Reduce

Approximation for a fixed period Conclusion

Loris Marchal Steady state collective communications 8/ 27

slide-33
SLIDE 33

Series of Scatter

◮ mk: types of the messages with destination Pk ◮ s(Pi → Pj, mk): fractional number of messages of type mk

sent on the edge Pi → Pj within on time unit

◮ t(Pi → Pj): fractional time spent by processor Pi to send

data to its neighbor Pj within one time unit

◮ bound for this activity:

∀Pi, Pj, 0 t(Pi → Pj) 1

◮ on a link Pi → Pj during one time-unit:

t(Pi → Pj) =

  • k

s(Pi → Pj, mk)

Loris Marchal Steady state collective communications 9/ 27

slide-34
SLIDE 34

Series of Scatter

◮ mk: types of the messages with destination Pk ◮ s(Pi → Pj, mk): fractional number of messages of type mk

sent on the edge Pi → Pj within on time unit

◮ t(Pi → Pj): fractional time spent by processor Pi to send

data to its neighbor Pj within one time unit

◮ bound for this activity:

∀Pi, Pj, 0 t(Pi → Pj) 1

◮ on a link Pi → Pj during one time-unit:

t(Pi → Pj) =

  • k

s(Pi → Pj, mk)

Loris Marchal Steady state collective communications 9/ 27

slide-35
SLIDE 35

Series of Scatter

◮ mk: types of the messages with destination Pk ◮ s(Pi → Pj, mk): fractional number of messages of type mk

sent on the edge Pi → Pj within on time unit

◮ t(Pi → Pj): fractional time spent by processor Pi to send

data to its neighbor Pj within one time unit

◮ bound for this activity:

∀Pi, Pj, 0 t(Pi → Pj) 1

◮ on a link Pi → Pj during one time-unit:

t(Pi → Pj) =

  • k

s(Pi → Pj, mk)

Loris Marchal Steady state collective communications 9/ 27

slide-36
SLIDE 36

Series of Scatter

◮ mk: types of the messages with destination Pk ◮ s(Pi → Pj, mk): fractional number of messages of type mk

sent on the edge Pi → Pj within on time unit

◮ t(Pi → Pj): fractional time spent by processor Pi to send

data to its neighbor Pj within one time unit

◮ bound for this activity:

∀Pi, Pj, 0 t(Pi → Pj) 1

◮ on a link Pi → Pj during one time-unit:

t(Pi → Pj) =

  • k

s(Pi → Pj, mk)

Loris Marchal Steady state collective communications 9/ 27

slide-37
SLIDE 37

Series of Scatter

◮ mk: types of the messages with destination Pk ◮ s(Pi → Pj, mk): fractional number of messages of type mk

sent on the edge Pi → Pj within on time unit

◮ t(Pi → Pj): fractional time spent by processor Pi to send

data to its neighbor Pj within one time unit

◮ bound for this activity:

∀Pi, Pj, 0 t(Pi → Pj) 1

◮ on a link Pi → Pj during one time-unit:

t(Pi → Pj) =

  • k

s(Pi → Pj, mk)

Loris Marchal Steady state collective communications 9/ 27

slide-38
SLIDE 38

Linear constraints

◮ one port constraints for outgoing messages in Pi:

∀Pi,

  • Pi→Pj

t(Pi → Pj) 1

◮ one port constraints for incoming messages in Pi:

∀Pi,

  • Pj→Pi

t(Pj → Pi) 1

◮ conservation law in node Pi for message mk (k = i):

5mk 2mk 3mk 4mk Pi

  • Pj→Pi

s(Pj → Pi, mk) =

  • Pi→Pj

s(Pj → Pi, mk) 1

Loris Marchal Steady state collective communications 10/ 27

slide-39
SLIDE 39

Linear constraints

◮ one port constraints for outgoing messages in Pi:

∀Pi,

  • Pi→Pj

t(Pi → Pj) 1

◮ one port constraints for incoming messages in Pi:

∀Pi,

  • Pj→Pi

t(Pj → Pi) 1

◮ conservation law in node Pi for message mk (k = i):

5mk 2mk 3mk 4mk Pi

  • Pj→Pi

s(Pj → Pi, mk) =

  • Pi→Pj

s(Pj → Pi, mk) 1

Loris Marchal Steady state collective communications 10/ 27

slide-40
SLIDE 40

Linear constraints

◮ one port constraints for outgoing messages in Pi:

∀Pi,

  • Pi→Pj

t(Pi → Pj) 1

◮ one port constraints for incoming messages in Pi:

∀Pi,

  • Pj→Pi

t(Pj → Pi) 1

◮ conservation law in node Pi for message mk (k = i):

5mk 2mk 3mk 4mk Pi

  • Pj→Pi

s(Pj → Pi, mk) =

  • Pi→Pj

s(Pj → Pi, mk) 1

Loris Marchal Steady state collective communications 10/ 27

slide-41
SLIDE 41

Linear constraints

◮ one port constraints for outgoing messages in Pi:

∀Pi,

  • Pi→Pj

t(Pi → Pj) 1

◮ one port constraints for incoming messages in Pi:

∀Pi,

  • Pj→Pi

t(Pj → Pi) 1

◮ conservation law in node Pi for message mk (k = i):

5mk 2mk 3mk 4mk Pi

  • Pj→Pi

s(Pj → Pi, mk) =

  • Pi→Pj

s(Pj → Pi, mk) 1

Loris Marchal Steady state collective communications 10/ 27

slide-42
SLIDE 42

Linear constraints

◮ one port constraints for outgoing messages in Pi:

∀Pi,

  • Pi→Pj

t(Pi → Pj) 1

◮ one port constraints for incoming messages in Pi:

∀Pi,

  • Pj→Pi

t(Pj → Pi) 1

◮ conservation law in node Pi for message mk (k = i):

5mk 2mk 3mk 4mk Pi

  • Pj→Pi

s(Pj → Pi, mk) =

  • Pi→Pj

s(Pj → Pi, mk) 1

Loris Marchal Steady state collective communications 10/ 27

slide-43
SLIDE 43

Linear constraints

◮ one port constraints for outgoing messages in Pi:

∀Pi,

  • Pi→Pj

t(Pi → Pj) 1

◮ one port constraints for incoming messages in Pi:

∀Pi,

  • Pj→Pi

t(Pj → Pi) 1

◮ conservation law in node Pi for message mk (k = i):

5mk 2mk 3mk 4mk Pi

  • Pj→Pi

s(Pj → Pi, mk) =

  • Pi→Pj

s(Pj → Pi, mk) 1

Loris Marchal Steady state collective communications 10/ 27

slide-44
SLIDE 44

Throughput and Linear Program

◮ throughput: total number of messages mk received in Pk

TP =

  • Pj→Pk

s(Pj → Pk, mk) (same throughput for every target node Pk)

◮ summarize this constraints in a linear program:

Steady-State Scatter Problem on a Graph SSSP(G) Maximize TP, subject to                      ∀Pi, ∀Pj, 0 s(Pi → Pj) 1 ∀Pi,

Pj,(i,j)∈E s(Pi → Pj) 1

∀Pi,

Pj,(j,i)∈E s(Pj → Pi) 1

∀Pi, Pj, s(Pi → Pj) =

mk send(Pi → Pj, mk) × c(i, j)

∀Pi, ∀mk, k = i,

Pj,(j,i)∈E send(Pj → Pi, mk)

=

Pj,(i,j)∈E send(Pi → Pj, mk)

∀Pk, k ∈ T

Pi,(i,k)∈E send(Pi → Pk, mk) = TP Loris Marchal Steady state collective communications 11/ 27

slide-45
SLIDE 45

Throughput and Linear Program

◮ throughput: total number of messages mk received in Pk

TP =

  • Pj→Pk

s(Pj → Pk, mk) (same throughput for every target node Pk)

◮ summarize this constraints in a linear program:

Steady-State Scatter Problem on a Graph SSSP(G) Maximize TP, subject to                      ∀Pi, ∀Pj, 0 s(Pi → Pj) 1 ∀Pi,

Pj,(i,j)∈E s(Pi → Pj) 1

∀Pi,

Pj,(j,i)∈E s(Pj → Pi) 1

∀Pi, Pj, s(Pi → Pj) =

mk send(Pi → Pj, mk) × c(i, j)

∀Pi, ∀mk, k = i,

Pj,(j,i)∈E send(Pj → Pi, mk)

=

Pj,(i,j)∈E send(Pi → Pj, mk)

∀Pk, k ∈ T

Pi,(i,k)∈E send(Pi → Pk, mk) = TP Loris Marchal Steady state collective communications 11/ 27

slide-46
SLIDE 46

Series of Scatter - Toy Example

2/3 4/3 4/3 1 1 Pb Pa Ps P0 P1

platform graph (edges labeled with c(i, j))

Loris Marchal Steady state collective communications 12/ 27

slide-47
SLIDE 47

Series of Scatter - Toy Example

1 4m0 1 4m0 1 4m0 1 4m0 1 2m1 1 2m1

Pb Pa Ps P0 P1

value of s(Pi → Pj, mk) in the solution of the linear program

Loris Marchal Steady state collective communications 12/ 27

slide-48
SLIDE 48

Series of Scatter - Toy Example

1/3 2/3 3/4 1/4 1/6 Pb Pa Ps P0 P1

  • ccupation time of the edge (t(Pi → Pj))

Loris Marchal Steady state collective communications 12/ 27

slide-49
SLIDE 49

Building a schedule

◮ consider the time needed

for all transfers

◮ build a bipartite graph by

splitting all nodes

◮ extract matchings, using

the weighted-edge coloring algorithm

1 4 (1 4 m0) 1 3 ( 1 4 m 0) 2 3 (1 2m1) 1 6 (1 4m0) 1 4 (1 4m0) 1 2 (1 2m1)

Pa Pb Ps P0 P1

Loris Marchal Steady state collective communications 13/ 27

slide-50
SLIDE 50

Building a schedule

◮ consider the time needed

for all transfers

◮ build a bipartite graph by

splitting all nodes

◮ extract matchings, using

the weighted-edge coloring algorithm

1 4 1 3 1 2 1 4 2 3 1 6

P send

a

P send

b

P recv

b

P send

s

P recv P recv

1

P recv

a

P send

1

P send

1

P recv

s Loris Marchal Steady state collective communications 13/ 27

slide-51
SLIDE 51

Building a schedule

◮ consider the time needed

for all transfers

◮ build a bipartite graph by

splitting all nodes

◮ extract matchings, using

the weighted-edge coloring algorithm

1 2 1 2

P send

a

P send

b

P recv

b

P send

s

P recv P recv

1

P recv

a

P send

1

P send

1

P recv

s Loris Marchal Steady state collective communications 13/ 27

slide-52
SLIDE 52

Building a schedule

◮ consider the time needed

for all transfers

◮ build a bipartite graph by

splitting all nodes

◮ extract matchings, using

the weighted-edge coloring algorithm

1 4 1 4

P send

a

P send

b

P recv

b

P send

s

P recv P recv

1

P recv

a

P send

1

P send

1

P recv

s Loris Marchal Steady state collective communications 13/ 27

slide-53
SLIDE 53

Building a schedule

◮ consider the time needed

for all transfers

◮ build a bipartite graph by

splitting all nodes

◮ extract matchings, using

the weighted-edge coloring algorithm

1 6 1 6 1 6

P send

a

P send

b

P recv

b

P send

s

P recv P recv

1

P recv

a

P send

1

P send

1

P recv

s Loris Marchal Steady state collective communications 13/ 27

slide-54
SLIDE 54

Building a schedule

◮ consider the time needed

for all transfers

◮ build a bipartite graph by

splitting all nodes

◮ extract matchings, using

the weighted-edge coloring algorithm

1 12 1 12

P send

a

P send

b

P recv

b

P send

s

P recv P recv

1

P recv

a

P send

1

P send

1

P recv

s Loris Marchal Steady state collective communications 13/ 27

slide-55
SLIDE 55

Building a schedule

1 2 1 2

P send

a

P send

b

P recv

b

P send

s

P recv P recv

1

P recv

a

P send

1

P send

1

P recv

s

1 Pb → P1 Pb → P0 Pa → P0 Ps → Pb Ps → Pa t 1

11 12 3 4 1 2

matchings: ◮ least common multiple T = lcm{bi} where ai bi denotes the

number of messages transfered in each matching

◮ ⇒ periodic schedule of period T with atomic transfers of

messages

Loris Marchal Steady state collective communications 14/ 27

slide-56
SLIDE 56

Building a schedule

1 4 1 4

P send

a

P send

b

P recv

b

P send

s

P recv P recv

1

P recv

a

P send

1

P send

1

P recv

s

2 Pb → P1 Pb → P0 Pa → P0 Ps → Pb Ps → Pa t 1

11 12 3 4 1 2

matchings: ◮ least common multiple T = lcm{bi} where ai bi denotes the

number of messages transfered in each matching

◮ ⇒ periodic schedule of period T with atomic transfers of

messages

Loris Marchal Steady state collective communications 14/ 27

slide-57
SLIDE 57

Building a schedule

1 6 1 6 1 6

P send

a

P send

b

P recv

b

P send

s

P recv P recv

1

P recv

a

P send

1

P send

1

P recv

s

3 Pb → P1 Pb → P0 Pa → P0 Ps → Pb Ps → Pa t 1

11 12 3 4 1 2

matchings: ◮ least common multiple T = lcm{bi} where ai bi denotes the

number of messages transfered in each matching

◮ ⇒ periodic schedule of period T with atomic transfers of

messages

Loris Marchal Steady state collective communications 14/ 27

slide-58
SLIDE 58

Building a schedule

1 12 1 12

P send

a

P send

b

P recv

b

P send

s

P recv P recv

1

P recv

a

P send

1

P send

1

P recv

s

4 Pb → P1 Pb → P0 Pa → P0 Ps → Pb Ps → Pa t 1

11 12 3 4 1 2

matchings: ◮ least common multiple T = lcm{bi} where ai bi denotes the

number of messages transfered in each matching

◮ ⇒ periodic schedule of period T with atomic transfers of

messages

Loris Marchal Steady state collective communications 14/ 27

slide-59
SLIDE 59

Building a schedule

2 4 3 1 Pb → P1 Pb → P0 Pa → P0 Ps → Pb Ps → Pa t 1

11 12 3 4 1 2

matchings: ◮ least common multiple T = lcm{bi} where ai bi denotes the

number of messages transfered in each matching

◮ ⇒ periodic schedule of period T with atomic transfers of

messages

Loris Marchal Steady state collective communications 14/ 27

slide-60
SLIDE 60

Building a schedule

2 4 3 1 Pb → P1 Pb → P0 Pa → P0 Ps → Pb Ps → Pa t 1

11 12 3 4 1 2

matchings: ◮ least common multiple T = lcm{bi} where ai bi denotes the

number of messages transfered in each matching

◮ ⇒ periodic schedule of period T with atomic transfers of

messages

Loris Marchal Steady state collective communications 14/ 27

slide-61
SLIDE 61

Building a schedule

1 2 3 4               

  • Pb → P1

Pb → P0 Pa → P0 Ps → Pb Ps → Pa        t 48 40 30 20 10                                      matchings ◮ least common multiple T = lcm{bi} where ai bi denotes the

number of messages transfered in each matching

◮ ⇒ periodic schedule of period T with atomic transfers of

messages

Loris Marchal Steady state collective communications 14/ 27

slide-62
SLIDE 62

Asymptotic optimality

◮ No schedule can perform more tasks than the steady-state:

Lemma.

  • pt(G, K) TP(G) × K

◮ periodic schedule ⇒ schedule:

  • 1. initialization phase (fill buffers of messages)
  • 2. r periods of duration T (steady-state)
  • 3. clean-up phase (empty buffers)

Lemma. the previous algorithm is asymptotically optimal: lim

K→+∞

steady(G, K)

  • pt(G, K)

= 1

Loris Marchal Steady state collective communications 15/ 27

slide-63
SLIDE 63

Asymptotic optimality

◮ No schedule can perform more tasks than the steady-state:

Lemma.

  • pt(G, K) TP(G) × K

◮ periodic schedule ⇒ schedule:

  • 1. initialization phase (fill buffers of messages)
  • 2. r periods of duration T (steady-state)
  • 3. clean-up phase (empty buffers)

Lemma. the previous algorithm is asymptotically optimal: lim

K→+∞

steady(G, K)

  • pt(G, K)

= 1

Loris Marchal Steady state collective communications 15/ 27

slide-64
SLIDE 64

Asymptotic optimality

◮ No schedule can perform more tasks than the steady-state:

Lemma.

  • pt(G, K) TP(G) × K

◮ periodic schedule ⇒ schedule:

  • 1. initialization phase (fill buffers of messages)
  • 2. r periods of duration T (steady-state)
  • 3. clean-up phase (empty buffers)

Lemma. the previous algorithm is asymptotically optimal: lim

K→+∞

steady(G, K)

  • pt(G, K)

= 1

Loris Marchal Steady state collective communications 15/ 27

slide-65
SLIDE 65

Asymptotic optimality

◮ No schedule can perform more tasks than the steady-state:

Lemma.

  • pt(G, K) TP(G) × K

◮ periodic schedule ⇒ schedule:

  • 1. initialization phase (fill buffers of messages)
  • 2. r periods of duration T (steady-state)
  • 3. clean-up phase (empty buffers)

Lemma. the previous algorithm is asymptotically optimal: lim

K→+∞

steady(G, K)

  • pt(G, K)

= 1

Loris Marchal Steady state collective communications 15/ 27

slide-66
SLIDE 66

Asymptotic optimality

◮ No schedule can perform more tasks than the steady-state:

Lemma.

  • pt(G, K) TP(G) × K

◮ periodic schedule ⇒ schedule:

  • 1. initialization phase (fill buffers of messages)
  • 2. r periods of duration T (steady-state)
  • 3. clean-up phase (empty buffers)

Lemma. the previous algorithm is asymptotically optimal: lim

K→+∞

steady(G, K)

  • pt(G, K)

= 1

Loris Marchal Steady state collective communications 15/ 27

slide-67
SLIDE 67

Asymptotic optimality

◮ No schedule can perform more tasks than the steady-state:

Lemma.

  • pt(G, K) TP(G) × K

◮ periodic schedule ⇒ schedule:

  • 1. initialization phase (fill buffers of messages)
  • 2. r periods of duration T (steady-state)
  • 3. clean-up phase (empty buffers)

Lemma. the previous algorithm is asymptotically optimal: lim

K→+∞

steady(G, K)

  • pt(G, K)

= 1

Loris Marchal Steady state collective communications 15/ 27

slide-68
SLIDE 68

Asymptotic optimality

◮ No schedule can perform more tasks than the steady-state:

Lemma.

  • pt(G, K) TP(G) × K

◮ periodic schedule ⇒ schedule:

  • 1. initialization phase (fill buffers of messages)
  • 2. r periods of duration T (steady-state)
  • 3. clean-up phase (empty buffers)

Lemma. the previous algorithm is asymptotically optimal: lim

K→+∞

steady(G, K)

  • pt(G, K)

= 1

Loris Marchal Steady state collective communications 15/ 27

slide-69
SLIDE 69

Outline

Introduction

Two Problems of Collective Communication Platform Model Framework

Series of Scatter

Steady-state constraints Toy Example Building a schedule Asymptotic optimality

Series of Reduce

Introduction to reduction trees Linear Program Periodic schedule - Asymptotic optimality Toy Example for Series of Reduce

Approximation for a fixed period Conclusion

Loris Marchal Steady state collective communications 16/ 27

slide-70
SLIDE 70

Reduce - Reduction trees

◮ Reduce:

◮ each processor Pri owns a value

vi

◮ compute V = v1 ⊕ v2 ⊕ · · · ⊕ vN

(⊕ associative, non commutative)

◮ partial result of the Reduce

  • peration:

v[k,m] = vk ⊕ v2 ⊕ · · · ⊕ vm

◮ two partial results can be merged:

v[k,m] = v[k,l] ⊕ v[l+1,m] (computational task Tk,l,m)

Loris Marchal Steady state collective communications 17/ 27

slide-71
SLIDE 71

Reduce - Reduction trees

◮ Reduce:

◮ each processor Pri owns a value

vi

◮ compute V = v1 ⊕ v2 ⊕ · · · ⊕ vN

(⊕ associative, non commutative)

◮ partial result of the Reduce

  • peration:

v[k,m] = vk ⊕ v2 ⊕ · · · ⊕ vm

◮ two partial results can be merged:

v[k,m] = v[k,l] ⊕ v[l+1,m] (computational task Tk,l,m)

Loris Marchal Steady state collective communications 17/ 27

slide-72
SLIDE 72

Reduce - Reduction trees

◮ Reduce:

◮ each processor Pri owns a value

vi

◮ compute V = v1 ⊕ v2 ⊕ · · · ⊕ vN

(⊕ associative, non commutative)

◮ partial result of the Reduce

  • peration:

v[k,m] = vk ⊕ v2 ⊕ · · · ⊕ vm

◮ two partial results can be merged:

v[k,m] = v[k,l] ⊕ v[l+1,m] (computational task Tk,l,m)

Loris Marchal Steady state collective communications 17/ 27

slide-73
SLIDE 73

Reduce - Reduction trees

◮ Reduce:

◮ each processor Pri owns a value

vi

◮ compute V = v1 ⊕ v2 ⊕ · · · ⊕ vN

(⊕ associative, non commutative)

◮ partial result of the Reduce

  • peration:

v[k,m] = vk ⊕ v2 ⊕ · · · ⊕ vm

◮ two partial results can be merged:

v[k,m] = v[k,l] ⊕ v[l+1,m] (computational task Tk,l,m)

Loris Marchal Steady state collective communications 17/ 27

slide-74
SLIDE 74

Reduce - Reduction trees

◮ Reduce:

◮ each processor Pri owns a value

vi

◮ compute V = v1 ⊕ v2 ⊕ · · · ⊕ vN

(⊕ associative, non commutative)

◮ partial result of the Reduce

  • peration:

v[k,m] = vk ⊕ v2 ⊕ · · · ⊕ vm

◮ two partial results can be merged:

v[k,m] = v[k,l] ⊕ v[l+1,m] (computational task Tk,l,m)

Loris Marchal Steady state collective communications 17/ 27

slide-75
SLIDE 75

Reduce - Reduction trees

◮ Reduce:

◮ each processor Pri owns a value

vi

◮ compute V = v1 ⊕ v2 ⊕ · · · ⊕ vN

(⊕ associative, non commutative)

◮ partial result of the Reduce

  • peration:

v[k,m] = vk ⊕ v2 ⊕ · · · ⊕ vm

◮ two partial results can be merged:

v[k,m] = v[k,l] ⊕ v[l+1,m] (computational task Tk,l,m)

P0 v0 P1 v1 P2 v2

Loris Marchal Steady state collective communications 17/ 27

slide-76
SLIDE 76

Reduce - Reduction trees

◮ Reduce:

◮ each processor Pri owns a value

vi

◮ compute V = v1 ⊕ v2 ⊕ · · · ⊕ vN

(⊕ associative, non commutative)

◮ partial result of the Reduce

  • peration:

v[k,m] = vk ⊕ v2 ⊕ · · · ⊕ vm

◮ two partial results can be merged:

v[k,m] = v[k,l] ⊕ v[l+1,m] (computational task Tk,l,m)

v2 P2 → P1 P0 v0 P1 v1 P2 v2

Loris Marchal Steady state collective communications 17/ 27

slide-77
SLIDE 77

Reduce - Reduction trees

◮ Reduce:

◮ each processor Pri owns a value

vi

◮ compute V = v1 ⊕ v2 ⊕ · · · ⊕ vN

(⊕ associative, non commutative)

◮ partial result of the Reduce

  • peration:

v[k,m] = vk ⊕ v2 ⊕ · · · ⊕ vm

◮ two partial results can be merged:

v[k,m] = v[k,l] ⊕ v[l+1,m] (computational task Tk,l,m)

T1,1,2 P1 v2 P2 → P1 P0 v0 P1 v1 P2 v2

Loris Marchal Steady state collective communications 17/ 27

slide-78
SLIDE 78

Reduce - Reduction trees

◮ Reduce:

◮ each processor Pri owns a value

vi

◮ compute V = v1 ⊕ v2 ⊕ · · · ⊕ vN

(⊕ associative, non commutative)

◮ partial result of the Reduce

  • peration:

v[k,m] = vk ⊕ v2 ⊕ · · · ⊕ vm

◮ two partial results can be merged:

v[k,m] = v[k,l] ⊕ v[l+1,m] (computational task Tk,l,m)

v0 P0 → P1 T1,1,2 P1 v2 P2 → P1 P0 v0 P1 v1 P2 v2

Loris Marchal Steady state collective communications 17/ 27

slide-79
SLIDE 79

Reduce - Reduction trees

◮ Reduce:

◮ each processor Pri owns a value

vi

◮ compute V = v1 ⊕ v2 ⊕ · · · ⊕ vN

(⊕ associative, non commutative)

◮ partial result of the Reduce

  • peration:

v[k,m] = vk ⊕ v2 ⊕ · · · ⊕ vm

◮ two partial results can be merged:

v[k,m] = v[k,l] ⊕ v[l+1,m] (computational task Tk,l,m)

T0,0,2 P1 v0 P0 → P1 T1,1,2 P1 v2 P2 → P1 P0 v0 P1 v1 P2 v2

Loris Marchal Steady state collective communications 17/ 27

slide-80
SLIDE 80

Reduce - Reduction trees

◮ Reduce:

◮ each processor Pri owns a value

vi

◮ compute V = v1 ⊕ v2 ⊕ · · · ⊕ vN

(⊕ associative, non commutative)

◮ partial result of the Reduce

  • peration:

v[k,m] = vk ⊕ v2 ⊕ · · · ⊕ vm

◮ two partial results can be merged:

v[k,m] = v[k,l] ⊕ v[l+1,m] (computational task Tk,l,m)

v[0,2] P1 → P0 T0,0,2 P1 v0 P0 → P1 T1,1,2 P1 v2 P2 → P1 P0 v0 P1 v1 P2 v2

Loris Marchal Steady state collective communications 17/ 27

slide-81
SLIDE 81

Series of Reduce

◮ each processor Pri owns a set of values vt i (e.g. produced at

different time-steps t)

◮ perform a Reduce operation on each set {vt 1, . . . , vt N} to

compute V t

◮ each reduction uses a reduction tree ◮ two reductions (t1 and t2) may use different trees

Loris Marchal Steady state collective communications 18/ 27

slide-82
SLIDE 82

Linear Program - Notations

◮ s(Pi → Pj, v[k,l]): fractional number of values v[k,l] sent on

link Pi → Pj within one time-unit

◮ t(Pi → Pj) fractional occupation time of link Pi → Pj within

  • ne time-unit:

0 t(Pi → Pj) 1

◮ cons(Pi, Tk,l,m): fractional number of tasks Tk,l,m computed

  • n processor Pi within one time-unit

◮ α(Pi) time spent by processor Pi computing tasks within one

time-unit: 0 α(Pi) 1

◮ size(v[k,m]) size of a message containing a value vt [k,m] ◮ w(Pi, Tk,l,m) time needed by processor Pi to compute one

task Tk,l,m

Loris Marchal Steady state collective communications 19/ 27

slide-83
SLIDE 83

Linear Program - Notations

◮ s(Pi → Pj, v[k,l]): fractional number of values v[k,l] sent on

link Pi → Pj within one time-unit

◮ t(Pi → Pj) fractional occupation time of link Pi → Pj within

  • ne time-unit:

0 t(Pi → Pj) 1

◮ cons(Pi, Tk,l,m): fractional number of tasks Tk,l,m computed

  • n processor Pi within one time-unit

◮ α(Pi) time spent by processor Pi computing tasks within one

time-unit: 0 α(Pi) 1

◮ size(v[k,m]) size of a message containing a value vt [k,m] ◮ w(Pi, Tk,l,m) time needed by processor Pi to compute one

task Tk,l,m

Loris Marchal Steady state collective communications 19/ 27

slide-84
SLIDE 84

Linear Program - Notations

◮ s(Pi → Pj, v[k,l]): fractional number of values v[k,l] sent on

link Pi → Pj within one time-unit

◮ t(Pi → Pj) fractional occupation time of link Pi → Pj within

  • ne time-unit:

0 t(Pi → Pj) 1

◮ cons(Pi, Tk,l,m): fractional number of tasks Tk,l,m computed

  • n processor Pi within one time-unit

◮ α(Pi) time spent by processor Pi computing tasks within one

time-unit: 0 α(Pi) 1

◮ size(v[k,m]) size of a message containing a value vt [k,m] ◮ w(Pi, Tk,l,m) time needed by processor Pi to compute one

task Tk,l,m

Loris Marchal Steady state collective communications 19/ 27

slide-85
SLIDE 85

Linear Program - Notations

◮ s(Pi → Pj, v[k,l]): fractional number of values v[k,l] sent on

link Pi → Pj within one time-unit

◮ t(Pi → Pj) fractional occupation time of link Pi → Pj within

  • ne time-unit:

0 t(Pi → Pj) 1

◮ cons(Pi, Tk,l,m): fractional number of tasks Tk,l,m computed

  • n processor Pi within one time-unit

◮ α(Pi) time spent by processor Pi computing tasks within one

time-unit: 0 α(Pi) 1

◮ size(v[k,m]) size of a message containing a value vt [k,m] ◮ w(Pi, Tk,l,m) time needed by processor Pi to compute one

task Tk,l,m

Loris Marchal Steady state collective communications 19/ 27

slide-86
SLIDE 86

Linear Program - Notations

◮ s(Pi → Pj, v[k,l]): fractional number of values v[k,l] sent on

link Pi → Pj within one time-unit

◮ t(Pi → Pj) fractional occupation time of link Pi → Pj within

  • ne time-unit:

0 t(Pi → Pj) 1

◮ cons(Pi, Tk,l,m): fractional number of tasks Tk,l,m computed

  • n processor Pi within one time-unit

◮ α(Pi) time spent by processor Pi computing tasks within one

time-unit: 0 α(Pi) 1

◮ size(v[k,m]) size of a message containing a value vt [k,m] ◮ w(Pi, Tk,l,m) time needed by processor Pi to compute one

task Tk,l,m

Loris Marchal Steady state collective communications 19/ 27

slide-87
SLIDE 87

Linear Program - Notations

◮ s(Pi → Pj, v[k,l]): fractional number of values v[k,l] sent on

link Pi → Pj within one time-unit

◮ t(Pi → Pj) fractional occupation time of link Pi → Pj within

  • ne time-unit:

0 t(Pi → Pj) 1

◮ cons(Pi, Tk,l,m): fractional number of tasks Tk,l,m computed

  • n processor Pi within one time-unit

◮ α(Pi) time spent by processor Pi computing tasks within one

time-unit: 0 α(Pi) 1

◮ size(v[k,m]) size of a message containing a value vt [k,m] ◮ w(Pi, Tk,l,m) time needed by processor Pi to compute one

task Tk,l,m

Loris Marchal Steady state collective communications 19/ 27

slide-88
SLIDE 88

Linear Program - Constraints

◮ occupation of a link Pi → Pj:

t(Pi → Pj) =

  • v[k,l]

s(Pi → Pj, v[k,l]) × size(v[k,l]) × c(i, j)

◮ occupation time of a processor Pi:

α(Pi) =

  • Tk,l,m

cons(Pi, Tk,l,m) × w(Pi, Tk,l,m)

◮ “conservation law” for packets of type v[k,m]:

  • Pj→Pi

s(Pj → Pi, v[k,m]) +

  • kl<m

cons(Pi, Tk,l,m) =

  • Pi→Pj

s(Pi → Pj, v[k,m]) +

  • n>m

cons(Pi, Tk,m,n) +

  • n<k

cons(Pi, Tn,k−1,m)

Loris Marchal Steady state collective communications 20/ 27

slide-89
SLIDE 89

Linear Program - Constraints

◮ occupation of a link Pi → Pj:

t(Pi → Pj) =

  • v[k,l]

s(Pi → Pj, v[k,l]) × size(v[k,l]) × c(i, j)

◮ occupation time of a processor Pi:

α(Pi) =

  • Tk,l,m

cons(Pi, Tk,l,m) × w(Pi, Tk,l,m)

◮ “conservation law” for packets of type v[k,m]:

  • Pj→Pi

s(Pj → Pi, v[k,m]) +

  • kl<m

cons(Pi, Tk,l,m) =

  • Pi→Pj

s(Pi → Pj, v[k,m]) +

  • n>m

cons(Pi, Tk,m,n) +

  • n<k

cons(Pi, Tn,k−1,m)

Loris Marchal Steady state collective communications 20/ 27

slide-90
SLIDE 90

Linear Program - Constraints

◮ occupation of a link Pi → Pj:

t(Pi → Pj) =

  • v[k,l]

s(Pi → Pj, v[k,l]) × size(v[k,l]) × c(i, j)

◮ occupation time of a processor Pi:

α(Pi) =

  • Tk,l,m

cons(Pi, Tk,l,m) × w(Pi, Tk,l,m)

◮ “conservation law” for packets of type v[k,m]:

  • Pj→Pi

s(Pj → Pi, v[k,m]) +

  • kl<m

cons(Pi, Tk,l,m) =

  • Pi→Pj

s(Pi → Pj, v[k,m]) +

  • n>m

cons(Pi, Tk,m,n) +

  • n<k

cons(Pi, Tn,k−1,m)

Loris Marchal Steady state collective communications 20/ 27

slide-91
SLIDE 91

Linear Program - Constraints

◮ definition of the throughput:

TP =

  • Pj→Ptarget

s(Pj → Ptarget, v[0,m])+

  • 0l<N−1

cons(Ptarget, T0,l,N)

◮ solve the following linear program over the rational numbers:

Steady-State Reduce Problem on a Graph SSRP(G) Maximize TP, subject to all previous constraints

Loris Marchal Steady state collective communications 21/ 27

slide-92
SLIDE 92

Linear Program - Constraints

◮ definition of the throughput:

TP =

  • Pj→Ptarget

s(Pj → Ptarget, v[0,m])+

  • 0l<N−1

cons(Ptarget, T0,l,N)

◮ solve the following linear program over the rational numbers:

Steady-State Reduce Problem on a Graph SSRP(G) Maximize TP, subject to all previous constraints

Loris Marchal Steady state collective communications 21/ 27

slide-93
SLIDE 93

Building a schedule

◮ consider the reduction tree T t associated with the

computation of the tth value (V t):

◮ a given tree may be used by many time-stamps t

◮ there exists an algorithm which extracts from the solution a

set of weighted trees such that

◮ this description is polynomial and ◮ the sum of the weighted trees is equal to the original solution

◮ same use of a weighted edge-coloring algorithm on a bipartite

graph to orchestrate the communication

Loris Marchal Steady state collective communications 22/ 27

slide-94
SLIDE 94

Building a schedule

◮ consider the reduction tree T t associated with the

computation of the tth value (V t):

◮ a given tree may be used by many time-stamps t

◮ there exists an algorithm which extracts from the solution a

set of weighted trees such that

◮ this description is polynomial and ◮ the sum of the weighted trees is equal to the original solution

◮ same use of a weighted edge-coloring algorithm on a bipartite

graph to orchestrate the communication

Loris Marchal Steady state collective communications 22/ 27

slide-95
SLIDE 95

Building a schedule

◮ consider the reduction tree T t associated with the

computation of the tth value (V t):

◮ a given tree may be used by many time-stamps t

◮ there exists an algorithm which extracts from the solution a

set of weighted trees such that

◮ this description is polynomial and ◮ the sum of the weighted trees is equal to the original solution

◮ same use of a weighted edge-coloring algorithm on a bipartite

graph to orchestrate the communication

Loris Marchal Steady state collective communications 22/ 27

slide-96
SLIDE 96

Building a schedule

◮ consider the reduction tree T t associated with the

computation of the tth value (V t):

◮ a given tree may be used by many time-stamps t

◮ there exists an algorithm which extracts from the solution a

set of weighted trees such that

◮ this description is polynomial and ◮ the sum of the weighted trees is equal to the original solution

◮ same use of a weighted edge-coloring algorithm on a bipartite

graph to orchestrate the communication

Loris Marchal Steady state collective communications 22/ 27

slide-97
SLIDE 97

Building a schedule

◮ consider the reduction tree T t associated with the

computation of the tth value (V t):

◮ a given tree may be used by many time-stamps t

◮ there exists an algorithm which extracts from the solution a

set of weighted trees such that

◮ this description is polynomial and ◮ the sum of the weighted trees is equal to the original solution

◮ same use of a weighted edge-coloring algorithm on a bipartite

graph to orchestrate the communication

Loris Marchal Steady state collective communications 22/ 27

slide-98
SLIDE 98

Building a schedule

◮ consider the reduction tree T t associated with the

computation of the tth value (V t):

◮ a given tree may be used by many time-stamps t

◮ there exists an algorithm which extracts from the solution a

set of weighted trees such that

◮ this description is polynomial and ◮ the sum of the weighted trees is equal to the original solution

◮ same use of a weighted edge-coloring algorithm on a bipartite

graph to orchestrate the communication

Loris Marchal Steady state collective communications 22/ 27

slide-99
SLIDE 99

Toy Example for Series of Reduce

1 1 1 2 1 1

topology

Loris Marchal Steady state collective communications 23/ 27

slide-100
SLIDE 100

Toy Example for Series of Reduce

1 2 1 T0,0,2

1 3 v[1,2] 2 3 v[1,2] 2 3 T1,1,2 2 3 v[1,1] 1 3 v[2,2] 1 3 T1,1,2

results of the linear program

Loris Marchal Steady state collective communications 23/ 27

slide-101
SLIDE 101

Toy Example for Series of Reduce

1 2 v[2,2] T1,1,2 v[1,2] T0,0,2

first reduction tree (weight 1/3)

Loris Marchal Steady state collective communications 23/ 27

slide-102
SLIDE 102

Toy Example for Series of Reduce

1 2 v[1,1] T1,1,2 v[1,2] T0,0,2

second reduction tree (weight 2/3)

Loris Marchal Steady state collective communications 23/ 27

slide-103
SLIDE 103

Toy Example for Series of Reduce

2 2 1 1 P0 P1 P2

bipartite graph

Loris Marchal Steady state collective communications 23/ 27

slide-104
SLIDE 104

Toy Example for Series of Reduce

P0 P1 P2

first matching second matching

Loris Marchal Steady state collective communications 23/ 27

slide-105
SLIDE 105

Toy Example for Series of Reduce

P0 P1 P2

Loris Marchal Steady state collective communications 23/ 27

slide-106
SLIDE 106

Outline

Introduction

Two Problems of Collective Communication Platform Model Framework

Series of Scatter

Steady-state constraints Toy Example Building a schedule Asymptotic optimality

Series of Reduce

Introduction to reduction trees Linear Program Periodic schedule - Asymptotic optimality Toy Example for Series of Reduce

Approximation for a fixed period Conclusion

Loris Marchal Steady state collective communications 24/ 27

slide-107
SLIDE 107

Approximation for a fixed period

◮ our framework produces an asymptotically optimal schedule of

period T, but T may be to large

◮ we can approximate the solution with a fixed period Tfixed:

  • 1. {T , weightT }: the weighted set of trees obtained by the

decomposition algorithm

  • 2. compute r(T ) =
  • weight(T )

T

× Tfixed

  • 3. one port constraints are satisfied for {T , weightT } on a

period T, ⇒ they are satisfied for {T , r(T)} on a period Tfixed

  • 4. the performance loss is bounded:

TP − TP∗ card(Trees) Tfixed

Loris Marchal Steady state collective communications 25/ 27

slide-108
SLIDE 108

Approximation for a fixed period

◮ our framework produces an asymptotically optimal schedule of

period T, but T may be to large

◮ we can approximate the solution with a fixed period Tfixed:

  • 1. {T , weightT }: the weighted set of trees obtained by the

decomposition algorithm

  • 2. compute r(T ) =
  • weight(T )

T

× Tfixed

  • 3. one port constraints are satisfied for {T , weightT } on a

period T, ⇒ they are satisfied for {T , r(T)} on a period Tfixed

  • 4. the performance loss is bounded:

TP − TP∗ card(Trees) Tfixed

Loris Marchal Steady state collective communications 25/ 27

slide-109
SLIDE 109

Approximation for a fixed period

◮ our framework produces an asymptotically optimal schedule of

period T, but T may be to large

◮ we can approximate the solution with a fixed period Tfixed:

  • 1. {T , weightT }: the weighted set of trees obtained by the

decomposition algorithm

  • 2. compute r(T ) =
  • weight(T )

T

× Tfixed

  • 3. one port constraints are satisfied for {T , weightT } on a

period T, ⇒ they are satisfied for {T , r(T)} on a period Tfixed

  • 4. the performance loss is bounded:

TP − TP∗ card(Trees) Tfixed

Loris Marchal Steady state collective communications 25/ 27

slide-110
SLIDE 110

Approximation for a fixed period

◮ our framework produces an asymptotically optimal schedule of

period T, but T may be to large

◮ we can approximate the solution with a fixed period Tfixed:

  • 1. {T , weightT }: the weighted set of trees obtained by the

decomposition algorithm

  • 2. compute r(T ) =
  • weight(T )

T

× Tfixed

  • 3. one port constraints are satisfied for {T , weightT } on a

period T, ⇒ they are satisfied for {T , r(T)} on a period Tfixed

  • 4. the performance loss is bounded:

TP − TP∗ card(Trees) Tfixed

Loris Marchal Steady state collective communications 25/ 27

slide-111
SLIDE 111

Approximation for a fixed period

◮ our framework produces an asymptotically optimal schedule of

period T, but T may be to large

◮ we can approximate the solution with a fixed period Tfixed:

  • 1. {T , weightT }: the weighted set of trees obtained by the

decomposition algorithm

  • 2. compute r(T ) =
  • weight(T )

T

× Tfixed

  • 3. one port constraints are satisfied for {T , weightT } on a

period T, ⇒ they are satisfied for {T , r(T)} on a period Tfixed

  • 4. the performance loss is bounded:

TP − TP∗ card(Trees) Tfixed

Loris Marchal Steady state collective communications 25/ 27

slide-112
SLIDE 112

Approximation for a fixed period

◮ our framework produces an asymptotically optimal schedule of

period T, but T may be to large

◮ we can approximate the solution with a fixed period Tfixed:

  • 1. {T , weightT }: the weighted set of trees obtained by the

decomposition algorithm

  • 2. compute r(T ) =
  • weight(T )

T

× Tfixed

  • 3. one port constraints are satisfied for {T , weightT } on a

period T, ⇒ they are satisfied for {T , r(T)} on a period Tfixed

  • 4. the performance loss is bounded:

TP − TP∗ card(Trees) Tfixed

Loris Marchal Steady state collective communications 25/ 27

slide-113
SLIDE 113

Outline

Introduction

Two Problems of Collective Communication Platform Model Framework

Series of Scatter

Steady-state constraints Toy Example Building a schedule Asymptotic optimality

Series of Reduce

Introduction to reduction trees Linear Program Periodic schedule - Asymptotic optimality Toy Example for Series of Reduce

Approximation for a fixed period Conclusion

Loris Marchal Steady state collective communications 26/ 27

slide-114
SLIDE 114

Conclusion

◮ new framework to study collective communications in a

heterogeneous environment

◮ makespan difficult to minimize ⇒ focus on throughput ◮ relaxation, use of linear programming ◮ asymptotically optimal algorithm ◮ can be extended to other communication schemes and

scheduling problems

Loris Marchal Steady state collective communications 27/ 27