 
              Two Problems of Collective Communication Scatter one processor P source sends distinct messages to target � � processors ( ) P t 0 , . . . , P t N ◮ Series of Scatter P source sends consecutively a large number of distinct messages to all targets Reduce Each of the participating processor P r i in P r 0 , . . . , P r N owns a value v i ⇒ compute V = v 1 ⊕ v 2 ⊕ · · · ⊕ v N ( ⊕ is associative, non commutative) ◮ Series of Reduce several consecutive reduce operations from the same set P r 0 , . . . , P r N to the same target P target . Loris Marchal Steady state collective communications 5/ 27
Two Problems of Collective Communication Scatter one processor P source sends distinct messages to target � � processors ( ) P t 0 , . . . , P t N ◮ Series of Scatter P source sends consecutively a large number of distinct messages to all targets Reduce Each of the participating processor P r i in P r 0 , . . . , P r N owns a value v i ⇒ compute V = v 1 ⊕ v 2 ⊕ · · · ⊕ v N ( ⊕ is associative, non commutative) ◮ Series of Reduce several consecutive reduce operations from the same set P r 0 , . . . , P r N to the same target P target . Loris Marchal Steady state collective communications 5/ 27
Platform Model ◮ G = ( V, E, c ) ◮ P 1 , P 2 , . . . , P n : processors P 1 ◮ ( j, k ) ∈ E : communication link 10 between P i and P j 10 30 ◮ c ( j, k ) : time to transfer one unit P 0 P 3 message from P j to P k 5 5 8 ◮ one-port for incoming P 2 communications ◮ one-port for outgoing communications Loris Marchal Steady state collective communications 6/ 27
Platform Model ◮ G = ( V, E, c ) ◮ P 1 , P 2 , . . . , P n : processors P 1 ◮ ( j, k ) ∈ E : communication link 10 between P i and P j 10 30 ◮ c ( j, k ) : time to transfer one unit P 0 P 3 message from P j to P k 5 5 8 ◮ one-port for incoming P 2 communications ◮ one-port for outgoing communications Loris Marchal Steady state collective communications 6/ 27
Platform Model ◮ G = ( V, E, c ) ◮ P 1 , P 2 , . . . , P n : processors P 1 ◮ ( j, k ) ∈ E : communication link 10 between P i and P j 10 30 ◮ c ( j, k ) : time to transfer one unit P 0 P 3 message from P j to P k 5 5 8 ◮ one-port for incoming P 2 communications ◮ one-port for outgoing communications Loris Marchal Steady state collective communications 6/ 27
Platform Model ◮ G = ( V, E, c ) ◮ P 1 , P 2 , . . . , P n : processors P 1 ◮ ( j, k ) ∈ E : communication link 10 between P i and P j 10 30 ◮ c ( j, k ) : time to transfer one unit P 0 P 3 message from P j to P k 5 5 8 ◮ one-port for incoming P 2 communications ◮ one-port for outgoing communications Loris Marchal Steady state collective communications 6/ 27
Platform Model ◮ G = ( V, E, c ) ◮ P 1 , P 2 , . . . , P n : processors P 1 ◮ ( j, k ) ∈ E : communication link 10 between P i and P j 10 30 ◮ c ( j, k ) : time to transfer one unit P 0 P 3 message from P j to P k 5 5 8 ◮ one-port for incoming P 2 communications ◮ one-port for outgoing communications Loris Marchal Steady state collective communications 6/ 27
Platform Model ◮ G = ( V, E, c ) ◮ P 1 , P 2 , . . . , P n : processors P 1 ◮ ( j, k ) ∈ E : communication link 10 between P i and P j 10 30 ◮ c ( j, k ) : time to transfer one unit P 0 P 3 message from P j to P k 5 5 8 ◮ one-port for incoming P 2 communications ◮ one-port for outgoing communications Loris Marchal Steady state collective communications 6/ 27
Platform Model ◮ G = ( V, E, c ) ◮ P 1 , P 2 , . . . , P n : processors P 1 ◮ ( j, k ) ∈ E : communication link between P i and P j ◮ c ( j, k ) : time to transfer one unit P 0 P 3 message from P j to P k ◮ one-port for incoming P 2 communications ◮ one-port for outgoing communications Loris Marchal Steady state collective communications 6/ 27
Platform Model ◮ G = ( V, E, c ) ◮ P 1 , P 2 , . . . , P n : processors P 1 ◮ ( j, k ) ∈ E : communication link OK between P i and P j ◮ c ( j, k ) : time to transfer one unit P 0 P 3 message from P j to P k ◮ one-port for incoming P 2 communications ◮ one-port for outgoing communications Loris Marchal Steady state collective communications 6/ 27
Framework 1. express optimization problem as set of linear constraints (variables = fraction of time a processor spends sending to one of its neighbors) 2. solve linear program (in rational numbers) 3. use solution to build periodic schedule reaching best throughput Loris Marchal Steady state collective communications 7/ 27
Framework 1. express optimization problem as set of linear constraints (variables = fraction of time a processor spends sending to one of its neighbors) 2. solve linear program (in rational numbers) 3. use solution to build periodic schedule reaching best throughput Loris Marchal Steady state collective communications 7/ 27
Framework 1. express optimization problem as set of linear constraints (variables = fraction of time a processor spends sending to one of its neighbors) 2. solve linear program (in rational numbers) 3. use solution to build periodic schedule reaching best throughput Loris Marchal Steady state collective communications 7/ 27
Outline Introduction Two Problems of Collective Communication Platform Model Framework Series of Scatter Steady-state constraints Toy Example Building a schedule Asymptotic optimality Series of Reduce Introduction to reduction trees Linear Program Periodic schedule - Asymptotic optimality Toy Example for Series of Reduce Approximation for a fixed period Conclusion Loris Marchal Steady state collective communications 8/ 27
Series of Scatter ◮ m k : types of the messages with destination P k ◮ s ( P i → P j , m k ) : fractional number of messages of type m k sent on the edge P i → P j within on time unit ◮ t ( P i → P j ) : fractional time spent by processor P i to send data to its neighbor P j within one time unit ◮ bound for this activity: ∀ P i , P j , 0 � t ( P i → P j ) � 1 ◮ on a link P i → P j during one time-unit: � t ( P i → P j ) = s ( P i → P j , m k ) k Loris Marchal Steady state collective communications 9/ 27
Series of Scatter ◮ m k : types of the messages with destination P k ◮ s ( P i → P j , m k ) : fractional number of messages of type m k sent on the edge P i → P j within on time unit ◮ t ( P i → P j ) : fractional time spent by processor P i to send data to its neighbor P j within one time unit ◮ bound for this activity: ∀ P i , P j , 0 � t ( P i → P j ) � 1 ◮ on a link P i → P j during one time-unit: � t ( P i → P j ) = s ( P i → P j , m k ) k Loris Marchal Steady state collective communications 9/ 27
Series of Scatter ◮ m k : types of the messages with destination P k ◮ s ( P i → P j , m k ) : fractional number of messages of type m k sent on the edge P i → P j within on time unit ◮ t ( P i → P j ) : fractional time spent by processor P i to send data to its neighbor P j within one time unit ◮ bound for this activity: ∀ P i , P j , 0 � t ( P i → P j ) � 1 ◮ on a link P i → P j during one time-unit: � t ( P i → P j ) = s ( P i → P j , m k ) k Loris Marchal Steady state collective communications 9/ 27
Series of Scatter ◮ m k : types of the messages with destination P k ◮ s ( P i → P j , m k ) : fractional number of messages of type m k sent on the edge P i → P j within on time unit ◮ t ( P i → P j ) : fractional time spent by processor P i to send data to its neighbor P j within one time unit ◮ bound for this activity: ∀ P i , P j , 0 � t ( P i → P j ) � 1 ◮ on a link P i → P j during one time-unit: � t ( P i → P j ) = s ( P i → P j , m k ) k Loris Marchal Steady state collective communications 9/ 27
Series of Scatter ◮ m k : types of the messages with destination P k ◮ s ( P i → P j , m k ) : fractional number of messages of type m k sent on the edge P i → P j within on time unit ◮ t ( P i → P j ) : fractional time spent by processor P i to send data to its neighbor P j within one time unit ◮ bound for this activity: ∀ P i , P j , 0 � t ( P i → P j ) � 1 ◮ on a link P i → P j during one time-unit: � t ( P i → P j ) = s ( P i → P j , m k ) k Loris Marchal Steady state collective communications 9/ 27
Linear constraints ◮ one port constraints for outgoing messages in P i : � ∀ P i , t ( P i → P j ) � 1 P i → P j ◮ one port constraints for incoming messages in P i : � ∀ P i , t ( P j → P i ) � 1 P j → P i ◮ conservation law in node P i for message m k ( k � = i ): 5 m k 3 m k P i 4 m k 2 m k � � s ( P j → P i , m k ) = s ( P j → P i , m k ) � 1 P j → P i P i → P j Loris Marchal Steady state collective communications 10/ 27
Linear constraints ◮ one port constraints for outgoing messages in P i : � ∀ P i , t ( P i → P j ) � 1 P i → P j ◮ one port constraints for incoming messages in P i : � ∀ P i , t ( P j → P i ) � 1 P j → P i ◮ conservation law in node P i for message m k ( k � = i ): 5 m k 3 m k P i 4 m k 2 m k � � s ( P j → P i , m k ) = s ( P j → P i , m k ) � 1 P j → P i P i → P j Loris Marchal Steady state collective communications 10/ 27
Linear constraints ◮ one port constraints for outgoing messages in P i : � ∀ P i , t ( P i → P j ) � 1 P i → P j ◮ one port constraints for incoming messages in P i : � ∀ P i , t ( P j → P i ) � 1 P j → P i ◮ conservation law in node P i for message m k ( k � = i ): 5 m k 3 m k P i 4 m k 2 m k � � s ( P j → P i , m k ) = s ( P j → P i , m k ) � 1 P j → P i P i → P j Loris Marchal Steady state collective communications 10/ 27
Linear constraints ◮ one port constraints for outgoing messages in P i : � ∀ P i , t ( P i → P j ) � 1 P i → P j ◮ one port constraints for incoming messages in P i : � ∀ P i , t ( P j → P i ) � 1 P j → P i ◮ conservation law in node P i for message m k ( k � = i ): 5 m k 3 m k P i 4 m k 2 m k � � s ( P j → P i , m k ) = s ( P j → P i , m k ) � 1 P j → P i P i → P j Loris Marchal Steady state collective communications 10/ 27
Linear constraints ◮ one port constraints for outgoing messages in P i : � ∀ P i , t ( P i → P j ) � 1 P i → P j ◮ one port constraints for incoming messages in P i : � ∀ P i , t ( P j → P i ) � 1 P j → P i ◮ conservation law in node P i for message m k ( k � = i ): 5 m k 3 m k P i 4 m k 2 m k � � s ( P j → P i , m k ) = s ( P j → P i , m k ) � 1 P j → P i P i → P j Loris Marchal Steady state collective communications 10/ 27
Linear constraints ◮ one port constraints for outgoing messages in P i : � ∀ P i , t ( P i → P j ) � 1 P i → P j ◮ one port constraints for incoming messages in P i : � ∀ P i , t ( P j → P i ) � 1 P j → P i ◮ conservation law in node P i for message m k ( k � = i ): 5 m k 3 m k P i 4 m k 2 m k � � s ( P j → P i , m k ) = s ( P j → P i , m k ) � 1 P j → P i P i → P j Loris Marchal Steady state collective communications 10/ 27
Throughput and Linear Program ◮ throughput: total number of messages m k received in P k � TP = s ( P j → P k , m k ) P j → P k (same throughput for every target node P k ) ◮ summarize this constraints in a linear program: Steady-State Scatter Problem on a Graph SSSP(G) Maximize TP , subject to  ∀ P i , ∀ P j , 0 � s ( P i → P j ) � 1  ∀ P i , �  P j , ( i,j ) ∈ E s ( P i → P j ) � 1     ∀ P i , � P j , ( j,i ) ∈ E s ( P j → P i ) � 1     ∀ P i , P j , s ( P i → P j ) = � m k send ( P i → P j , m k ) × c ( i, j ) ∀ P i , ∀ m k , k � = i, � P j , ( j,i ) ∈ E send ( P j → P i , m k )      = � P j , ( i,j ) ∈ E send ( P i → P j , m k )     ∀ P k , k ∈ T � P i , ( i,k ) ∈ E send ( P i → P k , m k ) = TP  Loris Marchal Steady state collective communications 11/ 27
Throughput and Linear Program ◮ throughput: total number of messages m k received in P k � TP = s ( P j → P k , m k ) P j → P k (same throughput for every target node P k ) ◮ summarize this constraints in a linear program: Steady-State Scatter Problem on a Graph SSSP(G) Maximize TP , subject to  ∀ P i , ∀ P j , 0 � s ( P i → P j ) � 1  ∀ P i , �  P j , ( i,j ) ∈ E s ( P i → P j ) � 1     ∀ P i , � P j , ( j,i ) ∈ E s ( P j → P i ) � 1     ∀ P i , P j , s ( P i → P j ) = � m k send ( P i → P j , m k ) × c ( i, j ) ∀ P i , ∀ m k , k � = i, � P j , ( j,i ) ∈ E send ( P j → P i , m k )      = � P j , ( i,j ) ∈ E send ( P i → P j , m k )     ∀ P k , k ∈ T � P i , ( i,k ) ∈ E send ( P i → P k , m k ) = TP  Loris Marchal Steady state collective communications 11/ 27
Series of Scatter - Toy Example P s 1 1 P a P b 2/3 4/3 4/3 P 0 P 1 platform graph (edges labeled with c ( i, j ) ) Loris Marchal Steady state collective communications 12/ 27
Series of Scatter - Toy Example P s 1 4 m 0 1 4 m 0 1 2 m 1 P a P b 1 4 m 0 1 1 4 m 0 2 m 1 P 0 P 1 value of s ( P i → P j , m k ) in the solution of the linear program Loris Marchal Steady state collective communications 12/ 27
Series of Scatter - Toy Example P s 1/4 3/4 P a P b 1/3 1/6 2/3 P 0 P 1 occupation time of the edge ( t ( P i → P j ) ) Loris Marchal Steady state collective communications 12/ 27
Building a schedule P s ◮ consider the time needed 1 2 ( 1 2 m 1 ) 4 ( 1 1 4 m 0 ) for all transfers 1 4 ( 1 4 m 0 ) ◮ build a bipartite graph by P a P b splitting all nodes 0 ) 4 m 1 6 ( 1 3 ( 1 2 4 m 0 ) 1 2 m 1 ) ◮ extract matchings, using 3 ( 1 the weighted-edge coloring algorithm P 0 P 1 Loris Marchal Steady state collective communications 13/ 27
Building a schedule P send P recv s s ◮ consider the time needed 1 1 1 for all transfers 4 2 4 ◮ build a bipartite graph by P send P recv P send P recv a a b b splitting all nodes 1 ◮ extract matchings, using 1 3 2 6 3 the weighted-edge coloring algorithm P send P recv P send P recv 1 0 1 1 Loris Marchal Steady state collective communications 13/ 27
Building a schedule P send P recv s s ◮ consider the time needed 1 for all transfers 2 ◮ build a bipartite graph by P send P recv P send P recv a a b b splitting all nodes ◮ extract matchings, using 1 2 the weighted-edge coloring algorithm P send P recv P send P recv 1 0 1 1 Loris Marchal Steady state collective communications 13/ 27
Building a schedule P send P recv s s ◮ consider the time needed 1 for all transfers 4 ◮ build a bipartite graph by P send P recv P send P recv a a b b splitting all nodes 1 ◮ extract matchings, using 4 the weighted-edge coloring algorithm P send P recv P send P recv 1 0 1 1 Loris Marchal Steady state collective communications 13/ 27
Building a schedule P send P recv s s ◮ consider the time needed 1 for all transfers 6 ◮ build a bipartite graph by P send P recv P send P recv a a b b splitting all nodes ◮ extract matchings, using 1 1 6 6 the weighted-edge coloring algorithm P send P recv P send P recv 1 0 1 1 Loris Marchal Steady state collective communications 13/ 27
Building a schedule P send P recv s s ◮ consider the time needed 1 for all transfers 12 ◮ build a bipartite graph by P send P recv P send P recv a a b b splitting all nodes 1 ◮ extract matchings, using 12 the weighted-edge coloring algorithm P send P recv P send P recv 1 0 1 1 Loris Marchal Steady state collective communications 13/ 27
Building a schedule matchings: 1 P b → P 1 P send P recv s s 1 P b → P 0 2 P send P recv P send P recv a a b b P a → P 0 1 2 P s → P b P send P recv P send P recv 1 0 1 1 P s → P a t 1 3 11 0 1 2 4 12 ◮ least common multiple T = lcm { b i } where a i b i denotes the number of messages transfered in each matching ◮ ⇒ periodic schedule of period T with atomic transfers of messages Loris Marchal Steady state collective communications 14/ 27
Building a schedule matchings: 2 P b → P 1 P send P recv s s 1 P b → P 0 4 P send P recv P send P recv a a b b P a → P 0 1 4 P s → P b P send P recv P send P recv 1 0 1 1 P s → P a t 1 3 11 0 1 2 4 12 ◮ least common multiple T = lcm { b i } where a i b i denotes the number of messages transfered in each matching ◮ ⇒ periodic schedule of period T with atomic transfers of messages Loris Marchal Steady state collective communications 14/ 27
Building a schedule matchings: 3 P b → P 1 P send P recv s s 1 P b → P 0 6 P send P recv P send P recv a a b b P a → P 0 1 1 6 6 P s → P b P send P recv P send P recv 1 0 1 1 P s → P a t 1 3 11 0 1 2 4 12 ◮ least common multiple T = lcm { b i } where a i b i denotes the number of messages transfered in each matching ◮ ⇒ periodic schedule of period T with atomic transfers of messages Loris Marchal Steady state collective communications 14/ 27
Building a schedule matchings: 4 P b → P 1 P send P recv s s 1 P b → P 0 12 P send P recv P send P recv a a b b P a → P 0 1 12 P s → P b P send P recv P send P recv 1 0 1 1 P s → P a t 1 3 11 0 1 2 4 12 ◮ least common multiple T = lcm { b i } where a i b i denotes the number of messages transfered in each matching ◮ ⇒ periodic schedule of period T with atomic transfers of messages Loris Marchal Steady state collective communications 14/ 27
Building a schedule matchings: 1 2 3 4 P b → P 1 P b → P 0 P a → P 0 P s → P b P s → P a t 1 3 11 0 1 2 4 12 ◮ least common multiple T = lcm { b i } where a i b i denotes the number of messages transfered in each matching ◮ ⇒ periodic schedule of period T with atomic transfers of messages Loris Marchal Steady state collective communications 14/ 27
Building a schedule matchings: 1 2 3 4 P b → P 1 P b → P 0 P a → P 0 P s → P b P s → P a t 1 3 11 0 1 2 4 12 ◮ least common multiple T = lcm { b i } where a i b i denotes the number of messages transfered in each matching ◮ ⇒ periodic schedule of period T with atomic transfers of messages Loris Marchal Steady state collective communications 14/ 27
Building a schedule matchings 1 2 3 4                                                            � P b → P 1 P b → P 0 P a → P 0 P s → P b P s → P a t 0 10 20 30 40 48 ◮ least common multiple T = lcm { b i } where a i b i denotes the number of messages transfered in each matching ◮ ⇒ periodic schedule of period T with atomic transfers of messages Loris Marchal Steady state collective communications 14/ 27
Asymptotic optimality ◮ No schedule can perform more tasks than the steady-state: Lemma. opt ( G, K ) � TP ( G ) × K ◮ periodic schedule ⇒ schedule: 1. initialization phase (fill buffers of messages) 2. r periods of duration T (steady-state) 3. clean-up phase (empty buffers) Lemma. the previous algorithm is asymptotically optimal: steady ( G, K ) lim = 1 opt ( G, K ) K → + ∞ Loris Marchal Steady state collective communications 15/ 27
Asymptotic optimality ◮ No schedule can perform more tasks than the steady-state: Lemma. opt ( G, K ) � TP ( G ) × K ◮ periodic schedule ⇒ schedule: 1. initialization phase (fill buffers of messages) 2. r periods of duration T (steady-state) 3. clean-up phase (empty buffers) Lemma. the previous algorithm is asymptotically optimal: steady ( G, K ) lim = 1 opt ( G, K ) K → + ∞ Loris Marchal Steady state collective communications 15/ 27
Asymptotic optimality ◮ No schedule can perform more tasks than the steady-state: Lemma. opt ( G, K ) � TP ( G ) × K ◮ periodic schedule ⇒ schedule: 1. initialization phase (fill buffers of messages) 2. r periods of duration T (steady-state) 3. clean-up phase (empty buffers) Lemma. the previous algorithm is asymptotically optimal: steady ( G, K ) lim = 1 opt ( G, K ) K → + ∞ Loris Marchal Steady state collective communications 15/ 27
Asymptotic optimality ◮ No schedule can perform more tasks than the steady-state: Lemma. opt ( G, K ) � TP ( G ) × K ◮ periodic schedule ⇒ schedule: 1. initialization phase (fill buffers of messages) 2. r periods of duration T (steady-state) 3. clean-up phase (empty buffers) Lemma. the previous algorithm is asymptotically optimal: steady ( G, K ) lim = 1 opt ( G, K ) K → + ∞ Loris Marchal Steady state collective communications 15/ 27
Asymptotic optimality ◮ No schedule can perform more tasks than the steady-state: Lemma. opt ( G, K ) � TP ( G ) × K ◮ periodic schedule ⇒ schedule: 1. initialization phase (fill buffers of messages) 2. r periods of duration T (steady-state) 3. clean-up phase (empty buffers) Lemma. the previous algorithm is asymptotically optimal: steady ( G, K ) lim = 1 opt ( G, K ) K → + ∞ Loris Marchal Steady state collective communications 15/ 27
Asymptotic optimality ◮ No schedule can perform more tasks than the steady-state: Lemma. opt ( G, K ) � TP ( G ) × K ◮ periodic schedule ⇒ schedule: 1. initialization phase (fill buffers of messages) 2. r periods of duration T (steady-state) 3. clean-up phase (empty buffers) Lemma. the previous algorithm is asymptotically optimal: steady ( G, K ) lim = 1 opt ( G, K ) K → + ∞ Loris Marchal Steady state collective communications 15/ 27
Asymptotic optimality ◮ No schedule can perform more tasks than the steady-state: Lemma. opt ( G, K ) � TP ( G ) × K ◮ periodic schedule ⇒ schedule: 1. initialization phase (fill buffers of messages) 2. r periods of duration T (steady-state) 3. clean-up phase (empty buffers) Lemma. the previous algorithm is asymptotically optimal: steady ( G, K ) lim = 1 opt ( G, K ) K → + ∞ Loris Marchal Steady state collective communications 15/ 27
Outline Introduction Two Problems of Collective Communication Platform Model Framework Series of Scatter Steady-state constraints Toy Example Building a schedule Asymptotic optimality Series of Reduce Introduction to reduction trees Linear Program Periodic schedule - Asymptotic optimality Toy Example for Series of Reduce Approximation for a fixed period Conclusion Loris Marchal Steady state collective communications 16/ 27
Reduce - Reduction trees ◮ Reduce: ◮ each processor P r i owns a value v i ◮ compute V = v 1 ⊕ v 2 ⊕ · · · ⊕ v N ( ⊕ associative, non commutative) ◮ partial result of the Reduce operation: v [ k,m ] = v k ⊕ v 2 ⊕ · · · ⊕ v m ◮ two partial results can be merged: v [ k,m ] = v [ k,l ] ⊕ v [ l +1 ,m ] (computational task T k,l,m ) Loris Marchal Steady state collective communications 17/ 27
Reduce - Reduction trees ◮ Reduce: ◮ each processor P r i owns a value v i ◮ compute V = v 1 ⊕ v 2 ⊕ · · · ⊕ v N ( ⊕ associative, non commutative) ◮ partial result of the Reduce operation: v [ k,m ] = v k ⊕ v 2 ⊕ · · · ⊕ v m ◮ two partial results can be merged: v [ k,m ] = v [ k,l ] ⊕ v [ l +1 ,m ] (computational task T k,l,m ) Loris Marchal Steady state collective communications 17/ 27
Reduce - Reduction trees ◮ Reduce: ◮ each processor P r i owns a value v i ◮ compute V = v 1 ⊕ v 2 ⊕ · · · ⊕ v N ( ⊕ associative, non commutative) ◮ partial result of the Reduce operation: v [ k,m ] = v k ⊕ v 2 ⊕ · · · ⊕ v m ◮ two partial results can be merged: v [ k,m ] = v [ k,l ] ⊕ v [ l +1 ,m ] (computational task T k,l,m ) Loris Marchal Steady state collective communications 17/ 27
Reduce - Reduction trees ◮ Reduce: ◮ each processor P r i owns a value v i ◮ compute V = v 1 ⊕ v 2 ⊕ · · · ⊕ v N ( ⊕ associative, non commutative) ◮ partial result of the Reduce operation: v [ k,m ] = v k ⊕ v 2 ⊕ · · · ⊕ v m ◮ two partial results can be merged: v [ k,m ] = v [ k,l ] ⊕ v [ l +1 ,m ] (computational task T k,l,m ) Loris Marchal Steady state collective communications 17/ 27
Reduce - Reduction trees ◮ Reduce: ◮ each processor P r i owns a value v i ◮ compute V = v 1 ⊕ v 2 ⊕ · · · ⊕ v N ( ⊕ associative, non commutative) ◮ partial result of the Reduce operation: v [ k,m ] = v k ⊕ v 2 ⊕ · · · ⊕ v m ◮ two partial results can be merged: v [ k,m ] = v [ k,l ] ⊕ v [ l +1 ,m ] (computational task T k,l,m ) Loris Marchal Steady state collective communications 17/ 27
Reduce - Reduction trees v 0 v 1 v 2 ◮ Reduce: P 0 P 1 P 2 ◮ each processor P r i owns a value v i ◮ compute V = v 1 ⊕ v 2 ⊕ · · · ⊕ v N ( ⊕ associative, non commutative) ◮ partial result of the Reduce operation: v [ k,m ] = v k ⊕ v 2 ⊕ · · · ⊕ v m ◮ two partial results can be merged: v [ k,m ] = v [ k,l ] ⊕ v [ l +1 ,m ] (computational task T k,l,m ) Loris Marchal Steady state collective communications 17/ 27
Reduce - Reduction trees v 0 v 1 v 2 ◮ Reduce: P 0 P 1 P 2 ◮ each processor P r i owns a value v i v 2 ◮ compute V = v 1 ⊕ v 2 ⊕ · · · ⊕ v N P 2 → P 1 ( ⊕ associative, non commutative) ◮ partial result of the Reduce operation: v [ k,m ] = v k ⊕ v 2 ⊕ · · · ⊕ v m ◮ two partial results can be merged: v [ k,m ] = v [ k,l ] ⊕ v [ l +1 ,m ] (computational task T k,l,m ) Loris Marchal Steady state collective communications 17/ 27
Reduce - Reduction trees v 0 v 1 v 2 ◮ Reduce: P 0 P 1 P 2 ◮ each processor P r i owns a value v i v 2 ◮ compute V = v 1 ⊕ v 2 ⊕ · · · ⊕ v N P 2 → P 1 ( ⊕ associative, non commutative) T 1 , 1 , 2 ◮ partial result of the Reduce P 1 operation: v [ k,m ] = v k ⊕ v 2 ⊕ · · · ⊕ v m ◮ two partial results can be merged: v [ k,m ] = v [ k,l ] ⊕ v [ l +1 ,m ] (computational task T k,l,m ) Loris Marchal Steady state collective communications 17/ 27
Reduce - Reduction trees v 0 v 1 v 2 ◮ Reduce: P 0 P 1 P 2 ◮ each processor P r i owns a value v i v 2 ◮ compute V = v 1 ⊕ v 2 ⊕ · · · ⊕ v N P 2 → P 1 ( ⊕ associative, non commutative) v 0 T 1 , 1 , 2 ◮ partial result of the Reduce P 0 → P 1 P 1 operation: v [ k,m ] = v k ⊕ v 2 ⊕ · · · ⊕ v m ◮ two partial results can be merged: v [ k,m ] = v [ k,l ] ⊕ v [ l +1 ,m ] (computational task T k,l,m ) Loris Marchal Steady state collective communications 17/ 27
Reduce - Reduction trees v 0 v 1 v 2 ◮ Reduce: P 0 P 1 P 2 ◮ each processor P r i owns a value v i v 2 ◮ compute V = v 1 ⊕ v 2 ⊕ · · · ⊕ v N P 2 → P 1 ( ⊕ associative, non commutative) v 0 T 1 , 1 , 2 ◮ partial result of the Reduce P 0 → P 1 P 1 operation: v [ k,m ] = v k ⊕ v 2 ⊕ · · · ⊕ v m T 0 , 0 , 2 P 1 ◮ two partial results can be merged: v [ k,m ] = v [ k,l ] ⊕ v [ l +1 ,m ] (computational task T k,l,m ) Loris Marchal Steady state collective communications 17/ 27
Reduce - Reduction trees v 0 v 1 v 2 ◮ Reduce: P 0 P 1 P 2 ◮ each processor P r i owns a value v i v 2 ◮ compute V = v 1 ⊕ v 2 ⊕ · · · ⊕ v N P 2 → P 1 ( ⊕ associative, non commutative) v 0 T 1 , 1 , 2 ◮ partial result of the Reduce P 0 → P 1 P 1 operation: v [ k,m ] = v k ⊕ v 2 ⊕ · · · ⊕ v m T 0 , 0 , 2 P 1 ◮ two partial results can be merged: v [ k,m ] = v [ k,l ] ⊕ v [ l +1 ,m ] v [0 , 2] (computational task T k,l,m ) P 1 → P 0 Loris Marchal Steady state collective communications 17/ 27
Series of Reduce ◮ each processor P r i owns a set of values v t i (e.g. produced at different time-steps t ) ◮ perform a Reduce operation on each set { v t 1 , . . . , v t N } to compute V t ◮ each reduction uses a reduction tree ◮ two reductions ( t 1 and t 2 ) may use different trees Loris Marchal Steady state collective communications 18/ 27
Linear Program - Notations ◮ s ( P i → P j , v [ k,l ] ) : fractional number of values v [ k,l ] sent on link P i → P j within one time-unit ◮ t ( P i → P j ) fractional occupation time of link P i → P j within one time-unit: 0 � t ( P i → P j ) � 1 ◮ cons ( P i , T k,l,m ) : fractional number of tasks T k,l,m computed on processor P i within one time-unit ◮ α ( P i ) time spent by processor P i computing tasks within one time-unit: 0 � α ( P i ) � 1 ◮ size ( v [ k,m ] ) size of a message containing a value v t [ k,m ] ◮ w ( P i , T k,l,m ) time needed by processor P i to compute one task T k,l,m Loris Marchal Steady state collective communications 19/ 27
Linear Program - Notations ◮ s ( P i → P j , v [ k,l ] ) : fractional number of values v [ k,l ] sent on link P i → P j within one time-unit ◮ t ( P i → P j ) fractional occupation time of link P i → P j within one time-unit: 0 � t ( P i → P j ) � 1 ◮ cons ( P i , T k,l,m ) : fractional number of tasks T k,l,m computed on processor P i within one time-unit ◮ α ( P i ) time spent by processor P i computing tasks within one time-unit: 0 � α ( P i ) � 1 ◮ size ( v [ k,m ] ) size of a message containing a value v t [ k,m ] ◮ w ( P i , T k,l,m ) time needed by processor P i to compute one task T k,l,m Loris Marchal Steady state collective communications 19/ 27
Linear Program - Notations ◮ s ( P i → P j , v [ k,l ] ) : fractional number of values v [ k,l ] sent on link P i → P j within one time-unit ◮ t ( P i → P j ) fractional occupation time of link P i → P j within one time-unit: 0 � t ( P i → P j ) � 1 ◮ cons ( P i , T k,l,m ) : fractional number of tasks T k,l,m computed on processor P i within one time-unit ◮ α ( P i ) time spent by processor P i computing tasks within one time-unit: 0 � α ( P i ) � 1 ◮ size ( v [ k,m ] ) size of a message containing a value v t [ k,m ] ◮ w ( P i , T k,l,m ) time needed by processor P i to compute one task T k,l,m Loris Marchal Steady state collective communications 19/ 27
Linear Program - Notations ◮ s ( P i → P j , v [ k,l ] ) : fractional number of values v [ k,l ] sent on link P i → P j within one time-unit ◮ t ( P i → P j ) fractional occupation time of link P i → P j within one time-unit: 0 � t ( P i → P j ) � 1 ◮ cons ( P i , T k,l,m ) : fractional number of tasks T k,l,m computed on processor P i within one time-unit ◮ α ( P i ) time spent by processor P i computing tasks within one time-unit: 0 � α ( P i ) � 1 ◮ size ( v [ k,m ] ) size of a message containing a value v t [ k,m ] ◮ w ( P i , T k,l,m ) time needed by processor P i to compute one task T k,l,m Loris Marchal Steady state collective communications 19/ 27
Linear Program - Notations ◮ s ( P i → P j , v [ k,l ] ) : fractional number of values v [ k,l ] sent on link P i → P j within one time-unit ◮ t ( P i → P j ) fractional occupation time of link P i → P j within one time-unit: 0 � t ( P i → P j ) � 1 ◮ cons ( P i , T k,l,m ) : fractional number of tasks T k,l,m computed on processor P i within one time-unit ◮ α ( P i ) time spent by processor P i computing tasks within one time-unit: 0 � α ( P i ) � 1 ◮ size ( v [ k,m ] ) size of a message containing a value v t [ k,m ] ◮ w ( P i , T k,l,m ) time needed by processor P i to compute one task T k,l,m Loris Marchal Steady state collective communications 19/ 27
Linear Program - Notations ◮ s ( P i → P j , v [ k,l ] ) : fractional number of values v [ k,l ] sent on link P i → P j within one time-unit ◮ t ( P i → P j ) fractional occupation time of link P i → P j within one time-unit: 0 � t ( P i → P j ) � 1 ◮ cons ( P i , T k,l,m ) : fractional number of tasks T k,l,m computed on processor P i within one time-unit ◮ α ( P i ) time spent by processor P i computing tasks within one time-unit: 0 � α ( P i ) � 1 ◮ size ( v [ k,m ] ) size of a message containing a value v t [ k,m ] ◮ w ( P i , T k,l,m ) time needed by processor P i to compute one task T k,l,m Loris Marchal Steady state collective communications 19/ 27
Linear Program - Constraints ◮ occupation of a link P i → P j : � t ( P i → P j ) = s ( P i → P j , v [ k,l ] ) × size ( v [ k,l ] ) × c ( i, j ) v [ k,l ] ◮ occupation time of a processor P i : � α ( P i ) = cons ( P i , T k,l,m ) × w ( P i , T k,l,m ) T k,l,m ◮ “conservation law” for packets of type v [ k,m ] : � � s ( P j → P i , v [ k,m ] ) + cons ( P i , T k,l,m ) P j → P i k � l<m � � � = s ( P i → P j , v [ k,m ] ) + cons ( P i , T k,m,n ) + cons ( P i , T n,k − 1 ,m ) n>m P i → P j n<k Loris Marchal Steady state collective communications 20/ 27
Linear Program - Constraints ◮ occupation of a link P i → P j : � t ( P i → P j ) = s ( P i → P j , v [ k,l ] ) × size ( v [ k,l ] ) × c ( i, j ) v [ k,l ] ◮ occupation time of a processor P i : � α ( P i ) = cons ( P i , T k,l,m ) × w ( P i , T k,l,m ) T k,l,m ◮ “conservation law” for packets of type v [ k,m ] : � � s ( P j → P i , v [ k,m ] ) + cons ( P i , T k,l,m ) P j → P i k � l<m � � � = s ( P i → P j , v [ k,m ] ) + cons ( P i , T k,m,n ) + cons ( P i , T n,k − 1 ,m ) n>m P i → P j n<k Loris Marchal Steady state collective communications 20/ 27
Linear Program - Constraints ◮ occupation of a link P i → P j : � t ( P i → P j ) = s ( P i → P j , v [ k,l ] ) × size ( v [ k,l ] ) × c ( i, j ) v [ k,l ] ◮ occupation time of a processor P i : � α ( P i ) = cons ( P i , T k,l,m ) × w ( P i , T k,l,m ) T k,l,m ◮ “conservation law” for packets of type v [ k,m ] : � � s ( P j → P i , v [ k,m ] ) + cons ( P i , T k,l,m ) P j → P i k � l<m � � � = s ( P i → P j , v [ k,m ] ) + cons ( P i , T k,m,n ) + cons ( P i , T n,k − 1 ,m ) n>m P i → P j n<k Loris Marchal Steady state collective communications 20/ 27
Linear Program - Constraints ◮ definition of the throughput: � � TP = s ( P j → P target , v [0 ,m ] )+ cons ( P target , T 0 ,l,N ) P j → P target 0 � l<N − 1 ◮ solve the following linear program over the rational numbers: Steady-State Reduce Problem on a Graph SSRP(G) Maximize TP , subject to all previous constraints Loris Marchal Steady state collective communications 21/ 27
Linear Program - Constraints ◮ definition of the throughput: � � TP = s ( P j → P target , v [0 ,m ] )+ cons ( P target , T 0 ,l,N ) P j → P target 0 � l<N − 1 ◮ solve the following linear program over the rational numbers: Steady-State Reduce Problem on a Graph SSRP(G) Maximize TP , subject to all previous constraints Loris Marchal Steady state collective communications 21/ 27
Building a schedule ◮ consider the reduction tree T t associated with the computation of the t th value ( V t ): ◮ a given tree may be used by many time-stamps t ◮ there exists an algorithm which extracts from the solution a set of weighted trees such that ◮ this description is polynomial and ◮ the sum of the weighted trees is equal to the original solution ◮ same use of a weighted edge-coloring algorithm on a bipartite graph to orchestrate the communication Loris Marchal Steady state collective communications 22/ 27
Building a schedule ◮ consider the reduction tree T t associated with the computation of the t th value ( V t ): ◮ a given tree may be used by many time-stamps t ◮ there exists an algorithm which extracts from the solution a set of weighted trees such that ◮ this description is polynomial and ◮ the sum of the weighted trees is equal to the original solution ◮ same use of a weighted edge-coloring algorithm on a bipartite graph to orchestrate the communication Loris Marchal Steady state collective communications 22/ 27
Building a schedule ◮ consider the reduction tree T t associated with the computation of the t th value ( V t ): ◮ a given tree may be used by many time-stamps t ◮ there exists an algorithm which extracts from the solution a set of weighted trees such that ◮ this description is polynomial and ◮ the sum of the weighted trees is equal to the original solution ◮ same use of a weighted edge-coloring algorithm on a bipartite graph to orchestrate the communication Loris Marchal Steady state collective communications 22/ 27
Building a schedule ◮ consider the reduction tree T t associated with the computation of the t th value ( V t ): ◮ a given tree may be used by many time-stamps t ◮ there exists an algorithm which extracts from the solution a set of weighted trees such that ◮ this description is polynomial and ◮ the sum of the weighted trees is equal to the original solution ◮ same use of a weighted edge-coloring algorithm on a bipartite graph to orchestrate the communication Loris Marchal Steady state collective communications 22/ 27
Building a schedule ◮ consider the reduction tree T t associated with the computation of the t th value ( V t ): ◮ a given tree may be used by many time-stamps t ◮ there exists an algorithm which extracts from the solution a set of weighted trees such that ◮ this description is polynomial and ◮ the sum of the weighted trees is equal to the original solution ◮ same use of a weighted edge-coloring algorithm on a bipartite graph to orchestrate the communication Loris Marchal Steady state collective communications 22/ 27
Building a schedule ◮ consider the reduction tree T t associated with the computation of the t th value ( V t ): ◮ a given tree may be used by many time-stamps t ◮ there exists an algorithm which extracts from the solution a set of weighted trees such that ◮ this description is polynomial and ◮ the sum of the weighted trees is equal to the original solution ◮ same use of a weighted edge-coloring algorithm on a bipartite graph to orchestrate the communication Loris Marchal Steady state collective communications 22/ 27
Toy Example for Series of Reduce 0 1 1 1 1 2 1 topology Loris Marchal Steady state collective communications 23/ 27
Toy Example for Series of Reduce 1 T 0 , 0 , 2 0 2 1 3 v [1 , 2] 3 v [1 , 2] 2 3 v [1 , 1] 1 2 1 3 T 1 , 1 , 2 2 3 T 1 , 1 , 2 1 3 v [2 , 2] results of the linear program Loris Marchal Steady state collective communications 23/ 27
Recommend
More recommend