Euro PVM/MPI 2003 1/22 Venezia, Italia Efficient Parallel - - PowerPoint PPT Presentation

euro pvm mpi 2003
SMART_READER_LITE
LIVE PREVIEW

Euro PVM/MPI 2003 1/22 Venezia, Italia Efficient Parallel - - PowerPoint PPT Presentation

Euro PVM/MPI 2003 1/22 Venezia, Italia Efficient Parallel Implementation of Transitive Closure of Digraphs C. E. R. Alves Univsidade S ao Judas Tadeu E. N. C aceres Universidade Federal de Mato Grosso do Sul A. A. Castro Jr.


slide-1
SLIDE 1

1/22

  • Euro PVM/MPI 2003

Venezia, Italia

Efficient Parallel Implementation of Transitive Closure of Digraphs

  • C. E. R. Alves

Univsidade S˜ ao Judas Tadeu

  • E. N. C´

aceres

Universidade Federal de Mato Grosso do Sul

  • A. A. Castro Jr.

Universidade Cat´

  • lica Dom Bosco
  • S. W. Song

Universidade de S˜ ao Paulo

  • J. L. Szwarcfiter

Universidade Federal do Rio de Janeiro

slide-2
SLIDE 2

2/22

  • The Transitive Closure Problem
  • Used in many areas such as

– Network Planning – Distributed Systems Design

  • Used in problems such as

– All Shortest Paths in a Directed Graph – Breadth-First Spanning Trees

  • Directed graph D(V, E) with |V | = n, |E| = m
  • We present a parallel algorithm to compute its transi-

tive closure using – p processors – each with O(n2

p ) local memory

slide-3
SLIDE 3

3/22

  • Example

5 3 2 4 6 1

A directed graph.

slide-4
SLIDE 4

4/22

  • Example

5 3 2 4 6 1

Its transitive closure: green edges joining i to j if j can be reached from i.

slide-5
SLIDE 5

5/22

  • BSP/CGM Model

CGM (Coarse Grained Multicomputer) model: p of pro- cessors, each with its own local memory, communicating through a network. The algorithm alternates between

  • Computation round: each processor computes inde-

pendently.

  • Communication round: each processor sends/receives

data to/from other processors. Goals:

  • Obtain a linear speed-up on p.
  • Minimize the number of rounds.
slide-6
SLIDE 6

6/22

  • The CGM Model

Local computation Synchronization Barrier Global Communication Computation round Communication round P0 P1 P2 Pp−1

slide-7
SLIDE 7

7/22

  • Previous Parallel Algorithms
  • 1. PRAM:
  • Karp et al.: CREW: O(log2 n) time with O(M(n))1

processors.

aJ´ a: CRCW: O(log n) time with O(n3) processors.

  • 2. C´

aceres et al.: Acyclic digraph with linear extension labeling O(logp) rounds with O(n3/p) local time

  • 3. Dependency Graph Approach:
  • Pagourtzis et al.:

O(p) rounds with O(n3/p) local time

1M(n) is the best known sequential bound for multiplying two n × n matrices over a ring

slide-8
SLIDE 8

8/22

  • Warshall’s Algorithm

Algorithm 1: Warshall’s Algorithm Input: Adjacency matrix Mn×n of graph G Output: Transitive closure of graph G

1: for k ← 1 until n do 2:

for i ← 1 until n do

3:

for j ← 1 until n do

4:

M[i, j] ← M[i, j] or (M[i, k] and M[k, j])

5:

end for

6:

end for

7: end for

slide-9
SLIDE 9

9/22

  • Partitioning the Adjacency Matrix

1 1 2 2 3 3 4 4 t t t i k k j

slide-10
SLIDE 10

10/22

  • The Parallel Algorithm

Algorithm 2: Parallel Warshall Input: Adjacency matrix M stored in the p processors: each processor q (1 ≤ q ≤ p) stores submatrices M[(q − 1)n

p + 1..q n p][1..n]

and M[1..n][(q − 1)n

p + 1..q n p].

Output: Transitive closure of graph G represented by the trans- formed matrix M.

slide-11
SLIDE 11

11/22

  • Algorithm 3: Parallel Warshall

Each processor q (1 ≤ q ≤ p) does the following.

1: repeat 2:

for k = (q − 1)n

p + 1 until q n p do

3:

for i = 0 until n − 1 do

4:

for j = 0 until n − 1 do

5:

if M[i][k] = 1 and M[k][j] = 1 then

6:

M[i][j] = 1 (if M[i][j] belongs to processor different from q then store it for subsequent transmission to the corresponding processor.)

7:

end if

8:

Send stored data to the corresponding processors.

9:

Receive data that belong to processor q from other pro- cessors.

10:

end for

11:

end for

12:

end for

13: until no new matrix entry updates are done

slide-12
SLIDE 12

12/22

  • The Main Idea
  • Make a partition of V (D).
  • In each partition, using the edges of D construct a

digraph formed by the edges of D that have at least

  • ne of its extremes in the partition.
  • Compute the Transitive Closure in each partition.
  • Send the computed transitive edges to the proper par-

tition.

slide-13
SLIDE 13

13/22

  • Example

1 5 3 2 8 4 6 7

slide-14
SLIDE 14

14/22

  • Example

1 5 3 2 4 6 7

Processor 0

5 3 2 8 6 7

Processor 1

slide-15
SLIDE 15

15/22

  • Example

1 5 3 2 4 6 7

Processor 0

5 3 2 8 6 7

Processor 1

slide-16
SLIDE 16

16/22

  • Example

1 5 3 2 8 4 6 7

Processor 0

1 5 3 2 8 4 6 7

Processor 1

slide-17
SLIDE 17

17/22

  • Implementation
  • 64-node Beowulf cluster - low cost microcomputers

with 256MB RAM, 256MB swap memory, CPU In- tel Pentium III 448.956 MHz, 512KB cache.

  • 100 Mb fast-Ethernet switch.
  • Code in standard ANSI C and LAM-MPI Version 6.5.6.
  • Tests on randomly generated digraphs with 20 % prob-

ability of an edge between two vertices.

  • In all the tests, the number of communication rounds

required are less than log p.

slide-18
SLIDE 18

18/22

  • Implementation Results
  • 480x480
  • 512x512

10 20 30 40 50 60 5 10 15 20 25

  • No. Processors

Seconds

slide-19
SLIDE 19

19/22

  • Implementation Results
  • ◦ ◦
  • 960x960
  • • •
  • 1024x1024

⋄ ⋄ ⋄ ⋄ ⋄ ⋄ ⋄ ⋄ 1920x1920 10 20 30 40 50 60 500 1000 1500

  • No. Processors

Seconds

slide-20
SLIDE 20

20/22

  • Implementation Results
  • 480x480
  • 512x512

10 20 30 40 50 60 5 10 15

  • No. Processors

Speedup

slide-21
SLIDE 21

21/22

  • Implementation Results
  • 960x960
  • 1024x1024

⋄ ⋄ ⋄ ⋄ ⋄ ⋄ ⋄ ⋄ 1920x1920 10 20 30 40 50 60 10 20 30

  • No. Processors

Speedup

slide-22
SLIDE 22

22/22

  • Conclusion

A BSP/CGM algorithm for the Transitive Closure problem.

  • Digraph with n vertices and m edges.
  • The number of communication rounds measured: O(log p).
  • Local computation time: O(mn/p).