The Reverse Cuthill-McKee Algorithm in Distributed-Memory Ariful - - PowerPoint PPT Presentation

▶

Jan 24, 2024 313 likes •524 views

The Reverse Cuthill-McKee Algorithm in Distributed-Memory Ariful Azad Lawrence Berkeley Na.onal Laboratory (LBNL) SIAM CSC 2016, Albuquerque Acknowledgements q Joint work with Aydn Bulu Mathias Jacquelin Esmond Ng q Funding

SLIDE 1

The Reverse Cuthill-McKee Algorithm in Distributed-Memory

Ariful Azad Lawrence Berkeley Na.onal Laboratory (LBNL) SIAM CSC 2016, Albuquerque

SLIDE 2

q Joint work with

– Aydın Buluç – Mathias Jacquelin – Esmond Ng

q Funding

– DOE Office of Science – Time alloca.on at the DOE NERSC Center

Acknowledgements

SLIDE 3

q In this talk, I consider parallel algorithms for reordering

sparse matrices

q Goal: Find a permuta.on P so that the bandwidth/

profile of PAPT is small.

Reordering a sparse matrix

Before permuta.on ASer permuta.on

SLIDE 4

q BeTer cache reuse in SpMV [Karantasis et al. SC ‘14] q Faster itera.ve solvers such as precondi.oned

conjugate gradients (PCG).

Why reordering a matrix

1 2 4 8 16 32 64 128 1 2 4 8 16 32 64 128 256 Solver (me (s) Number of cores Natural ordering RCM ordering

Example: PCG implementa.on in PETSc

4x

Thermal2 (n=1.2M, nnz=4.9M)

SLIDE 5

q Finding a permuta.on to minimize the bandwidth is

NP-complete. [Papadimitriou ‘76]

q Heuris.cs are used in prac.ce

– Examples: the Reverse Cuthill-McKee algorithm, Sloan’s algorithm

q We focus on the Reverse Cuthill-McKee (RCM)

algorithm

– Simple to state – Easy to understand – Rela.vely easy to parallelize

The case for the Reverse Cuthill-McKee (RCM) algorithm

SLIDE 6

q Enable solving very large problems q More prac.cal: The matrix is already distributed

– gathering the distributed matrix onto a node for serial execu.on is expensive.

The case for distributed-memory algorithm

0" 2" 4" 6" 8" 10" 12" ldoor" hugetrace300020" Serena" dielFilterV3real" delaunay_n24" rgg_n_2_24_s0" nlpkkt240" Gather'(me'(sec)'

Time to gather a graph

n a node from 45 nodes of

NERSC/Edison (Cray XC30) Distributed algorithms are cheaper and scalable

SLIDE 7

The RCM algorithm

Start vertex (a pseudo-peripheral vertex) Cuthill-McKee

rder

1

Order ver.ces by increasing degree

2 3

Order ver.ces by parents’ order

5 6 7 8

Order ver.ces by (parents’ order, degree) Reverse the order of ver.ces to obtain the RCM ordering

4

SLIDE 8

RCM: Challenges in paralleliza.on

(in addi.on to parallelizing BFS)

q Given a start vertex, the algorithm

gives a fixed ordering except for .e

breaks. Not parallelizaLon friendly.

q Unlike tradiLonal BFS, the parent of

a vertex is set to a vertex with the minimum label. (i.e., boTom-up BFS is not beneficial)

q Within a level, ver.ces are labeled

by lexicographical order of (parents’

rder, degree) pairs, needs sor.ng

a e b c f g d

1 2 3

h

5 6 4 7 8

SLIDE 9

q We use specialized level-synchronous BFS

q Key differences from tradi.onal BFS (Buluç and Madduri, SC ‘11)

1. A parent with smaller label is preferred over another

vertex with larger label

2. The labels of parents are passed to their children
3. Lexicographical sor.ng of ver.ces in BFS levels

q The first two of them are addressed by sparse matrix-

sparse vector mul.plica.on (SpMSpV) over a semiring

q The third challenge is addressed by a lightweight

sor.ng func.on

Our approach to address paralleliza.on challenges

SLIDE 10

Exploring the next-level ver.ces via SpMSpV

a e b c f g d

1 2 3

3 2

x x x x x x x x x x x x x x x x x x

a b c d e f g h a b c d e f g h

2 3 2 Overload (mul.ply,add) with (select2nd, min)

a b c d e f g h

Current fronLer Next fronLer

Adjacency matrix

h

SLIDE 11

Ordering ver.ces via par.al sor.ng

a e b c f g d

1 2 3

2 3 2

a b c d e f g h

Current fronLer Next fronLer

h Parent’s label

4 2 1

My degree

1. c and h are ordered before f
2. h is ordered before c

Rules for ordering verLces

3 4 5

Sort degrees of the siblings many instances of small sor.ngs (avoids expensive parallel sor.ng)

SLIDE 12

Distributed memory paralleliza.on (SpMSpV)

x

A fron.er

à

x

ALGORITHM:

1. Gather ver.ces in processor column [communicaLon]
2. Local mul.plica.on [computa.on]
3. Find owners of the current fron.er’s adjacency and exchange

adjacencies in processor row [communicaLon]

n p n p p × p Processor grid

P processors are arranged in

SLIDE 13

q Bin ver.ces by their parents’ labels

– All ver.ces in a bin is assigned to a single node – Needs AllToAll communica.on

q Sequen.ally sort the degree of ver.ces in a single

node

Distributed-memory par.al sor.ng

SLIDE 14

Computa.on and communica.on complexity

OperaLon Per processor ComputaLon (lower bound) Per processor Comm (latency) Per processor Comm (bandwidth) SpMSpV Sor.ng

diameter*α p

diameter*αp

β n p

β m p + n p ! " # # $ % & &

m p n p log(n / p)

α : latency (0.25 μs to 3.7 μs MPI latency on Edison) β : inverse bandwidth (~8GB/sec MPI bandwidth on Edison) p : number of processors n: number of ver.ces, m: number of edges

SLIDE 15

q Finding a pseudo peripheral vertex.

– Repeated applica.on of the usual BFS (no ordering of ver.ces within a level)

q Our SpMSpV is hybrid OpenMP-MPI implementa.on

– Mul.threaded SpMSpV is also fairly complicated and subject to another work

Other aspects of the algorithm

SLIDE 16

Results: Scalability on NERSC/Edison (6 threads per MPI process)

1 6 24 54 216 1,014 4,056 2 4 6 8 10 Number of Cores Time (sec)

dielFilterV3real

Peripheral: SpMSpV Peripheral: Other Ordering: SpMSpV Ordering: Sorting Ordering: Other

#ver.ces: 1.1M #edges: 89M Bandwidth before: 1,036,475 aSer: 23,813

30x

Communica.on dominates

SLIDE 17

54 216 1,014 4,056 2 4 6 8 10 12 Number of Cores Time (sec)

nlpkkt240

Peripheral: SpMSpV Peripheral: Other Ordering: SpMSpV Ordering: Sorting Ordering: Other

Scalability on NERSC/Edison (6 threads per MPI process)

#ver.ces: 78M #edges: 760M Bandwidth before: 14,169,841 aSer: 361,755

Larger graphs conLnue scaling

SLIDE 18

q SpMP (Sparse Matrix Pre-processing) package by

Park et al. (hTps://github.com/jspark1105/SpMP)

q We switch to MPI+OpenMP aSer 12 cores

Single node performance NERSC/Edison (2x12 cores)

0.25 0.5 1 2 4 8 1 2 4 8 16 32 Time (s) Number of cores SpMP Our algorithm

If the matrix is already Distributed in 1K cores (~45 nodes) Time to gather: 0.82 s making the distributed algorithm more profitable Matrix: ldoor #ver.ces: 1M #edges: 42M

SLIDE 19

q For many prac.cal problems, the RCM ordering

expedites itera.ve solvers

q No scalable distributed memory algorithm for RCM

rdering exists

– forcing us gathering an already distributed matrix on a node and use serial algorithm (e.g., in PETSc), which is expensive

q We developed a distributed-memory RCM algorithm

using SpMSpV and par.al sor.ng

q The algorithm scales up to 1K cores on modern

supercomputers.

Conclusions

SLIDE 20