SLIDE 1
The Reverse Cuthill-McKee Algorithm in Distributed-Memory Ariful - - PowerPoint PPT Presentation
The Reverse Cuthill-McKee Algorithm in Distributed-Memory Ariful - - PowerPoint PPT Presentation
The Reverse Cuthill-McKee Algorithm in Distributed-Memory Ariful Azad Lawrence Berkeley Na.onal Laboratory (LBNL) SIAM CSC 2016, Albuquerque Acknowledgements q Joint work with Aydn Bulu Mathias Jacquelin Esmond Ng q Funding
SLIDE 2
SLIDE 3
q In this talk, I consider parallel algorithms for reordering
sparse matrices
q Goal: Find a permuta.on P so that the bandwidth/
profile of PAPT is small.
Reordering a sparse matrix
Before permuta.on ASer permuta.on
SLIDE 4
q BeTer cache reuse in SpMV [Karantasis et al. SC ‘14] q Faster itera.ve solvers such as precondi.oned
conjugate gradients (PCG).
Why reordering a matrix
1 2 4 8 16 32 64 128 1 2 4 8 16 32 64 128 256 Solver (me (s) Number of cores Natural ordering RCM ordering
Example: PCG implementa.on in PETSc
4x
Thermal2 (n=1.2M, nnz=4.9M)
SLIDE 5
q Finding a permuta.on to minimize the bandwidth is
NP-complete. [Papadimitriou ‘76]
q Heuris.cs are used in prac.ce
– Examples: the Reverse Cuthill-McKee algorithm, Sloan’s algorithm
q We focus on the Reverse Cuthill-McKee (RCM)
algorithm
– Simple to state – Easy to understand – Rela.vely easy to parallelize
The case for the Reverse Cuthill-McKee (RCM) algorithm
SLIDE 6
q Enable solving very large problems q More prac.cal: The matrix is already distributed
– gathering the distributed matrix onto a node for serial execu.on is expensive.
The case for distributed-memory algorithm
0" 2" 4" 6" 8" 10" 12" ldoor" hugetrace300020" Serena" dielFilterV3real" delaunay_n24" rgg_n_2_24_s0" nlpkkt240" Gather'(me'(sec)'
Time to gather a graph
- n a node from 45 nodes of
NERSC/Edison (Cray XC30) Distributed algorithms are cheaper and scalable
SLIDE 7
The RCM algorithm
Start vertex (a pseudo-peripheral vertex) Cuthill-McKee
- rder
1
Order ver.ces by increasing degree
2 3
Order ver.ces by parents’ order
5 6 7 8
Order ver.ces by (parents’ order, degree) Reverse the order of ver.ces to obtain the RCM ordering
4
SLIDE 8
RCM: Challenges in paralleliza.on
(in addi.on to parallelizing BFS)
q Given a start vertex, the algorithm
gives a fixed ordering except for .e
- breaks. Not parallelizaLon friendly.
q Unlike tradiLonal BFS, the parent of
a vertex is set to a vertex with the minimum label. (i.e., boTom-up BFS is not beneficial)
q Within a level, ver.ces are labeled
by lexicographical order of (parents’
- rder, degree) pairs, needs sor.ng
a e b c f g d
1 2 3
h
5 6 4 7 8
SLIDE 9
q We use specialized level-synchronous BFS
q Key differences from tradi.onal BFS (Buluç and Madduri, SC ‘11)
- 1. A parent with smaller label is preferred over another
vertex with larger label
- 2. The labels of parents are passed to their children
- 3. Lexicographical sor.ng of ver.ces in BFS levels
q The first two of them are addressed by sparse matrix-
sparse vector mul.plica.on (SpMSpV) over a semiring
q The third challenge is addressed by a lightweight
sor.ng func.on
Our approach to address paralleliza.on challenges
SLIDE 10
Exploring the next-level ver.ces via SpMSpV
a e b c f g d
1 2 3
3 2
x x x x x x x x x x x x x x x x x x
a b c d e f g h a b c d e f g h
2 3 2 Overload (mul.ply,add) with (select2nd, min)
a b c d e f g h
Current fronLer Next fronLer
Adjacency matrix
h
SLIDE 11
Ordering ver.ces via par.al sor.ng
a e b c f g d
1 2 3
2 3 2
a b c d e f g h
Current fronLer Next fronLer
h Parent’s label
4 2 1
My degree
- 1. c and h are ordered before f
- 2. h is ordered before c
Rules for ordering verLces
3 4 5
Sort degrees of the siblings many instances of small sor.ngs (avoids expensive parallel sor.ng)
SLIDE 12
Distributed memory paralleliza.on (SpMSpV)
x
A fron.er
à
x
ALGORITHM:
- 1. Gather ver.ces in processor column [communicaLon]
- 2. Local mul.plica.on [computa.on]
- 3. Find owners of the current fron.er’s adjacency and exchange
adjacencies in processor row [communicaLon]
n p n p p × p Processor grid
P processors are arranged in
SLIDE 13
q Bin ver.ces by their parents’ labels
– All ver.ces in a bin is assigned to a single node – Needs AllToAll communica.on
q Sequen.ally sort the degree of ver.ces in a single
node
Distributed-memory par.al sor.ng
SLIDE 14
Computa.on and communica.on complexity
OperaLon Per processor ComputaLon (lower bound) Per processor Comm (latency) Per processor Comm (bandwidth) SpMSpV Sor.ng
diameter*α p
diameter*αp
β n p
β m p + n p ! " # # $ % & &
m p n p log(n / p)
α : latency (0.25 μs to 3.7 μs MPI latency on Edison) β : inverse bandwidth (~8GB/sec MPI bandwidth on Edison) p : number of processors n: number of ver.ces, m: number of edges
SLIDE 15
q Finding a pseudo peripheral vertex.
– Repeated applica.on of the usual BFS (no ordering of ver.ces within a level)
q Our SpMSpV is hybrid OpenMP-MPI implementa.on
– Mul.threaded SpMSpV is also fairly complicated and subject to another work
Other aspects of the algorithm
SLIDE 16
Results: Scalability on NERSC/Edison (6 threads per MPI process)
1 6 24 54 216 1,014 4,056 2 4 6 8 10 Number of Cores Time (sec)
dielFilterV3real
Peripheral: SpMSpV Peripheral: Other Ordering: SpMSpV Ordering: Sorting Ordering: Other
#ver.ces: 1.1M #edges: 89M Bandwidth before: 1,036,475 aSer: 23,813
30x
Communica.on dominates
SLIDE 17
54 216 1,014 4,056 2 4 6 8 10 12 Number of Cores Time (sec)
nlpkkt240
Peripheral: SpMSpV Peripheral: Other Ordering: SpMSpV Ordering: Sorting Ordering: Other
Scalability on NERSC/Edison (6 threads per MPI process)
#ver.ces: 78M #edges: 760M Bandwidth before: 14,169,841 aSer: 361,755
Larger graphs conLnue scaling
SLIDE 18
q SpMP (Sparse Matrix Pre-processing) package by
Park et al. (hTps://github.com/jspark1105/SpMP)
q We switch to MPI+OpenMP aSer 12 cores
Single node performance NERSC/Edison (2x12 cores)
0.25 0.5 1 2 4 8 1 2 4 8 16 32 Time (s) Number of cores SpMP Our algorithm
If the matrix is already Distributed in 1K cores (~45 nodes) Time to gather: 0.82 s making the distributed algorithm more profitable Matrix: ldoor #ver.ces: 1M #edges: 42M
SLIDE 19
q For many prac.cal problems, the RCM ordering
expedites itera.ve solvers
q No scalable distributed memory algorithm for RCM
- rdering exists
– forcing us gathering an already distributed matrix on a node and use serial algorithm (e.g., in PETSc), which is expensive
q We developed a distributed-memory RCM algorithm
using SpMSpV and par.al sor.ng
q The algorithm scales up to 1K cores on modern
supercomputers.
Conclusions
SLIDE 20