Implementing a Parallel Graph Clustering Algorithm with Sparse Matrix Computation
Jun Chen, Peigang Zou High Performance Computing Center, Institute of Applied Physics and Computational Mathematics chenjun@iapcm.ac.cn
Implementing a Parallel Graph Clustering Algorithm with Sparse - - PowerPoint PPT Presentation
Implementing a Parallel Graph Clustering Algorithm with Sparse Matrix Computation Jun Chen, Peigang Zou High Performance Computing Center, Institute of Applied Physics and Computational Mathematics chenjun@iapcm.ac.cn OUTLINE n Gr Graph c h
Jun Chen, Peigang Zou High Performance Computing Center, Institute of Applied Physics and Computational Mathematics chenjun@iapcm.ac.cn
n Gr
Graph c h clu lustering ng
n Peer pressure clustering(PPCL)
n Cha
halle lleng nges
n A s
solu lution: b n: based o
n Large g graph p h pla latforms ms/li libraries
n Combinatorial BLAS
n Rela
lated w works ks
n Paralle
llel P l PPCL a alg lgorithm w hm with ma h matrix c computation
n Nume
merical R l Result lts
n Di
Discussions ns
n Conc
nclu lusion
ß A wide class of algorithms to classify vertices
Classified by distances between points. Classified by relationship between points.
Machine learning Pattern recognition Bioinformatics Image analyze Graph clustering
Graph clustering Random walks Peer pressure clustering(PPCL) Minimum cuts multi-way partition Genetic algorithms ……
E.g., # of Facebook users > 1 billion. A big graph!
n Paralle
llel g l graph c h clu lustering ng i imple leme ment nt i is d difficult
. It r requires: :
n well-suited description of the natural sparse locality n Storage them effectively n High performance computing
n High p
h performa manc nce c cha halle lleng nge
n Scalability: time should <= O(n). Memory consumption <O(nxn). n Parallel patterns for solving PDEs in typical scientific computing is
based on dense computations. They are not suitful for the sparse characteristic in large graph computation.
n MapReduce pattern for big data problems has low efficiency.
ü A S
Solu lution: B n: Based o
n Large Gr Graph p h pla latform/ m/li library y
n
Scale leGr Graph
n
TITECH, JAPAN
n
Implements Pregel model provided by google,and optimize its collective communication and memory management methods.
n
Goal: can analyze the graphs containing 10 billion nodes and edges.
n
Di DisBeli lief
n
n
A parallel framework for deep learning
n
PBGL GL(Paralle llel Bo Boost Gr Graph Library) y)
n
Indiana University, USA
n
C++ graph library
n
GA GAPDT(Gr Graph A h Alg lgorithm a hm and nd P Pattern Di n Discovery T y Toolb lbox)
n
UCSB, USA
n
Provide interactive graph operations, and can parallel run on Star-P, a parallel version of MATLAB.
n
Use distributed sparse array to describe the parallel operations.
n
Gr Graph BL BLAS
Þ Defines a core set of matrix-based graph operations that can be used to
implement a wide class of graph algorithms in a wide range of programming environments.
n
etc.
Table le 1
y basic li line near a alg lgebra o
ns i in n Comb mbBLAS
CombBLAS:
Ø A representation implementation of Graph BLAS Ø Multi-core parallelism based on MPI Ø A collective set of basic linear algebra operations.
[Buluç A, Gilbert J R. The Combinatorial BLAS: design, implementation, and applications]
ß Introduction of random walks method
Þ The cluster assignment of a vertex will be the same
ß Current representative parallel PPCL methods.
Þ Parallel PPCL in SPARQL Þ Parallel PPCL in STAR-P
PPCL PPCL in S n STAR-P
Result lts:
ü Maximu mum p m processor nu numb mber:128 128 ü R-M
graph:209 2097152 52 vertices, 1 , 18305177 edg edges es (scale le 2 21 ) ) ü Low p performa manc nce. Ø STAR-P is a parallel implementation of MATLAB.
PPCL PPCL in S n SPARQL Result lts:
ü Maximu mum p m processor nu numb mber:64 64 ü RDF DF Gr Graph:10000 v vertices, 2 , 232,0 ,000 edg edges es
200 秒 Ø SPARQL is a similar SQL tool for RDF graph.
RDF Graph:stores meta-data for web resources. Its vertex identify a resource, its arc describes the resource attributes.
Paper 1: Paper 2:
4.tally
5.Form new cluster approximation G’’
2.Initial
PeerPressure( ( , ), ) 1 for ( , , ) 2 do ( )( ( )) ( )( ( )) 3 for 4 do C ( ) : : ( )( ) ( )( ) 5 if 6 then return 7 else return PeerPre
i i i f i f f
G V E C u v w E T v C u T v C u w n V n i j V T n j T n i C C C = ∈ ← + ∈ ← ∀ ∈ ≤ == ssure( ( , ), )
f
G V E C =
Alg lgorithm 1 hm 1. 1.Given an approximation G’
Result G’’
If G’!=G’’, then G’=G’’ If G’==G’’
It starts with an initial cluster assignment, e.g., each vertex being in its own cluster.Each iteration performs an election at each vertex to select its cluster num. The votes are the cluster assignments of its neighbors. Ties are settled by selecting the lowest cluster ID to maintain determinism here. The algorithm converges when two consecutive iterations have a tiny difference between them.
4.tally
5.Form new cluster approximation G’’
2.Initial
i i f i f f f
PeerPressure( A : ,C : ) 1 T : : m : 2 T = C A 3 m = T max. 4 C = m .== T 5 if C == C 6 then return C 7 else return PeerPressure( ,C )
N N N N N N N N N f
G C G
× × + × × + +
= R B R B R
Alg lgorithm 2 hm 2:
1.Given an approximation G’
Result G’’
If G’!=G’’, then G’=G’’ If G’==G’’
vertex is a cluster.
equally votes.
neighbors.
what to do if two clusters tie for the maximum number of votes for a vertex.
0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 A = 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.2 0.33 0.33 0.33 0.25 0.25 0.25 0.25 ⎛ ⎞ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠
1 1 1 1 1 1 1 1 1 1 C = 0 1 1 1 1 1 1 1 1 ⎛ ⎞ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠
f
1 1 1 1 1 C = 0 1 1 1 ⎛ ⎞ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ (a) The object graph G (b) The adjacency A of G after initialization (c) Temporary results of matrix C (d) Temporary matrix C after 1st tires-settling
Fig.1 The procedure when applying algorithm 2 for object graph G.
i i
a matix-based tageted graph A : and a matrix-based initial approximation garph C : . a matrix-based clustering result. PeerPressure( A : ,C : ) 1 SpParMat<unsig
N N N N N N N N
G
× × + × × +
= Input : Output : procedure R B R B ned,double,SpDCCols,<unsigned,double>> A, C; /* , single */ 2 DenseParVec<unsigned,double> rowsums = A.Reduce(Row,plus<double>); /* reduce to Row columns are collapsed to entries multinv double < > */ 3 rowsums.Apply(multinv<double>); /* , */ 4 A.DimScale(Row, rowsums); 5 C != T /* */ 6 function is user defined for double nomalize A scale each Column with given vector vote while do SpParMat<unsigned,double,SpDCCols,<unsigned,double>> T = SpGEMM(C,A); /* */ 7 Renomalize(SpParMat<unsigned,double,SpDCCols,<unsigned,double>> &T); /* renomalize T */ 8 settling_ties(SpParMat<unsigned,double,SpDCCols,<unsigned,double>> &T); settling ties
Alg lgorithm 3 hm 3:
Ini Initiali lization Vo Vote Norma mali lization Settli ling ng t ties
ß Data distribution and storage
Þ DCSC storage structure
ß Algorithm expansion & MPI implementation
Þ Parallel voting Þ Renormalization Þ Parallel ties-settling
ß Distribute the sparse matrices on a 2D <Pr,
Þ Processor P(I,j) stores the sub-matrix Aij of
ß HyperSparseGEMM operates on O(nnz)
CSC CSC or C CSR structure structure
Total Storage:
( )
O nnz
DC DCSC i is mo more e efficienc ncy t y tha han C n CSR o
CSC in s n storing ng la large s scale le s sparse ma matrix.
Fig.1.1. trip description of matrix A Fig.1.2. CSC structure of matrix A Fig.1.3. DCSC structure of matrix A
DC DCSC structure structure
(1)Voting ng:sparse S SUM UMMA alg lgorithm hm,our M MPI imple leme ment ntation (2)Norma mali lization: R n: Reduce() and nd Di DimA mApply ly() i in n Comb mbBLAS, o , our M MPI imple leme ment ntation. n. (3)Settli ling ng t ties: b both M h MPI a I and nd M MPI- I-OpenM nMP i imple leme ment ntation. n. (90% time in the full code)
ß SpGEMM algorithm
ß Leverage two primitives in CombBLAS to assure the
elements in matrix T remain to 1 or 0.
Algorithm 4 Renormalize the T matrix Renormalize(SpParMat &T) DenseParVec colmax = T.reduce(Column, max); T.DimApply(Column, colmaxs, equal_to<double>).
ß
Algorithm 5.
Settling_ties(SpParMat<unsigned,double,SpDCCols,<unsigned,double>> &T) 1 for all processors ( , ) in parallel do /* */ 2 vector<int P i j creat an vector tallying the processor number > v = T.CreatVec(); /* _ */ 3 MPI_Allreduce(v, min_v, k, MPI_INT, MPI_MIN, Colworld()); /* */ 4 do MPI Allreduce on every processor column generate a new clustering result matrix T.PruneMat(min_v);
ü Func nction: S n: Sele lecting ng t the he lo lowest nu numb mbered c clu luster w with t h the he hi highe hest nu numb mber o
votes. . Recording ng Commu mmuni nication Sele lection The for…in parallel do construct indicates that all of the docode blocks execute in parallel by all the processors. In line 2, constructing a vector where every element is corresponding to each column in T matrix. If the value in some column contain 1, then tallying its processor ID in its corresponding element in the vector, or tallying the maximal integer.
ß Platform and Input Graph ß Testing CombBLAS performance on Dawning ß Testing our parallel implementation
Da Dawni ning ng s supercomputer:
n
Intel Omni-Path network,100Gbps two-side connection;
n
172 nodes, each has 24 2.5GHz Intel E5-2680 processor cores with 64GB memory.
n
Mvapich2.
In Input Gr Graph: h:
1. 1.A permuted R-MAT graph of scale 16 with self loops added.
Ø
65,5 ,536 vertices, 4 , 490,5 ,563 edg edges. es.
Ø
2,0 ,097,152 vertices, 1 , 18,3 ,305,177 edg edges. es.
R-M
graph aph:RMAT model is one of graph generation models . It has simple algorithm and its resulting graph fits for the Power-Law distribution
ß Using BFS code included in
CombBLAS.
ß Input:
Þ
R-MAT graph of scale 17 with self- loops added
Þ
Contains a sparse matrix of size 1310722.
ß Performance:
Þ
Speedup: 15 on 64 processors.
Þ
MTEPS(Mega Traversed Edges per second).
# of processor cores MTEPS 1 74 4 198 16 406 64 945
Table 1. MTEPS of BFS code of CombBLAS running
The results have not gained good enough performance as predicted. The reason maybe that the CombBLAS version we used is not optimized to match the Dawning supercomputer.
ß Case 1: R-mat graph of scale 16 ß Case 2: R-mat graph of scale 21
Fig.2 .2. Time vs. processor number 图3.
Parallel efficiency on 4 processors:92.5% … Parallel efficiency on 64 processors:62.7%
. 4.T .Time me v
. processor nu numb mbers Fig.5 .5.S .Speedup v
. processor nu numb mbers
E on 4 processors:87.5% … E on 1024 processors:75%
E on 4 processors:87.5% … E on 1024 processors:86.4%
. 8. . MPI implementation:(left)parallel time.(right)speedup
ß The sparse degree of the input graph
Þ All the two input graphs have highly sparse
degree>99%. (=(# of zeros)/(# of total elements))
Þ Also high irregular degree.
ß Scalability of our algorithm
Þ Under the same conditions besides the size of graph
and sparse degree, the speedups of the small and large graphs we test are not much difference. It also proves the scalability of the algorithm.
ß Good scalability of our parallel algorithm. ß Limit on the core numbers: mxm
Þ # of columns = # of rows. Þ Current version uses even distribution between
processors.
ß Settling ties step consumes most of the time
Þ Reason: all-to-all communication between processors.
ß BFS testing on Dawning indicates optimizing the
ß There are not so many research results about
ß Our implementation VS. direct parallelization
Þ It is applicable to provide direct parallelization of PPCL
algorithm
Þ But the tedious treatment process for irregular data
structure of graph can not ignored. It also causes a large amount of time costs and memory occupations.
Þ We translated the graph computations into a series of
matrix operations, and the computations on the irregular data structures of graph can be transformed to be structural representation based on sparse matrix. Thus high performance can be achieved.
ß We design and implement parallel PPCL algorithm based on
CombBLAS, which representing the PPCL algorithm as sparse matrix computations.
ß When the input is a permuted R-MAT graph of scale 21 with self
loops, the MPI implementation achieves up to 809.6x speed up
ß Next work includes:
Þ Optimizing the performance of parallel PPCL implementation and
testing larger graph for high scalability
Þ Optimizing the performance of relative matrix algebra building blocks
in CombBLAS to match the under computer architecture.
Þ Studying more parallel irregular algorithms presented by sparse matrix